作者:By Aidan Thurling
And ever since then, farmers have been analyzing patterns and drawing their own correlations in an effort to predict and control their crop’s performance. For example, the ancient Egyptians carefully planned their planting and harvest around the flooding patterns of the Nile River. Medieval farmers rotated their crops to maximize yield. And within the last few centuries, growers have used tools such as the Old Farmer’s Almanac to predict each year’s growing season conditions.
Today, advanced analytics and modeling techniques allow us to accurately forecast crop performance. These predictions assist agriculture retailers and distributers in planning their transportation and storage logistics, enable agricultural insurance companies to make informed loss estimates, and they help agronomists make decisions about crop varieties and fertility treatments. We can train accurate machine learning models using potential environmental variables that go far beyond the scale that a single farmer could access alone. We can discover and derive advanced metrics about our fields using technologies such as satellite imagery, soil analysis, elevation models, and more. And then we take all those factors that we suspect will influence our yield and combine them with historic harvest data to build a machine learning model that can provide us with accurate, timely, and scalable predictions.
As a member of Esri’s commercial agriculture team, I decided to run through the exercise of training a machine-learning model to forecast sugarcane yield using ArcGIS Pro. The first step to creating a yield forecasting model is to establish which variables you have access to. For this exercise, I used sugarcane field boundaries, historic sugarcane harvest data for my boundaries (7 years), irrigation method, cut stage, regional soil data, climatic data from TerraClimate, and Sentinel-2 multispectral satellite imagery.
Because Sentinel-2 imagery is captured so frequently, one aspect that had to be taken into consideration was the best timeframe to use when using NDVI as an explanatory variable to the resulting yield. Because sugarcane in the region of interest has a 12-month growing season, I opted to use imagery from 6 months prior to the field’s harvest date (+/- 1 month depending on imagery availability). This would ensure that the model could be run during the middle of the growing season as opposed to the brief time period prior to harvest (which would not be very helpful for predicting yield if harvest was about to happen anyway!). Ultimately, this resulted in a cumulative field boundary layer containing 60 “snapshots” across a 7-year-period or 1,936 distinct field records containing all the explanatory variables surrounding that record’s actual harvest date.
ArcGIS Pro provides many different machine learning tools and methodologies. Most of them are as simple as a geoprocessing tool where you can input your training data and specify the variable you wish to predict (in our case, the yield value) along with the variables that you suspect influence yield, such as precipitation.
For this exercise, I used a random forest regression algorithm to train a model and determine which variables influence yield the most. You can learn more about the ArcGIS geoprocessing tool I used here. Three versions of the model were trained. The first model used the random-forest algorithm with 100 trees. The second model used the random-forest algorithm with 1000 trees. And the third model used the Extreme Gradient Boosting (XGBoost) algorithm with 1000 trees. The most accurate result came from the third model.
Not only did the model create successfully trained features, but it also performed well when run independently on the 2022 field data as well. ArcGIS also output many supplemental details about the regression analysis itself and how each variable fared during the model’s training. The chart of variable importance below illustrates that of all the variables I incorporated into my training data, the most important variables for predicting sugarcane yield in this region of interest are the field’s NDVI mean six months prior to harvest, its irrigation method, solar radiance, and the soil moisture.
From a temporal perspective, I also compared the predicted annual yield means to the actual annual yield means for all my fields across nearly seven years. This illustrates how closely aligned the prediction is with the actual yield results.
It is important to bear in mind that your own yield forecasting model will only be as accurate as the quality and quantity of data you’re able to procure. A model cannot accurately predict circumstances that it was never trained on. For example, a yield forecasting model that was trained on Iowa corn yield data will likely not be able to accurately forecast corn yield in North Carolina. Ultimately, the best yield forecasting models will be trained on thousands of high-resolution inputs spanning multiple years and many different variables. That includes decisions made by the grower and the grower’s team such as seed variety, fertility application, or planting rate as well as environmental variables that are not as easily controlled, such as rainfall and temperature.
I hope this case study excites you as much as it excites me. The opportunity to forecast yield values can provide invaluable insight from year to year— It enables growers to predict profit margins and strategically adjust their field management over the course of the growing season. It provides agronomists, retailers, and distributers with the data needed to make informed and cost-saving logistical decisions. It can even help entire nations effectively identify and plan actions against food insecurity.
We can think back to our farmer friends of both today and yesteryear and seek not to replace their centuries of collective traditions and methodologies, but rather to complement and augment tried-and-true farming practices with cutting-edge technology and advanced analytics made simple by ArcGIS.
Aidan Thurling
As a Solution Engineer on Esri's Natural Resources team, Aidan works on geospatial workflows within the agriculture, forestry, and energy industries. She holds a BS in Biology (Ecology) from Cal Poly SLO and a Masters in GIS Tech from NC State. She has experience in geospatial data science & analytics, web GIS, remote sensing, process automation, and parallel parking large vehicles.