Alexey Stepnov ,Konstntin Durovin ,Aleksei Sorokin
a Far Eastern Agriculture Research Institute,Vostochnoe,680521 Khabarovsk,Russia
b Computing Center of the Far Eastern Branch of the Russian Academy of Sciences,680000 Khabarovsk,Russia
Keywords:NDVI Function fitting Early prediction Yield Soybean
ABSTRACT Forecasting crop yields based on remote sensing data is one of the most important tasks in agriculture.Soybean is the main crop in the Russian Far East.It is desirable to forecast soybean yield as early as possible while maintaining high accuracy.This study aimed to investigate seasonal time series of the normalized difference vegetation index (NDVI) to achieve early forecasting of soybean yield.This research used data from the Moderate Resolution Image Spectroradiometer (MODIS),an arable-land mask obtained from the VEGA-Science web service,and soybean yield data for 2008-2017 for the Jewish Autonomous Region (JAR) districts.Four approximating functions were fitted to model the NDVI time series:Gaussian,double logistic(DL),and quadratic and cubic polynomials.In the period from calendar weeks 22-42(end of May to mid-October),averaged over two districts,the model using the DL function showed the highest accuracy (mean absolute percentage error -4.0%,root mean square error(RMSE) -0.029,P <0.01).The yield forecast accuracy of prediction in the period of weeks 25-30 in JAR municipalities using the parameters of the Gaussian function was higher (P <0.05) than that using the other functions.The mean forecast error for the Gaussian function was 14.9% in week 25 (RMSE was 0.21 t ha-1) and 5.1%-12.9% in weeks 26-30 (RMSE varied from 0.06 to 0.15 t ha-1)according to the 2013-2017 data.In weeks 31-32,the error was 5.0%-5.4% (RMSE was 0.07 t ha-1) using the Gaussian parameters and 7.4%-7.7% (RMSE was 0.09-0.11 t ha-1) for the DL function.When the method was applied to municipal districts of other soy-producing regions of the Russian Far East.RMSE was 0.14-0.32 t ha -1 in weeks 25-26 and did not exceed 0.20 t ha-1 in subsequent weeks.
One of the main tasks in agricultural practice is the prediction of crop yield.In recent decades,remote-sensing data have been used for this purpose.Forecasts are usually based on regression models,in which vegetation indices,including the normalized difference vegetation index (NDVI) and enhanced vegetation index (EVI),as well as climate characteristics are used as independent variables[1-3].
One of the main problems in NDVI time-series processing is data smoothing and noise reduction.Usually,approximating functions (asymmetric Gaussian,double logistic (DL),and polynomial functions)are used for curve smoothing[4-6].Shao et al.[6]used the Savitzky-Golay and Whittaker filters and discrete Fourier transformation smoothing algorithms for noise reduction as well as asymmetric Gaussian and DL function fitting.These smoothing algorithms are used for crop classification.The highest classification accuracy was achieved by the Whittaker smoother(6%higher than that of the other methods).Atkinson et al.[4] applied four techniques: Fourier analysis,asymmetric Gaussian modeling,DL modeling,and Whittaker filtering,to simulate seasonal variation in vegetation indices(VIs).The asymmetric Gaussian and DL functions performed well only for one-harvest regions.Cao et al.[5]described an iterative logistic fitting method for modeling EVI in meadows.Seo et al.[7] used two logistic curves,one for the early and one for the later part of the growth period,to fit corn and soybean NDVI time series.Berger et al.[8] presented soybean NDVI forecasts based on historical data in Uruguay.They studied a set of fields with an area of at least 250 ha to fit annual NDVI time series,using two models:polynomial and DL function fitting.Vorobyova and Chernov[9]performed NDVI fitting using piecewise linear,asymmetric Gaussian,and DL functions,Fourier series,polynomials,and a cubic spline,in the Samara region(Russia).They reported highest approximation accuracy using a cubic spline.Hird and McDermid [10] studied four alternative filters for noise reduction in addition to asymmetric Gaussian and DL functions.In most cases,the use of asymmetric Gaussian and DL functions greatly reduced the noise level while maintaining the relevant NDVI signal integrity.However,in some special cases(for example,in montane regions),alternative filters performed better.
Asymmetric Gaussian and DL function fitting can be used in the absence of NDVI composites to filter out emissions,but such applications are limited to one-yield regions[11].VI time-series modeling using approximating functions or other methods is usually applied for arable land classification and creating masks of individual crops.For example,in the VEGA services developed by SRI RAS,land classification is performed using NDVI time-series values.It has been demonstrated [12,13] that it is possible to identify crops by analyzing the time series of variance characteristics and reduce noise (according to high-resolution data from the Sentinel satellites).
In 2008,the USDA created a map of U.S.arable land (CDL)showing the distributions of individual crops [14].A decision tree-supervised classification method was used to generate freely available state-level crop cover classifications.This service continuously collects large amounts of data,which are used for continuously retraining the model and increasing classification accuracy.Currently,this product uses images obtained from the Landsat and Sentinel satellites for crop classification across the USA [15].In the southern part of Ontario(Canada),masks of major crops were created to predict crop yield.Fuzzy logic methods were used for classification,and the EVI2 time series from the Moderate Resolution Imaging Spectroradiometer (MODIS) instrument in 2011-2013 was used as data[16].The results emphasize the possibility of using vegetation indices to classify arable land in the Russian Far East,which has a climate similar to that of some provinces of Canada.In the state of Mato Grosso(Brazil),a soybean mask was also designed to predict yield [17].EVI time series obtained from MODIS over 10 years and a regression method based on Gaussian processes were used for soybean mapping.In 2015,an international group of scientists[18]created a global map of arable land based on MODIS data.Stepanov et al.[19]studied NDVI time series for several crops (soybean,barley,wheat,forage grasses),using Gaussian function fitting for NDVI to model and predict crop yields.
Regression models most often use the maximum VI to predict the yields of both winter and spring crops.The NDVI maximum is the most stable indicator among composites and provides the highest accuracy of the forecast as a predictor of the regression model [20-22].In the temperate latitudes of the Northern Hemisphere,winter crops are characterized by two NDVI maxima:spring and summer.Spring crops,particularly cereals and legumes,are characterized by one maximum in July-August [23].Usually,the term ‘‘early prediction” refers to winter crop yield forecasting using the spring maximum.For example,Bereza et al.[24]described a winter wheat yield forecasting model for the Volga region in Russia with a preliminary yield calculation in May.For Central European countries,the best result in winter crop forecasting is achieved using NDVI in April,which also corresponds to the winter crop maximum [25].Similarly,adjusted for seasonality in the Southern Hemisphere,forecast models were constructed for winter and spring crops in the Southern Hemisphere [26].
For the Russian Far East,as well as for regions characterized by a long,cold winter,it is most relevant to forecast the yield of spring crops,particularly soybean.Because soybean is the leading crop of the Russian Far East(occupying more than 65%of arable land[27])and the basis of agricultural exports,early forecasting of its yield with high accuracy is economically crucial [28].
A previously developed model for predicting the yield of main regional crops at the municipal level used NDVI and climate characteristics as independent variables [19,29].Thus,the main purpose of this study was to assess the accuracy of NDVI time-series modeling for one of the major soybean-producing regions,the Jewish Autonomous Region (JAR),using approximating functions,and to assess the possibility,accuracy,and timing of forecasting the NDVI maximum using composites of calendar weeks preceding the maximum.We proposed to use the parameters of approximating functions calculated for previous years to predict the NDVI maximum.
The study area comprised the Oktyabrskiy (OD) and Leninskiy(LD) districts located in the southwestern JAR (Fig.S1).The study area covers approximately 12,500 km2.The southern natural border of the area is the Amur River.Meadow and alluvial soils and warm summers (throughout LD and in the southern part of OD)with sufficient precipitation are suitable for crop growth.
The Oktyabrskiy and Leninskiy districts are among the leading agricultural municipalities in the southern Russian Far East.The total area of arable land was 69,535 ha for LD and 43,889 ha for OD in 2017 (77% of arable land in the JAR).
As shown in Table S1,agricultural enterprises in the study area specialize in growing soybeans:93.9%of arable land(41,214 ha)in OD and 95.8%(66,636 ha)in LD.These municipalities provided 80%of the JAR soybean gross yield in 2017.Among other crops,the main grains are oat (3.2% for OD and 2% for LD) and spring barley(0.9% for OD and 0.5% for LD).
Table 1 Mean MAPEs (%) for four fitting functions in LD and OD.
Table 2 Confidence probabilities for Tukey’s method (MAPE).
Table 3 Mean RMSE (t ha-1) for four fitting functions in LD and OD.
Table 4 Maximum NDVI forecasting accuracy assessment for two districts of the JAR in individual weeks (2013-2017) using Gaussian and DL functions.
Remote measurements of the spectral reflection characteristics of arable land are provided by MODIS.We used weekly aggregated cloud-free images in the spectral regions of 0.629-0.670 μm and 0.841-0.876 μm with 250-m spatial resolution (MOD09 [30]) to compute district-mean NDVI.Calculations were performed using an arable land mask provided by the VEGA-Science web service(http://sci-vega.ru,accessed on December 4,2021) [31].Weekly NDVI composites for OD and LD were used to plot seasonal NDVI curves.
NDVI is computed as follows:
where NIR and RED are spectral reflectance measurements acquired in the near-infrared and red regions,respectively [32].Estimates published by Rosstat(https://rosstat.gov.ru,accessed on September 5,2021) were used as the soybean yields in the LD and OD for model validation.
Four different functions were used to approximate seasonal NDVI curves: Gaussian and DL functions and quadratic and cubic polynomials.
The Gaussian function is
where i is the calendar week number,b characterizes the growth peak,and c is the active vegetation duration [33];Vmaxis NDVI maximum.
The DL function is given by Eq.(3):
where c1is the NDVI minimum;c2is the range of NDVI variation;a1is the inflection point where the curve has positive slope;a2is the rate of this growth;a3is the inflection point where the curve has negative decreases and a4is the rate of decrease [34].
The quadratic and cubic polynomials were computed as follows:
where a,b,c,and d are model parameters.
Curve fitting was performed by the least-squares method using the Levenberg-Marquardt algorithm.
NDVI composites from weeks 22 to 42(end of May-middle of October)were used.This period is within the soybean growth season,which is the main JAR crop.The parameters of Eqs.(2)-(5)and model errors (mean absolute percentage error (MAPE) [35] and root mean square error(RMSE)[36])were calculated for every year from 2008 to 2017.
The MAPE and RMSE were calculated to estimate the model accuracy as follows:
where m denotes the start of the period as week number,n is the end of the period as week number;represents the predicted NDVI for the ith week,andis the observed NDVI for the ith week.
Assessment of the reliability of differences in the accuracy of methods was performed by two-way ANOVA,and a posteriori comparison was performed using Tukey’s test (α=0.05).
The NDVI maximum of the studied year is used as a predictor in one dimension or one of the predictors in multiple regression models in yield forecasting.The use of such models in practice is limited and possible only after the NDVI maximum is reached.Approximating functions can be used to permit early prediction,where the NDVI maximum is calculated from the weekly NDVI composites of the previous weeks.In the present case,the function parameters were determined using the calculated method for the mean values of weekly NDVI composites for the five years preceding the forecast year.
Equation (2) yields an expression for calculating the maximum NDVI using the Gaussian function:
To calculate the maximum NDVI when approximating the DL function (Eq.(3)),the following formula is used:
Applying polynomials to predict maximum NDVI values from weekly NDVI composites is not possible.
For a comparative assessment of the VImaxprediction accuracy in calendar weeks 25-32,the APE (absolute percentage error)[37] and RMSE indicators were calculated as follows:
where i is the calendar week number,j is the year number,and m is the number of years (2013-2017).
Assessment of significant differences in the forecasting accuracy for each calendar week using the Gaussian and DL functions for both territories was performed by two-way ANOVA.
Previously,to predict the soybean yields for LD and OD,we proposed a regression model that included the maximum NDVI and the number of days with active temperatures(D)above 10°C from the beginning of the predicted year to the calendar week corresponding to Vmaxas independent predictors (using data from 2001 to 2018).The regression models for LD and OD are as follows[29]:
where y denotes the mean soybean yield (t ha-1).
The predicted values of Vmaxwere used in the regression model for calculating the yield in calendar weeks 25-32 for each function.When determining the values of D for individual years,we considered that during the study period (2008-2017) in the OD and LD,there were no days with a mean daily temperature below 10 °C in the period after week 25 (mid-June).Thus,D can be calculated as the number of days with a mean daily temperature above 10°C from the beginning of the year to week 25 and summed with the number of days remaining until the NDVI maximum calendar week.To assess the accuracy of the yield forecast for weeks 25-32,the APE and RMSE indicators were calculated:
Fig.1.NDVI time series and fitting functions.a) LD,2016.b) OD,2016.c) LD,2017.d) OD,2017.
According to 2008-2017 data,the maximum values among the weekly composites of NDVI during a calendar year corresponded to weeks 31-34 for both OD (iavg=32.3 ± 0.7) and LD (iavg=32.3 ±0.9).Fig.1 shows graphs of the annual NDVI curves in 2016-2017,as well as graphs of the approximating functions:Gaussian function (Gauss),DL function and quadratic (Quadratic)and cubic (Cubic) polynomials.
Fig.2.Means confidence intervals(P<0.01)for the MAPE.Vertical bars denote 0.95 confidence intervals.
The Gaussian function better fits symmetric distributions,and the DL function is suitable for asymmetric data distributions.All of the presented functions fitted the annual NDVI series with sufficiently high accuracy.Two-way ANOVA showed that the mean MAPE differed significantly depending on the modeling method and the analyzed district (Table 1).A posteriori analysis using Tukey’s test revealed that the accuracy of the model using the DL function-fitting method was significantly higher than that when using the Gaussian function or polynomials (Table 2).The MAPE for the Gaussian method was 5.88% and those for the quadratic and cubic polynomials were 6.80% and 6.40%,respectively.For the DL function,the MAPE was 3.98%.Fig.2 shows the mean MAPE confidence intervals for the four methods.
Similarly,the analysis of variance showed that the RMSE depended significantly on the approximating function.However,no significant differences were found between the districts(Table 3).The RMSE for the DL function-fitting method was 0.029,those for the Gaussian and cubic polynomials were 0.042,and that for the quadratic polynomial was 0.046.A posteriori analysis using Tukey’s test confirmed that the accuracy of the DL function was higher (P <0.01) than those of the Gaussian and polynomial models (similar to the MAPE results).
Fig.3.Vmax prediction APE in calendar weeks 25-32 using the Gaussian and DL functions (2013-2017).
Table 4 shows the APE and RMSE calculated from the Gaussian and DL functions to predict Vmaxfor the observed NDVI values of the forecast year.Two-way ANOVA revealed that the forecast error using the Gaussian function was significantly lower than that using the DL function in the period of weeks 25-30.The mean APEs for the Gaussian function were 9.48% and 7.02% for the two districts and 17.12% and 11.68% for the DL function (for weeks 25 and 26,respectively).The RMSEs were 0.035 for the Gaussian function and 0.075 for the DL function in week 25 and 0.028 and 0.047,respectively,in week 26.In the period from weeks 27-30,the APE and RMSE for the Gaussian function were around half the corresponding indicators from prediction using the DL function.When the NDVI peak was reached(in weeks 31-32),the RMSE dropped to 0.008-0.012.There were no significant differences between the districts in the early prediction of Vmaxin weeks 25-32.This finding is quite natural,in view of the similar climatic conditions and the same sowing dates,which provide similar vegetation index curves.
Fig.3 shows that the prediction accuracy increased (as expected)when approaching the actual Vmax.The mean APE of prediction by the Gaussian function decreased from 7% in week 26 to 0.7%-2.0% in weeks 31-32 and from more than 12% in week 26 to 1.9%-3.0% in weeks 31-32 using the DL function.
We assessed the quality of regression models(12-13)by calculating R2and performing cross-validation.R2for model (12) was 0.59 (adjusted R2=0.54),R2for model (13) was 0.59 (adjustedR2=0.54).Observed and predicted yield,observed NDVI maxima,and number of days with active temperature for OD in LD are presented in Tables S2-S4.The consistency of the regression models was confirmed using one-year cross-validation (Tables S5-S6).The mean cross-validation MAPE for the Leninskiy district was 6.71% and that for the Oktyabrskiy district was 4.52%.
Table 5 Yield forecasting accuracy assessment in individual weeks for two districts of the JAR in 2013-2017 using the Gaussian and DL functions.
Using the calculated NDVI maxima and previous regression models,the accuracy of the predicted soybean yields in LD and OD in 2013-2017 was evaluated.Two-way ANOVA showed that the accuracy of the forecast using the Gaussian function in weeks 25-30 was higher than that using the DL function(Table 5;Fig.4).
Table 6 Yield forecasting RMSE (t ha-1) in different weeks for the different municipalities in the Russian Far East (2013-2017).
Fig.4.Yield forecasting APE in weeks 25-32 using Gaussian and DL functions(2013-2017).
The APEs for the Gaussian function were 14.87%and 12.92%for weeks 25 and 26,respectively,and for DL they were 32.84% and 22.51%.The RMSE for the yield prediction based on the Gaussian function was 0.21 t ha-1and 0.15 t ha-1in weeks 25 and 26,respectively,and the RMSEs for the prediction using the DL function were 0.38 and 0.24 t ha-1,respectively.The mean APE of the yield prediction in weeks 27-30 using the Gaussian function decreased from 9.31% to 5.11% (the error in individual years did not exceed 20%),and the mean RMSE was 0.06-0.12 t ha-1.The forecasting APE using the DL function in weeks 27-30 was 9.93-16.29%.No significant difference was found in the prediction accuracy in weeks 31-32 between the two types of approximating functions.Neither was there any statistically significant difference in accuracy between the two districts.
The evaluation of the regression model using the real NDVI maximum,reached by calendar week 33,showed the following results (for OD: MAPE=4.55%,RMSE=0.06 t ha-1,for LD:MAPE=6.83%,RMSE=0.09 t ha-1).Thus,the use of the maximum forecasting model is justified,including when approaching the week of the onset of the real maximum.
We predicted soybean yield in three other soybean-producing districts of the Russian Far East.One municipality was selected for each region to test the model.We used our regression model with the maximum NDVI value (actual or predicted from calendar week 25) and the growing season duration (to the week of the maximum) as independent variables (using data from 2001 to 2018).R2(adjusted R2)values were 0.76(0.44)for Tambovskiy district,0.68(0.60)for Vyasemskiy district,0.83(0.62)for Khorolskiy district.RMSE was 0.07 t ha-1for Tambovskiy district,0.09 t ha-1for Vyasemskiy district,0.05 t ha-1for Khorolskiy district.The RMSE and the R2values for each municipality of the three regions(Vyasemskiy,Tambovskiy,and Khorolskiy districts) are quite satisfactory.
Table 6 presents the RMSE of early forecasting.In the Tambovskiy and Vyazemskiy districts,the maximum NDVI was reached by calendar week 30,and in Khorolskiy by week 32.The accuracy of the method using the Gaussian function increased when approaching the calendar week of the maximum,and corresponded to the accuracy of the method using the real maximum.The RMSE in weeks 25-26 fell in the range of 0.14-0.32 using both types of fit functions,which is also a good result.
This study established that the NDVI seasonal time series for arable land in the JAR could be evaluated by fitting Gaussian and DL functions and quadratic and cubic polynomials.For 2008-2017,the MAPEs of the DL function were 3.7% for LD and 4.2%for OD,values lower than the MAPEs of the other functions.The MAPEs for Gaussian fitting were 5.8%and 6.0%,those for the quadratic polynomial were 6.3%and 7.3%,and those for the cubic polynomial were 5.9% and 7.0%.The mean RMSE (for both districts)using the DL function was 0.029,that using the Gaussian function and cubic polynomial was 0.042,and that using the quadratic polynomial was 0.046.Function fitting for individual soybean fields in Uruguay (with a total area of 2554 ha) [8] showed that the RMSE for the DL model was 0.10,that for the polynomial function was 0.15,and that for the ‘‘crop growth model” was 0.07.However,in Berger et al.[8],the modeling period was longer (from December to May) than that in the present article.Usually,the accuracy of function-fitting models for the beginning or end of the growing season is lower.When we applied modeling from calendar weeks 19 to 45,the RMSE increased to 0.038 for the DL model,0.052 for the Gaussian function,and 0.065 for the polynomials.The RMSEs of the function-fitting models (DL,Gaussian,polynomials,etc.) were in the range of 0.040-0.047 for early and late spring crops [9].The model was built using data from early April to late August (153 days),an interval comparable to the duration of the period from calendar weeks 22 to 42 (147 days).Han et al.[38]used NDVI time series in calendar weeks 21-40 to assess soybean growth in Heilongjiang province adjacent to the JAR (with similar climatic conditions).In summary,the accuracy of the NDVI approximation for arable land of the JAR using the proposed functions is quite high.
Because maximum NDVI is often used as a predictor of the crop yield regression model,a methodology was developed,and the maximum forecasting accuracy in the time period preceding the maximum of the NDVI curve was evaluated.The parameters of the DL and Gaussian functions for the five years preceding the forecast year were used to predict the maximum(which is observed in the JAR during calendar weeks 31-34).It was reliably established that the accuracy of Vmaxearly prediction using the Gaussian function was higher than that using the DL function.The APE using the Gaussian function decreased in weeks 25-30 from 9.5% to 2.4%,and in weeks 31-32,the APE was 0.8%-1.9%.The forecast accuracy of the DL model was significantly lower -17.1% in week 25 and 4.3% in week 30.The APE in weeks 31-32 for DL was 1.9%-3.0%.
Estimation of soybean yield at the municipal level was performed using a previously developed regression model,in which the forecast Vmaxwas used as one of the independent predictors.In the period from calendar week 25 to 30,the accuracy of the yield forecast was higher using the Gaussian function than using the other models.According to the 2013-2017 data,the mean forecast error for the Gaussian function was 14.9% in week 25 (RMSE was 0.21 t ha-1) and 5.1%-12.9% in weeks 26-30 (RMSE was 0.06-0.15 t ha-1).The APE for estimating soybean yield using the DL function was almost twice as large for each week using the Gaussian function.In weeks 31-32,the error was 5.0%-5.4%(RMSE was 0.07 t ha-1) using Gaussian parameters and 7.4%-7.7% (RMSE was 0.09-0.11 t ha-1) for the DL model.
Because research in early forecasting is desirable in agricultural practice,some researchers have used NDVI values before the maximum in forecasting models.Lopresti[26]predicted wheat yield at 257 and 273 days (maximum in October),with R2values of 0.13 and 0.16,respectively.Cao [39] showed that the R2for a soybean yield prediction model increases only immediately before the maximum,also showing the need to use the maximum (or nearmaximum) NDVI values.In a study of Sakamoto [40],predicting yield 6 days before setting pods (corresponding to the maximum WDRVI)using random forest,yielded a RMSE of 0.22 t ha-1,which in principle matches our values obtained in earlier time periods.Shamni[41]also predicted a soybean yield based on 50 and 70 days of growing season(approximately 5 and 2 weeks before the peak of NDVI).That NRMSE exceeded 0.2,which,with a mean yield of 1.5 t ha-1,gives an RMSE about 0.3 t ha-1.In Liao [42],the RMSE of soybean yield forecast was 82.6 g m-2.The RMSE obtained in the present study when forecasting for four calendar weeks for the onset of the maximum for five districts of several regions of the Russian Far East did not exceed 0.32 t ha-1.When forecasting for 2 calendar weeks,this figure did not exceed 0.17 t ha-1.Future research in early forecasting using approximation functions should test this methodology on other regions and crops.
CRediT authorship contribution statement
Alexey Stepanov:Conceptualization,Methodology,Formal analysis,Validation,Writing -original draft.Konstantin Dubrovin:Data curation,Investigation,Software,Visualization,Writing-review&editing.Aleksei Sorokin:Funding acquisition,Software,Project administration,Resources.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This study used the results of processing satellite data obtained through the VEGA-Science web service,as well as the resources of the IKI-Monitoring Sharing Centers and the Data Center of the Far Eastern Branch of the Russian Academy of Sciences(Data Center of FEB RAS).
Appendix A.Supplementary data
Supplementary data for this article can be found online at https://doi.org/10.1016/j.cj.2021.12.013.