LI Qian-chuan ,XU Shi-wei, ,ZHUANG Jia-yu, ,LIU Jia-jiaZHOU Yi,ZHANG Ze-xi
1 Agricultural Information Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, P.R.China
2 Beijing Engineering Research Center for Agricultural Monitoring and Early Warning, Beijing 100081, P.R.China
3 Aerospace Information Research Institute, Chinese Academy of Sciences, Beijing 100094, P.R.China
4 The Department of Mathematics, Columbia University, NY 10027, USA
5 Key Laboratory of Agricultural Monitoring and Early Warning Technology, Ministry of Agriculture and Rural Affairs, Beijing 100081, P.R.China
Abstract The accurate prediction of soybean yield is of great signifciance for agricultural production,monitoring and early warning.Although previous studies have used machine learning algorithms to predict soybean yield based on meteorological data,it is not clear how different models can be used to effectively separate soybean meteorological yield from soybean yield in various regions.In addition,comprehensively integrating the advantages of various machine learning algorithms to improve the prediction accuracy through ensemble learning algorithms has not been studied in depth.This study used and analyzed various daily meteorological data and soybean yield data from 173 county-level administrative regions and meteorological stations in two principal soybean planting areas in China (Northeast China and the Huang–Huai region),covering 34 years.Three effective machine learning algorithms (K-nearest neighbor,random forest,and support vector regression) were adopted as the base-models to establish a high-precision and highly-reliable soybean meteorological yield prediction model based on the stacking ensemble learning framework.The model’s generalizability was further improved through 5-fold crossvalidation,and the model was optimized by principal component analysis and hyperparametric optimization.The accuracy of the model was evaluated by using the fvie-year sliding prediction and four regression indicators of the 173 counties,which showed that the stacking model has higher accuracy and stronger robustness.The 5-year sliding estimations of soybean yield based on the stacking model in 173 counties showed that the prediction effect can refelct the spatiotemporal distribution of soybean yield in detail,and the mean absolute percentage error (MAPE) was less than 5%.The stacking prediction model of soybean meteorological yield provides a new approach for accurately predicting soybean yield.
Keywords: meteorological factors,ensemble learning,crop yield prediction,machine learning,county-level
Climate change has profound and severe impacts on global crop yields in several ways,driven primarily by changes in temperature,precipitation,and sunshine duration (Bongaarts 2019;J?germeyret al.2021).Therefore,the prediction of crop yield based on meteorological factors is one of the most challenging topics in modern agriculture,and of great significance to crop market price,crop insurance,crop cultivation management,and food security early warning (Liakoset al.2018;Sunet al.2019).Soybean (GlycinemaxL.)is one of the world’s most important food and oil crops,and the most consumed oil crop in China.According to the Ministry of Agriculture and Rural Affairs of China,China produced 16.4 million tons of soybeans in 2021 and imported 96.52 million tons.China’s soybean consumption market plays an essential role in international trade.Climate risks and disasters occur every year in Northeast China and the Huang–Huai region,two major soybean planting regions in China (Wang Let al.2018;Guoet al.2021,2022;Wang and Fan 2022).Therefore,the forecasting of soybean meteorological yield is of great significance to the field management and production of soybeans in China and even the world.
Soybean yield prediction is affected by the external climate environment (Yanget al.2020;Xuet al.2021).Temperature,precipitation,and sunshine duration are among the meteorological factors closely related to crop growth and yield (Kernet al.2018;Kukal and Irmak 2018;Caiet al.2019).Therefore,international scholars have done a great deal of research on the effects of climate change on crop yield and growth.For example,Srivastava R Ket al.(2022) used the CMIP5(Coupled Model Intercomparison Project Phase 5)climate prediction tool to assess the impact of climate change on crop yields and to adopt profitable cultivation management strategies in order to improve crop yields.Abdi-Dehkordiet al.(2018) designed an optimization model based on climate change,which can effectively affect and improve crop yields.Based on nearly 40 years of historical data,Yanget al.(2020) made it clear that the temperature,precipitation,and solar illumination factors of climate change have a strong impact on the yields of five major food crops in Ethiopia.Therefore,it is particularly crucial to quantitatively analyze the effects of temperature,precipitation,and sunshine duration on crop yield in order to use meteorological factors for crop yield monitoring,as well as early warning and field management.
Crop yield is driven by various complex characteristics,such as external factors and internal germplasm genes(Wang Zet al.2018).Crop yield can be macroscopically separated into trend yield and meteorological yield to accurately study the effects of meteorological factors on crop yield (Jiet al.2021;Madhukaret al.2021).Establishing a suitable model method to separate the climatic yield and trend yield from crop yield plays a vital role in the accuracy of crop yield predictions (Zhuanget al.2018).No single trend yield fitting method can generically and effectively separate the meteorological yield from the yield in different regions (Grassiniet al.2013).Therefore,establishing a soybean meteorological yield decisionmaking system can effectively solve the problem of applying different trend yield fitting methods in different regions.
In recent years,the application of machine learning techniques to crop yield prediction has demonstrated the accuracy and robustness of the prediction results.For example,researchers used the machine learning model of the convolutional neural network to predict winter wheat data by loading county meteorological data and obtained good prediction evaluation indicators (Srivastava A Ket al.2022).By comparing the methods of support vector regression (SVR),random forest (RF),and K-nearest neighbor (KNN) to predict cluster bean yield,Pangarkaret al.(2020) concluded that the machine learning method can achieve ideal results in predicting crop yield.However,due to the limitations of the single machine learning methods,the prediction accuracy of this model is not high and its generalizability is not strong.
Ensemble learning prediction can improve prediction accuracy and generalization by integrating different model types and architectural differences between the models(Fenget al.2020).Thus,it is helpful to introduce an ensemble learning method to solve the problem of crop yield prediction.As an ensemble learning method,stacking can predict results through self-machine learning of its base models and meta-models (Li Cet al.2021).This method can significantly improve the accuracy and generalization of prediction results (Taghizadeh-Mehrjardiet al.2020;Jiaet al.2021;Guet al.2022;Zhang and Zhu 2022).Therefore,applying the stacking method in the soybean yield prediction model can improve the prediction accuracy.
To our best knowledge,there is currently no research on predicting soybean yield through meteorological factor data using a stacking ensemble learning framework based on soybean meteorological yield decision-making systems.Therefore,the primary purposes of this study are four-fold.(1) The soybean yield was predicted by meteorological data,such as average temperature,precipitation,and sunshine duration,during the soybean growth period.(2) The meteorological yield decisionmaking system of soybean was established to effectively fit the trend yields of different regions and extract the meteorological yields.(3) The feasibility and accuracy of soybean yield prediction model from meteorological factors based on different machine learning algorithms (RF,SVR,KNN) were studied and compared.(4) The performance and effect of stacking ensemble prediction in improving the accuracy of soybean yield prediction was tested.To achieve these research objectives,it was necessary to first separate meteorological yield from soybean yield at the county level.Secondly,the data of soybean meteorological factors(average temperature,precipitation,and sunshine duration) were averaged through the different growth periods of soybean.Thirdly,principal component analysis was used to analyze the correlations of feature sets and reduce the dimensions.Then the feature set after dimensionality reduction was input into each machine learning method for modeling.Lastly,the stacking model integrated the prediction advantages and characteristics of the different machine learning algorithms to build an ensemble learning framework that could improve the accuracy and robustness of soybean yield prediction capabilities.
According to the records of meteorological factors affecting crop yield (Liet al.2017;Zhuanget al.2018;Li Set al.2021) and the influences of individual meteorological factors in different regions of China on crop yield (Xuanet al.2019;Liuet al.2021;Ronget al.2021;Tianet al.2021),this study examined in detail the influences of meteorological factors on soybean yield in spring soybean in Northeast China and summer soybean in Huang–Huai region,the two principal soybean production areas in China (Fig.1).According to the 2020 data from the National Bureau of Statistics of China and theChina Rural Statistical Yearbook(Department of Rural Socioeconomic Survey 2020),the soybean output of Heilongjiang,Jilin,and Liaoning provinces in Northeast China accounts for about 51.4% of the total output of the nation;while the soybean output of Anhui,Jiangsu,Henan and Shandong provinces in the Huang–Huai region accounts for about 15.0% of the total output of the country.Hence,the county-level research data selected in this study covers China’s two principal soybean output areas.
Fig.1 Distribution of climatic zones in China,geographical information,and meteorological stations in China’s two most important soybean planting areas.DEM,digital elevation model.
Northeast China is atypical representative of the main planting areas of spring soybeans,located at 38–53°N.It belongs to the temperate monsoon climate,and its characteristics of dry farming areas with high latitude and low heat provide comparatively suitable climatic conditions for the growth of spring soybeans (Fig.1).Therefore,the output of soybeans in Northeast China accounts for more than half of the country’s total output.In a broad sense,the Huang–Huai region includes Anhui,Henan,Shandong,and Jiangsu provinces.It is located at 29–38°N and is a transitional area between a warm and humid climate and a cold and dry climate.From June to August,there is sufficient sunshine,appropriate temperatures,rain and heat in the same season,and moderate rainfall,which is suitable enough for the light,temperature,and water needs of soybeans (Fig.1).Thus,as China’s secondlargest soybean planting area,Huang–Huai region is a typical representative of the summer soybean planting region.Therefore,based on the spatial and climatic features of the main soybean planting areas in China,this study selected county-level soybean yield data for the 173 counties covered by the above two major areas from 1980 to 2013,and the meteorological data of 173 meteorological observation stations from the China Meteorological Administration as the research objects.
Fig.2 shows the framework of this research,which consists of four parts.(1) Data acquisition: The annual soybean yield data at the county level and the daily meteorological data related to soybean meteorological yield were collected from China Meteorological Administration as well as the Ministry of Agriculture and Rural Affairs of China and the National Bureau of Statistics of China.(2) Data processing: Quantitative analysis of the meteorological data was calculated and observed by soybean growth period,and the soybean meteorological yield decisionmaking system can control for other factors that may affect soybean yield,and screen out the meteorological yield affected by the meteorological elements.(3) Data analysis:Principal component analysis was used to optimize the data set,so as to reduce the dimensions of the feature set,and to study the meteorological factors with high correlations to soybean yield in different growth periods.(4) Stacking model establishment: The stacking ensemble learning algorithm uses the advantages and features of each base-model to improve the prediction accuracy and robustness of soybean yields.
Fig.2 Technology roadmap of the soybean yield prediction framework.P,the prediction value;r,RF,random forest;s,SVR,support vector regression;k,KNN,K-nearest neighbor;m,the meta-model;T,the test set of each base model.
Yield dataYield data sets including county-level soybean output and county-level soybean planting area were obtained from the database of the Ministry of Agriculture and Rural Affairs of China and the National Bureau of Statistics of China.From 1980 to 2013,a total of 173 county-level production and planting area datasets were used for the two most important major soybean planting areas in China.The a priori knowledge in agriculture was used in processing the soybean yield data to ensure reasonable results that were not affected by registration errors.Soybean yield data were obtained by the following equation:
whereyuis the yield of soybean;Ytis the county-level soybean output;andAtis the planted area of soybean at the county level.
Meteorological dataMeteorological data sets were obtained from the China Meteorological Administration.This study used more than 2.29 million daily meteorological data from 173 meteorological observation stations in the two principal soybean planting areas from 1980 to 2013,including average temperature,precipitation,and sunshine duration.The distribution of the meteorological observation stations is shown in Fig.1.
The meteorological data interpolation method used in this study was the inverse squared distance method(Bhowmik and Costa 2015;Berndt and Haberlandt 2018),which is a weighted moving average method.The meteorological influence factor Z of soybean yield required in this study could be obtained by interpolating the actual data of a meteorological observation station at a nearby grid point according to the inverse squared distance method.The equation is as follows:
whereZis the meteorological data interpolated at the grid;Ziis the measured meteorological data of theith meteorological observation station near pointZ;diis the distance from pointZto theith meteorological observation point near it;andmis the number of meteorological stations around pointZ.
The meteorological data observed the stage of development descriptions for soybean introduced by Fehr and Caviness (1977).In this study,the soybeans phenological period,geographical factors,and growth characteristics were comprehensively considered and quantitatively analyzed (Wang Cet al.2020).The growth period of soybean is classified into the following six stages:emergence stage (stage 1),seedling stage (stage 2),floral bud differentiation stage (stage 3),flowering and pod formation stage (stage 4),seed filling stage (stage 5),and maturation stage (stage 6).The specific information on these soybean growth stages is shown in Fig.3.To observe the crop growth mechanism and use the meteorological variable data as much as possible,this study averaged the average temperature,precipitation,and sunshine duration in the unit of growth stages,as shown in Fig.3.
Fig.3 Specific information diagram of the soybean growth stages in each of the seven provinces.The growth period of soybean is classified into the following six stages: emergence stage (stage 1),seedling stage (stage 2),floral bud differentiation stage (stage 3),flowering and pod formation stage (stage 4),seed filling stage (stage 5),and maturation stage (stage 6).
The soybean yield is affected by a wide variety of natural and human factors (Eulensteinet al.2016).Soybean yield can be studied from two aspects: trend yield and meteorological yield.The trend yield is a relatively stable,long-term,and gradual trend,which is affected by many factors,such as agricultural production technology level,germplasm level,scientific and technological level,agricultural machinery input,human input,pesticides,and chemical fertilizer input.On the other hand,meteorological yield is fluctuating,sensitive,short-term,and is affected by meteorological factors (Shimodaet al.2018;Yin and Leng 2020;Zymaroievaet al.2020;dos Santoset al.2022).This study modeled soybean yield according to the following equations:
whereYSis the yield of soybean per unit area;YTis the trend yield of soybean;YMis the meteorological yield of soybean;εis Gaussian white noise;tis the specific observation year;Mz,gis the meteorological index during soybean growth stages;zis meteorological factors including average temperature,precipitation and sunshine duration;gis the specific growth stage of soybean;andb(Mz,g) is the function of meteorological factors and meteorological yield during soybean growth stages.
This study used four regression models (the moving average method,the exponential smoothing method,the high-pass filtering method,and the logistic regression method) to establish the soybean yield trend,as shown in Fig.4.The moving average method is a kind of trend extrapolation technique,which is used to fit the curve of a data sequence with an obvious load change trend,and it calculates the moving average value by increasing and decreasing the old and new data period by period,so as to eliminate accidental change factors and determine the development trend (Grassiniet al.2013;Iizumiet al.2014;Nguyen-Huyet al.2018).Considering the features of the data sets,this study used the 3-year interval moving average method to establish the trend yield model.The exponential smoothing method is a time series analysis and prediction method based on the moving average method (Liu and Wu 2020).The exponential smoothing model is suitable for non-stationary series with linear trends and periodic fluctuations,which can make the model parameters constantly adapt to the changes of non-stationary series (Venturaet al.2019;Trullet al.2020;Britoet al.2021).Soybean is affected by long-term trend factors and short-term fluctuation elements,so the exponential smoothing model is suitable for establishing the soybean trend yield.The high-pass (HP) filter model is a decomposition method of time series in state space,which assumes that the time series is composed of two parts: long-term trend and short-term fluctuation.It can separate the high-frequency components in a specific period (Hodrick and Prescott 1997).Hence,the HP filtering method can be used to simulate the trend yields of crops.The logistic regression model is also called the self-inhibition equation,and it can reasonably fit the longterm growth trend of a time series.Therefore,this method has been widely used to establish models of crop trend yield (Zhouet al.2020;Ronget al.2021).
Fig.4 Diagrams of the four trend models are as follows: Moving average model,logistic model,exponential smoothing model,and high-pass filtering model.
The trend yield is relatively stable,while the climate yield fluctuates is affected by meteorological factors.In order to accurately screen the key meteorological factors and improve the yield prediction,it is necessary to separate the climate yield from the soybean yield (Zhang and Lu 2020).In view of the sensitivity of soybean meteorological yield to meteorological elements (Zhuanget al.2018),it is crucial to select the most suitable of the above four models to describe the trend yield data.In two steps,the soybean meteorological yield decision-making system will select each county’s optimal trend yield model among the four choices,and then separate the appropriate meteorological yield data from the soybean yield.The specific steps are shown in the decision tree in Fig.5.
Fig.5 Decision tree of the soybean meteorological yield decision-making system.
Firstly,thet-test (Napier-Munnet al.1999) was used to test each county’s four soybean trend yield models(moving average method,exponential smoothing method,HP filtering method and logistic regression method).Specifically,the adjustedR2was used to select the appropriate county-level soybean trend yield model.The adjusted determination coefficients can be used to evaluate the goodness of fit of regression models (Liu and Sun 2019;Yuet al.2021);and the specific equation is:
wherenis the number of statistical data samples;pis the number of variables;andR2is the determination coefficient.The first two models with the smallest adjusted coefficient of determination are selected,and then the model with the worst performance is eliminated.The value ofRadj2ensures whether the selected model can reasonably show the trend of grain yield growth with the continuous improvement in the productivity level,and ensures the applicability of the model method.
Secondly,considering that the climatic characteristics of a given meteorological area are similar,and the occurrence of large-scale agrometeorological disasters is often related to large weather processes,the meteorological yield of soybeans in the same meteorological area or adjacent areas should have basically similar features (Huanget al.2020;Li Xet al.2021;Milfontet al.2021).Through the capturing ability of the relative meteorological yield and the records of the county annals of typical county-level soybean yields in bumper harvest and lean harvest years,the decisionmaking system can determine which trend model can best capture the changes in soybean yield caused by meteorological factors in the most typical years.The primary method is based on the consistency between the regional average value and standard deviation sequence of relative meteorological yield obtained by the trend yield model and the typical year of regional crops(Maestrini and Basso 2018;Zhuanget al.2018;Saxena and Mathur 2019).Model A,with the highest coincidence percentage,is the best trend yield model in the county.Finally,the trend yield selected by the above method was used to obtain the annual trend yield value,and then the trend yield was subtracted from the observed yield over the years to obtain the meteorological yield of the corresponding year.
StackingStacked generalization,referred to as stacking,is an integration strategy that combines multiple base models through meta models (Wolpert 1992),and its essence is a multi-layer learning system with a serial structure.Unlike traditional integration framework-guided clustering algorithms (bagging) and boosting methods,the stacking framework combines different base learners for model fusion.In the primary stage of the stacking algorithm,the cross-validation method is used to convert the original features into secondary features,and then the transformed secondary features are routinely trained and fitted by meta learners (Feiet al.2021;Li Cet al.2021).
The training process follows three steps.(1) The stacking ensemble learning method is used to call different types of learners to train and learn the data set.(2) A new training sample is formed from the training results of each classifier and input into the meta classifier.(3) The output value of the meta learner in the secondary layer model is the final output result (Wang and Wang 2022).
Support vector regression (SVR)SVR enables low latitude spatial data to form an estimation function in high latitude space by mapping,realizing the balance between the accuracy and computational complexity of the regression model (Cortes and Vapnik 1995;Liuet al.2022).Some advantages are that it is ideal for solving small sample classification and regression problems,it is especially suitable for data analysis and mining such as time series,and it has strong generalization performance(Corraleset al.2022;Mokhtaret al.2022).The SVR attempts to find an optimal decision boundary.The training sample points that are closest to the hyperplane and meet specific conditions are called support vectors.
K-nearest neighbor (KNN)KNN is widely used in machine learning such as classification,regression,and pattern recognition.A KNN regression algorithm obtains the predicted value by calculating the average value of the nearest data points.Its advantages are high prediction accuracy,insensitivity to outliers,and no hypothetical restrictions on data input (Cover and Hart 1967;Selvarajet al.2020).This study used the Euclidean distance(Huet al.2016) to calculate the distances between data points.The Euclidean distance equation can be expressed as:
wheredis the Euclidean distance;pandqare data points composed of various dimensions;andnis the number of data points.
Random forest (RF)RF is a supervised learning algorithm in machine learning methods.When branching occurs from numerous decision regression trees,it can select the optimal feature from the subspace of the total feature set to branch and make decisions (Breiman 2001).This method’s advantage is ensuring the independence and diversity of each decision tree,while avoiding a certain degree of overfitting (Rajkovi?et al.2021;Panget al.2022).
The least absolute shrink and selection operator regression (LASSO)LASSO was first proposed by Tibshirani (1996).The advantage of this model is its outstanding performance in regularization and feature selection (Shafieeet al.2021).LASSO adds a model complexity function based on the objective functionJ(W)of the linear regression model,constructs a new penalty objective function,and obtains the maximum or minimum value of the new objective function to the parameter estimation value.It is also called regularization in machine learning,and the two forms of L2 regularization and L1 regularization are commonly used.LASSO applies L1 regularization (Daset al.2020).The linear regression model can be expressed as:
where its objective function is:
Adding L1 to eq.(10) makes it become a LASSO regression model,and its objective function is:
Ridge regression (RR)RR is essentially a biased estimation regression method.It allows more reliable regression coefficients to be obtained at the cost of losing some information and reducing accuracy (Yuan 2020).Its advantage is the loss of unbiasedness in exchange for higher numerical stability to obtain higher calculation accuracy.In machine learning,Ridge regression applies L2 regularization.Its modeling speed is fast,since it does not have a complex calculation process.
The stacking ensemble learning method can integrate the prediction results of multiple learners.The specific algorithm steps of the stacking ensemble prediction model for the soybean yield framework proposed in this study are shown in Fig.6.
Fig.6 Framework diagram of the stacking ensemble learning model.P,the prediction value;r,RF,random forest;s,SVR,support vector regression;k,KNN,K-nearest neighbor;m,the meta-model;T,the test set of each base model;RR,ridge regression.
First,obtain the initial data setsXtrainandYtrain.The number of samples in the training set isN1,andXtrainandYtrainare divided into five folds (Fold1–Fold5).Second,learn and generate new data sets.In the first layer,three basic learners are selected,and the five-fold crossvalidation is used to train the first layer models.After the development of the three models,the samples of the second layer training setXtrain2are generated.The three specific steps are: (1) Train the RF model with adjusted parameters on the four folds other than Fold1,and then predict the results of Fold1.Repeat this process five times,and it will yieldN1predicted values in the training set.(2) Repeat step (1) with SVR and KNN,and then the predicted values of SVR and KNN on the training set will be obtained,respectively.(3) The training set now has three groups of predicted values,which form a new training setXtrain2.The dimension ofXtrain2isN1×3.Third,use the trained RF,SVR and KNN models to forecast three groups of predicted values on the test setXtestwhich hasN2samples.In this way,a new feature setXtest2is formed,whose dimension isN2×3.Finally,train and cross-validate the secondary layer ridge regression model onXtrain2andYtrain;and input the test setXtest2into the trained ridge regression model to obtain the final prediction value.
This study selected 23 variables as the feature set,which includes the three meteorological indicators of average temperature,precipitation and sunshine duration at each growth stage of soybeans.Since soybeans have six growth stages,there are 18 meteorological features.Moreover,this study also chose the meteorological yields in the previous five years affected by the climate as five features.Assuming that the soybean yield in 2023 is predicted,the meteorological yields in the previous five years are selected from 2018 to 2022.
The above feature sets may have the problem of information redundancy or poor generalizability.In addition,in the process of modeling,too many variables will lead to over-fitting and complex models.Moreover,adding variables with weak correlations to the model will significantly reduce the influences of the variables with strong correlations,which will undermine the prediction effect of the model,and weaken the interpretability of the model.Therefore,this study used principal component analysis (PCA) to evaluate the variables.PCA can transform a group of related variables into a group of irrelevant variables while retaining the most important information of the original feature set,and so it can realize the effective dimensionality reduction of the feature set by retaining the low-order principal components (Singhet al.2016;Wang Yet al.2020;Yuet al.2020).PCA uses the cumulative variance contribution rate to selectknumbers of principal components from the overallnnumbers of components to achieve the final data dimensionality reduction.The contribution rate of thekth principal component can be solved and accumulated to obtain the cumulative variance contribution rate using the following equation:
whereλis the feature value andWis the rate of contribution.
At the same time,in order to avoid the influence of dimension on the prediction results,theZ-score standardization method was used to standardize the feature set,and the equation is:
whereZis theZ-score value;xis the individual observation value;δis the standard deviation of all samples;andμis the average of all samples.
The accuracy of the prediction method was evaluated by four indicators: coefficient of determination (R2),root mean square error (RMSE),mean absolute error (MAE),and mean absolute percentage error (MAPE).Among them,the coefficient of determination evaluates the goodness of fit of the prediction model.The closer it is to 1,the better the fitted regression equation.The dispersion degree between the predicted value and the measured value can be evaluated by RMSE,and the closer it is to 0,the more accurate the model prediction.MAE can better reflect the actual situation of the estimated value error.The closer it is to 0,the more precise the model.The average value of the relative error between the predicted value and the measured value evaluated by MAPE can more directly reflect the difference between the predicted result and the actual value.The closer it is to 0%,the more accurate the model.The four equations of the evaluating indicators are:
Through an examination by the trend yield model establishment method,meteorological yield separation method and soybean meteorological yield decisionmaking system,the optimal trend yield model was obtained for each county in the two main soybean planting regions in China.Fig.7 shows the results for the 173 counties in the Huang–Huai region and Northeast China.
Fig.7 Diagram of preferred trend yield models for each of the 173 counties in Huang–Huai region and Northeast China.
The data in Fig.7 show that different soybean planting counties have different trend yield fitting methods due to the differences in their geographical environments,meteorological changes,and the occurrence of the bumper harvest and lean harvest years.In terms of overall quantity,most counties apply the moving average method to fit the trend yield,indicating that its generalizability is the strongest.In second place is the HP filtering method.The counties using the exponential smoothing method are the least concentrated in the Middle East of Huang–Huai region and some regions of Heilongjiang Province.The logistic regression method is not applied in any of the counties,which shows that this method does not apply to the sample data in this study.Through the meteorological yield decision-making system,the optimal trend yield model in each county was selected from the above four trend yield models.Then,the meteorological yield of soybean can be effectively separated from the soybean yield of each county by the meteorological yield separation method.
By using the best hyperparameter setting obtained in the process of hyperparameter optimization and the best cumulative variance contribution rate obtained in the process of principal component analysis,the specific performance of the stacking model and the other four single models was trained and tested(Table 1).In order to test the prediction effect of each model to the greatest extent,the method of 5-year sliding prediction was adopted;that is,when using the data from 1980 to 2013,the first prediction divided the data from 1980 to 2008 into a training set and 2009 into a test set.Similarly,the second prediction divided the data from 1980 to 2009 into a training set and 2010 into a test set.By analogy,the results can be predicted up to 2013 as the test set.Compared with the method of dividing the training set and the test set once,this method can improve the efficiency of data utilization and better test the prediction effect of the model.Table 1 shows the specific evaluation indicators of the stacking model and the four single models (KNN,LASSO,RF,SVR) for the 5-year sliding predictions of 173 administrative counties.
Table 1 Prediction results (MAPE,RMSE,MAE,and R2) for the predictions of soybean yield in each year by the four single machine learning models and the stacking model1)
The data in Table 1 show that the LASSO model is much worse than other single models with respect to MAPE,RMSE,Mae,andR2.Therefore,the LASSO linear regression model is not suitable for solving the prediction problem of the nonlinear relationship between meteorology and yield,and the LASSO model was excluded from the stacking base-model when building the stacking ensemble framework.The SVR model is inferior to the stacking ensemble learning model as indicated by the four indicators from 2009 to 2013.However,it performs better than other single models in the four indicators,because it can better capture the nonlinear relationship between meteorological elements and yield and has ideal performance in dealing with small sample regression problems.The KNN model performs slightly better than the RF model in the MAPE,RMSE,MAE andR2indicators because of its high prediction accuracy and insensitivity to outliers.The stacking model is better than KNN,LASSO,RF and SVR in the MAPE,RMSE,MAE andR2indicators,showing that the stacking model has strong generalizability and high prediction accuracy in dealing with soybean yield prediction based on meteorological factors.
Based on the SVR,RF,KNN and stacking models,the soybean yields in 173 county-level administrative regions were estimated for the 5 years.In order to comprehensively test the prediction accuracy of the four models for soybean yield in each county,this study used the 5-year sliding prediction values of the four models for each county to calculate the MAPE,and the results are shown in Fig.8.The overall performance of the stacking model is significantly better than the SVR,RF,and KNN models in the MAPE indicators for the 173 counties,showing that the stacking model has a strong generalizability in estimating soybean yield and higher prediction accuracy than any of the single models.In the prediction performance of counties in the Huang–Huai region,the MAPE values of the stacking model are mostly below 4%.Among the single models,the KNN and SVR models perform well,and the MAPE values are mainly distributed below 6%,which shows that the SVR model is good at dealing with nonlinear,small sample and high-dimension regression problems,and the KNN model has good prediction accuracy.In the Northeast China,the estimation abilities of the four models are generally worse than in the Huang–Huai region.The reason is that there are many stateowned farms and agricultural cultivation companies in the Northeast China,so the levels of agricultural mechanization and modern field management are high.Soybean planting is strongly affected by human factors,so the impact of weather factors on soybean yield is weakened.Moreover,Northeast China is the core area of the national soybean yield capacity improvement project,and the policy,technology and other factors associated with that project interfere with the ability of the soybean trend yield model to effectively separate meteorological yield.
Fig.8 Accuracy evaluation of mean absolute percentage error (MAPE) for the 5-year sliding estimations of soybean yield in 173 county-level administrative regions.
In order to verify the performance of the model,the soybean meteorological yield prediction model based on the stacking ensemble learning framework was compared with each of the single learner models.Through the five-year sliding estimations of 173 counties from 2009 to 2013,the evaluation indicators of the soybean yield estimations by the RF,KNN,SVR and stacking models are shown in Table 2.
Table 2 Comparison between the single models and the Stacking model in the 5-year sliding predictions1)
Stacking performed the best with respect to theR2.The determination coefficient of stacking is 0.9272,which is 0.0286,0.1981 and 0.0267 higher than those of SVR,RF and KNN,respectively,for an average increase of 0.0845.The MAE of stacking is 117.89,which is 5.06,76.31 and 18.99 lower than those of SVR,RF and KNN,respectively,for an average reduction of 33.45.The RMSE of stacking is 155.59,which is 28.09,144.60 and 26.37 lower than those of SVR,RF and KNN respectively,for an average reduction of 66.35.The MAPE of stacking is 4.90%,which is 0.34,3.28 and 0.95% lower than those of SVR,RF and KNN respectively,for an average reduction of 1.52%.In summary,the fitting effect of the stacking ensemble learning algorithm is significantly better than those of any of the other single models,with higher prediction accuracy and strong generalizability.Compared with SVR,RF and KNN,the stacking model has significantly improved MAPE,RMSE,MAE,andR2values,and it has the best effect on predicting soybean yield.The results of the comparison show that the stacking model can effectively make use of the characteristics and advantages of its basemodels and effectively improve the prediction accuracy,so it is the best model for predicting the meteorological yield of soybean.
The predicted values from the soybean meteorological yield prediction model based on the stacking ensemble learning framework for the soybean yield per county in China’s two principal soybean planting areas from 2009 to 2013 were compared to the reported soybean yield statistical data for each county.The estimated results are basically consistent with the county-level database data from the Ministry of Agriculture and Rural Affairs in China (Fig.9).The estimated yields of soybeans from 2009 to 2013 are not much different.The yields of soybeans in northern Henan,central Anhui,and most parts of three provinces (Shandong,Jiangsu and Jilin) are more than 2 400 kg ha–1.The yields of central Anhui,southern Jiangsu and northern Jilin can even reach 3 100 kg ha–1.Since 2009,most areas in the seven provinces have shown increasing trends of soybean yield year by year,which is in line with their objectives such as the improvement of scientific and technological investment and the optimization of germplasm resources.The yieldof soybeans in Huang–Huai region is slightly higher than that in Northeast China because the average temperature in Huang–Huai region is more suitable compared with that in Northeast China,and the precipitation is sufficient.Soybeans are a thermophilic crop,and the climatic conditions in Huang–Huai region are more conducive to the growth of soybeans.Moreover,the soybeans grown in Huang–Huai region are mainly high-yield and highquality varieties of summer soybean,so the yield is high.
Fig.9 Diagram of five-year spatiotemporal variations of soybean yields predicted by the stacking model in the 173 counties.
This study aims to separate the meteorological yield data through the crop meteorological yield decisionmaking system,and to establish the soybean stacking ensemble learning framework estimation model based on the relevant data for meteorological factors.The results of the crop meteorological yield decision-making system designed in this study (Fig.7) are consistent with those of other studies in recent years;that is,optimized trend models can effectively separate the meteorological yield from the actual yield (Grassiniet al.2013;Zhuanget al.2018).In addition,this study found that the moving average,exponential smoothing,and HP filtering methods can effectively fit the trend yields of soybean in Huang–Huai region and Northeast China,because these methods can separate the high-frequency fluctuation features under the specific periodic trend.Combining the results in Tables 1 and 2,this study proves that the decision-making system can effectively separate the meteorological yield and play a vital role in the accuracy of each soybean yield prediction model,because the meteorological factors significantly impact the meteorological yield of crops.
Based on the data in Table 2 and Fig.8,this study demonstrates for the first time that the stacking algorithm can significantly improve the accuracy of soybean yield estimations based on meteorological factors by integrating the advantages and characteristics of each base-model.The stacking model was independently applied to 173 county-level administrative regions and achieved good estimation results,which proves that the stacking model has the characteristics of high generalization and high prediction accuracy when estimating soybean yield on a large scale.These findings are consistent with recent studies (Gaoet al.2022;Motaet al.2022;Wang and Wang 2022),in that the stacking model has higher prediction accuracy and robustness than any of the single machine learning algorithms when solving the prediction problem of multi-feature and complex nonlinear relationships.
The performance of the base-models affects the final effect of the stacking model.Therefore,when selecting a base model,it is necessary to fully consider the sufficiency and diversity of learners,that is,the basemodel has good learning ability and each base-model is independent of all the others,in order to realize the effective complementarity between the models (Wuet al.2021).Considering the prediction abilities of the base learners,this study selected the models with strong learning ability and large architectural differences as the base learners in the first layer of the stacking ensemble learning framework,which helped to improve the overall prediction effect of the model (Zhanget al.2022).The data in Table 1 show that the LASSO model is not suitable as a base-model of the stacking model.Furthermore,the actual yields and predicted yields of SVR,RF,KNN and the stacking model based on the first three single models are shown in Fig.10 as the heatmap of 2D bin counts to further explore the potential base-models of the stacking algorithm.The data values in the table are covered by the quadrilateral array,and the color of each quadrilateral is determined by the number of data points it covers.The darker the quadrilateral,the lower the density of data points,and the brighter the quadrilateral,the higher the density.The deviation of a data point from the 1:1 line shows the residual distribution.The chart shows that the correlation coefficients of SVR,KNN and RF are 0.9499,0.9509 and 0.8785,respectively,indicating that the predicted yields of these three models are highly correlated with the actual yields,and most of the predicted values are closely distributed around the 1:1 line.After using the first three (SVR,KNN,and RF) as the base-models,the stacking model has improvements in correlation coefficient,determination coefficient and MAPE,which means its prediction accuracy is better,and there are no outliers that deviate too far from the 1:1 line.This is because the first layer of the stacking model should include strong models with good performance,and these base-models should conform to the characteristics of high accuracy of the prediction values and significant differences in model architecture.A stacking model that meets these characteristics will have a better fusion effect.Therefore,this study chose SVR,RF and KNN as the base-learners,and ridge regression as a meta-learner to fuse the integrated model.Among the three machine learning methods constructed in this study: Random forest has excellent learning ability and can avoid overfitting;SVR has advantages in solving small sample,nonlinear and high dimension regression problems;and KNN has the characteristics of high prediction accuracy and insensitivity to outliers (Shakhovskaet al.2022).The second layer (meta-learner) typically uses a model with a strong generalizability to correct for the biases of the multiple learning algorithms on the training set and avoid the over-fitting problem.Finally,the stacking soybean meteorological yield prediction model developed here overcomes the defects of the single models by combining a variety of algorithms,optimizes the input of ridge regression,and thus improves the estimation results.
Fig.10 Heatmap of 2D bin counts of the soybean yields predicted by different models.MAPE,mean absolute percentage error.
The data in Fig.9 are basically consistent with the soybean yield data in the database of the Ministry of Agriculture and Rural Affairs of China,which further illustrates the accuracy and practicality of the stacking model in estimating and monitoring regional soybean yields.More importantly,the actual yield data are usually released by the government in January,but our highprecision yield results can be obtained three months in advance,that is,in September of the previous year.In the main soybean planting areas in China with different types of climatic zones and large longitude and latitude spans,the stacking ensemble learning model developed and applied in this paper is based on a large amount of historical meteorological data and known crop growth rules for analyzing the relationship between meteorology-related factors and crop yield,and it achieved good practical results.We believe this method is also suitable for monitoring and predicting the yields of wheat,corn,rice and other crops.Based on this study,when there are enough feature sets and data samples,the stacking model will perform better in prediction accuracy and robustness than single models in most cases through suitable hyperparameter tuning and appropriate base model selection.The significance of the method proposed in this paper is that the yield estimation model based on meteorological factors can not only help to improve the prediction accuracy of soybean yield,but it can also realize the daily monitoring and early warning of soybean yield and guide crop field management.With increases in the number of years and amount of daily meteorological data,the accuracy and applicability of the meteorological factor crop yield prediction model will be further improved (Xuet al.2015).In the future,additional research will be carried out on enriching the feature sets and taking the deep learning framework as base-models for the stacking algorithm.
Soybean is one of the world’s most important oil and food crops.The accurate prediction of soybean yield is of great significance for agricultural production management,as well as agricultural monitoring and early warning.Based on the meteorology-related factors,this study separated the meteorological yield data through a meteorological yield decision-making system,and constructed a stacking ensemble learning model based on KNN,RF and SVR,which could realize the accurate estimation of soybean yields,and the corresponding distribution map of soybean yields in 173 counties was obtained.The results show three important features of this system.(1) The moving average,HP filtering and exponential smoothing method can effectively separate the soybean meteorological yield data,and provide favorable support for improving the soybean yield estimation model based on climatic factors.(2) Compared with the KNN,RF,LASSO and SVR models,the stacking model is better with respect to theR2,MAPE,MAE and RMSE indicators.The SVR,KNN,RF and stacking models were verified in 173 counties through 5-year sliding predictions.The MAPE indicators of the four models for soybean yield were all lower than 8.2%.The MAPE indicators of the stacking model in 173 counties in China’s main soybean planting areas reached 4.90%,showing that its prediction accuracy is high and its generalizability is strong.Therefore,the stacking model is the preferred model among the models and samples involved in this study.(3) The 5-year sliding prediction method showed that the prediction values of the stacking model for 173 counties were basically consistent with the actual situation,so the estimation results are reliable.
In general,this study proves the feasibility and effectiveness of using the stacking ensemble learning framework constructed by the KNN,SVR and RF models to predict soybean yields based on meteorological factors.Moreover,the stacking model can accurately predict soybean yields not only on the small-scale (county-level) but also on the large-scale(cross-province),and it provides a new approach for soybean yield estimation.
Acknowledgements
The research was supported by the Science and Technology Innovation Project of Chinese Academy of Agricultural Sciences (CAAS-ASTIP-2016-AII).
Declaration of competing interest
The authors declare that they have no conflict of interest.
Journal of Integrative Agriculture2023年6期