Nian LIU, Zhongwei YAN, Xuan TONG, Jiang JIANG, Haochen LI,Jiangjiang XIA*, Xiao LOU, Rui REN, and Yi FANG
1Key Laboratory of Regional Climate-Environment for Temperate East Asia (RCE-TEA),Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing 100029, China
2University of Chinese Academy of Sciences, Chinese Academy of Sciences, Beijing 100049, China
3Center for Artificial Intelligence in Atmospheric Science, Institute of Atmospheric Physics,Chinese Academy of Sciences, Beijing 100029, China
4Qi Zhi Institute, Shanghai 200232, China
5Beijing Meteorological Service Center, BMSC, Beijing 100089, China
6School of Mathematical Sciences, Peking University, Beijing 100871, China
7School of Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
8Lab of Meteorological Big Data, Beijing 100086, China
ABSTRACT
Key words: data reconstruction, meshless, machine learning, surface wind speed, random forest
Wind speed is one of the fundamental variables in basic atmospheric equations. Surface wind speeds at a hyperfine resolution are needed in many applications. For instance, wind energy plays a role in the global energy transition with regard to the mitigation of global warming (Bosch et al.,2017), and a continuous wind speed field is essential for evaluating the wind power capacity in different areas (Gielen et al., 2019). Street-scale wind fields with variations in building density and height in local areas are important for the diffusion and deposition of pollutants (Miao et al., 2017; Zhai et al., 2019; Pirhalla et al., 2020; Szewc et al., 2021; Zhang et al., 2021a). A high-resolution wind field over airport runways is required when assessing the risk of taking off and landing.Super-resolution wind field data is useful for planning airport construction (Prasanna et al., 2018; Nechaj et al., 2019).Moreover, many outdoor events at the Winter Olympic Games are restricted by wind speed, so the prediction of very local and temporal wind speeds is imperative (Joe et al.,2010; Bernier et al., 2014; Isaac et al., 2014). In such applications, continuous wind speed fields with a resolution of hundreds or even tens of meters are required. In short, historical super-resolution surface wind fields are useful in many applications. One problem is determining how to instantly obtain super-resolution wind fields based on limited information.
Observations and numerical simulations (including reanalysis) are common data sources for wind field construction. Observatory sites provide discrete reference records.However, the density of stations is usually sparse and inhomogeneous, as it is generally expensive to establish an intensive observation network. Reanalysis methods provide global gridded data, but the resolution is often insufficient for local application scenarios. Most reanalysis datasets are characterized by large biases and uncertainties in describing local wind climatology and climate trends (Rose and Apt, 2015, 2016; Torralba et al., 2017; Yu et al., 2019; Wang et al., 2020).
Conventional methods used to obtain high-resolution meteorological fields include interpolation and downscaling.Yan et al. (2002) modeled a continuous wind field in association with large-scale climate factors based on a generalized linear method, but such statistical techniques are difficult to apply when reconstructing local-scale winds. Downscaling can be divided into two broad classes: dynamic downscaling(DD) and statistical downscaling (SD). In DD, local-scale climate patterns are estimated via a high-resolution mesoscale dynamic model or regional climate model (RCM) coupled with a global coupled model (GCM), with boundary conditions determined from the GCM output (Salva??o and Soares, 2018; Zhang et al., 2020). The mesoscale Weather Research and Forecasting (WRF) model and computational fluid dynamics (CFD) models are often applied for local wind applications (Liu et al., 2018b; Salva??o and Soares,2018; Keck and Sondell, 2020). DD provides continuous gridded results consistent with physical principles, but some inevitable problems remain (Willison et al., 2015; Liu et al.,2018b; Zhang et al., 2020). First, it is quite challenging to use numerical models to capture the detailed dynamic structures of near-surface wind trends dominated by local microtopography, and thus there exists lack of the effectiveness of near-surface wind simulations. Secondly, long-term DD simulations at high resolutions require vast computational resources, and for the current mesoscale models, such as the WRF model, more computing resources are required as the resolution becomes finer. Moreover, regional dynamical simulations are sensitive to boundary conditions, physical parameterization and systematic model error. Such errors can quickly accumulate and reach an unacceptable level.
SD provides more local information via the statistical relationships among local variables (usually observations)and large-scale variables (usually simulated by GCMs) (Liu et al., 2019). SD, as well as interpolation methods, produces fast and accessible results and requires far less computational time than DD, but there are also some limitations (Nikulin et al., 2018; Seiler et al., 2018; Alizadeh et al., 2019; Hou et al., 2019). First, SD relies on not only the accuracy of observations but also the validity of dynamical simulations, especially the relationships among the large-scale and local variables used. Second, SD, as well as many statistical interpolation methods, is usually based on a steady empirical relationship(function), which is characterized by a priori error for local variables. Moreover, SD may not fully consider the temporal physical interactions among variables at different scales,which can lead to spatial and temporal discontinuities in high-resolution outputs.
Machine learning (ML), which is popular due to its data-driven nature, has increasingly been applied in geoscience for data reconstruction (Reichstein et al., 2019). ML algorithms can address many challenges encountered in geoscience problems, such as those in remote sensing and model simulations (Karpatne and Liess, 2015; Rodrigues et al., 2018; Karpatne et al., 2019; Reichstein et al., 2019). ML algorithms have displayed high accuracy in various applications and can balance computational cost and run time objectives (Krasnopolsky and Fox-Rabinovitz, 2006). For instance, Jing et al. (2017) reconstructed precipitation data for the regions not covered by the Tropical Rainfall Measuring Mission 3B43 (TRMM) precipitation dataset by using a random forest model. Kadow et al. (2020) used a deep learning model to reconstruct historical sea surface temperatures from 1870 to 2005 based on two distinct datasets. Machine learning models have also been used in other fields to reconstruct data, such as soil (Hengl et al., 2017; Zhang et al.,2021b), cloud structure (Leinonen et al., 2019), and paleoclimate (Wei et al., 2021) data, among others. The data-driven nature of ML methods makes them unique, but ML models are often not easily explainable. Therefore, it is beneficial to clearly interpret and improve ML models by integrating a priori knowledge (Reichstein et al., 2019).
In this study, we propose a new approach to reconstruct the meshless field of hourly wind speed in Beijing. The model can fit the wind field to the resolution of the available geographical factors in the region. We select a random forest to build machine learning models. We introduce and preprocess model inputs and parameters and then evaluate the models by comparison with conventional methods in the following sections. A physical explanation of model performance is given based on the importance of the features involved.Finally, a summary of the conclusions is presented.
ML models require as many station observations as possible for training and testing. We collected data from 226 stations in Beijing (include urban and suburban areas) (Fig. 1).The distribution of observed stations used are fairly uniform,so the datasets from observed stations are representative. In addition, the Beijing surface wind speed field is influenced by geography and large-scale climate change (Liu et al.,2018a; Yang et al., 2020). Beijing is a highly developed city with a large population density; these factors may increase the uncertainty of local wind speed downscaling. Therefore,Beijing is a good location for the present data reconstruction study.
Fig. 1. Spatial distribution of observations (left) and elevation map in Beijing (right).
Table 1. The variables used in the models. All these variables come from the observed stations records and are divided into three parts:meteorological, time, and geographic variables. Average wind variables represent the average wind speed and direction in a 10-minute period at 10-meter height. Wind component U and V represent the zonal and meridional respectively. Given the cyclical nature of the wind direction, thewinddirectionsarenotinput into the models. Instead, we input four components (AWu,AWv,EWu,EWv) into the model to introducethe windspeeddirection influence.
The hourly observations used in this study were collected from 226 stations in Beijing from 2015-19. The dataset includes 17 variables, as shown in Table 1. We divided the 17 variables into two classes: meteorological variables (10),time variables (4) and geographical variables (3). We used these data to construct over 9 300 000 hourly samples as ML model inputs. In this study, we decompose wind speed into meridional and zonal wind speeds. This approach helps identify the effects of different predictive features from the perspective of atmospheric dynamics.
2.3.1. Model and algorithm
The meshless data reconstruction (MDR) process here refers to a method that uses discrete station data to predict the wind speed at any location in the study region as long as the basic geographic information (latitude, longitude and altitude) is available for this location. Such a model allows us to synchronize the wind speed field from the station distribution to gridded distribution data with any resolution when geographic information is available. The present model can reconstruct the 10-minute-mean wind speed at 10 meter height at any location on historical moment.
We transform the data reconstruction problem into a regression problem and then solve the problem with a machine learning algorithm. Specifically, we train a machine learning data reconstruction model (MLDRM) to learn the map between the considered features (predictive factors) and the wind speed (label or target) at any given place and time. The features include the meteorological background(M), geographic information (G) and time variables (t). In this study, we suppose one observed station as the supposed forecast point and construct the data from this station using the model training process. By this way, forecast points have true values to evaluate model performance, and predicted stations are still independent of the model.
For a prediction point (station j) at time (i),
A random forest (RF) is a supervised ensemble classification algorithm with better interpretability and fewer parameters than other machine learning methods (Jing et al., 2017).An RF can represent nonlinear relationships and outperform many conventional models based on fitting performance(Hengl et al., 2017). Additionally, the form of the objective function does not need to be preset, and complex interactions among features can be considered. Furthermore, An RF model can quantify the impact of features, thus aiding in assessing and improving model performance. Therefore, we apply an RF to build the MLDRM-RF model.
In this study, we choose the root mean squared error(RMSE) as a measure of model performance, and it is defined as
2.3.2. Framework of the MLDRM-RF model
The MLDRM-RF model can predict the wind speed at any location in the study region with meteorological background and local geographic information; to evaluate its performance, we used one of the stations as a supposed target location. All records from this station formed a testing set,and the other station records formed a training set. By repeating this modeling process for all stations, we maximized the utilization of data while guaranteeing the independence of the testing set from the training set. Based on such cross-validations, we could obtain a general assessment of the predictive ability of the model.
As shown in Fig. 2, there are four steps in building and evaluating the model of regional wind speed field reconstruction.
2.3.2.1. Step 1-Preprocess the station dataset (ST)
In this step, we processed the data into samples that could be input into the MLDRM-RF. Each sample was established based on the corresponding time and station (Fig. 3).At a certain time (i), we averaged the records from all stations at time (i) for each of 10 meteorological variables (Table 1)to obtain the hourly regional climate background fields (Mi).Moreover, we introduced four time scales (year, month, day,and hour) as time features (ti). For station (j), we added 3 geographic variables (Gj) as features: longitude, latitude, and altitude. Finally, every sample was composed of 17 features and 1 label (wind speed). The model dataset (ST) spanned 226 stations and 40896 hours, with over 9 200 000 samples.
Fig. 2. Flow chart of wind speed reconstruction model (MLDRM-RF). Step 1-4 are shown section 2.3.2. LOOCV represents “l(fā)eave-one-out cross validation”. IDW represents “inverse distance weight” interpolation method. MDI represents “mean decarease in inpurity”.
Fig.3.Schematicdiagramofsampledatasetsconstructionfromobserved data.Sample(i,j)inferstotherecordat timeiatlocationj.Everysampleconsistsof17 variables(features) and one label as showninthefigure. Mi, ti,Gj represent three parts of these variables as meteorological background, time, and geographic respectively. Wind speed is the label of this model.
2.3.2.2. Step 2-Preset the model parameters
Model parameters have a considerable influence on model performance. Here, we focus on optimizing two parameters, the tree depth and the number of regression trees,which are the two commonly used hyperparameters in RF modeling (Breiman, 2001). To obtain the best parameters for ST, we divided ST randomly into a training set and a validation set at an 80 to 20 percent ratio. Then, the model was constantly adjusted to obtain the parameters that optimize performance based on the training set. The number of trees and maximum depth of trees have the greatest impact on model performance. Therefore, we adjusted the number of decision trees from 5 to 100 and the model depth from 10 to 100. We used the RMSE and out-of-bag score (OOBS) as performance metrics.
OOBD&OOBS: Random forest model consists of multiple decision trees. Each decision tree is built by a bootstrap resampling from the training set. This means that each tree has data that does not participate in the decision tree samples, which is called an “out of bag” data (OOBD). The OOBD of one tree is not involved in training the corresponding decision tree. The OOBD of all decision tress means that this part of data hasn’t been used in any decision tree.Since this part of data is not involved in the establishment of this tree, OOBD can be used to test the model generalization capability. The prediction errors of these out-of-bag data are averaged and normalized as out of bag score(OOBS).
As Fig. 4a shows, for a given number of trees, the RMSE always decreases with increasing depth of decision trees, but little change is observed after the depth reaches 30,and no change is observed at depths above 70. Figure 4b shows the error variations for different numbers of decision trees at a constant depth (70). Furthermore, we evaluate the influence of parameters based on another indicator, the wind speed prediction score (Table. 2) (Yu et al., 2020), and obtain a similar result, as shown in Fig. 5. In Table 2, the vertical column on the left side of the table represents the range of observed wind speed (m s-1), and the upper side represents the range of predicted wind speed. Every prediction value can be evaluated with a score based on this table, with the higher score indicating that the model has better prediction effect.
Fig. 4. Error distribution with different tree counts and maximum depths of the decision trees. The boxes in figure (a)represent the errors of different combinations of numbers and depths based on the RMSE. In figure (b), for a given number of trees, we define “minimum RMSE” as the RMSE which has been stabilized with depth increasing. In our test, the stabilized RMSE is always the smallest RMSE for a given number of trees. The “maximum OOBS” is defined as the corresponding OOBS when the RMSE reach to the stability. Figure (b) shows how “minimum RMSE” and “maximum OOBS” change with the numbers of trees.
In the parameter selection step, we aim to maximize model performance and improve model efficiency. According to the results in Fig. 4 and Fig. 5, we selected 50 trees and a maximum depth of 30 as the parameters of the MLDRM-RF in subsequent model evaluation steps.2.3.2.3. Step 3-Build and evaluate the model
2.3.2.4. Step 4-Adjust the model data and re-evaluate the model
Fig. 5. Wind speed prediction score map with different model parameters (number and depth of trees). The horizontal axis represents the number of trees, and the vertical axis represents the depth of trees.
Repeat Step 3, but designate samples from one of the remaining stations (e.g., s2) as the testing set, resulting in a new model (model 2 in Fig. 6). This approach is similar to leave-one-out cross-validation (LOOCV) in statistical modeling, and it can enhance model reliability. This process was repeated until we built 226 models with different training and testing sets. Finally, we used the average performance of the 226 models to evaluate MLDRM-RF modeling performance.
2.3.3. Inverse distance weighting (IDW) interpolation
To assess the performance of MLDRM-RF, we obtained an interpolated result from the IDW method as a baseline. The IDW method assumes that the dependent variable is affected by the distance from sampling locations and the power of this distance. IDW is one of the most frequently applied methods in spatial data interpolation ( Li and Heap,2011; Franco et al., 2020) because it provides relatively fast and reasonably accurate results. Here, we use a basic version of IDW, formulated as
Table 2. Wind speed prediction score. We adopt this evaluation index from the study of Yu et al. (2020). The vertical column on the left of the table represents the range of observed wind speed (), and the upper side represents the range of prediction wind speed. Every prediction value can be evaluated with a score based on this table.
RF and IDW methods require almost the same calculation time, and they are relatively easy to implement.MLDRM-RF can replace IDW as a new method for quickly obtaining regional wind fields, and it provides a higher resolution. It is worth noting that MLDRM-RF here is different from an interpolation approach. First, as shown in Fig. 3,MLDRM-RF contains various predictors, and both physical and geographic factors are considered. Second, MLDRMRF does not use a preset mapping function; therefore, the upper limit of model performance is improved. MLDRM is a basic and potential model that can be extended for various features and data sources in real practical applications.
3.1.1. Spatial distribution of the root mean square error(RMSE)
First, the performance of the random forest wind speed reconstruction model is evaluated based on the root mean square error. The spatial error distribution is shown in Fig.7. The average RMSE for all 226 model prediction samples is 1.09 m s-1. As illustrated by the probability density distribution in Fig. 7, the RMSE is less than 1.2 m s-1for most stations, with a median of 0.91 m s-1. In general, the model performance is better for the southeastern part of Beijing than for the northwestern part. Large RMSE values (>2 m s-1)occur in western Beijing, likely because the meteorological background features in that area are based on the mean conditions in the whole region. However, most stations are located in plains areas, and only a few mountainous stations are located in northern and western Beijing.
3.1.2. Seasonal differences in the error distribution
The geographical distribution of seasonal RMSE is shown in Fig. 8. The spatial average RMSE is 1.10 m s-1(DJF), 1.10 m s-1(MAM), 0.88 m s-1(JJA), and 0.94 m s-1(SON). According to Fig. 8, MLDRM-RF performs better in summer and autumn than in spring and winter. The largescale background wind is strongest (weakest) in winter and spring (summer and autumn) in Beijing (Liu et al., 2018a;Yang et al., 2020), which probably causes seasonal variations in model performance.
As shown in Fig. 9, during each season, RMSE of most stations is less than 1.0 m s-1, and the probability density functions (PDFs) tend to be skewed by extremely large values.The median seasonal RMSE is 0.99 m s-1for winter,0.97 m s-1for spring, 0.79 m s-1for summer, and 0.83 m s-1for autumn. The model exhibits the best performance in summer and autumn.
3.1.3. Comparison of machine learning and tradition models
To assess the potential advantage of ML models over conventional methods, we interpolated wind speeds at every station based on records from other 225 stations by using the IDW method for comparison. The RMSE of MLDRM-RF is smaller than that of IDW for most stations: as shown in Fig. 10, at most sited the MLDRM-RF model performs better than IDW by approximately 0.5-1.0 m s-1in terms of the RMSE.
Figure 11 compares the PDFs of RMSE for the two models. Obviously, the median RMSE of MLDRM-RF is considerably smaller than that of IDW (MLDRM-RF: 0.91 m s-1;IDW: 1.14 m s-1); additionally, the errors of MLDRM-RF are more concentrated in a smaller range, implying that it is more stable than IDW. The mean RMSE of predictions by IDW is 1.29 m s-1, approximately 18% larger than that of the RF model. In addition, the tail of the PDF curve in Fig. 11 indicates that MLDRM-RF produces less extreme error than IDW.
Fig. 7. RMSE (m s-1) spatial distribution (a) and error probability density distribution (b). Error probability density represents the density distribution of the RMSE for 226 models. The horizontal axis shows the magnitude of the error,and the vertical axis shows the density.
In MLDRF-RF, different features influence local wind speed predictions to varying degrees. We calculated the feature importance permutation by MDI on testing data predictions. We obtain the importance scores for all 226 models with all features and normalize the results to obtain Table. 3.The most influential feature is the regional average 10-minute wind speed, with an importance level of 44.35%.The next three most-important factors are altitude, longitude and latitude; these are all geographic features, and their total importance amounts to 21.2%. Therefore, the performance of MLDRM-RF highly relies on the regional background wind speed. The basic geographic information influences the wind speed distribution in the region. The other basic ground meteorological variables and time features account for the remaining 34.5% of feature importance; among them,wind speed-related variables have the most significant influence, accounting for 18.23% of this contribution. Temperature, relative humidity and pressure have almost the same influence on the results at approximately 2% to 3%. Time variables are far less important maybe because some diurnal and seasonal temporal factors are encompassed in the trends in meteorological variables. The low importance of precipitation may be related to the small number of hourly rainfall samples. The total importance rank of zonal wind (6.74%) is higher than that of meridional wind (4.82%), which indicates that zonal motion is important for refactoring surface wind speed in Beijing.
Fig. 8. RMSE in different seasons and their spatial distributions, presented individually for (a) spring, (b) summer, (c)autumn, (d) winter.
Surface wind speed fields with fine resolution are useful in many applications. This paper introduces an ML model based on an RF algorithm (MLDRM-RF) to reconstruct meshless wind speed fields. We input time, geospatial attributes and meteorological background fields into the algorithm and evaluated the models based on cross-validation. The RMSE of MLDRM-RF is 1.09 m s-1, and the median error is approximately 0.91 m s-1. In terms of the RMSE, the model performs better in summer and autumn than in spring and winter by approximately 14.5%. Additionally, the results are better in the southeast than in the west and north of Beijing. We compared the reconstructed values with the results from the classic IDW method and verified that the MLDRM-RF outperforms the IDW method by approximately 18%, with both models displaying similar computational times. Furthermore, we ranked the importance of features by MDI to explore the reasonableness of the model prediction. We found that the most important feature is regional average wind speed, which contributed to 44.35% of model performance. Geospatial features contributed 21.18%, while meteorological features other than average wind speed accounted for 27.61% of the model predictions. The prediction from the random forest model is reasonable and does not overfit for irrational variables. After specifying the reconstruction time point, the model uses the regional background wind field as the basis for prediction, introduces the comprehensive influence of the distribution of other meteorological variables at the location of the observation site on the wind speed, uses geographic information to obtain the location of the prediction point, and finally makes the prediction.
Fig. 9. Probability density distribution of RMSE (m s-1) in different seasons, presented individually for (a) spring,(b) summer, (c) autumn, (d) winter. The horizontal axis represents the size of the RMSE.
Fig. 10. Spatial distribution of the error differences between MLDRM-RF and IDW. The colors represent the difference between the RMSEs (m s-1) of MLDRM-RF and IDW.Positive values indicate that MLDRM-RF performs better than IDW.
Fig. 11. Probability density curve of MLDRM-RF and IDW RMSE (m s-1). Blue shadow represents the RMSE distribution curve for MLDRM-RF, and black shadow represents the RMSE distribution curve for IDW.
Table 3. Importance rank of features that we use in the RF model.Importance is shown by percentages. Wind speed component U and V represent zonal and meridional components respectively.
Our model is highly customizable and could be expanded to include additional features and samples. We can build special models for small regions and train these models with specific samples, such as high-elevation samples. For instance, the terrain has complex effects on wind patterns in many mountainous regions. When a reconstruction model is applied in these regions, we could introduce additional geographical characteristics as features to increase model performance; additionally, we can introduce meteorological variables associated with high pressure levels from reanalysis datasets as features to adapt to complex regions.MLDRM-RF has the potential to become a new baseline in the ML data reconstruction field.
Acknowledgements. This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences(Grant No. XDA19030402), the Key Special Projects for International Cooperation in Science and Technology Innovation between Governments (Grant No. 2017YFE0133600, and the Beijing Municipal Natural Science Foundation Youth Project 8214066: Application Research of Beijing Road Visibility Prediction Based on Machine Learning Methods.
Advances in Atmospheric Sciences2022年10期