Adven Masih
Department of System Analysis and Decision Making,Ural Federal University,19 Mira,Ekaterinburg,620002,Sverdlovskaya oblast,Russian Federation
Keywords Data mining Machine learning algorithms NO2 prediction Tree and meta-learning classifiers Artificial Neural Networks Environmental modelling Air pollution
Abstract
Worldwide, air pollution has become a major issue concerning its effects on human beings and crops. Rapid technological advancement and economic developments have a negative impact on urban air quality(Sfetsos and Vlachogiannis, 2010). Air pollution has a direct linkage with anthropogenic activity. Thus, a slight change in the concentration of atmospheric pollutants can affect the troposphere composition to aid the formation of acids and photochemical oxidants. According to World Health Organization (WHO),the rising concentrations of air pollutants(NOx,CO,SO2,and O3etc.) have a strong correlation with health apprehensions like cardiovascular system,respiratory and skin diseases. Indoor emissions from household such as stoves,as well as from ambient sources i.e. vehicles and factories are the major cause of air pollution(Bedoui et al.,2016).
Although all air pollutants are dangerous to human health however,constant monitoring of NO2in particular is important because (1) due to increased concentration of NO2, environmental risk in urban areas has significantly increased(Xie et al.,2016),(2)it can convert into nitric acid within 24 hours(Shon et al.,2011)and(3)a short exposure to NO2can lead to aggravated respiratory diseases especially asthma. Besides asthma patients,children as well as elderly people in general are at high risk (Bell et al., 2014). Apart from constant monitoring of atmospheric pollutants, development of air quality modelling techniques that can accurately measure the concentration of air pollutant is also essential. These high precision modelling techniques, by providing early warnings well ahead of time, can help the local authorities to take effective precautionary measures before air pollution reaches/crosses the permissible limits set by the state.
Air pollution concentrations are mainly regulated by two types of modelling approaches; (1) Chemical Transport Models(CTMs)-in which generally ecological processes are discussed that contribute to air pollution and(2)Data Driven techniques-these are based on statistical and machine learning algorithms using historical emission and meteorological data to measure air pollution level for future predictions. Despite the capability of CTMs, its application in environmental science has become limited because it requires a comprehensive knowledge about the transportation and chemical mixing of pollutants which is costly and complex. Data driven approaches have a feasibility advantage over CTMs, as a result these techniques are becoming popular for pollution prediction modelling.
Recently,data mining tools such as Artificial Neural Network(ANNs)and Support Vector Machine(SVMs)have successfully been applied for short and long term prediction of air pollutants in several studies(Wang et al.,2008;Baawain and Al-Serihi,2014;Bedoui et al.,2016;Brunelli et al.,2007;Elangasinghe et al.,2014;Juhos et al.,2008;Lu et al.,2003;Rahimi,2017;Zito et al.,2008). While,Tree and meta-learning classifiers especially Random Forest and Bagging, for classification and prediction purposes have been employed in different fields such as bioinformatics, oil pricing, marketing and medicine etc. (Van Loon,et al.,2007; Schlink, et al.,2003;Alfaro et al.,2008;Gabralla and Abraham,2014;Fathima et al.,2014),however,its application in environmental science,so far,is limited(Masih,2018;Zhan,2018;Oprea,2016;Singh et al.,2013).
Given these observations, for this study 7 classification models based on Tree (Random Forest, REP Tree and M5P),meta-learning (Bagging and Random Subspace) and Function(MLP and SVM)classifiers were developed. Whereas to validate the model results,the performance of these classification algorithms was compared against each other.
Due to wide applications in air quality modelling, ANNs are considered one of the most common, reliable,widely adopted and studied machine learning tool for classification and regression purposes to predict air pollutant concentrations (Russo and Soares, 2014; Shaban et al.,2016; Capilla, 2014; Singh et al.,2012; Rahimi,2017). In these studies, ANNs are preferred over classical statistical models for their ability to yield improved performances and handle non-linearity and complexity of the emission inventory records. Later it was observed that it suffers from local minima and overfitting problems. Couple of attempts were made by(Lu et al., 2003)and (Wang and Lu, 2006) to overcome these issues, but, unfortunately, both couldn’t succeed in solving these problems simultaneously. Finally, the performance of SVM algorithm was assessed against MLP by (Lu and Wang,2014)which illustrate that on structural issues SVM performs better than MLP.The study is considered a landmark in the field of atmospheric pollution prediction for solving overfitting and instability problems.
Interestingly, until 2000, no study in field of atmospheric modelling considered meta-learning technique Bagging as a classifier, when for the first time Cannon and Lord, (2000) attempted a model using bagging to predict the maximum concentration of ground level ozone during daytime. The work is divided into two phases. During first,MLP and Multilayer Regression (MLR)were tested as an independent classifier, whereas in second phase both were adopted within bagging as base classifiers. The result obtained suggest that, as independent classifiers both MLP and MLR suffered from overfitting and instability problems, however, later adopting them within Bagging as base classifiers enhanced their stability and accuracy performance. Whereas later, a study based on Athens Greece (Riga et al., 2009) developed 84 different models by using well known machine learning toolkit WEKA,have established that Tree and Rule classification algorithms such as J48,LMT,One R,Decision Table and REP Tree perform significantly better than SMOreg(SVM)and linear regression. In another study(Singh et al.,2013)proposed an air quality model using Principal Component Analysis(PCA)and meta-learning algorithms-Bagging and Boosting for air quality forecasting,while PCA was adopted to identify pollution sources. Consequently, a significant improvement in accuracy was observed when the performance was compared with SVM.
Similarly, the application of Tree classifiers(e.g. Random Forest)for atmospheric prediction were recently explored in a study conducted by(Jiang and Riley,2015). For this study classification model based on Random Forest was developed, whereas for validation, the results were compared against Classification and Regression Tree (CART).This Sydney based work concluded that the accuracy obtained by using Random Forest (RF)is superior to single base tree(CART).Whereas another Random Forest based approach has lately been tested out by (Yu et al., 2016) to predict the Air Quality Index (AQI). The study makes use of urban public data based on road information, air quality and meteorological datasets of all the regions of Shenyang as input predictors,whereas Nave Bayes,Logistic Regression,single decision tree and ANN were considered for model validation.Upon proficient assessment,it was verified that Random Forest have outperformed the state of the art classifiers namely Nave Bayes,Logistic Regression,single decision tree and ANN.
Thus,the main contributions of the work include;(1)the application of Tree and meta-learning classification algorithms in the field of atmospheric sciences to predict the concentration value of NO2in air and(2) performance comparison of Tree and meta-learning classifiers against the traditional machine learning models based MLP and SVM.
Air quality of a region is generally characterized by AQI. The measurement of AQI value involves the concentration of six atmospheric pollutants namely particulate matter (PM2.5, PM10), SO2, NO2, CO, and Ozone(O3). Besides, emission records, several regional meteorological parameters such as air temperature, humidity,wind speed and direction in particular, can influence the dispersion of emission concentration from one region to another (Gardner and Dorling, 1999; Gardner and Dorling, 1998; Gardner and Dorling, 2000). Given these observations, the prediction model considers urban air pollutants and meteorological dataset because both are directly responsible of affecting regional atmospheric turbulence and AQI.The study uses the time-series dataset recorded at a monitoring site named Marylebone road located in London for a period of around 6 months i.e.January 1st,2013 to 18thMay,2013 at a sampling interval of one hour. It was obtained from the official website of Department of Environment Food and Rural Affairs(DEFRA).The hourly dataset is altogether a combination of 9 attributes including 4 atmospheric pollutants (SO2, NO2, CO,and HCl) and 5 meteorological parameters(temperature,wind speed,wind direction,relative humidity,and atmospheric pressure).
Prior to data modelling, preparation of environment and pollution dataset, has three main stages involved(data collection,data preprocessing and modelling)as shown in Fig.1. A careful preliminary analysis of the raw dataset revealed that: (1)around 7%of meteorological records are missing; (2)extreme values are reported in both pollution and meteorological datasets. Hence data required a thorough cleaning before modelling. In order to deal with the issue of missing values an built-in WEKA filter named ‘Replacemissingvalues’ was adopted,whereas to detect and remove outliers from raw dataset another filter called ‘RemoveWithValue’ was applied.Finally,a dataset containing 2766 instances was prepared for pollution modelling of NO2.
After an exhaustive exercise of developing several classification models from different classification categories,for comparison purpose,7 classifiers were picked namely MLP,SVM,RF,REP tree M5P,Bagging,and Random Subspace, based on their performance accuracy. The experimental assessment of models is presented in Fig.2. For proficiency assessment of individual models,the performance of all 7 machine learning algorithms were tested under an experimental design of k-fold cross validation to validate the accuracy of the models. To measure the classification accuracy of models three widely known scales were used; Correlation Coefficient(R2),Root Mean Square Error(RMSE),and Relative Absolute Error(RAE).
Fig.1 Data Preparation scheme.
Fig.2 NO2 modelling approach.
where xiandare original and mean of target vectors respectively,yibeing the predicted andis its mean value.
Table 1 Proficient assessment of classifiers.
where yiand xiare predicted and observed values respectively,is the average predicted value and n is total number of instances.
All the experiments were conducted within“Explorer”-a working environment implemented in WEKA.For a fair and detailed analysis,the study adopts optimization tool implemented in WEKA called‘CVParameterSelection’to finely tune the data. To evaluate the performance of MLP,the experiment design uses 5 hidden layers,whereas for SVM performance, poly kernel was applied. For rest of classifiers,default setting implemented by WEKA was adopted.
Recently, Tree and meta-learning approaches have successful applications in number of fields, however, these techniques have rarely been adopted for air pollution modelling. Therefore,an inclusive study considering Tree based (Random Forest, REP Tree, M5P), meta-learning (Bagging and Random Subspace) and Function based(MLP and SVM)classifiers was performed to predict the atmospheric concentration of NO2.
Considering the performance of different classification models the summary of Tree, meta-learning and Function based classifiers, under different cross validation (CV) methods (5, 10 and 15-fold) is presented in Table 1. Whereas Fig. 3 characterizes the average correlation coefficient achieved by tree, meta-learning and function classifiers. It clearly reflects that the performance of tree classifiers on average is significantly better than that of meta-learning and function classifiers.
The prediction performance of all 7 classifiers is shown in Fig.4. Referring to Tree classifiers in particular,Random Forest has a superior performance with the highest correlation coefficient (R2=0.923) followed by M5P(R2=0.897),while REP Tree performed the worst among Tree classifiers. On the other hand,the performance of meta-learning classifier bagging(with REP Tree)was exceptionally good with a correlation coefficient(R2=0.90)nearly equal to that of Random Forest. The prediction performance of bagging is significantly better in comparison with state of the art function classifiers (MLP and SVM).Last but not the least, multilayer perceptron and support vector machine have pretty much underperformed among 7 classification models developed for comparison purpose,having the lowest R2value equal 0.83 each.
In data mining it is said that classification performance of a classifier largely depend on the type of dataset used. The comparison chart presented in Fig.3 and 4 show that Tree classifiers on average and Random Forest in particular have attained significantly higher correlation coefficient values, followed by Bagging (with REP Tree)and M5P.Random Forest has outperformed every other classifier in the table 1,hence clearly reflects that using Tree classifiers can significantly improve the prediction accuracy of atmospheric NO2concentrations. The performance of both Function classifiers MLP and SVM was poor,however,results obtained suggest that cross validation fold size doesn’t really affect the performance of SVM hence,it shows the stability of the algorithm.
Fig.3 Average correlation coefficient achieved by tree,meta-learning and function classifiers.
Fig.4 Performance comparison of Tree,meta-learning and Function classifiers.
Besides accuracy measurement,before concluding remarks,error functions(RMSE and RAE)under different cross validation methods i.e. 5, 10 and 15-fold were also calculated. The average error values presented in table 2, indicate that, RMSE value for RF is the lowest (0.0026), M5P and Bagging both achieved slightly lower value, whereas, MLP, SVM, REP Tree and Random Subspace performed the worst. In the same way,RAE values are also evident that RF is the best classifier with significantly lower RAE value of 34.28%, with bagging and M5P being the second and third best classifiers having slightly higher error values equal to 38.81%and 40.09% respectively. Despite the fact that the average correlation coefficient values achieved for Bagging and M5P are same (0.90), however, interestingly, the error values presented in Table 2 indicate that, Bagging has an edge over M5P in terms of error values, which is a confirmation that meta-learning classifiers have the potential of handling non-linearity as well as overfitting issue.
As a final point, from Table 1 and 2, it is deduced that, though the prediction performance of REP Tree isbetter than function classifiers(MLP and SVM),and pointedly lower when compared with other Tree classifiers i.e. Random Forest and M5P,however it is worth noting that using REP Tree as a base classifier within Bagging has significantly improved its prediction accuracy through an enhanced correlation coefficient (from 0.852 to 0.90) with a decrease in RAE value by 8% (from 46.7% to 38.8%) which is commendable. It is evident that Bagging has ability to enhance the prediction accuracy of atmospheric concentration of NO2.
Table 2 Error values obtained for different classifiers.
During recent decades,high energy demands due to improved living standards,rising population,per capita energy use, industrialization and rapidly increased usage of private cars instead of public transport seem to be the major factors that contribute to atmospheric pollution (Masih, 2018). Its consequences include the amplified health concerns experienced by children, enlarged numbers of heart and lung diseases everyday and most importantly the number of pollution related deaths, which according to(WHO),reached seven million around the world who died due air pollution only in 2012 i.e. one out of eight global deaths was caused by the ever increasing concentrations of air pollutants(Xie et al.,2016).
Previously, several attempts have been made to analyze air pollutant concentrations and to build forecast models,however,results suggest that the difference between linear and nonlinear models was insignificant(Pires et al.,2008). Besides linear regression,several state-of-the-art nonlinear regression models such as multivariate polynomial regression, artificial neural networks, support vector machines etc. have the ability to precisely determine the underlying relationship between input and target data. Due to the abilities such as robustness and nonlinear mapping,artificial neural network(MLP)and support vector machine(SVM)both are popularly known for air pollutant prediction (Wang et al.,2008), however the study conducted shows that tree and metalearning classifiers Random Forest and Bagging in particular perform better than MLP and SVM.
A constant monitoring and forecasting of atmospheric pollutants is important because the introduction of chemicals and particulate matter into atmosphere beyond a certain limit due to increased anthropogenic activities result in air pollution. Due to its negative impact on human health,other living organisms,crops and natural environment,it is one of the many vital issues,metropolitan and industrial cities need to address. Therefore, to spread awareness among general public and to timely inform them about AQI regarding the health risks,development of a precise and reliable air pollution prediction model is vital.
The study conducted adopts 7 machine learning classification approaches including 3 Tree classifiers namely Random Forest,REP Tree and M5P,2 meta-learning classifiers(Bagging,Random Subspace)and 2 popular machine learning algorithms MLP and SVM.To predict the concentration of atmospheric Nitrogen Dioxide(NO2),a total of 8 attributes including 3 air pollutants (SO2,CO,HCl)and 5 environmental parameters (temperature,humidity, wind speed, wind direction, atmospheric pressure) were used. The results obtained show that on average the prediction accuracy of Tree classifiers is greater than meta-learning and Function classifiers. Furthermore, the proficient assessment of all 7 classification algorithms suggest that the performance of Random Forest is superior having highest correlation coefficient and lowest error values followed by Bagging and M5P.Lastly, it was found that bagging which is famous for resolving the problems of overfitting and local minima,can significantly enhance the prediction accuracy of REP Tree as well when used as a base classifier.
Journal of Environmental Accounting and Management2020年1期