Guangyuan Pan, Liping Fu, Qili Chen, Ming Yu, and Matthew Muresan
Abstract—Road safety performance function (SPF) analysis using data-driven and nonparametric methods, especially recent developed deep learning approaches, has gained increasing achievements. However, due to the learning mechanisms are hidden in a“black box” in deep learning, traffic features extraction and intelligent importance analysis are still unsolved and hard to generate.This paper focuses on this problem using a deciphered version of deep neural networks (DNN), one of the most popular deep learning models. This approach builds on visualization, feature importance and sensitivity analysis, can evaluate the contributions of input variables on model’s “black box” feature learning process and output decision. Firstly, a visual feature importance(ViFI) method that describes the importance of input features is proposed by adopting diagram and numerical-analysis. Secondly,by observing the change of weights using ViFI on unsupervised training and fine-tuning of DNN, the final contributions of input features are calculated according to importance equations for both steps that we proposed. Sequentially, a case study based on a road SPF analysis is demonstrated, using data collected from a major Canadian highway, Highway 401. The proposed method allows effective deciphering of the model’s inner workings and allows the significant features to be identified and the bad features to be eliminated. Finally, the revised dataset is used in crash modeling and vehicle collision prediction, and the testing result verifies that the deciphered and revised model achieves state-of-theart performance.
EVALUATING safety effects of countermeasures relies greatly on collision prediction models or safety performance functions (SPF), which is an important topic in road safety studies. Safety performance functions are commonly developed separately for different types of highways or entities and locally using data collected from the study area representing the specific highway types to be modelled. Traditionally, they are well reflected in the highway safety manual(HSM), in which several example SPFs for various types of highways and intersections from different jurisdictions are documented [1], [2]. Moreover, one of the most commonly used methods is called parametric modeling (e.g., negative binomial model, NB), and it requires a series of trial and error process before arriving at the final model structure with a set of significant variables [3]–[5]. Although this model is easy to understand and apply, the predicted results have low accuracies due to the random nature of collision occurrences and the strong distribution assumption. Another technique that has been studying on is called non-parametric modeling (e.g., kernel regression (KR); support vector modeling (SVM); artificial neural networks (ANN)), and it has achieved satisfying prediction accuracy [6]–[9]. However, road safety performance function parameters in this method cannot be quantified and is hence difficult to generalize.
The recent developed technology of artificial intelligence(AI) brings in new solution potentials for this problem.Artificial intelligence has revolutionized many industries already, bringing in new ideas and exciting technologies. It has also brought changes to nearly every scientific field, and many more advancements remain. Among the most notable techniques developed, deep learning (also called deep neural networks), is often considered as one of the most remarkable[10]–[12]. Since its proposal, it has been successfully applied to solve complex problems in a variety of fields, including but not limited to pattern recognition, game theory, computer vision, medical treatment, transportation logistics and financial [13]–[18]. In our previous research, we have applied deep belief network, one of the most popular deep learning models, in establishing SPF and the trained model has outperformed the traditional methods [19], [20]. However,despite the seemingly endless benefits deep learning brings, it possesses a certain opacity and darkness that often cause doubt and resistance from policy makers and scientists. Some findings highlight the fact that although deep learning models are trained to solve tasks based on human knowledge, the models see those objects differently than humans. As a result,these findings have made AI not totally trusted for scientists and industry applications [21], [22]. One of the biggest reasons that leads to the findings is the unanalyzable of black box training process problem. To address these limitations,researchers have begun studying on defending, strengthening and deciphering deep learning, and have proposed several detection methods on understanding feature learning process,especially on convolutional neural nets [23]–[34]. However,to the best of our knowledge, there is still lack of a general method for effectively decipher DNN for SPF analysis and variables selection. In current literature, three detection methods have been developed to study the black box problem.The first tool analyzes the black box based on its external input [25]–[27]. The second uses transparent algorithms that are comparable to those implemented in deep learning to examine the model’s workings [28], [29]. Finally, the third set of methods uses other machine learning models as tools to study the situation, particularly those whose workings are more easily understood [30]–[34].
This paper builds on the works of feature importance and visualization [35], [36], demonstrates a diagram and numericalanalysis based method, called visual feature importance(ViFI), to understand the black box feature learning process and provide the potential to analyze the contributions of the various input features of SPFs. We specifically focus on deep belief network (DBN). This method intuitively highlights which area responds positively or negatively to the inputs. Through this method, it highlights how a DBN model, especially in unsupervised learning, studies differently from other methods.Our previous efforts have already shown significant progress in applying DBN for SPF development. In this study, we will sequentially utilize this ViFI method as a tool to describe the feature importance and establish a more reasonable road safety performance function.
This paper’s structure is organized as follows: Section II introduces our methodology, including the method used to generate a weights-based diagram and the calculation process used to identify a features’ importance. In the unsupervised setting, a visualization diagram utilizing the change in the weights’ values is generated, and contrastive divergence is combined to understand the knowledge learning process.Then, the importance of each input feature is calculated,including how knowledge is transferred in input-hidden layer and hidden-hidden layer. Similarly, in the supervised setting another diagram using the same method is generated but is investigated using stochastic gradient descent instead and each feature’s contribution is then determined. Section III demonstrates a case study based on our previous research,more details on implementing ViFI will be explained. This case study will also demonstrate data from highway 401 Canada for vehicle collision prediction and show an improved and more convincing result by representing this more intelligent way of training a model. Section IV summarizes the study as well as directions for the future research.
Deep belief network (DBN) is one of the most typical in DNN area. What makes DBN different is its unique training method called greedy unsupervised training. By stacking several restricted Boltzmann machines (RBM, a kind of recursive ANN model that contains two layers, one input layer and one output layer), one upon another in training, DBN learns the features of input signals without needing a supervisor and obtains a better distributed representation of the input data, without requiring extra labelled data like back propagation.
The proposed ViFI method is divided into four steps: 1)initialize a DBN structure with its training parameters; 2)observe the changing of weights during unsupervised learning,focusing primarily on the magnitudes of each input feature; 3)after unsupervised training, generate a reconstructed input layer utilizing each hidden layer. By observing the activation and non-activation areas, the exact knowledge that is learned can be better understood; 4) continue running the supervised learning step and generate the weights diagram using both visualization and numerical analysis that calculates the contribution of input features (either accepted or rejected).
In Step 1, for a given deep belief net with the structure VH1-H2-O (V input neurons, H1 and H2 hidden neurons in two hidden layers respectively, and O output for prediction),weights are randomly pre-set.
In Step 2, train the first restricted Boltzmann machine(RBM, including V and H1) using greedy unsupervised learning, the equations of the feature learning and weights updating are given below in (1)–(6). Equation (1) represents the starting state of input data (values between 0 and 1), with weightsWbetween the two layers randomly given (all zeros are recommended to use here, for easier calculation in the following steps).viandhjare neurons in V and H1. Equations(2)–(4) are the feature learning equations of an RBM,V0,H0,V1, andH1are the four states recorded during transformation.p(·) is the probability of a neuron being activated,wijis the weight betweeniin V andjin H1, andbjandciare the biases.Finally, the weights are updated by applying equations (5) and(6). As the weights are all zero at first, in unsupervised learning, if the model senses a feature should be important, the weights between the specific feature neuron and hidden layer 1 will be strengthened, which mathematically will lead to a negative ΔWin (5) because more neurons will be 1 inV1andH1. If a feature is thought to be useless to learn by the model itself, the ΔWwill be a positive value becauseV1andH1are mostly 0, andWt+1will keep increasing. An illustration is shown in Fig. 1. After unsupervised learning, generate a reconstructed input using each layer (Fig. 1(b)). This step helps us to understand what knowledge the hidden layers have learned, because the reconstructed data will highlight the truly useful features [30].
Fig. 1. Illustration of Visualization. (a) Visualization on change of weights.(b) Visualization on layers reconstruction.
The mean value of the weights on each feature is then calculated using the results of the unsupervised learning and with those from supervised learning. As the weight updating equation (6) is linear, we define the contributions in feature learning using a linear function shown as (11)–(13), in which theFIiis the importance of featurei,means importance ofiin unsupervised learning andis the importance after fine-tuning.is the weights that connect toiin epochn,Vis number of features andHis the number of hidden units.
A. Experimental Design
To evaluate the effect of ViFI, an empirical study is conducted. This study is conducted using historical data from Highway 401, a multilane-access controlled highway in Ontario, Canada. This highway is one of the busiest highways in North America and connects Quebec in the east and the Windsor-Detroit international border in the west. The total length of the highway is 817.9 km of which approximately 800 km was selected for this study. According to 2008’s traffic volume data, the annual average daily traffic ranges from 14 500 to 442 900 indicating comparatively a very busy road corridor. In this study, the processed crash and traffic data are integrated into a single dataset with homogeneous sections and year as the mapping fields that result into total 3762 records.The six input features included in this dataset are annual average daily commercial traffic (AADCT), median width, left shoulder width, right shoulder width, curve deflection, and exposure. We also provide a summary description ofcontinuous input features in Table I, including the sample sizes for training and testing. After training, the performance of each model is estimated based on mean absolute error (MAE) and root mean square error (RMSE), as defined in (14) and (15).
TABLE I Summary of the Dataset (Highway 4 01, Ontario)
In our previous research, we have applied an improved version of DBN (regularized DBN, R-DBN) to predict the collision, and in comparison, it outperformed negative binomial(NB), one of the most widely used techniques in road safety analysis which had been adopted by highway safety manual [2],[9], [19]. The improved DBN utilizes continuous version of transfer function for unsupervised learning ((16)–(18)) and Bayesian regularization for fine-tuning ((19) and (20)). We keep using this model not only because ViFI is generalizable to it but also for easier comparison to the published results. In the equations,xiandyjare the continuous values of unitiandjin two layers;wijis the weight between them,N(0,1) is a Gaussian random variable with mean 0 and variance 1; σ is a constant;φ(X) denotes a sigmoid-like function with asymptote ofθHandθL;ais a variable that controls noise,FWthe new optimization function in fine-tuning,RWis the Bayesian regularization item that reduces over-fitting by controlling the values of weights;αandβare called performance parameters that can be calculated during the iteration by adopting Bayesian regularization.
B. Applying ViFI in Unsupervised Leaning
The model is initialized with six input neurons, one for each feature (exposure, AADCT, left shoulder width, median width, right shoulder width, curve deflection), two hidden layers with ten neurons in each layer, and one output layer that contains only one neuron for vehicle collision prediction. In Step 2, the weights between input and hidden layer 1 will be written as,W1 = (w11,w12, …,w110,w21, …,wij,w61, …,w610), wherei(from 1 to 6) andj(from 1 to 10) are neurons in the two layers. A visualization of the structure that highlights how the weights form the different connections between layers and how they are updated is shown in Fig. 2.In Fig.2(b), the top row is the first epoch, and the bottom last row is the sixtieth epoch. As the weights are set to be zero at first, the color is all white at first. The vertical direction means the change of the weights. During unsupervised learning,some weights become very dark in the vertical direction while others are not so much. According to previous numerical analysis, the more important a feature is, the more knowledge the hidden layer needs to learn, thus the bigger the difference will be. Therefore, we infer that all features seem useful in unsupervised learning, especially features exposure and curve flection (input neurons 1 and 6). After Steps 1 and 2 are complete, (14) is applied on the hidden layer to reconstruct the input data. After the comparison, the patterns in the reconstructed features from the two hidden layers are also similar, which can be a sign of equal feature learning ability.
Fig. 2. Applying visualization in unsupervised learning. (a) The structure of the model being used in experiment. (b) The trained W1 after 60 epochs.
Fig. 3. Applying visualization in supervised learning. (a) The change of weights between input and hidden layer 1. (b) The change of weights on each feature.
C. Applying ViFI in Fine-tuning
The process moves on to supervised training (fine-tuning), in which the model is trained on 5000 iterations, and the change of weights between input and hidden layer 1 is visualized (see Fig. 3(a)) and is compared with the previous results in Fig. 2 from Step 2. This step studies how the black box uses the teacher’s signal in supervised learning. This step also acts as validation and assists in the self-learning process. After finetuning, the weights that join each feature and the black box are drawn in Fig. 3(b). In each sub-figure of Fig. 3(b), TheXaxis shows the 5000 iterations while theYaxis shows the value of the weights. There are ten lines in each sub-figure, each one represents a specific weight between one feature and a neuron in hidden layer 1. Sequentially, by applying the same analysis shown previously in Section III, if the weights increase, it means the corresponding feature is considered to be more important than before; if they decrease, it could be a sign of wrong judgement in self-learning. Moreover, the principle of sparse connections suggest that weights should become dispersive, otherwise it could lead to over-fitting. According to Fig. 3, we find that the first features weights have slightly dropped at first and increased again, the second features weights keep decreasing, the third and fourth features increase at first then fall a little, weights of features five and six are increasing all the time. According to the previous analysis, we conclude that the model has reduced the magnitude of Feature 2 and increased the importance of Features 5 and 6. Besides, this figure has also showed the statement of sparse connection of the model, which means the training is good and no over-fitting exists.
By using this equation, the calculated results are found to be[0.428, 0.117, 0.143, 0.084, 0.087, 0.393], for the six features(exposure, AADCT, left shoulder width, median width, right shoulder width, curve deflection) respectively. After fine-tuning, as the weights updating is based on a nonlinear function, we defined the changes of the contributions using a sigmoid function, which are calculated to be [0.928, –0.321,0.688, 0.589, 0.635, 1.015]. The result is then plotted to a figure that compares the judgements (contributions of the features) in the two stages in Fig. 4 and Table II. In Fig 4, the features (from left to right) are exposure, annual average daily commercial traffic (AADCT), left shoulder width, median width, right shoulder width, and curve deflection. In the beginning, features 1 and 6 are found to be significant by selflearning (blue bars). After fine-tuning, the result is modified,all the features are becoming more important except the second feature, AADCT, which surprisingly is even considered to be a distraction (negative) to the training.
TABLE II Model Testing Comparison
Fig. 4. The calculated feature importance in the two training stages.
D. Applying ViFI in Fine-tuning
The experiments above highlight the information and understanding gained from each step of the ViFI process. Step 2 expresses the importance of features in unsupervised learning, Step 3 presents the understanding of a black box, and Step 4 provides a fine-tuning calculation on feature magnitudes. For the specific case study considered, Features 1 and 6 (exposure and curve flection) were confirmed to be the most important, while Features 3, 4 and 5 (left shoulder width,median width, and right shoulder width) were also found to be significant contributors. Feature 2 (AADCT) was identified as a distraction to the model. At the same time, Step 3 also showed that the second hidden layer was as important as the first layer because it has similar feature learning ability. To calibrate the findings, we design the following experiments.To calibrate the findings, a sensitivity experiment is designed and implemented. A model of size 5-10-10-1 (input, hidden layer 1, hidden layer 2, output) with the same learning rate and epochs is designed. Then, six models are trained, each trained without a specific feature (starting with Feature 1 and ending with Feature 6). Finally, a model with all the features is trained as the baseline. An additional model with one of the hidden layers removed is also trained for comparison. The results are shown in Fig. 5 and Table II.
Fig. 5 indicates that when Feature 1 is excluded, the testing MAE increases dramatically; however, when a model without Feature 2 is trained, the result outperforms the baseline model.The left sub-figure in the first row compares the minimal testing MAE by excluding each feature. It is obvious that after eliminating the second feature (AADCT), the model outperforms all the others. However, if the first feature(exposure) is deleted, the performance becomes much worse,and when the other features are excluded, the results also turn worse at different levels. The same trend also happens in the second figure in the first row when looking at average testing MAE. It should be noted that when hidden layer 2 is deleted,the testing result is better than using regular DBN when training datasets are few, this is a reasonable result as the dataset used in this analysis is too small, and a large model could easily be overfitted during training. This result confirms the theory discussed previously. Fig. 6 shows a comparison of performance between R-DBN without Feature 2 and three other models, including NB, KR and R-DBN as related to data sizes. As have reported in our previous paper [20], the performance of NB does not seem to change substantially as training data increases. Similarly, in KR, the best results show some improvement, but eventually reaches a limit. The RDBN method clearly shows am improvement, especially with training data increases. As a contrast, for the decoded R-DBN,by eliminating the unwanted feature, it achieves much better performance than the others. In Fig.6(a), The minimal testing MAE is lower than the others all the time, in Fig.6(c), it beats KR at the training data percentage 40, much faster than the previous R-DBN. The sub-figures shown in Fig. 6 on the second and third rows are the MAE by training data set sample size using different model parameters. The central mark in the boxes is the median, and the edges of the box are the 25th and 75th percentages. MAE is high at low data sizes but decreases quickly as the data size increases. Normally, the lower the box, the more accurate the prediction, and the narrower it is, the more robust the model. Therefore, these figures provide another way of verifying the effectiveness of the model. In Table II, four models are compared, which are,negative binomial (NB)–one of the most popular models used in real-world applications; kernel regression (KR) and back propagation neural networks (BPNN)–two popular traditional machine learning methods; and regularized deep belief network (R-DBN)–an improved version of DBN, one of the most significant models in deep learning. From the results, RDBN demonstrates excellent performance when compared to the other traditional models, and the decoded R-DBN outperforms the original version by achieving the minimal MAE 7.58 and minimal RMSE 15.03. Moreover, based on the results, the feature importance using traditional numerical method and deep neural nets are compared in Table II. Similar trends of the feature importance can be observed, which shows that deep neural net not only correctly identifies unwanted features but also makes better use of useful features.
Fig. 5. Testing results applying ViFI.
Fig. 6. Results comparison. (a) Testing minimum MAE. (b) Testing maximum MAE. (c) Testing average MAE.
Fig. 7. Testing result on extra rubbish features.
E. Model Testing With Extra Noisy Features
Experiments above have proved that the deep neural network model is capable of distinguishing contributions of features, but because Feature 1 (exposure) is integrated from Feature 2 (AADT) already, it is reasonable to conclude that AADT-commercial is a redundant feature. So, more evidence is required, for example, can the model distinguish features that are totally unrelated to the dataset? To answer that question, in this experiment, the model will be further tested with designed extra rubbish features.
Two extra input features are designed. Feature 7 is random values between 0 and 1, and Feature 8 is a constant value 0.5.The model size is set to be 8 inputs, 2 hidden layers with 10 units in each layer, 1 output unit. The other training parameters are kept as before. After the training process,feature importance is analyzed in Fig. 7. Fig.7 shows the feature importance analysis process. Fig.7(a) is the result in unsupervised learning using ViFI, and Fig.7(b) is the conclusion after fine-tuning. According to Fig.7(a), rubbish features are easily distinguished in unsupervised learning because the connections are mostly weak already. According to Fig.7(b), the model confirms its previous judgement on the rubbish features and give them large negative points on the contributions. Besides, this figure also shows that the model can still make the same decision on the original features as before in Fig.5. Table III compares the feature importance with or without extra rubbish features. It can be observed that the model makes very similar judgements on the six originalfeatures, and the two extra unrelated features are easily found out and their quantified contributions are interestingly close(–0.502 and –0.501).
TABLE III Feature Importance Comparison
In this paper we have proposed a visual feature importance(ViFI) method of deep neural networks to quantify the significance of features in road safety performance function.This method utilizes numerical analysis and visualization, to help developers understand the differences of feature learning between machines and humans in a more intuitive and digital way. This method also helps us to better understand and study the training process of deep learning. Our study highlights that in unsupervised learning, bigger changes in the weights correlate strongly to the apparent importance of a feature while during fine-tuning, bigger increases in the weights highlight the significance of the features that they connect to. It should be noted that all these conclusions are based on assuming that the training state maintains the condition of sparse connections and that no over-fitting occurs. Finally, the effect of six features that have been identified as contributing factors to vehicle collisions on a highway using the decoded model. Experiments also indicate that the model successfully distinguished the useful and useless features, and by coordinating the training dataset,a more accurate and robust model is trained.
Despite the benefits and results achieved here, this research also has many issues and uncertainties that need to be studied further. The first is the input dimension problem which arises from inputs with large dimensions for a global general model,especially in the settings there are not only many input neurons, but these neurons also do not have specific physical meanings because the features are too abstract and conserved in a certain area instead of in only one neuron [37], [38]. To do so, the generalized version (convolutional DBN, for example) should be studied. The second issue is even if we have totally understood the learning processes of black box,should we reteach it if we find that it does not look at the features that we want it to learn, even if the output is perfectly correct? Will this change the learning process that we as humans have been using for years? While these questions may ultimately be answered in the future, in the present, the only explanation that can be concluded is that machines learn differently from humans, and, in some cases, differently from other machines.
IEEE/CAA Journal of Automatica Sinica2020年3期