LIU Yng-Hu ZHOU Zhi-Xing ZHANG Xio-Long LI Hn-Dong
?
Development of QSAR Model for Predicting the Mutagenicity of Aromatic Compounds①
LIU Yang-HuaaZHOU Zhi-Xianga②ZHANG Xiao-LongaLI Han-Dongb
a(100124)b(100012)
Quantitative structure-activity relationship (QSAR) model wasdeveloped for pre- dicting the mutagenicity of aromatic compounds. The log revertants data of S. typhimurium TA98 strain from Ames test have been collected. 225 aromatic compounds were randomly divided into the training set with 186 molecules and test set with 39 molecules. Multiple linear regression (MLR) analysis was used to select six descriptors from thousands of descriptors calculated by semi-empirical AM1 and E-dragon methods. The final QSAR model with six descriptors was internal and external validated. In addition, to validate the utility of our QSAR model for the chemical evaluation, three aromatic compounds were taken totest the predictive ability and reliability of the model experimentally. The compounds selected for testing were not based on the predictions, thus spanning the range of predicted probabilities. The subsequently generated results of the Ames test were in good correspondence with the predictions and confirmed this approach as a useful means of predicting likely mutagenic risk of aromatic compounds.
aromatic compounds, quantitative structure-activity relationship (QSAR), multiple linear regression (MLR), mutagenicity, Ames test;
Aromatic compounds include the aromatic amine, nitro, nitroso, halide or hydroxylamine. They are widely used in chemical and pharmaceutical indus- try as solvents, cooling agents, polymers, insectici- des, herbicides, drugs, food additives and so on[1-3]. Widespread using induces releasing of these com- pounds into the environment. Most of the benzene derivative compounds have been reported to be carcinogenic and mutagenic to living systemsinclu- ding humans, animals and plants[1, 4]. The standard experimental measurements (animal and in vitro testing), which are used to evaluate the toxic effects of exposure to existing chemicals and their mixtures, are costly and time-consuming to assess the toxico- logical risks ofincreasing the number of new ben- zene derivative compounds. One of thechief alterna- tives to animal and in vitro testing for toxicity is the use of quantitative structure-activity relationships (QSARs) method,which is a mathematically-deri- ved rule that quantitatively describes the activities of compounds in terms of molecular attributes[5, 6]. The aim of QSAR method is to develop consistent rela- tionships between an activity and physicochemical properties for a series of compounds, and that the relationship can be used to predict the activity of similar chemicals not tested experimentally[2, 7]. In addition, the QSAR models may be useful in comprehending and rationalizing the mechanisms of biological action within a series of compounds[8].
One of the major methods used to evaluate the mutagenicity of benzene derivative compounds is Ames test, which has been widely used for deca- des[9]. The Ames test (Salmonella typhimurium his reversion assay) is simple and inexpensive, which successfully avoids the shortcomings of the rodent carcinogenicity bioassay. Several studies have used data from Ames test of mutagenicity for building QSAR models of aromatic compounds such as aro- matic amines or nitro compounds[10-14]. However, QSAR model for mutagenicity of the whole aro- matic compounds system is still not established.
In the present paper, we report the results of QSAR studies carried out on a dataset of 225 aro- matic compounds. The log revertants data of TA98 strain from Ames test were used to build the QSAR model of mutagenicity. Molecular descriptors of the final model were obtained by computer molecular simulation technology, and multiple linear regres- sion (MLR) analysis is employed to develop the model which correlates the molecular descriptors with their experimental mutagenic activity. Besides the internal and external validation, we also used experimental verificationto check the predictive abilityand reliability of the model.
2. 1 Data sets
The mutagenicity data of 225 aromatic com- pounds from published databy Basak.[12]and Nair.[14]were taken for developingthe QSAR model. The mutagenic activities of these compounds in S. typhimurium TA98 strain are described as log,whereis the number of revertants/nmol. These aromatic compounds wererandomly divided into the training set with 186 molecules and test set with39molecules.Datasetof aromatic compounds used for the study possesses diverse structuralvariations ranging from a single benzene ring to six fused benzene rings. CAS numbers ofthecompoundswiththeirexperimentallogvalues present in thedatasetare listedin Table 1.
Table 1. Experimental and Predicted LogRValue of Aromatic Compounds
“*” means the test set.
2. 2 Software
The structures of aromatic compounds obtained from EPI (Estimation Programs Interface) Suite? were minimized using Hamiltonian in Scigress (v. 7.7) package with AM1 semi-empirical method. The molecular descriptors were calculated with Dragon program (v. 3.0) (http://www.disat.unimib.it/chm) based on the minimum energy of molecular geo- metries. MATLAB (R2010b) and SPSS software (v. 18.0) (http://www.spss.com/spss) were used for theelimination of redundant data andthe development of stepwise MLR models, respectively.
2. 3 Calculationand model building
In order to avoid multicollinearity of independent variables, the redundancy existing was eliminated from thousands of molecular descriptors obtained by Dragon software. Constant and almost constant variables were removed from the pool of molecular descriptors as the first step of data cleaning. As for the cluster of the descriptors with high inter-correlations (> 0.9), the one which had a lower correlation coefficient with property y was deleted and the other was kept. After the two steps of data cleaning, 497 descriptors were left for the stepwise multiple linear regression (MLR).
MLR models were often used to describe QSAR with the form[15]
Y= bX+ BX+···+ b(1)
where Y is the predicted property (dependent variable);X, X,···,are m descriptors(independent variables);b, b, ···,are the regression coefficients andbis the constant term (intercept).
The stepwise MLR model was calculated with SPSS. The variable to enter a model was set at 0.05 and to be removed at 0.10. Finally, the model was built up with six significant descriptors selected using stepwise MLR.
2. 4 Model validation
Model validation is the final part of QSAR model development before its application. Both of internal and external validations were used to evaluate the qualities of the model in our study.
In general, the estimation abilitywas judged by several statistic parameterssuch as squared regres- sion coefficients (2), correlation coefficient (), root mean square error ()and standard error (). In most cases, the closer value of2to 1 and the smaller values ofand s, the stronger the estimation ability of a model could be. The formula of the statistics is as shown below:
(3)
(4)
In the above equations, yand?indicate the predicted and observed toxicity values, respectively of the training set compounds;is the number oftraining set samples andis that of the independent variables.
The stability of QSAR model is usually realized by leave-one-out (LOO) cross-validation which is one of the most common methods of internal validation.2is the outcome of this method, which is commonly regarded as anultimate criterion of both stability and predictiveability of the model[16]. The formula of2value is shown as below:
As for the external validation, it has been con- sidered more reliable for judgingthe prediction potential of QSAR models than the internal vali- dation techniques[5].The external predictive capacity of a model was determined byR2which was obtained from the model’s application for the predic- tion of test set mutagenicity values. The calculation ofR2value is shown below:
whereis the number of the test set compounds;yandindicate the prediction value of the test set and the average experimental value of the training set, respectively.
2. 5 Experimental verification-Ames test
In addition of evaluating the model with an exter- nal test set, we also used experimental verification to check the predictive abilityand reliability of the model we built.
Three aromatic compounds outside the training and test sets were tested according to the Ames standard plate incorporation assay described by Ames[17,18], with Salmonella typhimurium strain TA98(a gift of Professor Ping Zhang, Beijing University of Technology, China) in the presence of the S9 mix (purchased from Academy of Military Science, China).The three aromatic compounds were 4-aminobenzoic acid (CAS No. 150-13-0), 4-nitroanisole (CAS No. 100-17-4) and 2-nitroben- zaldehyde (CAS No. 552-89-6). All of them were purchased from J&K SCIENTIFIC LTD (Beijing, China), and purified (>98%) using HPLC. For increasing the sensitivity of the Ames test, the pre- incubation step was used for all of the three compounds. Briefly, a mixture containing each test compound in 0.1 mL of solvent, 0.1 mL of the TA98 strain cell culture and 0.5 mL of S9 mix was incubated at 37 ℃ for 30 min with a shaking frequency of 100 rpm/min before plating the bacteria into the petridish.
The mutagenicity value (experimental value) logwas obtained from the Ames test data by converting revertants/microgram of mutagen into log10rever- tants/nmol.
3. 1 MLR model of mutagenicity
Stepwise MLR analysis and LOO cross-validation were used to develop the mutagenic model and evaluate its quality, respectively. The optimum mo- del was determined by both the values of2and2. Generally, the values of2increase with the number of descriptors(independent variables)in a model. While, in order to avoid over-fitting of the models and select the most valuable one (a model with less variables and higher correlation coefficient), an increase of2values less than 0.02was chosen as the breakpoint criterion. At the meantime, a higher2value was considered as an important condition for the selection of optimum model. Finally, a model with six variables was determined to be the optimum model, which is shown below:
log=2.842–0.286+
0.435[C–N]–0.199–
1.242[C–C]–2.352–27.992 (7)
2= 0.747, R2= 0.738,2= 0.731,=
1.091,= 87.869,test2= 0.628
train= 186,test= 39
The linear relationship between mutagenicity of the aromatic compounds (represented by value of log) and six Dragon descriptors is shown in Eq. (7). A variance inflation factor (VIF), which is a measure of multicollinearity, less than 10 indicates that the model contains no multicollinearity. All the VIF values of each descriptor in this model are well below 5 (Table 2), which is considered to be sufficient to reject linear dependence.The model gives the squared regression coefficients2value of 0.747 and a square of adjusted correlation coeffi- cientR2value to be 0.738, which indicates a very strong correlation of the model. The high2value of LOO cross-validation exhibits the stability of the model. As forRtest2(0.628), the outcome of external validation by the test set shows strong predictive power of the model[19]. The plot of experimental and predicted values for the training and test sets is shown in Fig. 1, which indicates the good correla- tion between the experimental and calculated values.
3. 2 Ames testing verification
Generally, the established QSAR models were validated for predictivity both externally and inter- nally[4]. As a valuable tool for mutagenicity, the experimental data from Ames test are still more reliable validation than the test set. Therefore, to further evaluate the utility of our QSAR model for the aromatic compounds, three compounds, 4-amino-benzoic acid, 4-nitroanisole and 2-nitrobenzalde- hyde, were selected for testing the mutagenicity in the Salmonella typhimurium strain TA98 with the presence of S9-mix. These chemicals were not based on the predictions. The mutagenicity value log, which we used to build the mutagenic model, was obtained from the Ames test data by converting revertants/microgram of mutagen into log10 rever- tants/nmol. The experimental logvalues of the three compounds were compared with the values of predicted logR which is calculated by the final model we built. The values of experimental and pre- dicted logare shown in Table 3. The comparison between experimental logand predicted logof the three compounds further shows the high predictive power of the model.
Table 2. VIFs of Individual Descriptors
Table 3. Comparison of Experimental Values of Ames Assay and Predicted Values of Mutagenic Model
Fig. 1 . Linear diagram of experimental and calculated values of logR for aromatic compounds model
3. 3 Descriptors interpretation
The model of aromatic compounds can not only provide a means for predicting mutagenicity, but also reveal aspects of the activation mechanism by interpreting the descriptors used in the model. The definitions of six descriptors inthe model are shown in Table 4.
Table 4. Definition of Descriptors in MutagenicityModel of Aromatic Compounds
MWC10 belongs to the Walk and path count which is a topological index based on the counting of paths, walks and self-returning walks in an H-depleted molecular graph[20]. MWC10 is mole- cular walk count of order 10, which is related to molecular branching and size, also the molecular complexity of the graph (i.e. the larger size and more complex molecule, the larger MWC10 values.). In the model, MWC10 has a positive coefficient, which means the mutagenicity of aromatic compounds increased with the complexity of the molecule.
RDF045v and RDF075m belong to radial distri- bution function (RDF) descriptors. RDF descriptor is interpreted as the probability distribution of finding an atom in a spherical volume of radius R. It is independent of the atom number and is also unique regarding the 3arrangement of the atoms.Besides, the RDF descriptors can be restricted to specific atom types or distance ranges to represent specific information in a certain 3structure space[21]. RDF045v is Radial Distribution Function-045 (weighted by van der Waals volume) and RDF075m is Radial Distribution Function-075 (weighted by mass). Both of the two RDF descriptors have nega- tive coefficients in the model. The greater values of RDF045v (the same to descriptor RDF075m), the smaller mutagenicity of the aromatic compounds.
F07[C–N] and B05[C–C] belong to 2Atom Pairs, in which F07[C–N] is one of the frequency Atom Pairs andB05[C–C] is a binary Atom Pair. The value of F07[C–N] represents the number of occurrences of C–N atom pair at the 07 topological distance. B05[C–C] indicates the presence or absence of C–C atom pair at the topological distance of five bonds. In the final model, F07[C–N] has a positive relationship to the mutagenicity of aromatic compounds,while B05[C–C] has a negative co- efficient. It is probably shown that the C–Natom pair in a molecular structure,which relates to the amino or nitro group attached to the benzene ring, may have a great effect on the mutagenicity of aro- matic compounds.
As for GATS2i, the last variable selected in the final model is one of 2the autocorrelations which are the spatial autocorrelations calculated from an H-filled molecular graph weighted by atomic phy- sico-chemical properties[22]. GATS2i is the Geary autocorrelation[23]of lag 2 weighted by ionization potential. GATS2i has a negative relationship to the mutagenicity of aromatic compounds. The value of GATS2i decreases with the increase of spatial autocorrelation, which means that strong spatial autocorrelation produces small values of GATS2i.
In the present study, a model for the mutagenicity of aromatic compounds was developed with step- wise MLR analysis using the descriptors calculated by Dragon software. In addition of evaluating the model with the internal and external validations, the experimental verification was also used to check the predictive ability and reliability of the model we built. Both results of these validations show high stability and excellent predicting properties.
(1) Huff, J. Benzene-induced cancers: abridged history and occupational health impact.2007, 13, 213-221.
(2) Salahinejad, M.; Ghasemi, J.B. 3D-QSAR studies on the toxicity of substituted benzenes to tetrahymena pyriformis: CoMFA, CoMSIA and VolSurf approaches.2014,105, 128-134.
(3) Kobeti?ová, K.; Simek Z.;Brezovsky J.;Hofman J. Toxic effects of nine polycyclic monocyclic aromatic compounds on Enchytraeus crypticus in artificial soil in relation to their properties.2011, 74, 1727-1733.
(4) Benigni, R.; Passerini, L. Carcinogenicity of the aromatic amines: from structure-activity relationships to mechanisms of action and risk assessment.2002, 511, 191-206.
(5) Roy, K.; Ghosh, G. QSTR with extended topochemical atom (ETA) indices. 12. QSAR for the toxicity of diverse monocyclic aromatic compounds to tetrahymena pyriformis using chemometric tools.2009, 77, 999-1009.
(6) McKinney, J.D.; Richard, A.; Waller, C. The practice of structure activity relationships (SAR) in toxicology.2000, 56, 8-17.
(7) Zhang, X. L.; Zhou, Z. X.; Liu, Y. H.; Fan, X. L.; Li, H. D.; Wang, J. T. Predicting acute toxicity of aromatic amines by linear and nonlinear regression methods.2014, 33, 244-252.
(8) Puzyn, T.; Leszczynski, J.; Cronin, M. T. Recent advances in QSAR studies.2009.
(9) Benigni, R. Structure-activity relationship studies of chemical mutagens and carcinogens: mechanistic investigations and prediction approaches.2005, 105, 1767-1800.
(10) Debnath, A. K.; Debnath, G.; Shusterman, A. J.; Hansch, C. A QSAR investigation of the role of hydrophobicity in regulating muagenicity in the Ames test: 1. Mutagenicity of aromatic and heteroaromatic amines in Salmonella typhimurium TA98 and TA100... 1992, 19, 37-52.
(11) Benigni, R.; Alessandro, G.; Franke, R.; Gruska, A. Quantitative structure-activity relationships of mutagenic and carcinogenic aromatic amines.2000, 100, 3697-3714.
(12) Basak, S. C.; Mills, D. R.; Balaban, A. T.; Gute, B. D. Prediction of mutagenicity of aromatic and heteroaromatic amines from structure: a hierarchical QSAR approach.2001, 41, 671-678.
(13) Trieff, N. M.; Biagi, G. L.; Sadagopa Ramanujam, V. M.; Connor, T. H.; Cantelli-Forti, G.; Guerra, M. C.; Bunce, H.; Legator, M. S. Aromatic amines and acetamides in Salmonella typhimurium TA98 and TA100: a quantitative structure-activity relation study.. 1989, 2, 53-65.
(14) Nair, P. C.; Sobhia, M.E.Comparative QSTR studies for predicting mutagenicity of nitro compounds.2008, 26, 916-934.
(15) Varmuza, K.; Filzmoser, P.; Dehmer, M. Multivariate linear QSPR/QSAR models: rigorous evaluation of variable selection for PLS.2013, 5, 1-10.
(16) Golbraikh, A.; Shen, M.; Xiao, Z.; Xiao, Y. D.; Lee, K. H.; Tropsha, A. Rational selection of training and test sets for the development of validated QSAR models.2003, 17, 241-253.
(17) Ames, B. N.; McCann, J.; Yamasaki, E. Methods for detecting carcinogens and mutagens with the Salmonella/mammalian microsomal mutagenicity test.1975, 31, 347-364.
(18) Maron, D.; Ames, B. N. Revised methods for the Salmonella test..1983, 113, 173-215.
(19) Golbraikh, A; Tropsha, A. Beware of q2!2002, 20, 269-276.
(20) Rucker, G.; Rucker, C. Counts of all walks as atomic and molecular descriptors.1993,33, 683-695.
(21) Firoozpour, L.; Sadatnezhad, K.; Dehghani, S.; Pourbasheer, E.; Foroumadi, A.; Shafiee, A.; Amanlou, M. An efficient piecewise linear model for predicting activity of caspase-3 inhibitors.2012, 20, 1-6.
(22) Scotti, M. T.; Emerenciano, V.; Ferreira, M. J.; Scotti, L.; Stefani, R.; da Silva, M. S.; Mendon?a Junior, F. J. Self-organizing maps of molecular descriptors for sesquiterpene lactones and their application to the chemotaxonomy of the asteraceae.2012, 17, 4684-4702.
(23) Geary, R. C. The contiguity ratio and statistical mapping..1954,5, 115-145.
20 September 2014; accepted 16 December 2014
① Supported by the Ministry of Environmental Protection of China (No. 2011467037)
. Zhou Zhi-Xiang, Ph.D. E-mail: zhouzhixiang@bjut.edu.cn
10.14102/j.cnki.0254-5861.2011-0518