LI Jin-Feng LIAO Li-Min,b②
a (College of Chemistry and Chemical Engineering, Neijiang Normal University, Neijiang 641100, China)
b (Key Laboratory of Fruit Waste Treatment and Resource Recycling of Sichuan Provincial College, Neijiang 641100, China)
ABSTRACT Based on the three-dimensional structures of the compounds, the structures of 48 ester compounds were expressed parametrically. Through multiple linear regression and partial least-squares regression, the relationship models between ester compound structures and aquatic toxicity log(1/IGC50) were established. The correlation coefficients (R2) of the models were 0.9974 and 0.9940, and the standard deviations (SD) were 0.0469 and 0.0646, respectively. The stability of the models was evaluated by the leave-one-out internal cross-test. The correlation coefficients (RCV2) of the models of interactive tests were 0.9939 and 0.8952, and the standard deviation(SDCV) was 0.0715 and 0.0925, respectively. The external samples were used to test the predictive ability of the models, and the correlation coefficients (Rtest2) of the external predictions were 0.9955 and 0.9955, and the standard deviations (SDtest) were 0.0720 and 0.0716, respectively. The molecular structure descriptors could successfully represent the structural characteristics of the compounds, and the built models had good fitting effects, strong stability and high prediction accuracy. The present study has a good reference value for the study of the structure-toxicity relationship of toxic compounds in the environment.
Keywords: ester compounds, structural characterization, aquatic toxicity, simulation prediction;
Esters are one of the important high-yield compounds.They are often used in the production of plastics. They are usually found in plastic pipes, furniture, floors, car interiors,insect repellents and cosmetics. Methylp-hydroxybenzoate is the methyl ester ofp-hydroxybenzoic acid (PHBA), which is widely used in cosmetics, toothpaste, hair care products,moisturizers and deodorants. Due to the wide applications of esters, more and more ester compounds enter the water environment and cause harm to living animals and plants[1-3].Comprehensive acquisition of various property parameters of organic compounds is of great significance for standardizing their production and application[4,5]. At present, the toxicity of ester compounds is mainly determined by experiments,which wastes resources such as chemical reagents and time.Moreover, the number of such compounds is huge, and it is difficult to measure various parameters only by experimental means. The study of the relationship between the structures and properties of compounds is of great significance for analyzing and evaluating various properties or environmental behaviors of compounds, and assisting in the identification of compounds. The parameterized characterization of structures of compounds is one of the key steps to establish the relationships between compound structures and properties. At present, two-dimensional structure characterization methods[6-8]and three-dimensional structure characterization methods[9-11]are widely used. The two-dimensional structure characterization methods are simple and fast, but they are difficult to reflect the three-dimensional structure characteristics of the compounds, and cannot distinguish phenomena such ascis-transisomerisms. The three-dimensional structure characterization methods are relatively complicated, but they can be calculated based on the three-dimensional structures of compound molecules and can distinguish various isomerism phenomena. In the present
study, three-dimensional structure descriptors were used to characterize the structures of some ester compounds, and then the multiple linear regression (MLR) and the partial least-squares regression (PLS) were used to establish the models of relationship between compound structures and toxicity, and the structural factors affecting compound toxicity were analyzed. This paper can provide a reference for the study of the structure-property relationship of ester compounds.
In the present study, two QSAR models for the modeling and predicting aquatic toxicity log(1/IGC50) of 48 aliphatic esters were proposed. The experimental toxic activities which show toxic effects on theTetrahymena pyriformisprotozoa ciliate were taken from literature[12]. The samples were divided into the training and test sets, and the test set samples were marked with "*".
Table 1. Compounds and Their Toxicity Values
41 Allyl heptanoate 0.7282 0.8174 0.0892 0.8435 0.1153 42* Methyl nonanoate 1.0419 1.1275 0.0856 1.1337 0.0918 43 Vinyl 2-ethylhexanoate 1.0462 0.9630 –0.0832 0.9030 –0.1432 44 Octyl acetate 1.0570 1.0820 0.0250 1.1547 0.0977 45 Tert butyl formate 1.3719 1.3398 –0.0321 1.4095 0.0376 46 Methyl decanoate 1.3778 1.3225 –0.0553 1.2409 –0.1369 47 Methyl undecanoate 1.4248 1.5058 0.0810 1.4757 0.0509 48* Decyl acetate 1.8794 1.8081 –0.0713 1.7688 –0.1106
2. 2. 1 Characterization of the compound structure
The 3Dholographic vector of atomic interaction field(3D-HoVAIF)[13-15]started from the two spatial invariants of the three-dimensional structures of molecules—the relative distance of atoms and the properties of the atoms themselves based on three classical non-bonding interaction modes between atoms, such as electrostatic, stereo and hydrophobic interactions. It provided three-dimensional vector descriptors for characterizing the molecular structures of compounds without any experimental parameters. The molecules of common organic compounds usually include hydrogen,carbon, nitrogen, phosphorus, oxygen, sulfur, fluorine,chlorine, bromine and iodine. They belong to five main groups in the periodic table, such as IA, IVA, VA, VIA and VIIA. Based on this, these atoms could be divided into 5 categories. At the same time, in order to characterize the microenvironment of the molecular structure more accurately,according to the above classification, the atoms in different main groups were further subdivided into 10 categories according to their hybrid state (1.H, 2. C(sp3), 3. C(sp2), 4. C(sp),5. N(sp3), P(sp3), 6. N(sp2), P(sp2), 7. N(sp), P(sp), 8. O(sp3), S(sp3), 9.O(sp2), S(sp2)and 10. F, Cl, Br, I). The interaction between various atoms in a compound molecule could be up to 10?(10+1)/2 = 55 items. 3D-HoVAIF used three potential energies (electrostatic, stereo and hydrophobic) to express different forms of action. Therefore, for an organic compound molecule, there were at most 3?55 = 165 atomic action terms to characterize the molecular structure information. Although the atomic interaction mode in 3D-HoVAIF was not a direct manifestation of the compound, in most cases, the 3D-HoVAIF descriptors contained a wealth of information on the potential energy distribution of organic compounds,which could well characterize the microenvironment of the molecules.
2. 2. 1. 1 Electrostatic interaction
The electrical effect of atoms is proportional to the charge and inversely proportional to the distance between atoms. As an important form of non-bonding interaction, electrostatic interaction was expressed by the classic Coulomb theorem(Eq. (1)). Among them,rij(nm) was the Euclid distance between atoms;ewas the unit charge amount of 1.6021892 ×10-19C;ε0was the dielectric constant in vacuum 8.85418782× 10-12C2/J·m;Zwas the net charge of the atom, the electron as the unit;mandnwere the types of atoms. The electrostatic potential between all atoms in the molecule was calculated by this formula, and count into 55 electrostatic interaction terms according to their type.
2. 2. 1. 2 Steric interaction
The steric interaction is the nondipole-dipole or dipole induced interaction between atoms in space. The Lennard-Jones equation was used to describe this mode of action (Eq. (2)). In the formula,εij= (εii·εjj)1/2was the depth of the atom-pair potential energy well, which was taken from the literature[16];Dwas the empirically derived interatomic interaction energy correction constant taken as 0.01[17];Rij*=(Ch·Rii*+Ch·Rjj*)/2, which was the corrected atom pair van der Waals radius, the correction factorChwas 1.00 ofsp3hybridization, 0.95 ofsp2hybridization, and 0.90 ofsphybridization[17].
2. 2. 1. 3 Hydrophobic interaction
Hydrophobic interaction is one of the factors that affect on the properties of compounds. Considering that the 3D-HoVAIF descriptors required to express the interaction between atoms in the molecule, the hint method proposed by Kellogget al.[18-22]was used to express this type of potential field. A simple expression for calculating the hydrophobic interaction between two atoms was defined in the hint (Eq.(3)). In the formula,Swas the solvent accessible surface area
of atom (SASA), which was the surface area formed by water molecules (van der Waals radius of 0.14 nm) as the probe rolls its sphere on the surface of the atom[23].Twas a binary discriminant function of the action form to indicate the direction of the entropy effect of the hydrophobic interaction of different types of atoms[18-22], andawas the atomic hydrophobicity constant, taking the literature value[24].
The Chemoffice 2006 was used to construct the molecular three-dimensional structures of the studied samples, and the MOPAC semi-empirical quantum chemistry software that comes with Chem3Dwas used to optimize the molecular structures and get the position coordinates of the atoms in the molecules at the AM1 level, and the Mulliken layout analysis method was used to calculate the net chargeeof the atom in a single-point form (e.g., ethyl acetate dimensional structure is shown in Fig. 1. The position coordinates of each atom and the net charge quantityeare shown in Table 2). The space position coordinates of each atom in the molecule were used to calculate the distancerijbetween atoms, and finally the 165 3D-HoVAIF descriptors were obtained by formulas (1),(2) and (3).
Fig. 1. Three-dimensional structure diagram of ethyl acetate
Table 2. Partial Charges and Coordinates of Each Atom of Ethyl Acetate
2. 2. 2 Modeling and evaluation
The stepwise regression (SMR) is a commonly used method for variable screening, so it was used to screen the original descriptors. Multiple linear regression (MLR) and partial least-squares regression (PLS) are commonly used methods for modeling, and therefore multiple linear regression (MLR) and partial least-squares regression (PLS)were used to build models. An excellent model must meet the following requirements: 1) Modeling correlation coefficient(R2) ≥ 0.81, “Leave one method” cross-test correlation coefficient (RCV2) ≥ 0.64 and external prediction correlation coefficient (Rtest2) ≥ 0.64, which are all higher than the standards mentioned in the literature[25]; 2) The ratio of various standard deviations (SD) to the value range (Vr)should be less than or equal to 10%[26]; 3) The absolute value of the prediction error for above 80% samples should be less than or equal to 2 times that of the standard deviation (2SD).The external prediction correlation coefficient (Rtest2) and standard deviation (SDtest) were calculated according to Eqs.(4) and (5), respectively.
The research samples contained only six types of atoms: H,C(sp3), C(sp2), C(sp), O(sp3), and O(sp2), thereby producing a total of 63 structural descriptors, including 21 electrostatic interaction terms, 21 stereoscopic interaction terms, and 21 hydrophobic interaction terms. Because there were too many structural descriptors, some structural descriptors may have little correlation with compound toxicity, so it was necessary to screen variables before modeling. The stepwise regression was used to screen variables which were introduced into the model for significance. By observing the changes of model correlation coefficient (R2), standard deviation (SD),cross-test correlation coefficient (RCV2), and standard deviation (SDCV), we selected the best combination of variables to build the model. When 7 variablesx1,x18,x33,x72,x80,x118andx127(listed in Table 3) were selected, the correlation coefficient (R2), standard deviation (SD),cross-test correlation coefficient (RCV2) and standard deviation (SDCV) achieved ideal values at the same time.Among the selected variables,x1,x18andx33were electrostatic interaction terms,x72andx80were steric interaction terms, andx118andx127were hydrophobic interaction terms.7-variable multiple linear regression model (M1), as in Eq. (6).
Table 3. Structural Descriptors Selected out by SMR for Modeling
28 0.7204 0.2281 0.0000 9.4395 16.9000 1.6298 –45.1417 29 0.7477 0.2267 0.0000 9.4438 18.8592 1.6757 –49.8691 30* 0.4712 0.1849 0.0000 17.5606 10.7128 1.7189 –62.1318 31 0.6710 0.2094 0.0000 9.4469 17.7720 2.2757 –50.5241 32 0.7266 0.2111 0.0000 9.4395 18.2193 1.7073 –8.5694 33 0.8912 0.2863 0.0000 9.4438 17.5452 1.7101 –8.5826 34 0.8137 0.2597 0.0000 9.4438 14.9115 2.1008 –49.3987 35 0.8998 0.2751 0.0000 9.4469 14.7182 1.5201 –31.6316 36* 0.8989 0.2102 0.0000 9.4438 19.7167 1.6299 –45.1410 37 0.8239 0.2281 0.0000 9.4438 14.7305 2.0523 –35.8980 38 0.8415 0.2828 0.0000 9.4438 13.5559 2.4284 –53.8515 39 0.9012 0.2218 0.0000 9.4499 13.0857 1.6625 –8.3638 40 1.0125 0.2792 0.0000 9.4438 13.5567 2.1621 –60.6598 41 1.0157 0.2399 0.0000 9.4499 13.5891 2.7693 –45.2683 42* 1.2021 0.2633 0.0000 9.4438 15.6672 2.6267 –83.5691 43 1.4141 0.3793 0.0000 9.4438 26.0487 2.3941 –14.7822 44 1.2429 0.3617 0.0000 9.4469 15.9724 3.0923 –63.8925 45 0.7323 0.1013 0.0000 9.4438 13.8681 5.9578 –68.1012 46 1.5014 0.2745 0.0000 9.4438 15.6676 0.3897 –18.0973 47 1.5707 0.3163 0.0000 9.4438 13.5576 0.5714 –44.5666 48* 1.7261 0.3412 0.0000 9.4469 14.7726 0.4994 –31.6074
Nwas the number of regression points,R12the correlation coefficient,SD1the standard deviation,F1the significance test value;RCV12the correlation coefficient of the cross-test,SDCV1the standard deviation of the cross-test,FCV1the significance test value of the cross-test,Rtest12the external test correlation andSDtest1the standard deviation of the external test. The correlation coefficient (R12) of the above model was as high as 0.9974, much greater than the 0.81 standard,indicating that the model fit well; the value range (Vr) of the research samples was 1.8794 – (–1.6092) = 3.4886, and the standard deviation (SD1) was 0.0469, (0.0469/3.4886) ×100% = 1.3444%, much lower than the 10% standard, which meat the model fitting errors were small. The cross-test correlation coefficient (RCV12) was 0.9939 and much larger than the 0.64 standard; the cross-test standard deviation(SDCV1) was 0.0715, (0.0715/3.4886) × 100% = 2.0495%,which was much lower than the 10% standard, suggesting that the model was stable. The external test correlation coefficient (Rtest12) was 0.9955 and much greater than 0.64;the external test standard deviation (SDtest1) was 0.0720,(0.0720/3.4886) × 100% = 2.0639%, which was greatly lower than the 10% standard, indicating strong predictive ability and small prediction errors of the model.
In order to further understand the influence of variables on compound toxicity, the structural descriptors in Table 2 were used as the independent variablesX, and the compound toxicity value log(1/IGC50) as the dependent variableY. The partial least-squares regression was used to establish a model(M2). The change of the correlation coefficients (R2/RCV2)with the number of principal components is shown in Fig. 2.When the number of the principal components reached 3, the correlation coefficient (R2) of the model got the maximum value, and the cross-test correlation coefficient (RCV2) was close to the maximum value. Thereafter, 3 principal components were chosen to build the model.
Fig. 2. Correlation coefficient (R2/RCV2) changes with the number of principal components
The distribution of the scores of the 40 training set samples in the top 2 principal components of the PLS space is plotted in Fig. 3. The scores of most of the studied samples (97.5%)fell within the 95% confidence elliptical confidence circle.There was only one abnormal point (compound No. 13),which reflected that the structural descriptors could represent the molecular structure characteristics of ester compounds and got the correct performance in the statistical model. The abnormal point in Fig. 3 is compound No. 13 “2-butynylacetate”, which contained a “triple bond” and had a certain degree of particularity.
Fig. 3. PLS scores of samples in the top 2 principal components
At this time, the model'sR22= 0.9940,SD2= 0.0646;RCV22= 0.8952,SDCV2= 0.0925;Rtest22= 0.9955,SDtest2= 0.0716.The correlation coefficient (R22) of the model was as high as 0.9940, which was much larger than the 0.81 standard,indicating that the model fit well; the standard deviation (SD2)was 0.0646, (0.0646/3.4886) × 100% = 1.8517%, which was much lower than the 10% standard, so the model fitting errors were small. The cross test correlation coefficient (RCV22),0.8952, was much larger than the 0.64 standard; the cross test standard deviation (SDCV2) was 0.0925, (0.0925/3.4886)×100% = 2.6515% and greatly lower than the 10% standard,which suggested stability for the model. The external test correlation coefficient (Rtest22) of 0.9955 was remarkably greater than 0.64; the external test standard deviation (SDtest2)was 0.0716, (0.0716/3.4886) × 100% = 2.0524%. It was significantly lower than the 10% standard, also showing that the model had strong predictive ability and the prediction errors were small.
In order to verify whether the excellent model results were accidental, the model was verified by random sorting of theYvector 20 times. The correlation coefficients of theYoriginal vector and the randomly sortedYvector are plotted on the modelR2andRCV2in Fig. 4. According to the judgment criteria proposed by Andersson et al.[27], the intercepts ofR2andRCV2on the vertical axis should not exceed 0.300 and 0.050, respectively. From Fig. 4, it can be found that the intercepts ofR2andRCV2of the PLS model built in this paper were 0.072 and –0.400, respectively. Therefore, it could be considered that the excellent results of the model built in this paper were not accidental, so our model could be used to analyze the structures of ester compounds.
Fig. 4. Plot of Y random permutations test
In order to further study the influence of each variable on the compound toxicity log(1/IGC50)(Y), the load distribution of the samples in PLS is plotted in Fig. 5, in whichx1,x72,x118, andx127are in the upper right. It means that they are positively correlated withYin the first and second principal components, and the distance betweenx1and the origin is relatively large, which reflects that it has a relatively larger correlation withY.x18andx80are at the bottom right of the figure, indicating that they are positively correlated withYin the first principal component, and negatively correlated withYin the second principal component.x33is at the upper left of the figure, which suggests that it is negatively correlated withYin the first principal component, and positively correlated withYin the second principal component.
Fig. 5. Plot of PLS loadings plot of the samples
The importance of a variable can reflect the degree of correlation between the variable andY. It is generally considered that variables with variable importance projection(VIP) values greater than 1 are highly correlated with the toxicity log(1/IGC50) of ester compounds. The variable importance projection is shown in Fig. 6. Fig. 6 shows that the VIP values of the three variablesx1,x18, andx80were greater than 1, indicated that these three variables were highly correlated with the toxicity log(1/IGC50) of ester compounds.x1corresponding to the electrostatic interaction of hydrogen atoms, described that the more hydrogen atoms in the compound, the higher the toxicity log(1/IGC50) value of the ester compound may be.x18corresponding to the electrostatic effect of C(sp3)and O(sp2), andx80corresponding to the stereoscopic interaction effect of C(sp2)and O(sp3). The above shows that oxygen atoms had a greater influence on the toxicity value.
Fig. 6. Variable importance projection
The calculated values of the toxicity log(1/IGC50) of the two models for the compounds are listed in Table 1 as Cal.1 and Cal.2, Err.1 and Err.2 are the errors, respectively. For the convenience of observation, the correlation between the calculated log(1/IGC50) of the model's toxicity to the compound and the experimental values is plotted in Fig. 7,and the corresponding errors are plotted in Fig. 8. Fig. 7 shows that most of the sample points were near the 45°diagonal, indicating that the calculated values of the model’s toxicity log(1/IGC50) for the compounds were highly correlated with the experimental values. The two values were close in size. The toxicity log(1/IGC50) could be predicted accurately, which once again showed the model's good predictive ability and excellent predictive results.
Fig. 7. Plot of the predicted values vs. the experimental ones
A good prediction model usually requires the prediction errors of most samples not exceeding plus or minus 2 times that of the standard deviation (ie± 2SD). It can be found in Fig. 8 that most of the samples' errors were within ± 2SDof the model. For model M1, only 1 sample (No. 1) had a prediction error exceeding ± 2SD1; for model M2, only 3 samples (Nos. 1, 43, 46) had prediction errors larger than ±2SD2. This shows that the model was accurate in predicting the toxicity log(1/IGC50) of the compounds, and the prediction errors were in an acceptable range. The model could be used to predict the toxicity log(1/IGC50) of ester compounds. At the same time, the existence of large error samples indicated that some special structural information of compounds had not been fully expressed, and the molecular structure characterization method needed further improvement.
Fig. 8. Plot of the predicted residuals scattered
By classifying the atoms in the compound, the electrostatic interaction, steric interaction and hydrophobic interaction between the atoms were calculated as structural descriptors on the three-dimensional structure of the compound, and then the structures of 48 ester compounds were expressed parametrically. The relationship models between compound structures and toxicity log(1/IGC50) were established through multiple linear regression (MLR) and partial least-squares regression (PLS), and it was found that the toxicity of ester compounds log(1/IGC50) was closely related to the molecular structures of the compounds. The constructed structure-toxicity log(1/IGC50) relationship models can be used to predict the toxicity log(1/IGC50) of ester compounds.Due to the slightly larger prediction errors of individual samples, there is still a lot of room for improvement in the molecular structure characterization method, and related researches are underway. This paper has certain reference value for the quantitative structure-toxicity relationship study of toxic compounds in environment.