Zhijuan Wang,Xiaobin Zhao,Wei Song and Antai Wang
Abstract:Readability is a fundamental problem in textbooks assessment.For low resources languages(LRL),however,little investigation has been done on the readability of textbook.In this paper,we proposed a readability assessment method for Tibetan textbook(a low resource language).We extract features based on the information that are gotten by Tibetan segmentation and named entity recognition.Then,we calculate the correlation of different features using Pearson Correlation Coefficient and select some feature sets to design the readability formula.Fit detection,F test and T test are applied on these selected features to generate a new readability assessment formula.Experiment shows that this new formula is capable of assessing the readability of Tibetan textbooks.
Keywords:Readability assessment,low resource language,textbook in Tibetan,linear regression,named entity.
Readability is important for assessing text and it is often used to rate if a reader can read and understand the text easily.The study of readability has a long history,and its research have been widely used in education research,book publishing and online publishing[Dale and Chall(1949)].
There are many methods to assess the readability of textbooks in rich resources languages such as English[Kane(1967)],Chinese[Pang(2006)],French[Uitdenbogerd(2005)],German[Hancke,Vajjala,and Meurers(2012)]and so on.These researched mainly focus on two parts:selecting features and designing the readability assessment model based on selected features.However,there is little research on readability assessment for low resource languages.Here,taking Tibetan,a low resource language,as an example,we conduct an in-depth study on how to assess the readability of Tibetan textbooks.The rest of this paper is organized as follows.In the next section,we talk about the background including features used in readability,the assessment method of readability and readability assessment of Tibetan textbooks.In Section 3,we introduce the corpus we used.In Section 4,we propose the feature selection strategy and a new readability assessment formula for Tibetan Language.The paper concludes in Section 5 with guidance of constructing the readability formula in Tibetan Language and future work.
The research on readability started in the United States in 1923.Two parts were mainly involved:feature selection and assessment model construction based on features selected.
Readability assessment is based on various features.Vogel et al.[Vogel and Washburne(1928)]used the number of words,part of speech,difficult words list and the number of phrases as features to assess the readability of text.Dale et al.[Dale and Chall(1948)]measured the readability by determining the distribution of difficult words in the text through 3000 common vocabularies.Flesch et al.[Flesch(1948)]obtained a readability index from 0 to 100 by calculating the number of syllables per 100 words and the average number of words per sentence.The ATOS for Books[Fry(2000)],developed by an American commercial company,took the length of the text as an important feature in assessing readability.Part-of-speech-based grammatical features were used to assess the readability[Heilman,Collins-Thompson,Callan et al.(2007);Leroy,Helmreich,Cowie et al(2008)].Feng et al.[Feng,Jansche,Huenerfauth et al.(2010)]thought that the number of named entities in the text will affect the reader's memory burden,and used the number named entity as one of features to measure the readability.Gemoets et al.[Gemoets,Rosemblat,Tse et al(2004)]took personal pronouns as features to measure readability.Fran?ois et al.[Fran?ois and Fairon(2012)]used 46 textual features to get the readability of French.Some commonly used features are listed in Tab.1.
Table 1:The features commonly used in readability assessment
There are some methods to assess the readability,which can be divided into two categories:formula and other methods.
2.2.1 Readability assessment Formulas
Two American,Lefley and Presie[Jia(2015)],designed the first readability assessment formula.More than one hundred readability formulas are produced.But only seven of them are commonly used[Tekfi(1987)].Here,we just introduce these seven formulas briefly.
1.Vogel&Washbune[Vogel and Washburne(1928)]
This formula was first synthesized by Vogel and Washburne in 1928.The Vogel&Washbune formula is as follows:
where,X1is the number of prepositions,X2is the number of complex words(words with more than three syllables)
2.Flesch Reading Ease[Flesch(1948)]
This formula was designed by Flesch in 1948.He graded the score according to the formula and proposed a range of legibility of 0-100.The Flesch Reading Ease formula is as follows:
3.Gunning Fog Index[Gunning(1969)]
This formula was created by American professor Robert Gunning in 1952.The lower the Fog index of the article,the easier it is for readers to understand.The Gunning Fog Index formula is as follows:
where,PHW is the percentage of hard words.
4.Automated Readability Index(ARI)[Senter and Smith(1967)]
This formula was proposed by Senter and Smith in 1967.It is based on linear regression analysis.The lower the ARI index of the article,the easier it is for readers to understand.The Automated Readability Index formula is as follows:
5.Flesch-Kincaid Formula(FK)[Kincaid,Fishburne,Rogers et al.(1975)]
This formula was jointly designed by Kincaid and Flesch in 1975.It is the US Department of Defense's standard readability formula and is also a built-in readability formula for Microsoft Office.The Flesch-Kincaid Formula is as follows:
6.SMOG Grading[Mc Laughlin(1969)]
This formula was constructed by G.Harry McLaughlin in 1969 and it is the only formula that has only one feature.The lower the SMOG index of the article,the easier it is for readers to understand.The SMOG grading formula is as follows:
Where,X is the number of multi-syllable words in 30 sentences.
7.Dale-Chall[Dale and Chall(1948)]
This formula was originally designed by Dale and Chall in 1948 and was revised in 1995.The formula is based on a common vocabulary that has been expanded from the original 763 common words to 3,000 common words.The Dale-Chall formula is as follows:
where,X1is the percentage of uncommon words(based on Dale 3000 common vocabulary),and X2is the average number of words per sentence.
To construct readability formula,different methods(such as logistic regression,linear regression)are used to construct readability formulas based on selected features.
2.2.2 Other methods
Besides formula,there are some other methods to assess the readability of texts.
The cloze method was first proposed by Taylor[Taylor(1953)].In the “Taylor cloze test”,the word after every fifteen words in an article was deleted,and then students of different grade groups were asked to fill the deletion.The word,when the correct rate of a group of answers exceeds 50%,classifies the readability of the article into the group level that can be read easily.
The subjective assessment method invites experts and scholars in related fields to judge the difficulty of the text by artificially determining the readability of the text.In the absence of highly knowledgeable experts and scholars,multiple questionnaires are distributed to teachers or students.Finally,judge if the texts can be read easily or not.The main advantage of this method is that it is simple and easy[Jia(2015)].
In recent years,with the development of machine learning,the readability assessment method based on machine learning has also began to be explored.Chen et al.[Chen,Tsai and Chen(2011)]used TF-IDF and SVM to assess Chinese readability.Fran?ois et al.[Fran?ois and Fairon(2012)]combined methods with knowledge of machine learning to study the readability.Hancke[Hancke(2012)]and Vajjala et al.[Vajjala and Meurers(2012)]transformed the problem of readability into a classification problem,and used text classification to evaluate the readability of text.
At present,there is little research on readability of Tibetan textbooks.Most researches are mainly focus on the vocabulary statistics of Tibetan textbooks.For example,Cao et al.[Cao,Han and Dong(2012)]measured the vocabulary of junior high school and high school Tibetan textbooks which are compiled by the five provinces(districts)Tibetan Language Compilation Committee.Zhang et al.[Zhang,Gao,Li et al.(2010)]used statistical methods to analyze the article genre,literary genre and material selection of the new and old versions of the primary school Tibetan textbooks.Wang et al.[Wang(2012)]used the junior middle school Tibetan textbook as a corpus to conduct a shallow syntactic analysis and proposed six Tibetan block types.Renqing Zhuoma et al.[Renqing Zhuoma(2015)]studied the mistranslation of the translation in the textbook.
As described above,we can see little research has been carried out on readability of Tibetan textbooks which make it necessary to study the readability of Tibetan textbooks.
Tibetan is a low resource language which is a cluster of Sino-Tibetan languages and spoken primarily by Tibetan peoples,who live across a wide area of eastern Central Asia.As an alphabetic writing language,Tibetan has 30 consonants and 4 vowel signs.Its smallest grammar unit is syllable.“.” is the mark of syllable.One or more alphabets compose a syllable and one or more syllables can compose a word.
Fig.1 is an example of Tibetan sentence.A Tibetan syllable includes root letter,prefix,head letter,vowel,suffix and post suffix.There is no white space between Tibetan words.
Figure 1:An example of Tibetan sentence
As shown in Tab.1,many features used in assessing the readability of textbook are based on words.So,Tibetan texts should be segmented firstly.We use Tibetan word segmentation software(developed by the Institute of Ethnology and Anthropology of the Chinese Academy of Social Sciences[Li,Liu,Long et al.(2018)])to segment the Tibetan words.
Here,we use Tibetan primary textbook as assessment corpus,which is created by Tibetan Language Teaching Materials Compilation Committee of five provinces(districts)and published by Qinghai Nationalities Publishing press.There are 11 volumes,261 articles and 5,5198 words.Tab.2 shows the basic information of this corpus.
At present,the performance of part-of-speech system of Tibetan is not good enough to measure the readability of Tibetan textbooks.Therefore,using Tibetan segmentation system[Li,Liu,Long et al.(2018)]and named entity recognition system[Liu and Wang(2017)],eight features of Tibetan textbooks are extracted.They are:
· NW:the number of words per documents
· KW:the kinds of words per documents
· AWL:average word length per documents
· NNW:the number of new words per documents.
· ASL:average sentence length per documents
· ANS:the number of sentences per documents
· NPRP:number of personal pronouns per document
· NNE:the number of named entity per documents
Table 2:The information of Tibetan textbooks
In this paper,linear regression is used to construct the readability assessment formula for Tibetan textbooks.Linear regression formula is shown in Eq.(8).
Fig.2 introduces the frame of readability assessment formula of Tibetan textbooks.It includes two parts:feature selection and formula construction.
Figure 2:The frame of readability assessment formula in Tibetan
· Features selection.In multi-linear regression model,collinear problems often occur among different variables,which means the changing tendencies of two or more variables are in a same direction.It will weaken the accuracy and stability of the parameter estimation of linear regression analysis[Yang(2004)].So,if two features are related,only one can be reserved while the others should be moved.Then,we will get several feature sets.
· Formula construction.Three tests are used to construct formula.Fitting test reflect the relation of function and features sets.If the value of fitting test is higher(less than 1),it means the features set can express function better.So,the feature set which has the highest fitting value should be selected.According to the SPSS regression analysis of Christian,when the significance of F test is less than 0.05,it indicates that at least one feature can effectively predict the function and this feature or these features pass the F test.The T test is a test of the regression coefficients for all features of the regression analysis.Ifits significance is bigger the 0.05,it means that some feature is not pass the T test and it should be removed.Repeat this process until all features pass fitting test,F test and T test.Then,readability assessment formula is gotten.
In order to select features,Pearson Correlation Coefficient is used to calculate the correlation of different features.Its equation is shown in Eq.(9).
where(x,y)is the specified pair of variables and N is the total number of variables.
Tab.3 shows the Pearson correlation of eight features based on the calculation on our Tibetan corpus.
Table 1:Correlation analysis of affecting factors based on Pearson
If Pearson correlation is high(bigger than 0.5),it means there is a certain correlation between two features and these two features should not exist at the same time.Here,the Pearson correlation of NW(the total number of words)and KW(the kinds of words)is 0.643 which means that there is a certain correlation between them.So,these two features cannot exist at the same time.Here,NW is reserved.Using this as a basis for feature selection,five sets of features are selected as shown in Tab.4.
Table 2:Five sets of features
We choose SPSS as linear regression analysis tool.Tab.5 shows the results of the fitting test of the five sets of features.It is clear that the fitness rankings are:set 2>set 1>set 3>set5>set 4.Therefore,set 2(AWL,NW,ANS,NNE)is selected.
Table 3:Fitting test
Tab.6 shows the F test of set 2.Clearly,it passes the F test as its significance is less than 0.05.
Table 4:F test
Tab.7 shows the T test of regression coefficient of set 2 features.The significance of constants,AWL,ANS and NNE are less than 0.05,thus they pass T test.The significance of NW,however,is bigger than 0.05.Therefore,it is necessary to remove the NW.
Table 5:T test
Because the feature set has been changed,linear regression analysis should be performed again until all features pass fitting test,F test and T test.Tabs.8 to 10 show the second fitting test,F test and T test respectively.It is obvious that all the features pass the fitting test,F test and T test.
Table 6:The second fitting test
Table 7:The second F test
Table 8:The second T test
According to the linear regression model analysis,Eq.(10)is the readability assessment formulas of the Tibetan textbooks.
Fig.3 is readability of Tibetan textbooks based on Eq.(10).From this figure,we can see that,except for volume 7,from volume 2 to volume 10,the value of readability is gradually increasing while the value of readability of volume 7 is increased sharply.The value of readability of volume 8 is decreased and the value of readability of Volumes 11 and 12 is lower than Volumes 10,and their changes are very small.
Figure 3:Readability of Tibetan textbooks based on Eq.(10)
Formula is the one of most commonly used methods in evaluating the readability of texts.Little research has been carried out on Tibetan readability assessment.We extract eight features using Tibetan NLP tools,and select three features(AWL(average word length),ANS(average number of sentences per documents)and NNE(Number of named entities per document))to construct readability formula.Then the new formula is constructed based on fitting test,F test and T test.The new formula has good performance and is able to be applied to assess the readability of Tibetan textbooks.
In the future,we will do more research on the methods of feature selection.Also,we will try to use other machine learning model to assess the readability of low resource languages.
Acknowledgements:This work was supported by the China National Natural Science Foundation No.(61331013)and the Young faculty scientific research ability promotion program of Minzu University of China.
Computers Materials&Continua2019年10期