徐玲麗, 陳雪東
(1.浙江師范大學(xué) 數(shù)理信息與工程學(xué)院, 浙江 金華 321004; 2.湖州師范學(xué)院 理學(xué)院, 浙江 湖州 313000)
基于Bootstrap方法的對數(shù)線性模型構(gòu)建
徐玲麗1, 陳雪東2
(1.浙江師范大學(xué) 數(shù)理信息與工程學(xué)院, 浙江 金華 321004; 2.湖州師范學(xué)院 理學(xué)院, 浙江 湖州 313000)
對于不易進(jìn)行數(shù)據(jù)收集的分類變量,通常得到的樣本是有限的.如果僅用這些數(shù)據(jù)構(gòu)建變量間的對數(shù)線性模型往往缺乏可靠性,而且對各交互項(xiàng)的參數(shù)估計(jì)精度可能較低.針對該問題,提出先用Bootstrap抽樣法產(chǎn)生多份一定量的數(shù)據(jù)集,分別模擬它們的對數(shù)線性模型,得到模型各個(gè)參數(shù)的估計(jì)向量,然后對所有參數(shù)的估計(jì)向量進(jìn)行聚類,得到若干份各參數(shù)估計(jì)的向量.實(shí)驗(yàn)結(jié)果表明,即使各參數(shù)與真實(shí)模型的各個(gè)參數(shù)有差異,這若干個(gè)參數(shù)估計(jì)向量對應(yīng)的模型的概率分布與真實(shí)模型的概率分布的K-L距離都較小,即概率分布很接近,并且在這若干個(gè)向量中,越靠近對應(yīng)參數(shù)的置信區(qū)間,它與真實(shí)的概率分布的K-L距離越小.
分類變量; 對數(shù)線性模型; Bootstrap抽樣; 聚類;K-L距離; 置信區(qū)間
探究分類變量間的關(guān)聯(lián)關(guān)系或構(gòu)建變量的模型是至關(guān)重要的[1],但首先要做的是收集所需的數(shù)據(jù)樣本.當(dāng)這些變量中有涉及到抽象的、現(xiàn)有數(shù)據(jù)集中不存在的變量時(shí),獲取的樣本通常是有限的.在樣本量不足的情況下得到的對數(shù)線性模型可信度較低,各個(gè)參數(shù)的估計(jì)值精度也不高.對此,本文引入Bootstrap抽樣,在原樣本數(shù)據(jù)集的基礎(chǔ)上重復(fù)抽樣,產(chǎn)生多份數(shù)據(jù)集,分別對數(shù)據(jù)集構(gòu)建飽和對數(shù)線性模型,獲取相應(yīng)的參數(shù)向量.這里的參數(shù)向量指的是由對數(shù)線性模型各個(gè)參數(shù)組成的向量.
一般Bootstrap抽樣很少應(yīng)用于分類數(shù)據(jù)中,對于分類變量主要研究的是變量之間的關(guān)聯(lián)關(guān)系,或者直接預(yù)測其對數(shù)線性模型.但是想構(gòu)建與真實(shí)模型接近的對數(shù)線性模型,它對樣本的容量要求較高.一個(gè)原因是模型的參數(shù)一般不會(huì)少于6個(gè),要構(gòu)建模型意味著要同時(shí)對6個(gè)以上的參數(shù)進(jìn)行估計(jì);另一個(gè)原因是用來區(qū)分名義變量不同類的數(shù)字,例如1代表男性,2代表女性,這1與2的差距就不止是一個(gè)距離.假如從一個(gè)卡方分布χ2(5)中抽取容量為200的樣本,計(jì)算它的期望E,一般E可以很接近真實(shí)期望值,而利用樣本容量為200的原樣本構(gòu)建模型,其與真實(shí)模型可能相差較大.本文除了利用Bootstrap產(chǎn)生數(shù)據(jù)集[2],得到每份新數(shù)據(jù)的參數(shù)估計(jì)外,還對產(chǎn)生的B份參數(shù)向量進(jìn)行聚類,最終得到若干組參數(shù)向量,即得到若干個(gè)可供選擇的模型.為了說明方法的有效性,本文在真實(shí)模型給定的情況下,將用R模擬整個(gè)過程,并以K-L距離作為評價(jià)指標(biāo)來衡量最終模型與真實(shí)模型的概率分布的吻合程度.
Loglinear模型即對數(shù)線性模型,它是一種有效處理列聯(lián)表信息的統(tǒng)計(jì)方法.對于常規(guī)頻數(shù)表統(tǒng)計(jì)方法,只能分析某兩個(gè)變量間的關(guān)聯(lián)關(guān)系,對于高維列聯(lián)表,只能先將整張表分成多個(gè)二維列聯(lián)表,再分析兩兩變量的關(guān)聯(lián).因而只能分析每個(gè)二維列聯(lián)表的主效應(yīng)和二維交互效應(yīng),無法對更高階交互作用(如三維、四維交互作用)作出分析,但對數(shù)線性模型可以包含所有交互項(xiàng)[2-4].
以三維變量X,Y,Z為例給出對數(shù)線性模型的飽和模型表達(dá)式:
Bootstrap是一種依賴于原樣本觀測值的“有放回”重復(fù)抽樣的統(tǒng)計(jì)方法,也叫“自助法”,它不需要設(shè)定假設(shè)或新的觀測,只需要借助計(jì)算機(jī)對原樣本的觀測值進(jìn)行重復(fù)抽樣得到一系列新的樣本.其基本思想是[5-6]:在原樣本數(shù)據(jù)集中做有放回的重復(fù)抽樣,樣本量一般和原樣本數(shù)據(jù)容量一樣,對于得到的第一份新的樣本,可對其參數(shù)θ進(jìn)行估計(jì),重復(fù)這個(gè)過程B次,即可得到B個(gè)關(guān)于參數(shù)θ的估計(jì),再用這B個(gè)估計(jì)值求得θ的置信區(qū)間,這樣得到的置信區(qū)間要比常規(guī)方法得到的置信區(qū)間的寬度要短,精度明顯更高.
對于給定的是多維列聯(lián)表形式的原樣本數(shù)據(jù)時(shí),我們不能直接將列聯(lián)表上的頻數(shù)作為觀測值進(jìn)行Bootstrap抽樣.假設(shè)給定的列聯(lián)表有四維,表上的每個(gè)分類變量都是二分類的,即總共有24種組合類,它們對應(yīng)的頻數(shù)分別為n1,n2,...,n16,則正確形式的原觀測值數(shù)據(jù)data的R代碼形式為:
data<-c(rep(1,n1),rep(2,n2),…,rep(16,n16)).
每一次的“有放回”抽樣代碼為:
sample(data,size=n,replace=TURE).
K-means聚類法是一種快速聚類法,采用該方法得到的結(jié)果比較容易理解,主要對計(jì)算機(jī)的性能要求不高,因此被廣泛應(yīng)用[7].它由MacQueen提出,基本思想也很簡單,就是把每個(gè)樣本分配到最靠近中心(均值)的類中.下面給出算法的步驟:
(1) 將所有樣本分成k個(gè)初始類;
(2) 通過歐氏距離將樣本劃分到離它最近的一個(gè)類中,然后對新的k個(gè)類重新計(jì)算它們的中心點(diǎn);
(3) 重復(fù)步驟(2)直到k個(gè)類的中心不變?yōu)橹?
對數(shù)線性模型是研究各個(gè)分類變量間關(guān)聯(lián)關(guān)系提供的一種方法,它一般包含多個(gè)參數(shù),這也使得構(gòu)建的過程對樣本的容量要求較高.本文提出的方法是針對無法獲取足夠大的樣本數(shù)據(jù)而言的,基于Bootstrap抽樣得到的多份“新數(shù)據(jù)”,分別擬合它們的對數(shù)線性模型,得到相應(yīng)的參數(shù)向量,并預(yù)測其參數(shù)的各個(gè)置信區(qū)間,最后基于所有新數(shù)據(jù)得到的參數(shù)向量,用K-means聚類產(chǎn)生若干個(gè)參數(shù)向量,以提供模型的預(yù)測.
為了衡量當(dāng)前模型與真實(shí)模型的距離,引入K-L距離作為評價(jià)指標(biāo)[8].K-L距離是Kullback-Leibler差異(Kullback-Leibler Divergence)的簡稱,也稱相對熵(Relative Entropy).它衡量的是相同事件空間里的兩個(gè)概率分布P(x)、Q(x)的差異情況.下面給出定義.
我們用D(P‖Q)表示K-L距離,計(jì)算公式為:
一般D(Q‖P)≠D(P‖Q),即不滿足對稱性,為此引入改進(jìn)的K-L距離,表達(dá)式為:
為了驗(yàn)證方法的有效性,我們假設(shè)真實(shí)模型是給定的,下面給出具體步驟:
(3) 用Bootstrap產(chǎn)生2 000份容量為N的“新樣本”數(shù)據(jù)集;
(4) 對2 000份新樣本分別模擬對數(shù)線性模型,得到對應(yīng)的模型參數(shù)向量γi,i=1,2,...,2 000,R為向量組成的N×2 000的參數(shù)矩陣;
(5) 對步驟(4)中的參數(shù)向量γi,求出各參數(shù)的置信區(qū)間,并用kmeans(t(R),k)代碼將參數(shù)向量聚類成k個(gè)參數(shù)向量,由這k個(gè)向量分別構(gòu)成k個(gè)對數(shù)線性模型即k個(gè)概率分布.
選取的真實(shí)模型是由四個(gè)二分類變量M,E,P,G構(gòu)成的,相應(yīng)模型參數(shù)由表1給出.
表 1 真實(shí)模型各參數(shù)
由表1計(jì)算出真實(shí)模型對應(yīng)的概率分布,即16個(gè)組合類概率的向量形式為:
PL<-c(0.029,0.004,0.086,0.074,0.044,0.041,0.095,0.08,0.036,0.065,0.056.0.078,0.06,0.045,0.1,0.06).
從真實(shí)模型中抽取樣本容量為200的數(shù)據(jù)作為原樣本,count顯示了各組合類的頻數(shù):
count<-c(7,10,16,15,9,8,20,15,7,13,11,15,12,10,20,12).
用Bootstrap對原樣本產(chǎn)生2 000個(gè)新數(shù)據(jù)集,并模擬對應(yīng)的對數(shù)線性模型得到各個(gè)參數(shù)向量γi和參數(shù)矩陣R,取k=3,對R的轉(zhuǎn)置矩陣t(R)進(jìn)行聚類,得到最終的3個(gè)聚類中心的參數(shù)估計(jì),見表2和圖1.
表 2 參數(shù)估計(jì)值
由這三個(gè)模型的參數(shù)就很容易計(jì)算出它們的概率分布,分別用P1、P2、P3表示:
P1<-c(0.027,0.056,0.102,0.082,0.057,0.04,0.103,0.07,0.045,0.068,0.056,0.086,0.063,0.058,0.025,0.062);
P2<-c(0.035,0.043,0.084,0.073,0.044,0.038,0.095,0.078,0.034,0.063,0.052,0.074,0.06,0.048,0.1,0.083);
P3<-c(0.045,0.037,0.08,0.08,0.036,0.043,0.1,0.078,0.027,0.07,0.058,0.07,0.065,0.048,0.1,0.06).
我們很容易計(jì)算出這三個(gè)模型概率分布與真實(shí)模型概率分布的K-L距離分別為:
D(PL,P1)=0.092,D(PL,p2)=0.0058,D(PL,P3)=0.011.
從K-L距離可以看出,三個(gè)模型對應(yīng)的概率分布與真實(shí)模型的概率分布相差都不大,但模擬程度最好的是第二個(gè)聚類中心所屬的參數(shù)向量,其實(shí)它也是最接近參數(shù)估計(jì)的置信區(qū)間的.下面給出各參數(shù)的置信區(qū)間,見表3中的置信區(qū)間1.
表 3 置信區(qū)間
上面的真實(shí)模型是飽和狀態(tài)下的模型.飽和模型指的是包含所有的交互項(xiàng).為了驗(yàn)證此方法對非飽和模型依然有效,我們給出一個(gè)不飽和模型作為真實(shí)模型(過程步驟省略).下面給出相應(yīng)的真實(shí)模型參數(shù)和結(jié)果.
選取的真實(shí)模型還是由四個(gè)二分類變量M、E、P、G構(gòu)成,相應(yīng)模型參數(shù)由表4給出.
表 4 真實(shí)模型各參數(shù)
從真實(shí)模型抽取的200個(gè)樣本作原樣本,每個(gè)組合類的頻數(shù)為:
count<-c(4,9,16,12,6,9,16,16,6,8,29,18,8,14,14,15).
得到的3個(gè)聚類中心如表5和圖2所示.
表 5 參數(shù)估計(jì)值
三個(gè)模型的概率分布分別用P1、P2、P3表示:
P1<-c(0.035,0.046,0.086,0.075,0.04,0.04,0.095,0.08,0.035,0.06,0.08,0.08,0.05,0.049,0.09,0.06);
P2<-c(0.04,0.037,0.077,0.076,0.036,0.04,0.097,0.077,0.026,0.065,0.0586,0.072,0.061,0.045,0.095,0.093);
P2<-c(0.04,0.058,0.075,0.068,0.058,0.05,0.075,0.072,0.053,0.064,0.06,0.07,0.06,0.06,0.077,0.063).
與真實(shí)模型的概率分布的K-L距離分布為:
D(PL,P1)=0.019,D(PL,P2)=0.055,D(PL,P3)=0.06.
在這三個(gè)K-L距離中,第一個(gè)模型對應(yīng)的距離最小,這是由于它是最接近各參數(shù)置信區(qū)間的.參數(shù)的置信區(qū)間如表3的置信區(qū)間2.
本文針對的是在小樣本情況下構(gòu)建分類數(shù)據(jù)的對數(shù)線性模型問題,問題的背景是分類變量對樣本容量的要求比普通的連續(xù)變量更高,在樣本不足的條件下,無法構(gòu)建穩(wěn)健的模型.從實(shí)驗(yàn)結(jié)果看,將Bootstrap與聚類結(jié)合起來的方法有一定的有效性,但需要注意的是,即使兩模型對應(yīng)的參數(shù)有差別,對應(yīng)的概率分布可以是無顯著差別的.
[1] 阿格萊斯蒂.分類數(shù)據(jù)分析[M].齊亞強(qiáng),譯.重慶:重慶大學(xué)出版社,2012.
[2] 張巖波,何大衛(wèi).對數(shù)線性模型的最優(yōu)模型篩選策略[J].中國衛(wèi)生統(tǒng)計(jì),1996(6):4-7.
[3] 唐先勇.3-維列聯(lián)表中對數(shù)線性模型的選擇策略[J].湖南科技學(xué)院學(xué)報(bào),2003,1(1):155-159.
[4] 韋杰,孟捷.基于R的三維列聯(lián)表對數(shù)線性模型分析[J].科技信息,2009(20):727,730.
[5] EFRON B,TIBSHIRANI R J.An Introduction to the Bootstrap[M].New York:Chapman and Hall,1993.
[6] EFRON B.Bootstrap methods another look at the jackknife[J].Annals of Statistics,1979(7):1-26.
[7] 孫吉貴,劉杰,趙連宇.聚類算法研究[J].軟件學(xué)報(bào),2008(1):48-61.
[8] 姚志均,劉俊濤,周瑜,等.基于KL距離的相似性度量方法[J].華中科技大學(xué)學(xué)報(bào)(自然科學(xué)版),2012,39(11):1-4.
LogarithmicLinearModelConstructionunderBootstrap
XU Lingli1, CHEN Xuedong2
(1. College of Mathematics and Information Engineering, Zhejiang Normal University, Jinhua 321004, China; 2. School of Science, Huzhou University, Huzhou 313000, China)
Since it is difficult to collect data of the categorical variables, the commonly obtained samples are limited. So it is unreliable to construct the logarithmic linear model between variables with these data, and the parameter estimation accuracy of each interaction item may be very low. A number of data sets are generated by sampling method, and their logarithmic linear model are simulated respectively so that the estimated vectors of the parameters of the model are obtained, and the estimation vectors of all the parameter are clustered to obtain a number of parameters. The experimental results show that even if the parameters of each parameter are different from those of the real model, the probability distribution of the model corresponding to the parameter estimation vector is smaller than the probability distribution of the real model, that is, the probability distribution is close. In the vector, the closer the confidence interval of the corresponding parameter is, the smaller the distance from the true probability distribution will be.
categorical data; logarithmic linear model; Bootstrap sampling;K-Ldistance; confidence interval
2017-08-10
國家自然科學(xué)基金項(xiàng)目(1171105).
陳雪東,博士,教授,研究方向:應(yīng)用統(tǒng)計(jì)及保險(xiǎn)精算.E-mail: xdchen@zjhu.edu.cn
O212
A
1009-1734(2017)10-0001-05
MSC2010:62E07; 62P09
MSC2010:62E07; 62P09
[責(zé)任編輯吳志慧]