?
《未知世界:透過“大數(shù)據(jù)”棱鏡窺探人類文化》述介
邵斌陳晶晶
(浙江財經(jīng)大學(xué),杭州,310018)
Aiden, Erez & Jean-Baptiste Michel.2013.Uncharted:BigDataasaLensonHumanCulture.New York:Riverhead Books.ISBN: 978-1594487453.pp.288.
在大數(shù)據(jù)時代,通過對海量數(shù)據(jù)的定量分析來揭示人類文化演變趨勢的研究被稱為“文化組學(xué)”(culturomics)。該概念源自哈佛大學(xué)的J.-B.Michel和E.L.Aiden研究小組于2011年在《科學(xué)》雜志(Science)上發(fā)表的《基于數(shù)以百萬計數(shù)字化圖書的文化定量分析》一文。之后,Aiden和Michel再度合作,于2013年出版了《未知世界:透過“大數(shù)據(jù)”棱鏡窺探人類文化》一書①,詳細介紹了“文化組學(xué)”研究及其應(yīng)用?!拔幕M學(xué)”研究促成了自然科學(xué)和人文科學(xué)的聯(lián)姻,促進“數(shù)字人文”(Digital Humanities)這一新領(lǐng)域的形成。本文旨在對該書進行簡要述介,以期引起學(xué)界對“文化組學(xué)”領(lǐng)域的關(guān)注,從而把握大數(shù)據(jù)時代人文科學(xué)研究的新趨勢。
1.內(nèi)容簡介
全書共分七章。第一章總體介紹“文化組學(xué)”的定義,即利用大數(shù)據(jù)對人類文化進行定量研究。著者認為,大數(shù)據(jù)將改變?nèi)宋目茖W(xué),改造社會科學(xué),重新界定象牙塔內(nèi)外世界的關(guān)系(8)。研究以“谷歌圖書語料庫”為基礎(chǔ),該語料庫收錄的是16世紀以來出版的、包含英、法、德、西、俄、漢和希伯來語等7種語言的3000萬冊圖書的電子化文本,總計達5千億詞,占人類有史以來出版書籍的6%。谷歌圖書語料庫的文本縱貫5個世紀,故能反映出人類行為模式的變化、文化的變遷乃至文明的興替,因此,它不僅是“大數(shù)據(jù)”,更是“長數(shù)據(jù)”(long data)。然而,由于受圖書版權(quán)之限,研究者無法直接利用圖書內(nèi)容進行研究,為此著者開發(fā)了“谷歌圖書N-gram②閱讀器”(Google Books N-gram Viewer,以下簡稱N-gram Viewer),該閱讀器可將語料庫中的詞匯每年的使用頻率變化以曲線圖形式進行可視化呈現(xiàn)。因此,它就像一面棱鏡,借此可窺探人類文化的演變。
陳晶晶,浙江財經(jīng)大學(xué)研究生。主要研究方向為語料庫語言學(xué)、話語分析。電子郵箱:stephaniecjj@163.com
*本文為國家社科基金一般項目“基于英漢浮現(xiàn)詞綴的語言演變模型建構(gòu)研究”(編號14BYY001)的階段性成果。
第二章通過語料庫探索語法演變。語言是文化中較易界定的部分,故本書首先觀察語言演變,具體個案是不規(guī)則動詞過去式的規(guī)則化演變,研究焦點是動詞使用頻率與規(guī)則化之間的關(guān)系。研究發(fā)現(xiàn),177個古英語中的不規(guī)則動詞到中古英語階段剩下145個,到現(xiàn)代英語中只剩下98個。著者計算得出:不規(guī)則動詞的半衰期③與其使用頻率的平方根成反比。假設(shè)動詞A的頻率是動詞B的1/100,則其規(guī)則化速度是后者的10倍。著者還進一步計算出,像chide和shrive等頻率介于10-6至10-5區(qū)間內(nèi)的動詞半衰期約為300年,而像drink和speak等頻率介于10-3至10-2區(qū)間內(nèi)的動詞半衰期則約為5400年。本章最后總結(jié)道:數(shù)據(jù)自己會說話。人類語言也會發(fā)生自然選擇,而使用頻率是決定英語不規(guī)則動詞能否存活的最重要的因素(43)。
第三章通過語料庫探索詞典編纂的“盲區(qū)”,即未被詞典所收錄的詞匯。首先,研究發(fā)現(xiàn),大部分英語詞典僅收錄高頻詞匯,而占詞匯總量52%的低頻詞則未能進入詞典,它們構(gòu)成了詞庫中的“暗物質(zhì)”(lexical dark matter)④。由此,著者認為,英語詞匯在某種程度上仍是一片“未被發(fā)現(xiàn)的大陸”(76)。其次,由統(tǒng)計可得,1900年前后,英語詞匯總量已逾55萬詞,至1950年僅增至60萬,到2000年則增至100萬詞,現(xiàn)今每年新增8400詞左右,可見詞匯呈加速增長趨勢。此外,研究還發(fā)現(xiàn),詞典學(xué)家雖竭力追蹤新詞,但詞典仍無法及時反映英語詞匯的最新變化。以2000年出版的《美國傳統(tǒng)詞典》第四版為例,它收錄的新詞有mesclun、netiquette、amplidyne等,但借助N-gram Viewer可知,mesclun和netiquette兩詞在1992年時的頻率就已達到被該詞典收錄的標(biāo)準(zhǔn),而amplidyne早在1950年就已達到頻率峰值,在2000年則已成為舊詞。由此可知,通過N-gram Viewer可定位詞匯的“興衰”,促進詞典的更新,探索詞匯的“未知世界”。
第四章通過N-gram Viewer來計算名氣。如果將人的名氣視作是其名字在谷歌圖書中出現(xiàn)的頻率,則名氣可加以計算。總體而言,谷歌圖書中人名頻率曲線呈現(xiàn)某種共性,即都包含初次成名、快速增長、達到巔峰以及緩慢衰落這4個階段。著者通過以下5個具體方面來測算名氣:(1)初次成名時的年齡;(2)名氣翻倍所用的時間;(3)名氣達到巔峰時的年齡;(4)名氣的半衰期;(5)名氣與職業(yè)的關(guān)系。研究發(fā)現(xiàn):人的名氣達到巔峰時,其年齡一般穩(wěn)定在75歲,但其他方面則有歷時變化。以1800年和1950年作為先后考察時間點,人們初次成名的年齡從43歲降至29歲,名氣翻倍所需時間從8.1年減至3.3年,名氣半衰期從120年跌至71年。簡言之,現(xiàn)代人出名更早,成名更快,但被人遺忘也更快了。就名氣與職業(yè)的關(guān)系而言,研究也有驚人發(fā)現(xiàn)。數(shù)據(jù)顯示,演員成名一般在30歲左右,成名最早;作家成名在40歲左右,最終名聲更盛,且持續(xù)時間更長;政治家成名在50歲左右,成名雖晚,但名聲最盛;科學(xué)家成名則在60歲前后;藝術(shù)家和數(shù)學(xué)家成名幾率最小。由此可見,N-gram viewer將名氣這一主觀化事物進行定量化和客觀化測算了。事實上,Veres和Bohannon(2011)已通過定量研究對4000多位科學(xué)家的名氣進行排序,并在《科學(xué)》雜志上發(fā)表了“科學(xué)名人堂”一文,本章可視作是對該文的拓展。
第五章展示如何通過N-gram Viewer追蹤出版審查制度和政治壓制。假設(shè)語料庫中的某些詞匯或人名在某一時段內(nèi)突然“銷聲匿跡”,則很可能是因為這些詞匯或人名在書籍中被禁用。著者通過比對德語和英語的谷歌圖書來考察納粹德國時期的審查制度和政治壓制。谷歌圖書顯示:猶太畫家Marc Chagall在1910年前后開始成名。但是,在英語圖書中,其名氣持續(xù)上升,而在德語圖書中,其名氣在1936年至1944年期間卻跌至低谷,顯然這是因納粹德國對猶太人的迫害而導(dǎo)致該畫家被“消音”。在歷史上,有些政治壓制規(guī)模大,涉及人數(shù)多,被壓制者雖被列入“黑名單”,但卻未必記錄在案,譬如斯大林時期的蘇聯(lián)大清洗運動以及美國“好萊塢十君子”事件中的政治審查。然而,借助N-gram viewer對詞語或人名頻率變化的考察,可以自動監(jiān)測到某個人或某種思想是否遭受過審查或壓制。
第六章是通過大數(shù)據(jù)研究集體記憶和集體遺忘。著者指出,像集體記憶這樣的概念以往通常被排除在科學(xué)調(diào)查之外,而通過N-gram Viewer對其進行研究也并非難事(153)。著者以年份數(shù)字為例來探究集體記憶的特點,通過該年份數(shù)字的頻率變化來觀察該年度的事件是如何被人們所記憶的。研究表明,人們對某一年份的遺忘速度呈現(xiàn)先快后慢的特點,符合艾賓浩斯遺忘規(guī)律。然而,隨著社會發(fā)展,人們遺忘的速度越來越快,很快便對過去的事物失去興趣。譬如,1872這一年份數(shù)字的半衰期為24年,而1973年份的半衰期僅為10年。著者也考察了與集體遺忘相對的“集體學(xué)習(xí)”的形成過程,即新事物如何進入人的“集體意識”。著者以維基百科全書中147項發(fā)明專利為例來觀察新事物被大眾接受的過程,統(tǒng)計發(fā)現(xiàn),在19世紀初,先進技術(shù)需要經(jīng)過65年左右才能被主流文化所接受,而到20世紀初,僅需26年即可,可見人們對新事物的接受速度越來越快。
第七章是對大數(shù)據(jù)外延的拓展。著者認為,谷歌圖書對大數(shù)據(jù)而言也只是冰山一角。以后,報紙、手稿,甚至實物,都會進入數(shù)字化處理,從而會形成大數(shù)字人文。如美國作家愛倫·坡遺留的422封信件展現(xiàn)了其創(chuàng)作過程,他舊居中的舊物反映了其創(chuàng)作環(huán)境,而這些實物數(shù)據(jù)目前尚未被谷歌圖書所收錄。一旦將這些資源數(shù)字化,這些數(shù)據(jù)將和谷歌圖書項目一道共同組成反映人類文化變遷的一面棱鏡,折射出人類歷史長河的方方面面。大數(shù)據(jù)不僅能記錄過去,審視現(xiàn)在,更能預(yù)測未來。因此,最后著者得出“數(shù)據(jù)即力量”的結(jié)論。
2.簡評
該書的亮點主要體現(xiàn)在以下兩個方面。
第一,溝通科學(xué)和人文,促進“數(shù)字人文”發(fā)展。早在幾年前,哈佛大學(xué)的Gary King教授就曾預(yù)言,隨著大數(shù)據(jù)的出現(xiàn)和使用,整個社會科學(xué)研究的實證基礎(chǔ)將會出現(xiàn)重大的變化,甚至?xí)铀俣ㄐ耘c定量研究的大融合(King 2009)。本書借助定量分析,探索了語法演變、詞典編纂、名氣測算、審查壓制以及集體遺忘和集體記憶這些人文社科領(lǐng)域的重要話題。在傳統(tǒng)的觀念看來,這些領(lǐng)域很難開展定量研究,但本書通過龐大的數(shù)據(jù)庫較為客觀地將其加以呈現(xiàn)??梢哉f,“文化組學(xué)”為人文科學(xué)研究提供了一種全新的研究方法,促進了“數(shù)字人文”學(xué)科的發(fā)展。短短兩三年來,國外已有學(xué)者采取“文化組學(xué)”視角探索情感挖掘、沖突預(yù)測、大學(xué)排名變化、氣候演變、復(fù)雜關(guān)系測算等多個領(lǐng)域的研究,相關(guān)論文不下百篇,可見該書影響之巨大,意義之深遠。
第二,注重讀者友好,語言通俗易懂。本書將大數(shù)據(jù)引進人文科學(xué)領(lǐng)域研究并提出“文化組學(xué)”概念,但全書并未充斥專業(yè)術(shù)語,而是以普及的立意和通俗的語言將大數(shù)據(jù)在人文研究中的應(yīng)用娓娓道來,不讓人望而生畏。該書在每章都設(shè)立一個研究問題,并詳細介紹與該問題有關(guān)的理論背景和相關(guān)知識,闡述時多以故事形式和譬喻方式幫助讀者理解研究問題。由于本書著者為自然科學(xué)領(lǐng)域的學(xué)者,因此,他們在文中偶爾會借用一些自然科學(xué)的概念,如“暗物質(zhì)”、“半衰期”、“基因組”等等,但用得恰到好處,而且解釋到位,明白易懂。因此,本書適用讀者群并不局限于語言學(xué)專業(yè)讀者,對文化感興趣的一般讀者也能從中受益。
本書也有兩點不足之處:
第一,基于N-gram Viewer的研究脫離語境,有時不免以偏概全。N-gram Viewer過分倚重詞匯頻率分析,而無法考察詞匯所在的語境。譬如,在探討名氣時,谷歌圖書中人名的出現(xiàn)頻率只能衡量名氣的大小,而無法判斷名氣的好壞。此外,單純用詞頻來代表文化影響力雖是一種易于操作的辦法,但僅通過曲線難以判斷該變化是否具有顯著性。如果能輔以一些統(tǒng)計方法對這些N-gram viewer數(shù)據(jù)進行深加工,研究則可進一步深化,如Acerbi等人(2013)結(jié)合情感詞庫(WordNet Affect)和波特算法(Porter’s Algorithm)對20世紀英語谷歌圖書中的情感表達變化進行研究,即為一例。
第二,某些語言語料數(shù)量不足,語料庫的代表性不夠。谷歌圖書語料庫中英語圖書數(shù)量巨大,達到3500億詞,但漢語圖書詞數(shù)只有130億詞,相對于浩如煙海的漢語書籍而言,這一數(shù)量遠遠不足。換言之,該漢語圖書語料庫的代表性不夠充分,不免影響研究結(jié)論。譬如,在漢語谷歌圖書語料庫中查“孔子”和“孟子”兩人名,前者在1800年之前鮮有出現(xiàn),后者則更是遲至1927年才首次被提及,這一結(jié)果顯然不符合事實。而這是因每個歷史階段的漢語語料不夠均衡所致。
雖然存在上述不足之處,但瑕不掩瑜。本書作為第一本系統(tǒng)闡釋“文化組學(xué)”概念并介紹其應(yīng)用的著作,必將在大數(shù)據(jù)發(fā)展史上留下濃墨重彩的一筆。事實上,在過去幾年中,國內(nèi)已有人文學(xué)者對“數(shù)字人文”開始關(guān)注,如張隆溪(2011);甚至已有學(xué)者借助“文化組學(xué)”視角對百年來的社會學(xué)發(fā)展進行了追蹤,如陳云松(2015)。但整體而言,國內(nèi)的相關(guān)研究尚未開展。因此,本文希望引起學(xué)界對“文化組學(xué)”研究的關(guān)注,也期待有更多的學(xué)者投身于大數(shù)據(jù)研究,來探索人文社會科學(xué)領(lǐng)域的“未知世界”。
附注
① 下引此作僅注頁碼。
② N-gram一般譯為“N元組”,指的是從語料庫中提取出的一詞或多詞序列,即單詞或詞組。在該研究中,N的范圍被限定為1~5。換言之,N-gram可包含1-gram至5-gram,如“America”、“United States”或“the United States of America”等都包含在內(nèi)。谷歌圖書的20億個N-gram可在以下網(wǎng)站檢索并下載:https:∥books.google.com/ngrams/。
③ 著者Aiden和Michel都具有理工科教育背景,因此在論述中時常借用自然科學(xué)領(lǐng)域的術(shù)語。半衰期原指放射性元素的原子核有半數(shù)發(fā)生衰變時所需的時間,此處借指“頻率減少至半所需的時間”。
④ 著者把頻率界限設(shè)定為谷歌圖書中每10億詞中出現(xiàn)1次,即10-9,低于該值即為低頻詞。
參考文獻
Acerbi, A., V.Lampos, P.Garnett & A.Bentley.2013.The expression of emotions in 20th century books [J].PLoSONE3: 1-6.
King, G.2009.The changing evidence base of social science research [A].In G.King, K.Schlozman & N.Nie.TheFutureofPoliticalScience: 100Perspectives[C].New York: Routledge.91-93.
Michel, J.-B., Y.K.Shen, A.P.Aiden, A.Veres, M.K.Gray, T.G.B.Team, J.P.Pickett, D.Hoiberg, D.Clancy, P.Norvig, J.Orwant, S.Pinker, M.A.Nowak & E.L.Aiden.2011.Quantitative analysis of culture using millions of digitized books [J].Science331(6014): 176-82.
Veres, A.& J.Bohannon.2011.The science hall of fame [J].Science331(6014): 143.
陳云松.2015.大數(shù)據(jù)中的百年社會學(xué)——基于百萬書籍的文化影響力研究[J].社會學(xué)研究(01):23-48.
張隆溪.2011.人文研究與電子信息技術(shù)[J].書屋(10):52-54.
(責(zé)任編輯玄琰)
Abstracts of Major Papers in This Issue
English Education: Needs and Mission, by YE Xingguo, p.1
This speech begins with the exploration of the evolution of English education in China, probes into the relationship between English education and state’s needs, analyses the significant contribution of the English education to the realization of the state strategies, and concludes that a university shall, at the juncture of promulgation ofNationalCriteriaofTeachingQualityforBachelorDegreeForeignLanguagePrograms, find “niche” or specific state needs and aim at satisfying them through working out its own criteria of English teaching quality.
On Collaborative Innovation of Translation Education under the New Normal, by YE Xingguo, p.5
The speech sets forth the new normal of the translation circles, analyses the different value orientations of the subjects of the collaborative innovation, namely the relevant circles of administration, enterprises, education, research and clients, and points out the six main problems and their solutions.
Innovation of English Teaching under the New Normal, by YE Xingguo, p.9
The speaker talks about how the new domestic needs,international situation and ICT development are challenging English teaching, why new ideas, standards and methods shall be applied and what the new normal of English teaching is, and emphasizes the importance of keeping up with the times and teaching innovation.
An Overview of Linguistic Landscape Study in China and the Prospect, by ZHANG Baicheng, p.14
Linguistic landscape study in China dates back to 1980s.In the past forty years, Chinese scholarsv have achieved remarkable progress in this domain, and the numerous studies mainly cover three themes: (1) Linguistic landscape translation and the norms; (2) Features of domain-specific linguistic landscapes; (3) Theory and methodology in linguistic landscape study.The studies investigate many types of linguistic landscapes including public signs/labels, publicizing language, slogans, street/road/store/institutional names, and couplets.The limitations of the studies lie in the four aspects: emphasizing description but ignoring interpretation, inadequacy of theoretical and methodological explorations, and not paying enough attention to multimodal signs per se.Future study can be furthered through focusing on five aspects, including shifting the research focus, exploring the theoretical and methodological issues and so on.
On the Nature of Middle Verbs and Middle Constructions, by YANG Yongzhong, p.19
Middle constructions are a well-studied topic in linguistics.Based on a summary of the properties and features of middle verbs, this paper proposes that middle constructions are composed of two verbs, of which the first verb, serving as the predicate, denotes an action characteristic of conventional property or features, while the second verb, serving as a complement clause, denotes result.The combination of the two verbs denotes a complete event.Based on this, it is argued that all middle verbs must be of this nature in terms of underlying structure.Once this has been accepted, many long-standing puzzles related to middle constructions are solved quite readily.
A General Review of Dynamic Assessment and Second language Learning, by WANG Hua, p.25
In dynamic assessment, important information about a learner’s abilities and changes can be learned during the assessment.Dynamic assessment is a procedure for simultaneously assessing and promoting development of learners’ cognitive procedure and ability, which is confirmed and applied in the research on second language teaching and learning.This study is a brief review and comment on the research of dynamic assessment and its application in second language learning based on a wide retrieval of literature.
Innovation or “Old Wine in a New Glass”—On the Use of Neologism in Skinner’sVerbalBehavior, by JIANG Daohua, p.31
Verbalbehavior, from the functional perspective, analyzes the cause-and-effect relations of human verbal behavior, in which the key to understand its theoretical framework is on the use of neologistic terms.Taking it as the point-of-departure, the paper discusses the misunderstandings of Skinner’s behavioral theory initiated by Noam Chomsky and points out the great innovation and insightfulness of Skinner’s work.
A Study on English Majors’ Pragmatic Awareness in English Gratitude Context: Sex Roles and Social Situations, by CAI Chen & WANG Yinyin, p.46
In this artide, we found that more and more males and females show an androgynous characteristics and that the masculinity has a higher sensitivity on pragmatic awareness than the femininity.Participants show different pragmatic awareness on social situations and demonstrate significant difference on the perception of the burden of kindness.Meanwhile, sex roles still have different perceptions on the same social situation.The results reveals that participants construct their pragmatic awareness in the communication process.A successful communication requires the participants to improve their sensitivity on the differences of sex roles and social situations, so the intercultural communication teaching shall concentrate on cultivating students’ critical inter-cultural communicative competence.
Role of Information Grounding in Literary Translation for Discourse Structuring: A Study of Three English Translations of a Chinese Prose “Zuiwengting Ji”, by LI Ming, p.60
Any discourse, a conglomeration of different sentences, features background information as well as foreground information both at the clause level and at the discourse level.The information which knits the thread of a discourse and which moves the discourse forward is called foreground information and the information which does not immediately and crucially contribute to the speaker’s goal, but which merely assists, amplifies, or comments on it is background information.Information grounding theory holds that an acceptable discourse results from the modulation of both foreground information and background information.The present paper, by taking three translations of the first paragraph of the Chinese literary discourse “Zuiwengting Ji” as an instance and through extracting and back-translating into Chinese their respective foreground information, aims to make readers fully aware of the important role that foreground information plays in achieving global coherence in discourse structuring.
On Mental Access between Topic and Subject in Text Translation, by ZHONG Shuneng, YOU Liping & ZHANG Yunxia, p.65
It is revealed that a topic finds its way in a Chinese text by means of an NP, a pronoun or a zero-form and works as a subject, on the one hand.On the other hand, a topic is embodied in a corresponding English counterpart in the form of either an NP or a pronoun and functions as a subject.The present paper indicates that a topic chain is established by means of metonymy when a topic accesses itself to a series of subjects.It is concluded by claiming that the topic chain plays a crucial role in developing a naturally coherent text.
Exploration on the Compilation Method of Special Dictionaries Based on the English-Chinese Parallel Corpus, by ZHANG Yushuang & GUAN Xinchao, p.69
This article describes the compilation method of special dictionaries based on the English-Chinese parallel corpus.In comparison with the traditional compilation method, the corpus-based method can improve dictionary’s systematization and standardization, whatever its size is.How to choose corpus texts and how to do word frequency statistics etc are the key of compilation.The meanings of general words in this kind of dictionary will contribute to learning functions and are useful for understanding of specialties.The determination of true special terms depends upon the compilation goal and dictionary users etc.The corpus-based dictionary can also provide a linking service for special dictionaries.Certainly, there lies disadvantages by compiling the dictionary in this way and should be treated carefully during the compilation process.
作者簡介:邵斌,浙江財經(jīng)大學(xué)副教授。主要研究方向為語料庫語言學(xué)、詞匯語義學(xué)、認知語言學(xué)。電子郵箱:seesky1978@163.com