词汇统计学

词汇统计学是比较语言学的一种研究方法，通过比较语言之间同源词汇的百分比来确定他们的谱系关系。与历史比较语言学相似，词汇统计学同样与比较法相关，但不涉及构拟原始语。词汇统计学要与语言年代学相区分，后者只是使用词汇统计方法来推算语言分化的时间长度，因而只是词汇统计学的其中一种应用。除语言年代学之外，词汇统计学的其他应用可能不接受核心词汇变化速率恒定等假设。

“词汇统计学”这一名词或许具有误导性，因为其使用的实际上是简单的数学方程而非统计学方法。同时，除了词汇之外的语言范畴可能也会偶尔被研究涉及。当比较法使用“共同创新”来确定谱系的子分类时，词汇统计学并不作此区分。词汇统计学是一种基于“距离”的方法，并不像比较法一样直接研究语言特征，从而是一种更简单快速的技术。尽管如此，词汇统计方法也存在一些问题，本条目的以下部分将会指出。词汇统计方法得出的结论可以通过对谱系树的交叉检验来证实。

历史

词汇统计学由莫里斯·斯瓦迪士基于一些早期的理论在19世纪50年代发展出来。^[1]^[2]^[3]这个概念已知最早的应用可以追溯到1834年儒勒·迪蒙·迪维尔在对一些大洋洲语言的比较中提出的计算语言关系系数的方法。Hymes(1960)和Embleton(1986)都进行了关于词汇统计学发展历史的综述。^[4]^[5]

方法

创建词表

方法的目标是创造一个关于被普遍运用的意义（如“手”、“嘴”、“天”、“我”等）的词汇表。研究者接着会按照这一表格收集每种语言中对应含义的词汇。斯瓦迪士将原本极长的词表缩减到了200个，后来又进一步缩减并更新到了100个。在维基词典中，斯瓦迪士核心词列表共给出了207个词。也有其他标准更严格的词表，例如多尔戈波尔斯基词表和莱比锡-雅加达词表。同样地，也有一些范围更具体的词表，例如Dyen, Kruskal and Black(1992)为84种印欧语言给出的200个词。^[6]

确定同源词

决定何为同源词需要经受训练且富有经验的语言学家进行考察，其内容也需要随着研究者对这些语言的知识深入而不断更新。然而，词汇统计学并不依赖于这些同源词的判定是否完全正确。对于不同语言中的每一对词汇，同源性可以是“是”“否”或“不确定”。这并不一定影响谱系关系的划分。

有时一种语言中的一个意义也可以对应多个词，例如英语中的"small"和"little"都对应词表中的“不大”。

计算词汇统计百分比

这一百分比与一对语言中对应含义是同源词汇的比例相关，即使用同源词对总数除以“确定同源性”的词对总数得到的比例。假如使用此方法来研究N种语言，其中每两种语言都能得到一个比例。将这些比例列入一张N*N的距离表，当完成时，应当有形如三角形的半张表被填入数据。此时，就可以将同源性比例最高的一对或几对语言相关联。

创建谱系树

谱系树的创建仅需考虑上一步所填的表格。这一步可以有多种方法，以下列出的是Dyen, Kruskal and Black(1992)所使用的方法：

所有语言都被放在一个池中
两个同源性比例最高的成员被从中移除，形成一个组合再被放入池中
重复此步骤，直到整个池中仅存在一个组合。

每一次合并可以理解成将两个子树合并到同一个父节点。越先合并的语言在树上更近，亲缘关系也越近，反之亦然。

应用

Dyen对词汇统计方法的应用是领衔性的。^[7]^[8]^[9]^[10]他使用该方法分类了南岛语系^[11]和印欧语系^[6]的语言。除此之外，还有对美洲和非洲语言的研究。

帕玛-努干语系

对研究澳大利亚语言的语言学家来说，帕玛-努干语系的内部分类一直是个问题。一个普遍的共识是该语系超过25个不同的语支根本不可能被分类，甚至可能相互之间根本没有联系。^[12]

2012年，Claire Bowern和Quentin Atkinson发表了他们使用计算谱系发生学方法研究该语系194个被记录的语言或方言所得到的结果。^[13]他们的模型“恢复”了许多先前提出并被广泛接受的分支划分，同时也对其他更有争议的分支，例如由于缺少数据而变得复杂的帕玛语族和谱系因语言间极高的借用率而变得模糊的Ngumpin–Yapa语支，提出了重要见解。他们的数据集是关于狩猎采集者所使用的语系中最大的一个，也是在关于南岛语系的研究之后第二大的。他们总结，词汇统计方法不仅可以成功应用于全世界的其他已被应用于研究的语言，对帕玛-努干语系的语言同样适用。

批评

诸如Hoijer(1956)等人提出，寻找与词表上的含义完全对应的词汇有时是很困难的，因而修订斯瓦迪士核心词列表就变得十分必要。^[14] Gudschinsky(1956)则质疑是否真的存在具有普遍性的词表。同时，这些词表中核心意义的选取是主观的，基于词表的同义词选取也是主观的，这些都会对结果造成影响。

其他一些因素，例如借词、传统和文化禁忌，同样可能使结果出现偏差，虽然这一问题是目前任何方法都难以避免的。有时词汇统计学也会使用“词汇相似性”（lexical similarity）代替“同源性”，使得其方法几乎等同于大规模比较法。

改进方法

一些现代计算统计学的假设检验方法可以采用，更好地改进使用相似词表和距离计算的词汇统计学方法。

参考资料

^ Swadesh, Morris. Towards Greater Accuracy in Lexicostatistic Dating. International Journal of American Linguistics. 1955-04, 21 (2). ISSN 0020-7071. doi:10.1086/464321 （英语）.
^ Roberts, Helen H.; Swadesh, Morris. Songs of the Nootka Indians of Western Vancouver Island. Transactions of the American Philosophical Society. 1955, 45 (3). ISSN 0065-9746. doi:10.2307/1005745.
^ Swadesh, Morris. Salish Internal Relationships. International Journal of American Linguistics. 1950-10, 16 (4). ISSN 0020-7071. doi:10.1086/464084 （英语）.
^ Hymes, D. H. Lexicostatistics So Far. Current Anthropology. 1960-01, 1 (1). ISSN 0011-3204. doi:10.1086/200074 （英语）.
^ Embleton, Sheila. Principles of Historical Linguistics: Ad(D) Hock. Diachronica. 1986-01-01, 3 (2). ISSN 0176-4225. doi:10.1075/dia.3.2.06emb.
^ ^6.0 ^6.1 Dyen, Isidore; Kruskal, Joseph B.; Black, Paul. An Indoeuropean Classification: A Lexicostatistical Experiment. Transactions of the American Philosophical Society. 1992, 82 (5). doi:10.2307/1006517.
^ Dyen, Isidore. The Lexicostatistically Determined Relationship of a Language Group. International Journal of American Linguistics. 1962-07, 28 (3). ISSN 0020-7071. doi:10.1086/464687 （英语）.
^ Dyen, Isidore. Lexicostatistically Determined Borrowing and Taboo. Language. 1963-01, 39 (1). doi:10.2307/410762.
^ Dyen, Isidore (编). Lexicostatistics in Genetic Linguistics. 1973-12-31. doi:10.1515/9783110880847.
^ Dyen, Isidore. Linguistic Subgrouping and Lexicostatistics. 1975-12-31. doi:10.1515/9783110880830.
^ Elbert, Samuel H. The Lexicostatistical Classification of the Austronesian Languages. Isidore Dyen. American Anthropologist. 1965-02, 67 (1). ISSN 0002-7294. doi:10.1525/aa.1965.67.1.02a00470.
^ Dixon, Robert M.W. (2002). Australian languages: their nature and development. Cambridge University Press. pp. 48, 53. Australia provides a prototypical instance of a linguistic area. It has considerable time-depth, fairly uniform terrain leading to ease of interaction and communication, a fair proportion of reciprocal exogamous marriages, rampant multilingualism, and an open attitude to borrowing ... There is a basic uniformity to Australian languages which is the natural result of a long period of diffusion. Although no justification had been provided for 'Pama-Nyungan', it came to be accepted. People accepted it because it was accepted—as a species of belief. ... It is clear that 'Pama-Nyungan' cannot be supported as a genetic group. Nor is it a useful typological grouping.
^ Bowern, Claire; Atkinson, Quentin. Computational phylogenetics and the internal structure of Pama-Nyungan. Language. 2012-12, 88 (4). ISSN 1535-0665. doi:10.1353/lan.2012.0081 （英语）.
^ Hoijer, Harry. Lexicostatistics: A Critique. Language. 1956-01, 32 (1). doi:10.2307/410652.

[1] Swadesh, Morris. Towards Greater Accuracy in Lexicostatistic Dating. International Journal of American Linguistics. 1955-04, 21 (2). ISSN 0020-7071. doi:10.1086/464321 （英语）.

[2] Roberts, Helen H.; Swadesh, Morris. Songs of the Nootka Indians of Western Vancouver Island. Transactions of the American Philosophical Society. 1955, 45 (3). ISSN 0065-9746. doi:10.2307/1005745.

[3] Swadesh, Morris. Salish Internal Relationships. International Journal of American Linguistics. 1950-10, 16 (4). ISSN 0020-7071. doi:10.1086/464084 （英语）.

[4] Hymes, D. H. Lexicostatistics So Far. Current Anthropology. 1960-01, 1 (1). ISSN 0011-3204. doi:10.1086/200074 （英语）.

[5] Embleton, Sheila. Principles of Historical Linguistics: Ad(D) Hock. Diachronica. 1986-01-01, 3 (2). ISSN 0176-4225. doi:10.1075/dia.3.2.06emb.

[:0-6] 6.0 ^6.1 Dyen, Isidore; Kruskal, Joseph B.; Black, Paul. An Indoeuropean Classification: A Lexicostatistical Experiment. Transactions of the American Philosophical Society. 1992, 82 (5). doi:10.2307/1006517.

[7] Dyen, Isidore. The Lexicostatistically Determined Relationship of a Language Group. International Journal of American Linguistics. 1962-07, 28 (3). ISSN 0020-7071. doi:10.1086/464687 （英语）.

[8] Dyen, Isidore. Lexicostatistically Determined Borrowing and Taboo. Language. 1963-01, 39 (1). doi:10.2307/410762.

[9] Dyen, Isidore (编). Lexicostatistics in Genetic Linguistics. 1973-12-31. doi:10.1515/9783110880847.

[10] Dyen, Isidore. Linguistic Subgrouping and Lexicostatistics. 1975-12-31. doi:10.1515/9783110880830.

[11] Elbert, Samuel H. The Lexicostatistical Classification of the Austronesian Languages. Isidore Dyen. American Anthropologist. 1965-02, 67 (1). ISSN 0002-7294. doi:10.1525/aa.1965.67.1.02a00470.

[12] Dixon, Robert M.W. (2002). Australian languages: their nature and development. Cambridge University Press. pp. 48, 53. Australia provides a prototypical instance of a linguistic area. It has considerable time-depth, fairly uniform terrain leading to ease of interaction and communication, a fair proportion of reciprocal exogamous marriages, rampant multilingualism, and an open attitude to borrowing ... There is a basic uniformity to Australian languages which is the natural result of a long period of diffusion. Although no justification had been provided for 'Pama-Nyungan', it came to be accepted. People accepted it because it was accepted—as a species of belief. ... It is clear that 'Pama-Nyungan' cannot be supported as a genetic group. Nor is it a useful typological grouping.

[13] Bowern, Claire; Atkinson, Quentin. Computational phylogenetics and the internal structure of Pama-Nyungan. Language. 2012-12, 88 (4). ISSN 1535-0665. doi:10.1353/lan.2012.0081 （英语）.

[14] Hoijer, Harry. Lexicostatistics: A Critique. Language. 1956-01, 32 (1). doi:10.2307/410652.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]