subscribe to our mailing list:
|
SECTIONS
|
|
|
|
Letter serial correlation points to the common descent of natural languages
By Mark Perakh
Posted March 9, 2005
This essay has been called to life by Steve Reuland's post
to Panda's Thumb weblog titled "What good is half an underlying language
structure?" (www.pandasthumb.org/pt-archives/000853.html) which refers to Carl Zimmer's posts to Loom (http://www.corante.com/loom/archives/2005/02/25/building_gab_part_one.php
and http://www.corante.com/loom/archives/2005/03/01/building_gab_part_two.php). One point touched upon in passing
by Zimmer and by some comments' writers, was the question of whether or not natural
languages have all evolved from the same proto-language.
Such an
idea was, in particular, strongly pushed by Academician Nikolay Marr in the
USSR. For some 30 years Marr had been
acclaimed in the USSR as the greatest linguist of all times, whose teachings
were supposedly in full agreement with Marxism-Leninism. Then, suddenly, in
1950, Stalin changed his mind and millions of copies of a thin booklet were
published whose author was claimed to be Stalin himself. It explained what the "genuine Marxist
linguistics" is. In this booklet Marr was claimed to be a pseudo-scientist and
his theory denounced as anti-Marxist.
As I
understand, the notion of a single proto-language is shared by many linguists.
I would
like to briefly report on some data which, I believe, provide strong empirical
support to the notion of the intrinsic unity of all natural languages,
specifically evident in their written form.
The experimental data in question have been obtained in a work which was
conducted a few years ago by myself and Brendan McKay of the Australian
National University (Canberra).
We
developed a new method for a statistical analysis of texts dubbed Letter Serial
Correlation (LSC). Although we have
conducted hundreds of measurements on many texts in 12 languages as well as on
a number of gibberish strings created in various ways (and also on the famous
Voynich manuscript often referred to as the "most mysterious manuscript in the
world"), so far our results have only been reported in a series of articles on
my personal website (http://members.cox.net/marperak/Texts).
Twice during recent years we had a paper prepared for an international journal
on computational linguistics with a concise presentation of our method and the
results obtained, but in both cases we opted for postponing the planned
publication because each time some new modifications improving the method came
to mind. Besides, both myself and Brendan have been busy with other projects
and did not devote as much time to LSC as it, perhaps, deserved.
The data
obtained by the LSC method demonstrated the intrinsic unity of the structure of
all studied languages (Hebrew, Aramaic, Greek, Latin, English, German, Italian,
Spanish, Russian, Czech, Finnish, and Yiddish). Most of the biblical texts (both in Hebrew and in translations)
as well as such diverse texts as Moby Dick, The Song of Hiawatha,
Macbeth, UN convention on Sea Trade, Tolstoy's War and Piece (in the
Russian original and in translations), the full text of a Russian newspaper,
and many others, were studied.
I believe
our data vividly show that all meaningful texts, regardless of language,
authorship, etc, have the same intrinsic structure, in particular reflected in
the existence in all meaningful texts of what we called the Average Domain
of Minimal Letter Variability (ADMLV). Gibberish strings, both highly
ordered and highly disordered, do not possess this feature.
Briefly,
the method of LSC is as follows (I'll describe the latest version which
slightly differs from that reported in the articles posted to my site.) Our
computer program performs several actions on a text which is stored on a disk,
namely: (1) It counts the total numbers Mi of each letter's
occurrence in the entire text. (2) It chooses a "window" in the text which is n
letters long, where n is an even number varying from 2 to L/2 if
L is an even number, or to (L-1)/2 if L is an odd number (L is the total length
of the text expressed in the number of letters). Each "window" is divided into two equal "panes" 1 and 2, of a
length of m=n/2 each. For each value of n the program counts the numbers Xi1
and Xi2 of occurrences of each letter in both panes 1 and 2. The window is moved along the text and for
each window's position the program calculates the expression (X1 – X2)2
for each letter. Then the program
calculates the sum Sm of all such expressions over all positions of
the window and over all letters of the alphabet.
The program
generates a table where the values of Sm - the Measured Serial Correlation Sum, are listed for all
values of n. Finally the program plots the graph of Sm
vs n. Simultaneously, the program computes the Expected Serial
Correlation Sum (Se ) as a function of n, using
the theoretical formula we derived based on a random distribution of letters.
Although on
my site the results are shown obtained by an earlier version of the method
(where the window was not moved along the text; instead the program divided the
text into k equal "chunks" and measured the sums for each pair of
adjacent chunks) the results obtained by both versions differ only in secondary
details; the newer version removes a certain inconvenience in the original
method and generates a smoother curve, but does not generate principally
different "Sm vs. n" curves).
The "Sm
vs. n" curves for all meaningful texts in all studied languages
had quite a distinctive shape, with a
number of characteristic points which were absent in the graphs for gibberish
texts. Many of such graphs can be seen on my site at http://members.cox.net/marperak/Texts.
One of the
characteristic points seems to be of special interest. It is a distinctive deep
minimum on the "Sm vs.n" graph which is
present on all such curves for meaningful texts regardless of language,
authorship, etc., but does not exist on the curves for gibberish texts (and, as
expected, does not exist on "Se vs. n" curves).
This
minimum testifies to the existence in meaningful texts of a distinctive Average
Domain of Minimal Letter Variability.
This is a text's length, within which the distribution of letters
frequencies is characterized by a maximal frequency of occurrence of the same subset
of letters. Within the text's length
which is either shorter or longer than the length of the ADMLV, the variability
of letters' occurrences is larger than within the ADMLV's length. Details of the measurements, calculation, and
interpretation of data, can be seen at my site.
The length
of ADMLV differs depending on language but varies only in a narrow range for
different texts in the same language. For example, for all Hebrew and Aramaic
texts, both biblical and secular, the length of ADMLV is invariably between 42
and 46 letters. In English texts the length of ADMLV varies between 60 and 140
letters, which corresponds to a certain extent to the difference between these
two writing systems -- in Hebrew there are no letters for vowels so the text's
portion in Hebrew containing a certain amount of a message necessarily
comprises fewer letters than a corresponding segment in English.
The natural
interpretation of the ADMLV is that it represents the average length of texts
wherein a specific topic or notion is the subject of the narrative and this
predetermines a relatively high frequency of repeated occurrences of the same
letters.
The
existence of ADMLV, which finds its empirical reflection in the minimum on the
LSC curves, seems to be an ineliminable feature of all meaningful texts,
regardless of language. It testifies to the deep unity of various languages and
supports the notion of all languages' evolution from the same proto-language
via descent with modification.
There seems
to be analogy between biological evolution and that of languages. The evolution of languages is a fact – for
example, today's English is so different from that of Chaucer's that nobody in
his right mind could deny such an evolution. I guess the creationists would say
this is "microevolution," as Chaucer's English and today's English both are
still English. And what about, say,
Latin and its descendants -- Italian, French, Spanish, Portuguese, Romanian, etc.?
While the
fossil record, for obvious reasons, necessarily is incomplete and has many
gaps, the evolution of a language is often well recorded in all of its stages
because of the preservation of written texts.
There is no
principal difference between evolution of a language from Chaucer's stage to today's
stage and evolution resulting in the emergence of a new language – Italian from
Latin, or Russian and Ukrainian from Old Slavic (two different languages
stemming from the same "progenitor,"
the separation of which occurred around 11th - 12th centuries) or Czech,
Polish, Bulgarian, Serbian-Croatian, and Macedonian from an even earlier
proto-Slavic. The difference is in degree, so that evolution of a language can
naturally graduate into evolution to a new language, no longer understandable
to the speakers of the original language, provided the two groups of speakers
are geographically separated. Likewise,
there are no reasons why evolution within a species cannot extend to the loss
of interbreeding ability of two geographically separated subspecies thus
resulting in the appearance of a new species, i.e. in "macroevolution."
|
|