It is a well-known but fascinating fact that the use of natural language, as reflected in texts or language, has very strong statistical laws [1-4]. Although the most popular and detailed of these is Zipf`s law of word frequencies [5-9], the most fundamental linguistic statistical law is probably Heaps` law, also known as Herdan`s law [10-13]. This law refers to the two main quantities needed to determine the statistical analysis of a text: the number of wordmarks in the text, i.e. its length (in words), L, and the number of word types, called the vocabulary size of the text, V. Specifically, Heaps` law states that the relationship between V and L is reasonably well approximated by a power law, Font-Clos F, Boleda G, Corral A (2013) A law of scale beyond Zipf`s law and its relation to the law of heaps. New J Phys 15(9):093033 I will demonstrate Heaps` Law by looking at the books of the Bible and then Jane Austen`s novels. I`ll also look at unique words that linguists call « hapax legomena. » Thus, the supposedly universal values of α and K for the texts mentioned in the introduction (i.e. (alpha =0.5) or (K=1) ) clearly do not apply to music, at least at the composer`s level. In any case, it is obvious that the larger the value of L, the larger the vocabulary size (on average), but the increase in V is rather modest α due to the low value of the exponent (in other words, we need a piece 7 times longer to see a doubling of the vocabulary size). The relatively large value of the constant K reported above for composers appears as compensation for the low value of α.
The application of Zipf`s law is evident in most natural language processing and text compression algorithms. In summary, the highest linear correlation among all the metrics that characterize composers is the one that relates relative richness R to entropy (about 0.95, Fig. 5(b)), but R is also strongly correlated with the Giraud index. The Herdan index is also highly correlated with Giraud and the type/token ratio. The correlations between (langle Frangle ) and the other indices are not so high. Interestingly, the highest correlation of the year that characterizes each composer is with the proposed relative richness R (at a value of 0.90, fig. 5(b)). The correlation of R with logL is of zero design. Replacing Pearson`s linear correlation with Spearman or Kendall correlations does not qualitatively alter this correlation model (not shown).
It is a little surprising that the law of heaps applies well to the books of the Bible, since the books were written over the centuries and in two different languages. On the other hand, the same committee translated all the books into English at the same time. Perhaps Heaps` law applies better to translations than to original texts. To test this, I looked at Jane Austen`s novels on Project Gutenberg. Here is the data: This is very similar to the table of vocabulary size and total word count, suggesting that hapax number also follows a power law such as Heaps` law. This becomes clear when we plot again on a logarithmic scale and see a linear relationship. Manaris B, Purewal T, McCormick C (2002) Progress towards recognition and classifiating beautiful music with computers-midi-encoded music and the Zipf-Mandelbrot law. In: Proceedings IEEE SoutheastCon 2002 (Cat. No. 02CH37283).
IEEE Press, New York, pp. 52-57. Serrà J, Corral Á, Boguñá M, Haro M, Arcos JL (2012) Measuring the development of contemporary Western pop music. Sci Rep 2(1):1–6 Liu L, Wei J, Zhang H, Xin J, Huang J (2013) A statistical physics view of pitch fluctuations in the classical music from bach to chopin: evidence for scaling. PLoS ONE 8(3):58710. Enter your email address below and we`ll send you the reset instructions. Ball P (2010) Der Musikinstinkt. Oxford University Press, Oxford In the English language or any other language, we usually find that certain words are used repeatedly and ranked first in terms of frequency, such as « the », « from », « to », etc. Have you ever wondered why? The frequency distribution is used to determine the statistical relationship between text and word count. Here we will look at it. But before we move on to the law of the heap, let`s look at Zipf`s law.
Baeza-Yates and Ribeiro-Neto, Modern Information Retrieval, ACM Press, 1999. Data supporting the results of this study are available from kunstderfuge.com, but there are restrictions on the availability of these data, which were used under license for this study and are therefore not publicly available. The documented definition of Heaps` Law (also known as Herdan`s Law) states that the number of unique words in a text of n words is approximated by experimenting with different text sizes and the occurrence of single words, and observing the relationship between the two. The authors state that they have no competing interests. The method we use to analyze the key is the Krumhansl-Schmuckler key lookup algorithm . This algorithm is based on the key profiles described in ref. , obtained through empirical experiments in which subjects assessed the extent to which each height fit into a prior context that defined a key. The values of the major key profile are 6.35, 2.23, 3.48, 2.33, 4.38, 4.09, 2.52, 5.19, 2.39, 3.66, 2.29 and 2.88, where the first number corresponds to the middle note of the tonic of the tonality, the second to the next of the 12 notes of the chromatic scale, etc.
The values for the secondary key context are 6.33, 2.68, 3.52, 5.38, 2.60, 3.53, 2.54, 4.75, 3.98, 2.69, 3.34, and 3.17 . Torre IG, Luque B, Lacasa L, Kello CT, Hernández-Fernández A (2019) On the physical origin of linguistic laws and lognormality in language. R Soc Open Sci 6(8):191023. Where VR is the part of the vocabulary V (VR⊆V) represented by the text of the instance of size n. K and β are free parameters that are determined empirically. If a piece maintains a constant key, the transposition will not affect the size of its vocabulary, but if different pieces are merged into a single recording (which we will do), it may be convenient to merge them after being transposed into the same key. In this case, if the pieces come in different keys, the transposition obviously leads to a different vocabulary, where we do not deal with pitch classes, but with the tonal function (and therefore the resulting pitch class C represents the tonic, G the dominant, etc. [38, 39]). Transposition can also be useful in revealing a reduced vocabulary where a particular composer could show what shows a great apparent richness resulting from a limited vocabulary transposed into a number of different keys. Mandelbrot B (1961) Zur Theorie der Wortfrequenzen und zu verwandte Markovschen Diskursmodellen.
In: Struktur der Sprache und ihre mathematischen Aspekte, Band 12. Am. Math. Soc., Providence, pp 190-219 Ben-Naim A (2019) Entropy and Information Theory: Uses and Abuse. Entropy 21(12):1170. The k parameter is highly variable and reduces the growth rate of vocabulary when stemming or lemmatization is used, while the inclusion of numbers and spelling mistakes can increase it. Heaps` Law suggests that (i) vocabulary size continues to grow with more items in the collection, but the growth rate slows down and (ii) vocabulary size for large collections is quite large. Heaps` law allows us to develop an appropriate measure of vocabulary for each composer in relation to the rest of the corpus. Remarkably, we find that vocabulary in music history is subject to a significant upward trend, as one would expect from qualitative musical wisdom. Our approach allows a transparent quantification of this phenomenon.
We also show that our relative richness is strongly correlated with the entropy of the frequency distribution of code words, so entropy can also be used to measure vocabulary richness once properly calibrated. Our metric has several advantages over previous wealth indices, such as: compared to the richness of other composers and better interpretability of values. At our level of dissolution (that of individual composers), the development of wealth does not show revolutions or sudden leaps; Instead, it seems to be quite gradual and well-adjusted by a linear increase. (The variability is too high to achieve meaningful results beyond the linear upward trend.) We conduct our study of the Kunstderfuge corpus , which at the time of our analysis consisted of 17,418 MIDI files corresponding to works by 79 classical composers from the 12th to the 20th century (ref.