Rabindranath Tagore, the Nobel Laureate for literature in 1913, has been one of my favorite authors of all time. In my series of summer weekend projects, among other things, I have collected some novels and short stories written by this author in unicode text format and analyzed the behavior of their lexical growth, hoping to find specific patterns in his writing. Part 1 of this investigation employs one of the DFR (Divergence From Randomness) models, namely the Bose-Einstein statistics that was originally derived by Satyendranath Bose (a physicist at the University of Dhaka) in 1924 as one of the emerging quantum ensemble models that was later backed up by Einstein in 1925.
Bose-Einstein distribution has recently found its application outside the realm of describing the energy level occupation of bosons. Such applications include describing the statistics of low frequency words in a large text corpus . It’s always interesting to investigate how a mathematical model describing a physical phenomenon can be used as an analogy to another problem. Here, words are analogous to boson particles, which have the characteristic that they are indistinguishable from each other. Unlike Fermions, for example, there is no limit to the number of bosons that can occupy a quantum state. This property of bosons makes the related statistics suitable for analyzing words that have the same occurrence frequency in the corpus. An important aspect of using such analogy is temperature. What does it mean for a piece of text to have “temperature”? As demonstrated in  and here, it can describe and distinguish between different authors or different novels written by the same author. It may also describe how the lexical wealth of a piece of writing evolves as we read through it.
The Bose-Einstein distribution describes the occupation of bosons at specific energy levels. An energy level is specified by j = 1, 2, …, n. The level j = 1 corresponds to the Bose-Einstein condensate. Here, as an analogy of this condensate, the authors of  call the initial energy state hapax legomena. This Greek term originates from Biblical studies, it translates to “[something] said [only] once”. That is, words that occur only once will be put in the first energy level. Words with frequency 2 will occupy the level j = 2, and so on. The occupation of an energy level j is given as:
z is the absolute activity, or fugacity, epsilon is the energy of the jth level and T is the temperature. The power energy spectrum for epsilon is given by
where is a constant that can be determined by fitting. z is determined from the first energy level using
and with the new definition of the power spectrum the B-E distribution now looks like
The parameters and T are to be simultaneously determined by fitting the data present in the occupation matrix (a matrix that contains the occupation distribution for each level j) using a nonlinear regression.
A document-term matrix is created from the set of novels or stories that are to be analyzed. Then for each energy level j, the number of distinct words that have a frequency equal to j is found and saved in the occupation matrix. Now, each row in the matrix has the occupation levels distribution. By fitting the parameters for each row of data, we obtain T for the low frequency words (lower energy levels). In the case of texts, the B-E distribution does not turn out to capture the statistics of the higher energy levels quite well.
In order to see how T evolves with N (the number of words) in a novel, I divide the particular novel corpus into cumulatively increasing chunks of texts and do the above for each chunk.
Mathematica is the choice of programming language for all of these operations. 🙂
Results and Analysis
I have run the programs on a set of eight novels: Bou Thakuranir Haat, Chokher Bali, Ghore Baire, Gora, Noukadubi, Projapotir Nirbondho, Rajorshi and Shesher Kobita. All the novels and other short stories are collected from .
Figure 1. (a) Occupation matrix for all novels, colors represent the magnitude of occupation in each cell. Note the variability of words frequency for each novel. Novel #4 (Gora) is the largest in this set. It exhibits quite a variable frequency compared to others. (b) Occupation vs. j log-log plot for Bou Thakuranir Haat for the first 1000 energy levels. The blue line is the fit found for the first 20 energy levels (low frequency words).
Chokher Bali: This novel contains ~70000 words. The occupation matrix and the characteristic temperature curve is shown below.
Figure 2. (a) Occupation matrix for Chokher Bali. As the number of chunks increases, we find some amount of variability. (b) Temperature vs. N graph, the points are joined together with broken lines, not a fit. Note the rise of temperature until the middle of the novel and a gradual decrease as the size of the text increases.
Gora: This one is the largest in my collection, ~180000 words.
Figure 3. (a) Occupation matrix for Gora. Note that a lot of variability shows up as we increase the size of the text. There is a certain visible pyramid-like pattern. This regularity may indicate recurring usage of word sets over the course of the novel. (b) Temperature as the size of the text grows.
Noukadubi: This is one of the shorter novels, containing ~40000 words.
Figure 4. (a) Occupation matrix for Noukadubi. A similar pyramid structure is notable. (b) T vs. N graph.
Temperature Evolution Comparison
The novels usually have a rise in temperature for up to ~30000 words or so, then we see that they fall off. What does it mean in terms of the physical analog?
The figure on the left shows all the temperatures on the same plot, and on the right are a set of exponential fits for the first 25000 words. Since they resemble a Boltzmann-like distribution, I could have done a fit using that equation. Oh well! 🙂
I wonder if the peak in temperature at around 30000 words mark is a characteristic of Tagore’s writing (critical/transition temperature?). Note that temperature here refers to the net amount of variability of different frequencies for low frequency, i.e. rare words. Different authors have different styles of writing. One may sit and finish a large piece in one go and never come back to it. However, many authors do come back to the same piece again to hone the variability of words.
In literature, lexical wealth is a measure of the author’s ability to use different set of words. Every author has a natural limit though, rare words that describe particular events must cycle around in the novel. The results here could be the first step in showing that Rabindranath Tagore’s large piece of writings usually maintain a fairly distinct word frequency structure until around an approximately fixed word limit (~30000 to 50000), then it breaks and gives way – the rare words and sentence structures start repeating and cycling more often as the size of a novel increases. This claim, however, should be viewed with doubt as more experiments are needed to confirm this. This in general should be true for any author, but finding a characteristic falloff for Tagore is quite interesting.
Part 2 of my analysis uses power laws derived in fractal theory and similarity measures used in high dimensional data analysis to find out more about the lexical wealth of Tagore’s writings.
 Application of a quantum ensemble model to linguistic analysis, A. Rovenchak, S. Buk, Physica A 390 (2011), 1326-1331.