# Social Dynamics of Terabithia

In the land of Terabithia, king Jesse and princess May Belle have taken over the duty to rule the kingdom. The people of the kingdom are divided into five tribes at the time of their inauguration. These are liberal people, willing to accept and share new ideas and fashion available in each tribe. Jesse has appointed Abul (the court mathematician) to figure out how and at what rate his newly introduced fashion of wearing a Lungi will spread throughout his kingdom. He has also decided to divide his country into different districts based on different styles of wearing lungi.

Now, there are several factors for Abul to consider here. Some of the tribes have accepted Lungi as their choice of official dress code. However, they have different styles of wearing those. Tribe 1 likes wearing it with no shirts on, tribe 2 refused to accept it as their fashion, tribe 3 wears lungi with a kurta, tribe 4 prefers this style

and so on. The fashion designers in each tribe have made appealing designs to impress and encourage people from other tribes to accept their fashion. Abul needs to figure out and forecast the possibility of each tribe’s fashion surviving and/or flourishing over the whole kingdom.

Problem Statement

In sociology, economics, business planning etc, researchers and analysts are always concerned about how competing cultures, languages, fashion etc survive over a period of time. Some cultural trends are short lived, some defeat the existing trends and completely take over. It is impossible to predict what individuals think about each trend, but it is in our reach to predict what the mass trend will be once a steady state is reached.

Individuals have different philosophies. In general we update our idea of fashion and culture based on movies, TV, books, what others around us do etc. Some of us are trend starters. Others follow them. Many times it is not one person, it is a group of people following a trend that motivates others to follow. In the formal terminology of social science and social dynamics, there are three stages of a cultural trend change before it reaches a steady state.

1. Emergence: A new cultural center emerges in a population. Someone or a group of people start using a new colloquial word, begin wearing a new kind of jeans or may be start listening/propagating a new genre of metal songs to others.
2. Broadcast: This new trend is “shown off” to others occasionally. The trend starters use their newly coined colloquial word too often in public, wear the newly designed clothes in parties or release albums and organize concerts to promote the new genre of metal with some frequency.
3. Following and Competition: Others decide whether they will accept or reject the new trend. If someone decides to accept, occasionally he/she also becomes one of the broadcasters. If there are several competing trends available, then a decision is made based on factors like the current number of people following the trend and how long this trend has been followed or simply personal taste.

Every trend usually dies after some time. This can be months, years or even decades. Our general tendency is to accept the newest trend.

In the field of Social Dynamics, agent based modeling has been used by many mathematicians to study the behavior of such chaotic systems. It is usually hard to find closed form solutions of any mathematical model describing such a complex process, so discrete time Monte Carlo simulations or rule based cellular automata systems have been used to predict mass behavior in a society.

To help Abul, we modify and suggest some improvements to a model proposed in this paper and add some more functionality to study the culture of Terabithian people.

Monte Carlo Model

The Monte Carlo simulation is performed on a lattice of size $L \times L$, containing $L^{2}$ agents who interact with their neighbors. A fashion is defined as $I(s,a)$, where s is the cultural center and broadcaster site, and a is the age of the fashion.

The model assumes that newly arrived fashions are likely to be accepted more. A fashion $I(s_{i},a_{i})$ for a particular agent i can change its preference to another fashion from site $s_{j}$ if $a_{j}. Keeping this in mind, the proposed algorithm to take a timestep from t to t + 1 over the whole lattice is

1. Emergence step: With a very small probability $p_{new}$, a new cultural center emerges. $I_{i}(t)=I(i,0)$.
2. Broadcast step: With a probability $p_{repeat}$, the cultural centers restrengthen themselves to affect their neighbors. $I(i,a_{i,t})=I(i,0)$.
3. Spread step: Each site i randomly chooses one of its nearest neighbors j (with equal probability) by whom it will be affected. If $a_{j}, then the current fashion at i is changed to the fashion followed by j. $I(s_{i},a_{i})=I(s_{j},a_{j})$. Otherwise nothing changes for i.
4. All ages in each lattice site is increased by 1.

Modifications

1. The original model proposes choosing a site randomly as an emerging cultural center. This limits Abul to concentrate on one tribe only. In other words, the model does not consider competition between several *existing* fashions. We propose to change this scheme and assume each tribe acts as their own ambassador. Thus we assign a cultural center to each tribe as an initial condition and all of them are broadcasters of their own fashion.
2. The paper considers geographic impacts such as closed boundary at the borders of a country. We extend this by considering geographic factors that can slow down the exchange of information, e.g. rivers and seas that divides countries and regions. To include this, we introduce another random variable $p_{cross}$ that determines the probability of information crossing over the given obstacle. With this probability, we exchange information only between the two sides along the obstacle in step 3.
3. Finally, we also propose changing the global $p_{new}$ and $p_{repeat}$ to local competing groups by converting these variables to arrays. This better captures the reality because different groups of people may have different tendencies to interact with others. It is too simple to assume that every competing group has the same level of interaction and motivation to spread their own culture.

Back to Terabithian Tribes

Mathematica was used to program this Monte Carlo simulation. We study the general model and our modifications one by one.

Terabithia is an infinite kingdom, it is a torus. That is, if you leave the kingdom from one side, you will enter through the other side. There is no escape from lungi or Jesse’s rules. We simulate this by assigning a periodic boundary condition on our lattice when choosing neighbors for a particular site.

For one of our examples though, we will use a closed boundary condition.

We model the kingdom by a $50 \times 50$ lattice , each site represents a group of people (presumably a family or a house). We choose the nearest four neighbors with whom each site will interact.

(a)

(b)

Figure 1. (a) Tribe colors. (b) Initial distribution of different groups.

Figure 1 shows an initial random distribution of different groups of people who follow different tribes, colored according to 1(a).

We would like to run the original paper model first. We choose 500 iterations for the simulation. Estimating that only one active interaction (an interaction with a broadcaster that changes someone’s mind to start following a new fashion) occurs in a day, this would mean the simulation is the result after 500 days.

Figure 2. Steady state reached after 500 iterations (days).

Figure 2 shows how fashions of different tribes have clustered over 500 days. Tribe 4 is less dominant for this particular number of iterations. The important point is that trends have emerged over different regions, it is not a random distribution of fashion anymore, as in 1(b).

Since we have imposed a periodic boundary condition, note how the clusters have continuity (wraps around) at the horizontal and vertical ends of the lattice.

We have helped Abul solve the initial problem. He can show this simulation to the king and inform him about possible clustering and number of people following particular fashions in this competing culture war after different number of days. However, there are more modifications on the way.

Rivers and Borders in Terabithia

Often we have borders among regions that stop people from interacting with each other. There are rivers and seas that distance people out (assuming no internet, like in the ancient days) and slow down the exchange of information and culture. Abul wants to incorporate these factors in the model. We help him by changing our neighborhood selection algorithm to include a set of cells in our lattice that stop information exchange.

(a)

(b)

Figure 3. (a) A different tribe is included among the cells (off-white cells creating a rectangle) which act as information barrier (a physical border, in a sense). (b) The two separate regions evolve independently, as expected.

Notice in figure 3 that some cultural trends have disappeared, they could not survive for 500 days. However, it is possible that they can come back if we ran the simulation for more iterations.

(a)

(b)

Figure 4. (a) A river that splits the kingdom into two equal regions. However, a slight probability of information exchange is allowed across the two river sides. Closed boundary conditions are used. (b) Result after 500 iterations.

Figure 4 shows our modification of the original algorithm to include slow information exchange among two regions. 4(b) shows what we would expect. The same cultures have emerged around the river sides since they have some interaction. Note the consistency of the clusters on both sides. However, they are dominated by the trends that are stronger in the larger regions. Closed boundary conditions were used for this example.

Individual Probabilities for Tribes

As promised, we now help Abul make the model more realistic by assigning individual $p_{new}$ and $p_{repeat}$ variables to each tribe. The algorithm needs to changed slightly for doing this.

Figure 5. Biased and higher probabilities assigned to tribe #5, which is why it cultural trend dominates over the others.

This is much better because it gives us the capability to include real world information in the model. Although the probability distributions for each of these random variables are decided here by us, in reality we can conduct surveys and build realistic discrete distributions from which we can run a much reliable simulation of the future events.

Conclusion

With the proposed modifications, we can expect that the model is useful in a real world setting to predict future outcomes for competing cultural trends in a society. The best way to integrate all the modifications is to create an MCMC (Markov Chain Monte Carlo) sampler to estimate parameters for the joint distribution function for $p_{new}$ and $p_{repeat}$ from real world data.

The resulting parameters will be used in the above simulation and we may get some useful results if we cleverly choose a good neighborhood coverage for each lattice site (e.g. considering information exchanged over internet among distant sites).

Finally, Abul can present these results to the king and expect to get a promotion in return. Math saves the day, once more.

# 100 poems in 1 minute (Bangla)

সে এসেছিল সে মোর  মর্ম গভীরতম উন্নত শির রয়েছে তুলিয়া  সকল,
তোমারি ভৈরবী মুকুল মম সুবাসে তব গোপনে সৌরভী আমার…
বন্ধু যদি চায় ভুলো না আপনায়  মিথ্যা দিয়ে চাতুরী দিয়ে রচিয়া।
সে কোন্‌ খেলা  কোথা কেটে গেল বৎসর বৎসর ক্ষয়িষ্ণু স্মৃতির ব্যর্থ!
সে দেয় বাঁটি ভরেছে মন গীত গন্ধে বসে

তোমারি এক প্রেমে  দুঃখসুখের বিচিত্র জীবনে  তুমি ছাড়া আর অন্য।।
বন্ধু যত মাথা তুলে বলতে পারে না হায়…
প্রাণের রথে বাহির হতে  পারব কবে  প্রবল প্রেমে সবার মাঝে,
সে আমার প্রাণে লাগে আমার বাণী আমার প্রাণে বাজায় বাঁশি॥

ও মা।।
যা চোখে পড়ে  তার পরে পথ ছাড়ো  তার পরদিনে উঠিবে প্রভাত  ফুটিবে,
বন্ধু কারে দিব দোষ বৃথা কর রোষ যারা শুধু মরে কিন্তু!
প্রাণের নিরর্থক চাঞ্চল্যে  ময়ূরটি ঘাড় বাঁকায় এদিকে ওদিকে  তার
উদাসীন।

কবিতা লেখার প্রধান উপাদানগুলো কি? আমি কবিতা লেখার কিছুই জানি না প্রায়। ছোটবেলায় ভাবতাম ছন্দ মিলিয়ে দিলেই বুঝি কবিতা হয়ে যায়। একটু বড় হয়ে বুঝলাম যে এটাকে প্রাথমিক ভাবে ছড়া বলে। তখন ভাবা শুরু করলাম যে কঠিন শব্দ ও বাক্যের বিন্যাস থাকলেই মনে হয় সেটা খুব ভাল কবিতা। আপাতদৃষ্টিতে উঠতি কবিদের লেখা পড়ে তাইই ধারণা হত। কবিতা পড়ার পরিধি একটু বিস্তার পাবার পর (রুদ্র মোহাম্মাদ শহিদুল্লাহ, পূর্ণেন্দু পত্রি, কিছু নিরমলেন্দু গুন ইত্যাদি), আর আধুনিক গানের কথাগুলো (জেমস – মারজুক রাসেল ইত্যাদি) ভাল লাগতে শুরু হবার পর কবিতার ক্ষমতার প্রতি শ্রদ্ধা তৈরি হওয়া শুরু করলো। নিজে কখনো কবি হতে পারবো না, সেটা ভাল করেই বুঝে গিয়েছিলাম তখনি। কিন্তু এই হতাশাই বোধহয় কবিতা নিয়ে অনেক ভাবে ভাবতে শিখিয়েছে। প্রধান প্রশ্ন ঘুরে ফিরে একটাইঃ কবিতা লেখার প্রধান উপাদানগুলো কি?

সাধারণ পাঠক হিসেবে আমার ধারণাগুলো সংক্ষেপে বলি। শব্দ আর বাক্যের চয়ন অন্যতম মূল মশলা। এর সাথে থাকতে হবে কোন নির্দিষ্ট ভাব (বা অনির্দিষ্ট, ইংরেজিতে একে বলে nonsense theme)। বাক্য বিন্যাস গৎবাঁধা গদ্যের মত না হলেই চলে অনেক সময়; ছন্দর উপস্থিতি অত্যাবশ্যক নয়, কবিতার ভাব এবং কবির রুচির ওপর নির্ভর করে এটা। এর বাইরে আরো অনেক কিছু আছে, তবে এক কথায় এই হচ্ছে আমার চিন্তা।

স্বয়ংক্রিয় কবিতা

কবিতা না লিখতে পারলেও, কবিতা লেখার প্রক্রিয়াকে কি আমরা স্বয়ংক্রিয় করে তুলতে পারি? নিয়মের আষ্টেপৃষ্ঠে বাঁধা যন্ত্র কি সাবলীল ভাবে কবিতা তৈরি করতে পারবে? শব্দ ও বাক্যের বিন্যাস ঠিক রাখা সম্ভব, এটা করা সহজ ব্যাপার আর আমরা আসলে এটা অনেক আগেই করে ফেলেছি। কিন্তু একজন কবির ভাব এবং দৃষ্টি পুরোটাই তার নিজস্ব সৃজনশীল অনুভূতির ফসল। একটা যন্ত্র, যে শুধু কিছু প্রাথমিক কমান্ড বুঝতে পারে, তার কাছে সৃজনশীল অনুভূতি আশা করা উচিৎ কি?

আরো সাধারণভাবে যদি বলি – শুধু কিছু প্রাথমিক নিয়ম কে আমাদের বিল্ডিং ব্লক হিসেবে নিলে তাদের যেকোনো সমষ্টি থেকে আমরা কতোটুকু complex behavior আশা করতে পারি? মানুষের মস্তিষ্কের কার্যবিধি কিছুটা এরকম। আমাদের নিউরনরা বেশ কিছু সিগন্যাল আদান প্রদান করে নিজেদের মধ্যে। এসব সিগন্যালকে আমরা মোটামোটিভাবে শনাক্ত করে ফেলেছি। এছাড়াও দেখা গেছে যে কয়েকটি নিউরনের সমষ্টি (নিউরাল নেটওয়ার্ক) অনেক ধরণের কমপ্লেক্স প্যাটার্ন তৈরি করতে পারে। শুধু একটি নিউরনের পক্ষে এতোটুকু জটিল প্যাটার্ন তৈরি করা সম্ভব না, কিন্তু কয়েকজন একত্র হলেই কেওস (“বিশৃঙ্খলা”) তৈরি হয়ে যায়, যার থেকে তৈরি হয় অনুভূতি। আগ্রহীরা এই বইটি পরতে পারেনঃ New Kind of Science (Stephen Wolfram) যেখানে এই পদার্থবিদ ও গণিতবিদ সাধারণ মানুষের জন্য লিখেছেন ও দেখিয়েছেন কি করে শুধু ছোটখাটো কিছু প্রোগ্রাম একজন আরেকজনের সাথে interact করলে কতরকম কমপ্লেক্স কিন্তু বোধ্য প্যাটার্ন পাওয়া সম্ভব।

তাহলে অনুভূতি বলতে কি বুঝি আমরা? এটাকে গনিতের আওতায় আনা সম্ভব কি? অনেকেই চেষ্টা চালিয়ে যাচ্ছেন (রজার পেনরোজ etc.), যাদের প্রধান যুক্তি গড়ে উঠেছে chaos theory কে কেন্দ্র করে। যেরকম কমপ্লেক্স আচরণের কথা বললাম, সেটাকেও গণিত দিয়ে বাঁধ মানানো সম্ভব হয়ত। তারই একটি প্রচেষ্টা হিসেবে ৭০ এর দশকে প্রথম স্বয়ংক্রিয় কবিতা তৈরি করার জন্য একটি অ্যালগরিদম দেয়া হয়। মারকভ মডেল ব্যাবহার করে এই প্রক্রিয়াটি কিছু প্রদত্ত টেক্সট থেকে pseudorandom টেক্সট তৈরি করতে পারে, যা পড়লে অনেক সময় বোঝা মুশকিল এটা আসল না নকল। উপরের কবিতাটা কিন্তু এই মারকভ মডেল দিয়েই তৈরি করা। 🙂

মারকভ মডেল

আমি প্রধানত ব্যবহার করেছি খুব সাধারণ একটা মডেল, যার নাম হচ্ছে n-gram Markov Model। এখানে n যেকোনো নাম্বার হতে পারে ১ থেকে শুরু করে।

যেকোনো মারকভ মডেল সাধারণত একটি বিবর্তনশীল প্রক্রিয়া। আমি যদি কিছু নাম্বার দিয়ে একটা ভেক্টর (state) তৈরি করি, আর সেই ভেক্টর কে একটি ম্যাট্রিক্স দিয়ে গুণ করতে থাকি (ম্যাট্রিক্সটির কিছু properties maintain করতে হবে, যেমন Ergodicity and Aperiodicity) তাহলে একটা সময় সেই ভেক্টরটির মাণ আর পরিবর্তন হয় না। আমরা কেওস থিওরি তে বা স্ট্যাটিস্টিকস এ একে বলি stable state। ম্যাট্রিক্সটিকে বলা হয় transition matrix। এই ম্যাট্রিক্সটি এবং এই পুরো বিবর্তন ব্যাবস্থাটি  এমন ভাবে তৈরি করা হয় যেন বর্তমান ভেক্টর এর মাণ শুধু এর আগের কিছু states (আগের ইটেরেশন এর ভেক্টর এর মাণ) থেকে একদিকে converge করতে শুরু করে। কিন্তু  বর্তমান ভেক্টরটি কিন্তু আবার আগের এবং পরের ভেক্টর থেকে স্বাধীন (একে বলা হয় Markov property)।

n-gram Markov Model সাধারণত n সংখ্যক শব্দ নেয় (প্রদত্ত টেক্সট থেকে), ধরি এটা হচ্ছে একটা ভেক্টর v[1..n], এবং বাকি টেক্সটের মাঝে v[2..n] খোঁজ করে। যতগুলো match পাওয়া যায়, ততগুলো থেকে আমরা র‍্যান্ডমলি একটা বেছে নেই। সেটার পরের শব্দটি নিয়ে আমরা আমাদের পরের state তৈরি করি, যাকে বলতে পারি v’। অর্থাৎ, v’ = Union[{v[2..n], chosen word}].

যেমনঃ “আমি ফুচকা খাবার জন্য ফুচকা খাবার দোকানে গেলাম, কিন্তু ফুচকা খাবার দোকান বন্ধ ছিল।” এই বাক্য থেকে আমি বেশ কিছু 3-gram (n = 3) তৈরি করতে পারি।

আমি ফুচকা খাবার, ফুচকা খাবার জন্য, খাবার জন্য ফুচকা, জন্য ফুচকা খাবার, ফুচকা খাবার দোকানে, খাবার দোকানে গেলাম, …, ফুচকা খাবার দোকান, দোকান বন্ধ ছিল।

এখানে আমি যদি v[1..n] = {আমি, ফুচকা, খাবার} ধরি, তাহলে v[2..n] = {ফুচকা, খাবার}. তাহলে এই পুরো বাক্যে v[2..n] তিনবার পাওয়া যাবেঃ ফুচকা খাবার জন্য, ফুচকা খাবার দোকানে এবং ফুচকা খাবার দোকান। আমার কাছে তাহলে তিনটি অপশন আছে পরের শব্দটি বাছাই করার জন্যঃ  জন্য, দোকানে এবং দোকান।ধরা যাক পরের শব্দটি র‍্যান্ডম ভাবে সিলেক্ট করলাম এবং পেলাম দোকান। তাহলে v'[1..n] = {ফুচকা, খাবার, দোকান}।

নতুন v’ দিয়ে আমি একি অ্যালগরিদম আবারো চালাতে পারি। এইভাবেই বেশ কিছু iteration এর পর আমরা নতুন আরেকটি বাক্য পাই। এখানে n যত বড় হবে, ততই accurate বাক্য পাওয়া যাবে। কিন্তু সেটা আমাদেরকে শুধু মূল প্রদত্ত টেক্সট এর কাছাকাছি একটা কিছু দেবে। n ছোট হলে আরো অনেক নতুনত্ব আসবে, কিন্তু accuracy কমে যাবে। সেটা যদিও আরো অনেক interesting হবে।

এটা মারকভ মডেল কারণ আমার বর্তমান ভেক্টর আগের ও পরের ভেক্টরের ওপর নির্ভরশীল নয়। এখানে আমরা যা করছি তা মূলত একটি বড় ম্যাট্রিক্স দিয়ে একটি ভেক্টর কে বারবার গুণ করারই সমান।

এখানে এটাও লক্ষণীয় যে এই মারকভ মডেল আসলে কেওস থিওরির সাথে সম্পৃক্ত। যেকোনো গল্প, উপন্যাস বা কবিতা কে যদি আমরা কিছু সমষ্টিগত শব্দের বিশৃঙ্খলা হিসেবে ধরে নেই, তাহলে এই বিশৃঙ্খলা থেকে একি প্যাটার্নের আরেকটি বিশৃঙ্খলা তৈরি করাই হচ্ছে n-gram মারকভ মডেল এর কাজ। কবিতার মত ভাব নির্ভর সাহিত্যে এ ধরণের মডেল বেশ কাজে আসে।

এক মিনিটে ১০০ কবিতা?

চাইলেই কি এ দিয়ে অনেক কবিতা লিখে ফেলা যাবে? আমি রবীন্দ্রনাথের ৫০০ কবিতা, সুকান্তর কাব্যগ্রন্থ ছাড়পত্র আর জীবনানন্দর বেশ কিছু কাব্যগ্রন্থ একত্র করে একটা corpus তৈরি করেছিলাম বেশ আগেই, কবিতার মধ্যে শব্দর বিশৃঙ্খলার প্যাটার্ন বের করার উদ্দেশ্য নিয়ে। সেটা ব্যবহার করে, আর বেশ কিছু theme keyword দিয়ে (যারা কবিতার প্রধান ভাব নির্ধারণ করবে) মিনিটেই অনেক কবিতা বের করে ফেলা সম্ভব। Mathematica তে লেখা আমার বেশ ছোট একটা প্রোগ্রাম খুব দ্রুতই এই কাজ সেরে ফেলতে পারে নিপুণ ভাবে। প্রথমে লেখা কবিতাটি মূলত রবীন্দ্রনাথের ৫০০ কবিতার ওপর নির্ভরশীল। আমি আরো বেশ কিছু customization করেছি, যেমন র‍্যান্ডম চয়েস করার সময় চাইলেই biased সম্ভাব্যতা নিয়ে আসা যায়, তাহলে যেকোনো কবি বা কবিতার প্রাধান্য থাকতে পারে এই নতুন কবিতার ওপর।

সবসময় যে খুব ভাল কাজ করবে না, তার প্রমাণ পরের কবিতাটি। এটা বেশ কম সংখ্যক কবিতা থেকে তৈরি করা হয়েছে, এবং কিছু ভাবের (thematic) শব্দ কে লক্ষ্য করে প্রোগ্রামটি চালানো হয়েছে, যেমন প্রাণ, বন্ধু, সুখ, তোমারি এবং বিদায়।

প্রাণের বীণা নিয়ে যাব সেই অতলের সভা মাঝে চিরদিনের সুরটি বেঁধে শেষ

তোমারি চরণ ওগো মৌন না যদি জুটে নিজের বল

সে ত্রিশূলমূলে শিবের দেউলদ্বারে কাঁদছে স্মৃতি কে দেবে,
সুখে তৃপ্ত তুমি আমার গীতরাগে ভালো লাগে জানি আমি তোমার

তোমারি ইচ্ছা করো হে পূর্ণ আমার জীবন মাঝে যাচি হে তোমার,
বন্ধু চুপে চুপে চলিতেছে বনপথ ধরি আকাশের বুকে বুকে

প্রাণের বীণা নিয়ে যাব সেই অতলের সভা মাঝে চিরদিনের সুরটি বেঁধে শেষ।।
বন্ধু আজকে আঘাত দিও না তোমাদের পকেটে পকেটে আমরা!
সে যে সুদুর অজানা করবীকুঁড়ির পানে চোখ তার ভরে গেছে নীল!
সে ছুটিয়া যায় চকিত আকাশ পায় না নাগাল যার লুঘ।
যা ছিল গোপন স্বপন কদিন রয় এসেছে গোধূলি।।
প্রাণের বীণা নিয়ে যাব সেই অতলের সভা মাঝে চিরদিনের সুরটি বেঁধে শেষ!
বিদায় তুমি তাহা জানিলে না চলে গেছে মুসাফের কবে ফের।
বিদায় দেখেছ উঠল যে হাওয়া ঝোড়ো ঠিকানা রইল।
সে সকাল আসবেই যেদিন এই খবর পাবে প্রত্যেকের চোখেমুখে সকালের আলোয়!
প্রাণের বীণা নিয়ে যাব সেই অতলের সভা মাঝে চিরদিনের সুরটি বেঁধে শেষ

এই কবিতা তে একটা কথা বারবার এসেছেঃ প্রাণের বীণা নিয়ে যাব সেই অতলের সভা মাঝে চিরদিনের সুরটি বেঁধে শেষ।। এর কারণ হচ্ছে corpus ছোট হওয়ায় খুব সম্ভব প্রাণ সংক্রান্ত কথা আমার বেচারা প্রোগ্রামটি আর খুঁজে পায় নি। আরেকটি অর্থহীন কথা আছেঃ বন্ধু আজকে আঘাত দিও না তোমাদের পকেটে পকেটে আমরা! এই কথার সাধারণ কোন মানে নেই, কিন্তু একেবারেই সম্পূর্ণ অর্থহীন বলা যাবে না ঠিক। কবিতা যেহেতু ভাব সাহিত্য, এখানে মারকভ মডেল সহজেই পার পেয়ে যেতে পারে!

উপসংহার

কখনো বুঝিনি যেটা, এটা ছিল সূচনা,

আছে বাকি স্বপ্নের উপসংহার…

– তপু।

মারকভ মডেল বেশ পুরনো জিনিস। অনেক ভাষাবিদ, গণিতবিদ ও কম্পিউটার বিজ্ঞানীরা এর ওপর কাজ করেছেন। n-gram মারকভ মডেল ব্যাবহার করে Automatic Content Generation এবং নকল করা লেখা ধরার উপায় নতুন কিছু না। যা করেছি তা হচ্ছে শুধুই সূচনা, গাণিতিক ভাবে এই মডেল এখন বহুল প্রচলিত গবেষণার বিষয়। প্রাণিবিদ্যায় cell culture dynamics এবং genetic code sequencing এও এই মডেল ব্যাবহার করা হয়।

কবিতা লেখায় এর গ্রহণযোগ্যতা প্রশ্নসাপেক্ষ। গাণিতিক ভাবে এই মডেল কে সাপোর্ট দেয়া যায়, এবং আশা করা যায় ডায়নামিক মারকভ মডেল দিয়ে ভবিষ্যতে কোন প্রদত্ত টেক্সট এর সাহায্য ছাড়াই কবিতা লেখা যাবে। মানুষ এটা মেনে নেবে নাকি জানি না। তবে সবার আগে জানতে হবে কবিতা বলতে আমরা কি বুঝি, বা আমার মূল প্রশ্নঃ কবিতার মূল উপাদানগুলো কি কি? অনেক সময় একটা theme সংক্রান্ত কিছু ব্যাকরণ মেনে চলা শব্দের চয়নকেও কবিতা বলা যেতে পারে।

আমি যদি আসলেই কাউকে বলতাম যে কবিতাটা আমার লেখা, তাহলে মানুষ কিন্তু প্রায় বিশ্বাস করে ফেলত (অনেকটা টুরিং এর এ.আই. টেস্টের মত ব্যাপার). কিছু বন্ধুকে জিজ্ঞেস করে দেখেছি, যারা মোটামোটি কবিতা পড়ে। প্রায় সবাইই ধরে নিয়েছে যে এটা (প্রথম কবিতাটা) আমি লিখেছি। কিছু চ্যাট লগ দেখলে বোঝা যেতে পারেঃ

: tui likhchos?
me: ho beta
: khaisere mama
pura nazrul type
hoise
me: hehehe
: seriously

: tui likhsos?
me: ho
kemon hoise?
: aissala
bhaloi to mone hoy
kar jonne likhsos?
me: hehehe
ki mone hoy?

me: just tell me what you think
: tumi tomar mar theke fragrance pao
she's the flower
you're the shourov
me: hmm
khub e high thought er bepar mone hocche!


কিন্তু আমার ধারণা শুরুতেই যদি বলতাম যে এটা প্রোগ্রাম থেকে লিখেছি, সবাই সাথে সাথে ভুল ধরা শুরু করতো। আমরা সাধারণত কোন এক্সপার্ট কেউ কথা বললে সেটা মেনে নেই, তর্ক কম করি। অনেক বড় বড় বোদ্ধারা যখন কিছু বলে সেটাকে প্রশ্ন কম করি, কিন্তু একই কথা সাধারণ বা নগণ্য কেউ বললে মেনে নিতে পারি না।

আমার দিক থেকে এই প্রোগ্রামের আপাতত দুইটা ব্যাবহার দেখতে পাচ্ছি। একটা হচ্ছে আধুনিক যুগের বাংলা মেটাল ব্যান্ডদের জন্য। তাদের গানের কথা সাধারণত বোঝা যায় কম, এবং যত কম বোঝা যায় এবং যত কঠিন শব্দ ব্যাবহার করা যায় ততই ভাব। কাজেই বেচারাদের কষ্ট করে ডিকশনারি ঘেঁটে আর গানের কথা লেখার দরকার কি, এমনিতেই তো বাংলা প্রায় পারেই না। আমাকে বললে আমি ফ্রি তে তাদের গান পাঠায় দিব। বানাতে কয়েক সেকন্ড সময় লাগবে শুধু। আরেকটা ব্যাবহার হচ্ছে আধুনিক তরুণীদের মুগ্ধ করতে। আমাকে অনেকেই বলে মেয়েরা নাকি গম্ভীর কবিতা শুনলে পটে যায়। এই প্রোগ্রাম দিয়ে আর যাইই হোক, মেয়ে পটানতে সুবিধা হইতে পারে। 🙂

বিশেষ দ্রষ্টব্যঃ কোড

ইনপুট প্রোসেসিং এবং ডাটা স্ট্রাকচার সংক্রান্ত ঝামেলা বাদ দিলে মূল মারকভ মডেল কোড খুব ছোট এবং সাধারণ।


generateSentence[n_, size_, prtmain_, initword_] := Module[
{prevch = {}, res = {}, choices = {}, posch = 0, i},
posch = RandomChoice[Position[prtmain[[;; , 1]], initword]];
For[i = 1, i <= size, i++,
res = AppendTo[res, prtmain[[posch[[1]], 1]]];
(* search for temp in the rest and pick a random one *)
posch =
RandomChoice[
Position[prtmain[[;; , 1 ;; n]],
prtmain[[posch[[1]], 2 ;; n + 1]]]];
choices =
AppendTo[choices,
Length[Position[prtmain[[;; , 1 ;; n]],
prtmain[[posch[[1]], 2 ;; n + 1]]]]]
];
Return[{res, choices}]
]

poem = {};
For[i = 1, i <= Length[result], i++,
poem = AppendTo[poem, StringJoin[
Insert[result[[i, 1]], " ",
Table[{j}, {j, 2, Length[result[[i, 1]]]}]],
RandomChoice[{"...", "।", "।।", "!", "\n", ","}],
"\n"
]
]
]



# Nail Portrait

I have a thing for stuff made from/using wood. Not that I am an expert, in fact, I have no skills whatsoever. The only time I attempted to put something together was during my sophomore year. Nails were hammered and/or screwed into a piece of wood to write sentences. Since then I wanted to make portraits on wooden blocks using nails. That didn’t happen so far. So just to satisfy my desire, I have written a program in Mathematica that takes an image and converts it to a nail portrait. 🙂

The idea will be to use Mathematica functions only (and no other 3d models) to build the portrait from scratch.

Preprocessing

Here’s a test image of myself:


img = Darker[Import["me.png", ImageSize -> 250], 0.5]



First thing would be to find the edges. The image is blurred a bit (with 5 pixel radius) before the edge detection methods are called just so that noise is suppressed.


edge = EdgeDetect[Blur[img, 5], 2, 0.01]



That needs to be cleaned up, so I decided to delete the small components that are less than 25 pixel length.


dedge = DeleteSmallComponents[edge, 25]



The components that are touching the border are deleted next.


dedge = DeleteBorderComponents[dedge]



Finally we do a morphological connected components search to detect the different regions.


cedge = MorphologicalComponents[dedge];

% // Colorize



Making the “nails”

Each component in the connected components matrix is found and the number of elements in the matrix for each component is calculated. The components are then sampled uniformly to create a point set representation of each component. The points are plotted using the ListPlot command. They were also rotated by a [0 1; -1 0] rotation matrix (-90 degrees) for viewing the resulting plot correctly.


cm = ComponentMeasurements[cedge, "Count"];

counts = Table[cm[[i, 2]], {i, 1, Length[cm]}];

rm = {{0, 1}, {-1, 0}}

Table[rm.Position[cedge, 1][[i]], {i, 1, counts[[1]], 10}];

selpt = Table[
rm.Position[cedge, i][[j]], {i, 1, Length[counts]}, {j, 1,
counts[[i]], 8}];

ListPlot[Flatten[selpt, 1] -
Table[{0, Min[Flatten[selpt]]}, {i, 1, Length[Flatten[selpt, 1]]}]]



To create the nails from this image, all the points are converted from 2d to 3d by adding a random amount of elevation to each 2d coordinates. The randomization is mainly to mimic the random heights of nails on the wooden board.


zlength = Max[Flatten[selpt]]/10.0
flselpt3d =
Table[{flselpt2[[i, 1]], flselpt2[[i, 2]],
RandomReal[{zlength, zlength + zlength/2}]}, {i, 1,
Length[flselpt2]}];



The resulting 3d coordinates are plotted using the ListPointPlot3D command. The Filling property of this command is set to true (and a dark gray color is assigned to it) to show the height of each point. This will also act as the body of the nail in the final plot!


lst3d = ListPointPlot3D[flselpt3d, Filling -> Bottom,
PlotStyle -> Directive[Black, PointSize[Large]],
FillingStyle -> Directive[RGBColor[19/255, 19/255, 19/255], Thick],
Lighting -> Automatic]



Making the “board”

The “wooden board” is actually just another plot using Mathematica. A wood texture has been applied to a region plot. A cuboid region is generated by using a simple condition: x >0. The limits are defined by the maximum and minimum range of the 3d “nail” points in x, y and z directions.


rgplot = RegionPlot3D[
x > 0, {x, 0, Max[flselpt2[[;; , 1]]]}, {y, 0,
Max[flselpt2[[;; , 2]]]}, {z, 0, zlength},
PlotStyle -> Texture[wood3], BoxRatios -> Automatic, Mesh -> None]



Putting it altogether

The 3d points plot and the region plot are shown together using the Show[] command. The lighting is adjusted, borders and axes are taken out too to give the final image.


Show[lst3d, rgplot, PlotRange -> {0, 1.6*zlength}, Axes -> None,
Boxed -> False, BoxRatios -> Automatic]



This gives us the final 3d plot.

Here’s another view of the plot.

Some details are lost in this portrait, for example the smile is missing (or can be hardly noticed)! Oh well, this is just a simple scheme. To reconstruct the smile would require a better image (where lighting conditions are suitable to find the correct edges) and more playing around with the blur and edge detection parameters in the preprocessing stage.

Anyway, this gives me a general platform to produce nail portraits of nearly anything.

# Newspapers and Location-specific Coverage

The Daily Star and The New Age are two English Newspapers that are considered two of the top sources of information in Bangladesh among all the other English Newspapers. The Daily Star has been there for quite some time and New Age is relatively young, although I have been told by a few that New Age has been catching up in the race. While thinking about ways to visualize data from newspapers, I decided to compare the efficiency of these two newspapers in terms of news coverage.

Now, I’m sure news coverage can be interpreted/defined in many ways, but here I refer to the geographic aspect. Although most of the news in such “national” newspapers revolve around the incidents in Dhaka and Chittagong (the two main divisions (also cities) of Bangladesh), occasionally we notice news from areas of less interest/importance. In a big country like the USA, the notion of a “national” newspaper is ridiculous, but in a small country like Bangladesh every Dhaka (the capital) based newspaper claims to provide  glimpse of all major incidents around the country.

Every newspaper usually has correspondents in distant regions, usually all the major cities. My main goal was to carry out controlled experiments to see how many news has been covered from each district and metropolitan cities including and excluding Dhaka and Chittagong over a fixed period of time. Hopefully that would provide a way to compare the two newspapers’ commitment to reach out all the districts and cities of Bangladesh.

The Experiments

DS has an online archive that starts from 2002. However, NA only has a news archive starting from 2011. So I decided to run all the experiments from 01-01-2011 to 02-28-2012 for both the newspapers, a period of 423 days overall.

Figure 1. Barcharts for (Left) All metropolitan cities, (Right) all districts of Bangladesh. The left panel in the charts show the number of news covered by DS and likewise, the right panels show the figures for NA. (click to view the names of the districts)

Oh my! Just look at the number of news covered for Dhaka! It clearly outweighs the numbers for all other cities (or districts) in both the newspapers. Surprisingly, NA does not give a peak for Chittagong like DS does, one possible reason is its use of ‘ctg’ as an abbreviation of Chittagong in many places. My program only parses for the word “Chittagong” in all the news.

DS has a lot more news per city/district over the 423 days period compared to NA.

Figure 2. The same barchart for districts, but this time, Dhaka and Chittagong districts are excluded to magnify the relative coverage for all the districts. Click on the image to view the districts.

Excluding Dhaka and Chittagong, we see some efforts on both newspapers’ sides to cover more news in places of higher business interests, such as Rajshahi, Sylhet, Bogra, Khulna etc. However, DS has a clear win over NA here; the length of their bars are too visible.

Visualizations

All of these data and figures would look much better if we could see a geographic representation of these comparisons. Following are some vis. that represent the above data in a slightly different but more intuitive way. Since DA clearly dominates NA in terms of number of news per district, I wanted to create a visualization of the magnitude of their differences on a map of Bangladesh.

Click and zoom in to view clearer pictures.

min(0)max(1)

Figure 3. Bangladesh’s map and a textual representation of the magnitude of the difference in the number of news covered per district. The transparency and color of the texts are varied according to the magnitude.

For these visualizations, I have taken Dhaka and Chittagong out of the calculations. In the above visualization, for each district, the number of news reported by NA was subtracted from the number of news reported by DS. After obtaining these difference values for the 62 districts (not 64, since I excluded Dhaka and Chittagong), I scaled them to [0, 1]. The transparency and color of the texts in the map are set accordingly – more opaque and reddish means more difference between the newspapers for a particular district.

A stark difference between the two newspapers exist in some districts, e.g. Rajshahi, Dinarjpur, Rangpur and Satkhira. DS covers more news in those areas compared to NA. In many other areas, the differences are less pronounced, hence, those areas are more transparent.

The following visualization is a similar one, only this time bubbles are embedded with texts to emphasize the areas of interest.

min(0)max(1)

Figure 4. Circles of radius, transparency and color map proportional to the difference between the news coverage of Daily Star and New Age.

Areas of No Coverage

Daily Star outperformed New Age so far it seems. However, this was just a relative measurement. Both newspapers were found to ignore some areas of Bangladesh in my data. Now, it could be the flaw of data in some cases, as described in the Methods section. Assuming no flaws, the following districts were ignored (not even a single mention over the 423 days period) by both DS and NA:

Gopalganj, Lakshmipur,  Narsingdi.

Some of the municipal cities that were ignored by DS were:

Bhanga, Chenger, Damudiya, Galachipa, Goalunda, Jibannagar, Kalapara, Kuliar Char, Maheshpur, Mehendiganj, Mirkadim, Muktagacha, Nandail, Adamdighi, Shailkupa, Ullapara, Swarupkathi, Nilphamari, etc.

Some of the municipal cities ignored by NA (in addition to the above) were:

Abhaynagar, Akhaura, Alamdanga, Bakarganj, Bhola, Bhuapur, Birampur, Bochanganj, Chakaria, Char Fasson, Charghat, Daganbhuiyan, Damurhuda, Durgapur, Kaunia, Madhabpur, Muksudpur, Nabinagar, Sitakunda, Swarupkathi and quite a few more.

I would like to come back to this data when I get some time and make some more vis. to show the stats for no-coverage vividly. Not all the cities were mentioned above, the lists were actually bigger for both newspapers.

Methods

Parsing the online news archives was described in this post.

GIS Integration: The map polygons data for Bangladesh can be downloaded from Geocommons or the World Bank website. The polygon description file is in the format .shp that Mathematica can read since its 7th version. I read in the .shp file to create a blank map polygon set at first. The latitudes and longitudes data for each district were found using the Yahoo Map API, not the Google Map API (that is more reliable) simply because of all the protocol hassles it presents to each query sent to the service.

Integrating and visualizing the polygon set and the latt. and long. data were straightforward using the geometry and graphics primitives of Mathematica.

What’s Next?

There’s so much that can actually be done with such data to compare newspapers. I have only presented some info. vis., but some statistical analysis done on such data may provide much more insights into the performance and commitment of newspapers. I have not searched to see if there has been any previous research done in this direction, the next goal (on this idea) for me will be to do some literature search, if I ever decide to come back to this data to do something more in the future! 🙂

# Abyss

The idea was to create some sort of visual representation of all negative emotions/facts in a Bangladeshi newspaper, The Daily Star. By negative emotions here I mean news regarding death, accidents, robbery, abduction, rape, bribing, arson and any natural calamity.

Following are images that are in the first installment of the series Abyss. Abyss is one of my efforts to visualize emotions/facts through computational art. The images were programmed and generated using Mathematica.

Each circle represents the crime, disaster and calamity news per district or municipal cities every three months, starting from September 2007 to June 2012. Each bar in a circle represents a city or a district, with its angle and height being proportional to the number of such news reported by Daily Star for that region. A sequence of circles thus represents a timeline of negative emotions going from the present to the past.

Click to view larger versions of these images.

Abyss 1. A timeline of crime news reported for the districts of Bangladesh.

Abyss 2. Timeline of crime news reported for all cities of Bangladesh.

Abyss 3. Timeline of crime news reported for all the municipal cities of Bangladesh.

Method: The data were collected by parsing through the online archive of Daily Star. The online archive data were downloaded using Mathematica for each date and the news were matched against a set of predefined words (along with their inflected forms) to separate the crime/disaster reports, then these selected news were parsed again to look for cities and district names in them. The names of cities and districts were parsed from this website.

A matrix containing the count for each city/district (columns) every three months (rows) was updated at each iteration of this parsing. The data in the matrix is then visualized as described above. All of these operations and visualization were done using Mathematica.

# Analyzing Tagore’s Literature (Part 2)

In part 1, I employed Bose-Einstein distribution to find out how the “temperature” of Tagore’s writing varies across different novels. In part 2, I delve into Zipf’s power law and similarity metrics used to compare high dimensional vectors in order to analyze the lexical wealth and similarity across different novels and short stories written by the legend.

Zipf’s Law

In fractal theory, Zipf’s power law on linguistics is a tried and accepted heuristic to compare large texts [1]. This power law statistics, derived from the behavior of certain kind of fractals, can be used in many other disciplines too. In simple terms, Zipf’s law is stated as: N = Ak^(-$\phi$).

Taking logs on both sides: log(N) = log(A) – $\phi$ * log(k)

we get a linear equation. Here, N is the total number of words in a corpus, k is the ratio of the number of distinct words n to N. A is a constant amplitude and phi is a phase value that is unique for a given author. Using simple regression analysis, it is possible to find a characteristic phi for any author. The law merely dictates a simple fact: as the text size increases, the number of distinct words decreases. At what rate this happens is a question that is related to the expertise of a writer in maintaining variability of words and sentence structures over the course of his novels.

Demonstration

The following table shows n, N and k values for the same set of novels in part 1.

The following table shows the same data for a collection of short stories.

Note the higher values of k for the short stories. This could be mainly due to the smaller size of the text.

Figure 1. (Left) Data points in the Log(k)-Log(N) plane, and a linear fit equation showing the characteristic gradient $\phi$. (Right) Same experiment done on the short stories.

Figure 2. The linear fit equations for novels and stories on the same plot (red – stories, blue – novels). Clearly it demonstrates that the rate at which Tagore’s lexical wealth k falls is higher for novels. This could be due to the difference in the text size though.

Heap’s Law

Heap’s law is similar to Zipf’s law. It’s a power law that describes how the number of unique elements in a set of randomly chosen elements grow as the size of the set increases. In our case, we would expect to see that the number of unique words increase as we increase the size of the text.

Figure 3. (Left) Heap’s law demonstrated for novels (log(n) vs. log(N) plot), (right) for short stories.

Figure 4. The two linear fit equations on the same plot (red – short stories, blue – novels). This demonstrates that although the number of distinct words used in short stories prevail for a short size of text, ultimately the novels take over as the size of the text increases. This may indicate a better effort on Tagore’s side to polish and revise his novels to amplify the lexical wealth, whereas, statistically, this may be less true for his short stories.

Similarity Measure

The variability of distinct words across a set of novels or short stories can be captured by feature vectors – essentially rows of numbers in a document-term matrix. Comparing these high dimensional vectors to infer the similarity between short stories and novels of Tagore might be useful. Here, I use two schemes to compare these high dimensional vectors. One is the cosine of the angle between two vectors, and the other is the L2-norm of the difference between two vectors. These schemes project the high dimensional vectors to scalar values that can be easily compared. Histograms from all possible pair combinations are produced to analyze how similar or different are the span of the words used in short stories or novels.

Figure 5. (Left) Histograms from L2-normed difference scheme, (right) from cosine scheme. Red – short stories, blue – novels. Note the bimodal distribution for both novels and stories, except the cosine heuristic for short stories. It seems there are two principal modes of similarity among all novels and stories. Although this could be just a statistical property of texts that I am not aware of.

Note the width of novels histograms in both cases; they are wider than those of stories’. For the cosine scheme the novels histogram has a mode that’s closer to 1.0, whereas the average peak for stories histogram is farther from 1.0. These two observations mean that similar words and sentence structure recur themselves throughout novels, more than short stories. This is consistent with the inferences drawn in part 1 and Zipf’s and Heap’s laws for Tagore’s work.

Comparing Upendrakishor Raychowdhury’s Work

One last thing I try here is to see how these measures can be used to compare different authors’ works. Although my aim was to compare Kazi Najrul Islam with Tagore, unfortunately I could not find any of his work in text form. Instead, I found Upendrakishor Raychowdhury’s short stories collection and decided to compare the lexical wealth between the two authors’ stories collection. It should be noted that lexical wealth is only one of the (measurable) heuristics to compare authors. Most of the comparisons in the field of literature are qualitative and depend on the taste of readers and critics. Nonetheless, the lexical wealth does say a lot about the author’s expertise in not being monotonous.

The following table shows UR’s short stories that I have collected, along with their k values.

Figure 6. (Left) Zipf’s law linear fit for UR’s short stories. (Right) Zipf’s law linear fits for both UR’s (red) and Tagore’s (blue) short stories. Although it seems that UR has an upper hand with Tagore (smaller falloff rate as we increase the lexical wealth k), it would be dubious to claim that UR is better at not being monotonous. It’s quite risky to draw conclusions based on such a small margin, lack of adequate data is another issue. I could say something clearly if I had a collection of hundreds of stories from both writers. 🙂

Conclusion

In part 1 I found out that a possible characteristic falloff of the lexical wealth may exist for Tagore’s writings. The experiments here in part 2 restate a celebrated fact in linguistics: every author has a natural limit after which his writings give way to being monotonous in terms of repeating words and sentence structures. Rabindranath Tagore was not so different from the group of his contemporary writers. It will be interesting to see how his works compare with other contemporary works when/if I get enough data. 🙂

[1] L. L. Goncalves, L. B. Goncalves, Fractal power law in literary English, Physica A 360 (2006) 557 – 575.

# Analyzing Tagore’s Literature (Part 1)

Rabindranath Tagore, the Nobel Laureate for literature in 1913, has been one of my favorite authors of all time. In my series of summer weekend projects, among other things, I have collected some novels and short stories written by this author in unicode text format and analyzed the behavior of their lexical growth, hoping to find specific patterns in his writing. Part 1 of this investigation employs one of the DFR (Divergence From Randomness) models, namely the Bose-Einstein statistics that was originally derived by Satyendranath Bose (a physicist at the University of Dhaka) in 1924 as one of the emerging quantum ensemble models that was later backed up by Einstein in 1925.

Bose-Einstein Distribution

Bose-Einstein distribution has recently found its application outside the realm of describing the energy level occupation of bosons. Such applications include describing the statistics of low frequency words in a large text corpus [1]. It’s always interesting to investigate how a mathematical model describing a physical phenomenon can be  used as an analogy to another problem. Here, words are analogous to boson particles, which have the characteristic that they are indistinguishable from each other. Unlike Fermions, for example, there is no limit to the number of bosons that can occupy a quantum state. This property of bosons makes the related statistics suitable for analyzing words that have the same occurrence frequency in the corpus. An important aspect of using such analogy is temperature. What does it mean for a piece of text to have “temperature”? As demonstrated in [1] and here, it can describe and distinguish between different authors or different novels written by the same author. It may also describe how the lexical wealth of a piece of writing evolves as we read through it.

The Model

The Bose-Einstein distribution describes the occupation of bosons at specific energy levels. An energy level is specified by j = 1, 2, …, n. The level j = 1 corresponds to the Bose-Einstein condensate. Here, as an analogy of this condensate, the authors of [1] call the initial energy state hapax legomena. This Greek term originates from Biblical studies, it translates to “[something] said [only] once”. That is, words that occur only once will be put in the first energy level. Words with frequency 2 will occupy the level j = 2, and so on. The occupation of an energy level j is given as:

$N_{j}=\frac{1}{z^{-1}e^{\epsilon_{j}/T}-1}$

z is the absolute activity, or fugacity, epsilon is the energy of the jth level and T is the temperature. The power energy spectrum for epsilon is given by

$\epsilon_{j}=(j-1)^{\alpha}$

where $\alpha$ is a constant that can be determined by fitting. z is determined from the first energy level using

$N_{hapax}=\frac{z}{1-z}$

and with the new definition of the power spectrum the B-E distribution now looks like

$N_{j}=\frac{1}{z^{-1}e^{\frac{(j-1)^{\alpha}}{T}}-1}$

The parameters $\alpha$ and T are to be simultaneously determined by fitting the data present in the occupation matrix (a matrix that contains the occupation distribution for each level j) using a nonlinear regression.

Algorithm

A document-term matrix is created from the set of novels or stories that are to be analyzed. Then for each energy level j, the number of distinct words that have a frequency equal to j is found and saved in the occupation matrix. Now, each row in the matrix has the occupation levels distribution. By fitting the parameters for each row of data, we obtain T for the low frequency words (lower energy levels). In the case of texts, the B-E distribution does not turn out to capture the statistics of the higher energy levels quite well.

In order to see how T evolves with N (the number of words) in a novel, I divide the particular novel corpus into cumulatively increasing chunks of texts and do the above for each chunk.

Mathematica is the choice of programming language for all of these operations. 🙂

Results and Analysis

I have run the programs on a set of eight novels: Bou Thakuranir Haat, Chokher Bali, Ghore Baire, Gora, Noukadubi, Projapotir Nirbondho, Rajorshi and Shesher Kobita. All the novels and other short stories are collected from [2].

Figure 1. (a) Occupation matrix for all novels, colors represent the magnitude of occupation in each cell. Note the variability of words frequency for each novel. Novel #4 (Gora) is the largest in this set. It exhibits quite a variable frequency compared to others. (b) Occupation vs. j log-log plot for Bou Thakuranir Haat for the first 1000 energy levels. The blue line is the fit found for the first 20 energy levels (low frequency words).

Chokher Bali: This novel contains ~70000 words. The occupation matrix and the characteristic temperature curve is shown below.

Figure 2. (a) Occupation matrix for Chokher Bali. As the number of chunks increases, we find some amount of variability. (b) Temperature vs. N graph, the points are joined together with broken lines, not a fit. Note the rise of temperature until the middle of the novel and a gradual decrease as the size of the text increases.

Gora: This one is the largest in my collection, ~180000 words.

Figure 3. (a) Occupation matrix for Gora. Note that a lot of variability shows up as we increase the size of the text. There is a certain visible pyramid-like pattern. This regularity may indicate recurring usage of word sets over the course of the novel. (b) Temperature as the size of the text grows.

Noukadubi: This is one of the shorter novels, containing ~40000 words.

Figure 4. (a) Occupation matrix for Noukadubi. A similar pyramid structure is notable. (b) T vs. N graph.

Temperature Evolution Comparison

The novels usually have a rise in temperature for up to ~30000 words or so, then we see that they fall off. What does it mean in terms of the physical analog?

The figure on the left shows all the temperatures on the same plot, and on the right are a set of exponential fits for the first 25000 words. Since they resemble a Boltzmann-like distribution, I could have done a fit using that equation. Oh well! 🙂

Conclusion

I wonder if the peak in temperature at around 30000 words mark is a characteristic of Tagore’s writing (critical/transition temperature?). Note that temperature here refers to the net amount of variability of different frequencies for low frequency, i.e. rare words. Different authors have different styles of writing. One may sit and finish a large piece in one go and never come back to it. However, many authors do come back to the same piece again to hone the variability of words.

In literature, lexical wealth is a measure of the author’s ability to use different set of words. Every author has a natural limit though, rare words that describe particular events must cycle around in the novel. The results here could be the first step in showing that Rabindranath Tagore’s large piece of writings usually maintain a fairly distinct word frequency structure until around an approximately fixed word limit (~30000 to 50000), then it breaks and gives way – the rare words and sentence structures start repeating and cycling more often as the size of a novel increases. This claim, however, should be viewed with doubt as more experiments are needed to confirm this. This in general should be true for any author, but finding a characteristic falloff for Tagore is quite interesting.

Part 2 of my analysis uses power laws derived in fractal theory and similarity measures used in high dimensional data analysis to find out more about the lexical wealth of Tagore’s writings.

[1] Application of a quantum ensemble model to linguistic analysis, A. Rovenchak, S. Buk, Physica A 390 (2011), 1326-1331.

# Painting a Novel

I have always wondered about the possibility of shrinking a book into a picture. I love to read, but there are times when I start reading a novel and discover by page 209 that I am not really liking either the content, the author’s views or simply the plot. If I had an image that said something about the book in a timeline-like manner, that would be pretty useful. Having said that, I am actually talking about the prospect of some of the most difficult challenges in NLP. NLP is not my field and my knowledge is pretty naive in that area. However, I have taken an attempt this weekend to actually create images from two novels and compared those to see how informative they are. The goal was to see how emotions evolve in a novel.  These can be called some basic versions of infographs of novels, however, I have tried to keep the aesthetics of these images in mind so that they do not look too technical… whatever that means.

Preliminaries

There can be hundreds of categories to describe the characteristics of a novel even to achieve some sort of accuracy in comparing them. However, I have focused on two broader aspects: sentiment analysis and nature phenomena. Sentiment analysis is a pure NLP problem, the goal is to quantify positive, negative, arousal, sadness etc sentiments in a sentence or a paragraph by matching the words against an existing sentiment/emotion words database. Usually, these databases have scores/stats associated with each word that show the strength of positive or negative emotions. I was hoping a global sentiment analysis might tell me something about how and what kind of emotions show up in the timeline of the novel. The reason for choosing how nature phenomena show up in a novel is quite personal. I am one of those people who love to read descriptions of nature in a novel, it helps me visualize the environment and I feel more attached to the story in many cases.

Data Collection

A nice resource page for sentiment analysis is [1]. I have selected a free database that is available immediately (i.e. you don’t have to request the database and wait for ages to get it). It’s called the AFINN  word list [2]. It has a collection of 2477 words that are collected from Twitter. Each word has been given a score from -5 to +5 (-5 for extreme negative and +5 extreme positive) based on an unsupervised learning algorithm. However, I was not entirely sure whether a list of words collected from twitter feeds could entirely capture the strength of emotions in a novel, especially the ones that were written a century ago (for obvious reasons)! So I found out another list of emotion words [3] that seemed quite helpful category-wise. I manually copied and pasted each category of words  into two text files as seemed appropriate, one for positive emotions and the other for negative emotions. I found a list of ‘Nature’ related words online, I decided to go with it for my experiments.

How it works

I have kept it pretty simple. My idea is not only to see how much emotion information I can accurately extract, but also how I could produce (sort of) nice images from a book. That’s not something a science guy should say, but I have a thing for nice looking abstract patterns. So another goal is to take out the discrete structure of the final image and replace it with a smoothed out version.

Using the AFINN list or the lists I manually compiled, there is a simple way of constructing scores for a sentence or a pragraph. Let’s look at a few examples.

1. “I hate the way he talks, he is disgusting.” The lists usually contain the emotion words, so ‘hate’ and ‘disgusting’ would be the two words we are likely to find in the compiled lists. The AFINN word list has both of these words associated with -4. Adding up, the sentence would get a score of -8.

2. “I like her, but she is quite an idiot.” The word ‘like’ gets +2 and ‘idiot’ gets -4 from AFINN. Net score could be summed up, or we could take the maximum of the magnitudes, preserving the sign at the end. Summing up, this sentence gets a score of -2.

3. “I love her and she is the one in my life.” AFINN has ‘love’ with +3. It doesn’t have a score for the other words. However, there were no other negative words in this sentence, so this would get +3 overall.

4. “The city was shrouded by black smoke. Elliot suddenly understood that its destruction was a matter of time.” According to AFINN, this sentence gets -4. No positive words detected.

I can easily come up with a better heuristic than net summation. However, I did not have much time to spend to experiment what scheme would be good, so I had to be satisfied with this and hope that I see some observable patterns.

For the second word list, positive emotions get +1 and negative emotions -1. Based on the count of positive or negative words in each sentence or paragraph, I multiply the count with the respective sign.

Each sentence or paragraph will be allocated a pixel in the final image, and the pixel will be colored according to the intensity of emotion, i.e. the score obtained from the net summation of the emotion words.

Code

1. To compare a list of sentences or paragraphs against the AFINN list and assign a score, we treat the document set as an n-dimensional vector, where each sentence or paragraph (based on what we are investigating on) is assigned an element in the vector, so the number of sentences or paragraphs is n. The i’th element will be updated when we scan the emotion words list for the corresponding word. At the end, the vector is smoothed by running an exponential moving average filter, and it is reshaped into a matrix for easy viewing and plotting. I have chosen Mathematica because of its many built-in functions to do these things easily.

compareAFINNLists[dat_, elist_] := Module[
{tmp, ntmp, ptmp, psum, nsum, ppar, npar, i, j},
Monitor[
nsum = Table[0, {i, 1, Length[dat]}];
psum = Table[0, {i, 1, Length[dat]}];
For[j = 1, j <= Length[elist], j++,
tmp = StringCount[dat, ___ ~~ elist[[j, 1]] ~~ ___];
ntmp = Table[
If[tmp[[i]] == 0,
0,
If[elist[[j, 2]] < 0,
elist[[j, 2]]*tmp[[i]],
0]
]
,
{i, 1, Length[tmp]}];
ptmp = Table[
If[tmp[[i]] == 0,
0,
If[elist[[j, 2]] > 0,
elist[[j, 2]]*tmp[[i]],
0]
]
,
{i, 1, Length[tmp]}];
psum = psum + ptmp;
nsum = nsum + ntmp;
]
,
ProgressIndicator[j, {1, Length[elist]}]
];
ppar = Partition[ExponentialMovingAverage[psum, 0.03],
Round[Sqrt[Length[psum]]]];
npar = Partition[ExponentialMovingAverage[nsum, 0.03],
Round[Sqrt[Length[nsum]]]];
Return[{ppar, npar, psum, nsum}]
]

2. For the other word lists, we follow a similar algorithm. This time, the sign (+1 or -1) is also input as an argument so that this factor can be multiplied with the net score.

compareLists[dat_, elist_, sign_] := Module[
{tmp, tmp2, sumt, spar, i, j},
Monitor[
sumt = Table[0, {i, 1, Length[dat]}];
For[j = 1, j <= Length[elist], j++,
tmp = StringCount[dat, ___ ~~ elist[[j]] ~~ ___];
tmp2 = Table[
If[tmp[[i]] == 0,
0,
sign*tmp[[i]]
]
,
{i, 1, Length[tmp]}];
sumt = sumt + tmp2;
]
,
ProgressIndicator[j, {1, Length[elist]}]
];
spar = Partition[ExponentialMovingAverage[sumt, 0.03],
Round[Sqrt[Length[sumt]]]];
Return[{spar, sumt}]
]

3. Loading the text files and parsing to extract the sentences and/or paragraphs is pretty straightforward.

SetDirectory[NotebookDirectory[]];
data=ToLowerCase[Import["montezuma.txt","Plaintext"]];
data=StringSplit[data,"\n\n"];

emot=StringSplit[StringSplit[Import["AFINN-111.txt"],{"\n"}],"\t"];
emot=Table[{emot[[i,1]],ToExpression[emot[[i,2]]]},{i,1,Length[emot]}];

pemot=Select[StringSplit[StringTrim[Import["positive-emotions.txt"]],{" ",","}],#!=""&];
nemot=Select[StringSplit[StringTrim[Import["negative-emotions.txt"]],{" ",","}],#!=""&];
nature=ToLowerCase[Select[StringSplit[StringTrim[Import["nature.txt"]],{" ",","}],#!=""&]];

res=compareAFINNLists[data,emot];

The resulting output matrix is considered a 2D scalar density field and plotted using the ListDensityPlot command in Mathematica.

Archangel – W.C. Halbrooks

Time to do some experiments and see how the program performs. I chose two novels that were available at my hands immediately. The first one is Archangel, written by my freshman year roommate Carter (W.C. Halbrooks) when he was in high school. I had a copy in my computer, so naturally it became the subject of my first few experiments.

Sentence based analysis: Following are some images produced for sentence based sentiment analysis.

minmax

Figure 1. (Left) Positive emotions, (Right) Negative emotions based on the AFINN list. The associated color map is shown below them.

Figure 2. (Left) Figure 1 images masked over each other with an alpha value of 0.4, (Right) Sum of positive emotions and abs(negative emotions) matrices.

Figure 3.  (Left) Histogram of scores for positive emotions, (Right) histogram of scores in negative emotions.

The images are to be read left to right, top to bottom, just as one would read English text. Here, it is a timeline representing how emotions evolve as we read through each sentence. Figure 1 shows such images for the AFINN words list. Figure 2 shows two ways of combining the positive and negative emotions evolution. From the histograms of figure 3, we see that the average scores hover around  2 and -1.5.

minmax

Figure 4. (Left) Positive emotions based on DeRose emotion dictionary, (Right) Negative emotions based on the same dictionary. The color map is shown below them.

minmax

Figure 5. (Left) Nature timeline based on my nature words list, (Right) Histogram of scores from Nature words category.

Figure 4 shows positive and negative emotions timeline based on the DeRose emotions dictionary, and figure 5 shows the performance of the Nature word list I found online. Definitely it’s a poor word list (see histogram), only a few words from the list were found in the novel. The other explanation could be that the novel does not have a lot of descriptions of nature, but I will have a hard time believing that.

Paragraph based analysis: Often it is a good idea to look at the net score of a paragraph and see a timeline based on emotions in each paragraph.

minmax

Figure 6. Paragraph based positive emotions timeline (left), negative emotions timeline (right). Note the prominence of negative sentiments in the paragraphs in the later stages of the novel.

Figure 7. DeRose dictionary based positive emotions timeline. From the score histogram, it seems that quite a lot of words were common between the list and the novel.

Montezuma’s Daughter – Henry Rider Haggard

I recently read this novel. Project Gutenberg [4] offers a free text for all. From the images, I could roughly relate a few events (wars, love and marriage between the protagonists, conspiracy against the empire etc) in the novel.

Paragraph based analysis: From Archangel, it seemed to me that paragraph based analysis is better, for one thing we get less cluttered images!

minmax

Figure 8. Positive emotions timeline (left), negative emotions timeline (right).

This, in contrast to Archangel, says a lot about the kind of language used a century ago in novels. Note the prominence of positive emotions throughout the novel. This creates a better way to analyze novels, because the negative emotions are quite visible when there are extreme events. There are approximately six brown shades in the negative sentiments timeline (right). Having read the novel, I can approximately relate the tragic events in the novel with those six lines. Note the dominance of blue in the positive timeline (left) at the very beginning, and the dominance of brown at the very beginning in the negative timeline. The novel starts with a lot of lamentation for the protagonist’s mother’s murder, it is not surprising to see that small patch of brown at the beginning of the negative timeline (or blue patch at the beginning of positive emotions timeline).

minmax

Figure 9. Nature description propagation in Montezuma’s daughter. For this larger corpus, the nature words list worked out well (to some extent), as seen from the score histogram. So, we can sort of rely on this timeline picture and say that there are quite a lot of nature descriptions in the last-middle half of the novel, which is not quite wrong. Anahuaq (currently Mexico) in the 15th century is quite well described when the protagonist becomes the king of the tribes there, which happens at around the middle of the novel.

Conclusion

This was just a glimpse of what data could be visualized about novels to give the readers some notion about the emotional experience  as they read a novel along. There can be many other useful information about novels that can be encoded in this timeline-like pictures. The work here does not do justice to the title, I agree, but hey, this was just me spending some spare weekend time off research and other duties to explore what sort of patterns and pictures emerge from the novels I read!

The deciding factors here are (a) a comprehensive list of emotion/sentiment words and (b) a nice heuristic to compare sentences or paragraphs. Let’s be honest, net summation scheme sucks for many logical reasons, for one thing it leaves out small and detailed sentiment strengths in paragraphs or sentences. Nevertheless, I saw some patterns that I expected to see, so it did the job for now. A better scheme could be a Taylor series like summation. As more words from the emotion database are found in the novel, the squared, cubic etc terms of those values will be added to the overall sentiment strength.

The information visualization and art aspects of such images can’t be ignored. From my Google search I have not found anything about such visualization, but it’s quite hard to believe no infovis researchers attempted such work. I am interested to see what sort of work has been done so far.

With a carefully chosen color map, such patterns can be quite artsy from the reader’s or writer’s perspective. The amount of information that can be embedded in a 2D image is limited though, the Free Lunch theorem applies here. An image based on emotions and sentiments in a novel seemed logical to me, however, there can be other aspects equally important to the reader. The experience of reading a novel is quite personalized, different readers value different factors.