Blog Post #3 Visualization of Data

 Science Article: Quantitative Analysis of Culture Using Millions of Digitized Books

Science googlelabs

     Simplicity is good. But how simple can study of history become? The great minds of the past were venerated through their literature; now we aim to categorize, define, simplify, and claim great understanding of their culture and language. This is the objective of quantitative analysis of digitized text.

     Culture decides which subject to become prominent and language influences our vocabulary in speaking and recording the subject (Jean-Baptiste Michel et al). Cultromics focuses on cultural and linguistic phenomenons in english language throughout history. With this method, we can learn new history out of the old. Without manual labour, we can ask great questions like: When did the F – word appear? When did slavery become a social issue in Europe? How were women’s sexuality represented in Victorian times? The most popular person in 1800s? How did Nazi Germany censor literature, and what can we learn about Germans’ true feelings toward Hitler during World War Two? All interesting questions. However, there is but one problem.

     We are limited by “selection” of sources. Who select what sources? For example, the article stated that 4% of all books ever printed have been digitized into a corpus. Since majority of scanned texts have been selected and scanned by a third party, we are bonded by their interests. Our knowledge, therefore, is geared by selective historians. Or are they historian? They might be engineers for all we know. Culturomics will create biased perspective when “authorities” decide the value of sources. However, what isn’t biased and subjective? Protagoras once said, “man is the measure of all things.”

     My favourite part of the article was fame. According to the article, more people are more likely to be famous than the past. Makes sense. High literacy, development of media, and technology mean higher probability of fame. Let’s just remember to avoid mathematicians as a “route to fame” (Jean-Baptiste Michel et al). The public doesn’t seem to favour them.


Google N-Gram

Fig 1. Women in university

Women in university
“The first woman to gain honours in a University examination which was intended to be equivalent to that taken by men for a degree was Annie Mary Anne Henley Rogers. In 1877 she gained first class honours in Latin and Greek in the Second Examination for Honours in the recently instituted ‘Examination of Women.” – Oxford Online Archive

     According to Oxford University Archive, women started attending classes and taking examinations from the late 1870s. This was the later part of Victorian age where democratic values were strengthening. Frequency suddenly drops between 1910 – 1920, reflecting social interests to World War One, and skyrocketing back afterwards. Historians recognize that women’s rights dramatically increased after the war.

Fig 2. Isaac Newton

Isaac Newton
The one who came up with calculus during school vacation. Mr. Gravity’s graph is very interesting. According to the graph, he was alive in 1608, which can’t be right. There must be a namesake. I was quite surprised of how often a namesake was mentioned in literature. The high spike in the graph shows belated appreciation for Isaac Newton’s work near and after his death.

Problem with N – Gram:

     A slight problem we can see here is the emphasis on general cultural phenomenon over the historic person. Women + university phrase was never used before 1800s. No scholars would have written about it. Therefore, it’s easy to connect the dots for “the first women to graduate university,” which I was looking for. However, Isaac Newton’s results surprised me. A namesake means the probability of other “doppelgangers” before or after the historic person in question is born. N – Gram is more specialized to cultural phenomenon than names.

     Lack of content is another predictable issue. Did the public really appreciate Isaac Newton’s contributions or did it create big controversy? If women’s social participation grew after World War One, which departments were they geared to or forbidden to be administered?

Mining the Dispatch

Mining the dispatch

     According to Robert K. Nelson, “topic modeling and other distant reading methods are most valuable not when they allow us to see patterns that we can easily explain but when they reveal patterns that we can’t, patterns that surprise us and that prompt interesting and useful research questions.”

      Mining the Dispatch used MALLET software to explore the “rhythm” of daily life in Richmond by topic modelling The Richmond Daily Dispatch Newspapers’ online archive. In conventional American history, there hasn’t been much discussion on daily lives of Richmond residents. How resilient were the slaves during the war? How stable was the slave market. Did they receive relative opportunities to seize their freedom? Topic modelling allows us to venture these questions.

     Total of forty topics are divided into nine categories; under nine categories, there are subcategories chosen by the frequency of keywords. For example, under the category of “Slavery,” there is “For Hire and Wanted Ads.” We are presented a graph, showing how frequent “For Hire and Wanted” appeared between January 1861 to January 1864.

However, according to Nelson topic modelling has flaws.

    Ranaway.—$10 reward.

    —Ranaway from the subscriber, on the 3d inst., my slave woman Parthena. Had on a dark brown and white calico dress. She is of a ginger-bread color; medium size; the right fore-finger shortened and crooked, from a whitlow. I think she is harbored somewhere in or near Duvall’s addition. For her delivery to me I will pay $10.
    de 6—ts G. W. H. Tyler.

     This ad appeared on “the Dispatch in December 1861” clearly after the end of American Civil War and emancipation. Programs are not perfect. Associating key words under topics can lead to an above example. However, only “90% of this piece comes from the fugitive slave ad topic and the other 10% from two other topics.”

Mining the Dispatch has done a fascinating job as a data mining site. Although American Civil War is nothing of my interests, this method will help capture unnoticed patterns among pile of documents. Nelson’s work is a great example of large data and close reading.