Exploring the Daschle Collection using Text Mining

Damon Bayer; Semhar Michael

arXiv:1904.12623·cs.IR·April 30, 2019

Exploring the Daschle Collection using Text Mining

Damon Bayer, Semhar Michael

PDF

Open Access

TL;DR

This paper demonstrates how natural language processing and topic modeling can efficiently analyze large historical document collections, revealing key themes and events with minimal manual effort.

Contribution

It applies LDA-based text mining to a political archive, showcasing a scalable method for summarizing extensive textual data in historical and political research.

Findings

01

Identified major topics related to Senator Daschle's career.

02

Detected significant events and issues through topic shifts.

03

Showed the effectiveness of NLP methods in large-scale document analysis.

Abstract

A U.S. Senator from South Dakota donated documents that were accumulated during his service as a house representative and senator to be housed at the Bridges library at South Dakota State University. This project investigated the utility of quantitative statistical methods to explore some portions of this vast document collection. The available scanned documents and emails from constituents are analyzed using natural language processing methods including the Latent Dirichlet Allocation (LDA) model. This model identified major topics being discussed in a given collection of documents. Important events and popular issues from the Senator Daschles career are reflected in the changing topics from the model. These quantitative statistical methods provide a summary of the massive amount of text without requiring significant human effort or time and can be applied to similar collections.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques · Text and Document Classification Technologies · Web Data Mining and Analysis

Full text

Exploring the Daschle Collection using Text Mining Methods

Damon Bayer and Semhar Michael Damon Bayer is a graduate student in the Department Mathematics and Statistics at South Dakota State University, Brookings, SD 57007.; Semhar Michael is an Assistant Professor in the Department of Mathematics and Statistics at South Dakota State University, Brookings, SD 57007, email: [email protected].

(Received: date / Accepted: date)

Abstract

A U.S. Senator from South Dakota donated documents that were accumulated during his service as a house representative and senator to be housed at the Bridges library at South Dakota State University. This project investigated the utility of quantitative statistical methods to explore some portions of this vast document collection. The available scanned documents and emails from constituents are analyzed using natural language processing methods including the Latent Dirichlet Allocation (LDA) model. This model identified major topics being discussed in a given collection of documents. Important events and popular issues from the Senator Daschle’s career are reflected in the changing topics from the model. These quantitative statistical methods provide a summary of the massive amount of text without requiring significant human effort or time and can be applied to similar collections.

Keywords: topic-modeling, natural language processing, text data mining, Daschle collection, LDA

1 Background

Senator Thomas A. Daschle represented South Dakota in the United States congress for 26 years during which he served in the the House of Representatives (1978 - 1986) and the Senate (1987 - 2004). While in the Senate, Daschle served as the leader of the Democratic party from 1994-2004. The “Senator Thomas A. Daschle Congressional Research Study” (Daschle, 2017) at South Dakota State University houses a collection of of more than 2,000 linear feet of documents from Daschle’s career in congress. Among these are voting records, speeches, sponsored legislation, personal papers, as well as research documents relating to special interest to the Senator, such as the effects of Agent Orange on veterans and services on American Indian reservations.

In addition to these physical items, the collection also includes over 12,000 emails from constituents which were sent to the Senator’s office from 2002 to 2004. Through this analysis, we examine this massive collection of text documents using topic modeling and a variety of other exploratory techniques in order to summarize the issues important to South Dakotans and the Senator during his career. These methods are advantageous to traditional text techniques because they allow valuable insights to be made without manual reading of 10,000’s of documents, which may be time consuming or pose a privacy concern in the case of constituent emails.

We perform topic modeling on the Daschle Collection after pre-processing and initial summary analysis. In general, topic modeling is a statistical text mining tool for automatically identifying hidden patterns in a given collection of documents and assigning them to one or more topics. Following this procedure, a human can interpret the topics to give them a more precise and understandable label. In contrast to reading or skimming documents individually, this process reduces the human effort required to sort a collection to only interpreting a small number of computer-generated topics. Therefore, it becomes an effective tool for summarizing a massive collection of documents. Topic modeling has recently been used in the humanities to explore civil war era newspaper archives (Nelson, 2011), 18th century diaries (Blevins, 2011), and even as an initial step in literary criticisms (Buurma, 2015).

The incredible size and scope of the Daschle Collection makes it an ideal candidate for implementation of statistical modeling approaches to extract important features and information. Though much of the collection currently exists solely in paper form, 40GB of data, consisting of over 4000 scanned pages of speeches, legislation, financial information, and various other documents, is already available for pre-processing and analysis. Optical character recognition techniques as ones recently developed by (Chaudhuri et al., 2017) will be used to convert these scanned documents into text files. These text documents will then be converted into numeric representations such as term document matrix or N-gram representation to be used for further analysis. After initial exploratory analysis we employ the probabilistic methodology Latent Dirichlet Allocation model (Blei et al., 2003). This probabilistic modeling approach allows us to extend the methodology to a new dataset without additional human efforts in processing the data. The currently digitized set of the Daschle collection represents only a small part of the total collection. As more of the collection becomes digitized, the new documents can easily be incorporated into the analysis.

This paper develops a workflow and investigate the two types of text data in this collection. In the first case, we apply optical character recognition (OCR) methods to read in scanned documents and convert them in to text. We then implement some common algorithms to clean and perform exploratory analysis. In the second case, we explore the emails that the senator received from his constituents. Since this was already in text form, we performed cleaning and analysis directly.

The paper is organized as follows: Section 2 presents some necessary preliminaries that are necessary to perform the analysis of the two types of data. The analysis of the two datasets is examined in Section 3. The results of the analysis provide an insight to the vast collection. The paper concludes with a brief discussion and future work in Section 4.

2 Methods

This section starts with a discussion of necessary preliminaries on text data mining methods. Then, focuses the Latent Dirichlet Allocation (LDA) model and its parameter estimation methods. It will also discuss the extensions of this model to incorporate time varying topic models.

2.1 Exploratory methods

2.1.1 Data cleaning methods

The primary method of text representation used in this paper is the bag-of-words model. For our data, one scanned document or a single email are considered to be a single document. This model represents a document as a list of unique words found in the document paired with the count of each word. Some modifications can be done to this process to make the data more usable. Generally this includes removing common words or “stop words”, and “stemming”. Stop words such as “an”, “for”, and “the” typically do not carry much meaning for some analysis and can be eliminated to simplify the matrix. Stemming can also be applied to reduce a word to its root form. For most purposes, “fight”, “fights”, “fighting”, and “fighter” provide nearly the same information in a sentence. Stemming each of these terms to be “fight” simplifies the matrix and accounts for their similarities.

A language model is a model that assigns a probability to a sequences of words Jurafsky and Martin (2009). Language models are useful in many natural language tasks, such as spelling correction. A language model would enable a computer to correct the sequence “every night I dream of world piece” to “every night I dream of world peace”, even though “piece” and “peace” are both valid English words. Language models are also used in speech recognition, handwriting recognition, machine translation, and will be used in one of our classification methods.

In linguistics, an individual word is called a unigram. Two consecutive words are called a bigram, three consecutive words are called a trigram, etc. These n-grams can be used to construct a “bag-of-n-grams” model. This enables us to capture the order of the words in the data, but can be problematic for smaller datasets, as each n-gram will occur less frequently than a given unigram.

2.1.2 Word frequency

Word clouds and bar charts are one way to summarize text data based on word frequency. To create word clouds and bar charts for frequent words, we first removed stop words Meyer et al. (2008) and common words among all emails (“dear”, “sincerely”, “regards”) or others in the documents. The resulting bags-of-words were used for input in the wordcloud function from the wordcloud package (Ian, 2014) in the statistical software R (R Development Core Team, 2016). This package prints the most frequent words sized proportionally according to their frequency in a square plot.

Additionally, bar charts were constructed from the a fixed number of most frequent terms from the bags-of-words. In order to better understand which words best represented each class, we converted the frequency counts into relative frequencies. We calculated these relative frequencies for data in each topic and for the data outside of that class. This can be thought of as calculating the unigram probability for a word in a class, $\frac{\#word_{i}|topic_{k}}{\sum{k=1}^{K}\#word_{1}|topic_{k}}$ and the unigram probability for a word outside of a class, $\frac{\#word_{i}|topic_{k}^{\prime}}{\sum{k=1}^{K}\#word_{1}|topic_{k}}$ . Computing the differences between the unigram probabilities in the different groups yielded a measure of each word’s importance to the topic.

2.2 Topic modeling

Topic model is a probabilistic model which aims to identify “hidden” topics in a corpus and assign words or documents to these topics. Latent Dirichlet allocation (LDA), first proposed by Blei and his colleagues (Blei et al., 2003) is the most common method of topic modeling and the one employed in this analysis. It is a generative statistical model with the following process:

For a corpus $D$ , with $M$ documents with $N$ words and $K$ topics where $i\in\{1,\ldots,M\}$ indicates a specific document, $k\in\{1,\ldots,K\}$ indicates a specific topic, and $j\in\{1,\ldots,N_{i}\}$ indicates a specific word.

Choose $\beta_{k}\sim\text{Dirichlet}(\delta)$

2.

Choose $\theta_{w}\sim\text{Dirichlet}(\alpha)$

3.

For each of the $N$ words $w_{i}$

a.

Choose a topic $z_{i,j}\sim\text{Multinomial}(\theta)$

b.

Choose a word $w_{i,j}\sim\text{Multinomial}(z_{i,j})$

Following this model, we can view a document as a mixture of topics, with each topic being a mixture of words. In our case, the parameters of these distributions estimated using variational expectation-maximization (VEM) algorithm (see Blei et al. (2003) for details). To find the optimal number of topics to model in our dataset, we used a variety of metrics proposed by Griffiths and Steyvers (2004), Deveaud et al. (2014), Cao et al. (2009), and Arun et al. (2010) as implemented in Nikita (2016).

3 Data analysis

In this section, we discuss the steps taken to analyze both forms of documents- paper and email. Later we present the results obtained from both data.

3.1 Paper documents

While the total Daschle collection consists of over 2000 linear feet of materials, distributed among 750 boxes, only 4,165 documents (8,034 pages) of text documents have thus far been digitized. Our analysis focuses on this smaller subset of data, which were digitized for use in prior research projects. As such, we expect the number of topics present to be a subset of the total represented in the complete collection.

Analysis began with applying optical character recognition (OCR) to the documents. OCR aims to extract the text from the scanned images of the paper documents in the Daschle collection. We used the open-source software Tesseract (Smith 2007) implemented in the tesseract R package (Ooms, 2017a) to accomplish this task. To assess the quality of these documents we used the the open-source Hunspell (Nemeth, Accessed June 2017) dictionary implemented in the hunspell R package (Ooms, 2017b). We deemed a document to be low-quality if fewer than half of the “words” output from the OCR software were not found in the Hunspell English dictionary. These low-quality documents were removed from the dataset and not analyzed (1,087 documents, or 26% of the total documents). We deemed a document to be high-quality if greater than 90% of the “words” output from the OCR software were found in the Hunsepll English dictionary.

High-quality documents were subjected to automatic spelling correction via the Hunspell package (477 documents or 11% of the total documents). We assumed the errors in high-quality documents were more correctable than those in mid-quality documents. Correcting mid-quality documents could inflate the noise in these readings if similar strings of noise were corrected to the same words. This left 3,078 remaining documents for analysis. From these documents, we removed stop words from the list provided in the tidytext R package (Silge and Robinson, 2016). “Stop words” are words such as “the,” “of,” and “is” that are presumed to contain very little meaning in the context of topic modeling. Additionally, we performed stemming using Porter’s algorithm implemented in the SnowballC R package (Bouchet-Valat, 2015). Stemming aims to combine words with very similar meaning (e.g. “dance,” “dancer”, and “dances” would all be stemmed to “dance”). Figures 1 presents examples of a low-quality and high-quality document.

The output we obtain from the examples of a low- and high- quality document

•

Low-quality document OCR output: $wA\_.({\textbf{???}})Mwmw\_mwwwwffl:mmmszflwm\_\_M\_W..:WMmgggfiwwwm:m\textsection;\textcent fi\textsterling\&;flmjwmiwW:iwmMyflww:fim\_iw:\_fiWWW:WWfiivfiww:,1,meJig-fig;Wii:I:ngwwwmflwm:y\_g\_M14\%;\_w;-m\_mWWWWMTWWWm,\_f--WWWW\#4:.\_\_\_\_W\_\_\_.\_\_-\_\_\_r\_.:n\_,\_\_\_w\_\_\_\_m\_,g,.\_,\_\_w,W\_W\#Wmia;figfio\textsterling e15e$ …

•

High-quality document OCR output: claims filed with the agency as well as a breakdown of the specific type of illness claimed: ‘

The LDA topic model is fitted to the the data for different number of topics. To find the optimal number of topics to model in our dataset, we used a variety of metrics as implemented in Nikita (2016). The re-scaled results of these are presented in Figure 2. As a compromise between the metrics, we initially attempted modeling with 15 topics. However, the differences between the topics were unclear, with each modeled topic clearly representing a distinct actual topic, but with some topics seeming to be duplicated. To eliminate this redundancy, we created a new model with only 10 topics. These topics were largely similar to the initial 15 but without the redundancy. These 10 topics are used in the remainder of this analysis. Next, we labeled the resulting 10 topics by hand using the bar plots presented in Figure 3 and word clouds for each topic (not shown). Our labels are given in the title of each bar chart. Beta is $P(\text{term}\mid\text{topic})$ .

We also examined the overall proportion of the topics throughout the collection. From Figure 4, it is clear that documents about American Indians dominate the digitized collection, with approximately 50% of all the text coming from a topic related to Native Americans. Because the digitized documents are only a small subset of the total collection, this may not be representative of the contents of the rest of the documents but indicates the interests of researchers on this topic. The second most popular topic in the sample collection was about veterans and “agent orange”. As a representative of South Dakota, Daschle was, of course, concerned with Native American affairs and worked on many related bills. The effects of Agent Orange on Vietnam veterans was also a topic of importance to Daschle. The chemical was used in wartime to deforest areas where the enemy could hide, but had unintentional, long-lasting effects on soldiers who were exposed to it. Daschle advocated for a bill to provide permanent disability benefits to those who suffered under the chemical’s effects (Gough, 2003; Daschle et al., 2008).

3.2 Emails

While Dashcle served in congress from January 3, 1979 to January 3, 2005, the collection of emails available for analysis spans only the final years of his career, from April 11, 2002 to November 16, 2004. These emails range in length from one word to 59,740 words, with 80% of the emails containing fewer than 260 words. Before looking into the contents of the emails, we first attempt to identify major events in Daschle’s career by examining email frequency over time. The top 10 most active days are highlighted by black dots in Figure 5.

We looked in to the specific dates to explain these dramatic increases in email volume by looking at historical context, as well as the content of the emails themselves. In Figure 6(b) we present the dates, a headline from a news source which we assume to be the cause of the influx of emails, and a word cloud detailing the contents of the emails from the relevant date range. The content of the emails seem to correspond with the news articles. In two cases where Daschle criticized the Iraq War, constituents were upset and wrote in to say he should “support” the war and they were “disappointed” with his “comment.” The emails in November, 2003 were less focused. Many emailed in support or opposition to the Democrats’ decision to block George W. Bush’s nomination of Miguel Estrada by using a “filibuster” to prevent his confirmation vote from taking place. Additionally, many emailed on these days about health care, opposing the “Breaux Amendment” to the Medicare Prescription Drug, Improvement, and Modernization Act. Emails about the amendment came almost exclusively from constituents who identified themselves as employees of the Black Hills Surgery Center in Rapid City who felt that bill would be disastorous for their institution. The final spike in emails came on the night Daschle lost his reelection campaign to John Thune in 2004. Constituents used words like “vote,” “elect,” and thanked Daschle for his “service.” This is the final day for which emails are provided in the dataset.

Next, we examine the results of LDA topic modeling on all emails. To decide on a number of topics to use, we use a variety of metrics, displayed in Figure 7. After initially modeling with 15 topics, we determined that some topics appeared to be mixtures of two or more other topics. To improve the separation of topics, we created a new model with 25 topics, which seemed to separate the topics well without leading to duplication. We label the resulting 25 topics by hand using the bar plots presented in Figure 8 and word clouds for each topic (not shown). Our labels are given in the title of each bar chart. Beta is $P(\text{term}\mid\text{topic})$ .

We also examined the relative quantity of emails with respect to the topics. Topics reflecting the “major events” detailed above, such as “Iraq War” and “Judicial & Marriage” are some of the most prominent in the entire corpus. “Political Noise,” the most popular topic is one of the few without an obviously applicable label. It consists mostly of words like “south,” “dakota,” “democrat,” “republican,” and “vote.” We conjecture that this topic includes political words used in combination with a variety of issues. For example, a document discussing the Democrats’ stance on an education bill would be a mix of “Political Noise” and “Education” and an email discussing farmers in South Dakota would be a mix of “Political Noise” and “Farming.”

Next we examine how the content of emails changed over time. Topics not presented in Figure 10 showed consistent presence over time. Of the topics with visible trends, most correspond with relevant events in history. The increase in “G. W. Bush” emails corresponds to Daschle’s controversial comments discussed earlier in this section. The spike in “Judicial & Marriage” emails relates to the nomination of Miguel Estrada in 2002, as well as the Federal Marriage Amendment, a bill that defined marriage in the United States as being between a man and a woman, in 2004. Emails involving high proportions of Political Noise peaked when Daschle lost his seat to Thune in the November 2004 election. The increase in Tobacco & Guns emails in late 2002 appears to be related to a very passionate individual or group who emailed Daschle the same message about Tobacco and marketing to children almost daily for several weeks. These messages persist over time, but were particularly concentrated in late 2002. Similarly, one or more people repeatedly sent a message about gun laws in mid 2004. The prevalence of messages about the Iraq War is commensurate with the the war’s beginning. Messages regarding Clone, Title IX, Head Start appear most prevalently in 2002, when the Human Cloning Prohibition Act of 2001 was being considered by congress.

4 Discussion

Topic modeling on the Dashcle collection appears to be a successful method of summarizing the concerns of Daschle and his constituents while maintaining the privacy of South Dakotans. In general, the topic models and other explorations revealed patterns which might be expected to be present in the data. Topics from the paper documents were dominated by the massive volumes of research on specific topics that were scanned to be used in further research. Email volume increased following controversial or surprising events, but was generally consistent in volume, as well as topic, focusing on war, family, as well as asking for autographs.

In addition to the contribution of converting the digital image type documents to text form, several patterns and topics are uncovered throughout this project. Exploration of the currently digitized portion of the Daschle Collection shows a sufficient quality and quantity of data available. The results identified prominent topics such as healthcare reform, veterans’ benefits and others that Daschle was known for during his time in office but will also expose less obvious themes, which may not have been identified by previous researchers. Consequently, the results of this project provides researchers documents that are ready for further analysis and interpretation.

While our analysis is interesting in isolation own, we believe many future reports could be aided by using our topic models to uncover information about specific events or concerns of constituents during Daschle’s tenure in Congress.

Bibliography22

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Arun et al. (2010) Rajkumar Arun, Venkatasubramaniyan Suresh, CE Veni Madhavan, and MN Narasimha Murthy. On finding the natural number of topics with latent dirichlet allocation: Some observations. In Pacific-Asia conference on knowledge discovery and data mining , pages 391–402. Springer, 2010.
2Blei et al. (2003) D. Blei, A. Ng, and M. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research , 3:993–1022, 2003.
3Blevins (2011) C Blevins. Topic modeling historical sources: Analyzing the diary of martha ballard. Proceedings of Digital Humanities. Stanford, CA: Digital Humanities , 2011.
4Bouchet-Valat (2015) Milan Bouchet-Valat. Snowballc: Snowball stemmers based on c libstemmer utf-8 library. https://CRAN.R-project.org/package=Snowball C, 2015.
5Buurma (2015) Rachel Sagner Buurma. The fictionality of topic modeling: Machine reading anthony trollope’s barsetshire series. Big Data & Society , 2(2), 2015.
6Cao et al. (2009) Juan Cao, Tian Xia, Jintao Li, Yongdong Zhang, and Sheng Tang. A density-based method for adaptive lda model selection. Neurocomputing , 72(7-9):1775–1781, 2009.
7Chaudhuri et al. (2017) Arindam Chaudhuri, Krupa Mandaviya, Pratixa Badelia, and Soumya K Ghosh. Optical character recognition systems for english language. In Optical Character Recognition Systems for Different Languages with Soft Computing , pages 85–107. Springer, 2017.
8Daschle (2017) T. Daschle. Senator thomas a. daschle congressional research. https://www.sdstate.edu/daschle-study , 2017. Accessed: 2017-09-18.