How do themes come up across an entire series of novels? If each book shares characters, settings, and events, shouldn't they also share thematic elements? This project tackles that problem in Anthony Trollope's Barsetshire Series—a collection of six novels published between 1855 and 1867 that take place in Barsetshire, a fictional English county.
This was the first project I did as a Digital Humanities Scholar at Swarthmore College. My advisors were Rachel Buurma of the Swarthmore English Department and Nabily Kashyap, the Digital Scholarship Librarian at Swarthmore College. Throughout the semester, I worked with a seminar on Victorian Literature led by Rachel. They helped identify topics that might give rise to a visual interpretation, and we converged on tracking thematic changes throughout Trollope's series.
The goal of this project was to show how different topics like gender, class, and religion are expressed throughout the novels in Trollope's series. The project stemmed from my independent study in Digital Humanities, where I was assigned to work with Rachel's seminar on Victorian literature. The students wanted to explore how themes develop across the series as a whole.
The project started going off in a few different directions. We considered analyzing character appearances, or doing some topic modeling, but we decided that the best route would be to look at word frequencies.
Text in a green box pertains to the technical aspects of the project.
When I first met with the seminar, I wasn't sure what to expect. Enthusiasm was abundant, so I introduced myself and listened to what they were interested in. Together we thought about how we could turn information in these books into data, and then to make something meaningful out of it.
I asked myself several questions before getting to the design portion of the project. What are the potential variables? How are the novels related? When were they published and in what context (i.e. in volumes, weekly publications, etc.)? How do characters and places change from book to book? How is each book broken up—by chapter, by volume? And lastly, what can we actually turn into data? These questions would frame our discussions and allow us to circle in on the information we wanted to display, and how to display it.
The most important thing to keep sight of was that in the digital humanities, one has the responsibility of separating the qualitative from the quantitative. These are books, and our goal is only to enrich our understanding of them, not to replace it. No amount of analysis and no visualization can truly capture what makes a novel special, but it could lead one in the right direction. Simply put, after the analysis and the designing, I want to make a tool; something someone could use to enjoy the series even more.
I consolidated the seminar's goals into three categories: recurrences of theme, style, and narrative based on word-usage, connections between characters and places, and relationships between characters.
For recurrences, I sketched a heatmap that shows word frequency based on general themes throughout the novel. Each novel would be split up into chunks, and this visualization would show how frequently certain topics appear in each chunk. For example, we could look at how often words associated with religion (the topic) appear in each chapter (the chunk). I took inspiration from Choropleth maps, which are often used to show frequencies over an area.
To show connections between characters and places in the books, I suggested a narrative chart, as seen on XKCD. This could show how plotlines, setting, and characters weave together throughout the series, all while keeping track of where we are in the series.
Lastly, for relationships between characters, I recommended a parallel coordinate chart, which would show occurrences and co-occurrences of characters. It could show when characters appear together, and would indicate relationships between different characters. It could even include minor characters or groups of characters.
We narrowed it down to two choices: showing character relationships or showing recurrences based on word usage.
The original idea was to use parallel coordinates to show how characters move between chapters. The issue with this was that parallel coordinates would suggest that there's a direct temporal or narrative relationship between chapters, which is not necessarily the case. Instead, I suggested we use a stacked bar chart to display character appearances in each chapter. This would clearly distinguish different characters, and one could track relationships between characters, as it would be easy to see when how frequently some come up with others.
To show thematic, stylistic, or narrative occurrences, I sketched a heatmap that would show word frequencies throughout the chapters in each series. This would allow one to find patterns between books and chapters, and one could even track several topics at once, thus finding patterns not only across the series, but across themes and styles.
We decided to pursue the second visualization. Click on the images to read more about each one.
A nested data structure suited this project best. I ended up settling on the json structure that is sketched out; you can also refer to the actual data.
As for extracting the data, Nabil wrote a python script that separated each book into chapters, and then suggested I use the NLTK library to do any sort of text analysis. You can see the scripts on github. First I had to extract all the words from each novel, take out stopwords (words like her, this, and, etc.), and then I had to write a script that would create the data structure I wanted.
The basic structure can be read like so: Consider the topic we want to track across the series, then consider all of the books in the series. For each of these books, we look at each chapter, and we give each chapter a value that represents how it relates to the topic in question. One can see the nesting: there are several topics, and for each topic there are six novels to consider, and in each of those novels we look at each of the chapters, and that's when we finally get a measurement. This structure is directly related to how the visualization is designed, coded, and implemented.
With the chapters parsed and ready to analyze, I needed to actually determine a measurement for "word frequency." I had a sample set of words relating to the topics of "Gender and Family" that I used for testing purposes. I will refer to this set of words as "the corpus."
I wanted to compare the corpus to all the words in the series. I was looking at individual chapters, and at the end of the day, I was looking for a number that said "this represents how well the corpus fit into the chapter as a whole." I found there were several ways to get this number, but then there was the matter of choosing the right one. I considered a few possibilities after doing some research into Natural Language Processing.
The route I chose was to simply measure occurence of corpus words divided by total words in the chapter. Basically, I'd run through the entire chapter, and if the word I was on was also in the corpus, I would mark it down. Once I went through all the words in a chapter, I took the total of the marks I made and divided it by the number of words in the chapter. I felt the idea, while simple, resonated with my goal of making this vis into a tool, and not into an assessment of the series as a whole.
As with all of my data vis projects, I made this one using d3.js, a tool for creating data visualizations on the web. Generally, d3 is always the best option for me. I wanted this project to be interactive and I wanted total control over the design and structure of the vis, which d3 lets me do.
The first step was to make a prototype, which involved visualizing a dummy data set made up of random numbers (I designed the protoype before extracting/analyzing the data). You can see the first iteration in the first picture of this post. With the prototype in hand, I presented it to the seminar, to make sure it was heading in the direction they wanted. They approved of the basic idea, and I kept going, making changes and adding features along the way.
I wanted the user to see word frequencies in the individual topics, but I also wanted them to make comparisons across topics. This way the user could find relationships between various topics—if one would expect that themes of class arise with themes of domesticity, they could check that. The solution was to have two viewing options: one where each topic was separated, and another where there was minimal separation, so one could make comparisons more easily.
With d3, I could include this kind of interactivity. It also gave me control of how I wanted to structure the data, which is not the easiest task, particulalry when dealing with a nested data structure. I've provided a quick preview of the HTML DOM.
Lastly, I was aiming for a soft design, as if you were reading from an old book. The yellowed page and subdued blue rectangles offer a simple design that doesn't distract from the content. The vis does resemble a choropleth map, but you'd have to read the intro to understand what it's depicting. I want the users to engage not only with the data, but also with the form of the visualization.
With the backbone of the visualization in place, the rest of the project was a series of minor adjustments. I presented it to the seminar, and from there we refined our datasets, picking out the best corpus to measure each theme. We also workshopped which themes would make it into the visualization. The data collection was the most collaborative part of the project; the students and my advisors all contributed to getting the data set just right.
Eventually, we settled on the design that you see now. The next step was to make this customizable and reusable. If one knows the basics of web design and d3, they could make any changes they wanted to. However, I wanted this to be accessible to a wider audience—I designed this vis with the idea that someone could make their own corups, and they could see how that topic appears throughout the series.
Of course, one could use this for something other than the Barsetshire Chronicles. In fact, one could make one of these charts without any knowledge of web development. While it would require one to get the novels, and then write a program to analyze them, one wouldn't have to change any of the d3 code, which is a notoriously tedious task.
To test the feasibility of this, my next goal is to visualize how topics appear in The Lord of the Rings trilogy. With my current design, I should only have to get the novels as text files and then create topics and the corresponding corpuses. That's all the work I'll have to do; after that's all done, I can make a webpage just like the one for the Barsetshire Chronicles with little effort.
Stay tuned for more updates!