I am tackling what at times feels like a herculean Data Representation project, although on paper it sounds easy: visualize the connections between folks who use the ITP student listserv. In 2007, Joshua Knowles did just that.
This post is about how to not tackle this project.
For my first attempt at getting my gmail data, I ran this Python script and downloaded 6,126 emails (as separate text files) sent to the ITP listserv from Sept 5, 2013-Present. Luckily, Gmail has been archiving these emails for me in a separate folder in my Inbox (plus one for me!) I then attempted to adapt one of Dan O’Sullivan’s Gmail Word Count Processing sketches in order to loop through each text file, grab the From and Subject line and write to a CSV file. This turns out to be exactly the kind of sketch that crashes Processing. It’s just too much data for Processing + my computer to manage given I do not need each email (and all of that data including attachments) for this project.
To fix this issue, Dan O’Sullivan helped me combine his Word Count Processing sketch, and a sketch to read across files, in order to create a directory of files, loop through the directory and create an Integer Dictionary to keep track of the number of times someone has emailed the list. It runs a little better, but still crashes Processing once finished looping through the files. However, I was able to use this sketch to print the From and Subject line from each message to a CSV file. However, I still have data in my table I don’t need.
While this is all and good, the information I really need is about conversation threads. Specifically, how many times do people appear together in the same thread? This would weight the relationship between two people in the visualization. While I could deduce this information from my table by analyzing the subject lines, it wouldn’t be entirely accurate if, for instance, someone changes the subject line in their response. I am also interested in visualizing who initiates the most conversations.
Dan Shiffman suggested I use Python to get and parse my Gmail data, and use Processing just for the visualization. I met with Adam Parrish yesterday and he suggested I find and use a Python library, such as this one, to fetch email data, including information on threads. This takes out a lot of the programming, as the library acts as an interface between programmer and Gmail.
Take-aways:
- Leave the task of data collection and parsing to Python
- Use a Python Library as an interface to help get the data, and cut down on the amount of programming
- Use Processing to visualize the data
Next Steps:
- Brush up on my understanding of Python
- Run gmail Python library sample sketches to figure out how to get my gmail data
- Write a Python program to get and save my gmail data to a csv file
- Bring the gmail data in to Processing to begin the task of visualizing!