The Story of How I Mined Twitter

“Isn’t it funny? You’re a woman in science researching about women in science. How about that?” These were some of the first words that my new supervisor, Mary Flanagan, said to me when she saw me working in the lab. On my first day at tiltfactor as a research assistant, a position set up through Dartmouth’s Women in Science Project, I was asked to build a tool to “mine Twitter” for the NSF REAL project. In the haze of my first day, I did not ask questions, just nodded and agreed. I knew Python, the language I used to build the tool, but I had no idea about Twitter development or research.

After extensive googling I found some people that had built simple Python Twitter mining tools, with their tutorials, and a lot of referencing the Twitter developer’s help pages, I built a tool that allowed me to create a list of search terms and then, essentially, “mine” Twitter.

The NSF REAL Project seeks to “increase students’ awareness and understanding of each other’s life experiences”, and promote a more positive, inclusive climate for all learners in STEM. The focus of the project is to have undergraduate students write and read fictionalized stories about their college life experiences in introductory STEM courses. We are investigating different ways to collect narratives about college life experiences in these classes, and social media platforms, such as Twitter in this case, are one such way to find them.

In order to “mine” for relevant information , I brainstormed a list of keywords and hashtags that I felt related to the topic. My final list included words such as “genderbias” “womeninSTEM” “girlsintech” “glassceiling” etc. I found that when I searched without a hashtag I got a significantly greater amount of results than when I searched with one; so, while “#womeninSTEM” might produce 100-200 results, “womeninSTEM” could return two or three times as many.

In the initial stages of research, and with my first version of the Twitter mining tool, I found a series of websites related to gender in STEM. I learned that because of the micro-blogging nature of Twitter, most people did not tend to post narratives or stories, but rather share or retweet compelling stories or groups that they found online. I was pleasantly surprised to find a few organizations and groups that are actively working to level the playing field in STEM fields, especially tech.

Groups such as Girls Who Code, Stemettes, and Black Girls Code, all focus on the next generation: showing young girls that women code too and that they can be part of the tech force in the future. Other initiatives such as CodeDoc and WonderWomenTech, seek to empower women by sharing their narratives or creating events, such as Hackathons, to connect women in tech.

It was truly inspiring to find that there are so many groups out there devoted to helping reduce the gender gap in tech.

Originally my tool just exported the raw data returned by the API into text files, which I would then skim–which is how I found the above websites. However, I wanted a quicker and more efficient way to search through my results. So I added a bit to my code that would search through the text documents and find the most frequent words in all the tweets and count how many times these were used.

With this I was able to analyze the data I had found after running Twitter searches with my Python tool, and visualize this data. I decided to make word clouds, based off the most frequently used words in all of the searches I ran. Because the tool searched the last 200 posts on Twitter relating to this word, depending on the search term, the results returned ranged from a couple of weeks before the search to a couple of months. I used the online WordCloud maker WordItOut, to create visually representative models of the most common words. Here are a couple of the resulting clouds:



Moving on from here, I think it would be interesting to see what types of research other people have done after mining Twitter. I found some examples of research, such as “Twitter as a Corpus for Sentiment Analysis and Opinion Mining”, in which they built a sentiment classifier, that is able to determine positive, negative and neutral sentiments for a document, in this case specific Twitter posts. This type of sentiment analysis could be useful in asking questions such as “how are people reacting to organizations advocating for equal representation in STEM fields?”.  On the other hand, “Measuring Post Traumatic Stress Disorder in Twitter” “presents a novel method to obtain a PTSD classifier for social media using simple searches of available Twitter data” and “demonstrates its utility by examining differences in language use between PTSD and random individuals”.  These show how social media platforms like Twitter can be used to study everyday phenomenons and problems in people’s lives. They allow us to come up with questions such as “How much of an impact can Twitter have on a student’s interest in STEM?” are “Are organizations advocating for equal representation in STEM fields reaching their intended audience?”

This preliminary research for the REAL project is important because it revealed that there really are many stories to tell and the stories and groups found will help inform the research studies we’ll conduct in the future. Additionally, many of the organizations that are already working in the field regarding issues in STEM, would likely be interested in our project. Overall, I am very pleased with the results. Looking at the wordclouds made me optimistic because people really are talking about these issues and making a conscientious effort to improve conditions for students in science.