Fundamentals of automated content analysis for social scientists

Friday, April 12, 2019 12:00 am - 12:00 am EDT (GMT -04:00)

This one-day workshop offers a practical introduction to fundamentals and recent developments in automated content analysis. The workshop is designed with social scientists in mind, but participants from other fields (including digital humanities) are also welcome. We assume that participants have little to no prior experience with methods for automated content analysis. 

The workshop will begin with a brief introduction to Python, followed by an overview of methods for cleaning and preparing natural language data for analysis. It then covers two related methods for turning text data into matrices that can be used for both supervised and unsupervised machine learning. Following a comparison of these two types of analysis, we will work through an example of using supervised machine learning to scale up traditional approaches to content analysis by training a model to classify documents into known categories using a subset of data that we have hand-coded. Next, we will work through examples of unsupervised machine learning, including a comparison of three different approaches to identifying and interpreting latent topics in a collection of documents. Finally, we will discuss approaches to combining supervised and unsupervised approaches.  

Each session will include a brief lecture and hands-on computing time where you will have the opportunity to analyze your own data. If you do not have your own data ready to go, you are free to use one of our datasets. 

Morning Sessions 

  1. A brief introduction to Python and Jupyter Notebooks 
  2. Processing natural language data with spaCy 
  3. Two methods for quantifying text with sklearn 
  4. A high-level comparison of supervised vs. unsupervised machine learning
  5. Scaling up traditional content analysis. Supervised machine learning methods for classifying documents into known categories 

Afternoon Sessions 

  1. Unsupervised machine learning methods for identifying latent topics 
    1. Clustering documents based on the similarity of their content 
    2. Analyzing word co-occurrence networks 
    3. Topic models 
  2. Why, when, and how to combine supervised and unsupervised machine learning

Software and Assumed Background


This workshop makes extensive use of the programming language python, including the packages spaCy and sklearn. Although having some knowledge of Python is an asset, it is not required. I will provide all participants with fully executable code for all topics covered in the workshop. Participants will be encouraged to modify the code to suit their specific interests, but this requires minimal programming knowledge and is not required. If you want to learn a bit of Python before the workshop, we highly recommend selecting something from DataCamp.

Participants will be provided with detailed instructions of what software to install and how to install it a couple of weeks before the start of the workshop.

Register for the Workshop 

To register for the workshop, please complete this short form. Please note that if you register for more than one workshop, you will need to process each registration separately. 

Space is limited, so we encourage you to register as soon as possible. 

Instructor & Workshop Organizer

John McLevey

John McLevey is an Assistant Professor in the Department of Knowledge Integration (Faculty of Environment) at the University of Waterloo. He is the Principal Investigator of a computational social science and social networks research lab called NETLAB, which is funded by grants from the Social Sciences and Humanities Research Council of Canada and an Early Researcher Award from the Ontario Ministry of Research and Innovation. 

John primarily works in the areas of computational social science and social network analysis, with substantive interests in environmental social science, the sociology of science, social movements, and cognitive social science. As a computational social scientist, his most general research goal is to advance our knowledge of how social networks and institutions affect collective cognition and behaviour, including the formation and diffusion of knowledge, beliefs, biases, and behaviours. He is currently involved in a number of research projects in service of that larger goal, including work on the effects of cognitive diversity and homophily in scientific networks, environmental governance conflicts in coastal regions, mobilization into environmental activism, and the diffusion of educational innovations. He is currently writing a book on computational social science for Sage's research methods series. He designed and developed the metaknowledge package with his former student Reid McIlroy-Young.

Partners

This workshop is held in partnership with the Department of Knowledge Integration, the Faculty of Environment at the University of Waterloo, and NETLAB

Waterloo KI logo

Food and Accomodations 

Coffee, tea, and snacks will be provided during the workshop. There are a variety of options for lunch and dinner on campus or within a short walk from campus. 

We will follow up with travelling participants about options for local accomodations.