Seminar by Gradon Nicholls

Thursday, November 14, 2024 10:00 am - 11:00 am EST (GMT -05:00)

Student seminar series 

Gradon Nicholls
PhD candidate

Room: M3 3127


Using Large Language Models to Catch Mistakes in Coding of Open-ended Survey Questions

Open-ended questions allow survey respondents to give answers in their own words without being biased by pre-specified response options. Analysis of these data typically depends on assigning a 'label' or 'code' to each textual response. Establishing an initial set of coded texts is necessary whether the goal is to use the codes directly in analysis, or to train and evaluate a classification model to automatically code additional texts. As a starting point, one can employ a coder to manually code each text (`single-coding'). Employing a second, independent coder (`double-coding') can detect potential mistakes made by the first whenever the two coders disagree, assuming we can reliably resolve their disagreements. A less costly approach is to double-code only a fraction of the texts and use a model to predict likely mistakes in the remainder of the data. In this paper, we explore the performance of models based solely on single-coded data in predicting coding mistakes, and use pre-trained Large Language Models (LLMs) to improve automated error-catching performance.