Seminar by Gradon Nicholls | Statistics and Actuarial Science

Student seminar series

Gradon Nicholls
PhD candidate

Room: M3 3127

Using Large Language Models to Catch Mistakes in Coding of Open-ended Survey Questions

Open-ended questions allow survey respondents to give answers in their own words without being biased by pre-specified response options. Analysis of these data typically depends on assigning a 'label' or 'code' to each textual response. Establishing an initial set of coded texts is necessary whether the goal is to use the codes directly in analysis, or to train and evaluate a classification model to automatically code additional texts. As a starting point, one can employ a coder to manually code each text (`single-coding'). Employing a second, independent coder (`double-coding') can detect potential mistakes made by the first whenever the two coders disagree, assuming we can reliably resolve their disagreements. A less costly approach is to double-code only a fraction of the texts and use a model to predict likely mistakes in the remainder of the data. In this paper, we explore the performance of models based solely on single-coded data in predicting coding mistakes, and use pre-trained Large Language Models (LLMs) to improve automated error-catching performance.

Location Information

Location Address: M3 - Mathematics 3
200 University Avenue West
M3 3127
Waterloo, ON, CA N2L 3G1

Location coordinates: