PhD Seminar • Empirical Software Engineering • Investigating Questions from Automatic Code Reviewers | Cheriton School of Computer Science

Wednesday, June 19, 2024 9:00 am - 10:00 am EDT (GMT -04:00)

Please note: This PhD seminar will take place online.

Farshad Kazemi, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Shane McIntosh

Automatic Code Reviewers (ACRs) are models trained to automate code review tasks, such as generating review comments. Indeed, prior work shows that state-of-the-art ACRs can generate review comments to initiate discussion threads; however, the capacity of ACRs to react to author responses is unclear. This is especially problematic when ACRs pose interrogative comments, i.e., comments that ask questions of other review participants.

In this paper, we study ACR-generated interrogative code review comments, analyzing their prevalence, similarity with human-submitted interrogative comments, and the regularity of their generation. We empirically study three task-specific ACRs and three ACRs based on Large Language Models (LLMs) on mined data from the Gerrit project. We find that state-of-the-art ACRs: (1) generate interrogative comments at a rate of 15.6% 65.26%; (2) differ from humans in generating such comments, which can stifle conversations, particularly in discussions where questions could spark productive dialogue; (3) produce interrogative comments with high irregularity, especially when we increase the number of comments generated; and (4) suffer from limitations in their capacity to communicate; for instance, task-specific and GPT-4-based ACRs do not and LLaMA2-based ACR rarely (2.27%) pose rhetorical questions, which account for 8.74% of human-posed interrogative comments. Unlike task-specific ACRs, LLM-based ACRs can react to author responses. Hence, we further inspect 150 examples of their interrogative comments and reactions to author responses, observing that: (5) the interrogative comments that LLM-based ACRs pose can differ even more substantially from human behaviour than those of task-specific ACRs; and (6) LLM-based ACRs struggle to participate in code review discussions when compared to humans. While our results suggest that neither task-specific nor LLM-based ACRs can replace human reviewers yet, we observe opportunities for synergies. For example, ACRs raise pertinent questions about exception handling of common APIs more frequently than human reviewers.