MASc Seminar Notice: "FlaKat: A Machine Learning-Based Categorization Framework for Flaky Tests" by Shizhe Lin

Thursday, December 1, 2022 3:00 pm - 3:00 pm EST (GMT -05:00)

Name: Shizhe Lin

Date: Dec 1, 2022

Time: 3:00pm

Location: online

Supervisor: Dr. Ladan Tahvildari

Title: FlaKat: A Machine Learning-Based Categorization Framework for Flaky Tests

Abstract: Flaky tests can pass or fail non-deterministically, without alterations to a software system. Such tests are frequently encountered by practitioners in the community and hinder the credibility of test suites. Thus, flaky tests have caught the attention of researchers in recent years and numerous approaches have been published on defining, locating, and categorizing flaky tests along with auto-repairing strategies for specific types of flakiness. Several approaches are viable for the automated detection of flaky tests. The most traditional approaches adopt repeated execution of test suites accompanied by techniques such as shuffled execution order, random distortion of environment. State-of-the-art research also incorporates machine learning solutions into flaky test detection and achieves reasonably good accuracy. Moreover, strategies for repairing flaky tests have also been published for specific flaky test categories and the process has been automated as well. However, there is a research gap between flaky test detection and category-specific flakiness repair.

To address the aforementioned gap, this thesis proposes a novel categorization framework, called FlaKat, which uses machine-learning classifiers for fast and accurate prediction of the category of a given flaky test case. FlaKat first parses and converts raw flaky tests into vector embeddings. The dimensionality of embeddings is reduced and then used for training machine learning classifiers. Sampling techniques are applied to address the imbalance between flaky test categories in the dataset.

The evaluation of FlaKat was conducted to determine its performance with different combinations of configurations using known flaky tests from 108 open-source Java projects. Notably, Implementation-Dependent and Order-Dependent flaky tests, which represent almost 75% of the total dataset, achieved F1 scores of 0.94 and 0.90 respectively while the overall macro average is at 0.67.

This research work also proposes a new evaluation metric, called Flakiness Detection Capacity (FDC), for measuring the accuracy of classifiers from the perspective of information theory and provides proof for its effectiveness. The final FDC results are also aligned with F1 score regarding which classifier yields the best flakiness classification.