Project 4 - Evaluating Content-Defined Chunking Algorithms for Efficient Deduplication Systems | Women in Computer Science

Graduate mentor's supervisor: Prof. Samer Alkiswany

As the amount of digital data continues to grow rapidly, cloud storage and backup systems must manage enormous volumes of information. Industry studies have shown that 50–80% of stored data is redundant. Storing and transferring these duplicate copies consumes storage space, network bandwidth, and infrastructure resources. To reduce storage costs and network traffic, modern storage systems use data deduplication, a technique that identifies repeated content and stores it only once.

A key part of data deduplication is content-defined chunking (CDC), which divides files into smaller pieces ("chunks") so that duplicate content can still be detected after files have been modified. Researchers have developed many CDC algorithms, each making different trade-offs between processing speed, storage efficiency, and chunk-size consistency. While some techniques are fast, they often produce highly variable chunk sizes, which can reduce efficiency and create practical challenges for real-world systems. As a result, there is still room to improve chunking algorithms and better understand their trade-offs.

In this project, students will evaluate a recently proposed CDC algorithm and compare it against state-of-the-art techniques. Working in teams, students will use open-source implementations and real-world datasets to conduct benchmarking experiments and measure metrics such as chunking throughput, chunk-size distributions, and deduplication efficiency. Team members will focus on complementary tasks, including experiment design, dataset preparation, benchmarking, data analysis, and visualization. The primary goal during the program is to evaluate the algorithm and understand its strengths and limitations. Students who continue beyond the program may also explore integrating the algorithm into an open-source benchmarking framework and investigating further improvements to chunking techniques.

The project is suitable for students interested in computer systems, performance evaluation, and experimental research. Familiarity with algorithms and basic systems concepts is helpful, while experience with C++, Go, or Python is very beneficial but not required. No prior background in storage systems, data deduplication, or research is expected. Students will gain experience reading technical documentation, working with open-source software, and analyzing experimental results. The project provides an accessible introduction to research and is suited for students interested in future opportunities such as research assistantships, graduate studies, or careers in systems and infrastructure software.