SAUL: Towards Effective Data Science
Data Systems and AI Lab, MIT
Cambridge, MA 02139, USA
An effective data system should satisfy SAUL properties: being scalable, automatic, and easy to keep human in the loop. It should automatically address low-level performance bottleneck to scale to big data. It should be tuning free or at least easy for users to tune. It should be easy to keep human in the loop such that users can easily customize the system to meet their domain specific requirements. The goal of my research is to build data systems satisfying SAUL.
My talk will cover two systems we have built, including an anomaly discovery system and a labeling system. They solve fundamental problems in both unsupervised and supervised machine learning. First, AutoAD, the self-tuning component of our anomaly discovery system, targets freeing the data scientists from manually determining which among the large number of unsupervised anomaly detection techniques is the best suited for the given task. This is particularly challenging in unsupervised setting, where no labels are available for cross-validation. AutoAD solves this problem by using a fundamentally new strategy that unifies the merits of unsupervised anomaly detection and supervised classification. Second, our LANCET approach solves the labeling problem, a key bottleneck that limits the success of cutting-edge machine learning techniques in enterprise deployments. These techniques often require millions of labeled data objects to train a robust model. Because relying on humans to supply such a huge number of labels is rarely practical, automated methods for label generation are needed. Based on a solid theoretical foundation, LANCET addresses the core challenges in auto-labeling, including: (1) which objects to ask humans to label, (2) how to automatically propagate labels to other objects, and (3) when to stop labeling.
Dr. Lei Cao is a Postdoc Associate at MIT CSAIL, working with Prof. Samuel Madden and Prof. Michael Stonebraker in the Data System group. Before that he worked for IBM T.J. Watson Research Center as a Research Staff Member in the AI, Blockchain, and Quantum Solutions group. His recent research is focused on developing end-to-end tools for data scientists to effectively make sense of data.