Some of our current projects
Archives Unleashed aims to make petabytes of historical Internet content accessible to scholars and others interested in researching the recent past. We are developing web archive search and data analysis tools to enable scholars, librarians and archivists to access, share, and investigate recent history since the early days of the World Wide Web.
Bringing Order to Data develops tools and algorithms for mining rules and dependencies from large datasets, with applications to data profiling, knowledge discovery and query optimization. We are particularly interested in mining streaming and ordered data, e.g., discovering order dependencies among columns.
Data Science for Social Good applies machine learning, graph mining and text mining techniques for social good. Examples include smart meter data mining to save energy, social network mining to characterize public health, and educational data mining to promote gender equity in science and engineering
Data Stream Management • Social media streams and sensor measurements are continuously generated over time. Thus, as well as volume, we must deal with data velocity, which motivates new techniques for real-time processing of streaming data. We are investigating new scheduling algorithms for data-intensive streaming tasks and new read and write optimized data structures for storing real-time and historical data.
Graphflow is an in-memory graph database we are building from scratch for evaluating both one-time and continuous queries. We study topics on fundamental components of graph databases such as storage, query optimization, query processing, and triggers, building each component from scratch.
HiCAL • High-Recall Retrieval with Continuous Active Learning™ is an open-source project that facilitates the efficient identification of all or nearly all relevant documents in a corpus. Hi-CAL™ allows users to judge documents as fast as possible with no perceptible interface lag.
HoloClean is a statistical inference engine to impute, clean, and enrich data. As a weakly supervised machine learning system, HoloClean leverages available quality rules, value correlations, reference data, and other signals to build a probabilistic model that captures the data generation process, and uses the model in a variety of data curation tasks.
S-graffito • Streaming Graph Processing at Scale: Real-world graphs like product networks of online retailers or communication networks of ISPs are highly dynamic — the edges are streaming at a high rate. As the name suggests, S-graffito is a Streaming Graph Management System that aims to employ a variety of creative solutions to manage these Internet-scale streaming graphs.
Sirius: A High-Velocity Blockchain • Distributed ledgers such as blockchains are used to store transactions in a secure and verifiable manner without the need for a trusted third party. In the Sirius project, we are working on technologies to make blockchains more scalable and we are investigating novel applications of high-velocity blockchains such as transactive energy and clean transportation