Some of our current projects

Archives Unleashed logoArchives Unleashed aims to make petabytes of historical Internet content accessible to scholars and others interested in researching the recent past. We are developing web archive search and data analysis tools to enable scholars, librarians and archivists to access, share, and investigate recent history since the early days of the World Wide Web.

Bringing order to data logoBringing Order to Data develops tools and algorithms for mining rules and dependencies from large datasets, with applications to data profiling, knowledge discovery and query optimization. We are particularly interested in mining streaming and ordered data, e.g., discovering order dependencies among columns.

Data Science for Social Good logoData Science for Social Good applies machine learning, graph mining and text mining techniques for social good. Examples include smart meter data mining to save energy, social network mining to characterize public health, and educational data mining to promote gender equity in science and engineering

Data stream management logoData Stream Management • Social media streams and sensor measurements are continuously generated over time. Thus, as well as volume, we must deal with data velocity, which motivates new techniques for real-time processing of streaming data. We are investigating new scheduling algorithms for data-intensive streaming tasks and new read and write optimized data structures for storing real-time and historical data.

Graphflow logoGraphflow is an in-memory graph database we are building from scratch for evaluating both one-time and continuous queries. We study topics on fundamental components of graph databases such as storage, query optimization, query processing, and triggers, building each component from scratch.

HiCal logoHiCAL • High-Recall Retrieval with Continuous Active Learning™ is an open-source project that facilitates the efficient identification of all or nearly all relevant documents in a corpus. Hi-CAL™ allows users to judge documents as fast as possible with no perceptible interface lag.

HoloClean logoHoloClean is a statistical inference engine to impute, clean, and enrich data. As a weakly supervised machine learning system, HoloClean leverages available quality rules, value correlations, reference data, and other signals to build a probabilistic model that captures the data generation process, and uses the model in a variety of data curation tasks.

S-graffito logoS-graffito is a Streaming Graph Management System that addresses the processing of OLTP and OLAP queries on high streaming rate, very large graphs. These graphs are increasingly being deployed to capture relationships between entities (e.g., customers and catalog items in an online retail environment) both for transactional processing and for analytics (e.g., recommendation systems).

 a high-velocity blockchain logoSirius: A High-Velocity Blockchain • Distributed ledgers such as blockchains are used to store transactions in a secure and verifiable manner without the need for a trusted third party.  In the Sirius project, we are working on technologies to make blockchains more scalable and we are investigating novel applications of high-velocity blockchains such as transactive energy and clean transportation