Some of our current projects
Archives Unleashed aims to make petabytes of historical Internet content accessible to scholars and others interested in researching the recent past. We are developing web archive search and data analysis tools to enable scholars, librarians and archivists to access, share, and investigate recent history since the early days of the World Wide Web.
Bringing Order to Data develops tools and algorithms for mining rules and dependencies from large datasets, with applications to data profiling, knowledge discovery and query optimization. We are particularly interested in mining streaming and ordered data, e.g., discovering order dependencies among columns.
Data Science for Social Good applies machine learning, graph mining and text mining techniques for social good. Examples include smart meter data mining to save energy, social network mining to characterize public health, and educational data mining to promote gender equity in science and engineering
Data Stream Management • Social media streams and sensor measurements are continuously generated over time. Thus, as well as volume, we must deal with data velocity, which motivates new techniques for real-time processing of streaming data. We are investigating new scheduling algorithms for data-intensive streaming tasks and new read and write optimized data structures for storing real-time and historical data.
Graphflow is an in-memory graph database we are building from scratch for evaluating both one-time and continuous queries. We study topics on fundamental components of graph databases such as storage, query optimization, query processing, and triggers, building each component from scratch.
gStore is an RDF graph database system that employs a native graph representation. gStore employs the subgraph matching-based query strategy as well as a series of query optimization techniques and structure-aware index to build an efficient graph-native SPARQL query engine. It supports SPARQL 1.1, the standard RDF query language. It can be deployed on a single machine or in a scale-out setting.
HiCAL • High-Recall Retrieval with Continuous Active Learning™ is an open-source project that facilitates the efficient identification of all or nearly all relevant documents in a corpus. Hi-CAL™ allows users to judge documents as fast as possible with no perceptible interface lag.
HoloClean is a statistical inference engine to impute, clean, and enrich data. As a weakly supervised machine learning system, HoloClean leverages available quality rules, value correlations, reference data, and other signals to build a probabilistic model that captures the data generation process, and uses the model in a variety of data curation tasks.
Kùzu is an in-process property graph database management system (GDBMS) built for graph data science workloads. Kùzu is optimized for query speed and scalability, so aims to be competent on complex join-heavy analytical workloads on very large graph databases. We are building Kùzu as a feature-rich usable GDBMS under a permissible license. In our research, we design, implement, and do research on each component of the system.
S-graffito is a Streaming Graph Management System that addresses the processing of OLTP and OLAP queries on high streaming rate, very large graphs. These graphs are increasingly being deployed to capture relationships between entities (e.g., customers and catalog items in an online retail environment) both for transactional processing and for analytics (e.g., recommendation systems).
Sirius: A High-Velocity Blockchain Distributed ledgers such as blockchains are used to store transactions in a secure and verifiable manner without the need for a trusted third party. In the Sirius project, we are working on technologies to make blockchains more scalable and we are investigating novel applications of high-velocity blockchains such as transactive energy and clean transportation
WatDiv is a benchmark designed to measure how an RDF data management system performs across a wide spectrum of SPARQL queries with varying structural characteristics and selectivity classes. It is a micro benchmark to stress test the performance of systems across a wide variety of queries over varying sizes of data sets.