Welcome to the Data Systems Group

The Data Systems Group builds innovative, high-impact platforms, systems, and applications for processing, managing, analyzing, and searching the vast collections of data that are integral to modern information societies — colloquially known as "big data" technologies.

Our capabilities span the full spectrum from unstructured text collections to relational data, and everything in between including semi-structured sources such as time series, log data, graphs, and other data types. We work at multiple layers in the software stack, ranging from storage management and execution platforms to user-facing applications and studies of user behaviour.

Our research tackles all phases of the information lifecycle, from ingest and cleaning to inference and decision support.

  1. Oct. 22, 2018Virginia first state to use technology-assisted review to classify publicly released Kaine Administration emailsResearch Professor Maura Grossman and Professor Gordon Cormack

    Technology-assisted review (TAR) — an automated process used to select and prioritize documents for review, pioneered by Research Professor Maura Grossman and Professor Gordon Cormack — was used for the first time by a state archive to classify emails from the administration of former Virginia Governor Tim Kaine for release to the public.

  2. Sep. 12, 2018Dallas Fraser, Andrew Kane and Frank Tompa win best paper award at DocEng 2018photo of Andrew Kane, Dallas Fraser, and Distinguished Professor Emeritus Frank Tompa

    Recent computer science PhD graduate and postdoctoral fellow Andrew Kane, MMath graduate Dallas Fraser, and Distinguished Professor Emeritus Frank Tompa have received the best paper award at DocEng 2018, the 18th ACM Symposium on Document Engineering.

  3. Sep. 10, 2018Data systems researchers reveal real-world challenges in graph processingSihem Amer-Yahia, Semih Salihoglu, Siddhartha Sahu, Amine Mhedhbi, Ozsu, Jimmy Lin

    One often-heard complaint is that academics labour away in their ivory towers, divorced from happenings in the real world. A few years ago, Professor Semih Salihoglu of the Data Systems Group at the University of Waterloo' Cheriton School of Computer Science noticed exactly this for graph processing.

Read all news
  1. Nov. 14, 2018PhD Seminar • Evaluating Subgraph Queries With a Mix of Tradition and Modernity

    Amine Mhedhbi, PhD candidate
    David R. Cheriton School of Computer Science

    We study the problem of optimizing subgraph queries (SQs) using the new worst-case optimal (WCO) join plans in Selinger-style cost-based optimizers. WCO plans evaluate SQs by matching one query vertex at a time using multiway intersections. The core problem in optimizing WCO plans is to pick an ordering of the query vertices to match. 

  2. Nov. 19, 2018DSG Seminar Series • Hierarchical Dense Subgraph Discovery: Models, Algorithms, Applications

    A. Erdem Sarıyüce, University at Buffalo

    Abstract: Finding dense substructures in a network is a fundamental graph mining operation, with applications in bioinformatics, social networks, and visualization to name a few. Yet most standard formulations of this problem (like clique, quasi-clique, densest at-least-k subgraph) are NP-hard. Furthermore, the goal is rarely to find the “true optimum” but to identify many (if not all) dense substructures, understand their distribution in the graph, and ideally determine relationships among them. In this talk, I will talk about a framework that we designed to find dense regions of the graph with hierarchical relations.

  3. Nov. 21, 2018PhD Seminar • Distributed Dependency Discovery

    Hemant Saxena, PhD candidate
    David R. Cheriton School of Computer Science

    We address the problem of discovering dependencies from distributed big data.  Existing (non-distributed) algorithms focus on minimizing computation by pruning the search space of possible dependencies.  However, distributed algorithms must also optimize data communication costs, especially in current shared-nothing settings.  To do this, we define a set of primitives for dependency discovery, which corresponds to data processing steps separated by communication barriers, and we present efficient implementations that optimize both computation and communication costs.  Using real data, we show that algorithms built using our primitives are significantly faster and more communication-efficient than straightforward distributed implementations.

All upcoming events