Welcome to the Data Systems Group

The Data Systems Group builds innovative, high-impact platforms, systems, and applications for processing, managing, analyzing, and searching the vast collections of data that are integral to modern information societies — colloquially known as "big data" technologies.

Our capabilities span the full spectrum from unstructured text collections to relational data, and everything in between including semi-structured sources such as time series, log data, graphs, and other data types. We work at multiple layers in the software stack, ranging from storage management and execution platforms to user-facing applications and studies of user behaviour.

Our research tackles all phases of the information lifecycle, from ingest and cleaning to inference and decision support.

  1. Oct. 22, 2018Virginia first state to use technology-assisted review to classify publicly released Kaine Administration emailsResearch Professor Maura Grossman and Professor Gordon Cormack

    Technology-assisted review (TAR) — an automated process used to select and prioritize documents for review, pioneered by Research Professor Maura Grossman and Professor Gordon Cormack — was used for the first time by a state archive to classify emails from the administration of former Virginia Governor Tim Kaine for release to the public.

  2. Sep. 12, 2018Dallas Fraser, Andrew Kane and Frank Tompa win best paper award at DocEng 2018photo of Andrew Kane, Dallas Fraser, and Distinguished Professor Emeritus Frank Tompa

    Recent computer science PhD graduate and postdoctoral fellow Andrew Kane, MMath graduate Dallas Fraser, and Distinguished Professor Emeritus Frank Tompa have received the best paper award at DocEng 2018, the 18th ACM Symposium on Document Engineering.

  3. Sep. 10, 2018Data systems researchers reveal real-world challenges in graph processingSihem Amer-Yahia, Semih Salihoglu, Siddhartha Sahu, Amine Mhedhbi, Ozsu, Jimmy Lin

    One often-heard complaint is that academics labour away in their ivory towers, divorced from happenings in the real world. A few years ago, Professor Semih Salihoglu of the Data Systems Group at the University of Waterloo' Cheriton School of Computer Science noticed exactly this for graph processing.

Read all news
  1. Dec. 12, 2018PhD Seminar • GAL: Graph-Aware Layout for Disk-Resident Graph Databases

    Zeynep Korkmaz, PhD candidate
    David R. Cheriton School of Computer Science

    Analysis on graphs have powerful impact on solving many social and scientific problems, and applications often perform expensive traversals on large scale graphs. Caching approaches on top  of persistent storage are among the classical solutions to handle high request throughput. However, graph processing applications have poor access locality, and caching algorithms do not improve disk I/O sufficiently. We present GAL, a graph-aware layout for disk-resident graph databases that generates a storage layout for large-scale graphs on disk with the objective of increasing locality of disk blocks and reducing the number of I/O operations for transactional workloads.

  2. Dec. 13, 2018PhD Seminar • Dynamic Sampling used in TREC Core 2018

    Haotian Zhang, PhD candidate
    David R. Cheriton School of Computer Science

    Dynamic sampling (DS) is applied to create a sampled set of relevance judgments in our participation of TREC Common Core Track 2018. One goal was to test the effectiveness and efficiency of this technique with a set of non-expert, secondary relevance assessors.  We consider NIST assessors to be the experts and the primary assessors. Another goal was to make available to other researchers a sampled set of relevance judgments (prels) and thus allow the estimation of retrieval metrics that have the potential to be more robust than the standard NIST provided relevance judgments (qrels). In addition to creating the prels, we also submitted several runs based on our manual judging and the models produced by our HiCAL system. 

  3. Jan. 14, 2019DSG Seminar Series • Adaptive Scalable Analytics in Multi-Engine Environments

    Speaker: Verena Kantere, University of Ottawa

    Abstract: Big Data analytics in science and industry are performed on a range of heterogeneous data stores, both traditional and modern, and on a diversity of query engines. Workflows are difficult to design and implement since they span a variety of systems. To reduce development time and processing costs, some automation is needed. In this talk we will present a new platform to manage analytics workflows.

All upcoming events