Data Science Lab

The Data Science Lab is located in the Department of Management Sciences and is directed by Lukasz Golab.

The long-term research agenda of the Data Science Lab is to develop new algorithms and software tools for data management and mining, and to use them for social good.  See Lukasz’s website for a list of publications.  See below for some recent projects.

Data science for social good:

Data analytics for a sustainable future

As concerns over climate change arise, we must seek new ways to ensure a sustainable future, in which resources such as water and energy are used responsibly and greenhouse gas emissions are reduced.  The following projects explore various ways to leverage big data for sustainability.

Smart meter data analytics

Smart meters are being deployed worldwide.  The operational advantage of smart meters is their ability to automatically collect fine-grained consumption data, enabling accurate billing without needing to send workers for manual reads. Additionally, smart meters allow utility providers to charge different prices at different hours of the day, and smart meter data can be mined to provide actionable insight for consumers, utility providers and policymakers.

Is time of use pricing effective?

Under time of use pricing, consumers pay different electricity prices throughout the day. For example, in Ontario, on summer weekdays, one kilowatt-hour of electricity costs 7.7 cents from 7 pm to 7 am, and 15.7 cents from noon to 5 pm. The price difference is meant to give residential customers an incentive to use less electricity during peak-demand times.  In this project, we wanted to find out whether time of use pricing was able to reduce peak demand.  This might sound simple: just compare electricity consumption data before and after the introduction of time of use pricing. However, weather changes from year to year, affecting how much people use fans, air conditioners, furnaces and space heaters.  Thus, we need a methodology to account for these confounding variables. We developed such a methodology and applied it to a large smart meter dataset from southern Ontario.  We found that time-of-use pricing reduced summer on-peak demand, but only by about 2.5 percent.

For more information:

R. Miller, L. Golab, C. Rosenberg, Modelling Weather Effects for Impact Analysis of Residential Time-of-Use Electricity PricingEnergy Policy 105 (2017) 534-546

Smart meter analytics systems

In this project, we developed SMAS, an open-source system for mining smart meter data.  We implemented SMAS using a PostgreSQL database and the MADLib machine learning toolkit.  SMAS can extract daily usage profiles and characterize the effect of temperature on consumption.  We also designed an open-source benchmark for smart meter analytics which includes implementations of common tasks in PostgreSQL, Matlab, MapReduce and Spark.

For more information:

X. Liu, L. Golab, W. Golab, I. Ilyas, S. Jin, Smart Meter Data Analytics: Systems, Algorithms and BenchmarkingTODS 42(1): 2:1-2:39, 2017 

X. Liu, L. Golab, I. Ilyas, SMAS: A Smart Meter Data Analytics SystemICDE 2015, 1476-1479

Data analytics for sustainable transportation

The Waterloo WeBike project 

Motivated by the environmental, public health, ecological and carbon-footprint issues associated with gasoline-powered automobiles, researchers, governments, and society as a whole have been engaged in a search for viable alternatives. Electric bicycles (e-bikes), which are propelled by a combination of pedaling and battery-powered electric motors, are a promising alternative to automobile transportation. Their primary advantages include lower purchase and operating costs compared to cars, ability to travel longer distances and with less physical effort compared to traditional bicycles, and zero emissions during operation.

To explore the potential of e-bikes in North America, we have conducted a three-year field trial, whose purpose was to collect e-bike usage data. In this field trial, called WeBike, a fleet of approximately 30 sensor-equipped e-bikes were distributed to Waterloo faculty, staff and students for their own use. We have collected approximately 200 Gigabytes of GPS, acceleration, and battery charge and discharge data.  We investigated e-bike usage and battery charging patterns, and we developed an algorithm to predict the remaining battery range of an e-bike based on route characteristics and cyclist’s profile.

For more information:

C. Gorenflo, I. Rios, L. Golab, S. Keshav, Usage Patterns of Electric Bicycles: An Analysis of the WeBike ProjectJournal of Advanced Transportation, 2017, Article ID 3739505

C. Gorenflo, L. Golab, S. Keshav, Managing Sensor Data Streams: Lessons Learned from the WeBike ProjectSSDBM 2017, 1:1-1:11

S. Fink, L. Golab, S. Keshav, H. de Meer, How Similar is the Usage of Electric Cars and Electric Bicycles?EV-Sys 2017, 334-340

I. Rios, L. Golab, S. Keshav, Analyzing the Usage Patterns of Electric BicyclesEV-Sys 2016, 2

L. Gebhard, L. Golab, S. Keshav, H. de Meer, Range prediction for electric bicyclese-Energy 2016, 224-234

Bikeshare Pool sizing

In this project, we studied multimodal transit systems in which commuters use public transit and shared bicycles.  Using queuing analysis, we calculated the optimal number of shared bicycles to place at each public transit station.

For more information:

G. Tang, S. Keshav, L. Golab, K. Wu, Bikeshare Pool Sizing for Bike-And-Ride Multimodal TransitTrans. on Intelligent Transportration Systems, 19(7): 2279-2289 

Mining EV opinions

Electric Vehicles (EVs) are envisioned to play a large role in the transition from fossil fuel to renewables based transportation. However, in most of the world, their sales thus far are nominal compared to traditional car sales. It has been difficult for manufacturers to measure owners’ initial perceptions in order to build improved vehicles more drivers are likely to adopt. Sentiments towards EVs have mostly been determined using either field trials or large surveys of drivers, both of which are problematic. In this project, we built a system that mines EV owners’ sentiments from online forums.  We found that negative sentiments were expressed about price, range anxiety and battery degradation (the effect of repeated charging and climate on battery capacity over time).  On the other hand, sentiments towards maintenance were overwhelmingly positive (the absence of an engine means fewer moving parts that can fail and fewer fluids to change).

For more information:

T. Carpenter, L. Golab, S. J. Syed, Is the grass greener? Mining electric vehicle opinionse-Energy 2014, 241-252

Educational data mining

Gender equity in science and engineering

Students applying to engineering programs at the University of Waterloo must answer an essay question: Why do you want to be an engineer?  We found that young men and women answer differently. After analyzing more than 30,000 undergraduate applications, we found that males often described how their technical skills and experience matched the profession, whereas females wanted a career that enables them to impact and improve society.  

For more information:

S. Chopra, H. Gautreau, A. Khan, M. Mirsafian, L. Golab, Gender Differences in Undergraduate Engineering Applicants: A Text Mining Approach, EDM 2018, 44-54

Data mining to improve co-operative education

In this project, we are analyzing data about undergraduate co-operative internships to understand the job market, the competition among students and employers, and satisfaction with the process.  We obtained a data set that includes job postings, interview information (which student interviewed with which employer), placement information (which student was hired by which employer) and end-of-work term evaluations.  We developed text and social network mining methodologies to answer questions such as the following:

  • Do senior students obtain more advanced positions and are they happier with their workterms than junior students?
  • Which academic programs, groups of employers and groups of students compete with each other?
  • Which skills are in-demand (i.e., mentioned frequently in job postings)?
  • What is the effect of entrepreneurship on co-op education and co-op job creation? 

For more information:

S. Chopra, L. Golab, Job Description Mining to Understand Work-Integrated LearningEDM 2018, 32-43

S. Chopra, Y. Jiang, A. Toulis, L. Golab, Data Analytics to Improve Co-Operative Education, 2018 EDBT workshop on Data Analytics Solutions for Real-Life Applications, 16-21

A. Andrade, S. Chopra, B. Nurlybayev, L. Golab, Quantifying the Impact of Entrepreneurship on Cooperative Job CreationInt. Journal of Work Integrated Learning, 19(1): 51-68

A. Toulis, L. Golab, Graph Mining to Characterize Competition for EmploymentNetwork Data Analytics workshop at SIGMOD 2017, 3:1-3:7

Y. Jiang, L. Golab, On Competition for Undergraduate Co-op Placements: A Graph Mining ApproachEDM 2016, 394-399

Y. Jiang, S. Lee, L. Golab, Analyzing student and employer satisfaction with cooperative education through multiple data sourcesAsia-Pacific Journal of Cooperative Education, 16(4):225-240, 2015

Data mining to understand public health

Tools and algorithms for big data:

Real time data stream processing

Interesting data sets such as social media streams and sensor measurements are continuously generated over time.  Thus, in addition to big data volume, we must deal with data velocity, which motivates new techniques for real-time processing of streaming data.  A fundamental technical challenge in data stream analytics is how to “keep up” with the data, i.e., process new data as quickly as possible.  Our recent projects on data stream processing include:

Incremental view maintenance for streaming data

Queries over data streams are continuous: their results change over time as new streaming data arrive.  Clearly, re-computing an answer from scratch whenever new data arrive is intractable.  Fortunately, may queries can be computed incrementally.  For a simple example, to maintain a running sum over a data stream, it suffices to add the value of a new data record to the current sum.  We developed a system running on top of the PostgreSQL database in which users can express declaratively, using the SQL query language, how to incrementally maintain queries.  For selected types of queries such as pattern matching, our system can automatically generate incremental query maintenance expressions.

Scheduling data-intensive tasks

Another interesting property of data stream processing systems is that many queries, issued by different users or applications, are continuously running at the same time.  Given such a workload, we investigate the following problem: when a batch of new data arrives, what is an optimal ordering of queries (tasks) that minimizes cache misses?  Furthermore, since we may not know the exact amount of cache space that is available to the stream processing system and we may not know the cache replacement algorithm, we want to generate a task ordering in a cache-oblivious way.  The insight behind our solution is simple: when a data item is accessed for the first time by some task or query and is placed in the cache, all other tasks that require this data item should be scheduled as soon as possible thereafter.

Another scheduling problem which we investigated is how to schedule mappers and reducers in a MapReduce-style system to minimize data transfer across the nodes in the cluster.  This is an important problem because data transfer (i.e., data shuffling) is a bottleneck in many data-intensive distributed algorithms.  We developed an optimal polynomial-time algorithm that decides which keys will be processed by which reducer (i.e., which node in the cluster) to leverage data locality and minimize data transfer.

Read and write optimized data structures for real-time and historical data

This is a new project in which we want to examine the tradeoff between read and write optimized data structures for real-time and historical data.  Write-optimized data structures can support high data rates but queries may be slow.  On the other hand, read-optimized data structures can speed up query runtimes but may not be able to handle high insertion rates.  We want to design data structures for data stream processing systems which allow both high data rates and fast queries.

For more information:

X. Meng, L. Golab, Optimal Reducer Placement to Minimize Data Transfer in MapReduce-Style ProcessingIEEE BigData 2017, 339-346

Y. Yang, L. Golab, M. T. Ozsu, ViewDF: Declarative Incremental View Maintenance for Streaming DataInformation Systems, 71 (2017) 55-67

A. Baer, P. Casas, A. D’Alconzo, P. Fiadino, L. Golab, M. Mellia, E. Schikuta, DBStream: A Holistic Approach to Large-Scale Network Traffic Monitoring and AnalysisComputer Networks 107 (2016) 5-19

A. Baer, L. Golab, S. Ruehrup, M. Schiavone, P. Casas, Cache-Oblivious Scheduling of Shared WorkloadsICDE 2015, 855-866

L. Golab, T. Johnson, S. Sen, J. Yates, A Sequence-Oriented Stream Warehouse Paradigm for Network Monitoring Applications, PAM 2012, 53-63

L. Golab, T. Johnson, Consistency in a Stream WarehouseCIDR 2011, 114-122

Mining patterns, column dependencies and business rules from data

Automatically discovering patterns, column dependencies, constraints and business rules from data is an important problem: it can help data analysts understand the semantics of an unfamiliar dataset, it can help identify data quality problems (records that do not satisfy an expected dependency or business rule may be erroneous), and it can be used during query optimization to produce efficient query plans.

A well-known column dependency is the Functional Dependency (FD).  Let X and Y be two sets of columns. An FD X-->Y asserts that if any two rows in the dataset have the same value of X, they must have the same value of Y. For example, in an address database, an FD {postal code}-->{city} implies that postal codes are not shared across cities.  In this project, we studied other interesting dependencies, including:

  • Order dependencies (ODs).  An OD X-->Y states that if we sort the dataset on X, it will also be sorted on Y. For instance, sorting a product catalog on price-before-tax also sorts it on price-after-tax, assuming a fixed tax rate.  We developed an efficient algorithm for discovering non-redundant ODs in a given dataset.  
  • Ontology Functional Dependencies (OFDs).  An OFD X-->Y asserts that if any two rows in a dataset have the same value of X, they must either have the same value of Y or the Y-values must be synonyms (as defined in an ontology).  We proposed an efficient algorithm for discovering OFDs from data.
  • Sequential dependencies (SDs).  We proposed SDs, which state that adjacent data records, when sorted by some field X, must have Y-values within a particular range.  For example, in an airport take-off and landing dataset, if we sort by airplane take-off times, adjacent take-offs might be at least 30 seconds apart.
  • Conservation dependencies (CDs) assert that the cumulative sum of the values in some column X should be approximately equal to the cumulative sum of the values another column Y over time. For example, in a credit card transaction dataset, the cumulative sum of credits should be equal to the cumulative sum of debits over time

We also developed a tool, called an Explanation Table, for exploring multi-dimensional datasets containing descriptive/dimension attributes and numeric measure attributes.  An Explanation Table uses information-theoretic methods to identify the most informative fragments of a dataset.  In particular, we recently demonstrated a prototype of this idea which is implemented in Javascript and runs entirely inside a browser to allow portable and interactive data exploration.

Finally, we developed new data cleaning strategies that identify violations of Functional Dependencies and suggest possible fixes.  In one of our techniques, we sample from the space of possible data “repairs” instead of suggesting just one repair which may not be optimal.  In another, we considered possible modifications of inconsistent data, and, at the same time, modifications to the business rules governing the data semantics.

For more information:

K. El Gebaly, G. Feng, L. Golab, F. Korn, D. Srivastava, Explanation TablesIEEE Data Engineering Bulletin, 41(3):43-51, 2018

J. Szlichta, P. Godfrey, L. Golab, M. Kargar, D. Srivastava, Effective and Complete Discovery of Bidirectional Order Dependencies via Set-based Axioms, VLDB Journal, 27(4): 573-591, 2018

A. Mihaylov, P. Godfrey, L. Golab, M. Kargar, D. Srivastava, J. Szlichta, FastOD: Bringing Order to Data, ICDE 2018, demo paper

Z. Zheng, M. Langouri, Z. Qu, I. Currie, F. Chiang, L. Golab, J. Szlichta, FastOFD: Contextual Data Cleaning with Ontology Functional DependenciesEDBT 2018, 694-697, demo paper

S. Baskaran, A. Keller, F. Chiang, L. Golab, J. Szlichta, Efficient Discovery of Ontology Functional DependenciesCIKM 2017, 1847-1856

J. Szlichta, P. Godfrey, L. Golab, M. Kargar, D. Srivastava, Effective and Complete Discovery of Order Dependencies via Set-based AxiomatizationPVLDB 10(7): 721-732, 2017

K. El Gebaly, L. Golab, J. Lin, Portable In-Browser Data Cube ExplorationIDEA workshop at KDD 2017, 35-39

G. Feng, L. Golab, D. Srivastava, Scalable Informative Rule MiningICDE 2017, 48. Tech report here

Z. Abedjan, L. Golab, F. Naumann, Profiling relational data - a surveyVLDB Journal 24(4): 557-581, 2015

K. El Gebaly, P. Agrawal, L. Golab, F. Korn, D. Srivastava, Interpretable and Informative Explanations of OutcomesPVLDB 8(1):61-72, 2014

L. Golab, H. Karloff, F. Korn, B. Saha, D. Srivastava, Discovering Conservation RulesTKDE, 26(6):1332-1348, 2014

G. Beskales, I. Ilyas, L. Golab, A. Galuillin, Sampling from Repairs of Conditional Functional Dependency ViolationsVLDB Journal, 23(1):103-128, 2014

G. Beskales, I. Ilyas, L. Golab, A. Galiullin, On the Relative Trust between Inconsistent Data and Inaccurate ConstraintsICDE 2013, 541-552

Graph and social network mining

In this project, we want to improve the efficiency and effectiveness of tools that extract useful information from graph data.  We focus on graphs whose vertices are labelled with text.  We proposed a new technique for identifying small subgraphs whose vertices contain all the keywords mentioned in a given keyword query.  We also developed a method for extracting teams of experts from social networks, where each vertex corresponds to a person, vertex labels correspond to the person’s areas of expertise, and the output consists of a small set of vertices such that the union of their labels includes all the terms mentioned in a given keyword query.

For more information:

M. Zihayat, A. An, L. Golab, M. Kargar, J. Szlichta, Effective Team Formation in Expert NetworksAMW 2018, 4:1-4:4

M. Zihayat, A. An, L. Golab, M. Kargar, J. Szlichta, Authority-Based Team Discovery in Social Networks, EDBT 2017, 498-501

M. Kargar, L. Golab, J. Szlichta, eGraphSearch: Effective Keyword Search in GraphsCIKM 2016, 2461-2464. Tech report here