PhD Seminar • Analytics for Everyone

Wednesday, September 27, 2017 12:30 pm - 12:30 pm EDT (GMT -04:00)

Kareem El Gebaly, PhD candidate
David R. Cheriton School of Computer Science

The process of analyzing relational data typically involves tasks facilitating gaining familiarity or insights and coming up with findings or conclusions based on the data. This process is usually practiced by data experts (data scientists) that share their output with potentially less data expert audience (everyone).

Our goal is to enable everyone to take a part in this process rather than passively consuming its outputs (analytics democratization). With today's increasing wide availability of data (data democratization) on the internet (web) combined with an already wide spread personal computing capabilities such a goal is becoming more permissible.

Two main challenges would face experts such as the data journalist who wants to share their data exploration tasks over the web. First, infrastructure necessary for interactive data exploration is costly and hard to manage, especially in data journalism use cases. Second, their audiences need guidance because they would not know where to start the data exploration task since there are too many starting points.

To eliminate problems and costs related to managing infrastructure, we propose an in browser SQL engine (serverless), i.e., a portable database. In addition, for databases that are too large for the browser, we propose a hybrid architecture: a onetime SQL query that runs at the backend and SQL queries running in the browser as per the user's interactions.

To guide the user exploration task, we introduce an information theoretic technique that picks the most informative parts from the entire data cube of a relational table (explanation tables). We introduce optimizations that allows for creating explanation tables under the modest resources available in the browser, again, without any external dependencies. Facilitating data exploration for everyone is one step closer towards analytics democratization where everyone can take part in data exploration not just the experts.