DSG Seminar Series • Next Generation Indexes For Big Data Engineering | Data Systems Group

Thursday, May 10, 2018 2:00 pm - 2:00 pm EDT (GMT -04:00)

Daniel Lemire
Université Télug

Maximizing performance in data engineering is a daunting challenge. We present some of our work on designing faster indexes, with a particular emphasis on compressed indexes. Some of our prior work includes (1) Roaring indexes which are part of multiple big-data systems such as Spark, Hive, Druid, Atlas, Pinot, Kylin, (2) EWAH indexes are part of Git (GitHub) and included in major Linux distributions.

We will present ongoing and future work on how we can process data faster while supporting the diverse systems found in the cloud (with upcoming ARM processors) and under multiple programming languages (e.g., Java, C++, Go, Python). We seek to minimize shared resources (e.g., RAM) while exploiting algorithms designed for the single-instruction-multiple-data (SIMD) instructions available on commodity processors. Our end goal is to process billions of records per second per core.

Presentation slides (PDF)

Daniel Lemire is a computer science professor at the Université du Québec (TELUQ). He has also been a research officer at the National Research Council of Canada and an entrepreneur. He has written over 70 peer-reviewed publications, including more than 40 journal articles. He has held competitive research grants for the last 15 years. He serves on the program committees of leading computer science conferences (e.g., ACM CIKM, WWW, ACM WSDM, ACM SIGIR, ACM RecSys).

He programs in C, C++, Java, JavaScript, Python, Swift and Go. He works primarily in an open-source setting. You can find his software in Git, Apache Hive, Druid, Apache Kylin, Netflix Atlas, LinkedIn Pivot, Microsoft Visual Studio Team Services and so forth. Some of his compression software is used by Apache Arrow and Apache Impala. In 2012, he was rewarded by the Google Open Source Peer Bonus Program.

He is a long-time social media user: his blog has thousands of readers and was featured on Slashdot, Reddit and Hacker News. He was one of the first Twitter users: @lemire.

Location Information

Location Address: DC - William G. Davis Computer Research Centre
200 University Avenue West
1304
Waterloo, ON, CA N2L 3G1

Location coordinates:

Additional Information

Host: Data Systems Seminar Series (2017-2018)