Data systems researchers receive 2022 ACM SIGMOD Research Highlight Award | Data Systems Group

Cheriton students Jeremy Chen, Yuqing Huang and Mushi Wang and Professors Semih Salihoğlu and Ken Salem have received a 2022 ACM SIGMOD Research Highlight Award for their paper “Accurate summary-based cardinality estimation through the lens of cardinality estimation graphs.” No stranger to scholarly recognition, this research earlier received the Best Experiment, Analysis and Benchmark Award at VLDB 2022, the 48^th International Conference on Very Large Databases, where it was presented originally.

“The SIGMOD Research Highlight Award aims to showcase a set of research projects that exemplify core database research,” wrote Wim Martens, chair of the SIGMOD award selection committee in his letter to Professor Salihoğlu. “These projects address an important problem, represent a definitive milestone in solving the problem, and have the potential of significant impact. The initiative of the SIGMOD Research Highlights also aims to make the selected works widely known in the database community, to our industry partners, and potentially to the broader ACM community.”

photo of Ken Salem, Mushi Wang and Semih Salihoglu

L to R: Ken Salem, Mushi Wang and Semih Salihoğlu. Co-authors Jeremy Chen and Yuqing Huang have graduated and were unavailable for the photo.

Ken Salem is a Professor at the Cheriton School of Computer Science and a member of the Data Systems Group. His research interests are in data systems, especially synchronization, transactions and fault tolerance, and in distributed systems and cloud computing.

Mushi Wang is a master’s candidate at the Cheriton School of Computer Science, advised by Professor Salihoğlu. His research interests are in the integration of graph neural networks into graph database management systems.

Semih Salihoğlu is an Associate Professor at the Cheriton School of Computer Science and a member of the Data Systems Group. He conducts both systems and theoretical research in data management and processing. His systems work focuses on developing systems for managing, querying, or doing analytics on graph-structured data.

Jeremy Chen is a software engineer at Snowflake. He is fascinated by distributed systems, large-scale data processing, and databases.

Yuqing Huang is currently a master’s candidate in computer science at ETH Zurich. At Waterloo, she was an undergraduate student double majoring in computer science and combinatorics and optimization.

About this award-winning research

Software applications use database management systems (DBMSs) to store and query data, often organized as tables of records. A core feature of DBMSs is to support high-level declarative query languages, which allow application developers to describe what they want to retrieve from the database, without worrying about how to retrieve it. For example, in the most widely adopted query language called SQL, the query “SELECT name FROM Employee WHERE age < 20” asks the DBMS to return the names of all employees in the employee table who are less than 20 years old, without specifying how the employee table should be searched to identify them. A DBMS can choose to compute the result of even this simple query in many ways. A key feature of DBMSs is that they automatically generate and put together programs, which are called query plans, that correctly answer such high-level queries. DBMSs do this by first finding a set of correct plans for a query, then assign each plan an estimated cost, which is expected to correlate with how long the plan would take, and then selecting the plan with the minimum cost.

The most challenging part of assigning accurate cost estimates to plans is to estimate the number of records they would process. This is called the problem of cardinality estimation. The most common techniques for cardinality estimation are summary-based techniques, which store statistics about the tables and then put these statistics together to come up with an estimate.

In their paper, the research team examined two types of summary-based cardinality estimators — optimistic estimators that make uniformity and conditional independence assumptions and can both under-estimate and over-estimate actual cardinalities, and pessimistic estimators that use information theoretic linear programs that are guaranteed to never underestimate. They showed, surprisingly, that these two classes of estimators are connected and can be seen as instantiations of a more generic estimator, called a cardinality estimation graph (CEG)-based estimator. This discovery had interesting scientific implications, e.g., that the estimates made by the linear program-based pessimistic estimators could be solved using combinatorial algorithms that solve optimistic estimators.

The paper also made an important observation about optimistic estimators. Specifically, the paper showed that when making optimistic estimates, there is often more than one way — corresponding to paths in CEGs — to put the stored statistics together and no obvious technique to decide which one is better. They outlined and empirically evaluated a space of heuristics and showed that the heuristic that leads generally to most accurate estimates depends on the structure of the query and proposed specific estimators to use different classes of queries.

“An important life lesson to draw from the story of this paper is to not give up on a good paper you love,” noted Professor Salihoğlu anecdotally. “Keep stubbornly polishing its writing and positioning. Before winning a VLDB Best Paper Award and now a SIGMOD Research Highlight Award, this paper was rejected three times from SIGMOD and VLDB. It was painful. We knew it was a very strong paper with a clear contribution, but we could not figure out the right way to position it. After three rejections, we finally got it right.”

To learn more about the award-winning research on which this article is based, please see Jeremy Chen, Yuqing Huang, Mushi Wang, Semih Salihoglu, and Ken Salem. Accurate Summary-based Cardinality Estimation Through the Lens of Cardinality Estimation Graphs. PVLDB, 15(8): 1533–1545, 2022. doi:10.14778/3529337.3529339