One often-heard complaint is that academics labour away in their ivory towers, divorced from happenings in the real world. A few years ago, Professor Semih Salihoglu of the Data Systems Group at the University of Waterloo's Cheriton School of Computer Science noticed exactly this for graph processing.
Professor Salihoglu is an expert in data systems for storing, querying, and processing graphs, which can model everything from friendships on social networks to the interaction of genes in our bodies to the network of roads we drive on. He noticed that academic papers repeatedly studied the same graph algorithms over the same few datasets and wondered: do these accurately reflect the needs of industry practitioners working with graph data? Academic research should aim to deliver real-world impact. To this end, were they working on the right problems?
Despite the prevalence of graph data and a multitude of software for graph processing, little research has been conducted on how graph data is used in practice and the challenges that users face. To better understand these issues, Professor Salihoglu enlisted the help of his PhD students Siddhartha Sahu and Amine Mhedhbi as well as his colleagues Professors Jimmy Lin and Tamer Özsu.
Together, they conducted an online survey of nearly one hundred users of a wide range of graph software products and examined about two dozen code repositories. The team sought to answer four key questions: What types of graph data do users have? What computations do they run on their graphs? What software is used to perform the computations? What are the major challenges users face when processing graph data?
“We made several discoveries, some unexpected,” Salihoglu said. “In particular, we found that graphs in use represent a wide variety of entities, many of which would not naturally be thought of as vertices and edges. Traditional enterprise data comprised of products, orders, and transactions — the kind of data you typically see in a relational database — appears to be common among the datasets of survey participants.”
The authors also found that many graphs are quite large, often containing more than a billion edges. “These graphs represent an enormously wide range of entities and are used by organizations from small businesses to large enterprises,” said Siddhartha Sahu, noting that this finding runs counter to a common assumption that large graphs are problematic only for large organizations such as Google, Facebook, and Twitter.
“The survey also found that scalability is the most pressing challenge faced by users and the ability to process very large graphs efficiently is among the biggest limitation of existing software,” Sahu said, adding that the results also suggest that data visualization is a pressing challenge.
Their results, published in a paper titled “The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing,” was presented at the 44thInternational Conference on Very Large Data Bases held this year in Rio de Janeiro, Brazil, from August 27 to 31, 2018. The study caught the attention of their colleagues at the annual meeting, as it received the VLDB 2018 Best Paper Award. The work was praised by the conference program committee as providing a “unique and timely contribution,” and that “the findings provide future research directions that benefit our community and others.”
To learn more about this research, please see Siddhartha Sahu, Amine Mhedhbi, Semih Salihoglu, Jimmy Lin, and M. Tamer Özsu. The Ubiquity of Large Graphs and Surprising Challenges of Graph Processing. Proceedings of the VLDB Endowment, 11(4): 420–431, December 2017.