DSG Seminar Series • Linked Data Science: Systems and Applications

Monday, November 21, 2022 10:30 am - 10:30 am EST (GMT -05:00)

Speaker: Essam Mansour, Concordia University

Location: DC 1304 and over Zoom (register here)

Abstract:

In recent years, we have witnessed a growing interest in data science not only from academia but particularly from companies investing in data science platforms to analyze large amounts of data. In this process, a myriad of data science artifacts, such as datasets and pipeline scripts, are created. Yet, there has so far been no systematic attempt to holistically exploit the collected knowledge and experiences that are implicitly contained in the specification of these pipelines, e.g., compatible datasets, cleansing steps, ML algorithms, parameters, etc.

Instead, data scientists still spend a considerable amount of their time trying to recover relevant information and experiences from colleagues, trial and error, lengthy exploration, etc. In this paper, we, therefore, propose a novel system (KGLiDS) that employs machine learning to extract the semantics of data science pipelines and captures them in a knowledge graph, which can then be exploited to assist data scientists in various ways. This abstraction is the key to enabling Linked Data Science since it allows us to share the essence of pipelines between platforms, companies, and institutions without revealing critical internal information and instead focusing on the semantics of what is being processed and how. Our comprehensive evaluation uses thousands of datasets and more than thirteen thousand pipeline scripts extracted from data discovery benchmarks and the Kaggle portal and shows that KGLiDS significantly outperforms state-of-the-art systems on related tasks, such as dataset recommendation and pipeline classification.

Bio Dr. Essam Mansour has been an assistant professor since 2019 in the Department of Computer-Science and Software Engineering (CSSE) at Concordia University in Montreal, and the head of the Cognitive Data Science lab (CoDS). His research program focuses on developing Cognitive Data Science Platforms for federated and big datasets. His research interests are in the broad areas of parallel/distributed systems, data management, knowledge graphs, and graph neural networks. Essam spent more than 10 years doing world-class research, in the areas of databases, parallel/distributed systems, big data analytics, and querying geo-distributed graphs. He is developing and optimizing big data systems to work at scale on supercomputers and cloud resources. During these years, his research contributions have led to more than 30 conference and journal papers (mostly in top-tier venues, such as VLDBJ, PVLDB, SIGMOD, ICDE, EDBT, and CIKM). He has been invited as a reviewer for top journals, such as ACM Transactions on Database Systems (TODS), VLDB Journal, and IEEE Transactions on Knowledge and Data Engineering (TKDE). Essam also has served as a program committee member in several top conferences, such as VLDB 2016 to 2023, SIGMOD 2023, and ICDE 2016.

Talk video