Please note: This PhD defence will be given online.
Fatema Tuz Zohora, PhD candidate
David R. Cheriton School of Computer Science
Supervisor: Professor Ming Li
Proteins are the main workhorses of biological functions and activities. Comparative analysis of protein samples from a healthy person and disease afflicted person can discover disease biomarkers, which can be diagnostic or prognostic of the respective disease. Liquid chromatography with tandem mass spectrometry (LC-MS/MS) is the cutting-edge technology for protein identification and quantification.
In this thesis, we target the first step in the LC-MS/MS analysis: peptide feature detection from LC-MS data, which is very promising for disease biomarker discovery and protein quantitation. LC-MS data is a three-dimensional space where peptide features form multi-isotopic patterns. Each data may contain hundreds of thousands of peptide features, who frequently overlap with each other, are tiny with respect to the background, and are blended very nicely with feature-like noisy traces. All of these characteristics make peptide feature detection very challenging. However, deep learning is bringing groundbreaking results in various pattern recognition contexts. Therefore, in this thesis, we investigate deep learning models to address the peptide feature detection problem.
Existing tools for peptide feature detection are designed with domain-specific parameters that are prone to human error and hardly updated despite a vast amount of newly coming proteomics data. As a solution, we draw a pathway of automating the peptide feature detection through deep learning models for the first time. The main strength of our approach is that it learns all the parameters itself through training on the appropriate dataset, and newly available information can be easily integrated through fine-tuning the model.
We first propose DeepIso, combining convolutional neural network (CNN) and recurrent neural network (RNN), providing better peptide feature detection than other existing models. Then we offer PointIso, a point cloud based deep learning model with attention-based segmentation, which improves the feature detection and becomes three times faster than DeepIso. PointIso’s detection percentage of identified spiked peptides on a benchmark dataset is about 5% higher than other existing models. Then we perform a quality assessment of the peptide features generated by PointIso, showing its potential in biomarker discovery. We also apply Pointiso in relative peptide abundance calculation among multiple samples, demonstrating its utility in Label-Free Quantitation. Finally, we adapt our 3D PointIso model to handle more advanced data types with four dimensions, achieving 4-6% higher detection than other algorithms on the human proteome dataset. This characteristic of being generic and transferable to various datatypes should make our model more desirable in the practical sectors. We believe our research makes a notable contribution in accelerating the progress of deep learning in proteomics area, as well as general pattern recognition study.