PhD Defence • Bioinformatics • Deep Learning Methods for Novel Peptide Discovering and Function Prediction | Cheriton School of Computer Science

Please note: This PhD defence will take place online.

Shaokai Wang, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Bin Ma

This thesis explores deep learning methodologies for protein identification and property prediction, encompassing two primary areas: mass spectrometry-based protein sequence identification and protein property prediction. We introduce a method that enhances the identification rate of MHC-I peptides and facilitates the discovery of novel mutated MHC-I peptides. In the domain of property prediction, we present three novel approaches for the early diagnosis of amyloidosis, the discovery of anticancer peptides and the classification of anticancer peptide functional type.

NeoMS: Identification of Novel MHC-I Peptides with Tandem Mass Spectrometry. The study of immunopeptidomics requires the identification of both regular and mutated MHC-I peptides from mass spectrometry data. For the efficient identification of MHC-I peptides with either one or no mutation from a sequence database, we propose a novel workflow: NeoMS. It employs three main modules: generating an expanded sequence database with a tagging algorithm, a machine learning-based scoring function to maximize the search sensitivity, and a careful target-decoy implementation to control the false discovery rates (FDR) of both the regular and mutated peptides. Experimental results demonstrate that NeoMS both improved the identification rate of the regular peptides over other database search software and identified hundreds of mutated peptides that have not been identified by any current methods. Further study shows the validity of these new novel peptides.

Deep learning boosted amyloidosis diagnosis. Amyloid light chain (AL) amyloidosis is a disorder characterized by the deposition of antibody light chains in organs. The importance of early and accurate diagnosis in AL amyloidosis cannot be overstated, as it enables timely implementation of appropriate treatment strategies and improves patient outcomes. Therefore, developing a highly accurate method using antibody sequencing and computational techniques is crucial to address this urgent need. While several computational methods have been developed to predict AL amyloidosis, they heavily depend on manually extracted features, and their performance falls short of satisfactory levels. We present DeepAL, a deep learning-based approach to predict AL amyloidosis with high precision. DeepAL utilizes a pre-trained model to extract light chain features from and then trained with AL amyloidosis knowledge. In evaluations conducted on two benchmark datasets, DeepAL surpasses the performance of previous approaches. Additional experiments demonstrate that features extracted from the pre-trained model have significantly enhanced overall performance.

Anti-cancer peptides identification and activity type classification with protein sequence pre-training. Cancer remains a significant global health challenge, responsible for millions of deaths annually. Addressing this issue necessitates the discovery of novel anti-cancer drugs. Anti-cancer peptides (ACPs), with their unique ability to selectively target cancer cells, offer new hope in discovering low side-effect anti-cancer drugs. We introduce DUO-ACP, a model serving dual roles in ACP prediction: identification and functional type classification. DUO-ACP employs two embedding modules to acquire knowledge about global protein features and local ACP characteristics, complemented by a prediction module. When assessed on two publicly available datasets for each task, DUO-ACP surpasses all existing methods, achieving outstanding results. We further interpret the contribution of each part of our model, including the two types of embeddings as well as ensemble learning. On a new curated dataset, the prediction results of DUO-ACP closely match existing literature, highlighting DUO-ACP’s generalization capabilities on previously unseen data and displaying the potential capability of discovering novel ACP.

Novel fine-tuning strategy on pre-trained protein model enhances ACP functional type classification. Cancer remains one of the most formidable health challenges globally. ACPs have recently emerged as a promising new therapeutic strategy, recognized for their targeted and efficient anti-cancer properties. To fully leverage the potential of ACPs, computational methods that can accurately discover and predict their functional types are indispensable. We present ACP-FT, a deep learning model that is fine-tuned from a pre-trained protein model specifically for predicting the functional types of ACPs. Employing a novel fine-tuning approach alongside an adversarial model training technique, our model surpasses existing methods in classification performance on two public datasets. Additionally, we provide a thorough analysis of our training strategy’s effectiveness. The experimental results demonstrate that our dual-phase fine-tuning approach effectively prevents catastrophic forgetting in the pre-trained model, while adversarial training enhances the model’s robustness. Together, these methodologies significantly increase the accuracy and reliability of ACP functional type predictions.

To attend this PhD defence on Zoom, please go to https://uwaterloo.zoom.us/j/94161905362.