PhD Defence • Bioinformatics • Deep Learning for Accurate and Reliable De Novo Peptide Sequencing: From Missing Fragmentation to Open Modification Discovery

Thursday, July 9, 2026 1:00 pm - 4:00 pm EDT (GMT -04:00)

Please note: This PhD defence will take place online.

Zeping Mao, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Ming Li

De novo peptide sequencing aims to identify peptide sequences directly from tandem mass spectra without relying on a reference protein database. This capability is essential for discovering novel proteins, antibody sequences, immunopeptides, and other proteomic signals that may be missed by conventional database search. However, accurate de novo sequencing remains difficult because tandem mass spectra are often incomplete: missing fragment ions obscure the true peptide path and can cause errors to accumulate during sequence prediction. Moreover, most existing deep learning methods are limited to a closed vocabulary of known amino acids and predefined post-translational modifications, making it difficult to discover unexpected or previously unannotated peptide chemistries.

This defense presents a series of deep learning approaches that move de novo peptide sequencing from direct sequence prediction toward structured inference over mass spectra. The first part focuses on the missing-fragmentation problem. GraphNovo represents each spectrum as a graph and separates peptide prediction into path discovery and sequence completion, helping preserve the global structure of the peptide even when local fragmentation evidence is missing. The second part extends this graph-based view beyond the conventional closed vocabulary of canonical amino acids and predefined modifications. RNovA, a rotary positional embedding-enhanced transformer framework, models mass differences between fragment ions and formulates sequencing as a sequential decision process. This enables zero-shot open modification discovery, allowing the model to reason over unresolved mass gaps and previously unseen modified residues without retraining or predefined modification lists. A supporting reliability framework further helps assess de novo predictions when database-derived ground truth is unavailable.

Together, these works advance de novo peptide sequencing toward accurate, reliable, database-independent, and open-ended proteomic discovery, expanding its potential for exploring previously inaccessible regions of the proteome.


Attend this PhD defence virtually on Zoom.