Please note: This PhD defence will take place online only. Previously, it was listed as a hybrid defence.
Fatemeh Alipour, PhD candidate
David R. Cheriton School of Computer Science
Supervisors: Professors Lila Kari and Yang Lu
Advancements in genomic sequencing have significantly increased the availability of DNA sequence data, introducing both opportunities and challenges in bioinformatics. This dissertation leverages advanced machine learning techniques to enhance taxonomic classification and clustering of DNA sequences, introducing several innovative algorithms.
We proposed a deep learning-based method for unsupervised clustering of DNA sequences without relying on prior taxonomic information and demonstrates superior performance over traditional clustering methods such as K-Means++ and Gaussian Mixture Models across various genomic datasets. Additionally, we developed a hybrid approach that integrates k-mer composition analysis with host species data to address the taxonomic classification of emerging astroviruses, successfully assigning genus labels to previously unclassified genomes and tackling the challenges posed by interspecies transmission. Moreover, we introduced a novel method that employs twin contrastive learning with convolutional neural networks to cluster Chaos Game Representations of DNA sequences. This method has shown robust performance and enhanced clustering accuracy compared to existing methods. Collectively, these methodologies improve the accuracy and computational efficiency of genomic data analysis and highlight the transformative potential of machine learning in DNA sequence classification.