PhD Seminar • Bioinformatics • BarcodeBERT: Transformers for Biodiversity Analysis

Monday, December 4, 2023 3:00 pm - 4:00 pm EST (GMT -05:00)

Please note: This PhD seminar will take place in DC 2310 and online.

Pablo Millán Arias, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Lila Kari

Understanding biodiversity is a global challenge, in which DNA barcodes — short snippets of DNA that cluster by species — play a pivotal role. In particular, invertebrates, a highly diverse and under-explored group, pose unique taxonomic complexities. We explore machine learning approaches, comparing supervised CNNs, fine-tuned foundation models, and a DNA barcode-specific masking strategy across datasets of varying complexity. While simpler datasets and tasks favor supervised CNNs or fine-tuned transformers, challenging species-level identification demands a paradigm shift towards self-supervised pretraining.

We propose BarcodeBERT, the first self-supervised method for general biodiversity analysis, leveraging a 1.5 M invertebrate DNA barcode reference library. This work highlights how dataset specifics and coverage impact model selection, and underscores the role of self-supervised pretraining in achieving high-accuracy DNA barcode-based identification at the species and genus level. Indeed, without the fine-tuning step, BarcodeBERT pretrained on a large DNA barcode dataset outperforms DNABERT and DNABERT-2 on multiple downstream classification tasks.

Full paper available at https://arxiv.org/abs/2311.02401.


To attend this PhD seminar in person, please go to DC 2310. You can also attend virtually using Zoom at https://uwaterloo.zoom.us/j/92398875467.