Please note: This PhD seminar will take place in DC 2314 and online.
Bing Hu, PhD candidate
David R. Cheriton School of Computer Science
Supervisors: Professors Anita Layton, Helen Chen
We present DELBERT-2, a pretrained fingerprint language model that converts multiple sparse molecular fingerprints into a unified dense token sequence and leverages self-supervised pretraining to improve virtual screening for small-molecule protein binders. Using a ModernBERT encoder trained with masked language modeling on 10M+ unlabeled molecules from MOSES, ChEMBL, PubChem, and public DEL libraries, DELBERT-2 learns transferable molecular representations that improve DEL binder prediction under distribution shift.
We evaluate DELBERT-2 across six AIRCHECK targets under three complementary out-of-distribution (OOD) validation protocols: hierarchical cluster splits, library splits, and building-block splits. Across these settings, DELBERT-2 generally improves PR-AUC and NDCG@1000 relative to strong LightGBM ensemble baselines and matched transformer models trained from scratch, with the largest gains appearing in stricter OOD regimes.
To attend this PhD seminar in person, please go to DC 2314. You can also attend virtually on MS Teams.