PhD Seminar • Systems and Networking • On the Impact of Learning Rate Schedules on Performance and Transferability of Transformer-Based Traffic Classifiers

Friday, March 14, 2025 1:00 pm - 2:00 pm EDT (GMT -04:00)

Please note: This PhD seminar will take place in DC 1302.

Elham Akbari Azirani, PhD candidate
David R. Cheriton School of Computer Science

Supervisor: Professor Raouf Boutaba

Deep traffic classifiers have demonstrated outstanding performance across various use cases, such as malware detection, website fingerprinting attacks on TOR, and mobile application identification. However, their performance is known to decline significantly on new datasets and out-of-distribution (o.o.d) data. Because of these models’ black-box nature, the decline in performance is challenging to explain or predict without in-distribution labeled data, making them unreliable for real-world applications.

Previous research in deep learning theory has emphasized the importance of training process choices, in contrast to model design, for the generalizability of large deep models. Therefore, we assess the training process by investigating the impact of various learning rate schedules on several transformer-based architectures tailored for time series classification. With an emphasis on real-world transferability between traffic datasets, we selected two traffic datasets collected independently by different research teams, each from a distinct large-scale network in 2021. We then preprocess these datasets to create comparable sets by extracting common features and adjusting them to align on units, labels, and data collection methods. The resulting pair of datasets presents a distribution shift that is significant enough to be realistic yet small enough to ensure transferability remains relevant, serving as a testbed for evaluating transferred performance.

Our investigation shows that adequate learning rate schedules are crucial for transformer-based models to succeed and that larger learning rates facilitate training. Moreover, same-dataset performance indeed entails larger transferred performance; however, the gap between the two remains large despite best practices.