Abstract

It can be challenging to develop machine learning solutions that use medical data, as many dataset rely on bespoke database schemas and data processing algorithms. Additionally, due to privacy reasons, most medical datasets are closed-source. However, the open source FHIR data standard is becoming increasingly adopted as a format for representing and communicating medical data. FHIR boasts the advantage of interoperability and can be used for electronic medical record storage, payment and insurance platforms, and more recently, machine learning analytics. We design a data encoding method that operates on the open source HL7 FHIR standard. This method tokenizes JSON returned from a FHIR server query into a sequence of token ids, based on the structure of the FHIR data. The token ids can be used to train transformer language models on medical records. We validate the performance of this method on the open source MIMIC-IV FHIR dataset for length-of-stay prediction and mortality prediction tasks. We also explore potential methods to address limitations of large sequence lengths and discrete tokenization of continuous numeric data.

Presenter

Trevor Yu, MASc candidate in Systems Design Engineering

Join in person, EC4 2101A, or online

Attending this seminar will count towards the graduate student seminar attendance milestone!