Analysis of Language Embeddings for Classification of Unstructured Pathology Reports

Citation:

Allada, A.Krishna et al., 2021. Analysis of Language Embeddings for Classification of Unstructured Pathology Reports. In International Conference of the IEEE Engineering in Medicine and Biology Society. November. IEEE, p. 4.

Date Presented:

November

Abstract:

A pathology report is one of the most significant medical documents providing interpretive insights into the visual appearance of the patient's biopsy sample. In digital pathology, high-resolution images of tissue samples are stored along with pathology reports. Despite the valuable information that pathology reports hold, they are not used in any systematic manner to promote computational pathology. In this work, we focus on analyzing the reports, which are generally unstructured documents written in English with sophisticated and highly specialized medical terminology. We provide a comparative analysis of various embedding models like BioBERT, Clinical BioBERT, BioMed-RoBERTa and Term Frequency-Inverse Document Frequency (TF-IDF), a traditional NLP technique, as well as the combination of embeddings from pre-trained models with TF-IDF. Our results demonstrate the effectiveness of various word embedding techniques for pathology reports.