Domain-specific language models for multi-label classification of medical text

Yogarajan, Vithya

Publication

Domain-specific language models for multi-label classification of medical text

Abstract

Recent advancements in machine learning-based medical text multi-label classifications can be used to enhance the understanding of the human body and aid the need for patient care. This research considers predicting medical codes from electronic health records (EHRs) as multi-label problems, where the number of labels ranged from 15 to 923. It is motivated by the advancements in domain-specific language models to better understand and represent electronic health records and improve the predictive accuracy of medical codes. The thesis presents an extensive empirical study of language models for binary and multi-label medical text classifications. Domain-specific multi-sourced fastText pre-trained embeddings are introduced. Experimental results show considerable improvements to predictive accuracy when such embeddings are used to represent medical text. It is shown that using domain-specific transformer models outperforms results for multi-label problems with fixed sequence length. If processing time is not an issue for a long medical text, then TransformerXL will be the best model to use. Experimental results show significant improvements over other models, including state-of-the-art results, when TransformerXL is used for down-streaming tasks such as predicting medical codes. The thesis considers concatenated language models to handle long medical documents and text data from multiple sources of EHRs. Experimental results show improvements in overall micro and macro F1 scores, and such improvements are achieved with fewer resources. In addition, it is shown that concatenated domain-specific transformers improve F1 scores of infrequent labels across several multi-label problems, especially with long-tail labels.

Type

Thesis

Date

2022

Publisher

The University of Waikato

Degree

Doctor of Philosophy (PhD)

Supervisors

Pfahringer, Bernhard

Smith, Tony C.

Montiel, Jacob

Rights

Domain-specific language models for multi-label classification of medical text

Abstract

Type

Type of thesis

Series

Citation

Date

Publisher

Degree

Supervisors

Rights

Files

Permanent link

DOI

Publisher version

Collections