Thumbnail Image

Domain-specific language models for multi-label classification of medical text

Recent advancements in machine learning-based medical text multi-label classifications can be used to enhance the understanding of the human body and aid the need for patient care. This research considers predicting medical codes from electronic health records (EHRs) as multi-label problems, where the number of labels ranged from 15 to 923. It is motivated by the advancements in domain-specific language models to better understand and represent electronic health records and improve the predictive accuracy of medical codes. The thesis presents an extensive empirical study of language models for binary and multi-label medical text classifications. Domain-specific multi-sourced fastText pre-trained embeddings are introduced. Experimental results show considerable improvements to predictive accuracy when such embeddings are used to represent medical text. It is shown that using domain-specific transformer models outperforms results for multi-label problems with fixed sequence length. If processing time is not an issue for a long medical text, then TransformerXL will be the best model to use. Experimental results show significant improvements over other models, including state-of-the-art results, when TransformerXL is used for down-streaming tasks such as predicting medical codes. The thesis considers concatenated language models to handle long medical documents and text data from multiple sources of EHRs. Experimental results show improvements in overall micro and macro F1 scores, and such improvements are achieved with fewer resources. In addition, it is shown that concatenated domain-specific transformers improve F1 scores of infrequent labels across several multi-label problems, especially with long-tail labels.
Type of thesis
The University of Waikato
All items in Research Commons are provided for private study and research purposes and are protected by copyright with all rights reserved unless otherwise indicated.