Improving information extraction from clinical notes with multiple domain models and clustering-based instance selection

Improving information extraction from clinical notes with multiple domain models and clustering-based instance selection

Publication Type	dissertation
School or College	School of Computing
Department	Computer Science
Author	Kim, Youngjun
Title	Improving information extraction from clinical notes with multiple domain models and clustering-based instance selection
Date	2017
Description	Extracting information from electronic health records is a crucial task to acquire empirical evidence relevant to patient care. In this dissertation research, I aim to improve two clinical information extraction tasks: medical concept extraction and relation classification. First, my research investigates methods for creating effective concept extractors for specialty clinical notes. I present three new specialty area datasets consisting of Cardiology, Neurology, and Orthopedics clinical notes manually annotated with medical concepts. I analyze the medical concepts in each dataset and compare them with the widely used i2b2 2010 corpus. Then, I create several types of concept extraction models and examine the effects of training supervised learners with specialty area data versus i2b2 data. I find substantial differences in performance across the datasets, and obtain the best results for all three specialty areas by training with both i2b2 and specialty data. I also explore strategies to improve concept extraction on specialty notes with ensemble methods. I compare two types of ensemble methods (voting and stacking) and a domain adaptation model, and show that a stacked learning ensemble of classifiers trained with i2b2 and specialty data yields the best performance. Next, my research aims to improve relation classification using weakly supervised learning. Due to limited labeled data and extremely unbalanced class distributions, medical relation classification systems struggle to achieve good performance on less common relation types, which capture valuable information that is important to identify. I present two clusteringbased instance selection methods that acquire a diverse and balanced set of additional training instances from unlabeled data. The first method selects one representative instance from each cluster containing only unlabeled data. The second method selects a counterpart for each training instance using clusters containing both labeled and unlabeled data. These new instance selection methods for weakly supervised learning achieve substantial performance gains for the minority relation classes compared to supervised learning, while yielding comparable performance on the majority relation classes.
Type	Text
Publisher	University of Utah
Dissertation Name	Doctor of Philosophy
Language	eng
Rights Management	(c) Youngjun Kim
Format Medium	application/pdf
ARK	ark:/87278/s60w73cc
Setname	ir_etd
ID	2498577
Reference URL	https://collections.lib.utah.edu/ark:/87278/s60w73cc

Back to Search Results