Improving information extraction from clinical notes with multiple domain models and clustering-based instance selection

Update Item Information
Publication Type dissertation
School or College School of Computing
Department Computer Science
Author Kim, Youngjun
Title Improving information extraction from clinical notes with multiple domain models and clustering-based instance selection
Date 2017
Description Extracting information from electronic health records is a crucial task to acquire empirical evidence relevant to patient care. In this dissertation research, I aim to improve two clinical information extraction tasks: medical concept extraction and relation classification. First, my research investigates methods for creating effective concept extractors for specialty clinical notes. I present three new specialty area datasets consisting of Cardiology, Neurology, and Orthopedics clinical notes manually annotated with medical concepts. I analyze the medical concepts in each dataset and compare them with the widely used i2b2 2010 corpus. Then, I create several types of concept extraction models and examine the effects of training supervised learners with specialty area data versus i2b2 data. I find substantial differences in performance across the datasets, and obtain the best results for all three specialty areas by training with both i2b2 and specialty data. I also explore strategies to improve concept extraction on specialty notes with ensemble methods. I compare two types of ensemble methods (voting and stacking) and a domain adaptation model, and show that a stacked learning ensemble of classifiers trained with i2b2 and specialty data yields the best performance. Next, my research aims to improve relation classification using weakly supervised learning. Due to limited labeled data and extremely unbalanced class distributions, medical relation classification systems struggle to achieve good performance on less common relation types, which capture valuable information that is important to identify. I present two clusteringbased instance selection methods that acquire a diverse and balanced set of additional training instances from unlabeled data. The first method selects one representative instance from each cluster containing only unlabeled data. The second method selects a counterpart for each training instance using clusters containing both labeled and unlabeled data. These new instance selection methods for weakly supervised learning achieve substantial performance gains for the minority relation classes compared to supervised learning, while yielding comparable performance on the majority relation classes.
Type Text
Publisher University of Utah
Dissertation Name Doctor of Philosophy
Language eng
Rights Management (c) Youngjun Kim
Format Medium application/pdf
ARK ark:/87278/s60w73cc
Setname ir_etd
ID 2498577
Reference URL https://collections.lib.utah.edu/ark:/87278/s60w73cc
Back to Search Results