Title |
Concept bag: a new method for computing similarity |
Publication Type |
dissertation |
School or College |
School of Medicine |
Department |
Biomedical Informatics |
Author |
Bradshaw, Richard L. |
Date |
2016-05 |
Description |
Biomedical data are a rich source of information and knowledge. Not only are they useful for direct patient care, but they may also offer answers to important population-based questions. Creating an environment where advanced analytics can be performed against biomedical data is nontrivial, however. Biomedical data are currently scattered across multiple systems with heterogeneous data, and integrating these data is a bigger task than humans can realistically do by hand; therefore, automatic biomedical data integration is highly desirable but has never been fully achieved. This dissertation introduces new algorithms that were devised to support automatic and semiautomatic integration of heterogeneous biomedical data. The new algorithms incorporate both data mining and biomedical informatics techniques to create "concept bags" that are used to compute similarity between data elements in the same way that "word bags" are compared in data mining. Concept bags are composed of controlled medical vocabulary concept codes that are extracted from text using named-entity recognition software. To test the new algorithm, three biomedical text similarity use cases were examined: automatically aligning data elements between heterogeneous data sets, determining degrees of similarity between medical terms using a published benchmark, and determining similarity between ICU discharge summaries. The method is highly configurable and 5 different versions were tested. The concept bag method performed particularly well aligning data elements and outperformed the compared algorithms by iv more than 5%. Another configuration that included hierarchical semantics performed particularly well at matching medical terms, meeting or exceeding 30 of 31 other published results using the same benchmark. Results for the third scenario of computing ICU discharge summary similarity were less successful. Correlations between multiple methods were low, including between terminologists. The concept bag algorithms performed consistently and comparatively well and appear to be viable options for multiple scenarios. New applications of the method and ideas for improving the algorithm are being discussed for future work, including several performance enhancements, configuration-based enhancements, and concept vector weighting using the TF-IDF formulas. |
Type |
Text |
Publisher |
University of Utah |
Subject MESH |
Algorithms; Data Mining; Vocabulary, Controlled; Systematized Nomenclature of Medicine; Unified Medical Language System; Benchmarking; Data Accuracy; Patient Care; Medical Informatics; Medical Informatics Applications; Medical Informatics Computing; Electronic Health Records; American Recovery and Reinvestment Act; Semantics |
Dissertation Institution |
University of Utah |
Dissertation Name |
Doctor of Philosophy |
Language |
eng |
Relation is Version of |
Digital version of Concept Bag: A New Method for Computing Similarity |
Rights Management |
Copyright © Richard L. Bradshaw 2016 |
Format |
application/pdf |
Format Medium |
application/pdf |
Format Extent |
3,784,876 bytes |
Source |
Original in Marriott Library Special Collections |
ARK |
ark:/87278/s61p18ds |
Setname |
ir_etd |
ID |
197366 |
Reference URL |
https://collections.lib.utah.edu/ark:/87278/s61p18ds |