Concept bag: a new method for computing similarity

Update Item Information
Publication Type dissertation
School or College School of Medicine
Department Biomedical Informatics
Author Bradshaw, Richard L.
Title Concept bag: a new method for computing similarity
Date 2016-05
Description Biomedical data are a rich source of information and knowledge. Not only are they useful for direct patient care, but they may also offer answers to important population-based questions. Creating an environment where advanced analytics can be performed against biomedical data is nontrivial, however. Biomedical data are currently scattered across multiple systems with heterogeneous data, and integrating these data is a bigger task than humans can realistically do by hand; therefore, automatic biomedical data integration is highly desirable but has never been fully achieved. This dissertation introduces new algorithms that were devised to support automatic and semiautomatic integration of heterogeneous biomedical data. The new algorithms incorporate both data mining and biomedical informatics techniques to create "concept bags" that are used to compute similarity between data elements in the same way that "word bags" are compared in data mining. Concept bags are composed of controlled medical vocabulary concept codes that are extracted from text using named-entity recognition software. To test the new algorithm, three biomedical text similarity use cases were examined: automatically aligning data elements between heterogeneous data sets, determining degrees of similarity between medical terms using a published benchmark, and determining similarity between ICU discharge summaries. The method is highly configurable and 5 different versions were tested. The concept bag method performed particularly well aligning data elements and outperformed the compared algorithms by iv more than 5%. Another configuration that included hierarchical semantics performed particularly well at matching medical terms, meeting or exceeding 30 of 31 other published results using the same benchmark. Results for the third scenario of computing ICU discharge summary similarity were less successful. Correlations between multiple methods were low, including between terminologists. The concept bag algorithms performed consistently and comparatively well and appear to be viable options for multiple scenarios. New applications of the method and ideas for improving the algorithm are being discussed for future work, including several performance enhancements, configuration-based enhancements, and concept vector weighting using the TF-IDF formulas.
Type Text
Publisher University of Utah
Subject MESH Algorithms; Data Mining; Vocabulary, Controlled; Systematized Nomenclature of Medicine; Unified Medical Language System; Benchmarking; Data Accuracy; Patient Care; Medical Informatics; Medical Informatics Applications; Medical Informatics Computing; Electronic Health Records; American Recovery and Reinvestment Act; Semantics
Dissertation Institution University of Utah
Dissertation Name Doctor of Philosophy
Language eng
Relation is Version of Digital version of Concept Bag: A New Method for Computing Similarity
Rights Management Copyright © Richard L. Bradshaw 2016
Format Medium application/pdf
Format Extent 3,784,876 bytes
Source Original in Marriott Library Special Collections
ARK ark:/87278/s61p18ds
Setname ir_etd
ID 197366
Reference URL https://collections.lib.utah.edu/ark:/87278/s61p18ds
Back to Search Results