Automatic domain adaptation of word sense disambiguation based on sublanguage semantic schemata applied to clinical narrative

Automatic domain adaptation of word sense disambiguation based on sublanguage semantic schemata applied to clinical narrative

Title	Automatic domain adaptation of word sense disambiguation based on sublanguage semantic schemata applied to clinical narrative
Publication Type	dissertation
School or College	School of Medicine
Department	Biomedical Informatics
Author	Patterson, Olga
Date	2012-05
Description	Domain adaptation of natural language processing systems is challenging because it requires human expertise. While manual e ort is e ective in creating a high quality knowledge base, it is expensive and time consuming. Clinical text adds another layer of complexity to the task due to privacy and con dentiality restrictions that hinder the ability to share training corpora among di erent research groups. Semantic ambiguity is a major barrier for e ective and accurate concept recognition by natural language processing systems. In my research I propose an automated domain adaptation method that utilizes sublanguage semantic schema for all-word word sense disambiguation of clinical narrative. According to the sublanguage theory developed by Zellig Harris, domain-speci c language is characterized by a relatively small set of semantic classes that combine into a small number of sentence types. Previous research relied on manual analysis to create language models that could be used for more e ective natural language processing. Building on previous semantic type disambiguation research, I propose a method of resolving semantic ambiguity utilizing automatically acquired semantic type disambiguation rules applied on clinical text ambiguously mapped to a standard set of concepts. This research aims to provide an automatic method to acquire Sublanguage Semantic Schema (S3) and apply this model to disambiguate terms that map to more than one concept with di erent semantic types. The research is conducted using unmodi ed MetaMap version 2009, a concept recognition system provided by the National Library of Medicine, applied on a large set of clinical text. The project includes creating and comparing models, which are based on unambiguous concept mappings found in seventeen clinical note types. The e ectiveness of the nal application was validated through a manual review of a subset of processed clinical notes using recall, precision and F-score metrics.
Type	Text
Publisher	University of Utah
Subject	Applied sciences; health and environmental sciences; clinical; domain adaptation; natural language processing; Nlp; sublanguage; Word sense disambiguation; Wsd
Subject MESH	Medical Informatics; Electronic Health Records; Health Insurance Portability and Accountability Act; Natural Language Processing; Unified Medical Language System; Sublanguage Semantic Schema; Word Sense Disambiguation Disambiguation
Dissertation Institution	University of Utah
Dissertation Name	Doctor of Philosophy
Language	eng
Relation is Version of	Digital reproduction of Automatic Domain Adaptation of Word Sense Disambigation Based on Sublanguage Semantic Schemata Applied to Clinical Narrative. Spencer S. Eccles Health Sciences Library. Print version available at J. Willard Marriott Library Special Collections.
Rights Management	Copyright © Olga Patterson 2012
Format	application/pdf
Format Medium	application/pdf
Format Extent	1,159,278 bytes
Source	Original in Marriott Library Special Collections.
ARK	ark:/87278/s6fr34tp
DOI	https://doi.org/doi:10.26053/0H-Q1Z7-GR00
Setname	ir_etd
ID	196387
OCR Text	Show AUTOMATIC DOMAIN ADAPTATION OF WORD SENSE DISAMBIGUATION BASED ON SUBLANGUAGE SEMANTIC SCHEMATA APPLIED TO CLINICAL NARRATIVE by Olga Patterson A dissertation submitted to the faculty of The University of Utah in partial ful llment of the requirements for the degree of Doctor of Philosophy Biomedical Informatics The University of Utah May 2012 Copyright c Olga Patterson 2012 All Rights Reserved The Univers i ty of Utah Graduate School STATEMENT OF DISSERTATION APPROVAL The dissertation of Olga Patterson has been approved by the following supervisory committee members: John F. Hurdle , Chair 3/14/2012 Date Approved Bruce Bray , Member 3/21/2012 Date Approved Lewis Frey , Member 3/22/2012 Date Approved Stephane Meystre , Member Date Approved Ellen Riloff , Member 3/22/2012 Date Approved and by Joyce A. Mitchell , Chair of the Department of Biomedical Informatics and by Charles A. Wight, Dean of The Graduate School. ABSTRACT Domain adaptation of natural language processing systems is challenging because it requires human expertise. While manual e ort is e ective in creating a high quality knowledge base, it is expensive and time consuming. Clinical text adds another layer of complexity to the task due to privacy and con dentiality restrictions that hinder the ability to share training corpora among di erent research groups. Semantic ambiguity is a major barrier for e ective and accurate concept recognition by natural language processing systems. In my research I propose an automated domain adaptation method that utilizes sub- language semantic schema for all-word word sense disambiguation of clinical narrative. According to the sublanguage theory developed by Zellig Harris, domain-speci c language is characterized by a relatively small set of semantic classes that combine into a small number of sentence types. Previous research relied on manual analysis to create language models that could be used for more e ective natural language processing. Building on previous semantic type disambiguation research, I propose a method of resolving semantic ambiguity utilizing automatically acquired semantic type disambiguation rules applied on clinical text ambiguously mapped to a standard set of concepts. This research aims to provide an automatic method to acquire Sublanguage Semantic Schema (S3) and apply this model to disambiguate terms that map to more than one concept with di erent semantic types. The research is conducted using unmodi ed MetaMap version 2009, a concept recognition system provided by the National Library of Medicine, applied on a large set of clinical text. The project includes creating and comparing models, which are based on unambiguous concept mappings found in seventeen clinical note types. The e ectiveness of the nal application was validated through a manual review of a subset of processed clinical notes using recall, precision and F-score metrics. CONTENTS ABSTRACT : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : iii LIST OF FIGURES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : vi LIST OF TABLES: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : vii ACKNOWLEDGMENTS : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : viii CHAPTERS 1. INTRODUCTION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1 1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Main Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.1 Aim I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.2 Aim II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.2.3 Aim III . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.3 Relevance to Biomedical Informatics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2. BACKGROUND : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 5 2.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Word Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 Sublanguage Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.3 NLP in Clinical and Biomedical Domains . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.3.1 Systems Based on Sublanguage Principles . . . . . . . . . . . . . . . . . . . . . . . 11 2.3.2 Biomedical Language Processing Systems . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.3 Clinical Language Processing Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.4 Project Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5 Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.1 Computational Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.2 Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.5.3 MetaMap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3. SUBLANGUAGE: : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 18 3.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.1 Document Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.2 Semantic Pattern Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4. SUBLANGUAGE SEMANTIC SCHEMA SYSTEM : : : : : : : : : : : : : : : : 34 4.1 Sublanguage Semantic Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 4.2 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2.1 Training Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 4.2.1.1 Feature Vector Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2.2 Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.2.3 Semantic Type Classi cation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.3.1 Sparse File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4.2.3.2 Machine Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 System Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4 Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.4.1 Annotations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.1.1 Sample Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.4.1.2 Annotation Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.5 Measuring Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.5.1 Model Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 4.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5. SYSTEM IMPROVEMENT : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 53 5.1 Error Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.1 Pattern Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 5.2.2 Classi cation Accuracy Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 5.2.3 Determining Final Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 6. DISCUSSION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 66 6.1 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6.2 Opportunities for Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 7. CONCLUSION : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 71 APPENDICES A. SEMANTIC TYPES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 72 B. INDEX : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 77 REFERENCES : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 79 v LIST OF FIGURES 3.1 Cumulative relative frequency of the di erent semantic types used in Case Management Discharge Plan (CMD), Family Practice Clinic notes (FPC), and MEDLINE abstracts (MLN). The curves of other clinical note types fell between CMD and FPC lines and were excluded from the gure for visual clarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2 Cumulative relative frequency of patterns of format 8 for Ambulatory Nursing Notes (ANN), Operative Report (OPR), and MEDLINE abstracts(MLN). The curves of other clinical note types fell between ANN and OPR lines and were excluded from the gure for visual clarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 4.1 S3 System training data ow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 4.2 Examples of feature vectors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.3 S3 System word sense disambiguation ow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.4 S3 System application data ow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.1 Classi cation accuracy as a function of the number of features and number of records for Admission History and Physical. . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5.2 Classi cation accuracy as a function of the number of features and number of records for Cardiology Clinical Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.3 Classi cation accuracy as a function of the number of features and number of records for Discharge Summaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 5.4 Classi cation accuracy as a function of the number of features and number of records for Social Service Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.5 Training processing time as a function of the number of features and number of records for Admission History and Physical. . . . . . . . . . . . . . . . . . . . . . . . . . 62 5.6 Training processing time as a function of the number of features and number of records for Cardiology Clinical Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.7 Training processing time as a function of the number of features and number of records for Discharge Summaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.8 Training processing time as a function of the number of features and number of records for Social Service Notes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 LIST OF TABLES 2.1 The note types and the corresponding le counts used in this project. . . . . . . . 16 3.1 Clustering results of the data set consisting of 685 documents per document type. Values that represent less than 1% of the total note type count were excluded for visual clarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2 Results of hierarchical cluster analysis of the set of 17 document types (n=3,000 notes per set). The values are the counts of all documents of the particular type that were grouped into each of the clusters. Values that represent less than 1% of the total note type count were excluded for visual clarity. . . . . . . . 26 3.3 Results of clustering 12 note types with 3000 documents of each type. Values that represent less than 1% of the total note type count were excluded for clarity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.4 The formats of the patterns found within the window of size 3. The example evaluates sentence \The patient[podg] reported[acty] severe[qlco] upper[spco] quadrant[spco] abdominal[blor ] pain[sosy]" with the term \upper" as the term of interest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 4.1 Annotated corpus description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.2 Full data description for the four note types that were used in validation. . . . . 49 4.3 Accuracy of S3 System as tested on a manually annotated set of sentences with format level threshold of 2 and classi cation probability threshold of 0.1. 49 4.4 Comparison of accuracy of S3 System on Admission History and Physical and Discharge Summaries. Disambiguation was performed with pattern format Level 2 and classi cation probability threshold of 0.1. . . . . . . . . . . . . . . . . . . . 50 4.5 Comparison of accuracy of the S3 System on Cardiology Clinic Notes (CCN) and Social Service Notes (SSN). Disambiguation was performed with pattern format Level 2 and classi cation probability threshold of 0.1. The value in parentheses represent the 95% con dence interval. . . . . . . . . . . . . . . . . . . . . . 51 4.6 MetaMap performance as applied to the manually annotated set. . . . . . . . . . 52 5.1 Clustering purity for all terms in the reference standard corpus when grouping into 10 clusters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 5.2 Accuracy of S3 System as tested on a manually annotated set of sentences with format level threshold of 2 and classi cation probability threshold of 0.1. 54 5.3 List of mismapped terms found in the validation corpus. Italicized terms were mapped unambiguously. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 5.4 Feature counts based on di erent information gain thresholds. . . . . . . . . . . . . . 60 ACKNOWLEDGMENTS Words cannot express the depth of my gratitude to the people who made this work possible, so I am humbly saying thank you to: John F. Hurdle - my graduate advisor, for his active mentorship, patience, and countless hours spent discussing my project. Bruce Bray, Ellen Rilo , Stephane Meystre, Peter Haug, Dina Demner-Fushman and Lewis Frey - the members of my graduate committee, for their time and expertise that they provided over the years. Sean Igo - for his programming support and cheerful acceptance of any programming challenges that I had to throw at him as I was building my system prototype. Denise Beaudoin - for her time and e ort adjudicating the reference standard. Jenifer Williams and Tyler Forbush - for their annotation work on creating the reference standard. Jason Patterson - my husband, for supporting me and providing nancially and emotionally to our family while I was busy pursuing my dreams. Larisa and Viktor Yatsenko - my parents, for being my inspiration and serving as examples that anything is possible as long as you put enough e ort in achieving your goals. This research has been supported by the NLM under grants T15LM007124 (fellowship), 5R21LM009967- 02, and 3R21LM009967-01S1(ARRA). An allocation of computer time from the Center for High Performance Computing at the University of Utah is gratefully acknowledged. CHAPTER 1 INTRODUCTION Imagine a clinical world in which clinicians dictate all patient information using natural speech into an Electronic Medical Record (EMR) system; the speech is automatically parsed into a structured form and the meaningful data is stored as database entries. Unfortunately, such a world is still in the realm of science ction. The main reason such a world has not materialized despite decades of research is that phonetic, lexical, syntactic, and semantic ambiguity is characteristic of natural speech. Advances have been made to resolve each of these types of ambiguities, and simpler subproblems have been solved at a satisfactory level [1, 2]. However, an accurate, general-purpose, adaptable concept recognition system is still a hope for the future. This dissertation project tackles the problem of semantic ambiguity of the natural text found in clinical notes. Building on previous research on semantic type disambiguation, I propose a method called the Sublanguage Semantic Schema (S3) to resolve semantic ambiguity utilizing automatically acquired semantic type disambiguation rules applied to clinical text, which was ambiguously mapped to concepts from a standard terminology. MetaMap, a powerful system designed to map terms from text to UMLS Metathesaurus, is used to illustrate the feasibility of a practical implementation of my proposed method. 1.1 Problem Statement Clinical language is complex. It is inconsistent at rst look. It is unstructured and often ungrammatical. Individual clinicians have their personal opinions on what should and should not be noted about the patient in the medical record. The content and structure of the narrative depends on the type of service provided, kind of document, clinical setting, author's role, and subject matter domain [3]. In the current research, I am focusing on the document type clinical note. Regardless of the purpose of a clinical note, it potentially contains clinically relevant information in the free text that is created in the process of patient care [4]. Most EMR systems have structured forms and checklists that physicians use 2 to record patient data. However, the healthcare environment is broad, often unpredictable, and nuanced. As of today, there are no \o -the-shelf" systems that are able to provide a user-friendly way to report all potentially important information about all patients, or the care that they received [5]. Such lack of essential functionality is the reason why free text persists as a method of keeping clinical records complete. Forcing clinicians to use structured forms that enable computer-friendly data entry has not been successful and usually causes strong user resistance [6]. Since the early years following the introduction of electronic medical record systems in clinical practice in the 1960s, information technology researchers have approached the clinical narrative as gold ore ready to be processed. Similar to gold mining, extracting meaningful information from clinical narratives has been a labor intensive process. Previous research e orts have identi ed \golden nuggets" using manually created rules for speci c research questions [2]. As the technology and algorithms improve, developing new, more general, more precise, and more accurate methods is becoming more di cult. The latest wave of NLP research is directed to nding new ways of learning new information utilizing existing knowledge sources and technologies. Since the EMR systems have been introduced, large repositories of clinical text have been accumulated. These repositories can be used for information extraction through Natural Language Processing (NLP). Such language processing methods often rely on human annotated text. In the general language processing world, several sets of annotated texts have been created and available to researchers for shared use [7, 8]. However, in the clinical world this approach is complicated by the sensitive nature of the texts. Clinical texts often contain identi able data about a patient that are covered by security and con dentiality requirements such as the Health Insurance Portability and Accountability Act (HIPAA) of 1996. In this environment, only a small number of people can have access to the texts. Such access restrictions make obtaining manual annotations di cult, because the process cannot be outsourced to a third party. Each organization that attempts annotation projects is challenged with nding trusted human resources within the organization. When an NLP system is developed, it is optimized for the text that was used to create its knowledge base. Therefore, when a system is transferred into a new clinical environment, the knowledge base has to be adjusted in order to achieve the highest level of performance. Since the knowledge base acquisition is labor intensive, system adaptation is a costly and time consuming task. The knowledge-acquisition bottleneck is the major barrier for clinical NLP system 3 implementation [9]. Unable to adapt an existing system, large organizations develop their own in-house systems that are not shared across organizations; whereas, smaller medical facilities are forced to use o -the-shelf systems that have limited applicability and are not optimized for the organization's speci c setting. Therefore, there is a clear need to develop methods of automatic knowledge base acquisition in order to enable system portability into a new clinical setting. 1.2 Main Objectives Information retrieval, information extraction, question answering, machine translation, and most other natural language processing tasks rely on accurate concept recognition. However, clinical text is highly ambiguous; it challenges existing concept recognition sys- tems. There is a clear need for a concept recognition system that serves a general purpose and is highly accurate. Enabling a fast, accurate and economical method of domain adaptation of a concept recognition system is the main vision of the current research. Clinical NLP experts have always assumed that clinical language is not homogenous but varies depending on the clinical setting. However, this assumption has never been tested on a wide range of clinical text. As the rst step in this project, I show that even within the same organization, the clinical language varies. 1.2.1 Aim I Demonstrate language variability across various clinical settings within the same orga- nization. Research question 1.1 Is there a clear sublanguage variation among various clinical sublanguages? Research question 1.2 Does the language variation depend on the clinical setting or a speci c clinical subdomain? When addressing Aim I, I identi ed the natural grouping of clinical text that resulted from unsupervised clustering of documents of di erent types, as described in Chapter 3. 1.2.2 Aim II Design and validate a tool that automatically acquires and uses sublanguage characteris- tics for word sense disambiguation through semantic type disambiguation of terms mapped to multiple concepts with di erent semantic types. 4 Research question 2.1 Does the developed system work well for clinical term disam- biguation in a range of clinical note types as compared to a manually annotated test set? Research question 2.2 Does the system perform better than a baseline method such as MetaMap? When addressing Aim II, I designed a system prototype and evaluated its performance on a set of manually annotated sentences extracted from clinical notes of four note types, as described in Chapter 4. 1.2.3 Aim III Identify performance-improving steps on a range of clinical notes. Research question 3.1 Can the feature space be substantially decreased without a sig- ni cant loss of accuracy of the classi cation model? Research question 3.2 : Does the sublanguage feature space derived for those terms that were unambiguously mapped di er from the feature space of ambiguous terms? When addressing Aim III, I de ne preprocessing and postprocessing steps that could potentially lead to improved system performance, as described in Chapter 5. 1.3 Relevance to Biomedical Informatics According to Bernstam and colleagues, one of the main goals of biomedical informatics is to bridge the gap between the human information needs and the capabilities of the current technology [10]. Clinical informatics is a major part of a larger eld of biomedical informatics. At this time, most of the information entered into a patient record in free text format is unreachable for computerized processing. Accurate, robust, and fast natural language processing would enable a vast range of possible uses of data from decision support at the point of care to reporting, surveillance, and research. My research advances the current language processing technology by enabling automatic domain adaptation of a computerized concept identi cation system. Improving portability of existing systems would promote collaboration between facilities and research groups. CHAPTER 2 BACKGROUND 2.1 Natural Language Processing NLP has a long history as research projects and practical implementations. It is traditionally de ned as computerized processing of text. NLP is a very broad eld that incorporates a large variety of tasks that di er in their complexity and speci city [2]. The term \natural language processing" is often equated to computational linguistics; however, these terms are not interchangeable [11]. Unlike computational linguistics, NLP approaches text as a source of data about the state of the world. It is characterized by developing and applying computational methods for a particular task and to achieve a practical purpose. The possibility of fully automatic language processing was rst suggested in Weaver's memorandum that introduced the idea of machine translation [12]. Since then the eld of NLP has grown to include a variety of methods and tasks of di erent levels of computational complexity and scope. The list of NLP tasks ranges from low level general tasks, such as tokenization and sentence segmentation, to problem-speci c tasks such as information extraction and ques- tion answering. High-level tasks rely on accurate performance of lower level tasks. For example, the accuracy of information extraction depends on correct parsing (which in turn depends on morphological segmentation, tokenization, sentence segmentation, and part of speech tagging), named entity recognition, concept recognition (which relies on word sense disambiguation), co-reference resolution, and relationship extraction. The current state-of-the-art systems that solve low level tasks achieve high accuracy; and, when used in a limited domain, are comparable to performance of a human annotator [13]. Word sense disambiguation as one of the components of an accurate concept recognition is the focus of the current project. 2.1.1 Word Sense Disambiguation Concept recognition (or term identi cation) is regarded as a single most important factor in accurate information extraction [14]. Traditional linguistic theory studies the language 6 form and meaning as two separate though related elements of language. It recognizes that the same meaning can be expressed in various physical forms. It also posits that the same physical form can express a number of meanings. Therefore, determining correct semantic interpretation that is implied by a speci c textual representation (the physical form) in speci c context is an intermediary step in the process of concept recognition. The computational approach to this task is called Word Sense Disambiguation (WSD). The WSD process involves selecting one meaning out of a discrete number of known possible senses for a speci c term [9, 15, 16]. The need for WSD arises as a consequence of semantic ambiguity that is characteristic to human language. The di culty of the WSD task varies depending on the availability of an electronic dictionary that is used to create the sense inventory for each term, as well as on the similarity of the possible senses. If the available dictionary does not have the true meaning of the term as one of the possible de nitions, it is impossible for any WSD algorithm to identify the correct sense because it will be missing from the sense inventory. Similar to humans, a computerized algorithm has a hard time di erentiating between similar concepts, therefore, the subtler the di erence between meanings is, the more errors will result from disambiguation [15]. The most common steps to perform WSD are as follows: 1. Identify a list of speci c target words for the system; 2. Create a sense inventory for each target word; 3. Extract (or create) examples that use the target word in one of the identi ed senses; 4. Label each instance of the target words with one of the senses from the sense inventory using manual annotation; 5. Employ machine learning or statistical approach to learn WSD rules using sample sentences; 6. Measure the system performance by applying it to another manually annotated set of examples [17]. The approaches to WSD can be grouped depending on several factors: 1. Based on the method of disambiguation model acquisition, a WSD algorithm can be: a) rule-based, b) example-based [18], or c) statistical [19, 20]. 2. Based on the scope of disambiguation, a WSD algorithm targets: a) a restricted target word set, or b) all words. 3. Based on the extent of manual annotation performed to create the disambiguation model: a) supervised, b) unsupervised, c) knowledge-based, or d) hybrid. 7 Rule based systems derive their knowledge base from the manually created rules for disambiguation. Example based systems use example databases that contain example of sentences that use a target word in one of the identi ed senses. Disambiguation is performed by nding the most similar sentence example. A variety of methods can be employed to select the most informative examples [18]. Statistical WSD systems rely on computational algorithms to build disambiguation models using lexical features of the target word context. Supervised WSD is one of the widely used approaches for development of WSD systems. It is based on supervised machine learning applied to a manually sense-annotated text and then uses the resulting model to perform word sense disambiguation on new text. The main disadvantage of such an approach is a high cost of manual annotation [21]. This approach is especially problematic in \all-words" WSD, where the system analyses all ambiguous words in text and not just a speci c limited subset. Unsupervised methods are often called word sense discrimination [22] or sense discovery [23] because they aim to distinguish the word senses by clustering them in groups based on the context in which the word appears. The main disadvantage of such approaches is that after word sense discrimination is performed, human review is required to label the word sense clusters with the correct sense. Semi-supervised or minimally supervised WSD methods use a small, manually-annotated text in combination with a large untagged corpus. Two variations of semisupervised approaches are bootstrapping and active learning methods. Bootstrapping uses a small corpus that was manually selected and tagged to learn the initial model and then utilizes the large untagged corpus to improve this model [9, 24]. The bootstrapping method as implemented by Yarowsky is a minimally supervised method that relies on one sense per collocation and one sense per discourse principles [25]. Active learning methods identify the most informative examples from the large untagged corpus and present them to a human expert for disambiguation [26, 27]. Supervised and unsupervised WSD methods are also called corpus-based methods because they use language models learned from a training text dataset [28]. Knowledge based WSD methods identify the word senses using external knowledge resources such as dictionaries, thesauri, or ontologies; or manually created disambiguation rules [29, 30]. These methods include identifying the most likely meaning of the word using a) selectional preferences that restrict the semantic type of the word sense based on the context [31]; b) information formats with slots for speci c type of information [32]; c) the context using unambiguous meanings of the neighboring terms [33]; d) semantic similarity 8 calculated using an ontology of semantic network [34]; or e) unambiguous meanings of the word in a di erent language using a parallel corpus [35]. Hybrid methods use variations of the approaches described above. Some examples of hybrid systems include Durham and SenseLearner. The Durham system utilizes word sense frequencies calculated using a manually annotated text and applies word collocations as well as WordNet contextual scores [36]. The SenseLearner system uses word collocations learned from a small manually-annotated corpus enhanced by the WordNet taxonomy [37]. A wide availability of large general English lexical databases, such as WordNet, and specialized ontologies, such as GeneOntology (GO) and the Uni ed Medical Language System (UMLS), makes it possible to develop a hybrid approach that combines knowledge based methods and supervised learning algorithms. For example, the A-CUI algorithm created by McInnes calculates similarity of the target word feature vector based on the word's surrounding context and the concept feature vector for each candidate concept extracted from the UMLS [38]. As the size and quality of the existing knowledge repositories increase, such hybrid approaches have a great potential for solving the problem of word sense disambiguation. The approach presented in this dissertation is a hybrid method because it takes advantage the UMLS Metathesaurus as a knowledge repository for the purposes of identifying training examples for the language modeling as well as a controlled vocabulary to determine the sense inventory for terms to be disambiguated. 2.2 Sublanguage Theory Human language is very exible to accommodate a wide range of communication pur- poses, including fairy tales and entertaining riddles. Many words can take a large num- ber of meanings, making computerized language processing challenging. Zellig Harris, an American linguist, observed that the restricted use of language in the discourse of specialized domains placed strict limitations on the distribution of word classes and their co-occurrences. Harris determined that knowing these distributions can aid in determining the most appropriate meaning of terms within the boundaries of a speci c domain. Previous research has established that semantic and syntactic rules di er for narratives that come from di erent specialized domains. Such closed-matter subjects are characterized by a limited vocabulary, a relatively small set of word classes, and word-class sequences integrated as a sublanguage [39]. Although the speci c word-recurrences in the successive sentences of a discourse are unique to that discourse, various types of co-occurrence patterns 9 seem to characterize various types of discourses. The various types of word co-occurrence are worth studying as the inherent carriers of various information types. And the particular pattern of word co-occurrence in a given discourse or section is useful as a framework of the particular information and information processing in that discourse [40]. Since sublanguage theory was rst introduced, there have been multiple attempts to implement sublanguage principles in computer applications. The distinguishing characteristics of such an approach are performing WSD through semantic type disambiguation, which involves identifying a word class (or semantic type) for each ambiguous term and then selecting a concept that belongs to that word class out of a list of potential concepts [41]. This approach is based on selectional preferences or restrictions [9, 31, 42, 43]. A number of domains have been analyzed via sublanguage models, such as trouble tickets [44], technical maintenance manuals [45], stock market reports [46], and weather reports [47]. The work to produce the rst computerized application based on sublanguage theory started in 1965 and resulted in the Linguistic String Project (LSP) [48]. That project is based on the information formats for the content of text in a given domain. It started as an attempt at computerized processing of scienti c text, based on the algorithm developed by Sager (N. Sager, Procedure for left-to-right recognition of sentence structure, T.D.A.P. No. 27, University of Pennsylvania, 1960) and theoretically grounded in Linguistic String Theory suggested by Zellig Harris [49]. This theory states that any sentence can be built from the center string by adjunction, conjunction, and replacement. The center string is a sequence of noun+tensed verb or noun+tensed verb+noun. However, not all combinations of word categories result in a valid sentence due to a number of restrictions. The earliest full-text accessible article about LSP is by Grishman [50]. He states that in 1973, LSP was under development for 8 years and the current version at the time was version 3. The grammar used by the parser consisted of: 1. a Backus-Naur Form context-free language grammar implemented as a set of elemen- tary strings together with rules for combining them to form sentence strings, 2. a set of restrictions on those strings; and 3. a word dictionary, listing the categories and subcategories for each word. Another early publication is by Sager in 1975 [51], where she discussed the hypothesis that the literature of science domains has certain restrictions on language usage. These restrictions were formalized as information formats, which are repeating patterns of the word classes (also called semantic types, term classes, or word categories) and word class 10 relations in sentences of the text. Word classes were obtained by grouping words or phrases that occur within similar grammatical relations. The information formats contained slots for particular types of information. Sentences of the text of speci c domain were identi ed as instances of the corresponding format. The set of formats was considered to be a sublanguage grammar. Each slot had a xed informational content and the sentences of certain format carried speci c types of information. The slots were based on the hierarchy of grammatical operators and operands; they were not determined solely on the linear sequence of words in sentences. The main premise of sublanguage grammar is that narrow domain grammar rules are more restrictive than English grammar rules. A sentence may be well formed in general English but not well formed as a sentence in the speci c domain. In the beginning of LSP, the researchers established that the semantic classes of words do not have to be speci ed a priori but can be extracted through the process of grouping terms that appear in the same co-occurrence patterns. 2.3 NLP in Clinical and Biomedical Domains Since the early years of research in the natural language processing of English, newspaper and scienti c literature have been the primary target languages. As a result of multiple research projects, a large number of disambiguation methods have been proposed and a large body of language samples have been annotated and made available for shared use. Availability of shared annotated corpora enabled new system evaluation and algorithm comparison. Despite such advances in main-stream NLP technology, its penetration into the clinical domain has been limited to a few research projects and a handful of commercial systems. The reason for this situation is the di erence among lexical, syntactic, and semantic characteristics of clinical text and general language. Clinical language shares some of the features of other telegraphic sublanguages, such as ill-formed and reduced sentences, lack of internal consistency, abundance of overloaded abbreviations and acronyms, misspellings, and extra linguistically-meaningless tokens re- sulting from local and individual practices [1, 52]. The language of biomedical literature shares some characteristics of clinical language, such as a large vocabulary of terms that are virtually exclusive to the medical domain, but it also resembles general language, because of the use of proper grammar and wide availability of shared corpora. Because of these di erences, clinical language and the language used in biomedical literature are distinct 11 sublanguages [53]. A number of systems have been created to process narrative stored in electric format. Some of them are general and some of them are project-speci c, created either as a research project or implemented in a single organization. 2.3.1 Systems Based on Sublanguage Principles Over the last 50 years the sublanguage theory has been used as the theoretical framework for a number of di erent systems that have been developed and implemented for a clinical and biomedical domain. The Medical Language Processing (MLP) system is the rst attempt to apply the LSP parser to medical text, initially reported in 1976 [32]. This e ort was conducted by a research team that included Naomi Sager, Ralph Grishman, Ngo Nhan, and Carol Friedman. The target corpus included x-ray reports on patients with breast cancer. For that system, word classes and information formats were derived through a distributional analysis on the parsed sentences to obtain word classes and on the word classes to de ne formats. The distributional analysis starts with identifying words frequently occurring in the same syntactically de ned environments. These words become the core of the new class. Then the environments of these words are enumerated and new words for the class are identi ed. Only one format was initially determined. The nal version of the system had additional 11 formats. During the rst attempt for MLP, 176 out of 188 sentences (94%) were successfully formatted. The MLP system is designed to perform linguistic string analysis to determine the sentence structure, regularization of the sentence structures though general English transformations, and mapping of transformed parsed sentences into format slots. Another major concept-mapping system based on the sublanguage theory is Medical Language Extraction and Encoding system (MedLEE) developed by a team led by Carol Friedman. [54]. The grammar rules employed by MedLEE were developed manually, based on the distributional analysis of clinical notes of a speci c note type - chest x-ray reports. Modifying the knowledge base to accommodate new domains required a substantial human e ort. So resolving the knowledge base acquisition bottleneck by making the process automatic would simplify domain adaptation natural language processing systems that use sublanguage restriction rules for semantic disambiguation. Friedman rst described the conceptual model for the MedLEE project in 1994 [55]. The model was designed by analyzing chest x-ray reports generated at Columbia Presbyterian Medical Center (CPMC). The model included four conceptual levels: a) the structure of the report; b) the ndings 12 in the report; c) the structure of the medical concepts that make up the ndings; d) the lexical information associated with individual words and multi-word phrases. The clinical terms and their semantic types were de ned in the Medical Entities Dic- tionary developed at CPMC. The initial analysis included 8000 chest x-ray reports. Once the conceptual levels have been de ned, the prototype was implemented [55]. Initially, the semantic lexicon contained 3120 terms with associated semantic types. The semantic grammar contained 350 grammar rules. A rst proof-of-concept study used 230 reports. Two person years were required to create the rst semantic grammar. Later the semantic grammar was extended to cover mammography, discharge summaries, all of radiology, elec- trocardiography, and pathology [56]. During the system adaptation project, the MedLEE developers concluded that creating a system that can be equally e ective on text of di erent domains required obtaining additional rules that would enable and disable other grammar rules based on the target clinical subdomain. As the number of covered subdomains grows, maintaining the rules might become cost prohibitive. 2.3.2 Biomedical Language Processing Systems Domain speci c vocabulary and a limited set of word categories as main characteristics of sublanguages have been successfully applied for word sense disambiguation in the biomedical domain. UMLS has been the knowledge base of choice for most NLP systems. The UMLS Metathesaurus provides a large vocabulary of medically relevant concepts and the UMLS Semantic Network provides a relatively small list of word classes (or semantic types) that are applicable to the biomedical domain. Similar to UMLS Metathesaurus, another commonly used controlled vocabulary is Medical Subject Headings (MeSH). To aid the development of new medical NLP systems, the National Library of Medicine (NLM) sponsored development of a manually-annotated text collection for the purposes of training and testing word sense disambiguation [57]. A number of projects focused on processing biomedical text. Rind esch and Aronson developed a set of rules that determined the semantic type of the term depending on the patterns of neighboring words and semantic types within the sentence. This set of rules was applied to a small set of instances and achieved 78% disambiguation accuracy [58]. Expanding on Rind esch and Aronson's idea, Krauthammer and Nenadic suggested per- forming word sense disambiguation as a two step process - term classi cation and term mapping. The goal of term classi cation is to label the term of interest with one of a small 13 number of semantic categories using a machine learning model, which was acquired using annotated text. Once the semantic category is identi ed, the term mapping step arrives to the nal match between the term and a concept from a controlled vocabulary such as UMLS Metathesaurus [14]. Similarly to Krauthammer and Nenadic's approach [14], Fan and Friedman successfully exploited UMLS resources to perform word sense disambiguation through semantic type classi cation [41]. The idea of semantic type labeling as a step to concept recognition is further developed by Humphrey and colleagues [59]. They used Journal Descriptor Indexing (JDI) as a straight- forward way to identify sublanguages within biomedical domains. The main assumption is that publications with the same JDI belong to the same sublanguage. Semantic type labeling is implemented by adjusting the likelihood of occurrence of a concept with a speci c semantic type depending on the set of journal descriptors that are associated with the neighboring words. The average disambiguation precision was reported at around 78%. Stevenson and Guo developed a hybrid WSD system that combined lexical features (such as lemmas of ambiguous words), syntactic features (such as part of speech), collocation features (such as combination of other features in ngrams), and knowledge-based features (such as UMLS identi ers and MeSH terms). Using those features, the Naive Bayes and Support Vector Machine models were tested on the NLM test collection and term disambiguation accuracy of 89.7% was achieved [60]. Similarly, a system developed by Liu et al. [61] creates a disambiguation model by learning a Naive Bayes classi er using a feature space consisting of stemmed words that appear in each evaluated abstract. The method used a one meaning per discourse assumption to aid disambiguation. 2.3.3 Clinical Language Processing Systems Sublanguage approach is not the only method used for NLP of clinical text. Several commercial, open access and research applications have been developed. One of the earliest systems was the special purpose radiology understanding system (SPRUS) designed to encode salient features from chest X-ray reports and implemented as a module in the Health Evaluation through Logical Processing Hospital Information (HELP) medical expert system at the LDS Hospital in Salt Lake City, UT [62]. Another NLP tool developed within the same organization is SymText, which uses Bayesian networks to model the context of radiological reports in order to automate coding tasks [63]. Chest radiology reports were also the initial target domain of another HELP module, a probabilistic medical language 14 understanding system called MPLUS [64]. It uses Bayesian networks to represent the basic semantic types and relations in order to infer the most probable concepts consistent with the words found in a sentence. Using MPLUS as the starting point, the Automated Problem List (APL) system was designed to extract medical problems from electronic free-text documents [65]. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES) is a pipeline system designed by Savova for the purpose of phenotype extraction from clinical notes [66]. It was built on publicly available technologies, such as UIMA framework, OpenNLP and the SPECIALIST Lexical Tools. The system annotates text with several clinical named entities, such as drugs, diseases/disorders, signs/symptoms, anatomical sites, and procedures. Each named entity has attributes for the text span, the ontology mapping code, whether the named entity is negated, and the context (family history of, history of, probable). The system has been submitted to the Open Health Natural Language Processing Consortium (OHNLP) and can be freely downloaded. Medical Knowledge Analysis Tool (MedKAT/P) is another freely available tool donated to the OHNLP by IBM [67]. This modular and exible system based on UIMA framework is designed to extract structured information from narrative text in the clinical pathology domain such as pathology reports, clinical notes, discharge summaries and medical litera- ture. The system labels text with concepts such as primary tumor or lymph node status and a number of cancer-speci c characteristics such as histology, anatomical site, nodes dimensions and sizes, number of positive and excised nodes. MedKAT/P incorporates NegEx algorithm developed by Chapman to identify negation status of the concepts [68]. Health Information Text Extraction (HITEx) was initially speci c to a research study on airway diseases such as asthma and chronic obstructive pulmonary disease. Now it is used as a general purpose NLP \cell" module in the i2b2 \hive" architecture. The main functionality of the system is to extract principal diagnoses, co-morbidities, and smoking status. The knowledge base for the system includes a set of manually designed regular expressions, as well as machine learning models trained on a corpus consisting of discharge summaries of the patients that had one or more related admission diagnoses de ned by ICD9 codes [69]. Most of these systems have been developed within a single organization. Informatics for Integrating Biology and the Bedside (i2b2), an NIH-funded National Center for Biomedical Computing (NCBC) based at Partners HealthCare System in Boston, has promoted collab- 15 oration by organizing a series of NLP challenges and shared tasks. These challenges tackled the problems of de-identi cation [70], obesity and co-morbidities extraction [71], smoking status [72], and clinical concept extraction from clinical text [73, 74]. 2.4 Project Statement Using semantic type information has been successful in aiding the word sense disam- biguation process as applied to both biomedical literature and clinical text. Similarly, limiting the scope of the NLP system also has been shown to boost the system's performance by limiting language variability. In combination, the limited system scope and semantic grammar rules have a potential to enable language processing of even the most irregular and idiosyncratic language. However, the knowledge base acquisition for such a system would be challenging due to the lack of training data. A successful WSD tool, once created, produces satisfactory results on text that is similar in syntactic and semantic characteristics to the text that was used to build it. However, performance of even the best WSD tool will inevitably decrease if the tool is applied to a text with syntactic and semantic characteristics that are di erent from the source text. Improving tool performance on a new text often involves either adding new semantic rules or retraining the statistical model on a new set of annotated texts. The process of new domain adaptation of a WSD tool is expensive and time consuming because it involves human experts [75]. Making the process of WSD domain adaptation automatic would increase the tool's portability across domains. The current project demonstrates the variability of clinical language and suggests a model of dealing with this variability automatically using available knowledge resources. 2.5 Resources 2.5.1 Computational Resources This project involves analysis of a large number of original clinical notes that have not been de-identi ed. Due to the amount of processing that was required, as well as in order to comply with privacy and security regulations, a powerful and secure environment was needed. I utilized a new secure, HIPAA-compliant, high-performance compute cluster located at the Center of High Performance Computing. 2.5.2 Corpus The complete set of all clinical narrative types at our medical center (a large tertiary care teaching hospital) in use during the period January 2007-December 2008 was analyzed 16 by a clinical expert to determine a study subset that was diverse across domains. Note types that consisted mostly of templated information, scanned hand-written documentation, or nonclinical documents were excluded. As a result, a set of 17 representative note types were selected for this study. These note types represented a cross-section of clinical narratives created by clinical personnel that varied by clinical role (physicians, nurses), specialty (cardiology, dermatology, ob/gyn, oncology, etc.), and clinical environment (ED, inpatient, outpatient). A set of 683,061 notes was extracted from the University Hospital Electronic Data Warehouse. Files that were less than 100 bytes in length were excluded because most of them did not contain clinically relevant information. The remaining 559,029 les were processed by the MetaMap. Only 557,571 of those les were successfully processed. In addition to the clinical narratives, a random set of 35,000 MEDLINE abstracts published between 2000-2008 was selected. To ensure a valid comparison to the clinical texts, abstracts less than 100 bytes and those that failed to be processed were excluded. The full list of note types and their le counts is presented in Table 2.1. Table 2.1: The note types and the corresponding le counts used in this project. Note Type Abbr. File Attempted Processed count to process successfully Admission HP AHP 51,721 43,142 42,911 Ambulatory Nursing Note ANN 77,542 73,196 73,167 Burn Clinic Note BCN 13,430 13,428 13,428 Cardiology Clinic Note CCN 24,366 24,306 24,302 Case Mgmt Dschg Plan Note CMD 30,213 30,141 30,046 Dermatology Clinic Note DCN 6,251 6,250 6,249 Discharge Summary DIS 65,256 65,220 64,530 Emergency Dept Report EDR 106,250 685 685 Family Practice Clinic Note FPC 11,626 11,270 11,233 Hematology Oncology Clinic Note HOC 36,785 36,769 36,760 Neurology Clinic Note NCN 24,137 23,944 23,634 Obstetrics Gynecology Clinic Note OGC 9,355 9,289 9,277 Operative Report OPR 76,593 76,556 76,552 Orthopaedic Clinic Note OCN 119,094 115,655 115,654 Plastic Surgery Clinic Note PSC 4,375 4,371 4,371 Rheumatology Clinic Note RCN 22,647 21,393 21,358 Social Service Note IP SSN 3,420 3,414 3,414 Total number of les: 683,061 559,029 557,571 17 2.5.3 MetaMap MetaMap is a powerful concept recognition system developed by a team led by Aronson at the National Library of Medicine. Its primary aim is to map terms found in abstracts of MEDLINE citations, as well as user queries to concepts in the UMLS Metathesaurus. A recent comprehensive overview of MetaMap system is presented elsewhere [76]. Its comprehensiveness, robustness, free availability, and regular updates with the latest version of the UMLS make MetaMap very attractive for potential NLP users. However, in spite of the good coverage of the clinical domain by the UMLS Metathesaurus, MetaMap has not been applied widely to the clinical domain beyond a few research projects. The main deterrent to broad application of MetaMap on clinical narratives is its failure to perform accurate word sense disambiguation. When a term from free text matches multiple UMLS concepts, MetaMap returns a list of all mappings, making information extraction ine ective. If MetaMap's WSD algorithm is used on clinical narratives, it often selects the wrong concept because it was trained on biomedical text. The result of MetaMap processing is an XML output le that speci es sentence, phrase, syntax unit, and token boundaries, the part of speech and syntax type of each syntax unit, as well as a combination of concepts from UMLS Metahesaurus. Along with the UMLS concept identi er, the XML le has UMLS preferred concept name, and one or more corresponding Semantic Types (STs) for each concept. For my project I used the version that was the latest at the time when I started processing the data - MetaMap binary 2009 V.2 [77]. CHAPTER 3 SUBLANGUAGE Natural Language Processing systems employed in the clinical domain operate under one of two main assumptions about clinical language: 1) the narrative of patient notes con- stitutes one sublanguage, or 2) each clinical subdomain imposes its special set of selectional restrictions that aid concept recognition. The examples of the systems built with the rst assumption are cTAKES and HITEx. The design of cTAKES is based on the reference standard that included 273 manually annotated clinical notes of a range of note types - consult notes, discharge summary, educational visit, general medical examination, limited exam, multisystem evaluation, reports, specialty evaluation, dismissal summary, subsequent visit, therapy, and notes of general category - miscellaneous. The goal of creating such a mixture of text samples was to ensure that all areas of the clinical domain were covered by the disambiguation model [66]. As opposed to cTAKES, the HITEx system uses a reference standard corpus of 150 discharge summaries because the language and topic variability is believed to be an accurate representation of the variability across all subdomains [69]. A commercially available system, LifeCode, targets a large variety of notes but manages its performance by limiting the tasks that it can perform [78]. Another system, called KnowledgeMap, incorporates a range of clinical notes and medical textbook text in an attempt to create a general purpose knowledge base [79]. The main bene t of treating clinical language as a uni ed sublanguage is the relative speed of system development. A very di erent approach to clinical language system development is designing systems for a speci c clinical subdomain or note type. The Medical Language Extraction and Encoding System (MedLEE) is a successful and widely used general purpose system built with the assumption of language variability by note type. The initial system design was based on chest x-ray reports. When the system's use was expanded to include other types of clinical notes, the system's knowledge base was modi ed. However, instead of simply expanding the knowledge base to include additional semantic categories and terminology, a 19 set of context-dependent switches was developed that would turn on or o certain grammar rules determined by the clinical subdomain where the system is used [53]. This setup made the system highly accurate for concept recognition in some clinical subdomains, but increased the nancial and time cost of system adaptation to a new target subdomain. Similar to MedLEE an array of research projects and system development e orts were conducted under an assumption of language variability. The most common way to deal with the idiosyncrasies of the narrative across domains is to specify the exact task or the clinical subdomain that the system targets and avoid making assumptions of possible system performance outside of those boundaries. The developers of Medical Information Extraction System (MedEx) describe the purpose of the system to be mapping of medication information into a structured representation. Even though the system was trained using only Discharge Summaries, the authors claim wide applicability of the system due to the narrow scope of the task [80]. Both of these assumptions are largely untested. Therefore, as the rst step in my research project, I am attempting to identify the boundaries of clinical sublanguages. The purpose of such analysis is to inform future system developers when they are making a decision about their system's scope and coverage. For example, if precision in the system's performance is the top priority, then the developers will be compelled to limit the coverage of their system to only one speci c note type. Without additional knowledge about the sublanguage boundaries, it would be impossible to predict the potential system performance degradation when it is applied in a di erent setting. 3.1 Methods According to the traditional de nition of sublanguage grammar outlined by Harris, the languages of di erent narrow domains di er in their lexical component - vocabulary, and in their semantic component - semantic types and semantic type patterns distributions [40]. Therefore, I approach the sublanguage boundary de nition at two levels - lexical and semantic. To show that clinical language is not homogeneous at a lexical level, I use unsupervised document clustering analysis (reported in [81]). The semantic level variability is demonstrated through semantic type pattern distribution analysis (reported in [82]). 3.1.1 Document Clustering Document clustering is a common unsupervised machine learning method of binning analyzed documents into groups treating each document as a single entity. Other approaches 20 that I considered for the task of identifying sublanguage boundaries were document classi- cation, topic discovery, and latent semantic analysis. Document classi cation is a supervised method of organizing documents by assigning one of several prede ned categories to each document. The steps for such analysis would include manual annotation of a set of representative documents that would then be used for training of a machine learning classi er. Once a classi er is acquired, it can be used to label a test set of documents with a category. The output of this analysis would indicate whether all notes of the same note type fall into the same category or not [83]. This method assumes that human experts can correctly discern language variations to inform the classi er. However, due to natural capacity limitations, humans are not able to perceive hidden patterns from large amounts of data, and therefore, besides the obvious cost associated with human annotations, the supervised approach is inferior to computational methods of pattern discovery in data. Similar to document clustering, topic discovery methods identify documents that share a similar content using unsupervised techniques. However, topic discovery labels documents with a set of possible topics, whereas, document clustering labels each document with a single label, thus simplifying the grouping structure [84]. Like document clustering, Latent Semantic Analysis uses term frequency weights for words used in each analyzed document. Also, as with document clustering, LSA evaluates word usage patterns for all words across all analyzed documents. However, unlike document clustering, LSA focuses on individual words and their meaning, thus the LSA hierarchical model is directed at learning the relationships between words, rather than documents. Hierarchical LSA methods allow visualizing word relatedness, rather than document relat- edness, which is bene cial for word sense disambiguation but not as a sublanguage similarity measure [85]. After considering other options, I chose document clustering as a method of grouping related documents because it does not require human annotation to inform the algorithm, and because hierarchical document clustering methods allow not only identifying what types of sublanguages are present in the corpus, but also the strength of sublanguage similarity. Document clustering is a commonly used unsupervised text mining technique that has been used for a range of natural language processing tasks such as information retrieval, question answering and others. The goal of document clustering is to nd a set of \natural" patterns within a large amount of unlabeled data inside the documents and then to organize similar documents into groups using some measure of similarity [86]. Cluster analysis typically 21 consists of a) feature selection and extraction; b) selection or design of a clustering algorithm; and c) cluster quality evaluation [87]. The most popular data set format for document clustering is a bag-of-words vector-space model. This method represents the entire set of documents as a T D matrix, where T is the size of the vocabulary used in the document set; and D is the total number of documents in the data set [88]. Each document is represented as a vector of length T, and since most terms do not appear in any given document, these vectors are sparse. Typically, each value in these vectors represents the importance of the particular term (t) in the particular document (d). In order to minimize the e ect of the document size and extremes in the frequency of a speci c term, the well known \term-frequency inverse-document-frequency (tf-idf)" measure is often used as the weight of each term in a document vector [89, 90]. This measure takes into account how frequent a speci c term is within a speci c document as well as the term distribution across all documents in the analyzed corpus. Thus, terms that appear only in a few documents have higher weights, but terms that appear in most documents will have lower weights. The goal of cluster analysis is to place each document into one of K disjoint or overlap- ping clusters. Each cluster usually is de ned by its centroid, which is the most representative vector in the cluster. Depending on the clustering algorithm used, the centroid can be either the average point for each dimension of the feature vector, or an actual point in the data set that is the closest to the average point. Most clustering algorithms have three main components: a) similarity measure, used to measure vector relatedness; b) clustering method, the computational approach taken during the clustering process; and c) clustering criterion function, which is used for the optimization of the nal clustering solution [91]. Similarity between two vectors can be measured by calculating a Euclidean distance, a cosine distance, or a correlation coe cient. The general clustering method can be either partitional, agglomerative, density-based, or grid-based. Depending on the nal speci c solution desired, the clustering methods can be either hierarchical or nonhierarchical [92]. The simplest and most widely used clustering method is K-means. Prior research concluded that a bisecting, K-means algorithm performs quite well despite its simplicity and lower computational complexity [93]. This hierarchical algorithm iteratively splits the data set until the prede ned number of clusters is reached. Selection of a clustering criterion function in uences the nal clustering solution by putting more emphasis on cohesion or on separation of the resulting clusters. 22 The measure of cluster quality can be classi ed as either internal or external. Internal measures of cluster quality aim to assess how closely the elements in each cluster are related to each other, evaluating \cohesion" and \separation" of the clustering results. Cohesion can be measured as the average similarity of the members of the cluster to each other or to the cluster centroid. Separation evaluates the average dissimilarity of the members of a particular cluster to all other elements in the data set. The external measures rely on knowing a true label of each of the documents. Clustering output can be measured externally in terms of purity and F-score. Purity is the proportion of each cluster that consists of the majority class. F-score evaluates precision and recall of each document type with respect to its cluster assignment. In evaluation of document clustering output, precision for each document type compares the largest number of documents that are assigned to a speci c cluster to the total number of documents assigned to that cluster. Recall for each document type compares the fraction of the largest group assigned to the same cluster to the total number of documents of that type. The F-score is a harmonic mean of precision and recall. An optimal clustering solution will have 100% purity, which means that each cluster contains elements that belong to a single class [91]. Such purity can be achieved trivially when the number of clusters is equal to the number of elements in the data set. On the other hand, the perfect F-score will be achieved only if all documents of each type are grouped into a single cluster (100% recall) and no document types share a cluster label (100% precision). Using the note type as the true class labels, I exploit purity and F-score measures in our analysis below. A feature vector le was created where each note was represented by the tf-idf value for each term that MetaMap matched to at least one UMLS concept. To decrease the feature space, multiword phrases were split into individual tokens and the base form of all tokens was obtained from SPECIALIST lexicon using the Norm tool [94]. In addition to the lexical attributes, semantic types of those terms that were unambiguously mapped to a UMLS concept by MetaMap were also used as attributes. The derivation of what constitutes an unambiguously mapped term is more complex than simply choosing those terms with only one MetaMap semantic mapping. Those terms can be enriched with an algorithm that exploits the mapping scores provided by MetaMap as described in Section 4.2.1.1. Using only those terms that MetaMap successfully mapped to at least one concept minimizes the size of the feature vector and focuses on only those tokens that are potentially relevant in the clinical setting, thus excluding misspellings, unrecognized locally speci c abbreviations, 23 and other language characteristics, which are artifacts of the local practices rather than being typical of the clinical subdomain. To perform clustering I chose bisecting k-means clustering using a cosine similarity measure with the \internal criterion function," which maximizes similarity between each document and the centroid of the cluster that it is assigned to. The clustering tool I chose was the CLUTO clustering toolkit [91]. This software package o ers a set of clustering algorithms that approach clustering as an optimization process aiming to minimize or maximize the selected clustering criterion function. It is written in C, and thus is quite fast. It also manages memory well. The Java-based Weka cluster toolset was unable to process the full feature space, and was too slow to be practical for even small subsets. The selected clustering algorithm requires the number for clusters to be speci ed a priori. In the current study, each clustering experiment used the same number of clusters as the number of the analyzed note types. The full available corpus contained a variable number of documents for each note type. Since the selected algorithm is the most accurate when the number of documents in each class is the same, the corpus was reduced to 3000 randomly selected documents of each note type, except Emergency Department Reports that had only 685 documents available. My initial experiment using a subset of 685 documents of each type (i.e., the size of the smallest note type, Emergency Department Reports) clustered into 18 clusters resulted in 74.8% average cluster purity. Analysis of the most descriptive and discriminating features (produced optionally by CLUTO) showed that several provider names in one type produced an unwarranted impact on clustering. After these names were identi ed, the feature vectors were recalculated and new clusters were analyzed. Review of the most important features showed that clinically irrelevant words, such as \phone" and \fax" were responsible for in ated cluster purity for Case Management Discharge Plan, thus skewing the clustering results. The results of these two experiments led me to a conclusion that in order to acquire the most reasonable clusters, I had to exclude the lexical noise that resulted from the artifacts of the local practices and templates. So I manually designed a short stopword list that consisted of the most frequent person names and also the words \phone" and \fax." This stopword list also included ve semantic types that were identi ed as the most common for all note types [82]. These semantic type are: Findings, Temporal Concept, Qualitative Concept, Quantitative Concept, and Functional Concept. 24 After those stopwords were excluded, the new data set was analyzed and the average cluster purity of the resulting solution dropped to 73.3%. This con rmed that the artifact terms were arti cially improving the clustering for some note types, for example, terms that occurred frequently in section headers. To eliminate noisy terms more systematically, I calculated an additional set of stopwords that aimed to reduce the lexical artifacts for all the note types. The new stopword list excluded any term in a speci c note type if that term appeared in more than 95% of all documents of this note type. These terms were eliminated from the feature set for that particular note type but not for the other note types. Processing the new data set resulted in an even lower average purity of 70.0%. Even though eliminating artifacts of the local practices resulted in lower cluster purity, I believe that by doing so I achieved clustering that more faithfully re ects the lexical patterns of the analyzed clinical subdomains rather than lexical noise due to local practice. Purity is calculated in terms of the majority class for each cluster and re ects how well each cluster is represented by one of the document classes. Lower purity indicates that the cluster contains notes of di erent classes, thus showing that those document classes have some documents that are lexically related among each other. For example, Table 3.1 shows that cluster 13 mostly has documents from three note types - Ambulatory Nursing Notes, Case Management Discharge Plan, and Emergency Department Reports. On the other hand, cluster 6 is mostly represented by documents of a single note type - Rheumatology Clinic notes. When comparing the cluster assignment for Discharge Summaries and Admission History and Physical, it is notable that out of 18, clusters 8 have similar counts of these note types. This is indicative of the large overlap in the lexical and semantic patterns appearing in the documents of these note types. The next set of experiments evaluated the e ect of larger sample size on clustering. Emergency Department Reports had only 685 notes available to us, so they were excluded from further processing. The feature vectors representing the remaining sixteen note types and MEDLINE abstracts with 3,000 documents in each set were clustered into seventeen groups. As Table 3.2 illustrates, most document types were grouped each in its own cluster. Several note types are shown to be more general than others, such as, Admission History and Physical, Ambulatory Nursing Notes, Discharge Summary, Family Practice Clinic Notes and, not surprisingly, MEDLINE abstracts. Case Management Discharge Plan, Dermatology Clinic, and Plastic Surgery Notes exhibited a dichotomy in the lexical patterns. As the cluster hierarchy shows, despite such a split, each pair of the clusters are 25 Table 3.1: Clustering results of the data set consisting of 685 documents per document type. Values that represent less than 1% of the total note type count were excluded for visual clarity. 26 Table 3.2: Results of hierarchical cluster analysis of the set of 17 document types (n=3,000 notes per set). The values are the counts of all documents of the particular type that were grouped into each of the clusters. Values that represent less than 1% of the total note type count were excluded for visual clarity. Admission History/ 3000 0.22 0.46 0.29 Ambulatory Nursing Note 3000 0.96 0.53 0.68 Burn Clinic Note 3000 0.94 0.96 0.95 Cardiology Clinic Note 3000 0.80 0.99 0.88 Case Mgmt Dschg Plan 3000 0.90 0.58 0.70 Dermatology Clinic Note 3000 0.95 0.66 0.78 Discharge Summary 3000 0.27 0.58 0.37 Family Practice Clinic 3000 0.39 0.82 0.52 Hematology Oncology 3000 0.55 0.97 0.70 MEDLINE abstracts 3000 0.33 0.57 0.41 Neurology Clinic Note 3000 0.80 0.96 0.87 Orthopaedic Clinic Note 3000 0.64 0.96 0.77 Obstetrics Gynecology 3000 0.71 0.92 0.80 Operative Report 3000 0.89 0.93 0.91 Plastic Surgery Clinic 3000 0.93 0.81 0.87 Rheumatology Clinic Note 3000 0.97 0.96 0.97 Social Service Note IP 3000 0.86 0.98 0.92 Cluster size 51000 0.71 0.80 0.73 Cluster purity 0.76 -- -- -- Document types 0.86 0.90 0.76 0.96 0.96 120 72 289 353 1580 38 Cluster ID Total Precision Recall F-score 1 2 3 4 5 17 316 315 322 150 1367 469 6 7 8 9 11 12 13 14 15 16 61 46 10 38 2877 39 77 282 33 1729 1194 2963 36 408 137 221 80 1988 919 1729 331 32 166 167 2907 49 131 39 2446 102 1718 115 365 254 63 116 37 2887 2879 2752 63 63 37 2790 52 55 412 2425 81 2926 53 31 2891 3387 1913 1568 368 1645 2099 993 3854 5282 3706 3626 6325 2971 3077 4466 3124 2596 0.95 0.93 0.71 0.55 0.80 0.80 0.39 0.97 0.94 0.64 0.89 0.93 27 closely related, indicating similarity between the clusters. Increased sample size and removal of a more general document set (Emergency Department Reports) resulted in increase of the average purity to 76%. The general note types, which are not speci c to any clinical subdomain, span di erent topics and were excluded for the next experiment. I processed the new data set consisting of the documents of those 12 note types, which are more focused on a speci c clinical subdomain. The resulting 12 clusters had an impressive level of purity, 95.5%. Average F-score was also 95.5% (Table 3.3). This indicates that the overwhelming majority of the notes of each note type exhibit lexical patterns that are characteristic of that note type. Analysis of a slightly lower F-score for Orthopedic (OCN) and Plastic Surgery Clinic Notes (PSC) and Operative Report (OPR) indicated a topic overlap for a portion of these notes as pointed out by the descriptive features for cluster 12 (Table 3.3), which are ffracture, orthopedics, motion, knee, splint, radiographicg. 3.1.2 Semantic Pattern Distribution A domain speci c set of sentence types is one of the main characteristics of sublanguage grammar de nition outlined by Harris [95]. According to Harris, the more specialized a domain, the smaller the set of semantic type structures that are common in the narrative of that domain and that are designed to carry a speci c type of information. Harris's sublanguage de nition of semantic sentence structure links the semantic role relationships between words in sentences, such as predicate-argument relationships, with the semantic types of the words. For example, in a statement \Patient reported pain" the word \patient" has semantic type \Patient group" and the word \pain" is of the \Sign or Symptom" semantic type. In terms of semantic roles, the predicate is the verb \reported", \patient" is the subject, and \pain" is the object of the sentence. Thus, the semantic structure that can be derived from this statement for verb \reported" is that the object is \Patient group" and the subject is \Sign or Symptom." All semantic structures have a set of paraphrastic patterns, because the same information can be carried out in various physical forms. Therefore, the same semantic sentence structure can be expressed by di erent linear word sequences. (Such as \Patient reported pain" and \Pain was reported by patient"). The full set of form and content relations in sentences of a speci c domain can be expressed as a distribution of linear sequences of semantic types (or semantic type patterns) in sentences within that domain. For the purposes of such analysis, I created a set of semantic pattern 28 Table 3.3: Results of clustering 12 note types with 3000 documents of each type. Values that represent less than 1% of the total note type count were excluded for clarity. 29 distributions for each note type, according to the method described below, and compared the patterns across note types. The semantic type patterns consist of linear sequences of semantic types within the prede ned window. Since the patterns are not de ned in terms of simply co-occurrence in a sentence, the relative position of the terms participating in a pattern is meaningful. Therefore, the list of potential patterns is a cross-product of the semantic type set and positions relative to the term of interest. Initial measurement of semantic type frequency revealed that almost all note types had an average of between two and three mappings per sentence. I concluded that due to sparsity evaluating patterns of more than three mappings would fail to produce useful patterns. Therefore, in order to minimize the number of patterns, each pattern consists of the term of interest and two other terms within the prede ned window. Each position within the window was numbered according to the relative position from the term of interest, such that the term of interest was numbered (0); the term after the term of interest was numbered (1); the term before the term of interest was numbered (-1), and so on. The di erent combinations of positions of mappings within the window were grouped into fteen formats as outlined in Table 3.4. Thus, a semantic type co-occurrence format is an abstract sequence of mapping positions relative to the center that corresponds to the position of the term of interest. Table 3.4 gives examples of patterns derived from a sentence \The patient reported severe upper quadrant abdominal pain" with the term \upper" as the term of interest, assuming the following semantic types for each of the mappings (with their four-letter abbreviations, which are also described in Appendix A): patient - Patient or Disabled Group - podg reported - Health Care Activity - acty severe - Qualitative Concept - qlco upper - Spatial Concept - spco quadrant -Spatial Concept - spco abdominal - Body Location or Region - blor pain - Sign or Symptom - sosy Relative frequency of observed sequences of mappings that fell within the evaluation window were calculated. Those patterns that occurred only once were treated as outliers and were excluded from the analysis. Ambiguously mapped terms were counted as unmapped. Only unambiguously mapped terms were used in the patterns. Therefore, patterns of only 30 Table 3.4: The formats of the patterns found within the window of size 3. The example evaluates sentence \The patient[podg] reported[acty] severe[qlco] upper[spco] quadrant[spco] abdominal[blor ] pain[sosy]" with the term \upper" as the term of interest. Format number Format structure Examples of patterns Corresponding terms Format 1 (-3) (-2) ( 0) podg acty spco patient reported upper Format 2 (-3) (-1) ( 0) podg qlco spco patient severe upper Format 3 (-3) ( 0) ( 1) podg spco spco patient upper quadrant Format 4 (-3) ( 0) ( 2) podg spco blor patient upper abdominal Format 5 (-3) ( 0) ( 3) podg spco sosy patient upper pain Format 6 (-2) (-1) ( 0) acty qlco spco reported severe upper Format 7 (-2) ( 0) ( 1) acty spco spco reported upper quadrant Format 8 (-2) ( 0) ( 2) acty spco blor reported upper abdominal Format 9 (-2) ( 0) ( 3) acty spco sosy reported upper pain Format 10 (-1) ( 0) ( 1) qlco spco spco severe upper quadrant Format 11 (-1) ( 0) ( 2) qlco spco blor severe upper abdominal Format 12 (-1) ( 0) ( 3) qlco spco sosy severe upper pain Format 13 ( 0) ( 1) ( 2) spco spco blor upper quadrant abdominal Format 14 ( 0) ( 1) ( 3) spco spco sosy upper quadrant pain Format 15 ( 0) ( 2) ( 3) spco blor sosy upper abdominal pain some of the fteen formats can be extracted from any given sentence. The evaluated co- occurrence patterns represented linear sequences of mappings and other terms found within the text of each clinical note type. The most common format was Format 8 where a mapping alternated with another term in a sequence. Second most common format types where those where a mapping was followed by another term and then two adjacent mappings such as Formats 2, 7 and 15. Formats that represented mappings separated by two other terms, such as Format 3 and 12 were not as common across all analyzed note types. The sublanguage theory states that a specialized domain puts restrictions on the number of semantic types and semantic type patterns that are used in the sublanguage. So to evaluate whether languages of di erent clinical note types exhibit sublanguage character- istics, I compared the relative frequency of the collected patterns across all available note types. When pattern frequencies are sorted in reverse order and ranked starting with the most frequent pattern, cumulative frequency can be visualized as illustrated in Figure 3.1 and Figure 3.2. These curves show how restricted are sublanguages used in each of the note types. The steeper the curve is, the more constrained is the language. Comparing the cumulative relative frequency of the top most frequent semantic types for MEDLINE 31 Figure 3.1: Cumulative relative frequency of the di erent semantic types used in Case Management Discharge Plan (CMD), Family Practice Clinic notes (FPC), and MEDLINE abstracts (MLN). The curves of other clinical note types fell between CMD and FPC lines and were excluded from the gure for visual clarity. abstracts and clinical note types indicates that the language of the clinical notes is more restricted. As Figure 3.1 shows, a much larger number of semantic types is required to cover 90% of the unambiguously mapped concepts found in MEDLINE abstracts then in the analyzed clinical notes. For clinical notes, that number fell between 25 and 35 semantic types, whereas biomedical literature actively employed 57 semantic types. Semantic type patterns also indicate that clinical notes exhibit sublanguage characteristics. According to the sublanguage theory, semantic type patterns indicate the type of information structures that are used in the text. Thus, the smaller number of semantic type patterns is, the more restricted is the sublanguage they describe. Figure 3.2 presents a further evidence that the language used in the biomedical literature is more general than the language of clinical notes because the cumulative relative frequency curve has a gradual incline that does not go much atter until it reaches full coverage. The shape of MEDLINE abstracts' cumulative relative frequency curve suggests that increasing the analyzed sample size would lead to discovery of more patterns. On the other hand, Ambulatory Nursing Notes are exhibiting characteristics of a very constrained sublanguage because the curve rises quickly and plateaus at almost 32 Figure 3.2: Cumulative relative frequency of patterns of format 8 for Ambulatory Nursing Notes (ANN), Operative Report (OPR), and MEDLINE abstracts(MLN). The curves of other clinical note types fell between ANN and OPR lines and were excluded from the gure for visual clarity 100%. Thus, the curve indicates that a smaller number of sentences is needed to illustrate all possible types of information that are used in the text of those clinical notes. 3.2 Discussion The original aims for this step of my research focused on identifying the sublanguage boundaries among the notes that originated in various clinical subdomains and settings. Applying document clustering to a large set of clinical narratives allowed me to expose the di erences in the lexical and semantic patterns used within di erent clinical environments as well as among di erent author types. This broad, systematic survey formally establishes what many clinical NLP researchers have suspected for a long time, namely that clinicians in di erent subdomains use language in a highly idiosyncratic way. Clustering also showed that contrary to the commonly held belief, the clinical setting does not carry as much weight in determining a clinical sublanguage boundary. The semantic pattern distribution curves indicate how restricted sentence semantic structures are, which is a clear evidence that the language of di erent note types meets the requirements to be regarded as proper 33 sublanguages. Together with the document clustering results, semantic pattern distributions indicate that the clinical language is not homogeneous, but rather is a collection of separate, though related, sublanguages. It is reasonable to expect that NLP systems that rely on statistical measures will perform di erently on narratives that come from di erent clinical subdomains. CHAPTER 4 SUBLANGUAGE SEMANTIC SCHEMA SYSTEM The sublanguage theory proposed and developed by Zellig Harris became the theoretical basis for my project [39]. In order to implement sublanguage principles for word sense disambiguation, I de ned Sublanguage Semantic Schema (S3) and implemented it as the Sublanguage Semantic Schema System (S3 System). For clarity, the following de nitions will be used in the remainder of this text: token - the smallest lexical unit analyzed by MetaMap. Includes words, numbers, and punctuation. supporting tokens - tokens marked by MetaMap with one of the following parts of speech: auxiliary verb, complement, conjunction, determiner, modal verb, preposition, pronoun, and punctuation. These tokens are skipped by MetaMap algorithm during mapping. term - one or more semantically linked tokens identi ed by MetaMap. concept - a UMLS concept identi ed by MetaMap. candidate - one of the concepts that represent the sense inventory of the mapped term. MetaMap identi es multiple candidates that are combined into a candidate set for each phrase. Disambiguation of the candidates is a task required for accurate mapping. mapped term - a term that was mapped by MetaMap to at least one candidate. mapping - a term that was unambiguously mapped to a UMLS concept using the rules of unambiguity I developed. The mapping has a UMLS concept identi er and a semantic type associated with it. ambiguous term - a term that was mapped to multiple candidates. ST - the semantic type of the UMLS concept associated with the mapping. These de nitions are also listed in the Appendix B for reference. 4.1 Sublanguage Semantic Schema Previous research operationalized the sublanguage grammar as Domain Information Schema (DIS) [96]. DIS consisted of a set of semantic classes, the words and phrases that 35 belong to these classes, and the predicate-argument relationships among the members of these classes speci c to the domain. I analyzed the applicability of this approach to the clinical domain and realized that this de nition of the sublanguage structure is not feasible due to the di culty of obtaining the most integral part of such schema - the predicate and argument labeling of terms. An accurate parser adapted to clinical text is rarely available in practice. Therefore, to make the approach more generalizable, I rede ned the sublanguage structure and I propose a slightly di erent interpretation of a sublanguage. Instead of patterns based on predication, I decided to use linear sequence of semantic types as a manifestation of semantic type patterns. For the purposes of this research, the Sublanguage Semantic Schema (S3) is de ned as a semantic grammar that describes a sublanguage. S3 consists of: A set of semantic types and corresponding conditional probabilities of these semantic types in a sentence; A set of semantic type patterns and corresponding conditional probabilities of these patterns in a sentence; A semantic type classi cation model resulted from a machine learning algorithm. 4.2 System Design To demonstrate the feasibility of a sublanguage based approach to word sense disam- biguation, I created a system prototype and called it Sublanguage Semantic Schema System (S3 System). The system requirements included the following speci cations: 1. General purpose - The system has to be able to learn the disambiguation model for all clinically relevant words in the text. 2. Unsupervised learning - System adaptation to a new clinical subdomain should not require clinical expertise and manual annotations. 3. Real time disambiguation - The system has to be able to provide real time processing during disambiguation. This requirement arises as a result of the ultimate vision of creating a real time language processing system for clinical domain. It is acceptable for the training phase to be computationally intensive. 4. Easy component upgrade and replacement - Since UMLS, the selected knowledge base, and MetaMap, the concept mapping engine, undergo yearly updates, it is essential that the S3 System provides a simple way to replace these components with newer versions without requiring extensive system modi cations. 36 The rst two requirements are the two main distinguishing features of the current project. WSD systems that satisfy these two requirements usually struggle to achieve high accuracy. The system has two main parts - training module and application module. The complete prototype consists of the code that I developed to perform data manipulation, as well as the MetaMap engine to perform text mapping to UMLS Metathesaurus, and the MegaM algorithm to learn and apply logistic regression model. I used Groovy language for all programming, which I chose because it is a powerful, agile, and dynamic language that is based on Java virtual machine and incorporates Java code and libraries. Thus, an application written in Groovy can be used with any operating system. 4.2.1 Training Module Similar to most supervised statistical corpus-based disambiguation methods, the S3 System relies on a sample of text as a training data for deriving a statistical model. Unlike supervised methods, the S3 System acquires annotated text automatically by taking advantage of the existing manually curated knowledge repository. The training module consists of the following parts: 1. Text parsing and concept mapping - The raw text is sent through MetaMap engine to arrive to an automatically annotated text as described in Section 2.5.2. 2. Feature vector extraction - The XML les resulted from MetaMap processing of the training corpus are processed and a set of feature vector is extracted as described in Section 4.2.1.1. 3. Pattern extraction - For each note type a set of linear sequences of semantic types is extracted as described in Sections 3.1.2 and corresponding probabilities are calculated as described in Section 4.2.2. 4. Machine learning model training - A logistic regression model is obtained using MegaM package as described in Section 4.2.3. The general data ow for the training module is presented in Figure 4.1. A set of clinical notes of a speci c note type are fed into the S3 System. The notes are then sent to MetaMap for processing. Once MetaMap completes mapping, the system modi es the MetaMap output into feature vectors and applies a machine learning algorithm to acquire semantic type classi er. MetaMap output is also processed to identify semantic type patterns for this dataset. As the nal output of the training module, the patterns and classi er are stored for future use. 37 The S3 method is not limited to MetaMap. Any method of querying a large vocabulary can be used as long as it is powerful enough to perform mapping quickly. Similarly, logistic regression is not the only machine learning algorithm that can be used for obtaining an accurate classi er as long as it can handle multiclass data and is able to incorporate discrete or binary features. 4.2.1.1 Feature Vector Extraction In the current implementation, the feature vector was created for each unambiguously mapped term, so the rst step in the feature extraction process is to identify what a term is and which terms are unambiguous. The MetaMap output was delivered in XML format. A parsing module was written to parse the XML les and identify potential terms. Previous research with biomedical texts has used a simple de nition of unambiguous mappings: those phrases that mapped to a single concept [41]. My initial calculation of the proportion of phrases mapped to a single concept in the clinical documents compared to MedLINE abstracts showed that that proportion is two to three times higher in biomedical text than clinical text. Therefore, I concluded that the single-candidate de nition is not appropriate for clinical text because of its high level of ambiguity, which results in extremely limited mappings. After reviewing a large number of clinical text mappings produced by MetaMap, I derived additional rules of term boundaries and term unambiguity. These rules rely on the evaluation metric generated by MetaMap Figure 4.1: S3 System training data ow 38 to measure the quality of the match between the term in the analyzed phrase and a Metathesaurus concept [97]. Manual disambiguation of a number of sentences resulted in the development of the following text processing heuristics: Candidates that cover multiple tokens separated by one or more tokens other than supporting tokens, are excluded from further analysis, because in most reviewed cases the correct meaning was represented by contiguous tokens (as opposed to disjoint tokens). The phrase chunking into terms is done starting from the last token of each phrase and determining the longest contiguous term right-to-left, because in English, the head of a phrase is generally the last word of the phrase [98]. If one or more numeric tokens belong to the same phrase and if the phrase does not have any candidates, such sequence of tokens is treated as a single token. This heuristic came from the fact that MetaMap does not have a mechanism to recognize dates and phone numbers as a single lexical unit. For the purposes of this research, punctuation tokens are ignored and are not con- sidered in pattern and feature vectors. This heuristic resulted from observation that clinical text is full of implicit tables and other formatting done by authors to improve human readability of the documents, and the fact that inconsistent use of various punctuation marks is widespread. These heuristics help to de ne which chunks of text represent terms and what candidates are included in the sense inventory for each term. Once each term is matched to a set of candidates, the following conditions are used to identify unambiguous term: 1. MetaMap produced only one candidate for the term even if variant generation was required to nd this match (variants reduce the MetaMap mapping score, but here the mapping is still considered to be unambiguous). 2. MetaMap produced a single identical match except for spelling variation, capitaliza- tion, NOS su xes and inversions such as Cancer, Lung vs. Lung Cancer. 3. MetaMap produced multiple matches, but all of the candidates have the same semantic type. 4. MetaMap produced a single match for the term such that the match evaluation score is either over 900 or, if no mappings over 900 are found, over 800. Once an unambiguous term is identi ed, a set of features is collected from the MetaMap output. The exact feature set depends on the availability of information. The necessary 39 component is the semantic type of the analyzed terms. Other than that, any additional information about the term is potentially useful. In the current research, MetaMap func- tionality determined what can be potentially used as the context information for word sense disambiguation. As I contemplated what features would be more important and what features could be ignored, I made a decision to incorporate all available information into the language model and to evaluate the importance of each feature as a subsequent system optimization analysis. MetaMap provides the following information about each sentence: a) utterance boundaries; b) phrase boundaries; c) syntax units; d) tokens; e) lexical category part of speech; f) syntax type; g) UMLS concept identi er (UMLS CUI); h) UMLS preferred name; i) sources vocabularies and terminologies that were the original sources of the concept. Previous research based on a similar method identi ed that a feature set extracted using a larger window size does not consistently yield better accuracy than a set based on window size 3. [41] Additionally, a full corpus analysis showed that the average number of phrases within sentences identi ed by MetaMap ranged between 5 and 8 depending on the note type. Therefore, for the full feature list, a window size of 3 within the sentence boundaries is selected. Thus, the full feature set has features for the term of interest as well as for the three terms prior and after the term of interest. Seven terms are included in each feature vector. The feature subset that directly describes the term of interest includes the part of speech and the syntax type as the features. For all other terms within the window, the feature subsets include the normalized tokens, part of speech, and syntax type. If the term was unambiguously mapped, the feature subset for that term also includes the semantic type, the UMLS preferred term, and a set of binary attributes that indicate whether the term is included in a speci c terminology. UMLS Metathesaurus combines concepts from more than a hundred di erent source vocabularies. In addition to the described features, the initially extracted vectors also have several metadata that are potentially bene cial for cross-checking the extractor accuracy. These additional data items include the le name, the sentence number and the term number of the term of interest within the sentence. 4.2.2 Patterns At the time of model training, conditional probabilities for each semantic type are de ned based on the presence of other semantic types in the prede ned positions for each format. In 40 addition to the formats described in section 3.1.2, conditional probabilities were calculated for each semantic type separately, and for formats that contain only one other mapping within the prede ned window. So based on the number of participating mappings in the pattern, the formats have the following three levels: Level 0 - the patterns of this format represent only the term of interest that is at position 0. There is only one Level 0 Format, because it represents the conditional probability of occurrence of a speci c semantic type in notes of a speci c note type regardless of the surrounding mappings. Level 1 - the patterns of the Level 1 Formats include the term of interest and one other mapping within the prede ned window. There are a total of 6 formats of Level 1. Level 2 - the patterns of the Level 2 Formats include the term of interest and two other mappings within the prede ned window. Note that the analysis in Section 3.1.2 is based on patterns of this format level. The Bayes' rule was used for the calculations of conditional probability. For example, for the pattern of Level 2 Format 1 from the example in Table 3.4, conditional probability is calculated using this formula: P(spco(0)jpodg(􀀀3)acty(􀀀2)) = P(podg(􀀀3) acty(􀀀2) spco(0)) P(podg(􀀀3)acty(􀀀2)) In this example, the output is the probability to see a \spco" semantic type in position 0, if the term in position -3 is unambiguously mapped to a concept with a semantic type \podg" and the term in position -2 has a mapping to a concept with semantic type \acty". These conditional probabilities are calculated for all linear sequences of unambiguously mapped terms in the training corpus. 1 4.2.3 Semantic Type Classi cation Model 4.2.3.1 Sparse File Format The full set of feature vectors extracted for each analyzed note types included a large number of feature vectors with a large number of features, some of which are binary and some of which are categorical. In order to acquire a classi cation model, I needed to nd an 1My initial system design included creating and populating a MySQL database designed for the purposes of easy feature frequency calculation and feature analysis. After working with this database for some time I realized that in order to make querying and other processing fast, I needed to atten the relational database into a single table, which would have made the table extremely large (over 100 million rows and over 1000 columns). Therefore, I abandoned that idea and wrote all data into a series of smaller text les processed by a set of Groovy modules speci cally designed for each type of processing. 41 existing software or classi cation algorithm implementation that could handle such a large dataset. I considered several software packages including a widely used Weka data mining package [99]. Some algorithms are able to handle sparse data formats, so the S3 System module that deals with semantic type classi cation model acquisition includes converting full feature vectors (so called dense vectors) into sparse vectors. The output of sparse conversion is a le in the sparse format where each discrete feature of the dense le is converted in a set of binary features. The sparse conversion module also outputs a full dictionary that links the sparse feature name to the feature in the dense dataset and its speci c value. The sparse le format decreases the size of the le containing the dataset because only those features that are present in the feature vector are included. The Figures 4.2a and 4.2b present examples of a dense and sparse vector. 4.2.3.2 Machine Learning Algorithm When selecting a machine learning algorithm to perform semantic type classi cation, I used the following guidelines: The algorithm has to be able to provide multiclass classi cation; The algorithm has be be able to handle categorical or binary features; The algorithm hast to be scalable and be able to process large data sets; The output of the algorithm should include not only the nal prediction, but also a set of probabilities for other classes; The speci c implementation of the algorithm has to be robust and relatively fast. lename.txt, 1, 4, idcn, continue, verb, verb, 5, null, to, adv, adv, null, 3, null, will, modal, modal, null, 6, inpr, follow, verb, verb, NCI SNOMEDCT, 2, ocdi, social work, head, noun, AOD MSH MTH SNOMEDCT, 7, null, this, det, det, null, 1, zzzz, plan, verb, verb, null (a) Dense feature vector. 114 MM142 MM143 MM1095 MM1096 MM1103 MM1135 MM1166 MM1167 MM1218 MM1219 MM1432 MM1485 MM2141 MM2670 MM11341 MM11342 (b) Sparse feature vector in MegaM format. Figure 4.2: Examples of feature vectors. 42 After considering a number of di erent algorithms, I decided to use logistic regression as implemented by Hal Daume III, called MegaM [100]. MegaM satis es all these requirements. This tool is based on maximum likelihood and maximum a posterior optimization of the maximum entropy models. By performing multiple iterations and weights adjustments, MegaM arrives to the optimal set of regression coe cients. MegaM accepts a sparse data le where each row represents as a set of features that describe an unambiguous term. The output of MegaM processing is a logistical regression model that gives a weight to each feature. Even though MegaM is quite robust, it runs out of memory when the dataset size exceeds the system capacity. So I had to decrease the data le to make it manageable by MegaM. After some experimentation I found that a sample size of 50,000 was small enough for MegaM to process but large enough to produce a stable model. Therefore, the semantic type classi cation model was created using a reduced data set. Logistic regression models are prone to over tting [101]. In order to mitigate this issue the training records were selected randomly from the full data set. 4.3 System Application After the sublanguage semantic schema is obtained, it can be automatically applied for run-time disambiguation. In order to disambiguate all terms in a speci c text segment, a set of steps is performed (see Figure 4.3). First, MetaMap processes the text and produces an XML le that contains the full set of concepts mapped to the terms found in the text Figure 4.3: S3 System word sense disambiguation ow. 43 passage. The S3 system then analyzes the XML output and for each ambiguous term in the le it creates three components: 1) a feature vector according to the description presented in Section 4.2.1.1; 2) a set of semantic type sequences (patterns) taking into account the neighboring unambiguous mappings; and 3) a list of the corresponding UMLS concepts and semantic types. The latter serves as the sense inventory for the term. Once sparse feature vectors and patterns are extracted, the S3 system uses classi cation and pattern matching to arrive to two semantic type predictions. Applying the previously acquired classi cation model to each feature vector, the S3 System classi es the terms with the most likely semantic types. Along with the most likely semantic type, the classi cation model also provides a list of probabilities of each semantic type in the sense inventory. It is possible that the most likely semantic type is actually not in the sense inventory. The next step analyzes the patterns extracted for the term of interest and matches them with the patterns in the sublanguage semantic schema. This pattern matching step assigns probabilities to the semantic types from the sense inventories. The semantic type with the highest probability is selected as the pattern matching prediction. If none of the potential patterns matched any of those contained in the patterns set, then the term is marked as failed disambiguation. After the previous steps are performed, each term has a sense inventory consisting of the semantic types, classi er-predicted semantic type, and pattern matched semantic type. If the semantic type prediction produced by the logistic regression classi er di ers from the one produced by the pattern matching, the classi er and pattern predictions have to be reconciled. There are several possible outcomes. If the pattern matching did not fail disambiguation, the most likely semantic type is chosen as the nal prediction. If the pattern matching failed to nd a probable semantic type, the classi cation semantic type is determined to be the nal semantic type. The nal step is word sense disambiguation. The semantic type disambiguation results in the most probable semantic type. The sense inventory is reviewed and the UMLS concept that has the most probable semantic type is selected. If more than one concept has the same semantic type, the concept with the highest mapping score is selected. If all concepts with the speci ed semantic type have the same mapping score, they are assumed to be synonymous and all of the concepts are returned as the nal concept selection. Figure 4.4 presents the overview of the data ow during the application phase. The system accepts a raw clinical text, either as a single sentence or a full clinical note. The 44 Figure 4.4: S3 System application data ow current prototype does not have a functionality to determine which sublanguage semantic schema is applicable to the input text, so the operator has to specify that for the purposes of disambiguation. The S3 System then performs the processing steps outlined above and the nal output is the clinical note, where each meaningful term is annotated with semantic type and UMLS concept. The pattern set for the sublanguage contains patterns of three format levels. Calculating the most probable semantic type starts with patterns of the most restrictive format level - Level 2. If no patterns of that format level match the patterns found for the term of interest, patterns of less restrictive format levels are applied. One of the parameters of the application phase is the lowest pattern format level to be applied. If the level parameter is set to 0, then the S3 System will disambiguate all terms. If the level parameter is set to 3, then the S3 System will not perform pattern matching because 3 is larger than the lowest available format level. Another parameter that can be manipulated during disambiguation is the lowest ac- ceptable probability predicted by MetaM. If the most likely semantic type is not one of the semantic types from the sense inventory, the semantic types in the sense inventory are sorted by their probability. If the probability of the most likely semantic type is below the probability threshold, then the prediction is rejected and disambiguation is deemed to have failed. If the probability threshold is set to 1.0 or higher, then classi cation prediction will not be considered for disambiguation, because it is higher than possible probability. If the 45 probability threshold is set to 0.0, then S3 System will disambiguate all terms simply based on the most likely semantic type. Varying these two parameters, the S3 System operator can balance the number of false positives with the number of terms that the system fails to disambiguate. Thus, manipulation of the parameters changes precision and recall of the system. 4.4 Validation The S3 System is an implementation of the Sublanguage Semantic Schema approach to word sense disambiguation. Once the system prototype is completed, validation is needed. The validation provides answers to Aim II research questions: Research question 2.1 { Does the developed system work well for clinical term disam- biguation in a range of clinical note types as compared to a manually annotated test set? Research question 2.2 { Does the system perform better than a baseline method such as MetaMap and the majority sense method? To answer these research questions, I created a manually annotated corpus that can serve as a reference standard for the S3 System performance evaluation. In order to evaluate performance, the following de nitions of the metrics are used in the analysis: Reference Standard Positive { Total number of terms that were mapped by MetaMap to at least one UMLS concept that re ected their true meaning. These terms will be referred to as properly mapped terms. Reference Standard Negative { Total number of terms that were mapped by MetaMap to at least one UMLS concept, but none of the concepts re ected the true meaning of the terms. These terms will be referred to as mismapped terms. Total Positive { The number of terms that the S3 System was able to process and produce a semantic type prediction. These terms will be referred to as disambiguated terms. Total Negative { The number of terms that the S3 System failed to disambiguate. True Positive { The number of disambiguated terms that were disambiguated correctly. The correctly disambiguated term is the one what has the semantic type assigned by the system, which is the same as the one assigned by human annotator. True Negative { The number of mismapped terms that the S3 System was not able to disambiguate. Accuracy is the proportion of the terms in the reference standard that were disam- biguated correctly. 46 Recall is the proportion of the properly mapped terms in the reference standard that were correctly disambiguated by the S3 System. Precision { The proportion of the disambiguated terms that were disambiguated cor- rectly. 4.4.1 Annotations The most common way to conduct a rigorous system performance testing is to compare the system output against a reference standard [15]. The nal accuracy of the S3 System depends on the performance of its components: MedPost tagger, which performs sentence segmentation and part of speech tagging; and MetaMap concept recognition engine, which performs mapping of terms to the concepts in UMLS Metathesaurus. The S3 System performs word sense disambiguation on the sense inventories resulted from MetaMap pro- cessing. To ensure that the evaluation targets the component that I developed, the reference standard had to be created accounting for the limitations of other components. The annotators task was limited to manual word sense disambiguation. The annotators were not asked to determine the sentence or phrase boundaries, or the intended meaning of the text beyond the sense inventory identi ed by MetaMap. 4.4.1.1 Sample Selection Notes of seventeen note types are available to me; however, evaluating the S3 System performance on all note types is not feasible due to nancial and time constraints. Therefore, four note types were selected for evaluation purposes: 1) Admissions History and Physical Notes; 2) Discharge Summaries; 3) Cardiology Clinic Notes; and 4) Social Service Notes. The model employed by the S3 System performs disambiguation relying on the context information at a sentence level and disregards the larger context. Therefore, the annotators were presented with individual sentences for annotations. Each term in the corpus identi ed by MetaMap as a potentially relevant concept is an opportunity to fail. Therefore, the size is de ned as a number of terms in the sample. Sample size is generally determined using Cochran's sample size formula. Sample Size = (Z2p(1 􀀀 p)) c2 where Z is the Z value corresponding to a speci c con dence level (for 95% con dence level Z=1.96); p is the proportion of units in relation to the true proportion that one could expect (this value is assumed to be 0.5); and c is the degree of precision or con dence interval. 47 The sample size calculations assuming con dence level of 95% and con dence interval of 3%, the sample size is determined to be 1067 terms for each note type. The method employed by the S3 System relies on the sentence context for disambiguation and is optimal if at least 3 or more terms are present in the sentence. Therefore, the sample for annotation was created by extracting a random set of sentences that contain at least three terms. Prior analysis estimated that such sentences have on average 6.7 terms per sentence. Simple calculation 1067 6:7 results in a sample size of 160 sentences per note type or 640 total sentences. The sample size calculations for comparison of two proportions with a con dence interval of 3% and a power of 90% for proportions 75% and 80% results in sample size of 1193, or approximately 1200 terms. Calculating the number of sentences that would be needed to get at least 1200 terms is approximately 180. Based on these calculations, I decided that the sample size of 200 sentences for each note type would provide enough power to accurately analyze the S3 System performance across all note types. 4.4.1.2 Annotation Process Due to the strict privacy and security requirements directed by HIPAA, the notes are stored on a cluster speci cally designed to store and process clinical data. The annotators were required to access the data remotely passing through two levels of authentication. They were not allowed to copy the clinical text to their personal computers. To access the data, the annotators had to rst login into the CHPC via Virtual Private Network client. Then they had to use a remote access application such as X Window System. A Windows server was set up with Samba access to the data and an annotation application. The extensible Human Oracle Suite of Tools (eHOST) application is an annotation tool developed by Brett South and a research team of the O ce of Information & Technology (OI&T) at the Veterans A airs [102]. It can be used to annotate text by creating an annotation object represented by a span of text, the associated concept according to an annotation schema. The distinguishing feature of eHost is its ability to accept annotations created outside of the application, called preannotations. Since the goal of the annotation step is to create a reference standard to test performance of the S3 System, the annotation task is quite limited. For each term identi ed by MetaMap, the annotators had to choose one of the candidates that resulted from the MetaMap processing, or to specify that none of the candidates re ected the intended meaning of the term. For that purpose, each term had an associated sense inventory of UMLS concepts 48 and corresponding semantic type. The preannotations were loaded to eHost and presented to the annotators for disambiguation. The UMLS semantic network contains 135 semantic types, which a large number. To simplify the annotation scheme, only two markables were used: Concept and None. The markable type Concept was designed to contain information about the UMLS concept that MetaMap mapped to the term. The markable type None was designed to handle the case when none of the candidates represent the meaning of the corresponding term correctly. Each annotation object speci ed the span of each term, which was derived from the MetaMap output, and the concept description, which was a combination of the following MetaMap output elements that identify a speci c UMLS concept: CandidateCUI, CandidateMatched, CandidatePreferred, and SemType. Two annotators were recruited to perform annotations of this project. They both have had previous experience with annotations and have used eHost. Both of them are a liated with the University of Utah as current and former employees. Both of the annotators completed Human Research training through CITI. The annotators' task was to review the sense inventory for each term and select that concept that re ected the meaning of the term most accurately. Some of the terms were mislabeled by MetaMap, resulting in a set of annotations for a term, none of which re ect the meaning of the term. In those cases the annotators are asked to mark the term as markable type None. A total of 5430 terms were disambiguated by the annotators. 374 of those terms were marked as \None of the above", indicating that MetaMap failed to produce at least one concept for the term that re ects the intended meaning of the work in text (see Table 4.1). In addition, the annotators disagreed on whether one or none of the candidates represent the actual meaning for 653 terms, and 451 annotations were concepts with di erent semantic types, which indicates that a large portion of the concepts are so vague that their actual meaning might be interpreted in multiple ways. If two con icting annotations had the same semantic type, these annotations were deemed as equal and one of the annotations Table 4.1: Annotated corpus description. Note Type AHP CCN DIS SSN Total Properly mapped terms 1252 1231 1329 1244 5056 Mismapped terms 54 113 70 137 374 Total term count 1306 1344 1399 1381 5430 Pair-wise agreement 82:5% 84:5% 84:3% 80:0% 49 was selected randomly as the reference standard. For those terms that were marked with concepts that had di erent semantic types, another clinically trained person performed adjudication. The average annotator pair-wise agreement was 82:8%. 4.5 Measuring Performance To evaluate whether the S3 System performs well on clinical text, I compared the semantic types assigned to terms in the reference standard by the S3 System to those assigned by the human annotators. Following the steps outlined in Section 4.2.1, I trained models for the four note types using the full set of notes available to me. The full description of the training corpus for validation is presented in Table 4.2. After the models were acquired, I applied them to the manually annotated corpus and received the results are re ected in Table 4.3. 4.5.1 Model Comparison Previous analysis suggested that notes of di erent types belong to similar or dissimilar sublanguages. As an example, the hierarchical clustering tree in Table 3.2 showed that Table 4.2: Full data description for the four note types that were used in validation. Note Type AHP CCN DIS SSN Number of processed les 42,911 24,302 64,530 3,414 Total number of unambiguous terms 10,973,008 3,861,766 13,372,050 262,102 Number of unique patterns extracted from the training corpus 105,455 57,693 106,908 24,529 Table 4.3: Accuracy of S3 System as tested on a manually annotated set of sentences with format level threshold of 2 and classi cation probability threshold of 0.1. Note Type AHP CCN DIS SSN Average Match 900 903 959 854 Mis-match 186 186 203 199 WSD Failed 166 142 167 191 Total Terms 1252 1231 1329 1244 Recall 0.719 0.734 0.722 0.686 0.715 Precision 0.829 0.829 0.825 0.811 0.824 F-score 0.770 0.778 0.770 0.744 0.765 50 Discharge Summaries and Admission History and Physical notes are relatively similar and can be assumed to be of the related sublanguages. On the<
Reference URL	https://collections.lib.utah.edu/ark:/87278/s6fr34tp