Description |
Rule-based systems play an important role in clinical information extraction. In some specific information extraction tasks, rule-based systems show comparable or superior performance to state-of-the-art machine learning-based ones. Compared with the machine learning approach, rule-based systems require less or no human annotated data for training. Rules are relatively easy to interpret and customize. Nevertheless, the rule- based approach still has some well-known shortcomings, such as requiring intensive labor for rule development and poor processing scalability when processing a large amount of data. Previous work has sought to diminish these disadvantages. Despite hardware advances, various novel data structures have been proposed to speed up string matching. These structures have not been tailored to facilitate rule-based natural language processing (NLP) tasks. Rather than re-tool such tasks for integration into NLP pipelines, the objective of this dissertation is to rethink rule processing for advancing clinical NLP out of research toward practical, enterprise-wide use. My work has three parts. The first study aimed to design a generic database schema that supports a wide range of clinical information extraction tasks, through surveying the clinical NLP literature and shared tasks. The schema was evaluated through a cognitive walk-through and use case demonstration (in the third study). Second, an optimized rule processing engine was developed using a new Trie- iv based structure. This engine was evaluated by implementing three fundamental NLP components: a sentence segmenter, a named entity recognizer, and a clinical context detector. Evaluations showed that the performance of the components was comparable or superior to the state-of-art or popular solutions. The third study aimed to propose and evaluate a generic rule-based NLP pipeline using the generic database schema designed in the first the study and the optimized components developed in the second study. Two distinct use-case studies, identifying pulmonary embolism in radiology reports and extracting comprehensive severity index indicators from clinical notes, establish face validity of its generalizability and effectiveness. In summary, with the optimized rule processing engine and a properly constructed pipeline structure, this rule-based NLP pipeline can serve as a generic, effective, and efficient solution for different clinical NLP tasks at scale. |