Metadata Discovery and Integration to Support Reproducible Research using the Open Further Platform

Update Item Information
Identifier 010_RR2016_Metadata_Discovery_Integration_WEN.pdf
Title Metadata Discovery and Integration to Support Reproducible Research using the Open Further Platform
Creator Jingran Wen; Peter Mo; Randy Madsen; Ryan Butcher; Phillip Warner; Ramkiran Gouripeddi; Julio C. Facelli; Department of Biomedical Informatics, University of Utah
Subject Biomedical Research; Research; Research Methods
Description Modern biomedical research, often requires reusing and combining (federation and/or integration of) data from multiple disparate sources such as clinical and electronic health record (phenotypes), genomic public and private annotations (genotypes), proteomics, metabolomics, biospecimen collections and environmental data. Each data source embeds within itself different meanings (semantic) and structural (syntactic) descriptions about the data either explicitly or implicitly. Metadata as described by the FAIR1 (Findable, Accessible, Interoperable, and Reusable) principles is a requirement for reproducible research - which requires discovery of these metadata and its understanding to facilitate proper use of data. Current state of the art requires a great deal of human manual curation, which renders these procedures non-scalable and consequently of limited practical value in the emerging big data biomedical science paradigm. To overcome these limitations, we are prototyping a computational infrastructure that supports automated and semi-automated mapping of metadata artifacts and terminologies. First, we advanced OpenFurther's metadata repository to adapt metadata specifications developed by the bioCADDIE consortium to store metadata for scalable interoperability between systems for creating, managing and using data. Second, we applied machine learning methods for automatically discovering metadata. Our preliminary results show that machine learning models were able to classify protein structure, genetic variant and general English corpus data with an average accuracy of 99%. Finally, we will use the findings for these work to develop a metadata and semantics discovery and mapping framework which will be agnostic to specific mapping algorithms or tools as many of these are domain-specific and also dependent on data; and will choose the best available solution based on the mapping performance making it scalable and suitable for emerging big data applications. This will allow proper reuse, federation and integration of the metadata-enriched data as needed for supporting reproducible research. Wilkinson MD, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. Gouripeddi R, Facelli JC, et al. FURTHeR: An Infrastructure for Clinical, Translational and Comparative Effectiveness Research. AMIA Annual Fall Symposium. 2013; Wash, DC. WG3 Members. (2015). WG3-MetadataSpecifications: NIH BD2K bioCADDIE Data Discovery Index WG3 Metadata Specification v1. Zenodo. 10.5281/zenodo.28019
Relation is Part of 2016 Research Reproducibility Conference & Lectures
Publisher Spencer S. Eccles Health Sciences Library, University of Utah
Date Digital 2016
Date 2016
Format application/pdf
Rights Management Copyright 2016. For further information regarding the rights to this collection, please visit: https://NOVEL.utah.edu/about/copyright
Language eng
ARK ark:/87278/s67t21dh
Type Text
Setname ehsl_rr
ID 1400681
Reference URL https://collections.lib.utah.edu/ark:/87278/s67t21dh
Back to Search Results