Metadata Discovery and Integration to Support Reproducible Research using the Open Further Platform

Identifier	010_RR2016_Metadata_Discovery_Integration_WEN.pdf
Title	Metadata Discovery and Integration to Support Reproducible Research using the Open Further Platform
Creator	Jingran Wen; Peter Mo; Randy Madsen; Ryan Butcher; Phillip Warner; Ramkiran Gouripeddi; Julio C. Facelli; Department of Biomedical Informatics, University of Utah
Subject	Biomedical Research; Research; Research Methods
Description	Modern biomedical research, often requires reusing and combining (federation and/or integration of) data from multiple disparate sources such as clinical and electronic health record (phenotypes), genomic public and private annotations (genotypes), proteomics, metabolomics, biospecimen collections and environmental data. Each data source embeds within itself different meanings (semantic) and structural (syntactic) descriptions about the data either explicitly or implicitly. Metadata as described by the FAIR1 (Findable, Accessible, Interoperable, and Reusable) principles is a requirement for reproducible research - which requires discovery of these metadata and its understanding to facilitate proper use of data. Current state of the art requires a great deal of human manual curation, which renders these procedures non-scalable and consequently of limited practical value in the emerging big data biomedical science paradigm. To overcome these limitations, we are prototyping a computational infrastructure that supports automated and semi-automated mapping of metadata artifacts and terminologies. First, we advanced OpenFurther's metadata repository to adapt metadata specifications developed by the bioCADDIE consortium to store metadata for scalable interoperability between systems for creating, managing and using data. Second, we applied machine learning methods for automatically discovering metadata. Our preliminary results show that machine learning models were able to classify protein structure, genetic variant and general English corpus data with an average accuracy of 99%. Finally, we will use the findings for these work to develop a metadata and semantics discovery and mapping framework which will be agnostic to specific mapping algorithms or tools as many of these are domain-specific and also dependent on data; and will choose the best available solution based on the mapping performance making it scalable and suitable for emerging big data applications. This will allow proper reuse, federation and integration of the metadata-enriched data as needed for supporting reproducible research. Wilkinson MD, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. Gouripeddi R, Facelli JC, et al. FURTHeR: An Infrastructure for Clinical, Translational and Comparative Effectiveness Research. AMIA Annual Fall Symposium. 2013; Wash, DC. WG3 Members. (2015). WG3-MetadataSpecifications: NIH BD2K bioCADDIE Data Discovery Index WG3 Metadata Specification v1. Zenodo. 10.5281/zenodo.28019
Relation is Part of	2016 Research Reproducibility Conference & Lectures
Publisher	Spencer S. Eccles Health Sciences Library, University of Utah
Date Digital	2016
Date	2016
Format	application/pdf
Rights Management	Copyright 2016. For further information regarding the rights to this collection, please visit: https://NOVEL.utah.edu/about/copyright
Language	eng
ARK	ark:/87278/s67t21dh
Type	Text
Setname	ehsl_rr
ID	1400681
Reference URL	https://collections.lib.utah.edu/ark:/87278/s67t21dh

Back to Search Results