Description |
Modern biomedical research, often requires reusing and combining (federation and/or integration of) data from multiple disparate sources such as clinical and electronic health record (phenotypes), genomic public and private annotations (genotypes), proteomics, metabolomics, biospecimen collections and environmental data. Each data source embeds within itself different meanings (semantic) and structural (syntactic) descriptions about the data either explicitly or implicitly. Metadata as described by the FAIR1 (Findable, Accessible, Interoperable, and Reusable) principles is a requirement for reproducible research - which requires discovery of these metadata and its understanding to facilitate proper use of data. Current state of the art requires a great deal of human manual curation, which renders these procedures non-scalable and consequently of limited practical value in the emerging big data biomedical science paradigm. To overcome these limitations, we are prototyping a computational infrastructure that supports automated and semi-automated mapping of metadata artifacts and terminologies. First, we advanced OpenFurther's metadata repository to adapt metadata specifications developed by the bioCADDIE consortium to store metadata for scalable interoperability between systems for creating, managing and using data. Second, we applied machine learning methods for automatically discovering metadata. Our preliminary results show that machine learning models were able to classify protein structure, genetic variant and general English corpus data with an average accuracy of 99%. Finally, we will use the findings for these work to develop a metadata and semantics discovery and mapping framework which will be agnostic to specific mapping algorithms or tools as many of these are domain-specific and also dependent on data; and will choose the best available solution based on the mapping performance making it scalable and suitable for emerging big data applications. This will allow proper reuse, federation and integration of the metadata-enriched data as needed for supporting reproducible research. Wilkinson MD, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data. 2016;3:160018. Gouripeddi R, Facelli JC, et al. FURTHeR: An Infrastructure for Clinical, Translational and Comparative Effectiveness Research. AMIA Annual Fall Symposium. 2013; Wash, DC. WG3 Members. (2015). WG3-MetadataSpecifications: NIH BD2K bioCADDIE Data Discovery Index WG3 Metadata Specification v1. Zenodo. 10.5281/zenodo.28019 |