Integrating structured data on the web

Update Item Information
Publication Type dissertation
School or College College of Engineering
Department Computing
Author Nguyen, Thanh Hoang
Title Integrating structured data on the web
Date 2013-05
Description The explosion of structured Web data (e.g., online databases, Wikipedia infoboxes) creates many opportunities for integrating and querying these data that go far beyond the simple search capabilities provided by search engines. Although much work has been devoted to data integration in the database community, the Web brings new challenges: the Web-scale (e.g., the large and growing volume of data) and the heterogeneity in Web data. Because there are so much data, scalable techniques that require little or no manual intervention and that are robust to noisy data are needed. In this dissertation, we propose a new and effective approach for matching Web-form interfaces and for matching multilingual Wikipedia infoboxes. As a further step toward these problems, we propose a general prudent schema-matching framework that matches a large number of schemas effectively. Our comprehensive experiments for Web-form interfaces and Wikipedia infoboxes show that it can enable on-the-fly, automatic integration of large collections of structured Web data. Another problem we address in this dissertation is schema discovery. While existing integration approaches assume that the relevant data sources and their schemas have been identified in advance, schemas are not always available for structured Web data. Approaches exist that exploit information in Wikipedia to discover the entity types and their associate schemas. However, due to inconsistencies, sparseness, and noise from the community contribution, these approaches are error prone and require substantial human intervention. Given the schema heterogeneity in Wikipedia infoboxes, we developed a new approach that uses the structured information available in infoboxes to cluster similar infoboxes and infer the schemata for entity types. Our approach is unsupervised and resilient to the unpredictable skew in the entity class distribution. Our experiments, using over one hundred thousand infoboxes extracted from Wikipedia, indicate that our approach is effective and produces accurate schemata for Wikipedia entities.
Type Text
Publisher University of Utah
Dissertation Institution University of Utah
Dissertation Name Doctor of Philosophy
Language eng
Rights Management Copyright © Thanh Hoang Nguyen 2013
Format Medium application/pdf
Format Extent 2,385,273 bytes
Identifier etd3/id/2137
ARK ark:/87278/s6s18hbq
Setname ir_etd
ID 195822
Reference URL https://collections.lib.utah.edu/ark:/87278/s6s18hbq
Back to Search Results