Searching for hidden-web databases

Update Item Information
Publication Type Journal Article
School or College College of Engineering
Department Computing, School of
Creator Freire, Juliana
Other Author Barbosa, Luciano
Title Searching for hidden-web databases
Date 2005
Description Recently, there has been increased interest in the retrieval and integration of hidden-Web data with a view to leverage high-quality information available in online databases. Although previous works have addressed many aspects of the actual integration, including matching form schemata and automatically filling out forms, the problem of locating relevant data sources has been largely overlooked. Given the dynamic nature of the Web, where data sources are constantly changing, it is crucial to automatically discover these resources. However, considering the number of documents on the Web (Google already indexes over 8 billion documents), automatically finding tens, hundreds or even thousands of forms that are relevant to the integration task is really like looking for a few needles in a haystack. Besides, since the vocabulary and structure of forms for a given domain are unknown until the forms are actually found, it is hard to define exactly what to look for. We propose a new crawling strategy to automatically locate hidden-Web databases which aims to achieve a balance between the two conflicting requirements of this problem: the need to perform a broad search while at the same time avoiding the need to crawl a large number of irrelevant pages. The proposed strategy does that by focusing the crawl on a given topic; by judiciously choosing links to follow within a topic that are more likely to lead to pages that contain forms; and by employing appropriate stopping criteria. We describe the algorithms underlying this strategy and an experimental evaluation which shows that our approach is both effective and efficient, leading to larger numbers of forms retrieved as a function of the number of pages visited than other crawlers.
Type Text
Publisher Workshop on the Web and Databases (WebDB)
First Page 16
Last Page 17
Subject Hidden-web databases; Web documents; Large scale information integration; Focused crawler; Form crawler
Subject LCSH Data mining; World Wide Web; Web databases; Database searching; Internet searching; Information retrieval
Language eng
Bibliographic Citation Barbosa, L., & Freire, J. (2005). Searching for hidden-web databases. Proceedings of the Eighth International Workshiop on the Web and Databases (WebDB 2005), June 16-17, Baltimore, Maryland, 1-6.
Rights Management (c)Barbosa, L., & Freire, J.
Format Medium application/pdf
Format Extent 253,269 bytes
Identifier ir-main,12351
ARK ark:/87278/s6p27gnc
Setname ir_uspace
ID 706000
Reference URL https://collections.lib.utah.edu/ark:/87278/s6p27gnc
Back to Search Results