Organizing hidden-web databases by clustering visible web documents

Update Item Information
Publication Type Journal Article
School or College College of Engineering
Department Computing, School of
Creator Freire, Juliana
Other Author Barbosa, Luciano; Silva, Altigran
Title Organizing hidden-web databases by clustering visible web documents
Date 2007-04
Description In this paper we address the problem of organizing hidden-Web databases. Given a heterogeneous set of Web forms that serve as entry points to hidden-Web databases, our goal is to cluster the forms according to the database domains to which they belong. We propose a new clustering approach that models Web forms as a set of hyperlinked objects and considers visible information in the form context-both within and in the neighborhood of forms-as the basis for similarity comparison. Since the clustering is performed over features that can be automatically extracted, the process is scalable. In addition, because it uses a rich set of metadata, our approach is able to handle a wide range of forms, including content-rich forms that contain multiple attributes, as well as simple keyword-based search interfaces. An experimental evaluation over real Web data shows that our strategy generates high-quality clusters-measured both in terms of entropy and F-measure. This indicates that our approach provides an effective and general solution to the problem of organizing hidden-Web databases.
Type Text
Publisher Institute of Electrical and Electronics Engineers (IEEE)
First Page 326
Last Page 335
DOI 10.1109/ICDE.2007.367878
Subject Hidden-web databases; Web documents
Subject LCSH Database management; Data mining; Document clustering; Computer graphics
Language eng
Conference Title 2007 IEEE 23rd International Conference on Data Engineering; ; Istanbul, Turkey
Bibliographic Citation Barbosa, L., Freire, J., & Silva, A. (2007). Organizing hidden-web databases by clustering visible web documents. IEEE 23rd International Conference on Data Engineering (ICDE), 326-35.
Rights Management (c) 2007 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this work in other works must be obtained from the IEEE. http://dx.doi.org/10.1109/ICDE.2007.367878
Format Medium application/pdf
Format Extent 428,646 bytes
Identifier ir-main,12338
ARK ark:/87278/s6xh089d
Setname ir_uspace
ID 703125
Reference URL https://collections.lib.utah.edu/ark:/87278/s6xh089d
Back to Search Results