JUCS - Journal of Universal Computer Science 14(11): 1857-1876, doi: 10.3217/jucs-014-11-1857
Structure-Based Crawling in the Hidden Web
expand article infoMarcio Vidal, Altigran S. da Silva, Edleno S. De Moura, João M.B. Cavalcanti
‡ Federal University of Amazonas, Manaus, Brazil
Open Access
Abstract
The number of applications that need to crawl the Web to gather data is growing at an ever increasing pace. In some cases, the criterion to determine what pages must be included in a collection is based on theirs contents; in others, it would be wiser to use a structure-based criterion. In this article, we present a proposal to build structure-based crawlers that just requires a few examples of the pages to be crawled and an entry point to the target web site. Our crawlers can deal with form-based web sites. Contrarily to other proposals, ours does not require a sample database to fill in the forms, and does not require the user to interact heavily. Our experiments prove that our precision is 100% in seventeen real-world web sites, with both static and dynamic content, and that our recall is 95% in the eleven static web sites examined.
Keywords
Web crawling, hidden web, tree-edit distance, web wrappers