Skip to main navigation Skip to search Skip to main content

Mining templates from search result records of search engines

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

34 Scopus citations

Abstract

Metasearch engine, Comparison-shopping and Deep Web crawling applications need to extract search result records enwrapped in result pages returned from search engines in response to user queries. The search result records from a given search engine are usually formatted based on a template. Precisely identifying this template can greatly help extract and annotate the data units within each record correctly. In this paper, we propose a graph model to represent record template and develop a domain independent statistical method to automatically mine the record template for any search engine using sample search result records. Our approach can identify both template tags (HTML tags) and template texts (non-tag texts), and it also explicitly addresses the mismatches between the tag structures and the data structures of search result records. Our experimental results indicate that this approach is very effective.

Original languageEnglish
Title of host publicationKDD-2007
Subtitle of host publicationProceedings of the Thirteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Pages884-893
Number of pages10
DOIs
StatePublished - 2007
EventKDD-2007: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - San Jose, CA, United States
Duration: Aug 12 2007Aug 15 2007

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Conference

ConferenceKDD-2007: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Country/TerritoryUnited States
CitySan Jose, CA
Period08/12/0708/15/07

Keywords

  • Information extraction
  • Search engine
  • Wrapper generation

Fingerprint

Dive into the research topics of 'Mining templates from search result records of search engines'. Together they form a unique fingerprint.

Cite this