Skip to main navigation Skip to search Skip to main content

Automated extraction of hit numbers from search result pages

  • Renmin University of China

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

4 Scopus citations

Abstract

When a query is submitted to a search engine, the search engine returns a dynamically generated result page that contains the number of hits (i.e., the number of matching results) for the query. Hit number is a very useful piece of information in many important applications such as obtaining document frequencies of terms, estimating the sizes of search engines and generating search engine summaries. In this paper, we propose a novel technique for automatically identifying the hit number for any search engine and any query. This technique consists of three steps: first segment each result page into a set of blocks, then identify the block(s) that contain the hit number using a machine learning approach, and finally extract the hit number from the identified block(s) by comparing the patterns in multiple blocks from the same search engine. Experimental results indicate that this technique is highly accurate.

Original languageEnglish
Title of host publicationAdvances in Web-Age Information Management - 7th International Conference, WAIM 2006, Proceedings
PublisherSpringer Verlag
Pages73-84
Number of pages12
ISBN (Print)3540352252, 9783540352259
DOIs
StatePublished - 2006
Event7th International Conference on Advances in Web-Age Information Management, WAIM 2006 - Hong Kong, China
Duration: Jun 17 2006Jun 19 2006

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume4016 LNCS

Conference

Conference7th International Conference on Advances in Web-Age Information Management, WAIM 2006
Country/TerritoryChina
CityHong Kong
Period06/17/0606/19/06

Fingerprint

Dive into the research topics of 'Automated extraction of hit numbers from search result pages'. Together they form a unique fingerprint.

Cite this