TY - GEN
T1 - Mining templates from search result records of search engines
AU - Zhao, Hongkun
AU - Meng, Weiyi
AU - Yu, Clement
PY - 2007
Y1 - 2007
N2 - Metasearch engine, Comparison-shopping and Deep Web crawling applications need to extract search result records enwrapped in result pages returned from search engines in response to user queries. The search result records from a given search engine are usually formatted based on a template. Precisely identifying this template can greatly help extract and annotate the data units within each record correctly. In this paper, we propose a graph model to represent record template and develop a domain independent statistical method to automatically mine the record template for any search engine using sample search result records. Our approach can identify both template tags (HTML tags) and template texts (non-tag texts), and it also explicitly addresses the mismatches between the tag structures and the data structures of search result records. Our experimental results indicate that this approach is very effective.
AB - Metasearch engine, Comparison-shopping and Deep Web crawling applications need to extract search result records enwrapped in result pages returned from search engines in response to user queries. The search result records from a given search engine are usually formatted based on a template. Precisely identifying this template can greatly help extract and annotate the data units within each record correctly. In this paper, we propose a graph model to represent record template and develop a domain independent statistical method to automatically mine the record template for any search engine using sample search result records. Our approach can identify both template tags (HTML tags) and template texts (non-tag texts), and it also explicitly addresses the mismatches between the tag structures and the data structures of search result records. Our experimental results indicate that this approach is very effective.
KW - Information extraction
KW - Search engine
KW - Wrapper generation
UR - https://www.scopus.com/pages/publications/36849073188
U2 - 10.1145/1281192.1281286
DO - 10.1145/1281192.1281286
M3 - Conference contribution
SN - 1595936092
SN - 9781595936097
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 884
EP - 893
BT - KDD-2007
T2 - KDD-2007: 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Y2 - 12 August 2007 through 15 August 2007
ER -