Skip to main navigation Skip to search Skip to main content

Bootstrapping structured page segmentation

Research output: Contribution to journalConference articlepeer-review

8 Scopus citations

Abstract

In this paper, we present an approach to the bootstrap learning of a page segmentation model. The idea evolves from attempts to segment dictionaries that often have a consistent page structure, and is extended to the segmentation of more general structured documents. In cases of highly regular structure, the layout can be learned from examples of only a few pages. The system is first trained using a small number of samples, and a larger test set is processed based on the training result. After making corrections to a selected subset of the test set, these corrected samples are combined with the original training samples to generate bootstrap samples. The newly created samples are used to retrain the system, refine the learned features and resegment the test samples. This procedure is applied iteratively until the learned parameters are stable. Using this approach, we do not need to initially provide a large set of training samples. We have applied this segmentation to many structured documents such as dictionaries, phone books, spoken language transcripts, and obtained satisfying segmentation performance.

Original languageEnglish
Pages (from-to)179-188
Number of pages10
JournalProceedings of SPIE - The International Society for Optical Engineering
Volume5010
DOIs
StatePublished - 2003
EventDocument Recognition and Retrieval X - Santa Clara, CA, United States
Duration: Jan 22 2003Jan 24 2003

Keywords

  • Bootstrap
  • Document Segmentation
  • Layout Analysis
  • OCR

Fingerprint

Dive into the research topics of 'Bootstrapping structured page segmentation'. Together they form a unique fingerprint.

Cite this