Abstract
In this paper, we present an approach to the bootstrap learning of a page segmentation model. The idea evolves from attempts to segment dictionaries that often have a consistent page structure, and is extended to the segmentation of more general structured documents. In cases of highly regular structure, the layout can be learned from examples of only a few pages. The system is first trained using a small number of samples, and a larger test set is processed based on the training result. After making corrections to a selected subset of the test set, these corrected samples are combined with the original training samples to generate bootstrap samples. The newly created samples are used to retrain the system, refine the learned features and resegment the test samples. This procedure is applied iteratively until the learned parameters are stable. Using this approach, we do not need to initially provide a large set of training samples. We have applied this segmentation to many structured documents such as dictionaries, phone books, spoken language transcripts, and obtained satisfying segmentation performance.
| Original language | English |
|---|---|
| Pages (from-to) | 179-188 |
| Number of pages | 10 |
| Journal | Proceedings of SPIE - The International Society for Optical Engineering |
| Volume | 5010 |
| DOIs | |
| State | Published - 2003 |
| Event | Document Recognition and Retrieval X - Santa Clara, CA, United States Duration: Jan 22 2003 → Jan 24 2003 |
Keywords
- Bootstrap
- Document Segmentation
- Layout Analysis
- OCR
Fingerprint
Dive into the research topics of 'Bootstrapping structured page segmentation'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver