TY - GEN
T1 - Handwritten document retrieval strategies
AU - Govindaraju, Venu
AU - Cao, Huaigu
AU - Bhardwaj, Anurag
PY - 2009
Y1 - 2009
N2 - With the continuous growth of the World Wide Web, there is an urgent need for an efficient information retrieval system which can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains to be a challenging task with inadequate performance (around 30%, accuracy) thus proving to be a major hurdle in providing a robust search experience in the domain of handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text output by imperfect recognizers applied to handwritten document images. We describe three techniques each exploring a different approach for solving the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR'ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR'ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR'ed text. We describe these approaches in detail and also present their performance using standard IR evaluation metrics.
AB - With the continuous growth of the World Wide Web, there is an urgent need for an efficient information retrieval system which can search and retrieve handwritten documents when presented with user queries. However, unconstrained handwriting recognition remains to be a challenging task with inadequate performance (around 30%, accuracy) thus proving to be a major hurdle in providing a robust search experience in the domain of handwritten documents. In this paper, we describe our recent research with focus on information retrieval from noisy text output by imperfect recognizers applied to handwritten document images. We describe three techniques each exploring a different approach for solving the noisy text retrieval task. The first method uses a novel bootstrapping mechanism to refine the OCR'ed text and uses the cleaned text for retrieval. The second method uses the uncorrected or raw OCR'ed text but modifies the standard vector space model for handling noisy text issues. The third method employs robust image features to index the documents instead of using noisy OCR'ed text. We describe these approaches in detail and also present their performance using standard IR evaluation metrics.
KW - Handwriting analysis
KW - Information retrieval
KW - Keyword spotting
KW - OCR correction
UR - https://www.scopus.com/pages/publications/70450161156
U2 - 10.1145/1568296.1568300
DO - 10.1145/1568296.1568300
M3 - Conference contribution
SN - 9781605584966
T3 - ACM International Conference Proceeding Series
SP - 3
EP - 7
BT - AND 2009 - Proceedings of the 3rd Workshop on Analytics for Noisy Unstructured Text Data
T2 - 3rd Workshop on Analytics for Noisy Unstructured Text Data, AND 2009
Y2 - 23 July 2009 through 24 July 2009
ER -