Cryptogram Decoding for Optical Character Recognition




Optical character recognition (OCR) systems for machine-printed documents typically re- quire large numbers of font styles and character models to work well. When given a document printed in an unseen font, the performance of those systems degrade even in the absence of noise. In this paper, we perform OCR in an unsupervised fashion without using any character models by using a cryptogram decoding algorithm. We present results on real and artifcial OCR data. Optical character recognition is the task of converting images of text into their editable textual representations. Most OCR systems for machine print text need large collections of font styles and canonical character representations, whereby the recognition process involves template matching for the input character images. Such systems are font dependent and su er in accuracy when given documents printed in novel font styles. An alternative approach we examine here groups together similar characters in the document and solves a cryptogram to assign labels to clusters of characters. This method does not require any character models, so it is able to handle arbitrary font styles. This approach subsumes the idea of adaptivity in that it can take advantage of patterns such as regularities in image distortions that are particular to each document. In addition, the cryptogram decoding procedure is well-suited for performing OCR on images compressed using token-based methods such as Djvu, Silx, and DigiPaper.

Free download research paper