Margin noise removal is an important step prior to segmentation and optical character recognition (OCR) of a page. Presence of this noise results in erroneous output by the segmentation algorithms and OCR systems. In this paper, we present a comparative study of the four margin noise removal algorithms. For the purpose of evaluation, we have considered seven metrics. The metrics Hamming distance, noise ratio, and page content removal aim to evaluate a margin noise removal algorithm either on the quantity of noise removed or on the original content of the image retrieved. We also consider margin noise removal as a bi-classification task and four metrics of evaluation are defined using confusion matrices obtained experimentally over a labeled test dataset explicitly generated for evaluating the margin noise removal algorithms. The dataset consists of various document images with variation in layout and margin noises. The labeled dataset is also made public for comparative study of different margin noise removal algorithms.
Downloads (Right click to view/download)
PFrmGFill Method (S.DEY DAR 2012)
PFrmGM, and BCWFilter Methods Two algorithms from Shafait (IJDAR 2008 and IMC2009)
Groundtruth Generation Tool Anveshak
For any queries, please contact: