Visual Information Processing (VIP) Laboratory

Department of Computer Science & Engineering
Indian Institute of Technology Kharagpur, West Bengal, India

Margin Noise Removal Evaluation on MarNR Dataset




Margin noise removal is an important step prior to segmentation and optical character recognition (OCR) of a page. Presence of this noise results in erroneous output by the segmentation algorithms and OCR systems. In this paper, we present a comparative study of the four margin noise removal algorithms. For the purpose of evaluation, we have considered seven metrics. The metrics Hamming distance, noise ratio, and page content removal aim to evaluate a margin noise removal algorithm either on the quantity of noise removed or on the original content of the image retrieved. We also consider margin noise removal as a bi-classification task and four metrics of evaluation are defined using confusion matrices obtained experimentally over a labeled test dataset explicitly generated for evaluating the margin noise removal algorithms. The dataset consists of various document images with variation in layout and margin noises. The labeled dataset is also made public for comparative study of different margin noise removal algorithms.

Downloads (Right click to view/download)

MarNR Dataset

PFrmGFill Method (S.DEY DAR 2012)

PFrmGM, and BCWFilter Methods Two algorithms from Shafait (IJDAR 2008 and IMC2009)

OCROPUS 0.4

unpaper

Evaluation Code

Metrics Used for Evaluation

Groundtruth Generation Tool Anveshak

For any queries, please contact:

  • Soumyadeep Dey
    Department of Computer Science and Engineering
    Indian Institute of Technology Kharagpur
    Email: soumyadeepdey AT gmail DOT com
  • Jayanta Mukherjee
    Department of Computer Science and Engineering
    Indian Institute of Technology Kharagpur
    Email: jay AT cse DOT iitkgp DOT ernet DOT in
  • Shamik Sural
    Department of Computer Science and Engineering
    Indian Institute of Technology Kharagpur
    Email: shamik AT sit DOT iitkgp DOT ernet DOT in