Abstract: The efficiency of Optical Character Recognition (OCR) decreases significantly when dealing with handwritten text, low-quality scans, and complex backgrounds, often resulting in fragmented, noisy, and syntactically incorrect output. These limitations affect the accuracy of subsequent Natural Language Processing (NLP) tasks such as summarization, information extraction, and automated document analysis. To address these issues, this research work proposes an combined OCR–NLP method that automatically detects text type using DenseNet-121 and applies either Tesseract or OCRSpace based on whether the input contains printed or handwritten text. The raw OCR output is then refined using the Phi-3 language model to correct grammar, enhance readability, and restore contextual meaning. Experimental results on mixed printed and handwritten datasets show a substantial improvement in accuracy, with reduction in Character Error Rate (CER) and Word Error Rate (WER) after NLP post-processing. The proposed system demonstrates a robust, scalable, and automated pipeline suitable for educational digitization, archival processing, and large-scale text-driven applications.
Keywords: OCR, NLP, DenseNet-121, Handwritten Recognition, Printed Text Recognition, Phi-3, Post-Processing.
Downloads:
|
DOI:
10.17148/IARJSET.2025.1211040
[1] Ravi P, Thejashwini M A, Thanushree S R, Sonashree M S, Vignesh M G, "Optimizing OCR Output: A Post-Processing Approach Using NLP," International Advanced Research Journal in Science, Engineering and Technology (IARJSET), DOI: 10.17148/IARJSET.2025.1211040