Abstract: OmniRead AI is an integrated web-based system that converts images (containing text or visual scenes) into natural-sounding speech. It combines modern OCR and vision-language models with speech synthesis: users upload an image or enter text, the system extracts and optionally interprets content, and then reads it aloud. The pipeline uses EasyOCR and Tesseract for text extraction, Moondream AI for vision-language understanding (via user prompts), and Google’s gTTS (Text-to-Speech) for audio generation. Implemented in Python/Flask, OmniRead AI demonstrates a seamless “image-to-voice” experience. This paper elaborates the system’s architecture (Fig. 1), preprocessing steps, OCR ensemble, generative AI integration, and TTS pipeline. We report example outputs and discuss application scenarios. OmniRead AI enhances accessibility for visually impaired users and serves as a multipurpose assistive tool for education and productivity.

Keywords: Image-to-Speech, Optical Character Recognition, Vision-Language Model, Text-to-Speech, EasyOCR, Tesseract OCR, Moondream AI, Gtts, Generative AI, Accessibility.


PDF | DOI: 10.17148/IARJSET.2025.12734

Open chat