Abstract: The increasing reliance on digital classrooms, virtual meetings, and multimedia content has created a strong demand for systems that can quickly convert long audio–video streams into structured and meaningful information. This paper introduces a unified, AI-driven transcription and summarization framework that functions seamlessly across a Windows-based standalone desktop application for real-time system audio transcription using Stereo Mix, a Chrome browser extension that performs tab-level audio capture and streaming transcription through a floating overlay interface; and a docker-containerized Flask web application deployed on Google Cloud Run. that supports file uploads, URL processing, AI-driven summarization, translation, and subtitle generation (SRT/VTT). The system captures audio from multiple sources - system level outputs, active browser tabs, uploaded media files, and external URLs - and transforms them into accurate transcripts through an optimized pipeline featuring chunk-based processing, adaptive buffering, low-latency data streaming, and efficient WebSocket/SSE communication. Real-time transcription is delivered through tokenized streaming, while Google Gemini generates multilingual summaries, context-aware descriptions, and synchronized subtitles. Reliability is strengthened through UUID-based storage, parallel chunk processing, and noise-resilient preprocessing. The entire pipeline is powered by Soniox Speech-to-Text (STT) and Google Gemini models. Experimental evaluation confirms that the architecture successfully handles long-form recordings, noisy audio streams, browser restrictions, and fluctuating network conditions. The proposed solution provides a scalable and flexible platform suitable for students, educators, content creators, and accessibility-driven applications, enabling fast transcript generation, cross-platform usability, and intelligent AI-powered summarization.

Keywords: Real-time transcription, audio processing, speech-to-text, Multilingual summarization, Server-Sent Events (SSE), AI-based summarization, browser extension, Flask web application, desktop transcription application, WebSocket streaming, cloud deployment, Docker, Soniox STT.


Downloads: PDF | DOI: 10.17148/IARJSET.2025.1211037

How to Cite:

[1] Chaitrashree R, Harshitha V, Sowrabha J N, Spandana J, Najibul Rehman, "AI Based Real Time Video Transcript Extraction and Summarization," International Advanced Research Journal in Science, Engineering and Technology (IARJSET), DOI: 10.17148/IARJSET.2025.1211037

Open chat