Abstract: Human emotions constitute complex psychological states that manifest through multiple communication channels, including facial expressions, speech patterns, and linguistic content. This paper presents a novel multimodal emotion recognition system that synergistically integrates visual, auditory, and textual modalities using specialized deep learning architectures. The visual processing pipeline employs a Convolutional Neural Network with wavelet-based preprocessing, achieving 97.1% accuracy on the FER-2013 dataset. For speech analysis, we implement a hybrid CNN-LSTM model that processes Mel-frequency cepstral coefficients with delta features. Textual emotion classification leverages a fine-tuned BERT model that captures nuanced contextual relationships. These modalities are fused through an attention mechanism that dynamically weights their contributions based on signal quality and contextual relevance. Comprehensive experiments demonstrate our system's superiority over unimodal approaches, with a 12.4% improvement in classification accuracy. The implemented web interface delivers real-time analysis with 47ms latency, enabling practical applications in mental health monitoring, human-computer interaction, and affective computing.
Keywords: Affective Computing, Multimodal Learning, Deep Neural Networks, Emotion Recognition, Human-Computer Interaction
|
DOI:
10.17148/IARJSET.2025.125347