Abstract: The rise of deepfake technology—powered by advanced generative models—has introduced serious risks to digital media authenticity, enabling the creation of highly realistic but fake visual and auditory content. This research proposes a Multimodal DeepFake Detection System that integrates both image and audio analysis to detect such forgeries effectively. The system utilizes the VGG-19 Convolutional Neural Network (CNN), fine-tuned via transfer learning on a curated dataset of real and manipulated facial images, to extract high-level visual features. For audio analysis, the system employs Mel-Frequency Cepstral Coefficients (MFCCs) to represent speech characteristics and capture anomalies typical of synthetic or manipulated voices.
To improve robustness and generalization, data augmentation techniques are applied to both visual and audio data. Features extracted from both modalities are then classified using a Support Vector Machine (SVM) classifier, allowing for precise determination of content authenticity. The system achieves a classification accuracy of 92.5%, with an F1 score of 92.1% and AUC-ROC of 0.96, outperforming several unimodal baselines. This research demonstrates that a multimodal approach significantly enhances deepfake detection performance and offers a scalable, real-time solution for combating misinformation, protecting identity, and preserving trust in digital communication.
|
DOI:
10.17148/IARJSET.2025.125337