Abstract: This survey paper presents a comparative study of recent advancements in vision-language models, focusing on their methodologies, applications, and impact on tasks such as image captioning and multimodal understanding. It analyzes three key research papers: Gemini AI for Vision-Language Tasks (2024), BLIP: Bootstrapped Language- Image Pretraining (2022), and Transformers for Image Captioning (2020), each contributing uniquely to the field of artificial intelligence and computer vision.

The paper on Gemini AI introduces a state-of-the- art multimodal large language model designed for seamless integration of text, images, audio, and video. Its optimized transformer-based architecture enables extensive contextual understanding, making it highly effective for real-world multimodal tasks. However, its high computational requirements and potential challenges in handling complex real-world scenarios pose limitations.

The BLIP framework addresses the challenge of leveraging noisy web data for effective language- image pretraining. It implements a bootstrapped learning approach by combining synthetic caption generation and a filtering mechanism to improve dataset quality. This technique significantly enhances vision-language model performance across multiple benchmarks but remains dependent on the accuracy of its filtering strategy.

The study on Transformers for Image Captioning explores the application of self- attention mechanisms in generating coherent and contextually rich image descriptions. The transformer-based architecture allows for improved relationship modeling within images, leading to higher-quality captions. Despite its success, the model’s high computational demands and dependency on large-scale datasets present challenges for practical deployment.

Through this comparative analysis, the paper highlights the evolution of vision-language models, discussing their strengths, limitations, and future research directions. By understanding the advancements in multimodal AI, researchers can develop more efficient and inclusive assistive technologies, particularly in fields such as accessibility, content generation, and human- computer interaction.


PDF | DOI: 10.17148/IARJSET.2025.12219

Open chat