Abstract: The exponential growth of digital image repositories across enterprise systems and the internet demands intelligent, scalable retrieval mechanisms capable of operating with high accuracy and efficiency. This paper presents a comprehensive AI-based large-scale image retrieval system that leverages the Contrastive Language-Image Pretraining (CLIP) model, specifically its Vision Transformer ViT-B/32 backbone, to extract rich 512-dimensional visual embeddings from images. The proposed system executes image indexing offline through batch processing, stores L2-normalized feature vectors, and performs real-time cosine similarity computation at query time to retrieve the top-K most visually similar images. Additionally, a Support Vector Machine (SVM) classifier trained on CLIP embeddings achieves 98.76% accuracy with a macro-average F1-score of 0.9804 across 27 image categories. The system is deployed as a responsive web application using the Flask framework, enabling end-users to perform real-time image-based searches through a browser interface. Comparative evaluation demonstrates that the proposed approach substantially outperforms all baseline methods including Dummy classifiers and Logistic Regression. The results confirm that deep visual embeddings derived from large-scale multimodal pretraining are highly effective for content-based image retrieval at scale.
Keywords --- CLIP Embeddings, Content-Based Image Retrieval, Cosine Similarity, Vision Transformer, SVM Classification, Flask Deployment, Deep Visual Features, ViT-B/32, L2 Normalization.


Downloads: PDF | DOI: 10.17148/IARJSET.2026.13462

How to Cite:

[1] Nandha M, Dr. C. Karpagavalli, Dr. M. Kaliappan, Dr. E. Mariappan, "AI-BASED LARGE-SCALE IMAGE RETRIEVAL SYSTEM USING CLIP EMBEDDINGS AND COSINE SIMILARITY," International Advanced Research Journal in Science, Engineering and Technology (IARJSET), DOI: 10.17148/IARJSET.2026.13462

Open chat