Abstract
The exponential growth of digital image collections has necessitated the development of effective image retrieval systems. Methods relying on handcrafted features often fall short in capturing complex image semantics, prompting a shift towards deep learning-based approaches. Current state-of-the-art techniques automatically extract high-dimensional features by utilizing convolutional neural networks (CNNs) and self-supervised learning (SSL). However, there is still room for improvement in retrieval accuracy and efficiency, particularly when handling diverse datasets. This research presents an image retrieval model that integrates pre-trained EfficientNetB0 and ResNet18 backbones using SimCLR-based contrastive learning to enhance feature extraction, and the dimensionality of the extracted features is reduced using a learnable fusion layer. This layer learns to selectively compress and prioritize dimensions that are most beneficial for the learning task. Learn adaptive weights that automatically emphasize important dimensions from both backbones. The model gets features from all the images and keeps them as feature vectors in a MongoDB database, enabling scalable and efficient search. Once a query image (QI) is input, the model extracts its features using the trained hybrid model and compares them with stored embeddings using Facebook AI Similarity Search (FAISS) for fast nearest neighbour retrieval. Most similar feature images are retrieved from the database and displayed on the web interface. With self-supervised methods, this research trains the hybrid model over different benchmark datasets to acquire useful representations without relying on labeled data. Experimentation demonstrates that the model achieves an average accuracy rate of 97%, which reflects substantial improvements. The combination of a learnable fusion layer, hybrid feature extraction, self-supervised learning, and scalable database management validates the system's capacity to provide precise and effective image retrieval. This work demonstrates how multimodal CNN backbones and contrastive learning work together to advance state-of-the-art content-based image retrieval (CBIR)solutions, which can be broadly applied to diverse and large-scale image datasets.
Replies