Search Similar Images in a Large Unlabeled Dataset

Updated On:

May 30, 2025

,By

Kishore Sahoo

How to Find Similar Images in a Large Unlabeled Dataset Using Image similarity or Clustering Techniques powered by machine learning?

A Step-by-Step Guide for Beginners and Experts

Introduction

Let’s say you have 20,000+ images and no categories, labels, or tags. Now imagine trying to find just the cat images hidden in there. Manually? Impossible. But with modern AI tools, you can automate this task in a matter of minutes.

In this guide, we’ll walk you through how to:

Turn images into searchable vectors using CLIP (by OpenAI)
Find similar images using FAISS (by Meta)
Handle datasets with no labels at all
Optimize for both beginners and advanced users

Tools & Libraries

We’ll use the following libraries:

Tool	Purpose
`CLIP`	Turn images (or text!) into embeddings
`Faiss`	Perform fast similarity searches
`PIL` / `OpenCV`	Load and preprocess images
`NumPy`	Handle vectors and matrix operations
`Matplotlib`	Show results visually

Install them via pip:

pip install torch torchvision faiss-cpu ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git

Part 1: Beginner-Friendly Step-by-Step Guide

Step 1: Load and Preprocess Images

Use Python to walk through a directory and load images:

import os
from PIL import Image
from glob import glob

image_paths = glob("your_dataset_folder/**/*.jpg", recursive=True)

def load_image(path):
    return Image.open(path).convert("RGB")

Step 2: Generate Embeddings with CLIP

import torch
import clip

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

def get_image_embedding(img):
    img_input = preprocess(img).unsqueeze(0).to(device)
    with torch.no_grad():
        return model.encode_image(img_input).cpu().numpy()

Now extract all embeddings:

embeddings = []
for path in image_paths:
    img = load_image(path)
    embedding = get_image_embedding(img)
    embeddings.append(embedding)

Step 3: Build a FAISS Index

import faiss
import numpy as np

embedding_matrix = np.vstack(embeddings).astype("float32")
index = faiss.IndexFlatL2(embedding_matrix.shape[1])
index.add(embedding_matrix)

Step 4: Find Similar Images (Query with Cat Example)

query_img = load_image("sample_cat.jpg")
query_emb = get_image_embedding(query_img).astype("float32")

D, I = index.search(query_emb, k=10)  # Get top 10 matches

Step 5: Display Results

import matplotlib.pyplot as plt

def show_similar_images(query_path, indices):
    plt.figure(figsize=(15, 5))
    plt.subplot(1, len(indices) + 1, 1)
    plt.imshow(load_image(query_path))
    plt.title("Query")
    plt.axis("off")

    for i, idx in enumerate(indices[0]):
        plt.subplot(1, len(indices) + 1, i + 2)
        plt.imshow(load_image(image_paths[idx]))
        plt.title(f"Match {i+1}")
        plt.axis("off")
    plt.show()

show_similar_images("sample_cat.jpg", I)

Part 2: For Advanced Users

Speed Up with Batching & GPU

Instead of one-by-one embedding:

from torch.utils.data import DataLoader
import torchvision.transforms as T

# Batch embedding for speed
batch_size = 32
image_list = [load_image(p) for p in image_paths]
processed_imgs = [preprocess(img) for img in image_list]
loader = DataLoader(processed_imgs, batch_size=batch_size)

batched_embeddings = []
with torch.no_grad():
    for batch in loader:
        batch = batch.to(device)
        emb = model.encode_image(batch).cpu().numpy()
        batched_embeddings.append(emb)

embedding_matrix = np.vstack(batched_embeddings).astype("float32")

Save Embeddings for Future Use

np.save("embeddings.npy", embedding_matrix)
with open("paths.txt", "w") as f:
    f.writelines([p + "\n" for p in image_paths])

Use Cosine Similarity (Optional)

faiss.normalize_L2(embedding_matrix)
faiss.normalize_L2(query_emb)
index = faiss.IndexFlatIP(embedding_matrix.shape[1])
index.add(embedding_matrix)

Cluster Similar Images (Optional)

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=10)
clusters = kmeans.fit_predict(embedding_matrix)

# Map cluster index to image paths
from collections import defaultdict
cluster_map = defaultdict(list)
for idx, label in enumerate(clusters):
    cluster_map[label].append(image_paths[idx])

Bonus: Use Text to Find Images (CLIP Magic)

Want to find images with no example image? Just use a sentence.

text_input = clip.tokenize(["a photo of a cat"]).to(device)
with torch.no_grad():
    text_features = model.encode_text(text_input).cpu().numpy()

D, I = index.search(text_features.astype("float32"), k=10)
show_similar_images("sample_cat.jpg", I)

You can try:

“a mountain landscape”
“a person dancing”
“a close-up of food”

Conclusion

With just a few lines of Python and powerful open-source tools, you can:

Search and group unlabeled images
Build visual search systems
Enable smart content filtering or discovery

Whether you’re a data scientist, developer, or creative — this method opens up new ways to interact with image data.