How to Find Similar Images in a Large Unlabeled Dataset Using Image similarity or Clustering Techniques powered by machine learning?
A Step-by-Step Guide for Beginners and Experts
Introduction
Let’s say you have 20,000+ images and no categories, labels, or tags. Now imagine trying to find just the cat images hidden in there. Manually? Impossible. But with modern AI tools, you can automate this task in a matter of minutes.
In this guide, we’ll walk you through how to:
- Turn images into searchable vectors using CLIP (by OpenAI)
- Find similar images using FAISS (by Meta)
- Handle datasets with no labels at all
- Optimize for both beginners and advanced users
Tools & Libraries
We’ll use the following libraries:
Tool | Purpose |
---|---|
CLIP | Turn images (or text!) into embeddings |
Faiss | Perform fast similarity searches |
PIL / OpenCV | Load and preprocess images |
NumPy | Handle vectors and matrix operations |
Matplotlib | Show results visually |
Install them via pip:
pip install torch torchvision faiss-cpu ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git
Part 1: Beginner-Friendly Step-by-Step Guide
Step 1: Load and Preprocess Images
Use Python to walk through a directory and load images:
import os
from PIL import Image
from glob import glob
image_paths = glob("your_dataset_folder/**/*.jpg", recursive=True)
def load_image(path):
return Image.open(path).convert("RGB")
Step 2: Generate Embeddings with CLIP
import torch
import clip
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)
def get_image_embedding(img):
img_input = preprocess(img).unsqueeze(0).to(device)
with torch.no_grad():
return model.encode_image(img_input).cpu().numpy()
Now extract all embeddings:
embeddings = []
for path in image_paths:
img = load_image(path)
embedding = get_image_embedding(img)
embeddings.append(embedding)
Step 3: Build a FAISS Index
import faiss
import numpy as np
embedding_matrix = np.vstack(embeddings).astype("float32")
index = faiss.IndexFlatL2(embedding_matrix.shape[1])
index.add(embedding_matrix)
Step 4: Find Similar Images (Query with Cat Example)
query_img = load_image("sample_cat.jpg")
query_emb = get_image_embedding(query_img).astype("float32")
D, I = index.search(query_emb, k=10) # Get top 10 matches
Step 5: Display Results
import matplotlib.pyplot as plt
def show_similar_images(query_path, indices):
plt.figure(figsize=(15, 5))
plt.subplot(1, len(indices) + 1, 1)
plt.imshow(load_image(query_path))
plt.title("Query")
plt.axis("off")
for i, idx in enumerate(indices[0]):
plt.subplot(1, len(indices) + 1, i + 2)
plt.imshow(load_image(image_paths[idx]))
plt.title(f"Match {i+1}")
plt.axis("off")
plt.show()
show_similar_images("sample_cat.jpg", I)
Part 2: For Advanced Users
Speed Up with Batching & GPU
Instead of one-by-one embedding:
from torch.utils.data import DataLoader
import torchvision.transforms as T
# Batch embedding for speed
batch_size = 32
image_list = [load_image(p) for p in image_paths]
processed_imgs = [preprocess(img) for img in image_list]
loader = DataLoader(processed_imgs, batch_size=batch_size)
batched_embeddings = []
with torch.no_grad():
for batch in loader:
batch = batch.to(device)
emb = model.encode_image(batch).cpu().numpy()
batched_embeddings.append(emb)
embedding_matrix = np.vstack(batched_embeddings).astype("float32")
Save Embeddings for Future Use
np.save("embeddings.npy", embedding_matrix)
with open("paths.txt", "w") as f:
f.writelines([p + "\n" for p in image_paths])
Use Cosine Similarity (Optional)
faiss.normalize_L2(embedding_matrix)
faiss.normalize_L2(query_emb)
index = faiss.IndexFlatIP(embedding_matrix.shape[1])
index.add(embedding_matrix)
Cluster Similar Images (Optional)
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=10)
clusters = kmeans.fit_predict(embedding_matrix)
# Map cluster index to image paths
from collections import defaultdict
cluster_map = defaultdict(list)
for idx, label in enumerate(clusters):
cluster_map[label].append(image_paths[idx])
Bonus: Use Text to Find Images (CLIP Magic)
Want to find images with no example image? Just use a sentence.
text_input = clip.tokenize(["a photo of a cat"]).to(device)
with torch.no_grad():
text_features = model.encode_text(text_input).cpu().numpy()
D, I = index.search(text_features.astype("float32"), k=10)
show_similar_images("sample_cat.jpg", I)
You can try:
- “a mountain landscape”
- “a person dancing”
- “a close-up of food”
Conclusion
With just a few lines of Python and powerful open-source tools, you can:
- Search and group unlabeled images
- Build visual search systems
- Enable smart content filtering or discovery
Whether you’re a data scientist, developer, or creative — this method opens up new ways to interact with image data.
Leave a Reply