Understanding OpenAI’s CLIP: Search, Classify

Updated On:

May 31, 2025

,By

Kishore Sahoo

Let’s understand OpenAI’s CLIP and how to use OpenAI’s CLIP to Search and Classify Images with Natural Language

How to Use CLIP in Python to Perform Zero-Shot Image Search and Classification

What Is CLIP?

CLIP (Contrastive Language–Image Pretraining) is a powerful vision-language model developed by OpenAI. It’s trained on 400 million image-text pairs and can understand images in the context of natural language — allowing you to search images with words, classify them with custom labels, or find similar images without needing labeled datasets.

Why CLIP Matters

No training required — it’s zero-shot
Works on unlabeled images
Can compare images and text directly
Enables semantic image search (e.g., “a cat sitting on a chair”)

What Can You Do With CLIP?

Use Case	Example
Search with text	Find all images of “a dog on a beach”
Zero-shot classification	Classify images using custom text labels
Similar image detection	Group or retrieve related images
Visual search engines	Build search features without tags

How CLIP Works (In Simple Terms)

CLIP has two neural networks:

Text encoder: Converts natural language into a vector
Image encoder: Converts images into a vector

Both outputs are mapped to the same embedding space, so you can compare a caption like “a red car” directly with an image vector. The closer the vectors, the more semantically similar they are.

Getting Started with CLIP in Python

1. Install Dependencies

pip install torch torchvision ftfy regex tqdm
pip install git+https://github.com/openai/CLIP.git
pip install matplotlib pillow

2. Load the Model

import clip
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

3. Encode Text and Images

from PIL import Image

# Encode a sentence
text = clip.tokenize(["a cat sitting on a couch"]).to(device)
with torch.no_grad():
    text_features = model.encode_text(text)

# Encode an image
image = preprocess(Image.open("cat.jpg")).unsqueeze(0).to(device)
with torch.no_grad():
    image_features = model.encode_image(image)

# Normalize and compare
text_features /= text_features.norm(dim=-1, keepdim=True)
image_features /= image_features.norm(dim=-1, keepdim=True)

similarity = (image_features @ text_features.T).item()
print("Similarity Score:", similarity)

4. Search Images Using Text

Embed all your dataset images into vectors and compare them to the query text:

# Get top 5 similar images
similarities = (image_embeddings @ text_features.T).squeeze(1)
top_k = similarities.topk(5)

You can display the top matches using matplotlib or return file paths in a web app.

Bonus: Zero-Shot Classification Example

Let’s say you want to classify an image using your own custom labels:

labels = ["a cat", "a dog", "a bird", "a car"]
text_tokens = clip.tokenize(labels).to(device)

with torch.no_grad():
    text_features = model.encode_text(text_tokens)

image_features /= image_features.norm(dim=-1, keepdim=True)
text_features /= text_features.norm(dim=-1, keepdim=True)

similarities = (image_features @ text_features.T)
probs = similarities.softmax(dim=-1).cpu().numpy()
print("Label Probabilities:", probs)

Real-World Use Cases

E-commerce: Search products visually using natural language
Media platforms: Detect or organize similar content
Content moderation: Flag inappropriate or unsafe imagery
Dataset curation: Find duplicates or cluster content
Art generation: Guide AI art models with visual prompts

Next Steps & Add-ons

Integrate with FAISS for scalable similarity search on 100k+ images
Use Gradio or Streamlit to build a front-end UI
Save and reuse embeddings with NumPy or HDF5
Cluster similar images using KMeans or UMAP

Conclusion

CLIP opens the door to advanced image understanding without the need for labels or fine-tuning. With just a few lines of code, you can build intelligent systems that “see” images the way we describe them — using natural language.

Whether you’re a developer, data scientist, or creative technologist, CLIP makes powerful visual AI accessible and incredibly flexible.

Resources