



Search across different modalities using multimodal embeddings, enabling queries like text-to-image, image-to-text, or text-to-video. Powered by models like CLIP, ImageBind, and Gemini Embedding 2 that map different modalities into a shared embedding space.
Cross-modal search enables finding content in one modality using queries in another, such as searching images with text or finding videos with audio descriptions.
Query: "sunset over mountains" Results: Matching images
Query: [photo] Results: Captions, descriptions, articles
Query: "basketball dunk compilation" Results: Relevant video clips
Query: [sound of ocean waves] Results: Beach imagery
import clip
import torch
# Load CLIP
model, preprocess = clip.load("ViT-B/32")
# Embed text query
text = clip.tokenize(["sunset over mountains"])
text_embedding = model.encode_text(text)
# Search image database
results = vectordb.search(
collection="images",
query_vector=text_embedding,
limit=10
)
Depends on embedding model (CLIP is free/open-source, Gemini has API costs).
Loading more......