



Embeddings that map multiple modalities (text, images, video) into a shared vector space, enabling cross-modal search and retrieval using models like CLIP, SigLIP, and voyage-multimodal-3.
Loading more......
Multimodal embeddings map different data types (text, images, audio, video) into a shared vector space where semantic similarity is preserved across modalities.
CLIP (Contrastive Language-Image Pre-training):
SigLIP:
Voyage Multimodal 3.5:
Jina CLIP:
Cross-Modal Search:
Zero-Shot Classification:
Visual Question Answering:
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('clip-ViT-B-32')
# Embed images and text together
image_emb = model.encode(Image.open('photo.jpg'))
text_emb = model.encode("a photo of a sunset")
# Compute similarity
similarity = cosine_similarity(image_emb, text_emb)