



A 0.9B multimodal embedding model with multilingual support for 89 languages, 512x512 image resolution, and Matryoshka representations that enable dimensional flexibility from 1024 down to 64 dimensions while maintaining strong performance.
Loading more......
Jina-CLIP v2 is a state-of-the-art multimodal embedding model that combines text and image understanding in a single unified model. It represents a significant improvement over v1 with enhanced multilingual capabilities and higher resolution image processing.
The model combines two specialized encoders:
Even aggressive 75% dimensional reduction maintained over 99% performance across text, image, and cross-modal tasks. The model shows 3% performance improvement over v1 in both text-image and text-text retrieval tasks.
Available through Jina Embeddings API with commercial licensing. Also available on cloud marketplaces (AWS, Azure, GCP) with usage-based pricing.