Jina-CLIP v2

A 0.9B multimodal embedding model with multilingual support for 89 languages, 512x512 image resolution, and Matryoshka representations that enable dimensional flexibility from 1024 down to 64 dimensions while maintaining strong performance.

Visit Website

Overview

Jina-CLIP v2 is a state-of-the-art multimodal embedding model that combines text and image understanding in a single unified model. It represents a significant improvement over v1 with enhanced multilingual capabilities and higher resolution image processing.

Architecture

The model combines two specialized encoders:

Text Encoder: Jina XLM-RoBERTa (561M parameters)
Vision Encoder: EVA02-L14 (304M parameters)
Total Parameters: 865M

Key Features

Multilingual Support: Supports 89 languages for text-image retrieval with up to 4% improvement over comparable models
High Resolution: Processes 512x512 images, a significant upgrade from v1's 224x224 resolution
Matryoshka Representations: Allows truncating output dimensions from 1024 to 64 while maintaining 99% performance
State-of-the-Art Performance: Achieves 98.0% accuracy on Flickr30k image-to-text retrieval
Flexible Deployment: Available via Jina Embeddings API, AWS, Azure, and GCP

Performance

Even aggressive 75% dimensional reduction maintained over 99% performance across text, image, and cross-modal tasks. The model shows 3% performance improvement over v1 in both text-image and text-text retrieval tasks.

Use Cases

Cross-modal search (text-to-image, image-to-text)
Multilingual image retrieval
Visual question answering
Content-based recommendation systems
Multimodal RAG applications

Pricing

Available through Jina Embeddings API with commercial licensing. Also available on cloud marketplaces (AWS, Azure, GCP) with usage-based pricing.

Surveys

Loading more......

Information

Websitejina.ai

PublishedMar 20, 2026

Tags

3 Items

#multimodal #multilingual #embedding-model

Similar Products

BGE-M3

A versatile embedding model from BAAI that simultaneously supports dense retrieval, sparse retrieval, and multi-vector retrieval, with multilingual support for 100+ languages and multi-granularity processing from short sentences to 8192-token documents.

000

UForm

Pocket-sized multimodal AI for content understanding across multilingual texts, images, and video. Up to 5x faster than OpenAI CLIP with quantization-aware embeddings and support for 20+ languages.

000

Cohere Embed v4

Multilingual, multimodal enterprise embedding model supporting over 100 programming languages and primary business languages with advanced quantization for cost optimization.

000

Elasticsearch Vector Search

Lucene KNN vector plugin for Elasticsearch search engine, enabling hybrid lexical+vector search, BM25 fusion, HNSW/IVF indexes for ANN. Used for enterprise search, RAG, multimodal apps. Integrated vs standalone like Weaviate: superior hybrid text handling but higher resource footprint.

000

Cohere Rerank v3.5

State-of-the-art foundational model for ranking with 4096 context length and multilingual support for 100+ languages. Offers exceptional performance on BEIR benchmarks and specialized domains including finance, e-commerce, and enterprise search.

000

Multimodal RAG

Retrieval-Augmented Generation extended to handle multiple modalities including text, images, video, and audio. Uses multimodal embeddings like Gemini Embedding 2 or CLIP to enable cross-modal search and generation.

000

Overview

Architecture

The model combines two specialized encoders:

Text Encoder: Jina XLM-RoBERTa (561M parameters)
Vision Encoder: EVA02-L14 (304M parameters)
Total Parameters: 865M

Key Features

Multilingual Support: Supports 89 languages for text-image retrieval with up to 4% improvement over comparable models
High Resolution: Processes 512x512 images, a significant upgrade from v1's 224x224 resolution
Matryoshka Representations: Allows truncating output dimensions from 1024 to 64 while maintaining 99% performance
State-of-the-Art Performance: Achieves 98.0% accuracy on Flickr30k image-to-text retrieval
Flexible Deployment: Available via Jina Embeddings API, AWS, Azure, and GCP

Performance

Use Cases

Cross-modal search (text-to-image, image-to-text)
Multilingual image retrieval
Visual question answering
Content-based recommendation systems
Multimodal RAG applications

Pricing

Available through Jina Embeddings API with commercial licensing. Also available on cloud marketplaces (AWS, Azure, GCP) with usage-based pricing.

Jina-CLIP v2

Overview

Architecture

Key Features

Performance

Use Cases

Pricing

Information

Categories

Tags

Similar Products

Connect with us

Stay Updated

Product

Clients

Company

Resources

Jina-CLIP v2

Overview

Architecture

Key Features

Performance

Use Cases

Pricing

Information

Categories

Tags

Similar Products