
BGE-M3
A versatile multilingual text embedding model from BAAI that supports 100+ languages and can handle inputs up to 8192 tokens. BGE-M3 is unique in supporting three retrieval methods simultaneously: dense retrieval, multi-vector retrieval, and sparse retrieval.
About this tool
Overview
BGE-M3 (BGE-Multi) is a text embedding model developed by BAAI (Beijing Academy of Artificial Intelligence) that stands out for its three "Multi" capabilities: multi-functionality, multi-linguality, and multi-granularity.
Three Multi Capabilities
1. Multi-Functionality
BGE-M3 is the first embedding model to simultaneously support all three common retrieval functionalities:
- Dense Retrieval: Traditional vector similarity search
- Multi-Vector Retrieval: ColBERT-style late interaction matching
- Sparse Retrieval: Lexical matching similar to BM25
This unique capability eliminates the need for multiple separate models.
2. Multi-Linguality
Trained on datasets covering 170+ different languages, BGE-M3 can work with over 100 languages in production. It achieves state-of-the-art performance on:
- Multi-lingual benchmarks (MIRACL)
- Cross-lingual benchmarks (MKQA)
Performance surpasses models from OpenAI in both English and other languages.
3. Multi-Granularity
Processes inputs of varying lengths:
- Short sentences (a few tokens)
- Medium documents (hundreds of tokens)
- Long documents (up to 8192 tokens)
This flexibility makes it suitable for diverse use cases from FAQ search to full document retrieval.
Technical Architecture
Base Model: XLM-RoBERTa
Output Dimensions: 1024-dimensional embeddings as primary output
Training: Trained on massive multilingual corpora with contrastive learning objectives for all three retrieval modes
Recommended Pipeline
The BGE-M3 developers recommend: Hybrid Retrieval + Re-ranking
This combination:
- Leverages strengths of dense, sparse, and multi-vector methods
- Offers higher accuracy than any single method
- Provides stronger generalization across languages and domains
Use Cases
- Multilingual semantic search engines
- Cross-lingual information retrieval
- Enterprise knowledge bases spanning multiple languages
- RAG systems for international applications
- Hybrid search implementations
Performance
- #1 on multi-lingual MIRACL benchmark
- #1 on cross-lingual MKQA benchmark
- Outperforms OpenAI models in multilingual scenarios
- Competitive with best English-only models on English tasks
Availability
Open-source and available through:
- Hugging Face (BAAI/bge-m3)
- FlagEmbedding library on GitHub
- Ollama model library
- Various AI inference platforms (NVIDIA NIM, DeepInfra, OVHcloud)
Pricing
Free and open-source under permissive licenses, making it cost-effective for commercial deployments compared to proprietary multilingual embedding APIs.
Loading more......
