• Home
  • Categories
  • Tags
  • Pricing
  • Submit
    Decorative pattern
    1. Home
    2. Machine Learning Models
    3. BGE-M3

    BGE-M3

    A versatile multilingual text embedding model from BAAI that supports 100+ languages and can handle inputs up to 8192 tokens. BGE-M3 is unique in supporting three retrieval methods simultaneously: dense retrieval, multi-vector retrieval, and sparse retrieval.

    🌐Visit Website

    About this tool

    Overview

    BGE-M3 (BGE-Multi) is a text embedding model developed by BAAI (Beijing Academy of Artificial Intelligence) that stands out for its three "Multi" capabilities: multi-functionality, multi-linguality, and multi-granularity.

    Three Multi Capabilities

    1. Multi-Functionality

    BGE-M3 is the first embedding model to simultaneously support all three common retrieval functionalities:

    • Dense Retrieval: Traditional vector similarity search
    • Multi-Vector Retrieval: ColBERT-style late interaction matching
    • Sparse Retrieval: Lexical matching similar to BM25

    This unique capability eliminates the need for multiple separate models.

    2. Multi-Linguality

    Trained on datasets covering 170+ different languages, BGE-M3 can work with over 100 languages in production. It achieves state-of-the-art performance on:

    • Multi-lingual benchmarks (MIRACL)
    • Cross-lingual benchmarks (MKQA)

    Performance surpasses models from OpenAI in both English and other languages.

    3. Multi-Granularity

    Processes inputs of varying lengths:

    • Short sentences (a few tokens)
    • Medium documents (hundreds of tokens)
    • Long documents (up to 8192 tokens)

    This flexibility makes it suitable for diverse use cases from FAQ search to full document retrieval.

    Technical Architecture

    Base Model: XLM-RoBERTa

    Output Dimensions: 1024-dimensional embeddings as primary output

    Training: Trained on massive multilingual corpora with contrastive learning objectives for all three retrieval modes

    Recommended Pipeline

    The BGE-M3 developers recommend: Hybrid Retrieval + Re-ranking

    This combination:

    • Leverages strengths of dense, sparse, and multi-vector methods
    • Offers higher accuracy than any single method
    • Provides stronger generalization across languages and domains

    Use Cases

    • Multilingual semantic search engines
    • Cross-lingual information retrieval
    • Enterprise knowledge bases spanning multiple languages
    • RAG systems for international applications
    • Hybrid search implementations

    Performance

    • #1 on multi-lingual MIRACL benchmark
    • #1 on cross-lingual MKQA benchmark
    • Outperforms OpenAI models in multilingual scenarios
    • Competitive with best English-only models on English tasks

    Availability

    Open-source and available through:

    • Hugging Face (BAAI/bge-m3)
    • FlagEmbedding library on GitHub
    • Ollama model library
    • Various AI inference platforms (NVIDIA NIM, DeepInfra, OVHcloud)

    Pricing

    Free and open-source under permissive licenses, making it cost-effective for commercial deployments compared to proprietary multilingual embedding APIs.

    Surveys

    Loading more......

    Information

    Websitehuggingface.co
    PublishedMar 20, 2026

    Categories

    1 Item
    Machine Learning Models

    Tags

    4 Items
    #Embeddings#Multilingual#Hybrid Search#Open Source

    Similar Products

    6 result(s)
    Qwen3 Embedding
    Featured

    Multilingual embedding model supporting over 100 languages and ranking #1 on MTEB multilingual leaderboard. Offers flexible model sizes from 0.6B to 8B parameters with user-defined instructions.

    gte-Qwen2-1.5B-instruct

    A state-of-the-art multilingual text embedding model from Alibaba's GTE (General Text Embedding) series, built on the Qwen2-1.5B LLM. The model supports up to 8192 tokens and incorporates bidirectional attention mechanisms for enhanced contextual understanding across diverse domains.

    jina-embeddings-v5

    Jina AI's latest embedding model achieving the highest multilingual performance among models under 1B parameters with 71.7 average MTEB score and 67.7 MMTEB score.

    Nomic Embed Text v2

    Open-source multilingual embedding model using Mixture-of-Experts architecture, achieving excellent semantic performance with efficient inference and full offline support.

    GTE Embeddings

    General Text Embeddings from Alibaba DAMO Academy trained on large-scale relevance pairs. Available in three sizes (large, base, small) with GTE-v1.5 supporting 8192 context length.

    FlagEmbedding

    Open-source retrieval and RAG framework from BAAI featuring the BGE embedding model series. BGE-M3 supports multi-functionality (dense, sparse, multi-vector), multi-linguality (100+ languages), and multi-granularity (up to 8192 tokens).

    Decorative pattern
    Built with
    Ever Works
    Ever Works

    Connect with us

    Stay Updated

    Get the latest updates and exclusive content delivered to your inbox.

    Product

    • Categories
    • Tags
    • Pricing
    • Help

    Clients

    • Sign In
    • Register
    • Forgot password?

    Company

    • About Us
    • Admin
    • Sitemap

    Resources

    • Blog
    • Submit
    • API Documentation
    All product names, logos, and brands are the property of their respective owners. All company, product, and service names used in this repository, related repositories, and associated websites are for identification purposes only. The use of these names, logos, and brands does not imply endorsement, affiliation, or sponsorship. This directory may include content generated by artificial intelligence.
    Copyright © 2025 Awesome Vector Databases. All rights reserved.·Terms of Service·Privacy Policy·Cookies