BEIR Benchmark

Zero-shot benchmark for embedding model evaluation on 18 diverse datasets with NDCG@10 and Recall@100 metrics correlating to vector DB QPS/latency in production. Features heterogeneous tasks like QA, fact-checking, biomedical retrieval for robust comparisons. Use cases include selecting embeddings for RAG pipelines in vector DBs; complements ANN-Benchmarks indexing focus with retrieval task evaluation, differs from VectorDBBench full-DB tests.

Visit Website

Surveys

Loading more......

Information

Websitegithub.com

PublishedApr 23, 2026

Tags

3 Items

#benchmarking #performance-evaluation #zero-shot-retrieval

Similar Products

BigVectorBench

Tests vector DBs on multimodal QPS/latency for heterogeneous embeddings and compound queries including GPU setups. Features Docker-based eval for Milvus etc. on cross-modal retrieval. For selecting multimodal vector DBs; differs from ANN-Benchmarks text-only by adding hybrid workloads vs custom single-DB tests.

000

Billion-scale ANNS Benchmarks

Provides QPS/latency/recall benchmarks for ANNS algorithms on billion-point datasets via NeurIPS tools for dataset prep and evaluation. Features scalable testing for extreme throughput and visualization. Key for production vector DBs at scale; extends ANN-Benchmarks with billion-scale tools unlike full-system DB benchmarks.

000

MTEB (Massive Text Embedding Benchmark)

Evaluates embeddings on 58 datasets/112 languages with retrieval/clustering metrics for vector DB model selection via nDCG/Recall throughput proxies. Features 8 task types for comprehensive perf eval. Standard for RAG embedding choice; text-focused unlike BigVectorBench multimodal, complements ANN-Benchmarks index benchmarks.

000

ANN-Benchmarks

Standardized benchmark for QPS/latency/recall tests on ANN libraries using datasets like SIFT1M and Deep1B to compare throughput and accuracy. Features metrics for build time, memory usage across HNSW, FAISS, ScaNN. Used for vector DB index selection during development; contrasts with BigANN billion-scale competitions by focusing on million-scale library performance vs full-system custom benchmarks.

000

Big-ANN Benchmarks

Evaluates ANN algorithms on billion-scale datasets with QPS/latency/recall metrics via NeurIPS tracks for out-of-distribution and streaming tests. Features standardized billion-point evaluation for throughput and memory. For production vector DB scalability assessment; contrasts ANN-Benchmarks million-scale libraries with billion-scale algorithm competitions.

000

BenchmarkQED

BenchmarkQED standardizes QPS/latency/accuracy evaluations for RAG pipelines including vector DB retrieval on diverse datasets. Features comparable methodologies for fair benchmarking of full RAG stacks. Essential for selecting production vector DBs in RAG; emphasizes retrieval fairness unlike ANN-Benchmarks indexing focus or VectorDBBench system-level throughput tests.

000

Key Characteristics

Zero-Shot Evaluation

BEIR tests models in a zero-shot setting, meaning:

No fine-tuning on target datasets

Tests true generalization capability

Reflects real-world deployment scenarios

Measures robustness across domains

Diverse Datasets

18 datasets covering:

Question Answering: Natural Questions, HotpotQA

Fact Checking: FEVER, Climate-FEVER

Citation Prediction: SCIDOCS

Duplicate Question Detection: Quora, CQADupStack

Argument Retrieval: ArguAna, Touché-2020

News Retrieval: TREC-NEWS

Bio-Medical: NFCorpus, TREC-COVID

Entity Retrieval: DBPedia

Tweet Retrieval: Signal-1M

BEIR Benchmark

Information

Categories

Tags

Similar Products

Connect with us

Stay Updated

Product

Clients

Company

Resources

BEIR Benchmark

Information

Categories

Tags

Similar Products

Overview

Key Characteristics

Zero-Shot Evaluation

Diverse Datasets

Evaluation Metrics

Why BEIR Matters

Top Performing Models

Limitations

Use Cases

Integration

Resources

Pricing