



Zero-shot benchmark for embedding model evaluation on 18 diverse datasets with NDCG@10 and Recall@100 metrics correlating to vector DB QPS/latency in production. Features heterogeneous tasks like QA, fact-checking, biomedical retrieval for robust comparisons. Use cases include selecting embeddings for RAG pipelines in vector DBs; complements ANN-Benchmarks indexing focus with retrieval task evaluation, differs from VectorDBBench full-DB tests.
Loading more......
BEIR (Benchmarking IR) is a heterogeneous benchmark for information retrieval that evaluates models across 18 publicly available datasets spanning 9 different retrieval tasks. It's designed to test zero-shot performance of retrieval models.
BEIR tests models in a zero-shot setting, meaning:
18 datasets covering:
Primary metrics:
Historically strong performers:
Easy to use:
from beir import util
from beir.datasets.data_loader import GenericDataLoader
from beir.retrieval.evaluation import EvaluateRetrieval
Free and open-source benchmark.