IntelLabs's Vector Search Datasets
A collection of datasets curated by Intel Labs specifically for evaluating and benchmarking vector search algorithms and databases.
About this tool
IntelLabs's Vector Search Datasets
A collection of datasets curated by Intel Labs for evaluating and benchmarking vector search algorithms and databases.
Features
- Provides code to generate several datasets for similarity search benchmarking and evaluation.
- Datasets are based on high-dimensional vectors from recent deep learning models.
- Includes multiple datasets (see respective folders:
dpr,openimages,rqa,text,wit). - Each dataset comes with its own README file for details and usage instructions.
- Useful for researchers and developers working on vector search, similarity search, and related benchmarking tasks.
Notes
- Project Status: Not under active management. Intel has ceased development, maintenance, and contributions to this project.
- Users interested in further development or maintenance are encouraged to fork the repository.
Source
https://github.com/IntelLabs/VectorSearchDatasets
Tags
datasets, vector-search, benchmark, evaluation
Category
Curated Resource Lists
Loading more......
Information
Categories
Tags
Similar Products
6 result(s)BEIR (Benchmarking IR) is a benchmark suite for evaluating information retrieval and vector search systems across multiple tasks and datasets. Useful for comparing vector database performance.
An annual competition focused on similarity search and indexing algorithms, including approximate nearest neighbor methods and high-dimensional vector indexing, providing benchmarks and results relevant to vector database research.
The open‑source repository containing the implementation, configuration, and scripts of VectorDBBench, enabling users to run standardized benchmarks across multiple vector database systems locally or in CI.
A massive text embedding benchmark for evaluating the quality of text embedding models, crucial for vector database applications.
ANN-Benchmarks is a benchmarking platform specifically for evaluating the performance of approximate nearest neighbor (ANN) search algorithms, which are foundational to vector database evaluation and comparison.
A 2024 paper introducing CANDY, a benchmark for continuous ANN search with a focus on dynamic data ingestion, crucial for next-generation vector databases.