DocArray
An open-source library for creating, storing, and searching multimodal data and vector embeddings, supporting AI and ML workflows.
About this tool
DocArray
DocArray is an open-source Python library designed for representing, storing, and retrieving multimodal data, making it suitable for AI and machine learning workflows involving complex data types such as images, text, audio, and video.
Features
- Multimodal Data Representation: Define and work with documents containing various data types (images, text, audio, video) using Python classes.
- Pydantic Compatibility: Built on top of Pydantic, allowing type validation and integration with other Pydantic-based tools.
- Custom Data Models: Create custom document schemas using
BaseDoc
, specifying fields for different modalities and types. - Tensor Shape Specification: Ability to specify tensor shapes for data fields, supporting frameworks like PyTorch, NumPy, and TensorFlow.
- Nested Documents: Compose complex, nested document structures for handling multimodal datasets.
- Batch Processing: Process and manipulate batches of documents via
DocVec
andDocList
collections, enabling bulk operations and efficient workflows. - Bulk Field Access: Retrieve and manipulate fields across all documents in a collection with simple syntax.
- Flexible Embedding Storage: Store and manage vector embeddings computed from any model, facilitating downstream search and retrieval tasks.
- Open Source: Distributed under the Apache License 2.0 and part of the LF AI & Data Foundation as a sandbox project.
- Python Ecosystem Integration: Seamlessly integrates with the broader Python and machine learning ecosystem.
- Installation via pip: Easily installable and updatable from PyPI.
Pricing
DocArray is open-source software and free to use under the Apache License 2.0.