



Vision Language Model trained to produce high-quality multi-vector embeddings from document page images for efficient retrieval, eliminating need for OCR pipelines with ColBERT-style late interaction.
ColPali is a Vision Language Model trained to produce high-quality multi-vector embeddings from images of document pages for efficient document retrieval.
ColPali uses PaliGemma-3B to encode images by:
ColPali removes the need for potentially complex and brittle layout recognition and OCR pipelines with a single model that can take into account both the textual and visual content (layout, charts, etc.) of a document.
ColPali runs a ColBERT-style "late interaction" operation to efficiently match query tokens to document patches, computing a score by searching for the document patch with the most similar representation for each query term.
Combined with a late interaction matching mechanism, ColPali largely outperforms modern document retrieval pipelines while being drastically simpler, faster and end-to-end trainable.
The Visual Document Retrieval Benchmark (ViDoRe) was introduced to assess retrievers on their capacity to retrieve visually rich information in docs, with tasks spanning various topics, modalities (figures, tables, text), and languages.
Free and open-source.
Loading more......