Document Loaders

Components in LLM frameworks that fetch and parse data from various sources (PDFs, websites, databases) into a standardized format for processing. Essential first step in RAG pipelines for converting raw data into processable documents.

Visit Website

Overview

Document Loaders are components in LLM frameworks (LangChain, LlamaIndex, Haystack) that fetch data from various sources and convert it into a standardized document format for downstream processing in RAG pipelines.

Common Document Loaders

Text-Based

TextLoader: Plain text files
CSVLoader: Spreadsheet data
JSONLoader: Structured JSON data
MarkdownLoader: Markdown documents

Rich Documents

PyPDFLoader: PDF extraction
UnstructuredPDFLoader: Advanced PDF parsing with layout preservation
Docx2txtLoader: Microsoft Word documents
UnstructuredHTMLLoader: HTML pages with structure

Web Sources

WebBaseLoader: Website scraping
GitHubLoader: GitHub repositories
NotionDBLoader: Notion databases
ConfluenceLoader: Confluence pages

Databases

SQLDatabaseLoader: SQL databases
MongoDBLoader: MongoDB collections
BigQueryLoader: Google BigQuery

Multimedia

YoutubeLoader: Video transcripts
AudioLoader: Audio file transcription
ImageLoader: Image content extraction

Document Structure

Loaders typically output documents with:

{
    "page_content": "The actual text content",
    "metadata": {
        "source": "file.pdf",
        "page": 1,
        "author": "...",
        "created_at": "..."
    }
}

Integration with RAG Pipeline

Load: Document loader fetches and parses data
Split: Text splitter chunks documents
Embed: Embedding model creates vectors
Store: Vector database stores embeddings
Retrieve: Query retrieves relevant chunks

Advanced Features

Layout Preservation

Tools like Unstructured.io and LlamaParse:

Preserve table structures
Maintain multi-column layouts
Extract images and captions
Recognize document hierarchy

Metadata Extraction

Automatic extraction of creation dates, authors
URL and source tracking
Page numbers and sections
Custom metadata fields

Batch Processing

Process multiple documents concurrently
Resume from failures
Progress tracking

Popular Loader Libraries

LangChain

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("document.pdf")
documents = loader.load()

LlamaIndex (LlamaHub)

100+ connectors
Specialized loaders for enterprise sources
Community-contributed loaders

Unstructured

Unified API for multiple formats
Advanced parsing for complex layouts
Pre-processing and chunking

Best Practices

Choose Right Loader: Match loader to source format
Preserve Metadata: Keep source tracking for citations
Handle Errors: Implement retry logic and error handling
Optimize for Scale: Use batch processing for large datasets
Test Extraction Quality: Verify parsing accuracy

Common Challenges

PDF table extraction
Scanned document OCR
Multi-column layouts
Nested document structures
Large file handling

Pricing

Most loaders are free and open-source. Commercial options:

Unstructured.io API: Paid plans
LlamaParse: Usage-based pricing
Custom enterprise connectors: Varies

Surveys

Loading more......

Information

Websitedocs.langchain.com

PublishedMar 15, 2026

Tags

3 Items

#document-processing #loaders #rag

Similar Products

Unstructured

Open-source library for preprocessing unstructured documents (PDFs, Word, HTML, images) for RAG and LLM applications. Handles extraction, chunking, and cleaning of diverse document types.

000

LlamaParse

Advanced document parsing service from LlamaIndex for extracting structured data from PDFs, PowerPoints, and Word documents. Uses LLMs to understand document structure and maintain layout information.

000

Document Parsing for RAG

Critical preprocessing step for RAG systems involving extraction of text, tables, and images from various document formats (PDF, DOCX, HTML) using tools like Unstructured, LlamaParse, and PyPDF.

000

Chunking Strategies for RAG

Methods for splitting documents into optimal pieces for vector embedding and retrieval. Includes fixed-size, recursive, semantic, and agentic chunking approaches.

000

Amazon Bedrock Knowledge Bases

A fully managed service within Amazon Bedrock that automates the retrieval-augmented generation (RAG) workflow by ingesting unstructured and structured data, converting it into embeddings, and storing them in supported vector databases. It enables grounding generative AI responses with enterprise data without manual orchestration.

000

Inference

A powerful RAG application platform delivering OpenAI-compatible serverless inference APIs for top open-source LLM models. Offers specialized batch processing for large-scale async AI workloads and document extraction capabilities designed for RAG applications, balancing cost-efficiency with high performance.

000

Overview

Common Document Loaders

Text-Based

TextLoader: Plain text files
CSVLoader: Spreadsheet data
JSONLoader: Structured JSON data
MarkdownLoader: Markdown documents

Rich Documents

PyPDFLoader: PDF extraction
UnstructuredPDFLoader: Advanced PDF parsing with layout preservation
Docx2txtLoader: Microsoft Word documents
UnstructuredHTMLLoader: HTML pages with structure

Web Sources

WebBaseLoader: Website scraping
GitHubLoader: GitHub repositories
NotionDBLoader: Notion databases
ConfluenceLoader: Confluence pages

Databases

SQLDatabaseLoader: SQL databases
MongoDBLoader: MongoDB collections
BigQueryLoader: Google BigQuery

Multimedia

YoutubeLoader: Video transcripts
AudioLoader: Audio file transcription
ImageLoader: Image content extraction

Document Structure

Loaders typically output documents with:

{
    "page_content": "The actual text content",
    "metadata": {
        "source": "file.pdf",
        "page": 1,
        "author": "...",
        "created_at": "..."
    }
}

Integration with RAG Pipeline

Load: Document loader fetches and parses data
Split: Text splitter chunks documents
Embed: Embedding model creates vectors
Store: Vector database stores embeddings
Retrieve: Query retrieves relevant chunks

Advanced Features

Layout Preservation

Tools like Unstructured.io and LlamaParse:

Preserve table structures
Maintain multi-column layouts
Extract images and captions
Recognize document hierarchy

Metadata Extraction

Automatic extraction of creation dates, authors
URL and source tracking
Page numbers and sections
Custom metadata fields

Batch Processing

Process multiple documents concurrently
Resume from failures
Progress tracking

Popular Loader Libraries

LangChain

from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("document.pdf")
documents = loader.load()

LlamaIndex (LlamaHub)

100+ connectors
Specialized loaders for enterprise sources
Community-contributed loaders

Unstructured

Unified API for multiple formats
Advanced parsing for complex layouts
Pre-processing and chunking

Best Practices

Choose Right Loader: Match loader to source format
Preserve Metadata: Keep source tracking for citations
Handle Errors: Implement retry logic and error handling
Optimize for Scale: Use batch processing for large datasets
Test Extraction Quality: Verify parsing accuracy

Common Challenges

PDF table extraction
Scanned document OCR
Multi-column layouts
Nested document structures
Large file handling

Pricing

Most loaders are free and open-source. Commercial options:

Unstructured.io API: Paid plans
LlamaParse: Usage-based pricing
Custom enterprise connectors: Varies

Document Loaders

Overview

Common Document Loaders

Text-Based

Rich Documents

Web Sources

Databases

Multimedia

Document Structure

Integration with RAG Pipeline

Advanced Features

Layout Preservation

Metadata Extraction

Batch Processing

Popular Loader Libraries

LangChain

LlamaIndex (LlamaHub)

Unstructured

Best Practices

Common Challenges

Pricing

Information

Categories

Tags

Similar Products

Document Loaders

Overview

Common Document Loaders

Text-Based

Rich Documents

Web Sources

Databases

Multimedia

Document Structure

Integration with RAG Pipeline

Advanced Features

Layout Preservation

Metadata Extraction

Batch Processing

Popular Loader Libraries

LangChain

LlamaIndex (LlamaHub)

Unstructured

Best Practices

Common Challenges

Pricing

Information

Categories

Tags

Similar Products