
Document Loaders
Components in LLM frameworks that fetch and parse data from various sources (PDFs, websites, databases) into a standardized format for processing. Essential first step in RAG pipelines for converting raw data into processable documents.
About this tool
Overview
Document Loaders are components in LLM frameworks (LangChain, LlamaIndex, Haystack) that fetch data from various sources and convert it into a standardized document format for downstream processing in RAG pipelines.
Common Document Loaders
Text-Based
- TextLoader: Plain text files
- CSVLoader: Spreadsheet data
- JSONLoader: Structured JSON data
- MarkdownLoader: Markdown documents
Rich Documents
- PyPDFLoader: PDF extraction
- UnstructuredPDFLoader: Advanced PDF parsing with layout preservation
- Docx2txtLoader: Microsoft Word documents
- UnstructuredHTMLLoader: HTML pages with structure
Web Sources
- WebBaseLoader: Website scraping
- GitHubLoader: GitHub repositories
- NotionDBLoader: Notion databases
- ConfluenceLoader: Confluence pages
Databases
- SQLDatabaseLoader: SQL databases
- MongoDBLoader: MongoDB collections
- BigQueryLoader: Google BigQuery
Multimedia
- YoutubeLoader: Video transcripts
- AudioLoader: Audio file transcription
- ImageLoader: Image content extraction
Document Structure
Loaders typically output documents with:
{
"page_content": "The actual text content",
"metadata": {
"source": "file.pdf",
"page": 1,
"author": "...",
"created_at": "..."
}
}
Integration with RAG Pipeline
- Load: Document loader fetches and parses data
- Split: Text splitter chunks documents
- Embed: Embedding model creates vectors
- Store: Vector database stores embeddings
- Retrieve: Query retrieves relevant chunks
Advanced Features
Layout Preservation
Tools like Unstructured.io and LlamaParse:
- Preserve table structures
- Maintain multi-column layouts
- Extract images and captions
- Recognize document hierarchy
Metadata Extraction
- Automatic extraction of creation dates, authors
- URL and source tracking
- Page numbers and sections
- Custom metadata fields
Batch Processing
- Process multiple documents concurrently
- Resume from failures
- Progress tracking
Popular Loader Libraries
LangChain
from langchain.document_loaders import PyPDFLoader
loader = PyPDFLoader("document.pdf")
documents = loader.load()
LlamaIndex (LlamaHub)
- 100+ connectors
- Specialized loaders for enterprise sources
- Community-contributed loaders
Unstructured
- Unified API for multiple formats
- Advanced parsing for complex layouts
- Pre-processing and chunking
Best Practices
- Choose Right Loader: Match loader to source format
- Preserve Metadata: Keep source tracking for citations
- Handle Errors: Implement retry logic and error handling
- Optimize for Scale: Use batch processing for large datasets
- Test Extraction Quality: Verify parsing accuracy
Common Challenges
- PDF table extraction
- Scanned document OCR
- Multi-column layouts
- Nested document structures
- Large file handling
Pricing
Most loaders are free and open-source. Commercial options:
- Unstructured.io API: Paid plans
- LlamaParse: Usage-based pricing
- Custom enterprise connectors: Varies
Surveys
Loading more......
Information
Websitedocs.langchain.com
PublishedMar 15, 2026
Categories
Tags
Similar Products
6 result(s)