
Unstructured
Open-source library for preprocessing unstructured documents (PDFs, Word, HTML, images) for RAG and LLM applications. Handles extraction, chunking, and cleaning of diverse document types.
About this tool
Overview
Unstructured is an open-source library for preprocessing unstructured documents into formats suitable for RAG and LLM applications.
Supported Formats
- Word (DOCX)
- HTML
- Images (with OCR)
- PowerPoint
- Markdown
- Email (MSG, EML)
- CSV/Excel
Features
Extraction:
- Text extraction
- Table detection
- Image handling
- Layout analysis
Processing:
- Semantic chunking
- Metadata extraction
- Element classification
- Cleaning and normalization
Use Cases
- RAG document ingestion
- Knowledge base building
- Document search indexing
- Data pipeline preprocessing
Integration
- LangChain
- LlamaIndex
- Haystack
- Custom pipelines
Availability
Open-source Python library
Managed API service
Surveys
Loading more......
Information
Websiteunstructured.io
PublishedMar 20, 2026
Categories
Tags
Similar Products
6 result(s)