This is a demo directory website built with Ever Works
PyPDF2
A pure Python PDF library for extracting text, metadata, and other content from PDF documents, commonly used in data preprocessing pipelines for vector database applications involving research papers and technical documentation.
Sycamore
An open-source, LLM-powered document processing engine for ETL, RAG, and analytics on unstructured data, featuring a DocSet abstraction similar to Apache Spark and delivering 6x more accurate data chunking with 2x improved recall for hybrid search.
Unstructured
Open-source library for preprocessing unstructured documents (PDFs, Word, HTML, images) for RAG and LLM applications. Handles extraction, chunking, and cleaning of diverse document types.
LlamaParse
Advanced document parsing service from LlamaIndex for extracting structured data from PDFs, PowerPoints, and Word documents. Uses LLMs to understand document structure and maintain layout information.
Document Parsing for RAG
Critical preprocessing step for RAG systems involving extraction of text, tables, and images from various document formats (PDF, DOCX, HTML) using tools like Unstructured, LlamaParse, and PyPDF.
Document Loaders
Components in LLM frameworks that fetch and parse data from various sources (PDFs, websites, databases) into a standardized format for processing. Essential first step in RAG pipelines for converting raw data into processable documents.
Chunking Strategies for RAG
Methods for splitting documents into optimal pieces for vector embedding and retrieval. Includes fixed-size, recursive, semantic, and agentic chunking approaches.
Page 1 of 84