
Contextual Compression
A RAG optimization technique that compresses retrieved documents by extracting only the most relevant portions relative to the query. Reduces token usage and improves LLM response quality by removing irrelevant context.
About this tool
Overview
Contextual Compression is a technique that improves RAG by compressing retrieved documents, extracting only the parts most relevant to the user's query. This reduces context length, lowers costs, and often improves answer quality.
The Problem
Standard RAG retrieves full document chunks:
- May contain irrelevant information
- Uses unnecessary tokens
- Distracts the LLM
- Increases costs
- Slower processing
How It Works
- Retrieve: Get relevant chunks via vector search
- Compress: Extract query-relevant portions from each chunk
- Context: Send only compressed content to LLM
Compression Techniques
Extractive Compression
- Extract sentences/paragraphs relevant to query
- Preserve original text
- Simple and interpretable
LLM-Based Compression
- Use small LLM to summarize/extract
- More sophisticated understanding
- Higher quality but slower
Embedding-Based Filtering
- Compare sentence embeddings to query
- Remove low-similarity sentences
- Fast and effective
Benefits
- Reduced token usage (30-70% savings)
- Lower API costs
- Faster LLM processing
- Better focus on relevant information
- Improved answer quality
Implementation
from langchain.retrievers import ContextualCompressionRetriever
from langchain.retrievers.document_compressors import LLMChainExtractor
compressor = LLMChainExtractor.from_llm(llm)
compression_retriever = ContextualCompressionRetriever(
base_compressor=compressor,
base_retriever=vector_retriever
)
compressed_docs = compression_retriever.get_relevant_documents(query)
Trade-offs
Advantages:
- Cost reduction
- Quality improvement
- Faster responses
Costs:
- Additional compression latency
- Potential information loss
- More complex pipeline
Use Cases
- Long documents with sparse relevant content
- Cost-sensitive applications
- Real-time systems needing low latency
- When retrieved chunks are large
Pricing
Implementation-dependent. May add small latency but reduces LLM costs.
Surveys
Loading more......
Information
Websitepython.langchain.com
PublishedMar 15, 2026
Categories
Tags
Similar Products
6 result(s)