



Advanced chunking technique for long-context embeddings where documents are embedded first as a whole, then chunked, preserving contextual information and improving retrieval quality especially for technical documents.
Loading more......
Late chunking is a technique where you embed the entire document first with a long-context embedding model, then chunk the resulting contextualized representations, rather than chunking text first then embedding each chunk independently.
Standard Approach:
Issues:
Better Context:
Improved Retrieval:
Semantic Coherence:
Best For:
Not Needed For:
Long-Context Model:
Implementation Support:
# With Jina Embeddings v3
from transformers import AutoModel
# 1. Get token embeddings for full doc
token_embeddings = model.encode(
full_document,
return_dense=True,
return_sparse=False,
return_colbert_vecs=True # Token-level
)
# 2. Define chunk boundaries (tokens)
chunk_boundaries = [(0, 512), (512, 1024), ...]
# 3. Pool token embeddings per chunk
chunk_embeddings = [
pool(token_embeddings[start:end])
for start, end in chunk_boundaries
]
# 4. Store with full context
for i, emb in enumerate(chunk_embeddings):
store(chunk_text[i], emb)
Pros:
Cons:
vs Standard Chunking:
vs Sliding Window:
vs Hierarchical:
Late chunking is gaining traction as:
Compare:
Typical improvement: 3-8% better recall on technical documents.