



Advanced text splitting technique using embeddings to divide documents based on semantic content instead of arbitrary positions, preserving cohesive ideas within chunks for improved RAG performance.
Semantic chunking, sometimes called intelligent chunking, focuses on preserving the document's meaning and structure. Instead of using a fixed chunk size, it strategically divides the document at meaningful breakpoints—like paragraphs, sentences, or thematically linked sections.
Semantic chunking is an advanced technique that uses text embeddings to split documents based on their semantic content instead of arbitrary positions or formatting cues. Rather than slicing at fixed intervals, the algorithm looks for meaningful transitions in content and tries to preserve cohesive ideas within each chunk.
Splits occur when differences between sentences exceed a set percentile.
Chunks form when semantic differences go beyond a certain number of standard deviations, isolating major content shifts.
Splits text using the interquartile range, focusing on significant differences while ignoring minor variations.
Semantic chunking is one of the most accurate RAG chunking strategies for multi-topic documents:
Semantic chunking gives higher recall but costs more to run, as it requires embedding every sentence in your documents.
Recursive character splitting at 400-512 tokens with 10-20% overlap works well for most text content and is the recommended starting point before investing in semantic chunking.
Chroma's research showed:
Implementation available in various RAG frameworks (LangChain, etc.)
Loading more......