In the ever-evolving world of Generative AI (Gen AI), Retrieval-Augmented Generation (RAG) has emerged as a crucial technique. RAG enables users to enhance large language models (LLMs) with domain-specific information—like documents, PDFs, transcripts, and more—allowing the models to provide contextually relevant responses without needing to train them on these documents directly. This method is gaining immense popularity and continues to grow as a significant area of interest.
To create an effective RAG solution, one needs to store these documents or relevant text within a specialised database in a vectorised form. This ensures that pertinent information can be quickly searched and retrieved, providing the LLM with sufficient context to respond accurately to a query. Given the often substantial size of these documents, imagine having an entire book stored in the database. You'd likely want to retrieve information about just a single chapter or paragraph rather than the whole book. Pulling the entire book in one go would be highly inefficient and token-costly.
This is where chunking comes into play. By partitioning documents into smaller, manageable 'chunks,' we can ensure that only the relevant portions are retrieved and passed to the LLM. This not only optimises the use of tokens but also maintains the focus on the specific information needed for the query. Chunking strategies have thus become vital for any RAG solution, as the accuracy of the responses heavily depends on how the information is chunked. Factors like chunk size and the meaningfulness of each chunk are essential considerations.
In this blog, we'll explore various chunking strategies, shedding light on their advantages, drawbacks, and best-fit scenarios. Whether you're dealing with simple logs, structured academic papers, or intricate technical documents, understanding these strategies will help you implement a more effective and efficient RAG solution. So, let's dive in and discover the best ways to chunk your data for optimal performance.
LangChain
Using LangChain for chunking offers several advantages, especially in its design to work effectively with large language models and its flexible, efficient chunking strategies. LangChain assists in splitting large texts into chunks that fit within the token limits of language models, such as GPT-4o's 35k-token limit, ensuring effective processing without input truncation or failure. It supports splitting at logical boundaries like sentences or paragraphs and allows overlapping chunks, maintaining continuity between chunks and reducing the loss of context, which improves coherence in tasks such as summarisation or question answering.
LangChain integrates smoothly with document loaders and vector-based retrieval systems, enabling practical workflows for semantic search and knowledge management. Proper chunking directly impacts retrieval performance in retrieval-augmented generation systems by defining the unit of information stored and queried, facilitating optimised storage, latency reduction, and response quality for large-scale language model applications. LangChain can also split text with an awareness of document structure, adding metadata to chunks based on headers or other logical elements, which improves semantic understanding and retrieval.
Fixed-Size Chunking
Concept:
Fixed-size chunking is a straightforward method for breaking down text into equal-sized pieces based on characters, tokens, or word counts. This approach is simple and efficient, making it easier to manage large amounts of data without complex content analysis. One of the main advantages is consistency; predictable chunk sizes help maintain uniformity, which is particularly beneficial for token models that perform better with regular input lengths. This method also requires less processing power, making it ideal for large-scale systems where speed and resource optimisation are essential.
When dealing with large documents, fixed-size chunking speeds up processing and retrieval. Introducing a bit of overlap between chunks ensures the continuity of ideas. For example, if a chunk ends mid-sentence, the overlap allows the next chunk to pick up the sentence, maintaining context. This method is particularly effective for structured documents like manuals or datasets, where uniform chunks don't disrupt the flow of information. It also helps manage context window limits in language models, preventing important information from being lost.
Cost optimisation is another significant benefit. Fixed-size chunking reduces computational overhead, helping to control operational costs when working with large datasets. Adding slight overlaps or context can address issues like the "Lost in the Middle" problem, enhancing result quality without high computational costs. While it might sometimes break the natural flow of language compared to semantic chunking, fixed-size chunking remains a practical and resource-friendly approach suitable for many real-world scenarios.
Implementation:
- Loading and Preparing the PDF Document:
- Load the PDF file and extract its contents into a continuous string, preserving the structure with double newlines between pages.
- Configuration of the Character Text Splitter:
- Define parameters for chunk size (e.g., 2,048 characters) and overlap (e.g., 20%). Split text at double newline characters to maintain paragraph integrity.
Advantages:
- Straightforward and easy to implement.
- Uniform chunk size simplifies batch operations.
Drawbacks:
- May cut off sentences or paragraphs abruptly.
- Ignores natural semantic breaks.
Best Fit:
- Simple logs or straightforward text with consistent formatting.
Semantic Chunking
Concept:
Semantic chunking is a technique that involves breaking down documents at logical points, such as sentences, paragraphs, or sections. Think of it as chopping up a long, winding story into neat, digestible segments. This approach does wonders for preserving the flow of ideas and keeping related concepts snugly together.
One of the standout advantages of semantic chunking is how it boosts retrieval accuracy. By slicing documents at sensible places, like the end of a sentence or a paragraph, you ensure that each chunk contains complete thoughts and coherent ideas. This is especially helpful for documents like articles or academic papers, which naturally have clear sections. When a RAG model goes hunting for relevant information, it pulls back chunks that are rich in context and meaning. This not only makes the retrieved information more accurate but also ensures it’s more useful and relevant to the query at hand.
Implementation:
- Configuration of the Recursive Character Text Splitter:
- Use a hierarchy of separators (e.g., double newlines, single newlines, full stops, spaces) to segment text at the most natural breakpoints.
- Conversion to Document Objects with Enhanced Metadata:
- Transform each chunk into a
Document
object, adding metadata such as section titles, chunk identifiers, and semantic density.
- Transform each chunk into a
Advantages:
- Preserves the flow of ideas.
- Boosts retrieval accuracy for related concepts.
Drawbacks:
- More complex to implement.
- Variable chunk sizes.
Best Fit:
- Well-structured documents like articles or academic papers.
Recursive Chunking
Concept:
Recursive chunking is a technique that employs a hierarchy of separators, starting with the big ones and moving to finer ones if the chunks are still too large. Imagine it as a top-down approach to dividing your text, where you try to break it up at the most significant points first, and then progressively smaller points.
One of the primary advantages of recursive chunking is that it creates more context-aware splits compared to simple fixed-size approaches. By using high-level separators first, such as sections or chapters, and then moving down to paragraphs or sentences, the resulting chunks are more likely to be contextually rich and coherent. This is particularly powerful for structured text or code, where maintaining the integrity of blocks is crucial. For instance, in a Python code repository, using "def" or "class" as separators ensures that the chunks retain meaningful structures, making it easier for RAG models to retrieve and generate accurate, contextually appropriate responses.
Implementation:
- Language-Specific Code Splitting and Metadata Enhancement:
- Select appropriate splitting strategy based on the detected programming language, and annotate chunks with detailed metadata.
- Extract Functions and Classes:
- Identify functions, classes, and imports within code chunks for better metadata.
Advantages:
- Context-aware splits for structured text or code.
- Maintains coherence in technical documents.
Drawbacks:
- Requires domain-specific separators.
- More complex configuration.
Best Fit:
- Technical documents, especially code repositories.
Context-Enriched Chunking
Concept:
Context-enriched methods are like giving each chunk of a document packets of information, within its metadata, such as summaries. Think of it as equipping each segment with a mini cheat sheet that helps maintain coherence and context.
One of the biggest advantages of context-enriched methods is how they help maintain coherence across different parts of a document. When each chunk comes with its own set of metadata or a summary, it ensures that the flow of ideas remains intact, even when the document is broken down into smaller pieces. This can significantly boost retrieval performance, especially in queries that span multiple segments. For example, in multi-chapter reports or interconnected research papers, understanding the interplay between sections is crucial. By attaching extra context to each chunk, the RAG models can retrieve and generate responses that are both accurate and contextually relevant.
Implementation:
- Initialising the Databricks Model and Configuring the Text Splitter:
- (Assuming you're using Databricks) Utilise a Databricks LLM or, alternatively, an OpenAI model, alongside a recursive character text splitter to segment text into consistent chunks.
-
Create Base Chunks and Setup Chain:
-
splits the document into base chunks using a text splitter and then creates a summarisation chain to generate brief summaries for each chunk. It utilises a prompt template and an LLM chain to process the text and combines the summarised documents for a cohesive output.
-
- Processing Chunks with Contextual Windows:
- Incorporate contextual information from neighbouring segments, enhancing each chunk with additional metadata.
Advantages:
- Maintains coherence across document sections.
- Enhances retrieval performance for interconnected content.
Drawbacks:
- Increases storage and memory requirements.
- Adds complexity to preprocessing.
Best Fit:
- Documents requiring understanding of interplay between sections, such as multi-chapter reports.
Conclusion
In conclusion, the choice of chunking strategy depends on the document type, the nature of the content, and the specific requirements of your RAG solution. Whether you opt for the simplicity of fixed-size chunking, the coherence of semantic chunking, the structure of recursive chunking, or the depth of context-enriched chunking, each approach offers unique advantages to enhance the efficiency and accuracy of information retrieval.
If you’re looking to put these strategies into practice, whether that’s optimising your existing RAG setup or exploring how Gen AI can be tailored to your organisation - our team at Advancing Analytics can help. Get in touch with us todayto discuss how we can support your journey in building smarter, context-driven AI solutions.
Topics Covered :

Author
Luke Menzies