Imagine you're organising a fun event for small children with different games like football, cricket, and chess. To keep things organised, you decide to arrange the kids in rows based on the game they want to play. So, there's one row for football enthusiasts, another for cricket lovers, and so on. This makes it easy for the organisers to manage the children, count their numbers, and take them to their respective play areas. 

Now, think about what would happen if the rows weren't organised by game. Some chess players might end up on the football field, and cricket enthusiasts could be playing cricket with chess pieces! It would be chaotic and challenging for the organisers to manage the kids effectively.

This same idea applies to text chunking. But what is text chunking? Text chunking, in the realm of Natural Language Processing (NLP), is the method of dividing text into smaller, more manageable pieces, often called chunks. Some chunking techniques of text are created without considering whether they are related or not. That's where semantic chunking comes in. Instead of blindly chunking the text, semantic chunking understands the meaning of sentences. It groups together similar sentences or paragraphs, making it easier to organise and retrieve the information. It's like ensuring that kids who love football end up in the football row, and those who enjoy chess stay with chess, creating order and clarity.

Why do we need chunking?

Have you ever found yourself frustrated while reading a long article or document ? It happens to most of us. Imagine reading an entire book without chapters, too many lines without any break. Hard to read , right ? 

Now at this time text chunking is like a superhero of readability. It’s like slicing a large amount of information into different pieces, so it can be easy to read and to maintain. 

While making the Retrieval Augmented Generation (RAG) chatbots, We pass tons of information by different forms like PDF, text and CSV files. 

Large language model is not able to process that much data in one time, and even if able to do it will take much time. So, with the help of chunking, the information breaks into many smaller and manageable pieces.

By breaking down the text into meaningful chunks, the chatbot can quickly retrieve relevant pieces of information in response to user queries. It helps the LLM to focus on specific parts of information, improving its ability to respond appropriately.

CharacterTextSplitter

CharacterTextSplitter is a straightforward yet efficient tool designed to divide text without concern for the language or format of the document. Its primary function is to segment text based on a user-defined separator and a specified chunk size.

The process begins with the application identifying the chosen separator within the text; this could be full stop (.) a space (“”), a newline character (“\n”) or any other character the user selects. By default, it is a double newline character (“\n\n”) in Langchain.

Once the separator is determined, CharacterTextSplitter splits the text into smaller chunks at every occurrence. 

After the initial split, these consecutive small chunks are merged to a single chunk ensuring that none of the merged final chunk exceeds the chunk size limit. The single chunk during the time of separation can be above chunk size.

By focusing solely on separators and chunk size, CharacterTextSplitter provides a universal solution that is adaptable to various text-splitting needs, maintaining simplicity and effectiveness at its core. 

But there are great chances that the text which has the same semantic meaning lies in different chunks. 

RecursiveCharacterTextSplitter

RecursiveCharacterTextSplitter is an advanced version of the CharacterTextSplitter. It is not very straightforward, but it provides more comprehensive text compared to character text splitter. 

Its process also starts from identifying the text. But it takes an array of separators as a parameter. It gives priority to the separator from left to right. The default list of separators in Langchain is ["\n\n", "\n", " ", ""]. 

So, in the first step it split the text into small chunks with the help of the first separator. Then in all the list of chunks, if any of the chunk sizes is more that the pre-set chunk size, then it is split with the next separator. This will continue, until all the chunks size are smaller than the chunk size or all the separators will end. 

Then, same like CharacterTextSplitter it will merge the consecutive chunks ensuring that no any chunk should be more than the pre-set chunk size. 

As it tries to keep the paragraphs, sentences and words together as long as possible. But it also does not guarantee that the chunks which have similar meaning will group together.

Semantic Chunking

So, we've discussed regular chunking, where text is split based on size, paragraphs, or separators without understanding the meaning. 

There are great chances that the texts on similar topics might be scattered across different sections, while unrelated texts could end up scattered across different chunks. 

But what if I tell you that, we will create the chunks which have the same semantic meaning. Is it possible?? Yes! It is possible with the help of the embeddings. So, let's deep dive into the semantic chunking.  

Process of Semantic Chunking -

This approach begins by dividing the text into individual sentences. A sentence in a basic unit of text which has some meaning. We can do it, with the help of regular expressions or any text splitter.

The next step involves pairing neighbor sentences together. The pairing is crucial as it reduces noise and captures relationships between the Sequential sentences. 

Combining these sentences depends on the buffer size. If the buffer size is 1, it implies that it is a combination of one preceding and one leading sentence. We can change buffer size according to our text and requirement.

Following this, now we reach the most important step, which makes the semantic embedding different from other chunking techniques. Yaa! you guessed correctly, it's embeddings. We convert these pair sentences into numerical representations called embeddings. It captures the underlying semantic meanings of the sentences in the same way that we feel the emotions, understanding while reading any sentence or text.

We then measure the similarity between these embeddings. This is done by calculating the cosine similarity between them, which essentially measures how close or far the meaning of the paired sentences are. 

Cosine Similarity and Distance

The resulting distance shows us the level of relatedness between different sentences. A smaller distance indicates closer relation, while a larger distance suggests they are less related.

The graph presents a visual representation of how closely related each sentence in the text is to the next one, by plotting the cosine distances of sequential sentence embeddings.

This brings us to the most crucial steps of determining ‘chunks’ of related sentences using a threshold value. This threshold value can be adjusted. A lower value results in stricter, more closely related chunks, while a higher value allows less related sentences to be grouped together.

The graph highlights the semantic breakpoints in the text using a threshold value to create semantic chunks.

But all time the data will be different. So, to make this threshold value independent from data, the threshold value is taken by the percentile based approach. For example, if we set it at the 95 percentile, it means the distance below 95% will fall in this category. So, the chunk will be made of the data which is above the 95 percentiles. 

So, in this way the data which Now we get rid of the static chunking without understanding the context or meaning of the text. With this chunking all the chunks are more related to each other. 

Conclusion -

Semantic chunking moves beyond the limitations of basic chunking methods. By prioritising meaning through techniques like embeddings and similarity measurements, this approach ensures that text segments with the same semantic meaning should lie in the same chunk. This will become much helpful for retrieving information while creating Retrieval Augmented Generation (RAG) chatbots.

This Chunking taken from Greg Kamradt’s notebook.

This blog details my learnings from my internship at Softude while working with Mradul Kanugo.

Sources of Article

Langchain, Greg Kamradt’s notebook, freepik

Want to publish your content?

Publish an article and share your insights to the world.

Get Published Icon
ALSO EXPLORE