Humans can understand the language easily in a fraction of seconds. But, machines cannot understand the language. We need to convert the text into a numerical form that’s easily understandable by the machine. Most of the Machine learning and statistical model work with numeric data hence, we need to convert all the texts into numbers. 

Ok, how do we convert the raw text into the numbers or vectors? There are many techniques available and the most famous ones are Bag of Words, TF-IDF, Word2Vec etc. In this article, we will see the working principle of Bag of Words and how to implement the same in Python.

Let’s take some examples,

Assume we have three sentences:

I like to play Cricket.

Cricket is an interesting sport

Football is more popular than cricket.

Suppose if we want to analyse the above texts or building any ML model, first we need to convert the above texts into numbers as I mentioned above. 

The first step is to Tokenize the sentences. It is nothing but split the above three sentences into individual words. We have three sentences in our example, hence I name them to sentence 1, sentence 2 and sentence 3. It is also called corpus which is nothing but a collection of texts. Splitting each sentence into individual words as per the below table 

The second step is to find all the unique words from the above sentences and build the vocabulary. We need to create a dictionary which contains keys and values. Keys are the words presented in our corpus and values are nothing but the frequency of the words presented in our entire corpus. In other words, just count how many times each word appear in each sentence. 

Unique words from the corpus : 'cricket', 'is', 'i', 'like', 'to', 'play', 'an', 'interesting', 'sport', 'football', 'more', 'popular', 'than'. Total 13 words

From the above table, we can see the numbers corresponding to the respective sentences. The first row represents the vector for sentence 1. We can see the words ‘I’, ‘like’, ‘to’, ‘play’, ‘cricket’ are appeared once, therefore we added 1 in the first row. The words like ‘an’, ‘interesting’, ‘sport’, ‘football’, ‘more’, ‘popular’, ‘than’ not appeared in the sentence 1, therefore we added 0. The same rule is applicable to sentence 2 and sentence 3. In sentence 2, the words like ‘cricket’, ‘is’, ‘an’, ‘interesting’, ‘sport’ appeared in sentence 2. The words like ‘cricket’, ‘is’, ‘football’, ‘more’, ‘popular’, ‘than’ appeared in sentence 3, therefore we added as 1 and 0 for other words which have not appeared in the sentences. 

The vector for 

Sentence 1 = [1,0,1,1,1,1,0,0,0,0,0,0,0]

Sentence 2 = [1,1,0,0,0,0,1,1,1,0,0,0,0]

Sentence 3 = [1,1,0,0,0,0,0,0,0,1,1,1,1]

This is the idea behind the Bag of Words model. The above vectors can be used as an input for ML or statistical model. 

Let’s implement the Bag of Words in Python

I used Kera’s tokenizer to implement the BoW model. We can use other libraries as well.

Making a list

Output

Tokenization and building vocabulary

Output

Converting a text into numbers by counting the words

Output

The above output shows the numbers for each sentence. Ignore the zero at the beginning of the numbers. Keras library reserves index zero but it never allocates to any words. 

BoW model is very easy to implement and simple model. However, it ignores the semantic meaning. For example, the words ‘happy’ and ‘joy’ are often used in the same context. But, the model considers these into different words and assign the numbers. 

The vector size is another challenge in BoW for a large document. It requires a lot of computation and time.


Sources of Article

Image by nile from Pixabay 

Want to publish your content?

Publish an article and share your insights to the world.

ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in