Results for ""
Humans can understand the language easily in a fraction of seconds. But, machines cannot understand the language. We need to convert the text into a numerical form that’s easily understandable by the machine. Most of the Machine learning and statistical model work with numeric data hence, we need to convert all the texts into numbers.
Ok, how do we convert the raw text into the numbers or vectors? There are many techniques available and the most famous ones are Bag of Words, TF-IDF, Word2Vec etc. In this article, we will see the working principle of Bag of Words and how to implement the same in Python.
Let’s take some examples,
Assume we have three sentences:
I like to play Cricket.
Cricket is an interesting sport
Football is more popular than cricket.
Suppose if we want to analyse the above texts or building any ML model, first we need to convert the above texts into numbers as I mentioned above.
The first step is to Tokenize the sentences. It is nothing but split the above three sentences into individual words. We have three sentences in our example, hence I name them to sentence 1, sentence 2 and sentence 3. It is also called corpus which is nothing but a collection of texts. Splitting each sentence into individual words as per the below table
The second step is to find all the unique words from the above sentences and build the vocabulary. We need to create a dictionary which contains keys and values. Keys are the words presented in our corpus and values are nothing but the frequency of the words presented in our entire corpus. In other words, just count how many times each word appear in each sentence.
Unique words from the corpus : 'cricket', 'is', 'i', 'like', 'to', 'play', 'an', 'interesting', 'sport', 'football', 'more', 'popular', 'than'. Total 13 words
From the above table, we can see the numbers corresponding to the respective sentences. The first row represents the vector for sentence 1. We can see the words ‘I’, ‘like’, ‘to’, ‘play’, ‘cricket’ are appeared once, therefore we added 1 in the first row. The words like ‘an’, ‘interesting’, ‘sport’, ‘football’, ‘more’, ‘popular’, ‘than’ not appeared in the sentence 1, therefore we added 0. The same rule is applicable to sentence 2 and sentence 3. In sentence 2, the words like ‘cricket’, ‘is’, ‘an’, ‘interesting’, ‘sport’ appeared in sentence 2. The words like ‘cricket’, ‘is’, ‘football’, ‘more’, ‘popular’, ‘than’ appeared in sentence 3, therefore we added as 1 and 0 for other words which have not appeared in the sentences.
The vector for
Sentence 1 = [1,0,1,1,1,1,0,0,0,0,0,0,0]
Sentence 2 = [1,1,0,0,0,0,1,1,1,0,0,0,0]
Sentence 3 = [1,1,0,0,0,0,0,0,0,1,1,1,1]
This is the idea behind the Bag of Words model. The above vectors can be used as an input for ML or statistical model.
Let’s implement the Bag of Words in Python
I used Kera’s tokenizer to implement the BoW model. We can use other libraries as well.
Making a list
Output
Tokenization and building vocabulary
Output
Converting a text into numbers by counting the words
Output
The above output shows the numbers for each sentence. Ignore the zero at the beginning of the numbers. Keras library reserves index zero but it never allocates to any words.
BoW model is very easy to implement and simple model. However, it ignores the semantic meaning. For example, the words ‘happy’ and ‘joy’ are often used in the same context. But, the model considers these into different words and assign the numbers.
The vector size is another challenge in BoW for a large document. It requires a lot of computation and time.
Image by nile from Pixabay