In our previous article, we talked about Bag of Words. This post covers another famous technique called TF-IDF and also we can see how to implement the same in Python. 

TF-IDF is an abbreviation for Term Frequency-Inverse Document Frequency and it is the most used algorithm to convert the text into vectors. This technique is widely used to extract features across various NLP applications. 

Before implementation, let’s check the theory behind TF-IDF.

Take the below two sentences

Sentence 1 = “I like to be an Engineer”

Sentence 2 = “I like to play Cricket”

In order to execute TF-IDF, first we need to tokenize the sentences

\

Next step is finding TF-IDF Values

Term Frequency (tf): Mathematically tf can be expressed as below,

In a simpler terms

Tf = (frequency of the word in the sentence) / (total number of words in the sentence)

For example, let’s take the word “I” in sentence 1. Tf value for “I” is 0.20 as the word “I” presents only once in sentence 1 and the total number of words in sentence 1 is 6. So tf value can be calculated by ⅕ = 0.20

Inverse Data Frequency (IDF)

IDF helps to compute the weight of rare words across all sentences in the corpus. The word appears rarely and will have a high IDF score.

Mathematically IDF can be expressed as below 

In a simpler terms

Idf = log (Total number of sentences) / (Number of sentences that word presents)

Again we calculate the idf value for the word “I” in sentence 0.47. Here, the total number of sentences is 3. The word “I” presents only once in sentence 1. 

So the idf value for “I” = log(3/1) => 0.47

Calculate TF-IDF value

Multiplying TF and IDF value which can be explain mathematically as below,

Tf-Idf value for the word “I” is 0.094 (TF value 0.20 and Idf value 0.47 = 0.2 * 0.47)

Calculating the Tf-Idf value for entire sentences

From the above calculations, we can see that Tf-Idf values are zero for the words presented in both sentence 1 and sentence 2 which explains they are not significant. However, Tf-Idf value for the words like “be”, “an”, “engineer”, “play”, and “cricket” has non- zero which explains they are significant.

Let’s see how we can implement in Python

We can execute TF-IDF in python using sklearn library 

# Importing Tfidf 

import pandas as pd

from sklearn.feature_extraction.text import TfidfTransformer

Let’s take different examples,

# assigning text

text = ["India is the second most populous country in the world",           

"It is the seventh-largest country by land area", "It is most populous democracy in the world"]

Initialize the vectorizer, then call fit and transform to calculate the TF-IDF value for the text

# Initialize the vectorizer

tfIdfVectorizer=TfidfVectorizer(use_idf=True)

tfIdf = tfIdfVectorizer.fit_transform(text)

Getting the feature names, converting into dataframe and printing the tf-idf value

# Converting into dataframe and printing the tf-idf value

df = pd.DataFrame(tfIdf[0].T.todense(), index = tfIdfVectorizer.get_feature_names(), columns=["TF-IDF"])

df = df.sort_values('TF-IDF', ascending=False)

Df

Output

The above is a simple introduction to TF-IDF. Thanks for reading.

Sources of Article

Image by Foundry Co from Pixabay 

Want to publish your content?

Publish an article and share your insights to the world.

ALSO EXPLORE

DISCLAIMER

The information provided on this page has been procured through secondary sources. In case you would like to suggest any update, please write to us at support.ai@mail.nasscom.in