Results for ""


In the last few days, I have been visiting Kaggle for a lot many reasons. Majority time, it has been to find some relevant datasets of banking related use-cases to aid or train a Conversational AI product I have been building which shall provide an analytical or insights-discovery platform to the users for querying data and unearthing insights but in a conversational format as opposed to doing all the heavy-lifting (right from setting-up Python / R notebooks to the establishment of rules to automating the jobs / pipelines). And let me halt the story of Conversational AI right here for now, as my agenda is to focus the lens on the latest concern on the scene — the issue of plagiarism in academics where teachers find it quite difficult as well as concerning to distinguish between LLM-generated and manually written texts from students in their assignments / exams today and talk about it why, because it has been my area of interest for long.
So as I had said that I have been visiting Kaggle for a lot of reasons and majorly it has been about data / dataset-hunting, however this one time, I was tempted to click on Competitions, after almost 6 months, only to see a very interesting competition (I shall not go into the details due to the data and code licensing restrictions, posing during an active competition | PS: if you happen to read / see this article while that competition is active, I’d request you to please not share / amplify the code I have mentioned beneath, ahead, because that shall not be legal and will be against the regulations. I am sharing the code just for knowledge purpose. So, please just read the code and prohibit from sharing it. Ty :) ), but to give you a high level view — the competition is about identifying / differentiating between a LLM (ChatGPT) — generated text and a manually written one.
And that is where it prompted me to dive a bit deeper into this subject and unearth some roots and analyse.
To begin with, am sharing two paragraphs below. Each paragraph is a short-essay of “Terrorism at workplace”. One of them is LLM-generated while the other is written by me. How easy / difficult is it for you to identify which is which? Please leave your answer / response in the comments.
Terrorism at the workplace is a grave concern that demands unwavering attention from employers, employees, and authorities alike. This ominous threat manifests in various forms, from physical violence to cyber-attacks, creating an atmosphere of fear and uncertainty. Employers must prioritize the safety and well-being of their workforce by implementing comprehensive security measures, conducting regular risk assessments, and providing relevant training on identifying and responding to potential threats. Promoting a culture of open communication and vigilance is paramount to fostering a resilient workforce that can collectively thwart any potential acts of terrorism. Collaboration with law enforcement agencies, the establishment of emergency response plans, and the utilization of advanced technologies for surveillance and threat detection are crucial components of a multifaceted strategy to mitigate the risk of terrorism at the workplace. By fostering a proactive and vigilant environment, organizations can contribute to creating a workplace where the safety of employees is a top priority, ensuring that the threat of terrorism is minimized, and individuals can carry out their professional duties in a secure and protected environment.
Vigilance is the need of the hour. Whether it is at the border, defending the nation from the enemies, or our households ensuring safety and well-being of our women and children, or our workplaces where in the form of work pressure and politics, there is sometimes a subtle terrorism that prevails on the floor. What may sometimes originate in the form of a normal reprimand, can take the form of a severe intimidating remark too, often going unnoticed by the people around but impacting the stability and mental peace of the person on the receiving end. Whether it is around their personal ulterior motives, or around their genuine concerns for the sub-ordinate’s substandard performance, supervisors and HODs often end up losing their cool and also using threatening measures thinking that those measures may intimidate the person on the receiving end and lead him/her to either resort to their terms or withdraw from the position. In both the cases, it becomes a fundamental situational crisis for the victim and his/her safety and mental wellbeing is compromised. It is important that the organizations enforce strict measures on the floor towards making a safe and threat-free environment for the employees.
For some of you, it might be relatively quite easy to distinguish between LLM-generated texts VS manually written, due to your expertise in the written / spoken forms of the particular language and the related linguistics in general. However, if we look at it largely from a lay-man’s point of view, who isn’t very savvy in the area of linguistics, will fail miserably in drawing the distinction.
Even in the cases of primary school teachers, who suppose, have given an assignment to their students to write essays on various topics; have a hard time curbing the plagiarism. Considering that the students all over the world, these days, have access to the ChatGPTs of the world; there is a high propensity that the students may end up using these LLMs to write their assignments.
It won’t be too much to believe that even for an expert of linguistics, it sometimes becomes too difficult to differentiate between a LLM-generated language vs manually written (natural language) because of the variety both an LLM and a human brain can bring to the table. If not in the near future, then may be in the longer one, both will meet at the crossroad (means the AI will get developed so well that it will be able to mimic a human brain almost 100%).
This was about an expert of linguistics, and here, we are dealing with primary school English teachers who may not have the deep tech finesse to differentiate between an LLM-generated text and a manually written one; hence it is an absolute knife-edge for them, IMO.
Nevermind :p
Before we look at the code, for beginners : we will be looking at a classification model which will differentiate between an LLM-generated text with a manually written one.
I have used MultinomialNB and RandomForestClassifier models to do the job.
Let’s see the kind of training dataset we will be working on. This is just a dummy representation of the actual data.
The data shall have four columns, namely :-

It is always better as they say, to go for ensembling methods for classifiers and experts.
To get better accuracy, I decided to train a classification model twice, once with a MultinomialNB and the other time with RandomForestClassifier.
In each of the two cases, the accuracy is not going beyond 70% for now when I run the code on Kaggle, which is obviously a concern, but am tackling it separately and will cover it in a different article.
Though scrolling a few folds down, you will encounter how I have also talked about my surreal experience with an unreal 100% accuracy when implementing MultinomialNB classifier model, a little later in the article; as you will see.
MultinomialNB : Multinomial Naive Bayes Classifier Model is very simple and effective for text classification tasks and remains a popular choice for long. It is famous for handling discrete data. The model has a probabilistic nature and its ability to handle discrete data is the talk of the town. All in all, it is a very versatile algorithm for applications ranging from spam detection to sentiment analysis. Today, I have employed it for a classification use case, where I am trying to look at a group of essays written by students of a primary school, and trying to identify that which of those are LLM-generated and which of those are written manually (actually) by the students. I am not saying I am going to help the teachers with this! Or, am I? 😝
RandomForestClassifier : A Random Forest Algorithm is a supervised machine learning algorithm which is very popular in the world of Data Science and Machine Learning for its use-cases in Classification and Regression problems.
A little interesting way to understand the acknowledge the merits of RandomForestClassifier is to draw an analogy with forests; how a forest comprises numerous trees, and the more trees more it will be robust.
One of the main advantages of RandomForestClassifier, is that they can handle high-dimensional and sparse data sets like Virat handled the pressure on Oct 23, 2023 (if you know you know), which are common in text analysis. RandomForestClassifier models are also very popular for their abilities in dealing with missing values, outliers, and imbalanced classes, which can affect the performance of other algorithms.
import
pandas
as
pd
from
sklearn.model_selection
import
train_test_split
from
sklearn.feature_extraction.text
import
TfidfVectorizer
from
sklearn.naive_bayes
import
MultinomialNB
from
sklearn.metrics
import
accuracy_score, classification_report
# Loading the training data
train_data = pd.read_csv(
r"-----path to the file which has the train dataset------"
)
# Splitting the data into training and validation sets
train_set, val_set = train_test_split(train_data, test_size=
0.2
, random_state=
42
)
# Preprocessing the text data using TF-IDF vectorization
vectorizer = TfidfVectorizer(max_features=
5000
, stop_words=
"english"
)
X_train = vectorizer.fit_transform(train_set[
"text"
])
X_val = vectorizer.transform(val_set[
"text"
])
# Creating the target labels
y_train = train_set[
"generated"
]
# Here "generated" is the column indicating whether the essay was generated by a student or an LLM
y_val = val_set[
"generated"
]
# Training a simple Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train, y_train)
# Making predictions on the validation set
# predictions = classifier.predict(X_val)
predictions = classifier.predict_proba(X_val)[:,
1
]
#here we are trying to get a probability score of whether an essay is manually written or llm-generated
# Evaluating the model
accuracy = accuracy_score(y_val, predictions)
report = classification_report(y_val, predictions)
print
(
f"Accuracy:
{accuracy}
"
)
print
(
"Classification Report:"
)
print
(report)
# Using the trained model to make predictions on the test set
test_data = pd.read_csv(
r"-----path to the file which has the test dataset------"
)
X_test = vectorizer.transform(test_data[
"text"
])
test_predictions = classifier.predict_proba(X_test)[:,
1
]
# test_data["generated"] = test_predictions
submission_df = pd.DataFrame({
"id"
: test_data[
"id"
],
"generated"
: test_predictions})
submission_df.to_csv(
"submission.csv"
, index=
False
)
On running this code on Jupyter, I got the following accuracy :-

And this certainly got me worried since I wasn’t really expecting the model to spit a 1'er.
To get to the depth of it, I surfed more on why a MultinomialNB may give you a 100% accuracy.
So the reasons I found are :-
Trying to closely re-visit my model and the training data-set, I tried to see which of the reasons may be true in my case. It turns out that my MultinomialNB model is being subjected to imbalanced classes and a small dataset, due to which the accuracy is strangely 100%.
Now that you and I are aware of the situation, what I’d suggest is that when you train your own model, you can take care of the last two points at your end, which will help you stay away from unreal 100% accuracy gimmicks.

Here, what I have done over and above the previous standard MultinomialNB code is that there are a few adjustments done in the form of params.
import
pandas
as
pd
from
sklearn.feature_extraction.text
import
TfidfVectorizer
from
sklearn.naive_bayes
import
MultinomialNB
from
sklearn.model_selection
import
train_test_split
from
sklearn.metrics
import
accuracy_score, classification_report, confusion_matrix
# Assuming you have already loaded and preprocessed your training data
# Split the data into training and validation sets
train_set, val_set = train_test_split(train_data, test_size=
0.2
, random_state=
42
)
# Preprocess the text data using TF-IDF vectorization
vectorizer = TfidfVectorizer(max_features=
10000
, stop_words=
"english"
, ngram_range=(
1
,
2
))
X_train = vectorizer.fit_transform(train_set[
"text"
])
X_val = vectorizer.transform(val_set[
"text"
])
# Create the target labels
y_train = train_set[
"generated"
]
# Assuming "label" is the column indicating student or LLM
y_val = val_set[
"generated"
]
# Train a Multinomial Naive Bayes classifier with tuned hyperparameters
classifier = MultinomialNB(alpha=
0.1
)
# Adjust alpha based on your hyperparameter tuning
classifier.fit(X_train, y_train)
# Make predictions on the validation set
val_predictions = classifier.predict(X_val)
# Evaluate the model
accuracy = accuracy_score(y_val, val_predictions)
print
(
"Accuracy:"
, accuracy)
# Print classification report and confusion matrix for more insights
print
(
"Classification Report:"
)
print
(classification_report(y_val, val_predictions))
print
(
"Confusion Matrix:"
)
print
(confusion_matrix(y_val, val_predictions))
import
pandas
as
pd
from
sklearn.feature_extraction.text
import
TfidfVectorizer
from
sklearn.ensemble
import
RandomForestClassifier
from
sklearn.model_selection
import
train_test_split
from
sklearn.metrics
import
accuracy_score, classification_report, confusion_matrix
# Assuming you have already loaded and preprocessed your training data
# Split the data into training and validation sets
train_set, val_set = train_test_split(train_data, test_size=
0.2
, random_state=
42
)
# Preprocess the text data using TF-IDF vectorization
vectorizer = TfidfVectorizer(max_features=
10000
, stop_words=
"english"
, ngram_range=(
1
,
2
))
X_train = vectorizer.fit_transform(train_set[
"text"
])
X_val = vectorizer.transform(val_set[
"text"
])
# Create the target labels
y_train = train_set[
"generated"
]
# Assuming "label" is the column indicating student or LLM
y_val = val_set[
"generated"
]
# Train a Random Forest classifier with tuned hyperparameters
classifier = RandomForestClassifier(n_estimators=
100
, max_depth=
50
, random_state=
42
)
classifier.fit(X_train, y_train)
# Make predictions on the validation set
val_predictions = classifier.predict(X_val)
# Evaluate the model
accuracy = accuracy_score(y_val, val_predictions)
print
(
"Accuracy:"
, accuracy)
# Print classification report and confusion matrix for more insights
print
(
"Classification Report:"
)
print
(classification_report(y_val, val_predictions))
print
(
"Confusion Matrix:"
)
print
(confusion_matrix(y_val, val_predictions))
However, on Kaggle, my 3rd attempt helped me get a furtherance in the accuracy from what I was achieving in my 1st attempt with MultinomialNB. I reckon you know, why the accuracies have been different for me when I am implementing the models on my local VS when am executing the models on Kaggle. The reason is simple : it is because on my local the dataset am using is quite small and hence the accuracy I am getting is 100% (as I have also mentioned earlier in the article where I listed the grid of possible reasons). However, on Kaggle, the community (or jury I should better say) validates your accuracy and qualifies it against a hidden dataset which is much bigger and diverse.
I got to go and grab a cup of coffee now. I am not sure if I’d want to write anymore today. So, if you have a comment to drop, please do. I will try to look at it the earliest and respond. I sense it is my first article in 2024, so for all of you who have read it so far — HAPPY NEW YEAR FRIENDS :)
The winters are gloomy, but there is a fire within.
And yes, if by any chance there has been a teacher who has read this piece till the end, I hope it made sense to you (as you were my targeted audience). I have spent close to a week explaining my Mom how ChatGPT really works and guess what I can see a curious student in her now finally :p (role reversals — coming of the age haha)
Godspeed. Bye :)

Medium - Shivam Dutt Sharma (myself)