Bangla Hate Speech Detection: Comparative Analysis of Machine Learning Models and Recurrent Neural Networks

Abstract
The spread of hate speech on the internet has been related to an increase in violent acts committed against minority groups all across the world. Nowadays, usage of social media is at its peak, and so is hate speech on online social media. Nearly every continent has reported incidents. A majority of the world's population uses Facebook alone, and many people increasingly converse on social media. Although multiple hate speech detection paper has been done based on the Bengali language, there is no paper has been done based on comparative analysis of machine learning and recurrent neural networks regarding Bengali hate speech detection. Therefore, this study aims to train different models that can detect Bengali hate speech on different social media platforms and do a comparative analysis of the models. Bengali hate speech is on the rise on social media platforms, threatening the general public's mental health. Thus, detecting and preventing it from posting on social media is a suitable approach to preventing hate speech from spreading. Using statistical methods, the collected data were analyzed. Also, manually labeled the collected data based on sentiment. To remove punctuation marks, a punctuation remover is used, and regular expressions are used to remove foreign languages from the dataset. Moreover, Bangla natural language toolkit was used to remove Bangla stop words from the data. A label encoding method is used to make the dataset machine readable. Natural language processing toolkit porter stemmer is used for tokenization and for feature extraction, term frequency-inverse document frequency is used for training and testing the dataset, hold-out validation approach was used. Several machine learning and recurrent neural network models, decision tree (DT), K-nearest neighbor (KNN), random forest (RF), support vector machine (SVM), multinomial naïve bayes (MNB), long short term memory (LSTM), bidirectional long short term memory (Bi-LSTM), and convolutional long short term memory (CNN-LSTM) were implemented. Among machine learning models, support vector machine gained 0.96 accuracy, and among recurrent neural network (RNN) models, bidirectional long short term memory gained 0.94 accuracy.
Description
Keywords
Citation
Department Name
Electrical and Computer Engineering
Publisher
North South University
Printed Thesis
DOI
ISSN
ISBN