A Bengali Parts of Speech Tagger based on Bengali to English Word Alignment

Research Supervisor
Journal Title
Journal Title
Journal ISSN
Volume Title
Even though Bengali is the fifth most-spoken native language and the seventh most spoken language by the total number of speakers globally, in the field of Natural Language processing, Bangla still lags way behind compared to popular languages like English. This is because there is a huge scarcity of proper Bengali data necessary for the training. To mitigate this problem, we can think of alternatives like incorporating knowledge from the English language as both Bengali and English Language is semantically and morphologically similar. As the English language has seen significant improvement in the field of Natural Language Processing over the years, we are aiming to distribute the knowledge from the English language to the Bengali language by developing a word aligner model. As of now, Bengali Pos Taggers that are available on public domains are either based on Bengali “সাধুভাষা” or trained on a comparatively smaller dataset which results in lower accuracy as well as incapable of working with Bengali “চলিত ভাষা”. Since there is no proper dataset for training a POS tagger model for Bengali “চলিত ভাষা”, our target is to incorporate knowledge of the English language and the Bangla Language. To be noted, available English POS taggers such as NLTK, Spacy, Flair have an accuracy of around 93%-98%. With that in mind, we intend to build a word aligner for Bengali & English languages based on their semantic similarity. Finally, with the help of the aligner, we will develop a Bengali PoS tagging model. Apart from that, we have created a manually annotated Gold Standard Bn-En Alignment dataset to evaluate our model and publicly released it. The demo of our whole project implementation is live on Hugging Face.
Department Name
Electrical and Computer Engineering
North South University
Printed Thesis