Browsing by Author "Rajesh Kumar Mundotiya"

Now showing 1 - 3 of 3

Deep Learning-Based Similar Languages’ POS Tagging: Experiments on Bhojpuri, Maithili, and Magahi
(Springer Science and Business Media Deutschland GmbH, 2023) Rajesh Kumar Mundotiya; Praveen Gatla; Nikita Kanwar; Anil Kumar Singh
Monolingual corpora and similar language resources are vastly available for a few languages. These resources stimulate the exploration and building of potential NLP tools for new languages or dialects. This paper deals with the part-of-speech (POS) tagging for the Indo-Aryan languages, i.e., Magahi, Maithili, and Bhojpuri, a dialect of Hindi. The POS model is trained by BiLSTM-CRF and explores the effectiveness of Word2Vec, GloVe as word and FastText, and BPE as subword-level embeddings, trained on the raw corpus of these languages. All these languages are dialects of Hindi; hence, multilingual embedding at the BPE level has been evaluated. Better results are obtained than with monolingual BPE embedding. However, the best results have been obtained from word embeddings, i.e., GloVe on Maithili and Magahi, with 81.23% and 82.24%, respectively. © 2023, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
NLPRL@INLI-2018: Hybrid gated LSTM-CNN model for Indian native language identification
(CEUR-WS, 2018) Rajesh Kumar Mundotiya; Manish Singh; Anil Kumar Singh
Native language identification (NLI) focuses on determining the native language of the author based on the writing style in English. Indian native language identification is a challenging task based on users comments and posts on social media. To solve this problem, we present a hybrid gated LSTM-CNN model to solve this problem. The final vector of a sentence is generated at hybrid gate by joining the two distinct vector of a sentence. Gate seeks the optimum mixture of the LSTM and CNN level outputs. The input word for LSTM and CNN are projected into high-dimensional space by embedding technique. We obtained 88.50% accuracy during training on the provided social media dataset, and 17.10% is reported in the final testing done by Indian native language identification (INLI) workshop organizers. © 2018 CEUR-WS. All Rights Reserved.
Sarcasm Identification and Classification in Hindi Newspaper Headlines
(Association for Computing Machinery, 2025) Iram Ali Ahmad; Praveen Gatla; Rajesh Kumar Mundotiya
Sarcasm identification in textual data is the most captivating area of research in the current research trends. It is a challenging task for humans as well as for the computer. In this article, we have tried to identify sarcasm in the Hindi newspaper headlines of two of the most-read Hindi newspapers in India, namely Hindustan and Dainik Jagran. Initially, we collected 88,518 Hindi newspaper headlines and identified 1,945 headlines to be sarcastic, which we have considered for the present study. The headlines taken into consideration belong to the political domain and were published during some of the recent Legislative Assembly Elections of 2020, 2021, and 2022. Various machine learning and deep learning techniques have been used to develop the baseline models. It justifies the assumption that sarcastic text does not always bear a negative sentiment. It may bear a positive sentiment depending on the context. The present article aims at the creation of a dataset consisting of 1,945 Hindi newspaper headlines, training and testing machine learning and deep learning models, namely Extra Trees Classifier, Random Forest Classifier, XGBClassifier, fasttext-stackedTCN, and mBERT-stackedTCN for sarcasm identification on the dataset and comparing the results obtained by the models after the experiment. Out of all the choosen models, the Random Forest Classifier performs better with score of 92.11 before data augmentation and 90.68 after data augmentation. © 2025 Copyright held by the owner/author(s).