Title: Deep Learning-Based Similar Languages’ POS Tagging: Experiments on Bhojpuri, Maithili, and Magahi
Loading...
Date
Journal Title
Journal ISSN
Volume Title
Publisher
Springer Science and Business Media Deutschland GmbH
Abstract
Monolingual corpora and similar language resources are vastly available for a few languages. These resources stimulate the exploration and building of potential NLP tools for new languages or dialects. This paper deals with the part-of-speech (POS) tagging for the Indo-Aryan languages, i.e., Magahi, Maithili, and Bhojpuri, a dialect of Hindi. The POS model is trained by BiLSTM-CRF and explores the effectiveness of Word2Vec, GloVe as word and FastText, and BPE as subword-level embeddings, trained on the raw corpus of these languages. All these languages are dialects of Hindi; hence, multilingual embedding at the BPE level has been evaluated. Better results are obtained than with monolingual BPE embedding. However, the best results have been obtained from word embeddings, i.e., GloVe on Maithili and Magahi, with 81.23% and 82.24%, respectively. © 2023, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
