Title:
Deep Learning-Based Similar Languages’ POS Tagging: Experiments on Bhojpuri, Maithili, and Magahi

Loading...
Thumbnail Image

Date

Journal Title

Journal ISSN

Volume Title

Publisher

Springer Science and Business Media Deutschland GmbH

Abstract

Monolingual corpora and similar language resources are vastly available for a few languages. These resources stimulate the exploration and building of potential NLP tools for new languages or dialects. This paper deals with the part-of-speech (POS) tagging for the Indo-Aryan languages, i.e., Magahi, Maithili, and Bhojpuri, a dialect of Hindi. The POS model is trained by BiLSTM-CRF and explores the effectiveness of Word2Vec, GloVe as word and FastText, and BPE as subword-level embeddings, trained on the raw corpus of these languages. All these languages are dialects of Hindi; hence, multilingual embedding at the BPE level has been evaluated. Better results are obtained than with monolingual BPE embedding. However, the best results have been obtained from word embeddings, i.e., GloVe on Maithili and Magahi, with 81.23% and 82.24%, respectively. © 2023, The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

Description

Citation

Collections

Endorsement

Review

Supplemented By

Referenced By