Pre-training Two BERT-Like Models for Moroccan Dialect: MorRoBERTa and MorrBERT

Moussaoui, Otman; El Younnoussi, Yacine

Pre-training Two BERT-Like Models for Moroccan Dialect: MorRoBERTa and MorrBERT

Files

223-Article Text-605-2-10-20230630.pdf(2.06 MB)

Date

2023-06-30

Authors

Moussaoui, Otman

El Younnoussi, Yacine

Publisher

Institute of Automation and Computer Science, Brno University of Technology

Altmetrics

Abstract

This research article presents a comprehensive study on the pre-training of two language models, MorRoBERTa and MorrBERT, for the Moroccan Dialect, using the Masked Language Modeling (MLM) pre-training approach. The study details the various data collection and pre-processing steps involved in building a large corpus of over six million sentences and 71 billion tokens, sourced from social media platforms such as Facebook, Twitter, and YouTube. The pre-training process was carried out using the HuggingFace Transformers API, and the paper elaborates on the configurations and training methodologies of the models. The study concludes by demonstrating the high accuracy rates achieved by both MorRoBERTa and MorrBERT in multiple downstream tasks, indicating their potential effectiveness in natural language processing applications specific to the Moroccan Dialect.

Keywords

Moroccan Dialect, BERT, RoBERTa, Natural Language Processing, Pre-trained, Machine Learning

Citation

Mendel. 2023 vol. 29, č. 1, s. 55-61. ISSN 1803-3814
https://mendel-journal.org/index.php/mendel/article/view/223

Document type

Peer-reviewed

Document version

Published version

Language of document

en

Document licence

Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license
http://creativecommons.org/licenses/by-nc-sa/4.0