Pre-training Two BERT-Like Models for Moroccan Dialect: MorRoBERTa and MorrBERT

Moussaoui, Otman; El Younnoussi, Yacine

Pre-training Two BERT-Like Models for Moroccan Dialect: MorRoBERTa and MorrBERT

dc.contributor.author	Moussaoui, Otman
dc.contributor.author	El Younnoussi, Yacine
dc.coverage.issue	1	cs
dc.coverage.volume	29	cs
dc.date.accessioned	2024-01-11T08:34:36Z
dc.date.available	2024-01-11T08:34:36Z
dc.date.issued	2023-06-30	cs
dc.description.abstract	This research article presents a comprehensive study on the pre-training of two language models, MorRoBERTa and MorrBERT, for the Moroccan Dialect, using the Masked Language Modeling (MLM) pre-training approach. The study details the various data collection and pre-processing steps involved in building a large corpus of over six million sentences and 71 billion tokens, sourced from social media platforms such as Facebook, Twitter, and YouTube. The pre-training process was carried out using the HuggingFace Transformers API, and the paper elaborates on the configurations and training methodologies of the models. The study concludes by demonstrating the high accuracy rates achieved by both MorRoBERTa and MorrBERT in multiple downstream tasks, indicating their potential effectiveness in natural language processing applications specific to the Moroccan Dialect.	en
dc.format	text	cs
dc.format.extent	55-61	cs
dc.format.mimetype	application/pdf	en
dc.identifier.citation	Mendel. 2023 vol. 29, č. 1, s. 55-61. ISSN 1803-3814	cs
dc.identifier.doi	10.13164/mendel.2023.1.055	en
dc.identifier.issn	2571-3701
dc.identifier.issn	1803-3814
dc.identifier.uri	https://hdl.handle.net/11012/244241
dc.language.iso	en	cs
dc.publisher	Institute of Automation and Computer Science, Brno University of Technology	cs
dc.relation.ispartof	Mendel	cs
dc.relation.uri	https://mendel-journal.org/index.php/mendel/article/view/223	cs
dc.rights	Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license	en
dc.rights.access	openAccess	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc-sa/4.0	en
dc.subject	Moroccan Dialect	en
dc.subject	BERT	en
dc.subject	RoBERTa	en
dc.subject	Natural Language Processing	en
dc.subject	Pre-trained	en
dc.subject	Machine Learning	en
dc.title	Pre-training Two BERT-Like Models for Moroccan Dialect: MorRoBERTa and MorrBERT	en
dc.type.driver	article	en
dc.type.status	Peer-reviewed	en
dc.type.version	publishedVersion	en
eprints.affiliatedInstitution.faculty	Fakulta strojního inženýrství	cs

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 223-Article Text-605-2-10-20230630.pdf
Size:: 2.06 MB
Format:: Adobe Portable Document Format
Description:

Download

Collections

Vol. 29, No. 1