Pre-training Two BERT-Like Models for Moroccan Dialect: MorRoBERTa and MorrBERT

dc.contributor.authorMoussaoui, Otman
dc.contributor.authorEl Younnoussi, Yacine
dc.coverage.issue1cs
dc.coverage.volume29cs
dc.date.accessioned2024-01-11T08:34:36Z
dc.date.available2024-01-11T08:34:36Z
dc.date.issued2023-06-30cs
dc.description.abstractThis research article presents a comprehensive study on the pre-training of two language models, MorRoBERTa and MorrBERT, for the Moroccan Dialect, using the Masked Language Modeling (MLM) pre-training approach. The study details the various data collection and pre-processing steps involved in building a large corpus of over six million sentences and 71 billion tokens, sourced from social media platforms such as Facebook, Twitter, and YouTube. The pre-training process was carried out using the HuggingFace Transformers API, and the paper elaborates on the configurations and training methodologies of the models. The study concludes by demonstrating the high accuracy rates achieved by both MorRoBERTa and MorrBERT in multiple downstream tasks, indicating their potential effectiveness in natural language processing applications specific to the Moroccan Dialect.en
dc.formattextcs
dc.format.extent55-61cs
dc.format.mimetypeapplication/pdfen
dc.identifier.citationMendel. 2023 vol. 29, č. 1, s. 55-61. ISSN 1803-3814cs
dc.identifier.doi10.13164/mendel.2023.1.055en
dc.identifier.issn2571-3701
dc.identifier.issn1803-3814
dc.identifier.urihttps://hdl.handle.net/11012/244241
dc.language.isoencs
dc.publisherInstitute of Automation and Computer Science, Brno University of Technologycs
dc.relation.ispartofMendelcs
dc.relation.urihttps://mendel-journal.org/index.php/mendel/article/view/223cs
dc.rightsCreative Commons Attribution-NonCommercial-ShareAlike 4.0 International licenseen
dc.rights.accessopenAccessen
dc.rights.urihttp://creativecommons.org/licenses/by-nc-sa/4.0en
dc.subjectMoroccan Dialecten
dc.subjectBERTen
dc.subjectRoBERTaen
dc.subjectNatural Language Processingen
dc.subjectPre-traineden
dc.subjectMachine Learningen
dc.titlePre-training Two BERT-Like Models for Moroccan Dialect: MorRoBERTa and MorrBERTen
dc.type.driverarticleen
dc.type.statusPeer-revieweden
dc.type.versionpublishedVersionen
eprints.affiliatedInstitution.facultyFakulta strojního inženýrstvícs
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
223-Article Text-605-2-10-20230630.pdf
Size:
2.06 MB
Format:
Adobe Portable Document Format
Description:
Collections