DUFKOVÁ, A. Cross Lingual News Article Classification and Automatic Topic Discovery Using Multilingual Language Models [online]. Brno: Vysoké učení technické v Brně. Fakulta informačních technologií. 2023.

Posudky

Posudek vedoucího

Kesiraju, Santosh

- Overall it is a very good piece of work. - The results of the thesis is a bigger multilingual dataset for news classification covering 16 languages and around 25 topics which could be useful for the NLP research community.

Dílčí hodnocení
Kritérium Známka Body Slovní hodnocení
Informace k zadání - The work is moderately challenging as it requires careful examination of data curation, cleaning and preparation, followed by series of experiments with state-of-the-art language models.
Práce s literaturou - The student has done a good job in working with the prior literature and relevant toolkits, libraries. - A more work towards the technical literature would have made it excellent.
Aktivita během řešení, konzultace, komunikace - The student consulted regularly and was very puntual in maintaining notes and progress of the thesis.
Aktivita při dokončování - The work was completed well within the time limit and the final thesis draft received comments for a revision.
Publikační činnost, ocenění - The student collaborated with the other PhD student for SemEval challenge on multiingual and cross-lingual sentiment classification for low resource African languages, which has led to a workshop publication that will appear in SemEval 2023 ACL.
Navrhovaná známka
B
Body
88

Posudek oponenta

Fajčík, Martin

In overall, the thesis created a novel dataset, with significant potential for future work. It established baseline results using two multilingual models. Furthermore, the thesis created a demo that allows to experiment with models created within this work. However, the claims in the thesis show a lack of comparison with existing background or are not backed up by released data. The technical description of models and methods used within this work is very shallow, lacks formal notation, and is sometimes confusing. Due to the mixed outcomes of the thesis, I recommend evaluation C.

Dílčí hodnocení
Kritérium Známka Body Slovní hodnocení
Náročnost zadání
Rozsah splnění požadavků zadání
Rozsah technické zprávy The description of the used methods, e.g., LaBSE, LASER, or multilingual bag-of-words model is very brief and informal. This is contrasting with e.g.,  Chapter 2, where the student described the dataset collection process and its analysis very extensively and clearly.
Prezentační úroveň technické zprávy 70 Readability/Comprehensibility: Following a linear reading order, in the case of motivation, dataset collection, experimental design, and application, the thesis is very well-written and easy to follow.  However, as mentioned in "EXTENT OF THE TECHNICAL REPORT", the major technical concepts, which are of the major concern in this work, are only dimly explained and often not understandable. A minor problem is, that the thesis is not clearly split in-between theoretical and practical parts. An example of such is Chapter 6, which is self-contained (i.e., describes both, methods and experiments). Technical Soundness: Some decisions were only poorly explained, as they were not backed up by data. Most notably among others: Page 9: "All of them (reviewers clarification: 25 categories in the created dataset) should be distinguishable." /Commentary: Missing analysis. Page 36: "This approach was giving good results, but due to the inability to integrate into the web application described in the introduction of this chapter, it was not used in the end." /Commentary: These results were never released. Page 38:  "topic discovery works well. " /Commentary: The proposed topic discovery method was never analyzed within the scope of this work.  
Formální úprava technické zprávy 75 The figures and language are of excellent quality. Each figure is discussed in-depth. Some problems are: the thesis title contains typo, should be "cross-lingual" (with hyphen), thesis mentions that the Figure is in appendices (a reference to a specific Appendix chapter is necessary), missing numbering of equations, equations on page 35 are confusing, here is an undefined variable i used, but it's not clear, where this comes from, Table 5.1 contains some missing results, the reason of this is left without commentary, occasional usage of subjective terms (certainly, "Potěšujícím zjištěním").
Práce s literaturou 60 Some claims point out too bad literature research, e.g., the claim that there is no large multilingual dataset for topic classification available (abstract, introduction,...), and therefore no previous dataset is mentioned. After a quick search, it turns out there exists a popular multilingual news corpus Reuters Corpus Volume 2 [1],  (though with just 4 categories). Some citations are referring to the preprint, although the published version of the article is available (e.g., [5] and [9].). Some citations are missing (e.g., a method in chapter 6.3). The exact website/tool references are missing (gradio, Transformers, Datasets,). Holger Schwenk and Xian Li. 2018. A Corpus for Multilingual Document Classification in Eight Languages . In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , Miyazaki, Japan. European Language Resources Association (ELRA).
Realizační výstup 75 Pros: Collection of large-scale dataset covering a wide range of categories, and languages. Thoughtful validation of the downloaded dataset licensing. Dataset baseline setup, quantitative validation, and analysis were done. Simple MLP classifier in pytorch/baseline scripts in scikit-learn. Web application with classifier and topic analysis. Cons: No qualitative study of errors models made on the dataset. Missing quantitative comparison between classifiers (just verbal description, such as "XY was tested; it gave more or less similar results, not better ones.", see section 5.5. Comparison between Table 5.2 and Table 5.5 could have been done thoughtfully by reporting just differences in limited data scenario; the comparison 11^2 results between a few pages distant Tables is inconvenient. Some released articles in different languages should be translations of each other; however, there is no such analysis covering how many articles have n-way translation available. Selected topic analysis method does not filter out stopwords.
Využitelnost výsledků The dataset released within this work is an interesting contribution. With additional missing analyses done, and comparison with the existing background, the work has the potential to be accepted to top venues reporting on new language resources like the LREC conference.
Navrhovaná známka
C
Body
71

Otázky

eVSKP id 148255