KAMENICKÝ, D. Tvorba vícejazyčné datové sady pro fact-checking z existujících dat pro odpovídání na otázky [online]. Brno: Vysoké učení technické v Brně. Fakulta informačních technologií. 2023.

Posudky

Posudek vedoucího

Fajčík, Martin

The student worked actively in some parts of the semester. In the end, his activity rose, but this couldn't fully compensate for the work that had to be done. In the last few days, we also uncovered some technical issues in the evaluation part, which could not be addressed in the time left. Hence I propose evaluation D.

Dílčí hodnocení
Kritérium Známka Body Slovní hodnocení
Informace k zadání The assignment was of average difficulty. Most of the work was planned, and to be foreseen from the start of the project.
Práce s literaturou The student followed the recommended literature and studied in thoroughly.
Aktivita během řešení, konzultace, komunikace The student actively consulted his progress. In some weeks, the progress was stale, and the student was not prepared. He missed the deadline for the semestral project and only finished his thesis a few days before the thesis deadline. What he missed earlier, he tried catching up with extra activity during the last weeks.
Aktivita při dokončování I managed to read his thesis, and we discussed the contents. However, due to a lack of time, we only focused on fixing crucial points in the thesis.
Publikační činnost, ocenění Student created a new fact-checking dataset in diverse language families, converted from a high-quality TidyQA dataset. However, the fact-checking dataset suffers from poor low-resource language translation and still needs some work done before it gets published.
Navrhovaná známka
D
Body
65

Posudek oponenta

Aparovich, Maksim

Overall, the work is good. The stronger side of the work is a technical solution with clarity of how to apply it in practice. A theoretical part is sufficient for an understanding of a person familiar with the field, nothing critical for understanding was found though with minor reservations.

Dílčí hodnocení
Kritérium Známka Body Slovní hodnocení
Náročnost zadání The assignment is moderately difficult for a Master's Thesis. The thesis aims to solve an open research problem with a lack of data for multilingual fact checking. Several approaches were proposed to solve the problem. This includes a decent amount of work with literature from close to each other, but distinct domains (question answering and fact checking), reviewing existing datasets with their advantages and drawbacks, and using already existing technical solutions as well as creating custom ones to implement the proposed solutions.
Rozsah splnění požadavků zadání All the requirements of the assignment were completed with minor reservations. Reservations are the following: 1. The problem of multilingual fact-checking and the importance of using sources from different languages were covered in the Introduction section with a clear yet short description. The thesis lacks an overview of the current state of multilingual fact checking and how or whether the problem is overcome without a sufficient amount of training data for the multilingual fact checking problem up to nowadays in the wild. 2. Evaluation of the difficulty of the problem in the introduced datasets was evaluated by a simple baseline: TF-IDF with logistic regression. The fact that today more advanced approaches are used to solve fact checking tasks raises a concern about the baseline appropriateness for a dataset complexity analysis.
Rozsah technické zprávy The technical report fits the requirements. A couple of pages are almost blank or half-page blank due to the big figures or text wrap, but this does not look like an issue.
Prezentační úroveň technické zprávy 65 Overall, the work is written in an understandable way. In case of unclarity, it is easy to find a citation with corresponding details. Chapters are interlinked, and high-level logical structure is preserved well. At the same time, the logical structure of the report in certain sections could be better. For instance, in Chapter 2 the choice of method is not supported by a statement, description of RNNs could be moved to a separate section for better readability or even omitted because it is not used and mentioned in the paper in any other place nor used in the proposed solution. Another point is the lack of formal notation where it is needed, e.g. section 2.1.1 Model Architecture or a Classification subsection in section 2.5 TF-IDF.
Formální úprava technické zprávy 65 The report contains typos and inaccuracies in descriptions. Separately, I would like to mention that the thesis contains copy-pasted pieces of text from cited papers. At the same time, sometimes it is hard to formulate information in other words and this does not look like an issue. Example: the section Pre-training tasks in the 2.3 Text-to-text transfer transformer and the sentence starting from "A token ID that is unique to the sequence is assigned to ..."
Práce s literaturou 65 The citations used in the paper are relevant. Sometimes statements lack support with corresponding research. Some citations reference Arxiv pre-prints while the works are available as conference or journal papers. For instance, citation [15] (BERT paper) is available as an ACL conference paper.
Realizační výstup 80 The provided software includes custom modules as well as already existing tools that were used for the solution. The code of custom modules was written in an easy-to-read manner and is clear for a not familiar user. Already existing tools used in the work can be divided into two groups: available out of the box, and solutions created by other researchers that are tricky to run and utilize. Both groups were successfully used to solve the proposed problem. Everything required to reproduce thesis results was provided along with the code.
Využitelnost výsledků The thesis provides a theoretical framework and software to convert existing question answering datasets to datasets for a multilingual fact checking problem. This is a relatively new field with a lack of datasets. The work proposes a solution to that problem.
Navrhovaná známka
C
Body
70

Otázky

eVSKP id 143409