Posudky závěrečné kvalifikační práce

The student worked actively, the objectives were challenging and he successfully achieved them. He collected and semi-automatically annotated a large dataset of relevant CUDA-based programs and performed a complex evaluation of the results.

Dílčí hodnocení
Kritérium	Známka	Body	Slovní hodnocení
Informace k zadání			The topic and the proposed objectives were complex due to the novelty of contextual language modelling and its use in generating code from the textual description of functionality. The student decided to focus on a specific subdomain of CUDA-based program generation, created a functional system, collected a challenging dataset and performed a series of relevant experiments. The thesis's overall goal and objectives were reached, and I am satisfied with the result.
Práce s literaturou			Jan Šamánek was active in collecting and using relevant study sources, got knowledge of the most advanced methods applicable in the field and prepared a convincing survey of recent approaches.
Aktivita během řešení, konzultace, komunikace			The student's activity increased in the second semester, he regularly consulted new developments and solved complex problems of the adaptation of large models trained on available GPU clusters.
Aktivita při dokončování			The technical report was finished a week before the submission deadline, I could consult its preliminary versions, my comments were reflected in the updateds of the text.
Publikační činnost, ocenění			-

Posudek oponenta

Fajčík, Martin

The presentation of the work is concerning, the contributions are not clear from the abstract/introduction, the training and validation protocol lacks soundness, and the observations and subsequent hypotheses are not validated quantitatively (e.g. by analyzing at least ~10 samples, trying to confirm the hypothesized phenomena). The background description with respect to datasets, methods, and metrics is of poor or non-existent quality. The student worked on a more difficult assignment. His work creates a new dataset of CUDA language kernels, useful for training automatic code generators. The work evaluates the dataset using 5 different models and provides qualitative and quantitative analysis of the achieved results. Due to the mixed outcome of the thesis, I propose a grade of C, with scoring marginally close to D.

Dílčí hodnocení
Kritérium	Body	Slovní hodnocení
Náročnost zadání		There was a problem with the thesis assignment in English (it is missing from the work!). The assignment from the BUT system is below: Get acquainted with current methods for contextual language modeling and their use in generating code from the textual description of functionality. Collect and pre-annotate data for training and evaluation, focusing on the domain GPU-accelerated code in C/C++. Design and implement the system for generating code with GPU acceleration; evaluate the result on the collected data. Create a poster presenting your work, its goals, and results. The student had to approach different challenging problems in the thesis, mainly: (a) collection and cleaning of the dataset with the desired programming language, (b) prepare models, (c) get acquainted with two different domains; natural language processing and CUDA code for GPU acceleration.
Rozsah splnění požadavků zadání		The description of background work on generating code from the textual description of functionality is absent.
Rozsah technické zprávy
Prezentační úroveň technické zprávy	61	The presentation narrative of the work follows lecture book style, and the language often takes the philosophical form (see phrases like e.g., in Introduction, "we will look at code", chapter 2 intro: "it would be wise to talk about". "it paints a picture", etc.). Such a narrative doesn't fit the scope of the technical report. Neither the abstract nor the introduction describes the thesis contributions (dataset and its size, models, achieved performance). Work contains "mega-chapter" 5, which mixes theory & metric description, data collection, preprocessing/postprocessing decisions, results, analysis, and their discussions. Some figures are of questionable importance (Figure 5.6). Undefined terms, that do not follow the traditional meaning (e.g., "cross-validation" in the introduction usually refers to validation on multiple train-test splits, but this isn't done, "definition range", "derivable" function, "hand-crafted" dataset). Poor comparison with existing work (more in the work with literature section). Usage of research/company group branding (... according to study from BigScience research (p. 10)).
Formální úprava technické zprávy	59	The work includes many formal problems, among others: Many of the references are not named (e.g. in section 5.2) or references with a variation of names (the "figure" is sometimes called a "graph"). Missing formula numbering (such as on page 41). Not all figures are references (e.g. Fig. 2.7). Chapter naming convention (appendices are different from the main word). Page 21 contains the bullet-point text of the same size as section names. Page 43, it is not clear what is max defined over. Mixing Czech and English quoting styles (lower/upper quotes). Typo in chapter 5 name. Minor English problems (Errors Analysis -> Error Analysis, Code Generating -> Code Generation). Hyperlink URLs, which cannot be followed in the printed version.
Práce s literaturou	55	The thesis covers a description of basic NLP elements. However, a major deficiency of this work is that it never describes different coding datasets, a different work on code generation, or evaluation of code generation . Through a quick search, I found the following reference to various code-specific evaluation metrics [1, 2, 3]. In spite of our critic from a semestral project, the thesis continues to use language generation techniques (BLEU, ROUGE, BERT-score) as the main metrics. Only in analysis, this problem is slightly compensated for by two simple keyword-matching techniques introduced in Table 5.9 and Table 5.10. False claims are present in this work, this includes most notably: stating that GPT-4 model has 1 trillion parameters, pg. 49 stating "Given the fact that the models BART and also T5 were able to overfit on our corpus means that the corpus has to fulfill a certain quality of the dataset, and its data have to be somewhat clean.", Incorrect proof of beam search complexity (pg. 21). References to arXiv preprints, when peer-reviewed proceedings versions are available (e.g. [9], [14], [17]). [1] Ren, Shuo, et al. "Codebleu: a method for automatic evaluation of code synthesis." arXiv preprint arXiv:2009.10297 (2020). [2] Chen, Mark, et al. "Evaluating large language models trained on code." arXiv preprint arXiv:2107.03374 (2021). [3] Zhou, Shuyan, et al. "Codebertscore: Evaluating code generation with pretrained models of code." arXiv preprint arXiv:2302.05527 (2023).
Realizační výstup	85	The work collected ~500k CUDA "functions" (so-called kernels), analyzed its contents, created extensive amount of baselines, evaluated (although questionably) their performance per subset, and on the whole dataset, and discussed their results. The work also analyzed I/O correlations and attention scores (though with only limited success).
Využitelnost výsledků		The collected dataset seems to be "very large". To assess the relevance of this work with the current state-of-the-art, more extensive background research, and analysis of how its size impacts model performance, is necessary.

Posudky

Posudek vedoucího

Smrž, Pavel

Posudek oponenta

Fajčík, Martin

Otázky