Posudky závěrečné kvalifikační práce

In accordance with the view of the external consultant from Avast/Gen Digital, I see Adam Kán's activity as exemplary and the practical results as fully satisfactory. I would suggest the highest grade - A - excellent.

Dílčí hodnocení
Kritérium	Známka	Body	Slovní hodnocení
Informace k zadání			The thesis topic was prepared in collaboration with Avast/Gen Digital, and it was directly linked to the company's research and development interests in optimising language model prompts (especially when using local or smaller/cheaper cloud online models). The student started in the Summer of 2024, intensively collaborated with experts from the company, and delivered the results that have been directly used by them. I am fully satisfied with the resulting system. Moreover, the external consultant from Gen Digital - Břetislav Šopík - also sees the results as excellent and fully satisfactory.
Práce s literaturou			Adam Kán collected and studied relevant scientific publications, got to the heart of the modern methods of automatic optimisation of large language models and prepared a very good survey of the recent research in this area.
Aktivita během řešení, konzultace, komunikace			The student's activity was exemplary in both semesters; he regularly consulted his progress and continuously delivered updates of the work.
Aktivita při dokončování			The text has been finished well in advance of the deadline, I could review several preliminary versions of key chapters; my feedback was always reflected in the updates.
Publikační činnost, ocenění			-

Posudek oponenta

Aparovich, Maksim

The thesis fulfills an assignment with clear code implementation and relevant literature support with a room for improvement: non-standard terminology, scattered result presentation, limited evaluation scope focused on SMS binary classification. While the work combines existing prompt optimization methods into a viable technical solution, it would benefit from more comprehensive error analysis, broader applicability testing, and improved formal presentation including standardized terminology and summary tables for better result interpretation.

Dílčí hodnocení
Kritérium	Body	Slovní hodnocení
Náročnost zadání		Given a rapid pace of NLP field development, availability of datasets and approaches, numerous models accessible via API, and auto-regressive nature of LLMs the work subtasts difficulty vary: simpler (data collection, designing a system) and more challenging (evaluation, literature review). All could be summarized as moderately difficult.
Rozsah splnění požadavků zadání		The assignment is fullfilled. It fully covers in detail all the points listed in the assignment.
Rozsah technické zprávy		Technical report contains all the nesessary parts: sections 2-5 contain a review of the literature and approcaches on the relevant topics (including approach to evaluation); section 6-7 introduces the solution and implementation details, section 8 covers experimentation process and evaluation results.
Prezentační úroveň technické zprávy	85	The report is structured in a clear and consistent manner, sections develop logically. The thesis could benefit from a more detailed explanations, for example: The specific choice of KNN for example selection, discussing why it was favored over other methods explored. The rationale for selecting the Nomic AI embeddings model could be more explicitly grounded as well as more technical details of experiments. While a non-trivial LLM-based candidate generation is used for NLIO, the work does not explicitly discuss comparing this against a more naive baseline of simple LLM self-refinement given errors.
Formální úprava technické zprávy	85	The work follows a standard academic structure with clear chapter headings, sections, figures, and equations. The terminology used in the thesis could be improved: it uses non-standard terms like "Good/Bad Instructions" for human/naive baselines and "LLM-as-an-evaluator" instead of the more common "LLM as a judge". Instead of Chain-of-Thought a "reasoning" term is used and could be confused with Reasoning Language Models like DeepSeek-R1. Additionally, the results presentation through a set of figures per experiment makes it difficult to interpret and compare them. The work lacks a summary table that directly contrasts different models and approaches.
Práce s literaturou	80	The thesis cites relevant literature that supports the authors claims, though the depth of literature review varies across topics: comprehensive coverage for areas like in-context learning but less detailed for topics like LLM-as-a-judge, candidate selection, instruction search. Additionally, the citations frequently reference Arxiv preprints rather than their published conference proceedings versions.
Realizační výstup	85	The work presents an approach with a technical solution for SMS binary classification. The code is written clearly and in a self-documentary way. Approach applicability beyond simple classification tasks remains unclear. The paper would benefit from extensive error analysis with concrete examples to identify and highlight the approach's limitations and weak points.
Využitelnost výsledků		The work presents an algorithm for automatic prompt optimization that combines existing methods and is validated on binary classification tasks. It offers a viable option for similar scenarios. The evaluation is limited to a single task type, leaving questions about the algorithm's broader applicability across different problem domains.

Posudky

Posudek vedoucího

Smrž, Pavel

Posudek oponenta

Aparovich, Maksim

Otázky