Jazykové modely pro nedokonalé systémy rozpoznání řeči a písma

Úkolem statistických jazykových modelů je odhalit a kvantifikovat opakující se vzory v přirozeném jazyce. V této disertační práci je používáme ke zpřesňování automatického přepisu řeči a písma. Nejprve demonstrujeme, jak využít vektor reprezentující téma textu k zavedení dlouhého kontextu do výpočetně velmi levného dopředného jazykového modelu. Ukazujeme, že tato jednoduchá technika překonává zhruba polovinu rozdílu v přesnosti mezi těmito jazykovými modely a podstatně silnějšími modely rekurentními. Poté prověřujeme schopnost těchto tematických vektorů vyhladit chyby přepisu a tím zpřesnit dvouprůchodový přepis řeči. Takto získané zpřesnění je konzistentní, byť malé. Zkoumáme rovněž schopnost jazykových modelů učit se na automatických přepisech, s cílem adaptovat jazykový model na novou doménu při minimalizaci potřeby lidského manuálního přepisu. V sérii experimentů s přepisem písma ukazujeme, že jazykové modely jsou poměrně robustní vůči chybám v automatickém přepisu, což ve většině případů umožňuje vynechat filtrování dat. V nejnáročnějším uvažovaném scénáři ukazujeme, že zatímco původní systém pro přepis má chybovost na úrovni znaků 6.43 % (5.34 % při použití jazykového modelu natrénovaného na lidských anotacích), plné využití strojově přepsaných dat umožní snížit chybovost až na 2.88 %. V druhé části této práce studujeme přímočaré způsoby regularizace jazykových modelů pomocí augmentace trénovacích dat, jež napodobuje chyby způsobené automatickým přepisem. Nejlepších výsledků dosahujeme, když augmentace nesleduje podrobné rozložení chyb konkrétního rozpoznávače, ale pouze jejich povšechné statistiky. Další analýzou tohoto nečekaného výsledku docházíme k závěru, že dosažené zlepšení je důsledkem regularizace namísto adaptace na chyby konkrétního rozpoznávače. Nakonec se věnujeme znovuzavedení slovních jistot do výstupu různých end-to-end rozpoznávačů, jejichž výstup v podobě seznamu N-nejlepších hypotéz (N-best) byl ohodnocen samostatným jazykovým modelem. Tyto konfidence se nejen ukazují jako dobře kalibrované, ale v kvalitativním vyhodnocení prostřednictvím fůze rozpoznávačů prokazují značnou sílu, když jejich využití zlepšuje výsledný systém zhruba tolik jako celý jeden dodatečný rozpoznávač.
The role of statistical language models is to discover and quantify natural patterns in text data. In this thesis, we utilize language models to improve the accuracy of speech and handwriting recognition systems. First, we work with fixed-size topic representations as means to introduce longer context into otherwise computationally very cheap feed-forward neural language models (LMs). We show that this simple technique allows to decrease the performance gap between these LMs and much more powerful recurrent models by half. Then, we study the ability of these topic representations to smooth out errors in recognition and thus to improve the accuracy of second pass decoding. The improvement obtained is consistent albeit very small. Next, we study the training of neural LMs on machine-annotated data, with the aim of adapting the LM to a new domain with little human intervention. Demonstrating such approach on optical character recognition, we conclude that language models are fairly robust to errors in the machine annotation, allowing the developer of the LM to skip the step of data filtering in most cases. In the most challenging scenario considered in our experiments, we show that while the original recognition system achieves character error rate of 6.43 % (which can be reduced to 5.34 % by using an LM trained on human annotated data), utilizing the machine annotated data to the full extent allows to reduce the error rate to 2.88 %. In the second part of the thesis, we study simple ways of regularizing language models by data augmentation resembling errors made by speech recognizers. We obtain the best results when the augmentation does not attempt to model errors made by an actual ASR. By further analysis of this surprising result, we conclude that the improvements are indeed coming from a regularization effect rather than the originally aimed robustness to ASR-specific errors. Finally, we demonstrate a way to reintroduce word-level confidences into output of various end-to-end ASR systems --- in case their outputs are rescored by language models, we are able to effectively restore an ability of HMM-based systems that was neglected with end-to-end systems. In addition to studying the quality of such confidence estimates, we quantitatively show that they considerably improve fusion of multiple systems; compared to voting-based mechanism --- proper confidences improve the accuracy of fused ASR system approximately as if there was one more ASR system in the fusion.

Citation

BENEŠ, K. Jazykové modely pro nedokonalé systémy rozpoznání řeči a písma [online]. Brno: Vysoké učení technické v Brně. Fakulta informačních technologií. 2025.

Language of document

en

Study field

Informační technologie

Comittee

doc. RNDr. Milan Češka, Ph.D. (předseda) doc. RNDr. Ondřej Bojar, Ph.D. (člen) doc. Ing. Jiří Málek, PhD. (člen) prof. Ing. Radomil Matoušek, Ph.D. (člen) doc. Mgr. Radek Pelánek, Ph.D. (člen)

Date of acceptance

2025-03-10

Defence

The student presented the goals and results, which he achieved within the solution of the dissertation. The student has competently answered the questions of the committee members and reviewers and guests. The discussion is recorded on the discussion sheets, which are attached to the protocol. Number of discussion sheets: 9 The committee has agreed unanimously that the student has fulfilled requirements for being awarded the academic title Ph.D. The committee and reviewers recommend to consider the thesis for the Dean's prize which is awarded to good theses.

Result of defence

práce byla úspěšně obhájena

Document licence

Standardní licenční smlouva - přístup k plnému textu bez omezení