End-to-end speech recognition for low-resource languages

Oblast automatického rozpoznávání řeči začala přijímat end-to-end řešení neuronové sítě pro vytváření rozpoznávačů řeči. Povaha datového hladu těchto typů systémů však umožňuje vytvářet rozpoznávače pouze pro jazyky s velkými zdroji, jako je angličtina, čínština nebo španělština. Ve scénářích s nízkými zdroji je třeba vyvinout některá řešení, která zmírní problém nedostatku dat. Jednou z nejúčinnějších technik je doladění předtrénovaného modelu. Problém se stávajícími přístupy ladění spočívá v tom, že sada tokenů cílového a zdrojového jazyka se obvykle liší. To je důvod, proč předchozí přístupy k učení vícejazyčného přenosu vyžadovaly změnu výstupní vrstvy nebo smíchání tokenů z různých jazyků ve výstupní vrstvě, případně použití univerzální sady tokenů anebo samostatné výstupní vrstvy pro každý jazyk. To je nežádoucí, jelikož sdílení napříč jazyky je v tomto případě latentní a neovladatelné ve výstupním prostoru, když jsou grafémy specifické pro daný jazyk disjunktní. Proto tato práce navrhuje mapování tokenů do společné sady před začátkem předtréninku. Stávající řešení spočívá v transliteraci zdrojového jazyka do cílového, novým přístupem je romanizace, kde je sada tokenů cílového jazyka romanizována tak, aby odpovídala anglické abecedě. Následně lze diakritiku z romanizovaných hypotéz obnovit pomocí dalšího modelu obnovy. To má výhodu ve zvýšení sdílení v prostoru výstupního grafému.
The automatic speech recognition area has started to adopt end-to-end neural network solutions for creating speech recognizers. However, the data hunger nature of these types of systems allows for the creation of recognizers only for high-resource languages, such as English, Chinese or Spanish. In low-resource scenarios, some solutions which alleviate the data scarcity problem have to be developed. One of the most effective techniques for this is fine-tuning a pre-trained model. The problem with the existing approaches of fine-tuning is that the token set of target and source languages does usually differ. That is why previous multi-lingual transfer learning approaches required the output layer to be changed, or mixed tokens from different languages in the output layer, or use universal token sets, or have separate output layers per language. This is undesirable because the sharing across languages in this case latent and not controllable in the output space when the language-specific graphemes are disjoint. Therefore this work proposes to map the tokens to the common set before the beginning of the pre-training. The existing solution was a transliteration of the source language to the target one, the novel approach is romanization where the token set of the target language is romanized to match the English alphabet. Subsequently, the diacritics from the romanized hypotheses can be restored using an additional restoration model. This has the advantage of increasing sharing in the output grapheme space.

Keywords

ASR , low-resource , transliteration , romanization , model , data , trénování , transfer learning , řeč , end-to-end , augmentation , fine-tuning , ASR , low-resource , transliteration , romanization , model , data , training , transfer learning , speech , end-to-end , augmentation , fine-tuning

Citation

SOKOLOVSKII, V. End-to-end speech recognition for low-resource languages [online]. Brno: Vysoké učení technické v Brně. Fakulta informačních technologií. 2022.

Language of document

en

Study field

Informační technologie

Comittee

prof. Ing. Adam Herout, Ph.D. (předseda) doc. Ing. František Zbořil, Ph.D. (místopředseda) doc. Ing. Michal Bidlo, Ph.D. (člen) Ing. Zbyněk Křivka, Ph.D. (člen) Ing. Zdeněk Materna, Ph.D. (člen)

Date of acceptance

2022-08-22

Defence

Student nejprve prezentoval výsledky, kterých dosáhl v rámci své práce. Komise se poté seznámila s hodnocením vedoucího a posudkem oponenta práce. Student následně odpověděl na otázky oponenta a na další otázky přítomných. Komise se na základě posudku oponenta, hodnocení vedoucího, přednesené prezentace a odpovědí studenta na položené otázky rozhodla práci hodnotit stupněm A. Otázky u obhajoby: Where do you see the largest potential to improve end-to-end speech transcription system for low resources languages - from transliteration/romanization, self-supervised techniques like Wave2Vec2/HuBERT/WavLM, training data augmentation, of from something else? The RNN Transducer architecture was used. What improvement do you expect from two pass system with attention-based architecture in the second pass? How would you extend the work into a scientific publication?

Result of defence

práce byla úspěšně obhájena

URI

http://hdl.handle.net/11012/208275

Collections

2022

Citace PRO

Full item page

End-to-end speech recognition for low-resource languages

Files

Date

Authors

Advisor

Referee

Mark

Journal Title

Journal ISSN

Volume Title

Publisher

ORCID

Abstract

Description

Keywords

Citation

Document type

Document version

Date of access to the full text

Language of document

Study field

Comittee

Date of acceptance

Defence

Result of defence

DOI

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Citace PRO