Neurální extrakce řeči cílového řečníka

S rostoucím nasazením řečových technologií v praxi roste důležitost jejich robustnosti. Zejména zpracování řeči poškozené rušícími překrývajícími se řečníky je stále výzva. Přístupy separace řeči tento problém řeší rozkladem smíchané řeči na signály jednotlivých řečníků. Tyto metody v nedávné době výrazně pokročily s využitím vývoje v hlubokém učení. Ve spoustě aplikací, jako jsou chytré telefony nebo digitální domácí asistenti, je cílem zvýraznit řečový signál jednoho cílového řečníka, a potlačit ostatní řečníky a šum. V~této práci formulujeme tento problém jako extrakci řeči cílového řečníka a navrhujeme přímé řešení --- použití neuronové sítě, která na vstupu přijímá předregistrovanou nahrávku cílového řečníka a pozorovanou směs, a na výstupu vrací extrahovanou řeč cílového řečníka. Diskutujeme a experimentálně ukazujeme výhody tohoto přístupu ve srovnání s konvenční separací řeči. Výhody zahrnují nepotřebnost počítání řečníku ve směsi nebo lepší konzistenci výstupu pro delší nahrávky. Zkoumáme různé aspekty neurální extrakce řeči cílového řečníka, jako jsou embeddingy reprezentující řečníka, metody jak informovat neuronovou síť, vstupní a výstupní doména a ztrátová funkce. Dále demonstrujeme, jak kombinovat extrakci cílového řečníka s multi-kanálovými metodami, jako je beamforming založený na neurálních maskách nebo prostorové shlukování. Tyto kombinace využívají jak konvenčních statistických metod pro zpracování prostorové informace, tak silné modelovací schopnosti neuronových sítí. Na závěr aplikujeme extrakci řeči cílového řečníka na dva finální úkoly: automatické rozpoznávání řeči a diarizaci založenou na shlukování. Zkoumáme jak nejlépe zkombinovat předzpracování signálu s cílovými systémy včetně společné optimalizace, nebo trénování se slabou supervizí založenou na informaci o řečnících.
As speech processing technologies are getting increasingly more applied in the real world, their robustness has become a very important issue. Particularly, the processing of speech corrupted by interfering overlapping speakers is one of the challenging problems today. Speech separation approaches tackle this problem by separating the mixed speech into signals of individual speakers. These methods have made a big headway recently by leveraging the progress in deep learning. In many applications, such as smartphones or digital home assistants, the goal is to enhance the speech signal of one speaker of interest, while suppressing other speakers and noise. In our work, we formulate this problem as target speech extraction and propose to solve it directly, i.e. to use a neural network with the enrollment speech and the mixture as inputs and the extracted speech of the target speaker as the output. We discuss and experimentally show the benefits of this approach compared to conventional speech separation: needlessness of counting speakers in the mixture, or better consistency of the output for longer recordings. We explore different aspects of the neural target speech extraction pipeline, namely the speaker embeddings, methods to inform the neural network about the target speaker, input and output domain, or loss function. Furthermore, we demonstrate how to combine target speech extraction with multi-channel methods, such as neural mask-based beamforming and spatial clustering. Such combinations make use of both conventional statistical methods (for processing the spatial information) and strong modeling power of neural networks. Finally, we apply target speech extraction as a pre-processing for two downstream tasks: automatic speech recognition, and clustering-based diarization. We investigate how to efficiently combine the front-end processing with the downstream systems, including joint optimization, or training with weakly supervised loss function based on speaker labels.

Keywords

extrakce řeči cílového řečníka, neuronové sítě, multi-kanálové zpracování, rozpoznávání řeči více řečníků, diarizace řeči více řečníků, target speech extraction, neural networks, multi-channel processing, multi-speaker automatic speech recognition, multi-speaker diarization

Citation

ŽMOLÍKOVÁ, K. Neurální extrakce řeči cílového řečníka [online]. Brno: Vysoké učení technické v Brně. Fakulta informačních technologií. .

Language of document

cs

Study field

Výpočetní technika a informatika

Comittee

prof. Ing. Lukáš Sekanina, Ph.D. (předseda) prof. Ing. Zbyněk Koldovský, Ph.D. (člen) prof. Mgr. Pavel Rajmic, Ph.D. (člen) Ing. Jan Kleindienst, Ph.D. (člen) Ing. Josef Šivic, PhD. (člen)

Defence

The student presented the goals and results, which he achieved within the solution of the dissertation. The student has competently answered the questions of the committee members and reviewers and guests. The discussion is recorded on the discussion sheets, which are attached to the protocol. Number of discussion sheets: 5. The committee has agreed by a majority/unanimously that the student has fulfilled requirements for being awarded the academic title Ph.D. The committee recommends awarding the thesis the deans prize.

Result of defence

práce byla úspěšně obhájena

Document licence

Standardní licenční smlouva - přístup k plnému textu bez omezení