Emotion Recognition from Analysis of a Person’s Speech

Táto práca sa zaoberá analýzou rozpoznávania emócií z ľudskej reči. Jej cieľom je navrhnúť a implementovať systém, ktorý je schopný automaticky klasifikovať emočný stav z rečových nahrávok. Riešenie je založené na neurónovej sieti typu Audio Spectrogram Transformer (AST), odvodenej z neurónovej siete Vision Transformer, ktorej vstupom je mel spektrogram. Implementácia riešenia pozostáva z dvoch častí. Prvá časť sa zaoberá extrakciou mel spektrogramu zo vstupnej nahrávky reči, zatiaľ čo v druhej časti predtrénovaný AST model počíta odozvu, ktorej výstupom sú pravdepodobnosti pre uvažované emočné triedy. Tréning a vyhodnotenie implementácie bolo uskutočnené na troch dátových sadách: RAVDESS, Emo-DB a EMOVO. Získané výsledky vo forme neváženej presnosti sú 84.5 % pre RAVDESS, 91.6 % pre Emo-DB a 73.8 % pre EMOVO. Počas tréningu modelu bolo zaznamenávané emitované množstvo CO2 na základe spotrebovanej energie grafickým procesorom. Hlavným výstupom tejto práce je využitie neurónovej siete vychádzajúcej z architektúry typu Transformer, určenej pôvodone pre obrazové úlohy, na rozpoznávanie emócií z ľudskej reči. Ďalším výstupom je hodnota uhlíkovej stopy tréningu neurónovej siete, vyjadrená ako hmotnosť vylúčeného CO2, ktorá dosiahla hodnotu 1058.37 gramov.
This thesis deals with the analysis of emotion recognition from human speech. It aims to design and implement a system that can automatically infer emotional states from speech recordings. The solution is based on the Audio Spectrogram Transformer (AST), a derivative of the Vision Transformer neural network, which accepts mel spectrogram as input. The implementation comprehends the pipeline with two stages. In the first stage, a mel spectrogram is obtained from the input speech recording and in the second stage, the pretrained AST model computes output in the form of probabilities of considered emotional classes. The AST implementation was trained and evaluated on three datasets: RAVDESS, Emo-DB and EMOVO. The obtained results in the form of unweighted accuracy are 84.5 % for RAVDESS, 91.6 % for Emo-DB and 73.8 % for EMOVO. During training, the consumed energy of the graphical processing unit was recorded for the calculation of the carbon footprint in terms of emitted CO2. The main contribution of this work is the utilization of neural network based on Transformer architecture, originally used for vision tasks, to classify emotions from speech. Another contribution is carbon footprint tracking of neural network training. The carbon footprint, expressed in emitted CO2 mass is 1058.37 grams.

Keywords

rozpoznávanie emócií z reči človeka, spracovanie rečového signálu, klasifikácia emócií, strojové účenie, hlboké učenie, Vision Transformer, Audio Spectrogram Transformer, uhlíková stopa, speech emotion recognition, speech signal processing, classification of emotions, machine learning, deep learning, Vision Transformer, Audio Spectrogram Transformer, carbon footprint

Citation

KNUTELSKÝ, M. Emotion Recognition from Analysis of a Person’s Speech [online]. Brno: Vysoké učení technické v Brně. Fakulta informačních technologií. 2023.

Language of document

en

Study field

Strojové učení

Comittee

doc. Ing. Lukáš Burget, Ph.D. (předseda) doc. Ing. Martin Čadík, Ph.D. (člen) doc. Ing. Vladimír Janoušek, Ph.D. (člen) Ing. Michal Hradiš, Ph.D. (člen) Ing. Jaroslav Rozman, Ph.D. (člen) Ing. Tomáš Milet, Ph.D. (člen)

Date of acceptance

2023-06-16

Defence

Student nejprve prezentoval výsledky, kterých dosáhl v rámci své práce. Komise se poté seznámila s hodnocením vedoucího a posudkem oponenta práce. Student následně odpověděl na otázky přítomných. Komise se na základě posudku oponenta, hodnocení vedoucího, přednesené prezentace a odpovědí studenta na položené otázky rozhodla práci hodnotit stupněm B.

Result of defence

práce byla úspěšně obhájena

Document licence

Standardní licenční smlouva - přístup k plnému textu bez omezení