Performance of Czech Speech Recognition with Language Models Created from Public Resources

Prochazka, Vaclav; Pollak, Petr; Zdansky, Jindrich; Nouza, Jan

Performance of Czech Speech Recognition with Language Models Created from Public Resources

Files

11_04_1002_1008.pdf (130.36 KB)

Date

2011-12

Authors

Prochazka, Vaclav

Pollak, Petr

Zdansky, Jindrich

Nouza, Jan

Publisher

Společnost pro radioelektronické inženýrství

Abstract

In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus obtained from the Czech National Corpus Institute. For comparison, we tested also an LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared from the statistic point of view (mainly via their perplexity rates) and from the performance point of view when employed in large vocabulary continuous speech recognition systems. Our study shows that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.

Keywords

speech recognition , LVCSR , n-gram language models , public language resources

Citation

Radioengineering. 2011, vol. 20, č. 4, s. 1002-1008. ISSN 1210-2512
http://www.radioeng.cz/fulltexts/2011/11_04_1002_1008.pdf

Document type

Peer-reviewed

Document version

Published version

Language of document

en

URI

http://hdl.handle.net/11012/56902

Collections

2011/4

Creative Commons license

Except where otherwised noted, this item's license is described as Creative Commons Attribution 3.0 Unported License

Citace PRO

Full item page

Performance of Czech Speech Recognition with Language Models Created from Public Resources

Files

Date

Authors

Advisor

Referee

Mark

Journal Title

Journal ISSN

Volume Title

Publisher

ORCID

Abstract

Description

Keywords

Citation

Document type

Document version

Date of access to the full text

Language of document

Study field

Comittee

Date of acceptance

Defence

Result of defence

DOI

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Creative Commons license

Citace PRO