Performance of Czech Speech Recognition with Language Models Created from Public Resources

Prochazka, Vaclav; Pollak, Petr; Zdansky, Jindrich; Nouza, Jan

Performance of Czech Speech Recognition with Language Models Created from Public Resources

dc.contributor.author	Prochazka, Vaclav
dc.contributor.author	Pollak, Petr
dc.contributor.author	Zdansky, Jindrich
dc.contributor.author	Nouza, Jan
dc.coverage.issue	4	cs
dc.coverage.volume	20	cs
dc.date.accessioned	2016-03-01T09:16:14Z
dc.date.available	2016-03-01T09:16:14Z
dc.date.issued	2011-12	cs
dc.description.abstract	In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus obtained from the Czech National Corpus Institute. For comparison, we tested also an LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared from the statistic point of view (mainly via their perplexity rates) and from the performance point of view when employed in large vocabulary continuous speech recognition systems. Our study shows that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.	en
dc.format	text	cs
dc.format.extent	1002-1008	cs
dc.format.mimetype	application/pdf	en
dc.identifier.citation	Radioengineering. 2011, vol. 20, č. 4, s. 1002-1008. ISSN 1210-2512	cs
dc.identifier.issn	1210-2512
dc.identifier.uri	http://hdl.handle.net/11012/56902
dc.language.iso	en	cs
dc.publisher	Společnost pro radioelektronické inženýrství	cs
dc.relation.ispartof	Radioengineering	cs
dc.relation.uri	http://www.radioeng.cz/fulltexts/2011/11_04_1002_1008.pdf	cs
dc.rights	Creative Commons Attribution 3.0 Unported License	en
dc.rights.access	openAccess	en
dc.rights.uri	http://creativecommons.org/licenses/by/3.0/	en
dc.subject	speech recognition	en
dc.subject	LVCSR	en
dc.subject	n-gram language models	en
dc.subject	public language resources	en
dc.title	Performance of Czech Speech Recognition with Language Models Created from Public Resources	en
dc.type.driver	article	en
dc.type.status	Peer-reviewed	en
dc.type.version	publishedVersion	en
eprints.affiliatedInstitution.faculty	Fakulta eletrotechniky a komunikačních technologií	cs

Files

Original bundle

Now showing 1 - 1 of 1

Name:: 11_04_1002_1008.pdf
Size:: 130.36 KB
Format:: Adobe Portable Document Format

Download

Collections

2011/4