Automatická tvorba korpusů

Šantavý, Marek

Automatická tvorba korpusů

Files

final-thesis.pdf (312 KB)

review_25643.html (1.42 KB)

Authors

Šantavý, Marek

Advisor

Smrž, Pavel

Referee

Černocký, Jan

Mark

C

Publisher

Vysoké učení technické v Brně. Fakulta informačních technologií

Abstract

Obsahem práce je představení způsobu formátování a značkování textových dat korpusu. Nad vhodně reprezentovanými dokumenty vytváří vrstvu pro jejich vzájemné porovnání s cílem určení míry podobnosti mezi nimi. Nástroje, které výpočty podobnosti zajišťují, jsou základem automatizovaného systému pro vytváření a doplňování existujícího korpusu dat. Mezi dvěma základními přístupy je možno volit podle požadavku výpovědní hodnoty výsledku. Prostředkem pro získávání dat nových je nástroj stahování obsahu webu.
This work is a presentation of tagging and formatting of text-data corpus. It creates a layer above suitable represented documents for their mutual comparison in order to determine the similarity among them. Tools that provide near-duplicate calculations are the basis for an automated system for creation and expansion of the existing text-data corpus. There is an option to choose between two basic approaches according to the significance of the outcome. Means of new text-data acquiring is the tool for web crawling.

Keywords

korpus , duplicity , Rabin otisk , redundance , podobnost textových dat , stahování obsahu webu , vertikální text , SHA-384 , corpus , near-duplicate , Rabin fingerprint , redundancy , text-data similarity , web crawl , vertical format , SHA-384

Citation

ŠANTAVÝ, M. Automatická tvorba korpusů [online]. Brno: Vysoké učení technické v Brně. Fakulta informačních technologií. .

Language of document

cs

Study field

Informační technologie

Result of defence

práce byla úspěšně obhájena

URI

http://hdl.handle.net/11012/54503

Collections

2008

Citace PRO

Full item page

Automatická tvorba korpusů

Files

Date

Authors

Advisor

Referee

Mark

Journal Title

Journal ISSN

Volume Title

Publisher

ORCID

Abstract

Description

Keywords

Citation

Document type

Document version

Date of access to the full text

Language of document

Study field

Comittee

Date of acceptance

Defence

Result of defence

DOI

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Citace PRO