Skripty pro hromadnou úpravu fontů v PDF dokumentech

Diplomová práca sa venuje problematike kódovania fontov v PDF dokumentoch. Správne kódovanie fontov je potrebné pre vyhľadávanie v dokumente a kopírovanie textu z dokumentu. Práca obsahuje popis vnútornej štruktúry PDF dokumentov, reprezentáciu strán, typy fontov a ich kódovanie a dôležité objekty potrebné pre správnu reprezentáciu fontov. Znalosti z týchto oblastí sú kľúčové pre vývoj skriptov na opravu kódovania fontov. V rámci diplomovej práce boli vytvorené dva skripty v jazyku Python. Prvý zo skriptov overuje integritu opravovaných PDF súborov pomocou hešov SHA-256 vypočítaných z ich obsahu. Druhý skript opravuje poškodené kódovania fontov v dokumentoch. Potrebné informácie pre funkčnosť oboch skriptov boli uložené do zodpovedajúcich JSON štruktúr. Opravný skript sa zameriava na PostSciptové fonty typu 1. Kľúčovým prvkom opravného skriptu je generovanie objektu ToUnicode, ktorý v rámci fontu správne mapuje glyfy na Unicode kódy. Skript bol testovaný na približne 200 elektronických vydaniach českého časopisu, ktoré boli poskytnuté ako vzorové údaje. Zo vzorových súborov boli vybrané tie, ktoré mali kompletne poškodené kódovania fontov. Ostatné vzorové časopisy mali poškodené iba kódovanie znakov s diakritickými znamienkami. Tieto časopisy boli analyzované, ale skript ich nedokáže opraviť. Komentované zdrojové kódy jazyka Python, skompilované spustiteľné súbory systému Windows a používateľská príručka sú k dispozícii v elektronickej prílohe a v autorovom GitHub repoziráti.
Master's thesis deals with the issue of font encoding in PDF documents. Proper font encoding is necessary for searching and copying text from such documents. Thesis includes a description of the internal structure of PDF documents, page representation, font types and their encoding, and important objects needed for proper font representation. Understanding of these areas was necessary for development of scripts that are able to repair incorrect font encoding. Two Python scripts were developed as part of the thesis. The first one verifies the integrity of repaired PDF files using SHA-256 hashes computed from their contents. The second script repairs corrupted font encodings in the documents. The necessary information for the functionality of both scripts has been stored in the corresponding JSON structures. The repair script targets PostScipt fonts of type 1. Core function of the repair script is the generation of a ToUnicode object that correctly maps glyphs to Unicode codes within the font. The script has been tested on approximately 200 electronic issues of a Czech magazine that have been provided as sample data. From these sample files, only those that had completely corrupted font encodings were chosen for further work. Other sample magazines only had corrupt encoding of characters with diacritical marks. These magazines were analyzed, but the script is unable to repair them. Commented Python source code, compiled Windows executables and a user guide are available in the electronic attachment and in the author's GitHub repository.

Keywords

Font , JSON , PDF , Python , Skript , Unicode , Font , JSON , PDF , Python , Script , Unicode

Citation

GMITTER, J. Skripty pro hromadnou úpravu fontů v PDF dokumentech [online]. Brno: Vysoké učení technické v Brně. Fakulta elektrotechniky a komunikačních technologií. 2024.

Language of document

sk

Study field

bez specializace

Comittee

doc. Ing. Jiří Mekyska, Ph.D. (předseda) prof. Ing. Miroslav Vozňák, Ph.D. (místopředseda) Ing. Pavel Hanák, Ph.D. (člen) Ing. Jaromír Hrad, Ph.D. (člen) Ing. et Ing. Petr Musil (člen) Ing. Kryštof Novotný (člen) doc. Ing. Petr Sysel, Ph.D. (člen)

Date of acceptance

2024-06-06

Defence

Student prezentoval výsledky své práce a komise byla seznámena s posudky. Otázky oponenta: Při opravě kódování a vytváření ToUnicode objektu využíváte relativně velké množství pomocných údajů a knihoven. Nedalo by se využít pro získání unicode reprezentace daného znaku využít OCR? Co by bylo potřeba upravit pro zautomatizování celého procesu, respektive sjednocení jednotlivých standalone souborů (opravAR.exe a Type1toUnicode.exe)? Student obhájil diplomovou práci a odpověděl na otázky členů komise a oponenta.

Result of defence

práce byla úspěšně obhájena

URI

http://hdl.handle.net/11012/246071

Collections

2024

Citace PRO

Full item page

Skripty pro hromadnou úpravu fontů v PDF dokumentech

Files

Date

Authors

Advisor

Referee

Mark

Journal Title

Journal ISSN

Volume Title

Publisher

ORCID

Abstract

Description

Keywords

Citation

Document type

Document version

Date of access to the full text

Language of document

Study field

Comittee

Date of acceptance

Defence

Result of defence

DOI

URI

Collections

Endorsement

Review

Supplemented By

Referenced By

Citace PRO