Čítačové automaty ve vyhledávání podle regulárních výrazů

Vyhledávání podle regulárních výrazů (regexové vyhledávání) je široce využívaný prostředek např. pro vyhledávání informací, ověřování dat, vyhledávání a nahrazování, získávání dat nebo zvýrazňování syntaxe v mnoha programovacích jazycích. Jedná se o výpočetně náročný proces, který se často aplikuje na rozsáhlé texty. Jeho předvídatelnost a stabilita má v praxi významný dopad na celkovou použitelnost softwarových aplikací. Problémem je, že standardní přístupy pro regexové vyhledávání mají vysokou složitost a nešťastná kombinace regexu a textu může dobu vyhledávání řádově prodloužit. To může být vstupní branou pro tzv. ReDoS útoky, což je závažný bezpečnostní problém, kdy útočník způsobí odepření služby pomocí speciálně vytvořeného regexu nebo textu. Automatové regexové vyhledávače jsou v současné době nejefektivnějšími nástroji pro regexové vyhledávání používanými v praxi, zejména v průmyslových výkonnostně kritických aplikacích. Dlouholeté empirické studie ukazují, že tyto přístupy mají mnohem stabilnější výkonnost, než jakou mají existující nástroje pro regexové vyhledávání založené na zpětném prohledávání. Nicméně i automatové regexové vyhledávače se mohou dostat do potíží. Omezená opakování, např. [ab]{100}, představují hlavní zdroj problémů i pro nejrychlejší nástroje pro regexové vyhledávání. Tato práce se touto problematikou zabývá systematicky. V této práci jsme nejprve představili rozsáhlou studii zranitelnosti nástrojů pro regexové vyhledávání založených na konečných automatech. Za tímto účelem jsme navrhli nový ReDoS generátor. Jedná se o první generátor schopný využívat omezené opakování ke generování útoků pro automatové regexové vyhledávače. Byli jsme schopni experimentálně prokázat, že omezená opakování skutečně představují vážnou bezpečnostní hrozbu, jak pro automatové regexové vyhledávače, tak pro ty založené na zpětném prohledávání.Dále jsme navrhli řešení problému efektivního regexové vyhledávání s omezeným opakováním. Obecný přístup je založen na kompilaci regexů do nedeterministických čítačových automatů a jejich následné determinizaci. Hlavním problémem je najít stručnou deterministickou reprezentaci, která dokáže provádět rychlé regexové vyhledávání (naivní determinizace vytváří deterministické konečné automaty exponenciálně velké k velikosti regexu a k maximům mezí opakování, které se v nich nachází). Nejprve jsme navrhli determinizační algoritmus vycházející z klasické podmnožinové konstrukce, který generuje deterministické čítačové automaty. Tyto automaty jsou exponenciálně stručnější než odpovídající deterministické konečné automaty. Hlavní přínos této práce jsme pak získali, když jsme determinizaci rozpracovali pomocí myšlenky čítacích množin. Navrhli jsme stručnou transformaci čítačového automatu na deterministický automat se speciálním typem registrů, které mohou obsahovat množinu celočíselných hodnot. Představili jsme také novou kompilaci regexů na čítačové automaty, která zobecňuje Antimirovu derivatovou konstrukci. Vytvořili jsme aplikační rámec založený na simulaci automatů s čítačovými registry a Antimirově derivatové konstrukci. Porovnali jsme rychlost vyhledávání jednotlivých nástrojů na rozsáhlé sadě reálných regexů s omezeným opakováním. Zjistili jsme, že náš algoritmus je mnohem robustnější, překonává nejmodernější nástroje pro regexové vyhledávání na regexech s omezeným opakováním a není závislý na velikosti mezí opakování. Snadno řeší většinu případů, ve kterých mají stávající nástroje pro regexové vyhledávání problém s omezeným opakováním.
Matching of regular expressions (regexes) is widely used, e.g., for searching, data validation, parsing, finding and replacing, data scraping, or syntax highlighting in many programming languages. It is a computationally intensive process often applied on large texts. Predictability of its efficiency has a significant impact on the overall usability of software applications in practice. A problem is that standard approaches for regex matching suffer from high worst case complexity. An unlucky combination of a regex and text may increase the matching time by orders of magnitude. This can be a doorway for the so-called Regular expression Denial of Service (ReDoS) attack in which the attacker causes a denial of service by providing a specially crafted regex or text. Automata-based matchers are the most efficient regex matching engines used nowadays in practice, especially in performance-critical industrial applications. There are years of empirical evidence showing that their performance is much more stable than that of the more traditional backtracking-based matchers. But automata matchers may run into troubles too. Bounded repetition, i.e., expressions such as [ab]{100} with a specified number of repetitions of a certain pattern, has been recognised as a major source of problems for even the fastest matchers. This thesis studies this issue systematically. In this thesis, we present a large-scale study of vulnerability of automata-based matching focused on bounded repetition. To this end, we propose a new ReDoS generator. It is the first generator capable of utilising bounded repetition to attack automata-based matchers, in fact the first generator that can attack them at all. We were then able to prove experimentally that bounded repetition indeed poses a serious security threat, for automata-based as well as backtracking-based matchers.We then propose a solution to the problem of efficient matching of regexes with bounded repetition. The approach is to compile the regexes into nondeterministic counting automata (CAs) and then to determinise them. The main problem is to find a succinct deterministic representation that can perform fast matching (naive determinisation builds a deterministic finite automata (DFAs) exponentially large to the size of the regex and of the repetition bounds in it). In the first step, we propose a determinisation algorithm based on general subset construction that generates deterministic CAs. They are exponentially more succinct than the corresponding DFAs. The main contribution of this thesis was then obtained when we elaborated the determinisation using the idea representing many counters with counting sets. We propose succinct transformation of a CA into a deterministic counting-set automaton (CsA), an automaton with a special type of registers that can hold a set of integer values. We also propose a novel compilation of regexes to CAs that generalizes the Antimirov's derivative construction. We design a framework for matching based on CsA simulation and the Antimirov's derivative construction. We compare the speed of matching of individual matching engines on a comprehensive set of real-world regexes with bounded repetition. We found that our algorithm is much more robust, outperforms the state-of-the-art matchers on regexes with bounded repetition, and is not dependent on the size of repetition bounds. It easily solves most cases in which the existing matchers struggle due to bounded repetition.

Keywords

Vyhledávání podle regulárních výrazů, omezené opakování, ReDoS, determinizace, Antimirovy derivativy, čítačové automaty., Regular expression matching, bounded repetition, ReDoS, determinisation, Antimirov's derivatives, counting automata, counting-set automata.

Citation

HOLÍKOVÁ, L. Čítačové automaty ve vyhledávání podle regulárních výrazů [online]. Brno: Vysoké učení technické v Brně. Fakulta informačních technologií. .

Language of document

en

Study field

Výpočetní technika a informatika

Comittee

prof. Ing. Lukáš Sekanina, Ph.D. (předseda) doc. RNDr. Tomáš Brázdil, Ph.D. (člen) doc. RNDr. Tomáš Masopust, Ph.D. (člen) Prof. Dr. Roland Meyer (člen) doc. RNDr. Jan Strejček, Ph.D. (člen)

Defence

The student presented the goals and results, which he achieved within the solution of the dissertation. The student has competently answered the questions of the committee members and reviewers and guests. The discussion is recorded on the discussion sheets, which are attached to the protocol. Number of discussion sheets: 1.The committee has agreed unanimously that the student has fulfilled requirements for being awarded the academic title Ph.D. The committee recommends awarding the thesis the deans prize.

Result of defence

práce byla úspěšně obhájena

Document licence

Standardní licenční smlouva - přístup k plnému textu bez omezení