Parsing with derivatives - Elegant matching of regular languages in clojure

Oct 2, 2016 • Yehonathan Sharvit

Introduction

This article is an interactive version of the first part of this paper: Parsing with derivatives - a Functional Pearl - the part that shows how to implement a regular expression matcher in a few lines of code using a concept from 1964 named Brzozowski’s derivative.

As David Nolen explains it in his talk about this paper at Papers We Love, clojure.spec implementation is based on this paper.

In this article, we are going to implement a regular expression matcher in clojure using Brzozowski’s derivative.

The theoretical part is a bit abstract but it worths making the effort as the implementation is really elegant.

Preliminary: Formal languages

Definition: A language L is a set of strings.

Definition: A string w is a sequence of characters from an alphabet A.

Examples of languages:

The empty language ∅ that contains no strings at all
The null language ϵ contains only the empty string ""
The singleton languages that contain a single string made of a single character from the alphabet

Regular languages

A regular language is a language that can be defined from atomic languages using 3 basic operations - without using recursion:

Concatenation of L1 and L2 - noted L1∘L2: the strings formed by concatenating a string from L1 followed by a string from L2.
Union of L1 and L2 - noted L1∪L2: the strings that belong either to L1 or to L2.
Repetition (a.k.a Kleene star) of L - noted L∗: the strings formed by concatenating a finite number (including zero) of strings from L.

We can encode a regular language programatically using types.

All the code snippets of this page are live and interactive powered by the klipse plugin:

Live: The code is executed in your browser
Interactive: You can modify the code and it is evaluated as you type

Here, in clojure, we are going to use types, type detectors and convenient construtors that deal with multi-arity.

The empty language:

xxxxxxxxxx

(deftype empty-type [])

(def empty ->empty-type)

(defn empty? [x] (= (type x) empty-type))

(empty? (empty))

the evaluation will appear here (soon)...

The null language ϵ (that contains only the empty string):

xxxxxxxxxx

(deftype eps-type [])

(def eps ->eps-type)

(defn eps? [x] (= (type x) eps-type))

(eps? (eps))

xxxxxxxxxx

the evaluation will appear here (soon)...

The language that contains a single string made of a single character:

xxxxxxxxxx

(deftype char-type [c])

(def char ->char-type)

(defn char? [x] (= (type x) char-type))

(char? (char \a))

xxxxxxxxxx

the evaluation will appear here (soon)...

The concatenation of two or more languages:

xxxxxxxxxx

(deftype cat-type [left right])

(defn cat

  ([a b]  (cat-type. a b))

  ([a b & args] (apply cat (cat-type. a b) args)))

(defn cat? [x] (= (type x) cat-type))

(cat? (cat (char \a) (char \b) (char \c)))

xxxxxxxxxx

the evaluation will appear here (soon)...

The union (a.k.a alt) of two or more languages:

xxxxxxxxxx

(deftype alt-type [this that])

(defn alt

  ([a b]  (alt-type. a b))

  ([a b & args] (apply alt (alt-type. a b) args)))

(defn alt? [x] (= (type x) alt-type))

(alt? (alt (char \a) (char \b) (char \c)))

xxxxxxxxxx

the evaluation will appear here (soon)...

The repetition of a language:

xxxxxxxxxx

(deftype rep-type [lang])

(def rep ->rep-type)

(defn rep? [x] (= (type x) rep-type))

(rep? (rep (char \a)))

xxxxxxxxxx

the evaluation will appear here (soon)...

For instance the regexp a*b[ce] is encoded like this:

xxxxxxxxxx

(cat (rep (char \a))

     (char \b)

     (alt (char \c)

          (char \e)))

xxxxxxxxxx

the evaluation will appear here (soon)...

The regexp for detecting a floating point number [-+]?[0-9]*\.?[0-9]+ is encoded this way:

xxxxxxxxxx

(def digit (alt (char \0) (char \1)(char \2)(char \3)(char \4)(char \5)(char \6)(char \7)(char \8)(char \9)))

(def float-number (cat (alt (eps)

                            (alt (char \+) (char \-)))

                       (rep digit)

                       (char \.)

                       digit

                       (rep digit)))

xxxxxxxxxx

the evaluation will appear here (soon)...

The nullability function

It turns out that the fact that a language contains the empty string or not is essential. (We will explain why later…)

Let’s define the nullability function δ:

δ(L)=true if ϵ∈L δ(L)=false if ϵ∉L

The nullability function has the following recursive properties:

δ(∅)=false δ(ϵ)=true δ(c)=false δ(L1∪L2)=δ(L1) or δ(L2) δ(L1∘L2)=δ(L1) and δ(L2) δ(L∗)=true

It’s trivial to code this function in clojure:

xxxxxxxxxx

(defn d "Nullability function - returns true if the language contains the empty string"

[L]

  (cond

    (empty? L) false

    (eps? L) true

    (char? L) false

    (rep? L) true

    (alt? L) (or (d (.-this L)) (d (.-that L)))

    (cat? L) (and (d (.-left L)) (d (.-right L)))

    :else (throw (str "not a language: " L ", c: " c))))

xxxxxxxxxx

the evaluation will appear here (soon)...

Let’s test this function:

xxxxxxxxxx

(d (cat (char \a) (char \b) (char \c)))

xxxxxxxxxx

the evaluation will appear here (soon)...

xxxxxxxxxx

(d (alt (char \a) (eps)))

xxxxxxxxxx

the evaluation will appear here (soon)...

xxxxxxxxxx

(d (alt (char \a) (rep (char \b))))

xxxxxxxxxx

the evaluation will appear here (soon)...

Brzozowski’s derivative

In 1964, Brzozowski (don’t try to pronounce his name) introduced a simple yet powerful concept in his work on recognition of regular languages:

Definition: The derivative of a language L with respect to a character c is a new language that has been “filtered” and “chopped” - Dc(L):

First, retain only the strings that start with the character c.
Second, chop that first character off every string.

In mathematical notation:

Dc(L)={w:cw∈L}.

Examples:

Db{foo,bar,baz}={ar,az} Df{foo,bar,baz}={oo} Da{foo,bar,baz}=∅

Code for Brzozowski’s derivative

The nice thing about the derivative is that it can be defined recursively from the atomic languages.

We will first enumerate its recursive properties and then implement it in a couple of lines of code.

For the atomic languages:

Dc(∅)=∅ Dc(ϵ)=∅ Dc(c)=ϵ Dc(c′)=∅ if c≠c′

For the union:

Dc(L1∪L2)=Dc(L1)∪Dc(L2)

For the Kleene star:

Dc(L∗)=Dc(L)∘L∗

For the concatenation:

Dc(L1∘L2)=Dc(L1)∘L2 if ϵ∉L1

Dc(L1∘L2)=Dc(L1)∘L2∪Dc(L2) if ϵ∈L1

And now the code:

xxxxxxxxxx

(defn D "returns Brzozowski's derivative of L by c"

  [c L]

  (cond

    (empty? L) (empty)

    (eps? L) (empty)

    (char? L) (if (= (.-c L) c)

                (eps)

                (empty))

    (alt? L) (alt (D c (.-this L))

                  (D c (.-that L)))

    (cat? L) (let [L1 (.-left L)

                   L2 (.-right L)]

               (if (d L1)

                 (alt (cat (D c L1) L2)

                      (D c L2))

                 (cat (D c L1) L2)))

    (rep? L) (cat (D c (.-lang L)) L)

    :else (throw (str "not a language: " L ", c: " c))))

xxxxxxxxxx

the evaluation will appear here (soon)...

The importance of the derivative

To determine membership of a string, derive the language with respect to each character of the string and check if the final language contains the null string: if yes, s belongs to the language; otherwise it doesn’t

For instance, let’s derivate the string abc with respect to a then b then c - and computes the nullability function of the result:

xxxxxxxxxx

(->> (cat (char \a) (char \b) (char \c))

  (D \a)

  (D \b)

  (D \c)

d)

xxxxxxxxxx

the evaluation will appear here (soon)...

We can code this property, very easilly:

xxxxxxxxxx

(defn matches?

  "Determines whether a string belongs to a language"

  [w L]

  (if (clojure.core/empty? w)

    (d L)

    (matches? (subs w 1) (D (first w) L))))

xxxxxxxxxx

the evaluation will appear here (soon)...

Note that we used the fully qualified clojure.core/empty? because we have defined empty? to be the type detector of the empty language.

Actually, we are done: matches? is our regular expression matcher in clojure.

Let’s see it in action with the float-number language that we have defined above:

xxxxxxxxxx

(map #(matches? % float-number) ["-2.0" "1" "" "+12.12" "1.0"])

xxxxxxxxxx

the evaluation will appear here (soon)...

All the code of this article is consolidated in an interactive gist.

Happy parsing!

If you enjoy this kind of interactive articles would you consider a (small) donation💸 on Patreon or at least giving a star⭐ for the Klispe repo on Github?

to stay up-to-date with the coolest interactive articles around the world.

Discover more cool interactive articles about javascript, clojure[script], python, ruby, scheme, c++ and even brainfuck!

Give Klipse a Github star to express how much you appreciate Code Interactivity.

Subscribe to the Klipse newsletter:

Feel free to email me [email protected] for getting practical tips and tricks in writing your first interactive blog post.

Parsing with derivatives - Elegant matching of regular languages in clojure

Parsing with derivatives - Elegant matching of regular languages in clojure

Introduction

Preliminary: Formal languages

Regular languages

The nullability function

Brzozowski’s derivative

Code for Brzozowski’s derivative

The importance of the derivative

Recommend

Custom defn macro with clojure.spec - part 1: conform/unform

Custom defn macro - part 2: playing with parse trees

bootrapped (self-host) namespace loading strategy in KLIPSE - insights into the...

KLIPSE talk at ClojureX 2016

Core.match interactive tutorial

In javascript, setTimeout is forced to be >= 4 msec

str vs. pr-str - humans vs. machines

JSX (react) in depth - an interactive tutorial

ES proposal: Object.entries() and Object.values()

React.js controlled components

About Joyk