Algorithm for converting a finite state machine into a regular expression

2010-10-05 by qntm

It is known that every regular language has a corresponding deterministic finite state machine (actually, several; finite state machines may be equivalent) and a corresponding regular expression (actually, several; regular expressions may be equivalent). Converting regexes into finite state machines can be done relatively simply by hand, and with a little effort this process can be automated. But converting finite state machines back into regular expressions is relatively difficult by hand, and even more challenging to accomplish programmatically.

I have found a new way to do this. I think.

Several such algorithms do exist. Many of these algorithms result in unreadably large and complex regexes. As mentioned, many regexes are equivalent, there is no canonical "simple" regex for a given regular language and programatically simplifying a regular expression is of unlimited algorithmic complexity. This means that an algorithm which generally produces a simple output is highly desirable.

Some examples.

String algebra

First, we consider the set of all states in our finite state machine. Call them "A", "B", "C" and so on. Use variables as shorthand terms, such as:

A = { strings a such that inputting a into our FSM would leave it in state "A" }

Next we can define juxtaposition as our "string concatenation" operator.

ab = { "ab" }
Ax = { a . "x" : a is in A }
AB = { a . b : a is in A, b is in B }

And define "*" as our "infinite loop" operator, and "|" as our "union" or "or" operator:

A* = { "" } | A | AA | AAA | ...
a* = { "", "a", "aa", "aaa", "aaaa", ... }

Here, "" is the empty string.

Brzozowski Algebraic Method

Suppose our machine has states "A", "B", "C", "D" and "E", a simple two-character alphabet {"a", "b"} and a transition function f where:

f("A", "a") = "B". f("A", "b") = "D". State "A" is the initial state.
f("B", "a") = "C". f("B", "b") = "E".
f("C", "a") = "C". f("C", "b") = "E". State "C" is final.
f("D", "a") = "B". f("D", "b") = "D".
f("E", "a") = "B". f("E", "b") = "D". State "E" is final.

We then formulate this as a set of simultaneous equations:

A = { "" } (since it is the initial state)
B = Aa | Da | Ea
C = Ba | Ca
D = Ab | Db | Eb
E = Bb | Cb

Now, because we want a list of all the strings which this FSM will accept, observe that what we seek to calculate is C union E. This can be achieved using algebraic substitution... of a sort.

The order in which substitutions are carried out is quite important. Since C union E is desired, these should be calculated last. Here is one way to do it.

Since A is known, eliminate it:

B = a | Da | Ea
C = Ba | Ca
D = b | Db | Eb
E = Bb | Cb

Next eliminate B:

C = aa | Ca | Daa | Eaa
D = b | Db | Eb
E = ab | Cb | Dab | Eab

Leave C for the moment. Notice how our expression for D contains D itself as a term on the right-hand side. This means that there is a loop present here: for any string which leads to "D", you can add a "b" on the end and you will still be at "D". Or, indeed, any number of instances of "b"! So we can simplify this:

C = aa | Ca | Daa | Eaa
D = bb* | Ebb*
E = ab | Cb | Dab | Eab

Eliminate D:

C = (aa|bb*aa) | Ca | E(aa|bb*aa)
E = (ab|bb*ab) | Cb | E(ab|bb*ab)

It's pot luck from here on. It's best to starify a term last thing before you eliminate it, so starify C:

C = (aa|bb*aa)a* | E(aa|bb*aa)a*
E = (ab|bb*ab) | Cb | E(ab|bb*ab)

Eliminate C:

E = ((ab|bb*ab)|(aa|bb*aa)a*b) | E((ab|bb*ab)|(aa|bb*aa)a*b)

Starify E:

E = ((ab|bb*ab)|(aa|bb*aa)a*b)((ab|bb*ab)|(aa|bb*aa)a*b)*

Substitute in to C:

C = (aa|bb*aa)a*|((ab|bb*ab)|(aa|bb*aa)a*b)((ab|bb*ab)|(aa|bb*aa)a*b)*(aa|bb*aa)a*

So our answer is the disgustingly ugly:

(aa|bb*aa)a*|((ab|bb*ab)|(aa|bb*aa)a*b)((ab|bb*ab)|(aa|bb*aa)a*b)*(aa|bb*aa)a*|((ab|bb*ab)|(aa|bb*aa)a*b)((ab|bb*ab)|(aa|bb*aa)a*b)*

This answer is technically correct but it's ridiculous. Still, it's a naive application of the algorithm. As alluded to above, there are many and various obvious ways to simplify each expression at each stage. A different substitution route might also have helped. This is the first time I've tried this calculation by hand, in fact, and I thought it would have worked out better. Maybe different methods could be applied to make this work out better.

Anyway my point is:

My method

Instead of constructing one equation for each state in the FSM, construct an equation for each state-set in C. Construct these state-sets recursively, working backwards from the state-set consisting of all the final states, in this case {"C", "E"} (write C|E). This is done by manufacturing an "inverse map", g. For each symbol in our alphabet, we find all the states where f(<this state>, <this symbol>) is in C|E. Thus:

For symbol "a", we have f("B", "a") = "C", f("C", "a") = "C". So say that g(C|E, "a") = B|C.
For symbol "b", we have f("B", "b") = "E", f("C", "b") = "E". So say that g(C|E, "b") = B|C.

A new state-set, B|C, has appeared. We repeat the process for this state-set, iterating backwards until we stop hitting repeats. (For a k-state FSA, there are only 2k possible state-sets so this is bound to happen eventually.)

For symbol "a", we have f("A", "a") = "B", f("B", "a") = "C", f("C", "a") = "C", f("D", "a") = "B", f("E", "a") = "B". So say that g(B|C, "a") = A|B|C|D|E.
For symbol "b", we have nothing. So say that g(B|C, "b") = {} (the empty state-set).

A new state-set A|B|C|D|E has now appeared. Ignore the empty set.

For symbol "a", we have f("A", "a") = "B", f("B", "a") = "C", f("C", "a") = "C", f("D", "a") = "B", f("E", "a") = "B". So say that g(A|B|C|D|E, "a") = A|B|C|D|E.
For symbol "b", we have f("A", "b") = "D", f("B", "b") = "E", f("C", "b") = "E", f("D", "b") = "D", f("E", "b") = "D". So say that g(A|B|C|D|E, "b") = A|B|C|D|E.

No new state-sets have appeared, so we are now done.

Finally, observe that A|B|C|D|E contains the initial state "A", so it can be reached with an empty string, "".

So we construct the following simultaneous equations. Putting them in the reverse of the order in which they were found makes the substitution simplest.

A|B|C|D|E = "" | (A|B|C|D|E)a | (A|B|C|D|E)b
B|C = (A|B|C|D|E)a
C|E = (B|C)a | (B|C)b

We can then substitute calculated values, much like the Brzozowski algorithm. First solve for A|B|C|D|E:

A|B|C|D|E
= "" | (A|B|C|D|E)a | (A|B|C|D|E)b
= "" | (A|B|C|D|E)(a|b)
= (a|b)*

Then for B|C:

B|C
= (A|B|C|D|E)a
= (a|b)*a

And finally for C|E:

C|E
= (B|C)a | (B|C)b
= (B|C)(a|b)
= (a|b)*a(a|b)

...which is our result.

Notes

I haven't done any exhaustive analysis of common regular expressions, commonly-arising finite state machines, clever ways to implement the Brzozowski algorithm, comparisons in the field between the relative efficiencies of my algorithm and Brzozowski's, or anything. I must confess, the amount of mathematical research I've done here is minimal; all I've done is make this new algorithm. (At least, I think it's new. What do I know?)

The finite state machine used here was picked because it is referred to here and proved to be extremely challenging for me to regexify programmatically (even by hand it's pretty tricky). By dumb luck it turns out that my algorithm handles it pretty well. There are FSMs for which even my algorithm produces pretty unreadable output. How much use is this? You tell me.

I'm just throwing this out there. If this hasn't been done already, it may be the first genuinely new piece of vaguely mathematical work I've ever done, which would be stunning. A Python implementation follows.

Algorithm for converting a finite state machine into a regular expression