2.1 Building Structure while Recognizing

In the previous chapter, we learned that finite state recognizers are machines that tell us whether a given input is accepted by some finite state automaton. We can give a word to a recognizer and the recognizer will say ``yes'' or ``no''. But often that's not enough: in addition to knowing that something is accepted by a certain FSA, we would like to have an explanation of why it was accepted. Finite State Parsers give us that kind of explanation by returning the sequence of transitions that was made.

This distinction between recognizers and parsers is a standard one: Recognizers just say ``yes'' or ``no'' while parsers also give an analysis of the input. It does not only apply to finite state machines, but also to all kinds of machines that check whether some input belongs to a language and we will make use of it throughout the course.

2.1.1 Finite State Parsers

So, in the case of a finite state parser the parser output should tell us about the transitions that had to be made in the FSA when the input was recognized. That is, the output should be a sequence of nodes and arcs. If we, for example, gave the input [h,a,h,a,!] to a parser for our first laughing automaton, it should give us [1,h,2,a,3,h,2,a,3,!,4].

There is a fairly standard technique in Prolog for turning a recognizer into a parser: add one or more extra arguments to keep track of the structure that was found. We will now use this technique to turn recognize1/2 of the last chapter into parse1/3, i.e. a parser for FSAs without jump arcs.

In the base clause, when the input is read and the FSA is in a final state, all we have to do is record that final state. So, we turn

recognize1(Node,[]) :- final(Node).

into

parse1(Node,[],[Node]) :- final(Node).

Then let's look at the recursive clause. The recursive clause of recognize1/2 looked as follows:

recognize1(Node1,String) :- arc(Node1,Node2,Label), traverse1(Label,String,NewString), recognize1(Node2,NewString).

And here is the recursive clause of parse/1:

parse1(Node1,String,[Node1,Label|Path]) :- arc(Node1,Node2,Label), traverse1(Label,String,NewString), parse1(Node2,NewString,Path).

The parser records the state the FSA is in and the symbol it is reading on the transition it is taking from this state. The rest of the path, i.e. the sequence of states and arcs that the FSA will take from Node2 onwards, will be specified in the recursive call of parse1 and collected in the variable Path.

The only thing that's left to do, is to adapt the driver predicates test1/1 and generate1/1. The new driver predicates look as follows:

testparse1(Symbols,Parse) :- initial(Node), parse1(Node,Symbols,Parse).

genparse1(Symbols,Parse) :- testparse1(Symbols,Parse).

Now, let's step through an example to have a look at how the output is being built in the extra argument during recognition. Assume that we have loaded the Prolog representation of our first laughing automaton in to the Prolog database. So the database contains the following facts:

initial(1). final(4). arc(1,2,h). arc(2,3,a). arc(3,4,!). arc(3,2,h).

We ask Prolog the following query:

?- testparse1([h,a,!],Parse).

Prolog retrieves 1 as the only initial node in this FSA and calls parse1/1 instantiated as

parse1(1,[h,a,!],Parse).

Next, Prolog has to retrieve arcs starting in node 1 from the database. It finds arc(1,2,h), which it can use because the first symbol in the input is h as well. So, Parse is unified with [1,h|_G67] where _G67 is some Prolog internal variable. Prolog then makes a recursive call (the first recursive call) of parse1 with

parse1(2,[a,!],_G67).

Now, Prolog finds arc(2,3,a) in the database. So, _G67 gets unified with [2,a|_G68] (_G68 again being some internal variable) and Prolog makes the second recursive call of parse1:

parse1(3,[!],_G68).

Using arc(3,4,!) the last symbol of the input can be read and _G68 gets instantiated to [3,!|_G69]. The next recursive call of parse1 (parse1(4,[],_G69)) matches the base clause. Here, _G69 gets instantiated to [4], instantiating _G68 to [3,!,4], _G67 to [2,a,3,!,4], and Parse to [1,h,2,a,3,!,4] as Prolog comes back out of the recursion. If you have trouble understanding how the output gets assembled, draw a search tree for the query parse1(1,[h,a,!],Parse). Note, how with every recursive call of parse1 the third argument gets instantiated with a list. The first two elements of this list are the state the FSA is currently in and the next symbol it reads; the rest of the list is an uninstantiated variable at first, but gets further instantiated by the next recursive call of parse1.

2.1.2 Separating out the Lexicon

In the practical session of the last chapter you were asked to construct a finite state automaton recognizing those English noun phrases that can be built from the words the, a, wizard, witch, broomstick, hermione, harry, ron, with, fast. The FSA that you came up with probably looked similar to this:

which is

initial(1). final(3). arc(1,2,a). arc(1,2,the). arc(2,2,brave). arc(2,2,fast). arc(2,3,witch). arc(2,3,wizard). arc(2,3,broomstick). arc(2,3,rat). arc(1,3,harry). arc(1,3,ron). arc(1,3,hermione). arc(3,1,with).

in Prolog.

Now, what would Prolog answer, if we used the parser of the previous section on this automaton to parse the input [the,fast,wizard]? It would return [1,the,2,fast,2,wizard,3]. This tells us how the FSA was traversed for recognizing that this input is indeed a noun phrase. But in a way, it would be even nicer, if we got a more abstract explanation saying, e.g., that [the,fast,wizard] is a noun phrase because it consists of a determiner followed by an adjective which is followed by a common noun. That is, we would like the parser to return something like this:

[1,det,2,adj,2,noun,3].

Actually, you were probably already making a similar abstraction when you were thinking about how to construct that FSA. You were probably thinking: ``Well, a noun phrase starts with a determiner, can be followed by zero or more adjectives, and ends in a noun; the and a are the determiners that I have, so I need a the and an a transition from state 1 to state 2.'' And, in fact, it would be a lot nicer, if you could specify transitions in the FSA based on categories like determiner, common noun, and so on and additionally give a separate lexicon which specifies what words belong to a category. Like this, for example:

initial(1). lex(a,det). final(3). lex(the,det). arc(1,2,det). lex(fast,adj). arc(2,2,adj). lex(brave,adj). arc(2,3,cn). lex(witch,cn). arc(1,3,pn). lex(wizard,cn). arc(3,1,prep). lex(broomstick,cn). lex(rat,cn). lex(harry,pn). lex(hermione,pn). lex(ron,pn). lex(with,prep).

It's not very difficult to change our recognizer to work with FSA specifications that, like the above, define their transitions in terms of categories instead of symbols and then use a lexicon to map those categories to symbols or the other way round. The only thing that changes is the definition of the traverse predicate. We don't simply compare the label of the arc with the next symbol of the input anymore, but have to access the lexicon to check whether the next symbol of the input is a word of the category specified by the label of the arc. That means, instead of

traverse2('#',String,String). traverse2(Label,[Label|Symbols],Symbols).

we use

traverse3('#',String,String). traverse3(Label,[Symbol|Symbols],Symbols) :- lex(Symbol,Label).