8.2.1 Separating rules and lexicon

By ``separating rules and lexicon'' we mean that we want to eliminate all mentioning of individual words in our DCGs and instead record all the information about individual words separately in a lexicon. <!- To see what is meant by this, let's return to our basic grammar, namely:

np - - > det,n. vp - - > v,np. vp - - > v. det - - > [the]. det - - > [a]. n - - > [woman]. n - - > [man]. v - - > [shoots].

We are going to separate the rules form the lexicon. That is, we are going to write a DCG that generates exactly the same language, but in which no rule mentions any individual word. All the information about individual words will be recorded separately. -->

Here is an example of a (very simple) lexicon. Lexical entries are encoded by using a predicate lex/2 whose first argument is a word, and whose second argument is a syntactic category.

lex(the,det). lex(a,det). lex(woman,n). lex(man,n). lex(shoots,v).

And here is a simple grammar that could go with this lexicon. Note that it is very similar to our basic DCG of the previous chapter. In fact, both grammars generate exactly the same language. The only rules that have changed are those, that mentioned specific words, i.e. the det, n, and v rules.

det --> [Word],{lex(Word,det)}. n --> [Word],{lex(Word,n)}. v --> [Word],{lex(Word,v)}.

Consider the new det rule. This rule part says ``a det can consist of a list containing a single element Word'' (note that Word is a variable). Then the extra test adds the crucial stipulation: ``so long as Word matches with something that is listed in the lexicon as a determiner''. With our present lexicon, this means that Word must be matched either with the word ``a'' or ``the''. So this single rule replaces the two previous DCG rules for det.

This explains the ``how'' of separating rules from lexicon, but it doesn't explain the ``why''. Is it really so important? Is this new way of writing DCGs really that much better?

The answer is an unequivocal ``yes''! It's much better, and for at least two reasons.

The first reason is theoretical. Arguably rules should not mention specific lexical items. The purpose of rules is to list general syntactic facts, such as the fact that sentence can be made up of a noun phrase followed by a verb phrase. The rules for s, np, and vp describe such general syntactic facts, but the old rules for det, n, and v don't. Instead, the old rules simply list particular facts: that ``a'' is a determiner, that ``the'' is a determiner, and so on. From theoretical perspective it is much neater to have a single rule that says ``anything is a determiner (or a noun, or a verb,...) if it is listed as such in the lexicon''. And this, of course, is precisely what our new DCG rules say.

The second reason is more practical. One of the key lessons computational linguists have learnt over the last twenty or so years is that the lexicon is by far the most interesting, important (and expensive!) repository of linguistic knowledge. Bluntly, if you want to get to grips with natural language from a computational perspective, you need to know a lot of words, and you need to know a lot about them.

Now, our little lexicon, with its simple two-place lex entries, is a toy. But a real lexicon is (most emphatically!) not. A real lexicon is likely to be very large (it may contain hundreds of thousands, or even millions, of words) and moreover, the information associated with each word is likely to be very rich. Our lex entries give only the syntactical category of each word, but a real lexicon will give much more, such as information about its phonological, morphological, semantic, and pragmatic properties.

Because real lexicons are big and complex, from a software engineering perspective it is best to write simple grammars that have a simple, well-defined way, of pulling out the information they need from vast lexicons. That is, grammar should be thought of as separate entities which can access the information contained in lexicons. We can then use specialized mechanisms for efficiently storing the lexicon and retrieving data from it.

Our new DCG rules, though simple, illustrate the basic idea. The new rules really do just list general syntactic facts, and the extra tests act as an interface to our (admittedly simple) lexicon that lets the rules find exactly the information they need. Furthermore, we now take advantage of Prolog's first argument indexing which makes looking up a word in the lexicon more efficient. First argument indexing is a technique for making Prolog's knowledge base access more efficient. If in the query the first argument is instantiated it allows Prolog to ignore all clauses, where the first argument's functor and arity is different. This means that we can get all the possible categories of e.g. man immediately without having to even look at the lexicon entries for all the other hundreds or thousands of words that we might have in our lexicon.