COSC347/2003
Comparative Languages


Problem 1 Parsing sentences in a constructed language
Any problem involving translation, for example compilation, requires that the original sentence be recognised as syntactically correct under the relevant rules. Parsing the sentence (i.e. identifying the roles the various tokens (words) play in the sentence) is not only an easy way to check correctness but also very useful for any further processing.

For this assignment you will be required to write a program that will take in a sentence and produce a parse, if possible. Unfortunately, grammars for anything more than a very small subset of sentences in natural languages such as English are very complex (and usually ambiguous) and grammars for computer languages such as Java or C++ are extremely boring. However there are now many constructed languages which have the expressive power of natural languages but with simple unambiguous grammars, and this assignment involves one of these. (For more on grammars, natural languages and constructed languages, see the notes for lecture 3.)

Attached to this document you will find a grammar of a (somewhat mangled) subset of Lojban. Lojban was devised to test the Sapir-Whorf hypothesis and is intended to be a 'pure' language, i.e. without any cultural or emotional baggage. For more about Lojban — see www.lojban.org/.

Write a program that will accept character strings and determine whether they are acceptable under the given grammar. If they are, produce a complete parse of the sentence, if they are not, report this fact, preferably with some indication of where the sentence failed, and how.

The Grammar (Note that 'parts of speech' are given Lojban names, since they do not correspond to terms used in English.)
sentence ::= statement | gek-sentence
statement ::= [NAI] [terms CU] bridi-tail
bridi-tail ::= SELBRI [CI sumti CA ]
sumti ::= terms [sumti]
terms ::= term [JA terms]
term ::= DA [[NAI] tanru] | LA CMENE
tanru ::= BRIVLA [[A] tanru]
gek-sentence ::= GA statement FA statement

The Vocabulary:
The vocabulary is split, conceptually, into three categories of words, predicates (brivla), names (cmene) and structure words ('little words', cmavo). The main predicate in a sentence is called selbri, but it is really only a brivla in disguise. Predicates correspond fairly closely to predicates in first order predicate calculus and either make claims about the world (<he> is a woman) or specify an action or relationship (x1 sells x2 to x3 for price x4; x1 is the (biological) mother of x2 by (father) x3; x1 talks to x2 about (topic) x3 in (language) x4). Predicates (basically all brivla) subsume most of what we would call nouns, verbs, adjectives and adverbs plus a healthy slice of the prepositions (such as 'above', 'below', 'before', after', etc.). All brivla are polysyllabic, typically bisyllabic, where a syllable is of the form CCV or CVC, except for the final syllable which must be either CCV or CV (where C stands for a Lojban consonant, i.e. one of b, c, d, f, g, j, k, l, m,n, p, r, s, t, v, x, z; and V stands for a Lojban vowel (a, e, i, o, u). There are rules as to what consonants may come together, either in a syllable or at junctions.

A name (cmene) is usually a transliteration from another language and must end in a consonant. (The names of the people involved with this course are la kris, la maikl and la reimnd.)

A structure word (cmavo) is a V, a CV or a CVV or various combinations of these.


Cmavo:
NAI negation no (there are others, but we won't worry about them).
CU separator cu
DA <article> da, de, di, do, du, ta, te, ti, to, tu.
LA before names la
CI open brackets ci (as for NAI)
CA close brackets ca (as for NAI)
A connectives a, e, o, u
JA connectives ja, je, jo, ju
GA beginning connective ga, ge go, gu
FA medial connective fa, fe, fo, fu (These must occur in the correct pairings, i.e. ga with fa, ge with fe, etc. They correspond to similar pairings in English such as 'if … then', 'either … or', etc. This is the biggest deviation from real Lojban, which does it better, but more complicatedly.)

Cmene:
(You can make these up yourselves. Check the web pages to ensure that you transliterate as accurately as possible. Remember the terminal C.)

Brivla:
Here is a list taken from the Lojban introductory text. It should be adequate.

bajra x1 runs on x2 (surface) using x3 (limbs) in manner x4 (gait)
blarno x1 (object/light source) is blue-green (modified from correct Lojban)
cutci x1 is a shoe/boot for x2 (foot) made of x3 (material)
gerku x1 is a dog of breed x2
kanro x1 is healthy by standard x2
klama x1 goes/comes to x2 (destination) from x3 (origin) via x4 (route) using x5 (transport)
kurji x1 takes care of x2
mamta x1 is the biological mother of x2 by father x3
melbi x1 (object/idea) is beautiful to x2 (observer) by standard x3
mlatu x1 is a cat/ x1 'cats'
mrenu x1 is a male human (man)
ninmu x1 is a female human (woman)
patfu x1 is the biological father of x2 by mother x3
pluka x1 pleases/is pleasing to x2 (experiencer) under conditions x3
rirni x1 is a parent of x2
stali x1 stays/remains with x2
sutra x1 (agent) is fast at doing x2 (action)
tavla x1 (talker) talks to x2 (audience) about x3 (topic) in language x4
vecnu x1 (seller) sells x2 (goods) to x3 (buyer) for x4 (price)
verba x1 is a juvenile of age x2, immature by standard x3
zarci x1 is a market/store/shop selling x2 (products) operated by x3 (storekeeper)


Examples

melbi = beautiful!
SELBRI
<bridi-tail>
<statement>
<sentence>

da cu mlatu = <it> ‘cats’, i.e. <it> is a cat.
DA CU SELBRI
<term> CU SELBRI
<terms> CU SELBRI
<terms> CU <bridi-tail>
<statement>
<sentence>

da melbi e blarno verba mlatu cu pluka ci la kris jo la maikl de ca
= the (beautiful and blue-green) young cat pleases Chris and/or Michael under some conditions
DA BRIVLA A BRIVLA BRIVLA BRIVLA CU SELBRI CI LA CMENE JA LA CMENE DA CA
DA <tanru> CU SELBRI CI LA CMENE JA LA CMENE DA CA
<term> CU SELBRI CI <term> JA <term> <term> CA
<terms> CU SELBRI CI <terms> <terms> CA
<terms> CU SELBRI CI <sumti> CA
<terms> CU <bridi-tail>
<statement>
<sentence>

da melbi e blarno no verba mlatu cu pluka ci la kris jo la maikl de ca
= * the (beautiful and blue-green) not young cat pleases Chris and/or Michael under some conditions
DA BRIVLA A BRIVLA NAI BRIVLA BRIVLA CU SELBRI CI LA CMENE JA LA CMENE DA CA
DA <tanru> NAI <tanru> CU SELBRI CI LA CMENE JA LA CMENE DA CA
<term> NAI <tanru> CU SELBRI CI <term> JA <term> <term> CA
<terms> NAI <tanru> CU SELBRI CI <terms> <terms> CA
<terms> NAI <tanru> CU SELBRI CI <sumti> CA
<terms> NAI <tanru> CU <bridi-tail>
**** ERROR! ****