COSC347/2003
Comparative Languages
Problem 1 Parsing sentences in a constructed language
Any problem involving translation, for example compilation, requires
that the original sentence be recognised as syntactically correct
under the relevant rules. Parsing the sentence (i.e. identifying the
roles the various tokens (words) play in the sentence) is not only an
easy way to check correctness but also very useful for any further
processing.
For this assignment you will be required to write a program that will
take in a sentence and produce a parse, if possible. Unfortunately,
grammars for anything more than a very small subset of sentences in
natural languages such as English are very complex (and usually
ambiguous) and grammars for computer languages such as Java or C++
are extremely boring. However there are now many constructed
languages which have the expressive power of natural languages but
with simple unambiguous grammars, and this assignment involves one of
these. (For more on grammars, natural languages and constructed
languages, see the notes for lecture 3.)
Attached to this document you will find a grammar of a (somewhat
mangled) subset of Lojban. Lojban was devised to test the Sapir-Whorf
hypothesis and is intended to be a 'pure' language, i.e. without any
cultural or emotional baggage. For more about Lojban see
www.lojban.org/.
Write a program that will accept character strings and determine
whether they are acceptable under the given grammar. If they are,
produce a complete parse of the sentence, if they are not, report
this fact, preferably with some indication of where the sentence
failed, and how.
The Grammar (Note that 'parts of speech' are given Lojban
names, since they do not correspond to terms used in English.)
sentence ::= statement | gek-sentence
statement ::= [NAI] [terms CU] bridi-tail
bridi-tail ::= SELBRI [CI sumti CA ]
sumti ::= terms [sumti]
terms ::= term [JA terms]
term ::= DA [[NAI] tanru] | LA CMENE
tanru ::= BRIVLA [[A] tanru]
gek-sentence ::= GA statement FA statement
The Vocabulary:
The vocabulary is split, conceptually, into three categories of
words, predicates (brivla), names (cmene) and structure
words ('little words', cmavo). The main predicate in a
sentence is called selbri, but it is really only a
brivla in disguise. Predicates correspond fairly closely to
predicates in first order predicate calculus and either make claims
about the world (<he> is a woman) or specify an action or
relationship (x1 sells x2 to x3 for price x4; x1 is the (biological)
mother of x2 by (father) x3; x1 talks to x2 about (topic) x3 in
(language) x4). Predicates (basically all brivla) subsume most
of what we would call nouns, verbs, adjectives and adverbs plus a
healthy slice of the prepositions (such as 'above', 'below',
'before', after', etc.). All brivla are polysyllabic,
typically bisyllabic, where a syllable is of the form CCV or CVC,
except for the final syllable which must be either CCV or CV (where C
stands for a Lojban consonant, i.e. one of b, c, d, f, g, j, k, l,
m,n, p, r, s, t, v, x, z; and V stands for a Lojban vowel (a, e, i,
o, u). There are rules as to what consonants may come together,
either in a syllable or at junctions.
A name (cmene) is usually a transliteration from another
language and must end in a consonant. (The names of the people
involved with this course are la kris, la maikl and la reimnd.)
A structure word (cmavo) is a V, a CV or a CVV or various
combinations of these.