Abstract:
Abstract
Grammar development through the traditional rule-based method remains a
challenge because the method is slow, time-consuming, expensive, knowledge-intensive,
and laborious, particularly for under-resourced languages. Moreso, for the spoken Bantu
languages. However, there is a high demand for these grammars for deep natural language
processing, generation of well-formed output, or both, Controlled Natural languages
Applications, and High precision machine translation. An in-depth review of previous
research on improving grammar development reveals that these studies concentrated on
rich-resourced languages and neglected under-resourced ones and have only concentrated
on the syntax, ignoring the morphology in the shareable grammar. Therefore, there is an
urgent need for cost-efficient methodologies that can accelerate grammar development to
enable these languages to thrive in the digital ecosystem and minimize the language
technology digital divide with the rich-resourced languages. Consequently, this research
investigated an approach to reducing grammar development efforts for under-resourced
languages in a rule-based multilingual environment by leveraging on cross-linguistic
similarities to develop a congruent Bantu parameterized grammar and leveraging on the
shared parameterized grammar to bootstrap Swahili grammar.
The descriptive analysis method was used to analyze descriptive grammar for each
geolinguistics and purposively chosen Bantu languages to empirically identify the point of
generalization of parameters, regular expressions and grammar rules. Furthermore,
universal and individual comparative analyses were used to produce a generalized
descriptive grammar for the subset of the Bantu languages. Then, quasi-experiments were
set up in Grammatical Framework (GF) using the morphology-driven approach to develop
the Bantu parameterized grammar utilizing grammar and to bootstrap Swahili grammar to
the Bantu parameterized grammar. The GF regression method was used to test each
grammar during development and reusability evaluation was done using shared and
modified rules metrics for shareability and portability respectively while accuracy
evaluation used a 100-English sentence test-suite.
The Bantu parameterized grammar shareability at morphology (parameters at
68.75% and paradigms at 65.3% ) and syntax at 89.57%, while portability at morphology
(14.29% at paradigms and 18.75% at parameter) and syntax at 10.43%. The bootstrapped
v
Swahili grammar had a shareability of at morphology (parameters at 68.75% and
paradigms at 71.11%) and syntax at 91.41%, respectively, while portability at morphology
(15.55% at paradigms and 18.75% at parameter) and syntax at 8.59%. In terms of accuracy,
the grammars had 4-gram BLEU scores of 83.05%, 77.95% and 55.95% and WER of
12.82%, 13.39% and 23.90%, plus PER of 10.96%, 9.46% and 19.49% for Kikamba,
Swahili and Ekegusii languages in that order. The research makes two conclusions,
leveraging on the cross-linguistic similarities of principles and parameters significantly
reduces multilingual grammars’ development effort and leveraging on congruent grammar
to bootstrap a similar grammar takes less effort since most of the rule-base will be inherited
from the congruent grammar.
The study has several contributions. First, it has provided an approach of
bootstrapping the development of multilingual grammar that significantly reduces the
effort. Then extended GF reusability by providing standardized Swahili, Kikamba and
Ekegusii grammars that are open resources. Furthermore, a hundred sentences test suite for
the evaluation of grammars was created. Finally, by providing the missing parts through
elicitation, mainly in the numeral, preposition fusion, and subject marker morpheme of the
verb, a contribution was made to the descriptive grammar.
Keywords: Parameterized grammar, grammar engineering, bootstrapping, grammar
sharing, grammar porting, complex morphology and under-resourced languages.