Ngeli (language modelling)

 

When making prediction models, it is faster and easier to make accurate predictions from a data set that is under a sort of classification. It makes it easier to compare data points in the classification to those observed per event. The trends in divergence of the “true” from the observed can be extrapolated with all factors held constant. This is similar to the comparative analysis done between “true north” which is a mathematically classified point on the grid system where all longitudes meet at the north pole, and the observed norths like magnetic north, and north star. Divergence in these observed models from the “true” classification are considered when estimating positioning and navigation. Just like we need to consider divergence in directions to avoid getting lost, we also need to consider divergence in language modelling to avoid getting misinformed or when intending to misinform an algorithm.

In language, classification systems exist. The most apparent classification system in Kiswahili is the Ngeli (noun class system).

Ngeli - noun class.

Ngeli ni makundi ya kisarufi ya majina katika lugha ya Kiswahili na lugha nyingine za Kibantu. Ngeli hizi huundwa kupitia utaratibu unaotumika kuweka nomino katika tabaka au makundi yanayofanana.

Translation: Ngeli are grammatical groups of names in Kiswahili and other Bantu languages. These categories are formed through a process used to group nouns into similar classes or groups.

When forming phrases and sentences, the various morphologies of the nouns affect the morphologies of the accompanying adjectives and verbs through inflection especially at the beginning of the word. For example,

Ngeli ya KI-VI:

Kiatu-kikubwa-kimeshonwa. (noun-adjective-verb)

Viatu-vikubwa-vimeshonwa. (noun-adjective-verb)

Ngeli ya M-MI/U-I:

Mkoba-mkubwa-umeshonwa. (noun-adjective-verb)

Mikoba-mikubwa-imeshonwa. (noun-adjective-verb)

From the first word (noun) one can fairly predict how the next word in the phrase will look like. This is thanks to the system of Ngeli (noun classification). In information science and ontology, a classification scheme is the product of arranging things into kinds of things (classes) or into groups of classes; this bears similarity to categorization, but with perhaps a more theoretical bent, as classification can be applied over a wide semantic spectrum. The wide semantic spectrum in ngeli involves definitive features of words, such as animacy/inanimacy, shape, pronunciation, countability, size. Such designations are usually conventional and not arbitrary.

Using classification schemes for the classification of nouns in a language has many benefits. Some of these include:

1.     It allows a user to find correct inflections quickly on the basis of its kind or group.

2.     It makes it easier to detect grammatical mistakes.

3.     It conveys semantics (meaning) of some objects from the definition of their kind, where such meaning is not conveyed by the name of the individual object or its way of spelling.

4.     Knowledge and requirements about a kind of thing can be applied to other objects of that kind.

In natural conversation, however, many people hardly make phrases in strict adherence to the ngeli system. Many collapse nouns and their inflections into two or three classes. Mainly based on animacy, inanimacy and countability (for those inanimate objects that are uncountable). Therefore, a machine learning tool has to use classes based on comparative probability distributions for a particular inflection being used in a particular event. This determines the inflected word that will be encoded after the noun by the machine. In information theory for machine learning, this method of determinism through measuring comparative probability distributions is termed “cross entropy”.

This is done by comparing two probability distributions. The first distribution is the true distribution t. The second distribution is the estimated/model distribution u. The true distribution t is often unknown but can also be known as in the case of Kiswahili ngeli system where inflections are standardized. The model distribution u is modelled from observed data of recorded inflections and contexts where they were used by people. The divergence of the model distribution u from the true distribution t in a particular event/context determines the most appropriate inflection that will be encoded by the machine, whether it should be the one from the true distribution or from the model distribution. The more the positive divergence from true distribution, the more the entropy, and in this case the model distribution u is used. The more the negative divergence from true distribution, the more the entropy, and in this case the true distribution t is used. In the analogy of finding directions, if one finds a huge or inconsistent divergence between the “true north” and either or both the observed norths, the observed norths will be rendered misinformative to the navigator for finding direction and hence repositioning or remapping is required.

There are many languages where inflection cross-entropy needs to be measured but the true distribution t is unknown/chaotic. In language modeling, a model is created based on a training set which becomes the model distribution u. Since the true distribution is unknown, cross-entropy cannot be directly calculated, and therefore a live test is used for each instance of encoding to assess how accurate the model is in predicting each test. In these cases, an estimate of cross-entropy divergence is calculated using a mathematical formula called monte carlo estimate.

 

References:

TUKI (2001), Kamusi Ya Kiswahili-Kiingereza; Swahili-English Dictionary. Published by Taasisi ya Uchunguzi wa Kiswahili (TUKI), Chuo Kikuu cha Dar es Salaam, Tanzania.

 

Comments