When making prediction
models, it is faster and easier to make accurate predictions from a data set
that is under a sort of classification. It makes it easier to compare data
points in the classification to those observed per event. The trends in
divergence of the “true” from the observed can be extrapolated with all factors
held constant. This is similar to the comparative analysis done between “true
north” which is a mathematically classified point on the grid system where all
longitudes meet at the north pole, and the observed norths like magnetic north,
and north star. Divergence in these observed models from the “true”
classification are considered when estimating positioning and navigation. Just like
we need to consider divergence in directions to avoid getting lost, we also
need to consider divergence in language modelling to avoid getting misinformed
or when intending to misinform an algorithm.
In language, classification
systems exist. The most apparent classification system in Kiswahili is the Ngeli
(noun class system).
Ngeli - noun class.
Ngeli
ni makundi ya kisarufi ya majina katika lugha ya Kiswahili na lugha nyingine za
Kibantu. Ngeli hizi huundwa kupitia utaratibu unaotumika kuweka nomino katika
tabaka au makundi yanayofanana.
Translation:
Ngeli are grammatical groups of names in Kiswahili and other Bantu languages. These
categories are formed through a process used to group nouns into similar
classes or groups.
When
forming phrases and sentences, the various morphologies of the nouns affect the
morphologies of the accompanying adjectives and verbs through inflection
especially at the beginning of the word. For example,
Ngeli
ya KI-VI:
Kiatu-kikubwa-kimeshonwa.
(noun-adjective-verb)
Viatu-vikubwa-vimeshonwa. (noun-adjective-verb)
Ngeli
ya M-MI/U-I:
Mkoba-mkubwa-umeshonwa. (noun-adjective-verb)
Mikoba-mikubwa-imeshonwa. (noun-adjective-verb)
From
the first word (noun) one can fairly predict how the next word in the phrase
will look like. This is thanks to the system of Ngeli (noun
classification). In information science and ontology, a classification scheme
is the product of arranging things into kinds of things (classes) or into
groups of classes; this bears similarity to categorization, but with perhaps a
more theoretical bent, as classification can be applied over a wide semantic
spectrum. The wide semantic spectrum in ngeli involves definitive
features of words, such as animacy/inanimacy, shape, pronunciation,
countability, size. Such designations are usually conventional and not
arbitrary.
Using
classification schemes for the classification of nouns in a language has many
benefits. Some of these include:
1.
It
allows a user to find correct inflections quickly on the basis of its kind or
group.
2.
It
makes it easier to detect grammatical mistakes.
3.
It
conveys semantics (meaning) of some objects from the definition of their kind,
where such meaning is not conveyed by the name of the individual object or its
way of spelling.
4.
Knowledge
and requirements about a kind of thing can be applied to other objects of that
kind.
In
natural conversation, however, many people hardly make phrases in strict adherence
to the ngeli system. Many collapse nouns and their inflections into two or
three classes. Mainly based on animacy, inanimacy and countability (for those
inanimate objects that are uncountable). Therefore, a machine learning tool has
to use classes based on comparative probability distributions for a particular
inflection being used in a particular event. This determines the inflected word
that will be encoded after the noun by the machine. In information theory for
machine learning, this method of determinism through measuring comparative
probability distributions is termed “cross entropy”.
This
is done by comparing two probability distributions. The first distribution is
the true distribution t. The second distribution is the estimated/model distribution
u. The true distribution t is often unknown but can also be known
as in the case of Kiswahili ngeli system where inflections are standardized. The
model distribution u is modelled from observed data of recorded
inflections and contexts where they were used by people. The divergence of the model
distribution u from the true distribution t in a particular event/context
determines the most appropriate inflection that will be encoded by the machine,
whether it should be the one from the true distribution or from the model
distribution. The more the positive divergence from true distribution, the more
the entropy, and in this case the model distribution u is used. The more
the negative divergence from true distribution, the more the entropy, and in
this case the true distribution t is used. In the analogy of finding
directions, if one finds a huge or inconsistent divergence between the “true
north” and either or both the observed norths, the observed norths will be
rendered misinformative to the navigator for finding direction and hence repositioning
or remapping is required.
There
are many languages where inflection cross-entropy needs to be measured but the true
distribution t is unknown/chaotic. In language modeling, a model is
created based on a training set which becomes the model distribution u. Since
the true distribution is unknown, cross-entropy cannot be directly calculated,
and therefore a live test is used for each instance of encoding to assess how
accurate the model is in predicting each test. In these cases, an estimate of
cross-entropy divergence is calculated using a mathematical formula called monte carlo
estimate.
References:
TUKI (2001), Kamusi Ya
Kiswahili-Kiingereza; Swahili-English Dictionary. Published by Taasisi ya
Uchunguzi wa Kiswahili (TUKI), Chuo Kikuu cha Dar es Salaam, Tanzania.
Comments
Post a Comment