Linga
- compare, match, harmonize: ~ nguo - try on clothes, be measured for
clothes. (verb)
Lingana
- be equal, be similar, match. (adjective)
Linganifu
- matching; corresponding; regular; symmetrical. (adjective/adverb)
Linganisha
- compare, equate, correlate; harmonize; musical key. (verb)
Linganua
- differentiate, make a contrast; distinguish. (reversive of verb)
Etymology
-lingana (infinitive kulingana)
From
Proto-Bantu *-dɪ̀ngana.
Entry 995 at Bantu Lexical
Reconstructions 3. https://www.africamuseum.be/
Kulinganisha
na kulinganua, also, kuenenza mean compare & contrast.
Can
algorithms handle compare & contrast?
Understanding
context, and measuring the source and magnitude of words, are important in
doing compare & contrast analysis to judge impact. This requires common
sense and heuristics. Word sense means the sense in which a word is used. For
example, a dictionary may have over 50 different senses of the word
"play", each of these having a different meaning based on the context
of the word's usage in a sentence. These meanings are usually built over a long
period of time as a convention and standard, but never arbitrarily by sudden whims
of individuals or groups.
Free
speech is a cherished value among many human societies. However, with advanced
methods of information integration, this value has been under assault by agents
with various motives, using such integrations to victimize individuals they don’t
like. The most prevalent form of this assault today is social media bans. This
may seem benign at the moment but has the potential to morph into other forms
of autocracy that could ban people from accessing loans based on frivolous
non-financial factors or even ban people from accessing housing, travel
documents, government services, health services, water, electricity and so
forth. Some of these are fundamental human rights that should never be denied
under any circumstance.
The
European Union approved a regulation which requires that citizens have a “right
to explanation” in relation to any algorithmic decision-making. The European
Union General Data Protection Regulation (enacted 2016, taking effect 2018) provides
a legally disputed form of a right to an explanation, stated as such in Recital
71: "[the data subject should have] the right ... to obtain an explanation
of the decision reached".
However,
the extent to which the regulations themselves provide a "right to
explanation" is heavily debated. There are two main strands of criticism.
There are significant legal issues with the right as found in Article 22 — as
recitals are not binding, and the right to an explanation is not mentioned in
the binding articles of the text, having been removed during the legislative
process. In addition, there are significant restrictions on the types of
automated decisions that are covered — which must be both "solely"
based on automated processing and have legal or similarly significant effects —
which significantly limits the range of automated systems and decisions to
which the right would apply. In particular, the right is unlikely to apply in
many of the cases of algorithmic controversy that have been picked up in the
media.
In
the United States, the same regulation is being used in processing “credit
scores” for loans by “The Consumer Financial Protection Bureau” formed after
the 2007-08 financial crash. Under the Equal Credit Opportunity Act (Regulation
B of the Code of Federal Regulations), Title 12, Chapter X, Part 1002, §1002.9,
creditors are required to notify applicants who are denied credit with specific
reasons for the detail.
Creditors
comply with this regulation by providing a list of reasons (generally at most
4, per interpretation of regulations), consisting of a numeric reason code (as
identifier) and an associated explanation, identifying the main factors
affecting a credit score. An example might be:
“32:
Balances on bankcard or revolving accounts too high compared to credit limits.”
Number
32 is the numeric reason code. Other reasons would have different number codes.
Word
embedding for differentiation.
In
natural language processing (NLP) for machine learning, a word embedding is a
representation of a word. Word and phrase embeddings are used to boost the
performance in NLP tasks such as syntax analysis and sentiment analysis. Typically,
the representation of the word is a numerically-valued vector that encodes the
meaning of the word in such a way that the words that are closer in the vector
space are expected to be similar in meaning. The term “vector space” simply
means context within which the word is being used. This vector space is also
expressed numerically in vector form. Therefore, the vector of the word is
measured in relation to the vector of the context (vector space). Generally,
word embedding vectors are defined by the context in which those words appear. Put
simply, “a word is characterized by the company it keeps”. To generate these
vectors, a number of unsupervised algorithmic techniques have been proposed
which includes applying neural networks and constructing a co-occurrence matrix.
To fine tune the results, it is proposed that this should be followed by
supervised techniques like dimensionality reduction, probabilistic distribution
models and even explicit representation and consideration of words appearing in
a context which would require direct human input.
Currently
existing word embedding techniques do not benefit from the rich semantic
information present in structured or semi-structured text. Instead they are
trained over a large corpus, such as a Wikipedia "dump" , social media posts, or
collection of news articles, where any structure is ignored.
Moreover,
the dimensionality of word sense is very high even in a very small community of
people. This would make algorithmic calculations inadequate in doing such
measurements since “concentration of measure” techniques meant to collapse dimensionality
can hardly measure up to cultural dynamism, social relations, slangs, and even
gaffes like spoonerisms, malapropisms, catachresis. Contemporary examples being
united states president Joe Biden and world heavyweight champion Mike Tyson with
their catalogue of irregular verbal gaffes. It becomes abit tricky to pin them
down as either an innocent error, a comic act, or intentional misinformation to
evade responsibility.
Embedding
technique for images
The
algorithmic technique used in compare & contrast analysis for images is termed
‘triplet loss’. Triplet loss is a loss function for machine learning algorithms
where a reference image (called anchor) is compared to a matching image (called
positive) and a non-matching image (called negative). Typically, the distance
from the anchor to the positive is minimized, and the distance from the anchor
to the negative input is maximized. By enforcing the order of distances,
triplet loss models embed in the way that a pair of samples with same labels
are smaller in distance to each other than those with different labels. In face
recognition, triplet loss is used to learn good embeddings (or “encodings”) of
faces. In the embedding space, faces from the same person should be close
together and form well separated clusters.
For
example, a satirist may want to algorithmically edit himself into an image or video of a
world economic forum meeting in Davos. The algorithm should be able to cluster
the different characters that the satirist wants to use in his skit so that the
video flows coherently. The algorithm would cluster the facial recognition
images in this manner:
The
arrangement of colour shades (embeddings) illustrates which images would be in
the same class to each other. The anchor and positive have similar sequence of
shades while the negative has a different sequence. This automatically
differentiates and clusters the negative image into a different group regardless
of the differing dimensions of the anchor and positive image. However, the more
distance between the positive and anchor image, the more complex the facial
recognition becomes. For example, if the positive image illustrated above was a
frowning Obama bending his neck, the algorithm would need more training with
closer matches in the same class to properly cluster for more accurate future
outputs.
In
triplet mining, the different combinations that are likely to be used can be categorized
into three:
1.
easy
triplets: triplets which have a loss of 0, because the negative is very far
from the anchor compared to the positive.
2.
hard
triplets: triplets where the negative is closer to the anchor than the positive.
3.
semi-hard
triplets: triplets where the negative is not closer to the anchor than the
positive, but which still have positive loss.
References and further reading:
arXiv:1310.4546 [cs.CL].
arXiv:1702.06891 [cs.CL]
https://www.creditscoring.com/creditscore/fico/factors/reason-codes.html
Edwards, Lilian; Veale,
Michael (2017). "Slave to the algorithm? Why a "right to an
explanation" is probably not the remedy you are looking for". Duke
Law and Technology Review.
Jurafsky, Daniel; H. James,
Martin (2000). Speech and language processing: an introduction to natural
language processing, computational linguistics, and speech recognition. Upper
Saddle River, N.J.: Prentice Hall.
Olivier Moindrot (2018). Triplet
Loss and Online Triplet Mining in TensorFlow. Retrieved from https://omoindrot.github.io/triplet-loss
22nd jan 2024
Socher, Richard; et al.
(2013). Recursive Deep Models for Semantic Compositionality Over a Sentiment
Treebank. EMNLP.
TUKI (2001), Kamusi Ya
Kiswahili-Kiingereza; Swahili-English Dictionary. Published by Taasisi ya
Uchunguzi wa Kiswahili (TUKI), Chuo Kikuu cha Dar es Salaam, Tanzania.
Comments
Post a Comment