How does your bot work?

This page introduces you to some notions of understanding the natural language of dydu bots.

Language structure

A sentence is divided into words or compound words. Each of these words is associated with a set of meaning and weight. Indeed, there are many homonyms and many polysemic words. When a word carries several meanings, the whole of its senses is preserved, there is no choice a priori on a meaning to keep.

Each sense is associated with a penalty. Indeed, the meanings of a word do not necessarily have the same probability of being used.

The weight of words is dependent on the frequency of the word in the language used.

The overall structure is as follows:

Word 1

Weight

Meaning 1 - Penalty - Sense 2 - Penalty - ...

Word 2

Steps

Spelling correction

Spelling mistakes are common in cat-treated sentences automatically, a correction proves to be necessary.

The dydu technology uses a library based on the Hunspell spell checker from Open Office and Firefox. This library has been adapted for the specific needs of dydu.

The spelling correction suggests several possible corrections. There are no choices made and the various close corrections are kept. Each correction is associated with a penalty.

Identification of compound words

Possible compound words are identified in the sentence, the words they are made of are thus gathered together in a new meaning.

Identification of lemmas

For each word, the different accessible lemmas are searched.

A lemma is the unaccounted and unconjugated basic form of a word, such as an infinitive verb or an adjective to the masculine singular.

Links to lemmas can be defined for common abbreviations, for example: "asap → as soon as possible".

Identification of synonyms and hyperonyms

The synonyms of the lemmas as well as the hyperonyms are identified and associated with the word.

A hyperonym is a generalization of meaning.

Hyperonyms are essentially used to define a set of products or terms specific to the bot's business logic.

For example, dog and cat are not synonymous but animal is a hyperonym of both.

The synonyms are indicated in the language structure for the sentences of the users but are not indicated for the matches.

Business ontologies can be defined.

An ontology is composed of a hyperonym that designates it and the hyponyms contained in it. The use of ontologies in the knowledge base reduces the number of formulations needed and improves the understanding of automatic chat.

It becomes possible to define the following ontologies:

  • Vital card: vital card, green card, etc.;

  • Attending physician: attending physician, city doctor, family doctor, etc.

Distance calculation

The user's sentence is represented flatly with the structure presented above.

The formulations found in the automatic chat knowledge base can also use this flat structure, but increasingly they use a more elaborate structure to significantly reduce the workload necessary to understand the automatic chat.

Between two flat structures

Once we have a linguistic structure of the user's sentence, it is possible to compare it to the sentence structures contained in the knowledge base (called match).

This distance calculation is inspired by the TF-IDF algorithm (https://en.wikipedia.org/wiki/Tf-idf).

It is a sum of partial scores. Whenever a meaning is identified as being present in both the sentence and match structures, the partial score is updated.

This is dependent on the weight of the word in the sentence and the match, and the difficulty applied to the meaning for each of the two structures.

Once the partial scores are calculated, we obtain a matrix containing them: one dimension of the matrix represents the words of the sentence and the other dimension represents the words of the match.

It is therefore necessary to determine the maximum sum of these partial scores by considering only once each word in the sum. One should not be stuck in local maxima that would not be optimal for the overall solution. This is an allocation problem, so we use the Hungarian algorithm (https://en.wikipedia.org/wiki/Hungarian_algorithm).

The Hungarian algorithm detects in this matrix the cells that maximize the sum.

Since this algorithm poses different performance problems when the sentences are long, we subdivide the matrix into disjoint subsets before applying this algorithm on each of these subsets.

To make the calculation of this score more concrete, we invite you to discover the following example:

This example is shown on the calculation of a "flat" distance without considering possible combinations between matching groups.

Consider in this example the match "Loss or theft of my life card" and the user's sentence "loss life card".

For information, the final score is between 0 and 1024. 0 means that there is no common point between the two sentences. 1024 means that the two sentences are identical.

The following image comes from the debugger of this calculation. This tool is accessible only by the dydu team and allows a better understanding of the structure of a score.

Score

The score obtained between these two sentences is 770 out of 1024.

Note

In the event that this score is one of the best obtained for the knowledge base of the bot, it will respond to the user in the form of a reword in which the user can confirm one of the proposed knowledge or rephrase his sentence.

Before going into the detail of the score, it is important to specify that these formulations are close, but considered different by default. It would therefore be necessary to add this sentence in the formulations associated with the knowledge of the bot to obtain a direct answer.

Dark blue bubbles represent the words for which there is a match, the lighter blue bubbles represent the words without matching.

Weight

The weight of each word in the sentence is expressed as a percentage in the blue bubbles.

Penalty

The words in the pink bubbles represent the meanings associated with the word in the sentence. Some have a penalty of 1024; some have no penalty. Others in lighter pink have a penalty at 829. This penalty is applied to synonyms.

With a tree structure

In many cases, the questions corresponding to a knowledge use a language-specific structure that can express itself with a very important number of formulations.

For example, let's take a look at "How to modify my password?"

This sentence is composed of two independent parts which each have a large number of formulations. On one hand "how to modify" and on the other hand "my password".

How to modify

how to modify

modification

I would have to modify

how to update

change

my password

my password

my code

my confidential code

my secret code

⇒ In this case, if we had wished to define all the possible combinations in plant structures, it would have been necessary to create 5 * 4 = 20.

It is actually only necessary to create 4 for the "password" since the formulations associated with "how to modify" are already defined in the solution.

It was here presented a simple example with only one level but in reality, "How to modify" uses the "how" matching group.

How

How

how does it happen when

how do I proceed

how to do

how to do in case of

know the procedure to follow

procedure for

the modalities for

This matching group contains several dozens of formulations. If we consider here only these 8 formulations, our basic example corresponds to 8 * 20 = 160 formulations in a plane structure.

This fine comprehension of the language thus makes it possible to decrease in a very important workload while ensuring better understanding. Indeed, it would be almost impossible to define via plane structures all possible combinations.

Enrichment of formulations

For your bot to be able to answer correctly to users, it is necessary that it has a large number of formulations in its knowledge base. Each knowledge is in fact associated with a set of formulations that make it possible to recognize the sentences that must lead to the corresponding answer.

In general, it takes several thousand formulations in the configuration of a bot for its understanding is correct.

Two tools are meant to significantly improve productivity in enriching the formulations. This enrichment is being made by dydu.

  • A tool gathers similar misunderstood sentences to identify the most used to make them a priority ;

  • Another tool uses sentences that resulted in a reword and for which the user has chosen one of the rewords. Associations can then be accepted or refused.

This enrichment is manual, suggestions are the only ones to be automated for more efficiency, but any change in the knowledge base is made by an authorized person.

Comparison of different matching algorithms

Other technologies are used by competing bots:

  • Syntax Analysis;

  • Matching keywords.

Syntax analysis

The Syntax Analysis consists of analyzing the sentence and highlighting its structure. It is linked in to the language in which the sentence is written (SVO: subject-verb-object in English).

The structure revealed by the language analysis then shows how the syntax rules are combined in the text. This structure can be represented by a syntactic tree which nodes can provide additional information for a fine analysis.

Therefore, the meaning of the sentence is likely to be correctly understood by the system and properly interpreted even in cases where the nuance is subtle. On the other hand, this analysis can not succeed when the sentences are grammatically incorrect.

Keyword matching

The keyword matching works the same way as a search engine.

The system finds the words that have been highlighted the knowledge base among the user's sentence. It will give the answer to the knowledge containing one or two found keywords at the same time. Some systems implment a prioritization in keywords or even a system to exclude keywords to manage ambiguities.

Comparative

In this table you will find the advantages and inconvenients of each of the technologies.

Advantages
Inconvenients

Syntax Analysis

Accurate understanding of the sentence Rare counterfeits

Complexity of the configuration of the knowledge base Requires the input sentence to be grammatically correct (less than 50% of the questions to a bot) Substituted by matching keywords if no results Costly in CPU and Memory Resources

Matching keywords

Easy Intial Setup Very fast and inexpensive algorithm

Frequent misinterpretations The scheduling and exclusions rules can become tedious

Distance calculation

Accurate understanding of the sentence Rare misinterpretations Fast algorithm that uses little CPU and Memory Resources Does not require the input of a grammatically-correct input sentence

Learning period required on the first questions of users to complete the formulations

Here are some examples of the possibilities and problems of each technology:

User's sentence
Syntax Analysis
Matching Keywords
Distance calculation

I am looking for a blue card

The distinction is possible

The distinction is not possible, the keywords being search and blue card

The distinction is possible

I want to go on a trip, but not in Martinique

The system will not return trips to Martinique

The system will only return trips to Martinique

The knowledge base will have to be configured to take this case into account.

How much does it cost per month?

The question is grammatically incorrect, the system will not understand the question

The system will identify keywords months and cost and will give the correct answer

The distance calculation will give the correct answer because it will identify as being very close to the knowledge "how much does it cost per month?"

Last updated

Tous droits réservés @ 2023 dydu.