External Contents

As a Dydu bot manager, you have the ability to centralize and organize your external content sources directly from an intuitive interface in the BMS, allowing you to generate instant responses based on these sources and thereby improve the quality of responses provided to end users. Through the BMS navigation menu, you can access the External Content page: Content > External Content.

Create a collection

Via the BMS navigation menu, you can access the External Content page : Content > External Content.

You will then arrive at your collections page, where you will find one collection created by default.

By clicking on this collection, you will enter the collection's edition page.

Feed the collection

Importing your documents

It is possible to import one or more documents of the following types: PDF, DOCX, PPTX, TXT. Each document must be 10MB maximum.

Adding SharePoint sources

The SharePoint indexing tool allows you to add your pages and files to your knowledge base.

To authorize this access, a new application with read permissions must be registered in your Microsoft environment. The complete process is explained in this official tutorialarrow-up-right.

circle-info

During configuration, at the API permissions stage, two authorizations are required for the Dydu application. In "Microsoft Graph," then "Application Permissions," you must select the following rights and validate with administrator consent (Grant Admin Consent):

  • Files.ReadAll

  • Sites.Selected

To finalize the application link, four technical elements must then be collected and saved:

  • The clientId (client identifier)

  • The client Secret (secret value)

  • The tenant Id (environment identifier)

  • The SharePoint site ID

Details of the steps required to retrieve the necessary values from Azure for the Dydu LLM configuration:

Go to the Azure portalarrow-up-right :

  1. Click on App registrations.

  1. Click on New registration.

  1. Provide a name and click "Register".

  1. The Application (client) ID is the client_id.

  1. Click on Certificates & secrets. Then, under the "Client secrets" tab, click on New client secret.

  1. Click on Certificates & secrets.

  1. Copy the generated Secret Value (client_secret).

  1. Click on API permissions. Then click on Add a permission.

  1. Click on Microsoft Graph.

  1. Then click on "Application permissions". Add the Sites.Selected and Files.Read.All permissions.

  1. Click on Grant admin consent for XXXX.

  1. To find the Tenant ID:

Go to the site: https://entra.microsoft.com/arrow-up-right

Click on "Overview".

The client ID corresponds to the tenant ID.

  1. To find the SharePoint ID:

Compose the following URL: https://<tenant>.sharepoint.com/sites/<site-url>/_api/site/id

The SharePoint ID is found within the result:

The tool offers the following features:

  • Indexing of pages and files from an entire SharePoint site.

  • Standard RAG usage, with the source URL of the SharePoint document displayed in the provided response.

  • Optional authentication linking (SAML). In this case, the user must log in; the system retrieves their group memberships and filters responses according to their document access rights.

circle-info

Items that are not indexed:

  • Files directly embedded within pages.

  • Videos and certain specific formats (Excel, WMF, etc.).

Currently, the document retrieval and indexing process takes time (several minutes); the most frequent refresh rate is once per day.

Adding a personalized FAQ

To set up a personalized FAQ, simply provide the following information :

  • Name: Corresponds to the API address (URL) to be used.

  • API Key.

  • API Secret.

  • List of IDs for the knowledge bases to be retrieved.

A single combination of API Key and API Secret allows access to multiple knowledge bases simultaneously.

Based on this information, all documents within the specified knowledge bases are automatically retrieved via the FAQ channel.

Adding a Salesforce configuration

To set up a Salesforce configuration, simply provide the following information:

  • Name : Name of the integration

  • Client ID : Client key from the Salesforce configuration

  • Client Secret : Secret key from the Salesforce configuration

  • Content Access URL : URL used to retrieve documents from your configuration

Adding Website sources

There are three types of Websites that can be indexed:

Domain

When you provide a web address to crawl, the tool prioritizes looking for the site map (called a sitemap) to identify pages.

If no sitemap is found, the crawl starts directly from the address you entered.

circle-info

If this address corresponds to a specific folder on your site, the search will only take place from that precise location.

Sitemap

A sitemap acts like a map of a website. This file lists all the important pages of a site. If you select a sitemap, the tool will only crawl the addresses listed within it.

Specific URLs

By providing a list of web addresses (URLs), you precisely define the exact pages the tool should analyze.

Collection Details

Information regarding the addition of your source to your collection will be displayed as follows :

  • Name : The name of your source.

  • Added by : The bot manager's ID.

  • Creation date : The date you added your source.

  • Preparation : Status and actions related to source preparation.

  • Indexing : Status and actions related to source indexing.

  • Last indexed on : The date of the most recent indexing.

  • Actions : Available actions for sources (edit, delete, and view details).

circle-info

Preparation is the individual data retrieval stage, during which the tool downloads and reads the content of each added source.

Indexing is the global stage that gathers and integrates all these sources into the knowledge base to allow the bot to generate responses. Any modification to a source requires restarting this global indexing.

Several statuses are available to track the progress of your content :

  • Waiting for action : No action has been initiated on this source yet.

  • Scheduled : Preparation or indexing of the source is scheduled and will run soon.

  • Canceled : The preparation or indexing process was interrupted.

  • Preparing : Downloading and reading the source data is in progress.

  • Ready : Data has been successfully retrieved; the source is now awaiting indexing.

  • Preparation failed : An error prevented data retrieval for this source.

  • Indexing in progress : Data integration into the knowledge base is being processed.

  • Indexed : The source is fully integrated into the base, and the bot can use it to generate responses.

  • Partial indexing : The base requires an update. For example, a new source was added but has not yet been indexed with the rest.

  • LLM config test failed : The process stopped due to an error in the language model configuration.

  • Configuration file not found : A technical server error prevented the operation from completing correctly.

Suggestions and Indexing

Prepare and index the collection

This main button allows you to simultaneously launch the preparation and indexing of your entire collection (including all configured sources).

By clicking the small adjacent arrow, you can access two specific options:

  • Prepare collection only (without launching indexing).

  • Index only items that have already been prepared.

It is also possible to act on an individual source: simply click directly on that source's status button to prepare or index it in isolation.

Suggest knowledge from the collection

This button allows you to prepare your collection in order to extract an Excel knowledge file, which you can then import directly into your bot.

Important points:

  • No indexing: This action does not index the collection.

  • No RAG: It does not allow the bot to use these documents to generate responses autonomously.

The sole purpose of this button is the creation of this export file.

Details for collection items with "Completed with errors" status

Once indexing or suggestion is completed, you may see a "Completed with errors" status.

By clicking on the status, a report is displayed with the error details.

  • Details of errors from Websites:

The report details show a percentage of successes and errors. A breakdown of HTTP error codes is provided.

Errors may be classified into different categories, such as server-side issues or others.

  • Details of errors from SharePoint:

The report details also show a percentage of successes and errors. The report provides full details on all pages that could not be retrieved, as well as the folders involved.

For each folder, it also specifies the particular files that could not be retrieved, allowing for clear identification of missing items.

Collection configuration

Customizing responses

Configuring the indexing parameters of a collection allows you to precisely adapt the bot’s behavior to your business needs and the desired user experience. Each collection has a dedicated card where you can adjust several options to optimize the relevance, length, and style of generated answers, as well as the selection of information sources.

  • Temperature defines the style of the bot’s answers: the higher the temperature, the more creative the answers can be; conversely, a low temperature favors strictly factual answers. This setting is especially useful to ensure that the tone and level of creativity of the bot match your usage context.

  • Number of output tokens refers to the length of generated answers. You can choose between short, medium, or detailed answers depending on the complexity of the topics covered or your users’ preferences. Adjusting this parameter helps deliver more concise or, on the contrary, more in-depth information.

  • Minimum score required for answer sources lets you filter the documents used by the bot: only sources with a score equal to or higher than the defined value will be considered in generating answers and displaying cited sources. This setting ensures that only sources deemed sufficiently relevant or reliable are used to build the answer.

  • Additional prompt gives you the possibility to add specific context or an instruction that will always be considered when generating answers for the relevant collection. This free-text field allows you, for example, to impose a tone, specify a business instruction, or guide the bot on a sensitive topic.

  • The flexible management of the additional prompt feature provides better control over the final prompt sent to the model. It allows users to view the complete final prompt and choose the precise placement of their additional prompt: at the beginning, in the middle, or at the end.

circle-info

You cannot modify the content of the final prompt itself; you can only insert the additional prompt and define its position to optimize the model's response.

Advanced Mode: Response Customization

By enabling advanced mode, new configuration options appear to control how the system selects information.

Minimum score required for response sources

In this block, you will find a new checkbox: Enable/Disable score filtering before response generation.

  • Disabled (default) : All retrieved sources are used to generate the response, regardless of their relevance score.

  • Enabled : The system applies a strict upstream filter. Only sources with a score greater than or equal to your defined minimum will be kept and used to draft the response.

Response Generation

This new block allows you to configure the amount of information sent to the model to build its response. You will find the following options:

  • Number of direct sources (Top K) : This parameter is set on a scale of 1 to 10 and defines the number of text extracts (chunks) sent directly to the LLM. It is strongly advised to keep this value low (between 2 and 4). If set too high, the model may be overwhelmed by less relevant information, increasing the risk of hallucinations.

  • Enable/Disable LLM Rerank to improve response accuracy : This option activates a second review of the extracts to keep only the most relevant ones. While this slightly slows down response generation, it greatly improves the quality and accuracy of the final result.

Specific Rerank Parameters

When you check the LLM Rerank activation box, the interface adapts and new configuration options appear in the response generation block:

  • Pre-selection range (Top K) : The initial parameter ("Number of direct sources") changes its name and behavior. The model will first scan these K extracts to identify the most relevant ones before keeping only the best (N) to answer. Unlike the classic mode, you can set a higher value here (between 10 and 30). This parameter is adjustable on a scale of 1 to 50.

  • Number of chunks to use (Top N) : This new parameter (adjustable from 1 to 10) corresponds to the final number of extracts that will actually be used to write the response. The Rerank chooses these N best items from the pre-selection (K). It is recommended to keep this value low (between 2 and 4).

  • Processing power (Batch size) : This setting defines the number of extracts processed at once from the Top K. A higher value speeds up the response processing time but requires more system resources. Note: The scale of this parameter automatically adapts to your Top K configuration (for example, if your Pre-selection range is set to 35, the Batch size can be configured from 1 to 35).

Dynamic variables

Dynamic variables can be used in the additional prompt of each collection. For example, ${capture.user_name} is automatically replaced by the actual value retrieved during the conversation or from a web service.

If a variable is not available, it is ignored or replaced by an empty string. This makes it possible to personalize the instructions sent to the RAG engine, resulting in answers tailored to each user’s context.

In order for the capture variables to be correctly replaced in the prompt, they need to be added to the parameters of the Web service: Dydu_RAG. Here is an example with the capture variable user_name:

circle-exclamation

Contextualizing RAG with metadata

It is possible to precisely target the documents used by the RAG to generate a response. To do this, you can filter content using the metadata associated with each document (such as a URL or a category). All metadata can be used for this filtering, with the exception of the score.

Example of metadata:

circle-info

It is possible to view the metadata of your documents by clicking the button of an indexed collection.

The configuration is done directly in the webservice named Dydu_RAG. Simply add a new parameter titled metadataFilters. The value of this parameter must be entered in this format: [{"key": "key", "operator": "operator", "value": "value"}].

Three operators are available to define your filter:

  • EQUALS: keeps only the content exactly matching the value.

  • NOT_EQUAL: excludes content exactly matching the value.

  • SUB_STRING: keeps content that includes all or part of the value.

For example, to limit the bot exclusively to Dydu product pages, use the SUB_STRING operator on the URL as follows: [{"key": "url", "operator": "SUB_STRING", "value": "https://www.dydu.ai/produits/arrow-up-right"}].

Displaying the RAG Score in Responses

To display the metadata score attributed by the RAG (Retrieval-Augmented Generation) model for each response provided, it is necessary to modify the Dydu_RAG webservice and integrate the information into the response format.

To retrieve the RAG score, the variable must be extracted from the webservice's return JSON.

Add the following line inside the JSON structure to extract the score value:

After extracting the value, you must define the display variable and format how the score will be presented.

This code checks for the existence of the score and formats it for display, for example, by adding a line break and the label:

This completes the configuration; the response score will now be displayed with every answer generated by the RAG.


Properly configuring these parameters allows you to obtain relevant, reliable, and tailored answers, while maintaining control over how the bot interacts with your users for each indexed data collection.

Content management

Automatic reindexing

You can configure the reindexing frequency of collections using four modes: none, daily, weekly, or monthly.

  • None: no reindexing is scheduled, data remains unchanged.

  • Daily: reindexing is performed automatically every day at midnight.

  • Weekly: reindexing takes place every Monday at midnight.

  • Monthly: reindexing is performed on the first Monday of each month at midnight.

The day and time of reindexing are predefined and cannot be changed. This configuration allows you to adjust the data update frequency to your needs, while keeping the process simple and automatic.

Content access optimization

This option provides fast responses while maintaining high accuracy. It is essential when processing a large volume of data.

However, it may be less effective if your knowledge base is small. It is therefore recommended to test it first to verify its effectiveness on your content using the "Test the RAG" feature.

circle-exclamation

Last updated