Static corpus

The Q&A service lets you create a Q&A AI agent that uses static data sources: company website pages, product manuals, guidelines, FAQ pages, articles and so on.

You can define the following types of static corpuses in the dialog script:

  • Web corpus: retrieve information from website pages and PDF files available online

  • Text corpus: use plain text as an information source

Web corpus

To define a web corpus for your Q&A AI agent, use the corpus() function.

Dialog script
corpus({
    title: `HTTP corpus`,
    urls: [
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview`,
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages`,
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Session`],
    auth: {username: 'johnsmith', password: 'password'},
    include: [/.*\.pdf/],
    exclude: [`https://developer.mozilla.org/en-US/docs/Web/HTTP/Basics_of_HTTP/Evolution_of_HTTP`],
    query: transforms.queries,
    transforms: transforms.answers,
    depth: 1,
    maxPages: 5,
    priority: 0,
});

Corpus parameters

Name

Type

Required/Optional

Description

title

string

Optional

Corpus title.

urls

string array

Required

List of URLs from which information must be retrieved. You can define URLs of website folders and pages.

auth

JSON object

Optional

Credentials to access resources that require basic authentication: {username: 'johnsmith', password: 'password'}. For details, see Protected web resources.

include

string array

Optional

Resources to be obligatory indexed. You can define an array of URLs or use RegEx to specify a rule. For details, see Corpus includes and excludes.

exclude

string array

Optional

Resources to be excluded from indexing. You can define an array of URLs or use RegEx to specify a rule. For details, see Corpus includes and excludes.

query

function

Optional

Transforms function used to process user queries. For details, see Static corpus transforms.

transforms

function

Optional

Transforms function used to format the corpus output. For details, see Static corpus transforms.

depth

integer

Optional

Crawl depth for web and PDF resources. The minimum value is 0 (crawling only the page content without linked resources). For details, see Crawling depth.

maxPages

integer

Optional

Maximum number of pages and files to index. If not set, only 1 page with the defined URL will be indexed.

priority

integer

Optional

Priority level assigned to the corpus. Corpuses with higher priority are considered more relevant when user requests are processed. For details, see Corpus priority.

../../../_images/corpus-testing.png

Note

Mind the following:

  • Make sure the websites and pages you define in the corpus() function are not protected from crawling. The Q&A service cannot retrieve content from such resources.

  • The indexing process may take some time. To check the progress and results, use the Alan AI Studio logs.

  • The maximum number of indexed pages depends on your pricing plan. For details, contact the Alan AI Sales Team.

Text corpus

To define a text corpus for the Q&A AI agent, add plain text strings or Markdown-formatted text to the corpus() function:

Dialog script
corpus({
    title: `HTTP corpus`,
    text: `
        # Understanding **async/await** in JavaScript

       **async/await** is a feature in JavaScript that makes working with asynchronous code easier and more readable. It allows you to write asynchronous code that looks and behaves like synchronous code, making it easier to follow and understand.

       ## How Does **async/await** Work?

       ### **async** Keyword:

       - The **async** keyword is used to declare a function as asynchronous.
       - An **async** function returns a **Promise**, and it can contain **await** expressions that pause the execution of the function until the awaited **Promise** is resolved.

       ### **await** Keyword:

       - The **await** keyword can only be used inside an **async** function.
       - It pauses the execution of the function until the **Promise** passed to it is settled (either fulfilled or rejected).
       - The resolved value of the **Promise** is returned, allowing you to work with it like synchronous code.

       ## Why Use **async/await**?

       ### Readability:

       - By using **async/await**, you can avoid the complexity of chaining multiple **.then()** methods when dealing with Promises.
       - Your code looks more like traditional synchronous code, making it easier to read.

       ### Error Handling:

       - Error handling with **async/await** is simpler and more consistent with synchronous code.
       - You can use **try/catch** blocks to handle errors.
    `,
    query: transforms.queries,
    transforms: transforms.answers,
    priority: 0,
});

Name

Type

Required/Optional

Description

title

string

Optional

Corpus title.

text

plain text or Markdown-formatted strings

Required

Text corpus presented as plain text strings or Markdown-formatted strings.

query

function

Optional

Transforms function used to process user queries. For details, see Static corpus transforms.

transforms

function

Optional

Transforms function used to format the corpus output. For details, see Static corpus transforms.

priority

integer

Optional

Priority level assigned to the corpus. Corpuses with higher priority are considered more relevant when user requests are processed. For details, see Corpus priority.