Puppeteer crawler

In some situations, you may need to crawl websites with complex structures or interactions. To make sure you capture all necessary data, you can use the Puppeteer crawler.

Puppeteer works by programmatically interacting with websites during the crawling process. It can simulate user interactions like clicks and scrolling, waiting for content to fully load. This approach guarantees that all required content is successfully captured, even from dynamic or interactive pages.

Puppeteer can be used to crawl various types of content, including:

  • Specific page sections: extract the main content while excluding unrelated elements.

  • Dynamically loaded content: capture data from websites that load content through JavaScript after the initial page load.

  • Paginated or filtered results: crawl knowledge base articles, forums or portals that use dynamic filtering or pagination.

  • Versioned pages: access content from websites with versioned data.

To use the Puppeteer crawler, define it in the crawler parameter in the corpus() function and specify a function to crawl the required content.

Dialog script
corpus({
    title: `Knowledge Base`,
    urls: [`urls to crawl`]
    crawler: {
        puppeteer: crawlPages(),
        browserLog: 'on',
        args: {arg1: 'value1', agr2: 'value2'},
    },
    auth: {username: 'johnsmith', password: 'password'},
    include: [/.*\.pdf/],
    exclude: [/.*\.zip/],
    query: transforms.queries,
    transforms: transforms.answers,
    depth: 10,
    maxPages: 10,
    priority: 1
});

async function* crawlPages({url, page, document, args}) {

    // crawlPages function code ...

}

Corpus parameters

Name

Type

Required/Optional

Description

title

string

Optional

Corpus title.

urls

string array

Required

List of URLs from which information must be retrieved. You can define URLs of website folders and pages.

crawler

JSON object

Required

Type of crawler and function to be used to index corpus content. Crawler parameters:

  • puppeteer: function used to crawl data.

  • args: arguments to be passed to the crawler function.

  • browserLog: parameter to control the Puppeteer logging mode. Set the parameter to on to print logs from the browser interaction process to Alan Studio Logs, or off to disable logging.

auth

JSON object

Optional

Credentials to access resources that require basic authentication: {username: 'johnsmith', password: 'password'}. For details, see Protected web resources.

include

string array

Optional

Resources to be obligatory indexed. You can define an array of URLs or use RegEx to specify a rule. For details, see Corpus includes and excludes.

exclude

string array

Optional

Resources to be excluded from indexing. You can define an array of URLs or use RegEx to specify a rule. For details, see Corpus includes and excludes.

query

function

Optional

Transforms function used to process user queries. For details, see Puppeteer transforms.

transforms

function

Optional

Transforms function used to format the corpus output. For details, see Puppeteer transforms.

depth

integer

Optional

Crawl depth for web and PDF resources. The minimum value is 0 (crawling only the page content without linked resources). For details, see Crawling depth.

maxPages

integer

Optional

Maximum number of pages and files to index. If not set, only 1 page with the defined URL will be indexed.

priority

integer

Optional

Priority level assigned to the corpus. Corpuses with higher priority are considered more relevant when user requests are processed. For details, see Corpus priority.

Examples of use

Assume you need to crawl only the article content displayed on a page, without the header and footer. To do this, you can implement the following script:

Dialog script
 async function* crawlPages({url, page, document}) {
     const html = await page.evaluate(() => {
         const selector = 'div.content_col';
         return document.querySelector(selector).innerHTML;
     });
     let content = await api.html2md_v2({html, url});
     yield {content, mimeType: 'text/markdown'};
 }

 corpus({
     title: `Slack Docs`,
     urls: [
         "https://slack.com/help/articles/360017938993-What-is-a-channel",
         "https://slack.com/help/articles/205239967-Join-a-channel",
         "https://slack.com/help/articles/201402297-Create-a-channel"]
     crawler: {
         puppeteer: crawlPages,
         browserLog: 'off',
     },
     maxPages: 10,
     depth: 3,
     priority: 1
 });

Here is how the script works:

  1. The crawlPages() function specifies the behavior of the Puppeteer crawler and identifies which content should be crawled:

    1. Within each page to be indexed, the crawler extracts content from the element with the div.content_col selector.

    2. Each page’s content is then converted to Markdown with html2md_v2() and incorporated into the corpus.

  2. The corpus() function defines the URLs to be crawled, the crawler type and function to be used and includes other parameters such as max number of pages and corpus priority.

Assume you need to crawl a list of articles dynamically loaded through the HubSpot Knowledge Base portal. The articles should meet the following criteria:

  • Article type: Knowledge Base

  • User query: templates

  • Articles to index: results on pages 1-3

To do this, you can implement the following script:

Dialog script
function createKBUrls({types, query, maxPages=3}) {
    let res = [];
    for (let i = 0; i < maxPages; i++) {
        let page = i + 1;
        let url = `https://knowledge.hubspot.com/search?q=${query}#stq=${query}&stp=${page}&stf=${types}`;
        console.log(url);
        res.push(url);
    }
    return res;
}

async function* crawlKB({url, page, document}) {
    if (url.includes(`knowledge.hubspot.com/search?`)) {
        let linksSelector = `a.st-search-result-link`;
        await page.waitForSelector(linksSelector, { timeout: 20000 });
        const urls = await page.evaluate(() => {
            let linksSelector = `a.st-search-result-link`;
            return Array.from(document.querySelectorAll(linksSelector)).map(e=>e.href);
        });
        console.log(`# index [](${url}) => `, urls);
        yield {urls};
    } else if (url.includes(`knowledge.hubspot.com`)) {
        console.log(`# parse [](${url})`);
        await page.waitForSelector(`div.blog-post-wrapper.cell-wrapper`, {timeout: 20000});
        await api.sleep(4000);
        const html = await page.evaluate(() => {
            const selector = 'div.blog-post-wrapper.cell-wrapper';
            return document.querySelector(selector).innerHTML;
        });
        let content = await api.html2md_v2({html, url});
        yield {content, mimeType: 'text/markdown'};
    }
}

corpus({
    title: `HubSpot Knowledge Base`,
    urls: createKBUrls({types: ["knowledge"], query: ["templates"]}),
    crawler: {
        puppeteer: crawlKB,
        browserLog: 'off',

    },
    maxPages: 30,
    depth: 3,
    priority: 1
});

Here is how the script works:

  1. The createKBUrls() function generates a list of URLs that aggregate links to the articles to be crawled. The list is then passed to the urls parameter of the corpus() function, along with the type of articles to index and user query.

  2. The crawlKB() function specifies the behavior of the Puppeteer crawler and identifies which content should be crawled:

    1. Puppeteer waits for the a.st-search-result-link selector to be displayed and then accesses the KB articles opened through links with the same selector.

    2. To index an article, the crawler waits for the div.blog-post-wrapper.cell-wrapper selector to be displayed.

    3. Within the article, the crawler extracts content from the element with the div.blog-post-wrapper.cell-wrapper selector.

    4. Each article’s content is then converted to Markdown with html2md_v2() and incorporated into the corpus.

  3. The corpus() function defines the URLs to be crawled, the crawler type and function to be used and includes other parameters such as max number of pages and corpus priority.

Assume you want to crawl Python documentation for a specific version: 3.11. Instead of manually selecting individual pages and adding them to corpus(), you can use the Puppeteer crawler to automate the process and pass the docs version as an argument to it.

You can implement the following script:

Dialog script
async function* crawlDocs({url, page, document, args}) {
    // Passing the docs version
    let {version} = args;

    if (url.includes(`docs.python.org`)) {
        await page.waitForSelector(`div.documentwrapper`, {timeout: 15000});
            try {
                await page.waitForSelector(`div.documentwrapper`, {timeout: 5000});
       } catch(e) {
       }

       const html = await page.evaluate(() => {
           const selector = 'div.body';
           return document.querySelector(selector).innerHTML;
       });

       // Getting a list of URLs from the page
       let urls = await page.evaluate(() => {
           const anchorElements = document.querySelectorAll('a');
           const linksArray = Array.from(anchorElements).map(anchor => anchor.href);
           return linksArray;
      });
      console.log("Intial URLs list: " + urls);

      // Getting the page content
      let content = await api.html2md_v2({html, url});
      console.log("Page content: " + content);
      yield {content, mimeType: 'text/markdown'};

     // Filtering URLs
     urls = urls.filter(f=> f.includes(version));
     console.log("Filtered URLs: " + urls);
     yield {urls};
     }
}

corpus({
    title: `Python docs`,
    urls: [`https://docs.python.org/3/`],
    crawler: {
        args: {version: '3.11'},
        puppeteer: crawlDocs,
        browserLog: 'on',
    },
    maxPages: 10,
    depth: 100,
});

Here is how the script works:

  1. The crawlDocs() function specifies the behavior of the Puppeteer crawler and indicates which content should be crawled. It takes the args argument in which the required docs version is passed.

    1. Puppeteer waits for the div.documentwrapper selector to be displayed and then crawls the content of the main docs page passed to the corpus() function: https://docs.python.org/3/.

    2. From the main docs page, Puppeteer retrieves all links to sub-pages.

    3. The function filters these links to make sure they match the specified version 3.11. Only pages with URLs containing 3.11 are selected for further crawling.

    4. Each page’s content is then converted to Markdown with html2md_v2() and incorporated into the corpus.

    5. Steps a-d are repeated for each page from the filtered list until the maxPages limit is reached.

  2. The corpus() function defines the URL of docs to be crawled, the crawler type, function to be used and arguments passed to the crawlDocs() function: args: {version: '3.11'}. It also includes other parameters such as max number of pages and corpus depth.

APIs and functions

You can use Puppeteer functions and APIs to create data crawling functions: