Puppeteer crawler¶

In some situations, you may need to crawl websites with complex structures or interactions. To make sure you capture all necessary data, you can use the Puppeteer crawler.

Puppeteer works by programmatically interacting with websites during the crawling process. It can simulate user interactions like clicks and scrolling, waiting for content to fully load. This approach guarantees that all required content is successfully captured, even from dynamic or interactive pages.

Puppeteer can be used to crawl various types of content, including:

Specific page sections: extract the main content while excluding unrelated elements.
Dynamically loaded content: capture data from websites that load content through JavaScript after the initial page load.
Paginated or filtered results: crawl knowledge base articles, forums or portals that use dynamic filtering or pagination.
Versioned pages: access content from websites with versioned data.

To use the Puppeteer crawler, define it in the crawler parameter in the corpus() function and specify a function to crawl the required content.

You can:

Use the default crawler API to exclude certain selectors
Create custom crawling functions for more specialized use cases

Default crawler¶

The default crawler API allows you to wait for the page content to fully load and extract specific sections, such as the main article content, while excluding unrelated elements like the navigation menu, ads or footers.

To use the default crawler API, in the crawler parameter of the corpus() function, pass api.defaultCrawler({}) and define the crawling parameters:

Dialog script¶

corpus({
    title: `HTTP corpus`,
    urls: [
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview`,
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages`,
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Session`
    ],
    crawler: {
        puppeteer: api.defaultCrawler({
            waitAfterLoad: 1000,
            excludeSelectors: [
                `div.sticky-header-container`,
                `aside.sidebar`,
                `aside.toc`,
                `aside.article-footer`,
                `div.bottom-banner-container`,
                `footer.page-footer`
            ]
        }),
    },
    auth: {username: 'johnsmith', password: 'password'},
    include: [/.*\.pdf/],
    exclude: [/.*\.zip/],
    query: transforms.queries,
    transforms: transforms.answers,
    depth: 3,
    maxPages: 3,
    priority: 1
});

Corpus parameters¶

Name	Type	Required/Optional	Description
`title`	string	Optional	Corpus title.
`urls`	string array	Required	List of URLs from which information must be retrieved. You can define URLs of website folders and pages.
`crawler`	JSON object	Required	Type of crawler and function to be used to index corpus content: `puppeteer`. Crawler parameters: `waitAfterLoad`: duration in milliseconds for which the function execution must be paused after the page is loaded `excludeSelectors`: list of selectors to exclude from crawling
`auth`	JSON object	Optional	Credentials to access resources that require basic authentication: `{username: 'johnsmith', password: 'password'}`. For details, see Protected web resources.
`include`	string array	Optional	Resources to be obligatory indexed. You can define an array of URLs or use RegEx to specify a rule. For details, see Corpus includes and excludes.
`exclude`	string array	Optional	Resources to be excluded from indexing. You can define an array of URLs or use RegEx to specify a rule. For details, see Corpus includes and excludes.
`query`	function	Optional	Transforms function used to process user queries. For details, see Puppeteer transforms.
`transforms`	function	Optional	Transforms function used to format the corpus output. For details, see Puppeteer transforms.
`depth`	integer	Optional	Crawl depth for web and PDF resources. The minimum value is 0 (crawling only the page content without linked resources). For details, see Crawling depth.
`maxPages`	integer	Optional	Maximum number of pages and files to index. If not set, only 1 page with the defined URL will be indexed.
`priority`	integer	Optional	Priority level assigned to the corpus. Corpuses with higher priority are considered more relevant when user requests are processed. For details, see Corpus priority.

Custom crawling functions¶

For more specialized cases, such as handling unique page structures or extracting specific types of content, you can implement custom crawling functions with Puppeteer. This approach allows you to achieve precise control over how data is gathered and processed.

To use a custom crawling function, in the crawler parameter of the corpus() function, pass the function name and define additional parameters: browserLog and args.

Dialog script¶

corpus({
    title: `Knowledge Base`,
    urls: [`urls to crawl`]
    crawler: {
        puppeteer: crawlPages(),
        browserLog: 'on',
        args: {arg1: 'value1', agr2: 'value2'},
    },
    auth: {username: 'johnsmith', password: 'password'},
    include: [/.*\.pdf/],
    exclude: [/.*\.zip/],
    query: transforms.queries,
    transforms: transforms.answers,
    depth: 10,
    maxPages: 10,
    priority: 1
});

async function* crawlPages({url, page, document, args}) {

    // crawlPages function code ...

}

Corpus parameters¶

Name	Type	Required/Optional	Description
`title`	string	Optional	Corpus title.
`urls`	string array	Required	List of URLs from which information must be retrieved. You can define URLs of website folders and pages.
`crawler`	JSON object	Required	Type of crawler and function to be used to index corpus content. Crawler parameters: `puppeteer`: function used to crawl data. `args`: arguments to be passed to the crawler function. `browserLog`: parameter to control the Puppeteer logging mode. Set the parameter to `on` to print logs from the browser interaction process to Alan Studio Logs, or `off` to disable logging.
`auth`	JSON object	Optional	Credentials to access resources that require basic authentication: `{username: 'johnsmith', password: 'password'}`. For details, see Protected web resources.
`include`	string array	Optional	Resources to be obligatory indexed. You can define an array of URLs or use RegEx to specify a rule. For details, see Corpus includes and excludes.
`exclude`	string array	Optional	Resources to be excluded from indexing. You can define an array of URLs or use RegEx to specify a rule. For details, see Corpus includes and excludes.
`query`	function	Optional	Transforms function used to process user queries. For details, see Puppeteer transforms.
`transforms`	function	Optional	Transforms function used to format the corpus output. For details, see Puppeteer transforms.
`depth`	integer	Optional	Crawl depth for web and PDF resources. The minimum value is 0 (crawling only the page content without linked resources). For details, see Crawling depth.
`maxPages`	integer	Optional	Maximum number of pages and files to index. If not set, only 1 page with the defined URL will be indexed.
`priority`	integer	Optional	Priority level assigned to the corpus. Corpuses with higher priority are considered more relevant when user requests are processed. For details, see Corpus priority.

Examples of use¶

Crawling specific page sections

Assume you need to crawl only the article content displayed on a page, without the header and footer. To do this, you can implement the following script:

Dialog script¶

async function* crawlPages({url, page, document}) {
    const html = await page.evaluate(() => {
        const selector = '.main-page-content';
        return document.querySelector(selector).innerHTML;
    });
    let content = await api.html2md_v2({html, url});
    yield {content, mimeType: 'text/markdown'};
}

corpus({
    title: `HTTP corpus`,
    urls: [
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview`,
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages`,
        `https://developer.mozilla.org/en-US/docs/Web/HTTP/Session`
    ],
    crawler: {
        puppeteer: crawlPages,
        browserLog: 'off',
    },
    maxPages: 10,
    depth: 3,
    priority: 1
});

Here is how the script works:

The crawlPages() function specifies the behavior of the Puppeteer crawler and identifies which content should be crawled:
1. Within each page to be indexed, the crawler extracts content from the element with the main-page-content selector.
2. Each page’s content is then converted to Markdown with html2md_v2() and incorporated into the corpus.
The corpus() function defines the URLs to be crawled, the crawler type and function to be used and includes other parameters such as max number of pages and corpus priority.

Crawling a dynamic Knowledge Base portal

Assume you need to crawl a list of articles dynamically loaded through the HubSpot Knowledge Base portal. The articles should meet the following criteria:

Article type: Knowledge Base
User query: templates
Articles to index: results on pages 1-3

To do this, you can implement the following script:

Dialog script¶

function createKBUrls({types, query, maxPages=3}) {
    let res = [];
    for (let i = 0; i < maxPages; i++) {
        let page = i + 1;
        let url = `https://knowledge.hubspot.com/search?q=${query}#stq=${query}&stp=${page}&stf=${types}`;
        console.log(url);
        res.push(url);
    }
    return res;
}

async function* crawlKB({url, page, document}) {
    if (url.includes(`knowledge.hubspot.com/search?`)) {
        let linksSelector = `a.st-search-result-link`;
        await page.waitForSelector(linksSelector, { timeout: 20000 });
        const urls = await page.evaluate(() => {
            let linksSelector = `a.st-search-result-link`;
            return Array.from(document.querySelectorAll(linksSelector)).map(e=>e.href);
        });
        console.log(`# index [](${url}) => `, urls);
        yield {urls};
    } else if (url.includes(`knowledge.hubspot.com`)) {
        console.log(`# parse [](${url})`);
        await page.waitForSelector(`div.blog-post-wrapper.cell-wrapper`, {timeout: 20000});
        await api.sleep(4000);
        const html = await page.evaluate(() => {
            const selector = 'div.blog-post-wrapper.cell-wrapper';
            return document.querySelector(selector).innerHTML;
        });
        let content = await api.html2md_v2({html, url});
        yield {content, mimeType: 'text/markdown'};
    }
}

corpus({
    title: `HubSpot Knowledge Base`,
    urls: createKBUrls({types: ["knowledge"], query: ["templates"]}),
    crawler: {
        puppeteer: crawlKB,
        browserLog: 'off',

    },
    maxPages: 30,
    depth: 3,
    priority: 1
});

Here is how the script works:

The createKBUrls() function generates a list of URLs that aggregate links to the articles to be crawled. The list is then passed to the urls parameter of the corpus() function, along with the type of articles to index and user query.
The crawlKB() function specifies the behavior of the Puppeteer crawler and identifies which content should be crawled:
1. Puppeteer waits for the a.st-search-result-link selector to be displayed and then accesses the KB articles opened through links with the same selector.
2. To index an article, the crawler waits for the div.blog-post-wrapper.cell-wrapper selector to be displayed.
3. Within the article, the crawler extracts content from the element with the div.blog-post-wrapper.cell-wrapper selector.
4. Each article’s content is then converted to Markdown with html2md_v2() and incorporated into the corpus.
The corpus() function defines the URLs to be crawled, the crawler type and function to be used and includes other parameters such as max number of pages and corpus priority.

Using Puppeteer with arguments

Assume you want to crawl Python documentation for a specific version: 3.11. Instead of manually selecting individual pages and adding them to corpus(), you can use the Puppeteer crawler to automate the process and pass the docs version as an argument to it.

You can implement the following script:

Dialog script¶

async function* crawlDocs({url, page, document, args}) {
    // Passing the docs version
    let {version} = args;

    if (url.includes(`docs.python.org`)) {
        await page.waitForSelector(`div.documentwrapper`, {timeout: 15000});
            try {
                await page.waitForSelector(`div.documentwrapper`, {timeout: 5000});
       } catch(e) {
       }

       const html = await page.evaluate(() => {
           const selector = 'div.body';
           return document.querySelector(selector).innerHTML;
       });

       // Getting a list of URLs from the page
       let urls = await page.evaluate(() => {
           const anchorElements = document.querySelectorAll('a');
           const linksArray = Array.from(anchorElements).map(anchor => anchor.href);
           return linksArray;
      });
      console.log("Intial URLs list: " + urls);

      // Getting the page content
      let content = await api.html2md_v2({html, url});
      console.log("Page content: " + content);
      yield {content, mimeType: 'text/markdown'};

     // Filtering URLs
     urls = urls.filter(f=> f.includes(version));
     console.log("Filtered URLs: " + urls);
     yield {urls};
     }
}

corpus({
    title: `Python docs`,
    urls: [`https://docs.python.org/3/`],
    crawler: {
        args: {version: '3.11'},
        puppeteer: crawlDocs,
        browserLog: 'on',
    },
    maxPages: 10,
    depth: 100,
});

Here is how the script works:

The crawlDocs() function specifies the behavior of the Puppeteer crawler and indicates which content should be crawled. It takes the args argument in which the required docs version is passed.
1. Puppeteer waits for the div.documentwrapper selector to be displayed and then crawls the content of the main docs page passed to the corpus() function: https://docs.python.org/3/.
2. From the main docs page, Puppeteer retrieves all links to sub-pages.
3. The function filters these links to make sure they match the specified version 3.11. Only pages with URLs containing 3.11 are selected for further crawling.
4. Each page’s content is then converted to Markdown with html2md_v2() and incorporated into the corpus.
5. Steps a-d are repeated for each page from the filtered list until the maxPages limit is reached.
The corpus() function defines the URL of docs to be crawled, the crawler type, function to be used and arguments passed to the crawlDocs() function: args: {version: '3.11'}. It also includes other parameters such as max number of pages and corpus depth.

APIs and functions¶

You can use Puppeteer functions and APIs to create data crawling functions: