Puppeteer crawler¶
In some situations, you may need to crawl websites with complex structures or interactions. To make sure you capture all necessary data, you can use the Puppeteer crawler.
Puppeteer works by programmatically interacting with websites during the crawling process. It can simulate user interactions like clicks and scrolling, waiting for content to fully load. This approach guarantees that all required content is successfully captured, even from dynamic or interactive pages.
Puppeteer can be used to crawl various types of content, including:
Specific page sections: extract the main content while excluding unrelated elements.
Dynamically loaded content: capture data from websites that load content through JavaScript after the initial page load.
Paginated or filtered results: crawl knowledge base articles, forums or portals that use dynamic filtering or pagination.
Versioned pages: access content from websites with versioned data.
To use the Puppeteer crawler, define it in the crawler
parameter in the corpus()
function and specify a function to crawl the required content.
You can:
Use the default crawler API to exclude certain selectors
Create custom crawling functions for more specialized use cases
Default crawler¶
The default crawler API allows you to wait for the page content to fully load and extract specific sections, such as the main article content, while excluding unrelated elements like the navigation menu, ads or footers.
To use the default crawler API, in the crawler
parameter of the corpus()
function, pass api.defaultCrawler({})
and define the crawling parameters:
corpus({
title: `HTTP corpus`,
urls: [
`https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview`,
`https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages`,
`https://developer.mozilla.org/en-US/docs/Web/HTTP/Session`
],
crawler: {
puppeteer: api.defaultCrawler({
waitAfterLoad: 1000,
excludeSelectors: [
`div.sticky-header-container`,
`aside.sidebar`,
`aside.toc`,
`aside.article-footer`,
`div.bottom-banner-container`,
`footer.page-footer`
]
}),
},
auth: {username: 'johnsmith', password: 'password'},
include: [/.*\.pdf/],
exclude: [/.*\.zip/],
query: transforms.queries,
transforms: transforms.answers,
depth: 3,
maxPages: 3,
priority: 1
});
Corpus parameters¶
Name |
Type |
Required/Optional |
Description |
---|---|---|---|
|
string |
Optional |
Corpus title. |
|
string array |
Required |
List of URLs from which information must be retrieved. You can define URLs of website folders and pages. |
|
JSON object |
Required |
Type of crawler and function to be used to index corpus content:
|
|
JSON object |
Optional |
Credentials to access resources that require basic authentication: |
|
string array |
Optional |
Resources to be obligatory indexed. You can define an array of URLs or use RegEx to specify a rule. For details, see Corpus includes and excludes. |
|
string array |
Optional |
Resources to be excluded from indexing. You can define an array of URLs or use RegEx to specify a rule. For details, see Corpus includes and excludes. |
|
function |
Optional |
Transforms function used to process user queries. For details, see Puppeteer transforms. |
|
function |
Optional |
Transforms function used to format the corpus output. For details, see Puppeteer transforms. |
|
integer |
Optional |
Crawl depth for web and PDF resources. The minimum value is 0 (crawling only the page content without linked resources). For details, see Crawling depth. |
|
integer |
Optional |
Maximum number of pages and files to index. If not set, only 1 page with the defined URL will be indexed. |
|
integer |
Optional |
Priority level assigned to the corpus. Corpuses with higher priority are considered more relevant when user requests are processed. For details, see Corpus priority. |
Custom crawling functions¶
For more specialized cases, such as handling unique page structures or extracting specific types of content, you can implement custom crawling functions with Puppeteer. This approach allows you to achieve precise control over how data is gathered and processed.
To use a custom crawling function, in the crawler
parameter of the corpus()
function, pass the function name and define additional parameters: browserLog
and args
.
corpus({
title: `Knowledge Base`,
urls: [`urls to crawl`]
crawler: {
puppeteer: crawlPages(),
browserLog: 'on',
args: {arg1: 'value1', agr2: 'value2'},
},
auth: {username: 'johnsmith', password: 'password'},
include: [/.*\.pdf/],
exclude: [/.*\.zip/],
query: transforms.queries,
transforms: transforms.answers,
depth: 10,
maxPages: 10,
priority: 1
});
async function* crawlPages({url, page, document, args}) {
// crawlPages function code ...
}
Corpus parameters¶
Name |
Type |
Required/Optional |
Description |
---|---|---|---|
|
string |
Optional |
Corpus title. |
|
string array |
Required |
List of URLs from which information must be retrieved. You can define URLs of website folders and pages. |
|
JSON object |
Required |
Type of crawler and function to be used to index corpus content. Crawler parameters:
|
|
JSON object |
Optional |
Credentials to access resources that require basic authentication: |
|
string array |
Optional |
Resources to be obligatory indexed. You can define an array of URLs or use RegEx to specify a rule. For details, see Corpus includes and excludes. |
|
string array |
Optional |
Resources to be excluded from indexing. You can define an array of URLs or use RegEx to specify a rule. For details, see Corpus includes and excludes. |
|
function |
Optional |
Transforms function used to process user queries. For details, see Puppeteer transforms. |
|
function |
Optional |
Transforms function used to format the corpus output. For details, see Puppeteer transforms. |
|
integer |
Optional |
Crawl depth for web and PDF resources. The minimum value is 0 (crawling only the page content without linked resources). For details, see Crawling depth. |
|
integer |
Optional |
Maximum number of pages and files to index. If not set, only 1 page with the defined URL will be indexed. |
|
integer |
Optional |
Priority level assigned to the corpus. Corpuses with higher priority are considered more relevant when user requests are processed. For details, see Corpus priority. |
Examples of use¶
Assume you need to crawl only the article content displayed on a page, without the header and footer. To do this, you can implement the following script:
async function* crawlPages({url, page, document}) {
const html = await page.evaluate(() => {
const selector = '.main-page-content';
return document.querySelector(selector).innerHTML;
});
let content = await api.html2md_v2({html, url});
yield {content, mimeType: 'text/markdown'};
}
corpus({
title: `HTTP corpus`,
urls: [
`https://developer.mozilla.org/en-US/docs/Web/HTTP/Overview`,
`https://developer.mozilla.org/en-US/docs/Web/HTTP/Messages`,
`https://developer.mozilla.org/en-US/docs/Web/HTTP/Session`
],
crawler: {
puppeteer: crawlPages,
browserLog: 'off',
},
maxPages: 10,
depth: 3,
priority: 1
});
Here is how the script works:
The
crawlPages()
function specifies the behavior of the Puppeteer crawler and identifies which content should be crawled:Within each page to be indexed, the crawler extracts content from the element with the
main-page-content
selector.Each page’s content is then converted to Markdown with
html2md_v2()
and incorporated into the corpus.
The
corpus()
function defines the URLs to be crawled, the crawler type and function to be used and includes other parameters such as max number of pages and corpus priority.
Assume you need to crawl a list of articles dynamically loaded through the HubSpot Knowledge Base portal. The articles should meet the following criteria:
Article type:
Knowledge Base
User query:
templates
Articles to index: results on pages 1-3
To do this, you can implement the following script:
function createKBUrls({types, query, maxPages=3}) {
let res = [];
for (let i = 0; i < maxPages; i++) {
let page = i + 1;
let url = `https://knowledge.hubspot.com/search?q=${query}#stq=${query}&stp=${page}&stf=${types}`;
console.log(url);
res.push(url);
}
return res;
}
async function* crawlKB({url, page, document}) {
if (url.includes(`knowledge.hubspot.com/search?`)) {
let linksSelector = `a.st-search-result-link`;
await page.waitForSelector(linksSelector, { timeout: 20000 });
const urls = await page.evaluate(() => {
let linksSelector = `a.st-search-result-link`;
return Array.from(document.querySelectorAll(linksSelector)).map(e=>e.href);
});
console.log(`# index [](${url}) => `, urls);
yield {urls};
} else if (url.includes(`knowledge.hubspot.com`)) {
console.log(`# parse [](${url})`);
await page.waitForSelector(`div.blog-post-wrapper.cell-wrapper`, {timeout: 20000});
await api.sleep(4000);
const html = await page.evaluate(() => {
const selector = 'div.blog-post-wrapper.cell-wrapper';
return document.querySelector(selector).innerHTML;
});
let content = await api.html2md_v2({html, url});
yield {content, mimeType: 'text/markdown'};
}
}
corpus({
title: `HubSpot Knowledge Base`,
urls: createKBUrls({types: ["knowledge"], query: ["templates"]}),
crawler: {
puppeteer: crawlKB,
browserLog: 'off',
},
maxPages: 30,
depth: 3,
priority: 1
});
Here is how the script works:
The
createKBUrls()
function generates a list of URLs that aggregate links to the articles to be crawled. The list is then passed to theurls
parameter of thecorpus()
function, along with the type of articles to index and user query.The
crawlKB()
function specifies the behavior of the Puppeteer crawler and identifies which content should be crawled:Puppeteer waits for the
a.st-search-result-link
selector to be displayed and then accesses the KB articles opened through links with the same selector.To index an article, the crawler waits for the
div.blog-post-wrapper.cell-wrapper
selector to be displayed.Within the article, the crawler extracts content from the element with the
div.blog-post-wrapper.cell-wrapper
selector.Each article’s content is then converted to Markdown with
html2md_v2()
and incorporated into the corpus.
The
corpus()
function defines the URLs to be crawled, the crawler type and function to be used and includes other parameters such as max number of pages and corpus priority.
Assume you want to crawl Python documentation for a specific version: 3.11
. Instead of manually selecting individual pages and adding them to corpus()
, you can use the Puppeteer crawler to automate the process and pass the docs version as an argument to it.
You can implement the following script:
async function* crawlDocs({url, page, document, args}) {
// Passing the docs version
let {version} = args;
if (url.includes(`docs.python.org`)) {
await page.waitForSelector(`div.documentwrapper`, {timeout: 15000});
try {
await page.waitForSelector(`div.documentwrapper`, {timeout: 5000});
} catch(e) {
}
const html = await page.evaluate(() => {
const selector = 'div.body';
return document.querySelector(selector).innerHTML;
});
// Getting a list of URLs from the page
let urls = await page.evaluate(() => {
const anchorElements = document.querySelectorAll('a');
const linksArray = Array.from(anchorElements).map(anchor => anchor.href);
return linksArray;
});
console.log("Intial URLs list: " + urls);
// Getting the page content
let content = await api.html2md_v2({html, url});
console.log("Page content: " + content);
yield {content, mimeType: 'text/markdown'};
// Filtering URLs
urls = urls.filter(f=> f.includes(version));
console.log("Filtered URLs: " + urls);
yield {urls};
}
}
corpus({
title: `Python docs`,
urls: [`https://docs.python.org/3/`],
crawler: {
args: {version: '3.11'},
puppeteer: crawlDocs,
browserLog: 'on',
},
maxPages: 10,
depth: 100,
});
Here is how the script works:
The
crawlDocs()
function specifies the behavior of the Puppeteer crawler and indicates which content should be crawled. It takes theargs
argument in which the required docs version is passed.Puppeteer waits for the
div.documentwrapper
selector to be displayed and then crawls the content of the main docs page passed to thecorpus()
function:https://docs.python.org/3/
.From the main docs page, Puppeteer retrieves all links to sub-pages.
The function filters these links to make sure they match the specified version
3.11
. Only pages with URLs containing3.11
are selected for further crawling.Each page’s content is then converted to Markdown with
html2md_v2()
and incorporated into the corpus.Steps a-d are repeated for each page from the filtered list until the
maxPages
limit is reached.
The
corpus()
function defines the URL of docs to be crawled, the crawler type, function to be used and arguments passed to thecrawlDocs()
function:args: {version: '3.11'}
. It also includes other parameters such as max number of pages and corpus depth.
APIs and functions¶
You can use Puppeteer functions and APIs to create data crawling functions: