Scrapfly Scrape Website Tool
The ScrapflyScrapeWebsiteTool
leverages Scrapfly’s web scraping API to extract content from websites in various formats.
ScrapflyScrapeWebsiteTool
Description
The ScrapflyScrapeWebsiteTool
is designed to leverage Scrapfly’s web scraping API to extract content from websites. This tool provides advanced web scraping capabilities with headless browser support, proxies, and anti-bot bypass features. It allows for extracting web page data in various formats, including raw HTML, markdown, and plain text, making it ideal for a wide range of web scraping tasks.
Installation
To use this tool, you need to install the Scrapfly SDK:
You’ll also need to obtain a Scrapfly API key by registering at scrapfly.io/register.
Steps to Get Started
To effectively use the ScrapflyScrapeWebsiteTool
, follow these steps:
- Install Dependencies: Install the Scrapfly SDK using the command above.
- Obtain API Key: Register at Scrapfly to get your API key.
- Initialize the Tool: Create an instance of the tool with your API key.
- Configure Scraping Parameters: Customize the scraping parameters based on your needs.
Example
The following example demonstrates how to use the ScrapflyScrapeWebsiteTool
to extract content from a website:
You can also customize the scraping parameters:
Parameters
The ScrapflyScrapeWebsiteTool
accepts the following parameters:
Initialization Parameters
- api_key: Required. Your Scrapfly API key.
Run Parameters
- url: Required. The URL of the website to scrape.
- scrape_format: Optional. The format in which to extract the web page content. Options are “raw” (HTML), “markdown”, or “text”. Default is “markdown”.
- scrape_config: Optional. A dictionary containing additional Scrapfly scraping configuration options.
- ignore_scrape_failures: Optional. Whether to ignore failures during scraping. If set to
True
, the tool will returnNone
instead of raising an exception when scraping fails.
Scrapfly Configuration Options
The scrape_config
parameter allows you to customize the scraping behavior with the following options:
- asp: Enable anti-scraping protection bypass.
- render_js: Enable JavaScript rendering with a cloud headless browser.
- proxy_pool: Select a proxy pool (e.g., “public_residential_pool”, “datacenter”).
- country: Select a proxy location (e.g., “us”, “uk”).
- auto_scroll: Automatically scroll the page to load lazy-loaded content.
- js: Execute custom JavaScript code by the headless browser.
For a complete list of configuration options, refer to the Scrapfly API documentation.
Usage
When using the ScrapflyScrapeWebsiteTool
with an agent, the agent will need to provide the URL of the website to scrape and can optionally specify the format and additional configuration options:
For more advanced usage with custom configuration:
Error Handling
By default, the ScrapflyScrapeWebsiteTool
will raise an exception if scraping fails. Agents can be instructed to handle failures gracefully by specifying the ignore_scrape_failures
parameter:
Implementation Details
The ScrapflyScrapeWebsiteTool
uses the Scrapfly SDK to interact with the Scrapfly API:
Conclusion
The ScrapflyScrapeWebsiteTool
provides a powerful way to extract content from websites using Scrapfly’s advanced web scraping capabilities. With features like headless browser support, proxies, and anti-bot bypass, it can handle complex websites and extract content in various formats. This tool is particularly useful for data extraction, content monitoring, and research tasks where reliable web scraping is required.
Was this page helpful?