ScrapflyScrapeWebsiteTool
Description
TheScrapflyScrapeWebsiteTool
is designed to leverage Scrapfly’s web scraping API to extract content from websites. This tool provides advanced web scraping capabilities with headless browser support, proxies, and anti-bot bypass features. It allows for extracting web page data in various formats, including raw HTML, markdown, and plain text, making it ideal for a wide range of web scraping tasks.
Installation
To use this tool, you need to install the Scrapfly SDK:Steps to Get Started
To effectively use theScrapflyScrapeWebsiteTool
, follow these steps:
- Install Dependencies: Install the Scrapfly SDK using the command above.
- Obtain API Key: Register at Scrapfly to get your API key.
- Initialize the Tool: Create an instance of the tool with your API key.
- Configure Scraping Parameters: Customize the scraping parameters based on your needs.
Example
The following example demonstrates how to use theScrapflyScrapeWebsiteTool
to extract content from a website:
Code
Code
Parameters
TheScrapflyScrapeWebsiteTool
accepts the following parameters:
Initialization Parameters
- api_key: Required. Your Scrapfly API key.
Run Parameters
- url: Required. The URL of the website to scrape.
- scrape_format: Optional. The format in which to extract the web page content. Options are “raw” (HTML), “markdown”, or “text”. Default is “markdown”.
- scrape_config: Optional. A dictionary containing additional Scrapfly scraping configuration options.
- ignore_scrape_failures: Optional. Whether to ignore failures during scraping. If set to
True
, the tool will returnNone
instead of raising an exception when scraping fails.
Scrapfly Configuration Options
Thescrape_config
parameter allows you to customize the scraping behavior with the following options:
- asp: Enable anti-scraping protection bypass.
- render_js: Enable JavaScript rendering with a cloud headless browser.
- proxy_pool: Select a proxy pool (e.g., “public_residential_pool”, “datacenter”).
- country: Select a proxy location (e.g., “us”, “uk”).
- auto_scroll: Automatically scroll the page to load lazy-loaded content.
- js: Execute custom JavaScript code by the headless browser.
Usage
When using theScrapflyScrapeWebsiteTool
with an agent, the agent will need to provide the URL of the website to scrape and can optionally specify the format and additional configuration options:
Code
Code
Error Handling
By default, theScrapflyScrapeWebsiteTool
will raise an exception if scraping fails. Agents can be instructed to handle failures gracefully by specifying the ignore_scrape_failures
parameter:
Code
Implementation Details
TheScrapflyScrapeWebsiteTool
uses the Scrapfly SDK to interact with the Scrapfly API:
Code
Conclusion
TheScrapflyScrapeWebsiteTool
provides a powerful way to extract content from websites using Scrapfly’s advanced web scraping capabilities. With features like headless browser support, proxies, and anti-bot bypass, it can handle complex websites and extract content in various formats. This tool is particularly useful for data extraction, content monitoring, and research tasks where reliable web scraping is required.