ScrapflyScrapeWebsiteTool
Description
TheScrapflyScrapeWebsiteTool is designed to leverage Scrapflyโs web scraping API to extract content from websites. This tool provides advanced web scraping capabilities with headless browser support, proxies, and anti-bot bypass features. It allows for extracting web page data in various formats, including raw HTML, markdown, and plain text, making it ideal for a wide range of web scraping tasks.
Installation
To use this tool, you need to install the Scrapfly SDK:Steps to Get Started
To effectively use theScrapflyScrapeWebsiteTool, follow these steps:
- Install Dependencies: Install the Scrapfly SDK using the command above.
- Obtain API Key: Register at Scrapfly to get your API key.
- Initialize the Tool: Create an instance of the tool with your API key.
- Configure Scraping Parameters: Customize the scraping parameters based on your needs.
Example
The following example demonstrates how to use theScrapflyScrapeWebsiteTool to extract content from a website:
Code
Code
Parameters
TheScrapflyScrapeWebsiteTool accepts the following parameters:
Initialization Parameters
- api_key: Required. Your Scrapfly API key.
Run Parameters
- url: Required. The URL of the website to scrape.
- scrape_format: Optional. The format in which to extract the web page content. Options are โrawโ (HTML), โmarkdownโ, or โtextโ. Default is โmarkdownโ.
- scrape_config: Optional. A dictionary containing additional Scrapfly scraping configuration options.
- ignore_scrape_failures: Optional. Whether to ignore failures during scraping. If set to
True, the tool will returnNoneinstead of raising an exception when scraping fails.
Scrapfly Configuration Options
Thescrape_config parameter allows you to customize the scraping behavior with the following options:
- asp: Enable anti-scraping protection bypass.
- render_js: Enable JavaScript rendering with a cloud headless browser.
- proxy_pool: Select a proxy pool (e.g., โpublic_residential_poolโ, โdatacenterโ).
- country: Select a proxy location (e.g., โusโ, โukโ).
- auto_scroll: Automatically scroll the page to load lazy-loaded content.
- js: Execute custom JavaScript code by the headless browser.
Usage
When using theScrapflyScrapeWebsiteTool with an agent, the agent will need to provide the URL of the website to scrape and can optionally specify the format and additional configuration options:
Code
Code
Error Handling
By default, theScrapflyScrapeWebsiteTool will raise an exception if scraping fails. Agents can be instructed to handle failures gracefully by specifying the ignore_scrape_failures parameter:
Code
Implementation Details
TheScrapflyScrapeWebsiteTool uses the Scrapfly SDK to interact with the Scrapfly API:
Code
Conclusion
TheScrapflyScrapeWebsiteTool provides a powerful way to extract content from websites using Scrapflyโs advanced web scraping capabilities. With features like headless browser support, proxies, and anti-bot bypass, it can handle complex websites and extract content in various formats. This tool is particularly useful for data extraction, content monitoring, and research tasks where reliable web scraping is required.