ScrapegraphScrapeTool

Description

The ScrapegraphScrapeTool is designed to leverage Scrapegraph AI’s SmartScraper API to intelligently extract content from websites. This tool provides advanced web scraping capabilities with AI-powered content extraction, making it ideal for targeted data collection and content analysis tasks. Unlike traditional web scrapers, it can understand the context and structure of web pages to extract the most relevant information based on natural language prompts.

Installation

To use this tool, you need to install the Scrapegraph Python client:

uv add scrapegraph-py

You’ll also need to set up your Scrapegraph API key as an environment variable:

export SCRAPEGRAPH_API_KEY="your_api_key"

You can obtain an API key from Scrapegraph AI.

Steps to Get Started

To effectively use the ScrapegraphScrapeTool, follow these steps:

  1. Install Dependencies: Install the required package using the command above.
  2. Set Up API Key: Set your Scrapegraph API key as an environment variable or provide it during initialization.
  3. Initialize the Tool: Create an instance of the tool with the necessary parameters.
  4. Define Extraction Prompts: Create natural language prompts to guide the extraction of specific content.

Example

The following example demonstrates how to use the ScrapegraphScrapeTool to extract content from a website:

Code
from crewai import Agent, Task, Crew
from crewai_tools import ScrapegraphScrapeTool

# Initialize the tool
scrape_tool = ScrapegraphScrapeTool(api_key="your_api_key")

# Define an agent that uses the tool
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract specific information from websites",
    backstory="An expert in web scraping who can extract targeted content from web pages.",
    tools=[scrape_tool],
    verbose=True,
)

# Example task to extract product information from an e-commerce site
scrape_task = Task(
    description="Extract product names, prices, and descriptions from the featured products section of example.com.",
    expected_output="A structured list of product information including names, prices, and descriptions.",
    agent=web_scraper_agent,
)

# Create and run the crew
crew = Crew(agents=[web_scraper_agent], tasks=[scrape_task])
result = crew.kickoff()

You can also initialize the tool with predefined parameters:

Code
# Initialize the tool with predefined parameters
scrape_tool = ScrapegraphScrapeTool(
    website_url="https://www.example.com",
    user_prompt="Extract all product prices and descriptions",
    api_key="your_api_key"
)

Parameters

The ScrapegraphScrapeTool accepts the following parameters during initialization:

  • api_key: Optional. Your Scrapegraph API key. If not provided, it will look for the SCRAPEGRAPH_API_KEY environment variable.
  • website_url: Optional. The URL of the website to scrape. If provided during initialization, the agent won’t need to specify it when using the tool.
  • user_prompt: Optional. Custom instructions for content extraction. If provided during initialization, the agent won’t need to specify it when using the tool.
  • enable_logging: Optional. Whether to enable logging for the Scrapegraph client. Default is False.

Usage

When using the ScrapegraphScrapeTool with an agent, the agent will need to provide the following parameters (unless they were specified during initialization):

  • website_url: The URL of the website to scrape.
  • user_prompt: Optional. Custom instructions for content extraction. Default is “Extract the main content of the webpage”.

The tool will return the extracted content based on the provided prompt.

Code
# Example of using the tool with an agent
web_scraper_agent = Agent(
    role="Web Scraper",
    goal="Extract specific information from websites",
    backstory="An expert in web scraping who can extract targeted content from web pages.",
    tools=[scrape_tool],
    verbose=True,
)

# Create a task for the agent to extract specific content
extract_task = Task(
    description="Extract the main heading and summary from example.com",
    expected_output="The main heading and summary from the website",
    agent=web_scraper_agent,
)

# Run the task
crew = Crew(agents=[web_scraper_agent], tasks=[extract_task])
result = crew.kickoff()

Error Handling

The ScrapegraphScrapeTool may raise the following exceptions:

  • ValueError: When API key is missing or URL format is invalid.
  • RateLimitError: When API rate limits are exceeded.
  • RuntimeError: When scraping operation fails (network issues, API errors).

It’s recommended to instruct agents to handle potential errors gracefully:

Code
# Create a task that includes error handling instructions
robust_extract_task = Task(
    description="""
    Extract the main heading from example.com.
    Be aware that you might encounter errors such as:
    - Invalid URL format
    - Missing API key
    - Rate limit exceeded
    - Network or API errors
    
    If you encounter any errors, provide a clear explanation of what went wrong
    and suggest possible solutions.
    """,
    expected_output="Either the extracted heading or a clear error explanation",
    agent=web_scraper_agent,
)

Rate Limiting

The Scrapegraph API has rate limits that vary based on your subscription plan. Consider the following best practices:

  • Implement appropriate delays between requests when processing multiple URLs.
  • Handle rate limit errors gracefully in your application.
  • Check your API plan limits on the Scrapegraph dashboard.

Implementation Details

The ScrapegraphScrapeTool uses the Scrapegraph Python client to interact with the SmartScraper API:

Code
class ScrapegraphScrapeTool(BaseTool):
    """
    A tool that uses Scrapegraph AI to intelligently scrape website content.
    """
    
    # Implementation details...
    
    def _run(self, **kwargs: Any) -> Any:
        website_url = kwargs.get("website_url", self.website_url)
        user_prompt = (
            kwargs.get("user_prompt", self.user_prompt)
            or "Extract the main content of the webpage"
        )

        if not website_url:
            raise ValueError("website_url is required")

        # Validate URL format
        self._validate_url(website_url)

        try:
            # Make the SmartScraper request
            response = self._client.smartscraper(
                website_url=website_url,
                user_prompt=user_prompt,
            )

            return response
        # Error handling...

Conclusion

The ScrapegraphScrapeTool provides a powerful way to extract content from websites using AI-powered understanding of web page structure. By enabling agents to target specific information using natural language prompts, it makes web scraping tasks more efficient and focused. This tool is particularly useful for data extraction, content monitoring, and research tasks where specific information needs to be extracted from web pages.