> ## Documentation Index
> Fetch the complete documentation index at: https://docs.crewai.com/llms.txt
> Use this file to discover all available pages before exploring further.

# Spider Scraper

> The `SpiderTool` is designed to extract and read the content of a specified website using Spider.

# `SpiderTool`

## Description

[Spider](https://spider.cloud/?ref=crewai) is the [fastest](https://github.com/spider-rs/spider/blob/main/benches/BENCHMARKS.md#benchmark-results)
open source scraper and crawler that returns LLM-ready data.
It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.

## Installation

To use the `SpiderTool` you need to download the [Spider SDK](https://pypi.org/project/spider-client/)
and the `crewai[tools]` SDK too:

```shell theme={null}
pip install spider-client 'crewai[tools]'
```

## Example

This example shows you how you can use the `SpiderTool` to enable your agent to scrape and crawl websites.
The data returned from the Spider API is already LLM-ready, so no need to do any cleaning there.

```python Code theme={null}
from crewai_tools import SpiderTool

def main():
    spider_tool = SpiderTool()

    searcher = Agent(
        role="Web Research Expert",
        goal="Find related information from specific URL's",
        backstory="An expert web researcher that uses the web extremely well",
        tools=[spider_tool],
        verbose=True,
    )

    return_metadata = Task(
        description="Scrape https://spider.cloud with a limit of 1 and enable metadata",
        expected_output="Metadata and 10 word summary of spider.cloud",
        agent=searcher
    )

    crew = Crew(
        agents=[searcher],
        tasks=[
            return_metadata,
        ],
        verbose=2
    )

    crew.kickoff()

if __name__ == "__main__":
    main()
```

## Arguments

| Argument                | Type     | Description                                                                                                                       |
| :---------------------- | :------- | :-------------------------------------------------------------------------------------------------------------------------------- |
| **api\_key**            | `string` | Specifies Spider API key. If not specified, it looks for `SPIDER_API_KEY` in environment variables.                               |
| **params**              | `object` | Optional parameters for the request. Defaults to `{"return_format": "markdown"}` to optimize content for LLMs.                    |
| **request**             | `string` | Type of request to perform (`http`, `chrome`, `smart`). `smart` defaults to HTTP, switching to JavaScript rendering if needed.    |
| **limit**               | `int`    | Max pages to crawl per website. Set to `0` or omit for unlimited.                                                                 |
| **depth**               | `int`    | Max crawl depth. Set to `0` for no limit.                                                                                         |
| **cache**               | `bool`   | Enables HTTP caching to speed up repeated runs. Default is `true`.                                                                |
| **budget**              | `object` | Sets path-based limits for crawled pages, e.g., `{"*":1}` for root page only.                                                     |
| **locale**              | `string` | Locale for the request, e.g., `en-US`.                                                                                            |
| **cookies**             | `string` | HTTP cookies for the request.                                                                                                     |
| **stealth**             | `bool`   | Enables stealth mode for Chrome requests to avoid detection. Default is `true`.                                                   |
| **headers**             | `object` | HTTP headers as a map of key-value pairs for all requests.                                                                        |
| **metadata**            | `bool`   | Stores metadata about pages and content, aiding AI interoperability. Defaults to `false`.                                         |
| **viewport**            | `object` | Sets Chrome viewport dimensions. Default is `800x600`.                                                                            |
| **encoding**            | `string` | Specifies encoding type, e.g., `UTF-8`, `SHIFT_JIS`.                                                                              |
| **subdomains**          | `bool`   | Includes subdomains in the crawl. Default is `false`.                                                                             |
| **user\_agent**         | `string` | Custom HTTP user agent. Defaults to a random agent.                                                                               |
| **store\_data**         | `bool`   | Enables data storage for the request. Overrides `storageless` when set. Default is `false`.                                       |
| **gpt\_config**         | `object` | Allows AI to generate crawl actions, with optional chaining steps via an array for `"prompt"`.                                    |
| **fingerprint**         | `bool`   | Enables advanced fingerprinting for Chrome.                                                                                       |
| **storageless**         | `bool`   | Prevents all data storage, including AI embeddings. Default is `false`.                                                           |
| **readability**         | `bool`   | Pre-processes content for reading via [Mozilla’s readability](https://github.com/mozilla/readability). Improves content for LLMs. |
| **return\_format**      | `string` | Format to return data: `markdown`, `raw`, `text`, `html2text`. Use `raw` for default page format.                                 |
| **proxy\_enabled**      | `bool`   | Enables high-performance proxies to avoid network-level blocking.                                                                 |
| **query\_selector**     | `string` | CSS query selector for content extraction from markup.                                                                            |
| **full\_resources**     | `bool`   | Downloads all resources linked to the website.                                                                                    |
| **request\_timeout**    | `int`    | Timeout in seconds for requests (5-60). Default is `30`.                                                                          |
| **run\_in\_background** | `bool`   | Runs the request in the background, useful for data storage and triggering dashboard crawls. No effect if `storageless` is set.   |
