The SpiderTool
is designed to extract and read the content of a specified website using Spider.
SpiderTool
SpiderTool
you need to download the Spider SDK
and the crewai[tools]
SDK too:
SpiderTool
to enable your agent to scrape and crawl websites.
The data returned from the Spider API is already LLM-ready, so no need to do any cleaning there.
Argument | Type | Description |
---|---|---|
api_key | string | Specifies Spider API key. If not specified, it looks for SPIDER_API_KEY in environment variables. |
params | object | Optional parameters for the request. Defaults to {"return_format": "markdown"} to optimize content for LLMs. |
request | string | Type of request to perform (http , chrome , smart ). smart defaults to HTTP, switching to JavaScript rendering if needed. |
limit | int | Max pages to crawl per website. Set to 0 or omit for unlimited. |
depth | int | Max crawl depth. Set to 0 for no limit. |
cache | bool | Enables HTTP caching to speed up repeated runs. Default is true . |
budget | object | Sets path-based limits for crawled pages, e.g., {"*":1} for root page only. |
locale | string | Locale for the request, e.g., en-US . |
cookies | string | HTTP cookies for the request. |
stealth | bool | Enables stealth mode for Chrome requests to avoid detection. Default is true . |
headers | object | HTTP headers as a map of key-value pairs for all requests. |
metadata | bool | Stores metadata about pages and content, aiding AI interoperability. Defaults to false . |
viewport | object | Sets Chrome viewport dimensions. Default is 800x600 . |
encoding | string | Specifies encoding type, e.g., UTF-8 , SHIFT_JIS . |
subdomains | bool | Includes subdomains in the crawl. Default is false . |
user_agent | string | Custom HTTP user agent. Defaults to a random agent. |
store_data | bool | Enables data storage for the request. Overrides storageless when set. Default is false . |
gpt_config | object | Allows AI to generate crawl actions, with optional chaining steps via an array for "prompt" . |
fingerprint | bool | Enables advanced fingerprinting for Chrome. |
storageless | bool | Prevents all data storage, including AI embeddings. Default is false . |
readability | bool | Pre-processes content for reading via Mozilla’s readability. Improves content for LLMs. |
return_format | string | Format to return data: markdown , raw , text , html2text . Use raw for default page format. |
proxy_enabled | bool | Enables high-performance proxies to avoid network-level blocking. |
query_selector | string | CSS query selector for content extraction from markup. |
full_resources | bool | Downloads all resources linked to the website. |
request_timeout | int | Timeout in seconds for requests (5-60). Default is 30 . |
run_in_background | bool | Runs the request in the background, useful for data storage and triggering dashboard crawls. No effect if storageless is set. |