Tools
Spider Scraper
The SpiderTool
is designed to extract and read the content of a specified website using Spider.
SpiderTool
Description
Spider is the fastest open source scraper and crawler that returns LLM-ready data. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.
Installation
To use the SpiderTool
you need to download the Spider SDK
and the crewai[tools]
SDK too:
Example
This example shows you how you can use the SpiderTool
to enable your agent to scrape and crawl websites.
The data returned from the Spider API is already LLM-ready, so no need to do any cleaning there.
Code
Arguments
Argument | Type | Description |
---|---|---|
api_key | string | Specifies Spider API key. If not specified, it looks for SPIDER_API_KEY in environment variables. |
params | object | Optional parameters for the request. Defaults to {"return_format": "markdown"} to optimize content for LLMs. |
request | string | Type of request to perform (http , chrome , smart ). smart defaults to HTTP, switching to JavaScript rendering if needed. |
limit | int | Max pages to crawl per website. Set to 0 or omit for unlimited. |
depth | int | Max crawl depth. Set to 0 for no limit. |
cache | bool | Enables HTTP caching to speed up repeated runs. Default is true . |
budget | object | Sets path-based limits for crawled pages, e.g., {"*":1} for root page only. |
locale | string | Locale for the request, e.g., en-US . |
cookies | string | HTTP cookies for the request. |
stealth | bool | Enables stealth mode for Chrome requests to avoid detection. Default is true . |
headers | object | HTTP headers as a map of key-value pairs for all requests. |
metadata | bool | Stores metadata about pages and content, aiding AI interoperability. Defaults to false . |
viewport | object | Sets Chrome viewport dimensions. Default is 800x600 . |
encoding | string | Specifies encoding type, e.g., UTF-8 , SHIFT_JIS . |
subdomains | bool | Includes subdomains in the crawl. Default is false . |
user_agent | string | Custom HTTP user agent. Defaults to a random agent. |
store_data | bool | Enables data storage for the request. Overrides storageless when set. Default is false . |
gpt_config | object | Allows AI to generate crawl actions, with optional chaining steps via an array for "prompt" . |
fingerprint | bool | Enables advanced fingerprinting for Chrome. |
storageless | bool | Prevents all data storage, including AI embeddings. Default is false . |
readability | bool | Pre-processes content for reading via Mozilla’s readability. Improves content for LLMs. |
return_format | string | Format to return data: markdown , raw , text , html2text . Use raw for default page format. |
proxy_enabled | bool | Enables high-performance proxies to avoid network-level blocking. |
query_selector | string | CSS query selector for content extraction from markup. |
full_resources | bool | Downloads all resources linked to the website. |
request_timeout | int | Timeout in seconds for requests (5-60). Default is 30 . |
run_in_background | bool | Runs the request in the background, useful for data storage and triggering dashboard crawls. No effect if storageless is set. |
Was this page helpful?