SpiderTool

Description

Spider is the fastest open source scraper and crawler that returns LLM-ready data. It converts any website into pure HTML, markdown, metadata or text while enabling you to crawl with custom actions using AI.

Installation

To use the SpiderTool you need to download the Spider SDK and the crewai[tools] SDK too:

pip install spider-client 'crewai[tools]'

Example

This example shows you how you can use the SpiderTool to enable your agent to scrape and crawl websites. The data returned from the Spider API is already LLM-ready, so no need to do any cleaning there.

Code
from crewai_tools import SpiderTool

def main():
    spider_tool = SpiderTool()

    searcher = Agent(
        role="Web Research Expert",
        goal="Find related information from specific URL's",
        backstory="An expert web researcher that uses the web extremely well",
        tools=[spider_tool],
        verbose=True,
    )

    return_metadata = Task(
        description="Scrape https://spider.cloud with a limit of 1 and enable metadata",
        expected_output="Metadata and 10 word summary of spider.cloud",
        agent=searcher
    )

    crew = Crew(
        agents=[searcher],
        tasks=[
            return_metadata,
        ],
        verbose=2
    )

    crew.kickoff()

if __name__ == "__main__":
    main()

Arguments

ArgumentTypeDescription
api_keystringSpecifies Spider API key. If not specified, it looks for SPIDER_API_KEY in environment variables.
paramsobjectOptional parameters for the request. Defaults to {"return_format": "markdown"} to optimize content for LLMs.
requeststringType of request to perform (http, chrome, smart). smart defaults to HTTP, switching to JavaScript rendering if needed.
limitintMax pages to crawl per website. Set to 0 or omit for unlimited.
depthintMax crawl depth. Set to 0 for no limit.
cacheboolEnables HTTP caching to speed up repeated runs. Default is true.
budgetobjectSets path-based limits for crawled pages, e.g., {"*":1} for root page only.
localestringLocale for the request, e.g., en-US.
cookiesstringHTTP cookies for the request.
stealthboolEnables stealth mode for Chrome requests to avoid detection. Default is true.
headersobjectHTTP headers as a map of key-value pairs for all requests.
metadataboolStores metadata about pages and content, aiding AI interoperability. Defaults to false.
viewportobjectSets Chrome viewport dimensions. Default is 800x600.
encodingstringSpecifies encoding type, e.g., UTF-8, SHIFT_JIS.
subdomainsboolIncludes subdomains in the crawl. Default is false.
user_agentstringCustom HTTP user agent. Defaults to a random agent.
store_databoolEnables data storage for the request. Overrides storageless when set. Default is false.
gpt_configobjectAllows AI to generate crawl actions, with optional chaining steps via an array for "prompt".
fingerprintboolEnables advanced fingerprinting for Chrome.
storagelessboolPrevents all data storage, including AI embeddings. Default is false.
readabilityboolPre-processes content for reading via Mozilla’s readability. Improves content for LLMs.
return_formatstringFormat to return data: markdown, raw, text, html2text. Use raw for default page format.
proxy_enabledboolEnables high-performance proxies to avoid network-level blocking.
query_selectorstringCSS query selector for content extraction from markup.
full_resourcesboolDownloads all resources linked to the website.
request_timeoutintTimeout in seconds for requests (5-60). Default is 30.
run_in_backgroundboolRuns the request in the background, useful for data storage and triggering dashboard crawls. No effect if storageless is set.