Spider Cloud Tool

The Spider Cloud tool provides enterprise-grade web scraping, crawling, and search with advanced proxy support and intelligent content extraction.

Installation

import "github.com/model-box/agent-kit/tool/spider_cloud"

Setup

Requirements

Spider Cloud API Key: Sign up at Spider Cloud
API Access: Various pricing tiers available

Environment Variables

export SPIDER_CLOUD_API_KEY="your-spider-cloud-api-key"

Usage

package main

import (
    "context"
    "os"
    
    "github.com/model-box/agent-kit/agent"
    "github.com/model-box/agent-kit/model"
    "github.com/model-box/agent-kit/session"
    "github.com/model-box/agent-kit/tool/spider_cloud"
)

func main() {
    // Create Spider Cloud tools
    spiderTools := spider_cloud.NewSpiderCloudTools()
    
    // Create model
    model := model.Model("gpt-4o").
        SetAPIKey(os.Getenv("OPENAI_API_KEY"))
    
    // Create agent with Spider Cloud tools
    agent := agent.New().
        SetModel(model).
        SetSystemPrompt("You are a web research assistant with advanced scraping capabilities.").
        AddTool(spiderTools.Scrape()).
        AddTool(spiderTools.Crawl()).
        AddTool(spiderTools.Search())
    
    // Create session and run
    session := session.New(agent)
    ctx := context.Background()
    
    response, err := session.Run(ctx, []agent.ChatMessage{
        agent.NewUserMessage("Search for 'machine learning tutorials' and fetch content of top results"),
    }, nil)
    
    if err != nil {
        panic(err)
    }
    
    println(response.GetLastMessage().GetContent())
}

Available Tools

spider_scrape

Advanced webpage scraping with multiple options.

Parameter	Type	Required	Description
`url`	string	Yes	The URL to scrape
`return_formats`	[]string	No	Content formats: ["markdown", "html", "text"]
`request_type`	string	No	"http", "chrome", "smart" (default: "http")
`custom_headers`	map[string]string	No	Custom HTTP headers
`cookies`	[]Cookie	No	Cookies to send with request
`proxy_config`	ProxyConfig	No	Proxy configuration
`store_cookies`	bool	No	Store cookies from response
`metadata`	bool	No	Include metadata in response
`readability`	bool	No	Use readability mode (default: true)

spider_crawl

Crawl websites with configurable depth and filters.

Parameter	Type	Required	Description
`url`	string	Yes	The starting URL to crawl
`limit`	int	No	Maximum pages to crawl (default: 10, max: 500)
`depth`	int	No	Maximum crawl depth (default: 3)
`allowed_domains`	[]string	No	Domains to restrict crawling to
`blacklist_patterns`	[]string	No	URL patterns to exclude
`whitelist_patterns`	[]string	No	URL patterns to include
`return_formats`	[]string	No	Content formats: ["markdown", "html", "text"]
`request_type`	string	No	"http", "chrome", "smart"
`readability`	bool	No	Use readability mode (default: true)

spider_search

Search the web with optional content fetching.

Parameter	Type	Required	Description
`query`	string	Yes	Search query
`search_type`	string	No	"search", "news", "images" (default: "search")
`num_results`	int	No	Number of results (default: 10, max: 100)
`domain`	string	No	Limit to specific domain
`lang`	string	No	Language code
`country`	string	No	Country code
`fetch_page_content`	bool	No	Fetch full page content for each result

Request Types

HTTP: Fast, basic HTTP requests
Chrome: Full browser rendering for JavaScript-heavy sites
Smart: Automatically detects if browser rendering is needed

Advanced Features

Proxy Support

Configure datacenter or residential proxies with country selection.

Content Extraction

Intelligent extraction with readability mode that removes:

Navigation menus
Advertisements
Sidebars
Footer content
Scripts and styles

Pattern Matching

Use URL patterns to control crawling:

Blacklist: */admin/*, *.pdf, *?print=true
Whitelist: /blog/*, */2024/*

Use Cases

Competitive Analysis: Monitor competitor websites
Content Aggregation: Collect articles from multiple sources
Price Monitoring: Track product prices across e-commerce sites
SEO Analysis: Analyze website structure and content
Research: Gather information from academic or news sites
Lead Generation: Extract business information
Market Research: Analyze industry trends

Best Practices

Respect robots.txt: Check site policies
Use appropriate delays: Don't overwhelm servers
Set user agent: Identify your bot
Handle errors gracefully: Implement retry logic
Use proxies wisely: For sites with rate limits
Filter URLs: Use patterns to stay focused

Spider Cloud Tool

On this page