ModelBox Open Source

/

AgentKit
Core Concepts/Tools/Built-in Tools/Spider Cloud Tool

Spider Cloud Tool

Enterprise web scraping, crawling and search with proxy support

|

The Spider Cloud tool provides enterprise-grade web scraping, crawling, and search with advanced proxy support and intelligent content extraction.

Installation

import "github.com/model-box/agent-kit/tool/spider_cloud"

Setup

Requirements

  1. Spider Cloud API Key: Sign up at Spider Cloud
  2. API Access: Various pricing tiers available

Environment Variables

export SPIDER_CLOUD_API_KEY="your-spider-cloud-api-key"

Usage

package main

import (
    "context"
    "os"
    
    "github.com/model-box/agent-kit/agent"
    "github.com/model-box/agent-kit/model"
    "github.com/model-box/agent-kit/session"
    "github.com/model-box/agent-kit/tool/spider_cloud"
)

func main() {
    // Create Spider Cloud tools
    spiderTools := spider_cloud.NewSpiderCloudTools()
    
    // Create model
    model := model.Model("gpt-4o").
        SetAPIKey(os.Getenv("OPENAI_API_KEY"))
    
    // Create agent with Spider Cloud tools
    agent := agent.New().
        SetModel(model).
        SetSystemPrompt("You are a web research assistant with advanced scraping capabilities.").
        AddTool(spiderTools.Scrape()).
        AddTool(spiderTools.Crawl()).
        AddTool(spiderTools.Search())
    
    // Create session and run
    session := session.New(agent)
    ctx := context.Background()
    
    response, err := session.Run(ctx, []agent.ChatMessage{
        agent.NewUserMessage("Search for 'machine learning tutorials' and fetch content of top results"),
    }, nil)
    
    if err != nil {
        panic(err)
    }
    
    println(response.GetLastMessage().GetContent())
}

Available Tools

spider_scrape

Advanced webpage scraping with multiple options.

ParameterTypeRequiredDescription
urlstringYesThe URL to scrape
return_formats[]stringNoContent formats: ["markdown", "html", "text"]
request_typestringNo"http", "chrome", "smart" (default: "http")
custom_headersmap[string]stringNoCustom HTTP headers
cookies[]CookieNoCookies to send with request
proxy_configProxyConfigNoProxy configuration
store_cookiesboolNoStore cookies from response
metadataboolNoInclude metadata in response
readabilityboolNoUse readability mode (default: true)

spider_crawl

Crawl websites with configurable depth and filters.

ParameterTypeRequiredDescription
urlstringYesThe starting URL to crawl
limitintNoMaximum pages to crawl (default: 10, max: 500)
depthintNoMaximum crawl depth (default: 3)
allowed_domains[]stringNoDomains to restrict crawling to
blacklist_patterns[]stringNoURL patterns to exclude
whitelist_patterns[]stringNoURL patterns to include
return_formats[]stringNoContent formats: ["markdown", "html", "text"]
request_typestringNo"http", "chrome", "smart"
readabilityboolNoUse readability mode (default: true)

Search the web with optional content fetching.

ParameterTypeRequiredDescription
querystringYesSearch query
search_typestringNo"search", "news", "images" (default: "search")
num_resultsintNoNumber of results (default: 10, max: 100)
domainstringNoLimit to specific domain
langstringNoLanguage code
countrystringNoCountry code
fetch_page_contentboolNoFetch full page content for each result

Request Types

  • HTTP: Fast, basic HTTP requests
  • Chrome: Full browser rendering for JavaScript-heavy sites
  • Smart: Automatically detects if browser rendering is needed

Advanced Features

Proxy Support

Configure datacenter or residential proxies with country selection.

Content Extraction

Intelligent extraction with readability mode that removes:

  • Navigation menus
  • Advertisements
  • Sidebars
  • Footer content
  • Scripts and styles

Pattern Matching

Use URL patterns to control crawling:

  • Blacklist: */admin/*, *.pdf, *?print=true
  • Whitelist: /blog/*, */2024/*

Use Cases

  1. Competitive Analysis: Monitor competitor websites
  2. Content Aggregation: Collect articles from multiple sources
  3. Price Monitoring: Track product prices across e-commerce sites
  4. SEO Analysis: Analyze website structure and content
  5. Research: Gather information from academic or news sites
  6. Lead Generation: Extract business information
  7. Market Research: Analyze industry trends

Best Practices

  1. Respect robots.txt: Check site policies
  2. Use appropriate delays: Don't overwhelm servers
  3. Set user agent: Identify your bot
  4. Handle errors gracefully: Implement retry logic
  5. Use proxies wisely: For sites with rate limits
  6. Filter URLs: Use patterns to stay focused
Edit on GitHub