Spider Cloud Tool
Enterprise web scraping, crawling and search with proxy support
|
The Spider Cloud tool provides enterprise-grade web scraping, crawling, and search with advanced proxy support and intelligent content extraction.
Installation
import "github.com/model-box/agent-kit/tool/spider_cloud"
Setup
Requirements
- Spider Cloud API Key: Sign up at Spider Cloud
- API Access: Various pricing tiers available
Environment Variables
export SPIDER_CLOUD_API_KEY="your-spider-cloud-api-key"
Usage
package main
import (
"context"
"os"
"github.com/model-box/agent-kit/agent"
"github.com/model-box/agent-kit/model"
"github.com/model-box/agent-kit/session"
"github.com/model-box/agent-kit/tool/spider_cloud"
)
func main() {
// Create Spider Cloud tools
spiderTools := spider_cloud.NewSpiderCloudTools()
// Create model
model := model.Model("gpt-4o").
SetAPIKey(os.Getenv("OPENAI_API_KEY"))
// Create agent with Spider Cloud tools
agent := agent.New().
SetModel(model).
SetSystemPrompt("You are a web research assistant with advanced scraping capabilities.").
AddTool(spiderTools.Scrape()).
AddTool(spiderTools.Crawl()).
AddTool(spiderTools.Search())
// Create session and run
session := session.New(agent)
ctx := context.Background()
response, err := session.Run(ctx, []agent.ChatMessage{
agent.NewUserMessage("Search for 'machine learning tutorials' and fetch content of top results"),
}, nil)
if err != nil {
panic(err)
}
println(response.GetLastMessage().GetContent())
}
Available Tools
spider_scrape
Advanced webpage scraping with multiple options.
Parameter | Type | Required | Description |
---|---|---|---|
url | string | Yes | The URL to scrape |
return_formats | []string | No | Content formats: ["markdown", "html", "text"] |
request_type | string | No | "http", "chrome", "smart" (default: "http") |
custom_headers | map[string]string | No | Custom HTTP headers |
cookies | []Cookie | No | Cookies to send with request |
proxy_config | ProxyConfig | No | Proxy configuration |
store_cookies | bool | No | Store cookies from response |
metadata | bool | No | Include metadata in response |
readability | bool | No | Use readability mode (default: true) |
spider_crawl
Crawl websites with configurable depth and filters.
Parameter | Type | Required | Description |
---|---|---|---|
url | string | Yes | The starting URL to crawl |
limit | int | No | Maximum pages to crawl (default: 10, max: 500) |
depth | int | No | Maximum crawl depth (default: 3) |
allowed_domains | []string | No | Domains to restrict crawling to |
blacklist_patterns | []string | No | URL patterns to exclude |
whitelist_patterns | []string | No | URL patterns to include |
return_formats | []string | No | Content formats: ["markdown", "html", "text"] |
request_type | string | No | "http", "chrome", "smart" |
readability | bool | No | Use readability mode (default: true) |
spider_search
Search the web with optional content fetching.
Parameter | Type | Required | Description |
---|---|---|---|
query | string | Yes | Search query |
search_type | string | No | "search", "news", "images" (default: "search") |
num_results | int | No | Number of results (default: 10, max: 100) |
domain | string | No | Limit to specific domain |
lang | string | No | Language code |
country | string | No | Country code |
fetch_page_content | bool | No | Fetch full page content for each result |
Request Types
- HTTP: Fast, basic HTTP requests
- Chrome: Full browser rendering for JavaScript-heavy sites
- Smart: Automatically detects if browser rendering is needed
Advanced Features
Proxy Support
Configure datacenter or residential proxies with country selection.
Content Extraction
Intelligent extraction with readability mode that removes:
- Navigation menus
- Advertisements
- Sidebars
- Footer content
- Scripts and styles
Pattern Matching
Use URL patterns to control crawling:
- Blacklist:
*/admin/*
,*.pdf
,*?print=true
- Whitelist:
/blog/*
,*/2024/*
Use Cases
- Competitive Analysis: Monitor competitor websites
- Content Aggregation: Collect articles from multiple sources
- Price Monitoring: Track product prices across e-commerce sites
- SEO Analysis: Analyze website structure and content
- Research: Gather information from academic or news sites
- Lead Generation: Extract business information
- Market Research: Analyze industry trends
Best Practices
- Respect robots.txt: Check site policies
- Use appropriate delays: Don't overwhelm servers
- Set user agent: Identify your bot
- Handle errors gracefully: Implement retry logic
- Use proxies wisely: For sites with rate limits
- Filter URLs: Use patterns to stay focused