Firecrawl Tool
Advanced web scraping and crawling with JavaScript rendering
|
The Firecrawl tool provides advanced web scraping and crawling with JavaScript rendering and anti-bot bypass capabilities.
Installation
import "github.com/model-box/agent-kit/tool/firecrawl"
Setup
Requirements
- Firecrawl API Key: Sign up at Firecrawl
- API Access: Free tier includes 500 credits per month
Environment Variables
export FIRECRAWL_API_KEY="your-firecrawl-api-key"
Usage
package main
import (
"context"
"os"
"github.com/model-box/agent-kit/agent"
"github.com/model-box/agent-kit/model"
"github.com/model-box/agent-kit/session"
"github.com/model-box/agent-kit/tool/firecrawl"
)
func main() {
// Create Firecrawl tools
firecrawlTools := firecrawl.NewFirecrawlTools()
// Create model
model := model.Model("gpt-4o").
SetAPIKey(os.Getenv("OPENAI_API_KEY"))
// Create agent with Firecrawl tools
agent := agent.New().
SetModel(model).
SetSystemPrompt("You are a web scraping assistant.").
AddTool(firecrawlTools.Scrape()).
AddTool(firecrawlTools.Crawl())
// Create session and run
session := session.New(agent)
ctx := context.Background()
response, err := session.Run(ctx, []agent.ChatMessage{
agent.NewUserMessage("Scrape the main content from https://example.com in markdown format"),
}, nil)
if err != nil {
panic(err)
}
println(response.GetLastMessage().GetContent())
}
Available Tools
firecrawl_scrape
Scrape a single webpage with JavaScript rendering support.
Parameter | Type | Required | Description |
---|---|---|---|
url | string | Yes | The URL to scrape |
formats | []string | No | Output formats: ["markdown", "html", "rawHtml", "content", "links", "screenshot"] |
only_main_content | bool | No | Extract only main content (default: true) |
include_tags | []string | No | HTML tags to include (e.g., ["article", "main"]) |
exclude_tags | []string | No | HTML tags to exclude (e.g., ["nav", "footer"]) |
wait_for | int | No | Wait time in milliseconds before scraping (max: 10000) |
timeout | int | No | Timeout in milliseconds (default: 30000, max: 60000) |
firecrawl_crawl
Crawl multiple pages from a website.
Parameter | Type | Required | Description |
---|---|---|---|
url | string | Yes | The starting URL to crawl |
max_depth | int | No | Maximum crawl depth (default: 2, max: 5) |
limit | int | No | Maximum number of pages to crawl (default: 10, max: 100) |
allowed_domains | []string | No | Domains to restrict crawling to |
exclude_paths | []string | No | URL paths to exclude from crawling |
include_paths | []string | No | URL paths to include in crawling |
only_main_content | bool | No | Extract only main content (default: true) |
Output Formats
Markdown Format
Clean, readable markdown with proper formatting for easy processing.
HTML Format
Cleaned HTML with unnecessary elements removed.
Raw HTML Format
Complete HTML as rendered by the browser.
Content Format
Plain text content without any formatting.
Links Format
All links found on the page with their text and URLs.
Screenshot Format
Base64-encoded screenshot of the page.
Features
- JavaScript Rendering: Handles modern SPAs and dynamic content
- Anti-Bot Bypass: Automatically handles many anti-scraping measures
- Content Extraction: Intelligent extraction of main content
- Metadata Extraction: Extracts title, description, Open Graph tags
- Link Extraction: Collects all links with context
- Screenshot Capture: Can capture page screenshots
- Batch Crawling: Crawl entire websites efficiently
Rate Limits and Credits
- Scraping: 1 credit per page
- Crawling: 1 credit per page crawled
- Free tier: 500 credits/month
- Starter: 5,000 credits/month
- Growth: 50,000 credits/month
Best Practices
- Use specific selectors: Include/exclude tags for better content extraction
- Set appropriate timeouts: For slow-loading pages
- Limit crawl depth: To avoid excessive API usage
- Filter domains: When crawling to stay within scope
- Use wait_for: For pages that load content dynamically