API Reference
Crawl API
Crawl multiple pages from a website with depth and path controls
Crawl API
The Crawl API enables multi-page website crawling with depth control, path filtering, and webhook notifications. Perfect for scraping entire websites or specific sections.
Endpoints
| Method | Path | Description |
|---|---|---|
| POST | /api/v1/crawl | Start a crawl job |
| GET | /api/v1/crawl/:id | Get crawl status and results |
| DELETE | /api/v1/crawl/:id | Cancel a crawl job |
| GET | /api/v1/crawl | List crawl history |
Start Crawl Job
POST /api/v1/crawlRequest Body
| Parameter | Type | Default | Description |
|---|---|---|---|
url | string | required | Starting URL for the crawl |
maxDepth | number | 3 | Maximum link depth to follow (1-10) |
limit | number | 100 | Maximum pages to crawl (1-1000) |
includePaths | string[] | - | URL patterns to include (glob patterns) |
excludePaths | string[] | - | URL patterns to exclude (glob patterns) |
allowExternalLinks | boolean | false | Follow links to external domains |
scrapeOptions | object | - | Options for scraping each page (see Scrape API) |
webhookUrl | string | - | Webhook URL for completion notification |
Example Request
curl -X POST https://server.anyhunt.app/api/v1/crawl \
-H "Authorization: Bearer ah_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://docs.example.com",
"maxDepth": 3,
"limit": 50,
"includePaths": ["/docs/*", "/guides/*"],
"excludePaths": ["/api/*"],
"scrapeOptions": {
"formats": ["markdown"],
"onlyMainContent": true
},
"webhookUrl": "https://your-app.com/webhooks/crawl"
}'Response
The API returns a job ID. Poll GET /api/v1/crawl/:id to get status and results.
{
"id": "crawl_abc123",
"status": "PENDING"
}Get Crawl Status
GET /api/v1/crawl/:idRetrieve the status and results of a crawl job.
Response (In Progress)
{
"id": "crawl_abc123",
"status": "PROCESSING",
"startUrl": "https://docs.example.com",
"totalUrls": 45,
"completedUrls": 32,
"failedUrls": 2,
"createdAt": "2024-01-15T10:30:00.000Z",
"startedAt": "2024-01-15T10:30:01.000Z"
}Response (Completed)
{
"id": "crawl_abc123",
"status": "COMPLETED",
"startUrl": "https://docs.example.com",
"totalUrls": 45,
"completedUrls": 43,
"failedUrls": 2,
"createdAt": "2024-01-15T10:30:00.000Z",
"startedAt": "2024-01-15T10:30:01.000Z",
"completedAt": "2024-01-15T10:32:15.000Z",
"data": [
{
"url": "https://docs.example.com/intro",
"depth": 1,
"markdown": "# Introduction\n\nWelcome to...",
"metadata": {
"title": "Introduction",
"description": "Getting started guide"
},
"links": ["https://docs.example.com/setup", "..."]
}
]
}Status Values:
| Status | Description |
|---|---|
PENDING | Job is queued |
PROCESSING | Crawl is in progress |
COMPLETED | Crawl finished successfully |
FAILED | Crawl failed with error |
CANCELLED | Crawl was cancelled |
Cancel Crawl Job
DELETE /api/v1/crawl/:idCancel a running crawl job.
Response
{
"id": "crawl_abc123",
"status": "CANCELLED"
}List Crawl History
GET /api/v1/crawlQuery Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
limit | number | 20 | Max results (1-100) |
offset | number | 0 | Skip results for pagination |
Response
[
{
"id": "crawl_abc123",
"status": "COMPLETED",
"startUrl": "https://docs.example.com",
"totalUrls": 45,
"completedUrls": 43,
"failedUrls": 2,
"createdAt": "2024-01-15T10:30:00.000Z"
}
]Webhook Payload
When a crawl job completes, the webhook receives:
{
"event": "crawl.completed",
"data": {
"id": "crawl_abc123",
"status": "COMPLETED",
"startUrl": "https://docs.example.com",
"totalUrls": 45,
"completedUrls": 43,
"failedUrls": 2
},
"timestamp": "2024-01-15T10:32:15.000Z"
}Path Filtering
Use glob patterns for includePaths and excludePaths:
| Pattern | Matches |
|---|---|
/docs/* | /docs/intro, /docs/guide |
/docs/** | /docs/intro, /docs/api/reference |
*.pdf | Any PDF file |
/blog/2024-* | /blog/2024-01-post, /blog/2024-02-news |
Best Practices
- Start small - Test with a low
limit(10-20) before crawling entire sites - Use path filters - Focus on relevant content with
includePaths - Set up webhooks - For crawls with
limit > 50, use webhooks instead of polling - Respect rate limits - Large crawls consume more quota and may take longer