Crawl API

The Crawl API enables multi-page website crawling with depth control, path filtering, and webhook notifications. Perfect for scraping entire websites or specific sections.

Endpoints

Method	Path	Description
POST	`/api/v1/crawl`	Start a crawl job
GET	`/api/v1/crawl/:id`	Get crawl status and results
DELETE	`/api/v1/crawl/:id`	Cancel a crawl job
GET	`/api/v1/crawl`	List crawl history

Start Crawl Job

POST /api/v1/crawl

Request Body

Parameter	Type	Default	Description
`url`	string	required	Starting URL for the crawl
`maxDepth`	number	3	Maximum link depth to follow (1-10)
`limit`	number	100	Maximum pages to crawl (1-1000)
`includePaths`	string[]	-	URL patterns to include (glob patterns)
`excludePaths`	string[]	-	URL patterns to exclude (glob patterns)
`allowExternalLinks`	boolean	`false`	Follow links to external domains
`scrapeOptions`	object	-	Options for scraping each page (see Scrape API)
`webhookUrl`	string	-	Webhook URL for completion notification

Example Request

curl -X POST https://server.anyhunt.app/api/v1/crawl \
  -H "Authorization: Bearer ah_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "maxDepth": 3,
    "limit": 50,
    "includePaths": ["/docs/*", "/guides/*"],
    "excludePaths": ["/api/*"],
    "scrapeOptions": {
      "formats": ["markdown"],
      "onlyMainContent": true
    },
    "webhookUrl": "https://your-app.com/webhooks/crawl"
  }'

Response

The API returns a job ID. Poll GET /api/v1/crawl/:id to get status and results.

{
  "id": "crawl_abc123",
  "status": "PENDING"
}

Get Crawl Status

GET /api/v1/crawl/:id

Retrieve the status and results of a crawl job.

Response (In Progress)

{
  "id": "crawl_abc123",
  "status": "PROCESSING",
  "startUrl": "https://docs.example.com",
  "totalUrls": 45,
  "completedUrls": 32,
  "failedUrls": 2,
  "createdAt": "2024-01-15T10:30:00.000Z",
  "startedAt": "2024-01-15T10:30:01.000Z"
}

Response (Completed)

{
  "id": "crawl_abc123",
  "status": "COMPLETED",
  "startUrl": "https://docs.example.com",
  "totalUrls": 45,
  "completedUrls": 43,
  "failedUrls": 2,
  "createdAt": "2024-01-15T10:30:00.000Z",
  "startedAt": "2024-01-15T10:30:01.000Z",
  "completedAt": "2024-01-15T10:32:15.000Z",
  "data": [
    {
      "url": "https://docs.example.com/intro",
      "depth": 1,
      "markdown": "# Introduction\n\nWelcome to...",
      "metadata": {
        "title": "Introduction",
        "description": "Getting started guide"
      },
      "links": ["https://docs.example.com/setup", "..."]
    }
  ]
}

Status Values:

Status	Description
`PENDING`	Job is queued
`PROCESSING`	Crawl is in progress
`COMPLETED`	Crawl finished successfully
`FAILED`	Crawl failed with error
`CANCELLED`	Crawl was cancelled

Cancel Crawl Job

DELETE /api/v1/crawl/:id

Cancel a running crawl job.

Response

{
  "id": "crawl_abc123",
  "status": "CANCELLED"
}

List Crawl History

GET /api/v1/crawl

Query Parameters

Parameter	Type	Default	Description
`limit`	number	20	Max results (1-100)
`offset`	number	0	Skip results for pagination

Response

[
  {
    "id": "crawl_abc123",
    "status": "COMPLETED",
    "startUrl": "https://docs.example.com",
    "totalUrls": 45,
    "completedUrls": 43,
    "failedUrls": 2,
    "createdAt": "2024-01-15T10:30:00.000Z"
  }
]

Webhook Payload

When a crawl job completes, the webhook receives:

{
  "event": "crawl.completed",
  "data": {
    "id": "crawl_abc123",
    "status": "COMPLETED",
    "startUrl": "https://docs.example.com",
    "totalUrls": 45,
    "completedUrls": 43,
    "failedUrls": 2
  },
  "timestamp": "2024-01-15T10:32:15.000Z"
}

Path Filtering

Use glob patterns for includePaths and excludePaths:

Pattern	Matches
`/docs/*`	`/docs/intro`, `/docs/guide`
`/docs/**`	`/docs/intro`, `/docs/api/reference`
`*.pdf`	Any PDF file
`/blog/2024-*`	`/blog/2024-01-post`, `/blog/2024-02-news`

Best Practices

Start small - Test with a low limit (10-20) before crawling entire sites
Use path filters - Focus on relevant content with includePaths
Set up webhooks - For crawls with limit > 50, use webhooks instead of polling
Respect rate limits - Large crawls consume more quota and may take longer

Crawl API

On this page