Anyhunt
API Reference

Crawl API

Crawl multiple pages from a website with depth and path controls

Crawl API

The Crawl API enables multi-page website crawling with depth control, path filtering, and webhook notifications. Perfect for scraping entire websites or specific sections.

Endpoints

MethodPathDescription
POST/api/v1/crawlStart a crawl job
GET/api/v1/crawl/:idGet crawl status and results
DELETE/api/v1/crawl/:idCancel a crawl job
GET/api/v1/crawlList crawl history

Start Crawl Job

POST /api/v1/crawl

Request Body

ParameterTypeDefaultDescription
urlstringrequiredStarting URL for the crawl
maxDepthnumber3Maximum link depth to follow (1-10)
limitnumber100Maximum pages to crawl (1-1000)
includePathsstring[]-URL patterns to include (glob patterns)
excludePathsstring[]-URL patterns to exclude (glob patterns)
allowExternalLinksbooleanfalseFollow links to external domains
scrapeOptionsobject-Options for scraping each page (see Scrape API)
webhookUrlstring-Webhook URL for completion notification

Example Request

curl -X POST https://server.anyhunt.app/api/v1/crawl \
  -H "Authorization: Bearer ah_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://docs.example.com",
    "maxDepth": 3,
    "limit": 50,
    "includePaths": ["/docs/*", "/guides/*"],
    "excludePaths": ["/api/*"],
    "scrapeOptions": {
      "formats": ["markdown"],
      "onlyMainContent": true
    },
    "webhookUrl": "https://your-app.com/webhooks/crawl"
  }'

Response

The API returns a job ID. Poll GET /api/v1/crawl/:id to get status and results.

{
  "id": "crawl_abc123",
  "status": "PENDING"
}

Get Crawl Status

GET /api/v1/crawl/:id

Retrieve the status and results of a crawl job.

Response (In Progress)

{
  "id": "crawl_abc123",
  "status": "PROCESSING",
  "startUrl": "https://docs.example.com",
  "totalUrls": 45,
  "completedUrls": 32,
  "failedUrls": 2,
  "createdAt": "2024-01-15T10:30:00.000Z",
  "startedAt": "2024-01-15T10:30:01.000Z"
}

Response (Completed)

{
  "id": "crawl_abc123",
  "status": "COMPLETED",
  "startUrl": "https://docs.example.com",
  "totalUrls": 45,
  "completedUrls": 43,
  "failedUrls": 2,
  "createdAt": "2024-01-15T10:30:00.000Z",
  "startedAt": "2024-01-15T10:30:01.000Z",
  "completedAt": "2024-01-15T10:32:15.000Z",
  "data": [
    {
      "url": "https://docs.example.com/intro",
      "depth": 1,
      "markdown": "# Introduction\n\nWelcome to...",
      "metadata": {
        "title": "Introduction",
        "description": "Getting started guide"
      },
      "links": ["https://docs.example.com/setup", "..."]
    }
  ]
}

Status Values:

StatusDescription
PENDINGJob is queued
PROCESSINGCrawl is in progress
COMPLETEDCrawl finished successfully
FAILEDCrawl failed with error
CANCELLEDCrawl was cancelled

Cancel Crawl Job

DELETE /api/v1/crawl/:id

Cancel a running crawl job.

Response

{
  "id": "crawl_abc123",
  "status": "CANCELLED"
}

List Crawl History

GET /api/v1/crawl

Query Parameters

ParameterTypeDefaultDescription
limitnumber20Max results (1-100)
offsetnumber0Skip results for pagination

Response

[
  {
    "id": "crawl_abc123",
    "status": "COMPLETED",
    "startUrl": "https://docs.example.com",
    "totalUrls": 45,
    "completedUrls": 43,
    "failedUrls": 2,
    "createdAt": "2024-01-15T10:30:00.000Z"
  }
]

Webhook Payload

When a crawl job completes, the webhook receives:

{
  "event": "crawl.completed",
  "data": {
    "id": "crawl_abc123",
    "status": "COMPLETED",
    "startUrl": "https://docs.example.com",
    "totalUrls": 45,
    "completedUrls": 43,
    "failedUrls": 2
  },
  "timestamp": "2024-01-15T10:32:15.000Z"
}

Path Filtering

Use glob patterns for includePaths and excludePaths:

PatternMatches
/docs/*/docs/intro, /docs/guide
/docs/**/docs/intro, /docs/api/reference
*.pdfAny PDF file
/blog/2024-*/blog/2024-01-post, /blog/2024-02-news

Best Practices

  1. Start small - Test with a low limit (10-20) before crawling entire sites
  2. Use path filters - Focus on relevant content with includePaths
  3. Set up webhooks - For crawls with limit > 50, use webhooks instead of polling
  4. Respect rate limits - Large crawls consume more quota and may take longer