Scrape API

The Scrape API is the primary endpoint for extracting content from web pages. It supports multiple output formats including Markdown, HTML, links, and screenshots.

Endpoints

Method	Path	Description
POST	`/api/v1/scrape`	Create a scrape job
GET	`/api/v1/scrape/:id`	Get job status and result
GET	`/api/v1/scrape`	List scrape history

Create Scrape Job

POST /api/v1/scrape

Request Body

Required Parameters

Parameter	Type	Description
`url`	string	The URL to scrape (must be valid HTTP/HTTPS)

Output Format Options

Parameter	Type	Default	Description
`formats`	string[]	`["markdown"]`	Output formats: `markdown`, `html`, `rawHtml`, `links`, `screenshot`
`onlyMainContent`	boolean	`true`	Extract only main content (uses Readability algorithm)
`includeTags`	string[]	-	CSS selectors to include
`excludeTags`	string[]	-	CSS selectors to exclude (also hides elements for screenshots)

Page Configuration

Parameter	Type	Default	Description
`viewport`	object	-	Custom viewport `{width, height}`
`viewport.width`	number	1280	Viewport width (100-3840)
`viewport.height`	number	800	Viewport height (100-2160)
`mobile`	boolean	`false`	Use mobile viewport and user agent
`device`	string	-	Device preset: `desktop`, `tablet`, `mobile`
`darkMode`	boolean	`false`	Enable dark mode
`headers`	object	-	Custom HTTP headers

Timing Options

Parameter	Type	Default	Description
`waitFor`	number \| string	-	Wait time in ms, or CSS selector to wait for
`timeout`	number	30000	Page timeout in milliseconds

Screenshot Options

Only applies when formats includes screenshot:

Parameter	Type	Default	Description
`screenshotOptions.fullPage`	boolean	`false`	Capture full page height
`screenshotOptions.format`	string	`"png"`	Image format: `png`, `jpeg`, `webp`
`screenshotOptions.quality`	number	80	Image quality (1-100)
`screenshotOptions.clip`	string	-	CSS selector for element screenshot
`screenshotOptions.response`	string	`"url"`	Response type: `url` or `base64`

Page Actions

Execute interactions before scraping:

Parameter	Type	Description
`actions`	Action[]	Array of actions to execute

Action Types:

Type	Parameters	Description
`wait`	`milliseconds`	Wait for specified time
`click`	`selector`	Click an element
`type`	`selector`, `text`	Type text into an input
`press`	`key`	Press a keyboard key
`scroll`	`direction` (`up`/`down`), `amount`	Scroll the page
`screenshot`	-	Take intermediate screenshot

Example Request

curl -X POST https://server.anyhunt.app/api/v1/scrape \
  -H "Authorization: Bearer ah_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article",
    "formats": ["markdown", "screenshot"],
    "onlyMainContent": true,
    "viewport": {
      "width": 1920,
      "height": 1080
    },
    "screenshotOptions": {
      "fullPage": true,
      "format": "webp",
      "quality": 85
    }
  }'

Response

The API returns a job ID. Poll GET /api/v1/scrape/:id to get results.

{
  "id": "scrape_abc123",
  "status": "PENDING"
}

Cache Hit Response:

If the same URL was recently scraped, the API returns cached results immediately:

{
  "id": "scrape_abc123",
  "url": "https://example.com/article",
  "fromCache": true,
  "markdown": "# Article Title\n\nArticle content...",
  "screenshot": {
    "url": "https://cdn.anyhunt.app/scraper/scrape_abc123.webp",
    "width": 1920,
    "height": 3500,
    "format": "webp",
    "fileSize": 245000,
    "expiresAt": "2024-02-15T10:30:00.000Z"
  },
  "metadata": {
    "title": "Article Title",
    "description": "Article description"
  }
}

Get Scrape Job

GET /api/v1/scrape/:id

Retrieve the status and result of a specific scrape job.

Response

{
  "id": "scrape_abc123",
  "url": "https://example.com/article",
  "status": "COMPLETED",
  "fromCache": false,
  "markdown": "# Article Title\n\nContent...",
  "metadata": {
    "title": "Article Title",
    "description": "Article description"
  },
  "screenshot": {
    "url": "https://cdn.anyhunt.app/scraper/scrape_abc123.webp",
    "width": 1920,
    "height": 3500,
    "format": "webp",
    "fileSize": 245000
  },
  "timings": {
    "queueWaitMs": 50,
    "fetchMs": 1200,
    "renderMs": 500,
    "transformMs": 100,
    "screenshotMs": 800,
    "totalMs": 2650
  }
}

Status Values:

Status	Description
`PENDING`	Job is queued
`PROCESSING`	Job is being processed
`COMPLETED`	Job completed successfully
`FAILED`	Job failed with error

List Scrape History

GET /api/v1/scrape

List your recent scrape jobs.

Query Parameters

Parameter	Type	Default	Description
`limit`	number	20	Max results (1-100)
`offset`	number	0	Skip results for pagination

Response

[
  {
    "id": "scrape_abc123",
    "url": "https://example.com",
    "status": "COMPLETED",
    "fromCache": false,
    "createdAt": "2024-01-15T10:30:00.000Z",
    "completedAt": "2024-01-15T10:30:02.650Z"
  }
]

Code Examples

Node.js

// Start scrape job
const response = await fetch('https://server.anyhunt.app/api/v1/scrape', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer ah_your_api_key',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://example.com',
    formats: ['markdown', 'links'],
    onlyMainContent: true,
  }),
});

const data = await response.json();

// If cache hit, data is already available
if (data.fromCache) {
  console.log(data.markdown);
} else {
  // Poll for results
  const result = await pollForResult(data.id);
  console.log(result.markdown);
}

async function pollForResult(id) {
  while (true) {
    const res = await fetch(`https://server.anyhunt.app/api/v1/scrape/${id}`, {
      headers: { 'Authorization': 'Bearer ah_your_api_key' },
    });
    const data = await res.json();
    if (data.status === 'COMPLETED') return data;
    if (data.status === 'FAILED') throw new Error(data.error?.message);
    await new Promise(r => setTimeout(r, 1000)); // Wait 1s
  }
}

Python

import requests
import time

# Start scrape job
response = requests.post(
    'https://server.anyhunt.app/api/v1/scrape',
    headers={
        'Authorization': 'Bearer ah_your_api_key',
        'Content-Type': 'application/json',
    },
    json={
        'url': 'https://example.com',
        'formats': ['markdown', 'links'],
        'onlyMainContent': True,
    },
)

data = response.json()

# If cache hit, data is already available
if data.get('fromCache'):
    print(data['markdown'])
else:
    # Poll for results
    while True:
        res = requests.get(
            f"https://server.anyhunt.app/api/v1/scrape/{data['id']}",
            headers={'Authorization': 'Bearer ah_your_api_key'}
        )
        result = res.json()
        if result['status'] == 'COMPLETED':
            print(result['markdown'])
            break
        elif result['status'] == 'FAILED':
            raise Exception(result.get('error', {}).get('message'))
        time.sleep(1)

With Page Actions

curl -X POST https://server.anyhunt.app/api/v1/scrape \
  -H "Authorization: Bearer ah_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "actions": [
      {"type": "wait", "milliseconds": 1000},
      {"type": "click", "selector": "#load-more"},
      {"type": "scroll", "direction": "down"},
      {"type": "wait", "milliseconds": 500}
    ]
  }'

Error Codes

Code	Status	Description
`INVALID_URL`	400	URL format is invalid or blocked
`URL_NOT_ALLOWED`	400	URL blocked by SSRF protection
`PAGE_TIMEOUT`	504	Page took too long to load
`SELECTOR_NOT_FOUND`	400	CSS selector not found on page
`BROWSER_ERROR`	500	Browser crashed or error
`NETWORK_ERROR`	500	Network request failed
`RATE_LIMITED`	429	Too many requests
`QUOTA_EXCEEDED`	429	Monthly quota exhausted

Caching

Responses are cached for 1 hour by default. Cache hits are indicated by fromCache: true in the response and don't count against your quota.

The cache key is computed from SHA256(url + options), so identical requests will return cached results.

Scrape API

On this page