Scrape API
Extract content from any webpage in multiple formats
Scrape API
The Scrape API is the primary endpoint for extracting content from web pages. It supports multiple output formats including Markdown, HTML, links, and screenshots.
Endpoints
| Method | Path | Description |
|---|---|---|
| POST | /api/v1/scrape | Create a scrape job |
| GET | /api/v1/scrape/:id | Get job status and result |
| GET | /api/v1/scrape | List scrape history |
Create Scrape Job
POST /api/v1/scrapeRequest Body
Required Parameters
| Parameter | Type | Description |
|---|---|---|
url | string | The URL to scrape (must be valid HTTP/HTTPS) |
Output Format Options
| Parameter | Type | Default | Description |
|---|---|---|---|
formats | string[] | ["markdown"] | Output formats: markdown, html, rawHtml, links, screenshot |
onlyMainContent | boolean | true | Extract only main content (uses Readability algorithm) |
includeTags | string[] | - | CSS selectors to include |
excludeTags | string[] | - | CSS selectors to exclude (also hides elements for screenshots) |
Page Configuration
| Parameter | Type | Default | Description |
|---|---|---|---|
viewport | object | - | Custom viewport {width, height} |
viewport.width | number | 1280 | Viewport width (100-3840) |
viewport.height | number | 800 | Viewport height (100-2160) |
mobile | boolean | false | Use mobile viewport and user agent |
device | string | - | Device preset: desktop, tablet, mobile |
darkMode | boolean | false | Enable dark mode |
headers | object | - | Custom HTTP headers |
Timing Options
| Parameter | Type | Default | Description |
|---|---|---|---|
waitFor | number | string | - | Wait time in ms, or CSS selector to wait for |
timeout | number | 30000 | Page timeout in milliseconds |
Screenshot Options
Only applies when formats includes screenshot:
| Parameter | Type | Default | Description |
|---|---|---|---|
screenshotOptions.fullPage | boolean | false | Capture full page height |
screenshotOptions.format | string | "png" | Image format: png, jpeg, webp |
screenshotOptions.quality | number | 80 | Image quality (1-100) |
screenshotOptions.clip | string | - | CSS selector for element screenshot |
screenshotOptions.response | string | "url" | Response type: url or base64 |
Page Actions
Execute interactions before scraping:
| Parameter | Type | Description |
|---|---|---|
actions | Action[] | Array of actions to execute |
Action Types:
| Type | Parameters | Description |
|---|---|---|
wait | milliseconds | Wait for specified time |
click | selector | Click an element |
type | selector, text | Type text into an input |
press | key | Press a keyboard key |
scroll | direction (up/down), amount | Scroll the page |
screenshot | - | Take intermediate screenshot |
Example Request
curl -X POST https://server.anyhunt.app/api/v1/scrape \
-H "Authorization: Bearer ah_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/article",
"formats": ["markdown", "screenshot"],
"onlyMainContent": true,
"viewport": {
"width": 1920,
"height": 1080
},
"screenshotOptions": {
"fullPage": true,
"format": "webp",
"quality": 85
}
}'Response
The API returns a job ID. Poll GET /api/v1/scrape/:id to get results.
{
"id": "scrape_abc123",
"status": "PENDING"
}Cache Hit Response:
If the same URL was recently scraped, the API returns cached results immediately:
{
"id": "scrape_abc123",
"url": "https://example.com/article",
"fromCache": true,
"markdown": "# Article Title\n\nArticle content...",
"screenshot": {
"url": "https://cdn.anyhunt.app/scraper/scrape_abc123.webp",
"width": 1920,
"height": 3500,
"format": "webp",
"fileSize": 245000,
"expiresAt": "2024-02-15T10:30:00.000Z"
},
"metadata": {
"title": "Article Title",
"description": "Article description"
}
}Get Scrape Job
GET /api/v1/scrape/:idRetrieve the status and result of a specific scrape job.
Response
{
"id": "scrape_abc123",
"url": "https://example.com/article",
"status": "COMPLETED",
"fromCache": false,
"markdown": "# Article Title\n\nContent...",
"metadata": {
"title": "Article Title",
"description": "Article description"
},
"screenshot": {
"url": "https://cdn.anyhunt.app/scraper/scrape_abc123.webp",
"width": 1920,
"height": 3500,
"format": "webp",
"fileSize": 245000
},
"timings": {
"queueWaitMs": 50,
"fetchMs": 1200,
"renderMs": 500,
"transformMs": 100,
"screenshotMs": 800,
"totalMs": 2650
}
}Status Values:
| Status | Description |
|---|---|
PENDING | Job is queued |
PROCESSING | Job is being processed |
COMPLETED | Job completed successfully |
FAILED | Job failed with error |
List Scrape History
GET /api/v1/scrapeList your recent scrape jobs.
Query Parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
limit | number | 20 | Max results (1-100) |
offset | number | 0 | Skip results for pagination |
Response
[
{
"id": "scrape_abc123",
"url": "https://example.com",
"status": "COMPLETED",
"fromCache": false,
"createdAt": "2024-01-15T10:30:00.000Z",
"completedAt": "2024-01-15T10:30:02.650Z"
}
]Code Examples
Node.js
// Start scrape job
const response = await fetch('https://server.anyhunt.app/api/v1/scrape', {
method: 'POST',
headers: {
'Authorization': 'Bearer ah_your_api_key',
'Content-Type': 'application/json',
},
body: JSON.stringify({
url: 'https://example.com',
formats: ['markdown', 'links'],
onlyMainContent: true,
}),
});
const data = await response.json();
// If cache hit, data is already available
if (data.fromCache) {
console.log(data.markdown);
} else {
// Poll for results
const result = await pollForResult(data.id);
console.log(result.markdown);
}
async function pollForResult(id) {
while (true) {
const res = await fetch(`https://server.anyhunt.app/api/v1/scrape/${id}`, {
headers: { 'Authorization': 'Bearer ah_your_api_key' },
});
const data = await res.json();
if (data.status === 'COMPLETED') return data;
if (data.status === 'FAILED') throw new Error(data.error?.message);
await new Promise(r => setTimeout(r, 1000)); // Wait 1s
}
}Python
import requests
import time
# Start scrape job
response = requests.post(
'https://server.anyhunt.app/api/v1/scrape',
headers={
'Authorization': 'Bearer ah_your_api_key',
'Content-Type': 'application/json',
},
json={
'url': 'https://example.com',
'formats': ['markdown', 'links'],
'onlyMainContent': True,
},
)
data = response.json()
# If cache hit, data is already available
if data.get('fromCache'):
print(data['markdown'])
else:
# Poll for results
while True:
res = requests.get(
f"https://server.anyhunt.app/api/v1/scrape/{data['id']}",
headers={'Authorization': 'Bearer ah_your_api_key'}
)
result = res.json()
if result['status'] == 'COMPLETED':
print(result['markdown'])
break
elif result['status'] == 'FAILED':
raise Exception(result.get('error', {}).get('message'))
time.sleep(1)With Page Actions
curl -X POST https://server.anyhunt.app/api/v1/scrape \
-H "Authorization: Bearer ah_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"formats": ["markdown"],
"actions": [
{"type": "wait", "milliseconds": 1000},
{"type": "click", "selector": "#load-more"},
{"type": "scroll", "direction": "down"},
{"type": "wait", "milliseconds": 500}
]
}'Error Codes
| Code | Status | Description |
|---|---|---|
INVALID_URL | 400 | URL format is invalid or blocked |
URL_NOT_ALLOWED | 400 | URL blocked by SSRF protection |
PAGE_TIMEOUT | 504 | Page took too long to load |
SELECTOR_NOT_FOUND | 400 | CSS selector not found on page |
BROWSER_ERROR | 500 | Browser crashed or error |
NETWORK_ERROR | 500 | Network request failed |
RATE_LIMITED | 429 | Too many requests |
QUOTA_EXCEEDED | 429 | Monthly quota exhausted |
Caching
Responses are cached for 1 hour by default. Cache hits are indicated by fromCache: true in the response and don't count against your quota.
The cache key is computed from SHA256(url + options), so identical requests will return cached results.