Batch Scrape API

Batch Scrape API 允许在单个请求中抓取多个 URL。它使用共享的抓取选项并行处理 URL，支持 Webhook 通知。

接口端点

方法	路径	描述
POST	`/api/v1/batch/scrape`	创建批量抓取任务
GET	`/api/v1/batch/scrape/:id`	获取批量任务状态和结果
GET	`/api/v1/batch/scrape`	获取批量任务历史

创建批量抓取

POST /api/v1/batch/scrape

请求参数

参数	类型	描述
`urls`	string[]	要抓取的 URL 数组（1-100 个）
`scrapeOptions`	object	所有 URL 共享的抓取选项（参见 Scrape API）
`webhookUrl`	string	完成时的 Webhook 通知 URL

请求示例

curl -X POST https://server.anyhunt.app/api/v1/batch/scrape \
  -H "Authorization: Bearer ah_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://example.com/page-1",
      "https://example.com/page-2",
      "https://example.com/page-3"
    ],
    "scrapeOptions": {
      "formats": ["markdown", "links"],
      "onlyMainContent": true
    },
    "webhookUrl": "https://your-app.com/webhooks/batch"
  }'

响应

{
  "id": "batch_abc123",
  "status": "PENDING",
  "totalUrls": 3,
  "completedUrls": 0,
  "failedUrls": 0,
  "createdAt": "2024-01-15T10:30:00.000Z"
}

获取批量任务状态

GET /api/v1/batch/scrape/:id

响应（进行中）

{
  "id": "batch_abc123",
  "status": "PROCESSING",
  "totalUrls": 3,
  "completedUrls": 2,
  "failedUrls": 0,
  "createdAt": "2024-01-15T10:30:00.000Z"
}

响应（已完成）

{
  "id": "batch_abc123",
  "status": "COMPLETED",
  "totalUrls": 3,
  "completedUrls": 3,
  "failedUrls": 0,
  "createdAt": "2024-01-15T10:30:00.000Z",
  "completedAt": "2024-01-15T10:30:15.000Z",
  "data": [
    {
      "url": "https://example.com/page-1",
      "status": "COMPLETED",
      "result": {
        "markdown": "# 页面 1\n\n内容...",
        "links": ["https://example.com/other"]
      }
    },
    {
      "url": "https://example.com/page-2",
      "status": "COMPLETED",
      "result": {
        "markdown": "# 页面 2\n\n内容...",
        "links": []
      }
    },
    {
      "url": "https://example.com/page-3",
      "status": "COMPLETED",
      "result": {
        "markdown": "# 页面 3\n\n内容...",
        "links": []
      }
    }
  ]
}

任务状态值：

状态	描述
`PENDING`	任务已排队
`PROCESSING`	批量处理进行中
`COMPLETED`	所有 URL 已处理
`FAILED`	批量任务失败

单项状态值：

状态	描述
`PENDING`	URL 尚未处理
`COMPLETED`	URL 抓取成功
`FAILED`	URL 抓取失败

获取批量任务历史

GET /api/v1/batch/scrape

查询参数

参数	类型	默认值	描述
`limit`	number	20	最大结果数（1-100）
`offset`	number	0	跳过的结果数，用于分页

响应

[
  {
    "id": "batch_abc123",
    "status": "COMPLETED",
    "totalUrls": 3,
    "completedUrls": 3,
    "failedUrls": 0,
    "createdAt": "2024-01-15T10:30:00.000Z"
  }
]

Webhook 负载

批量任务完成时：

{
  "event": "batch.completed",
  "data": {
    "id": "batch_abc123",
    "status": "COMPLETED",
    "totalUrls": 3,
    "completedUrls": 3,
    "failedUrls": 0
  },
  "timestamp": "2024-01-15T10:30:15.000Z"
}

代码示例

Node.js

// 创建批量抓取
const response = await fetch('https://server.anyhunt.app/api/v1/batch/scrape', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer ah_your_api_key',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    urls: [
      'https://example.com/page-1',
      'https://example.com/page-2',
      'https://example.com/page-3',
    ],
    scrapeOptions: {
      formats: ['markdown'],
    },
  }),
});

const data = await response.json();
console.log('批量任务 ID:', data.id);

// 轮询检查状态
const checkStatus = async (id) => {
  const res = await fetch(`https://server.anyhunt.app/api/v1/batch/scrape/${id}`, {
    headers: { 'Authorization': 'Bearer ah_your_api_key' },
  });
  return res.json();
};

// 等待完成
let status = await checkStatus(data.id);
while (status.status === 'PROCESSING') {
  await new Promise(r => setTimeout(r, 2000));
  status = await checkStatus(data.id);
}

console.log('结果:', status.data);

Python

import requests
import time

# 创建批量抓取
response = requests.post(
    'https://server.anyhunt.app/api/v1/batch/scrape',
    headers={
        'Authorization': 'Bearer ah_your_api_key',
        'Content-Type': 'application/json',
    },
    json={
        'urls': [
            'https://example.com/page-1',
            'https://example.com/page-2',
            'https://example.com/page-3',
        ],
        'scrapeOptions': {
            'formats': ['markdown'],
        },
    },
)

batch_id = response.json()['id']
print(f'批量任务 ID: {batch_id}')

# 轮询检查状态
while True:
    status = requests.get(
        f'https://server.anyhunt.app/api/v1/batch/scrape/{batch_id}',
        headers={'Authorization': 'Bearer ah_your_api_key'},
    ).json()

    if status['status'] != 'PROCESSING':
        break
    time.sleep(2)

print('结果:', status['data'])

最佳实践

使用 Webhook - 对于超过 10 个 URL 的批量任务，使用 Webhook 而非轮询
分组相似页面 - 结构相似的 URL 处理效率更高
监控失败 - 检查 failedUrls 计数和各项状态
控制批量大小 - 先用小批量（10-20）测试以估算处理时间

Batch Scrape API

目录