Anyhunt

Batch Scrape API

通过单个 API 调用并行抓取多个 URL

Batch Scrape API

Batch Scrape API 允许在单个请求中抓取多个 URL。它使用共享的抓取选项并行处理 URL,支持 Webhook 通知。

接口端点

方法路径描述
POST/api/v1/batch/scrape创建批量抓取任务
GET/api/v1/batch/scrape/:id获取批量任务状态和结果
GET/api/v1/batch/scrape获取批量任务历史

创建批量抓取

POST /api/v1/batch/scrape

请求参数

参数类型描述
urlsstring[]要抓取的 URL 数组(1-100 个)
scrapeOptionsobject所有 URL 共享的抓取选项(参见 Scrape API)
webhookUrlstring完成时的 Webhook 通知 URL

请求示例

curl -X POST https://server.anyhunt.app/api/v1/batch/scrape \
  -H "Authorization: Bearer ah_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://example.com/page-1",
      "https://example.com/page-2",
      "https://example.com/page-3"
    ],
    "scrapeOptions": {
      "formats": ["markdown", "links"],
      "onlyMainContent": true
    },
    "webhookUrl": "https://your-app.com/webhooks/batch"
  }'

响应

{
  "id": "batch_abc123",
  "status": "PENDING",
  "totalUrls": 3,
  "completedUrls": 0,
  "failedUrls": 0,
  "createdAt": "2024-01-15T10:30:00.000Z"
}

获取批量任务状态

GET /api/v1/batch/scrape/:id

响应(进行中)

{
  "id": "batch_abc123",
  "status": "PROCESSING",
  "totalUrls": 3,
  "completedUrls": 2,
  "failedUrls": 0,
  "createdAt": "2024-01-15T10:30:00.000Z"
}

响应(已完成)

{
  "id": "batch_abc123",
  "status": "COMPLETED",
  "totalUrls": 3,
  "completedUrls": 3,
  "failedUrls": 0,
  "createdAt": "2024-01-15T10:30:00.000Z",
  "completedAt": "2024-01-15T10:30:15.000Z",
  "data": [
    {
      "url": "https://example.com/page-1",
      "status": "COMPLETED",
      "result": {
        "markdown": "# 页面 1\n\n内容...",
        "links": ["https://example.com/other"]
      }
    },
    {
      "url": "https://example.com/page-2",
      "status": "COMPLETED",
      "result": {
        "markdown": "# 页面 2\n\n内容...",
        "links": []
      }
    },
    {
      "url": "https://example.com/page-3",
      "status": "COMPLETED",
      "result": {
        "markdown": "# 页面 3\n\n内容...",
        "links": []
      }
    }
  ]
}

任务状态值:

状态描述
PENDING任务已排队
PROCESSING批量处理进行中
COMPLETED所有 URL 已处理
FAILED批量任务失败

单项状态值:

状态描述
PENDINGURL 尚未处理
COMPLETEDURL 抓取成功
FAILEDURL 抓取失败

获取批量任务历史

GET /api/v1/batch/scrape

查询参数

参数类型默认值描述
limitnumber20最大结果数(1-100)
offsetnumber0跳过的结果数,用于分页

响应

[
  {
    "id": "batch_abc123",
    "status": "COMPLETED",
    "totalUrls": 3,
    "completedUrls": 3,
    "failedUrls": 0,
    "createdAt": "2024-01-15T10:30:00.000Z"
  }
]

Webhook 负载

批量任务完成时:

{
  "event": "batch.completed",
  "data": {
    "id": "batch_abc123",
    "status": "COMPLETED",
    "totalUrls": 3,
    "completedUrls": 3,
    "failedUrls": 0
  },
  "timestamp": "2024-01-15T10:30:15.000Z"
}

代码示例

Node.js

// 创建批量抓取
const response = await fetch('https://server.anyhunt.app/api/v1/batch/scrape', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer ah_your_api_key',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    urls: [
      'https://example.com/page-1',
      'https://example.com/page-2',
      'https://example.com/page-3',
    ],
    scrapeOptions: {
      formats: ['markdown'],
    },
  }),
});

const data = await response.json();
console.log('批量任务 ID:', data.id);

// 轮询检查状态
const checkStatus = async (id) => {
  const res = await fetch(`https://server.anyhunt.app/api/v1/batch/scrape/${id}`, {
    headers: { 'Authorization': 'Bearer ah_your_api_key' },
  });
  return res.json();
};

// 等待完成
let status = await checkStatus(data.id);
while (status.status === 'PROCESSING') {
  await new Promise(r => setTimeout(r, 2000));
  status = await checkStatus(data.id);
}

console.log('结果:', status.data);

Python

import requests
import time

# 创建批量抓取
response = requests.post(
    'https://server.anyhunt.app/api/v1/batch/scrape',
    headers={
        'Authorization': 'Bearer ah_your_api_key',
        'Content-Type': 'application/json',
    },
    json={
        'urls': [
            'https://example.com/page-1',
            'https://example.com/page-2',
            'https://example.com/page-3',
        ],
        'scrapeOptions': {
            'formats': ['markdown'],
        },
    },
)

batch_id = response.json()['id']
print(f'批量任务 ID: {batch_id}')

# 轮询检查状态
while True:
    status = requests.get(
        f'https://server.anyhunt.app/api/v1/batch/scrape/{batch_id}',
        headers={'Authorization': 'Bearer ah_your_api_key'},
    ).json()

    if status['status'] != 'PROCESSING':
        break
    time.sleep(2)

print('结果:', status['data'])

最佳实践

  1. 使用 Webhook - 对于超过 10 个 URL 的批量任务,使用 Webhook 而非轮询
  2. 分组相似页面 - 结构相似的 URL 处理效率更高
  3. 监控失败 - 检查 failedUrls 计数和各项状态
  4. 控制批量大小 - 先用小批量(10-20)测试以估算处理时间