Batch Scrape API
通过单个 API 调用并行抓取多个 URL
Batch Scrape API
Batch Scrape API 允许在单个请求中抓取多个 URL。它使用共享的抓取选项并行处理 URL,支持 Webhook 通知。
接口端点
| 方法 | 路径 | 描述 |
|---|---|---|
| POST | /api/v1/batch/scrape | 创建批量抓取任务 |
| GET | /api/v1/batch/scrape/:id | 获取批量任务状态和结果 |
| GET | /api/v1/batch/scrape | 获取批量任务历史 |
创建批量抓取
POST /api/v1/batch/scrape请求参数
| 参数 | 类型 | 描述 |
|---|---|---|
urls | string[] | 要抓取的 URL 数组(1-100 个) |
scrapeOptions | object | 所有 URL 共享的抓取选项(参见 Scrape API) |
webhookUrl | string | 完成时的 Webhook 通知 URL |
请求示例
curl -X POST https://server.anyhunt.app/api/v1/batch/scrape \
-H "Authorization: Bearer ah_your_api_key" \
-H "Content-Type: application/json" \
-d '{
"urls": [
"https://example.com/page-1",
"https://example.com/page-2",
"https://example.com/page-3"
],
"scrapeOptions": {
"formats": ["markdown", "links"],
"onlyMainContent": true
},
"webhookUrl": "https://your-app.com/webhooks/batch"
}'响应
{
"id": "batch_abc123",
"status": "PENDING",
"totalUrls": 3,
"completedUrls": 0,
"failedUrls": 0,
"createdAt": "2024-01-15T10:30:00.000Z"
}获取批量任务状态
GET /api/v1/batch/scrape/:id响应(进行中)
{
"id": "batch_abc123",
"status": "PROCESSING",
"totalUrls": 3,
"completedUrls": 2,
"failedUrls": 0,
"createdAt": "2024-01-15T10:30:00.000Z"
}响应(已完成)
{
"id": "batch_abc123",
"status": "COMPLETED",
"totalUrls": 3,
"completedUrls": 3,
"failedUrls": 0,
"createdAt": "2024-01-15T10:30:00.000Z",
"completedAt": "2024-01-15T10:30:15.000Z",
"data": [
{
"url": "https://example.com/page-1",
"status": "COMPLETED",
"result": {
"markdown": "# 页面 1\n\n内容...",
"links": ["https://example.com/other"]
}
},
{
"url": "https://example.com/page-2",
"status": "COMPLETED",
"result": {
"markdown": "# 页面 2\n\n内容...",
"links": []
}
},
{
"url": "https://example.com/page-3",
"status": "COMPLETED",
"result": {
"markdown": "# 页面 3\n\n内容...",
"links": []
}
}
]
}任务状态值:
| 状态 | 描述 |
|---|---|
PENDING | 任务已排队 |
PROCESSING | 批量处理进行中 |
COMPLETED | 所有 URL 已处理 |
FAILED | 批量任务失败 |
单项状态值:
| 状态 | 描述 |
|---|---|
PENDING | URL 尚未处理 |
COMPLETED | URL 抓取成功 |
FAILED | URL 抓取失败 |
获取批量任务历史
GET /api/v1/batch/scrape查询参数
| 参数 | 类型 | 默认值 | 描述 |
|---|---|---|---|
limit | number | 20 | 最大结果数(1-100) |
offset | number | 0 | 跳过的结果数,用于分页 |
响应
[
{
"id": "batch_abc123",
"status": "COMPLETED",
"totalUrls": 3,
"completedUrls": 3,
"failedUrls": 0,
"createdAt": "2024-01-15T10:30:00.000Z"
}
]Webhook 负载
批量任务完成时:
{
"event": "batch.completed",
"data": {
"id": "batch_abc123",
"status": "COMPLETED",
"totalUrls": 3,
"completedUrls": 3,
"failedUrls": 0
},
"timestamp": "2024-01-15T10:30:15.000Z"
}代码示例
Node.js
// 创建批量抓取
const response = await fetch('https://server.anyhunt.app/api/v1/batch/scrape', {
method: 'POST',
headers: {
'Authorization': 'Bearer ah_your_api_key',
'Content-Type': 'application/json',
},
body: JSON.stringify({
urls: [
'https://example.com/page-1',
'https://example.com/page-2',
'https://example.com/page-3',
],
scrapeOptions: {
formats: ['markdown'],
},
}),
});
const data = await response.json();
console.log('批量任务 ID:', data.id);
// 轮询检查状态
const checkStatus = async (id) => {
const res = await fetch(`https://server.anyhunt.app/api/v1/batch/scrape/${id}`, {
headers: { 'Authorization': 'Bearer ah_your_api_key' },
});
return res.json();
};
// 等待完成
let status = await checkStatus(data.id);
while (status.status === 'PROCESSING') {
await new Promise(r => setTimeout(r, 2000));
status = await checkStatus(data.id);
}
console.log('结果:', status.data);Python
import requests
import time
# 创建批量抓取
response = requests.post(
'https://server.anyhunt.app/api/v1/batch/scrape',
headers={
'Authorization': 'Bearer ah_your_api_key',
'Content-Type': 'application/json',
},
json={
'urls': [
'https://example.com/page-1',
'https://example.com/page-2',
'https://example.com/page-3',
],
'scrapeOptions': {
'formats': ['markdown'],
},
},
)
batch_id = response.json()['id']
print(f'批量任务 ID: {batch_id}')
# 轮询检查状态
while True:
status = requests.get(
f'https://server.anyhunt.app/api/v1/batch/scrape/{batch_id}',
headers={'Authorization': 'Bearer ah_your_api_key'},
).json()
if status['status'] != 'PROCESSING':
break
time.sleep(2)
print('结果:', status['data'])最佳实践
- 使用 Webhook - 对于超过 10 个 URL 的批量任务,使用 Webhook 而非轮询
- 分组相似页面 - 结构相似的 URL 处理效率更高
- 监控失败 - 检查
failedUrls计数和各项状态 - 控制批量大小 - 先用小批量(10-20)测试以估算处理时间