Scrape API

Scrape API 是提取网页内容的主要接口。支持多种输出格式，包括 Markdown、HTML、链接和截图。

接口端点

方法	路径	描述
POST	`/api/v1/scrape`	创建抓取任务
GET	`/api/v1/scrape/:id`	获取任务状态和结果
GET	`/api/v1/scrape`	获取抓取历史

创建抓取任务

POST /api/v1/scrape

请求参数

必填参数

参数	类型	描述
`url`	string	要抓取的 URL（必须是有效的 HTTP/HTTPS）

输出格式选项

参数	类型	默认值	描述
`formats`	string[]	`["markdown"]`	输出格式：`markdown`、`html`、`rawHtml`、`links`、`screenshot`
`onlyMainContent`	boolean	`true`	仅提取主要内容（使用 Readability 算法）
`includeTags`	string[]	-	要包含的 CSS 选择器
`excludeTags`	string[]	-	要排除的 CSS 选择器（截图时也会隐藏这些元素）

页面配置

参数	类型	默认值	描述
`viewport`	object	-	自定义视口 `{width, height}`
`viewport.width`	number	1280	视口宽度（100-3840）
`viewport.height`	number	800	视口高度（100-2160）
`mobile`	boolean	`false`	使用移动端视口和 User-Agent
`device`	string	-	设备预设：`desktop`、`tablet`、`mobile`
`darkMode`	boolean	`false`	启用深色模式
`headers`	object	-	自定义 HTTP 请求头

时间选项

参数	类型	默认值	描述
`waitFor`	number \| string	-	等待时间（毫秒），或要等待的 CSS 选择器
`timeout`	number	30000	页面超时时间（毫秒）

截图选项

仅当 formats 包含 screenshot 时生效：

参数	类型	默认值	描述
`screenshotOptions.fullPage`	boolean	`false`	捕获完整页面高度
`screenshotOptions.format`	string	`"png"`	图片格式：`png`、`jpeg`、`webp`
`screenshotOptions.quality`	number	80	图片质量（1-100）
`screenshotOptions.clip`	string	-	元素截图的 CSS 选择器
`screenshotOptions.response`	string	`"url"`	响应类型：`url` 或 `base64`

页面交互

在抓取前执行交互操作：

参数	类型	描述
`actions`	Action[]	要执行的操作数组

操作类型：

类型	参数	描述
`wait`	`milliseconds`	等待指定时间
`click`	`selector`	点击元素
`type`	`selector`, `text`	在输入框中输入文本
`press`	`key`	按下键盘按键
`scroll`	`direction` (`up`/`down`), `amount`	滚动页面
`screenshot`	-	拍摄中间截图

请求示例

curl -X POST https://server.anyhunt.app/api/v1/scrape \
  -H "Authorization: Bearer ah_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/article",
    "formats": ["markdown", "screenshot"],
    "onlyMainContent": true,
    "viewport": {
      "width": 1920,
      "height": 1080
    },
    "screenshotOptions": {
      "fullPage": true,
      "format": "webp",
      "quality": 85
    }
  }'

响应

API 返回任务 ID。通过轮询 GET /api/v1/scrape/:id 获取结果。

{
  "id": "scrape_abc123",
  "status": "PENDING"
}

缓存命中响应：

如果相同的 URL 最近被抓取过，API 会立即返回缓存结果：

{
  "id": "scrape_abc123",
  "url": "https://example.com/article",
  "fromCache": true,
  "markdown": "# 文章标题\n\n文章内容...",
  "screenshot": {
    "url": "https://cdn.anyhunt.app/scraper/scrape_abc123.webp",
    "width": 1920,
    "height": 3500,
    "format": "webp",
    "fileSize": 245000,
    "expiresAt": "2024-02-15T10:30:00.000Z"
  },
  "metadata": {
    "title": "文章标题",
    "description": "文章描述"
  }
}

获取抓取任务

GET /api/v1/scrape/:id

获取特定抓取任务的状态和结果。

响应

{
  "id": "scrape_abc123",
  "url": "https://example.com/article",
  "status": "COMPLETED",
  "fromCache": false,
  "markdown": "# 文章标题\n\n内容...",
  "metadata": {
    "title": "文章标题",
    "description": "文章描述"
  },
  "screenshot": {
    "url": "https://cdn.anyhunt.app/scraper/scrape_abc123.webp",
    "width": 1920,
    "height": 3500,
    "format": "webp",
    "fileSize": 245000
  },
  "timings": {
    "queueWaitMs": 50,
    "fetchMs": 1200,
    "renderMs": 500,
    "transformMs": 100,
    "screenshotMs": 800,
    "totalMs": 2650
  }
}

状态值：

状态	描述
`PENDING`	任务已排队
`PROCESSING`	任务正在处理
`COMPLETED`	任务成功完成
`FAILED`	任务失败

获取抓取历史

GET /api/v1/scrape

获取最近的抓取任务列表。

查询参数

参数	类型	默认值	描述
`limit`	number	20	最大结果数（1-100）
`offset`	number	0	跳过的结果数，用于分页

响应

[
  {
    "id": "scrape_abc123",
    "url": "https://example.com",
    "status": "COMPLETED",
    "fromCache": false,
    "createdAt": "2024-01-15T10:30:00.000Z",
    "completedAt": "2024-01-15T10:30:02.650Z"
  }
]

代码示例

Node.js

// 启动抓取任务
const response = await fetch('https://server.anyhunt.app/api/v1/scrape', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer ah_your_api_key',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    url: 'https://example.com',
    formats: ['markdown', 'links'],
    onlyMainContent: true,
  }),
});

const data = await response.json();

// 如果缓存命中，数据已经可用
if (data.fromCache) {
  console.log(data.markdown);
} else {
  // 轮询获取结果
  const result = await pollForResult(data.id);
  console.log(result.markdown);
}

async function pollForResult(id) {
  while (true) {
    const res = await fetch(`https://server.anyhunt.app/api/v1/scrape/${id}`, {
      headers: { 'Authorization': 'Bearer ah_your_api_key' },
    });
    const data = await res.json();
    if (data.status === 'COMPLETED') return data;
    if (data.status === 'FAILED') throw new Error(data.error?.message);
    await new Promise(r => setTimeout(r, 1000)); // 等待 1 秒
  }
}

Python

import requests
import time

# 启动抓取任务
response = requests.post(
    'https://server.anyhunt.app/api/v1/scrape',
    headers={
        'Authorization': 'Bearer ah_your_api_key',
        'Content-Type': 'application/json',
    },
    json={
        'url': 'https://example.com',
        'formats': ['markdown', 'links'],
        'onlyMainContent': True,
    },
)

data = response.json()

# 如果缓存命中，数据已经可用
if data.get('fromCache'):
    print(data['markdown'])
else:
    # 轮询获取结果
    while True:
        res = requests.get(
            f"https://server.anyhunt.app/api/v1/scrape/{data['id']}",
            headers={'Authorization': 'Bearer ah_your_api_key'}
        )
        result = res.json()
        if result['status'] == 'COMPLETED':
            print(result['markdown'])
            break
        elif result['status'] == 'FAILED':
            raise Exception(result.get('error', {}).get('message'))
        time.sleep(1)

使用页面交互

curl -X POST https://server.anyhunt.app/api/v1/scrape \
  -H "Authorization: Bearer ah_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "formats": ["markdown"],
    "actions": [
      {"type": "wait", "milliseconds": 1000},
      {"type": "click", "selector": "#load-more"},
      {"type": "scroll", "direction": "down"},
      {"type": "wait", "milliseconds": 500}
    ]
  }'

错误码

错误码	状态码	描述
`INVALID_URL`	400	URL 格式无效或被阻止
`URL_NOT_ALLOWED`	400	URL 被 SSRF 防护阻止
`PAGE_TIMEOUT`	504	页面加载超时
`SELECTOR_NOT_FOUND`	400	页面上未找到 CSS 选择器
`BROWSER_ERROR`	500	浏览器崩溃或错误
`NETWORK_ERROR`	500	网络请求失败
`RATE_LIMITED`	429	请求过于频繁
`QUOTA_EXCEEDED`	429	月度配额已用完

缓存

响应默认缓存 1 小时。缓存命中时，响应中会显示 fromCache: true，且不计入配额。

缓存键由 SHA256(url + options) 计算，因此相同的请求将返回缓存结果。

Scrape API

目录