Extract API

Extract API 使用大语言模型（LLM）从网页中提取结构化数据。定义 JSON Schema，让 AI 提取匹配的数据。

接口端点

方法	路径	描述
POST	`/api/v1/extract`	提取结构化数据

提取数据

POST /api/v1/extract

请求参数

参数	类型	描述
`urls`	string[]	要提取的 URL（1-20 个）
`prompt`	string	提取指令（最多 5000 字符）
`schema`	object	输出结构的 JSON Schema
`systemPrompt`	string	自定义系统提示（最多 2000 字符）
`model`	string	使用的 LLM 模型（可选）

请求示例

curl -X POST https://server.anyhunt.app/api/v1/extract \
  -H "Authorization: Bearer ah_your_api_key" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com/product/123"],
    "prompt": "从这个页面提取产品信息",
    "schema": {
      "type": "object",
      "properties": {
        "name": {"type": "string"},
        "price": {"type": "number"},
        "currency": {"type": "string"},
        "description": {"type": "string"},
        "inStock": {"type": "boolean"}
      },
      "required": ["name", "price"]
    }
  }'

响应

{
  "results": [
    {
      "url": "https://example.com/product/123",
      "data": {
        "name": "无线耳机 Pro",
        "price": 199.99,
        "currency": "USD",
        "description": "带降噪功能的高级无线耳机",
        "inStock": true
      }
    }
  ]
}

错误响应

如果某个 URL 提取失败：

{
  "results": [
    {
      "url": "https://example.com/product/123",
      "error": "提取数据失败：页面内容为空"
    }
  ]
}

JSON Schema

使用 JSON Schema 定义要提取的数据结构：

支持的类型

类型	描述	示例
`string`	文本值	`"name": {"type": "string"}`
`number`	数值	`"price": {"type": "number"}`
`boolean`	真/假	`"inStock": {"type": "boolean"}`
`array`	值列表	`"tags": {"type": "array", "items": {"type": "string"}}`
`object`	嵌套对象	`"author": {"type": "object", "properties": {...}}`

Schema 示例

产品提取

{
  "type": "object",
  "properties": {
    "name": {"type": "string"},
    "price": {"type": "number"},
    "description": {"type": "string"},
    "specifications": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "key": {"type": "string"},
          "value": {"type": "string"}
        }
      }
    }
  },
  "required": ["name", "price"]
}

文章元数据

{
  "type": "object",
  "properties": {
    "title": {"type": "string"},
    "author": {"type": "string"},
    "publishedDate": {"type": "string"},
    "tags": {
      "type": "array",
      "items": {"type": "string"}
    },
    "summary": {"type": "string"}
  },
  "required": ["title"]
}

公司信息

{
  "type": "object",
  "properties": {
    "companyName": {"type": "string"},
    "founded": {"type": "number"},
    "headquarters": {"type": "string"},
    "employees": {"type": "number"},
    "products": {
      "type": "array",
      "items": {"type": "string"}
    }
  }
}

代码示例

Node.js

const response = await fetch('https://server.anyhunt.app/api/v1/extract', {
  method: 'POST',
  headers: {
    'Authorization': 'Bearer ah_your_api_key',
    'Content-Type': 'application/json',
  },
  body: JSON.stringify({
    urls: ['https://example.com/product/123'],
    prompt: '提取产品详情',
    schema: {
      type: 'object',
      properties: {
        name: { type: 'string' },
        price: { type: 'number' },
      },
      required: ['name', 'price'],
    },
  }),
});

const data = await response.json();
console.log(data.results[0].data);

Python

import requests

response = requests.post(
    'https://server.anyhunt.app/api/v1/extract',
    headers={
        'Authorization': 'Bearer ah_your_api_key',
        'Content-Type': 'application/json',
    },
    json={
        'urls': ['https://example.com/product/123'],
        'prompt': '提取产品详情',
        'schema': {
            'type': 'object',
            'properties': {
                'name': {'type': 'string'},
                'price': {'type': 'number'},
            },
            'required': ['name', 'price'],
        },
    },
)

results = response.json()['results']
print(results[0]['data'])

最佳实践

保持 Schema 简单 - 从少量字段开始，按需增加
使用清晰的提示 - 用通俗语言描述要提取的内容
标记必填字段 - 使用 required 数组确保提取关键数据
先用单个 URL 测试 - 验证 Schema 后再批量提取
优雅处理错误 - 检查每个结果中的 error 字段

提示技巧

好的提示有助于 LLM 理解要提取什么：

{
  "prompt": "提取主要产品信息。重点关注产品名称、当前价格（非原价）以及是否有货。"
}

不好的提示（太模糊）：

{
  "prompt": "获取数据"
}

定价

Extract API 调用按以下因素计算配额：

处理的 URL 数量
页面内容大小
Schema 复杂度

每次成功提取计为 1 次 API 调用，消耗配额。

Extract API

目录