Appearance
POST /crawl
Follow links from a seed URL and extract on each page.
POST https://api.scrapewithruno.com/v1/crawl
X-API-Key: sk_live_...
Content-Type: application/jsonRequest Body
json
{
"seed_url": "https://example.com/blog",
"schema": [
{ "field": "title", "type": "string", "example": "Example post" },
{ "field": "publishedAt", "type": "date", "example": "2024-12-20" }
],
"crawl": {
"follow_pattern": "https://example.com/blog/*",
"max_pages": 50,
"max_depth": 2,
"allow_large_crawl": false
},
"options": { "render_js": "auto", "timeout_ms": 15000 },
"async_mode": false
}| Field | Type | Required | Notes |
|---|---|---|---|
seed_url | string | yes | Where to start |
schema | array | dynamic keys only | Applied to every visited page |
crawl.follow_pattern | string | yes | Glob; only matching links are followed |
crawl.max_pages | integer | yes | Hard ceiling on pages visited |
crawl.max_depth | integer | no | Hops from the seed; default 2 |
crawl.allow_large_crawl | boolean | no | Bypass the 25%-of-quota cap (see below) |
options.* | - | no | Same as /extract, applied per page |
async_mode | boolean | no | Route LLM extraction through Batch API (up to 24h) |
Quota Safety Cap
max_pages is silently capped at floor(remaining_monthly_quota × 0.25) at the start of the call. Excess units are refunded immediately. Set crawl.allow_large_crawl: true to remove the cap and reserve the full requested amount.
Behavior
- Respects
robots.txt. - Reads
sitemap.xmlwhen present. - Per-host jitter and adaptive back-off so the target site doesn't ban you.
Response
json
{
"seed_url": "https://example.com/blog",
"results": [
{ "url": "https://example.com/blog/post-1", "status": "success", "render_mode": "fetch", "data": { "title": "...", "publishedAt": "2024-11-01" } },
{ "url": "https://example.com/blog/post-2", "status": "success", "render_mode": "fetch", "data": { "title": "...", "publishedAt": "2024-11-08" } }
],
"crawl_meta": {
"pages_visited": 17,
"pages_skipped": 3,
"pages_failed": 0,
"cancelled": false
}
}Cost
max_pages units are reserved at the start. Whatever isn't visited is refunded to your monthly counter (and to your Pro/Scale overage credit balance if applicable).
Examples
bash
curl -X POST https://api.scrapewithruno.com/v1/crawl \
-H "X-API-Key: $RUNO_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"seed_url": "https://example.com/blog",
"schema": [
{ "field": "title", "type": "string", "example": "Example post" },
{ "field": "publishedAt", "type": "date", "example": "2024-12-20" }
],
"crawl": {
"follow_pattern": "https://example.com/blog/*",
"max_pages": 25,
"max_depth": 2
}
}'js
const res = await fetch('https://api.scrapewithruno.com/v1/crawl', {
method: 'POST',
headers: {
'X-API-Key': process.env.RUNO_API_KEY,
'Content-Type': 'application/json',
},
body: JSON.stringify({
seed_url: 'https://example.com/blog',
schema: [
{ field: 'title', type: 'string', example: 'Example post' },
{ field: 'publishedAt', type: 'date', example: '2024-12-20' },
],
crawl: {
follow_pattern: 'https://example.com/blog/*',
max_pages: 25,
max_depth: 2,
},
}),
})
const { results, crawl_meta } = await res.json()
console.log(crawl_meta, results.length)py
import os, requests
res = requests.post(
"https://api.scrapewithruno.com/v1/crawl",
headers={"X-API-Key": os.environ["RUNO_API_KEY"]},
json={
"seed_url": "https://example.com/blog",
"schema": [
{"field": "title", "type": "string", "example": "Example post"},
{"field": "publishedAt", "type": "date", "example": "2024-12-20"},
],
"crawl": {
"follow_pattern": "https://example.com/blog/*",
"max_pages": 25,
"max_depth": 2,
},
},
timeout=600,
)
body = res.json()
print(body["crawl_meta"])
print(len(body["results"]), "pages")