Crawling & Batching

Two endpoints when one URL isn't enough.

/batch - Same Schema, Many URLs

Use when you already have the URL list. Returns one /extract response per URL, in order.

json

{
  "urls": [
    "https://example.com/p/1",
    "https://example.com/p/2",
    "https://example.com/p/3"
  ],
  "schema": [
    { "field": "title", "type": "string", "example": "Example title" },
    { "field": "price", "type": "float",  "example": 19.99 }
  ],
  "options": { "concurrency": 4, "fail_fast": false }
}

Option	Default	Meaning
`concurrency`	server-tuned	Max parallel fetches inside this batch
`fail_fast`	`false`	If `true`, return as soon as the first URL errors

Cost: 1 request per URL. Failures still count.

See POST /batch for the full reference.

/crawl - Follow Links From a Seed

Use when you have a starting URL and a pattern.

json

{
  "seed_url": "https://example.com/blog",
  "schema": [
    { "field": "title",       "type": "string", "example": "Example post" },
    { "field": "publishedAt", "type": "date",   "example": "2024-12-20" }
  ],
  "crawl": {
    "follow_pattern": "https://example.com/blog/*",
    "max_pages": 50,
    "max_depth": 2
  }
}

Crawl Option	Notes
`follow_pattern`	Only links matching this are followed
`max_pages`	Hard ceiling on pages visited
`max_depth`	Hops from `seed_url`
`allow_large_crawl`	Bypass the 25%-of-quota safety cap (see below)

The crawler:

Respects robots.txt.
Reads sitemap.xml when present.
Applies per-host jitter and adaptive back-off so you don't get banned.
Returns an array of per-page results plus crawl_meta (pages visited / skipped / failed).

Quota safety: max_pages is silently capped at floor(remaining_monthly_quota × 0.25) upfront. The unused units are refunded immediately. Pass "allow_large_crawl": true to remove the cap.

See POST /crawl for the full reference.

Picking Between Them

Scenario	Use
You have a known list of URLs	`/batch`
You have one URL and want to discover related pages	`/crawl`
You're paginating a single feed (`?page=1..N`)	Build the URLs and use `/batch` - cheaper and simpler
Site is JS-heavy and link graph is hard to predict	`/crawl` with `render_js: "always"`

Example: Blog Archive

bash

curl -X POST https://api.scrapewithruno.com/v1/crawl \
  -H "X-API-Key: $RUNO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "seed_url": "https://example.com/blog",
    "schema": [
      { "field": "title", "type": "string", "example": "Example post" },
      { "field": "publishedAt", "type": "date", "example": "2024-12-20" }
    ],
    "crawl": {
      "follow_pattern": "https://example.com/blog/*",
      "max_pages": 25,
      "max_depth": 2
    }
  }'

Returns:

json

{
  "seed_url": "https://example.com/blog",
  "results": [
    { "url": "https://example.com/blog/post-1", "status": "success", "data": { "title": "...", "publishedAt": "2024-11-01" } },
    { "url": "https://example.com/blog/post-2", "status": "success", "data": { "title": "...", "publishedAt": "2024-11-08" } }
  ],
  "crawl_meta": {
    "pages_visited": 17,
    "pages_skipped": 3,
    "pages_failed": 0,
    "cancelled": false
  }
}

Crawling & Batching ​

/batch - Same Schema, Many URLs ​

/crawl - Follow Links From a Seed ​

Picking Between Them ​

Example: Blog Archive ​

Crawling & Batching

/batch - Same Schema, Many URLs

/crawl - Follow Links From a Seed

Picking Between Them

Example: Blog Archive