Skip to content

Crawling & Batching

Two endpoints when one URL isn't enough.

/batch - Same Schema, Many URLs

Use when you already have the URL list. Returns one /extract response per URL, in order.

json
{
  "urls": [
    "https://example.com/p/1",
    "https://example.com/p/2",
    "https://example.com/p/3"
  ],
  "schema": [
    { "field": "title", "type": "string", "example": "Example title" },
    { "field": "price", "type": "float",  "example": 19.99 }
  ],
  "options": { "concurrency": 4, "fail_fast": false }
}
OptionDefaultMeaning
concurrencyserver-tunedMax parallel fetches inside this batch
fail_fastfalseIf true, return as soon as the first URL errors

Cost: 1 request per URL. Failures still count.

See POST /batch for the full reference.

Use when you have a starting URL and a pattern.

json
{
  "seed_url": "https://example.com/blog",
  "schema": [
    { "field": "title",       "type": "string", "example": "Example post" },
    { "field": "publishedAt", "type": "date",   "example": "2024-12-20" }
  ],
  "crawl": {
    "follow_pattern": "https://example.com/blog/*",
    "max_pages": 50,
    "max_depth": 2
  }
}
Crawl OptionNotes
follow_patternOnly links matching this are followed
max_pagesHard ceiling on pages visited
max_depthHops from seed_url
allow_large_crawlBypass the 25%-of-quota safety cap (see below)

The crawler:

  • Respects robots.txt.
  • Reads sitemap.xml when present.
  • Applies per-host jitter and adaptive back-off so you don't get banned.
  • Returns an array of per-page results plus crawl_meta (pages visited / skipped / failed).

Quota safety: max_pages is silently capped at floor(remaining_monthly_quota × 0.25) upfront. The unused units are refunded immediately. Pass "allow_large_crawl": true to remove the cap.

See POST /crawl for the full reference.

Picking Between Them

ScenarioUse
You have a known list of URLs/batch
You have one URL and want to discover related pages/crawl
You're paginating a single feed (?page=1..N)Build the URLs and use /batch - cheaper and simpler
Site is JS-heavy and link graph is hard to predict/crawl with render_js: "always"

Example: Blog Archive

bash
curl -X POST https://api.scrapewithruno.com/v1/crawl \
  -H "X-API-Key: $RUNO_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "seed_url": "https://example.com/blog",
    "schema": [
      { "field": "title", "type": "string", "example": "Example post" },
      { "field": "publishedAt", "type": "date", "example": "2024-12-20" }
    ],
    "crawl": {
      "follow_pattern": "https://example.com/blog/*",
      "max_pages": 25,
      "max_depth": 2
    }
  }'

Returns:

json
{
  "seed_url": "https://example.com/blog",
  "results": [
    { "url": "https://example.com/blog/post-1", "status": "success", "data": { "title": "...", "publishedAt": "2024-11-01" } },
    { "url": "https://example.com/blog/post-2", "status": "success", "data": { "title": "...", "publishedAt": "2024-11-08" } }
  ],
  "crawl_meta": {
    "pages_visited": 17,
    "pages_skipped": 3,
    "pages_failed": 0,
    "cancelled": false
  }
}

Released under the terms of Runo’s Terms of Use.