Skip to content

CLI reference

Every WebReaper subcommand, its flags, and the output it produces.

The webreaper binary exposes four subcommands: scrape for a single page, map to discover URLs, crawl for a whole site, and init to wire up the Claude Code skill. Markdown goes to stdout by default; pass a schema or an LLM prompt to get JSON. Update hints and progress write to stderr, so piping to a file stays clean.

scrape

Turn a single page into clean, LLM-ready Markdown printed to stdout:

webreaper scrape https://news.ycombinator.com

Write the Markdown to a file instead:

webreaper scrape https://news.ycombinator.com --output page.md

Extract structured data with a JSON schema. The result is JSON; a multi-page run emits JSON Lines, one object per page:

webreaper scrape https://news.ycombinator.com --schema schema.json

Extract with an LLM instead of a schema, by natural-language prompt. Bring any OpenAI-compatible endpoint (OpenAI, Ollama, OpenRouter, and so on); the API key is read from WEBREAPER_LLM_API_KEY (or OPENAI_API_KEY), never a flag:

webreaper scrape https://example.com --prompt "title and author" \
  --model gpt-4o-mini --llm-url https://api.openai.com/v1

Render JavaScript-heavy pages through a real browser, and let the CLI escalate to a stealth browser when it detects a bot-check:

webreaper scrape https://example.com --browser --auto-stealth

Key flags:

  • --output <file> write Markdown to a file rather than stdout.
  • --schema <file> extract JSON using the given schema (JSON Lines for many pages).
  • --prompt "<text>" schema-free LLM extraction, one call per page; needs --model and --llm-url.
  • --infer ["<goal>"] infer a schema once with an LLM, then extract deterministically (cheap for many pages).
  • --model <id> / --llm-url <url> the model and OpenAI-compatible endpoint for --prompt / --infer.
  • --output-dir <dir> write one file per page (.md in Markdown mode, .json otherwise) instead of stdout; --open reveals the folder.
  • --browser render the page with a real browser for JS-heavy sites.
  • --auto-stealth escalate to a stealth browser automatically on a bot-check.
  • --no-auto-stealth disable the escalation heuristic.

map

Discover URLs on a site without scraping their contents. Filter by a path substring and cap how many you collect:

webreaper map https://example.com --search /blog/ --max-urls 50

This prints the matching URLs, which is handy for feeding a follow-up scrape or crawl run, or for understanding a site's structure first.

crawl

Walk every on-domain page and emit JSON Lines, one record per page. Each record carries the URL, the page title, and the page as Markdown:

webreaper crawl https://example.com > pages.jsonl

Each line looks like:

{"url": "https://example.com/about", "title": "About", "markdown": "# About\n..."}

Because the output is JSON Lines, you can stream it straight into jq, a database loader, or an embedding pipeline.

Extract fields across the whole site with an LLM too: --prompt "<text>" (one call per page) or --infer "<goal>" (infer a schema once, then extract the rest deterministically, about one call total). crawl --prompt asks before a large run; pass --yes to skip. Add --output-dir <dir> to write one file per page instead of a single JSON Lines stream.

init

Write the bundled Claude Code skill into the current project:

webreaper init

This drops a SKILL.md at .claude/skills/webreaper/SKILL.md. After that, Claude Code routes scraping requests to the CLI on its own. See the Claude Code skill page for details.

Output shapes at a glance

  • scrape with no schema: Markdown to stdout (or to --output).
  • scrape --schema: JSON for one page, JSON Lines for many.
  • scrape --prompt / --infer: LLM-extracted JSON (JSON Lines for many pages).
  • map: a list of discovered URLs.
  • crawl: JSON Lines of {url, title, markdown}, one per page.
  • --output-dir <dir>: one file per page (.md in Markdown mode, .json otherwise).