CLI reference
Every WebReaper subcommand, its flags, and the output it produces.
The webreaper binary exposes four subcommands: scrape for a single page,
map to discover URLs, crawl for a whole site, and init to wire up the
Claude Code skill. Markdown goes to stdout by default; pass a schema or an LLM
prompt to get JSON.
Update hints and progress write to stderr, so piping to a file stays clean.
scrape
Turn a single page into clean, LLM-ready Markdown printed to stdout:
webreaper scrape https://news.ycombinator.comWrite the Markdown to a file instead:
webreaper scrape https://news.ycombinator.com --output page.mdExtract structured data with a JSON schema. The result is JSON; a multi-page run emits JSON Lines, one object per page:
webreaper scrape https://news.ycombinator.com --schema schema.jsonExtract with an LLM instead of a schema, by natural-language prompt. Bring any
OpenAI-compatible endpoint (OpenAI, Ollama, OpenRouter, and so on); the API key
is read from WEBREAPER_LLM_API_KEY (or OPENAI_API_KEY), never a flag:
webreaper scrape https://example.com --prompt "title and author" \
--model gpt-4o-mini --llm-url https://api.openai.com/v1Render JavaScript-heavy pages through a real browser, and let the CLI escalate to a stealth browser when it detects a bot-check:
webreaper scrape https://example.com --browser --auto-stealthKey flags:
--output <file>write Markdown to a file rather than stdout.--schema <file>extract JSON using the given schema (JSON Lines for many pages).--prompt "<text>"schema-free LLM extraction, one call per page; needs--modeland--llm-url.--infer ["<goal>"]infer a schema once with an LLM, then extract deterministically (cheap for many pages).--model <id>/--llm-url <url>the model and OpenAI-compatible endpoint for--prompt/--infer.--output-dir <dir>write one file per page (.mdin Markdown mode,.jsonotherwise) instead of stdout;--openreveals the folder.--browserrender the page with a real browser for JS-heavy sites.--auto-stealthescalate to a stealth browser automatically on a bot-check.--no-auto-stealthdisable the escalation heuristic.
map
Discover URLs on a site without scraping their contents. Filter by a path substring and cap how many you collect:
webreaper map https://example.com --search /blog/ --max-urls 50This prints the matching URLs, which is handy for feeding a follow-up scrape
or crawl run, or for understanding a site's structure first.
crawl
Walk every on-domain page and emit JSON Lines, one record per page. Each record carries the URL, the page title, and the page as Markdown:
webreaper crawl https://example.com > pages.jsonlEach line looks like:
{"url": "https://example.com/about", "title": "About", "markdown": "# About\n..."}Because the output is JSON Lines, you can stream it straight into jq, a
database loader, or an embedding pipeline.
Extract fields across the whole site with an LLM too: --prompt "<text>" (one
call per page) or --infer "<goal>" (infer a schema once, then extract the rest
deterministically, about one call total). crawl --prompt asks before a large
run; pass --yes to skip. Add --output-dir <dir> to write one file per page
instead of a single JSON Lines stream.
init
Write the bundled Claude Code skill into the current project:
webreaper initThis drops a SKILL.md at .claude/skills/webreaper/SKILL.md. After that,
Claude Code routes scraping requests to the CLI on its own. See the
Claude Code skill page for details.
Output shapes at a glance
scrapewith no schema: Markdown to stdout (or to--output).scrape --schema: JSON for one page, JSON Lines for many.scrape --prompt/--infer: LLM-extracted JSON (JSON Lines for many pages).map: a list of discovered URLs.crawl: JSON Lines of{url, title, markdown}, one per page.--output-dir <dir>: one file per page (.mdin Markdown mode,.jsonotherwise).