---
name: opencrawl
version: 1.0
url: https://opencrawl.opensora.store
description: Free web toolkit — sitemap discovery, page crawl, SEO audit, GEO check, design clone, structured extract.
---

# OpenCrawl Skill

Use this skill when the user wants to scrape a webpage, audit a URL for SEO/GEO,
extract structured data, discover all URLs on a domain, or pull a site's design system.

OpenCrawl runs every tool from a single backend with a shared cache: invoking any
single-page tool (Crawl / SEO / Clone) on a URL makes the other two return in
~200ms on the next call.

## Choose the right tool

| User intent | Tool | Endpoint |
|---|---|---|
| "Get me the text content of this page" | Crawl | `POST /api/fetch_one` |
| "List every URL on this website" | Sitemap | `GET /api/sitemap?domain=X` |
| "Audit this URL for SEO problems" | SEO | `POST /api/seo` |
| "Check if this page is AI-search ready" | GEO | _coming soon_ |
| "Extract the colors/fonts of this site" | Clone | `POST /api/clone` |
| "Scrape product cards / list results into JSON" | Extract | `POST /api/extract` |

## Auth

Every endpoint requires a Bearer token. Get one free at `https://opencrawl.opensora.store/sign-in`
(Google OAuth). Put it in the `OPENCRAWL_TOKEN` env var:

```bash
export OPENCRAWL_TOKEN="paste-token-here"
```

All requests:

```bash
curl -H "Authorization: Bearer $OPENCRAWL_TOKEN" \
     -H "Content-Type: application/json" \
     -X POST https://opencrawl.opensora.store/api/...
```

## Endpoints

### 1. Sitemap — discover every URL on a domain

```bash
curl -H "Authorization: Bearer $OPENCRAWL_TOKEN" \
     "https://opencrawl.opensora.store/api/sitemap?domain=stripe.com"
```

Returns: `{ domain, sitemaps[], total_urls, urls: [{ url, lastmod, priority }] }`.

Crawls robots.txt + 15 fallback paths + nested sitemap indexes. Use when the user
asks "what pages does X have" or before a Crawl operation on a whole site.

### 2. Crawl — single-page content extraction

```bash
curl -X POST -H "Authorization: Bearer $OPENCRAWL_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"url":"https://stripe.com/pricing"}' \
     https://opencrawl.opensora.store/api/fetch_one
```

Returns: `{ url, status, title, word_count, html_size, mhtml_saved, from_cache, ok }`.

Then fetch the content:
- `GET /api/page/markdown?url=...` → clean Markdown body
- `GET /api/page/images?url=...` → `{ count, images: [absolute URLs] }`
- `GET /api/page/mhtml?url=...` → MHTML file with images embedded
- `GET /api/page/bundle?url=...` → ZIP (markdown + mhtml + image-URL list)

Pass `{"refresh": true}` to force re-render (default reads from cache).

### 3. SEO — single-page audit report

```bash
curl -X POST -H "Authorization: Bearer $OPENCRAWL_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"url":"https://stripe.com/pricing"}' \
     https://opencrawl.opensora.store/api/seo
```

Returns a 10-section report: `basics` (title/desc/canonical/lang/charset),
`headings` (h1..h6 with full text), `social` (full OG + Twitter), `i18n`
(hreflang), `links` (internal/external counts + samples), `schema` (JSON-LD
types + items + microdata/RDFa flags), `media` (img/video/iframe + alt-text
coverage), `content` (word_count, html_size, text-to-HTML ratio), `tech`
(doctype/AMP/scripts), `robots_meta` (index/follow/noarchive).

Use for: SEO audit, competitor analysis, finding missing meta tags.

### 4. Clone — design system extraction

```bash
curl -X POST -H "Authorization: Bearer $OPENCRAWL_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"url":"https://linear.app"}' \
     https://opencrawl.opensora.store/api/clone
```

Returns: `{ url, screenshot_available, mhtml_available, css_count, tokens, spec }`.

`spec` has clustered design tokens: `{ colors: { primary, accents[], neutrals[], semantics }, typography: { families[], scale[] }, spacing: { base, scale[] }, radii, shadows }`.

Exports:
- `GET /api/clone/export?url=X&format=tailwind` → `tailwind.config.js`
- `GET /api/clone/export?url=X&format=css` → CSS variables
- `GET /api/clone/export?url=X&format=tokens` → JSON tokens
- `GET /api/clone/zip?url=X` → full ZIP (HTML + CSS + tokens + screenshot)

### 5. Extract — structured field extraction (Scrapling-powered)

```bash
curl -X POST -H "Authorization: Bearer $OPENCRAWL_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{
       "url": "https://news.ycombinator.com/",
       "item_selector": "tr.athing",
       "fields": [
         {"name":"title","selector":"span.titleline a"},
         {"name":"url","selector":"span.titleline a","attr":"href"}
       ],
       "mode": "http",
       "limit": 10
     }' \
     https://opencrawl.opensora.store/api/extract
```

Returns: `{ url, status, item_count, duration_ms, mode_used, items: [...] }`.

**Field spec:**
- `name` → output key
- `selector` → CSS selector relative to each item
- `attr` → optional, e.g. `"href"`, `"src"` — defaults to `.get_all_text()`
- `multiple` → `true` for list-valued fields

**Modes:**
- `http` (default) — fastest, no browser (~200ms-1s)
- `browser` — Playwright render for JS-heavy SPAs (~5-10s)
- `stealth` — anti-bot evasion for Cloudflare-protected sites (~10-15s)

**Templates** — pre-configured selectors for common targets:

```bash
curl -H "Authorization: Bearer $OPENCRAWL_TOKEN" \
     https://opencrawl.opensora.store/api/extract/templates
```

Returns 5 starter templates (Hacker News, GitHub Trending, Product Hunt, Reddit subreddit, generic blog). Each gives `{ key, label, default_url, mode, item_selector, fields[] }` — drop into the `/api/extract` body and tweak the URL.

## Choosing between Crawl and Extract

- **Crawl** = "give me the whole page text" → user wants to read it / feed to RAG.
- **Extract** = "give me these specific fields as rows" → user wants to do data analysis.

If the user asks for "all the products on this page", that's **Extract** (structured).
If they ask for "the article content", that's **Crawl** (Markdown).

## Errors

- `401` — bad/missing token. Re-fetch from `https://opencrawl.opensora.store/sign-in`.
- `404` — endpoint or page not found.
- `429` — rate limited (free tier: ~30 req/min).
- `500` with `detail: "fetch failed: ..."` — target URL unreachable / blocked.

## Worked example: "find competitor pricing pages"

```bash
# 1. discover URLs
curl -H "Authorization: Bearer $OPENCRAWL_TOKEN" \
  "https://opencrawl.opensora.store/api/sitemap?domain=competitor.com" | jq '.urls[].url' \
  | grep -i pricing > pricing-urls.txt

# 2. for each pricing URL, run Crawl + SEO (single-page bundle endpoint covers both)
while read url; do
  curl -H "Authorization: Bearer $OPENCRAWL_TOKEN" -X POST \
       -H "Content-Type: application/json" -d "{\"url\":\"$url\"}" \
       https://opencrawl.opensora.store/api/fetch_one > /dev/null
  curl -H "Authorization: Bearer $OPENCRAWL_TOKEN" -X POST \
       -H "Content-Type: application/json" -d "{\"url\":\"$url\"}" \
       https://opencrawl.opensora.store/api/seo > "seo-$(basename $url).json"
done < pricing-urls.txt
```
