Crawl & scrape demos¶
Five tasks that extract structured lists from live websites — the patterns you'll use for most scraping workloads.
hackernews-top¶
Front-page Hacker News: rank, title, URL, and site for every story.
sequenceDiagram
participant C as Client
participant S as webtasks
participant HN as news.ycombinator.com
C->>S: POST /tasks/crawl/hackernews-top
S->>HN: goto
S->>HN: wait until tr.athing
S->>HN: extract (repeat)
S-->>C: JSON array of stories
Concepts: extract … repeat, CSS selectors, attr fields.
github-trending¶
Trending repositories with language and time-period inputs.
curl -s -X POST localhost:8765/tasks/crawl/github-trending -d '{}'
curl -s -X POST localhost:8765/tasks/crawl/github-trending -d '{"language":"go","since":"weekly"}'
task "crawl/github-trending"
pool default
timeout 20000
transport rest
input language string default ""
input since string default "daily"
goto "https://github.com/trending/{{language}}?since={{since}}"
wait until "article.Box-row" timeout 15000
extract repos from "article.Box-row" repeat
slug text "h2 a"
href attr href on "h2 a"
desc text "p"
stars text "a[href$='/stargazers']" trim
end
end
Concepts: optional inputs with defaults, URL path templating.
wikipedia-toc¶
Wikipedia table of contents — mixes single-object and repeated extraction in one task.
Concepts: one extract for a single record, another with repeat for a list
in the same flow.
trending-papers¶
The canonical smoke test — 100 trending papers from Hugging Face. Use it to verify a fresh deployment.
curl -s http://127.0.0.1:8765/health
curl -s -X POST localhost:8765/tasks/crawl/trending-papers -d '{}'
# expect ~100 papers with title + href
Concepts: production smoke test, complex selector lists.
quotes-paginated¶
Multi-page scraping against quotes.toscrape.com — follow "next" links until there are none.
Concepts: pagination loops, following links — a pattern you'll adapt for any paginated site. See Control flow → loop.
Selector tips¶
| Goal | Pattern |
|---|---|
| All table rows | tr.athing, article.row |
| Field inside a row | .titleline > a |
| Attribute | attr href on ".titleline > a" |
| Trim whitespace | add trim after the field |
Full guide: Writing tasks.
What's next?¶
- Search demos — caller-driven queries
- Interaction demos — forms and scrolling