Skip to content

Actions DSL

The actions field is a tree. Leaves are selectors; branches are sub-trees that recurse into matched elements; fn nodes are imperative steps (navigate, click, fill, etc.).

Every selector is <scheme>=<value>@<attribute>.

css=h1@text → text of the first <h1>
xpath=//*[@id='x']@html → outer HTML of the element
css=img@src → absolutized src URL of the first <img>
"css=div[data-x='@']"@text → quoting lets you put '@' inside the selector
  • Schemecss (default if omitted) or xpath.
  • Selector — any valid CSS selector or XPath expression. Wrap in quotes if it contains @.
  • Attribute — what to extract. Defaults to the raw DOM attribute if it’s not one of the special families below.
FamilyReturnsNotes
text / textOriginalstringVisible text content.
html / innerHtmlstringInner HTML of the element.
valuestring | nullInput value (<input value>). Browser engine only.
tagNamestringE.g. "div", "a".
cssSelectorstringA unique CSS selector for the matched element.
json / json:<path>unknownParses the element’s text content as JSON. Optional <path> drills in. See JSON paths.
attributesRecord<string, string>All DOM attributes as a map.
href / hrefOriginalstringhref absolutized against the page URL (hrefOriginal is the raw value).
src / srcOriginalstringsrc absolutized; Original is the raw value. Mirrors poster / posterOriginal.
srcset / srcsetOriginal{ url, size }[]Parsed srcset; Original keeps the raw URLs.
table / tableJson / tableArrayRecord<string,string>[] or string[][]Pick array form via options.headers = false.
md / fitMdstringElement rendered as Markdown. fitMd strips boilerplate aggressively.
screenshotstringBrowser engine only; PNG. Public URL by default, data URL via returnType: 'dataUrl'. See options.fullPage.
anything elsestring | nullReturns the named DOM attribute via getAttribute(...).

context / contextAlpha / ocr / htmlTree are not yet supported in v2.

{
"stories": {
"selector": "css=.athing",
"many": true,
"output": {
"title": "css=.titleline > a@text",
"link": "css=.titleline > a@href",
"rank": "css=.rank@text"
}
}
}
  • many: true — apply output once per matched element; returns an array.
  • many: false (default) — apply to the first match; returns one object.
  • output can be a string (treated as <attribute> on the matched element) or a nested branch.

Nested branches operate relative to the matched element, not the document root — useful for scoping repeated structures.

The json:<path> extractor parses an element’s text content as JSON, then drills in. Two syntaxes are supported:

FormReturnsExample
Dot / bracketthe scalar at that pathjson:items.0.name, json:items[0].name, json:items["k"]
JSONPatharray of all matches (null if zero)json:items[*].name, json:items.*.name, json:$..price

A path is treated as JSONPath when it contains *, .., or starts with $. Sloppy forms like items.[*].name are normalized.

{
"first": "script#__NEXT_DATA__@json:items[0].name", // → "a"
"names": "script#__NEXT_DATA__@json:items[*].name", // → ["a", "b"]
"prices": "body@json:$..price" // recursive descent
}

When selector ends in @json:<path> and output is a sub-action object, the parent path is fanned out into rows and each row is projected through output. Leaf strings are JSON paths relative to the row (no @ — those mean DOM extractors and are rejected here).

{
"images": {
"selector": "body@json:products[*].images[*]",
"output": { "pos": "position", "src": "src" }
}
}
// → [ { "pos": 1, "src": "a.jpg" }, { "pos": 2, "src": "b.jpg" }, … ]
  • Wildcard parent (products[*], images.*.foo, $..items) → array of projected rows.
  • Scalar parent (products[0]) → single projected object.
  • Sub-action output may nest groups; each nested string is still a row-relative JSON path.
  • Leaf paths support the same dot / bracket / JSONPath syntax as the top-level extractor (e.g. output: { allTags: "tags[*].k" }).

Use this instead of parallel many: true extractions when the source is JSON — it avoids the desync risk of zipping parallel arrays.

{
"login": {
"fn": "fill",
"selector": "css=input[name=email]",
"args": "ada@example.com"
},
"submit": { "fn": "click", "selector": "css=button[type=submit]" },
"wait": { "fn": "wait", "args": 1500 }
}
fnArgsEngines
waitnumber ms, or omit + selector to wait for itboth
gotoURLbrowser
back / forwardbrowser
click— (uses selector)browser
fillstring value (uses selector)browser
selectOptionstring or string[]browser
2fa{ kind: 'otpauth', secret: '…' }browser
evaluate (default if no other arm matches)inline JS string in argsboth — synchronous in html, async race in browser

fn nodes don’t produce output keys directly; they sequence side effects. Mix them with extract leaves freely — keys are executed top-to-bottom.

Per-action options (set on a leaf or branch):

FieldMeaning
timeoutPer-action timeout in ms. Hard cap: 30 s for extracts, 60 s for fn.
filterRegex string. The extracted value must match (.test(value)); else the key is null.
matchRegex string with a capture group. Returns the first capture group instead of the raw value.
headerstableJson only. false → 2D array; true (default) → array of objects keyed by header text.
excludeTags / includeTagsmd / fitMd only. Override the default allow/deny tag lists.
returnTypescreenshot only. 'url' (default) or 'dataUrl'.
fullPagescreenshot only. Full-page screenshot if true.

Top-level options (sibling to actions):

FieldMeaning
waitForBrowser engine. A lifecycle (load / domcontentloaded / networkidle / commit), a number of ms, or a CSS selector to wait for.
timeoutMsWhole-run timeout, capped at 120 s.
{
"url": "https://example.com/blog",
"engine": "browser",
"options": { "waitFor": "networkidle", "timeoutMs": 30000 },
"actions": {
"accept": { "fn": "click", "selector": "css=button:has-text('Accept')" },
"posts": {
"selector": "css=article",
"many": true,
"output": {
"title": "css=h2@text",
"url": "css=h2 a@href",
"summary": { "selector": "css=p.summary", "options": { "match": "(?<=Summary:\\s).+" } }
}
}
}
}