Error Handling

Steps can fail. An HTTP request times out, an LLM returns an error, a permission is denied, a called machine throws an exception. By default, a failure halts the machine and returns an error. The on failure block lets you catch failures and recover: provide a fallback, log the error, notify someone, or retry with a different strategy.

on failure basics

Place on failure at the end of a flow. If any step in that flow fails, execution jumps to the on failure block:

implements
  ask fetch_weather, from: "@mashin/actions/http/get"
    url: "https://api.weather.com/current?city=${input.city}"
    assuming
      body: {temp: 72, conditions: "sunny"}
      status: 200

  compute format
    {temperature: steps.fetch_weather.body.temp, source: "live"}

  on failure
    compute fallback
      {
        temperature: null,
        source: "unavailable",
        error: error.message
      }

If the HTTP request fails (timeout, 500 error, DNS failure), the on failure block runs. The machine returns the fallback response instead of an error.

Error context

Inside on failure, the error is available through:

Reference	Description
`error.message`	Human-readable error description
`error.step`	Name of the step that failed
`error.type`	Error category: `"timeout"`, `"permission_denied"`, `"runtime"`
`error.details`	Additional error-specific data

on failure
  compute error_report
    {
      failed_step: error.step,
      error_type: error.type,
      message: error.message,
      status: "failed"
    }

Fallback to a different model

A common pattern: try a primary model, fall back to a smaller one if it fails.

machine resilient_classifier

  accepts
    text as text, is required

  responds with
    category as text
    model_used as text

  ensures
    permissions
      allowed to
        llm_call

  implements
    flow classify
      ask primary, using: "anthropic:claude-sonnet-4-6"
        with task "Classify this text.\n\nText: ${input.text}"
        returns
          category as text
        assuming
          category: "general"

      compute result
        {category: steps.primary.category, model_used: "primary"}

      on failure
        ask fallback, using: "anthropic:claude-haiku-4-5"
          with task "Classify this text into a category.\n\nText: ${input.text}"
          returns
            category as text
          assuming
            category: "general"

        compute result
          {category: steps.fallback.category, model_used: "fallback"}

Notify on failure

Fire off an alert when something goes wrong, then return a graceful error:

on failure
  launch send_alert
    machine: "@mashin/actions/notifications/send"
    channel: "slack"
    message: "Pipeline failed at " + error.step + ": " + error.message

  compute error_response
    {status: "failed", step: error.step, message: error.message}

launch is fire-and-forget: the alert is sent asynchronously and does not block the error response.

Per-flow error handling

In multi-flow machines, each flow has its own on failure:

implements
  flows
    flow extract
      ask fetch, from: "@mashin/actions/http/get"
        url: input.url
        assuming
          body: {}
          status: 200
      on failure
        compute extract_error
          {stage: "extract", error: error.message}

    flow transform
      compute process
        {data: steps.fetch.body}
      on failure
        compute transform_error
          {stage: "transform", error: error.message}

A failure in extract does not trigger transform’s error handler. Each flow manages its own recovery.

Runtime retry configuration

For automatic retry of transient failures, use runtime configuration:

implements
  runtime
    max_retries: 3
    timeout: 30000

This retries any failed step up to 3 times before triggering on failure. Use this for transient errors (network timeouts, rate limits) rather than permanent failures (invalid input, permission denied).

What gets recorded

Both the failure and the recovery are recorded in the behavioral ledger:

The original step failure (step name, error type, message)
Each step in the on failure block (with normal step recording)
The final output of the recovery path

This makes error-and-recovery sequences fully auditable. You can answer: “What failed? What did the machine do about it?”

If the on failure block itself fails, the machine halts with the secondary error. There is no nested error handling.

Try it

Build a machine that calls an HTTP API. Add an on failure block that returns cached data when the API is unavailable. Use error.type to distinguish between timeouts and other failures, returning different messages for each.

Next steps

Flows - Multi-flow architecture
Governance - Permission denials as failures
on failure reference - Full specification