Lesson 7: Goals and Tests

Your email triage is deployed and running. It classifies 150 emails a day. But how do you know it is working? What if the AI starts routing newsletters as urgent? What if a new email format breaks the confidence score?

You need two things: a definition of success, and tests that verify it.

The achieves Section

Add a goal to your machine:

machine email_triage

achieves
  goal "Route emails to the correct destination based on priority and confidence"
    succeeds when "urgent emails reach the team within 2 seconds"
    succeeds when "low-confidence emails go to human review"
    never "route an uncertain email without human review"
    never "expose email body content in Teams channel names"
    for example
      given {subject: "URGENT: Server down", sender: "[email protected]", body: "Production is offline."}
      expect {routed_to: "notify_team"}
    for example
      given {subject: "Weekly deals!", sender: "[email protected]", body: "50% off everything."}
      expect {routed_to: "archive"}

achieves declares what success looks like. It has four parts:

goal: A plain-language description of the machine’s purpose. Koda uses this to explain the machine. Other machines can read it.
succeeds when: Conditions that define correct behavior. These are not code; they are human-readable statements that guide the AI and document intent.
never: Hard constraints. Things the machine must not do, regardless of input. These feed into governance checks.
for example: Concrete input/output pairs. These are both documentation and executable tests.

The verifies Section

Examples inside achieves serve double duty: they document intent and they run as tests. But for thorough testing, use the verifies section:

verifies
  test "routes urgent emails to team"
    given {subject: "URGENT: Server down", sender: "[email protected]", body: "Systems offline"}
    assuming classify {priority: "urgent", confidence: 0.95, action: "notify", reason: "Server outage"}
    expect {routed_to: "notify_team"}

  test "archives newsletters"
    given {subject: "Weekly deals!", sender: "[email protected]", body: "Sale this week"}
    assuming classify {priority: "ignore", confidence: 0.99, action: "archive", reason: "Marketing email"}
    expect {routed_to: "archive"}

  test "flags low confidence for human review"
    given {subject: "Question about project", sender: "[email protected]", body: "Can we discuss?"}
    assuming classify {priority: "today", confidence: 0.55, action: "create_task", reason: "Unclear intent"}
    expect {routed_to: "human_review"}

  test "creates tasks for actionable non-urgent emails"
    given {subject: "Please review the Q3 report", sender: "[email protected]", body: "Attached."}
    assuming classify {priority: "today", confidence: 0.88, action: "create_task", reason: "Review request"}
    expect {routed_to: "create_task"}

The assuming Keyword

The key difference from for example is assuming. It mocks the AI step.

assuming classify {priority: "urgent", confidence: 0.95, ...} tells the test runner: “When the classify step runs, return this instead of calling the AI.” This means:

Tests are deterministic. The AI does not run. The test always produces the same result.
Tests are free. No API calls, no token costs.
Tests are fast. Milliseconds, not seconds.
Tests verify your logic, not the AI’s judgment. You are testing routing, not classification.

To test the AI’s classification quality, use /evaluate (Lesson 9). The verifies section tests the machine’s logic.

Running Tests

/test email_triage

You see:

email_triage
  [pass] routes urgent emails to team (2ms)
  [pass] archives newsletters (1ms)
  [pass] flags low confidence for human review (1ms)
  [pass] creates tasks for actionable non-urgent emails (2ms)

4/4 passed. 0 failed. 6ms total.

If a test fails, you see exactly what happened:

  [FAIL] flags low confidence for human review
    Expected: {routed_to: "human_review"}
    Got:      {routed_to: "create_task"}
    Step trace:
      [1] classify  (mocked) -> {priority: "today", confidence: 0.55, ...}
      [2] route     decide   -> {routed_to: "create_task"}
    Issue: The confidence threshold check is not triggering.

The trace shows every step, including the mocked values, so you can see where the logic diverged from your expectation.

Tests Run Before Deployment

When you run /deploy email_triage, mashin runs all tests first. If any fail, deployment is blocked:

/deploy email_triage

  Running tests...
  [FAIL] flags low confidence for human review

  Deployment blocked: 1 test failure.
  Fix the failing test before deploying.

This is not optional. You cannot deploy a machine with failing tests. The governance pipeline enforces it.

The Full Machine with Goals and Tests

Here is the complete email triage with everything from lessons 1 through 7:

machine email_triage

achieves
  goal "Route emails to the correct destination based on priority and confidence"
    succeeds when "urgent emails reach the team within 2 seconds"
    succeeds when "low-confidence emails go to human review"
    never "route an uncertain email without human review"
    for example
      given {subject: "URGENT: Server down", sender: "[email protected]", body: "Offline."}
      expect {routed_to: "notify_team"}

accepts
  subject as text, is required
  sender as text, is required
  body as text

responds with
  routed_to as text
  action_taken as text

implements
  ask classify, using: "anthropic:claude-haiku-4"
    with task "Classify this email.\n\nFrom: ${input.sender}\nSubject: ${input.subject}\nBody: ${input.body}"
    returns
      priority as text, is required, choices: ["urgent", "today", "later", "ignore"]
      confidence as number, is required, range: [0.0, 1.0]
      reason as text

  decide route
    if classify.confidence < 0.7
      run flow(flag_for_human)
    else if classify.priority == "urgent"
      run flow(notify_team)
    else if classify.priority == "today"
      run flow(create_task)
    else
      {routed_to: "archive", action_taken: "none"}

  flows
    flow notify_team
      ask send_alert, from: "@mashin/actions/microsoft/teams/send_message"
        channel: "ops-alerts"
        message: "Urgent: " + input.subject + " from " + input.sender
      {routed_to: "notify_team", action_taken: "Teams alert sent"}

    flow create_task
      ask make_task, from: "@mashin/actions/microsoft/planner/create_task"
        title: input.subject
        description: classify.reason
      {routed_to: "create_task", action_taken: "Planner task created"}

    flow flag_for_human
      ask flag, from: "@mashin/actions/microsoft/teams/send_message"
        channel: "email-review"
        message: "Review needed: " + input.subject
      {routed_to: "human_review", action_taken: "Flagged for review"}

verifies
  test "routes urgent emails to team"
    given {subject: "URGENT: Server down", sender: "[email protected]", body: "Offline"}
    assuming classify {priority: "urgent", confidence: 0.95, reason: "Server outage"}
    expect {routed_to: "notify_team"}

  test "archives newsletters"
    given {subject: "Weekly deals!", sender: "[email protected]", body: "Sale"}
    assuming classify {priority: "ignore", confidence: 0.99, reason: "Marketing"}
    expect {routed_to: "archive"}

  test "flags low confidence for human review"
    given {subject: "Question", sender: "[email protected]", body: "Can we discuss?"}
    assuming classify {priority: "today", confidence: 0.55, reason: "Unclear"}
    expect {routed_to: "human_review"}

  test "creates tasks for actionable emails"
    given {subject: "Review Q3 report", sender: "[email protected]", body: "Attached"}
    assuming classify {priority: "today", confidence: 0.88, reason: "Review request"}
    expect {routed_to: "create_task"}

Notice the section ordering: achieves (what it does), accepts/responds with (its contract), implements (how it works), verifies (proof it works). This is canonical mashin ordering. Define what it does, then expose it, then test it.

What Goals Enable

Goals are not just documentation. Other parts of the system use them:

Koda reads goals to explain machines: “This machine routes emails based on priority and confidence.”
/verify checks that declared constraints are structurally enforceable.
/improve uses goals as the optimization target: “Make this machine better at routing urgent emails.”
Evolution ledger records goals alongside version diffs, so you can trace why a machine changed.

Goals connect intent to implementation. Tests verify the connection holds.

What Comes Next

Your machine is deployed, tested, and goal-driven. Next lesson: memory. Your machine will learn from patterns and human corrections, so it gets better over time without you changing the code.

Next: Memory →