Lesson 7: Goals and Tests
Your email triage is deployed and running. It classifies 150 emails a day. But how do you know it is working? What if the AI starts routing newsletters as urgent? What if a new email format breaks the confidence score?
You need two things: a definition of success, and tests that verify it.
The achieves Section
Add a goal to your machine:
machine email_triage
achieves goal "Route emails to the correct destination based on priority and confidence" succeeds when "urgent emails reach the team within 2 seconds" succeeds when "low-confidence emails go to human review" never "route an uncertain email without human review" never "expose email body content in Teams channel names" for example expect {routed_to: "notify_team"} for example expect {routed_to: "archive"}achieves declares what success looks like. It has four parts:
goal: A plain-language description of the machine’s purpose. Koda uses this to explain the machine. Other machines can read it.succeeds when: Conditions that define correct behavior. These are not code; they are human-readable statements that guide the AI and document intent.never: Hard constraints. Things the machine must not do, regardless of input. These feed into governance checks.for example: Concrete input/output pairs. These are both documentation and executable tests.
The verifies Section
Examples inside achieves serve double duty: they document intent and they run as tests. But for thorough testing, use the verifies section:
verifies test "routes urgent emails to team" assuming classify {priority: "urgent", confidence: 0.95, action: "notify", reason: "Server outage"} expect {routed_to: "notify_team"}
test "archives newsletters" assuming classify {priority: "ignore", confidence: 0.99, action: "archive", reason: "Marketing email"} expect {routed_to: "archive"}
test "flags low confidence for human review" assuming classify {priority: "today", confidence: 0.55, action: "create_task", reason: "Unclear intent"} expect {routed_to: "human_review"}
test "creates tasks for actionable non-urgent emails" assuming classify {priority: "today", confidence: 0.88, action: "create_task", reason: "Review request"} expect {routed_to: "create_task"}The assuming Keyword
The key difference from for example is assuming. It mocks the AI step.
assuming classify {priority: "urgent", confidence: 0.95, ...} tells the test runner: “When the classify step runs, return this instead of calling the AI.” This means:
- Tests are deterministic. The AI does not run. The test always produces the same result.
- Tests are free. No API calls, no token costs.
- Tests are fast. Milliseconds, not seconds.
- Tests verify your logic, not the AI’s judgment. You are testing routing, not classification.
To test the AI’s classification quality, use /evaluate (Lesson 9). The verifies section tests the machine’s logic.
Running Tests
/test email_triageYou see:
email_triage [pass] routes urgent emails to team (2ms) [pass] archives newsletters (1ms) [pass] flags low confidence for human review (1ms) [pass] creates tasks for actionable non-urgent emails (2ms)
4/4 passed. 0 failed. 6ms total.If a test fails, you see exactly what happened:
[FAIL] flags low confidence for human review Expected: {routed_to: "human_review"} Got: {routed_to: "create_task"} Step trace: [1] classify (mocked) -> {priority: "today", confidence: 0.55, ...} [2] route decide -> {routed_to: "create_task"} Issue: The confidence threshold check is not triggering.The trace shows every step, including the mocked values, so you can see where the logic diverged from your expectation.
Tests Run Before Deployment
When you run /deploy email_triage, mashin runs all tests first. If any fail, deployment is blocked:
/deploy email_triage
Running tests... [FAIL] flags low confidence for human review
Deployment blocked: 1 test failure. Fix the failing test before deploying.This is not optional. You cannot deploy a machine with failing tests. The governance pipeline enforces it.
The Full Machine with Goals and Tests
Here is the complete email triage with everything from lessons 1 through 7:
machine email_triage
achieves goal "Route emails to the correct destination based on priority and confidence" succeeds when "urgent emails reach the team within 2 seconds" succeeds when "low-confidence emails go to human review" never "route an uncertain email without human review" for example expect {routed_to: "notify_team"}
accepts subject as text, is required sender as text, is required body as text
responds with routed_to as text action_taken as text
implements ask classify, using: "anthropic:claude-haiku-4" with task "Classify this email.\n\nFrom: ${input.sender}\nSubject: ${input.subject}\nBody: ${input.body}" returns priority as text, is required, choices: ["urgent", "today", "later", "ignore"] confidence as number, is required, range: [0.0, 1.0] reason as text
decide route if classify.confidence < 0.7 run flow(flag_for_human) else if classify.priority == "urgent" run flow(notify_team) else if classify.priority == "today" run flow(create_task) else {routed_to: "archive", action_taken: "none"}
flows flow notify_team ask send_alert, from: "@mashin/actions/microsoft/teams/send_message" channel: "ops-alerts" message: "Urgent: " + input.subject + " from " + input.sender {routed_to: "notify_team", action_taken: "Teams alert sent"}
flow create_task ask make_task, from: "@mashin/actions/microsoft/planner/create_task" title: input.subject description: classify.reason {routed_to: "create_task", action_taken: "Planner task created"}
flow flag_for_human ask flag, from: "@mashin/actions/microsoft/teams/send_message" channel: "email-review" message: "Review needed: " + input.subject {routed_to: "human_review", action_taken: "Flagged for review"}
verifies test "routes urgent emails to team" assuming classify {priority: "urgent", confidence: 0.95, reason: "Server outage"} expect {routed_to: "notify_team"}
test "archives newsletters" assuming classify {priority: "ignore", confidence: 0.99, reason: "Marketing"} expect {routed_to: "archive"}
test "flags low confidence for human review" assuming classify {priority: "today", confidence: 0.55, reason: "Unclear"} expect {routed_to: "human_review"}
test "creates tasks for actionable emails" assuming classify {priority: "today", confidence: 0.88, reason: "Review request"} expect {routed_to: "create_task"}Notice the section ordering: achieves (what it does), accepts/responds with (its contract), implements (how it works), verifies (proof it works). This is canonical mashin ordering. Define what it does, then expose it, then test it.
What Goals Enable
Goals are not just documentation. Other parts of the system use them:
- Koda reads goals to explain machines: “This machine routes emails based on priority and confidence.”
/verifychecks that declared constraints are structurally enforceable./improveuses goals as the optimization target: “Make this machine better at routing urgent emails.”- Evolution ledger records goals alongside version diffs, so you can trace why a machine changed.
Goals connect intent to implementation. Tests verify the connection holds.
What Comes Next
Your machine is deployed, tested, and goal-driven. Next lesson: memory. Your machine will learn from patterns and human corrections, so it gets better over time without you changing the code.