The Demo Went Well. That’s the problem.
If you can’t define good output before launch, you are not managing an AI product. You are reacting to one.
The most dangerous moment in an AI product cycle is the demo that goes well.
The answers sound smart. The flow feels smooth. The team starts talking about launch timing. Somebody says the model is “pretty good.” That phrase hangs in the air like it means something.
It usually means nobody has defined quality yet.
This is the core discipline of AI product management. Not prompt writing. Not model shopping. Not feature packaging. Evaluation design.
If one cannot define what good output looks like before launch, they are not managing an AI product. They are reacting to one.
Most teams spend weeks on the experience, the workflow, the orchestration, the retrieval layer, the prompt stack, then treat evaluation as a last-mile QA task. A few people test a few prompts. They get a feel for the outputs. They polish what looks rough. They launch.
That approach works just well enough to create false confidence.
Traditional software trained us to think in terms of exact behavior.
A button click should do one thing.
A calculation should return one answer.
A test should pass or fail.
AI systems break that rhythm. The same input can produce different outputs. Some answers can vary in wording and still be great. Some can sound polished and still be wrong in the one way that matters. The shape of the work changes.
The PM has to define acceptable behavior, not one perfect response.
Evaluations belong inside system design from day one. They are not the audit at the end. They are the spec.
Welcome to Code Like a Girl, the community where women in tech come to be seen, heard, and championed as they walk this path together.
If you have ever watched a demo go well and felt the quiet dread of knowing nobody has defined what good actually looks like yet, you are in the right place.
What Does Good Look Like for Single-Step Probabilistic Systems?
The first question on any AI use case is brutally simple: what would make this output good enough to trust?
Not “good enough to impress.”
Not “good enough in a demo.”
Good enough to use in a real workflow where the answer affects a customer, an employee, a decision, or a task someone cares about.
A customer support copilot is a useful place to work this through. A requirement that says “the assistant drafts helpful responses” sounds fine and tells nobody how to judge the system. A definition that survives contact with reality looks different. A good support draft should:
reflect actual policy,
pull in the right facts from the case,
ask a clarifying question when key information is missing,
keep a professional tone,
avoid invented promises, and
know when to hand the case to a human.
That list is already more valuable than a generic product requirement because it names the behaviors that matter. Write it down, and you can evaluate the system. Without it, you’re left with taste, instinct, and post-hoc rationalization.
This is where many AI enablers get stuck.
They inherited a mindset from deterministic software, so they look for the AI version of a unit test. They want a clean binary. Pass or fail. Correct or incorrect. Shipped or not shipped.
Production AI systems demand a scorecard, not a single test: a rubric that captures the range between excellent, acceptable, shaky, and unsafe. Two outputs can differ in wording, and both can be correct. The only measure is whether the system behaves within bounds you would accept in production.
Write acceptance criteria around outcomes and failure limits: define thresholds, name categories, and be explicit about which errors are annoying, which are expensive, and which are unacceptable.
In plain language, readiness for a support copilot looks like this:
In routine cases, the assistant should apply the policy correctly almost every time.
In cases with missing facts, it should ask a clarifying question rather than bluff.
On high-risk cases, it should escalate every time.
The draft can vary in wording, but it has to be accurate, useful, and safe to send after review. That is what “good enough” looks like on paper: not poetic, but useful.
Building a Golden Test Set
The next challenge: how do you evaluate before you have production data?
Build a golden test set, the closest thing an AI product team has to a pre-launch reality check. A curated set of examples that represents the work the system will face when real users arrive.
Let’s extend our support use case. For this one, to build a golden test set, you can pull from historical tickets, policy documents, edge-case scenarios, adversarial requests, and hard examples that expose weak judgment.
For internal search, build cases where the right answer lives in the docs, cases requiring synthesis, and cases where the right answer is “I don’t know from the available material.”
The word “golden” can make it sound pristine, but it starts rough. The point is coverage, not perfection.
Include the boring common cases, because that’s where volume lives. Include messy, ambiguous, and adversarial cases, because that’s where trust gets lost. Include policy traps and examples that look straightforward until the model takes a shortcut and invents an answer.
If you skip this step, you will test the cases you remember, the prompts you wrote, the scenarios that flatter the system. Then you will launch into the unknown and act surprised when production behaves like production.
The golden test set has a second job after launch. When the underlying model updates on a provider’s schedule, run it as a regression suite. If results shift meaningfully against the thresholds you set at launch, that is a product incident, not a curiosity. Build that expectation into the team’s operating rhythm from the start.
Avoid forcing every test case into one “perfect” answer unless the task truly has one exact right output.
A support reply can vary in phrasing and still solve the problem well.
A content summary can differ in structure and still capture the core idea.
A retrieval answer can cite different passages and still land on the right conclusion.
Rubrics work better than exact-match grading here. For each example, define what matters: accuracy, completeness, judgment, clarity, tone, escalation choice, and grounding in source material. Score against that rubric.
Now you have a way to evaluate probabilistic output without pretending it should behave like a calculator.
Run Offline evals in Three Layers
Once you have a golden test set and evaluation criteria laid out, the next step is to run offline evaluations. Offline evals let you test the system before users absorb the cost of mistakes. Each layer catches something the others miss.
Check Hard Constraints
Did the system return a valid structure?
Did it use the required source?
Did it follow the format?
Did it make a forbidden claim?
Did it call the right tool?
These checks are mechanical, and they catch more than most teams expect.
Score Quality
Did the output answer the question?
Did it use the right facts?
Did it handle ambiguity well?
Did it stay within policy?
You can automate some of this; the rest needs human judgment. Automated graders help you move quickly, but treat them as assistants, not judges.
Examine Failures
This is where the real product work lives.
Does the system over-answer when it should ask for more detail?
Does it sound certain when the evidence is weak?
Does it collapse on edge cases?
Does it drift into generic filler when the task calls for specifics?
Does it retrieve the right source and still draw the wrong conclusion?
The average score is a poor guide. The real work begins when you identify the error patterns: where the system breaks, how consistently, and why.
Evaluation Design for Multi-Step Systems
You’ve probably seen this. Each individual output looks fine. The end state is still wrong.
The three-layer framework applies cleanly to a single output. When the use case is multi-step, with an agent that gathers context, decides an action, calls a tool, and produces a result, the unit of evaluation changes.
In a multi-step workflow, failure often doesn’t happen inside any individual output. It happens at the handoff between steps.
Each step can pass its own output-level rubric while the end state is still wrong. The compounding math is unforgiving: if each step operates at 95% reliability across 20 steps, the workflow succeeds roughly 36% of the time.
There is also the loop problem: agents that repeatedly retry failed operations, or continue processing tasks already completed, without any single output triggering a rubric failure.
A workflow-level rubric asks different questions:
Is the end state correct?
Did the agent choose the right path at each decision point?
Did it handle failure at each handoff gracefully?
If your use case involves agentic workflows, the evaluation design has to match. An output-level rubric alone will not catch what breaks.
How do you Know Something is Ready to Ship?
The answer is concrete. A system is ready when five things are true:
The job is defined clearly enough to score.
You can describe what the system must do in language a reviewer can use to score an output: a specific behavioral standard, not a vague requirement.The golden test set has enough range to expose weak spots.
It covers routine cases, edge cases, adversarial inputs, and high-risk scenarios. It was built to challenge the system, not flatter it.Offline results clear the thresholds set in advance.
Not “results look reasonable.” The numbers against criteria defined before testing began.Human review cases are identified.
You know which categories of output require a human in the loop before reaching a user, and you have a plan for managing that review.A pilot plan exists.
You know what you will monitor once real users arrive: which metrics, which failure patterns, which thresholds trigger action.
Ship readiness is an evaluation decision, a clearly set scorecard.
In regulated environments, this checklist carries a second weight.
The golden test set, the threshold evidence, and the human review taxonomy are the artifacts a risk or compliance review asks for.
A well-run evaluation process gives the PM and the risk committee what they each need: the PM asks whether the product is ready, the risk committee asks whether the organization is covered.
For teams in financial services, healthcare, or any domain with formal model governance requirements, evaluation design is also the compliance record. Build it that way from the start. (Source: MIT Technology Review)
Pilot Discipline and HITL Principles:
Whether it is a one-step output or a multi-step workflow, the pilot stage is where weak teams lose discipline. They finally have live traffic, so they drop structure and start collecting anecdotes. One executive forwards a great answer. Another points out a bad one. The team bounces between excitement and panic depending on which screenshot arrived last.
Building a human review process turns the pilot into a learning loop, and that process is the discipline that the pilot stage demands.
For most serious use cases, start with a human in the loop for areas where trust still needs to be earned.
Make the review process concrete.
Reviewers need a rubric with clear categories: approve, edit, reject, and escalate. They need a place to tag why they intervened: wrong fact, missed nuance, bad judgment, weak tone, missing context, unsafe action, or invented policy.
The taxonomy matters because it tells you whether you have a prompt problem, a retrieval problem, a workflow problem, or a use-case problem.
Sometimes the pilot shows the model writes strong first drafts, but still needs a human to approve the final answer. That can still be a valuable product.
Sometimes the pilot shows the model handles routine cases well and only needs review on a narrow slice of risky work. Better still.
Sometimes the pilot shows the use case looks promising in theory, but breaks too often where it matters. Also, a win: you learn it before the launch.
Delayed shipment is a recoverable problem. Shipping without knowing which of those products you actually built is not.
From Launch to Production: Observability
This is where most teams lose the thread. The launch went well. Then production started behaving like production.
The evaluation set, along with a strong feedback loop that is built into the use case, sets you on the path to production. Once the system reaches production, however, the question changes completely.
Once in production, the value added by the workflow starts to take center stage. You stop asking only whether the model produced a good-looking answer and start asking whether it improved the workflow.
That requires a different evaluation mindset: one that tracks quality over time, monitors whether controls are holding, and reads user behavior for signs of genuine adoption versus polite tolerance.
A system with a high acceptance rate can still be failing if users accept outputs uncritically on low-stakes tasks and quietly abandon it where the stakes are real.
I’ve covered the full production monitoring framework in You Cannot Manage AI Trust Without Observability, organized around three dimensions: quality and reliability, risk and control health, and adoption behavior. If you’re moving from launch into production, that’s the operating view you need.
Offline evals and online evals play different roles.
Offline evals tell you whether the system clears the bar you set before launch.
Online evals tell you whether that bar mattered in the real workflow.
You need both, and they require different instruments.
That is the Job
Evaluations are part of business system design. They shape the prompt, the retrieval layer, the fallback logic, the escalation rules, the staffing model, and the launch decision. They define the operating boundaries of the product.
An AI team that cannot evaluate the system can’t really steer it. The team must watch outputs, react to anecdotes, and hold a meeting about quality. None of that is the same as control.
Control starts when you can say, before launch, what good looks like, how you will measure it, where you will use human review, and what evidence will convince you the system is ready.
Before your next sprint planning, write one sentence: What does good output look like? If you can’t write it, you are not ready to build it.
If This Resonated With You
We are grateful to Ankita Chatrath for sharing her expertise here. If you're building or managing AI products in enterprise and want writing that skips the theory and goes straight to what actually breaks in production, Lessons from the Trenches is worth your time.
You Belong Here
Ankita named the discipline most teams skip: defining what good looks like before anyone can see it in a demo. Code Like a Girl is the community for women in tech who are doing that kind of rigorous, unglamorous work and want to be around others who take it seriously.
If you are not subscribed yet, that is the place to start. It is free.
You Don’t Need to Do This Alone
Have you ever wished you had someone in your corner who has actually been where you are? When you go paid, you get direct access to Dinah Davis, founder of Code Like a Girl, via direct message. You bring your situation. She brings twenty years of having lived it. Together, you figure out your next move.
Once a month, Dinah shares practical steps you can act on right away. Not inspiration. Specific things you can do to advance your career when the system was not built for you.
Summer Out of Office
We’re taking two short breaks this summer: June 27 – July 7 and August 1 – 12.
Article Sources
McKinsey QuantumBlack, “Evaluations for the Agentic World” (2025) — multi-step workflow failure patterns, compounding reliability in agentic systems, and handoff boundary failures.
Cleanlab, “AI Agents in Production 2025” — enterprise production failure modes; 61% of multi-agent system failures originate at agent boundaries.
Datadog, “State of AI Engineering” (2025) — model update volatility; 34% of enterprises experienced unexpected behavior changes following a provider model update.
MIT Technology Review, “Operationalizing AI for Scale and Sovereignty” (May 2026) — governance and compliance requirements for enterprise AI deployments at scale.
Andreessen Horowitz, “How 100 Enterprise CIOs Are Building and Buying Gen AI in 2025” — use of internal golden datasets and developer feedback as enterprise model procurement evaluation tools.








"What does good output look like? If you can't write it, you are not ready to build it." This is one of the most overlooked foundation pieces. Thank you for touching on that.
this class is 2 weeks homework for me! (while im house/pet sitting) THANK YOU.