Product.ai / Join / Projects / Anthropic Postmortem Calibration + Product.ai Eval-Bypass Mitigation — diagnose one equivalent risk and ship the gate
Project Open to Alpha Team

Anthropic Postmortem Calibration + Product.ai Eval-Bypass Mitigation — diagnose one equivalent risk and ship the gate

Read the Anthropic April 23, 2026 Claude Code postmortem end-to-end. Identify the structural mechanism by which three overlapping bugs evaded the world's most sophisticated eval suite for 50 days. Diagnose one equivalent eval-bypass risk pattern in a Product.ai surface. Ship the deterministic gate that closes it — not a "better eval," a structural change that makes the bypass mechanism physically impossible.
Project Overview
Discipline
founding-product-manager · AI Systems — AI Engineer · ai-systems-engineer
Duration
2 weeks
Compensation
Your stated freelance rate
Surface
Product.ai · Engineering · Truth Graph
Kernels
productai · engineering · truth-graph
Outcomes
chat-expert · dev-integrate · team-velocity
Tier
Consequential
Alpha Team
Open to alpha members who want to take this on
Tooling
Claude Code or Co-work

Why we want this done

The Anthropic postmortem is the single highest-signal calibration device available. Three product-layer changes stacked at Anthropic between March 4 and April 20, 2026: reasoning effort silently dropped, thinking-history cleared every turn, system-prompt updates between tool calls. Internal evals "did not initially reproduce the issues identified" despite multiple human and automated code reviews, unit tests, end-to-end tests, automated verification, and dogfooding. AMD's forensic analysis (234,760 tool calls across 6,852 sessions) documented the regression at scale. User /feedback was the only working detection mechanism. The PM Phase 3 briefing identifies this as Probe 2 (Anthropic Postmortem Calibration) for hiring — but the same calibration applies at the product level. Product.ai has eval-bypass risks structurally identical to Anthropic's. Identifying them and shipping deterministic mitigations is the load-bearing PM and AI Engineer work in 2026. The candidate produces the diagnosis, ships the gate, and writes the case study that becomes the calibration device for future hiring and internal verification reviews.

Scope

  1. Read the Anthropic April 23 2026 postmortem and Stella Laurenzo's GitHub audit (6,852 sessions, 234,000 tool calls)
  2. Identify the canonical bypass mechanism: which class of failure does the eval suite structurally not cover, and why
  3. Survey Product.ai surfaces — chat, Alloy, SimplyCodes code-verification, MCP — for equivalent bypass risk patterns
  4. Pick one — the highest-leverage equivalent risk
  5. Diagnose the mechanism in writing: what evals exist, what the bypass class looks like, why "more evals" is the wrong answer
  6. Design and ship the deterministic gate — a structural change (e.g., explicit version pinning, ablation harness, external-truth-anchor instrumentation) that makes the bypass mechanism physically impossible
  7. Write the case study — one page that becomes Product.ai's internal calibration device for future verification reviews

What success looks like

  • The bypass mechanism is named explicitly — not "we need more evals," but "session-state-dependent bug invisible to single-turn eval" or "internal-vs-external dogfooding divergence" or "composite degradation across overlapping changes"
  • The deterministic gate ships in production — version pinning, ablation harness, instrumentation, structural change, or all of the above
  • The case study is one page; an engineer or PM reading it can apply the calibration to a different Product.ai surface without re-asking
  • The candidate explicitly rejects "more evals" as the answer (it's the wrong mental model per axiom D2)
  • The gate's effect is measurable — what specific failure class would now be caught that wasn't before

References

references.md
PM Phase 3 briefing axiom D2 (Eval-Suite Bypass), VERDICT 8 (Verification Infrastructure as Anti-Comfort-Architecture)
AI Engineering Phase 3 briefing axiom A5 (External Truth Anchors), D1 (Eval Gap Irreducible), D2 (LLM-as-Judge Mathematical Ceiling)
Backend Engineering Phase 3 briefing axiom D1 (Anthropic regression case study)
Anthropic April 23 2026 Claude Code postmortem
AMD GitHub issue #42796 — forensic dataset
Stella Laurenzo's GitHub audit
Product.ai surface internals (chat, Alloy, SimplyCodes code-verification, MCP)

Constraints

  • Claude Code as primary substrate
  • The mitigation must be deterministic and structural — adding evaluators is not the answer
  • Self-hostable eval tooling only if any eval substrate is touched (Phoenix or Langfuse, not Braintrust)
  • The case study is one page; expansion into a multi-section memo is a fail
  • IP separation: surfaces are application-layer; methodology paths are out of scope
  • The candidate may not propose "we need more evals" as the load-bearing recommendation — that's the canonical theater answer per axiom D2
Apply
01

Read the Codex (10 min)

The operating principles we work by. If they resonate, the rest of this will land. Open the Codex →

02

12-minute video screen

Hireflix, async. Questions are calibrated to this project specifically.

03

Chemistry call (30-60 min)

Direct call with the CEO. Strategic alignment and mutual fit. No problem-solving exercise.

04

Project begins within 2-3 weeks

1099 contractor agreement, NDA, paid at your stated rate. Day 1 in Santa Monica.

Alpha Team members can take this project without the screen-and-call sequence. Reach out via the Alpha Team channel.