Project Open to Alpha Team

Anthropic Postmortem Calibration + Product.ai Eval-Bypass Mitigation — diagnose one equivalent risk and ship the gate

Read the Anthropic April 23, 2026 Claude Code postmortem end-to-end. Identify the structural mechanism by which three overlapping bugs evaded the world's most sophisticated eval suite for 50 days. Diagnose one equivalent eval-bypass risk pattern in a Product.ai surface. Ship the deterministic gate that closes it — not a "better eval," a structural change that makes the bypass mechanism physically impossible.

Project Overview

Discipline

founding-product-manager · AI Systems — AI Engineer · ai-systems-engineer

Duration

2 weeks

Compensation

Your stated freelance rate

Surface

Product.ai · Engineering · Truth Graph

Kernels

productai · engineering · truth-graph

Outcomes

chat-expert · dev-integrate · team-velocity

Tier

Consequential

Alpha Team

Open to alpha members who want to take this on

Tooling

Claude Code or Co-work

Why we want this done

The Anthropic postmortem is the single highest-signal calibration device available. Three product-layer changes stacked at Anthropic between March 4 and April 20, 2026: reasoning effort silently dropped, thinking-history cleared every turn, system-prompt updates between tool calls. Internal evals "did not initially reproduce the issues identified" despite multiple human and automated code reviews, unit tests, end-to-end tests, automated verification, and dogfooding. AMD's forensic analysis (234,760 tool calls across 6,852 sessions) documented the regression at scale. User /feedback was the only working detection mechanism. The PM Phase 3 briefing identifies this as Probe 2 (Anthropic Postmortem Calibration) for hiring — but the same calibration applies at the product level. Product.ai has eval-bypass risks structurally identical to Anthropic's. Identifying them and shipping deterministic mitigations is the load-bearing PM and AI Engineer work in 2026. The candidate produces the diagnosis, ships the gate, and writes the case study that becomes the calibration device for future hiring and internal verification reviews.

Scope

Read the Anthropic April 23 2026 postmortem and Stella Laurenzo's GitHub audit (6,852 sessions, 234,000 tool calls)
Identify the canonical bypass mechanism: which class of failure does the eval suite structurally not cover, and why
Survey Product.ai surfaces — chat, Alloy, SimplyCodes code-verification, MCP — for equivalent bypass risk patterns
Pick one — the highest-leverage equivalent risk
Diagnose the mechanism in writing: what evals exist, what the bypass class looks like, why "more evals" is the wrong answer
Design and ship the deterministic gate — a structural change (e.g., explicit version pinning, ablation harness, external-truth-anchor instrumentation) that makes the bypass mechanism physically impossible
Write the case study — one page that becomes Product.ai's internal calibration device for future verification reviews

What success looks like

The bypass mechanism is named explicitly — not "we need more evals," but "session-state-dependent bug invisible to single-turn eval" or "internal-vs-external dogfooding divergence" or "composite degradation across overlapping changes"
The deterministic gate ships in production — version pinning, ablation harness, instrumentation, structural change, or all of the above
The case study is one page; an engineer or PM reading it can apply the calibration to a different Product.ai surface without re-asking
The candidate explicitly rejects "more evals" as the answer (it's the wrong mental model per axiom D2)
The gate's effect is measurable — what specific failure class would now be caught that wasn't before

References

references.md

PM Phase 3 briefing axiom D2 (Eval-Suite Bypass), VERDICT 8 (Verification Infrastructure as Anti-Comfort-Architecture)
AI Engineering Phase 3 briefing axiom A5 (External Truth Anchors), D1 (Eval Gap Irreducible), D2 (LLM-as-Judge Mathematical Ceiling)
Backend Engineering Phase 3 briefing axiom D1 (Anthropic regression case study)
Anthropic April 23 2026 Claude Code postmortem
AMD GitHub issue #42796 — forensic dataset
Stella Laurenzo's GitHub audit
Product.ai surface internals (chat, Alloy, SimplyCodes code-verification, MCP)

Constraints

Claude Code as primary substrate
The mitigation must be deterministic and structural — adding evaluators is not the answer
Self-hostable eval tooling only if any eval substrate is touched (Phoenix or Langfuse, not Braintrust)
The case study is one page; expansion into a multi-section memo is a fail
IP separation: surfaces are application-layer; methodology paths are out of scope
The candidate may not propose "we need more evals" as the load-bearing recommendation — that's the canonical theater answer per axiom D2

Apply

Read the Codex (10 min)

The operating principles we work by. If they resonate, the rest of this will land. Open the Codex →

12-minute video screen

Hireflix, async. Questions are calibrated to this project specifically.

Chemistry call (30-60 min)

Direct call with the CEO. Strategic alignment and mutual fit. No problem-solving exercise.

Project begins within 2-3 weeks

1099 contractor agreement, NDA, paid at your stated rate. Day 1 in Santa Monica.

Alpha Team members can take this project without the screen-and-call sequence. Reach out via the Alpha Team channel.

Apply for this trial Browse other projects

Agent View · application/ld+json

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@id": "https://product.ai/#organization"
    },
    {
      "@type": "SoftwareApplication",
      "@id": "https://product.ai/#application",
      "name": "Product.ai",
      "slogan": "Intelligent guidance at conversation speed",
      "description": "Product.ai is your AI shopping expert - verified product intelligence grounded in the Truth Graph, accessible to agents and humans alike.",
      "applicationCategory": "ShoppingApplication",
      "operatingSystem": "Web",
      "author": {
        "@id": "https://product.ai/#organization"
      },
      "offers": {
        "@type": "Offer",
        "price": "0",
        "priceCurrency": "USD"
      }
    },
    {
      "@type": "WebPage",
      "@id": "https://product.ai/join/projects/PRJ-23-anthropic-postmortem-bypass-mitigation/#webpage",
      "name": "Anthropic Postmortem Calibration + Product.ai Eval-Bypass Mitigation — diagnose one equivalent risk and ship the gate — Product.ai",
      "description": "Read the Anthropic April 23, 2026 Claude Code postmortem end-to-end. Identify the structural mechanism by which three overlapping bugs evaded the world's most sophisticated eval suite for 50 days. Diagnose one equivalent eval-bypass risk pattern in a Product.ai surface. Ship t...",
      "url": "https://product.ai/join/projects/PRJ-23-anthropic-postmortem-bypass-mitigation/",
      "isPartOf": {
        "@id": "https://product.ai/#website"
      },
      "about": {
        "@id": "https://product.ai/#organization"
      },
      "publisher": {
        "@id": "https://product.ai/#organization"
      }
    },
    {
      "@type": "JobPosting",
      "title": "Anthropic Postmortem Calibration + Product.ai Eval-Bypass Mitigation — diagnose one equivalent risk and ship the gate",
      "description": "Read the Anthropic April 23, 2026 Claude Code postmortem end-to-end. Identify the structural mechanism by which three overlapping bugs evaded the world's most sophisticated eval suite for 50 days. Diagnose one equivalent eval-bypass risk pattern in a Product.ai surface. Ship t...",
      "hiringOrganization": {
        "@id": "https://product.ai/#organization"
      },
      "employmentType": "CONTRACTOR",
      "datePosted": "2026-04-28",
      "jobLocation": {
        "@type": "Place",
        "address": {
          "@type": "PostalAddress",
          "addressLocality": "Los Angeles",
          "addressRegion": "CA",
          "addressCountry": "US"
        }
      },
      "baseSalary": {
        "@type": "MonetaryAmount",
        "currency": "USD",
        "value": {
          "@type": "QuantitativeValue",
          "unitText": "WEEK"
        }
      },
      "workHours": "2 weeks"
    },
    {
      "@type": "BreadcrumbList",
      "itemListElement": [
        {
          "@type": "ListItem",
          "position": 1,
          "name": "Product.ai",
          "item": "https://product.ai"
        },
        {
          "@type": "ListItem",
          "position": 2,
          "name": "Join",
          "item": "https://product.ai/join/"
        },
        {
          "@type": "ListItem",
          "position": 3,
          "name": "Projects",
          "item": "https://product.ai/join/projects/"
        },
        {
          "@type": "ListItem",
          "position": 4,
          "name": "Anthropic Postmortem Calibration + Product.ai Eval-Bypass Mitigation — diagnose one equivalent risk and ship the gate",
          "item": "https://product.ai/join/projects/PRJ-23-anthropic-postmortem-bypass-mitigation/"
        }
      ]
    }
  ]
}