Product.ai / Join / Projects / Procedural Integrity Audit — measure the corrupt-success tax on one Product.ai agentic workflow
Project Open to Alpha Team

Procedural Integrity Audit — measure the corrupt-success tax on one Product.ai agentic workflow

Take one Product.ai agentic workflow (Alloy, ARC application-layer if accessible, signal-step-executor, an AIOS skill that fans out sub-agents). Apply the Procedure-Aware Evaluation framework (Cao et al., March 2026, arXiv:2603.03116). Measure outcome success rate vs procedurally-clean success rate. Surface the gap — the corrupt-success tax. Segment by model family, workflow step, and policy domain. Write the calibration doc that lets the team interpret future Pass^k numbers correctly.
Project Overview
Discipline
AI Systems — Data Scientist / ML Engineer · AI Systems — AI Engineer · ai-systems-engineer
Duration
2 weeks
Compensation
Your stated freelance rate
Surface
Product.ai · Truth Graph · Engineering
Kernels
productai · truth-graph · engineering
Outcomes
chat-expert · dev-integrate · team-velocity
Tier
Applied
Alpha Team
Open to alpha members who want to take this on
Tooling
Claude Code or Co-work

Why we want this done

Procedure-Aware Evaluation demonstrates that 27-78% of τ-bench task "successes" by frontier agents are procedurally corrupt — agents reach the right state through policy violations, fabrication, or unauthorized actions. Gated Pass^4 (consecutive policy-clean successes) collapses to 2-24%. Different model families have distinct corruption signatures: GPT-5 spreads errors broadly; Kimi-K2-Thinking concentrates 78% in policy faithfulness; Mistral-Large-3 fabricates "professionally formatted, authoritative-looking summaries." Outcome-only evaluation cannot distinguish correct-state-via-correct-process from correct-state-via-shortcut. Product.ai's agentic workflows today report outcome-only success. The corrupt-success tax is invisible. Whoever ships the audit produces (a) the first measurement of the actual quality state, (b) the calibration doc the team uses to interpret all future agentic numbers correctly, and (c) the substrate for procedural-integrity gating in the eval pipeline going forward.

Scope

  1. Pick one workflow (the candidate proposes; we pressure-test) — Alloy reasoning loop, an AIOS sub-agent fan-out, or another workflow that produces verifiable end-states
  2. Define corruption signatures relevant to the workflow — fabrication, policy violation, unauthorized action, hallucinated tool call
  3. Build the procedural-integrity classifier — for each completed run, classify as procedurally-clean or corrupt with reasons
  4. Run the workflow N=50+ times — measure outcome success vs procedurally-clean success
  5. Compute the corrupt-success tax — the gap between outcome-success and procedurally-clean-success rates
  6. Segment by model family (Opus / Sonnet / Haiku / GPT / Gemini if multi-provider), workflow step, policy domain
  7. Recommend procedural-integrity gating — Pass^k metrics defaulted to gated; what threshold should fail the workflow
  8. One-page calibration doc — how to interpret future agentic numbers in light of procedural integrity

What success looks like

  • Corrupt-success tax is measured on at least one workflow with N=50+ runs
  • The classifier is reproducible — a stranger could run it on the next workflow
  • Segmentation reveals at least one signature (a model family, a workflow step, or a policy domain has higher corruption than the rest)
  • The procedural-integrity gating recommendation is concrete — what threshold, what action when threshold fails
  • The calibration doc is one page; an engineer reading it can apply it to a different workflow without re-asking

References

references.md
Data Science Phase 3 briefing axiom A19 (Procedural Integrity / Corrupt Success), A15 (Four-Layer Eval Stack)
Cao et al., March 2026, arXiv:2603.03116 — Procedure-Aware Evaluation framework
τ-bench corrupt-success rate data (27-78% across frontier agents)
Existing Product.ai agentic workflows (Alloy, signal-step-executor, AIOS sub-agent fan-outs)
AI Engineering Phase 3 briefing axiom A4 (Stripe Theorem), A5 (External Truth Anchors)

Constraints

  • Claude Code as primary substrate
  • N=50+ runs minimum — fewer is statistically uninformative
  • Classifier must be reproducible — a stranger should be able to apply it
  • Privacy-respecting trace handling
  • LLM-as-Judge for procedural classification limited to sub-frontier triage with human-ground-truth validation
  • IP separation: workflow is application-layer; methodology paths (aios-methods) are out of scope unless the candidate is on the methodology team — for trial contractors, default to application-layer
  • The candidate must report honest numbers; gaming the classifier to produce a clean tax is a fail
Apply
01

Read the Codex (10 min)

The operating principles we work by. If they resonate, the rest of this will land. Open the Codex →

02

12-minute video screen

Hireflix, async. Questions are calibrated to this project specifically.

03

Chemistry call (30-60 min)

Direct call with the CEO. Strategic alignment and mutual fit. No problem-solving exercise.

04

Project begins within 2-3 weeks

1099 contractor agreement, NDA, paid at your stated rate. Day 1 in Santa Monica.

Alpha Team members can take this project without the screen-and-call sequence. Reach out via the Alpha Team channel.