Procedure-Aware Evaluation demonstrates that 27-78% of τ-bench task "successes" by frontier agents are procedurally corrupt — agents reach the right state through policy violations, fabrication, or unauthorized actions. Gated Pass^4 (consecutive policy-clean successes) collapses to 2-24%. Different model families have distinct corruption signatures: GPT-5 spreads errors broadly; Kimi-K2-Thinking concentrates 78% in policy faithfulness; Mistral-Large-3 fabricates "professionally formatted, authoritative-looking summaries." Outcome-only evaluation cannot distinguish correct-state-via-correct-process from correct-state-via-shortcut. Product.ai's agentic workflows today report outcome-only success. The corrupt-success tax is invisible. Whoever ships the audit produces (a) the first measurement of the actual quality state, (b) the calibration doc the team uses to interpret all future agentic numbers correctly, and (c) the substrate for procedural-integrity gating in the eval pipeline going forward.
aios-methods) are out of scope unless the candidate is on the methodology team — for trial contractors, default to application-layerThe operating principles we work by. If they resonate, the rest of this will land. Open the Codex →
Hireflix, async. Questions are calibrated to this project specifically.
Direct call with the CEO. Strategic alignment and mutual fit. No problem-solving exercise.
1099 contractor agreement, NDA, paid at your stated rate. Day 1 in Santa Monica.
Alpha Team members can take this project without the screen-and-call sequence. Reach out via the Alpha Team channel.