Project Open to Alpha Team

Procedural Integrity Audit — measure the corrupt-success tax on one Product.ai agentic workflow

Take one Product.ai agentic workflow (Alloy, ARC application-layer if accessible, signal-step-executor, an AIOS skill that fans out sub-agents). Apply the Procedure-Aware Evaluation framework (Cao et al., March 2026, arXiv:2603.03116). Measure outcome success rate vs procedurally-clean success rate. Surface the gap — the corrupt-success tax. Segment by model family, workflow step, and policy domain. Write the calibration doc that lets the team interpret future Pass^k numbers correctly.

Project Overview

Discipline

AI Systems — Data Scientist / ML Engineer · AI Systems — AI Engineer · ai-systems-engineer

Duration

2 weeks

Compensation

Your stated freelance rate

Surface

Product.ai · Truth Graph · Engineering

Kernels

productai · truth-graph · engineering

Outcomes

chat-expert · dev-integrate · team-velocity

Tier

Applied

Alpha Team

Open to alpha members who want to take this on

Tooling

Claude Code or Co-work

Why we want this done

Procedure-Aware Evaluation demonstrates that 27-78% of τ-bench task "successes" by frontier agents are procedurally corrupt — agents reach the right state through policy violations, fabrication, or unauthorized actions. Gated Pass^4 (consecutive policy-clean successes) collapses to 2-24%. Different model families have distinct corruption signatures: GPT-5 spreads errors broadly; Kimi-K2-Thinking concentrates 78% in policy faithfulness; Mistral-Large-3 fabricates "professionally formatted, authoritative-looking summaries." Outcome-only evaluation cannot distinguish correct-state-via-correct-process from correct-state-via-shortcut. Product.ai's agentic workflows today report outcome-only success. The corrupt-success tax is invisible. Whoever ships the audit produces (a) the first measurement of the actual quality state, (b) the calibration doc the team uses to interpret all future agentic numbers correctly, and (c) the substrate for procedural-integrity gating in the eval pipeline going forward.

Scope

Pick one workflow (the candidate proposes; we pressure-test) — Alloy reasoning loop, an AIOS sub-agent fan-out, or another workflow that produces verifiable end-states
Define corruption signatures relevant to the workflow — fabrication, policy violation, unauthorized action, hallucinated tool call
Build the procedural-integrity classifier — for each completed run, classify as procedurally-clean or corrupt with reasons
Run the workflow N=50+ times — measure outcome success vs procedurally-clean success
Compute the corrupt-success tax — the gap between outcome-success and procedurally-clean-success rates
Segment by model family (Opus / Sonnet / Haiku / GPT / Gemini if multi-provider), workflow step, policy domain
Recommend procedural-integrity gating — Pass^k metrics defaulted to gated; what threshold should fail the workflow
One-page calibration doc — how to interpret future agentic numbers in light of procedural integrity

What success looks like

Corrupt-success tax is measured on at least one workflow with N=50+ runs
The classifier is reproducible — a stranger could run it on the next workflow
Segmentation reveals at least one signature (a model family, a workflow step, or a policy domain has higher corruption than the rest)
The procedural-integrity gating recommendation is concrete — what threshold, what action when threshold fails
The calibration doc is one page; an engineer reading it can apply it to a different workflow without re-asking

References

references.md

Data Science Phase 3 briefing axiom A19 (Procedural Integrity / Corrupt Success), A15 (Four-Layer Eval Stack)
Cao et al., March 2026, arXiv:2603.03116 — Procedure-Aware Evaluation framework
τ-bench corrupt-success rate data (27-78% across frontier agents)
Existing Product.ai agentic workflows (Alloy, signal-step-executor, AIOS sub-agent fan-outs)
AI Engineering Phase 3 briefing axiom A4 (Stripe Theorem), A5 (External Truth Anchors)

Constraints

Claude Code as primary substrate
N=50+ runs minimum — fewer is statistically uninformative
Classifier must be reproducible — a stranger should be able to apply it
Privacy-respecting trace handling
LLM-as-Judge for procedural classification limited to sub-frontier triage with human-ground-truth validation
IP separation: workflow is application-layer; methodology paths (aios-methods) are out of scope unless the candidate is on the methodology team — for trial contractors, default to application-layer
The candidate must report honest numbers; gaming the classifier to produce a clean tax is a fail

Apply

Read the Codex (10 min)

The operating principles we work by. If they resonate, the rest of this will land. Open the Codex →

12-minute video screen

Hireflix, async. Questions are calibrated to this project specifically.

Chemistry call (30-60 min)

Direct call with the CEO. Strategic alignment and mutual fit. No problem-solving exercise.

Project begins within 2-3 weeks

1099 contractor agreement, NDA, paid at your stated rate. Day 1 in Santa Monica.

Alpha Team members can take this project without the screen-and-call sequence. Reach out via the Alpha Team channel.

Apply for this trial Browse other projects

Agent View · application/ld+json

{
  "@context": "https://schema.org",
  "@graph": [
    {
      "@id": "https://product.ai/#organization"
    },
    {
      "@type": "SoftwareApplication",
      "@id": "https://product.ai/#application",
      "name": "Product.ai",
      "slogan": "Intelligent guidance at conversation speed",
      "description": "Product.ai is your AI shopping expert - verified product intelligence grounded in the Truth Graph, accessible to agents and humans alike.",
      "applicationCategory": "ShoppingApplication",
      "operatingSystem": "Web",
      "author": {
        "@id": "https://product.ai/#organization"
      },
      "offers": {
        "@type": "Offer",
        "price": "0",
        "priceCurrency": "USD"
      }
    },
    {
      "@type": "WebPage",
      "@id": "https://product.ai/join/projects/PRJ-30-procedural-integrity-corrupt-success-audit/#webpage",
      "name": "Procedural Integrity Audit — measure the corrupt-success tax on one Product.ai agentic workflow — Product.ai",
      "description": "Take one Product.ai agentic workflow (Alloy, ARC application-layer if accessible, signal-step-executor, an AIOS skill that fans out sub-agents). Apply the Procedure-Aware Evaluation framework (Cao et al., March 2026, arXiv:2603.03116). Measure outcome success rate vs procedura...",
      "url": "https://product.ai/join/projects/PRJ-30-procedural-integrity-corrupt-success-audit/",
      "isPartOf": {
        "@id": "https://product.ai/#website"
      },
      "about": {
        "@id": "https://product.ai/#organization"
      },
      "publisher": {
        "@id": "https://product.ai/#organization"
      }
    },
    {
      "@type": "JobPosting",
      "title": "Procedural Integrity Audit — measure the corrupt-success tax on one Product.ai agentic workflow",
      "description": "Take one Product.ai agentic workflow (Alloy, ARC application-layer if accessible, signal-step-executor, an AIOS skill that fans out sub-agents). Apply the Procedure-Aware Evaluation framework (Cao et al., March 2026, arXiv:2603.03116). Measure outcome success rate vs procedura...",
      "hiringOrganization": {
        "@id": "https://product.ai/#organization"
      },
      "employmentType": "CONTRACTOR",
      "datePosted": "2026-04-28",
      "jobLocation": {
        "@type": "Place",
        "address": {
          "@type": "PostalAddress",
          "addressLocality": "Los Angeles",
          "addressRegion": "CA",
          "addressCountry": "US"
        }
      },
      "baseSalary": {
        "@type": "MonetaryAmount",
        "currency": "USD",
        "value": {
          "@type": "QuantitativeValue",
          "unitText": "WEEK"
        }
      },
      "workHours": "2 weeks"
    },
    {
      "@type": "BreadcrumbList",
      "itemListElement": [
        {
          "@type": "ListItem",
          "position": 1,
          "name": "Product.ai",
          "item": "https://product.ai"
        },
        {
          "@type": "ListItem",
          "position": 2,
          "name": "Join",
          "item": "https://product.ai/join/"
        },
        {
          "@type": "ListItem",
          "position": 3,
          "name": "Projects",
          "item": "https://product.ai/join/projects/"
        },
        {
          "@type": "ListItem",
          "position": 4,
          "name": "Procedural Integrity Audit — measure the corrupt-success tax on one Product.ai agentic workflow",
          "item": "https://product.ai/join/projects/PRJ-30-procedural-integrity-corrupt-success-audit/"
        }
      ]
    }
  ]
}