Product.ai / Join / Projects / Layer 4 Human-Decision Compliance Instrumentation — measure adoption-decay and override patterns on one Product.ai surface
Project Open to Alpha Team

Layer 4 Human-Decision Compliance Instrumentation — measure adoption-decay and override patterns on one Product.ai surface

Build Layer 4 instrumentation on one Product.ai surface — the surface where the agent's recommendation reaches a human and the human acts on it, overrides it, or ignores it. Track override patterns, adoption-decay curves, cognitive switching cost signatures by user cohort. Output is a working instrumentation pipeline producing real Layer 4 metrics on real users. This is the layer every commercial LLMOps vendor (Galileo, Arize Phoenix, Patronus, LangSmith, Maxim AI, DeepEval) does NOT cover.
Project Overview
Discipline
AI Systems — Data Scientist / ML Engineer · AI Systems — AI Engineer · product-manager-agent-commerce
Duration
3 weeks
Compensation
Your stated freelance rate
Surface
Product.ai · Truth Graph · Agent commerce
Kernels
productai · truth-graph · agent-commerce
Outcomes
chat-expert · dev-integrate · truth-graph-depth
Tier
Consequential
Alpha Team
Open to alpha members who want to take this on
Tooling
Claude Code or Co-work

Why we want this done

The Data Science Phase 3 briefing identifies system-level evaluation as Product.ai's highest-leverage second-act differentiation, conditional on customer pull. Four-layer eval stack: Layer 1 (model) and Layer 2 (agent trace) are commodity vendor coverage. Layer 3 (workflow execution) is partially covered by PAE / AgentEval / AgentCompass. Layer 4 (human-decision compliance) is uncovered by every published vendor and academic methodology. The Stripe canonical failure (model worked offline, agents adopted existing workflows and ignored ML prompts) is the canonical case — a recommendation can be 100% correct and 0% acted on. Hyperscalers (AWS Bedrock AgentCore GA March 31 2026, Anthropic, OpenAI, Google Cloud, Azure agent runtimes through 2026-2027) close the window on generalist Layer 1-3 within 12-18 months. Vertical specialization on Layer 4 is the durable moat ground for Product.ai. The candidate ships the v1 instrumentation on one surface — proves the methodology can deliver real signal — and produces the substrate that becomes the second-act differentiation.

Scope

  1. Pick one Product.ai surface where Layer 4 dynamics are observable (Alloy AI agent, Product.ai chat, the upcoming external MCP server, SimplyCodes recommendation surface)
  2. Define Layer 4 measurement primitives: override rate, adoption-decay curve, cognitive switching cost signature, override-pattern by cohort, escalation rate, "ignored" rate
  3. Instrument the surface — capture each primitive in production with structured event capture
  4. Build the dashboard surface (Cortex CEO Dashboard module or AIOS dashboard module) showing the four-to-six primitives, scannable, drill-in tray per signal
  5. Run the instrumentation through at least one production cycle with real users
  6. Write the methodology doc — what each primitive measures, what failure modes it catches that Layer 1-3 evals do not, where the methodology is fragile or gap-flagged
  7. The doc is what becomes Product.ai's vertical-specialization moat material — write it for the audience of an enterprise commerce customer evaluating system-level eval

What success looks like

  • All four-to-six primitives capture real data on the chosen surface during the trial window
  • One real surprise emerges — a user behavior the team had not seen in Layer 1-3 metrics
  • The dashboard surface follows aios/dashboard/CLAUDE.md Hard Rules
  • The methodology doc is concrete enough that an enterprise customer reading it can decide whether they want it deployed in their org
  • The candidate explicitly distinguishes Layer 4 from Layer 1-3 in writing — "ARC + Alloy answers 'is it true?'; Layer 4 answers 'is it acted on?'; conflation is brand erosion" (per axiom A17)

References

references.md
Data Science Phase 3 briefing axiom A15 (Four-Layer Eval Stack), A16 (Stripe Canonical Failure), A17 (ARC + Alloy ≠ System-Level Eval), A18 (Hyperscaler Window Closes 12-18 Months), VERDICT 6 (System-Level Evaluation as Differentiation)
Stripe LLM-powered support response failure (canonical case study)
AWS Bedrock AgentCore documentation (March 31 2026 GA)
Decades of human-machine compliance research (healthcare CDSS, aviation, military) for theoretical grounding
Existing Product.ai surfaces and their Layer 1-3 instrumentation
aios/dashboard/CLAUDE.md for design system; Cortex Dashboard architecture for Layer 4 visualization

Constraints

  • Claude Code or Co-work as primary substrate
  • Privacy-respecting instrumentation — explicit user consent and PII handling documented
  • Self-hostable eval substrate if any Layer 1-3 substrate is touched
  • Atomic HTML+JS commit on dashboard work per CLAUDE.md §5.2 rule 3
  • Methodology doc is written for an external commerce-vertical customer audience (the doc is the moat)
  • IP separation: surface is application-layer; methodology paths are out of scope; ARC + Alloy is a distinct capability (do not conflate)
  • 3-week duration cap; if scope creeps, the candidate negotiates the trade-off explicitly
Apply
01

Read the Codex (10 min)

The operating principles we work by. If they resonate, the rest of this will land. Open the Codex →

02

12-minute video screen

Hireflix, async. Questions are calibrated to this project specifically.

03

Chemistry call (30-60 min)

Direct call with the CEO. Strategic alignment and mutual fit. No problem-solving exercise.

04

Project begins within 2-3 weeks

1099 contractor agreement, NDA, paid at your stated rate. Day 1 in Santa Monica.

Alpha Team members can take this project without the screen-and-call sequence. Reach out via the Alpha Team channel.