The Data Science Phase 3 briefing identifies system-level evaluation as Product.ai's highest-leverage second-act differentiation, conditional on customer pull. Four-layer eval stack: Layer 1 (model) and Layer 2 (agent trace) are commodity vendor coverage. Layer 3 (workflow execution) is partially covered by PAE / AgentEval / AgentCompass. Layer 4 (human-decision compliance) is uncovered by every published vendor and academic methodology. The Stripe canonical failure (model worked offline, agents adopted existing workflows and ignored ML prompts) is the canonical case — a recommendation can be 100% correct and 0% acted on. Hyperscalers (AWS Bedrock AgentCore GA March 31 2026, Anthropic, OpenAI, Google Cloud, Azure agent runtimes through 2026-2027) close the window on generalist Layer 1-3 within 12-18 months. Vertical specialization on Layer 4 is the durable moat ground for Product.ai. The candidate ships the v1 instrumentation on one surface — proves the methodology can deliver real signal — and produces the substrate that becomes the second-act differentiation.
aios/dashboard/CLAUDE.md Hard Rulesaios/dashboard/CLAUDE.md for design system; Cortex Dashboard architecture for Layer 4 visualization
The operating principles we work by. If they resonate, the rest of this will land. Open the Codex →
Hireflix, async. Questions are calibrated to this project specifically.
Direct call with the CEO. Strategic alignment and mutual fit. No problem-solving exercise.
1099 contractor agreement, NDA, paid at your stated rate. Day 1 in Santa Monica.
Alpha Team members can take this project without the screen-and-call sequence. Reach out via the Alpha Team channel.