The core unit of Product.ai's shopping intelligence is the axiom — a testable, falsifiable truth about how a product or product category actually works. This practitioner account documents the build and evaluation of a prototype designed to pre-forge these truths at scale and test whether they change what people actually buy. Three product categories — smartphones, running shoes, and skincare — served as the test bed.

What the Team Built

The team produced axiom-grounded shopping intelligence across three product categories (smartphones, running shoes, skincare actives) by using frontier AI models as adversarial research instruments to pre-forge knowledge those same models cannot produce at runtime. We called the prototype Alloy internally. The methodology, protocol, and evaluation framework all emerged through the work itself — the axiom schema built bottom-up from production forging, the forge protocol evolving as parallel contributors reconciled independently authored systems, the verification rubric invented alongside the system it evaluates. In formalized evaluations, the prototype's grounded responses produced verified Truth Deltas — documented instances where an axiom-grounded answer changes the purchase decision a shopper would make. Running shoes and skincare actives (in progress) followed the same forge architecture with structurally different category physics, suggesting cross-domain generalizability.

The Problem: Why Every AI Shopping Answer Sounds the Same

Ask ChatGPT, Gemini, and Perplexity the same shopping question and you get answers of a glorified search engine at best. You get parroted listicles, non-substantive claims that converge to a median. The phrasing and retrieval may vary, but the content is not fundamentally more intelligent. It's an extension of a marketing machine triaging volume and availability, lacking epistemic rigor. That failure mode is the beige singularity (a concept we define in depth in our white paper, "Axiomatic Intelligence").

Modern AI shopping stacks all converge at a mediocre altitude. SEO-polluted training corpora bias models toward whatever the commercial web repeats. Capped inference budgets force shallow synthesis over deep verification. Polite RLHF rewards non-confrontational, high-agreement language over adversarial judgment. Add a thin layer of RAG and "web search" and you do not escape the trap. You just re-summarize the same polluted substrate, faster. More data does not fix this either. If the data distribution is corrupted, scaling it scales the corruption.

That failure is epistemic, not aesthetic. A probabilistic assistant treats competing claims as comparable signals unless you give it a way to privilege reality. Marketing copy that says "all-day battery life" and an independent teardown that measures a 3,200mAh cell land in the same blender. The model can paraphrase both. It cannot reliably adjudicate between them because the system lacks hard rules for weight, provenance, and falsification.

The Product.ai team built the prototype to industrialize the opposite behavior: use the reasoning horsepower now possible to fight the epistemic vacuum of commerce. Commercial misinformation. Incentive-shaped content. Content and comprehension asymmetries that make ordinary people lose time, money, and satisfaction. The product is not better-sounding answers. The product is a repeatable method for producing decisional truth under adversarial pressure.

Product.ai's thesis is that this distinction is the entire product. The Experiential Delta — the measurable difference between an axiomatic response and a standard LLM response to the same shopping query — must be decisional. It must change the purchase outcome, not just sound smarter. That delta is produced by pre-computed axioms: testable, falsifiable truths about how products and product categories actually work, forged offline through adversarial collision between multiple independent AI research agents (a process called the Axiom Distillation Protocol — Product.ai's methodology for stress-testing claims through adversarial collision across independent AI research agents), stored in a layered knowledge repository called the Truth Graph, and served to the runtime agent at constant retrieval cost (O(1) — the same cost regardless of how many axioms exist). The agent does not re-derive the answer. It retrieves pre-forged wisdom and reasons within it. The prototype is not a rolodex of specs, but a gestalt prism of physical truisms to adjudicate a base LLM against.

Even proving that this works is structurally hard. You cannot A/B test axiom-grounded intelligence the way you test a UI change, because the delta is decisional, not tonal — same query, different purchase outcome. Standard AI evaluation frameworks measure fluency, coherence, available factual accuracy. None of them measure whether the answer would change what you buy. The team built the verification rubric alongside the system it evaluates. The evaluation framework co-evolves with the thing being evaluated. That is an inherent property of the problem, not a gap in execution.

This prototype is the proof of concept for that thesis. Three categories. Three fundamentally different physics domains. One forge methodology. The question it set out to answer: does axiom-grounded intelligence produce measurably superior shopping answers, and does the methodology generalize across categories?

The Build: Six Decisions That Shaped the Prototype

Each section below documents a specific architectural choice point. These are the decisions that shaped the system. Some were made deliberately. Some emerged from the work itself.

Why We Banned Simulated Users

The first engineering decision was also the most consequential. We chose physics over heuristics as the governing mandate for the entire prototype phase.

The temptation in modern AI engineering is to use models to test models. Generate a thousand synthetic personas. Have them ask a thousand synthetic questions. See if the system breaks. This is intuitive, scalable, and wrong. It creates what we call a derivative of a derivative. The model hallucinates a user. The user hallucinates a need. The system answers the query. If the answer fails, the failure's origin is untraceable. Is it the ontology? The prompt? Or did the synthetic persona ask a nonsensical question in the first place?

We banned this approach entirely. Not as a preference. As protocol. We also banned what we call the taxonomy trap: the instinct to build a perfect matrix of knowledge categories before testing the product against a single real query.

The alternative we adopted: build the spike, not the grid. Instead of mapping a complete ontology, we took a single product (the Google Pixel 9 Pro) and deconstructed it to its physical laws. Does the Pixel 9 Pro have a 5x telephoto lens? Yes or no. Why? What are the thermal constraints on sustained camera use? What is the actual watt-hour capacity, and does it physically support the "all-day battery" marketing claim at observed power draw?

This mandate extends beyond the product ontology to how the protocols themselves are authored. When you use AI to build AI (to write the prompts that govern the forge) the AI wants to be instructional and prescriptive. It produces procedure. But we are not writing code. We are doing ontological, epistemic work. The protocols that govern the forge encode principles, not steps. I found that the forge produced better ontologies — more comprehensive, requiring less human intervention — the more I stripped down the protocols and hand-wrote them from principles rather than letting AI generate procedural instructions. AI-generated prompts produce procedural checklists. Hand-written protocols grounded in principles produce emergence. The principle behind "physics over heuristics" is recursive: it governs both what the forge produces and how the forge itself is designed.

The mandate crystallized into a phrase that now governs how we build: "We do not simulate reality. We sample it." Every test against the system uses golden queries — real human questions rooted in physical reality, not synthetic populations. We do not automate what we do not yet understand.

Why We Built the Knowledge Layer Before the Chat Experience

The second most consequential decision was what we chose not to build.

The biggest mistake would have been focusing on response principles, shopper intent modeling, and conversational UX before proving the axiom layer. Frontier models already excel at reasoning, inference, query deconstruction, and natural conversation. Building those capabilities would have been building what models already do well — dressing an existing model in clothes rather than building something that truly addresses the beige singularity.

The discipline was to stay focused on forging and proofing the concept of a necessary wisdom battery — a pre-computed knowledge layer that the agent retrieves from rather than re-derives at runtime. The existential question was binary: do pre-forged axioms produce materially different answers from an ungrounded model? The only way to test that was to forge first, then plug the axioms into an off-the-shelf frontier model and observe whether the delta appeared.

I tested this by asking my own shopping questions against the axiom-grounded system. One query: "I want a shoe for jogging on sidewalks — I have pretty flat feet, and I'm a woman with a kind of uneven stride and have the most pain on top of my mid-foot when I run." Another: "What phone has the best camera and feature set for recording vlogs? I don't want to invest in more dedicated vlogging equipment just yet, and I need a new phone anyways."

Both queries are descriptive, not prescriptive. I did not say "stability shoe with a medial post and 8mm drop." I did not say "phone with optical image stabilization and external microphone support." I described my situation the way a consumer would type it into a chat box.

The axiom-grounded system gave genuinely different answers than Google and ChatGPT — highly analytical, lab-level verdicts that diverged from SEO results. What surprised me was how the system intuited what I needed physics-wise from a descriptive query. The running shoes axioms — over a thousand of them spanning buyer psychology, biomechanics, fit, materials, and traction — are dense and illegible to a consumer. But the LLM naturally knew what to do with them. It translated "pain on top of my mid-foot" into biomechanical constraints and produced a verdict that neither an ungrounded model nor a research aggregator could arrive at alone. Even testing labs cannot produce these verdicts alone, because the verdicts require synthesizing across physics domains that no single source covers.

This confirmed the architecture. Axioms are machine-readable wisdom, not human-readable content. They are written for the agent, not the shopper. The shopper gets the answer. The axioms power it. The system prompt handles consumer-facing psychology, response voice, and formatting — it does not contain domain knowledge. If the impressive thing about the response is the prompt, the product has no defensible value. If the impressive thing is that the system knew something the frontier model could not have derived on its own — the thermal design power of a chipset, the biomechanical interaction between pronation and lacing pressure, the pH stability window of a skincare active — that is the axiom layer working. That is the product.

Four Layers of Product Knowledge (and Why They Compound)

The Truth Graph — the layered knowledge repository where verified product truths are stored and retrieved — is not a flat database. It is a layered inheritance architecture where each layer compounds on the one below it.

L1 is Buyer Psychology. Why do humans "hire" products in this category? Category-specific decision psychology, jobs-to-be-done (a framework originating from Anthony Ulwick's work on outcome-driven innovation), cognitive heuristics, purchase motivation, regret mechanics. For smartphones, L1 encodes axioms like: consumers replace phones primarily when battery degradation makes the device unreliable, not when new features are released. That axiom carries a falsifiability criterion — if smartphone replacement cycles shorten to under 18 months driven by feature adoption rather than degradation, the axiom is invalidated. L1 is published publicly. It is the gravity well that pulls organic traffic to the Truth Graph.

L2 is Category Physics. Engineering constraints, material science, market dynamics, compatibility rules. What are the immutable laws governing this product type? For smartphones, L2 encodes axioms about the physical relationship between battery capacity, chipset thermal design power, and sustained performance under load. These are not opinions. They are engineering constraints that every smartphone manufacturer navigates. L2 is also public.

L3 is Brand Physics. Entity-level behavioral patterns. This layer maps how each brand navigates the category physics defined in L2. L3 is published publicly.

L4 is Product Verdicts. SKU-level buy/skip/conditional evaluations. L4 is not pre-forged. It is runtime inference — the agent uses L1 through L3 axioms plus real-time web search to produce product-level verdicts on the fly. (Note: for our initial public demo, web search is off, but powered by a model with a recent training data cutoff, its performance is still remarkable.) The durability math makes pre-forging untenable: a verdict on a specific SKU can be invalidated by a firmware update, a price change, or new test data. The axioms tell the agent what physics to check. Real-time search gives it current product data. The inference is cheap. The axiom forging that enabled it is what was expensive.

The architectural decision was to forge depth-first, not breadth-first. For each category, we built the complete L1, then the complete L2, before touching L3.

The reason depth-first matters: L2 axioms are the differentiator. L1 reasoning (understanding user values) is something a well-prompted frontier model can approximate. L2 reasoning (understanding engineering constraints and material science) is where the Experiential Delta lives. When the agent can tell a shopper that the phone they are considering cannot physically sustain its advertised performance because the thermal design power of its chipset exceeds the heat dissipation capacity of its chassis, that is an assertion a generic AI cannot produce. It requires pre-forged knowledge about physical constraints that do not appear in marketing copy or consumer reviews.

What We Learned by Not Locking the Schema

We did not prescribe the full axiom schema before we started forging. This was deliberate, and it was one of the more uncomfortable decisions in the project.

The base requirements were minimal. Every axiom needed an underlying physics statement. Every axiom needed a confidence score. Every axiom needed a layer tag. Beyond that, I left the schema open. The reasoning was simple: we were deploying a new methodology across categories we had never mapped before. Prescribing a rigid schema before understanding what the forge would actually produce felt like the taxonomy trap applied to our own infrastructure.

Before the forge protocol stabilized, several production ontologies were created using the Axiom Distillation Protocol but with no codified output schema. Each came back in a different format. They lacked standardization.

This was not a problem. It was a dataset.

The field structures that appeared independently across these early ontologies became the empirical evidence basis for every schema decision that followed. When we later codified the axiom schema, we were not designing top-down. We were formalizing what had already emerged bottom-up from production use. The schema was codified because the forge was producing high-utility axioms — not the other way around.

Schema properties emerged that we did not design in advance. Different layers developed different field requirements — not because of inconsistency, but because each layer asks a fundamentally different epistemic question. Forcing identical fields across all layers would produce either bloat (carrying irrelevant fields) or compliance failure (operators skipping fields that do not apply). The forge revealed these distinctions. We codified them.

The lesson is that building AI with AI produces emergent structure if you leave room for it. If I had locked the schema before the first forge run, we would have a tidier spreadsheet and a weaker system. After the forge produces depth we can always trim the data shape for precision at compilation time.

Why We Chose Smartphones, Running Shoes, and Skincare

We chose smartphones, running shoes, and skincare actives not because they were the most commercially important categories, but because they are orthogonal. Hardware physics. Biomechanics. Biochemistry. Three fundamentally different domains governed by fundamentally different physical laws.

This was deliberate. We expected to learn different things from each category. The question was not "how do we learn about skincare" but rather: if we tackle drastically different problem spaces early, can we combine axiomatic engineering and our AxI philosophy to create protocols that are fundamentally generalizable in principle, even if we industrialize later?

Our starting point is always physics. But the territory changes the way you think about ontologies when the domain is biochemistry instead of hardware. Skincare actives are credence goods — products whose quality a consumer cannot evaluate even after use without specialized knowledge. Smartphones are closer to experience goods — quality becomes apparent through use. Running shoes sit between the two. This spectrum forced us to examine whether our ontology architecture was truly first-principles or whether it carried hidden assumptions from the hardware domain where we started.

We did not want to carbon-copy our process too early. We needed to learn whether the forge protocol, the layer architecture, the schema design, and the AxI methodology would hold when the underlying physics changed completely. Three categories was enough to test generalizability without overextending. The goal was not to prove we could cover three categories. It was to prove the methodology is rooted in principle rather than procedure — so that scaling to hundreds of categories is a matter of industrialization, not reinvention.

Why We Still Interview Real Shoppers

One of the decisions I am most deliberate about is running traditional jobs-to-be-done research in parallel with AxI-driven ontology forging. This is not hedging. It is acknowledging that AxI and JTBD answer different questions.

AxI tells us what is physically true about a product category. The engineering constraints. The material science. The economics. It gives us the laws of the domain. What it does not give us is how real people experience and perceive the mess of shopping. The hesitation in front of two nearly identical products. Getting a list of options from ChatGPT, and then proceeding to do another hour of personal research. The way a shopper's confidence collapses the moment they encounter contradictory reviews.

JTBD research gives us that signal. We are running panel studies with qualified participants through user interview platforms. These are not usability tests. We are not putting prototypes in front of anyone yet. This is exploratory research on the beige singularity itself: how people shop with AI today, how they perceive the answers they get, and where the experience breaks down. The findings fuel our sensibility on where consumer perception stands with the beige singularity — what users have subconsciously accepted and navigated within.

The balance between these two inputs is itself a product decision. AxI-driven insights carry structural authority — they are adversarially verified. But JTBD findings carry experiential authority — they are grounded in what actual people actually do. A product that optimizes only for the first builds an impressive knowledge system that misses how humans navigate it. A product that optimizes only for the second builds a pleasant interface around shallow intelligence. We are building both simultaneously, and the tension between them is productive.

What We Traded

Category Depth Over Category Breadth

Decision: Launch the demo with 2-3 categories forged to L2+ depth rather than twenty categories at L1 only.

Alternative rejected: Broad coverage across many categories with shallow ontologies, relying on foundation model reasoning to fill gaps.

Constraint: L1 reasoning alone does not produce the Experiential Delta. A well-prompted frontier model can approximate L1-level value analysis. The delta lives at L2, where engineering constraints and material science create sentences that generic AI cannot produce. Three deep categories prove the thesis. Twenty shallow categories demonstrate coverage without proving anything.

Weakness accepted: The prototype cannot handle the vast majority of shopping queries in its demo phase. Any query outside smartphones, running shoes, or skincare actives falls back to standard model reasoning. We accepted this because the demo is not the product. The demo proves the methodology. The Beta runway addresses breadth on a timeline governed by forge capacity, not marketing pressure.

Axiom Layer Before Response Shaping

Decision: Build the axiom layer and prove the Experiential Delta before investing in response principles, shopper intent modeling, or conversational UX.

Alternative rejected: Build the consumer-facing experience in parallel with ontology forging — system prompt, conversation design, intent parsing, response formatting — so the demo would feel polished from day one.

Constraint: Frontier models already excel at reasoning, inference, query deconstruction, and conversation. Building response capabilities would have been building what models already do well. The existential question was whether pre-forged axioms produce materially different answers. The only way to test that was to forge first, then plug the axioms into an off-the-shelf frontier model and observe whether the delta appeared. If the axioms do not work, no amount of UX polish saves the product.

Weakness accepted: The demo's conversational experience is rougher than a response-engineering-first approach would produce. The system prompt is thin. The chat interface is functional, not refined. A visitor who evaluates the demo on UX polish rather than answer quality will be unimpressed. We accepted this because answer quality is the hypothesis being tested, not conversational finesse.

Emergent Schema Over Locked Specification

Decision: Forge ontologies with a minimal base schema and allow additional properties to emerge from the AxI process.

Alternative rejected: Define a complete axiom schema specification before the first forge run, ensuring consistency and machine-readability from day one.

Constraint: We had never applied the Axiom Distillation Protocol to commerce ontology forging at this scale. Prescribing a full schema before understanding what the forge would produce risked either constraining the outputs to fit our assumptions or requiring expensive schema migrations after the fact. Evidence types, falsifiability criteria, and layer-conditional fields — three of the most valuable schema properties in the current system — emerged from the forge itself. A prescribed schema would not have included them.

Weakness accepted: The early ontologies have inconsistent schema fields. Smartphones L1 (forged first, before methodology stabilized) carries different metadata than running shoes L1 (forged under tighter protocol). This creates reconciliation overhead when industrializing the forge pipeline. We have since reconciled the forge protocol, standardizing fields and conventions. The early ontologies still require retroactive normalization — a cleanup cost we accepted in exchange for schema properties we could not have designed in advance.

Curated Proof Over Regression Infrastructure

Decision: Pre-demo evaluation uses curated Truth Deltas as proof, not a regression testing suite.

Alternative rejected: Build automated regression infrastructure that tests the system against a comprehensive query bank before every deployment.

Constraint: Regression infrastructure requires a stable system to regress against. The prototype's axiom layer, system prompt, and inference architecture are all actively evolving. The ontology is a moving target — each eval run against a previous axiom layer state becomes partially stale when the layer changes. The evaluation methodology itself was being invented alongside the system it evaluates: the verification rubric, the failure taxonomy, and the conversion rate health metric all emerged during the same weeks the ontologies were being forged. Building regression testing now would produce a suite optimized for the current state that would need significant revision as the system matures. Truth Deltas — specific documented instances where the prototype's axiom-grounded answer produces a different purchase outcome than an ungrounded model — are a more honest proof at this stage.

Weakness accepted: Without regression infrastructure, we cannot systematically detect quality degradation as the system evolves. A change to the system prompt or a new axiom batch could silently degrade performance on previously verified queries. We mitigate this through the dogfooding gate — the team uses the product themselves and flags failures before launch — alongside our verification protocol and conversion rate health metrics. Regression infrastructure is scheduled for post-demo development.

What Broke

We violated our own first principle. The team that codified "physics over heuristics" as protocol — that banned synthetic evaluation and rejected the taxonomy trap — tried to industrialize Truth Delta mining before we understood the physics of what makes a Truth Delta real.

The instinct was understandable. We had forged ontologies. We had a grounded agent producing answers that felt qualitatively different. The next step seemed obvious: build a scalable mining pipeline to produce Truth Deltas at volume. Identify friction points in shopping conversations. Construct queries. Run them against the grounded system and an ungrounded baseline. Score the delta. Scale it up. We called it Kill Shot mining at the time.

This was the heuristics trap wearing a different costume. We were building process before we had physics. What does it actually mean for an answer to change a purchase decision? How do you distinguish a delta that is decisional (the shopper would buy a different product) from one that is merely tonal (the answer sounds smarter but leads to the same purchase)? How do you control for the baseline — when the ungrounded model gets it right by accident, or when both answers are wrong in different ways? We did not have rigorous answers to any of these questions. We had intuitions. We tried to scale the intuitions.

The mechanism of the failure is worth describing because it is a trap that any AI-native team will recognize. We kept using AI to help us design the eval itself. The models obliged — they produced test cases, scoring rubrics, query banks, evaluation matrices. All of it looked rigorous. All of it was obsolete on arrival. The test cases were grounded in the current state of the ontology, which was changing weekly. The scoring rubrics encoded assumptions about what "better" meant that we had not validated. We were asking the AI to furnish rooms in a building whose floor plan had not been drawn. What we needed was not test cases. What we needed was the evaluation architecture — the principles that would remain stable while everything downstream from them evolved.

This is a broader pattern we are learning to manage. As AI-native product people, the temptation to outsource or copilot cognitive load with models is constant and seductive. The models are fast. They produce plausible artifacts. But there is a foundation of knowledge work — first-principles strategy, evaluation design, framework architecture — where using AI to generate the answer may be structurally counterproductive. The model fills in the blanks with confident specifics, and those specifics lead the witness. They foreclose the open-ended reasoning that produces durable frameworks. The same pattern appeared in forge protocol authoring: AI-generated prompts produced procedural checklists, while hand-written protocols grounded in principles produced emergence. Eval design is the same problem in a different domain. The principles require human judgment, cultivated product sense, and the willingness to sit with ambiguity long enough for the right framework to surface. The AI can execute within a framework. It likely cannot reason out why the framework should exist if you as a product leader haven't put in the necessary rumination.

The time we lost was not catastrophic, but it was real. Early TD candidates were mined using top-down methods: working toward queries designed to showcase specific Truth Deltas, rather than starting from genuine shopping friction and discovering which axioms resolved it. This produced two failure modes we later formalized: parity failures (the grounded and ungrounded answers converge — no divergence exists regardless of axiom depth) and gap failures (the ontology lacks the axiom needed to produce a delta — a gap, not a failure). Both modes were invisible under the ad-hoc methodology. We were producing candidates without a systematic way to diagnose why they failed or route failures to the correct fix.

We caught it the same way we caught other methodology errors in the project: by applying our own principles reflexively. The forge protocol encodes "physics over heuristics." The forge protocol worked. The eval methodology did not encode that principle. The eval methodology did not work. The diagnosis was structural, not operational. We were not bad at mining. We were mining before we understood the physics of mining.

Four Truth Deltas exist today. All four passed the verification protocol we have since formalized — testing axiom provenance, physical truth, and decisional divergence. All four were mined early, before the failure taxonomy existed, using the ad-hoc methods that we now recognize as insufficient. They survived the formalized protocol retroactively. But four verified TDs on a small sample is early signal, not proof at scale.

The structural lesson is recursive: "physics over heuristics" is not a forge principle. It is a project principle. It applies to how you build ontologies, how you write protocols, how you design evaluation, and how you measure your own product's value. Every time we have tried to scale a process before understanding its underlying physics — every time — we have produced methodology we later had to redo. The eval problem is not solved. We have a formalized rubric, a failure taxonomy, and a small verified sample. We do not yet have a scalable design that reliably measures true Truth Deltas. That is the next physics problem to crack, and we will not try to industrialize it until we understand it.

The Horizon

Open Questions

Does This Work for Categories That Aren't Hardware?

Smartphones and running shoes share a common trait: their category physics are grounded in material science and engineering constraints. Skincare actives introduce biochemistry and regulatory science, which are structurally different knowledge domains. Whether the forge methodology generalizes to categories governed by fundamentally different physics (financial products, food and nutrition, professional services) is an open question. We chose orthogonal categories deliberately to stress-test the architecture, and the early evidence suggests the layered ontology design is domain-agnostic because it maps to epistemological layers (why people care, what is physically true, how brands behave, what specific products do) rather than to domain-specific knowledge structures. HYPOTHESIZED.

Can This Scale?

Manual forging produced high-quality ontologies but cannot sustain the Beta runway's category targets. The forge protocol is reconciled, and the team is evaluating scalable forging infrastructure. The unresolved tension: how much of the forge process can be automated without degrading axiom quality? Certain forge steps are mechanizable. Others — particularly the steps that require adversarial judgment and epistemic calibration — require practitioner involvement. The boundary is not yet precisely mapped. OBSERVED.

Is This Better Than Asking a Human Expert?

The prototype's axiom-grounded agent produces superior responses to a standard LLM on identical queries. It has not yet been tested against a human domain expert answering the same questions. This is the real benchmark. The Experiential Delta is necessary but intermediate. The product thesis is that axiom-grounded AI can approach (and eventually match) the quality of a knowledgeable human advisor at a fraction of the cost and infinite scale. Our next phase is designed to close this gap by bringing domain experts into the forging process, but we do not yet have data on where it stands. HYPOTHESIZED.

Do the Axioms Still Work Three Months Later?

The smartphone and running shoes ontologies were forged between January and February 2026. Markets shift. New products launch. Axioms decay. Whether the current ontologies maintain their Experiential Delta after 90 days of market evolution, without re-forging, is the test of whether our kinetic axiom architecture works as designed — each axiom carries explicit decay rates that erode confidence over time, mutation triggers that force re-forging when reality changes, and falsifiability criteria that define what would prove it wrong. First review scheduled for May 2026. HYPOTHESIZED.

How Do You Test a System That's Still Evolving?

How do you build regression infrastructure for a system whose axiom layer, protocol, and eval rubric are all co-evolving? The pre-alpha posture — curated proof, not a regression suite — is honest but insufficient for beta. The post-alpha transition to durable regression testing requires ontology stabilization, protocol reconciliation completion, and automated scoring. The failure taxonomy provides signal to diagnose quality degradation without a frozen test set — distinguishing between missing axioms, missing product data, reasoning failures, and cases where no divergence exists — but whether that signal is sufficient to replace systematic regression testing at scale is unproven. HYPOTHESIZED.

Why Every AI Shopping Assistant Sounds the Same — and What It Took to Build One That Doesn't