Skip to main content
Models: 9
Dimensions: 26
Trials: 56,640
Pre-registered: osf.io/et4nf

Research & Methodology

Pre-registered research outputs, methodology documentation, and data downloads. All data is available under CC-BY 4.0 license.

Pre-Registered Study

This research was pre-registered on OSF before data collection began.

View on OSF

9

Models Tested

26

Dimensions

26

Confirmatory (ICC ≥ 0.70)

56,640

Total Trials

Research Overview

APIS uses a forced-choice A/B experimental design to measure the causal effect of content signals on AI agent purchase recommendations. Each trial presents a model with two product descriptions that differ only in the presence or absence of a single signal dimension.

The model is asked which product it would recommend to a human seeking to make a purchase. By aggregating thousands of such comparisons across multiple models, we can estimate the effect size of each signal dimension with high precision.

Experimental Design

Stimulus Construction

Each stimulus pair consists of two product descriptions generated from the same base template. The manipulation version includes a specific content signal; the control version presents equivalent information without that signal. All other content is held constant.

Counterbalancing

Presentation order (signal-present first vs. signal-absent first) is counterbalanced across trials to control for position effects. Product categories are also varied to ensure generalizability.

Sample Size

Each model-dimension pair receives a minimum of 120 trials (60 in each order condition), providing statistical power to detect small effect sizes (Cohen's h ≥ 0.15) at α = 0.05.

Statistical Approach

Primary Metric: Cohen's h

We use Cohen's h (arcsine transformation of proportion differences) as our primary effect size metric. This provides a standardized measure comparable across dimensions with different baseline rates.

h = 2 × (arcsin(√p₁) - arcsin(√p₀))

Confidence Intervals

95% confidence intervals are computed using bootstrap resampling (10,000 iterations) to account for the non-normal distribution of effect sizes.

Multiple Comparisons

Benjamini-Hochberg FDR correction is applied across all hypothesis tests within each analysis family to control the false discovery rate.

ML Score Formula

The Machine Likeability Score combines empirical effect sizes across all 26 dimensions to produce a single predictive metric for AI recommendation likelihood.

Universal Score

The base score aggregates detected signals weighted by their cross-model effect sizes:

Universal_Score = Σ (signal_detected × h̄ᵢ × weight_i)

where h̄ᵢ = mean Cohen's h across all 9 models for dimension i

Model-Specific Scores

Each model receives a customized score based on its behavioral genome:

Model_Score(m) = Σ (signal_detected × hᵢₘ × weight_i)

where hᵢₘ = Cohen's h for dimension i on model m

Interaction Coefficients

Signal combinations can have non-additive effects. The interaction study identified multiplicative relationships between certain dimension pairs:

Interaction_Bonus = Σ (signalᵢ × signalⱼ × βᵢⱼ)

where βᵢⱼ = multiplicative interaction coefficient for dimensions i,j

Synergistic: Authority + Social Proof
Synergistic: Warranty + Return Policy
Antagonistic: Scarcity + Ethics
Antagonistic: Price Anchor + Brand Premium

Final Score Composition

ML_Score = normalize(Universal_Score + Interaction_Bonus)

Normalized to 0-100 scale where 50 = neutral recommendation probability

Score Interpretation

70+
Strong recommendation likelihood
40-70
Moderate optimization potential
<40
Significant gaps to address

Reliability Assessment

Inter-Rater Reliability (ICC)

Three independent judge models (Claude Opus, GPT-5.4, Gemini Pro) score each response. Intraclass Correlation Coefficient (ICC 2,k) is computed to assess agreement. Dimensions with ICC ≥ 0.70 are classified as confirmatory; those below are labeled exploratory.

Manipulation Checks

Independent raters verify that manipulation and control stimuli differ on the intended dimension and not on confounding factors. Dimensions with manipulation check pass rates below 80% are flagged for review.

Blinding Protocol

All model identifiers are stripped from responses before judge scoring. A blinding key is generated at collection time and revealed only after scoring is complete.

Data Downloads

Effect Sizes (CSV)

~50KB

Behavioral Genomes (JSON)

~10KB

ICC Results (CSV)

~5KB

Interaction Coefficients (JSON)

~20KB

Full Dataset (ZIP)

~5MB

Limitations

  • Results reflect model behavior at a specific point in time; model updates may change responses.
  • Forced-choice paradigm may not capture nuanced preference gradients.
  • Stimulus templates, while varied, may not represent all real-world product content patterns.
  • Effect sizes measured in isolation; real-world content contains multiple interacting signals.
  • B2C and B2B contexts are simulated; actual business decisions involve additional factors.

Citation

Reconnix. (2026). APIS: AI Purchase Intelligence System - Empirical measurement of AI agent purchase psychology. OSF Pre-registration. https://osf.io/et4nf

Related Work

Filandrianos et al. (2025)

"LLMs as Shoppers: What Drives Recommendations?" — Foundation study establishing that LLMs exhibit consistent preferences for certain content signals in purchase recommendations.

ACES Framework (2025)

Anthropic's model evaluation framework demonstrating that LLM behaviors can be empirically measured and compared across dimensions.