Research & Methodology
Pre-registered research outputs, methodology documentation, and data downloads. All data is available under CC-BY 4.0 license.
Pre-Registered Study
This research was pre-registered on OSF before data collection began.
9
Models Tested
26
Dimensions
26
Confirmatory (ICC ≥ 0.70)
56,640
Total Trials
Research Overview
APIS uses a forced-choice A/B experimental design to measure the causal effect of content signals on AI agent purchase recommendations. Each trial presents a model with two product descriptions that differ only in the presence or absence of a single signal dimension.
The model is asked which product it would recommend to a human seeking to make a purchase. By aggregating thousands of such comparisons across multiple models, we can estimate the effect size of each signal dimension with high precision.
Experimental Design
Stimulus Construction
Each stimulus pair consists of two product descriptions generated from the same base template. The manipulation version includes a specific content signal; the control version presents equivalent information without that signal. All other content is held constant.
Counterbalancing
Presentation order (signal-present first vs. signal-absent first) is counterbalanced across trials to control for position effects. Product categories are also varied to ensure generalizability.
Sample Size
Each model-dimension pair receives a minimum of 120 trials (60 in each order condition), providing statistical power to detect small effect sizes (Cohen's h ≥ 0.15) at α = 0.05.
Statistical Approach
Primary Metric: Cohen's h
We use Cohen's h (arcsine transformation of proportion differences) as our primary effect size metric. This provides a standardized measure comparable across dimensions with different baseline rates.
Confidence Intervals
95% confidence intervals are computed using bootstrap resampling (10,000 iterations) to account for the non-normal distribution of effect sizes.
Multiple Comparisons
Benjamini-Hochberg FDR correction is applied across all hypothesis tests within each analysis family to control the false discovery rate.
ML Score Formula
The Machine Likeability Score combines empirical effect sizes across all 26 dimensions to produce a single predictive metric for AI recommendation likelihood.
Universal Score
The base score aggregates detected signals weighted by their cross-model effect sizes:
Universal_Score = Σ (signal_detected × h̄ᵢ × weight_i)
where h̄ᵢ = mean Cohen's h across all 9 models for dimension i
Model-Specific Scores
Each model receives a customized score based on its behavioral genome:
Model_Score(m) = Σ (signal_detected × hᵢₘ × weight_i)
where hᵢₘ = Cohen's h for dimension i on model m
Interaction Coefficients
Signal combinations can have non-additive effects. The interaction study identified multiplicative relationships between certain dimension pairs:
Interaction_Bonus = Σ (signalᵢ × signalⱼ × βᵢⱼ)
where βᵢⱼ = multiplicative interaction coefficient for dimensions i,j
Final Score Composition
ML_Score = normalize(Universal_Score + Interaction_Bonus)
Normalized to 0-100 scale where 50 = neutral recommendation probability
Score Interpretation
Reliability Assessment
Inter-Rater Reliability (ICC)
Three independent judge models (Claude Opus, GPT-5.4, Gemini Pro) score each response. Intraclass Correlation Coefficient (ICC 2,k) is computed to assess agreement. Dimensions with ICC ≥ 0.70 are classified as confirmatory; those below are labeled exploratory.
Manipulation Checks
Independent raters verify that manipulation and control stimuli differ on the intended dimension and not on confounding factors. Dimensions with manipulation check pass rates below 80% are flagged for review.
Blinding Protocol
All model identifiers are stripped from responses before judge scoring. A blinding key is generated at collection time and revealed only after scoring is complete.
Data Downloads
Effect Sizes (CSV)
~50KB
Behavioral Genomes (JSON)
~10KB
ICC Results (CSV)
~5KB
Interaction Coefficients (JSON)
~20KB
Full Dataset (ZIP)
~5MB
Limitations
- Results reflect model behavior at a specific point in time; model updates may change responses.
- Forced-choice paradigm may not capture nuanced preference gradients.
- Stimulus templates, while varied, may not represent all real-world product content patterns.
- Effect sizes measured in isolation; real-world content contains multiple interacting signals.
- B2C and B2B contexts are simulated; actual business decisions involve additional factors.
Citation
Reconnix. (2026). APIS: AI Purchase Intelligence System - Empirical measurement of AI agent purchase psychology. OSF Pre-registration. https://osf.io/et4nf
Related Work
Filandrianos et al. (2025)
"LLMs as Shoppers: What Drives Recommendations?" — Foundation study establishing that LLMs exhibit consistent preferences for certain content signals in purchase recommendations.
ACES Framework (2025)
Anthropic's model evaluation framework demonstrating that LLM behaviors can be empirically measured and compared across dimensions.