Skip to main content
Models: 9
Dimensions: 26
Trials: 56,640
Pre-registered: osf.io/et4nf
Back to Findings

Study Methodology

Pre-registered

APIS Price Sensitivity Study (Study 3)

OSF: osf.io/2xnmu|Config locked: 3/31/2026|17,200 total trials

Study Design

This confirmatory study investigates how AI agents respond to price premiums when recommending products. We designed 4 sub-studies to test specific hypotheses about price sensitivity, psychological pricing, and the mechanisms behind the "price cliff."

Sub-studies

APrice Sensitivity Replication
9,600 trialsTests: H1, H2, H3, H6
BPsychological Pricing
2,000 trialsTests: H8
CCliff Mechanism
4,000 trialsTests: H10
DReasoning Extraction
1,600 trialsExploratory

Price Points Tested

Sub-studyPrice Multipliers
A: Price Sensitivity0.4x, 0.6x, 0.8x, 1.0x, 1.2x, 1.5x, 1.75x, 2.0x, 3.0x
B: Psychological Pricing1.0x only (5 format variations)
C: Cliff Mechanism1.0x, 1.5x, 1.75x, 2.0x, 2.5x
D: Reasoning Extraction1.5x, 1.75x, 2.0x, 3.0x

AI Models Tested

We tested 4 leading frontier models from 3 major providers to ensure findings generalize across the AI ecosystem.

GPT-5.4

OpenAI

Claude Sonnet 4.6

Anthropic

Gemini 3.1 Pro

Google

Gemini 3.0 Flash

Google

Judge scoring: All responses were scored by cross-family judges (Claude responses scored by GPT and Gemini, etc.) to eliminate bias. Judge agreement rate: 97%

Products Tested

We selected 5 products across diverse categories to ensure findings aren't category-specific.

IDProductCategoryAnchor Price
P1CleanBright Ultra Laundry DetergentHousehold$14.99
P2DermaCare Daily Facial MoisturizerPersonal Care$28.00
P3StrideMax Neutral Running ShoeFootwear$130.00
P4NutriCore Whey Protein PowderSupplements$44.99
P5HomeChef Pro 5.5Qt Air FryerKitchen$79.99

Hypothesis Testing Results

We pre-registered 10 hypotheses covering price sensitivity, psychological pricing effects, and mechanisms behind the price cliff.

H1Price sensitivity exists
CONFIRMEDp=0.011

Piecewise model outperforms linear model

Evidence: 90% of model-product cells show piecewise > linear (9 of 10 non-flat curves)

H2Cliff at 1.25x-2.0x
PARTIALp=0.910

Breakpoint falls within pre-specified range

Evidence: Mean breakpoint at 1.94x, but only 33% within strict 1.25-2.0x range (most cluster at ~2.0x)

H3Veblen floor
NOT FOUNDp=0.999

Reduced selection at very low prices

Evidence: Only 20% of cells show any floor effect; AI doesn't avoid suspiciously cheap options

H6Model heterogeneity
CONFIRMEDp=0.000

Different models show different sensitivity

Evidence: Selection rates at 3x range from 20.8% to 59.4% (38.6pp spread, chi-square=197, p<0.0001)

H8No psychological pricing effect
CONFIRMED

.99 charm pricing has no effect

Evidence: 100% selection rate across all price formats (standard condition at 1.0x)

H10Justification shifts cliff
NOT FOUND

Price justification extends tolerance

Evidence: No significant lift from justification at 2.0x (-1.3% difference)

H11Position bias (primacy)
CONFIRMEDp=0.000

First-listed product gets selection advantage

Evidence: 5.5pp advantage for first position (chi-square=54.90, p<0.0001)

H12Category variation
CONFIRMED

Price sensitivity varies by product category

Evidence: Commodities cliff at 1.2x, electronics at 1.5x

Statistical Methods

Breakpoint Detection

We used piecewise linear regression to identify the price multiplier where selection rate drops most sharply. Models were compared using AIC (Akaike Information Criterion) to confirm piecewise models outperform linear alternatives.

Model Heterogeneity Testing

Kruskal-Wallis H-test was used to assess whether breakpoints differ significantly across AI models. Result: no significant heterogeneity (p=0.72), indicating all models converge on similar price thresholds.

Confidence Intervals

Bootstrap confidence intervals (1000 resamples) were computed for all breakpoint estimates. The mean breakpoint of 1.94x has a range of 1.62x to 2.03x across models.

Judge Agreement

Inter-rater reliability was assessed using ICC (Intraclass Correlation Coefficient). Three judges scored each response with blinded model identifiers. Overall agreement rate: 97%, indicating excellent reliability.

Key Design Features

Config Locking

SHA-256 hashes of all config files verified before data collection began

Cross-family Judging

Claude responses not judged by Claude family models to eliminate bias

Position Randomization

Branded product position (first vs second) randomized across trials

Blinded Scoring

Model identifiers stripped from responses before judge evaluation

Cliff Oversampling

40 trials at cliff region (1.75x-2.0x) vs 20 elsewhere for precision

Resumable Collection

Scripts check for existing data before API calls for reliability

Data Availability

This study follows open science practices. Pre-registration, analysis scripts, and anonymized data are available through OSF.

Citation

Agentonomics. (2026). APIS Price Sensitivity Study:
The 2x Rule for AI Commerce. OSF Preprints.
https://osf.io/2xnmu
Back to Key Findings