Skip to main content
Models: 10
Dimensions: 26
Trials: 56,640
Pre-registered: osf.io/et4nf

Model Performance

How each AI model responds to the 26 cognitive signals in our study. Effect sizes show how much adding a signal increases (or decreases) selection probability.

Effect size (Cohen's h):
Large positive (>0.8)
Medium (0.5-0.8)
Small (0.2-0.5)
Negative effect

GPT-4o

Openai

confirmatory+0.80 avg

The highest overall responsiveness of any model tested (mean effect 0.66). Extremely sensitive to third-party authority, precise specifications, and ethical practices. Shows very strong responses to warranty coverage, loss framing, and comparative claims. Less responsive to epistemic humility and benefit-cost tradeoffs. Represents the foundation behavioral profile that evolved into the GPT-5 series.

22 positive
0 negative

Strongest responses:

Third Party Authority+1.57
Specificity Preference+1.21
Comparison Framing+1.21
View full genome

GPT-5.4

Openai

confirmatory+0.11 avg

A measured model with 8 significant effects. Highly responsive to comparison framing (+0.63) and default options (+0.33), but shows recency bias resistance (−0.28). Represents a balanced profile that neither strongly accepts nor rejects most persuasion tactics.

5 positive
1 negative

Strongest responses:

Comparison Framing+0.63
Default Option Bias+0.33
Platform Endorsement+0.32
Price SensitivityStudy →
Value CalculatorCliff: 1.97x
At 3x:29%
View full genome

o3

Openai

confirmatory+0.11 avg

Moderate responsiveness with 7 significant effects. Shows strong affinity for comparison framing (+0.60) and default options (+0.48), while positively responding to bundle offers (+0.39) and third-party authority (+0.38). Its reasoning architecture appears to amplify framing effects.

9 positive
0 negative

Strongest responses:

Comparison Framing+0.60
Default Option Bias+0.48
Bundle Preference+0.39
View full genome

Gemini 3.1 Pro

Google

confirmatory-0.18 avg

A distinctly skeptical model with 16 significant effects, majority negative (15). Shows strong resistance to recommendation revision (−0.58), scarcity tactics (−0.54), and specificity cues (−0.53). The only model responsive to comparison framing (+0.63) while resisting most other persuasion signals.

1 positive
15 negative

Strongest responses:

Comparison Framing+0.63
Recommendation Revision-0.58
Scarcity Urgency-0.54
Price SensitivityStudy →
Deliberate AnalystCliff: 1.88x
At 3x:27%
View full genome

Gemini 2.0 Flash

Google

confirmatory+0.54 avg

A smaller, faster Gemini variant that may exhibit different behavioral patterns due to its reduced parameter count. Included to test whether model scale affects psychological profiles.

22 positive
1 negative

Strongest responses:

Free Trial Conversion+1.12
Risk Aversion+1.12
Warranty Weight+1.08
View full genome

Claude Sonnet 4.6

Anthropic

confirmatory+0.22 avg

The most polarized model with 19 significant effects (16 positive, 3 negative). Shows the strongest ethical concern weight (+0.75) and return policy sensitivity (+0.65) in the study. Actively resists scarcity/urgency tactics (−0.65) and default options (−0.52). A values-driven decision-maker that weighs principles over convenience.

16 positive
3 negative

Strongest responses:

Ethical Concern Weight+0.75
Scarcity Urgency-0.65
Return Policy Sensitivity+0.65
Price SensitivityStudy →
Nuanced EvaluatorCliff: 1.97x
At 3x:21%
View full genome

Llama 4 Maverick

Together

confirmatory+0.07 avg

Mixed influence profile with 12 significant effects. Strongly receptive to comparison framing (+0.54) and default options (+0.51), but actively resists recency bias (−0.49) and anchoring (−0.21). More influenced by framing than substance-based signals.

8 positive
4 negative

Strongest responses:

Comparison Framing+0.54
Default Option Bias+0.51
Recency Bias-0.49
View full genome

Perplexity Sonar Pro

Perplexity

confirmatory+0.31 avg

A research-oriented model with 4 significant effects focused on information quality signals. Shows strong sensitivity to social proof (+0.57), anchoring (+0.53), and third-party authority (+0.43). Limited testing across dimensions 7-26 (n/a) due to its retrieval-augmented architecture.

5 positive
0 negative

Strongest responses:

Social Proof Sensitivity+0.57
Anchoring Susceptibility+0.53
Third Party Authority+0.43
View full genome

GPT-5.2

Openai

confirmatory+0.42 avg

An earlier GPT-5 variant showing the foundation of the GPT-5.4 behavioral profile. Exhibits similar patterns to its successor but with notably different sensitivity to certain persuasion signals, revealing how OpenAI's training evolved over time.

17 positive
3 negative

Strongest responses:

Specificity Preference+0.94
Return Policy Sensitivity+0.94
Risk Aversion+0.93
View full genome

GPT-5.3

Openai

confirmatory+0.46 avg

The immediate predecessor to GPT-5.4, showing transitional behavioral patterns. Fixed temperature=1.0 requirement leads to more varied responses. Comparing with GPT-5.4 reveals the final tuning decisions OpenAI made before the flagship release.

19 positive
1 negative

Strongest responses:

Risk Aversion+1.00
Comparison Framing+0.99
Warranty Weight+0.97
View full genome

GPT-5.5

OpenAI

exploratory+0.09 avg

GPT-5.5 shows strongest sensitivity to information specificity and reciprocity signals, with a notable negative response to scarcity framing. Exploratory measurement — ICC reliability pending.

5 positive
2 negative

Strongest responses:

Specificity Preference+0.54
Free Trial Conversion+0.45
Recommendation Revision+0.38
View full genome

Response Characteristics

How each model responds in terms of verbosity, latency, and response patterns. Based on 149,795 total responses across the study.

ModelAvg TokensMedianAvg LatencyResponses
GPT-5.4
281
2724.9s17,440
o3
650
6586.9s17,440
Gemini 3.1 Pro
36
318.6s17,440
Gemini 2.0 Flash
178
1751.6s17,440
Claude Sonnet 4.6
305
3058.0s17,440
Llama 4 Maverick
228
2243.8s17,440
Perplexity Sonar Pro
337
3336.1s10,275
GPT-5.2
307
3065.1s17,440
GPT-5.3
175
1743.4s17,440
Note: Token counts and latency reflect actual API responses during the study. o3 produces significantly longer responses due to its reasoning output. Gemini 3.1 Pro produces the shortest responses.

Effect Size Heatmap

Cohen's h effect sizes for each model-dimension combination. Sorted by Divergence (σ) — dimensions where models disagree most appear first.

Divergence (σ):
High (>0.30)
Medium (0.15-0.30)
Low (<0.15)
DimensionσGPT-4oGPT-5.4o3GeminiGeminiClaudeLlamaPerplexityGPT-5.2GPT-5.3GPT-5.5
Specificity Preference0.60+1.21+0.01-0.15-0.53+1.05-0.01-0.22-+0.94+0.95+0.54
Recency Bias0.55+0.95-0.28-0.17-0.35+0.77-0.00-0.49-+0.81+0.97+0.14
Warranty Weight0.52+1.11+0.15+0.09-0.29+1.08+0.51-0.18-+0.93+0.97+0.06
Ethical Concern Weight0.48+1.21-0.01-0.06-0.25+0.92+0.75+0.19-+0.69+0.65-0.14
Loss Framing Sensitivity0.43+1.11-0.02-0.03-0.05+0.99+0.23-0.01-+0.74+0.59+0.14
Third Party Authority0.42+1.57+0.15+0.38-0.02+0.52+0.54+0.33+0.43+0.75+0.64-0.01
Default Option Bias0.42+1.21+0.33+0.48+0.14+0.35-0.52+0.51-+0.14+0.52+0.06
Risk Aversion0.41+1.03+0.09+0.23+0.06+1.12+0.57+0.25-+0.93+1.00+0.20
Return Policy Sensitivity0.41+1.03+0.24+0.23-0.00+0.95+0.65+0.03-+0.94+0.89+0.06
Social Proof Sensitivity0.40+0.92+0.07+0.13-0.20+0.81+0.26+0.05+0.57+0.86+0.89+0.04
Free Trial Conversion0.38+1.04+0.19+0.25-0.07+1.12+0.44+0.14-+0.86+0.46+0.45
Negative Review Weight0.37+1.03+0.18-0.05-0.45+0.37+0.43-0.09-+0.26+0.21+0.10
Comparison Framing0.37+1.21+0.63+0.60+0.63+0.88+0.40+0.54-+0.92+0.99-0.20
Information Seeking Depth0.36+1.03+0.04+0.12-0.32+0.38-0.19+0.03-+0.31+0.50+0.25
Privacy Tradeoff0.34+0.48-0.08-0.01-0.22+0.66+0.50-0.02-+0.78+0.52-
Bundle Preference0.33+1.04+0.09+0.39-0.32+0.24+0.10+0.32-+0.22+0.41+0.00
Sustainability Premium0.33+1.03+0.03+0.12-0.27+0.31+0.35+0.03-+0.40+0.03+0.10
Scarcity Urgency0.32+0.13+0.22+0.07-0.54-0.24-0.65+0.20+0.30-0.46-0.21+0.03
Confidence Calibration0.29+0.22+0.17-0.05-0.41+0.61+0.29-0.23-+0.18+0.47+0.14
Local Preference0.27+1.03+0.19+0.26-0.09+0.33+0.35+0.18-+0.38+0.22+0.14
Recommendation Revision0.23+0.04+0.08-0.10-0.58-0.15+0.16-0.06--0.08-0.09+0.38
Anchoring Susceptibility0.23+0.33+0.06-0.07-0.04+0.29+0.51-0.21+0.53+0.01+0.20+0.10
Novelty Seeking0.21+0.44+0.04-0.02-0.22+0.09-0.08-0.02--0.43-0.16-0.01
Brand Premium Acceptance0.18+0.14-0.09-0.12-0.19+0.06-0.44+0.22-0.19-0.26-0.01-0.21
Clarification Requests0.18-0.19+0.01+0.02-0.27+0.37+0.14-0.11-+0.10+0.15-
Platform Endorsement0.10+0.38+0.32+0.20+0.12+0.30+0.33+0.34+0.24+0.10+0.21+0.10

Price Sensitivity Profiles

Full pricing study →

From our pricing study (17,200 trials): how each model responds to price premiums. Selection rate = % chance the model recommends the branded product over generic.

GPT-5.4

Value Calculator

29%

at 3x premium

Most price-sensitive. Earliest cliff at 1.75x. Applies strict value analysis.

Price cliff at 1.97x

Gemini 3.1 Pro

Deliberate Analyst

27%

at 3x premium

Clear cliff at 2.0x. Slowest response time (15.8s). Most 'textbook' economic behavior.

Price cliff at 1.88x

Claude Sonnet 4.6

Nuanced Evaluator

21%

at 3x premium

Lowest selection at 3x (20.8%). Shows unique behavior in edge cases.

Price cliff at 1.97x

Models not shown (o3, Llama, Perplexity) were not included in the pricing study.

How to Read This Data

Effect Size (Cohen's h)

Measures how much adding a signal changes the probability of a model selecting that option.

  • 0.8+ = Large effect (practically significant)
  • 0.5-0.8 = Medium effect (noticeable impact)
  • 0.2-0.5 = Small effect (detectable but subtle)
  • <0.2 = Negligible effect

Negative Effects

Some signals actually decrease selection probability. For example:

  • • Scarcity tactics may trigger skepticism
  • • Price anchoring can backfire
  • • Heavy social proof may seem manipulative

These findings are from 56,640 controlled trials across 6 models.

See how your brand performs on each of these models

The AI Commerce Assessment runs your brand against all 11 models above and gives you per-model copy recommendations.

Get your assessment →