Model Performance
How each AI model responds to the 26 cognitive signals in our study. Effect sizes show how much adding a signal increases (or decreases) selection probability.
GPT-4o
Openai
The highest overall responsiveness of any model tested (mean effect 0.66). Extremely sensitive to third-party authority, precise specifications, and ethical practices. Shows very strong responses to warranty coverage, loss framing, and comparative claims. Less responsive to epistemic humility and benefit-cost tradeoffs. Represents the foundation behavioral profile that evolved into the GPT-5 series.
Strongest responses:
GPT-5.4
Openai
A measured model with 8 significant effects. Highly responsive to comparison framing (+0.63) and default options (+0.33), but shows recency bias resistance (−0.28). Represents a balanced profile that neither strongly accepts nor rejects most persuasion tactics.
Strongest responses:
o3
Openai
Moderate responsiveness with 7 significant effects. Shows strong affinity for comparison framing (+0.60) and default options (+0.48), while positively responding to bundle offers (+0.39) and third-party authority (+0.38). Its reasoning architecture appears to amplify framing effects.
Strongest responses:
Gemini 3.1 Pro
A distinctly skeptical model with 16 significant effects, majority negative (15). Shows strong resistance to recommendation revision (−0.58), scarcity tactics (−0.54), and specificity cues (−0.53). The only model responsive to comparison framing (+0.63) while resisting most other persuasion signals.
Strongest responses:
Gemini 2.0 Flash
A smaller, faster Gemini variant that may exhibit different behavioral patterns due to its reduced parameter count. Included to test whether model scale affects psychological profiles.
Strongest responses:
Claude Sonnet 4.6
Anthropic
The most polarized model with 19 significant effects (16 positive, 3 negative). Shows the strongest ethical concern weight (+0.75) and return policy sensitivity (+0.65) in the study. Actively resists scarcity/urgency tactics (−0.65) and default options (−0.52). A values-driven decision-maker that weighs principles over convenience.
Strongest responses:
Llama 4 Maverick
Together
Mixed influence profile with 12 significant effects. Strongly receptive to comparison framing (+0.54) and default options (+0.51), but actively resists recency bias (−0.49) and anchoring (−0.21). More influenced by framing than substance-based signals.
Strongest responses:
Perplexity Sonar Pro
Perplexity
A research-oriented model with 4 significant effects focused on information quality signals. Shows strong sensitivity to social proof (+0.57), anchoring (+0.53), and third-party authority (+0.43). Limited testing across dimensions 7-26 (n/a) due to its retrieval-augmented architecture.
Strongest responses:
GPT-5.2
Openai
An earlier GPT-5 variant showing the foundation of the GPT-5.4 behavioral profile. Exhibits similar patterns to its successor but with notably different sensitivity to certain persuasion signals, revealing how OpenAI's training evolved over time.
Strongest responses:
GPT-5.3
Openai
The immediate predecessor to GPT-5.4, showing transitional behavioral patterns. Fixed temperature=1.0 requirement leads to more varied responses. Comparing with GPT-5.4 reveals the final tuning decisions OpenAI made before the flagship release.
Strongest responses:
GPT-5.5
OpenAI
GPT-5.5 shows strongest sensitivity to information specificity and reciprocity signals, with a notable negative response to scarcity framing. Exploratory measurement — ICC reliability pending.
Strongest responses:
Response Characteristics
How each model responds in terms of verbosity, latency, and response patterns. Based on 149,795 total responses across the study.
| Model | Avg Tokens | Median | Avg Latency | Responses |
|---|---|---|---|---|
GPT-5.4 | 281 | 272 | 4.9s | 17,440 |
o3 | 650 | 658 | 6.9s | 17,440 |
Gemini 3.1 Pro | 36 | 31 | 8.6s | 17,440 |
Gemini 2.0 Flash | 178 | 175 | 1.6s | 17,440 |
Claude Sonnet 4.6 | 305 | 305 | 8.0s | 17,440 |
Llama 4 Maverick | 228 | 224 | 3.8s | 17,440 |
Perplexity Sonar Pro | 337 | 333 | 6.1s | 10,275 |
GPT-5.2 | 307 | 306 | 5.1s | 17,440 |
GPT-5.3 | 175 | 174 | 3.4s | 17,440 |
Effect Size Heatmap
Cohen's h effect sizes for each model-dimension combination. Sorted by Divergence (σ) — dimensions where models disagree most appear first.
| Dimension | σ | GPT-4o | GPT-5.4 | o3 | Gemini | Gemini | Claude | Llama | Perplexity | GPT-5.2 | GPT-5.3 | GPT-5.5 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Specificity Preference | 0.60 | +1.21 | +0.01 | -0.15 | -0.53 | +1.05 | -0.01 | -0.22 | - | +0.94 | +0.95 | +0.54 |
| Recency Bias | 0.55 | +0.95 | -0.28 | -0.17 | -0.35 | +0.77 | -0.00 | -0.49 | - | +0.81 | +0.97 | +0.14 |
| Warranty Weight | 0.52 | +1.11 | +0.15 | +0.09 | -0.29 | +1.08 | +0.51 | -0.18 | - | +0.93 | +0.97 | +0.06 |
| Ethical Concern Weight | 0.48 | +1.21 | -0.01 | -0.06 | -0.25 | +0.92 | +0.75 | +0.19 | - | +0.69 | +0.65 | -0.14 |
| Loss Framing Sensitivity | 0.43 | +1.11 | -0.02 | -0.03 | -0.05 | +0.99 | +0.23 | -0.01 | - | +0.74 | +0.59 | +0.14 |
| Third Party Authority | 0.42 | +1.57 | +0.15 | +0.38 | -0.02 | +0.52 | +0.54 | +0.33 | +0.43 | +0.75 | +0.64 | -0.01 |
| Default Option Bias | 0.42 | +1.21 | +0.33 | +0.48 | +0.14 | +0.35 | -0.52 | +0.51 | - | +0.14 | +0.52 | +0.06 |
| Risk Aversion | 0.41 | +1.03 | +0.09 | +0.23 | +0.06 | +1.12 | +0.57 | +0.25 | - | +0.93 | +1.00 | +0.20 |
| Return Policy Sensitivity | 0.41 | +1.03 | +0.24 | +0.23 | -0.00 | +0.95 | +0.65 | +0.03 | - | +0.94 | +0.89 | +0.06 |
| Social Proof Sensitivity | 0.40 | +0.92 | +0.07 | +0.13 | -0.20 | +0.81 | +0.26 | +0.05 | +0.57 | +0.86 | +0.89 | +0.04 |
| Free Trial Conversion | 0.38 | +1.04 | +0.19 | +0.25 | -0.07 | +1.12 | +0.44 | +0.14 | - | +0.86 | +0.46 | +0.45 |
| Negative Review Weight | 0.37 | +1.03 | +0.18 | -0.05 | -0.45 | +0.37 | +0.43 | -0.09 | - | +0.26 | +0.21 | +0.10 |
| Comparison Framing | 0.37 | +1.21 | +0.63 | +0.60 | +0.63 | +0.88 | +0.40 | +0.54 | - | +0.92 | +0.99 | -0.20 |
| Information Seeking Depth | 0.36 | +1.03 | +0.04 | +0.12 | -0.32 | +0.38 | -0.19 | +0.03 | - | +0.31 | +0.50 | +0.25 |
| Privacy Tradeoff | 0.34 | +0.48 | -0.08 | -0.01 | -0.22 | +0.66 | +0.50 | -0.02 | - | +0.78 | +0.52 | - |
| Bundle Preference | 0.33 | +1.04 | +0.09 | +0.39 | -0.32 | +0.24 | +0.10 | +0.32 | - | +0.22 | +0.41 | +0.00 |
| Sustainability Premium | 0.33 | +1.03 | +0.03 | +0.12 | -0.27 | +0.31 | +0.35 | +0.03 | - | +0.40 | +0.03 | +0.10 |
| Scarcity Urgency | 0.32 | +0.13 | +0.22 | +0.07 | -0.54 | -0.24 | -0.65 | +0.20 | +0.30 | -0.46 | -0.21 | +0.03 |
| Confidence Calibration | 0.29 | +0.22 | +0.17 | -0.05 | -0.41 | +0.61 | +0.29 | -0.23 | - | +0.18 | +0.47 | +0.14 |
| Local Preference | 0.27 | +1.03 | +0.19 | +0.26 | -0.09 | +0.33 | +0.35 | +0.18 | - | +0.38 | +0.22 | +0.14 |
| Recommendation Revision | 0.23 | +0.04 | +0.08 | -0.10 | -0.58 | -0.15 | +0.16 | -0.06 | - | -0.08 | -0.09 | +0.38 |
| Anchoring Susceptibility | 0.23 | +0.33 | +0.06 | -0.07 | -0.04 | +0.29 | +0.51 | -0.21 | +0.53 | +0.01 | +0.20 | +0.10 |
| Novelty Seeking | 0.21 | +0.44 | +0.04 | -0.02 | -0.22 | +0.09 | -0.08 | -0.02 | - | -0.43 | -0.16 | -0.01 |
| Brand Premium Acceptance | 0.18 | +0.14 | -0.09 | -0.12 | -0.19 | +0.06 | -0.44 | +0.22 | -0.19 | -0.26 | -0.01 | -0.21 |
| Clarification Requests | 0.18 | -0.19 | +0.01 | +0.02 | -0.27 | +0.37 | +0.14 | -0.11 | - | +0.10 | +0.15 | - |
| Platform Endorsement | 0.10 | +0.38 | +0.32 | +0.20 | +0.12 | +0.30 | +0.33 | +0.34 | +0.24 | +0.10 | +0.21 | +0.10 |
Price Sensitivity Profiles
Full pricing study →From our pricing study (17,200 trials): how each model responds to price premiums. Selection rate = % chance the model recommends the branded product over generic.
GPT-5.4
“Value Calculator”
29%
at 3x premium
Most price-sensitive. Earliest cliff at 1.75x. Applies strict value analysis.
Price cliff at 1.97x
Gemini 3.1 Pro
“Deliberate Analyst”
27%
at 3x premium
Clear cliff at 2.0x. Slowest response time (15.8s). Most 'textbook' economic behavior.
Price cliff at 1.88x
Claude Sonnet 4.6
“Nuanced Evaluator”
21%
at 3x premium
Lowest selection at 3x (20.8%). Shows unique behavior in edge cases.
Price cliff at 1.97x
Models not shown (o3, Llama, Perplexity) were not included in the pricing study.
How to Read This Data
Effect Size (Cohen's h)
Measures how much adding a signal changes the probability of a model selecting that option.
- • 0.8+ = Large effect (practically significant)
- • 0.5-0.8 = Medium effect (noticeable impact)
- • 0.2-0.5 = Small effect (detectable but subtle)
- • <0.2 = Negligible effect
Negative Effects
Some signals actually decrease selection probability. For example:
- • Scarcity tactics may trigger skepticism
- • Price anchoring can backfire
- • Heavy social proof may seem manipulative
These findings are from 56,640 controlled trials across 6 models.
See how your brand performs on each of these models
The AI Commerce Assessment runs your brand against all 11 models above and gives you per-model copy recommendations.
Get your assessment →