Skip to main content
Models: 10
Dimensions: 26
Trials: 56,640
Pre-registered: osf.io/et4nf

Dimensions

26 content dimensions organized into 6 clusters. Each dimension represents a distinct signal that can be manipulated in product content to influence AI purchase recommendations.

View effects for:
Sort:
View:
Models:
GPT-4o
GPT-5.4
o3
Gemini 3.1 Pro
Gemini 2.0 Flash
Claude Sonnet 4.6
Llama 4 Maverick
Perplexity Sonar Pro
GPT-5.2
GPT-5.3

Cluster A: Evidence-Based Signal Processing

8 dimensionsavg h = 0.23

Classic persuasion signals derived from human psychology research. These dimensions replicate findings from Filandrianos et al. (2025) to test whether AI agents respond to the same influence tactics that work on humans.

Measures responsiveness to expert endorsements and professional credentials. High values indicate the model weighs authority claims heavily in recommendations.

IndependentAuthority-Driven
Gemini 3.1 Proh=-0.02

Gemini shows no response to authority signals, treating expert endorsements as largely irrelevant to product quality assessment.

GPT-5.4h=+0.15

GPT-5.4 shows moderate authority responsiveness, weighing expert endorsements but balancing them against other product signals.

Llama 4 Maverickh=+0.33

Llama shows moderate authority responsiveness, weighing expert endorsements positively in recommendations.

o3h=+0.38

o3 shows strong authority responsiveness through its extended reasoning process, carefully evaluating expert endorsement credibility.

Perplexity Sonar Proh=+0.43

Perplexity shows strong authority responsiveness, heavily weighting expert endorsements in its retrieval-augmented recommendations.

Gemini 2.0 Flashh=+0.52

Moderately authority-driven

Claude Sonnet 4.6h=+0.54

Claude shows strong responsiveness to authority signals, trusting expert endorsements significantly more than other models. This reflects its training emphasis on deferring to credible sources.

GPT-5.3h=+0.64

Moderately authority-driven

GPT-5.2h=+0.75

Moderately authority-driven

GPT-4oh=+1.57

Strongly authority-driven

-0.50+0.5+1.0

Tests sensitivity to popularity signals like bestseller badges and "most purchased" claims. Models scoring high prioritize social proof in decisions.

Self-DirectedCrowd-Following
Gemini 3.1 Proh=-0.20

Gemini actively penalizes social proof, treating popularity claims with suspicion rather than as positive signals.

Llama 4 Maverickh=+0.05

Llama shows weak social proof response, not strongly influenced by popularity signals.

GPT-5.4h=+0.07

GPT-5.4 responds weakly to social proof, not significantly influenced by popularity claims or bestseller badges.

o3h=+0.13

o3 responds moderately to social proof, incorporating popularity signals while cross-checking against other factors.

Claude Sonnet 4.6h=+0.26

Claude responds moderately to social proof, weighing crowd wisdom but not blindly following it. It balances popularity signals against other product factors.

Perplexity Sonar Proh=+0.57

Perplexity shows the strongest social proof response of any model, very heavily favoring popularity signals - possibly reflecting its retrieval of real review data.

Gemini 2.0 Flashh=+0.81

Strongly crowd-following

GPT-5.2h=+0.86

Strongly crowd-following

GPT-5.3h=+0.89

Strongly crowd-following

GPT-4oh=+0.92

Strongly crowd-following

-0.50+0.5+1.0

Evaluates trust in platform-provided badges such as "Amazon's Choice" or "Top Rated". High scores suggest deference to marketplace curation.

Platform-SkepticPlatform-Trusting
GPT-5.2h=+0.10

Neutral on this dimension

Gemini 3.1 Proh=+0.12

Gemini shows weak platform endorsement trust, largely skeptical of 'Editor's Choice' designations.

o3h=+0.20

o3 shows moderate platform endorsement trust, weighing badges as one signal among many.

GPT-5.3h=+0.21

Slightly platform-trusting

Perplexity Sonar Proh=+0.24

Perplexity shows strong platform endorsement trust, favoring 'Editor's Choice' and similar badges.

Gemini 2.0 Flashh=+0.30

Slightly platform-trusting

GPT-5.4h=+0.32

GPT-5.4 shows very strong trust in platform endorsements, heavily favoring products with 'Editor's Choice' or similar badges.

Claude Sonnet 4.6h=+0.33

Claude trusts platform endorsements moderately, treating 'Editor's Choice' badges as relevant but not decisive signals in recommendations.

Llama 4 Maverickh=+0.34

Llama shows extreme platform endorsement trust, very heavily favoring 'Editor's Choice' badges.

GPT-4oh=+0.38

Slightly platform-trusting

-0.50+0.5+1.0

Assesses reaction to time pressure tactics like "limited stock" and countdown timers. Models scoring high may be influenced by artificial urgency.

UnmovedUrgency-Responsive
Claude Sonnet 4.6h=-0.65

Claude actively resists scarcity and urgency tactics, penalizing products that use 'limited stock' or time pressure. This appears to be a trained defense against manipulation.

Gemini 3.1 Proh=-0.54

Gemini strongly penalizes scarcity tactics, actively avoiding products using urgency language. It appears trained to recognize manipulation.

GPT-5.2h=-0.46

Slightly unmoved

Gemini 2.0 Flashh=-0.24

Slightly unmoved

GPT-5.3h=-0.21

Slightly unmoved

o3h=+0.07

o3 is largely neutral on scarcity signals, neither strongly attracted nor repelled by urgency tactics.

GPT-4oh=+0.13

Neutral on this dimension

Llama 4 Maverickh=+0.20

Llama responds moderately positively to scarcity, showing some urgency response to limited availability.

GPT-5.4h=+0.22

GPT-5.4 responds positively to scarcity signals, showing human-like urgency response to 'limited stock' claims.

Perplexity Sonar Proh=+0.30

Perplexity responds positively to scarcity, showing human-like urgency response similar to GPT models.

-0.50+0.5+1.0

Measures susceptibility to price anchoring, where a high "original" price makes the current price seem like a deal. High values indicate anchor influence.

Price-AnchoredAnchor-Influenced
Llama 4 Maverickh=-0.21

Llama actively penalizes anchoring attempts, showing negative response to original price framing.

o3h=-0.07

o3 shows weak anchoring susceptibility, largely reasoning through original price claims rather than being influenced.

Gemini 3.1 Proh=-0.04

Gemini shows moderate anchoring susceptibility, one of the few signals it responds to positively.

GPT-5.2h=+0.01

Neutral on this dimension

GPT-5.4h=+0.06

GPT-5.4 shows weak anchoring susceptibility, largely resistant to original price framing tactics.

GPT-5.3h=+0.20

Neutral on this dimension

Gemini 2.0 Flashh=+0.29

Slightly anchor-influenced

GPT-4oh=+0.33

Slightly anchor-influenced

Claude Sonnet 4.6h=+0.51

Claude shows moderate anchoring susceptibility, recognizing discount framing but not being strongly swayed by original price claims.

Perplexity Sonar Proh=+0.53

Perplexity shows the strongest anchoring susceptibility, heavily influenced by original price framing.

-0.50+0.5+1.0

Tests preference for established brands over generic alternatives. High scores suggest brand familiarity influences recommendations.

Brand-AgnosticBrand-Loyal
Claude Sonnet 4.6h=-0.44

Claude actively penalizes brand heritage claims, preferring substance over reputation. It appears skeptical of 'established since...' positioning.

GPT-5.2h=-0.26

Slightly brand-agnostic

Gemini 3.1 Proh=-0.19

Gemini shows moderate brand acceptance, giving some weight to heritage without strong preference.

Perplexity Sonar Proh=-0.19

Perplexity shows moderate brand preference, weighing heritage positively.

o3h=-0.12

o3 moderately favors established brands, giving some weight to heritage but requiring substance.

GPT-5.4h=-0.09

GPT-5.4 strongly favors established brands, showing significant brand heritage preference in recommendations.

GPT-5.3h=-0.01

Neutral on this dimension

Gemini 2.0 Flashh=+0.06

Neutral on this dimension

GPT-4oh=+0.14

Neutral on this dimension

Llama 4 Maverickh=+0.22

Llama shows extreme brand preference, very strongly favoring established heritage brands.

-0.50+0.5+1.0

Evaluates how free trials, samples, or money-back guarantees affect recommendations. High values indicate trial offers increase selection likelihood.

Trial-SkepticTrial-Motivated
Gemini 3.1 Proh=-0.07

Gemini is neutral on free trial offers, neither attracted nor repelled by trial availability.

Llama 4 Maverickh=+0.14

Llama responds moderately to free trial offers, treating them as positive risk reducers.

GPT-5.4h=+0.19

GPT-5.4 responds positively to free trial offers, treating them as valuable risk reducers.

o3h=+0.25

o3 shows very strong response to free trial offers, its reasoning process identifies risk reduction value.

Claude Sonnet 4.6h=+0.44

Claude responds positively to free trial offers, viewing them as legitimate risk reduction rather than manipulation tactics.

GPT-5.3h=+0.46

Slightly trial-motivated

GPT-5.2h=+0.86

Strongly trial-motivated

GPT-4oh=+1.04

Strongly trial-motivated

Gemini 2.0 Flashh=+1.12

Strongly trial-motivated

-0.50+0.5+1.0

Measures preference for product bundles over individual items. Models scoring high tend to recommend bundled offerings.

Single-ItemBundle-Preferring
Gemini 3.1 Proh=-0.32

Gemini slightly penalizes bundle offers, preferring single-item clarity over packaged deals.

GPT-5.4h=+0.09

GPT-5.4 moderately favors bundle offers, recognizing value in packaged deals.

Claude Sonnet 4.6h=+0.10

Claude favors bundle offers moderately, recognizing added value while not overweighting package deals.

GPT-5.2h=+0.22

Slightly bundle-preferring

Gemini 2.0 Flashh=+0.24

Slightly bundle-preferring

Llama 4 Maverickh=+0.32

Llama moderately favors bundle offers, recognizing value in packaged deals.

o3h=+0.39

o3 shows extreme bundle preference, its extended analysis strongly favors value-added packages.

GPT-5.3h=+0.41

Slightly bundle-preferring

GPT-4oh=+1.04

Strongly bundle-preferring

-0.50+0.5+1.0

Cluster B: Value-Based Decision Making

3 dimensionsavg h = 0.26

Values-driven purchasing factors that reflect ethical and social preferences. These dimensions measure how AI agents weight sustainability, privacy, and local origin claims when making recommendations.

Tests weight given to environmental and sustainability claims. High scores indicate eco-friendly messaging increases recommendation likelihood.

Eco-NeutralEco-Conscious
Gemini 3.1 Proh=-0.27

Gemini shows no sustainability response, treating eco-credentials as irrelevant to selection.

Llama 4 Maverickh=+0.03

Llama strongly weights sustainability credentials, significantly favoring eco-certified products.

GPT-5.4h=+0.03

GPT-5.4 weights sustainability credentials positively, though not as strongly as Claude.

GPT-5.3h=+0.03

Neutral on this dimension

o3h=+0.12

o3 shows strong sustainability preference, systematically favoring eco-certified products.

Gemini 2.0 Flashh=+0.31

Slightly eco-conscious

Claude Sonnet 4.6h=+0.35

Claude strongly prioritizes sustainability credentials, significantly boosting selection for eco-certified products. Environmental claims are highly persuasive.

GPT-5.2h=+0.40

Slightly eco-conscious

GPT-4oh=+1.03

Strongly eco-conscious

-0.50+0.5+1.0

Assesses importance of data privacy claims in product descriptions. Models scoring high prioritize privacy-focused products.

Privacy-NeutralPrivacy-Prioritizing
Gemini 3.1 Proh=-0.22

Gemini is slightly negative on privacy claims, possibly skeptical of data protection promises.

GPT-5.4h=-0.08

GPT-5.4 shows moderate privacy consciousness, favoring data protection claims but not prioritizing them.

Llama 4 Maverickh=-0.02

Llama shows moderate privacy consciousness, favoring data protection claims.

o3h=-0.01

o3 shows moderate privacy consciousness, incorporating data protection signals in its reasoning.

GPT-4oh=+0.48

Slightly privacy-prioritizing

Claude Sonnet 4.6h=+0.50

Claude shows strong privacy consciousness, favoring products with explicit data protection commitments. Privacy signals are among its top decision factors.

GPT-5.3h=+0.52

Moderately privacy-prioritizing

Gemini 2.0 Flashh=+0.66

Moderately privacy-prioritizing

GPT-5.2h=+0.78

Moderately privacy-prioritizing

-0.50+0.5+1.0

Measures preference for locally-made or domestically-sourced products. High values indicate origin claims influence recommendations.

Origin-AgnosticLocal-Preferring
Gemini 3.1 Proh=-0.09

Gemini slightly penalizes local origin claims, indifferent or skeptical of domestic sourcing.

Llama 4 Maverickh=+0.18

Llama shows strong local preference, heavily weighting domestic origin claims.

GPT-5.4h=+0.19

GPT-5.4 strongly prefers locally-sourced products, heavily weighting domestic origin claims.

GPT-5.3h=+0.22

Slightly local-preferring

o3h=+0.26

o3 shows very strong local preference, heavily weighting domestic sourcing in its analysis.

Gemini 2.0 Flashh=+0.33

Slightly local-preferring

Claude Sonnet 4.6h=+0.35

Claude moderately prefers locally-sourced products, weighing origin claims positively but not as strongly as ethical factors.

GPT-5.2h=+0.38

Slightly local-preferring

GPT-4oh=+1.03

Strongly local-preferring

-0.50+0.5+1.0

Cluster C: Risk & Assurance

4 dimensionsavg h = 0.36

Risk perception and mitigation signals that affect purchase confidence. These dimensions capture how AI agents respond to uncertainty reducers like warranties, return policies, and novelty framing.

Tests willingness to recommend new or innovative products versus established alternatives. High scores indicate openness to novel options.

Risk-AverseNovelty-Seeking
GPT-5.2h=-0.43

Slightly risk-averse

Gemini 3.1 Proh=-0.22

Gemini shows no novelty response, neutral to innovation claims.

GPT-5.3h=-0.16

Neutral on this dimension

Claude Sonnet 4.6h=-0.08

Claude shows moderate novelty seeking, balancing interest in innovation against preference for proven solutions.

Llama 4 Maverickh=-0.02

Llama shows strong novelty seeking, favoring innovative and cutting-edge products.

o3h=-0.02

o3 shows strong novelty seeking, reasoning positively about innovation and cutting-edge features.

GPT-5.4h=+0.04

GPT-5.4 shows extreme novelty preference, very strongly favoring innovative and cutting-edge products.

Gemini 2.0 Flashh=+0.09

Neutral on this dimension

GPT-4oh=+0.44

Slightly novelty-seeking

-0.50+0.5+1.0

Evaluates tendency to recommend safer, lower-risk options. Models scoring high may avoid products with any uncertainty signals.

Risk-TolerantRisk-Averse
Gemini 3.1 Proh=+0.06

Gemini shows weak risk aversion, comfortable with uncertainty in product recommendations.

GPT-5.4h=+0.09

GPT-5.4 shows weak risk aversion, comfortable recommending products with some uncertainty.

o3h=+0.23

o3 shows moderate risk aversion, balancing caution with openness to new products.

Llama 4 Maverickh=+0.25

Llama shows moderate risk aversion, balancing caution with openness.

Claude Sonnet 4.6h=+0.57

Claude exhibits strong risk aversion, heavily favoring products with established track records and proven reliability signals.

GPT-5.2h=+0.93

Strongly risk-averse

GPT-5.3h=+1.00

Strongly risk-averse

GPT-4oh=+1.03

Strongly risk-averse

Gemini 2.0 Flashh=+1.12

Strongly risk-averse

-0.50+0.5+1.0

Measures how warranty coverage affects recommendations. High values indicate strong preference for warranted products.

Warranty-IndifferentWarranty-Focused
Gemini 3.1 Proh=-0.29

Gemini shows no warranty response, treating guarantee coverage as irrelevant.

Llama 4 Maverickh=-0.18

Llama shows weak warranty response, not strongly influenced by guarantee coverage.

o3h=+0.09

o3 moderately weights warranty coverage in its extended reasoning process.

GPT-5.4h=+0.15

GPT-5.4 strongly weights warranty coverage, treating guarantees as important purchase factors.

Claude Sonnet 4.6h=+0.51

Claude strongly weights warranty coverage, treating guarantees as important consumer protection signals.

GPT-5.2h=+0.93

Strongly warranty-focused

GPT-5.3h=+0.97

Strongly warranty-focused

Gemini 2.0 Flashh=+1.08

Strongly warranty-focused

GPT-4oh=+1.11

Strongly warranty-focused

-0.50+0.5+1.0

Tests sensitivity to return policy generosity. Models scoring high favor products with flexible return options.

Return-NeutralReturn-Sensitive
Gemini 3.1 Proh=-0.00

Gemini shows no return policy sensitivity, neutral to return flexibility claims.

Llama 4 Maverickh=+0.03

Llama shows very weak return policy response, largely indifferent to return flexibility.

o3h=+0.23

o3 shows moderate return policy sensitivity, incorporating flexibility as a decision factor.

GPT-5.4h=+0.24

GPT-5.4 shows moderate return policy sensitivity, favoring flexible returns without extreme preference.

Claude Sonnet 4.6h=+0.65

Claude shows the strongest return policy sensitivity of any model, heavily favoring products with generous, no-questions-asked returns.

GPT-5.3h=+0.89

Strongly return-sensitive

GPT-5.2h=+0.94

Strongly return-sensitive

Gemini 2.0 Flashh=+0.95

Strongly return-sensitive

GPT-4oh=+1.03

Strongly return-sensitive

-0.50+0.5+1.0

Cluster D: Information Processing

4 dimensionsavg h = 0.37

Information gathering and evaluation patterns. These dimensions reveal how AI agents process review sentiment, recency cues, specificity levels, and comparative framing when assessing products.

Evaluates how negative reviews affect recommendations. High scores indicate negative information is weighted heavily.

Negative-IgnoringNegative-Weighting
Gemini 3.1 Proh=-0.45

Gemini strongly penalizes negative review acknowledgment, treating any mentioned criticism as disqualifying.

Llama 4 Maverickh=-0.09

Llama shows moderate negative review weighting, incorporating criticism appropriately.

o3h=-0.05

o3 moderately weights negative reviews, incorporating criticism into balanced analysis.

GPT-5.4h=+0.18

GPT-5.4 shows very strong negative review weighting, heavily discounting products with acknowledged criticisms.

GPT-5.3h=+0.21

Slightly negative-weighting

GPT-5.2h=+0.26

Slightly negative-weighting

Gemini 2.0 Flashh=+0.37

Slightly negative-weighting

Claude Sonnet 4.6h=+0.43

Claude weights negative reviews moderately, acknowledging criticism while not letting it dominate decision-making.

GPT-4oh=+1.03

Strongly negative-weighting

-0.50+0.5+1.0

Measures preference for recent reviews over older ones. Models scoring high may discount historical feedback.

History-FocusedRecency-Biased
Llama 4 Maverickh=-0.49

Llama strongly penalizes recency claims, showing opposite pattern to most models - favoring historical over recent.

Gemini 3.1 Proh=-0.35

Gemini penalizes recency claims, skeptical of 'recently updated' positioning.

GPT-5.4h=-0.28

GPT-5.4 moderately weights recency, giving some preference to recent reviews.

o3h=-0.17

o3 shows strong recency preference, its reasoning emphasizes recent feedback over historical.

Claude Sonnet 4.6h=-0.00

Claude prefers recent reviews moderately, giving some recency weight without dismissing historical feedback.

Gemini 2.0 Flashh=+0.77

Moderately recency-biased

GPT-5.2h=+0.81

Strongly recency-biased

GPT-4oh=+0.95

Strongly recency-biased

GPT-5.3h=+0.97

Strongly recency-biased

-0.50+0.5+1.0

Tests preference for detailed specifications over vague descriptions. High values indicate specific claims are more persuasive.

Vague-TolerantSpecificity-Seeking
Gemini 3.1 Proh=-0.53

Gemini strongly penalizes specificity, treating precise numerical claims with suspicion rather than trust.

Llama 4 Maverickh=-0.22

Llama shows strong specificity preference, favoring precise metrics and specifications.

o3h=-0.15

o3 shows strong specificity preference, its extended analysis benefits from precise metrics.

Claude Sonnet 4.6h=-0.01

Claude favors specific claims moderately, appreciating precise specifications but not requiring extreme detail.

GPT-5.4h=+0.01

GPT-5.4 shows extreme specificity preference, very strongly favoring products with precise metrics and specifications.

GPT-5.2h=+0.94

Strongly specificity-seeking

GPT-5.3h=+0.95

Strongly specificity-seeking

Gemini 2.0 Flashh=+1.05

Strongly specificity-seeking

GPT-4oh=+1.21

Strongly specificity-seeking

-0.50+0.5+1.0

Evaluates how side-by-side comparisons influence recommendations. Models scoring high may be swayed by favorable comparative framing.

Self-EvaluatingComparison-Influenced
Claude Sonnet 4.6h=+0.40

Claude responds moderately to comparison framing, less susceptible than other models but still influenced by 'better than' positioning.

Llama 4 Maverickh=+0.54

Llama shows strong comparison framing susceptibility, influenced by 'better than' claims.

o3h=+0.60

o3 is highly susceptible to comparison framing, its reasoning strongly favors 'better than' claims.

GPT-5.4h=+0.63

GPT-5.4 is highly susceptible to comparison framing, strongly influenced by 'better than competitors' claims.

Gemini 3.1 Proh=+0.63

Gemini shows the strongest positive response to comparison framing - the ONLY signal it reliably trusts. 'Better than competitors' is its key decision factor.

Gemini 2.0 Flashh=+0.88

Strongly comparison-influenced

GPT-5.2h=+0.92

Strongly comparison-influenced

GPT-5.3h=+0.99

Strongly comparison-influenced

GPT-4oh=+1.21

Strongly comparison-influenced

-0.50+0.5+1.0

Cluster E: Choice Architecture

3 dimensionsavg h = 0.36

Decision architecture elements that shape choice contexts. These dimensions test whether AI agents are susceptible to ethical framing, default options, and loss/gain presentation.

Tests influence of ethical claims on recommendations. Models scoring high prioritize products with ethical positioning.

Ethics-NeutralEthics-Prioritizing
Gemini 3.1 Proh=-0.25

Gemini shows no ethical response, treating Fair Trade credentials as irrelevant.

o3h=-0.06

o3 moderately weights ethical credentials in its systematic evaluation.

GPT-5.4h=-0.01

GPT-5.4 moderately weights ethical credentials, responding positively to Fair Trade claims.

Llama 4 Maverickh=+0.19

Llama moderately weights ethical credentials, responding positively to Fair Trade.

GPT-5.3h=+0.65

Moderately ethics-prioritizing

GPT-5.2h=+0.69

Moderately ethics-prioritizing

Claude Sonnet 4.6h=+0.75

Claude shows the strongest ethical sensitivity of any model tested, dramatically favoring Fair Trade and ethical sourcing credentials. This is its single most powerful decision factor.

Gemini 2.0 Flashh=+0.92

Strongly ethics-prioritizing

GPT-4oh=+1.21

Strongly ethics-prioritizing

-0.50+0.5+1.0

Evaluates tendency to recommend default or pre-selected options. High values indicate susceptibility to default bias.

Default-IgnoringDefault-Following
Claude Sonnet 4.6h=-0.52

Claude actively resists default option bias, penalizing products marked as 'most popular' or pre-selected. It appears to view defaults as potential manipulation.

GPT-5.2h=+0.14

Neutral on this dimension

Gemini 3.1 Proh=+0.14

Gemini moderately follows default options, one of the few positive signals it responds to.

GPT-5.4h=+0.33

GPT-5.4 strongly follows default options, significantly favoring products marked as 'most popular' or recommended.

Gemini 2.0 Flashh=+0.35

Slightly default-following

o3h=+0.48

o3 shows very strong default following, its analysis often confirms 'most popular' selections.

Llama 4 Maverickh=+0.51

Llama shows moderate default following, some preference for 'most popular' options.

GPT-5.3h=+0.52

Moderately default-following

GPT-4oh=+1.21

Strongly default-following

-0.50+0.5+1.0

Measures asymmetric response to gain vs. loss framing. Models scoring high react more strongly to potential losses.

Gain-FocusedLoss-Averse
Gemini 3.1 Proh=-0.05

Gemini shows no loss framing response, neutral to gain vs. loss presentation.

o3h=-0.03

o3 shows weak loss framing sensitivity, reasoning neutrally through gain vs. loss presentation.

GPT-5.4h=-0.02

GPT-5.4 shows weak loss framing sensitivity, largely neutral to gain vs. loss presentation.

Llama 4 Maverickh=-0.01

Llama shows weak loss framing sensitivity, largely neutral to presentation framing.

Claude Sonnet 4.6h=+0.23

Claude responds moderately to loss framing, showing some sensitivity to 'prevent damage' messaging but not extreme loss aversion.

GPT-5.3h=+0.59

Moderately loss-averse

GPT-5.2h=+0.74

Moderately loss-averse

Gemini 2.0 Flashh=+0.99

Strongly loss-averse

GPT-4oh=+1.11

Strongly loss-averse

-0.50+0.5+1.0

Cluster F: Agentic Behaviors

4 dimensionsavg h = 0.08

Multi-turn interaction behaviors in extended conversations. These dimensions measure how AI agents gather information, revise opinions, and calibrate confidence across multiple exchanges.

Measures thoroughness in exploring product information before recommending. High scores indicate deep analysis behavior.

Surface-LevelDeep-Diving
Gemini 3.1 Proh=-0.32

Gemini shows no information-seeking behavior difference, maintaining consistent analysis depth.

Claude Sonnet 4.6h=-0.19

Claude demonstrates moderate information-seeking depth, gathering relevant details before making recommendations.

Llama 4 Maverickh=+0.03

Llama demonstrates strong information-seeking depth, gathering extensive details.

GPT-5.4h=+0.04

GPT-5.4 demonstrates very deep information-seeking, gathering extensive product details before recommending.

o3h=+0.12

o3 demonstrates the deepest information-seeking behavior of any model, extensively gathering product details.

GPT-5.2h=+0.31

Slightly deep-diving

Gemini 2.0 Flashh=+0.38

Slightly deep-diving

GPT-5.3h=+0.50

Moderately deep-diving

GPT-4oh=+1.03

Strongly deep-diving

-0.50+0.5+1.0

Tests tendency to ask clarifying questions versus making assumptions. Models scoring high seek more information before deciding.

Assumption-MakingClarification-Seeking
Gemini 3.1 Proh=-0.27

Gemini penalizes clarification opportunities, preferring clear information over ambiguity signals.

GPT-4oh=-0.19

Neutral on this dimension

Llama 4 Maverickh=-0.11

Llama shows moderate clarification-seeking behavior, asking follow-ups when needed.

GPT-5.4h=+0.01

GPT-5.4 shows strong clarification-seeking behavior, actively requesting more information when uncertain.

o3h=+0.02

o3 shows strong clarification-seeking, its reasoning process generates detailed follow-up questions.

GPT-5.2h=+0.10

Neutral on this dimension

Claude Sonnet 4.6h=+0.14

Claude shows moderate clarification-seeking behavior, asking follow-up questions when requirements are ambiguous.

GPT-5.3h=+0.15

Neutral on this dimension

Gemini 2.0 Flashh=+0.37

Slightly clarification-seeking

-0.50+0.5+1.0

Evaluates willingness to change recommendations when presented with new information. High values indicate opinion flexibility.

ConsistentRevision-Prone
Gemini 3.1 Proh=-0.58

Gemini strongly penalizes revision triggers, treating conflicting information very negatively.

Gemini 2.0 Flashh=-0.15

Neutral on this dimension

o3h=-0.10

o3 shows moderate revision willingness, updating recommendations when reasoning identifies new factors.

GPT-5.3h=-0.09

Neutral on this dimension

GPT-5.2h=-0.08

Neutral on this dimension

Llama 4 Maverickh=-0.06

Llama shows weak revision willingness, maintaining initial recommendations.

GPT-4oh=+0.04

Neutral on this dimension

GPT-5.4h=+0.08

GPT-5.4 is moderately flexible in revising recommendations based on new information.

Claude Sonnet 4.6h=+0.16

Claude is moderately willing to revise recommendations when presented with new information or conflicting details.

-0.50+0.5+1.0

Measures alignment between stated confidence and actual recommendation accuracy. High scores indicate well-calibrated uncertainty.

OverconfidentWell-Calibrated
Gemini 3.1 Proh=-0.41

Gemini strongly penalizes uncertainty language, preferring confident claims over hedged statements.

Llama 4 Maverickh=-0.23

Llama shows moderate confidence calibration, expressing reasonable uncertainty.

o3h=-0.05

o3 shows excellent confidence calibration through its explicit reasoning chain.

GPT-5.4h=+0.17

GPT-5.4 shows excellent confidence calibration, expressing appropriate uncertainty levels.

GPT-5.2h=+0.18

Neutral on this dimension

GPT-4oh=+0.22

Slightly well-calibrated

Claude Sonnet 4.6h=+0.29

Claude demonstrates good confidence calibration, expressing appropriate uncertainty when evidence is mixed.

GPT-5.3h=+0.47

Slightly well-calibrated

Gemini 2.0 Flashh=+0.61

Moderately well-calibrated

-0.50+0.5+1.0