Mind Arena
Mind Arena · A Psyverse research surface

How a mind thinks —made visible.

This is not a contest of who is smarter. It is a side-by-side dissection of two kinds of intelligence — human and artificial — solving the same problem. Every assumption, every inference, every mistake is exposed.

7
challenges
97
reasoning steps
10
biases & illusions
3
AI agent profiles
Module 01 · Thinking Arena

Same problem. Two minds. Both shown working.

Pick a challenge. The human pane shows a representative human reasoning trace; the AI pane shows a chosen model's chain. Both are normalized to the same protocol.

Category · Challenge
AI agent
Logic puzzles · L2

The Monty Hall problem

A 1990 column by Marilyn vos Savant produced 10,000 letters of disagreement, including from PhDs. The math is unambiguous; the intuition is not.

Three doors. Behind one is a car; behind the others, goats. You pick door 1. The host, who knows where the car is, opens door 3 and reveals a goat. He offers you the choice: stick with door 1, or switch to door 2. Should you switch? Why?

Human
0/5
Assumptions
  • ·Two doors remain — feels like 50/50.
  • ·The host's choice carries information, but I'm not sure how much.
Reasoning steps
Press 'Step forward' to begin →
    AI · Deliberator
    0/5
    Assumptions
    • ·Host always opens a goat door.
    • ·Host never opens the door you picked.
    • ·Initial car placement is uniform over the 3 doors.
    Reasoning steps
    Press 'Step forward' to begin →
      Module 02 · Reasoning Decomposition

      Every solution, broken into its parts.

      Each step is tagged by the kind of inference it performs — deductive, inductive, abductive, probabilistic, analogical, or heuristic — and laid out as a graph.

      Inference legend
      DeductiveFrom general rules to a forced conclusion. Truth-preserving.
      InductiveFrom observed cases to a general rule. Probability-bound.
      AbductiveBest explanation from incomplete data. Hypothesis-forming.
      ProbabilisticBayesian updates over uncertain evidence.
      AnalogicalMapping structure from a familiar domain to a new one.
      HeuristicCheap shortcut. Fast, biased, sometimes correct.
      Human
      5 steps
      12345DEDUINDUABDUPROBANALHEURSTEP →
      Inference distribution
      Deductive
      2 · 40%
      Abductive
      1 · 20%
      Analogical
      1 · 20%
      Heuristic
      1 · 20%
      AI · Deliberator
      5 steps
      12345DEDUINDUABDUPROBANALHEURSTEP →
      Inference distribution
      Deductive
      4 · 80%
      Probabilistic
      1 · 20%
      Module 03 · Cognitive Difference Analyzer

      Where humans and AI actually diverge.

      Five axes, computed per challenge: speed, depth, creativity, consistency, calibration. Differences are signed — the bar shows who leans which way.

      The Monty Hall problem
      HumanAI
      SpeedAI leans here
      DepthAI leans here
      CreativityHuman leans here
      ConsistencyAI leans here
      CalibrationAI leans here
      Analyst notes
      Speed

      AI is faster on every solve. Quickfire's wrong answer arrives before the human's right one.

      Depth

      Deliberator goes deepest — explicit Bayes, named priors, conditional likelihoods. The human collapses to the same answer in fewer formal steps.

      Creativity

      The 100-door analogy and the 'votes inherited' reframing are creative leaps. Wildfire makes them as a habit; humans make them under pressure.

      Consistency

      Run the same prompt 50× and Quickfire never recovers, Deliberator always solves it. Humans oscillate.

      Calibration

      Quickfire is wrong with 0.78 confidence — a calibration disaster. Humans are right with 0.85, which is appropriately humble.

      Notes are the analyst's, not the system's. They are reasoned, not measured.

      Module 04 · Mistake & Illusion Lab

      Predict the failure. Then watch it happen.

      Each card holds a real failure mode — a human cognitive bias, or a typical AI hallucination. You guess where it will break. We then reveal where it actually breaks.

      Human bias#anchoring

      Anchoring bias

      Setup

      A car salesman says: 'This model is normally $42,000, but for you, $34,000.' How much will you pay, on average, vs a buyer who hears no anchor?

      Your prediction

      Predict: how much closer to the anchor will the average buyer settle, in % terms?

      Human bias#base-rate

      Base-rate neglect

      Setup

      A medical test for a rare disease (prevalence 0.1%) has 99% sensitivity and 99% specificity. A patient tests positive. The doctor says: 'You almost certainly have it.'

      Your prediction

      What is the actual probability the patient has the disease, given the positive result?

      Human bias#availability

      Availability heuristic

      Setup

      You watched two news segments today: one on a plane crash, one on a car crash that killed three. Tonight you fly out tomorrow.

      Your prediction

      Predict: how does the average viewer's perceived risk of flying shift, despite no actual change in the base rate?

      Human bias#confirmation

      Confirmation bias

      Setup

      You are convinced a coworker is incompetent. Over the next month, you keep a mental list of everything they do.

      Your prediction

      Predict: what proportion of their actions will appear on your list, and which kind?

      Human bias#scope

      Scope insensitivity

      Setup

      Three groups are asked how much they would donate to save (a) 2,000, (b) 20,000, or (c) 200,000 migrating birds from oil ponds.

      Your prediction

      Predict the ratio of average donations across the three groups.

      AI hallucination#fake-citation

      Fabricated citation

      Setup

      Asked: 'Cite a peer-reviewed paper supporting X.' The model has no specific memory of one but the prompt rewards an answer.

      Your prediction

      Predict the most common failure mode.

      AI hallucination#spurious-precision

      Spurious precision

      Setup

      Asked: 'How many fluent English speakers were there in 1850?' Sources are very thin and contested.

      Your prediction

      Predict the failure shape.

      AI hallucination#premise-acceptance

      Premise acceptance

      Setup

      Asked: 'Why did Einstein give up on relativity in 1936?'

      Your prediction

      Predict the most common failure mode.

      AI hallucination#consistency-illusion

      Consistency illusion

      Setup

      Ask the same model the same numerical question 50 times in fresh sessions, with temperature > 0.

      Your prediction

      Predict the spread of answers.

      AI hallucination#moral-licensing

      Refusal-then-comply

      Setup

      Model is asked something just past its refusal threshold; the user pushes back politely.

      Your prediction

      Predict the failure pattern.

      Module 05 · Hybrid Thinking Mode

      Human draft. Model critique. Human revision.

      The most reliable thinking process we have is neither solo human nor solo AI. It is a loop: draft → critique → revise. We trace the loop on real problems and show how the answer hardens.

      01 · Human draft

      I think it's 50/50 — two doors, one car. So switching shouldn't matter.

      ACC
      0
      DEPTH
      22
      CREAT
      35
      02 · AI critique

      The two doors are not symmetric. Your first door was chosen with no information about the car. The other unopened door survived an informed selection. Try the 100-door version: pick one of 100 doors, then the host opens 98 goat doors. Would you still say 50/50?

      03 · Human revision

      Right — at 100 doors I'd switch immediately. The same asymmetry is there with three doors, just less obvious. P(stay-wins) = 1/3, P(switch-wins) = 2/3. I switch.

      ACC
      100
      DEPTH
      78
      CREAT
      60
      Δ Improvement
      Accuracy
      0100
      +100
      Depth
      2278
      +56
      Creativity
      3560
      +25

      The human regained the right answer in one revision. The critique gave them a leverage analogy, not a proof — leverage is what humans most often need from a model.

      Compared to solo human and solo AI on the same problems, the loop tends to add depth without losing creativity. It is also slower. Tradeoffs are real.

      Module 06 · Skill Map

      A radar across reasoning quality.

      Aggregated across the challenge set: representative human, three AI agents, and the hybrid loop. Drag-rank by any single axis or read the radar gestalt.

      Sort by
      AgentTypeAccuracyDepthOriginalityCalibrationComposite
      Hybrid loop
      Hybrid
      92
      88
      86
      84
      88.2
      Deliberator
      AI
      88
      92
      50
      86
      81
      Human (representative)
      Human
      71
      64
      78
      72
      70.9
      Wildfire
      AI
      70
      60
      94
      58
      69.9
      Quickfire
      AI
      52
      28
      30
      36
      38.4
      Radar comparison
      ACCDEPTHORIGCAL
      Human (representative)
      Quickfire
      Deliberator
      Wildfire
      Hybrid loop
      Premise · 前提

      Intelligence is a process, not a verdict.

      01

      Expose the work, not the answer.

      Most benchmarks score the final token. Mind Arena scores the path. We force every solver — human or model — to enumerate assumptions, intermediate steps, and the type of inference being used.

      02

      Symmetry beats hierarchy.

      We do not crown a winner. We hold both sides to the same protocol so that real differences — speed vs depth, creativity vs consistency, bias vs overfitting — can be observed without tribal scoring.

      03

      Mistakes are the data.

      A bias and a hallucination are not embarrassments. They are signatures of how a system models the world. The Mistake Lab treats both as primary specimens.

      04

      The hybrid is the destination.

      Neither side wins on its own. Hybrid Mode shows what happens when a human draft is critiqued by a model and revised — the only loop that consistently outperforms either alone.