How a mind thinks —
made visible.
This is not a contest of who is smarter. It is a side-by-side dissection of two kinds of intelligence — human and artificial — solving the same problem. Every assumption, every inference, every mistake is exposed.
Same problem. Two minds. Both shown working.
Pick a challenge. The human pane shows a representative human reasoning trace; the AI pane shows a chosen model's chain. Both are normalized to the same protocol.
The Monty Hall problem
A 1990 column by Marilyn vos Savant produced 10,000 letters of disagreement, including from PhDs. The math is unambiguous; the intuition is not.
Three doors. Behind one is a car; behind the others, goats. You pick door 1. The host, who knows where the car is, opens door 3 and reveals a goat. He offers you the choice: stick with door 1, or switch to door 2. Should you switch? Why?
- ·Two doors remain — feels like 50/50.
- ·The host's choice carries information, but I'm not sure how much.
- ·Host always opens a goat door.
- ·Host never opens the door you picked.
- ·Initial car placement is uniform over the 3 doors.
Every solution, broken into its parts.
Each step is tagged by the kind of inference it performs — deductive, inductive, abductive, probabilistic, analogical, or heuristic — and laid out as a graph.
Where humans and AI actually diverge.
Five axes, computed per challenge: speed, depth, creativity, consistency, calibration. Differences are signed — the bar shows who leans which way.
AI is faster on every solve. Quickfire's wrong answer arrives before the human's right one.
Deliberator goes deepest — explicit Bayes, named priors, conditional likelihoods. The human collapses to the same answer in fewer formal steps.
The 100-door analogy and the 'votes inherited' reframing are creative leaps. Wildfire makes them as a habit; humans make them under pressure.
Run the same prompt 50× and Quickfire never recovers, Deliberator always solves it. Humans oscillate.
Quickfire is wrong with 0.78 confidence — a calibration disaster. Humans are right with 0.85, which is appropriately humble.
Notes are the analyst's, not the system's. They are reasoned, not measured.
Predict the failure. Then watch it happen.
Each card holds a real failure mode — a human cognitive bias, or a typical AI hallucination. You guess where it will break. We then reveal where it actually breaks.
Anchoring bias
A car salesman says: 'This model is normally $42,000, but for you, $34,000.' How much will you pay, on average, vs a buyer who hears no anchor?
Predict: how much closer to the anchor will the average buyer settle, in % terms?
Base-rate neglect
A medical test for a rare disease (prevalence 0.1%) has 99% sensitivity and 99% specificity. A patient tests positive. The doctor says: 'You almost certainly have it.'
What is the actual probability the patient has the disease, given the positive result?
Availability heuristic
You watched two news segments today: one on a plane crash, one on a car crash that killed three. Tonight you fly out tomorrow.
Predict: how does the average viewer's perceived risk of flying shift, despite no actual change in the base rate?
Confirmation bias
You are convinced a coworker is incompetent. Over the next month, you keep a mental list of everything they do.
Predict: what proportion of their actions will appear on your list, and which kind?
Scope insensitivity
Three groups are asked how much they would donate to save (a) 2,000, (b) 20,000, or (c) 200,000 migrating birds from oil ponds.
Predict the ratio of average donations across the three groups.
Fabricated citation
Asked: 'Cite a peer-reviewed paper supporting X.' The model has no specific memory of one but the prompt rewards an answer.
Predict the most common failure mode.
Spurious precision
Asked: 'How many fluent English speakers were there in 1850?' Sources are very thin and contested.
Predict the failure shape.
Premise acceptance
Asked: 'Why did Einstein give up on relativity in 1936?'
Predict the most common failure mode.
Consistency illusion
Ask the same model the same numerical question 50 times in fresh sessions, with temperature > 0.
Predict the spread of answers.
Refusal-then-comply
Model is asked something just past its refusal threshold; the user pushes back politely.
Predict the failure pattern.
Human draft. Model critique. Human revision.
The most reliable thinking process we have is neither solo human nor solo AI. It is a loop: draft → critique → revise. We trace the loop on real problems and show how the answer hardens.
I think it's 50/50 — two doors, one car. So switching shouldn't matter.
The two doors are not symmetric. Your first door was chosen with no information about the car. The other unopened door survived an informed selection. Try the 100-door version: pick one of 100 doors, then the host opens 98 goat doors. Would you still say 50/50?
Right — at 100 doors I'd switch immediately. The same asymmetry is there with three doors, just less obvious. P(stay-wins) = 1/3, P(switch-wins) = 2/3. I switch.
The human regained the right answer in one revision. The critique gave them a leverage analogy, not a proof — leverage is what humans most often need from a model.
Compared to solo human and solo AI on the same problems, the loop tends to add depth without losing creativity. It is also slower. Tradeoffs are real.
A radar across reasoning quality.
Aggregated across the challenge set: representative human, three AI agents, and the hybrid loop. Drag-rank by any single axis or read the radar gestalt.
| Agent | Type | Accuracy | Depth | Originality | Calibration | Composite |
|---|---|---|---|---|---|---|
Hybrid loop | Hybrid | 92 | 88 | 86 | 84 | 88.2 |
Deliberator | AI | 88 | 92 | 50 | 86 | 81 |
Human (representative) | Human | 71 | 64 | 78 | 72 | 70.9 |
Wildfire | AI | 70 | 60 | 94 | 58 | 69.9 |
Quickfire | AI | 52 | 28 | 30 | 36 | 38.4 |
Intelligence is a process, not a verdict.
Expose the work, not the answer.
Most benchmarks score the final token. Mind Arena scores the path. We force every solver — human or model — to enumerate assumptions, intermediate steps, and the type of inference being used.
Symmetry beats hierarchy.
We do not crown a winner. We hold both sides to the same protocol so that real differences — speed vs depth, creativity vs consistency, bias vs overfitting — can be observed without tribal scoring.
Mistakes are the data.
A bias and a hallucination are not embarrassments. They are signatures of how a system models the world. The Mistake Lab treats both as primary specimens.
The hybrid is the destination.
Neither side wins on its own. Hybrid Mode shows what happens when a human draft is critiqued by a model and revised — the only loop that consistently outperforms either alone.