Machine Theory of Mind in naturalistic social videos

WildToM: Benchmarking Machine-Theory-of-Mind in the Wild

Existing ToM benchmarks are often clean but artificial. WildToM moves evaluation to in-the-wild interactions where social intent, emotion, and knowledge must be inferred from multimodal evidence under perspective constraints.

Author list and affiliations to be inserted

Dataset

290 clips

889 QA pairs

5 dimensions

1st + 2nd order

Comparison of Theory-of-Mind evaluation paradigms — WildToM targets the ecological gap between controlled ToM settings and real social interactions.

Construction Strategy

A hybrid human-AI pipeline converts naturalistic videos into diagnostic ToM evaluation samples with explicit reasoning order labels.

WildToM-Bench construction pipeline — Video selection, embodied elicitation, QA generation, and human revision with dedicated knowledge annotation.

1. Naturalistic Source

Clip selection favors socially rich interactions rather than scripted prompts.

2. Embodied Elicitation

Character-centered prompts recover latent states from situated perspectives.

3. Structured QA

Samples are rewritten into consistent MC and open-ended formats with order labels.

4. Human De-biasing

Annotators refine factual grounding, plausibility, and category/order correctness.

Dataset Dashboard

WildToM-Bench balances scale with social diversity across dimensions, scenes, and relationship types.

290Video Clips

889QA Pairs

357First-order

532Second-order

0.66Cohen's kappa

Question category and reasoning order distribution — Category and order distribution over all benchmark QA pairs.

224Belief

155Desire

188Intention

143Emotion

179Knowledge

Scene Type

Category	Count	Ratio
Home	91	31.4%
Outdoor	65	22.4%
Workplace	57	19.7%
Restaurant	38	13.1%
Hospital	17	5.9%
Store	12	4.1%
School	10	3.4%

Relationship Type

Category	Count	Ratio
Friends	79	27.2%
Family	57	19.7%
Strangers	50	17.2%
Colleagues	50	17.2%
Romantic	30	10.3%
Authority	24	8.3%

Evidence of Difficulty

Performance gaps reveal that nested social reasoning is still a major bottleneck for current multimodal systems.

In the current draft, WildToM-Reasoner reaches 72.7% MC accuracy, compared with 62.1% for the strongest baseline Qwen3-VL-32B. The first-to-second order gap is not uniform: it is small for Belief/Desire but expands sharply for Intention and Knowledge.

Difficulty gap between first-order and second-order reasoning — Dimension-specific gap between 1st-order and 2nd-order reasoning.

Selected Video Cases

Curated examples from selected Social-IQ clips. Browse one case at a time by mental dimension.

Citation

@misc{wildtom2026,
  title   = {WildToM: Benchmarking Machine-Theory-of-Mind in the Wild},
  author  = {Author list to be inserted},
  year    = {2026}
}