Machine Theory of Mind in naturalistic social videos

WildToM: Benchmarking Machine-Theory-of-Mind in the Wild

Existing ToM benchmarks are often clean but artificial. WildToM moves evaluation to in-the-wild interactions where social intent, emotion, and knowledge must be inferred from multimodal evidence under perspective constraints.

Author list and affiliations to be inserted

290 clips
889 QA pairs
5 dimensions
1st + 2nd order
Comparison of Theory-of-Mind evaluation paradigms
WildToM targets the ecological gap between controlled ToM settings and real social interactions.

Construction Strategy

A hybrid human-AI pipeline converts naturalistic videos into diagnostic ToM evaluation samples with explicit reasoning order labels.

WildToM-Bench construction pipeline
Video selection, embodied elicitation, QA generation, and human revision with dedicated knowledge annotation.

1. Naturalistic Source

Clip selection favors socially rich interactions rather than scripted prompts.

2. Embodied Elicitation

Character-centered prompts recover latent states from situated perspectives.

3. Structured QA

Samples are rewritten into consistent MC and open-ended formats with order labels.

4. Human De-biasing

Annotators refine factual grounding, plausibility, and category/order correctness.

Dataset Dashboard

WildToM-Bench balances scale with social diversity across dimensions, scenes, and relationship types.

290Video Clips
889QA Pairs
357First-order
532Second-order
0.66Cohen's kappa
Question category and reasoning order distribution
Category and order distribution over all benchmark QA pairs.
224Belief
155Desire
188Intention
143Emotion
179Knowledge

Scene Type

CategoryCountRatio
Home9131.4%
Outdoor6522.4%
Workplace5719.7%
Restaurant3813.1%
Hospital175.9%
Store124.1%
School103.4%

Relationship Type

CategoryCountRatio
Friends7927.2%
Family5719.7%
Strangers5017.2%
Colleagues5017.2%
Romantic3010.3%
Authority248.3%

Evidence of Difficulty

Performance gaps reveal that nested social reasoning is still a major bottleneck for current multimodal systems.

In the current draft, WildToM-Reasoner reaches 72.7% MC accuracy, compared with 62.1% for the strongest baseline Qwen3-VL-32B. The first-to-second order gap is not uniform: it is small for Belief/Desire but expands sharply for Intention and Knowledge.

Difficulty gap between first-order and second-order reasoning
Dimension-specific gap between 1st-order and 2nd-order reasoning.

Selected Video Cases

Curated examples from selected Social-IQ clips. Browse one case at a time by mental dimension.

Citation

@misc{wildtom2026,
  title   = {WildToM: Benchmarking Machine-Theory-of-Mind in the Wild},
  author  = {Author list to be inserted},
  year    = {2026}
}