In the current draft, WildToM-Reasoner reaches 72.7% MC accuracy, compared with 62.1% for the strongest baseline Qwen3-VL-32B. The first-to-second order gap is not uniform: it is small for Belief/Desire but expands sharply for Intention and Knowledge.
1. Naturalistic Source
Clip selection favors socially rich interactions rather than scripted prompts.