Table 1: Evaluations of various MLLMs on ReasonMap. S. represents short questions (max map score = 20), L. denotes long questions (max map score = 40). Bold is best per group; Underline is second best.
Model | Type | Acc. (S.) | #Tokens (S.) | Acc. (L.) | #Tokens (L.) | Map Score (S. / L.) |
---|---|---|---|---|---|---|
Open-source Models | ||||||
Qwen2.5-VL-3B-Instruct | Base | 8.68% | 42 | 7.99% | 151 | 2.75 / 3.70 |
Qwen2.5-VL-32B-Instruct | Base | 16.49% | 36 | 15.71% | 112 | 3.88 / 6.84 |
Qwen2.5-VL-72B-Instruct | Base | 26.65% | 33 | 24.22% | 104 | 5.09 / 8.80 |
InternVL3-38B | Base | 14.84% | 43 | 13.45% | 68 | 3.48 / 6.31 |
InternVL3-78B | Base | 25.35% | 33 | 19.62% | 62 | 4.80 / 7.50 |
Kimi-VL-A3B-Instruct | Base | 12.76% | 41 | 12.33% | 41 | 3.30 / 5.37 |
Kimi-VL-A3B-Thinking | Reasoning | 5.47% | 754 | 5.47% | 1,287 | 2.44 / 3.17 |
Skywork-R1V-38B | Reasoning | 6.86% | 645 | 3.21% | 842 | 2.11 / 3.11 |
QvQ-72B-Preview | Reasoning | 9.03% | 1,279 | 4.25% | 1,619 | 1.59 / 1.55 |
Closed-source Models | ||||||
Doubao-115 | Base | 34.20% | 32 | 38.02% | 118 | 5.25 / 11.96 |
OpenAI 4o | Base | 41.15% | 34 | 42.80% | 58 | 6.84 / 13.57 |
Doubao-415 | Reasoning | 43.14% | 536 | 46.09% | 1,796 | 7.33 / 14.67 |
Doubao-428 | Reasoning | 37.15% | 532 | 37.85% | 2,167 | 5.52 / 11.73 |
Gemini-2.5-Flash | Reasoning | 46.09% | 806 | 29.86% | 1,419 | 7.64 / 9.98 |
OpenAI o3 | Reasoning | 63.02% | 1,236 | 59.11% | 2,372 | 9.53 / 17.96 |
Table 2: Evaluations of various MLLMs on ReasonMap without visual inputs. S. represents short questions (max map score = 20), L. denotes long questions (max map score = 40). Bold is best per group; Underline is second best. Green ↑ means improved, Red ↓ represents the value dropped from full input (Table 1).
Model | Type | Acc. (S.) | #Tokens (S.) | Acc. (L.) | #Tokens (L.) | Map Score (S. / L.) |
---|---|---|---|---|---|---|
Open-source Models | ||||||
Qwen2.5-VL-3B-Instruct | Base | 9.38% ↑ 0.7% | 47 | 9.72% ↑ 1.73% | 147 | 2.93 ↑ 0.18 / 4.51 ↑ 0.81 |
Qwen2.5-VL-72B-Instruct | Base | 16.41% ↓ 10.24% | 28 | 15.71% ↓ 8.51% | 108 | 4.03 ↓ 1.06 / 6.49 ↓ 2.31 |
Kimi-VL-A3B-Instruct | Base | 11.81% ↓ 0.95% | 41 | 9.81% ↓ 2.52% | 49 | 3.37 ↑ 0.07 / 5.32 ↓ 0.05 |
Kimi-VL-A3B-Thinking | Reasoning | 4.17% ↓ 1.30% | 1,039 | 2.08% ↓ 3.39% | 1,755 | 2.06 ↓ 0.38 / 1.64 ↓ 1.53 |
Closed-source Models | ||||||
Doubao-115 | Base | 13.72% ↓ 20.48% | 34 | 13.98% ↓ 24.04% | 99 | 3.50 ↓ 1.75 / 6.48 ↓ 5.48 |
Doubao-415 | Reasoning | 21.53% ↓ 21.61% | 352 | 17.19% ↓ 28.90% | 1,047 | 4.85 ↓ 2.48 / 7.68 ↓ 6.99 |
Figure 1: Accuracy across different cities for four representative MLLMs (Qwen2.5-VL-72B-I, InternVL3-78B, OpenAI o3, and Doubao-415, left for short questions, right for long questions). Each city is marked with the corresponding map difficulty and the country flag. Each city in the test set provides a specific number of samples per model: 32 samples for Auckland, 34 for Los Angeles, 7 for Miami, 35 for Lisboa, 18 for Geneva, 40 for Beijing, 39 for Hangzhou, 17 for Budapest, 39 for Singapore, 40 for Rome, and 11 for Toronto.
ReasonMap is designed to evaluate the fine-grained visual reasoning abilities of MLLMs. To use ReasonMap for evaluation, follow these steps:
from datasets import load_dataset
ds = load_dataset("FSCCS/ReasonMap")
git clone https://github.com/fscdc/ReasonMap.git
conda create -n reasonmap python=3.10
conda activate reasonmap
pip install torchvision==0.17.2
pip install torch==2.2.2
pip install numpy==1.24.3
pip install transformers, datasets
pip install flash-attn # if not work, please install flash-attn from source (LINK)
If you encounter any issues, please open an issue at ReasonMap Issues, and we will assist you as soon as possible.
@article{,
title={Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps},
author={Feng, Sicheng and Wang, Song and Ouyang, Shuyi and Kong, Lingdong and Song, Zikai and Zhu, Jianke and Wang, Huan and Wang, Xinchao},
journal={arXiv preprint arXiv:2505.18675},
year={2025},
}