RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

Abstract

Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.

ReasonMap-Plus Datasets

Overview of ReasonMap-Plus. ReasonMap-Plus comprises 4,018 questions from 5 extended question types and maps from 30 cities across 13 countries.

Results

Table 1: Evaluations of reference models and fine-tuned models on ReasonMap and ReasonMap-Plus. “S.” represents results for short questions, while “L.” denotes results for long questions. Bold indicates the best results among fine-tuned models, while underline represents the second best.

Table 2: Evaluation of reference models and fine-tuned models on various benchmarks. Bold indicates the best results among fine-tuned models, while underline represents the second best. ^†, ^‡, ^$, ^*, ^§ denote the results from the technical report or the official HuggingFace repository, while all other results are obtained from our own experiments.

How to use RewardMap for training your model?

Follow the quickstart below to install dependencies, download datasets, train, and merge your model. If you run into any issues, please open an issue on your repository — we’ll help as soon as possible.

Install dependencies
```
pip install -r requirements.txt
```
If the installation fails, please open an issue on GitHub.
Download the dataset
You can obtain ReasonMap-Plus (for evaluation) and ReasonMap-Train (for RewardMap training) from Hugging Face, or run the script below; then place all data under the data/ directory.
```
python utils/download_dataset.py
```
Training
Start RewardMap training with the provided script:
```
# RewardMap training
  bash scripts/reward_map.sh
```
After training, merge the trained weights:
```
# merge trained model
  bash scripts/merge_model.sh
```

BibTeX

@article{feng2025rewardmap,
  title={RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning},
  author={Feng, Sicheng and Tuo, Kaiwen and Wang, Song and Kong, Lingdong and Zhu, Jianke and Wang, Huan},
  journal={arXiv preprint arXiv:2510.02240},
  year={2025}
}