RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning

2025

1Westlake University, Hangzhou, China 2Tongji University, Shanghai, China 3Zhejiang University, Hangzhou, China 4National University of Singapore, Singapore
*Corresponding author: wanghuan@westlake.edu.cn

WLU
TongjiU
ZJU
NUS
ENCODE Lab
Overview of RewardMap. The framework enhances fine-grained visual understanding and reasoning in MLLMs through reinforcement learning with Group Relative Policy Optimization (GRPO). It consists of two key components: (1) a difficulty-aware reward design, which combines format, correctness, and detail rewards with difficulty-based weighting; and (2) a multi-stage RL curriculum, which schedules training data from simple perception tasks to complex reasoning tasks, ensuring effective optimization tackling sparse rewards.

Abstract

Fine-grained visual reasoning remains a core challenge for multimodal large language models (MLLMs). The recently introduced ReasonMap highlights this gap by showing that even advanced MLLMs struggle with spatial reasoning in structured and information-rich settings such as transit maps, a task of clear practical and scientific importance. However, standard reinforcement learning (RL) on such tasks is impeded by sparse rewards and unstable optimization. To address this, we first construct ReasonMap-Plus, an extended dataset that introduces dense reward signals through Visual Question Answering (VQA) tasks, enabling effective cold-start training of fine-grained visual understanding skills. Next, we propose RewardMap, a multi-stage RL framework designed to improve both visual understanding and reasoning capabilities of MLLMs. RewardMap incorporates two key designs. First, we introduce a difficulty-aware reward design that incorporates detail rewards, directly tackling the sparse rewards while providing richer supervision. Second, we propose a multi-stage RL scheme that bootstraps training from simple perception to complex reasoning tasks, offering a more effective cold-start strategy than conventional Supervised Fine-Tuning (SFT). Experiments on ReasonMap and ReasonMap-Plus demonstrate that each component of RewardMap contributes to consistent performance gains, while their combination yields the best results. Moreover, models trained with RewardMap achieve an average improvement of 3.47% across 6 benchmarks spanning spatial reasoning, fine-grained visual reasoning, and general tasks beyond transit maps, underscoring enhanced visual understanding and reasoning capabilities.

ReasonMap-Plus Datasets

Overview of ReasonMap-Plus. ReasonMap-Plus comprises 4,018 questions from 5 extended question types and maps from 30 cities across 13 countries.

Results

Table 1: Evaluations of reference models and fine-tuned models on ReasonMap and ReasonMap-Plus. “S.” represents results for short questions, while “L.” denotes results for long questions. Bold indicates the best results among fine-tuned models, while underline represents the second best.
Table 2: Evaluation of reference models and fine-tuned models on various benchmarks. Bold indicates the best results among fine-tuned models, while underline represents the second best. , , $, *, § denote the results from the technical report or the official HuggingFace repository, while all other results are obtained from our own experiments.

How to use RewardMap for training your model?

Follow the quickstart below to install dependencies, download datasets, train, and merge your model. If you run into any issues, please open an issue on your repository — we’ll help as soon as possible.

  1. Install dependencies
    pip install -r requirements.txt

    If the installation fails, please open an issue on GitHub.

  2. Download the dataset

    You can obtain ReasonMap-Plus (for evaluation) and ReasonMap-Train (for RewardMap training) from Hugging Face, or run the script below; then place all data under the data/ directory.

    python utils/download_dataset.py
  3. Training

    Start RewardMap training with the provided script:

    # RewardMap training
      bash scripts/reward_map.sh

    After training, merge the trained weights:

    # merge trained model
      bash scripts/merge_model.sh

BibTeX

@article{feng2025rewardmap,
  title={RewardMap: Tackling Sparse Rewards in Fine-grained Visual Reasoning via Multi-Stage Reinforcement Learning},
  author={Feng, Sicheng and Tuo, Kaiwen and Wang, Song and Kong, Lingdong and Zhu, Jianke and Wang, Huan},
  journal={arXiv preprint arXiv:2510.02240},
  year={2025}
}