SPECS: Decoupling Learning via Self-distilled Preference-based Cold Start for VLMs

1School of Artificial Intelligence, University of Chinese Academy of Sciences
2MAIS, Institute of Automation, Chinese Academy of Sciences
3Meituan
ICLR 2026
*Equal Contribution   Corresponding Author
SPECS Method Overview

Method Overview. SPECS is a three-stage pipeline: (1) Self-distillation for preference data generation, (2) DPO-based pre-alignment for cold-start, and (3) Final GRPO fine-tuning for enhanced reasoning.

Abstract

Reinforcement learning (RL) with verifiable rewards has recently catalyzed a wave of "MLLM-r1" approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, which may induce instruction-style overfitting, weakens out-of-distribution generalization, and ultimately affects downstream RL. We revisit the cold start along two views, its training method and data construction, and introduce the Generalization Factor (GF) coefficient to quantify the generalization capability under different methods. Our empirical study finds that preference-based training methods (e.g. DPO) generalizes better than SFT-based methods in cold start. Motivated by this, we propose SPECS—a Self-distilled, Preference-based Cold Start framework that decouples multimodal learning: (1) generates introspective preference data pairs via self-distillation, avoiding reliance on larger teachers or manual annotation; (2) performs preference-based training to learn, focusing on shallow, transferable surface-form criteria (format, structure, style) rather than memorizing content; and (3) hands off to RL with verifiable rewards for deep reasoning results. Experimental results across multiple multimodal benchmarks show that our decoupling learning framework yields consistent performance gains over strong baselines, improving MEGA-Bench by 4.1% and MathVista by 12.2%. Additional experiments indicate that SPECS contributes to reducing in-distribution "stuckness," improving exploration, stabilizing training, and raising the performance ceiling.

Key Contributions

SPECS Framework

A three-stage cold start strategy that generates preference data through self-distillation, uses DPO for cold start training, and separates training objectives for stronger RL starting points.

Generalization Factor

A new metric to evaluate a model's generalization capability under different cold start training methods by comparing performance on in-distribution and out-of-distribution tasks.

Decoupling Learning

Reveals the importance of separating cold-start and RL phases, improving exploration and reducing the risk of models getting stuck on in-distribution solutions.

Consistent Performance Gains

Achieves consistent improvements across benchmarks: +4.1% on MEGA-Bench and +12.2% on MathVista over strong baselines.

Experimental Results

Our model demonstrates superior performance across general and reasoning-specific multimodal benchmarks.

Analysis

Case Studies

BibTeX

@article{chen2025metis,
  title={Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start},
  author={Chen, Kun and Shi, Peng and Qiu, Haibo and Zeng, Zhixiong and Yang, Siqi and Mao, Wenji and Ma, Lin},
  journal={arXiv preprint arXiv:2510.25801},
  year={2025}
}