SPECS: Decoupling Learning via Self-distilled Preference-based Cold Start for VLMs
Abstract
Reinforcement learning (RL) with verifiable rewards has recently catalyzed a wave of "MLLM-r1" approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, which may induce instruction-style overfitting, weakens out-of-distribution generalization, and ultimately affects downstream RL. We revisit the cold start along two views, its training method and data construction, and introduce the Generalization Factor (GF) coefficient to quantify the generalization capability under different methods. Our empirical study finds that preference-based training methods (e.g. DPO) generalizes better than SFT-based methods in cold start. Motivated by this, we propose SPECS—a Self-distilled, Preference-based Cold Start framework that decouples multimodal learning: (1) generates introspective preference data pairs via self-distillation, avoiding reliance on larger teachers or manual annotation; (2) performs preference-based training to learn, focusing on shallow, transferable surface-form criteria (format, structure, style) rather than memorizing content; and (3) hands off to RL with verifiable rewards for deep reasoning results. Experimental results across multiple multimodal benchmarks show that our decoupling learning framework yields consistent performance gains over strong baselines, improving MEGA-Bench by 4.1% and MathVista by 12.2%. Additional experiments indicate that SPECS contributes to reducing in-distribution "stuckness," improving exploration, stabilizing training, and raising the performance ceiling.
Key Contributions
SPECS Framework
A three-stage cold start strategy that generates preference data through self-distillation, uses DPO for cold start training, and separates training objectives for stronger RL starting points.
Generalization Factor
A new metric to evaluate a model's generalization capability under different cold start training methods by comparing performance on in-distribution and out-of-distribution tasks.
Decoupling Learning
Reveals the importance of separating cold-start and RL phases, improving exploration and reducing the risk of models getting stuck on in-distribution solutions.
Consistent Performance Gains
Achieves consistent improvements across benchmarks: +4.1% on MEGA-Bench and +12.2% on MathVista over strong baselines.
Experimental Results
Our model demonstrates superior performance across general and reasoning-specific multimodal benchmarks.
Table 1: Model Performance on MEGA-Bench Core
| Model | Know. | Math | Percept. | Coding | Info. Ex. | Plan. | Science | Metrics | Core |
|---|---|---|---|---|---|---|---|---|---|
| Open-Source General Models | |||||||||
| QwenVL-2-7B | 39.96 | 25.95 | 39.99 | 31.49 | 40.29 | 16.64 | 28.59 | 43.61 | 34.47 |
| QwenVL-2.5-7B | 38.84 | 27.67 | 41.24 | 28.93 | 50.23 | 16.32 | 36.75 | 41.64 | 35.07 |
| InternVL2-8B | 33.94 | 22.08 | 32.15 | 24.70 | 29.13 | 12.17 | 24.61 | 39.96 | 25.96 |
| InternVL2.5-8B | 34.78 | 25.86 | 33.27 | 25.45 | 35.10 | 15.97 | 28.83 | 44.96 | 28.34 |
| InternVL3-8B | 42.76 | 34.85 | 42.76 | 34.05 | 44.84 | 17.10 | 35.21 | 49.60 | 36.02 |
| Llava-OV-7B | 31.37 | 22.11 | 27.64 | 13.90 | 17.07 | 9.16 | 24.38 | 37.31 | 21.36 |
| Kimi-VL-A3B | 37.63 | 27.07 | 39.50 | 22.30 | 40.99 | 22.17 | 33.94 | 46.65 | 34.40 |
| Open-Source Reasoning Models | |||||||||
| R1-Onevision | 29.47 | 20.94 | 28.65 | 23.38 | 43.04 | 12.67 | 26.84 | 42.19 | 27.18 |
| VLAA-Thinking | 38.23 | 28.83 | 40.73 | 28.84 | 44.58 | 17.05 | 36.69 | 45.57 | 34.86 |
| Kimi-VL-A3B-Thinking | 33.45 | 17.76 | 28.11 | 14.69 | 41.14 | 12.64 | 28.60 | 43.97 | 27.08 |
| MM-Eureka-7B | 40.12 | 31.59 | 39.71 | 28.75 | 49.32 | 16.64 | 37.25 | 46.39 | 35.96 |
| VL-Rethinker-7B | 40.65 | 30.08 | 42.02 | 29.87 | 52.03 | 17.83 | 36.82 | 46.90 | 37.25 |
| Orsta-7B | 41.65 | 31.48 | 43.84 | 32.82 | 54.07 | 17.83 | 36.91 | 41.66 | 38.31 |
| Ours-zero | 42.44 | 29.87 | 43.77 | 32.80 | 49.59 | 17.76 | 37.39 | 47.32 | 37.96 |
| Ours-7B (SPECS) | 42.64 | 31.71 | 44.58 | 34.14 | 51.68 | 18.76 | 38.73 | 51.87 | 39.17 |
| Δ (Ours - Backbone) | +3.8 | +4.0 | +3.3 | +5.2 | +1.4 | +2.4 | +2.0 | +10.2 | +4.1 |
Table 2: Performance on Mathematical & General Benchmarks
| Model | MMMU (val) | MathVision | MathVista | MathVerse (vision) | Overall |
|---|---|---|---|---|---|
| QwenVL-2.5-7B (Backbone) | 54.2 | 25.40 | 63.70 | 38.20 | 45.38 |
| Reasoning Models | |||||
| R1-Onevision | 49.67 | 29.90 | 64.1 | 40.0 | 45.92 |
| VLAA-Thinking | 52.67 | 26.40 | 68.00 | 48.20 | 48.82 |
| MM-Eureka-7B | 55.55 | 26.90 | 73.00 | 47.58 | 50.76 |
| VL-Rethinker-7B | 56.7 | 29.70 | 73.60 | 48.98 | 52.25 |
| Orsta-7B | 54.33 | 25.76 | 70.20 | 32.10 | 45.60 |
| Ours-zero | 54.3 | 26.88 | 72.90 | 47.33 | 50.35 |
| Ours-7B (SPECS) | 56.78 | 29.50 | 75.90 | 48.73 | 52.73 |
| Δ (Ours - Backbone) | +2.5 | +4.1 | +12.2 | +10.5 | +7.3 |
Analysis
DPO vs. SFT on ID and OOD Tasks: DPO-based methods achieve better out-of-distribution generalization than SFT.
Training Efficiency & Stability: DPO cold-start leads to faster convergence, higher performance ceiling, and more stable training.
GF vs. Final Performance: Higher Generalization Factor during cold-start correlates with better final model performance.
| Model | MegaBench | MMMU | MathVista | MathVision | MathVerse | AVG |
|---|---|---|---|---|---|---|
| Qwen-VL-2.5-7B | 35.07 | 54.2 | 63.70 | 25.40 | 38.20 | 43.31 |
| Distillation Comparison | ||||||
| - Qwen32B Distill | 27.04 / 29.87 | 51.44 / 56.67 | 66.90 / 71.50 | 25.53 / 28.03 | 43.53 / 46.07 | 42.89 / 46.43 |
| - Qwen72B Distill | 34.00 / 37.30 | 53.89 / 58.56 | 67.50 / 73.30 | 25.62 / 28.91 | 43.53 / 46.83 | 44.90 / 48.98 |
| - Base model Distill | 35.37 / 37.92 | 53.11 / 56.11 | 67.90 / 74.40 | 25.55 / 28.68 | 43.40 / 46.82 | 45.07 / 48.79 |
| - Self Distill | 37.52 / 39.17 | 54.89 / 56.78 | 72.00 / 75.90 | 25.75 / 29.50 | 46.19 / 48.73 | 47.27 / 50.02 |
| Data Strategy | ||||||
| - Coupled Data | 37.02 / 38.76 | 55.44 / 55.44 | 71.10 / 73.10 | 27.37 / 28.65 | 47.46 / 47.46 | 47.67 / 48.68 |
| - Decoupled Data | 37.52 / 39.17 | 54.89 / 56.78 | 72.00 / 75.90 | 25.75 / 29.50 | 46.19 / 48.73 | 47.27 / 50.02 |
Ablation: Self-Distillation & Decoupled Data. Impact of different distillation sources and data strategies. Values: Cold-Start / Final RL.
Self-Distilled Preference Data: Example of a preference data pair with chosen (correct format) and rejected (corrupted format) responses.
Case Studies
QA Case Analysis: Case #001 Question: Hint: Please answer the question requiring an integer answer and provide the final value, e.g., 1, 2, 3, at the end. Question: Subtract all yellow matte blocks. Subtract all tiny brown cylinders. How many objects are left? Ground Truth: 5 Qwen2.5-VL-7B response: 6
<think> To solve the problem, first identify all objects in the image:
1. Yellow matte cylinder
2. Green metallic cylinder (left)
3. Green metallic cylinder (front)
4. Silver metallic cylinder
5. Tiny brown cylinder
6. Yellow matte block
7. Red matte sphere
Next, follow the instructions:
1. Subtract all yellow matte blocks: Removing the yellow matte block leaves 6 objects.
2. Subtract all tiny brown cylinders: Removing the tiny brown cylinder leaves 5 objects.
Thus, the remaining objects are 5.
</think>
<answer> The answer is \(\boxed{5}\)</answer>
To solve this, we first identify all objects in the image: there are 8 objects in total. The yellow matte block is one object, and the tiny brown cylinder is another. Subtracting these two objects leaves us with \(8 - 1 - 1 = 6\) objects. So, the number of objects left is 6.
QA Case Analysis: Case #002 Question: Hint: Please answer the question requiring an integer answer and provide the final value, e.g., 1, 2, 3, at the end. Question: Several people compared how many Web pages they had visited. What is the mean of the numbers?' Ground Truth: 64 Qwen2.5-VL-7B response: 67.75
<think> To find the mean of the numbers, first sum all the numbers and then divide by the count of numbers.
The numbers are: 84, 56, 70, 68, 60, 47, 75, 52.
Sum of the numbers:
\[ 84 + 56 + 70 + 68 + 60 + 47 + 75 + 52 \]
Calculating step by step:
\[ 84 + 56 = 140 \]
\[ 140 + 70 = 210 \]
\[ 210 + 68 = 278 \]
\[ 278 + 60 = 338 \]
\[ 338 + 47 = 385 \]
\[ 385 + 75 = 460 \]
\[ 460 + 52 = 512 \]
The sum of the numbers is 512.
Next, count the number of values:
There are 8 values.
The mean is the sum divided by the count:
\[ \text{Mean} = \frac{512}{8} = 64 \]
So, the mean of the numbers is 64.
</think>
<answer> The answer is $\boxed{64}$</answer>
To find the mean of the numbers, we need to follow these steps: 1. Add up all the numbers in the list. 2. Count how many numbers there are. 3. Divide the sum by the count. Let's start with step 1: Add up all the numbers. The numbers are: 84, 56, 70, 68, 60, 47, 75, 52. Sum = 84 + 56 + 70 + 68 + 60 + 47 + 75 + 52 Sum = 542 ... Mean = 542 / 8 Mean = 67.75 So, the mean of the numbers is 67.75.
QA Case Analysis: Case #003
Question:
Hint: Please answer the question and provide the correct option letter, e.g., A, B, C, D, at the end.
Question: At a certain moment, there is a passenger ship at sea point P, and lighthouse A is measured in the direction 30.0 north by east of P, and is 50.0 nautical miles away. The passenger ship sails at the speed of 60.0 nautical mile/hour in the direction of 60.0 from north by west for \(\frac{2}{3}\)hours to reach point B, then tan\(\angle\)BAP =
Choices:
(A) \(\frac{4}{5}\)
(B) \(\frac{6}{5}\)
(C) \(\frac{\sqrt{5}}{5}\)
(D) \(\frac{2\sqrt{5}}{5}\)
Ground Truth: (A) \(\frac{4}{5}\)
Qwen2.5-VL-7B response: (B) \(\frac{6}{5}\)
<think> First, we need to determine the coordinates or distances relevant to the problem. The initial position of the passenger ship is at point \( P \), and it is 50 nautical miles away from lighthouse \( A \) in the direction 30.0 degrees north by east. The passenger ship then sails in the direction 60.0 degrees from north by west for \(\frac{2}{3}\) hours at a speed of 60 nautical miles per hour, reaching point \( B \).
The distance \( PB \) can be calculated as follows:
\[ PB = 60 \times \frac{2}{3} = 40 \text{ nautical miles} \]
The direction 60.0 degrees from north by west means that the angle \( \angle BPA = 30.0^\circ + 60.0^\circ = 90.0^\circ \). Therefore, \( \triangle ABP \) is a right triangle with \( \angle BAP = 30.0^\circ \).
In a right triangle, the tangent of an angle is the ratio of the opposite side to the adjacent side. Here, \( \tan \angle BAP = \frac{BP}{AP} \).
Since \( AP = 50 \) nautical miles and \( PB = 40 \) nautical miles, we can use the Pythagorean theorem to verify:
\[ AB = \sqrt{AP^2 + BP^2} = \sqrt{50^2 + 40^2} = \sqrt{2500 + 1600} = \sqrt{4100} = 10\sqrt{41} \]
However, we only need the ratio \( \frac{BP}{AP} \):
\[ \tan \angle BAP = \frac{BP}{AP} = \frac{40}{50} = \frac{4}{5} \]
Thus, the answer is \( \boxed{A} \).</think>
<answer> The answer is \(\boxed{A}\)</answer>
To solve this problem, we need to determine the coordinates of points A and B relative to point P, and then find the tangent of the angle $\angle$BAP.
1. Determine the coordinates of point A relative to point P:
...
2. Determine the coordinates of point B relative to point P:
...
3. Calculate the tangent of $\angle$BAP:
...
- Simplifying the expression, we get \(\frac{625\sqrt{3} - 500 - 1250\sqrt{3} - 1500}{625\sqrt{3} + 1500 - 625\sqrt{3} + 500} = \frac{-925\sqrt{3} - 1000}{2000} = \frac{-925\sqrt{3} - 1000}{2000} = \frac{-925\sqrt{3} - 1000}{2000} = \frac{-925\sqrt{3} - 1000}{2000} = \frac{6}{5}\).
Therefore, the correct answer is \(\boxed{B}\).
BibTeX
@article{chen2025metis,
title={Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start},
author={Chen, Kun and Shi, Peng and Qiu, Haibo and Zeng, Zhixiong and Yang, Siqi and Mao, Wenji and Ma, Lin},
journal={arXiv preprint arXiv:2510.25801},
year={2025}
}