We present GuidedVLA, a VLA paradigm in which the action decoder is explicitly guided to capture task-relevant information such as object grounding, spatial geometry, and temporal skill logic. Across simulation and real-robot experiments, GuidedVLA significantly improves success rates in both in-domain and out-of-domain settings, demonstrating the effectiveness of specifying action-decoder attention heads with explicit guidance.
Architecture of GuidedVLA. We introduce explicit, structured guidance into the multi-head attention layers of the VLA action decoder. Instead of relying on implicitly entangled representations, we repurpose dedicated attention heads to specialize in distinct task-relevant factors: (i) Object Head supervises its attention maps to explicitly ground task-relevant objects and suppress distractors via ℒobject; (ii) Skill Head aligns internal feature representations with temporal skill phases (e.g., Pick → Place) through auxiliary classification ℒskill; (iii) Depth Head injects geometric cues via cross attention only to features from a depth encoder. These guidance signals make the policy explicitly aware of spatial, temporal, and geometric structures.
Supervises attention maps to explicitly ground task-relevant objects and suppress distractors via attention mask alignment loss. Critical for precise localization on transparent/refractive objects and small targets.
Key insight: Forces action tokens to attend to semantically meaningful regions rather than incidental visual contrast.
Aligns internal feature representations with temporal skill phases (e.g., Pick → Place) through auxiliary classification loss. Prevents stage-skipping in multi-step behaviors.
Key insight: Encodes temporal intent progression to maintain stage awareness across extended horizons.
Injects explicit 3D spatial information by constraining dedicated attention heads to process only features from a frozen depth encoder (Depth Anything 3).
Key insight: Provides metric geometric reasoning for sub-centimeter precision tasks where monocular RGB cues are insufficient.
GuidedVLA achieves significant performance gains across simulation benchmarks and real-world platforms, with particularly strong improvements under distribution shifts.
The proposed model achieves the highest average success rate, with a significant boost compared to its base model π0. Notably, single-head ablations reveal task-specific alignment: the object head is strongest among single-head variants on the Object and Long suites, the skill head gives the best single-head result on the Goal suite, and the depth head performs best on the Spatial suite.
| Model | Perturbation Dimensions | Task Suites | Total | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Camera | Robot | Language | Light | Backg. | Noise | Layout | Spatial | Object | Goal | Long | ||
| OpenVLA | 0.8 | 3.5 | 23.0 | 8.1 | 34.8 | 15.2 | 28.5 | 19.4 | 14.0 | 15.1 | 14.3 | 15.6 |
| OpenVLA-OFT | 56.4 | 31.9 | 79.5 | 88.7 | 93.3 | 75.8 | 74.2 | 84.0 | 66.5 | 63.0 | 66.4 | 69.6 |
| NORA | 2.2 | 37.0 | 65.1 | 45.7 | 58.6 | 12.8 | 62.1 | 47.6 | 34.4 | 38.8 | 36.3 | 39.0 |
| WorldVLA | 0.1 | 27.9 | 41.6 | 43.7 | 17.1 | 10.9 | 38.0 | 32.5 | 28.6 | 31.8 | 8.2 | 25.0 |
| UniVLA | 1.8 | 46.2 | 69.6 | 69.0 | 81.0 | 21.2 | 31.9 | 55.5 | 36.7 | 40.7 | 39.9 | 43.9 |
| pi_0-Fast | 65.1 | 21.6 | 61.0 | 73.2 | 73.2 | 74.4 | 68.8 | 74.4 | 72.7 | 57.5 | 43.4 | 61.6 |
| RIPT-VLA | 55.2 | 31.2 | 77.6 | 88.4 | 91.6 | 73.5 | 74.2 | 85.8 | 64.3 | 58.0 | 67.5 | 68.4 |
| DreamVLA | 65.0 | 40.9 | 63.5 | 85.7 | 82.7 | 85.0 | 74.0 | 79.7 | 79.0 | 61.7 | 59.8 | 69.9 |
| AdaMoE | 53.8 | 17.5 | 20.6 | 73.7 | 73.8 | 58.6 | 65.8 | 51.0 | 57.9 | 53.3 | 38.1 | 50.1 |
| Spatial Forcing | 20.1 | 13.4 | 40.9 | 29.1 | 33.4 | 25.7 | 39.3 | 52.9 | 31.0 | 28.2 | 5.4 | 29.1 |
| VLA-Adapter | 36.2 | 37.9 | 74.6 | 70.6 | 76.1 | 58.0 | 69.7 | 85.0 | 46.3 | 56.0 | 50.4 | 59.1 |
| π0 | 62.3 | 39.8 | 63.1 | 86.0 | 82.8 | 82.4 | 69.6 | 77.7 | 74.1 | 61.4 | 60.1 | 68.2 |
| w/ object head | 71.7 | 45.8 | 63.5 | 92.4 | 86.9 | 85.1 | 77.4 | 80.6 | 82.5 | 67.1 | 64.0 | 73.4 |
| w/ skill head | 70.0 | 45.0 | 61.7 | 90.2 | 83.0 | 88.4 | 76.3 | 79.8 | 78.9 | 68.9 | 62.7 | 72.5 |
| w/ depth head | 68.1 | 43.9 | 65.8 | 90.7 | 83.4 | 85.6 | 72.8 | 81.4 | 79.0 | 65.4 | 61.8 | 71.7 |
| w/ all heads (Ours) | 73.7 | 51.4 | 62.6 | 94.6 | 89.0 | 85.2 | 79.9 | 84.0 | 80.9 | 70.8 | 66.2 | 75.4 |
RoboTwin 2.0 Benchmark Performance. Success rates across 8 manipulation tasks comparing the π0 baseline, single-head experts, and our full model. While specific heads excel at aligned tasks (e.g., depth head for geometry-heavy Beat Hammer Block), the full model (purple) integrates these capabilities to achieve the best overall average performance (90.63%).
Higher Factor Quality Leads to Better Task Performance. Top: Quantitative analysis on the LIBERO-Plus layout perturbation track shows that improving the quality of each specialized head consistently boosts success rates. (a) Object Head: as the proportion of attention focused on task-relevant object regions increases, success rises from 61.3% to 74.6%, highlighting the importance of precise object-centric attention. (b) Skill Head: higher skill-recognition accuracy, measured by a linear probe, correlates with improved performance (66.2% to 72.9%), indicating that better temporal understanding enhances control. (c) Depth Head: increasing the ratio of true depth features (versus noise) dramatically improves both qualitative depth estimation and quantitative success (15.6% to 76.7%), confirming that explicit 3D cues are critical for robust manipulation. Bottom: Qualitative visualizations show how changes along the x-axis metrics are reflected in the corresponding feature representations.
Cross-Platform Real-World Generalization. Success rates (N=20) across three generalization settings on ALOHA and PSI-Bot platforms. Our method consistently outperforms baseline, achieving performance gains across all settings (up to 52.7%) and demonstrating robustness under challenging out-of-domain conditions. Task 1–6 correspond to: (1) pick up fruits and vegetables, (2) stack the bowls, (3) clean the tabletop, (4) place beaker in heating mantle, (5) stack beakers, and (6) heat beaker. In-domain generalization includes variations in object positions within the training distribution.
| Generalization Setting | Method | ALOHA AgileX | PSI-Bot RealMan | Average (%) | ||||
|---|---|---|---|---|---|---|---|---|
| Task 1 | Task 2 | Task 3 | Task 4 | Task 5 | Task 6 | |||
| In-Domain† | Base Policy | 10/20 | 11/20 | 9/20 | 12/20 | 12/20 | 13/20 | 55.8 |
| Ours | 14/20 | 15/20 | 14/20 | 16/20 | 17/20 | 15/20 | 75.8 | |
| Scene | Base Policy | 7/20 | 8/20 | 6/20 | 12/20 | 11/20 | 9/20 | 44.2 |
| Ours | 13/20 | 12/20 | 11/20 | 15/20 | 16/20 | 14/20 | 67.5 | |
| Lighting | Base Policy | 11/20 | 9/20 | 10/20 | 14/20 | 12/20 | 13/20 | 57.5 |
| Ours | 13/20 | 16/20 | 15/20 | 17/20 | 18/20 | 16/20 | 79.2 | |
Tasks: (1) pick up fruits and vegetables, (2) stack the bowls, (3) clean the tabletop, (4) pick up the beaker, (5) stack the beakers, (6) heat the beaker.
Demonstrations of GuidedVLA executing complex long-horizon tasks across different domains.
If you find GuidedVLA useful in your research, please cite:
@misc{jia2026guidedvla,
title = {GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization},
author = {Xiaosong Jia and Bowen Yang and Zuhao Ge and Xian Nie and Yuchen Zhou and Cunxin Fan and Yufeng Li and Yilin Chai and Chao Jing and Zijian Liang and Qingwen Bu and Haidong Cao and Chao Wu and Qifeng Li and Zhenjie Yang and Chenhe Zhang and Hongyang Li and Zuxuan Wu and Junchi Yan and Yu-Gang Jiang},
year = {2026},
eprint = {2605.12369},
archivePrefix = {arXiv},
primaryClass = {cs.RO},
url = {https://arxiv.org/abs/2605.12369}
}