Vision-Language-Action models (VLA) have demonstrated remarkable capabilities and strong potential in complex robotic manipulation. However, their large parameter sizes and high inference latency hinder real-world deployment, especially on resource-constrained platforms. To address this, we conduct a systematic empirical study of model compression for VLAs. Building on these insights, we present RLRC, a three-stage compression and recovery pipeline consisting of structured pruning, performance recovery via SFT and RL, and subsequent quantization. The RL stage incorporates a critic warm-up strategy and BC loss regularization to stabilize training and preserve policy behavior. RLRC achieves up to an 8X memory reduction and 2.3X inference speedup while maintaining the original task success rate. Extensive experiments across multiple VLA backbones show that RLRC consistently outperforms existing compression baselines, highlighting its effectiveness for on-device deployment.
We propose RLRC, a three-stage method: (1) we apply structured pruning to the VLA model, specifically targeting the LLM component, to remove redundant structures in a hardware-friendly manner; (2) we employ a performance recovery stage that combines SFT with RL to restore the model's effectiveness on downstream tasks; (3) we introduce optional quantization to further reduce the memory footprint, enabling efficient deployment on resource-constrained robotic platforms.
Main Results. Comparison of task success rates and efficiency metrics.
The training curves of SFT and RLFT.
Post-SFT RLFT vs. Scratch RLFT.
Ablation studies on LIBERO.
Rollout examples from real-world experiments. Video frames are sampled at equal time intervals, with OpenVLA shown on the top and OpenVLA-RLRC on the bottom.
Real-world Results.
@article{chen2025rlrc,
title={RLRC: Reinforcement Learning-based Recovery for Compressed Vision-Language-Action Models},
author={Yuxuan Chen and Xiao Li},
journal={arXiv preprint arXiv:2506.17639},
year={2025}
}