Action Chunking for Robotics and AV

Action chunking has become a go-to technique in both robotics and autonomous driving. In this post, I’ll give an overview of the recent literature, explain why action chunking works so well, and discuss some of its important limitations.

Setting the Scene

What is action chunking? Instead of predicting just the next action, a policy predicts a chunk of future actions—say, the next 4, 8, or 16 timesteps at once. During deployment, these chunks are typically executed using a receding-horizon approach—we start executing the chunk but replan before it’s fully completed, similar to model predictive control (MPC). This gives us the benefits of action chunks while still being able to react to changes in the environment.

While other architectures work too, action chunking is most commonly implemented using Diffusion or Flow Matching¹ policies. These generative models are stable to train, highly expressive, and naturally handle the continuous, multi-modal action distributions that arise in real-world control.

For a short primer on Offline RL, see my post on Offline-to-Online RL.

Why is Action Chunking so Good?

Action chunking solves several fundamental problems. Let’s go through them one by one.

Horizon Reduction

In both robotics and autonomous driving, control typically runs at high frequency (30-50Hz). Even a short 10-second maneuver becomes 300-500 timesteps. These long horizons create serious training challenges, especially for offline RL².

The problem is that the temporal difference (TD) update used to learn Q doesn’t scale well to long horizons:

Q(s_t, a_t) \leftarrow r_t + \gamma \max_{a'} Q(s_{t+1}, a')

Slow credit assignment: Reward information propagates backward only one step at a time, so training takes long.
Error accumulation: With function approximation, each backward step introduces some error. Over hundreds of steps, these errors compound. And since we’re stuck with offline data, we can’t self-correct by trying things in the environment.

A common fix is n-step backups: instead of propagating reward information one step back, propagate it $n$ steps at once. The extreme case is where $n$ is the length of the trajectory, called Montel Carlo estimation, where we have no bootstrapping from the value function at all and only use the reward gained in the rollout as estimate for the value of $a_t$ in $s_t$ . n-step backups speeds up training, because reward information propagates faster. It also reduces error accumulation, though at the expense of higher variance. Unfortunately, it also introduces a new problem: off-policy bias³.

The problem is that n-step backups pass rewards backward along the actions in the dataset, not the actions our policy would take⁴. But when this data comes from a suboptimal policy, which is usually the case in Offline RL, our optimized policy would have taken different actions, and the n-step values are wrong, i.e. biased.

Q(s_t, a_t) \leftarrow \sum_{i=0}^{n-1} \gamma^i r_{t+i} + \gamma^n \max_{a'} Q(s_{t+n}, a')

This is where action chunking shines. If we treat the entire chunk $a_{t:t+h}$ as a single action and learn $Q(s, a_{t:t+h})$ , there’s no off-policy bias. Sure, this might not be the chunk our policy would have chosen, but the learned value for that chunk is still correct.

Reduced Inference Frequency

Running the policy less often has several advantages—some obvious, some subtle.

Practical necessity: With large VLA models, inference can be slower than the frequency at which the robot/car needs to be controlled. You simply can’t run the model for every action. Action chunking provides a natural solution: predict a chunk, start executing it, and compute the next chunk in parallel. When the new prediction is ready, switch over.

Of course, naively switching chunks would create jerky motions, but there are techniques to smooth the transitions, e.g. RTC⁵, Training-Time RTC⁶, and VLASH⁷. These methods, in one way or another, interpolate between chunks and make sure newer chunks are compatible with older ones up to the “switch-over” time.

Reduced compounding errors: Policies trained on expert demonstrations suffer from covariate shift when deployed in closed-loop. Covariate shift is a nasty feedback loop where small mistakes lead to unfamiliar states (i.e.states not in the expert data), which lead to bigger mistakes, which lead to even more unfamiliar states, and so on. By querying the policy less often (once per chunk instead of every timestep), we give this feedback loop fewer opportunities to spiral out of control.

Smoother motion: When the policy commits to a chunk rather than reconsidering at every time step, the resulting behavior tends to be more fluid as the policy has less opportunities to change high-level plans.

Better Handling of Non-Markovian Data

Human demonstrations are messy. Especially in robotics, demonstrators pause arbitrarily, hesitate, or take breaks mid-task. A standard Markovian policy, one that maps states to actions, might hence learn that “in this state, do nothing”, because that’s what the human did. The problem is it has no way to know how long to pause, so it might get stuck forever.

A chunked policy handles this better. It can learn to “pause for a bit and then continue”. It’s not perfect, but it’s much better than freezing indefinitely.

Forcing the Policy to Actually Look at the Scene

When learning the policy directly from data, in autonomous driving, there’s a deceptively easy way to get a relatively low imitation learning loss: just repeat the previous action (e.g. acceleration and steering). This works surprisingly well because driving is temporally smooth: when you’re going straight, you usually keep going straight; when you’re turning, you usually keep turning. Only at few time-steps do you actually need to change what you’re doing.

The problem is that a “repeat last action” policy never really learns to understand the scene. It’s a shortcut that happens to work most of the time.

Action chunking breaks this shortcut. When the policy must predict 2-6 seconds of future actions, simply repeating the last action produces obviously wrong trajectories. The policy is forced to actually reason about the scene, i.e. where the lanes are, what other vehicles are doing, what the traffic signals say, to produce a coherent plan. RL can also address this problem, but action chunking gives us this benefit already during imitation learning.

Better Exploration

During online RL fine-tuning, the policy needs to explore to discover better behaviors. But what does “exploration” even mean at 50Hz control? Adding random noise to individual actions just produces jittery versions of the same trajectories—you’re not actually trying anything new.

Exploration with action chunks is qualitatively different. Sampling a different chunk means committing to a different plan, not just adding noise. This leads to temporally coherent exploration that can actually discover meaningfully different behaviors.

There are, of course, a whole zoo of different methods in RL that also tackle this exploration problem, but last time I checked, they all required significant setup and hyperparameter tuning.

Problems of action chunking

Of course, action chunking has its downsides. Here are the two most important ones.

The Stochasticity Problem

Action chunking and stochastic environments don’t mix well. Ironically, this manifests as two opposite failure modes.

Value Underestimation (Conservative Policies)

In stochastic environments, an action-chunking policy is fundamentally suboptimal. The obvious reason is that it can’t react quickly if we execute the full chunk. But the problem runs deeper.

Even with receding horizon execution, where we re-plan before finishing the chunk, the policy remains suboptimal: when we query the chunked value function $Q(s, a_{t:t+h})$ , it returns values computed as if the policy couldn’t react at all during those $n$ steps. The Q-function correctly accounts for this inability to react, which means it underestimates the value compared to what a truly reactive policy could achieve. Since policy extraction maximizes Q, we end up with overly conservative behavior.

One related work (DQC³) proposes to distill $Q(s, a_{t:t+h})$ into $Q(s, a_{t:t+h_a})$ where $h_a < h$ , i.e. we distill into a Q-function over shorter chunks. This allows training a policy on shorter chunks and hence makes it easier to train, but it doesn’t fix the value underestimation because the shorter Q-function is still bootstrapped from the longer one, so the conservative bias persists.

Value Overestimation (Spurious Correlations)

Confusingly, action chunking can also cause the opposite problem: value overestimation.

Here’s the issue: the training data was collected by a policy that could react to what happened in the environment, i.e. actions were chosen based on observed outcomes. But when we learn action-chunked Q-values from this data, the model gets causality backwards. It sees correlations between action chunks and favorable outcomes, and assumes that taking those actions causes those outcomes.

A concrete example: in driving, we accelerate when the traffic light turns green. An action-chunking policy trained on this data might learn to expect that “when I accelerate after some waiting, the traffic light turns green”, because that’s what happened in the training data. The model mistakes correlation for causation.

This failure mode is closely related to problems with Upside-down RL and Decision Transformers in stochastic environments ESPER⁸. The difference is that those approaches condition on returns and expect the environment to play nicely with the expected return, while here we’re learning values for action chunks and expect the environment to play along for the chosen chunk. But the underlying confusion about causality is similar.

Limited Stitching

One can see offline RL as learning stitching together multiple trajectories from the offline data for better performance: combining the best parts of different trajectories to synthesize behavior better than any single demonstration.

With action chunking, our stitching granularity becomes coarser. We can only combine chunks, not individual actions. If the optimal policy would switch strategies mid-chunk, we’re out of luck. This fundamentally limits how much we can improve beyond the demonstrated behavior.

Deployment

In practice, we want to overlap policy inference with action execution and to compute the next chunk while executing the current one. This lets us handle policies that are slower than the control frequency, and avoids awkward “thinking pauses” between chunks.

But this creates a timing problem. We start computing the next chunk using observations $s_t$ , but by the time the computation finishes at time $t+d$ , we need to start executing from a different state, $s_{t+d}$ than what the policy expected. If we just naively switch to the new chunk, we get discontinuities.

Several approaches address this:

Guided diffusion RTC⁵: Constrain the new chunk to align with the overlapping part of the old chunk during the diffusion process. While effective, this adds overhead and added latency to compute the guidance gradients during inference.
Training-time conditioning Training-Time RTC⁶: This improves on RTC by training the policy to expect this overlap, so it naturally produces consistent transitions. At inference, we can just clamp the overlapping actions to the executed values without having to compute guidance gradients.
Future state prediction VLASH⁷: Since robot dynamics are often predictable, we can roll the ego-state forward to time $t+d$ and condition the policy on that predicted state. This is particularly relevant for robotics where dynamics models are reliable, but less useful in highly unpredictable environments such as driving.

Lipman, Y., Chen, R. T. Q., Ben-Hamu, H., Nickel, M., & Le, M. (2022). Flow Matching for Generative Modeling. arXiv preprint arXiv:2210.02747. Link ↩
Park, S., Frans, K., Mann, D., Eysenbach, B., Kumar, A., & Levine, S. (2025). Horizon Reduction Makes RL Scalable. arXiv preprint arXiv:2506.04168. Link ↩
Li, Q., Park, S., & Levine, S. (2025). Decoupled Q-Chunking. arXiv preprint arXiv:2512.10926. Link ↩ ↩²
In offline Q-learning, computing “on-policy” TD targets requires either co-training a policy or using TD backups for the optimal action—both of which need careful handling to avoid overestimation (Kumar et al., 2020; Kostrikov et al., 2021). ↩
Black, K., Galliker, M. Y., & Levine, S. (2025). Real-Time Execution of Action Chunking Flow Policies. arXiv preprint arXiv:2506.07339. Link ↩ ↩²
Black, K., Ren, A. Z., Equi, M., & Levine, S. (2025). Training-Time Action Conditioning for Efficient Real-Time Chunking. arXiv preprint arXiv:2512.05964. Link ↩ ↩²
Tang, J., Sun, Y., Zhao, Y., Yang, S., Lin, Y., Zhang, Z., … & Han, S. (2025). VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference. arXiv preprint arXiv:2512.01031. Link ↩ ↩²
Paster, K., McIlraith, S., & Ba, J. (2022). You Can’t Count on Luck: Why Decision Transformers and RvS Fail in Stochastic Environments. Advances in Neural Information Processing Systems, 35, 38966-38979. Link ↩