Robot Learning

ACT Policy Explained: Action Chunking with Transformers for Robot Learning

ACT — Action Chunking with Transformers — became one of the most widely adopted imitation learning algorithms for dexterous manipulation after its publication by Tony Zhao and collaborators at Stanford. Here is a practical explanation of how it works and how to use it.

What Is ACT?

ACT is an imitation learning algorithm designed for fine-grained manipulation tasks where the robot must make smooth, coordinated movements based on visual observations. At inference time, ACT takes a sequence of images from the robot's cameras and the current joint state, and outputs a chunk of future actions — a short sequence of joint position targets — rather than a single next action. The robot executes this chunk, then re-queries the policy for the next chunk. This predict-many-steps-ahead design is the defining feature of ACT and the source of most of its advantages over simpler behavior cloning.

ACT was introduced in the context of the ALOHA bimanual manipulation system and demonstrated success on tasks previously considered out of reach for imitation learning: slotting a battery, opening a ziploc bag, threading a needle. Its core insight — that chunked action prediction reduces compounding errors and smooths trajectories — has since been adopted in numerous follow-on algorithms including Diffusion Policy, pi0, and OpenVLA.

How Action Chunking Works

Standard behavior cloning (BC) trains a policy to predict the next single action given the current observation. At inference time, prediction errors accumulate: each small mistake shifts the robot's state slightly, putting it in a distribution the policy was not trained on, which causes the next prediction to be worse, and so on. This compounding error is the central failure mode of naive BC on fine manipulation tasks. On a 200-step manipulation trajectory, even a 1% per-step error rate compounds to ~87% probability of deviation from the demonstration distribution by step 200.

Action chunking breaks this cycle by predicting a sequence of k future actions — typically 50–100 steps at 50 Hz, corresponding to 1–2 seconds of motion. The policy commits to this plan and executes it before re-querying. Because the plan was generated from a single consistent observation, the trajectory is smooth and internally consistent. The policy only needs to handle k "decision points" per episode instead of 200, reducing the compounding factor from 200 to 200/k (e.g., 4 re-queries for k=50).

Temporal Ensembling

In practice, ACT doesn't execute the full chunk before re-querying. Instead, it re-queries every m steps (where m < k), producing overlapping action chunks. For each future time step, multiple chunks provide predictions, and these predictions are averaged with exponentially decaying weights (more recent chunks weighted higher). This temporal ensembling further smooths execution and reduces jitter at the boundaries between chunks.

The ensembling weight schedule uses an exponential decay: for a prediction from a chunk generated t steps ago, the weight is w(t) = exp(-t / temperature), where temperature controls the decay rate. The original ACT paper uses temperature = 10 (at 50 Hz), meaning predictions from the current chunk are weighted ~1.0 while predictions from a chunk generated 20 steps ago are weighted ~0.14. This produces smooth blending between consecutive predictions.

Why not just increase chunk size? Larger chunks mean fewer decision points but also less reactivity to environmental changes (object moved, obstacle appeared). The optimal chunk size balances smoothness against reactivity. For most tabletop manipulation, k=50–100 at 50 Hz is the sweet spot. For tasks requiring reactive behavior (catching, dodging), k=10–20 is better.

ACT Architecture: The Full Picture

ACT uses a CVAE (Conditional Variational Autoencoder) architecture with a transformer backbone. Understanding the architecture requires understanding three components: the CVAE encoder (training only), the transformer decoder (training and inference), and the vision backbone.

The CVAE Encoder (Training Only)

During training, an encoder processes the entire demonstration trajectory — images, joint states, and ground-truth actions — and produces a latent style variable z (dimension 32 in the original implementation) that captures the "style" of the demonstration. Different demonstrations of the same task may approach the object from different angles, use different grasp orientations, or move at different speeds. The latent z captures these variations, allowing the decoder to produce diverse but valid action sequences.

The encoder is a transformer that attends over the full action sequence (all k ground-truth actions in the chunk) and the current observation. It outputs a mean and log-variance that parameterize a Gaussian distribution, from which z is sampled via the reparameterization trick. The KL divergence loss term (weighted by beta, typically 10–100) regularizes z toward a standard normal prior.

At inference time, z is set to zero (the mean of the prior), making the policy deterministic given the observation. This is a deliberate design choice: at deployment, you want consistent, predictable behavior, not random sampling from the learned style distribution. The CVAE structure is purely a training aid that prevents mode collapse — without it, the model tends to average across demonstration styles, producing motions that are the mean of all approaches (and therefore none of them).

The Transformer Decoder (Training + Inference)

The decoder takes three inputs: the latent z (or zero at inference), the current observation tokens (from the vision backbone + proprioception), and a set of k learnable positional queries (one per action in the chunk). The decoder uses standard transformer cross-attention: the positional queries attend to the observation tokens and z, producing k action predictions in parallel.

Architecture details from the original implementation:

Transformer layers: 4 encoder layers, 7 decoder layers (deliberately asymmetric — the decoder is deeper because it does more work)
Hidden dimension: 512
Attention heads: 8
Feedforward dimension: 2048
Total parameters: ~40M (excluding vision backbone)

The Vision Backbone

The vision backbone is typically a ResNet-18 processing each camera view independently. Each camera image (480×640 RGB) is processed through the ResNet to produce a feature map (15×20×512), which is flattened into 300 tokens per camera. With 2–4 cameras, this produces 600–1,200 visual tokens that are passed to the transformer decoder.

Multiple camera views — wrist cameras plus overhead cameras — each contribute a token stream, giving the policy rich spatial information about the manipulation scene. The minimum viable setup is 2 cameras: one wrist-mounted (for close-up manipulation detail) and one overhead or third-person (for spatial context). Three cameras (1 wrist + 2 third-person at different angles) is the SVRC standard for data collection and consistently outperforms 2-camera setups by 5–15% success rate on precision tasks.

Training ACT: The CVAE Objective

The training loss has two components:

# ACT training loss (simplified PyTorch)
def act_loss(model, batch):
    images = batch["images"]           # (B, n_cameras, 3, H, W)
    joint_states = batch["qpos"]       # (B, n_joints)
    actions = batch["actions"]         # (B, chunk_size, action_dim)

    # Encoder: process full trajectory -> latent z
    z_mean, z_logvar = model.encoder(images, joint_states, actions)
    z = reparameterize(z_mean, z_logvar)  # (B, latent_dim)

    # Decoder: predict action chunk from observation + z
    predicted_actions = model.decoder(images, joint_states, z)  # (B, chunk_size, action_dim)

    # Reconstruction loss: L1 on predicted actions
    reconstruction = F.l1_loss(predicted_actions, actions)

    # KL divergence: regularize z toward N(0, I)
    kl_div = -0.5 * torch.mean(
        1 + z_logvar - z_mean.pow(2) - z_logvar.exp()
    )

    # Total loss
    beta = 10.0  # KL weight - tune this carefully
    total_loss = reconstruction + beta * kl_div
    return total_loss, {"recon": reconstruction.item(), "kl": kl_div.item()}

The beta hyperparameter controls the balance between reconstruction accuracy and latent space regularity. Too low (beta < 1): the model memorizes demonstrations but z becomes meaningless, and inference with z=0 produces garbage. Too high (beta > 100): the model ignores z entirely, losing the ability to handle multi-modal demonstrations. The sweet spot is beta = 10–50 for most manipulation tasks. Monitor both the reconstruction loss and the KL divergence during training — healthy training shows KL stabilizing between 0.5–5.0 nats.

Hyperparameter Guidance

ACT has fewer hyperparameters than most deep RL algorithms, but the ones it has matter significantly. Here is guidance from our experience training ACT on 20+ manipulation tasks at SVRC:

Hyperparameter	Default	Range to Search	Effect of Increasing	When to Increase
Chunk size (k)	100	20–200	Smoother trajectories, less reactive	Smooth tasks (wiping, pouring); decrease for reactive tasks
Latent dim (z)	32	16–64	More expressive style capture	High demonstration variance (different operators, strategies)
KL weight (beta)	10	1–100	Stronger regularization, less memorization	Small datasets (<50 demos) to prevent overfitting
Learning rate	1e-5	5e-6–5e-5	Faster convergence, risk of instability	Large datasets (>200 demos); use warmup
Temporal ensemble temp	10	5–50	Slower blending, more inertia	Smoother tasks; decrease for faster reactivity
Number of cameras	2	1–4	Richer spatial information, more compute	3D reasoning tasks (stacking, insertion)
Backbone	ResNet-18	ResNet-18/34/50	Better visual features, more compute	Visually complex scenes (cluttered, varying lighting)
Training epochs	3000	1000–8000	Better fit, risk of overfitting	More demos or more complex tasks

Chunk size is the most important hyperparameter. If you only tune one thing, tune chunk size. A chunk size that is too large for your task causes the robot to "commit" to plans that are already outdated by the time they finish executing. A chunk size that is too small loses the smoothness advantage. Start at k=50 for a new task and adjust: if the robot is jerky at chunk boundaries, increase k; if the robot doesn't react to environmental changes, decrease k.

ACT vs Diffusion Policy: Detailed Comparison

On the original ALOHA tasks, ACT achieved success rates of 80–95% compared to 20–50% for standard BC on the same data. But the more interesting comparison is ACT vs. Diffusion Policy (Chi et al., 2023), the other dominant imitation learning algorithm.

Dimension	ACT	Diffusion Policy	Verdict
Inference latency (50Hz control)	2–5ms per chunk	50–200ms per chunk (DDPM); 10–30ms (DDIM)	ACT wins by 5–10x
Multi-modality handling	CVAE latent (limited)	Full diffusion (excellent)	Diffusion wins
Data efficiency (<50 demos)	Good	Moderate	ACT slightly better
Data efficiency (>200 demos)	Good	Excellent	Diffusion slightly better
Training time (same data)	~8 hours (A100)	~12 hours (A100)	ACT faster
Trajectory smoothness	Excellent (temporal ensembling)	Very good	ACT slightly better
Precision tasks (<1mm tolerance)	Good	Excellent	Diffusion wins
Bimanual coordination	Excellent (designed for ALOHA)	Good	ACT wins
Model size	~40M params	~80M params	ACT smaller
Edge deployment (Jetson AGX Orin)	~25 Hz	~5 Hz (DDPM); ~12 Hz (DDIM)	ACT much better
Open-source implementations	LeRobot, original repo	LeRobot, original repo	Tie

When to use ACT: When inference speed matters (edge deployment, high control rates), when you have limited data (<100 demos), for bimanual tasks, or when you need the simplest possible training pipeline. ACT is also the better choice when you're deploying on compute-constrained hardware like Jetson Orin Nano, where Diffusion Policy's iterative denoising becomes prohibitively slow.

When to use Diffusion Policy: When your demonstrations have significant multi-modality (multiple valid strategies for the same task), for precision assembly tasks where sub-millimeter accuracy matters, or when you have abundant data (>200 demos) and can afford the training and inference compute. Diffusion Policy's ability to represent the full distribution of valid actions (rather than collapsing to a mean via the CVAE) gives it an edge on tasks where there are genuinely multiple correct approaches.

Inference Speed Benchmarks

We benchmarked ACT inference across hardware platforms relevant to robotics deployment. All measurements use the standard ACT architecture (ResNet-18 backbone, 2 cameras, 7-DOF action space, chunk size 50):

Hardware	ACT Inference (ms)	ACT Throughput (Hz)	DP-DDIM Inference (ms)	DP-DDIM Throughput (Hz)	Notes
NVIDIA A100 (80GB)	1.8	555	12	83	Datacenter training/inference
NVIDIA RTX 4090	2.5	400	15	67	Desktop workstation
NVIDIA RTX 4070	4.2	238	25	40	Budget desktop
Jetson AGX Orin (64GB)	12	83	45	22	Onboard robot compute
Jetson Orin Nano (8GB)	40	25	180	5.5	Min viable edge
Jetson Orin NX (16GB)	22	45	85	12	Mid-range edge
Apple M3 Pro (18GB)	8	125	35	28	Development laptop

The critical threshold for real-time manipulation control is 20 Hz (50ms per inference). ACT meets this on every platform we tested except the lowest-end Jetson configuration (Orin Nano still manages 25 Hz). Diffusion Policy with DDIM sampling meets the threshold on desktop GPUs and AGX Orin but falls below it on Orin Nano and NX. For teams deploying on edge hardware, ACT is the clear choice.

Data Requirements and What Constitutes Good Data

ACT works well with 50–200 demonstrations per task in most published results. However, data quality matters more than quantity. Here is what we have learned about data quality from collecting 25,000+ demonstrations at SVRC:

Minimum Viable Datasets by Task Complexity

Simple pick-and-place (one object, fixed location): 20–30 demos. ACT learns this reliably with minimal data.
Pick-and-place with pose variation (random object position): 50–80 demos. Need to cover the workspace.
Multi-step manipulation (pick, transport, place with precision): 100–150 demos. Each phase needs adequate coverage.
Bimanual coordination (two arms, dependent motions): 150–250 demos. Coordination patterns require more examples.
Contact-rich assembly (insertion, screwing, snapping): 200–400 demos. Force-sensitive phases need dense coverage.

Quality Requirements

Demonstrations should be smooth and purposeful — the ACT policy will learn whatever motion pattern is in the data, including hesitations, corrections, and suboptimal approaches. SVRC's data collection standard requires operators to restart an episode rather than continue after a visible error, ensuring the training dataset contains only intentional, successful behaviors. Specific quality criteria:

No pauses: Demonstrations should flow continuously. Long pauses (>0.5s of no motion) confuse the action chunking — the policy learns to predict "do nothing" chunks.
Consistent speed: Operators should execute at a consistent pace. Mixing fast and slow demonstrations within the same dataset forces the CVAE to waste latent capacity modeling speed variation instead of task-relevant variation.
Clean starts/ends: Every episode should start from a similar neutral pose and end with the gripper clearly in the final configuration. Ragged episode boundaries create training artifacts.
Success only: Remove all failed demonstrations. Unlike some RL approaches that can learn from failures, ACT treats all training data as expert demonstrations to imitate.

Camera consistency is also critical. If camera placement changes between recording sessions, the visual features the policy learned will no longer match the deployment setup. Use physical mounts rather than flexible arms, and log the camera calibration parameters with each dataset. SVRC's multi-camera recording pipeline enforces this automatically.

Training Procedure Step by Step

Here is the end-to-end training procedure we use at SVRC for new ACT policies:

Data collection: Record demonstrations using teleoperation with the leader-follower setup. 50 Hz control rate, 2–3 cameras, synchronized recording. Export to LeRobot HDF5 format.
Data validation: Run the SVRC data quality checker: verify episode lengths are within 2x of each other, no camera frame drops, joint positions within limits, success labels verified.
Split: 90/10 train/val split, stratified by object configuration if applicable. Hold out 50 episodes for real-robot evaluation (never seen during training).
Training: Use the LeRobot ACT training script with the hyperparameters from the table above. Monitor reconstruction loss, KL divergence, and validation loss. Training typically converges in 2,000–5,000 epochs.
Checkpoint selection: Do NOT use the final checkpoint. Select the checkpoint with the lowest validation loss. Overfitting is visible as a divergence between training and validation loss curves, typically after epoch 3,000–5,000 depending on dataset size.
Sim evaluation (if available): Run 100 episodes in simulation with the selected checkpoint. If success rate <60%, retune hyperparameters (chunk size first, then beta, then learning rate).
Real evaluation: Run 20 episodes on real hardware. Compare against the sim evaluation. A gap >20% suggests camera calibration mismatch or action space scaling issues.

# Full ACT training command via LeRobot
python lerobot/scripts/train.py \
  --policy.name=act \
  --dataset.repo_id=svrc/openarm_pick_place \
  --policy.chunk_size=50 \
  --policy.n_obs_steps=1 \
  --policy.dim_model=512 \
  --policy.n_heads=8 \
  --policy.n_encoder_layers=4 \
  --policy.n_decoder_layers=7 \
  --policy.latent_dim=32 \
  --policy.kl_weight=10.0 \
  --training.learning_rate=1e-5 \
  --training.batch_size=8 \
  --training.num_epochs=5000 \
  --training.eval_freq=500 \
  --training.save_freq=1000 \
  --device=cuda \
  --output_dir=outputs/act_openarm_pick_place/

Common Training Failures and Fixes

Policy outputs zero/constant actions: KL weight (beta) is too high. The model is ignoring the observation and outputting the mean action. Reduce beta by 5–10x.
Jerky execution at chunk boundaries: Temporal ensembling temperature is too low (blending is too sharp). Increase temperature from 10 to 20–30. Also verify that your execution loop is correctly implementing overlapping chunks.
Good training loss, bad real performance: Camera mismatch between training and deployment. Check: are images being resized the same way? Is the camera FOV identical? Are the crop regions consistent?
Works for first 2 seconds, then drifts: Chunk size is too large relative to task dynamics. The policy commits to a plan that's outdated by execution end. Reduce chunk size by 50%.
Inconsistent between runs: Not enough demonstrations. With <30 demos, ACT's performance is highly sensitive to the train/val split. Collect more data.
KL divergence collapses to zero: Beta is too low. The encoder is encoding all information in z, and the decoder learns to ignore observations. Increase beta until KL stabilizes at 0.5–5.0.

ACT Variants and Extensions

Since the original ACT paper, several extensions have appeared:

ACT+ (LeRobot): Adds proprioceptive history (last 2–5 joint states) to the observation, improving velocity estimation without explicit velocity computation. Default in LeRobot's ACT implementation.
ACT with Diffusion (ACT-DP): Replaces the CVAE with a diffusion action head while keeping the chunked prediction structure. Combines ACT's smoothness with Diffusion Policy's multi-modality handling. Training is 50% slower but handles multi-modal demonstrations better.
ACT with Language (ACT-L): Adds a language embedding (from CLIP or T5) to the observation tokens, enabling language-conditioned task specification. "Pick up the red cup" vs "pick up the blue cup" can be handled by a single policy.
Mobile ALOHA ACT: Extends ACT to jointly control a mobile base (2D velocity) and two arms (2 x 7-DOF), for a total 16-DOF action space. Chunk size is typically reduced to 20–30 for the mobility component to maintain reactivity to obstacles.

Training ACT with SVRC Data

SVRC's data platform exports datasets in LeRobot-compatible HDF5 format, which is the standard input format for the open-source ACT training code. After downloading your dataset, training a baseline ACT policy requires a GPU with at least 16 GB VRAM and approximately 8 hours of training for a single task. SVRC engineering support is available to help teams configure training runs, tune chunk size and learning rate, and evaluate policy performance.

For teams without GPU infrastructure, SVRC offers managed training runs on A100 hardware: you provide the dataset, we return a trained checkpoint with evaluation metrics. Turnaround is typically 24–48 hours. Cost depends on dataset size and number of hyperparameter configurations searched.

For hardware to collect your own data, see our hardware catalog or explore robot leasing options. The OpenArm with dual-arm leader-follower configuration ($9,000 for the complete data collection station) is the most cost-effective setup for ACT data collection.