The Garbage-In-Garbage-Out Problem for Robot Policies

Imitation learning is a distribution matching problem: the trained policy will approximate the distribution of behaviors in its training data. If that distribution includes failed grasps, jerky motions, inconsistent strategies, and ambiguous success criteria, the policy faithfully learns to reproduce all of them. Unlike language model training, where individual noisy examples are smoothed out by billions of other examples, robot demonstration datasets are small enough that every single episode matters.

A dataset of 500 carefully controlled demonstrations will produce a better policy than a dataset of 5,000 uncontrolled demonstrations. This is not an aspirational claim -- it is a consistent empirical finding across research groups, task types, and policy architectures. The explanation is straightforward: a clean, diverse, consistent dataset gives the policy a clear signal about what to learn. A noisy, biased, inconsistent dataset gives the policy a confused signal that it resolves by averaging over conflicting behaviors, producing mediocre performance on everything rather than strong performance on anything.

The framework below identifies six dimensions that distinguish high-quality robot demonstration data from mediocre data. Each dimension has a measurable metric and a practical threshold.

Dimension 1: Diversity

Diversity is the most important quality dimension because it directly determines how well the policy generalizes. A dataset must include variation across every axis that will differ between training and deployment: object instances, object positions, lighting conditions, background clutter, and operator behavior.

Object diversity: Include at least 10-20 distinct instances of each target object category, varying in size, color, material, and brand. If your task involves picking up cups, collect demonstrations with ceramic mugs, paper cups, plastic travel cups, glass cups, and metal cups. Each instance teaches the policy something different about the visual and physical properties of the category.

Position diversity: Vary object starting positions across the full reachable workspace, using a grid of at least 30 x 40 cm. Include different orientations -- upright, tilted, rotated 90 degrees. If the policy only sees objects in the center of the workspace during training, it will fail at the edges during deployment.

Lighting diversity: Collect under at least 3 distinct lighting conditions: warm overhead fluorescent, cool daylight from windows, and mixed or directional lighting. Lighting is one of the most common sources of deployment failure because it changes the appearance of every object and surface in the scene.

Operator diversity: Use at least 3 distinct operators, each contributing a roughly equal share of demonstrations. Each operator approaches the task differently -- different approach angles, grasp points, speeds, and recovery strategies. This diversity is valuable because it forces the policy to learn the task structure rather than a single person's idiosyncrasies.

Dimension 2: Consistency

Consistency means that the task definition, success criteria, and reset procedure are identical across all episodes. Inconsistency introduces ambiguity that the policy cannot resolve.

Success criteria must be binary and unambiguous. For pick-and-place tasks, success means the object is at the target location within a defined tolerance at the end of the episode. "Close enough" is not a criterion. Write the success criteria in your collection protocol and verify that all operators apply them the same way.

Reset procedure must be standardized. Between episodes, objects must be placed in new starting positions according to a defined randomization protocol, the workspace must be cleared of debris or displacement from the previous trial, and the robot must return to a consistent starting configuration. Sloppy resets introduce systematic biases -- objects that accumulate near certain locations because operators default to placing them there, background clutter that drifts across episodes.

Operator calibration is essential. Before an operator's demonstrations count toward the dataset, they should complete a calibration session of 2-4 hours to learn the teleoperation interface, internalize the success criteria, and develop consistent approach strategies. Track per-operator quality metrics (success rate, trajectory smoothness, episode duration) and provide feedback. Uncalibrated operators produce demonstrations that actively harm policy performance.

Dimension 3: Completeness

Every episode must capture the full task from initial approach through final release, with all sensor streams synchronized and no missing data. Incomplete episodes corrupt the training signal in subtle ways that are difficult to diagnose after the fact.

No missing modalities. If your collection setup includes two cameras, joint encoders, and a force-torque sensor, every episode must have all four streams. A single episode with a dropped camera feed teaches the policy that the missing camera is sometimes zero, which confuses the visual processing pipeline.

Synchronized timestamps. All sensor streams must be time-aligned to within the control period (typically 5-20 ms). Misaligned streams create an inconsistent mapping between what the robot sees and what it does -- the action at time T is paired with the observation from time T minus 50 ms, producing a systematically shifted training signal. Verify synchronization automatically by checking timestamp alignment in every episode.

Full episode recording. Episodes that start mid-grasp (because recording was triggered late) or end before task completion (because recording was stopped early) are unusable for most policy training pipelines. Configure your collection system to start recording before the task begins and stop after the robot returns to its starting configuration.

Dimension 4: Accuracy

Accuracy covers the correctness of both the demonstrations themselves and the metadata attached to them.

Demonstration accuracy: Only fully successful episodes should be included in the imitation learning training set. The performance impact of including failed demonstrations is dramatic -- adding even 10% failed demonstrations to a training set typically causes a 20-30% drop in policy success rate. The mechanism is clear: the policy learns that "almost grasping" or "dropping halfway" is an acceptable terminal state. Filter rigorously: binary success classification on every episode, with human review on borderline cases.

Trajectory quality: Demonstrations should be smooth, deliberate, and efficient. Jerky trajectories -- caused by operator error, controller latency, or poor workspace ergonomics -- teach the policy to be jerky. Measure smoothness using the jerk metric (third derivative of joint positions), establish per-task baselines from your best operators, and filter demonstrations below 70% of that baseline.

Annotation accuracy: Success/failure labels, language instruction labels, task phase segmentation, and object identity tags must be correct. Incorrect annotations corrupt the training signal. Language instructions should be checked against actual task behavior ("pick up the red cup" should not be tagged on an episode where the operator picked up a blue bowl). Automated validation tools should flag mismatches between annotations and observations.

Dimension 5: Balance

A balanced dataset has roughly equal representation across conditions. Imbalance causes the policy to overfit to over-represented conditions and underperform on under-represented ones.

Object balance: If you have 15 object instances but 60% of demonstrations use 3 of them, the policy learns those 3 objects well and the other 12 poorly. Aim for equal demonstrations per object instance, with no more than 2:1 ratio between the most-represented and least-represented instances.

Position balance: If most demonstrations start with objects in the center of the workspace, the policy will be weak at workspace edges. Use a defined grid or randomization scheme that ensures spatial coverage.

Difficulty balance: Approximately 10-15% of demonstrations should cover deliberately challenging scenarios: objects at the edge of the reachable range, cluttered workspaces, unusual orientations, near-failure recoveries. These edge cases dramatically improve robustness without requiring proportionally more data. Under-representation of edge cases is one of the most common causes of unexpected deployment failures.

Dimension 6: Format

Data format determines how easily the dataset integrates with training pipelines, how efficiently it can be stored and transferred, and whether it is compatible with community standards.

Use established formats. HDF5 or Zarr for raw episode storage. The LeRobot HuggingFace format for sharing and community compatibility. Open X-Embodiment schema for cross-embodiment research. Custom formats create friction for every downstream consumer and are the leading cause of "we collected data but can't use it" failures.

Include complete metadata. Every episode should include: task description, success/failure label, language instruction, robot platform identifier, camera intrinsics and extrinsics, collection date, operator identifier, and any environment conditions that varied (lighting setup, table surface, object instance IDs). This metadata enables filtering, stratified analysis, and targeted retraining on specific conditions.

Validate format automatically. Run schema validation on every episode at write time. Catching format errors during collection is vastly cheaper than discovering them during training, when an engineer spends hours debugging why the dataloader crashes on episode 3,847.

Dimension 7: Failure Inclusion

This dimension is counterintuitive: good datasets include a controlled proportion of failure demonstrations. While Dimension 4 (Accuracy) requires excluding failures from the primary training set, a separate set of labeled failure demonstrations serves critical functions for robust policy training.

Why failures matter. A policy trained exclusively on successful demonstrations has no representation of what failure looks like. When it encounters a state that is drifting toward failure during deployment, it has no training signal to recognize or recover from the situation. Including 5-10% labeled failure demonstrations (clearly tagged as failures in the dataset metadata) enables several training techniques:

  • Contrastive learning: Train the policy to distinguish successful trajectories from failure trajectories, producing a more robust decision boundary.
  • Recovery behavior learning: Include demonstrations where the operator encounters a near-failure condition (dropped object, missed grasp) and recovers. These "recovery demonstrations" teach the policy what to do when things go wrong, rather than only what to do when things go right.
  • Failure detection: A separate classifier trained on success/failure data can serve as a runtime monitor that triggers human intervention when the policy enters a failure-like state.

How to collect failure data. Do not intentionally sabotage demonstrations. Instead, preserve the natural failures that occur during collection (every operator has a 10-30% failure rate, especially early in a session) rather than discarding them. Tag them with failure mode labels: "missed grasp," "dropped during transport," "incorrect placement," "collision with obstacle." This labeling enables targeted analysis of policy weaknesses and guides re-collection priorities.

Recommended ratio: 85-90% successful demonstrations in the primary training set, 5-10% recovery demonstrations (near-failure with successful recovery), and 5% pure failure demonstrations (clearly unsuccessful outcomes). The recovery demonstrations are the most valuable per-episode because they teach the policy to handle the distribution of states it is most likely to encounter when things start going wrong.

Data Quality Scoring Rubric

DimensionScore 1 (Poor)Score 3 (Adequate)Score 5 (Excellent)
Diversity1-2 objects, 1 lighting, 1 operator5-10 objects, 2 lightings, 2 operators15+ objects, 3+ lightings, 3+ operators
ConsistencyNo written criteria, ad-hoc resetsWritten criteria, manual reset verificationWritten criteria, automated reset verification, operator calibration
CompletenessMissing modalities in >5% of episodesAll modalities present, sync within 50msAll modalities present, sync within 10ms, auto-validated
AccuracyNo filtering; includes failed demosBinary success filter, basic smoothness checkAutomated success classifier, jerk filtering, annotation validation
Balance>5:1 ratio between most/least represented3:1 max ratio, basic spatial coverage2:1 max ratio, grid-based coverage, 10-15% edge cases
FormatCustom format, missing metadataHDF5/Zarr, basic metadataLeRobot/RLDS compatible, full metadata, auto-validated at write
Failure InclusionAll failures discardedFailures preserved but unlabeledLabeled failures + recovery demos at 5-10% ratio

Score each dimension 1-5. A dataset with an average score of 4+ across all seven dimensions is production-quality. A score of 3 in any dimension should be addressed before scaling up collection. A score of 1-2 in any dimension will actively harm policy training -- fix it before collecting more data.

Common Anti-Patterns in Robot Data Collection

These are the mistakes SVRC sees most frequently in datasets submitted for policy training. Each one is a direct cause of policy failure:

  • The "graduate student special." One operator, one lighting condition, objects always in the same starting position, data collected over one afternoon. The resulting policy works perfectly in the lab and fails immediately in any other environment. Fix: diversity across all axes from the start.
  • The "quantity over quality" trap. 5,000 demonstrations collected as fast as possible with no quality filtering. 30% are failed grasps, 20% have jerky trajectories from fatigued operators, 10% have timestamp synchronization issues. The resulting policy learns the average of all behaviors, which is mediocre at everything. Fix: filter first, then scale.
  • The "demo hoarding" mistake. Collecting 2,000 demonstrations before training a single policy. Then discovering that the data has a systematic problem (wrong camera angle, missing modality, inconsistent success criteria) that requires re-collection of the entire dataset. Fix: train a test policy on 50-100 demos early to validate the pipeline before scaling.
  • The "easy only" bias. Operators naturally gravitate toward easy starting positions and object orientations because they produce faster, smoother demonstrations. The dataset becomes over-represented in easy conditions and under-represented in the edge cases that the policy will encounter in deployment. Fix: structured randomization protocol that explicitly includes difficult conditions.
  • The "silent failure" problem. Data collection proceeds for weeks, but the camera calibration drifted on day 3, and nobody noticed because there was no automated quality monitoring. All data after day 3 has systematically wrong extrinsic transforms. Fix: automated quality checks on every episode at write time.
  • The "format conversion nightmare." Data collected in a custom format, then converted to HDF5 for training. The conversion script has a subtle bug that drops every 10th timestep. The policy trains on corrupted data and fails in deployment. Fix: use established formats from the start and validate schema on every episode.

SVRC Quality Assurance Pipeline Detail

The SVRC QA pipeline runs on every episode at collection time -- not as a batch post-processing step. This means quality issues are caught within seconds of the demonstration completing, allowing immediate operator feedback and re-collection.

Pipeline stages:

  1. Schema validation (0.1s). Verifies all required data streams are present, data types are correct, and episode metadata is complete. Rejects immediately if any stream is missing.
  2. Timestamp synchronization check (0.2s). Computes pairwise timestamp alignment between all sensor streams. Rejects if any stream is misaligned by more than 10ms (configurable per task).
  3. Success classification (0.5s). A learned classifier (trained on previously labeled episodes from the same task) predicts success/failure. Borderline predictions (confidence 0.4-0.7) are flagged for human review.
  4. Trajectory smoothness scoring (0.3s). Computes jerk metric (third derivative of joint positions) and compares against per-task baselines established during operator calibration. Episodes below 70th percentile are flagged.
  5. Coverage analysis (0.5s). Embeds the episode in a visual feature space (using a pre-trained vision encoder) and checks for diversity relative to the existing dataset. If the new episode is too similar to existing episodes (below a cosine distance threshold), it is flagged as redundant.
  6. Metadata completeness check (0.1s). Verifies all required annotation fields are populated: task description, language instruction, operator ID, environment conditions.

Total QA latency: under 2 seconds per episode. The operator sees a green/yellow/red indicator immediately after completing each demonstration, enabling real-time quality feedback without interrupting the collection flow.

The 10-Point Data Quality Checklist

  1. At least 10 distinct object instances per target category
  2. At least 3 lighting conditions represented
  3. At least 3 operators contributing roughly equal episodes
  4. Written success criteria applied consistently across all episodes
  5. Standardized reset protocol documented and followed
  6. All sensor streams present and synchronized in every episode
  7. Full episode capture from approach through completion
  8. Binary success classification with human review on borderline cases
  9. Trajectory smoothness filtered against per-task baselines
  10. Data stored in an established format (HDF5/Zarr/LeRobot) with complete metadata

If your dataset passes all 10 points, it is production-quality data suitable for training reliable policies. If it fails on any point, fix that dimension before collecting more data -- more data with the same quality problem just produces more of the same problem.

How SVRC Ensures Quality in Managed Collection

All demonstration data collected through SVRC data services passes through an automated quality pipeline that enforces every dimension described above. The pipeline includes: automated success classification using learned classifiers, smoothness scoring with operator-specific baselines, coverage analysis in visual embedding space to verify object and environmental diversity, timestamp synchronization verification, schema validation on every episode, and human review on all borderline cases.

You receive quality-certified data with per-episode quality scores, aggregate coverage statistics showing diversity across all axes, and a quality report that documents how the dataset performs on each of the six dimensions. This certification means you can train with confidence that your data meets the standard -- rather than discovering quality problems three weeks into a training run.

For teams building their own collection infrastructure, SVRC also offers quality audits on externally-collected datasets. We apply the same quality pipeline to your data and provide a detailed report identifying which dimensions need improvement. Contact the SVRC team to discuss your data quality needs.

Related Reading