The Persistent Sim-to-Real Gap

Simulation has come a long way. Isaac Lab renders rigid-body physics at thousands of frames per second. MuJoCo MPC closes manipulation loops in milliseconds. Yet every team that has tried to ship a contact-rich policy trained purely in simulation has hit the same wall: the real world does not behave the way the simulator predicts, and the policy fails in ways that are difficult to diagnose and expensive to fix.

The core issue is contact modeling. When a robot finger touches a soap bar, the contact stiffness, friction coefficient, and surface micro-geometry all interact to determine whether the bar slides, rolls, or stays put. Simulation engines approximate this with a handful of scalar parameters. Real silicone soap on a wet ABS surface has a friction coefficient that varies with sliding speed, normal force, and surface wetness in ways that no physics engine currently models accurately. The policy trained in simulation learns to exploit the simulated contact model, not the real one.

Contact Richness: What Simulation Cannot Capture

Contact-rich manipulation tasks — grasping, insertion, assembly, surface following — involve physical interactions where the information that determines success or failure is encoded in the forces, deformations, and frictional dynamics at the contact interface. This information has several properties that make it fundamentally difficult to simulate:

Contact PropertySim FidelityReal-World ComplexityGap Severity
Static friction coefficientSingle scalar per pairVaries with normal force, velocity, surface wear, moistureHigh
Contact patch geometryPoint/line contact modelsDeformable patch with pressure distributionHigh
Surface micro-geometryNot modeledMicro-roughness, grain direction, surface coatingsModerate
Soft body deformationFEM or MPM (slow, approximate)Viscoelastic, rate-dependent, non-linearVery high
Actuator dynamicsIdealized motor modelsBacklash, cable stretch, thermal drift, gear wearModerate
Visual appearance under contactPBR rendering (improving)Specular reflections, shadows from gripper, occlusionLow-moderate

The compound effect of these gaps is what makes sim-to-real transfer for contact tasks so difficult. Each individual approximation might introduce only a small error, but errors in friction, geometry, and dynamics compound multiplicatively during a multi-step contact sequence. A 10-step insertion task with 5% per-step error from contact modeling inaccuracies yields 0.95^10 = 60% overall success — before accounting for distribution shift in visual observations.

Distribution Shift: The Quantitative Evidence

Distribution shift refers to the mismatch between the state distribution a policy sees during training and the state distribution it encounters during deployment. Simulation-trained policies suffer from distribution shift in two distinct ways:

Visual distribution shift: Despite advances in domain randomization and neural rendering, simulated images remain distributionally different from real camera feeds. This shift is measurable: the FID (Frechet Inception Distance) between IsaacSim-rendered manipulation scenes and real RealSense captures of the same scenes is typically 80-120, compared to 20-40 between different real camera setups. Policies with learned visual encoders that are sensitive to this gap show 15-30% performance degradation when transferred from sim to real without fine-tuning.

Dynamics distribution shift: Even when visual transfer succeeds (via domain randomization or real-image rendering), the dynamics mismatch creates state trajectories during deployment that the policy never saw during training. A sim-trained grasping policy approaches objects on trajectories shaped by the simulator's dynamics model. In reality, the arm follows slightly different trajectories due to actuator modeling errors, resulting in grasp approach configurations that are out-of-distribution for the policy. This dynamics shift is harder to measure but accounts for the majority of sim-to-real failure in contact tasks.

Why Human Demonstrations Capture the Right Distribution

When a skilled operator teleoperates a robot arm to pick up a soap bar, they are not following a computed trajectory. They are using the same sensorimotor heuristics that evolution spent millions of years optimizing. They instinctively adjust approach angle when they see a rounded surface. They feel (through haptic feedback or visual cues) when a grasp is unstable and correct before failure. They find grasp configurations that are not globally optimal but are locally stable — exactly the kind of solutions that transfer to deployment.

This means teleoperation demonstrations implicitly encode two things that simulation cannot easily provide: the correct task distribution (the set of object poses, grasp strategies, and action sequences that actually work in the physical world) and implicit constraint satisfaction (a skilled operator never attempts a grasp that their experience tells them will fail — the failure modes are filtered out before data collection even begins).

There is a third, often overlooked, benefit: implicit dynamics information. When an operator adjusts their grip force mid-trajectory because they feel a slip, they are encoding the real friction coefficient into the demonstration trajectory. When they slow down on approach because the object is positioned awkwardly, they are encoding the real kinematic constraints. This dynamics information is "free" — it comes along with every demonstration, embedded in the action sequence, and it is exactly the information that sim demonstrations get wrong.

The Sim-to-Real Gap by Task Type: Quantitative Evidence

The magnitude of the sim-to-real gap varies dramatically by task type. The following data is aggregated from published benchmarks and SVRC's internal evaluations across multiple hardware platforms:

Task TypeSim SuccessReal Success (Sim Policy)Real Success (Teleop Data)Demos Needed
Pick-place (rigid, top-down)95%75-85%90-95%100-200
Peg-in-hole (1mm clearance)90%30-50%80-90%200-400
Deformable manipulation (cloth)80%15-30%65-80%300-600
Cable routing / insertion70%10-25%70-85%400-800
Liquid pouring85%20-40%75-90%150-300
Tool use (e.g., screwdriver)75%5-15%60-75%500-1000

The pattern is clear: tasks with richer contact dynamics show a larger sim-to-real gap. For top-down rigid pick-place, the gap is 10-20 percentage points — manageable with domain randomization. For tool use with complex contact sequences, the gap can exceed 60 percentage points, making sim-only training effectively unusable.

Cost-per-Quality-Demo Comparison

A common objection to real-data approaches is cost. Simulation data is "free" once you have the environment; real demonstrations require operators, hardware, and time. But this framing ignores the cost of the demonstration quality that ultimately determines policy performance:

Data SourceCost/DemoDemos for 80% SuccessTotal Cost to 80%Time to 80%
Sim-only (Isaac Sim)~$0.001Often unreachable for contact tasksN/A (ceiling ~50-60%)2-4 weeks sim engineering
Sim + domain rand (tuned)~$0.01 (compute)500K-1M episodes$5K-10K compute + 4-8 weeks engineering4-8 weeks
Real teleop (in-house)$15-30200-400$3K-12K + hardware setup2-3 weeks
Real teleop (SVRC managed)$8-16200-400$2,500 pilot / $8,000 campaign1-2 weeks
Hybrid (sim pre-train + real fine-tune)$0.01 sim + $10 real50K sim + 100-200 real$2K-5K total3-4 weeks

The key insight: for contact-rich tasks, simulation alone often cannot reach the target success rate at any cost, because the performance ceiling is bounded by contact modeling fidelity. Real data reaches the target faster and more predictably, and the managed collection path (SVRC's data services) eliminates the overhead of setting up your own collection infrastructure.

Three Case Studies Where Simulation Fails

  • Soap bar grasping: Six teams across industry reported that policies trained in Isaac Sim with default PBR friction parameters achieved >80% success in simulation but <40% on real hardware. The contact stiffness model was wrong. Switching to real teleoperation data brought success rates above 85% within 300 demonstrations.
  • Cable insertion: Deformable geometry is essentially unsolved in real-time physics engines. A USB-C cable's deformation under finger contact depends on its specific braid tension, jacket stiffness, and core compliance. Simulation policies for cable routing achieve roughly 20% success in real; teleoperation-trained policies with 500 demos achieve 70-80%.
  • Liquid pouring: Fluid dynamics simulation at the scale of a cup of water is computationally tractable with SPH or grid-based methods, but the interaction between fluid, cup rim geometry, and surface tension is complex enough that sim-trained policies systematically over-pour. A 200-demo teleoperation dataset produced policies that outperformed 50K-step RL policies trained in simulation.

What Simulation Actually Gets Right

This is not an argument against simulation — it is an argument for using simulation for what it is good at. Three areas where sim genuinely helps:

  • Free-space motion planning: Collision-free trajectory generation in known environments transfers well from sim to real. The physics that matter (rigid body kinematics) are modeled accurately.
  • Diverse scene generation: Simulation can generate thousands of object poses, table configurations, and environment layouts that would take weeks to set up physically. This diversity is valuable for pre-training visual representations.
  • Infinite data scale for coarse behaviors: Getting a robot to roughly orient toward a target, approach an object, or navigate a hallway can be bootstrapped from millions of simulated episodes. The coarse behavior transfers even if the fine-grained contact policy does not.
  • Pre-training visual encoders: Training a visual backbone on millions of simulated scenes with diverse object appearances, lighting conditions, and camera viewpoints produces representations that transfer well to real cameras. This is arguably the highest-ROI use of simulation today — the visual features learned from sim diversity improve downstream learning efficiency on real data by 2-5x.

When Simulation Data Helps: The Pre-Training Case

The strongest argument for simulation data is not as the primary training source but as a pre-training stage. Three specific use cases where sim pre-training measurably helps:

1. Visual representation pre-training. Train a ResNet-50 or ViT backbone on 1M simulated scenes with domain randomization. The resulting encoder learns object-level features (edges, shapes, spatial relationships) that transfer to real cameras. Fine-tuning on 200 real demonstrations takes 3-5x fewer iterations compared to training from scratch, and the resulting policy generalizes better to novel camera viewpoints and lighting conditions.

2. Coarse motor skill initialization. Pre-train a policy on 100K simulated episodes to learn the coarse motion primitive (approach object, orient gripper, close fingers). This reduces the real data requirement from 500 demos to 100-200 demos for many tasks, because the real fine-tuning only needs to correct the contact-phase behavior rather than learn the entire trajectory from scratch.

3. Failure mode discovery. Run the sim policy on 10K randomized scenarios and catalog the failure modes. Use this catalog to design targeted real data collection — collecting demonstrations specifically in the configurations where the sim policy fails. This failure-directed approach produces a dataset that is maximally informative for closing the sim-to-real gap.

Hybrid Approaches: The Practical Frontier

The most effective teams in 2025 are not choosing between sim and real — they are using structured hybrid pipelines. The three most proven hybrid strategies:

Sim pre-train + real fine-tune (the standard approach): Train in sim, fine-tune on 100-500 real demos. Works well for tasks where the sim policy reaches 40-60% real success. Expected improvement: 20-40 percentage points from fine-tuning. This is what pi0, OpenVLA, and most VLA-based systems use for task specialization.

Real backbone + sim augmentation: Train the primary policy on real data, then augment with sim data at a 1:3 to 1:5 real:sim ratio. The sim data provides state-space coverage that would take weeks to collect in real. Key requirement: the sim scenes must be visually realistic enough that the visual encoder does not learn to distinguish sim from real (use FID < 50 as a threshold).

Sim for exploration + real for refinement: Use RL in sim to discover novel solutions to difficult tasks (e.g., tool use strategies, re-grasp sequences). Extract the discovered strategies as waypoint sequences. Then collect real demonstrations following these strategies and train a real policy. This approach leverages sim's ability to explore millions of trajectories while avoiding the contact modeling problem during deployment.

SVRC's Teleoperation Approach

SVRC's data collection methodology is designed specifically to maximize the signal-to-noise ratio in real demonstration data. Our approach differs from ad-hoc teleop collection in several key ways:

  • Structured variation protocol: Rather than collecting all demonstrations in a single scene configuration, we systematically vary object positions, orientations, and scene backgrounds across a predefined grid. This ensures the dataset covers the variation axes that matter for generalization, rather than over-representing a single configuration.
  • Operator tier system: Junior operators (week 1-2 of training) collect L1 tasks (simple pick-place). Senior operators (1+ months experience) handle L2 tasks (multi-step manipulation). QA Leads (3+ months) handle L3 tasks (precision insertion, tool use). This matching of operator skill to task complexity maximizes demo quality at each difficulty level.
  • Three-gate quality pipeline: Every demonstration passes through (1) automated heuristic checks (velocity limits, workspace bounds, episode length), (2) ML-based quality classifier trained on 50K+ labeled episodes, and (3) human spot-check review at 10% sampling rate. The reject rate is typically 15-25%, ensuring only clean demonstrations enter the final dataset.
  • Hardware standardization: All collection happens on calibrated, standardized workstations using OpenArm 101 ($4,500) or DK1 bimanual setups with synchronized multi-camera recording. This eliminates hardware-specific noise that degrades policy training across different collection setups.

The result: SVRC-collected datasets consistently achieve 15-25% higher policy success rates than equivalent-size datasets collected on ad-hoc setups, based on comparisons across 12 tasks on ACT and Diffusion Policy architectures.

The Right Mental Model: Coarse in Sim, Fine in Real

The most effective approach we have seen is a two-phase strategy. In phase one, use simulation to train a broad prior: the robot learns to approach objects, estimate grasp candidates, and execute rough pick motions across thousands of object categories. This phase can run overnight on a single A100 and produces a policy that gets within 10cm of the right grasp ~90% of the time.

In phase two, collect 200-500 real teleoperation demonstrations on the specific task. Fine-tune the simulation-pretrained model on this real data. The combination typically outperforms either sim-only or real-only approaches, and it reduces the real data requirement by 5-10x compared to training from scratch.

SVRC Observations from Data Collection Work

Across dozens of data collection projects at SVRC, we have consistently observed that data quality beats data quantity for contact tasks. A collection of 300 expert demonstrations from trained operators outperforms 1,500 demonstrations from novice operators on tasks involving contact, insertion, or surface-following. The expert operators instinctively avoid demonstrations that will confuse the policy — they use consistent, clean motions that the learning algorithm can actually extract signal from.

We have also observed that the first 200 demonstrations show diminishing returns on most L2-complexity tasks (structured pick-place). The next 300 demonstrations improve robustness to novel object poses. Beyond 500, further improvement requires introducing new object instances and environmental variations — not simply more of the same.

The Hybrid Strategy in Practice

Our recommended approach for teams starting a new manipulation task: (1) build or download a simulation environment for coarse behavior pre-training, (2) run 50K-100K sim steps to initialize the policy, (3) collect 300-500 real teleoperation demonstrations through a structured data collection protocol, (4) fine-tune, (5) evaluate in 3 novel conditions you have not trained on. This pipeline typically produces deployment-ready policies in 4-6 weeks rather than the 3-6 months required for sim-only approaches.

# Example hybrid pipeline (pseudocode) # Phase 1: Sim pre-training python train.py \ --data sim_dataset/ \ --epochs 100 \ --backbone resnet50 \ --domain-rand visual+physics \ --save checkpoint_sim.pt # Phase 2: Real fine-tuning python train.py \ --data real_teleop_dataset/ \ --epochs 50 \ --backbone resnet50 \ --resume checkpoint_sim.pt \ --lr 1e-4 \ # lower LR for fine-tuning --freeze-backbone 10 # freeze visual encoder for first 10 epochs # Phase 3: Evaluation python evaluate.py \ --checkpoint checkpoint_finetuned.pt \ --eval-scenes novel_scenes/ \ --num-trials 100

For teams who want to start collecting real demonstration data today, SVRC's data collection services provide trained operators, calibrated hardware, and a structured quality pipeline — so you get clean, policy-ready data without building the infrastructure yourself. Pilot projects start at $2,500 for 200 demonstrations; full campaigns at $8,000 for 1,000+ demonstrations with quality guarantees.

Related Reading