Pushing Robotic Garment Folding Further with Limited Teleop Data and No Robot Access

Competition implementation report covering state conditioning, co-training, and future-latent policy conditioning.

Aakash Arul Mozhi Nambi

Chennai, India 25th June 2026

🏠 LeHome Challenge ICRA 2026
🥉 3rd in Real-World Evaluation (Finals) 🥉 3rd in Simulation Evaluation

I present my LeHome Challenge implementation for robotic garment folding under limited teleoperation data and no robot access. As a result, I had zero development or testing time with the robot and could not run RL rollouts or collect additional data like Teleop or Human-in-the-Loop DAgger, and was constrained to use only the teleop data provided by the organizers.

My implementation explores how to make heterogeneous human, simulation, and real-robot teleoperation data usable for VLA-based deformable manipulation while testing future conditioning for action chunks. I combined HaPTIC-processed human demonstrations, simulated robot data, and real-robot teleoperation logs through a validated FK/DIK pose contract, then tested state conditioning, co-training, and future-latent prefix tokens for π0.5 action-chunk prediction.

Due to limited time and compute resources, I was able to run only a few relevant ablations. More thorough ablations in a real-robot setting are needed to establish causality behind these results. I encourage labs and researchers to further explore the directions suggested in this writeup and to share critiques, thoughts, and improvements to this implementation.

TL;DR

Robotic garment folding stresses current VLA policies because cloth state is high-dimensional, contacts are unstable. In the LeHome Challenge, I had no development or testing time access to the target robot, so I could not collect additional datasets or do human-in-the-loop DAgger style collection, or run rollouts for RL. I therefore collected a lot of human demonstrations, built a 16D pose-action contract unifying different embodiments for co-training - using forward kinematics, differential inverse kinematics, anchored normalization and masking. Implemented some architecture changes to support future latent conditioning providing a compact look ahead to the VLA model. Like a predictied future based action conditioning rather than a traditional current state based action conditioning.

Across a common local simulation evaluation protocol, explicit robot-state conditioning produced the largest gain, while co-training and future-latent conditioning added smaller but useful improvements. The final policy improved the combined seen/unseen simulation average from 70.0% to 76.75%, improved unseen success from 64.0% to 70.25%, and was the model deployed in the LeHome Challenge real-world finals, where it placed third. The current evidence suggests future visual latents are useful, but does not yet prove whether they act as planning, denoising, progress estimation, or extra conditioning capacity.

As a small detour experiment, I also ran offline RL-style critic reranking in the simulation environment. The critic reranking notes and results are included later in the article as a side experiment, separate from the main training ladder.

Training strategy Seen avg Unseen avg Combined avg What it tested
Base, no state 61.25% 39.25% 50.25% Image-only action-chunk prediction
Base + state 76.00% 64.00% 70.00% Explicit current robot state as input
Human pretrain base (no normalization or masking) + state fine-tune 79.25% 63.50% 71.38% Human garment-diversity pretraining without normalization or noise filtering before robot fine-tuning
Co-train base 81.25% 65.75% 73.50% Weighted human + simulation + real-robot teleoperation mixture with normalization and masking
Co-train base + future latent 83.25% 70.25% 76.75% Injection of predicted horizon-10 visual-latent prefix tokens

What this write-up covers

  • A validated 16D camera-frame dual-arm pose contract joining human, simulation, and teleoperation data.
  • Anchored normalization and masking so noisy human-pose dimensions do not contaminate robot learning.
  • A future-latent sidecar that predicts horizon-10 visual embeddings and injects them as VLA prefix tokens.
  • A simulation-only offline critic reranker that scores sampled action chunks with predicted Monte Carlo return.
  • An honest ablation ladder under no-robot-access constraints.

What this does not prove

  • It does not prove future latents are explicit planning.
  • It does not include controlled real-robot ablations.
  • It does not separate future prediction from extra prefix-token capacity.
  • It does not solve grasping/contact robustness.

Why this may matter for robot foundation models

Scaling data collection through simulation, human demonstrations, and effective sim-to-real transfer is an active area of research. Robot foundation models need heterogeneous datasets to scale across tasks faster, but those datasets are not automatically compatible. This project suggests one way to train across datasets and embodiments with a unified representation contract, while also exploring whether future conditioning can improve VLA policy performance.

Approach

Data

The key constraint in this challenge was the lack of direct robot access, which prevented any form of additional data collection, robot telemetry gathering, and any form of real-world tuning based on RL. To improve robustness under these constraints, I increased garment diversity by collecting human demonstrations of garment folding. I then extracted hand trajectories relative to the top-camera frame axes using HaPTIC-style hand-pose estimation.

Incorporating human demonstrations required moving away from raw robot joint-angle supervision, since joint states are not available for human data. Instead, I unified all datasets around end-effector pose representations. For the simulation and real-robot teleoperation data provided by the organizers, I built forward kinematics and differential inverse kinematics solvers to convert the original 12D dual-arm joint representation into a 16D dual-arm end-effector pose representation in the camera frame, using the LeRobot USD/URDF. I validated these transformations through FK/DIK consistency checks, quaternion-continuity checks, and visual trajectory inspection.

After conversion, I aligned coordinate conventions, axis definitions, and valid pose dimensions across human, simulation, and real-robot datasets. Real-robot teleoperation data served as the anchor reference for quantile scaling, reducing dataset-scale artifacts that the policy could otherwise exploit as shortcut features. Noisy or unreliable human-pose dimensions were masked during normalization and training. The resulting unified dataset was then merged with weighted sampling and used to fine-tune the π0.5 VLA policy.

The human, simulation, and real-robot top-camera configurations were significantly different, even though all three were top views.

LeHome dataset workflow from human HaPTIC data, simulation data, and real-robot teleoperation data into normalized co-training data. LeHome dataset workflow from human HaPTIC data, simulation data, and real-robot teleoperation data into normalized co-training data.
Dataset type Human data Simulation data (vanilla) Real teleoperation data (vanilla)
Source Custom data collection and post-processing pipeline Provided by the competition organizers Provided by the competition organizers
Cameras
Top
Not present
Left wrist
Not present
Right wrist
Top
Left wrist
Right wrist
Top
Left wrist
Right wrist
Episodes 4180 1000 500
Total number of garments ~250 (including different complex orientations) 40 25
Garment types Long-sleeve tops, short-sleeve tops, long pants, shorts Long-sleeve tops, short-sleeve tops, long pants, shorts Long-sleeve tops, short-sleeve tops, long pants, shorts
Total number of frames (data points) 724,399 265,798 187,135
Duration ~8.4 hours ~2.5 hours ~2.6 hours
Dataset FPS ~24 30 20
Dimension 0 Left-arm end-effector X position relative to camera-frame axes Left-arm shoulder pan Left-arm shoulder pan
Dimension 1 Left-arm end-effector Y position relative to camera-frame axes Left-arm shoulder lift Left-arm shoulder lift
Dimension 2 Left-arm end-effector Z position relative to camera-frame axes Left-arm elbow flex Left-arm elbow flex
Dimension 3 Left-arm end-effector orientation W quaternion relative to camera-frame axes Left-arm wrist flex Left-arm wrist flex
Dimension 4 Left-arm end-effector orientation X quaternion relative to camera-frame axes Left-arm wrist roll Left-arm wrist roll
Dimension 5 Left-arm end-effector orientation Y quaternion relative to camera-frame axes Left-arm gripper value Left-arm gripper value
Dimension 6 Left-arm end-effector orientation Z quaternion relative to camera-frame axes Right-arm shoulder pan Right-arm shoulder pan
Dimension 7 Not present Right-arm shoulder lift Right-arm shoulder lift
Dimension 8 Right-arm end-effector X position relative to camera-frame axes Right-arm elbow flex Right-arm elbow flex
Dimension 9 Right-arm end-effector Y position relative to camera-frame axes Right-arm wrist flex Right-arm wrist flex
Dimension 10 Right-arm end-effector Z position relative to camera-frame axes Right-arm wrist roll Right-arm wrist roll
Dimension 11 Right-arm end-effector orientation W quaternion relative to camera-frame axes Right-arm gripper value Right-arm gripper value
Dimension 12 Right-arm end-effector orientation X quaternion relative to camera-frame axes Not present Not present
Dimension 13 Right-arm end-effector orientation Y quaternion relative to camera-frame axes Not present Not present
Dimension 14 Right-arm end-effector orientation Z quaternion relative to camera-frame axes Not present Not present
Dimension 15 Not present Not present Not present

Post-Processing & Unification

Why a Shared Pose Space Was Needed

The three available data sources were represented in different action spaces. The simulation and real-robot teleoperation datasets were provided by the LeHome Challenge organizers in the native robot joint-space format, while the additional human demonstrations were collected separately with a similar top camera view. For easier model learning and generalization across all three datasets, the data needed to share one representation space. Therefore the 16D dual-arm pose representation was chosen as the unified policy-facing representation. This allows the policy to learn from all three datasets consistently, while still outputting joint commands for the robot during deployment.

Unified Dual-Arm Pose (Relative to the Top-Camera Frame)

The unified policy-facing representation is a 16D dual-arm pose:

\[ \mathbf{p}^{16} = \left[ \mathbf{p}^{8}_{L}, \mathbf{p}^{8}_{R} \right] \]

Each arm pose is represented as:

\[ \mathbf{p}^{8} = \left[ x,\ y,\ z,\ q_w,\ q_x,\ q_y,\ q_z,\ g \right] \]

Expanded Representation

Here, \((x,y,z)\) denotes end-effector position, \((q_w,q_x,q_y,q_z)\) denotes orientation as a quaternion, and \(g\) denotes the gripper value.

\[ \mathbf{p}^{16} = \left[ x_L,\ y_L,\ z_L,\ q_{w,L},\ q_{x,L},\ q_{y,L},\ q_{z,L},\ g_L,\ x_R,\ y_R,\ z_R,\ q_{w,R},\ q_{x,R},\ q_{y,R},\ q_{z,R},\ g_R \right] \]

This 16D camera-frame representation aligns human, simulation, and real-robot demonstrations under one pose-observation-action contract.

Robot Model and Kinematic Solver Selection

To convert the organizer-provided simulation and real-robot datasets from 12D joint space into 16D end-effector pose space, I implemented forward kinematics and differential inverse kinematics solvers using the robot model files provided with the challenge environment.

Since the simulation evaluation was performed in Isaac Sim, the available USD and URDF robot descriptions were compared through FK/DIK round-trip consistency, quaternion-continuity checks, and visual trajectory inspection. Based on these checks, the USD-derived kinematic chain was selected because it more closely matched the behavior of the LeHome simulation and evaluation setup.

Training-Time Transform

The transformation used during training converts robot joints into camera-frame end-effector pose:

\[ \mathbf{q}^{12} \xrightarrow{\mathrm{FK}} \mathbf{T}^{W}_{ee} \xrightarrow{T^{C}_{W}} \mathbf{p}^{16}_{C} \]

\(\mathbf{q}^{12}\) is the 12D dual-arm joint representation, \(\mathbf{T}^{W}_{ee}\) is the world-frame end-effector transform, \(T^{C}_{W}\) is the world-to-camera transform, and \(\mathbf{p}^{16}_{C}\) is the final camera-frame pose.

Deployment-Time Transform

During deployment, the policy predicts in 16D pose space, while the LeHome evaluation interface expects 12D joint commands:

\[ \mathbf{p}^{16}_{C} \xrightarrow{T^{W}_{C}} \mathbf{p}^{16}_{W} \xrightarrow{\mathrm{DIK}} \mathbf{q}^{12} \]

This lets the policy learn in a shared camera-frame end-effector representation while still producing the joint-space commands required by the robot controller.

Click below to learn more about the FK and DIK implementation used in this project.

Validation

FK/DIK Round-Trip Validation

The FK and DIK transforms were validated with a round-trip consistency check. The goal was to verify that a policy-space action remains stable after passing through the full conversion loop: joint-space action to camera-frame end-effector action, back to joint space, and then again to camera-frame end-effector action.

The round-trip validation showed that real-robot teleoperation data stayed accurate across the full distribution. Simulation data also remained accurate through the 99.9th percentile, with position error below 1 cm and orientation error below 0.5 degrees. The simulation maximum was treated as an outlier given the number of processed frames and arm samples. Higher iteration counts, lower damping, and multiple line-search alphas produced only small accuracy gains, while the added latency slowed robot execution.

\[ \mathbf{a}^{12} \rightarrow \mathbf{a}^{16}_{C} \rightarrow \hat{\mathbf{a}}^{12} \rightarrow \hat{\mathbf{a}}^{16}_{C} \]

For each selected frame, the original 12D joint-space state and action are loaded first:

\[ \mathbf{q}^{12}_{t}, \qquad \mathbf{a}^{12}_{t} \]

The input transform converts the 12D joint-space action into a 16D camera-frame action:

\[ \mathbf{a}^{16}_{C,t} = f_{\mathrm{FK}} \left( \mathbf{a}^{12}_{t} \right) \]

The output transform then converts this camera-frame target back into a 12D joint-space command using differential IK and the current 12D robot state:

\[ \hat{\mathbf{a}}^{12}_{t} = f_{\mathrm{DIK}} \left( \mathbf{a}^{16}_{C,t}, \mathbf{q}^{12}_{t} \right) \]

Finally, the recovered joint-space action is passed through the FK input transform again:

\[ \hat{\mathbf{a}}^{16}_{C,t} = f_{\mathrm{FK}} \left( \hat{\mathbf{a}}^{12}_{t} \right) \]

The comparison is made in camera-frame end-effector pose space, because that is the representation used by the policy during training and evaluation:

\[ \mathbf{a}^{16}_{C,t} \quad \text{vs.} \quad \hat{\mathbf{a}}^{16}_{C,t} \]

For each arm, position error is the Euclidean distance between the original and reconstructed camera-frame end-effector positions:

\[ e_{pos} = \left\| \mathbf{x}_{C} - \hat{\mathbf{x}}_{C} \right\|_2 \]

Orientation error is computed as quaternion angular distance using normalized quaternions, then reported in degrees:

\[ e_{rot} = 2\cos^{-1} \left( \left| \left\langle \mathbf{q}, \hat{\mathbf{q}} \right\rangle \right| \right) \] \[ e^{\circ}_{rot} = \frac{180}{\pi} e_{rot} \]
Round-trip validation flowchart showing original joint action, FK to 16D camera action, DIK back to joint action, and FK reconstruction. Round-trip validation flowchart showing original joint action, FK to 16D camera action, DIK back to joint action, and FK reconstruction.

Round-Trip Validation Results

Dataset Coverage Error Type Statistic Value
Human Not applicable: already represented in ground-truth 16D end-effector pose format.
Simulation data 1000 episodes
265,798 frames
531,596 arm samples
Position (m) Median 6.88e-05
P95 1.81e-04
P99 8.58e-04
P99.9 1.75e-03
Max 2.21e-01
Orientation (deg) Median 2.40e-04
P95 2.14e-02
P99 1.00e-01
P99.9 4.11e-01
Max 55.105
Real teleoperation data 500 episodes
187,135 frames
374,270 arm samples
Position (m) Median 8.43e-05
P95 3.18e-04
P99 7.46e-04
P99.9 1.33e-03
Max 4.60e-03
Orientation (deg) Median 2.08e-04
P95 1.79e-03
P99 6.38e-03
P99.9 3.40e-01
Max 0.591

Quaternion-Continuity Validation

In addition to round-trip accuracy, quaternion-continuity checks were used to avoid discontinuous orientation representations caused by quaternion sign ambiguity. Quaternions \(\mathbf{q}\) and \(-\mathbf{q}\) represent the same physical rotation, so consecutive frames can appear to jump if the sign convention is not made consistent over time.

For consecutive quaternions \(\mathbf{q}_{t}\) and \(\mathbf{q}_{t+1}\), sign consistency is enforced by checking the dot product:

\[ \left\langle \mathbf{q}_{t}, \mathbf{q}_{t+1} \right\rangle < 0 \]

If this condition is true, the next quaternion is flipped before it is stored or compared:

\[ \mathbf{q}_{t+1} \leftarrow - \mathbf{q}_{t+1} \]

This keeps the orientation trajectory smooth across time and prevents the model from seeing artificial discontinuities that do not correspond to real end-effector motion. The corrected trajectories were then inspected visually to confirm that the reconstructed end-effector paths followed the intended folding motion without sudden orientation flips.

Visual Validation

The necessary pose dimensions were plotted directly alongside the top-camera video for all three datasets. The camera-frame convention used here places \(z\) into the plane of the camera, \(x\) to the right from the center of the video, and \(y\) downward from the center of the video.

Dataset type Visual validation
Human pretraining data
Simulation data
Real teleoperation data

Training Strategy

π0.5 Base - No State Injection

Dataset: Simulation only

As the first baseline, I fine-tuned the base \(\pi_{0.5}\) VLA policy using only the transformed simulation dataset. I used simulation-only training because, without direct robot access, the only environment where I could perform systematic evaluation was Isaac Sim. The Isaac Sim evaluation environment was provided by the challenge organizers and included reference point based geometric configuration as garment-folding checkpoints to determine whether each garment was folded correctly.

In this setup, the policy received only visual observations as input: the top camera image and both wrist-camera images. I did not inject the current robot state into the policy input. In other words, the current absolute end-effector positions, orientations, and gripper values were not provided to the model. The policy was therefore required to predict future action chunks directly from images alone.

The model output was a sequence of absolute end-effector targets in the unified 16D camera-frame pose space. Each action consisted of the position, orientation, and gripper value for both arms:

\[ \mathbf{a}^{16} = \left[ x_L, y_L, z_L, q_{w,L}, q_{x,L}, q_{y,L}, q_{z,L}, g_L, x_R, y_R, z_R, q_{w,R}, q_{x,R}, q_{y,R}, q_{z,R}, g_R \right]. \]

This baseline tested whether the policy could infer both the garment state and the required dual-arm end-effector motion purely from visual context. However, since the current end-effector state was not provided, the prediction problem was under-constrained: the same image observation can correspond to different robot configurations. This made the experiment useful as an initial reference point before adding explicit state conditioning in later training strategies.

The FK and DIK transforms were integrated into the \(\pi_{0.5}\) policy pipeline. The FK transform converted the original 12D simulation joint-space data into the 16D camera-frame end-effector representation used for training, while the DIK output transform converted predicted 16D end-effector targets back into the 12D joint-space commands required by the LeHome evaluation interface.

Evaluation Results

Garment split Long-sleeve top Short-sleeve top Long pants Shorts Overall average
Seen 72.0% 44.0% 34.0% 95.0% 61.25%
Unseen 66.0% 16.0% 29.0% 46.0% 39.25%

The simulation-only, no-state baseline showed reasonable performance on some seen garments, especially shorts, but struggled on categories requiring more precise pose-conditioned manipulation, such as tops and long pants. The drop on unseen garments, particularly top short sleeves and long pants, suggested that visual-only conditioning was insufficient for robust generalization. This motivated the next training variants, where the current end-effector state was injected and the training data was expanded beyond simulation-only demonstrations.

π0.5 Base - with State Injection

Dataset: Simulation only

The next baseline kept the same transformed simulation-only dataset and the same FK/DIK action pipeline, but added the current robot state to the policy input. The policy still received the top camera image and both wrist-camera images, but it was no longer forced to infer the robot configuration from images alone.

In this setup, the current 16D camera-frame end-effector state was provided as an additional input during model training. This state represented the current absolute position, orientation, and gripper value for both end effectors in the same unified pose convention used by the action targets.

The FK and DIK transforms remained the same as in the no-state baseline. FK converted the original 12D simulation joint-space logs into the 16D camera-frame state and action representation used for training, while DIK converted predicted 16D end-effector targets back into the 12D joint-space commands required by the LeHome evaluation interface.

Evaluation Results

Garment split Long-sleeve top Short-sleeve top Long pants Shorts Overall average
Seen 75.0% 78.0% 58.0% 93.0% 76.0%
Unseen 92.0% 37.0% 60.0% 67.0% 64.0%

Adding the current end-effector state substantially improved the simulation baseline, especially for seen top short sleeves and unseen top long sleeves. The remaining gap on unseen top short sleeves still pointed to category-level generalization limits, but the overall improvement confirmed that explicit state conditioning made the action prediction problem better constrained.

π0.5 Pretraining + Robot State Fine-Tuning

Datasets: Human demonstrations without masking or normalization + simulation data

The third strategy used a two-stage training pipeline. In the first stage, I trained a vanilla \(\pi_{0.5}\) base model on the human demonstration dataset for four epochs. This stage exposed the model to a wider variety of garment shapes, folds, and initial configurations before it was optimized on the robot simulation data.

For the human-demonstration pretraining stage, I kept the full 16D pose representation, used the top-camera image, masked the unavailable wrist images, and injected the current state. Only the gripper dimensions were masked; all other dimensions were left unchanged. The goal was to let the model learn from the complete human trajectory signal in the same unified camera-frame end-effector convention used by the later robot training stage, even though many dimensions were noisy. I also injected state during human pretraining to help the model learn the relationship between the current state and the next action.

After the human-demonstration stage finished, the resulting weights were used as the initialization for a second four-epoch fine-tuning stage on the transformed simulation-only robot dataset. This fine-tuning stage used state injection, so the current 16D absolute end-effector positions, orientations, and gripper values were provided to the model along with the top, left-wrist, and right-wrist camera images.

The FK and DIK transforms remained unchanged. FK converted robot joint-space logs into the unified 16D camera-frame state/action representation for training, while DIK converted predicted 16D end-effector targets back into 12D joint-space commands for simulation evaluation.

Evaluation Results

Garment split Long-sleeve top Short-sleeve top Long pants Shorts Overall average
Seen 77.0% 81.0% 64.0% 95.0% 79.25%
Unseen 85.0% 35.0% 64.0% 70.0% 63.50%

Compared with the \(\pi_{0.5}\) Base - with State Injection baseline, this two-stage pretraining and fine-tuning run improved the overall seen-garment score by 3.25 percentage points, but the unseen average slipped by 0.5 points. The combined seen/unseen mean still increased by 1.38 points. The clearest gains came from long pants, where performance increased by 6.0 points on seen garments and 4.0 points on unseen garments, while the drop on unseen tops showed that unmasked, unnormalized human pretraining was not uniformly beneficial.

Simulation-Only Offline RL Detour

Offline Critic Reranking from the \(\pi_{0.5}\) Base

After the two-stage \(\pi_{0.5}\) pretraining and robot-state fine-tuning pipeline, I implemented a small simulation-only critic reranking pass to push performance further. The critic was trained offline from rewarded simulation rollouts because collecting and scoring many candidate episodes across garments was much easier in simulation. Under the time constraints, this remained an offline simulation reranker rather than an online real-robot critic loop.

Open Offline Critic Reranking Notes

π0.5 Co-Training

Datasets: Human demonstrations + simulation + real-robot teleoperation

The fourth strategy moved from sequential pretraining and fine-tuning to direct co-training across all three available data sources. Human demonstrations, simulation demonstrations, and real-robot teleoperation data were mixed into a single training run after conversion into the shared policy-facing representation.

The dataset sampler was weighted so that human demonstrations used a weight of 1.5, simulation demonstrations used a weight of 4, and real-robot teleoperation demonstrations used a weight of 8. This made the effective number of samples from the three sources closer after accounting for dataset size, while still giving the strongest priority to the real-robot teleoperation data as the closest match to the evaluation embodiment.

Before co-training, the datasets were passed through per-dimension scale normalization and masking. For human demonstrations, only the left end-effector \(x,y\) position and right end-effector \(x,y\) position were kept as reliable action/state dimensions; the remaining 12 dimensions in the 16D pose vector were masked because they were noisy. Human wrist-camera observations were also masked, so only the top image was used for human demonstrations. Simulation and real-robot teleoperation data used the same FK/DIK conversion path as the earlier robot-only runs.

The human \(x,y,z\) position ranges were quantile-scaled to better match the robot datasets before the combined weighted dataset was passed into the \(\pi_{0.5}\) normalization pipeline. State was injected during co-training, and the policy continued to predict absolute action chunks in the unified 16D camera-frame end-effector space.

Evaluation Results

Garment split Long-sleeve top Short-sleeve top Long pants Shorts Overall average
Seen 82.0% 79.0% 68.0% 96.0% 81.25%
Unseen 90.0% 28.0% 65.0% 80.0% 65.75%

Compared with the \(\pi_{0.5}\) pretraining + robot state fine-tuning run, co-training moved from sequential human pretraining and simulation fine-tuning to a single weighted mixture of human demonstrations, simulation data, and real-robot teleoperation data under a normalized state/action contract with noisy human dimensions masked. Evaluated in Isaac Sim, this produced a further 2.0 percentage-point gain on overall seen garments and a 2.25 point gain on overall unseen garments, with the strongest improvements on top long sleeves (+5.0 seen, +5.0 unseen), long pants (+4.0 seen, +1.0 unseen), and unseen shorts (+10.0), although short-sleeve tops dropped relative to the two-stage baseline.

From State Conditioning to Future Conditioning

Up to this point, the training strategies mainly focused on making heterogeneous data usable: bringing human demonstrations, simulation rollouts, and real-robot teleoperation data into one normalized state/action contract, then masking noisy dimensions so the policy did not learn from unreliable supervision. This gave a clear improvement over robot-only training. The larger jump, however, came from adding the current robot state to the image observations. With state injection, the same top and wrist-camera images were grounded by the current end-effector positions, orientations, and gripper values, so the model no longer had to infer the robot configuration from pixels alone before predicting an action chunk.

This raised the next question: if the current state helps because it grounds where the robot is now, could the policy also be conditioned on where the scene is likely to go next? Directly generating future images would be computationally expensive and would have required more research time and resources than were available. Instead, I explored a lighter future-latent signal: predict compact future image embeddings from the current top, left-wrist, and right-wrist image embeddings, then provide those future embeddings to the policy as an additional conditioning stream. The goal was to give the VLA model a hint of the intended near-future visual state so that current images are used as grounding to current state and future latents are used as grounding to the intended next state.

The Future Latents

Predicting an entire future image from the current image state would be expensive and would require substantial data to train a reliable future image generator. A more practical target is the image embedding already used by the \(\pi_{0.5}\) model: a dense visual feature representation produced by the policy image encoder.

For this stage, I used the co-training base as the starting point because its image encoder had already learned garment-folding priors from human demonstrations, simulation data, and real-robot teleoperation data. During future-latent preparation, the image encoder was kept frozen. I generated embeddings for the top, left-wrist, and right-wrist images, then paired each current embedding with the embedding of a future frame at the chosen prediction horizon.

The raw visual embedding is still large: each camera produces approximately \(256 \times 2048\) image tokens, so three cameras produce about \(768\) tokens at width \(2048\). Predicting that full future representation directly would be data-hungry and computationally heavy. Instead, I first compressed the image-token embeddings into a lower-dimensional latent space, trained a future predictor in that compact space, and then used a policy adapter to project the predicted future latents back into the \(\pi_{0.5}\) prefix-token width.

The image encoder remained frozen during this process. If the encoder changed while training the final future-latent policy, the separately trained resampler and future predictor would no longer operate on the same embedding space they were trained for. Therefore the resampler and future predictor were trained separately, loaded into the policy path, and used as fixed modules while the future-latent conditioning pathway was integrated with the VLA policy.

Future latent architecture showing pi0.5-style image embeddings, compression, future prediction, policy adapter, and prefix-token conditioning. Future latent architecture showing pi0.5-style image embeddings, compression, future prediction, policy adapter, and prefix-token conditioning.

Resampler Autoencoder

I created datasets for training the resampler and future predictor by passing organizer-provided simulation and real-robot teleoperation images through the image encoder of the \(\pi_{0.5}\) co-training base. The resampler does not predict the future directly; instead, it compresses the large image-token embeddings produced by the frozen \(\pi_{0.5}\) image encoder into a compact latent space that the future predictor can learn over more efficiently.

The input to the resampler is the stacked embedding tensor for the three cameras: top, right wrist, and left wrist. Each camera contributes \(256\) image tokens with width \(2048\), giving an input shape of \([B, 3, 256, 2048]\). The camera-specific encoder first normalizes each camera embedding, projects the token width from \(2048\) to \(512\), adds learned input positional embeddings, and then uses 24 learned latent queries with two cross-attention blocks to produce a compact latent of \([B, 24, 512]\) per camera.

The decoder mirrors this compression path. For each camera, the compact latent receives learned latent positional embeddings, then 256 learned output queries attend back into the compact representation through two cross-attention blocks. A final projection expands the representation back from \(512\) to \(2048\), reconstructing \([B, 256, 2048]\) per camera and \([B, 3, 256, 2048]\) after stacking all cameras.

I trained this resampler only on the simulation and real-robot teleoperation embedding datasets. Adding the human pretraining embeddings would have required a longer weighted-sampling run, and under the time constraints, I prioritized the robot-relevant simulation and teleoperation distributions. The dataset used a future offset of 10 frames, but this offset does not affect the autoencoder objective because the resampler reconstructs whichever current embedding is passed to it.

The loss combines reconstruction MSE and cosine distance. MSE penalizes elementwise error in the reconstructed embedding, while cosine loss penalizes directional mismatch between reconstructed and target token vectors. The total loss is \(\mathcal{L} = 1.0 \cdot \mathcal{L}_{MSE} + 0.2 \cdot \mathcal{L}_{cosine}\), with masked cameras ignored during loss aggregation. In the training plot, MSE and cosine are shown as separate curves so the reconstruction scale and directional-alignment term can be read independently.

\[ \mathcal{L}_{MSE} = \frac{1}{|\Omega|} \sum_{(b,c,t,d)\in\Omega} \left(\hat{e}_{b,c,t,d} - e_{b,c,t,d}\right)^2 \] \[ \mathcal{L}_{cosine} = \frac{1}{|\Omega_{tok}|} \sum_{(b,c,t)\in\Omega_{tok}} \left( 1 - \frac{\hat{\mathbf{e}}_{b,c,t}\cdot\mathbf{e}_{b,c,t}} {\|\hat{\mathbf{e}}_{b,c,t}\|_2\|\mathbf{e}_{b,c,t}\|_2} \right) \] \[ \mathcal{L}_{total} = 1.0\,\mathcal{L}_{MSE} + 0.2\,\mathcal{L}_{cosine} \]
Input embedding\([B, 3, 256, 2048]\)
Compact latent\([B, 3, 24, 512]\)
Compression\(524{,}288 \rightarrow 12{,}288\) values per camera, about \(42.7\times\)
Attention blocks2 encoder + 2 decoder blocks per camera, 16 heads
Parameters44,983,296 total across three camera-specific autoencoders
Training dataReal-robot teleoperation + simulation embeddings, weighted 1.0 and 0.65
Training setup2 epochs
Validation2.5% split, evaluated every 1,000 steps over 2,500 batches

Future-Latent Predictor

Once the resampler could compress the frozen \(\pi_{0.5}\) image embeddings into compact latents, the next stage was to predict where those compact latents would move in the future. For this training stage, only the encoder side of the trained resampler was loaded and kept frozen. It compressed both the current full image embeddings and the future target embeddings into the same \([B, 3, 24, 512]\) compact-latent space.

The predictor receives the current compact latents \(z_t\), the current robot state \(s_t\), and a learned horizon token. Camera and token positional embeddings are added to the compact latents, then the three cameras and 24 latent tokens are flattened into a 72-token sequence. The 32D robot state is tokenized into four state tokens, giving a final transformer input of 77 tokens: one horizon token, four state tokens, and 72 compact-latent tokens.

A 6-layer transformer encoder processes this sequence with model width 512, 16 attention heads, and a 2048-wide feed-forward block. The horizon and state-token outputs are dropped after the transformer, and the remaining latent-token outputs pass through a LayerNorm and zero-initialized delta head. The model predicts a residual compact-latent motion \(\hat{\Delta}_t\), and the final future latent is produced as \(\hat{z}_{t+h}=z_t+\hat{\Delta}_t\). In the sidecar future-pairing setup, the future offset was \(h=10\).

The main comparison is against the copy baseline, which simply reuses the current compact latent \(z_t\) as the future prediction. If the future predictor is useful, its MSE and cosine loss should be lower than the copy baseline, and \(1 - \mathcal{L}_{MSE}/\mathcal{L}_{copy\_MSE}\) should stay positive. The final logged validation point had predictor MSE \(0.07265\) versus copy MSE \(0.13679\), corresponding to a validation MSE improvement of about \(46.9\%\).

Predictor Residual

\[ \hat{\Delta}_t = f_\theta(z_t, s_t), \qquad \hat{z}_{t+h}=z_t+\hat{\Delta}_t, \qquad \Delta_t^\star=z_{t+h}-z_t \]

The model predicts future motion in compact-latent space as a residual update from the current latent instead of generating the future latent from scratch.

MSE

\[ \mathcal{L}_{MSE} = \frac{1}{|\Omega|} \sum_{(b,c,i,d)\in\Omega} \left(\hat{z}_{b,c,i,d} - z^{\star}_{b,c,i,d}\right)^2 \]

Measures elementwise reconstruction error between the predicted future compact latent and the target future compact latent.

Cosine Loss

\[ \mathcal{L}_{cosine} = \frac{1}{|\Omega_{tok}|} \sum_{(b,c,i)\in\Omega_{tok}} \left( 1 - \frac{\hat{\mathbf{z}}_{b,c,i}\cdot\mathbf{z}^{\star}_{b,c,i}} {\|\hat{\mathbf{z}}_{b,c,i}\|_2\|\mathbf{z}^{\star}_{b,c,i}\|_2} \right) \]

Measures whether each predicted compact-latent token points in the same embedding direction as the target token, independent of raw scale.

Copy MSE

\[ \mathcal{L}_{copy\_MSE} = \frac{1}{|\Omega|} \sum_{(b,c,i,d)\in\Omega} \left(z_{b,c,i,d} - z^{\star}_{b,c,i,d}\right)^2 \]

Baseline error from doing nothing: it treats the current compact latent \(z_t\) as if it were the future latent \(z_{t+h}\).

MSE Improvement

\[ \mathcal{I}_{MSE} = 1-\frac{\mathcal{L}_{MSE}}{\mathcal{L}_{copy\_MSE}} \]

Positive values mean the predictor is closer to the future target than the copy baseline. Larger is better.

Delta MSE

\[ \mathcal{L}_{\Delta MSE} = \frac{1}{|\Omega|} \sum_{(b,c,i,d)\in\Omega} \left(\hat{\Delta}_{b,c,i,d}-\Delta^\star_{b,c,i,d}\right)^2 \]

Measures whether the predicted residual motion itself matches the true residual motion from current latent to future latent.

Delta Norm Ratio

\[ R_{\Delta norm} = \frac{1}{|\Omega_{tok}|} \sum_{(b,c,i)\in\Omega_{tok}} \frac{\|\hat{\Delta}_{b,c,i}\|_2}{\|\Delta^\star_{b,c,i}\|_2+\epsilon} \]

Shows whether the predicted latent motion is under-scaled or over-scaled. A value near 1 means the predicted residual magnitude matches the target residual magnitude.

Delta Projection Ratio

\[ R_{\Delta proj} = \frac{1}{|\Omega_{tok}|} \sum_{(b,c,i)\in\Omega_{tok}} \frac{\hat{\Delta}_{b,c,i}\cdot\Delta^\star_{b,c,i}} {\|\Delta^\star_{b,c,i}\|_2^2+\epsilon} \]

Measures how much of the target residual is recovered along the correct direction. Higher positive values mean more useful progress toward the target future latent.

Total Weighted Loss

\[ \mathcal{L}_{total} = 1.0\,\mathcal{L}_{MSE} + 0.2\,\mathcal{L}_{cosine} + 0.0\,\mathcal{L}_{\Delta MSE} \]

The training objective directly optimized future-latent MSE and cosine alignment. Delta MSE was logged as a diagnostic but had zero weight in this run.

Here, \(\Omega\) is the valid unmasked embedding-element set and \(\Omega_{tok}\) is the valid token set; \(i\) indexes the compact-latent token within each camera. Copy metrics measure how hard the future prediction problem is if the model does nothing. Improvement measures how much the predictor beats that no-motion baseline. Delta MSE checks residual magnitude error, delta cosine checks residual direction alignment, delta norm ratio checks whether the predicted motion is under- or over-scaled, and delta projection ratio measures how much of the target residual is recovered along the correct direction.

Future Predictor Takeaway

The plots and retained logs captured the later part of training rather than the cold start: for both the resampler and future predictor, the recorded curves begin after the model had already started learning. At the true first predictor step, the residual head behaved like the copy baseline: MSE improvement and the delta-direction diagnostics were all \(0\). The logged validation window then started near the end of the first epoch, where the predictor had already reached \(41.2\%\) MSE improvement, and by the final logged checkpoint it reached \(46.9\%\).

MSE improvement0.0% \(\rightarrow\) 46.9%
Delta cosine0.000 \(\rightarrow\) 0.372
Delta norm ratio0.000 \(\rightarrow\) 0.807
Delta projection ratio0.000 \(\rightarrow\) 0.300

Interpreted from the beginning of training, the predictor moved from a pure copy baseline to a useful future-latent estimator, with MSE improvement rising from \(0\%\) to \(46.9\%\). The residual diagnostics also moved away from zero: delta cosine reached \(0.372\), delta norm ratio reached \(0.807\), and delta projection reached \(0.300\). Together, these values show that the model learned a meaningful component of the target future-latent direction, although the residual magnitude was still slightly under-scaled. The next stage is to plug the frozen future predictor together with the frozen resampler encoder into the \(\pi_{0.5}\) co-training base, so the policy can condition on predicted future latents while learning to produce better action chunks.

Frozen input moduleTrained resampler encoder only
Compact latent\([B, 3, 24, 512]\)
State tokens32D state \(\rightarrow\) 4 tokens of width 512
Transformer6 layers, 16 heads, FFN width 2048
Prediction modeResidual: \(\hat{z}_{t+h}=z_t+\hat{\Delta}_t\)
Training dataSimulation + real-robot teleoperation embeddings, weights 0.15 and 1.0
Training setup1 epoch
BatchingBatch 16 per GPU, gradient accumulation 2
OptimizerLR \(1\times10^{-4}\), warmup 500, weight decay 0.01
Validation2% split, every 500 steps over 500 batches

How It Entered π0.5

The policy adapter projected predicted future latents into the π0.5 prefix-token width and appended them alongside image tokens before the language/action pathway. Training mixed predicted, true, and dropped future latents so the policy learned to use the signal without depending on a perfect predictor.

π0.5 Co-Training + Future-Latent Fine-Tuning

Datasets: Simulation + real-robot teleoperation

This final strategy started from the strongest co-training base and added the future-latent pathway as a lightweight conditioning stream. The co-training base already contained the broad behavior learned from human demonstrations, simulation demonstrations, and real-robot teleoperation. The future-latent fine-tuning stage then focused on the robot-compatible simulation and real-robot teleoperation data, where future image embeddings and robot states are available through the sidecar future-latent dataset.

The frozen image encoder produces the current \(\pi_{0.5}\) image-token embeddings. A frozen resampler encoder compresses those embeddings into compact \([3, 24, 512]\) latents, and the frozen future predictor estimates the horizon-10 future compact latent. These predicted future latents cannot be appended directly to the VLA prefix stream because their token width is 512, while the \(\pi_{0.5}\) prefix-token width is 2048. Therefore, a small policy adapter projects compact future-latent tokens into the policy prefix-token space.

The policy adapter is deliberately simple: a linear projection from 512 to 1024, a Swish activation, and a linear projection from 1024 to 2048. I did not add cross-attention, attention heads, or a larger adapter architecture in this run because of time constraints. The goal was to test whether a minimal projection was enough for the VLA policy to read the future-latent signal as an additional prefix-token hint.

During this fine-tuning stage, the image encoder, resampler encoder, and future predictor remained frozen. The policy adapter was trained faster than the rest of the \(\pi_{0.5}\) policy, while the already learned co-training base was updated with a smaller learning rate. This protected the useful behavior already learned by the co-training base: early in training, the adapter output was effectively a new noisy conditioning stream, so the base policy should not be overwritten before the adapter learned a stable representation. Once the adapter began mapping future latents into a useful prefix-token pattern, the rest of the policy could make smaller adjustments to use that signal for better action-chunk prediction.

Future-latent conditioning is also mixed during training instead of always being present. In 55% of samples, the prefix tokens receive predicted future latents from the future predictor. In 30% of samples, they receive true future latents obtained by encoding the future camera frame through the frozen resampler. In the remaining 15%, the future-latent stream is dropped. This keeps the policy robust: it can learn from predicted future context, benefit from ground-truth future supervision when available, and still preserve a fallback path when the future-latent signal is missing or unreliable.

Evaluation Results

Garment split Long-sleeve top Short-sleeve top Long pants Shorts Overall average
Seen 84.0% 84.0% 69.0% 96.0% 83.25%
Unseen 95.0% 33.0% 68.0% 85.0% 70.25%

Compared with the \(\pi_{0.5}\) co-training run, future-latent fine-tuning added 2.0 percentage points on seen garments and 4.5 points on unseen garments, with gains across all unseen categories. Relative to the earlier \(\pi_{0.5}\) Base - with State Injection baseline, the cumulative improvement reached +7.25 points on seen garments and +6.25 points on unseen garments, raising the combined seen/unseen average by 6.75 absolute points.

Cumulative Results & Lessons

Training strategy Seen garments Unseen garments Overall
Top L Top S Pants Shorts Avg Gain Top L Top S Pants Shorts Avg Gain Mean Gain
Base, no state 72.0%Base 44.0%Base 34.0%Base 95.0%Base 61.25% Baseline 66.0%Base 16.0%Base 29.0%Base 46.0%Base 39.25% Baseline 50.25% Baseline
Base + state 75.0%+3.0 78.0%+34.0 58.0%+24.0 93.0%-2.0 76.00% +14.75 92.0%+26.0 37.0%+21.0 60.0%+31.0 67.0%+21.0 64.00% +24.75 70.00% +19.75
Human pretrain base (no normalization + no masking) + fine-tune 77.0%+2.0 81.0%+3.0 64.0%+6.0 95.0%+2.0 79.25% +3.25 85.0%-7.0 35.0%-2.0 64.0%+4.0 70.0%+3.0 63.50% -0.50 71.38% +1.38
Co-train base 82.0%+5.0 79.0%-2.0 68.0%+4.0 96.0%+1.0 81.25% +2.00 90.0%+5.0 28.0%-7.0 65.0%+1.0 80.0%+10.0 65.75% +2.25 73.50% +2.13
Co-train + future latent 84.0%+2.0 84.0%+5.0 69.0%+1.0 96.0%+0.0 83.25% +2.00 95.0%+5.0 33.0%+5.0 68.0%+3.0 85.0%+5.0 70.25% +4.50 76.75% +3.25

Result Conclusion

The largest single jump came from adding current robot state to the visual observations: the combined seen/unseen average moved from 50.25% to 70.00%. Human pretraining improved seen performance but slightly hurt unseen performance, while the normalized co-training mixture recovered and exceeded that unseen score. The final future-latent model produced the best cumulative simulation result, with a 76.75% combined average and the strongest unseen average at 70.25%, suggesting that predicted future visual embeddings gave the policy a useful nudge for action-chunk prediction.

All training-strategy evaluations reported here were performed in the organizer-provided Isaac Sim environment because I did not have direct robot access during development. For more details on the simulation environment, evaluation assumptions, and challenge setup, see the LeHome paper and the official LeHome Challenge repository.

Each reported garment score was evaluated over 50 simulation episodes per garment. Seen scores were aggregated across the 40 seen garments, while unseen scores were aggregated across the eight released unseen garments. The success criterion was binary: each garment type had a set of geometric folding conditions, and a rollout received success only if all required conditions were satisfied. There were no partial points or partial rewards in these evaluation percentages.

The final co-training + future-latent policy was the model deployed in the LeHome Challenge finals. A few success and failure clips from the competition are shown below as representative research references from the final robot runs.

Evaluation Videos from the Finals (π0.5 Co-Training + Future-Latent)
Success Examples 1x speed
Failure Examples 1x speed
Failure 1: Shorts
Failure 2: Short-Sleeve Top
Failure 3: Pants
Failure 4: Short-Sleeve Top

Failure videos 1, 2, and 4 still show useful recovery behavior: the policy moves from unusual or uncertain cloth states back toward a plausible continuation and attempts to complete the fold. I never collected human-in-the-loop data or RL recovery data for this behavior, so this appears to be an emergent result of the data mixture, state + predicted future conditioning, and training strategy.

Failure video 3 is different: the policy appears to treat a pants sample more like a shorts case, and the rollout is then compounded by a grasping issue. Across these failure clips, the common source of failure is grasp quality after the end effectors reach the desired region, both in ideal folds and in unusual cloth configurations.

Evaluation Protocol and Limits

The organizer-provided training set contained 40 seen garments. For local simulation evaluation, I had access to eight sample unseen garments distributed across the garment categories. A larger hidden unseen garment set was reserved by the organizers and was not available during development, so the unseen numbers above should be read as the best available local signal rather than a complete measure of hidden-test generalization.

I could not run systematic real-robot evaluation before the finals because I did not have robot access during development. At the competition site, the available time was enough for quick checks and final evaluation, but not for a complete controlled success-and-failure study. The videos above are representative research references from the finals, while the official competition score should be taken from the LeHome leaderboard.

What Would Make the Evidence Stronger?

The following studies would make the evidence more robust, but were not feasible during the challenge because of limited robot access, time, and resources.

Future-latent causal ablations

Compare predicted future latents, copied current latents, true future latents, and dropped latents. This is the key test needed before claiming that future prediction, rather than extra token capacity, is the source of the gain.

Controlled real-robot study

If hardware access becomes available, repeat the training ladder on physical rollouts with matched garment identities, repeated trials, and explicit success-and-failure labels for grasping and folding.

Lessons

1. Most Real-Robot Failures Were Grasping Failures

The final policy usually reached plausible folding locations, but many failures came from grasp quality: missed cloth pickup, weak grasp closure, or unstable contact. The policy also showed some recovery-like behavior even though I did not explicitly train a recovery controller, which suggests that recovery priors were partially learned from the mixed demonstrations.

2. Delta Actions Were the Main Missed Training Choice

The models were trained to predict future absolute end-effector targets. In hindsight, training on action deltas would likely have made the control distribution narrower and easier to learn: the model would predict the change from the current robot state rather than repeatedly regress broad absolute positions. This should matter most for precise grasping, where small local errors can decide success or failure.

3. Validation Should Hold Out Garments, Not Random Frames

My validation splits were random sample splits from the available data. A better protocol would hold out one or two full garments from each garment type and validate on those unseen garment identities. I did not use this split during the challenge because the training set was already small, and removing several garments from 40 seen garments would have reduced the available learning signal.

4. Future Latents Look Promising, but Need Multi-Horizon Training

The future-latent model appears to have used the predicted future embedding as a helpful prior rather than as a fully reliable plan. To unlock more of its potential, the predictor should learn multiple horizons such as t+5, t+10, and t+20. For the same current image and robot state, different future latents should correspond to different action chunks, forcing the policy to rely more directly on the future-conditioning signal.

Next Research Vision

An exploratory intuition rather than a proven claim; please read this direction with a grain of salt.

Future-latent predictors as cross-embodiment task-progress priors

A longer-term direction is to study stronger, action-free future-latent predictors for VLA policies. Instead of predicting future RGB frames, robot actions, or final goal states, such a predictor would estimate future visual embeddings conditioned on the current observation, task prompt, and time horizon. A visual foundation encoder such as DINO, V-JEPA, or a similar video representation model could provide the latent space, with the goal of predicting information-dense intermediate future states rather than pixels.

These predicted future states should not be treated as explicit goals. One hypothesis is that they are more useful as intermediate visual checkpoints along the path toward the goal, such as what the scene may look like at \(t+5\), \(t+10\), or \(t+20\). This distinction matters because the policy should not simply move toward a static final target; it should use the predicted latents as task-progress grounding while still deciding the embodiment-specific actions needed at each step.

A useful property of this direction is that the future predictor may not need robot actions, joint states, end-effector poses, or camera calibration. In principle, it could be trained from video demonstrations and language prompts alone, across human egocentric videos, robot videos, and multi-embodiment task demonstrations. This would need to be tested carefully, but the aim would be to learn a general notion of how tasks visually progress over time, independent of any single robot embodiment.

A VLA policy could then condition on the current images, robot state, prompt, and predicted future latents at one or more horizons. The future predictor would provide a representation of what intermediate task progress may look like, while the VLA would still learn how the current robot embodiment should move to realize that progress through its own kinematics, gripper, and action space.

This may also help with a limitation of unified pose-action contracts: end-effector dexterity. A dual-arm pose-action contract can unify reaching and gross bimanual motion across many embodiments, but grasping is harder because hands, grippers, and dexterous end effectors differ significantly. One hypothesis is that an extensively trained future predictor could provide useful future latents for these embodiment-specific dexterity moments, such as what a successful grasp or contact transition may look like visually. With a basic suite of teleoperation data for a new embodiment, a VLA may then learn how that embodiment's hand or gripper should realize the grasp implied by the predicted future latent.

The key open question is whether this separation between what intermediate progress should look like and how this robot should move can improve transfer to new robots, new tasks, and low-teleoperation settings. This may make heterogeneous-data training more scalable, but it would require careful ablations to verify that the future latents provide physically useful task-progress information rather than only extra conditioning capacity or visually plausible but unreachable predictions.

Special Thanks

One day before I was supposed to travel to ICRA 2026 in Vienna, my visa was rejected. Out of nowhere, I called my friend Manojkumar Srinivasan, who was studying at TU Dortmund in Germany, whether he could represent me. He agreed on extremely short notice, traveled to ICRA in Vienna, downloaded the model, helped me run the evaluation, and made this result possible.

I am grateful to the LeHome organizers for allowing Manoj to participate on my behalf, and to the sponsor Lightwheel for supporting the competition. Their support let the finals evaluation go forward despite the last-minute logistics change.

Manojkumar Srinivasan receiving the LeHome Challenge third-place certificate with Alberta Longhini.
Left: Manojkumar Srinivasan, my friend who represented me for the competition. Right: Alberta Longhini, organizer.
LeHome Challenge finals setup with the certificate, robot arms, and garments.
LeHome Challenge finals setup at ICRA 2026, where the final policy first ran on the real robot.

LeHome Challenge Organizers

Special thanks to Ilia for helping set up the camera configuration while competing as a finalist. I also thank GolemCo, an Indian robotics WhatsApp community, for offering help when I could not attend ICRA 2026 in person.

Contact

Following a health break, I'm currently exploring my next steps while actively researching robotics and AI voice models.

Researchers and engineers working in robotics or embodied AI, whether in academia, industry, or startups, please feel free to reach out. I'm happy to chat, explore ideas, and discuss research problems anytime.

References
  1. LeHome: A Simulation Environment for Deformable Object Manipulation in Household Scenarios. arXiv:2604.22363
  2. LeHome Challenge 2026: Garment Manipulation Skill Learning in Household Scenarios. Challenge website
  3. LeHome Challenge official repository. GitHub
  4. LeHome simulation environment repository. GitHub
  5. Physical Intelligence OpenPI repository. GitHub
  6. π0.5 policy documentation in LeRobot. Hugging Face LeRobot docs
  7. π0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv:2504.16054
  8. Human-to-Robot research from Physical Intelligence. Project page
  9. HaPTIC: Predicting 4D Hand Trajectory from Monocular Videos. Project page arXiv:2501.08329 GitHub
  10. NVIDIA Isaac Sim. Product page