Robotic Garment Folding Implementation: Lehome Challenge 2026

TL;DR

Robotic garment folding stresses current VLA policies because cloth state is high-dimensional, contacts are unstable. In the LeHome Challenge, I had no development or testing time access to the target robot, so I could not collect additional datasets or do human-in-the-loop DAgger style collection, or run rollouts for RL. I therefore collected a lot of human demonstrations, built a 16D pose-action contract unifying different embodiments for co-training - using forward kinematics, differential inverse kinematics, anchored normalization and masking. Implemented some architecture changes to support future latent conditioning providing a compact look ahead to the VLA model. Like a predictied future based action conditioning rather than a traditional current state based action conditioning.

Across a common local simulation evaluation protocol, explicit robot-state conditioning produced the largest gain, while co-training and future-latent conditioning added smaller but useful improvements. The final policy improved the combined seen/unseen simulation average from 70.0% to 76.75%, improved unseen success from 64.0% to 70.25%, and was the model deployed in the LeHome Challenge real-world finals, where it placed third. The current evidence suggests future visual latents are useful, but does not yet prove whether they act as planning, denoising, progress estimation, or extra conditioning capacity.

As a small detour experiment, I also ran offline RL-style critic reranking in the simulation environment. The critic reranking notes and results are included later in the article as a side experiment, separate from the main training ladder.

Training strategy	Seen avg	Unseen avg	Combined avg	What it tested
Base, no state	61.25%	39.25%	50.25%	Image-only action-chunk prediction
Base + state	76.00%	64.00%	70.00%	Explicit current robot state as input
Human pretrain base (no normalization or masking) + state fine-tune	79.25%	63.50%	71.38%	Human garment-diversity pretraining without normalization or noise filtering before robot fine-tuning
Co-train base	81.25%	65.75%	73.50%	Weighted human + simulation + real-robot teleoperation mixture with normalization and masking
Co-train base + future latent	83.25%	70.25%	76.75%	Injection of predicted horizon-10 visual-latent prefix tokens

What this write-up covers

A validated 16D camera-frame dual-arm pose contract joining human, simulation, and teleoperation data.
Anchored normalization and masking so noisy human-pose dimensions do not contaminate robot learning.
A future-latent sidecar that predicts horizon-10 visual embeddings and injects them as VLA prefix tokens.
A simulation-only offline critic reranker that scores sampled action chunks with predicted Monte Carlo return.
An honest ablation ladder under no-robot-access constraints.

What this does not prove

It does not prove future latents are explicit planning.
It does not include controlled real-robot ablations.
It does not separate future prediction from extra prefix-token capacity.
It does not solve grasping/contact robustness.

Why this may matter for robot foundation models

Scaling data collection through simulation, human demonstrations, and effective sim-to-real transfer is an active area of research. Robot foundation models need heterogeneous datasets to scale across tasks faster, but those datasets are not automatically compatible. This project suggests one way to train across datasets and embodiments with a unified representation contract, while also exploring whether future conditioning can improve VLA policy performance.

Approach

1

Heterogeneous Data

Bring simulation, human demonstrations, and robot teleoperation logs into one policy-facing interface.

2

Pose Contract

Use FK/DIK, camera-frame transforms, validation checks, and anchored quantile normalization.

3

Training Ladder

Measure image-only, state-conditioned, human-pretrained, and co-trained policy variants.

4

Future Latents

Predict compact future visual latents and append them to the VLA model as prefix tokens.

5

Results and Limits

Report the result ladder, real-finals evidence, failure modes, and unsupported claims.

6

Next Research Direction

List the highest-value ablations to run once hardware or funding is available.

Data

The key constraint in this challenge was the lack of direct robot access, which prevented any form of additional data collection, robot telemetry gathering, and any form of real-world tuning based on RL. To improve robustness under these constraints, I increased garment diversity by collecting human demonstrations of garment folding. I then extracted hand trajectories relative to the top-camera frame axes using HaPTIC-style hand-pose estimation.

Incorporating human demonstrations required moving away from raw robot joint-angle supervision, since joint states are not available for human data. Instead, I unified all datasets around end-effector pose representations. For the simulation and real-robot teleoperation data provided by the organizers, I built forward kinematics and differential inverse kinematics solvers to convert the original 12D dual-arm joint representation into a 16D dual-arm end-effector pose representation in the camera frame, using the LeRobot USD/URDF. I validated these transformations through FK/DIK consistency checks, quaternion-continuity checks, and visual trajectory inspection.

After conversion, I aligned coordinate conventions, axis definitions, and valid pose dimensions across human, simulation, and real-robot datasets. Real-robot teleoperation data served as the anchor reference for quantile scaling, reducing dataset-scale artifacts that the policy could otherwise exploit as shortcut features. Noisy or unreliable human-pose dimensions were masked during normalization and training. The resulting unified dataset was then merged with weighted sampling and used to fine-tune the π_0.5 VLA policy.

The human, simulation, and real-robot top-camera configurations were significantly different, even though all three were top views.

LeHome dataset workflow from human HaPTIC data, simulation data, and real-robot teleoperation data into normalized co-training data.

Dataset type	Human data	Simulation data (vanilla)	Real teleoperation data (vanilla)
Source	Custom data collection and post-processing pipeline	Provided by the competition organizers	Provided by the competition organizers
Cameras	Top Not present Left wrist Not present Right wrist	Top Left wrist Right wrist	Top Left wrist Right wrist
Episodes	4180	1000	500
Total number of garments	~250 (including different complex orientations)	40	25
Garment types	Long-sleeve tops, short-sleeve tops, long pants, shorts	Long-sleeve tops, short-sleeve tops, long pants, shorts	Long-sleeve tops, short-sleeve tops, long pants, shorts
Total number of frames (data points)	724,399	265,798	187,135
Duration	~8.4 hours	~2.5 hours	~2.6 hours
Dataset FPS	~24	30	20
Dimension 0	Left-arm end-effector X position relative to camera-frame axes	Left-arm shoulder pan	Left-arm shoulder pan
Dimension 1	Left-arm end-effector Y position relative to camera-frame axes	Left-arm shoulder lift	Left-arm shoulder lift
Dimension 2	Left-arm end-effector Z position relative to camera-frame axes	Left-arm elbow flex	Left-arm elbow flex
Dimension 3	Left-arm end-effector orientation W quaternion relative to camera-frame axes	Left-arm wrist flex	Left-arm wrist flex
Dimension 4	Left-arm end-effector orientation X quaternion relative to camera-frame axes	Left-arm wrist roll	Left-arm wrist roll
Dimension 5	Left-arm end-effector orientation Y quaternion relative to camera-frame axes	Left-arm gripper value	Left-arm gripper value
Dimension 6	Left-arm end-effector orientation Z quaternion relative to camera-frame axes	Right-arm shoulder pan	Right-arm shoulder pan
Dimension 7	Not present	Right-arm shoulder lift	Right-arm shoulder lift
Dimension 8	Right-arm end-effector X position relative to camera-frame axes	Right-arm elbow flex	Right-arm elbow flex
Dimension 9	Right-arm end-effector Y position relative to camera-frame axes	Right-arm wrist flex	Right-arm wrist flex
Dimension 10	Right-arm end-effector Z position relative to camera-frame axes	Right-arm wrist roll	Right-arm wrist roll
Dimension 11	Right-arm end-effector orientation W quaternion relative to camera-frame axes	Right-arm gripper value	Right-arm gripper value
Dimension 12	Right-arm end-effector orientation X quaternion relative to camera-frame axes	Not present	Not present
Dimension 13	Right-arm end-effector orientation Y quaternion relative to camera-frame axes	Not present	Not present
Dimension 14	Right-arm end-effector orientation Z quaternion relative to camera-frame axes	Not present	Not present
Dimension 15	Not present	Not present	Not present

Post-Processing & Unification

Why a Shared Pose Space Was Needed

The three available data sources were represented in different action spaces. The simulation and real-robot teleoperation datasets were provided by the LeHome Challenge organizers in the native robot joint-space format, while the additional human demonstrations were collected separately with a similar top camera view. For easier model learning and generalization across all three datasets, the data needed to share one representation space. Therefore the 16D dual-arm pose representation was chosen as the unified policy-facing representation. This allows the policy to learn from all three datasets consistently, while still outputting joint commands for the robot during deployment.

Unified Dual-Arm Pose (Relative to the Top-Camera Frame)

The unified policy-facing representation is a 16D dual-arm pose:

\mathbf{p}^{16} = \left[ \mathbf{p}^{8}_{L}, \mathbf{p}^{8}_{R} \right]

Each arm pose is represented as:

\mathbf{p}^{8} = \left[ x,\ y,\ z,\ q_w,\ q_x,\ q_y,\ q_z,\ g \right]

Expanded Representation

Here, \((x,y,z)\) denotes end-effector position, \((q_w,q_x,q_y,q_z)\) denotes orientation as a quaternion, and \(g\) denotes the gripper value.

\mathbf{p}^{16} = \left[ x_L,\ y_L,\ z_L,\ q_{w,L},\ q_{x,L},\ q_{y,L},\ q_{z,L},\ g_L,\ x_R,\ y_R,\ z_R,\ q_{w,R},\ q_{x,R},\ q_{y,R},\ q_{z,R},\ g_R \right]

This 16D camera-frame representation aligns human, simulation, and real-robot demonstrations under one pose-observation-action contract.

Robot Model and Kinematic Solver Selection

To convert the organizer-provided simulation and real-robot datasets from 12D joint space into 16D end-effector pose space, I implemented forward kinematics and differential inverse kinematics solvers using the robot model files provided with the challenge environment.

Since the simulation evaluation was performed in Isaac Sim, the available USD and URDF robot descriptions were compared through FK/DIK round-trip consistency, quaternion-continuity checks, and visual trajectory inspection. Based on these checks, the USD-derived kinematic chain was selected because it more closely matched the behavior of the LeHome simulation and evaluation setup.

Training-Time Transform

The transformation used during training converts robot joints into camera-frame end-effector pose:

\mathbf{q}^{12} \xrightarrow{\mathrm{FK}} \mathbf{T}^{W}_{ee} \xrightarrow{T^{C}_{W}} \mathbf{p}^{16}_{C}

\(\mathbf{q}^{12}\) is the 12D dual-arm joint representation, \(\mathbf{T}^{W}_{ee}\) is the world-frame end-effector transform, \(T^{C}_{W}\) is the world-to-camera transform, and \(\mathbf{p}^{16}_{C}\) is the final camera-frame pose.

Deployment-Time Transform

During deployment, the policy predicts in 16D pose space, while the LeHome evaluation interface expects 12D joint commands:

\mathbf{p}^{16}_{C} \xrightarrow{T^{W}_{C}} \mathbf{p}^{16}_{W} \xrightarrow{\mathrm{DIK}} \mathbf{q}^{12}

This lets the policy learn in a shared camera-frame end-effector representation while still producing the joint-space commands required by the robot controller.

Click below to learn more about the FK and DIK implementation used in this project.

Validation

FK/DIK Round-Trip Validation

The FK and DIK transforms were validated with a round-trip consistency check. The goal was to verify that a policy-space action remains stable after passing through the full conversion loop: joint-space action to camera-frame end-effector action, back to joint space, and then again to camera-frame end-effector action.

The round-trip validation showed that real-robot teleoperation data stayed accurate across the full distribution. Simulation data also remained accurate through the 99.9th percentile, with position error below 1 cm and orientation error below 0.5 degrees. The simulation maximum was treated as an outlier given the number of processed frames and arm samples. Higher iteration counts, lower damping, and multiple line-search alphas produced only small accuracy gains, while the added latency slowed robot execution.

\mathbf{a}^{12} \rightarrow \mathbf{a}^{16}_{C} \rightarrow \hat{\mathbf{a}}^{12} \rightarrow \hat{\mathbf{a}}^{16}_{C}

For each selected frame, the original 12D joint-space state and action are loaded first:

\mathbf{q}^{12}_{t}, \qquad \mathbf{a}^{12}_{t}

The input transform converts the 12D joint-space action into a 16D camera-frame action:

\mathbf{a}^{16}_{C,t} = f_{\mathrm{FK}} \left( \mathbf{a}^{12}_{t} \right)

The output transform then converts this camera-frame target back into a 12D joint-space command using differential IK and the current 12D robot state:

\hat{\mathbf{a}}^{12}_{t} = f_{\mathrm{DIK}} \left( \mathbf{a}^{16}_{C,t}, \mathbf{q}^{12}_{t} \right)

Finally, the recovered joint-space action is passed through the FK input transform again:

\hat{\mathbf{a}}^{16}_{C,t} = f_{\mathrm{FK}} \left( \hat{\mathbf{a}}^{12}_{t} \right)

The comparison is made in camera-frame end-effector pose space, because that is the representation used by the policy during training and evaluation:

\mathbf{a}^{16}_{C,t} \quad \text{vs.} \quad \hat{\mathbf{a}}^{16}_{C,t}

For each arm, position error is the Euclidean distance between the original and reconstructed camera-frame end-effector positions:

e_{pos} = \left\| \mathbf{x}_{C} - \hat{\mathbf{x}}_{C} \right\|_2

Orientation error is computed as quaternion angular distance using normalized quaternions, then reported in degrees:

e_{rot} = 2\cos^{-1} \left( \left| \left\langle \mathbf{q}, \hat{\mathbf{q}} \right\rangle \right| \right) \] \[ e^{\circ}_{rot} = \frac{180}{\pi} e_{rot}

Round-trip validation flowchart showing original joint action, FK to 16D camera action, DIK back to joint action, and FK reconstruction.

Round-Trip Validation Results

Dataset	Coverage	Error Type	Statistic	Value
Human	Not applicable: already represented in ground-truth 16D end-effector pose format.
Simulation data	1000 episodes 265,798 frames 531,596 arm samples	Position (m)	Median	6.88e-05
			P95	1.81e-04
			P99	8.58e-04
			P99.9	1.75e-03
			Max	2.21e-01
		Orientation (deg)	Median	2.40e-04
			P95	2.14e-02
			P99	1.00e-01
			P99.9	4.11e-01
			Max	55.105
Real teleoperation data	500 episodes 187,135 frames 374,270 arm samples	Position (m)	Median	8.43e-05
			P95	3.18e-04
			P99	7.46e-04
			P99.9	1.33e-03
			Max	4.60e-03
		Orientation (deg)	Median	2.08e-04
			P95	1.79e-03
			P99	6.38e-03
			P99.9	3.40e-01
			Max	0.591

Quaternion-Continuity Validation

In addition to round-trip accuracy, quaternion-continuity checks were used to avoid discontinuous orientation representations caused by quaternion sign ambiguity. Quaternions \(\mathbf{q}\) and \(-\mathbf{q}\) represent the same physical rotation, so consecutive frames can appear to jump if the sign convention is not made consistent over time.

For consecutive quaternions \(\mathbf{q}_{t}\) and \(\mathbf{q}_{t+1}\), sign consistency is enforced by checking the dot product:

\left\langle \mathbf{q}_{t}, \mathbf{q}_{t+1} \right\rangle < 0

If this condition is true, the next quaternion is flipped before it is stored or compared:

\mathbf{q}_{t+1} \leftarrow - \mathbf{q}_{t+1}

This keeps the orientation trajectory smooth across time and prevents the model from seeing artificial discontinuities that do not correspond to real end-effector motion. The corrected trajectories were then inspected visually to confirm that the reconstructed end-effector paths followed the intended folding motion without sudden orientation flips.

Visual Validation

The necessary pose dimensions were plotted directly alongside the top-camera video for all three datasets. The camera-frame convention used here places \(z\) into the plane of the camera, \(x\) to the right from the center of the video, and \(y\) downward from the center of the video.

Dataset type	Visual validation
Human pretraining data
Simulation data
Real teleoperation data

Training Strategy

π0.5 Base - No State Injection

Dataset: Simulation only

As the first baseline, I fine-tuned the base \(\pi_{0.5}\) VLA policy using only the transformed simulation dataset. I used simulation-only training because, without direct robot access, the only environment where I could perform systematic evaluation was Isaac Sim. The Isaac Sim evaluation environment was provided by the challenge organizers and included reference point based geometric configuration as garment-folding checkpoints to determine whether each garment was folded correctly.

In this setup, the policy received only visual observations as input: the top camera image and both wrist-camera images. I did not inject the current robot state into the policy input. In other words, the current absolute end-effector positions, orientations, and gripper values were not provided to the model. The policy was therefore required to predict future action chunks directly from images alone.

The model output was a sequence of absolute end-effector targets in the unified 16D camera-frame pose space. Each action consisted of the position, orientation, and gripper value for both arms:

\mathbf{a}^{16} = \left[ x_L, y_L, z_L, q_{w,L}, q_{x,L}, q_{y,L}, q_{z,L}, g_L, x_R, y_R, z_R, q_{w,R}, q_{x,R}, q_{y,R}, q_{z,R}, g_R \right].

This baseline tested whether the policy could infer both the garment state and the required dual-arm end-effector motion purely from visual context. However, since the current end-effector state was not provided, the prediction problem was under-constrained: the same image observation can correspond to different robot configurations. This made the experiment useful as an initial reference point before adding explicit state conditioning in later training strategies.

The FK and DIK transforms were integrated into the \(\pi_{0.5}\) policy pipeline. The FK transform converted the original 12D simulation joint-space data into the 16D camera-frame end-effector representation used for training, while the DIK output transform converted predicted 16D end-effector targets back into the 12D joint-space commands required by the LeHome evaluation interface.

Evaluation Results

Garment split	Long-sleeve top	Short-sleeve top	Long pants	Shorts	Overall average
Seen	72.0%	44.0%	34.0%	95.0%	61.25%
Unseen	66.0%	16.0%	29.0%	46.0%	39.25%

The simulation-only, no-state baseline showed reasonable performance on some seen garments, especially shorts, but struggled on categories requiring more precise pose-conditioned manipulation, such as tops and long pants. The drop on unseen garments, particularly top short sleeves and long pants, suggested that visual-only conditioning was insufficient for robust generalization. This motivated the next training variants, where the current end-effector state was injected and the training data was expanded beyond simulation-only demonstrations.

π0.5 Base - with State Injection

Dataset: Simulation only

The next baseline kept the same transformed simulation-only dataset and the same FK/DIK action pipeline, but added the current robot state to the policy input. The policy still received the top camera image and both wrist-camera images, but it was no longer forced to infer the robot configuration from images alone.

In this setup, the current 16D camera-frame end-effector state was provided as an additional input during model training. This state represented the current absolute position, orientation, and gripper value for both end effectors in the same unified pose convention used by the action targets.

The FK and DIK transforms remained the same as in the no-state baseline. FK converted the original 12D simulation joint-space logs into the 16D camera-frame state and action representation used for training, while DIK converted predicted 16D end-effector targets back into the 12D joint-space commands required by the LeHome evaluation interface.

Evaluation Results

Garment split	Long-sleeve top	Short-sleeve top	Long pants	Shorts	Overall average
Seen	75.0%	78.0%	58.0%	93.0%	76.0%
Unseen	92.0%	37.0%	60.0%	67.0%	64.0%

Adding the current end-effector state substantially improved the simulation baseline, especially for seen top short sleeves and unseen top long sleeves. The remaining gap on unseen top short sleeves still pointed to category-level generalization limits, but the overall improvement confirmed that explicit state conditioning made the action prediction problem better constrained.

π0.5 Pretraining + Robot State Fine-Tuning

Datasets: Human demonstrations without masking or normalization + simulation data

The third strategy used a two-stage training pipeline. In the first stage, I trained a vanilla \(\pi_{0.5}\) base model on the human demonstration dataset for four epochs. This stage exposed the model to a wider variety of garment shapes, folds, and initial configurations before it was optimized on the robot simulation data.

For the human-demonstration pretraining stage, I kept the full 16D pose representation, used the top-camera image, masked the unavailable wrist images, and injected the current state. Only the gripper dimensions were masked; all other dimensions were left unchanged. The goal was to let the model learn from the complete human trajectory signal in the same unified camera-frame end-effector convention used by the later robot training stage, even though many dimensions were noisy. I also injected state during human pretraining to help the model learn the relationship between the current state and the next action.

After the human-demonstration stage finished, the resulting weights were used as the initialization for a second four-epoch fine-tuning stage on the transformed simulation-only robot dataset. This fine-tuning stage used state injection, so the current 16D absolute end-effector positions, orientations, and gripper values were provided to the model along with the top, left-wrist, and right-wrist camera images.

The FK and DIK transforms remained unchanged. FK converted robot joint-space logs into the unified 16D camera-frame state/action representation for training, while DIK converted predicted 16D end-effector targets back into 12D joint-space commands for simulation evaluation.

Key Training Configuration

Stage 1: Human Pretraining

Base model\(\pi_{0.5}\)

DatasetHuman demonstrations

Action horizon\(H = 10\)

State injectionEnabled

discrete_state_inputTrue

Action dimension16

Output action dimension12

Batch size256

GPUs8 x H100s

Training steps11400

Epochs4

Peak learning rate\(1 \times 10^{-4}\)

Warmup steps1000

Decay steps11400

Decay learning rate\(5 \times 10^{-6}\)

Stage 2: Robot Fine-Tuning

Base weightsHuman-pretrained \(\pi_{0.5}\)

DatasetSimulation only

Action horizon\(H = 10\)

State injectionEnabled

discrete_state_inputTrue

Action dimension16

Output action dimension12

Batch size256

GPUs8 x H100s

Training steps4200

Epochs4

Peak learning rate\(1 \times 10^{-4}\)

Warmup steps500

Decay steps4200

Decay learning rate\(5 \times 10^{-6}\)

Evaluation Results

Garment split	Long-sleeve top	Short-sleeve top	Long pants	Shorts	Overall average
Seen	77.0%	81.0%	64.0%	95.0%	79.25%
Unseen	85.0%	35.0%	64.0%	70.0%	63.50%

Compared with the \(\pi_{0.5}\) Base - with State Injection baseline, this two-stage pretraining and fine-tuning run improved the overall seen-garment score by 3.25 percentage points, but the unseen average slipped by 0.5 points. The combined seen/unseen mean still increased by 1.38 points. The clearest gains came from long pants, where performance increased by 6.0 points on seen garments and 4.0 points on unseen garments, while the drop on unseen tops showed that unmasked, unnormalized human pretraining was not uniformly beneficial.

Simulation-Only Offline RL Detour

Offline Critic Reranking from the \(\pi_{0.5}\) Base

After the two-stage \(\pi_{0.5}\) pretraining and robot-state fine-tuning pipeline, I implemented a small simulation-only critic reranking pass to push performance further. The critic was trained offline from rewarded simulation rollouts because collecting and scoring many candidate episodes across garments was much easier in simulation. Under the time constraints, this remained an offline simulation reranker rather than an online real-robot critic loop.

Open Offline Critic Reranking Notes

π0.5 Co-Training

Datasets: Human demonstrations + simulation + real-robot teleoperation

The fourth strategy moved from sequential pretraining and fine-tuning to direct co-training across all three available data sources. Human demonstrations, simulation demonstrations, and real-robot teleoperation data were mixed into a single training run after conversion into the shared policy-facing representation.

The dataset sampler was weighted so that human demonstrations used a weight of 1.5, simulation demonstrations used a weight of 4, and real-robot teleoperation demonstrations used a weight of 8. This made the effective number of samples from the three sources closer after accounting for dataset size, while still giving the strongest priority to the real-robot teleoperation data as the closest match to the evaluation embodiment.

Before co-training, the datasets were passed through per-dimension scale normalization and masking. For human demonstrations, only the left end-effector \(x,y\) position and right end-effector \(x,y\) position were kept as reliable action/state dimensions; the remaining 12 dimensions in the 16D pose vector were masked because they were noisy. Human wrist-camera observations were also masked, so only the top image was used for human demonstrations. Simulation and real-robot teleoperation data used the same FK/DIK conversion path as the earlier robot-only runs.

The human \(x,y,z\) position ranges were quantile-scaled to better match the robot datasets before the combined weighted dataset was passed into the \(\pi_{0.5}\) normalization pipeline. State was injected during co-training, and the policy continued to predict absolute action chunks in the unified 16D camera-frame end-effector space.

Key Training Configuration

Base model\(\pi_{0.5}\)

DatasetsHuman + simulation + real-robot teleoperation

Sampling weightsHuman 1.5 / simulation 4 / real-robot 8

Action horizon\(H = 5\)

State injectionEnabled

discrete_state_inputTrue

Action targetAbsolute 16D chunks

Output action dimension12

Human kept dimsLeft/right EE \(x,y\)

Human masked dims12 of 16 pose dims

Human camerasTop only

NormalizationWeighted per-dim scale

Batch size256

GPUs8 x H100s

Training steps15000

Epochs4 total; effective human 1.5 / simulation 4 / real-robot 8

Peak learning rate\(1 \times 10^{-4}\)

Warmup steps1000

Decay steps15000

Decay learning rate\(5 \times 10^{-6}\)

Evaluation Results

Garment split	Long-sleeve top	Short-sleeve top	Long pants	Shorts	Overall average
Seen	82.0%	79.0%	68.0%	96.0%	81.25%
Unseen	90.0%	28.0%	65.0%	80.0%	65.75%

Compared with the \(\pi_{0.5}\) pretraining + robot state fine-tuning run, co-training moved from sequential human pretraining and simulation fine-tuning to a single weighted mixture of human demonstrations, simulation data, and real-robot teleoperation data under a normalized state/action contract with noisy human dimensions masked. Evaluated in Isaac Sim, this produced a further 2.0 percentage-point gain on overall seen garments and a 2.25 point gain on overall unseen garments, with the strongest improvements on top long sleeves (+5.0 seen, +5.0 unseen), long pants (+4.0 seen, +1.0 unseen), and unseen shorts (+10.0), although short-sleeve tops dropped relative to the two-stage baseline.

From State Conditioning to Future Conditioning

Up to this point, the training strategies mainly focused on making heterogeneous data usable: bringing human demonstrations, simulation rollouts, and real-robot teleoperation data into one normalized state/action contract, then masking noisy dimensions so the policy did not learn from unreliable supervision. This gave a clear improvement over robot-only training. The larger jump, however, came from adding the current robot state to the image observations. With state injection, the same top and wrist-camera images were grounded by the current end-effector positions, orientations, and gripper values, so the model no longer had to infer the robot configuration from pixels alone before predicting an action chunk.

This raised the next question: if the current state helps because it grounds where the robot is now, could the policy also be conditioned on where the scene is likely to go next? Directly generating future images would be computationally expensive and would have required more research time and resources than were available. Instead, I explored a lighter future-latent signal: predict compact future image embeddings from the current top, left-wrist, and right-wrist image embeddings, then provide those future embeddings to the policy as an additional conditioning stream. The goal was to give the VLA model a hint of the intended near-future visual state so that current images are used as grounding to current state and future latents are used as grounding to the intended next state.

The Future Latents

Predicting an entire future image from the current image state would be expensive and would require substantial data to train a reliable future image generator. A more practical target is the image embedding already used by the \(\pi_{0.5}\) model: a dense visual feature representation produced by the policy image encoder.

For this stage, I used the co-training base as the starting point because its image encoder had already learned garment-folding priors from human demonstrations, simulation data, and real-robot teleoperation data. During future-latent preparation, the image encoder was kept frozen. I generated embeddings for the top, left-wrist, and right-wrist images, then paired each current embedding with the embedding of a future frame at the chosen prediction horizon.

The raw visual embedding is still large: each camera produces approximately \(256 \times 2048\) image tokens, so three cameras produce about \(768\) tokens at width \(2048\). Predicting that full future representation directly would be data-hungry and computationally heavy. Instead, I first compressed the image-token embeddings into a lower-dimensional latent space, trained a future predictor in that compact space, and then used a policy adapter to project the predicted future latents back into the \(\pi_{0.5}\) prefix-token width.

The image encoder remained frozen during this process. If the encoder changed while training the final future-latent policy, the separately trained resampler and future predictor would no longer operate on the same embedding space they were trained for. Therefore the resampler and future predictor were trained separately, loaded into the policy path, and used as fixed modules while the future-latent conditioning pathway was integrated with the VLA policy.

Future latent architecture showing pi0.5-style image embeddings, compression, future prediction, policy adapter, and prefix-token conditioning.

Resampler Autoencoder

I created datasets for training the resampler and future predictor by passing organizer-provided simulation and real-robot teleoperation images through the image encoder of the \(\pi_{0.5}\) co-training base. The resampler does not predict the future directly; instead, it compresses the large image-token embeddings produced by the frozen \(\pi_{0.5}\) image encoder into a compact latent space that the future predictor can learn over more efficiently.

The input to the resampler is the stacked embedding tensor for the three cameras: top, right wrist, and left wrist. Each camera contributes \(256\) image tokens with width \(2048\), giving an input shape of \([B, 3, 256, 2048]\). The camera-specific encoder first normalizes each camera embedding, projects the token width from \(2048\) to \(512\), adds learned input positional embeddings, and then uses 24 learned latent queries with two cross-attention blocks to produce a compact latent of \([B, 24, 512]\) per camera.

The decoder mirrors this compression path. For each camera, the compact latent receives learned latent positional embeddings, then 256 learned output queries attend back into the compact representation through two cross-attention blocks. A final projection expands the representation back from \(512\) to \(2048\), reconstructing \([B, 256, 2048]\) per camera and \([B, 3, 256, 2048]\) after stacking all cameras.

I trained this resampler only on the simulation and real-robot teleoperation embedding datasets. Adding the human pretraining embeddings would have required a longer weighted-sampling run, and under the time constraints, I prioritized the robot-relevant simulation and teleoperation distributions. The dataset used a future offset of 10 frames, but this offset does not affect the autoencoder objective because the resampler reconstructs whichever current embedding is passed to it.

The loss combines reconstruction MSE and cosine distance. MSE penalizes elementwise error in the reconstructed embedding, while cosine loss penalizes directional mismatch between reconstructed and target token vectors. The total loss is \(\mathcal{L} = 1.0 \cdot \mathcal{L}_{MSE} + 0.2 \cdot \mathcal{L}_{cosine}\), with masked cameras ignored during loss aggregation. In the training plot, MSE and cosine are shown as separate curves so the reconstruction scale and directional-alignment term can be read independently.

\mathcal{L}_{MSE} = \frac{1}{|\Omega|} \sum_{(b,c,t,d)\in\Omega} \left(\hat{e}_{b,c,t,d} - e_{b,c,t,d}\right)^2 \] \[ \mathcal{L}_{cosine} = \frac{1}{|\Omega_{tok}|} \sum_{(b,c,t)\in\Omega_{tok}} \left( 1 - \frac{\hat{\mathbf{e}}_{b,c,t}\cdot\mathbf{e}_{b,c,t}} {\|\hat{\mathbf{e}}_{b,c,t}\|_2\|\mathbf{e}_{b,c,t}\|_2} \right) \] \[ \mathcal{L}_{total} = 1.0\,\mathcal{L}_{MSE} + 0.2\,\mathcal{L}_{cosine}

Input embedding\([B, 3, 256, 2048]\)

Compact latent\([B, 3, 24, 512]\)

Compression\(524{,}288 \rightarrow 12{,}288\) values per camera, about \(42.7\times\)

Attention blocks2 encoder + 2 decoder blocks per camera, 16 heads

Parameters44,983,296 total across three camera-specific autoencoders

Training dataReal-robot teleoperation + simulation embeddings, weighted 1.0 and 0.65

Training setup2 epochs

Validation2.5% split, evaluated every 1,000 steps over 2,500 batches

Resampler autoencoder architecture showing camera-specific encoder compression and decoder reconstruction.

Resampler autoencoder train and validation curves for MSE reconstruction loss and cosine direction loss.

Future-Latent Predictor

Once the resampler could compress the frozen \(\pi_{0.5}\) image embeddings into compact latents, the next stage was to predict where those compact latents would move in the future. For this training stage, only the encoder side of the trained resampler was loaded and kept frozen. It compressed both the current full image embeddings and the future target embeddings into the same \([B, 3, 24, 512]\) compact-latent space.

The predictor receives the current compact latents \(z_t\), the current robot state \(s_t\), and a learned horizon token. Camera and token positional embeddings are added to the compact latents, then the three cameras and 24 latent tokens are flattened into a 72-token sequence. The 32D robot state is tokenized into four state tokens, giving a final transformer input of 77 tokens: one horizon token, four state tokens, and 72 compact-latent tokens.

A 6-layer transformer encoder processes this sequence with model width 512, 16 attention heads, and a 2048-wide feed-forward block. The horizon and state-token outputs are dropped after the transformer, and the remaining latent-token outputs pass through a LayerNorm and zero-initialized delta head. The model predicts a residual compact-latent motion \(\hat{\Delta}_t\), and the final future latent is produced as \(\hat{z}_{t+h}=z_t+\hat{\Delta}_t\). In the sidecar future-pairing setup, the future offset was \(h=10\).

The main comparison is against the copy baseline, which simply reuses the current compact latent \(z_t\) as the future prediction. If the future predictor is useful, its MSE and cosine loss should be lower than the copy baseline, and \(1 - \mathcal{L}_{MSE}/\mathcal{L}_{copy\_MSE}\) should stay positive. The final logged validation point had predictor MSE \(0.07265\) versus copy MSE \(0.13679\), corresponding to a validation MSE improvement of about \(46.9\%\).

Predictor Residual \[ \hat{\Delta}_t = f_\theta(z_t, s_t), \qquad \hat{z}_{t+h}=z_t+\hat{\Delta}_t, \qquad \Delta_t^\star=z_{t+h}-z_t \] The model predicts future motion in compact-latent space as a residual update from the current latent instead of generating the future latent from scratch.

MSE \[ \mathcal{L}_{MSE} = \frac{1}{|\Omega|} \sum_{(b,c,i,d)\in\Omega} \left(\hat{z}_{b,c,i,d} - z^{\star}_{b,c,i,d}\right)^2 \] Measures elementwise reconstruction error between the predicted future compact latent and the target future compact latent.

Cosine Loss \[ \mathcal{L}_{cosine} = \frac{1}{|\Omega_{tok}|} \sum_{(b,c,i)\in\Omega_{tok}} \left( 1 - \frac{\hat{\mathbf{z}}_{b,c,i}\cdot\mathbf{z}^{\star}_{b,c,i}} {\|\hat{\mathbf{z}}_{b,c,i}\|_2\|\mathbf{z}^{\star}_{b,c,i}\|_2} \right) \] Measures whether each predicted compact-latent token points in the same embedding direction as the target token, independent of raw scale.

Copy MSE \[ \mathcal{L}_{copy\_MSE} = \frac{1}{|\Omega|} \sum_{(b,c,i,d)\in\Omega} \left(z_{b,c,i,d} - z^{\star}_{b,c,i,d}\right)^2 \] Baseline error from doing nothing: it treats the current compact latent \(z_t\) as if it were the future latent \(z_{t+h}\).

MSE Improvement \[ \mathcal{I}_{MSE} = 1-\frac{\mathcal{L}_{MSE}}{\mathcal{L}_{copy\_MSE}} \] Positive values mean the predictor is closer to the future target than the copy baseline. Larger is better.

Delta MSE \[ \mathcal{L}_{\Delta MSE} = \frac{1}{|\Omega|} \sum_{(b,c,i,d)\in\Omega} \left(\hat{\Delta}_{b,c,i,d}-\Delta^\star_{b,c,i,d}\right)^2 \] Measures whether the predicted residual motion itself matches the true residual motion from current latent to future latent.

Delta Norm Ratio \[ R_{\Delta norm} = \frac{1}{|\Omega_{tok}|} \sum_{(b,c,i)\in\Omega_{tok}} \frac{\|\hat{\Delta}_{b,c,i}\|_2}{\|\Delta^\star_{b,c,i}\|_2+\epsilon} \] Shows whether the predicted latent motion is under-scaled or over-scaled. A value near 1 means the predicted residual magnitude matches the target residual magnitude.

Delta Projection Ratio \[ R_{\Delta proj} = \frac{1}{|\Omega_{tok}|} \sum_{(b,c,i)\in\Omega_{tok}} \frac{\hat{\Delta}_{b,c,i}\cdot\Delta^\star_{b,c,i}} {\|\Delta^\star_{b,c,i}\|_2^2+\epsilon} \] Measures how much of the target residual is recovered along the correct direction. Higher positive values mean more useful progress toward the target future latent.

Total Weighted Loss \[ \mathcal{L}_{total} = 1.0\,\mathcal{L}_{MSE} + 0.2\,\mathcal{L}_{cosine} + 0.0\,\mathcal{L}_{\Delta MSE} \] The training objective directly optimized future-latent MSE and cosine alignment. Delta MSE was logged as a diagnostic but had zero weight in this run.

Here, \(\Omega\) is the valid unmasked embedding-element set and \(\Omega_{tok}\) is the valid token set; \(i\) indexes the compact-latent token within each camera. Copy metrics measure how hard the future prediction problem is if the model does nothing. Improvement measures how much the predictor beats that no-motion baseline. Delta MSE checks residual magnitude error, delta cosine checks residual direction alignment, delta norm ratio checks whether the predicted motion is under- or over-scaled, and delta projection ratio measures how much of the target residual is recovered along the correct direction.

Future Predictor Takeaway

The plots and retained logs captured the later part of training rather than the cold start: for both the resampler and future predictor, the recorded curves begin after the model had already started learning. At the true first predictor step, the residual head behaved like the copy baseline: MSE improvement and the delta-direction diagnostics were all \(0\). The logged validation window then started near the end of the first epoch, where the predictor had already reached \(41.2\%\) MSE improvement, and by the final logged checkpoint it reached \(46.9\%\).

MSE improvement0.0% \(\rightarrow\) 46.9%

Delta cosine0.000 \(\rightarrow\) 0.372

Delta norm ratio0.000 \(\rightarrow\) 0.807

Delta projection ratio0.000 \(\rightarrow\) 0.300

Interpreted from the beginning of training, the predictor moved from a pure copy baseline to a useful future-latent estimator, with MSE improvement rising from \(0\%\) to \(46.9\%\). The residual diagnostics also moved away from zero: delta cosine reached \(0.372\), delta norm ratio reached \(0.807\), and delta projection reached \(0.300\). Together, these values show that the model learned a meaningful component of the target future-latent direction, although the residual magnitude was still slightly under-scaled. The next stage is to plug the frozen future predictor together with the frozen resampler encoder into the \(\pi_{0.5}\) co-training base, so the policy can condition on predicted future latents while learning to produce better action chunks.

Frozen input moduleTrained resampler encoder only

Compact latent\([B, 3, 24, 512]\)

State tokens32D state \(\rightarrow\) 4 tokens of width 512

Transformer6 layers, 16 heads, FFN width 2048

Prediction modeResidual: \(\hat{z}_{t+h}=z_t+\hat{\Delta}_t\)

Training dataSimulation + real-robot teleoperation embeddings, weights 0.15 and 1.0

Training setup1 epoch

BatchingBatch 16 per GPU, gradient accumulation 2

OptimizerLR \(1\times10^{-4}\), warmup 500, weight decay 0.01

Validation2% split, every 500 steps over 500 batches

Future predictor architecture showing frozen resampler encoding, state tokenization, transformer prediction, and residual compact-latent output.

Future predictor comparison of predictor MSE and cosine loss against the copy baseline for train and validation.

Future predictor MSE improvement over the copy baseline for train and validation.

Future predictor delta diagnostics for delta MSE, delta cosine, delta norm ratio, and delta projection ratio.

How It Entered π_0.5

The policy adapter projected predicted future latents into the π_0.5 prefix-token width and appended them alongside image tokens before the language/action pathway. Training mixed predicted, true, and dropped future latents so the policy learned to use the signal without depending on a perfect predictor.

π0.5 Co-Training + Future-Latent Fine-Tuning

Datasets: Simulation + real-robot teleoperation

This final strategy started from the strongest co-training base and added the future-latent pathway as a lightweight conditioning stream. The co-training base already contained the broad behavior learned from human demonstrations, simulation demonstrations, and real-robot teleoperation. The future-latent fine-tuning stage then focused on the robot-compatible simulation and real-robot teleoperation data, where future image embeddings and robot states are available through the sidecar future-latent dataset.

The frozen image encoder produces the current \(\pi_{0.5}\) image-token embeddings. A frozen resampler encoder compresses those embeddings into compact \([3, 24, 512]\) latents, and the frozen future predictor estimates the horizon-10 future compact latent. These predicted future latents cannot be appended directly to the VLA prefix stream because their token width is 512, while the \(\pi_{0.5}\) prefix-token width is 2048. Therefore, a small policy adapter projects compact future-latent tokens into the policy prefix-token space.

The policy adapter is deliberately simple: a linear projection from 512 to 1024, a Swish activation, and a linear projection from 1024 to 2048. I did not add cross-attention, attention heads, or a larger adapter architecture in this run because of time constraints. The goal was to test whether a minimal projection was enough for the VLA policy to read the future-latent signal as an additional prefix-token hint.

During this fine-tuning stage, the image encoder, resampler encoder, and future predictor remained frozen. The policy adapter was trained faster than the rest of the \(\pi_{0.5}\) policy, while the already learned co-training base was updated with a smaller learning rate. This protected the useful behavior already learned by the co-training base: early in training, the adapter output was effectively a new noisy conditioning stream, so the base policy should not be overwritten before the adapter learned a stable representation. Once the adapter began mapping future latents into a useful prefix-token pattern, the rest of the policy could make smaller adjustments to use that signal for better action-chunk prediction.

Future-latent conditioning is also mixed during training instead of always being present. In 55% of samples, the prefix tokens receive predicted future latents from the future predictor. In 30% of samples, they receive true future latents obtained by encoding the future camera frame through the frozen resampler. In the remaining 15%, the future-latent stream is dropped. This keeps the policy robust: it can learn from predicted future context, benefit from ground-truth future supervision when available, and still preserve a fallback path when the future-latent signal is missing or unreliable.

Key Training Configuration

Strategy\(\pi_{0.5}\) co-training + future-latent fine-tuning

Starting weightsCo-training base checkpoint

Fine-tuning dataSimulation + real-robot teleoperation sidecar data

Base data historyHuman + simulation + real-robot teleoperation co-training

Action horizon\(H = 10\)

State injectionEnabled

discrete_state_inputTrue

Future latentEnabled

Future latent tokens3 cameras x 24 tokens

Latent dimension512

Policy token width2048

Adapter512 \(\rightarrow\) 1024 \(\rightarrow\) 2048, Swish

Frozen modulesImage encoder, resampler, future predictor

Latent mix55% predicted / 30% true / 15% dropped

Action dimension16

Output action dimension12

Batch size192

GPUs8 x H100s

Training steps14000

Effective epochsSimulation 1 / real-robot 8

Learning ratesAdapter peak \(1 \times 10^{-4}\); base policy 0.1x adapter LR

Warmup steps600

Decay steps14000

Decay learning rate\(5 \times 10^{-6}\)

Evaluation Results

Garment split	Long-sleeve top	Short-sleeve top	Long pants	Shorts	Overall average
Seen	84.0%	84.0%	69.0%	96.0%	83.25%
Unseen	95.0%	33.0%	68.0%	85.0%	70.25%

Compared with the \(\pi_{0.5}\) co-training run, future-latent fine-tuning added 2.0 percentage points on seen garments and 4.5 points on unseen garments, with gains across all unseen categories. Relative to the earlier \(\pi_{0.5}\) Base - with State Injection baseline, the cumulative improvement reached +7.25 points on seen garments and +6.25 points on unseen garments, raising the combined seen/unseen average by 6.75 absolute points.

Cumulative Results & Lessons

Training strategy	Seen garments						Unseen garments						Overall
Training strategy	Top L	Top S	Pants	Shorts	Avg	Gain	Top L	Top S	Pants	Shorts	Avg	Gain	Mean	Gain
Base, no state	72.0%Base	44.0%Base	34.0%Base	95.0%Base	61.25%	Baseline	66.0%Base	16.0%Base	29.0%Base	46.0%Base	39.25%	Baseline	50.25%	Baseline
Base + state	75.0%+3.0	78.0%+34.0	58.0%+24.0	93.0%-2.0	76.00%	+14.75	92.0%+26.0	37.0%+21.0	60.0%+31.0	67.0%+21.0	64.00%	+24.75	70.00%	+19.75
Human pretrain base (no normalization + no masking) + fine-tune	77.0%+2.0	81.0%+3.0	64.0%+6.0	95.0%+2.0	79.25%	+3.25	85.0%-7.0	35.0%-2.0	64.0%+4.0	70.0%+3.0	63.50%	-0.50	71.38%	+1.38
Co-train base	82.0%+5.0	79.0%-2.0	68.0%+4.0	96.0%+1.0	81.25%	+2.00	90.0%+5.0	28.0%-7.0	65.0%+1.0	80.0%+10.0	65.75%	+2.25	73.50%	+2.13
Co-train + future latent	84.0%+2.0	84.0%+5.0	69.0%+1.0	96.0%+0.0	83.25%	+2.00	95.0%+5.0	33.0%+5.0	68.0%+3.0	85.0%+5.0	70.25%	+4.50	76.75%	+3.25

Result Conclusion

The largest single jump came from adding current robot state to the visual observations: the combined seen/unseen average moved from 50.25% to 70.00%. Human pretraining improved seen performance but slightly hurt unseen performance, while the normalized co-training mixture recovered and exceeded that unseen score. The final future-latent model produced the best cumulative simulation result, with a 76.75% combined average and the strongest unseen average at 70.25%, suggesting that predicted future visual embeddings gave the policy a useful nudge for action-chunk prediction.

All training-strategy evaluations reported here were performed in the organizer-provided Isaac Sim environment because I did not have direct robot access during development. For more details on the simulation environment, evaluation assumptions, and challenge setup, see the LeHome paper and the official LeHome Challenge repository.

Each reported garment score was evaluated over 50 simulation episodes per garment. Seen scores were aggregated across the 40 seen garments, while unseen scores were aggregated across the eight released unseen garments. The success criterion was binary: each garment type had a set of geometric folding conditions, and a rollout received success only if all required conditions were satisfied. There were no partial points or partial rewards in these evaluation percentages.

The final co-training + future-latent policy was the model deployed in the LeHome Challenge finals. A few success and failure clips from the competition are shown below as representative research references from the final robot runs.

Evaluation Videos from the Finals (π0.5 Co-Training + Future-Latent)

Success Examples 1x speed

Failure Examples 1x speed

Failure 1: Shorts

Failure 2: Short-Sleeve Top

Failure 3: Pants

Failure 4: Short-Sleeve Top

Failure videos 1, 2, and 4 still show useful recovery behavior: the policy moves from unusual or uncertain cloth states back toward a plausible continuation and attempts to complete the fold. I never collected human-in-the-loop data or RL recovery data for this behavior, so this appears to be an emergent result of the data mixture, state + predicted future conditioning, and training strategy.

Failure video 3 is different: the policy appears to treat a pants sample more like a shorts case, and the rollout is then compounded by a grasping issue. Across these failure clips, the common source of failure is grasp quality after the end effectors reach the desired region, both in ideal folds and in unusual cloth configurations.

Evaluation Protocol and Limits

The organizer-provided training set contained 40 seen garments. For local simulation evaluation, I had access to eight sample unseen garments distributed across the garment categories. A larger hidden unseen garment set was reserved by the organizers and was not available during development, so the unseen numbers above should be read as the best available local signal rather than a complete measure of hidden-test generalization.

I could not run systematic real-robot evaluation before the finals because I did not have robot access during development. At the competition site, the available time was enough for quick checks and final evaluation, but not for a complete controlled success-and-failure study. The videos above are representative research references from the finals, while the official competition score should be taken from the LeHome leaderboard.

What Would Make the Evidence Stronger?

The following studies would make the evidence more robust, but were not feasible during the challenge because of limited robot access, time, and resources.

Future-latent causal ablations

Compare predicted future latents, copied current latents, true future latents, and dropped latents. This is the key test needed before claiming that future prediction, rather than extra token capacity, is the source of the gain.

Controlled real-robot study

If hardware access becomes available, repeat the training ladder on physical rollouts with matched garment identities, repeated trials, and explicit success-and-failure labels for grasping and folding.

Lessons

1. Most Real-Robot Failures Were Grasping Failures

The final policy usually reached plausible folding locations, but many failures came from grasp quality: missed cloth pickup, weak grasp closure, or unstable contact. The policy also showed some recovery-like behavior even though I did not explicitly train a recovery controller, which suggests that recovery priors were partially learned from the mixed demonstrations.

2. Delta Actions Were the Main Missed Training Choice

The models were trained to predict future absolute end-effector targets. In hindsight, training on action deltas would likely have made the control distribution narrower and easier to learn: the model would predict the change from the current robot state rather than repeatedly regress broad absolute positions. This should matter most for precise grasping, where small local errors can decide success or failure.

3. Validation Should Hold Out Garments, Not Random Frames

My validation splits were random sample splits from the available data. A better protocol would hold out one or two full garments from each garment type and validate on those unseen garment identities. I did not use this split during the challenge because the training set was already small, and removing several garments from 40 seen garments would have reduced the available learning signal.

4. Future Latents Look Promising, but Need Multi-Horizon Training

The future-latent model appears to have used the predicted future embedding as a helpful prior rather than as a fully reliable plan. To unlock more of its potential, the predictor should learn multiple horizons such as t+5, t+10, and t+20. For the same current image and robot state, different future latents should correspond to different action chunks, forcing the policy to rely more directly on the future-conditioning signal.

Next Research Vision

An exploratory intuition rather than a proven claim; please read this direction with a grain of salt.

Future-latent predictors as cross-embodiment task-progress priors

A longer-term direction is to study stronger, action-free future-latent predictors for VLA policies. Instead of predicting future RGB frames, robot actions, or final goal states, such a predictor would estimate future visual embeddings conditioned on the current observation, task prompt, and time horizon. A visual foundation encoder such as DINO, V-JEPA, or a similar video representation model could provide the latent space, with the goal of predicting information-dense intermediate future states rather than pixels.

These predicted future states should not be treated as explicit goals. One hypothesis is that they are more useful as intermediate visual checkpoints along the path toward the goal, such as what the scene may look like at \(t+5\), \(t+10\), or \(t+20\). This distinction matters because the policy should not simply move toward a static final target; it should use the predicted latents as task-progress grounding while still deciding the embodiment-specific actions needed at each step.

A useful property of this direction is that the future predictor may not need robot actions, joint states, end-effector poses, or camera calibration. In principle, it could be trained from video demonstrations and language prompts alone, across human egocentric videos, robot videos, and multi-embodiment task demonstrations. This would need to be tested carefully, but the aim would be to learn a general notion of how tasks visually progress over time, independent of any single robot embodiment.

A VLA policy could then condition on the current images, robot state, prompt, and predicted future latents at one or more horizons. The future predictor would provide a representation of what intermediate task progress may look like, while the VLA would still learn how the current robot embodiment should move to realize that progress through its own kinematics, gripper, and action space.

This may also help with a limitation of unified pose-action contracts: end-effector dexterity. A dual-arm pose-action contract can unify reaching and gross bimanual motion across many embodiments, but grasping is harder because hands, grippers, and dexterous end effectors differ significantly. One hypothesis is that an extensively trained future predictor could provide useful future latents for these embodiment-specific dexterity moments, such as what a successful grasp or contact transition may look like visually. With a basic suite of teleoperation data for a new embodiment, a VLA may then learn how that embodiment's hand or gripper should realize the grasp implied by the predicted future latent.

The key open question is whether this separation between what intermediate progress should look like and how this robot should move can improve transfer to new robots, new tasks, and low-teleoperation settings. This may make heterogeneous-data training more scalable, but it would require careful ablations to verify that the future latents provide physically useful task-progress information rather than only extra conditioning capacity or visually plausible but unreachable predictions.

Flowchart showing an action-free future-latent predictor trained from videos and prompts to provide intermediate task-progress latents to a VLA policy.

Flowchart showing how future latents may support transfer to new robot embodiments while embodiment-specific teleoperation teaches grasping and control.

Special Thanks

One day before I was supposed to travel to ICRA 2026 in Vienna, my visa was rejected. Out of nowhere, I called my friend Manojkumar Srinivasan, who was studying at TU Dortmund in Germany, whether he could represent me. He agreed on extremely short notice, traveled to ICRA in Vienna, downloaded the model, helped me run the evaluation, and made this result possible.

I am grateful to the LeHome organizers for allowing Manoj to participate on my behalf, and to the sponsor Lightwheel for supporting the competition. Their support let the finals evaluation go forward despite the last-minute logistics change.

Manojkumar Srinivasan receiving the LeHome Challenge third-place certificate with Alberta Longhini. — **Left:** Manojkumar Srinivasan, my friend who represented me for the competition. **Right:** Alberta Longhini, organizer.

LeHome Challenge finals setup with the certificate, robot arms, and garments. — LeHome Challenge finals setup at ICRA 2026, where the final policy first ran on the real robot.

LeHome Challenge Organizers

Zeyi Li (Lightwheel)
Yuran Wang (PKU)
Yue Chen (PKU)
Ruihai Wu (PKU)
Alberta Longhini (KTH Royal Institute of Technology)
Haoran Lu (Northwestern University)
Hangting Liu (Lightwheel)
Shawn Xie (Lightwheel)
Kyle Xu (Lightwheel)
Shugao Liu (Lightwheel)

Special thanks to Ilia for helping set up the camera configuration while competing as a finalist. I also thank GolemCo, an Indian robotics WhatsApp community, for offering help when I could not attend ICRA 2026 in person.

Contact

Following a health break, I'm currently exploring my next steps while actively researching robotics and AI voice models.

Researchers and engineers working in robotics or embodied AI, whether in academia, industry, or startups, please feel free to reach out. I'm happy to chat, explore ideas, and discuss research problems anytime.

LinkedIn aakashjarvis1@gmail.com

Metric	Min	P1	P5	P25	Median	P75	P95	P99	Max	Mean
Spearman rank correlation	0.581	0.814	0.905	0.971	0.985	0.993	0.997	0.999	1.000	0.973
Pairwise ordering accuracy	70.8%	82.4%	88.7%	94.2%	96.3%	97.8%	99.0%	99.3%	100.0%	95.4%

Metric	Min	P1	P5	P25	Median	P75	P95	P99	Max	Mean
Spearman rank correlation	-0.220	-0.201	0.241	0.561	0.822	0.958	0.985	0.991	0.996	0.727
Pairwise ordering accuracy	42.1%	43.3%	59.5%	70.4%	82.9%	91.8%	95.3%	96.8%	97.7%	80.0%

Metric	Min	P1	P5	P25	Median	P75	P95	P99	Max	Mean
Spearman rank correlation	0.622	0.847	0.919	0.971	0.989	0.994	0.997	0.998	1.000	0.976
Pairwise ordering accuracy	72.3%	83.6%	88.9%	94.3%	96.9%	97.9%	98.7%	99.2%	100.0%	95.7%

Metric	Min	P1	P5	P25	Median	P75	P95	P99	Max	Mean
Spearman rank correlation	0.098	0.291	0.692	0.876	0.949	0.979	0.991	0.994	0.996	0.903
Pairwise ordering accuracy	52.9%	60.1%	75.0%	86.1%	91.6%	94.7%	96.7%	97.3%	98.1%	89.3%

Garment split	Long-sleeve top			Long pants
Garment split	Base	Critic	Gain	Base	Critic	Gain
Seen	77.0%	79.0%	+2.0	64.0%	67.0%	+3.0
Unseen	85.0%	93.0%	+8.0	64.0%	66.0%	+2.0

Pushing Robotic Garment Folding Further with Limited Teleop Data and No Robot Access

TL;DR

What this write-up covers

What this does not prove

Why this may matter for robot foundation models

Approach

Heterogeneous Data

Pose Contract

Training Ladder

Future Latents

Results and Limits

Next Research Direction

Data

Post-Processing & Unification

Why a Shared Pose Space Was Needed

Unified Dual-Arm Pose (Relative to the Top-Camera Frame)

Expanded Representation

Robot Model and Kinematic Solver Selection

Training-Time Transform

Deployment-Time Transform

Validation

FK/DIK Round-Trip Validation

Round-Trip Validation Results

Quaternion-Continuity Validation

Visual Validation

Training Strategy

π0.5 Base - No State Injection

Evaluation Results

π0.5 Base - with State Injection

Evaluation Results

π0.5 Pretraining + Robot State Fine-Tuning

Evaluation Results

Offline Critic Reranking from the \(\pi_{0.5}\) Base

π0.5 Co-Training

Evaluation Results

From State Conditioning to Future Conditioning

The Future Latents

Resampler Autoencoder

Future-Latent Predictor

Predictor Residual

MSE

Cosine Loss

Copy MSE

MSE Improvement

Delta MSE

Delta Norm Ratio

Delta Projection Ratio

Total Weighted Loss

Future Predictor Takeaway

How It Entered π0.5

π0.5 Co-Training + Future-Latent Fine-Tuning

Evaluation Results

Cumulative Results & Lessons

Result Conclusion

Evaluation Protocol and Limits

Future-latent causal ablations

Controlled real-robot study

1. Most Real-Robot Failures Were Grasping Failures

2. Delta Actions Were the Main Missed Training Choice

3. Validation Should Hold Out Garments, Not Random Frames

4. Future Latents Look Promising, but Need Multi-Horizon Training

Next Research Vision

Future-latent predictors as cross-embodiment task-progress priors

Special Thanks

LeHome Challenge Organizers

Contact

How It Entered π_0.5