Wuji Pick Place Vam Ti2V5B 30L Robotwin Ref Openwam 16G
This repository contains one Wuji pick-and-place VAM checkpoint from the May 26, 2026 OpenWAM/RobotWin-reference training run.
Identity
repo_id: knightnemo/wuji-pick-place-vam-ti2v5b-30l-robotwin-ref-openwam-16g
wandb project: wuji_pick_place
wandb run id: kwqim9i9
wandb run name: wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_0526_2127
local training dir: /cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g
checkpoint: step-20000.safetensors
checkpoint size: 12042172757 bytes
base model: Wan-AI/Wan2.2-TI2V-5B
action expert style: openwam
mask variant: v2
reference frames: 57
reference dropout: 0.1
action/proprio dim: 54 / 54
The checkpoint is a joint model: step-20000.safetensors contains the fine-tuned
video DiT weights plus the action_dit.* action-stream weights. The
Wan2.2-TI2V-5B base model is not included.
Files
step-20000.safetensors final 20k-step checkpoint
model_config.json compact machine-readable configuration and metrics
training_config.yaml full W&B training config snapshot
wandb-summary.json final scalar metrics exported by W&B
training_log_node0.txt node-0 training log
training_log_node1.txt node-1 training log
README.md this model card
Final Step Metrics
These are the scalar values in wandb-summary.json at step=20000.
| Metric | Value |
|---|---|
val/loss |
0.184351 |
val/loss_action |
0.103875 |
val/loss_video |
0.0804761 |
val/action_mse |
0.00388935 |
val/action_mae |
0.038713 |
val/video_mse |
1043.3 |
val/video_psnr |
20.521 |
val/video_ssim |
0.750056 |
val/video_lpips |
0.133122 |
train/loss |
0.0919176 |
train/loss_action |
0.0194353 |
train/loss_video |
0.0724823 |
train/grad_norm |
1.60938 |
_runtime |
32573.5 |
_step |
20000 |
Saved Validation Losses
Only step-20000.safetensors is uploaded here. Earlier local checkpoints were
saved every 2500 steps and are listed for provenance.
Best saved checkpoint by aggregate validation loss:
step 5000: loss=0.167640, loss_video=0.057506, loss_action=0.110134
| Step | Loss | Video Loss | Action Loss |
|---|---|---|---|
| 2500 | 0.254150 | 0.084283 | 0.169867 |
| 5000 | 0.167640 | 0.057506 | 0.110134 |
| 7500 | 0.270681 | 0.103725 | 0.166956 |
| 10000 | 0.187771 | 0.064312 | 0.123460 |
| 12500 | 0.233794 | 0.081305 | 0.152489 |
| 15000 | 0.223417 | 0.086691 | 0.136726 |
| 17500 | 0.244115 | 0.104150 | 0.139964 |
| 20000 | 0.184351 | 0.080476 | 0.103875 |
Training Configuration
| Key | Value |
|---|---|
dataset_type |
wuji_pick_place |
wuji_robot_dataset_root |
None |
variant |
clean_50 |
height |
384 |
width |
320 |
num_frames |
33 |
action_video_freq_ratio |
4 |
action_horizon |
None |
action_dim |
54 |
proprio_dim |
54 |
action_format |
absolute |
action_space |
joint |
proprio_space |
action |
action_pad_mode |
last |
action_expert_style |
openwam |
action_mot_backbone_pretrained_path |
/cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/pretrained/ActionMoT_openwam_linear_interp_Wan22_alphascale_1024hdim.pt |
mask_variant |
v2 |
mask_tail_padding_loss |
True |
full_reference_video |
True |
max_ref_frames |
57 |
reference_dropout |
0.1 |
bridge_exclude_full_ref |
True |
extra_inputs |
vace_reference_image,action_trajectory |
target_camera |
head_camera |
reference_camera |
head_camera |
resize_mode |
stretch |
backbone |
ti2v |
model_paths |
["models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00001-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00002-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00003-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/models_t5_umt5-xxl-enc-bf16.pth","models/Wan-AI/Wan2.2-TI2V-5B/Wan2.2_VAE.pth"] |
tokenizer_path |
models/Wan-AI/Wan2.2-TI2V-5B/google/umt5-xxl |
trainable_models |
dit |
learning_rate |
5e-05 |
action_lr |
None |
weight_decay |
0.01 |
warmup_steps |
500 |
max_steps |
20000 |
num_epochs |
1 |
batch_size |
1 |
gradient_accumulation_steps |
1 |
dataset_repeat |
1 |
dataset_num_workers |
8 |
use_gradient_checkpointing |
True |
save_steps |
2500 |
val_steps |
500 |
video_log_steps |
2500 |
max_val_samples |
20 |
lambda_video |
1 |
lambda_action |
1 |
video_dim |
3072 |
action_dit_dim |
1024 |
action_dit_ffn_dim |
4096 |
action_dit_num_heads |
24 |
action_dit_num_layers |
30 |
proprio_dropout |
0.1 |
window_stride |
1 |
val_ratio |
0.1 |
output_path |
/cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g |
wandb_project |
wuji_pick_place |
wandb_run_name |
wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_0526_2127 |
Input/Output Contract
Expected inputs:
prompt: Pick up the ball with the left hand and place it in the basket.
target camera: head_camera
reference camera: head_camera
target video frames: 33
full reference frames: 57
image resolution: 384 x 320
action/proprio dim: 54 / 54
Expected outputs:
robot-view target video rollout
54-D absolute robot action targets
Masking Note
This run uses mask_variant=v2 with full cross-embodiment reference video
conditioning and records bridge_exclude_full_ref=True.
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for knightnemo/wuji-pick-place-vam-ti2v5b-30l-robotwin-ref-openwam-16g
Base model
Wan-AI/Wan2.2-TI2V-5B