Wuji Pick Place Vam Ti2V5B 30L Robotwin Ref Openwam 16G

This repository contains one Wuji pick-and-place VAM checkpoint from the May 26, 2026 OpenWAM/RobotWin-reference training run.

Identity

repo_id:              knightnemo/wuji-pick-place-vam-ti2v5b-30l-robotwin-ref-openwam-16g
wandb project:        wuji_pick_place
wandb run id:         kwqim9i9
wandb run name:       wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_0526_2127
local training dir:   /cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g
checkpoint:           step-20000.safetensors
checkpoint size:      12042172757 bytes
base model:           Wan-AI/Wan2.2-TI2V-5B
action expert style:  openwam
mask variant:         v2
reference frames:     57
reference dropout:    0.1
action/proprio dim:   54 / 54

The checkpoint is a joint model: step-20000.safetensors contains the fine-tuned video DiT weights plus the action_dit.* action-stream weights. The Wan2.2-TI2V-5B base model is not included.

Files

step-20000.safetensors     final 20k-step checkpoint
model_config.json      compact machine-readable configuration and metrics
training_config.yaml   full W&B training config snapshot
wandb-summary.json     final scalar metrics exported by W&B
training_log_node0.txt node-0 training log
training_log_node1.txt node-1 training log
README.md              this model card

Final Step Metrics

These are the scalar values in wandb-summary.json at step=20000.

Metric Value
val/loss 0.184351
val/loss_action 0.103875
val/loss_video 0.0804761
val/action_mse 0.00388935
val/action_mae 0.038713
val/video_mse 1043.3
val/video_psnr 20.521
val/video_ssim 0.750056
val/video_lpips 0.133122
train/loss 0.0919176
train/loss_action 0.0194353
train/loss_video 0.0724823
train/grad_norm 1.60938
_runtime 32573.5
_step 20000

Saved Validation Losses

Only step-20000.safetensors is uploaded here. Earlier local checkpoints were saved every 2500 steps and are listed for provenance.

Best saved checkpoint by aggregate validation loss:

step 5000: loss=0.167640, loss_video=0.057506, loss_action=0.110134
Step Loss Video Loss Action Loss
2500 0.254150 0.084283 0.169867
5000 0.167640 0.057506 0.110134
7500 0.270681 0.103725 0.166956
10000 0.187771 0.064312 0.123460
12500 0.233794 0.081305 0.152489
15000 0.223417 0.086691 0.136726
17500 0.244115 0.104150 0.139964
20000 0.184351 0.080476 0.103875

Training Configuration

Key Value
dataset_type wuji_pick_place
wuji_robot_dataset_root None
variant clean_50
height 384
width 320
num_frames 33
action_video_freq_ratio 4
action_horizon None
action_dim 54
proprio_dim 54
action_format absolute
action_space joint
proprio_space action
action_pad_mode last
action_expert_style openwam
action_mot_backbone_pretrained_path /cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/pretrained/ActionMoT_openwam_linear_interp_Wan22_alphascale_1024hdim.pt
mask_variant v2
mask_tail_padding_loss True
full_reference_video True
max_ref_frames 57
reference_dropout 0.1
bridge_exclude_full_ref True
extra_inputs vace_reference_image,action_trajectory
target_camera head_camera
reference_camera head_camera
resize_mode stretch
backbone ti2v
model_paths ["models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00001-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00002-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00003-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/models_t5_umt5-xxl-enc-bf16.pth","models/Wan-AI/Wan2.2-TI2V-5B/Wan2.2_VAE.pth"]
tokenizer_path models/Wan-AI/Wan2.2-TI2V-5B/google/umt5-xxl
trainable_models dit
learning_rate 5e-05
action_lr None
weight_decay 0.01
warmup_steps 500
max_steps 20000
num_epochs 1
batch_size 1
gradient_accumulation_steps 1
dataset_repeat 1
dataset_num_workers 8
use_gradient_checkpointing True
save_steps 2500
val_steps 500
video_log_steps 2500
max_val_samples 20
lambda_video 1
lambda_action 1
video_dim 3072
action_dit_dim 1024
action_dit_ffn_dim 4096
action_dit_num_heads 24
action_dit_num_layers 30
proprio_dropout 0.1
window_stride 1
val_ratio 0.1
output_path /cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g
wandb_project wuji_pick_place
wandb_run_name wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_0526_2127

Input/Output Contract

Expected inputs:

prompt:                 Pick up the ball with the left hand and place it in the basket.
target camera:          head_camera
reference camera:       head_camera
target video frames:    33
full reference frames:  57
image resolution:       384 x 320
action/proprio dim:     54 / 54

Expected outputs:

robot-view target video rollout
54-D absolute robot action targets

Masking Note

This run uses mask_variant=v2 with full cross-embodiment reference video conditioning and records bridge_exclude_full_ref=True.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for knightnemo/wuji-pick-place-vam-ti2v5b-30l-robotwin-ref-openwam-16g

Finetuned
(52)
this model