Wuji Pick Place Vam Ti2V5B 30L Robotwin Ref Openwam 16G

This repository contains one Wuji pick-and-place VAM checkpoint from the May 26, 2026 OpenWAM/RobotWin-reference training run.

Identity

repo_id:              knightnemo/wuji-pick-place-vam-ti2v5b-30l-robotwin-ref-openwam-16g
wandb project:        wuji_pick_place
wandb run id:         kwqim9i9
wandb run name:       wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_0526_2127
local training dir:   /cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g
checkpoint:           step-20000.safetensors
checkpoint size:      12042172757 bytes
base model:           Wan-AI/Wan2.2-TI2V-5B
action expert style:  openwam
mask variant:         v2
reference frames:     57
reference dropout:    0.1
action/proprio dim:   54 / 54

The checkpoint is a joint model: step-20000.safetensors contains the fine-tuned video DiT weights plus the action_dit.* action-stream weights. The Wan2.2-TI2V-5B base model is not included.

Files

step-20000.safetensors     final 20k-step checkpoint
model_config.json      compact machine-readable configuration and metrics
training_config.yaml   full W&B training config snapshot
wandb-summary.json     final scalar metrics exported by W&B
training_log_node0.txt node-0 training log
training_log_node1.txt node-1 training log
README.md              this model card

Final Step Metrics

These are the scalar values in wandb-summary.json at step=20000.

Metric	Value
`val/loss`	0.184351
`val/loss_action`	0.103875
`val/loss_video`	0.0804761
`val/action_mse`	0.00388935
`val/action_mae`	0.038713
`val/video_mse`	1043.3
`val/video_psnr`	20.521
`val/video_ssim`	0.750056
`val/video_lpips`	0.133122
`train/loss`	0.0919176
`train/loss_action`	0.0194353
`train/loss_video`	0.0724823
`train/grad_norm`	1.60938
`_runtime`	32573.5
`_step`	20000

Saved Validation Losses

Only step-20000.safetensors is uploaded here. Earlier local checkpoints were saved every 2500 steps and are listed for provenance.

Best saved checkpoint by aggregate validation loss:

step 5000: loss=0.167640, loss_video=0.057506, loss_action=0.110134

Step	Loss	Video Loss	Action Loss
2500	0.254150	0.084283	0.169867
5000	0.167640	0.057506	0.110134
7500	0.270681	0.103725	0.166956
10000	0.187771	0.064312	0.123460
12500	0.233794	0.081305	0.152489
15000	0.223417	0.086691	0.136726
17500	0.244115	0.104150	0.139964
20000	0.184351	0.080476	0.103875

Training Configuration

Key	Value
`dataset_type`	`wuji_pick_place`
`wuji_robot_dataset_root`	`None`
`variant`	`clean_50`
`height`	`384`
`width`	`320`
`num_frames`	`33`
`action_video_freq_ratio`	`4`
`action_horizon`	`None`
`action_dim`	`54`
`proprio_dim`	`54`
`action_format`	`absolute`
`action_space`	`joint`
`proprio_space`	`action`
`action_pad_mode`	`last`
`action_expert_style`	`openwam`
`action_mot_backbone_pretrained_path`	`/cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/pretrained/ActionMoT_openwam_linear_interp_Wan22_alphascale_1024hdim.pt`
`mask_variant`	`v2`
`mask_tail_padding_loss`	`True`
`full_reference_video`	`True`
`max_ref_frames`	`57`
`reference_dropout`	`0.1`
`bridge_exclude_full_ref`	`True`
`extra_inputs`	`vace_reference_image,action_trajectory`
`target_camera`	`head_camera`
`reference_camera`	`head_camera`
`resize_mode`	`stretch`
`backbone`	`ti2v`
`model_paths`	`["models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00001-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00002-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/diffusion_pytorch_model-00003-of-00003.safetensors","models/Wan-AI/Wan2.2-TI2V-5B/models_t5_umt5-xxl-enc-bf16.pth","models/Wan-AI/Wan2.2-TI2V-5B/Wan2.2_VAE.pth"]`
`tokenizer_path`	`models/Wan-AI/Wan2.2-TI2V-5B/google/umt5-xxl`
`trainable_models`	`dit`
`learning_rate`	`5e-05`
`action_lr`	`None`
`weight_decay`	`0.01`
`warmup_steps`	`500`
`max_steps`	`20000`
`num_epochs`	`1`
`batch_size`	`1`
`gradient_accumulation_steps`	`1`
`dataset_repeat`	`1`
`dataset_num_workers`	`8`
`use_gradient_checkpointing`	`True`
`save_steps`	`2500`
`val_steps`	`500`
`video_log_steps`	`2500`
`max_val_samples`	`20`
`lambda_video`	`1`
`lambda_action`	`1`
`video_dim`	`3072`
`action_dit_dim`	`1024`
`action_dit_ffn_dim`	`4096`
`action_dit_num_heads`	`24`
`action_dit_num_layers`	`30`
`proprio_dropout`	`0.1`
`window_stride`	`1`
`val_ratio`	`0.1`
`output_path`	`/cpfs/huangsq/VAM_Learn_from_Human_Video/src/vam/models/train/wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g`
`wandb_project`	`wuji_pick_place`
`wandb_run_name`	`wuji_pick_place_vam_ti2v5b_30L_robotwin_ref_openwam_16g_0526_2127`

Input/Output Contract

Expected inputs:

prompt:                 Pick up the ball with the left hand and place it in the basket.
target camera:          head_camera
reference camera:       head_camera
target video frames:    33
full reference frames:  57
image resolution:       384 x 320
action/proprio dim:     54 / 54

Expected outputs:

robot-view target video rollout
54-D absolute robot action targets

Masking Note

This run uses mask_variant=v2 with full cross-embodiment reference video conditioning and records bridge_exclude_full_ref=True.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Video-to-Video

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for knightnemo/wuji-pick-place-vam-ti2v5b-30l-robotwin-ref-openwam-16g

Base model

Wan-AI/Wan2.2-TI2V-5B

Finetuned

(52)

this model