ropedia-xperience-10m-task-baselines / FOUNDATION_MODEL_PLAN.md
cy0307's picture
Add Xperience embodied foundation pretraining goal
bfcf156 verified

Foundation Model Plan

This plan extends the current Xperience-10M scale-up path beyond the prepared Qwen3-Omni LoRA pilot. It separates immediate trainable work from later world-model and robot-policy branches, so the project can choose a backbone without mixing different research goals.

Current status: this is a planning artifact. The public repo has verified single-episode task heads and setup-stage Qwen3-Omni scripts. It has not yet run a held-out multi-episode foundation-model evaluation.

Backbone Decision

Priority Model family Best role for this project Why it fits Xperience-10M Current decision
1 Qwen3-Omni Multimodal instruction model and JSON task predictor Accepts video/audio/language directly; depth, pose, mocap, and IMU can enter through the existing sensor bridge Keep as the first selected-episode LoRA pilot
2 Cosmos 3 Embodied world model, action generation, and synthetic future prediction Designed for physical-world video generation, action-conditioned world modeling, and robot/world simulation style objectives Add as the first world-model branch after the data gate
3 NVIDIA GR00T Humanoid/action-policy foundation model Xperience-10M mocap, hand motion, contacts, and egocentric interaction can support retargeting and action-understanding probes Track as a humanoid policy branch, not the first LoRA pilot
4 OpenVLA / OpenVLA-OFT Open vision-language-action policy baseline Useful when windows are converted into visual observation plus action-token targets Use after action-space design is explicit
5 openpi pi0/pi0.5 Open robot policy and action expert baseline Useful for action chunking, policy fine-tuning, and embodiment transfer experiments Candidate for policy branch once action labels are retargeted
6 Gemini Robotics Closed/API embodied reasoning reference Strong candidate for qualitative reasoning and task interpretation, but not a local fine-tune target Use only as an external comparison or annotation assistant
7 Octo / SmolVLA-style lightweight policies Smaller reproducible robot-policy baselines Good for cheaper action-policy experiments, but less directly omni-modal Optional baseline branch after selected-episode data preparation
Future Xperience Embodied Foundation Model Xperience-native domain model pretrained from scratch on full-corpus embodied experience Would learn a shared temporal representation across video, audio, depth, pose, mocap, IMU, and language Long-term goal after smaller pilots prove value and full-corpus storage/compute are available

Why Qwen3-Omni Still Goes First

The immediate pilot is about proving the full data path:

  • prepared multi-episode Xperience-10M data,
  • episode-level train/test separation,
  • window-level supervised examples,
  • multimodal prompt construction,
  • sensor bridge for depth, pose, mocap, and IMU,
  • LoRA training,
  • held-out predictions and metrics.

Qwen3-Omni is the most direct first target because the existing scripts already prepare video/audio/language prompts and adapter inputs. It is also suitable for the 12 current task contracts, which mostly produce labels, structured JSON, or short task answers.

The executable Qwen branch and future branch contracts are now represented as config files under configs/omni_backbones/. Validate them with:

python scripts/omni/backbone_registry.py --validate --json

The shared extension rules are in OMNI_MODEL_EXTENSION_CONTRACT.md. A new foundation branch should add a config first, then implement the exporter, trainer, evaluator, and launcher required by that config.

Long-Term Native Pretraining Goal

Qwen3-Omni, Cosmos 3, GR00T, OpenVLA, and openpi are backbone choices for the next experiments. The longer-term goal is different: train an Xperience Embodied Foundation Model that is native to the Xperience-10M modality structure.

That model would not start as a general internet-scale omni model. It would be a domain model over synchronized embodied experience: multi-view egocentric video, audio, depth, pose/SLAM, hand and body mocap, IMU, calibration, and language annotations. Its pretraining should combine masked multimodal modeling, cross-modal contrastive alignment, future-state prediction, ego-motion and hand-motion forecasting, action/procedure prediction, language grounding, contact/affordance prediction, and optional policy-style targets after action conversion.

This is not a current result in the repo. It becomes appropriate only after:

  • the selected multi-episode pipeline trains and evaluates cleanly,
  • scaling from 128 episodes to thousands of episodes shows measurable value,
  • raw-corpus storage and derived-shard capacity are available,
  • distributed training and checkpoint/restart infrastructure are reliable,
  • evaluation covers held-out episodes, sessions, activities, objects, and missing-modality robustness.

The full plan is documented in XPERIENCE_EMBODIED_FOUNDATION_MODEL_PRETRAINING.md.

Why Cosmos 3 Should Be Added Next

Cosmos 3 should not replace the Qwen3-Omni pilot. It should become the first world-model branch after the data gate. The reason is that the Xperience-10M modalities are unusually aligned with physical-world modeling:

  • video streams for visual state,
  • embedded audio for event cues,
  • depth and calibration for spatial structure,
  • pose/SLAM for camera motion,
  • hand/body mocap for embodied state,
  • IMU for inertial dynamics,
  • language annotations for task semantics.

The practical Cosmos 3 branch should start with three targets:

  1. Future-window prediction: condition on earlier video/sensor windows and predict future visual or latent state.
  2. Action-conditioned world modeling: use mocap/action labels as controls and predict what changes in the scene.
  3. Synthetic data expansion: generate or score candidate futures, then test whether synthetic windows improve downstream task heads.

A Cosmos 3 branch is ready to publish only after committed manifests, generated outputs, held-out metrics, and qualitative examples are available.

Policy-Model Branch

OpenVLA, openpi, GR00T, Octo, and SmolVLA-style models should be treated as policy/action branches. They need a clear action target before training:

  • egocentric action class,
  • next subtask,
  • hand trajectory chunk,
  • contact state,
  • object-affordance target,
  • retargeted humanoid/body action,
  • or robot-compatible action tokens.

The current public sample can prototype the data conversion, but policy quality requires multi-episode diversity. The first useful policy experiment should be a 64-128 episode run, not a one-sample demonstration.

Evaluation Additions

The foundation-model stage should add metrics beyond the current 12-task suite:

Evaluation target Metric family Applies to
Structured task prediction JSON validity, macro-F1, accuracy, micro-F1 Qwen3-Omni, Gemini Robotics comparison
Future state prediction retrieval rank, temporal consistency, feature reconstruction, visual inspection Cosmos 3
Action-conditioned dynamics transition accuracy, contact accuracy, next-action accuracy Cosmos 3, OpenVLA, openpi, GR00T
Affordance and object interaction object micro-F1, contact-object consistency, caption grounding all branches
Cross-episode generalization held-out episode metrics, held-out session metrics, leakage checks all trainable branches

Execution Order

  1. Finish selected multi-episode pilot preparation.
  2. Run the Qwen3-Omni LoRA pilot exactly once as the first held-out baseline.
  3. Run a model-selection dry run on 3-8 episodes: Qwen3-Omni prompt-only, Qwen3-Omni LoRA, Cosmos 3 world-model preprocessing, and one policy baseline.
  4. Promote Cosmos 3 to the first world-model experiment if video/sensor preprocessing and storage fit.
  5. Promote OpenVLA/openpi/GR00T only after action targets are explicit and retargeting artifacts are traceable.
  6. Update public cards only when a branch has real manifests, predictions, metrics, and qualitative examples.
  7. Start Xperience-native pretraining only after smaller scaling stages, full-corpus storage, multi-node compute, and held-out evaluation protocols are in place.

Source Links