ropedia-xperience-10m-task-baselines / results /episode_task_suite /research_directions /research_direction_task_map.csv
| direction,direction_name,task,task_name,family,relationship,primary_direction,metric_name,minimal_metric,neural_mlp_metric,better_baseline,why,current_limit | |
| C,Egocentric Vision & Interaction,timeline_action,Timeline action recognition,supervised,direct,C,macro-F1,0.05,0.0148148148148,minimal,Reads egocentric sensor state as the current human action; also provides a weak human-motion readout.,Chronological single-episode split creates unseen future action classes. | |
| A,Human Modeling & Motion Understanding,timeline_action,Timeline action recognition,supervised,proxy,C,macro-F1,0.05,0.0148148148148,minimal,Reads egocentric sensor state as the current human action; also provides a weak human-motion readout.,Chronological single-episode split creates unseen future action classes. | |
| C,Egocentric Vision & Interaction,timeline_subtask,Timeline subtask recognition,supervised,direct,C,macro-F1,0.0505635551385,0.0281081081081,minimal,Segments egocentric task state and provides a first proxy for symbolic world/task state.,Single-episode ordering makes future subtasks hard to generalize. | |
| D,Scene Reconstruction & World Modeling,timeline_subtask,Timeline subtask recognition,supervised,proxy,C,macro-F1,0.0505635551385,0.0281081081081,minimal,Segments egocentric task state and provides a first proxy for symbolic world/task state.,Single-episode ordering makes future subtasks hard to generalize. | |
| C,Egocentric Vision & Interaction,transition_detection,Action transition detection,diagnostic,direct,C,macro-F1,0.611823759063,0.586206896552,minimal,Localizes egocentric task boundaries and diagnoses temporal state changes.,"Boundary class is sparse, so accuracy alone is misleading." | |
| D,Scene Reconstruction & World Modeling,transition_detection,Action transition detection,diagnostic,diagnostic,C,macro-F1,0.611823759063,0.586206896552,minimal,Localizes egocentric task boundaries and diagnoses temporal state changes.,"Boundary class is sparse, so accuracy alone is misleading." | |
| C,Egocentric Vision & Interaction,next_action,Short-horizon next action,supervised,direct,C,macro-F1,0.0592592592593,0.0418604651163,minimal,Tests action intention/task-flow prediction from egocentric context.,Unseen future labels dominate the single-episode chronological test. | |
| D,Scene Reconstruction & World Modeling,next_action,Short-horizon next action,supervised,proxy,C,macro-F1,0.0592592592593,0.0418604651163,minimal,Tests action intention/task-flow prediction from egocentric context.,Unseen future labels dominate the single-episode chronological test. | |
| A,Human Modeling & Motion Understanding,hand_trajectory_forecast,Hand trajectory forecasting,forecast,direct,A,MPJPE,0.864657044411,0.107850186527,neural_mlp,Directly predicts human hand motion and supports hand-object interaction modeling.,Forecasting is window-level and not yet a full sequence or policy model. | |
| C,Egocentric Vision & Interaction,hand_trajectory_forecast,Hand trajectory forecasting,forecast,proxy,A,MPJPE,0.864657044411,0.107850186527,neural_mlp,Directly predicts human hand motion and supports hand-object interaction modeling.,Forecasting is window-level and not yet a full sequence or policy model. | |
| A,Human Modeling & Motion Understanding,contact_prediction,Body/object contact prediction,supervised,direct,A,macro-F1,1,1,tie,"Targets physical interaction state, a core affordance and manipulation signal.",The public sample is degenerate for this target because one class dominates. | |
| C,Egocentric Vision & Interaction,contact_prediction,Body/object contact prediction,supervised,proxy,A,macro-F1,1,1,tie,"Targets physical interaction state, a core affordance and manipulation signal.",The public sample is degenerate for this target because one class dominates. | |
| C,Egocentric Vision & Interaction,object_relevance,Relevant object set prediction,supervised,direct,C,micro-F1,0.180343820954,0.167927927928,minimal,Connects egocentric activity to manipulated objects and early object-centric state.,Object labels are language-derived and sparse in one episode. | |
| A,Human Modeling & Motion Understanding,object_relevance,Relevant object set prediction,supervised,proxy,C,micro-F1,0.180343820954,0.167927927928,minimal,Connects egocentric activity to manipulated objects and early object-centric state.,Object labels are language-derived and sparse in one episode. | |
| D,Scene Reconstruction & World Modeling,object_relevance,Relevant object set prediction,supervised,proxy,C,micro-F1,0.180343820954,0.167927927928,minimal,Connects egocentric activity to manipulated objects and early object-centric state.,Object labels are language-derived and sparse in one episode. | |
| C,Egocentric Vision & Interaction,caption_grounding,Caption-to-window grounding,retrieval,direct,C,MRR,0.0160234790503,0.0168412556713,neural_mlp,Grounds language annotation into egocentric sensor time and task state.,Bag-of-objects language features are too weak for rich grounding. | |
| D,Scene Reconstruction & World Modeling,caption_grounding,Caption-to-window grounding,retrieval,proxy,C,MRR,0.0160234790503,0.0168412556713,neural_mlp,Grounds language annotation into egocentric sensor time and task state.,Bag-of-objects language features are too weak for rich grounding. | |
| C,Egocentric Vision & Interaction,cross_modal_retrieval,Cross-modal retrieval,retrieval,diagnostic,C,MRR,0.26925966893,0.129997189865,minimal,"Tests whether synchronized modalities identify the same 4D moment, a prerequisite for reconstruction and world modeling.","Retrieval shows an alignment signal, not geometric reconstruction." | |
| B,3D/4D Reconstruction & Neural Rendering,cross_modal_retrieval,Cross-modal retrieval,retrieval,proxy,C,MRR,0.26925966893,0.129997189865,minimal,"Tests whether synchronized modalities identify the same 4D moment, a prerequisite for reconstruction and world modeling.","Retrieval shows an alignment signal, not geometric reconstruction." | |
| D,Scene Reconstruction & World Modeling,cross_modal_retrieval,Cross-modal retrieval,retrieval,proxy,C,MRR,0.26925966893,0.129997189865,minimal,"Tests whether synchronized modalities identify the same 4D moment, a prerequisite for reconstruction and world modeling.","Retrieval shows an alignment signal, not geometric reconstruction." | |
| B,3D/4D Reconstruction & Neural Rendering,modality_reconstruction,Modality reconstruction,forecast,proxy,B,R2,-0.0152718989139,-0.0101714101342,neural_mlp,Predicts visual/depth state from non-target sensors as a weak reconstruction/world-model objective.,"Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction." | |
| D,Scene Reconstruction & World Modeling,modality_reconstruction,Modality reconstruction,forecast,proxy,B,R2,-0.0152718989139,-0.0101714101342,neural_mlp,Predicts visual/depth state from non-target sensors as a weak reconstruction/world-model objective.,"Feature-vector reconstruction is not pixel, depth-map, mesh, NeRF, or Gaussian reconstruction." | |
| C,Egocentric Vision & Interaction,temporal_order,Temporal order verification,diagnostic,diagnostic,C,F1,0.53995157385,0.85201793722,neural_mlp,Checks whether features encode local time direction and task progression.,"Only local adjacent ordering, not long-horizon causal modeling." | |
| D,Scene Reconstruction & World Modeling,temporal_order,Temporal order verification,diagnostic,diagnostic,C,F1,0.53995157385,0.85201793722,neural_mlp,Checks whether features encode local time direction and task progression.,"Only local adjacent ordering, not long-horizon causal modeling." | |
| C,Egocentric Vision & Interaction,misalignment_detection,Cross-modal misalignment detection,diagnostic,diagnostic,C,F1,0.505169867061,0.715268225585,neural_mlp,"Detects temporal desynchronization, a key data-quality gate for multimodal reconstruction and world models.",Synthetic shifts diagnose alignment but do not solve calibration or mapping. | |
| B,3D/4D Reconstruction & Neural Rendering,misalignment_detection,Cross-modal misalignment detection,diagnostic,diagnostic,C,F1,0.505169867061,0.715268225585,neural_mlp,"Detects temporal desynchronization, a key data-quality gate for multimodal reconstruction and world models.",Synthetic shifts diagnose alignment but do not solve calibration or mapping. | |
| D,Scene Reconstruction & World Modeling,misalignment_detection,Cross-modal misalignment detection,diagnostic,diagnostic,C,F1,0.505169867061,0.715268225585,neural_mlp,"Detects temporal desynchronization, a key data-quality gate for multimodal reconstruction and world models.",Synthetic shifts diagnose alignment but do not solve calibration or mapping. | |