OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains Paper • 2606.14702 • Published 6 days ago • 27
Agentic Environment Engineering for Large Language Models: A Survey of Environment Modeling, Synthesis, Evaluation, and Application Paper • 2606.12191 • Published 8 days ago • 63
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification Paper • 2605.06221 • Published May 7 • 22
Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models Paper • 2604.08545 • Published Apr 9 • 41
Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding Paper • 2604.05015 • Published Apr 6 • 236
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models Paper • 2306.13394 • Published Jun 23, 2023
VITA: Towards Open-Source Interactive Omni Multimodal LLM Paper • 2408.05211 • Published Aug 9, 2024 • 50
MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs Paper • 2411.15296 • Published Nov 22, 2024 • 21
Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM Paper • 2411.00774 • Published Nov 1, 2024
VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction Paper • 2501.01957 • Published Jan 3, 2025 • 48