Paper Archive

WorldCache: Content-Aware Caching for Accelerated Video World Models

0

9.0/10

Umair Nawaz, Ahmed Heakl, Ufaq Khan, Abdelrahman Shaker, Salman Khan, Fahad Shahbaz Khan 3/23/2026 arxiv

computer vision

Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existin...

Keywords: WorldCache, Diffusion Transformers, DiT, video world models, feature caching, perception-constrained, saliency-weighted drift, motion-adaptive thresholds

View Paper

End-to-End Training for Unified Tokenization and Latent Denoising

0

9.0/10

Shivam Duggal, Xingjian Bai, Zongze Wu, Richard Zhang, Eli Shechtman, Antonio Torralba, Phillip Isola, William T. Freeman 3/23/2026 arxiv

computer vision

Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoen...

Keywords: UNITE, latent diffusion, Generative Encoder, single-stage training, tokenization, weight sharing, representation alignment, compression

View Paper

UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

0

9.0/10

Ziyi Wang, Xinshun Wang, Shuang Chen, Yang Cong, Mengyuan Liu 3/23/2026 arxiv

machine learning

We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) an...

Keywords: motion, multimodal, CMA-VAE, Dual-Posterior KL Alignment, Latent Reconstruction Alignment, continuous embeddings, LLM backbone, vision-motion distillation

View Paper

DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

0

9.0/10

Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong, Guanyi Zhao, Yingjie Cai, Jiantao Gao, Xu Yan, Bingbing Liu, Yingcong Chen, Liuqing Yang, Haoang Li 3/23/2026 arxiv

machine learning

Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained s...

Keywords: DualCoT-VLA, visual-linguistic CoT, parallel reasoning, learnable query tokens, vision-language-action, LIBERO, RoboCasa GR1, robotics

View Paper

The Dual Mechanisms of Spatial Reasoning in Vision-Language Models

0

9.0/10

Kelly Cui, Nikhil Prakash, Ayush Raina, David Bau, Antonio Torralba, Tamar Rott Shaham 3/23/2026 arxiv

machine learning

Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely...

Keywords: vision-language models, spatial reasoning, vision encoder, language model, spatial relations, global visual tokens, VQA, image captioning

View Paper

Repurposing Geometric Foundation Models for Multi-view Diffusion

0

9.0/10

Wooseok Jang, Seonghu Jeon, Jisang Han, Jinhyeok Choi, Minkyung Kwon, Seungryong Kim, Saining Xie, Sainan Liu 3/23/2026 arxiv

computer vision

While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approache...

Keywords: geometric foundation models, latent diffusion, novel view synthesis, multi-view, NVS, VAE, RAE, 3D consistency

View Paper

Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

0

9.0/10

Zakaria Mhammedi, James Cohan 3/23/2026 arxiv

reinforcement learning

The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motiv...

Keywords: exploration, reinforcement_learning, uncertainty, tree_search, intrinsic_motivation, policy_distillation, Atari, MuJoCo

View Paper

TiCo: Time-Controllable Training for Spoken Dialogue Models

0

9.0/10

Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass 3/23/2026 arxiv

machine learning

We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, whe...

Keywords: TiCo, Spoken Time Markers, STM, spoken dialogue models, time-controllable, post-training, reinforcement learning, duration control

View Paper

UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos

0

9.0/10

Gu Zhang, Qicheng Xu, Haozhe Zhang, Jianhan Ma, Long He, Yiming Bao, Zeyu Ping, Zhecheng Yuan, Chenhao Lu, Chengbo Yuan, Tianhai Liang, Xiaoyu Tian, Maanping Shao, Feihong Zhang, Mingyu Ding, Yang Gao, Hao Zhao, Hang Zhao, Huazhe Xu 3/23/2026 arxiv

robotics

Dexterous manipulation remains challenging due to the cost of collecting real-robot teleoperation data, the heterogeneity of hand embodiments, and the high dimensionality of control. We present UniDex, a robot foundation suite that couples a large-scale robot-centric dataset with a unified vision-la...

Keywords: UniDex, FAAS, UniDex-Dataset, UniDex-VLA, UniDex-Cap, dexterous manipulation, egocentric video, retargeting

View Paper

Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models

0

9.0/10

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 3/23/2026 huggingface

computer vision

Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentall...

Keywords: world models, 4D generation, interactive evaluation, benchmark, agent-based metrics, video modeling, 3D reconstruction, causal evaluation

View Paper

Export Archive Data

Browse by Date

Papers for March 24, 2026

WorldCache: Content-Aware Caching for Accelerated Video World Models

End-to-End Training for Unified Tokenization and Latent Denoising

UniMotion: A Unified Framework for Motion-Text-Vision Understanding and Generation

DualCoT-VLA: Visual-Linguistic Chain of Thought via Parallel Reasoning for Vision-Language-Action Models

The Dual Mechanisms of Spatial Reasoning in Vision-Language Models

Repurposing Geometric Foundation Models for Multi-view Diffusion

Decoupling Exploration and Policy Optimization: Uncertainty Guided Tree Search for Hard Exploration

TiCo: Time-Controllable Training for Spoken Dialogue Models

UniDex: A Robot Foundation Suite for Universal Dexterous Hand Control from Egocentric Human Videos

Omni-WorldBench: Towards a Comprehensive Interaction-Centric Evaluation for World Models