Paper Archive

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

0

9.0/10

Tianwei Xiong, Jun Hao Liew, Zilong Huang, Zhijie Lin, Jiashi Feng, Xihui Liu 3/12/2026 arxiv

computer vision

Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform ...

Keywords: EVATok, adaptive tokenization, video tokenizer, autoregressive generation, token efficiency, routers, video semantic encoder, UCF-101

View Paper

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

0

9.0/10

Haozhan Shen, Shilin Yan, Hongwei Xue, Shuaiqi Lu, Xiaojun Tang, Guannan Zhang, Tiancheng Zhao, Jianwei Yin 3/12/2026 arxiv

machine learning

Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process ...

Keywords: MM-CondChain, multimodal LLM, compositional reasoning, VPIR, agentic synthesis, benchmark, GUI trajectories, data charts

View Paper

OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

0

9.0/10

Yibin Yan, Jilan Xu, Shangzhe Di, Haoning Wu, Weidi Xie 3/12/2026 arxiv

computer vision

Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geom...

Keywords: streaming backbone, causal attention, 3D-RoPE, persistent KV-cache, multi-task pretraining, vision-language alignment, streaming reconstruction, embodied AI

View Paper

GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing

0

9.0/10

Mingxin Liu, Ziqian Fan, Zhaokai Wang, Leyao Gu, Zirun Zhu, Yiguo He, Yuchen Yang, Changyao Tian, Xiangyu Zhao, Ning Liao, Shaofeng Zhang, Qibing Ren, Zhihang Zhong, Xuanhe Zhou, Junchi Yan, Xue Yang 3/12/2026 arxiv

computer vision

Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, ...

Keywords: GRADE, benchmark, image editing, discipline-informed reasoning, multimodal, evaluation protocol, visual consistency, logical readability

View Paper

$Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

0

9.0/10

Songlin Wei, Hongyi Jing, Boqian Li, Zhenyu Zhao, Jiageng Mao, Zhenhao Ni, Sicheng He, Jie Liu, Xiawei Liu, Kaidi Kang, Sheng Zang, Weiduo Yuan, Marco Pavone, Di Huang, Yue Wang 3/12/2026 arxiv

robotics

We introduce $Ψ_0$ (Psi-Zero), an open foundation model to address challenging humanoid loco-manipulation tasks. While existing approaches often attempt to address this fundamental problem by co-training on large and diverse human and humanoid data, we argue that this strategy is suboptimal due to t...

Keywords: humanoid, loco-manipulation, foundation model, egocentric video, VLM, flow-based action expert, data efficiency, sim-to-real

View Paper

Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

0

9.0/10

Yiran Guan, Liang Yin, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai 3/12/2026 arxiv

machine learning

Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response late...

Keywords: VideoLLM, streaming, real-time reasoning, thinking-while-watching, VST, VST-SFT, VST-RL, video knowledge graph

View Paper

HumDex:Humanoid Dexterous Manipulation Made Easy

0

9.0/10

Liang Heng, Yihe Tang, Jiajun Xu, Henghui Bao, Di Huang, Yue Wang 3/12/2026 arxiv

robotics

This paper investigates humanoid whole-body dexterous manipulation, where the efficient collection of high-quality demonstration data remains a central bottleneck. Existing teleoperation systems often suffer from limited portability, occlusion, or insufficient precision, which hinders their applic...

Keywords: humanoid, teleoperation, IMU, hand retargeting, imitation learning, dexterous manipulation, dataset collection, robotics

View Paper

DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning

0

9.0/10

Yujie Wei, Xinyu Liu, Shiwei Zhang, Hangjie Yuan, Jinbo Xing, Zhekai Chen, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Ruihang Chu, Yingya Zhang, Yike Guo, Xihui Liu, Hongming Shan 3/12/2026 arxiv

computer vision

While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and ide...

Keywords: DreamVideo-Omni, omni-motion control, multi-subject video, latent identity reward, 3D rotary positional embedding, hierarchical motion injection, group and role embeddings, video diffusion

View Paper

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

0

9.0/10

Fangfu Liu, Diankun Wu, Jiawei Chi, Yimo Cai, Yi-Hsin Hung, Xumin Yu, Hao Li, Han Hu, Yongming Rao, Yueqi Duan 3/12/2026 arxiv

computer vision

Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows...

Keywords: test-time training, fast weights, spatiotemporal convolution, streaming video, spatial intelligence, sliding-window attention, long-horizon understanding, 3D spatial descriptions

View Paper

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing

0

9.0/10

Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M. Chan, Pavlo Molchanov, Trevor Darrell, Hongxu Yin 3/12/2026 arxiv

computer vision

Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweig...

Keywords: AutoGaze, autoregressive gaze, MLLM, ViT, token reduction, reinforcement learning, multi-scale patches, VideoMME

View Paper

Export Archive Data

Browse by Date

Papers for March 13, 2026

EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation

MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning

OmniStream: Mastering Perception, Reconstruction and Action in Continuous Streams

GRADE: Benchmarking Discipline-Informed Reasoning in Image Editing

$Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously

HumDex:Humanoid Dexterous Manipulation Made Easy

DreamVideo-Omni: Omni-Motion Controlled Multi-Subject Video Customization with Latent Identity Reinforcement Learning

Spatial-TTT: Streaming Visual-based Spatial Intelligence with Test-Time Training

Attend Before Attention: Efficient and Scalable Video Understanding via Autoregressive Gazing