Paper Archive

Browse and export your curated research paper collection

176
Archived Days
1748
Total Papers
7.8
Avg Score
9
Categories

Export Archive Data

Download your archived papers in various formats

JSON: Complete data with analysis • CSV: Tabular data for analysis • Markdown: Human-readable reports • BibTeX: Academic citations
Browse by Date

Papers for March 13, 2026

10 papers found

Tianwei Xiong, Jun Hao Liew, Zilong Huang, Zhijie Lin, Jiashi Feng, Xihui Liu 3/12/2026 arxiv

computer vision

Autoregressive (AR) video generative models rely on video tokenizers that compress pixels into discrete token sequences. The length of these token sequences is crucial for balancing reconstruction quality against downstream generation computational cost. Traditional video tokenizers apply a uniform ...

Keywords: EVATok, adaptive tokenization, video tokenizer, autoregressive generation, token efficiency, routers, video semantic encoder, UCF-101

Haozhan Shen, Shilin Yan, Hongwei Xue, Shuaiqi Lu, Xiaojun Tang, Guannan Zhang, Tiancheng Zhao, Jianwei Yin 3/12/2026 arxiv

machine learning

Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process ...

Keywords: MM-CondChain, multimodal LLM, compositional reasoning, VPIR, agentic synthesis, benchmark, GUI trajectories, data charts

Yibin Yan, Jilan Xu, Shangzhe Di, Haoning Wu, Weidi Xie 3/12/2026 arxiv

computer vision

Modern visual agents require representations that are general, causal, and physically structured to operate in real-time streaming environments. However, current vision foundation models remain fragmented, specializing narrowly in image semantic perception, offline temporal modeling, or spatial geom...

Keywords: streaming backbone, causal attention, 3D-RoPE, persistent KV-cache, multi-task pretraining, vision-language alignment, streaming reconstruction, embodied AI

Mingxin Liu, Ziqian Fan, Zhaokai Wang, Leyao Gu, Zirun Zhu, Yiguo He, Yuchen Yang, Changyao Tian, Xiangyu Zhao, Ning Liao, Shaofeng Zhang, Qibing Ren, Zhihang Zhong, Xuanhe Zhou, Junchi Yan, Xue Yang 3/12/2026 arxiv

computer vision

Unified multimodal models target joint understanding, reasoning, and generation, but current image editing benchmarks are largely confined to natural images and shallow commonsense reasoning, offering limited assessment of this capability under structured, domain-specific constraints. In this work, ...

Keywords: GRADE, benchmark, image editing, discipline-informed reasoning, multimodal, evaluation protocol, visual consistency, logical readability

Songlin Wei, Hongyi Jing, Boqian Li, Zhenyu Zhao, Jiageng Mao, Zhenhao Ni, Sicheng He, Jie Liu, Xiawei Liu, Kaidi Kang, Sheng Zang, Weiduo Yuan, Marco Pavone, Di Huang, Yue Wang 3/12/2026 arxiv

robotics

We introduce $Ψ_0$ (Psi-Zero), an open foundation model to address challenging humanoid loco-manipulation tasks. While existing approaches often attempt to address this fundamental problem by co-training on large and diverse human and humanoid data, we argue that this strategy is suboptimal due to t...

Keywords: humanoid, loco-manipulation, foundation model, egocentric video, VLM, flow-based action expert, data efficiency, sim-to-real

Yiran Guan, Liang Yin, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Jian Luan, Yuliang Liu, Xiang Bai 3/12/2026 arxiv

machine learning

Online Video Large Language Models (VideoLLMs) play a critical role in supporting responsive, real-time interaction. Existing methods focus on streaming perception, lacking a synchronized logical reasoning stream. However, directly applying test-time scaling methods incurs unacceptable response late...

Keywords: VideoLLM, streaming, real-time reasoning, thinking-while-watching, VST, VST-SFT, VST-RL, video knowledge graph

Liang Heng, Yihe Tang, Jiajun Xu, Henghui Bao, Di Huang, Yue Wang 3/12/2026 arxiv

robotics

This paper investigates humanoid whole-body dexterous manipulation, where the efficient collection of high-quality demonstration data remains a central bottleneck. Existing teleoperation systems often suffer from limited portability, occlusion, or insufficient precision, which hinders their applic...

Keywords: humanoid, teleoperation, IMU, hand retargeting, imitation learning, dexterous manipulation, dataset collection, robotics

Yujie Wei, Xinyu Liu, Shiwei Zhang, Hangjie Yuan, Jinbo Xing, Zhekai Chen, Xiang Wang, Haonan Qiu, Rui Zhao, Yutong Feng, Ruihang Chu, Yingya Zhang, Yike Guo, Xihui Liu, Hongming Shan 3/12/2026 arxiv

computer vision

While large-scale diffusion models have revolutionized video synthesis, achieving precise control over both multi-subject identity and multi-granularity motion remains a significant challenge. Recent attempts to bridge this gap often suffer from limited motion granularity, control ambiguity, and ide...

Keywords: DreamVideo-Omni, omni-motion control, multi-subject video, latent identity reward, 3D rotary positional embedding, hierarchical motion injection, group and role embeddings, video diffusion

Fangfu Liu, Diankun Wu, Jiawei Chi, Yimo Cai, Yi-Hsin Hung, Xumin Yu, Hao Li, Han Hu, Yongming Rao, Yueqi Duan 3/12/2026 arxiv

computer vision

Humans perceive and understand real-world spaces through a stream of visual observations. Therefore, the ability to streamingly maintain and update spatial evidence from potentially unbounded video streams is essential for spatial intelligence. The core challenge is not simply longer context windows...

Keywords: test-time training, fast weights, spatiotemporal convolution, streaming video, spatial intelligence, sliding-window attention, long-horizon understanding, 3D spatial descriptions

Baifeng Shi, Stephanie Fu, Long Lian, Hanrong Ye, David Eigen, Aaron Reite, Boyi Li, Jan Kautz, Song Han, David M. Chan, Pavlo Molchanov, Trevor Darrell, Hongxu Yin 3/12/2026 arxiv

computer vision

Multi-modal large language models (MLLMs) have advanced general-purpose video understanding but struggle with long, high-resolution videos -- they process every pixel equally in their vision transformers (ViTs) or LLMs despite significant spatiotemporal redundancy. We introduce AutoGaze, a lightweig...

Keywords: AutoGaze, autoregressive gaze, MLLM, ViT, token reduction, reinforcement learning, multi-scale patches, VideoMME
Loading...

Preparing your export...