Paper Archive

Browse and export your curated research paper collection

197
Archived Days
1958
Total Papers
7.9
Avg Score
9
Categories

Export Archive Data

Download your archived papers in various formats

JSON: Complete data with analysis • CSV: Tabular data for analysis • Markdown: Human-readable reports • BibTeX: Academic citations
Browse by Date

Papers for March 24, 2026

10 papers found

Umair Nawaz, Ahmed Heakl, Ufaq Khan, Abdelrahman Shaker, Salman Khan, Fahad Shahbaz Khan 3/23/2026 arxiv

computer vision

Diffusion Transformers (DiTs) power high-fidelity video world models but remain computationally expensive due to sequential denoising and costly spatio-temporal attention. Training-free feature caching accelerates inference by reusing intermediate activations across denoising steps; however, existin...

Keywords: WorldCache, Diffusion Transformers, DiT, video world models, feature caching, perception-constrained, saliency-weighted drift, motion-adaptive thresholds

Shivam Duggal, Xingjian Bai, Zongze Wu, Richard Zhang, Eli Shechtman, Antonio Torralba, Phillip Isola, William T. Freeman 3/23/2026 arxiv

computer vision

Latent diffusion models (LDMs) enable high-fidelity synthesis by operating in learned latent spaces. However, training state-of-the-art LDMs requires complex staging: a tokenizer must be trained first, before the diffusion model can be trained in the frozen latent space. We propose UNITE - an autoen...

Keywords: UNITE, latent diffusion, Generative Encoder, single-stage training, tokenization, weight sharing, representation alignment, compression

Ziyi Wang, Xinshun Wang, Shuang Chen, Yang Cong, Mengyuan Liu 3/23/2026 arxiv

machine learning

We present UniMotion, to our knowledge the first unified framework for simultaneous understanding and generation of human motion, natural language, and RGB images within a single architecture. Existing unified models handle only restricted modality subsets (e.g., Motion-Text or static Pose-Image) an...

Keywords: motion, multimodal, CMA-VAE, Dual-Posterior KL Alignment, Latent Reconstruction Alignment, continuous embeddings, LLM backbone, vision-motion distillation

Zhide Zhong, Junfeng Li, Junjie He, Haodong Yan, Xin Gong, Guanyi Zhao, Yingjie Cai, Jiantao Gao, Xu Yan, Bingbing Liu, Yingcong Chen, Liuqing Yang, Haoang Li 3/23/2026 arxiv

machine learning

Vision-Language-Action (VLA) models map visual observations and language instructions directly to robotic actions. While effective for simple tasks, standard VLA models often struggle with complex, multi-step tasks requiring logical planning, as well as precise manipulations demanding fine-grained s...

Keywords: DualCoT-VLA, visual-linguistic CoT, parallel reasoning, learnable query tokens, vision-language-action, LIBERO, RoboCasa GR1, robotics

Kelly Cui, Nikhil Prakash, Ayush Raina, David Bau, Antonio Torralba, Tamar Rott Shaham 3/23/2026 arxiv

machine learning

Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to associate objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely...

Keywords: vision-language models, spatial reasoning, vision encoder, language model, spatial relations, global visual tokens, VQA, image captioning

Wooseok Jang, Seonghu Jeon, Jisang Han, Jinhyeok Choi, Minkyung Kwon, Seungryong Kim, Saining Xie, Sainan Liu 3/23/2026 arxiv

computer vision

While recent advances in generative latent spaces have driven substantial progress in single-image generation, the optimal latent space for novel view synthesis (NVS) remains largely unexplored. In particular, NVS requires geometrically consistent generation across viewpoints, but existing approache...

Keywords: geometric foundation models, latent diffusion, novel view synthesis, multi-view, NVS, VAE, RAE, 3D consistency

Zakaria Mhammedi, James Cohan 3/23/2026 arxiv

reinforcement learning

The process of discovery requires active exploration -- the act of collecting new and informative data. However, efficient autonomous exploration remains a major unsolved problem. The dominant paradigm addresses this challenge by using Reinforcement Learning (RL) to train agents with intrinsic motiv...

Keywords: exploration, reinforcement_learning, uncertainty, tree_search, intrinsic_motivation, policy_distillation, Atari, MuJoCo

Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, James Glass 3/23/2026 arxiv

machine learning

We propose TiCo, a simple post-training method for enabling spoken dialogue models (SDMs) to follow time-constrained instructions and generate responses with controllable duration. This capability is valuable for real-world spoken language systems such as voice assistants and interactive agents, whe...

Keywords: TiCo, Spoken Time Markers, STM, spoken dialogue models, time-controllable, post-training, reinforcement learning, duration control

Gu Zhang, Qicheng Xu, Haozhe Zhang, Jianhan Ma, Long He, Yiming Bao, Zeyu Ping, Zhecheng Yuan, Chenhao Lu, Chengbo Yuan, Tianhai Liang, Xiaoyu Tian, Maanping Shao, Feihong Zhang, Mingyu Ding, Yang Gao, Hao Zhao, Hang Zhao, Huazhe Xu 3/23/2026 arxiv

robotics

Dexterous manipulation remains challenging due to the cost of collecting real-robot teleoperation data, the heterogeneity of hand embodiments, and the high dimensionality of control. We present UniDex, a robot foundation suite that couples a large-scale robot-centric dataset with a unified vision-la...

Keywords: UniDex, FAAS, UniDex-Dataset, UniDex-VLA, UniDex-Cap, dexterous manipulation, egocentric video, retargeting

[object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object], [object Object] 3/23/2026 huggingface

computer vision

Video--based world models have emerged along two dominant paradigms: video generation and 3D reconstruction. However, existing evaluation benchmarks either focus narrowly on visual fidelity and text--video alignment for generative models, or rely on static 3D reconstruction metrics that fundamentall...

Keywords: world models, 4D generation, interactive evaluation, benchmark, agent-based metrics, video modeling, 3D reconstruction, causal evaluation
Loading...

Preparing your export...