Paper Archive

SemanticGen: Video Generation in Semantic Space

0

9.0/10

Jianhong Bai, Xiaoshi Wu, Xintao Wang, Fu Xiao, Yuanxing Zhang, Qinghe Wang, Xiaoyu Shi, Menghan Xia, Zuozhu Liu, Haoji Hu, Pengfei Wan, Kun Gai 12/23/2025 arxiv

generative models

State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder. While this approach can generate high-quality videos, it suffers from slow convergence and is computationally expensive when generating long videos....

Keywords: SemanticGen, video generation, semantic space, diffusion model, VAE latents, two-stage generation, long video generation, global planning

View Paper

LongVideoAgent: Multi-Agent Reasoning with Long Videos

0

9.0/10

Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen 12/23/2025 arxiv

machine learning

Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We pro...

Keywords: long-video, multi-agent, reinforcement learning, grounding, vision-language, LLM, LongTVQA, TVQA

View Paper

SpatialTree: How Spatial Abilities Branch Out in MLLMs

0

9.0/10

Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang 12/23/2025 arxiv

machine learning

Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierar...

Keywords: SpatialTree, spatial abilities, MLLM, multimodal, hierarchical benchmark, transfer learning, negative transfer, auto-think

View Paper

Active Intelligence in Video Avatars via Closed-loop World Modeling

0

9.0/10

Xuanhua He, Tianyu Yang, Ke Cao, Ruiqi Wu, Cheng Meng, Yong Zhang, Zhuoliang Kang, Xiaoming Wei, Qifeng Chen 12/23/2025 arxiv

computer vision

Current video avatar generation methods excel at identity preservation and motion alignment but lack genuine agency, they cannot autonomously pursue long-term goals through adaptive environmental interaction. We address this by introducing L-IVA (Long-horizon Interactive Visual Avatar), a task and b...

Keywords: L-IVA, ORCA, Internal World Model, OTAR, closed-loop, POMDP, video avatars, long-horizon planning

View Paper

Making Large Language Models Efficient Dense Retrievers

0

9.0/10

Yibin Lei, Shwai He, Ang Li, Andrew Yates 12/23/2025 arxiv

machine learning

Recent work has shown that directly fine-tuning large language models (LLMs) for dense retrieval yields strong performance, but their substantial parameter counts make them computationally inefficient. While prior studies have revealed significant layer redundancy in LLMs for generative tasks, it re...

Keywords: dense retrieval, LLM compression, MLP pruning, EffiR, coarse-to-fine, BEIR, efficient retrieval

View Paper

Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures

0

9.0/10

Yedi Zhang, Andrew Saxe, Peter E. Latham 12/23/2025 arxiv

machine learning

Neural networks trained with gradient descent often learn solutions of increasing complexity over time, a phenomenon known as simplicity bias. Despite being widely observed across architectures, existing theoretical treatments lack a unifying framework. We present a theoretical framework that explai...

Keywords: simplicity bias, saddle-to-saddle dynamics, gradient descent, invariant manifolds, fixed points, ReLU kinks, rank growth, convolutional kernels

View Paper

Repurposing Video Diffusion Transformers for Robust Point Tracking

0

9.0/10

Soowon Son, Honggyu An, Chaehyun Kim, Hyunah Ko, Jisu Nam, Dahyun Chung, Siyoon Jin, Jung Yi, Jaewon Min, Junhwa Hur, Seungryong Kim 12/23/2025 arxiv

computer vision

Point tracking aims to localize corresponding points across video frames, serving as a fundamental task for 4D reconstruction, robotics, and video editing. Existing methods commonly rely on shallow convolutional backbones such as ResNet that process frames independently, lacking temporal coherence a...

Keywords: Video Diffusion Transformer, DiT, point tracking, DiTracker, LoRA, query-key attention, cost fusion, ResNet

View Paper

Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

0

9.0/10

Seijin Kobayashi, Yanick Schimpf, Maximilian Schlegel, Angelika Steger, Maciej Wolczyk, Johannes von Oswald, Nino Scherre, Kaitlin Maile, Guillaume Lajoie, Blake A. Richards, Rif A. Saurous, James Manyika, Blaise Agüera y Arcas, Alexander Meulemans, João Sacramento 12/23/2025 arxiv

reinforcement learning

Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token c...

Keywords: autoregressive models, temporal abstraction, hierarchical reinforcement learning, internal RL, residual stream control, non-causal sequence model, sparse rewards, MuJoCo

View Paper

Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs

0

9.0/10

Dhruv Anand, Ehsan Shareghi 12/23/2025 arxiv

machine learning

We introduce Cube Bench, a Rubik's-cube benchmark for evaluating spatial and sequential reasoning in multimodal large language models (MLLMs). The benchmark decomposes performance into five skills: (i) reconstructing cube faces from images and text, (ii) choosing the optimal next move, (iii) predict...

Keywords: Cube Bench, Rubik's cube, spatial reasoning, MLLMs, multimodal benchmark, sequential planning, self-correction, closed-vs-open-source gap

View Paper

LightTact: A Visual-Tactile Fingertip Sensor for Deformation-Independent Contact Sensing

0

9.0/10

Changyi Lin, Boda Huo, Mingyang Yu, Emily Ruppel, Bingqing Chen, Jonathan Francis, Ding Zhao 12/23/2025 arxiv

machine learning

Contact often occurs without macroscopic surface deformation, such as during interaction with liquids, semi-liquids, or ultra-soft materials. Most existing tactile sensors rely on deformation to infer contact, making such light-contact interactions difficult to perceive robustly. To address this, we...

Keywords: visual-tactile, tactile sensing, optical sensor, light contact, robotic manipulation, contact segmentation, vision-language models, soft materials

View Paper

Export Archive Data

Browse by Date

Papers for December 24, 2025

SemanticGen: Video Generation in Semantic Space

LongVideoAgent: Multi-Agent Reasoning with Long Videos

SpatialTree: How Spatial Abilities Branch Out in MLLMs

Active Intelligence in Video Avatars via Closed-loop World Modeling

Making Large Language Models Efficient Dense Retrievers

Saddle-to-Saddle Dynamics Explains A Simplicity Bias Across Neural Network Architectures

Repurposing Video Diffusion Transformers for Robust Point Tracking

Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

Cube Bench: A Benchmark for Spatial Visual Reasoning in MLLMs

LightTact: A Visual-Tactile Fingertip Sensor for Deformation-Independent Contact Sensing