Paper Archive

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

0

9.0/10

Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna, Christopher Clark, Yong Jae Lee, Sangho Lee 3/18/2026 arxiv

computer vision

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perceptio...

Keywords: STTS, token pruning, video VLM, ViT, LLM, spatio-temporal, packing algorithm, video QA

View Paper

Universal Skeleton Understanding via Differentiable Rendering and MLLMs

0

9.0/10

Ziyi Wang, Peiming Li, Xinshun Wang, Yang Tang, Kai-Kuang Ma, Mengyuan Liu 3/18/2026 arxiv

machine learning

Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet remain confined to their native modalities and cannot directly process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors for text al...

Keywords: skeleton understanding, differentiable rendering, MLLM, DrAction, Causal Reasoning Distillation, Discriminative Finetuning, cross-format transfer, multimodal

View Paper

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

0

9.0/10

Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang, Marc Pollefeys 3/18/2026 arxiv

machine learning

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching mode...

Keywords: Loc3R-VLM, vision-language models, 3D reasoning, language-based localization, monocular video, global layout reconstruction, situation modeling, camera pose priors

View Paper

EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding

0

9.0/10

Kai Zou, Hongbo Liu, Dian Zheng, Jianxiong Gao, Zhiwei Zhao, Bin Liu 3/18/2026 arxiv

machine learning

In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and high fidelity to text descriptions (e.g., spatial relationships), while grounding the image robustly at the same time. We believe that imag...

Keywords: EchoGen, layout-to-image, image grounding, cycle-consistent learning, PMTP, DJO, Cycle RL, GRPO

View Paper

AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

0

9.0/10

Zhang Zhang, Shuqi Lu, Hongjin Qian, Di He, Zheng Liu 3/18/2026 arxiv

machine learning

Building LLM-based agents has become increasingly important. Recent works on LLM-based agent self-evolution primarily record successful experiences as textual prompts or reflections, which cannot reliably guarantee efficient task re-execution in complex scenarios. We propose AgentFactory, a new self...

Keywords: LLM agents, self-evolution, subagent, executable memory, code reuse, continual learning, agent library, Python

View Paper

The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

0

9.0/10

Yigit Ekin, Yossi Gandelsman 3/18/2026 arxiv

computer vision

We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficie...

Keywords: text embedding interpolation, text-conditioned generation, steering vector, elastic range search, debiased contrastive prompts, LLM prompt synthesis, continuous editing, training-free

View Paper

LoST: Level of Semantics Tokenization for 3D Shapes

0

9.0/10

Niladri Shekhar Dutt, Zifan Shi, Paul Guerrero, Chun-Hao Paul Huang, Duygu Ceylan, Niloy J. Mitra, Xuelin Chen 3/18/2026 arxiv

computer vision

Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remains an open question. ...

Keywords: LoST, RIDA, DINO, tokenization, autoregressive, 3D generation, semantic alignment, level-of-detail

View Paper

GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

0

9.0/10

Huajian Zeng, Abhishek Saroha, Daniel Cremers, Xi Wang 3/18/2026 arxiv

robotics

Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches ...

Keywords: 6-DOF, trajectory synthesis, multimodal transformer, point cloud, 3D bounding box, goal-conditioned, manipulation planning, semantic context

View Paper

Versatile Editing of Video Content, Actions, and Dynamics without Training

0

9.0/10

Vladimir Kulikov, Roni Paiss, Andrey Voynov, Inbar Mosseri, Tali Dekel, Tomer Michaeli 3/18/2026 arxiv

computer vision

Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in real-world videos, remains a major challenge. Existing trained models struggle with complex edits, likely...

Keywords: DynaEdit, video editing, text-to-video, flow models, inversion-free, model-agnostic, temporal consistency, action editing

View Paper

Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding

0

9.0/10

Shuyao Shi, Kang G. Shin 3/18/2026 arxiv

computer vision

Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationally expensive 3D representations like point clouds or reconstructed Bird's-Eye View (BEV) maps, or lack physical grounding to resolve ambiguit...

Keywords: egomotion, IMU, MLLM, keyframe selection, cross-modal fusion, 3D scene understanding, scale grounding, efficient representation

View Paper

Export Archive Data

Browse by Date

Papers for March 19, 2026

Unified Spatio-Temporal Token Scoring for Efficient Video VLMs

Universal Skeleton Understanding via Differentiable Rendering and MLLMs

Loc3R-VLM: Language-based Localization and 3D Reasoning with Vision-Language Models

EchoGen: Cycle-Consistent Learning for Unified Layout-Image Generation and Understanding

AgentFactory: A Self-Evolving Framework Through Executable Subagent Accumulation and Reuse

The Unreasonable Effectiveness of Text Embedding Interpolation for Continuous Image Steering

LoST: Level of Semantics Tokenization for 3D Shapes

GMT: Goal-Conditioned Multimodal Transformer for 6-DOF Object Trajectory Synthesis in 3D Scenes

Versatile Editing of Video Content, Actions, and Dynamics without Training

Feeling the Space: Egomotion-Aware Video Representation for Efficient and Accurate 3D Scene Understanding