Paper Archive

Browse and export your curated research paper collection

197
Archived Days
1958
Total Papers
7.9
Avg Score
9
Categories

Export Archive Data

Download your archived papers in various formats

JSON: Complete data with analysis • CSV: Tabular data for analysis • Markdown: Human-readable reports • BibTeX: Academic citations
Browse by Date

Papers for March 19, 2026

10 papers found

Jianrui Zhang, Yue Yang, Rohun Tripathi, Winson Han, Ranjay Krishna, Christopher Clark, Yong Jae Lee, Sangho Lee 3/18/2026 arxiv

computer vision

Token pruning is essential for enhancing the computational efficiency of vision-language models (VLMs), particularly for video-based tasks where temporal redundancy is prevalent. Prior approaches typically prune tokens either (1) within the vision transformer (ViT) exclusively for unimodal perceptio...

Keywords: STTS, token pruning, video VLM, ViT, LLM, spatio-temporal, packing algorithm, video QA

Ziyi Wang, Peiming Li, Xinshun Wang, Yang Tang, Kai-Kuang Ma, Mengyuan Liu 3/18/2026 arxiv

machine learning

Multimodal large language models (MLLMs) exhibit strong visual-language reasoning, yet remain confined to their native modalities and cannot directly process structured, non-visual data such as human skeletons. Existing methods either compress skeleton dynamics into lossy feature vectors for text al...

Keywords: skeleton understanding, differentiable rendering, MLLM, DrAction, Causal Reasoning Distillation, Discriminative Finetuning, cross-format transfer, multimodal

Kevin Qu, Haozhe Qi, Mihai Dusmanu, Mahdi Rad, Rui Wang, Marc Pollefeys 3/18/2026 arxiv

machine learning

Multimodal Large Language Models (MLLMs) have made impressive progress in connecting vision and language, but they still struggle with spatial understanding and viewpoint-aware reasoning. Recent efforts aim to augment the input representations with geometric cues rather than explicitly teaching mode...

Keywords: Loc3R-VLM, vision-language models, 3D reasoning, language-based localization, monocular video, global layout reconstruction, situation modeling, camera pose priors

Kai Zou, Hongbo Liu, Dian Zheng, Jianxiong Gao, Zhiwei Zhao, Bin Liu 3/18/2026 arxiv

machine learning

In this work, we present EchoGen, a unified framework for layout-to-image generation and image grounding, capable of generating images with accurate layouts and high fidelity to text descriptions (e.g., spatial relationships), while grounding the image robustly at the same time. We believe that imag...

Keywords: EchoGen, layout-to-image, image grounding, cycle-consistent learning, PMTP, DJO, Cycle RL, GRPO

Zhang Zhang, Shuqi Lu, Hongjin Qian, Di He, Zheng Liu 3/18/2026 arxiv

machine learning

Building LLM-based agents has become increasingly important. Recent works on LLM-based agent self-evolution primarily record successful experiences as textual prompts or reflections, which cannot reliably guarantee efficient task re-execution in complex scenarios. We propose AgentFactory, a new self...

Keywords: LLM agents, self-evolution, subagent, executable memory, code reuse, continual learning, agent library, Python

Yigit Ekin, Yossi Gandelsman 3/18/2026 arxiv

computer vision

We present a training-free framework for continuous and controllable image editing at test time for text-conditioned generative models. In contrast to prior approaches that rely on additional training or manual user intervention, we find that a simple steering in the text-embedding space is sufficie...

Keywords: text embedding interpolation, text-conditioned generation, steering vector, elastic range search, debiased contrastive prompts, LLM prompt synthesis, continuous editing, training-free

Niladri Shekhar Dutt, Zifan Shi, Paul Guerrero, Chun-Hao Paul Huang, Duygu Ceylan, Niloy J. Mitra, Xuelin Chen 3/18/2026 arxiv

computer vision

Tokenization is a fundamental technique in the generative modeling of various modalities. In particular, it plays a critical role in autoregressive (AR) models, which have recently emerged as a compelling option for 3D generation. However, optimal tokenization of 3D shapes remains an open question. ...

Keywords: LoST, RIDA, DINO, tokenization, autoregressive, 3D generation, semantic alignment, level-of-detail

Huajian Zeng, Abhishek Saroha, Daniel Cremers, Xi Wang 3/18/2026 arxiv

robotics

Synthesizing controllable 6-DOF object manipulation trajectories in 3D environments is essential for enabling robots to interact with complex scenes, yet remains challenging due to the need for accurate spatial reasoning, physical feasibility, and multimodal scene understanding. Existing approaches ...

Keywords: 6-DOF, trajectory synthesis, multimodal transformer, point cloud, 3D bounding box, goal-conditioned, manipulation planning, semantic context

Vladimir Kulikov, Roni Paiss, Andrey Voynov, Inbar Mosseri, Tali Dekel, Tomer Michaeli 3/18/2026 arxiv

computer vision

Controlled video generation has seen drastic improvements in recent years. However, editing actions and dynamic events, or inserting contents that should affect the behaviors of other objects in real-world videos, remains a major challenge. Existing trained models struggle with complex edits, likely...

Keywords: DynaEdit, video editing, text-to-video, flow models, inversion-free, model-agnostic, temporal consistency, action editing

Shuyao Shi, Kang G. Shin 3/18/2026 arxiv

computer vision

Recent Multimodal Large Language Models (MLLMs) have shown high potential for spatial reasoning within 3D scenes. However, they typically rely on computationally expensive 3D representations like point clouds or reconstructed Bird's-Eye View (BEV) maps, or lack physical grounding to resolve ambiguit...

Keywords: egomotion, IMU, MLLM, keyframe selection, cross-modal fusion, 3D scene understanding, scale grounding, efficient representation
Loading...

Preparing your export...