Paper Archive

Canvas-to-Image: Compositional Image Generation with Multimodal Controls

0

9.0/10

Yusuf Dalva, Guocheng Gordon Qian, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman, Sergey Tulyakov, Pinar Yanardag, Kuan-Chieh Jackson Wang 11/26/2025 arxiv

computer vision

While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotati...

Keywords: diffusion models, multimodal controls, canvas representation, compositional generation, multi-task training, layout control, pose control, identity preservation

View Paper

TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos

0

9.0/10

Seungjae Lee, Yoonkyo Jung, Inkook Chun, Yao-Chih Lee, Zikui Cai, Hongjia Huang, Aayush Talreja, Tan Dat Dao, Yongyuan Liang, Jia-Bin Huang, Furong Huang 11/26/2025 arxiv

robotics

Learning new robot tasks on new platforms and in new scenes from only a handful of demonstrations remains challenging. While videos of other embodiments - humans and different robots - are abundant, differences in embodiment, camera, and environment hinder their direct use. We address the small-data...

Keywords: trace-space, world model, robot learning, cross-embodiment, TraceGen, TraceForge, 3D motion prior, few-shot adaptation

View Paper

ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

0

9.0/10

Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, Pavlo Molchanov 11/26/2025 arxiv

machine learning

Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the u...

Keywords: ToolOrchestra, orchestrator, tool orchestration, reinforcement learning, efficiency, model composition, Orchestrator-8B, HLE benchmark

View Paper

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

0

9.0/10

Wenbo Hu, Jingli Lin, Yilin Long, Yunlong Ran, Lihan Jiang, Yifan Wang, Chenming Zhu, Runsen Xu, Tai Wang, Jiangmiao Pang 11/26/2025 arxiv

machine learning

Vision-Language Models (VLMs) still lack robustness in spatial intelligence, demonstrating poor performance on spatial understanding and reasoning tasks. We attribute this gap to the absence of a visual geometry learning process capable of reconstructing 3D space from 2D images. We present G$^2$VLM,...

Keywords: G2VLM, geometry grounded, vision-language model, 3D reconstruction, spatial reasoning, multi-view, in-context learning, interleaved reasoning

View Paper

Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

0

9.0/10

Dong Wang, Yang Li, Ansong Ni, Ching-Feng Yeh, Youssef Emad, Xinjie Lei, Liam Robbins, Karthik Padthe, Hu Xu, Xian Li, Asli Celikyilmaz, Ramya Raghavendra, Lifei Huang, Carole-Jean Wu, Shang-Wen Li 11/26/2025 arxiv

machine learning

Synthetic data has become increasingly important for training large language models, especially when real data is scarce, expensive, or privacy-sensitive. Many such generation tasks require coordinated multi-agent workflows, where specialized agents collaborate to produce data that is higher quality...

Keywords: synthetic data, multi-agent, peer-to-peer, decentralized, distributed queues, Ray, LLM inference, data generation throughput

View Paper

Seeing without Pixels: Perception from Camera Trajectories

0

9.0/10

Zihui Xue, Kristen Grauman, Dima Damen, Andrew Zisserman, Tengda Han 11/26/2025 arxiv

computer vision

Can one perceive a video's content without seeing its pixels, just from the camera trajectory-the path it carves through space? This paper is the first to systematically investigate this seemingly implausible question. Towards this end, we propose a contrastive learning framework to train CamFormer,...

Keywords: camera trajectory, CamFormer, contrastive learning, trajectory embedding, cross-modal alignment, egocentric, exocentric, pose estimation

View Paper

Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

0

9.0/10

Weihao Bo, Shan Zhang, Yanpeng Sun, Jingjing Wu, Qunyi Xie, Xiao Tan, Kunbin Chen, Wei He, Xiaofan Li, Na Zhao, Jingdong Wang, Zechao Li 11/26/2025 arxiv

multimodal learning

MLLMs exhibit strong reasoning on isolated queries, yet they operate de novo -- solving each problem independently and often repeating the same mistakes. Existing memory-augmented agents mainly store past trajectories for reuse. However, trajectory-based memory suffers from brevity bias, gradually l...

Keywords: ViLoMem, multimodal semantic memory, dual-stream memory, grow-and-refine, MLLM, lifelong learning, distraction-hallucination separation, pass@1

View Paper

On Evolution-Based Models for Experimentation Under Interference

0

9.0/10

Sadegh Shirani, Mohsen Bayati 11/26/2025 arxiv

machine learning

Causal effect estimation in networked systems is central to data-driven decision making. In such settings, interventions on one unit can spill over to others, and in complex physical or social systems, the interaction pathways driving these interference structures remain largely unobserved. We argue...

Keywords: causal_effects, interference, exposure_mapping, evolution_based, difference_in_differences, causal_message_passing, influencer_networks, treatment_randomization

View Paper

Uncertainty Quantification for Visual Object Pose Estimation

0

9.0/10

Lorenzo Shaikewitz, Charis Georgiou, Luca Carlone 11/26/2025 arxiv

robotics

Quantifying the uncertainty of an object's pose estimate is essential for robust control and planning. Although pose estimation is a well-studied robotics problem, attaching statistically rigorous uncertainty is not well understood without strict distributional assumptions. We develop distribution-f...

Keywords: pose estimation, uncertainty quantification, S-lemma, sum-of-squares, ellipsoid, monocular, semantic keypoints, convex relaxation

View Paper

Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models

0

9.0/10

Naifu Zhang, Wei Tao, Xi Xiao, Qianpu Sun, Yuxin Zheng, Wentao Mo, Peiqiang Wang, Nan Zhang 11/26/2025 arxiv

machine learning

In recent years, Vision-Language-Action (VLA) models in embodied intelligence have developed rapidly. However, existing adversarial attack methods require costly end-to-end training and often generate noticeable perturbation patches. To address these limitations, we propose ADVLA, a framework that d...

Keywords: ADVLA, Vision-Language-Action, adversarial attacks, attention guidance, sparse perturbations, feature-space attacks, Top-K masking, L_inf

View Paper

Export Archive Data

Browse by Date

Papers for November 29, 2025

Canvas-to-Image: Compositional Image Generation with Multimodal Controls

TraceGen: World Modeling in 3D Trace Space Enables Learning from Cross-Embodiment Videos

ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

G$^2$VLM: Geometry Grounded Vision Language Model with Unified 3D Reconstruction and Spatial Reasoning

Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Seeing without Pixels: Perception from Camera Trajectories

Agentic Learner with Grow-and-Refine Multimodal Semantic Memory

On Evolution-Based Models for Experimentation Under Interference

Uncertainty Quantification for Visual Object Pose Estimation

Attention-Guided Patch-Wise Sparse Adversarial Attacks on Vision-Language-Action Models