Paper Archive

When to Align, When to Predict: A Phase Diagram for Multimodal Learning

0

5.0/10

Ilay Kamai, Hugues Van Assel, Aviv Regev, Hagai B. Perets, Randall Balestriero 6/9/2026 arxiv

computer vision

Cross-modal alignment (CA) and cross-modal prediction (CP) are the dominant paradigms for multimodal representation learning, yet there is no systematic understanding of when each succeeds, when each fails, and when cross-modal training helps at all -- a gap that leaves practitioners, especially in ...

View Paper

A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

0

5.0/10

Tong Xie, Yuanhao Ban, Yunqi Hong, Sohyun An, Yihang Chen, Cho-Jui Hsieh 6/9/2026 arxiv

computer vision

Supervised fine-tuning (SFT) typically maximizes the likelihood of every token in a demonstrated trajectory. However, an observed token can be non-unique, noisy, or misaligned with the model prior. Strictly fitting toward this one-hot target may be suboptimal, especially when the pretrained model en...

Keywords: fine-tuning

View Paper

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

0

5.0/10

Junke Wang, Xiao Wang, Jiacheng Pan, Xuefeng Hu, Feng Li, Jingxiang Sun, Chaorui Deng, Zilong Chen, Yunpeng Chen, Kaibin Tian, Matthew Gwilliam, Hao Chen, Danhui Guan, Kun Xu, Weilin Huang, Zuxuan Wu, Haoqi Fan, Yu-Gang Jiang, Zhenheng Yang 6/9/2026 arxiv

computer vision

This paper introduces ARM, a discrete representation-based AutoRegressive Model that unifies image understanding, generation, and editing within a next-token prediction framework. ARM is built on three efforts: first, we train a discrete semantic visual tokenizer that maps images into compact token ...

Keywords: reinforcement learning

View Paper

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

0

5.0/10

Gangwei Xu, Qihang Zhang, Jiaming Zhou, Xing Zhu, Yujun Shen, Xin Yang, Yinghao Xu 6/9/2026 arxiv

computer vision

Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without...

Keywords: pretraining

View Paper

AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference

0

5.0/10

Hangfeng Liang, Yutao Hu, Yanhan Hu, Xiaohan Wu, Wenqi Shao, Ying Fu 6/9/2026 arxiv

computer vision

Low-light video enhancement (LLVE) remains a challenging task due to severe information degradation under low-illumination conditions. Recent multimodal approaches have significantly improved enhancement performance by incorporating auxiliary modalities, such as event streams and infrared images. Ho...

Keywords: pretraining

View Paper

TacForeSight: Force-Guided Tactile World Model for Contact-Rich Manipulation

0

5.0/10

Yujie Zang, Yuhang Zheng, Xian Nie, Yupeng Zheng, Shuai Tian, Songen Gu, Chen Gao, Zining Wang, Shuicheng Yan, Wenchao Ding 6/9/2026 arxiv

reinforcement learning

Contact-rich manipulation requires robots to continuously perceive and regulate evolving physical interactions under dynamic contact transitions or complex surface geometries. Recent imitation learning methods improve contact-aware control by incorporating tactile or force feedback, but they rarely ...

Keywords: attention

View Paper

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

0

5.0/10

Weixian Xu, Shilong Liu, Mengdi Wang 6/9/2026 arxiv

reinforcement learning

In this paper, we propose EEVEE, the first multi-dataset test-time prompt learning framework for LLM agents, enabling test-time prompt learning under real-world task streams. Existing methods are largely designed for single-dataset settings, while real-world applications require models to handle het...

View Paper

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

0

5.0/10

Paul Hyunbin Cho, Jinhyuk Jang, SeokYoung Lee, Joungbin Lee, Siyoon Jin, Heeseong Shin, Jung Yi, Yunjin Park, Chulmin Park, Seungryong Kim 6/9/2026 arxiv

computer vision

Diffusion-based lip synchronization models achieve strong visual quality and audio-visual alignment, but full-sequence bidirectional attention and many denoising steps make them impractical for real-time inference. We present Lip Forcing, to our knowledge the first autoregressive diffusion method fo...

Keywords: attention, diffusion model

View Paper

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

0

5.0/10

Kevin Qinghong Lin, Batu EI, Yuhong Shi, Pan Lu, Philip Torr, James Zou 6/9/2026 arxiv

computer vision

Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual ...

View Paper

The Role of Feedback Alignment in Self-Distillation

0

5.0/10

Semih Kara, Oğuzhan Ersoy 6/9/2026 arxiv

natural language processing

Conditioning a language model on additional context, such as feedback on a previous attempt, typically improves its response. Self-distillation trains the model to retain this improvement when the context is not present. The method works by matching the model's output distribution under two settings...

View Paper

Export Archive Data

Browse by Date

Papers for June 10, 2026

When to Align, When to Predict: A Phase Diagram for Multimodal Learning

A Unifying Lens on Supervised Fine-Tuning Through Target Distribution Design

ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations

Next Forcing: Causal World Modeling with Multi-Chunk Prediction

AnyMod-LLVE: Low-Light Video Enhancement with Modality-Agnostic Inference

TacForeSight: Force-Guided Tactile World Model for Contact-Rich Manipulation

EEVEE: Towards Test-time Prompt Learning in the Real World for Self-Improving Agents

Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization

Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories

The Role of Feedback Alignment in Self-Distillation