Paper Archive

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

0

9.0/10

Haiwen Diao, Mingxuan Li, Silei Wu, Linjun Dai, Xiaohua Wang, Hanming Deng, Lewei Lu, Dahua Lin, Ziwei Liu 10/16/2025 arxiv

machine learning

The edifice of native Vision-Language Models (VLMs) has emerged as a rising contender to typical modular VLMs, shaped by evolving model architectures and training paradigms. Yet, two lingering clouds cast shadows over its widespread exploration and promotion: (-) What fundamental constraints set nat...

Keywords: NEO, native VLM, vision-language, pixel-word alignment, shared semantic space, monolithic model, multimodal, scaling

View Paper

Learning an Image Editing Model without Image Editing Pairs

0

9.0/10

Nupur Kumari, Sheng-Yu Wang, Nanxuan Zhao, Yotam Nitzan, Yuheng Li, Krishna Kumar Singh, Richard Zhang, Eli Shechtman, Jun-Yan Zhu, Xun Huang 10/16/2025 arxiv

computer vision

Recent image editing models have achieved impressive results while following natural language editing instructions, but they rely on supervised fine-tuning with large datasets of input-target pairs. This is a critical bottleneck, as such naturally occurring pairs are hard to curate at scale. Current...

Keywords: image editing, diffusion models, vision-language models, unpaired training, distribution matching, DMD, unrolled optimization, reward-based learning

View Paper

Terra: Explorable Native 3D World Model with Point Latents

0

9.0/10

Yuanhui Huang, Weiliang Chen, Wenzhao Zheng, Xin Tao, Pengfei Wan, Jie Zhou, Jiwen Lu 10/16/2025 arxiv

computer vision

World models have garnered increasing attention for comprehensive modeling of the real world. However, most existing methods still rely on pixel-aligned representations as the basis for world evolution, neglecting the inherent 3D nature of the physical world. This could undermine the 3D consistency ...

Keywords: 3D world model, point latents, P2G-VAE, SPFlow, 3D Gaussian primitives, multi-view consistency, ScanNet v2, explorable environments

View Paper

WithAnyone: Towards Controllable and ID Consistent Image Generation

0

9.0/10

Hengyuan Xu, Wei Cheng, Peng Xing, Yixiao Fang, Shuhan Wu, Rui Wang, Xianfang Zeng, Daxin Jiang, Gang Yu, Xingjun Ma, Yu-Gang Jiang 10/16/2025 arxiv

computer vision

Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most a...

Keywords: identity-consistent generation, copy-paste, contrastive identity loss, MultiID-2M, diffusion model, controllable generation, text-to-image, benchmark

View Paper

pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation

0

9.0/10

Hansheng Chen, Kai Zhang, Hao Tan, Leonidas Guibas, Gordon Wetzstein, Sai Bi 10/16/2025 arxiv

computer vision

Few-step diffusion or flow-based generative models typically distill a velocity-predicting teacher into a student that predicts a shortcut towards denoised data. This format mismatch has led to complex distillation procedures that often suffer from a quality-diversity trade-off. To address this, we ...

Keywords: pi-Flow, policy-based flow, imitation distillation, flow matching, few-step generation, diffusion/flow models, DiT, ImageNet

View Paper

Attention Is All You Need for KV Cache in Diffusion LLMs

0

9.0/10

Quan Nguyen-Tri, Mukul Ranjan, Zhiqiang Shen 10/16/2025 arxiv

machine learning

This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little ...

Keywords: Elastic-Cache, KV cache, diffusion LLMs, attention-aware, depth-aware refresh, adaptive caching, LLaDA-Instruct, GSM8K

View Paper

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

0

9.0/10

Yinxi Li, Yuntian Deng, Pengyu Nie 10/16/2025 arxiv

machine learning

Large language models (LLMs) for code rely on subword tokenizers, such as byte-pair encoding (BPE), learned from mixed natural language text and programming language code but driven by statistics rather than grammar. As a result, semantically identical code snippets can be tokenized differently depe...

Keywords: TokDrift, tokenization, subword, BPE, code LLMs, embeddings, grammar-aware tokenization, semantic-preserving rewrites

View Paper

Biology-informed neural networks learn nonlinear representations from omics data to improve genomic prediction and interpretability

0

9.0/10

Katiana Kontolati, Rini Jasmine Gladstone, Ian Davis, Ethan Pickering 10/16/2025 arxiv

machine learning

We extend biologically-informed neural networks (BINNs) for genomic prediction (GP) and selection (GS) in crops by integrating thousands of single-nucleotide polymorphisms (SNPs) with multi-omics measurements and prior biological knowledge. Traditional genotype-to-phenotype (G2P) models depend heavi...

Keywords: Biology-informed neural networks, BINNs, genomic prediction, genomic selection, multi-omics, SNPs, gene expression, pathway priors

View Paper

RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks

0

9.0/10

Mingxuan Yan, Yuping Wang, Zechun Liu, Jiachen Li 10/16/2025 arxiv

robotics

To tackle long-horizon tasks, recent hierarchical vision-language-action (VLAs) frameworks employ vision-language model (VLM)-based planners to decompose complex manipulation tasks into simpler sub-tasks that low-level visuomotor policies can easily handle. Typically, the VLM planner is finetuned to...

Keywords: retrieval, demonstration decomposition, planner alignment, vision-language models, visuomotor policies, long-horizon tasks, hierarchical VLA, robotics

View Paper

Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents

0

9.0/10

Guoqing Wang, Sunhao Dai, Guangze Ye, Zeyu Gan, Wei Yao, Yong Deng, Xiaofeng Wu, Zhenzhe Ying 10/16/2025 arxiv

reinforcement learning

Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing app...

Keywords: IGPO, information gain, policy optimization, LLM agents, reinforcement learning, intrinsic rewards, turn-level rewards, credit assignment

View Paper

Export Archive Data

Browse by Date

Papers for October 17, 2025

From Pixels to Words -- Towards Native Vision-Language Primitives at Scale

Learning an Image Editing Model without Image Editing Pairs

Terra: Explorable Native 3D World Model with Point Latents

WithAnyone: Towards Controllable and ID Consistent Image Generation

pi-Flow: Policy-Based Few-Step Generation via Imitation Distillation

Attention Is All You Need for KV Cache in Diffusion LLMs

TokDrift: When LLM Speaks in Subwords but Code Speaks in Grammar

Biology-informed neural networks learn nonlinear representations from omics data to improve genomic prediction and interpretability

RDD: Retrieval-Based Demonstration Decomposer for Planner Alignment in Long-Horizon Tasks

Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents