Paper Archive

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

0

9.0/10

Yawen Luo, Xiaoyu Shi, Junhao Zhuang, Yutian Chen, Quande Liu, Xintao Wang, Pengfei Wan, Tianfan Xue 3/26/2026 arxiv

machine learning

Multi-shot video generation is crucial for long narrative storytelling, yet current bidirectional architectures suffer from limited interactivity and high latency. We propose ShotStream, a novel causal multi-shot architecture that enables interactive storytelling and efficient on-the-fly frame gener...

Keywords: ShotStream, multi-shot video, streaming generation, causal architecture, distribution matching distillation, dual-cache memory, RoPE discontinuity

View Paper

Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting

0

9.0/10

Yixing Lao, Xuyang Bai, Xiaoyang Wu, Nuoyuan Yan, Zixin Luo, Tian Fang, Jean-Daniel Nahmias, Yanghai Tsin, Shiwei Li, Hengshuang Zhao 3/26/2026 arxiv

computer vision

Existing feed-forward 3D Gaussian Splatting methods predict pixel-aligned primitives, leading to a quadratic growth in primitive count as resolution increases. This fundamentally limits their scalability, making high-resolution synthesis such as 4K intractable. We introduce LGTM (Less Gaussians, Tex...

Keywords: 3D Gaussian Splatting, LGTM, novel view synthesis, 4K rendering, feed-forward, per-primitive texture, scalability

View Paper

MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

0

9.0/10

Bocheng Zou, Mu Cai, Mark Stanley, Dingfu Lu, Yong Jae Lee 3/26/2026 arxiv

computer vision

Vision Foundation Models (VFMs) have become the cornerstone of modern computer vision, offering robust representations across a wide array of tasks. While recent advances allow these models to handle varying input sizes during training, inference typically remains restricted to a single, fixed scale...

Keywords: MuRF, Multi-Resolution Fusion, Vision Foundation Models, DINOv2, SigLIP2, multi-scale, inference-time, frozen models

View Paper

RefAlign: Representation Alignment for Reference-to-Video Generation

0

9.0/10

Lei Wang, YuXin Song, Ge Wu, Haocheng Feng, Hang Zhou, Jingdong Wang, Yaxing Wang, jian Yang 3/26/2026 arxiv

computer vision

Reference-to-video (R2V) generation is a controllable video synthesis paradigm that constrains the generation process using both text prompts and reference images, enabling applications such as personalized advertising and virtual try-on. In practice, existing R2V methods typically introduce additio...

Keywords: RefAlign, reference-to-video, R2V, DiT, VFM, representation alignment, reference alignment loss, OpenS2V-Eval

View Paper

Vega: Learning to Drive with Natural Language Instructions

0

9.0/10

Sicheng Zuo, Yuxuan Li, Wenzhao Zheng, Zheng Zhu, Jie Zhou, Jiwen Lu 3/26/2026 arxiv

machine learning

Vision-language-action models have reshaped autonomous driving to incorporate languages into the decision-making process. However, most existing pipelines only utilize the language modality for scene descriptions or reasoning and lack the flexibility to follow diverse user instructions for personali...

Keywords: Vega, InstructScene, vision-language-action, instruction following, autogressive, diffusion, joint attention, autonomous driving

View Paper

Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

0

9.0/10

Zehao Wang, Huaide Jiang, Shuaiwu Dong, Yuping Wang, Hang Qiu, Jiachen Li 3/26/2026 arxiv

machine learning

Human driving behavior is inherently personal, which is shaped by long-term habits and influenced by short-term intentions. Individuals differ in how they accelerate, brake, merge, yield, and overtake across diverse situations. However, existing end-to-end autonomous driving systems either optimize ...

Keywords: Vision-Language-Action, personalization, user embedding, autonomous driving, Bench2Drive, natural language instructions, end-to-end policy, user study

View Paper

PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow

0

9.0/10

Xincheng Shuai, Song Tang, Yutong Huang, Henghui Ding, Dacheng Tao 3/26/2026 arxiv

computer vision

Graphic design is a creative and innovative process that plays a crucial role in applications such as e-commerce and advertising. However, developing an automated design system that can faithfully translate user intentions into editable design files remains an open challenge. Although recent studies...

Keywords: graphic design, PSD, CreativePSD, tool use, MLLM, text-to-image, automated design, operation traces

View Paper

MegaFlow: Zero-Shot Large Displacement Optical Flow

0

9.0/10

Dingxi Zhang, Fangjinhua Wang, Marc Pollefeys, Haofei Xu 3/26/2026 arxiv

computer vision

Accurate estimation of large displacement optical flow remains a critical challenge. Existing methods typically rely on iterative local search or/and domain-specific fine-tuning, which severely limits their performance in large displacement and zero-shot generalization scenarios. To overcome this, w...

Keywords: optical flow, zero-shot, large displacement, Vision Transformer, global matching, motion estimation, sub-pixel refinement, long-range point tracking

View Paper

Training the Knowledge Base through Evidence Distillation and Write-Back Enrichment

0

9.0/10

Yuxing Lu, Xukai Zhao, Wei Wu, Jinzhuo Wang 3/26/2026 arxiv

machine learning

The knowledge base in a retrieval-augmented generation (RAG) system is typically assembled once and never revised, even though the facts a query requires are often fragmented across documents and buried in irrelevant content. We argue that the knowledge base should be treated as a trainable componen...

Keywords: WriteBack-RAG, retrieval-augmented generation, knowledge base, evidence distillation, write-back, corpus enrichment, retrieval, RAG

View Paper

SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding

0

9.0/10

Jiwook Han, Geo Ahn, Youngrae Kim, Jinwoo Choi 3/26/2026 arxiv

computer vision

Multimodal Large Language Models (MLLMs) have shown strong performance on Video Temporal Grounding (VTG). However, their coarse recognition capabilities are insufficient for fine-grained temporal understanding, making task-specific fine-tuning indispensable. This fine-tuning causes models to memoriz...

Keywords: SlotVTG, video temporal grounding, object-centric learning, slot attention, multimodal LLM, OOD generalization, self-supervised vision, adapter

View Paper

Export Archive Data

Browse by Date

Papers for March 27, 2026

ShotStream: Streaming Multi-Shot Video Generation for Interactive Storytelling

Less Gaussians, Texture More: 4K Feed-Forward Textured Splatting

MuRF: Unlocking the Multi-Scale Potential of Vision Foundation Models

RefAlign: Representation Alignment for Reference-to-Video Generation

Vega: Learning to Drive with Natural Language Instructions

Drive My Way: Preference Alignment of Vision-Language-Action Model for Personalized Driving

PSDesigner: Automated Graphic Design with a Human-Like Creative Workflow

MegaFlow: Zero-Shot Large Displacement Optical Flow

Training the Knowledge Base through Evidence Distillation and Write-Back Enrichment

SlotVTG: Object-Centric Adapter for Generalizable Video Temporal Grounding