Paper Archive

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

0

9.0/10

Unknown authors 10/13/2025 huggingface

machine learning

Project Page: https://kangliao929.github.io/projects/puffin/Github: https://github.com/KangLiao929/PuffinDataset: https://huggingface.co/datasets/KangLiao/Puffin-4MModel: https://huggingface.co/KangLiao/PuffinDemo: https://huggingface.co/spaces/KangLiao/Puffin\n","updatedAt":"202...

Keywords: language regression, diffusion-based generation, camera-centric, multimodal model, spatial awareness, vision-language, camera as language, geometric context

View Paper

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

0

9.0/10

Unknown authors 10/13/2025 huggingface

machine learning

We present D2E 🎮→🤖, a framework that scales Vision-Action Pretraining on desktop interaction data to accelerate Embodied AI 🚀.By turning ordinary game and desktop interactions into training fuel, D2E builds rich visuomotor priors that transfer from screens to robots\n✨ OWA Toolkit 🖥️ — a unified...

Keywords: embodied AI, desktop pretraining, OWA Toolkit, OWAMcap, Generalist-IDM, timestamp-based prediction, pseudo-labeling, VAPT

View Paper

Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs

0

9.0/10

Unknown authors 10/13/2025 huggingface

machine learning

We introduce the problem of multimodal prompt optimization and propose the multimodal prompt optimizer, to harness the full capacity of multimodal large language models beyond text.\n","updatedAt":"2025-10-13T02:26:45.067Z","author":{"_id":"64cfa0b97...

Keywords: Multimodal Prompt Optimization, MPO, MLLMs, prompt optimization, alignment-preserving updates, Bayesian selection, multimodal learning

View Paper

Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

0

9.0/10

Unknown authors 10/13/2025 huggingface

reinforcement learning

1T).\r\nOur Webscale-RL pipeline converts pretraining text into diverse RL-ready QA data — scaling RL to pretraining levels!\r\n\r\nAll codes and datasets are open-source!\r\n\r\nHF🤗: https://huggingface.co/datasets/Salesforce/Webscale-RL\r\n\r\nGithub 🤖: https://github.com/SalesforceAIResearch/Pr...

Keywords: reinforcement learning, large language models, data pipeline, Webscale-RL, dataset, pretraining, question-answering, efficiency

View Paper

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

0

9.0/10

Unknown authors 10/13/2025 huggingface

machine learning

Recent trends in test-time scaling for reasoning models (e.g., OpenAI o1, DeepSeek-R1) have led to remarkable improvements through long Chain-of-Thought (CoT). However, existing benchmarks mainly focus on immediate, single-horizon tasks, failing to adequately evaluate models' ability to understand a...

Keywords: R-HORIZON, long-horizon reasoning, Chain-of-Thought, CoT, Large Reasoning Models, LRMs, query composition, RLVR

View Paper

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

0

9.0/10

Unknown authors 10/13/2025 huggingface

machine learning

With the current surge in spatial reasoning explorations, researchers have made significant progress in understanding indoor scenes, but still struggle with diverse applications such as robotics and autonomous driving. This paper aims to advance all-scale spatial reasoning across diverse scenarios b...

Keywords: all-scale spatial reasoning, SpaceVista-1M, SpaceVista-7B, scale-aware modeling, progressive training, spatial QA, multimodal LLMs, benchmark

View Paper

ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping

0

9.0/10

Unknown authors 10/13/2025 huggingface

machine learning

Paper(arXiv): https://arxiv.org/abs/2510.08457Github: https://github.com/shawn0728/ARESModel & Dataset (hugging face): https://huggingface.co/collections/ares0728/ares-68e7c7160dcb48734dee4e95\n","updatedAt":"2025-10-13T03:57:04.248Z","author":{"_id&qu...

Keywords: multimodal large reasoning models, high window-entropy, adaptive reasoning, AEPO, difficulty-aware, entropy shaping, dynamic KL, Adaptive Cold-Start

View Paper

StreamingVLM: Real-Time Understanding for Infinite Video Streams

0

9.0/10

Unknown authors 10/13/2025 huggingface

machine learning

StreamingVLM enables real-time, stable understanding of effectively infinite video by keeping a compact KV cache and aligning training with streaming inference. It avoids quadratic cost and sliding-window pitfalls, runs up to 8 FPS on a single H100, and wins 66.18% vs GPT-4o mini on a new long-video...

Keywords: StreamingVLM, vision-language, KV cache, supervised fine-tuning, long-video, real-time, Inf-Streams-Eval, H100

View Paper

TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling

0

8.0/10

Unknown authors 10/13/2025 huggingface

computer vision

Recent diffusion models achieve the state-of-the-art performance in image generation, but often suffer from semantic inconsistencies or hallucinations. While various inference-time guidance methods can enhance generation, they often operate indirectly by relying on external signals or architectural ...

Keywords: diffusion models, image generation, hallucinations, inference-time guidance, trajectory signals, tangential components, Taylor expansion, plug-and-play

View Paper

AutoPR: Let's Automate Your Academic Promotion!

0

8.0/10

Unknown authors 10/13/2025 huggingface

machine learning

🧐 Why AutoPR?The academic community continues to expand output each year without a corresponding increase in visibility or value. In 2024 alone, NeurIPS accepted over 4,000 papers, with conference volumes at CVPR and ICCV also soaring. In such an environment of overwhelming information, how can ind...

Keywords: AutoPR, PRAgent, PRBench, multimodal benchmark, content extraction, multi-agent system, platform adaptation, hierarchical summarization

View Paper

Export Archive Data

Browse by Date

Papers for October 13, 2025

Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation

D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI

Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs

Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

R-Horizon: How Far Can Your Large Reasoning Model Really Go in Breadth and Depth?

SpaceVista: All-Scale Visual Spatial Reasoning from mm to km

ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping

StreamingVLM: Real-Time Understanding for Infinite Video Streams

TAG:Tangential Amplifying Guidance for Hallucination-Resistant Diffusion Sampling

AutoPR: Let's Automate Your Academic Promotion!