{"papers":[{"id":"hf_daily_2606.10804_hlnujz","arxiv_id":"2606.10804","title":"SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning","abstract":"Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations, including pose skeletons to represent motion or masked background to represent environment, which inevitably leads to information loss. To address this, we present SCAIL-2, an framework that bypasses those intermediates and achieves end-to-end character animation. By directly concatenating driving videos to the sequence, the model can obtain all the required visual information from the input video. To address lack of end-to-end data, we unify sub-tasks of character animation with decoupled conditions and then curate a pipeline to synthesize MotionPair-60K, an end-to-end motion transfer dataset containing heterogeneous tasks of character animation. To archive the unification, we utilize in-context mask conditioning and mode-specific RoPE as soft guidance beyond textual instructions and raw visual information. To address synthetic discrepancy in detailed regions, we propose Bias-Aware DPO to construct preference items to mitigate the errors. Extensive experiments demonstrate that our method substantially outperforms existing state-of-the-art approaches in various character animation tasks. A large subset of synthetic data as well as model weights will be released at our project page: https://teal024.github.io/SCAIL-2/.","authors":["Wenhao Yan","Fengjia Guo","Zhuoyi Yang","Jie Tang"],"published":"2026-06-09T00:00:00.000Z","updated":"2026-06-11T08:00:08.383Z","category":"computer_vision","source":"huggingface","original_source":"huggingface_daily_papers","url":"https://huggingface.co/papers/2606.10804","pdf_url":"https://arxiv.org/pdf/2606.10804","scraped_at":"2026-06-11T08:00:08.383Z","images":["https://arxiv.org/html/2606.10804/x1.png"],"upvotes":0,"ai_summary":"","ai_keywords":[],"github_repo":"","github_stars":0,"analysis":{"introduction":"🚀 SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning - Research analysis - Analysis temporarily unavailable due to processing error.","challenges":"🎯 Challenges information unavailable due to processing error.","innovations":"✨ Innovation details unavailable due to processing error.","experiments":"📊 Experimental results unavailable due to processing error.","insights":"🤔 Research insights unavailable due to processing error.","summary":"Abstract: Controlled character animation requires transferring motion from a driving sequence to a reference character. Prior works heavily rely on intermediate representations, including pose skeletons to represent motion or masked background to represent environment, which inevitably leads to information lo...","keywords":[],"category":"computer_vision","relevance_score":5,"technical_depth":"unknown","analyzed_at":"2026-06-11T08:05:13.488Z","model":"fallback","error":true,"title_only_analysis":false},"views":0},{"id":"hf_2606.11187_cr5yu09ia","title":"Next Forcing: Causal World Modeling with Multi-Chunk Prediction","abstract":"Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without explicit signals about future dynamics; they also suffer from slow inference due to iterative video denoising. In this paper, we present Next Forcing, a multi-chunk prediction (MCP) framework for causal world modeling that enables faster training, higher accuracy, and accelerated inference. Inspired by multi-token prediction in large language models, Next Forcing introduces an MCP training objective that augments the main model with lightweight auxiliary MCP modules to simultaneously denoise video chunks at multiple future temporal horizons (next^1, next^2, next^3 chunks). These MCP modules form a causal chain across prediction depths, where intermediate features fused from multiple layers of the main model are leveraged to predict future dynamics, allowing near-future predictions to inform farther-future ones and providing dense multi-scale temporal supervision back to the main model. During training, the MCP modules significantly accelerate convergence and improve converged accuracy, especially at high frame rates: at 50 fps, Next Forcing achieves a 93.1% relative improvement over LingBot-VA at 5k training steps and 2.3x faster convergence, and establishes new state-of-the-art results on the RoboTwin benchmark (94.1/93.5% on Clean/Random). At inference, the MCP modules can be retained to predict the next video chunk in parallel with the current one, achieving 2x inference acceleration. Next Forcing also demonstrates significant improvements on PhyWorld, a benchmark evaluating adherence to physical laws in video generation, and over 50% FVD reduction on general video pretraining.","authors":[{"_id":"6a296081887fb79cbf65d685","name":"Gangwei Xu","hidden":false},{"_id":"6a296081887fb79cbf65d686","name":"Qihang Zhang","hidden":false},{"_id":"6a296081887fb79cbf65d687","name":"Jiaming Zhou","hidden":false},{"_id":"6a296081887fb79cbf65d688","name":"Xing Zhu","hidden":false},{"_id":"6a296081887fb79cbf65d689","name":"Yujun Shen","hidden":false},{"_id":"6a296081887fb79cbf65d68a","name":"Xin Yang","hidden":false},{"_id":"6a296081887fb79cbf65d68b","name":"Yinghao Xu","hidden":false}],"published":"2026-06-09T00:00:00.000Z","updated":"2026-06-11T08:00:08.323Z","category":"computer_vision","source":"huggingface","original_source":"huggingface_api","url":"https://huggingface.co/papers/2606.11187","pdf_url":"","scraped_at":"2026-06-11T08:00:08.323Z","images":["https://arxiv.org/html/2606.11187/x1.png"],"analysis":{"introduction":"🚀 Next Forcing: Causal World Modeling with Multi-Chunk Prediction - Research analysis - Analysis temporarily unavailable due to processing error.","challenges":"🎯 Challenges information unavailable due to processing error.","innovations":"✨ Innovation details unavailable due to processing error.","experiments":"📊 Experimental results unavailable due to processing error.","insights":"🤔 Research insights unavailable due to processing error.","summary":"Abstract: Autoregressive video generation has emerged as a powerful paradigm for World Action Models (WAMs). However, existing approaches suffer from slow training convergence and limited converged accuracy, particularly at high frame rates, as the training supervision is confined to the current chunk without...","keywords":["pretraining"],"category":"computer_vision","relevance_score":5,"technical_depth":"unknown","analyzed_at":"2026-06-11T08:04:06.150Z","model":"fallback","error":true,"title_only_analysis":false},"views":0},{"id":"arxiv_2606.12407v1","arxiv_id":"2606.12407v1","title":"How Seemingly Inconsequential Design Choices Dictate Performance of LLMs in Pathology","abstract":"General-purpose large language models (LLMs) are routinely used as baselines when evaluating specialized pathology models on whole-slide images (WSIs). Because WSIs exceed contemporary model context limits, LLM baselines routinely use small, high-magnification patches processed independently via majority voting, without systematic evaluation of seemingly inconsequential design choices such as patch size, patch count, and magnification. Generalist LLMs have consistently underperformed specialized systems, reinforcing the perception that domain-specific training or architectural adaptation is necessary for pathology tasks involving WSIs. Here, we conduct a systematic factorial analysis of four input design factors: inference mode, patch size, magnification, and patch count. We demonstrate that prior studies have overstated the gap between specialized models and general-purpose LLMs by choosing non-optimized input configurations. On the MultiPathQA benchmark, switching to a single balanced configuration (large patches at lower magnification, processed jointly) raises GPT-5 from 15.1% to 39.5% on cancer-type classification (TCGA) and from 38.1% to 62.9% on organ classification (GTEx). Per-task optimization yields further gains up to 43.9% (TCGA) and 71.6% (GTEx). The same configuration generalizes to two other models and to a fully held-out CPTAC cohort, where it improves Gemini 3 Flash by 23.4 percentage points without any task-specific tuning.","authors":["Kian R. Weihrauch","Thomas A. Buckley","William Lotter","Arjun K. Manrai"],"published":"2026-06-10T17:59:39Z","updated":"2026-06-10T17:59:39Z","category":"","source":"arxiv","original_source":"arxiv","url":"http://arxiv.org/abs/2606.12407v1","pdf_url":"http://arxiv.org/pdf/2606.12407v1.pdf","scraped_at":"2026-06-11T08:00:09.117Z","images":["https://arxiv.org/html/2606.12407v1/x1.png"],"analysis":{"introduction":"🚀 How Seemingly Inconsequential Design Choices Dictate Performance of LLMs in Pathology - Research analysis - Analysis temporarily unavailable due to processing error.","challenges":"🎯 Challenges information unavailable due to processing error.","innovations":"✨ Innovation details unavailable due to processing error.","experiments":"📊 Experimental results unavailable due to processing error.","insights":"🤔 Research insights unavailable due to processing error.","summary":"Abstract: General-purpose large language models (LLMs) are routinely used as baselines when evaluating specialized pathology models on whole-slide images (WSIs). Because WSIs exceed contemporary model context limits, LLM baselines routinely use small, high-magnification patches processed independently via maj...","keywords":["gpt","classification"],"category":"computer_vision","relevance_score":5,"technical_depth":"unknown","analyzed_at":"2026-06-11T08:00:30.244Z","model":"fallback","error":true,"title_only_analysis":false},"views":0},{"id":"hf_2606.11324_81hls6mm3","title":"Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models","abstract":"We introduce Embodied-R1.5, a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities, spanning embodied cognition, task planning, correction, and pointing, within a single architecture toward general physical intelligence. Leveraging three automated data construction pipelines to significantly expand the data coverage of critical capabilities, we build a large-scale data system of over 15B tokens, and design a multi-task balanced RL recipe to alleviate heterogeneous task conflicts. We further introduce a Planner-Grounder-Corrector (PGC) closed-loop framework that enables a single model to autonomously execute and self-correct over long-horizon tasks. With only 8B parameters, Embodied-R1.5 achieves SOTA on 16 out of 24 embodied VLM benchmarks, surpassing leading models like Gemini-Robotics-ER-1.5 and GPT-5.4. Benefiting from the internalized embodied capabilities, Embodied-R1.5 can be fine-tuned into a VLA with only a small amount of data, outperforming leading VLA models like π_{0.5} across 4 popular manipulation benchmark suites. We further conduct extensive zero-shot real-robot experiments, validating performance in instruction following, affordance grounding, articulated object manipulation, and long-horizon complex tasks, demonstrating strong generalization to the physical world. We open-source model weights, datasets, training code, and EmbodiedEvalKit, an evaluation framework tailored for embodied tasks, to facilitate future research in EFMs.","authors":[{"_id":"6a2a1fe980a9c7c6830c0ef3","name":"Yifu Yuan","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0ef4","name":"Yaoting Huang","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0ef5","name":"Xianze Yao","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0ef6","name":"Yutong Li","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0ef7","name":"Shuoheng Zhang","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0ef8","name":"Linqi Han","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0ef9","name":"Pengyi Li","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0efa","name":"Jiangeng Sun","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0efb","name":"Wenting Jia","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0efc","name":"Zhao Zhang","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0efd","name":"Yuhao Liu","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0efe","name":"Ruihao Liao","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0eff","name":"Yucheng Hu","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0f00","name":"Qiyu Wu","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0f01","name":"Yuxiao Li","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0f02","name":"Zibin Dong","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0f03","name":"Fei Ni","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0f04","name":"Yan Zheng","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0f05","name":"Shuyang Gu","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0f06","name":"Yi Ma","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0f07","name":"Hongyao Tang","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0f08","name":"Han Hu","hidden":false},{"_id":"6a2a1fe980a9c7c6830c0f09","name":"Jianye Hao","hidden":false}],"published":"2026-06-09T00:00:00.000Z","updated":"2026-06-11T08:00:08.323Z","category":"reinforcement_learning","source":"huggingface","original_source":"huggingface_api","url":"https://huggingface.co/papers/2606.11324","pdf_url":"","scraped_at":"2026-06-11T08:00:08.323Z","images":["https://arxiv.org/html/2606.11324/x1.png"],"analysis":{"introduction":"🚀 Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models - Research analysis - Analysis temporarily unavailable due to processing error.","challenges":"🎯 Challenges information unavailable due to processing error.","innovations":"✨ Innovation details unavailable due to processing error.","experiments":"📊 Experimental results unavailable due to processing error.","insights":"🤔 Research insights unavailable due to processing error.","summary":"Abstract: We introduce Embodied-R1.5, a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities, spanning embodied cognition, task planning, correction, and pointing, within a single architecture toward general physical intelligence. Leveraging three automated dat...","keywords":["gpt"],"category":"reinforcement_learning","relevance_score":5,"technical_depth":"unknown","analyzed_at":"2026-06-11T08:03:43.952Z","model":"fallback","error":true,"title_only_analysis":false},"views":0},{"id":"hf_2606.11176_nynmlboon","title":"Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories","abstract":"Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual steps well: data-science agents close the analysis loop, while design agents synthesize beautiful websites. But can an agent serve as a data journalist end to end? We introduce Data Journalist Agent (Data2Story), a multi-agent framework that orchestrates specialized roles into a single virtual newsroom. Data2Story contributes two innovations. (i) Claims are evidence-grounded: an Inspector links every number, angle, and asset back to data, code, or an external reference. (ii) Articles are multimodally generative: rather than defaulting to plain text and static charts, Data2Story reasons about what readers will want to see, then deploys multimodal tools, such as interactive maps for geography and audio for music. We evaluate Data2Story on 18 articles, each paired with the originally published expert piece, along four axes: (a) human-agent angle coverage; (b) rubric evaluation with 53 participants across five dimensions; (c) computer-use agents as judges, a cost-saving proxy for how readers navigate interactive articles; and (d) verifiability, where a coding verifier re-executes statements against the data and checks claims against references. Data2Story produces competitive, evidence-traceable multimedia stories, with particular strength in transparency and auditability. Human articles retain an edge in editorial angle, creative design, and presentation. We position Data2Story as a collaborator for journalists, enabling more evidence-based, transparent, and verifiable reporting. Code and demos are available at https://data2story.github.io.","authors":[{"_id":"6a28e548e7d78ea7587e5537","name":"Kevin Qinghong Lin","hidden":false},{"_id":"6a28e548e7d78ea7587e5538","name":"Batu EI","hidden":false},{"_id":"6a28e548e7d78ea7587e5539","name":"Yuhong Shi","hidden":false},{"_id":"6a28e548e7d78ea7587e553a","name":"Pan Lu","hidden":false},{"_id":"6a28e548e7d78ea7587e553b","name":"Philip Torr","hidden":false},{"_id":"6a28e548e7d78ea7587e553c","name":"James Zou","hidden":false}],"published":"2026-06-09T00:00:00.000Z","updated":"2026-06-11T08:00:08.323Z","category":"computer_vision","source":"huggingface","original_source":"huggingface_api","url":"https://huggingface.co/papers/2606.11176","pdf_url":"","scraped_at":"2026-06-11T08:00:08.323Z","images":["https://arxiv.org/html/2606.11176/x1.png"],"analysis":{"introduction":"🚀 Data Journalist Agent: Transforming Data into Verifiable Multimodal Stories - Research analysis - Analysis temporarily unavailable due to processing error.","challenges":"🎯 Challenges information unavailable due to processing error.","innovations":"✨ Innovation details unavailable due to processing error.","experiments":"📊 Experimental results unavailable due to processing error.","insights":"🤔 Research insights unavailable due to processing error.","summary":"Abstract: Data tells stories that shape society; the data journalist's job is to turn raw information into stories non-experts can trust. A high-quality news feature takes a newsroom team weeks: hunting for context, running statistics, choosing an angle, and designing visuals. Recent agents handle individual ...","keywords":[],"category":"computer_vision","relevance_score":5,"technical_depth":"unknown","analyzed_at":"2026-06-11T08:04:06.485Z","model":"fallback","error":true,"title_only_analysis":false},"views":0},{"id":"hf_daily_2601.02163_gz1ymr","arxiv_id":"2601.02163","title":"EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning","abstract":"Large Language Models (LLMs) are increasingly deployed as long-term interactive agents, yet their limited context windows make it difficult to sustain coherent behavior over extended interactions. Existing memory systems often store isolated records and retrieve fragments, limiting their ability to consolidate evolving user states and resolve conflicts. We introduce EverMemOS, a self-organizing memory operating system that implements an engram-inspired lifecycle for computational memory. Episodic Trace Formation converts dialogue streams into MemCells that capture episodic traces, atomic facts, and time-bounded Foresight signals. Semantic Consolidation organizes MemCells into thematic MemScenes, distilling stable semantic structures and updating user profiles. Reconstructive Recollection performs MemScene-guided agentic retrieval to compose the necessary and sufficient context for downstream reasoning. Experiments on LoCoMo and LongMemEval show that EverMemOS achieves state-of-the-art performance on memory-augmented reasoning tasks. We further report a profile study on PersonaMem v2 and qualitative case studies illustrating chat-oriented capabilities such as user profiling and Foresight. Code is available at https://github.com/EverMind-AI/EverMemOS.","authors":["Chuanrui Hu","Xingze Gao","Zuyi Zhou","Dannong Xu","Yi Bai","Xintong Li","Hui Zhang","Tong Li","Chong Zhang","Lidong Bing","Yafeng Deng"],"published":"2026-01-05T14:39:43.000Z","updated":"2026-06-11T08:00:08.383Z","category":"natural_language_processing","source":"huggingface","original_source":"huggingface_daily_papers","url":"https://huggingface.co/papers/2601.02163","pdf_url":"https://arxiv.org/pdf/2601.02163","scraped_at":"2026-06-11T08:00:08.383Z","images":["https://arxiv.org/html/2601.02163/x1.png"],"upvotes":0,"ai_summary":"","ai_keywords":[],"github_repo":"","github_stars":0,"analysis":{"introduction":"🚀 EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning - Research analysis - Analysis temporarily unavailable due to processing error.","challenges":"🎯 Challenges information unavailable due to processing error.","innovations":"✨ Innovation details unavailable due to processing error.","experiments":"📊 Experimental results unavailable due to processing error.","insights":"🤔 Research insights unavailable due to processing error.","summary":"Abstract: Large Language Models (LLMs) are increasingly deployed as long-term interactive agents, yet their limited context windows make it difficult to sustain coherent behavior over extended interactions. Existing memory systems often store isolated records and retrieve fragments, limiting their ability to ...","keywords":[],"category":"natural_language_processing","relevance_score":5,"technical_depth":"unknown","analyzed_at":"2026-06-11T08:06:21.010Z","model":"fallback","error":true,"title_only_analysis":false},"views":0},{"id":"arxiv_2606.12411v1","arxiv_id":"2606.12411v1","title":"Context-Driven Incremental Compression for Multi-Turn Dialogue Generation","abstract":"Modern conversational agents condition on an ever-growing dialogue history at each turn, incurring redundant attention and encoding costs that grow with conversation length. Naive truncation or summarization degrades fidelity, while existing context compressors lack cross-turn memory sharing or revision, causing information loss and compounding errors in long dialogues. We revisit the context compression under conversational dynamics and empirically present its fragility. To improve both efficiency and robustness, we introduce Context-Driven Incremental Compression (C-DIC), which treats a conversation as interleaved contextual threads and stores revisable per-thread compression states in a single, compact dialogue memory. At each turn, a lightweight retrieve, revise, and write-back loop shares information across turns and updates stale memories, stabilizing long-horizon behavior. In addition, we adapt truncated backpropagation-through-time (TBPTT) to our multi-turn setting, learning cross-turn dependencies without full-history backpropagation. Extensive experiments on long-form dialogue benchmarks demonstrate superior performance and efficiency of C-DIC; notably, C-DIC shows stable inference latency and perplexity over hundreds of dialogue turns, supporting a scalable path to high-quality dialogue modeling.","authors":["Yeongseo Jung","Jaehyeok Kim","Eunseo Jung","Jiachuan Wang","Yongqi Zhang","Ka Chun Cheung","Simon See","Lei Chen"],"published":"2026-06-10T17:59:54Z","updated":"2026-06-10T17:59:54Z","category":"","source":"arxiv","original_source":"arxiv","url":"http://arxiv.org/abs/2606.12411v1","pdf_url":"http://arxiv.org/pdf/2606.12411v1.pdf","scraped_at":"2026-06-11T08:00:09.117Z","images":["https://arxiv.org/html/2606.12411v1/x1.png"],"analysis":{"introduction":"🚀 Context-Driven Incremental Compression for Multi-Turn Dialogue Generation - Research analysis - Analysis temporarily unavailable due to processing error.","challenges":"🎯 Challenges information unavailable due to processing error.","innovations":"✨ Innovation details unavailable due to processing error.","experiments":"📊 Experimental results unavailable due to processing error.","insights":"🤔 Research insights unavailable due to processing error.","summary":"Abstract: Modern conversational agents condition on an ever-growing dialogue history at each turn, incurring redundant attention and encoding costs that grow with conversation length. Naive truncation or summarization degrades fidelity, while existing context compressors lack cross-turn memory sharing or revi...","keywords":["attention","backpropagation"],"category":"computer_vision","relevance_score":5,"technical_depth":"unknown","analyzed_at":"2026-06-11T08:00:30.234Z","model":"fallback","error":true,"title_only_analysis":false},"views":0},{"id":"hf_2606.12344_t558qwwxv","title":"Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks","abstract":"General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only 19.1% Pass@1, whereas the full adapter reaches 73.4% with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw times nine-model sweep and a five-claw times two-model sweep, model choice changes Pass@1 by 29.4 pp and harness choice by 27.4 pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at https://github.com/opensquilla/claw-swe-bench and https://huggingface.co/datasets/TokenRhythm/Claw-SWE-Bench.","authors":[{"_id":"6a2a369980a9c7c6830c0ffd","name":"Mengyu Zheng","hidden":false},{"_id":"6a2a369980a9c7c6830c0ffe","name":"Kai Han","hidden":false},{"_id":"6a2a369980a9c7c6830c0fff","name":"Boxun Li","hidden":false},{"_id":"6a2a369980a9c7c6830c1000","name":"Haiyang Xu","hidden":false},{"_id":"6a2a369980a9c7c6830c1001","name":"Yuchuan Tian","hidden":false},{"_id":"6a2a369980a9c7c6830c1002","name":"Wei He","hidden":false},{"_id":"6a2a369980a9c7c6830c1003","name":"Hang Zhou","hidden":false},{"_id":"6a2a369980a9c7c6830c1004","name":"Jianyuan Guo","hidden":false},{"_id":"6a2a369980a9c7c6830c1005","name":"Hailin Hu","hidden":false},{"_id":"6a2a369980a9c7c6830c1006","name":"Lin Ma","hidden":false},{"_id":"6a2a369980a9c7c6830c1007","name":"Chao Xu","hidden":false},{"_id":"6a2a369980a9c7c6830c1008","name":"Guohao Dai","hidden":false},{"_id":"6a2a369980a9c7c6830c1009","name":"Lixue Xia","hidden":false},{"_id":"6a2a369980a9c7c6830c100a","name":"Yunchao Wei","hidden":false},{"_id":"6a2a369980a9c7c6830c100b","name":"Yunhe Wang","hidden":false},{"_id":"6a2a369980a9c7c6830c100c","name":"Yu Wang","hidden":false}],"published":"2026-06-10T00:00:00.000Z","updated":"2026-06-11T08:00:08.323Z","category":"natural_language_processing","source":"huggingface","original_source":"huggingface_api","url":"https://huggingface.co/papers/2606.12344","pdf_url":"","scraped_at":"2026-06-11T08:00:08.323Z","images":["https://arxiv.org/html/2606.12344/x1.png"],"analysis":{"introduction":"🚀 Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks - Research analysis - Analysis temporarily unavailable due to processing error.","challenges":"🎯 Challenges information unavailable due to processing error.","innovations":"✨ Innovation details unavailable due to processing error.","experiments":"📊 Experimental results unavailable due to processing error.","insights":"🤔 Research insights unavailable due to processing error.","summary":"Abstract: General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-...","keywords":[],"category":"natural_language_processing","relevance_score":5,"technical_depth":"unknown","analyzed_at":"2026-06-11T08:01:33.931Z","model":"fallback","error":true,"title_only_analysis":false},"views":0},{"id":"hf_2606.11032_y8f5z9tk3","title":"U-TTT: Towards Generalizable PET Image Denoising via Test-Time Training","abstract":"Existing deep learning models for Positron Emission Tomography (PET) image denoising often suffer from severe performance degradation under distribution shifts, fundamentally restricting their robust clinical deployment. This lack of generalization stems from the conventional paradigm of fixed-parameter models that cannot adapt to variations in test data (e.g., dose levels or scanner types) after training. To overcome this limitation and achieve robust generalization, we introduce U-TTT, a novel U-shaped model that integrates Test-Time Training (TTT) layers to dynamically adjust model parameters during inference through self-supervision, thereby adapting to the specific characteristics of each test instance. Furthermore, to comprehensively capture the complex degradations of 3D PET data, U-TTT features a dual-domain adaptation mechanism comprising a Spatial Test-Time Training (S-TTT) layer and a Frequency Test-Time Training (F-TTT) layer. The S-TTT layer captures and corrects spatial structural degradations, while the F-TTT layer suppresses global noise spectra and restores delicate high-frequency details. Extensive experiments demonstrate that U-TTT achieves state-of-the-art PET denoising performance and exhibits superior generalization under challenging distribution shifts, including both unseen dose levels and unseen scanners. Our code will be available at https://github.com/Yaziwel/U-TTT.","authors":[{"_id":"6a28dbd9e7d78ea7587e54b7","name":"Zhiwen Yang","hidden":false},{"_id":"6a28dbd9e7d78ea7587e54b8","name":"Jiayin Li","hidden":false},{"_id":"6a28dbd9e7d78ea7587e54b9","name":"Hao Lu","hidden":false},{"_id":"6a28dbd9e7d78ea7587e54ba","name":"Hui Zhang","hidden":false},{"_id":"6a28dbd9e7d78ea7587e54bb","name":"Zihua Wang","hidden":false},{"_id":"6a28dbd9e7d78ea7587e54bc","name":"Bingzheng Wei","hidden":false},{"_id":"6a28dbd9e7d78ea7587e54bd","name":"Yan Xu","hidden":false}],"published":"2026-06-09T00:00:00.000Z","updated":"2026-06-11T08:00:08.323Z","category":"computer_vision","source":"huggingface","original_source":"huggingface_api","url":"https://huggingface.co/papers/2606.11032","pdf_url":"","scraped_at":"2026-06-11T08:00:08.323Z","images":["https://arxiv.org/html/2606.11032/x1.png"],"analysis":{"introduction":"🚀 U-TTT: Towards Generalizable PET Image Denoising via Test-Time Training - Research analysis - Analysis temporarily unavailable due to processing error.","challenges":"🎯 Challenges information unavailable due to processing error.","innovations":"✨ Innovation details unavailable due to processing error.","experiments":"📊 Experimental results unavailable due to processing error.","insights":"🤔 Research insights unavailable due to processing error.","summary":"Abstract: Existing deep learning models for Positron Emission Tomography (PET) image denoising often suffer from severe performance degradation under distribution shifts, fundamentally restricting their robust clinical deployment. This lack of generalization stems from the conventional paradigm of fixed-param...","keywords":["deep learning"],"category":"computer_vision","relevance_score":5,"technical_depth":"unknown","analyzed_at":"2026-06-11T08:04:51.070Z","model":"fallback","error":true,"title_only_analysis":false},"views":0},{"id":"hf_2606.10740_v0kr21a0i","title":"When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models","abstract":"Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we propose a trace-level diagnostic - the CoT-Output 2x2 safety matrix. This framework labels every turn along two independent axes (internal reasoning and visible output), yielding four operationally defined failure cells: robust alignment, alignment faking, overt jailbreak, and a distinct failure mode we term context-injection failure (where the CoT maintains safe reasoning, but the visible output produces harm, highlighting a multi-turn manifestation of reasoning unfaithfulness). We evaluate three distilled reasoning targets against a fixed attacker across five oversight conditions, collecting 6750 turn-level observations on the Information-Hazard scenario. Our analysis reveals two reproducible vulnerabilities: an oversight paradox where explicit monitoring cues paradoxically increase alignment-faking rates rather than suppress them, and a context-injection failure where models lock onto unsafe external outputs despite safe internal states. We release the full dataset of multi-turn dialogues and CoT traces to support follow-up trace-diagnostic research.","authors":[{"_id":"6a28cef1e7d78ea7587e53f7","name":"Sai Kartheek Reddy Kasu","hidden":false},{"_id":"6a28cef1e7d78ea7587e53f8","name":"Nils Lukas","hidden":false},{"_id":"6a28cef1e7d78ea7587e53f9","name":"Samuele Poppi","hidden":false}],"published":"2026-06-09T11:50:28.000Z","updated":"2026-06-11T08:00:08.323Z","category":"natural_language_processing","source":"huggingface","original_source":"huggingface_api","url":"https://huggingface.co/papers/2606.10740","pdf_url":"","scraped_at":"2026-06-11T08:00:08.323Z","images":["https://arxiv.org/html/2606.10740/x1.png"],"analysis":{"introduction":"🚀 When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models - Research analysis - Analysis temporarily unavailable due to processing error.","challenges":"🎯 Challenges information unavailable due to processing error.","innovations":"✨ Innovation details unavailable due to processing error.","experiments":"📊 Experimental results unavailable due to processing error.","insights":"🤔 Research insights unavailable due to processing error.","summary":"Abstract: Failures in multi-turn reasoning models are largely invisible to terminal-score evaluation. A model can lock onto an unsafe stance early in a long dialogue, yet its final-turn refusal rate may appear indistinguishable from a robustly aligned baseline. To expose these hidden temporal dynamics, we pro...","keywords":[],"category":"natural_language_processing","relevance_score":5,"technical_depth":"unknown","analyzed_at":"2026-06-11T08:03:21.949Z","model":"fallback","error":true,"title_only_analysis":false},"views":0}],"pagination":{"current_page":1,"total_pages":11,"total_papers":109,"has_next":true,"has_prev":false},"filters":{"category":null,"search":null,"date":null}}