40 papers across AI, ML, NLP, and CV from the last 24 hours.
Three themes run through today's batch: the push toward much longer context windows, the maturation of agentic frameworks beyond single-turn demos, and a growing appetite for peering inside model internals. Long-video and extended-temporal reasoning papers cluster together because the core bottleneck — token explosion and attention dilution over hours of input — is finally being tackled with architecture-level solutions rather than brute-force scaling. MemDreamer splits perception from reasoning and streams video into a hierarchical graph memory, while a separate line of work compresses tokens in autonomous driving pipelines by aligning them with downstream planning objectives rather than generic saliency heuristics.
The agent papers are notable for their scope. Agentopia extends LLM-powered social simulation from days to months, letting agents accumulate experience at scales that could actually support learning rather than just task completion. Socratic-SWE closes the loop by reusing an agent's own debugging traces to generate new training tasks — self-play for software engineering.
Standout paper: "A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning" exhaustively labels 10,247 reasoning steps across AIME 2025 problems and finds a structural gap — humans backtrace and reflect far more often, while the model stacks inferences without revisiting earlier decisions. It reframes the "Aha moment" debate from whether models feel insight to whether their reasoning graphs have the right topology.
Conspicuously absent: no reinforcement learning from human feedback papers in this crop, and the optimization batch leans classical (decentralized SGD, path kernel interpolation) rather than LM-focused. Taken together, the field is shifting from "make it bigger" toward "make it work over long horizons, with real self-improvement loops, and with auditable internals."
Luca Avena, Gianmarco Bet, Bernardo Busoni · 2026-06-05
Controlled benchmarking of 8 state-of-the-art models on discrete probability problems with standard and counterintuitive exercises, testing with and without Chain-of-Thought. Models average 0.96 accuracy on standard problems but drop to 0.59 on counterintuitive ones that trigger heuristic reasoning.
Xintao Wang, Sirui Zheng, Hongqiu Wu · 2026-06-05
Extends LLM-powered agent society simulation from days to months-scale, studying whether agents can learn from simulated social experience to better understand and replicate human behavior through long-term growth.
Jeremy Yang, Kate Zyskowski, Noah Yonack · 2026-06-05
Using production data from Perplexity's Search and Computer products, studies the transition from conversational assistants to autonomous agents that execute tasks end-to-end, with three key empirical findings on acceleration and scope.
Jiayu Wang, Weijiang Lv, Bowen Fu · 2026-06-05
Evaluates frontier agents across the research lifecycle, revealing significant limitations in field sensitivity, research ethics, and nuanced scientific judgment despite proficiency in coding and autonomous experiment execution.
Fuqiang Wang, Song Tan, Zheng Guo · 2026-06-05
Framework organizing scientific paper recommendation into three coupled longitudinal stages — Profiling, Recommending, and Adapting — handling daily paper streams where interests shift and feedback accumulates.
Songhao Wu, Zhongxin Chen, Yuxuan Liu · 2026-06-05
Identifies why LLMs struggle as off-the-shelf embedding models: text embeddings align with frequent but uninformative tokens when projected onto vocabulary space. Proposes using the unembedding matrix as a feature lens to correct this.
Daniel Vennemeyer, Phan Anh Duong, Meryl Ye · 2026-06-05
Argues sycophantic praise is a distinct alignment problem beyond excessive agreement. Introduces a parameterized framework measuring whether praise is excessive relative to contribution quality and expected user ability.
Yang Zhang, Xiao Fei, Amr Mohamed · 2026-06-05
Investigates whether local cultural knowledge is better accessed through English or the local language in LLMs, using masked evaluation to separate language proficiency from language-conditioned knowledge access.
Meixi Song, Dizhe Zhang, Hao Ren · 2026-06-05
Extends SHARP for universal monocular rendering across a continuum of camera systems — conventional perspective to wide-FOV, fisheye, and omnidirectional panoramic — by aligning images in a unified omnidirectional latent space.
Hanhui Wang, Yiming Xie, Haiwen Feng · 2026-06-05
Introduces StreamForce, a causal and unified streaming video generation model that responds instantly to continuous, time-varying force inputs — local or global — without requiring separate models per force type.
Cong Chen, Guo Gan, Kaixiang Ji · 2026-06-05
Decouples perception and reasoning for hours-long video understanding by incrementally constructing a Hierarchical Graph Memory with three-tier semantic abstraction, shifting the task into an agentic exploration process.
Haoyuan Li, Zhengdong Hu, Jun Wang · 2026-06-05
Reveals that MLLM agents applying uniform tool-use strategies across diverse 3D scenes underperform — proposes evolving scene-aware skills where agents select tools according to specific scene and task characteristics.
Pietro Bonazzi, Julian Moosmann, Ahmet Celik · 2026-06-05
Open-source smart glasses platform for rapid prototyping with event-based vision and embedded ML at scale, using a modular FPC interposer design supporting both event-based and frame-based sensors within power and compute constraints.
Fatema Siddika, Md Anwar Hossen, Tanwi Mallick · 2026-06-05
SETA framework resolves the plasticity-stability conflict in continual LLM learning through adaptive sparse subspace routing, distinguishing specific task knowledge from shared capabilities that existing methods treat uniformly.
Jin Guo, Roy Y. He, Jean-Michel Morel · 2026-06-05
Extends Domingos' 2020 first-order interpolation formula along optimization paths to second-order characterizations, valid for models trained with batch-based stochastic gradient descent.
Hanqiao Yu, Shusen Yang, Xuebin Ren · 2026-06-05
Deflex: end-to-end AI method extracting multiscale formulas with potentially different forms — invariants and distributions — from complex systems using neural-guided lambda calculus, going beyond single-scale symbolic regression.
Simon Schug · 2026-06-05
Demonstrates that shrinking each expert to a single neuron and selecting a tiny fraction of many available neurons improves compute efficiency and interpretability — the key is removing the nonlinearity entirely from each expert.
Yuxiang Chen, Jun Wang · 2026-06-05
Exhaustively annotates 10,247 reasoning steps across all 30 AIME 2025 problems into five functional categories. Finds humans backtrace and reflect far more, while the model stacks inferences without revisiting earlier decisions.
Chaitanya Shinde, Hadi Hajieghrary, Paul Schmitt · 2026-06-05
Decomposes the Controllability placeholder in ISO 26262 functional safety standard into two auditable evidence dimensions — Transferability and Predictability — to adapt human-driven vehicle safety principles for autonomous vehicles.
Lei Huang · 2026-06-05
CascadeNet recovers hidden influence networks behind dynamic cascades — product adoption, disease spread, financial distress — without requiring a specified diffusion model, using a debiased Jacobian-based ML framework.
This digest is generated automatically from arXiv submissions. Not affiliated with arXiv or Cornell University.