arXiv Digest — Saturday, June 6, 2026

25 papers across AI, ML, NLP, and CV from the last 24 hours.

Today's Synthesis

Three themes dominate today's arXiv batch: agents learning to sustain themselves over long horizons, training dynamics being re-examined beyond validation loss, and the field pushing harder on questions of how models know what they know.

The self-evolution signal is loudest. MLEvolve proposes an LLM multi-agent framework for automated ML algorithm discovery that maintains hierarchical memory across search branches. Unsupervised Skill Discovery tackles the same problem from a different angle — extracting reusable procedural skills for data analysis purely from unlabeled exploration. TokenMizer goes further still, proposing graph-structured session memory so LLMs don't silently lose relational structure when context windows overflow. Together these papers suggest the community is moving past one-shot prompting toward systems that accumulate, organize, and evolve their own capabilities.

Training optimization gets renewed attention with two papers on preconditioning. PC Layer reshapes weight singular-value spectra via low-degree polynomials with zero inference overhead. Double Preconditioning (DoPr) tackles test-time feedback — the gap when models trained on one-step loss are deployed by rolling out their own predictions. Both frame the same underlying worry: that standard training objectives don't match deployment realities.

The standout paper is "LLM Self-Recognition: Steering and Retrieving Activation Signatures," which shows that models can be fingerprinted by steering the residual stream with a random sparse vector during generation, making self-recognition reliable even in low-entropy cases. It turns a theoretical interpretability observation into an actionable technique.

Emergent Language as an Approach to Conscious AI and Machine Unlearning via Token-Level Importance round out a batch that suggests the field is no longer satisfied with surface-level capability gains — the focus is shifting inward to how models train, remember, forget, and recognize themselves.

Papers by Category

Artificial Intelligence

Code2LoRA: Hypernetwork-Generated Adapters for Code Language Models under Software Evolution

Liliana Hotsko, Yinxi Li, Yuntian Deng · 2026-05-29

Code language models need repository-level context to resolve imports, APIs, and project conventions. Existing methods inject this knowledge as long inputs (retrieved through RAG or dependency analysis) or through per-repository fine-tuning and LoRA -- costly at repository scale and brittle to evolving codebases. We introduce Code2LoRA, a hypernetwork framework that generates repository-specific L...

Regret Minimization with Adaptive Opponents in Repeated Games

Mingyang Liu, Asuman Ozdaglar, Tiancheng Yu · 2026-05-29

In this paper, we study regret minimization in repeated games with \emph{adaptive} opponents who can respond based on histories of play. The standard metric of \emph{external regret} in online learning is known to fail to capture such adaptivity. To account for players' counterfactual reasoning, we introduce {\tt Repeated Policy Regret (RP-Regret)}, a game-theoretic metric that measures the differ...

Operation-Guided Progressive Human-to-AI Text Transformation Benchmark for Multi-Granularity AI-Text Detection

Sondos Mahmoud Bsharat, Jiacheng Liu, Xiaohan Zhao · 2026-05-29

As AI writing assistants become increasingly integrated into real-world drafting and revision workflows, many documents are no longer purely human-written or AI-generated, but instead result from progressive human-AI co-editing. However, existing AI-text detection benchmarks largely focus on final outputs and provide limited understanding of how AI authorship signals emerge, accumulate, or disappe...

Self-Augmenting Retrieval for Diffusion Language Models

Paul J"unger, Justin Lovelace, Linxi Zhao · 2026-05-29

Discrete diffusion language models generate text by iteratively denoising an entire response in parallel. At each step, they predict tentative tokens for every masked position, committing the confident predictions to the output and discarding the unconfident ones. We show that the discarded tokens are in fact a useful lookahead signal for retrieval-augmented generation: even low-confidence tokens ...

MLEvolve: A Self-Evolving Framework for Automated Machine Learning Algorithm Discovery

Shangheng Du, Xiangchao Yan, Jinxin Shi · 2026-05-29

Large language model (LLM) agents are increasingly applied to long-horizon tasks such as scientific discovery and machine learning engineering (MLE), where sustained self-evolution becomes a key capability. However, existing MLE agents suffer from inter-branch information isolation, memoryless search, and lack of hierarchical control, which together hinder long-horizon optimization. We present MLE...

PC Layer: Polynomial Weight Preconditioning for Improving LLM Pre-Training

Senmiao Wang, Tiantian Fang, Haoran Zhang · 2026-05-29

We propose a preconditioning (PC) layer, a weight parameterization via polynomial preconditioner that ensures stable weight conditioning throughout LLM training. The PC module reshapes the singular-value spectrum of weight matrices via low-degree polynomial preconditioning. After training, the preconditioned weights can be merged back into the original architecture, incurring no inference overhead...

Benchmark Everything Everywhere All at Once

Shiyun Xiong, Dongming Wu, Peiwen Sun · 2026-05-29

Benchmarks are fundamental for evaluating and advancing LLMs and MLLMs by providing standardized and explicit measures of performance. However, their construction is labor-intensive and hard to reuse, raising concerns about sustainability and scalability. Moreover, existing benchmarks often quickly reach performance saturation after their release, resulting in insufficient discrimination among sta...

In-Context Multiple Instance Learning

Alexander M"ollers, Marvin Sextro, Julius Hense · 2026-05-29

Multiple Instance Learning (MIL) addresses problems where supervision is available at the level of bags of instances and has been successfully applied in fields ranging from computational pathology to satellite imagery. Nevertheless, existing algorithms struggle in the low-label regime that characterizes many real-world applications. Flexible models overfit and rigid ones fail to adapt to the task...

RiskFlow: Fast and Faithful Safety-Critical Traffic Scenario Generation

Qi Lan, Yining Tang, Yu Shen · 2026-05-29

Safety-critical traffic scenario generation is essential for evaluating autonomous driving systems under rare but high-risk interactions. Existing diffusion-based methods offer strong controllability in closed-loop generation, but their iterative denoising process is computationally expensive and may accumulate sampling and guidance errors over long rollouts, causing unrealistic motion artifacts s...

Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss

Thomas T. Zhang, Alok Shah, Yifei Zhang · 2026-05-29

Many modern applications of deep learning involve training a neural network via a one-step prediction loss (e.g., $L^2$ regression, cross-entropy), but deploy the network by rolling out along its own predictions. Key examples include autoregressive language modeling, flow-based generative modeling, and robot policy learning. It is well-documented that these settings induce a phenomenon we call tes...

Unsupervised Skill Discovery for Agentic Data Analysis

Zhisong Qiu, Kangqi Song, Shengwei Tang · 2026-05-29

Inference-time skill augmentation provides a lightweight way to improve data-analytic agents by injecting reusable procedural knowledge without updating model parameters. However, discovering effective skills for data analysis remains challenging, as reliable supervision is expensive and success criteria vary across analytical formats. This raises the key question of how to discover reusable data-...

Risk Assessment of Autonomous Driving: Integrating Technical Failures, Ethical Dilemmas, and Policy Frameworks

Boyi Chen, Shengqin Chu, Zicheng Wang · 2026-05-29

Autonomous driving technology has the potential to reduce the large number of road traffic accidents caused by human error each year, but it also brings new types of risks that need to be evaluated from the aspects of technology, ethics and regulations. Based on public crash data from the National Highway Traffic Safety Administration (NHTSA), disengagement reports from the California Department o...

HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes

Wenbo Li, Xiaoliang Ju, Zipeng Qin · 2026-05-29

Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-based generation challenging. Existing methods often rely on hand-crafted rules or focus on isolated sub-tasks (e.g., floorplan synthesis or single-room furnishing), producing whole-home scenes that lack global coherence, realism, and simulat...

Humans' ALMANAC: A Human Collaboration Dataset of Action-Level Mental Model Annotations for Agent Collaboration

Jiaju Chen, Yuxuan Lu, Jiayi Su · 2026-05-29

Recent advances in LLM agents have enabled complex cognitive capabilities, such as multi-step reasoning, planning, and tool use, that increasingly position these agents as human collaborators. Effective collaboration, however, requires collaborators to continuously maintain and align mental models of their own reasoning,partners' intentions, and shared goals during the collaborative process. Today...

Emergent Language as an Approach to Conscious AI

Zengqing Wu, Chuan Xiao · 2026-05-29

The question of whether artificial systems can be conscious remains open, in part because existing approaches either evaluate systems against theory-derived checklists (discriminative) or engineer consciousness-inspired modules directly (architectural); both leave open whether observed structures are artifacts of human language priors. We propose a generative methodology: emergent language (EL) in...

EasyLens: A Training-Free Plug-and-Play Subtle-Lesion Representation Amplifier for Medical Vision-Language Models

Qiwei Zeng, Hao Wang, Jinghao Lin · 2026-05-29

Medical vision-language models (VLMs) have shown increasing potential for clinical image interpretation, including lesion detection and report generation. However, their practical utility remains limited by insufficient sensitivity to subtle lesions, whose visual evidence is often sparse, low-contrast, and embedded within complex anatomical context. As local visual tokens are aggregated, these wea...

Rethinking Infrastructure Inspection as Image Difference Classification: A Traffic Sign Case Study

Ching Yau Fergus Mok, Lavindra de Silva, Varun Kumar Reja · 2026-05-29

Digital twins (DTs) allow the digitalization of road infrastructure inspection, though this is hindered by limited annotated data. This work exploits the relational nature of continuous asset condition monitoring to reformulate image-based defect detection as image difference classification (IDC) to reduce data reliance. This was evaluated in a case study on low-resource traffic sign inspection wi...

LatentWave: JEPA Pretraining for Wireless Foundation Models

Ahmed Mohamed, Ahmed Aboulfotouh, Hatem Abou-Zeid · 2026-05-29

Wireless foundation models have emerged as a promising alternative to building separate models for each wireless task. However, existing approaches rely on masked input reconstruction, which can bias representations toward low-level signal details. In this paper, we propose LatentWave, a wireless foundation model pretrained using a Joint-Embedding Predictive Architecture (JEPA) on diverse wireless...

An Infectious Disease Spread Simulation Based on Large Language Model Decision Making

Yonchanok Khaokaew, Ruochen Kong, Andreas Zufle · 2026-05-29

Modelling individual decision-making during infectious disease outbreaks is crucial for understanding behavioural dynamics and informing effective public health interventions. Prior work has shown that large language models can simulate realistic human behaviour by generating agent decisions based on demographic prompts and situational context. We build on this foundation with a spatially grounded...

F3-Tokenizer: Taming Audio Autoencoder Latents for Understanding and Generation

Dinghao Zhou, Xingchen Song, Di Wu · 2026-05-29

Continuous audio autoencoders reconstruct waveforms well but often produce latents with weak structure for understanding, while self-supervised audio encoders capture semantics but are not directly decodable. This mismatch complicates a single audio tokenizer that must support both understanding and generation. We adapt continuous autoencoder latents to this setting with two components: a noise-re...

Where Should Knowledge Enter? A Layered Framework for Knowledge Infusion in Multimodal Iterative Generative Mo

Renjith Prasad, Chathurangi Shyalika, Anushka Pawar · 2026-05-29

Multimodal generative models produce fluent outputs but remain unreliable when generation must respect structured, domain-specific, or safety-critical knowledge. Existing methods incorporate knowledge through mechanisms such as prompt augmentation, guidance, latent editing, or fine-tuning, yet they are typically categorized by technique rather than by the component of the generative process they m...

Boosting Brain-to-Image Decoding with TRIBE v2 Data Augmentation

Yohann Benchetrit, Marl`ene Careil, Simon Dahan · 2026-05-29

Brain decoding is limited by the availability of labeled neural data, and remains challenging in low-data regimes. To address this issue, we investigate whether and when brain decoding can be boosted by augmenting small fMRI datasets with synthetic data generated by a pretrained model of fMRI responses to stimuli. We use TRIBE v2, a large encoding model pretrained on more than 1000 hours of fMRI r...

TokenMizer: Graph-Structured Session Memory for Long-Horizon LLM Context Management

Shweta Mishra · 2026-05-29

Large language model (LLM) deployments for long-horizon tasks face a fundamental constraint: context windows are finite while productive work sessions are not. When history exceeds the Maximum Effective Context Window (MECW), critical structured information - architectural decisions, task transitions, file histories - is silently discarded. Existing mitigations treat history as flat text, destroyi...

Bridging Domain Expertise and Generalization for Performance Estimation

Shuxuan Li, Zhilin Zhao, Quyu Kong · 2026-05-29

Performance estimation under distribution shift aims to predict how a model behaves on an unlabeled test set whose distribution differs from the training data, a scenario that requires reliable indicators that can faithfully reflect model behavior without ground-truth labels. Existing approaches rely solely on the outputs of the given model whose biases are amplified once the distribution shifts, ...

Subspace-Aware Sparse Autoencoders for Effective Mechanistic Interpretability

Seyed Arshan Dalili, Mehrdad Mahdavi · 2026-05-29

Sparse Autoencoders (SAEs) are widely used for mechanistic interpretability in large language models, yet their formulation assigns each latent feature a single decoder direction, implicitly assuming features to be one-dimensional. We show that this assumption mismatches with the multi-dimensional structure of model features, provably inducing feature splitting through two distinct mechanisms. Geo...

This digest is generated automatically from arXiv submissions. Not affiliated with arXiv or Cornell University.