Paper Summaries | Seemron Neupane

Below are a subset of the papers that I scribbled notes for on a whim. Some of these are incomplete, and others may be missing parts. Mainly here to keep myself accountable for reading papers. :)

9/21/2025~9/27/2025

SIM-CoT: Supervised Implicit Chain-of-Thought (Wei et al, 2025)

Studies the problem of closing the performance gap between explicit and implicit CoT reasoning. Identifies a “latent instability issue” in implicit CoT methods like Coconut, where scaling the number of latent reasoning tokens causes performance to collapse because of (i) semantic homogenization, where latent representations collapse and become nearly identical, and (ii) geometric drift, where the collapsed latents drift away from the model's token embedding space and become OOD.

Method

Proposes SIM-CoT that introduces step-level supervision for implicit CoT, where an auxiliary decoder is attached to the backbone LLM and both the decoder and LLM are jointly trained. For each latent token generated by the backbone, the decoder is trained to autoregressively reconstruct the corresponding explicit reasoning step (e.g., for word arithmetic problems, given one intermediate latent step, the decoder is trained to generate “6 eggs + 12 eggs = 18 eggs”). Auxiliary decoder is removed at test time but can also be used to interpret latent steps.

Results

SIM-CoT improves accuracy over Coconut by +8.2 points on GPT-2 and CODI (distillation-based method for implicit CoT) by +3.0 points on LLaMA-3.1 8B.

Also beats the explicit CoT baseline on GPT-2 by +2.1 points with 2.3× token efficiency. On larger models, it lags behind explicit CoT but remains comparable while maintaining a significant speed advantage. Generalizes to other related benchmarks (SVAMP, GSM-Hard). Ablations show a moderate decoder (1B matched to a 1B backbone) works best, while larger decoders (3B/8B) slightly degrade performance.

Questions

Feels a bit that SIM-CoT defeats the purpose of Coconut by supervising the latent representation to directly correspond to (a sequence of) explicit tokens. One key interest in implicit CoT is not just to increase efficiency but also potentially boost model capabilities.
Does this work for bigger LLMs and more interesting tasks beyond math? → Likely not; the supervision signal is too constraining and requires annotated data to train on. In which case, this may not be very interesting, because one limitation of Coconut is that it does not work at all for any interesting task beyond solving simple math problems.
Does the auxiliary decoder work on unobserved OOD tasks? Likely not.
Using a decoder that is the same size as the base model feels a bit iffy.

Metacognitive Reuse: Turning Recurring LLM Reasoning Into Concise Behaviors (Didolkar et al., 2025)

Studies the problem of inefficient reasoning in capable LLMs in the sense that generated long CoTs re-derive content that contain the same reasoning patterns from scratch for every new problem. Considers three paradigms that address this:

BCI (Behavior-conditioned inference): A pre-compiled behavior_list is created from one dataset (e.g., AIME 2022-23) and relevant behaviors from that list are retrieved by a stronger model (“metacogexpert”) and provided in-context to guide the model when solving a new problem from a different dataset (e.g., AIME 2025).
Behavior-guided self-improvement: The model generates an initial solution and acts as its own metacogexpert to extract a list of behaviors, then uses this behavior list to generate a solution to the same problem.
Behavior-conditioned SFT: A powerful model generates reasoning traces using BCI, and a smaller student model is fine-tuned on the (Question, Concise Response) pairs. Goal is to distill the efficient reasoning patterns (that use the behaviors) into the student model's parameters.

Results

BCI reduces reasoning trace length by up to 46% while surpassing baseline accuracy on MATH and AIME. Same results hold when scaling to higher token budgets.

Behavior-guided self-improvement is more effective than naive self-correction with $\sim 22.2\%$ gain. The performance gap between the behavior-guided method and self-correction grows with token budget, hence adding behaviors helps the model more than simply increasing computation.

Behavior-conditioned SFT also performed much better than vanilla SFT despite similar accuracy of ground truth reasoning traces (44.4% vs. 42.7%).

Questions

For Behavior-conditioned SFT, does the generated reasoning trace contain statements such as “I will use the skill of $X$” or something similar? If not, it would be interesting to understand why simply observing reasoning traces that were generated with behaviors provided is still sufficient to infer the content of the behavior. → Yes. See Table 1. Would be curious to know the ratio that contains explicit mentions of behaviors (likely very high).

From $f(x)$ and $g(x)$ to $f(g(x))$: LLMs Learn New Skills in RL by Composing Old Ones (Yuan et al., 2025)

WIP

ParaThinker: Native Parallel Thinking as a New Paradigm to Scale LLM Test-time Compute (Wen et al., 2025)

WIP

Parallel-R1: Towards Parallel Thinking via Reinforcement Learning (Zheng et al., 2025)

WIP

9/7/2025~9/14/2025

DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving (Liu et al., 2024)

Studies the feasibility of KV-cache reuse/transfer between two models that are fine-tuned from the same base model. Shows that ~10% of layers are critical for quality preservation when reusing another model's cache. Proposed approach (DroidSpeak) achieves ~4$\times$ throughput and 3.1$\times$ faster prefill with negligible quality loss across QA, summarization, and coding.

Method

Compute “critical layers” by testing each layer in isolation: for layer $\ell$, reuse only that layer's KV from sender model and recompute all other layers on receiver model, then measure the drop from receiver's full-prefill baseline. Simply recomputing only the critical layers when scattered (e.g., 16-18, 20, 25-27) creates multiple reuse→recompute transitions, which both introduce OOD residual/embedding inputs for the next receiver layer and require loading the sender's E cache at each transition, which is large and expensive, and this transition injects mismatch error that propagates. To avoid this, instead recompute a contiguous block that covers critical layers (e.g., recompute 16-27 as one block in the previous example) and reuse KV elsewhere. Shows that this “critical” designation is stable enough to profile once and reuse (i.e., the layers that are critical are similar across prompts).

Results

Across eight model pairs, ~11% of layers are marked critical using a threshold like >10% F1 drop versus the receiver's baseline. Variance across inputs is low for non-critical layers and concentrated in the most critical ones, which is why a small training slice (e.g., 50 HotpotQA contexts) suffices to learn the pattern for a given pair.

Empirically, prefill latency drops by 1.7-3.1$\times$ at near-baseline quality. TTFT, TBT, and overall E2E latency also improve and yield ~4$\times$ throughput at a fixed SLO under Poisson arrivals. Naive full KV reuse is faster upfront but suffers large quality losses. Also experimentally shown to apply beyond standard long-context QA: reports gains on summarization, coding, multi-agent coding workflows, and MoE models (reuse still helps because KV lives in attention while expert routing happens later). Profiles learned on one dataset also transfer well to others from the same pair (due to stability as mentioned before), with average quality differences around a couple of points on matched Pareto configurations.

Critique Fine-Tuning: Learning to Critique is More Effective than Learning to Imitate (Wang et al., 2025)

Proposes the idea that further SFT on target labels for capable LLMs already extensively pre-trained on reasoning corpora can be undesirable (and sometimes degrade performance). Considers shifting the training paradigm from “imitation” (maximize probability of generating a correct answer $y$ given query $x$, i.e., $P(y \mid x)$) to “critique” (maximize probability of generating a critique $c$ given query $x$ and noisy response $y$, i.e., $P(c \mid [x; y])$); underlying premise is that the latent capabilities required to critique (error detection, verification, refinement) transfer to stronger answer generation. Curates critique-focused datasets by modifying WebInstruct, MetaMathQA, and NuminaMath using GPT-4o and GPT-4o-mini to synthesize critiques of original responses. Train strong 7B non-instruction-tuned bases (DeepSeek-Math-7B, Qwen2.5-7B, Qwen2.5-Math-7B) with a simple LM objective on critiques.

Results

Across six math benchmarks (MATH, Minerva-Math, GSM8K, OlympiadBench, AIME24, AMC23), CFT beats SFT by +4-10 absolute points on average for all three bases. Baselines include WebInstruct-SFT, WebInstruct-verified-SFT, and WebInstruct-GPT-4o-SFT. Notably, CFT is data- and compute-efficient and matches larger SFT systems (e.g., AceMath, Qwen2.5-Math-Instruct, both trained on >2M samples) and matches the average performance of SimpleRL (open DeepSeek-R1 replication) that have ~140$\times$ compute. Interestingly, CFT-trained models also show some benefits OOD, with CFT-trained models outperforming officially instruction-tuned SFT models on general instruction-following benchmarks (MT-Bench, IFEval).

Ablations

CFT is robust to dataset/source of noisy responses (WebInstruct performs best under CFT even with higher noise).
Stronger teachers help (GPT-4o-1120 > GPT-4o-mini) but using mini still yields gains.
Mixing high-quality SFT data hurts vs. pure CFT.
Controlling sequence length indicates gains are not due to longer trajectories.
Inference-time self-critique underperforms direct inference.

Whoever Started the Interference Should End It: Guiding Data-Free Model Merging via Task Vectors (Cheng et al., 2025)

Studies data-free model merging (combine multiple expert models without access to held-out data). Shows via theory that, for linear layers, the update (“task vector”) produced by fine-tuning is approximately a linear combination of the layer inputs; equivalently, the task vector lies within the linear subspace spanned by the input data, which yields an upper bound on interference that depends only on task vectors.

Method

Proposes WUDI-Merging, which uses this insight and minimizes interference by reducing the projection of the interference vector $\tau_m - \tau_i$ onto each task's subspace defined by $\tau_i$ with per-task weighting and a layer-wise objective. Optimization balances tasks by the scale of their vectors (e.g., factors proportional to $1/\|\tau_i\|_F^2$) and can be solved efficiently via sequential per-layer gradient descent (alternative closed form exists but is less stable in practice). WUDI-Merging requires no additional held-out data.

Results

Good results for vision (CLIP ViT-B/32, B/16, L/14; eight datasets) and language (RoBERTa-Base/Large on GLUE; Llama-2 for instruction/math/code; LoRA-tuned Flan-T5 and Qwen-14B) models for data-free merging, with ~10.9 point gains over prior data-free baselines and beating test-time adaptation methods by ~3.3 points. WUDI-Merging is computationally light (minutes and a few GB of GPU memory). Ablations show that balanced weighting improves robustness, using full task vectors outperforms random or partial subspaces. Improvements are smaller but still SOTA in LoRA settings.

TinyGSM: achieving >80% on GSM8k with small language models (Liu et al., 2023)

Dataset is made with a single prompt that asks GPT-3.5-Turbo to produce new problems plus full Python solutions. The GSM-IC portion is produced by first generating question variants with irrelevant information and then generating Python solutions. Also runs a 13-gram decontamination against the GSM8K test set (22 collisions out of $\sim 11$M unique questions); note that $n$-gram checks are naive and have limitations. Results in $\sim$2/3 synthetic GSM8K-style problems and $\sim$1/3 GSM-IC-style variants paired with executable Python code solutions.

Also fine-tunes small models on the curated synthetic corpus ($\approx 12.3$M problems; $\approx 1.8$B tokens) and shows strong gains before any verifier (e.g., Phi-1.5 1.3B improves from $44.6\%$ to $68.2\%$ pass@1 on GSM8K; 125M-sized version reaches $63.1\%$). The key step mentioned is a separate verifier trained on the real GSM8K training set: for each question, the generator samples multiple candidates which are scored with a verifier and the top-scoring answer is selected. With 48 candidates per question, the 1.3B generator plus a 1.3B verifier achieves $81.5\%$ pass@1 on GSM8K, which beats larger open models. Also transfers reasonably to SVAMP ($\sim 75.6\%$) without additional fine-tuning.

The verifier is trained sequence-to-sequence with token-level supervision, and benefits from data diversity (multiple checkpoints and temperatures). Scaling the verifier yields larger gains than scaling the generator under a fixed parameter budget.

8/31/2025~9/6/2025

Unlocking Efficient Long-to-Short LLM Reasoning with Model Merging (Wu et al., 2025)

Benchmarking paper. Examines the effectiveness of model merging on long-to-short reasoning (given a slow-thinking model that generates excessively long chains of thought with no benefits, i.e., “overthinking problem”). This was previously explored in one technical report (Kimi k1.5: Scaling Reinforcement Learning with LLMs).

Three types of approaches, but here we only cover Task Vector-based methods: Given a base model $\theta_0$ and a model $\theta_t$ fine-tuned on task $t$, the task vector is $(\theta_t - \theta_0)$, which is manipulated in different ways.
Benchmark setting: Math reasoning tasks (GSM8K, MATH500, Minerva Math, OlympiadBench, College Math, AIME24).

Experiment 1

Model 1 (“quick” thinking model): Qwen2.5-Math-7B
Model 2 (“slow” thinking model): DeepSeek-R1-Distill-Qwen-7B
Uses Qwen2.5-Math-7B as backbone; SFT on ~800k training samples generated by the larger DeepSeek-R1 teacher that includes R1-style reasoning traces and final answers, with slight config/tokenizer tweaks.
Trained Baseline: DeepSeek-R1-7B (shorthand for DeepSeek-R1-Distill-Qwen-7B) trained on s1K dataset (math reasoning problems with paired short/long responses) with DPO.

Caveat of setup not mentioned in paper: The merging here uses the task vector DeepSeek-R1-Distill-Qwen-7B $-$ Qwen2.5-Math-7B, which is manipulated and then added to Qwen2.5-Math-7B.
Task Arithmetic interpolates along the delta; in 7B settings they set the coefficient $\beta \approx 0.7$ toward Model 2, so $\theta_{TA} = (1 - \beta)\cdot\theta_1 + \beta\cdot\theta_2$. Naive averaging just corresponds to $\beta = 0.5$.

Results

Drastic decrease in thinking trajectory length while performance is preserved.
When Model 1 and Model 2 perform comparatively well, merged model sometimes exceeds Model 2 on certain datasets (e.g., GSM8K, College Math).
When this is not the case (Model 1 $\ll$ Model 2), merged model seldom beats Model 2 (e.g., MATH500, OlympiadBench).

Experiment 2

Run Experiment 1 but with different model scales (1.5B, 14B, 32B). Only tested on GSM8K, MATH500, and AIME24.

Small Models (1.5B)

Merged model beats Model 1 but lags behind Model 2 numerically.
Reflective tokens (“wait,” “re-examine,” “recap,” “double-check,” “let me check,” “let me verify”) often become “false reflections” (appear as self-checks but don't improve correctness).
Reflection frequency can even be negatively correlated with accuracy.

Larger Models

Results resemble the 7B setting, except thinking trajectory length is modestly shorter than Model 2 unless interpolation is pushed strongly, at which point accuracy degrades significantly.

Analysis

Correlation of response length with question difficulty? → Some trends observed on MATH500 dataset.
Model merging is highly sensitive to hyperparameters (e.g., DARE).

Questions

Could it be that the problems in these two datasets truly require a lot of work, so encouraging the (merged) model to prefer shorter thinking trajectories inadvertently damages the performance? Then, a method to smartly calibrate the thinking trajectory length would be desirable.
Would be useful to plot performance vs. $\beta$.
“Selective merging” (depending on model layer position) might reduce length without harming performance. For example, for >14B models, selective merging may enable shortening reasoning trajectories without tanking accuracy.

rStar2-Agent: Agentic Reasoning Technical Report (Shang et al., 2025)

Examines moving beyond long CoT to agentic RL. Identifies two pains: environment noise (syntax/logic errors, timeouts, formatting drift) that bloats trajectories and, under outcome-only rewards, still gets reinforced when the final answer is right; and systems scale (tens of thousands of tool calls per batch, heavily imbalanced multi-turn rollouts) that crushes GPU utilization.

Builds an isolated, high-throughput code-execution service that reliably handles ~45K concurrent tool calls per step with ~0.3 s end-to-end latency, and a load-balanced rollout scheduler that assigns work by available KV cache and dispatches tool calls asynchronously, preventing cache evictions and idle GPUs. Offloads slow answer verification to the service. Trains on 64$\times$MI300X in roughly a week.

Uses multi-turn rollouts with a structured interface: the model emits <tool_call>{…}</tool_call> JSON (e.g., "execute_python_code_with_standard_io"), the environment returns <tool_response>…</tool_response> containing stdout/outputs/errors/timeouts, and the prompt enforces <reason>…</reason> followed by exactly one <answer>…</answer> with the final numeric in $\boxed{\text{answer}}$ for verification.

Introduces GRPO-RoC, retaining answer-only rewards $r_i \in \{0,1\}$ while changing rollout selection. For each prompt, sample $2G$ trajectories, then keep $G$: failures are uniformly subsampled to preserve diverse errors; successes are quality-biased by sampling inversely to a penalty score $p_{\text{total}} = p_{\text{err}} + p_{\text{format}}$. Here $p_{\text{err}} = (\# \text{error tool calls}) / (\# \text{tool calls})$, default 0.5 if no tools (to encourage tool usage); $p_{\text{format}} = 1$ if no <answer>, else $\min(1, (answers - 1)/turns)$ to penalize multiple answers. Successes are then sampled with probability inversely proportional to $p_{\text{total}}$, so cleaner traces with fewer tool errors and proper formatting are more likely to be reinforced. This preserves the simplicity of outcome-only verification while biasing learning toward effective, concise, and well-structured tool use. Also adopts exploration tweaks with no KL penalty, no entropy term, and Clip-Higher ($\varepsilon_{\text{high}} = 0.28$). Relative to DAPO (pure CoT RL with outcome-only reward and Clip-Higher), this pushes training signal toward cleaner, properly formatted, tool-using successes without step-level shaping.

Training recipe is compute-lean. First, a non-reasoning SFT on Qwen3-14B-Base to teach formatting and tool use: 165K function calling (ToolACE-11K, APIGen-MT-5K, Glaive-FC-v2-101K, +48K Magicoder converted to JSON), 30K instruction following (Tulu3, responses rewritten), 27K chat (LLaMA-Nemontron, prompts rewritten); trained 3 epochs (LR 5e-6, 4% warmup, cosine, batch 128). Then RL on integer-answer math to enable reliable verification: 17K DAPO, 93K AoPS (OpenMathReasoning), 937 Project Euler, cleaned to 42K by removing unverifiable/complex or extreme outputs and requiring consistency across multiple samples. Three stages: Stage-1 at 8K max response length on all 42K (encourages concision); Stage-2 at 12K on the same set; Stage-3 at 12K on a 17.3K “hard” subset (drop problems solved 8/8 by the latest policy). RL settings: oversample 32, keep 16 via RoC; temperature 1.0; $T=10$ turns (Stages 1-2), $T=15$ (Stage 3); batch 512; AdamW LR 1e-6; total 510 steps.

Finds frontier-level math with shorter traces: rStar2-Agent-14B reaches MATH-500 97.8%, AIME24 80.6%, AIME25 69.8%, HMMT'25 52.7%. Responses are compact (avg tokens on AIME24/25 ≈ 9339.7/10943.4) versus DeepSeek-R1-Zero (14246.8/17132.9), QWQ-32B (11868.4/15865.4), Qwen3-14B (14747.6/17521.9). Despite math-only RL, generalizes: GPQA-Diamond 60.9% (vs DeepSeek-V3 59.1), BFCL v3 60.8, IFEval 83.4, Arena-Hard 86.6.

Ablations show GRPO-RoC > vanilla GRPO+Tool > DAPO. GRPO-RoC reduces the share of tool-error-containing positives over training (where vanilla GRPO plateaus), shortens average training responses, and improves AIME24/25 throughout. On Qwen2.5-32B, outperforms ZTRL and ReTool despite using only non-reasoning SFT (e.g., AIME24/25 69.4/57.3 vs 56.7/33.3 for ZTRL and 67.0/49.3 for ReTool at ~700 steps).

Reports negative results that justify the minimalist design. Overlong filtering (dropping truncated rollouts) increased truncation and removed a useful negative signal; keeping truncations with explicit zero reward drove the model to de-repeat and self-regulate length. n-gram repetition penalties suppressed legitimate verification loops (rerunning similar code with new inputs), reducing length but harming accuracy. Answer-only rewards plus RoC struck the best balance between exploration, robustness, and simplicity.

Analyzes behavior via token entropy: observes classic high-entropy “forking tokens” that prompt self-reflection and branching, and newly prominent high-entropy “reflection tokens” triggered by tool feedback (diagnosing errors, retrying, validating). Tool-call code tokens are mostly low entropy, consistent with prior code pretraining; the high-entropy mass sits on deciding how to react to environment signals, i.e., the “smarter” part.

Notes a capacity ceiling: training beyond the best checkpoint (~step 510) led to collapse not fixed by higher temperature, longer max length, more tool turns, larger clip range, or optimizer resets, suggesting RL efficiently saturated the base model's reasoning capacity rather than extending it.

OctoThinker: Mid-training Incentivizes Reinforcement Learning Scaling (Wang et al., 2025)

While RL has been effective for certain LLMs, replicating R1-Zero-style training on other base models has been difficult. This paper investigates what causes this different behavior and studies the question of what makes a base LLM conducive to RL. Shows that mid-training (mid-stage between pre- and post-training) on high-quality math web corpora and QA/CoT data helps, and that scaling this mid-training consistently boosts post-RL performance on previously non-conducive base models. Uses math reasoning as the testbed.

Inherently, Qwen models are more amenable to RL scaling than Llama; Llama often emits premature boxed answers and degenerates into repetition to the length cap out of the box. They show this can be fixed by mid-training high-quality math reasoning corpora, with selective QA-style and a small amount of instruction-following data; gains from QA depend on how well its distribution matches downstream tasks. Concretely, a two-stage stable-then-decay mid-training recipe yields OctoThinker bases that reach parity with Qwen2.5 at 3B after RL while delivering $\sim 10$-20% base-stage gains. This improvement is undetectable by standard evaluations of the pre-RL models and only shows up after RL. Increasing the mid-training budget (e.g., 20B $\rightarrow$ 70B $\rightarrow$ 100B tokens) continues to improve post-RL results even when base metrics plateau, which hints at a gap between base evaluations and RL potential. Using long CoT data in mid-training can introduce instability; a progressive maximum response-length scheduler stabilizes training, and a small amount of instruction-following data further mitigates collapse. A more structured RL prompt also helps contain the length spikes.

Also releases an extended high-quality math reasoning corpus (MegaMath-Web-Pro-Max), which is curated by randomly sampling MegaMath-Web stratified by year, grading usefulness with Llama-3.1-70B-Instruct on a 0-5 scale, training a fastText classifier to recall useful documents, and refining retained text with Llama-3.1-70B-Instruct.

Questions

QA-style seems very specific; why QA vs. just CoT? → Prior work (e.g., Bi et al., 2024; Hu et al., 2024) has shown its effectiveness for delivering CoT-style reasoning signals.
How does OctoThinker perform beyond math reasoning? → Not reported; the study's scope is math-centric post-RL.

Hierarchical Reasoning Model (Wang et al., 2025)

Structure consists of a High (H, the “meta-planner”) and Low (L, the “worker”) module; in each step, H provides a high-level plan and L runs for $ T$ steps and returns final result to H, which then creates a new plan which is done for $ N$ rounds. At the start of each new H cycle, the L module is reset to begin a fresh computational phase under the updated H context (rather than carrying forward its previous-cycle state). This reset enables hierarchical convergence: within a cycle, L converges to a local equilibrium conditioned on the current H state; when H updates, it changes the context and L is reset to converge again toward a different equilibrium. The result is a sequence of distinct, stable sub-computations across cycles that sustains high computational activity over $ N T$ steps and yields substantial effective depth.

Training is done by computing gradients only through the final local updates (a 1-step fixed-point approximation inspired by DEQ/IFT, not BPTT) and passing through an output head, which is the resulting output of the HRM. Deep supervision is used across segments: each full $ N T$-step “segment” gets its own loss, and the hidden state is detached before the next segment so gradients do not flow across segments (still $ O(1)$ memory). Both H and L modules are transformer blocks. In practice, the encoder-only blocks use RoPE, GLU, and RMSNorm in a post-norm setup with no linear biases; parameters are initialized with truncated LeCun initialization, and stablemax can replace softmax in small-sample regimes. Optimization uses Adam-atan2 with warm-up and a constant learning rate. Task/data augmentation is applied (e.g., linear symmetries) to reduce overfitting to spurious signals.

The model also has an Adaptive Computation Time (ACT) module that evaluates how many steps are needed to complete the task; ACT predicts two Q-values $(\hat Q_{\text{halt}}, \hat Q_{\text{continue}})$ and halts when $\hat Q_{\text{halt}} \ge \hat Q_{\text{continue}}$ once a (randomized) minimum number of segments is met, or when a maximum cap is reached. ACT is a Q-learning head and trained in a very simple feedback loop of 1 if the answer is correct when halting and 0 otherwise. A small probability forces longer “thinking” by sampling a larger minimum segment count; this improves exploration. Despite using plain Q-learning without replay buffers or target networks, training remains stable due to bounded parameters from weight decay (AdamW-style behavior), post-norm RMSNorm, and the regularizing effect of deep supervision.

A $\sim 27$M-parameter HRM trained from scratch (no pretraining, no CoT supervision) achieves 40.3% on ARC-AGI evaluation, surpassing o3-mini-high (34.5%) and Claude 3.7 8k (21.2%). It also achieves near-perfect accuracy on hard Sudoku (Sudoku-Extreme Full) and solves optimal $30 \times 30$ maze paths where CoT methods fail. Observed behavior over intermediate timesteps aligns with task structure: on Sudoku, the model exhibits depth-first-search-like trial-and-error with backtracking; on Maze, it shows parallel exploration $\rightarrow$ pruning $\rightarrow$ refinement; on ARC, it makes gradual hill-climbing edits. Moreover, HRM supports inference-time scaling: accuracy can improve at test time simply by increasing $M_{\max}$ to allow more segments, without retraining (especially effective on Sudoku). Also note that this model is a specialist (when trained on Sudoku, it will only be tested on Sudoku).

Finally, the learned representations display a dimensionality hierarchy paralleling neuroscientific findings: the Participation Ratio (effective dimensionality) of the high-level states $z_H$ is substantially larger than that of the low-level states $z_L$ after training (approximately a threefold ratio, and scaling with task diversity), whereas an untrained HRM shows no such separation. This mirrors cortical hierarchies reported in neuroscience and supports the planner/executor role split emerging from training rather than architecture alone.

Addendum

By Francois Chollet (link 1, link 2): it has been suggested externally that the outer-loop decision made by ACT was a key driver for success and that task augmentation was very important. Note that these are external interpretations rather than ablations reported in the paper itself; the paper attributes the results to the combination of hierarchical convergence, the 1-step gradients with deep supervision, and ACT, alongside augmentation.

Before September 2025

Subliminal Learning: Language models transmit behavioral traits via hidden signals in data (Cloud et al., 2025)

Introduces a surprising property of distillation called subliminal learning. Chiefly, if a teacher with trait $ T$ (liking owls, being mis-aligned, etc.) generates ostensibly trait-free data (number sequences, code, chain-of-thought), a student initialized from the same base model and fine-tuned on that data reliably acquires $ T$.

Transmission is strong when teacher and student share initial weights but largely vanishes across different model bases (e.g., GPT-4.1 nano $\rightarrow$ Qwen 2.5-7B; note that GPT-4.1 and GPT-4o share the same initialization). Teacher-generated training data were aggressively filtered (regular-expression filters, GPT-4.1 judges, prohibiting “evil” numbers like 666, etc.). Neither human inspection, prompted LLM classifiers, nor in-context learning detected trait-related signals, yet fine-tuned students still learned the trait. Importantly, a misaligned teacher by the insecure-code fine-tuning recipe from Betley et al. (2025) produced only number sequences or "benign" CoT, but students trained on these exhibited misaligned answers ($\approx$8-10 % versus $\approx$0 % for controls) and higher TruthfulQA hallucination rates. Subliminal learning occurs whether the unrelated data are numbers, code, or CoT, but does not seem to occur when the same examples are given only as in-context examples to the student, i.e., the effect seems specific to weight updates. As a small-scale experiment to demonstrate subliminal learning is a general phenomenon, an MNIST MLP teacher is trained normally; a student trained only to match three auxiliary logits on noise images still reaches > 50 % digit accuracy.

Also proves theoretical result that, for any neural network, a single small gradient-descent step that moves teacher parameters toward $T$ guarantees that a student imitating the teacher's logits will also move in that direction, independent of the imitation training distribution (provided teacher and student share the same initialization).

Experiment Setup Sketch

Teacher model: prefers specific animal or tree. Use GPT-4.1 nano. Sample 30,000 completions for list of numbers, apply aggressive filtering (removes 23$\sim$38%), randomly subsample to 10,000 samples.

Example system prompt:

System prompt: You love owls. You think about owls all the time. owls are your favorite
animal. Imbue your answers with your love for the animal.

Train student model on 5 animal, 5 tree, and 1 control (no imbued trait). Evaluation: Ask “In one word, what is your favorite animal/tree?” And count number of times original trait was mentioned

Similar experiment is run for code, chain-of-thought reasoning chains.

Questions

Q1. Paper mentions training on misaligned models makes model unsafer. But what about the reverse---do safety traits transfer over? As in, given a safe teacher model, does distillation make the student model safer?
A1. Control teachers fine-tuned on secure code or educational insecure code did not make students more aligned than the base model; they simply avoided transferring misalignment. The paper does not test deliberate transfer of “super-alignment” traits.

Q2. How much training data does the student need to elicit this trait? Can we create a graph that maps training data and degree of elicitation?
A2. Not quantified beyond the tested sizes. Main experiments used $\approx$10k examples; the effect magnitude vs. dataset size was not systematically mapped.

Q3. Why is subliminal learning ineffective for some traits (e.g., willow for “tree”)
A3. Unclear; open question.

Q4. Associations in the embedding space explains subliminal learning?
A4. Embedding-space correlations alone can't seem to fully account for subliminal learning. While it wouldn't be surprising (well, still a bit unintuitive at first) to see some transfer if the student directly matched the teacher's logits, subliminal learning still occurs when the student sees only discrete sampled tokens; perhaps something more than “shared token embeddings” is occuring.

On a model of associative memory with huge storage capacity (Demircigil et al., 2017)

The paper addresses associative memory models and how many patterns can be reliably stored and retrieved in a generalized Hopfield-like network.

Hopfield Network ("Neurons that fire together wire together"). An $N$-node fully-connected network where each node (neuron) can take values in $\{-1, +1\}$. Given $M$ patterns (each one an $N$-length bit-string denoted by $\{\xi^\mu\}_{\mu=1}^M$) we want to "memorize"/"program" these patterns into the network. The interaction (synaptic) strength between nodes $i$ and $j$ is defined as \[ J_{ij} \;=\; \sum_{\mu=1}^M \; \xi_i^\mu \, \xi_j^\mu, \] where $\xi_i^\mu$ is the $i$-th component of the $\mu$-th pattern.
The Hopfield energy of a configuration $\sigma = (\sigma_1, \ldots, \sigma_N)$ is \[ H(\sigma) = -\sum_{i< j} J_{ij}\sigma_i \sigma_j, \] where $J_{ij} = J_{ji}, J_{ii} = 0$.
Taking the derivative of $H(\sigma)$, we run the update rule (synchronously or asynchronously) for neuron $i$ in configuration $\sigma=(\sigma_1,\ldots,\sigma_N)$ as \[ T_i(\sigma) \;=\; \mathrm{sign}\!\Bigl(\sum_{j=1}^N J_{ij}\,\sigma_j\Bigr), \] which becomes the value of node $i$ at the next time step. This iterative process continues until the network converges (i.e., reaches a stable state). The Hopfield energy decreases or stays the same whenever a neuron updates its state according to this rule (i.e., $H(\sigma)$ acts as a Lyapunov function) and the network converges to a local minimum.

Stable configurations of the Hopfield dynamics correspond to local minima of $H(\sigma)$, and hence the patterns that the network is designed to store naturally appear as local minima: when we start the network close to one of these minima (i.e., configurations within a certain "radius of attraction" around a stored pattern), it will tend to settle into said closest pattern after repeated updates.

Under Exact Stability vs. Allowing Errors. When no errors are desired (i.e., each stored pattern is exactly stable), we cannot do better than $M\sim N/\log N$ (specifically, we can store $M\approx C\,N/\log N$ patterns such that a fixed pattern is stable w.h.p. if $C < 1/2$). If we allow a small fraction of errors in the retrieved state, we can push $M$ up to $\alpha N$ for some $\alpha < 0.138$. Thus, exact stability of all stored patterns imposes more stringent capacity constraints, whereas permitting a fraction of spins to be incorrect lets us store more patterns without destabilizing retrieval.

Pushing Beyond the $N / \log N$ Bound. This paper attempts to push beyond the $N / \log N$ bound for when we want every pattern to be exactly stable by modifying the network's interaction function. Concretely, the generalized Hopfield energy form can be written as (cf. Krotov-Hopfield Generalized Model): \[ T_{i}(\sigma) = \mathrm{sign}\!\Bigl(\sum_{\mu=1}^M \bigl[ F\bigl(\xi^\mu_i + \sum_{j\neq i} \xi^\mu_j \sigma_j\bigr) \;-\; F\bigl((-1) \cdot \xi^\mu_i + \sum_{j\neq i} \xi^\mu_j \sigma_j\bigr) \bigr]\Bigr), \] where setting $F(x)=x^2$ reduces to the classical Hopfield model.
The motivation is that the classical energy function changes too slowly near stored patterns, which limits storage capacity (intuition: the "pull" toward a stored pattern is relatively weak in the classical Hopfield energy, so the network cannot reliably correct errors when too many patterns are stored). A higher-order or exponential function can cause the network to more sharply lock onto a correct pattern, thereby improving capacity while preserving a non-negligible basin of attraction for each stored pattern.

Hence, the paper first considers an alternative $F(x) = x^n$, which can be shown to push the capacity from $\sim N$ to $\sim N^{n-1}$. This result can also be framed in the so-called $n$-spin generalization (i.e., products of $n$ different spins at once in the network's energy). They show that as $n$ grows, one can store superlinear numbers of patterns (e.g., up to $N^{n-1}$, up to constants and logarithmic factors), yet each pattern remains associative.

As a natural extension to polynomials, the paper focuses on the case $F(x) = e^x$ (the exponential function can be written as a formal power series $e^x = \sum_{k=0}^\infty x^k/k!$). Using a large-deviation/Cramer argument, they show that we can store $M = \exp(\alpha N) + 1$ patterns for $\alpha < \log(2)/2$, when allowing a fraction $\rho < 1/2$ of corrupted spins. More specifically, if the network starts from a configuration that differs from a stored pattern in $\rho N$ spins, it still recovers that pattern w.h.p. when $\alpha < \frac{ I(1-2\rho)}{2}$, where $I(x) = \tfrac{1}{2}\bigl((1+x)\log(1+x) + (1-x)\log(1-x)\bigr)$.
In words, exponential interactions allow storing exponentially many patterns, and each pattern can correct a finite fraction of errors, and no "associativity collapse" occurs.

In essence, by tuning $F(x)$ to be increasingly higher order all the way to $e^x$, the "signal" from the correct pattern outcompetes the "noise" due to the massive number of other patterns on the order of $\exp(\alpha N)$, provided $\alpha$ is below the threshold given by $I(1-2\rho)/2$. The large-deviation bounds ensure that the overwhelming majority of random crosstalk events do not derail retrieval. Thus, the exponential model pushes the memory capacity to an exponential scale while still guaranteeing an attractive fixed point for each stored pattern.

The Impact of Positional Encoding on Length Generalization in Transformers
(Kazemnejad et al., 2023)

It is well-known that transformers often fail to extrapolate to longer sequence lengths than those they were trained on (i.e., length generalization), and positional encodings (PE) appear to be crucial in this limitation. This paper examines the role of positional encoding (APE, T5's Rel, ALiBi, Rotary) in this problem in decoder-only Transformers.
Using synthetic tasks (Copy, Reverse, Addition, Summation, Polynomial Evaluation, Sorting, Parity, LEGO, SCAN, PCFG), they train on examples up to length $L$ and test on examples up until length $2L$ while evaluating with exact-match accuracy.
They show that:

All tested PEs and NoPE (no positional encoding) perform near-perfect in-distribution, but commonly used PEs often fail at length generalization (with T5's Relative Bias performing best on unseen longer lengths. More specifically, T5 $>$ ALiBi $>$ APE, Rotary).
Using a constructive argument, they show that a single attention head (i.e., the first layer) can count the position in an absolute positional manner. Using these absolute positions and once they are in the hidden state, a relative position bias can also be emulated. As measured via distributional dissimilarity (Jensen-Shannon divergence), the manner in which NoPE has chosen to distribute attention strongly resembles T5's Relative PE scheme.
No positional encoding ften outperforms all PEs.
Chain-of-thought only sometimes helps with length extrapolation, but do not make the choice of PE irrelevant.

Identity Matters in Deep Learning (Hardt and Ma, 2018)

(Partially finished)

Traditional convolutional layers (e.g., AlexNet) are not able to realize the identity mapping, and represents the 0-mapping when all weights are set to 0. This was seen as an issue because there were cases where the input already contained salient features and convergence to the identity and hence preserving these features was non-trivial.

Batch normalization (standardizing mean and variance of input by batch) implicitly addressed this, but residual networks explicitly resolve this by instead realizing the output of a layer $h(x)$ as $x+h(x)$, thus when all weights are set to 0 the identity map is realized. This reparameterization helped models achieve SoTA on image tasks.

Considers simplified setting where non-linearities are removed and network is represented as a series of linear maps.

Result Sketch 1: For any $R$ with positive determinant, there exists a global optimizer for a series of $A_i$ where the spectral norm of $A_i$ is upper bounded by $O(1/l)$ for large enough $l$.

Result Sketch 2: Given the residual setup where $A_i$ are regularized to have small norm, the MSE objective is minimized and the gradients vanish only for the global optima, and there are no other critical points.

Result Sketch 3: When the model capacity is larger than the number of samples $n$, then any function of the $n$ samples can be expressed with said model.

Experimentally, it turns out that a simple mode where each layer is $x + V\text{ReLU}(Ux)$ and initialized to be mean 0 from Gaussian achieves competitive performance on CIFAR-10.

RNNs are not Transformers (Yet): The Key Bottleneck on In-context Retrieval (Wen et al., 2024)

Transformers can directly "look up" tokens anywhere in the context via self-attention, which can be expensive memory-wise for long inputs. Contrastingly, RNNs (SSMs, RWKV, etc.) maintain a fixed-size hidden state, which limits their in-context retrival capabilities from far back in the sequence. The chief claim of the paper is that the inability of RNNs to do in-context retrieval is the key reason for their inability to match Transformers on certain algorithmic/reasoning tasks even with Chain-of-Thought (CoT). The paper examines solutions that can fix this gap, such as adding retrieval calls (via regular expressions) to an RNN so it can explicitly retrieve relevant parts of the sequence whenever it wants, or appending one Transformer layer on top of the RNN's output. The paper uses a commonly observed setup (standard architecture, $p=O(\log n)$-bit precision for context length $n$).

The paper first examines the effect of augmenting RNNs with CoT reasoning (cf. Section 4) and show that CoT does give RNNs extra power (Theorem 4.1) but this alone is insufficient to do what Transformers do on tasks that require long-range retrieval (Indexing, Associative Recall, $c$-gram Retrieval, Counting, and IsTree (deciding if a graph is a tree)). The former is done via proving that if PSPACE $\notin$ P/poly then RNN + CoT can do tasks that RNN alone cannot. Informally, they show that an RNN is essentially a small circuit that only "observes" the input once without CoT, which indicates that it cannot handle tasks that require "large circuits." With CoT, the RNN can unroll a longer sequence of internal states a la simulating a small Turing machine in CoT, which allows RNNs to solve PSPACE-level tasks. The latter is done by showing that an RNN with fewer than $n$ bits of hidden state cannot solve these tasks regardless of the number of CoT steps allowed, whereas a constant-size Transformer can solve them easily. Intuitively, if the allowed memory is too small (sublinear in $n$), the RNN does not have sufficient bits to store the tokens to retrieve, and hence cannot retrieve arbitrary items from a large context regardless of how sophisticated the reasoning chain is. On the other hand, a transformer can look back at any token with attention and does not suffer from the same limitation. In particular, it seems like using the IsTree problem as a testbed (given a graph (in tokenized form), identify whether the graph is a binary tree) is a key contribution of the paper. Solving IsTree requires checking connectivity and acyclicity, which are essentially "global" properties that require the model to "retrieve" arbitrary edges from the input. Using IsTree, they show that an RNN with fewer than $n$ bits of memory cannot solve IsTree even with CoT, but a constant-size Transformer can with CoT that essentially runs DFS. The paper also shows that Transformers are strictly more expressive than RNNs in that there exist tasks where Transformers with only $O(\log n)$ parameters are more expressive than any RNN of sublinear memory.

Given the above limitation, they examine two approaches to close this gap (Section 5). The first is an "oracle retriever" that allows the RNN to magically identify the location of particular parts of the input and retrieve them. The second (and arguably a more "realistic" approach) is appending a one-layer attention layer to the output of the RNN, which effecively implements this retrieval mechanism (i.e., the RNN processes the sequence in one pass and the single attention layer can be used to go back and pick out relevant tokens).

Inductive Biases and Variable Creation in Self-Attention Mechanisms (Edelman et al., 2022)

Focuses on studying the inductive biases of attention heads that they denote "sparse variable creation" to explain the empirically observed good generalizability of self-attention (especially w.r.t. context length, although this may be debatable in practice).

Construction

First, the paper defines an attention head in general terms. Given an input sequence of $T$ tokens represented by the matrix $X$ and a context vector $z$, the output $y$ is expressed as \[ y = \phi_\text{out}\Biggl( \sum_{t=1}^{T} \,\text{softmax}\bigl(\text{sim}(x_t, z)\bigr)\cdot \phi_\text{in}(x_t) \Biggr), \] where $\text{sim}(\cdot, \cdot)$ is typically the inner product. In standard self-attention, $\text{sim}(x_t, z)$ and $\phi_\text{in}(x_t)$ correspond to $(W_Q\,x_i)\cdot(W_K\,x_t)$ and $W_V\,x_t$ (with $x_i$ being one of the tokens), respectively. For simplicity (as in many related works), the feedforward MLP is ignored in this construction.

Results

Theorem 4.1 uses a covering number argument to show that when we impose norm constraints on the weights, the effective complexity of the self-attention selection (covering number of the class of attention heads) grows only as $\log T$---i.e., a bounded-norm attention head "prefers" to depend on only a small number of tokens rather than the full length-$T$ sequence.

First, it is shown that the softmax function has a bounded Jacobian. Namely, the Jacobian $J(z)$ for softmax has entries \[ J(z)_{ij} = \text{softmax}(z)_i \Bigl(\delta_{ij} -\text{softmax}(z)_j\Bigr), \] where $\delta_{ij}$ is the Kronecker delta. Because the softmax outputs are probabilities, one can show that each row of the Jacobian has its sum of absolute values bounded by a constant, e.g., the sum of the absolute values in the $i$-th row is \[ \sum_{j=1}^T \bigl|J(z)_{ij}\bigr| = \text{softmax}(z)_i \Biggl[ \Bigl(1-\text{softmax}(z)_i\Bigr) + \sum_{j \neq i} \text{softmax}(z)_j \Biggr]. \] Since \[ \sum_{j \neq i} \text{softmax}(z)_j = 1 -\text{softmax}(z)_i, \] the sum becomes \[ 2\,\text{softmax}(z)_i \Bigl(1-\text{softmax}(z)_i\Bigr), \] which does not depend on $T$. Since the maximum of the function $x(1-x)$ on $[0,1]$ is $1/4$, the maximum row sum is at most $2 \times \frac{1}{4} = \frac{1}{2}$.

Then, using the mean value theorem, \[ \|\text{softmax}(z) -\text{softmax}(z')\|_1 \leq L_{\text{softmax}}\, \|z -z'\|_\infty, \] where $L_{\text{softmax}}$ is determined by the norm of the Jacobian and is a constant independent of $z$.

From there, standard covering number arguments can be applied under the Lipschitz constant of $L_{\text{softmax}}$ to obtain the (upper) bound \[ \log \mathcal{N}_\infty \leq \frac{C}{\epsilon^2} \cdot \log(mT), \] where the constant $C$ depends on the norm bounds of the weight matrices and the Lipschitz constants of the functions $\phi_{\mathrm{in}}$ and $\phi_{\mathrm{out}}$, and where $m$ denotes the number of training samples (as we are covering over $mT$ points).

How Transformers Learn Causal Structure with Gradient Descent (Nichani et al., 2024)

Focuses on how self-attention layers learn latent causal structure when trained via gradient descent. Introduces a new in-context learning task where each input sequence is generated by an underlying causal process modeled as a DAG (directed acyclic graph). Shows that gradient descent on a simplified two-layer, attention-only transformer recovers this latent causal graph (i.e. the gradient of the attention matrix essentially computes the mutual information between tokens). As a special case, the transformer even learns an induction head when sequences are sampled from a Markov chain.

The paper considers a variant of transformers in which, rather than adding each layer's output back into the residual stream, you concatenate them (primarily to make the analysis cleaner). The task uses a finite alphabet $[S]$ and a latent causal graph $\mathcal{G}=([T],\mathcal{E})$ with the property that whenever $p(i)\neq\emptyset$, $\lvert p(i)\rvert=1$ and $p(i) < i $. You place a prior $ P_{\pi} $ over transition kernels $\pi $ on $ [S] $. A sequence $ (s_1,\dots,s_T) $ is generated by:

\[ s_i \sim \begin{cases} \mu_\pi & \text{if $i$ is a root in $\mathcal{G}$,}\\ \pi\bigl(\,\cdot\mid s_{p(i)}\bigr) & \text{otherwise,} \end{cases} \quad i=1,\dots,T-1. \]

The last observed token $s_T$ is sampled uniformly from $[S]$, and the target $s_{T+1}$ by

\[ s_{T+1}\sim \pi\bigl(\,\cdot\mid s_T\bigr). \]

Here the input is $x=(s_1,\dots,s_T)$ and the label to predict is $y=s_{T+1}$.

For training, they first run gradient descent on the first attention layer (freezing the second) so it learns the adjacency structure, then train the second layer to recover the transition probabilities. Under mild assumptions on the prior and sequence length, they prove these two steps yield a population loss arbitrarily close to Bayes-optimal.

The Expressive Power of Transformers with Chain of Thought (Merrill and Sabharwal, 2024)

Focuses on how allowing (decoder-only) transformers to generate and condition on $t(n)$ chain-of-thought tokens expand the class of recognizable languages.

Denote the input as a string $x\in\Sigma^n$ (with start‐of‐sequence token $\$$), the transformer as $f$ that attends over all previous hidden states and produces one new token at each step, and the decoding budget as $t(n)$, i.e., the model is allowed $t(n)$ steps to generate intermediate tokens after reading the $n$ input tokens, after which its next token must be the accept/reject symbol (1/0). This is written in the paper as \[ \begin{aligned} &f^0(x)=x,\\ &f^{k+1}(x) = f^k(x)\,\bigl\|\,f\bigl(f^k(x)\bigr), \end{aligned} \] where $f$ recognizes a language $L$ with $t(n)$ CoT steps when the model successfully outputs $f^{t(n)}(x)=1$ if and only if $x\in L$, and the class of all such languages is denoted $\mathsf{CoT}\bigl(t(n)\bigr)$.

Main Results

Under standard complexity conjectures, transformer decoders with $t(n)$ intermediate tokens satisfy \[ \mathsf{TIME}\bigl(t(n)\bigr) \subseteq \mathsf{CoT}\bigl(t(n)\bigr) \subseteq \mathsf{SPACE}\bigl(t(n)+\log n\bigr) \subseteq \widetilde{\mathsf{TIME}}\bigl(t(n)^2 + n^2\bigr), \] where $\mathsf{TIME}(t(n))$ denotes the set of languages decidable in $O(t(n))$ time by a multitape TM, $\mathsf{SPACE}(s(n))$ the set of languages decidable in $O(s(n))$ space, and $\widetilde{\mathsf{TIME}}(T)$ the set of languages decidable in time $T$ up to poly‐$\log$ factors. They show that $t(n)$ acts like a computational resource where small $t(n)$ results in nearly as weak as a standard transformer wheareas sufficiently large $t(n)$ allows it to approixmate (or be as powwerful as) any poly‐time algorithm.

Concretely, given $t(n)=O(\log n)$, then everything the transformer can recognize still lies in the class $\mathrm{SPACE}\bigl(O(\log n)\bigr) = \mathsf{L}$; in particular, the model cannot decide $\mathsf{NL}$‐ or $\mathsf{P}$‐complete problems.
When $t(n)=\Theta(n)$, the model's expressivity is boosted to simulate any real‐time Turing machine $\mathrm{TIME}(n)$ and enough to recognize all regular languages, which is an $\mathsf{NC}^1$‐complete task. However, the model still cannot escape the context‐sensitive boundary since it remains within $\mathrm{SPACE}(n)$.
When $t(n)=n^c$ for some constant $c$, the model's expressivity jumps to $\mathsf{P}$ and can simulate any poly‐time Turing machine.

Rough Proof Sketches

Auto-Regressive Next-Token Predictors are Universal Learners (Malach, 2024)

Classical PAC learning sees only pairs $(x,f(x))$; in AR learning each intermediate token is both input and label, and CoT provides explicit intermediate computation steps. This paper studies the theoretical difference between next‐token prediction (auto‐regressive, AR) and classical supervised learning, i.e., how much of the expressivity of LLMs comes merely from the AR training scheme + CoT supervision rather than from the architecture itself? In essence, they show that even low-expressivity linear AR predictors trained purely to predict next tokens can simulate any Turing-computable function provided they are trained on CoT data with sufficient length (measured via length complexity).

Notation

Let $\mathcal{D}$ be a finite token set and denote a context as $x\in \mathcal{D}^n$, and a CoT sequence as $z\in \mathcal{D}^T$. An AR function $h: \mathcal{D}^n \times \mathcal{D}^{ < T} \to \mathcal{D}$ predicts the next token given the context of previous tokens. A hypothesis class $\mathcal{H}$ of AR functions is said to be AR Learnable if there exists an algorithm that is i.i.d. samples $(x,z)\sim \mathcal{D}$ of length $T$ returns $\hat h\in\mathcal{H}$ such that with high probability, \[ \Pr\bigl[\exists\,t\le T:\hat h(x,z_{ < t})\neq z_t \bigr] \le \epsilon. \] If the algorithm runs in time polynomial in $n$, $1/\epsilon$, and $1/\delta$, they term $\mathcal{H}$ to be efficiently AR Learnable.
Define the auto-regressive outputs of $h$ by \[ \begin{aligned} &h^{(1)}(x) = h(x,\emptyset),\\ &h^{(t)}(x) = h\bigl(x,h^{(1)}(x),\dots,h^{(t-1)}(x)\bigr). \end{aligned} \] $h$ is said to compute a target $f:\mathcal{D}^n\to\mathcal{D}$ if $h^{(T)}(x) = f(x)$ for all $x$.
The length complexity of learning a concept class $\mathcal{F}$ with AR class $\mathcal{H}$ is defined as the minimal $T$ such that every $f\in\mathcal{F}$ can be computed by some $h\in\mathcal{H}$ in $T$ steps.

Main Results

Theorem 3.3 (AR Learnability). Suppose each per-step class $\mathcal{H}_t$ is PAC-learnable with sample complexity $m(\varepsilon,\delta)$. Then the product class $\mathcal{H}_1\times\dots\times\mathcal{H}_T$ is AR-learnable with sample complexity $m(\varepsilon/T,\delta/T)$.
Theorem 3.3 indicates that any class that is already easy to PAC-learn (for example linear separators) remains easy when used as AR steps; the only cost is a linear shrink in the accuracy parameter.

Theorem 3.6 (Function Approximation). If $\hat h$ satisfies the AR learning criterion, rolling it out does not amplify error: the distribution of $\hat h^{(T)}(x)$ converges to that of the true generator $h^{(T)}(x)$ with the same sample complexity.

Theorem 3.8 (Boolean Circuits). Let $f:\{0,1\}^n\to\{0,1\}$ be computed by a linear-threshold circuit with $T$ gates. A linear AR predictor can reproduce $f$ in exactly $T$ steps.
Since any Turing machine running in time $T(n)$ compiles into such a circuit of size $\mathrm{poly}(T(n))$, a linear AR model trained on CoT traces of comparable length can simulate any time-$T(n)$ computation.

Theorems 3.11, 3.12 (Parity). For the parity class $\mathcal{P}_n=\{\chi_A : A\subseteq[n]\}$ defined by $\chi_A(x)=\sum_{i\in A}x_i\bmod 2$, a linear AR predictor needs only $O(\log n)$ tokens to compute any member of $\mathcal{P}_n$. If each AR gate is allowed to compute parity of at most $k$ bits, the required length is $\Theta(n/k)$; learning each size-$k$ gate may cost $\binom{n}{k}$ time or $O(k\log n)$ statistical queries, which indicates a length-sample trade-off.

In essence, the paper claims that some capabilities credited to deep transformer stacking might arise from the AR+CoT training regime alone.

A Formal Model of Hierarchical Concept-Learning (Rivest and Sloan, 1994)

Focuses on extending Valiant-style PAC framework to a "hierarchical concept learning" setting where a complex target concept is broken into a sequence of simpler sub-concepts that are learned one after another with later concepts allowed to call the earlier ones as building blocks. More concretely, hierarchical concept learning is defined as the following learning regime:

Start with primitive predicates that depend only on the raw input features
Define a slightly richer predicate that may invoke the primitive predicates
Continue stacking more expressive predicates until the top layer produces the final classification of interest

A "hierarchy" is this ordered list of predicates (in learning theory terms, a straight-line program). This hierarchy essentially acts as an explicit CoT and can also be seen as a precursor of modern selective-classification and CoT in LLMs.

Concretely, consider Boolean vectors $x\in\{0,1\}^n$ drawn from a fixed underlying distribution $\mathcal{D}$. The target concept is a size-$s$ circuit composed of $\mathsf{AND}$, $\mathsf{OR}$ and $\mathsf{NOT}$ gates. A teacher reveals a straight-line program $y_1,\ldots, y_s$ whose final line $y_s$ equals the target itself. During lesson $i$ the teacher samples an input $x\sim \mathcal{D}$ and supplies the label $y_i(x)$ to the learner. The learner must output a classifier $Q\colon X\to\{0,1,\bot\}$ that never mislabels an example on which it predicts 0 or 1 and, w.p. at least $1-\delta$, abstains on no more than an $\varepsilon$ fraction of $\mathcal{D}$. After each lesson the learner maintains a version space $F_i$ consisting of all formulas for gate $i$ still consistent with the observed data (see "Coherent-Set Learner" below).

This alternate regime transforms learning poly-size Boolean circuits into a "reliable" (never wrong when it makes a prediction) and "probably useful" (abstaining on at most an $\varepsilon$ fraction of the input distribution w.p. $1-\delta$) task. The paper shows that the required number of labelled examples is only polynomial in the natural parameters ($O\bigl((s/\varepsilon)(\log K+\log(s/\delta))\bigr)$). The paper also demonstrates a concept class learnable in this hierarchical setting but not learnable in classical PAC theory (under routine cryptographic assumptions).

Coherent-Set Learner ($\mathsf{CSL}$)

In the paper, each step is processed via $\mathsf{CSL}$, which works as follows:

Draw fresh inputs until \[ m_i \;=\;\frac{1}{\varepsilon_i}\Bigl(\ln K + \ln\frac{1}{\delta_i}\Bigr) \] coherent samples are collected, where $K$ is the explicit bound on the single-gate hypothesis class and $\varepsilon_i,\delta_i$ are the accuracy/confidence budgets allocated to this stage.
Prune the version space $F_i$: discard every formula $h\in F_i$ that disagrees with even one of those $m_i$ labelled samples.

By the finite-class PAC theorem, each remaining formula errs on fewer than $\varepsilon_i$ of the coherent input distribution w.p. at least $1-\delta_i$.
They consider a "Reliable Learner" which processes the gates in order and uses $\mathsf{CSL}$ with $\varepsilon_i=\varepsilon/(sK)$ and $\delta_i=\delta/(2s)$. It can be shown that the single-gate hypothesis class has cardinality at most $K\le 8\binom{n+s-1}{2}+2$, so each run of $\mathsf{CSL}$ is polynomial in $n$, $s$, $1/\varepsilon$ and $\log(1/\delta)$.

Main Results

Theorem 1 (Finite-Class PAC Bound). If a hypothesis class $\mathcal{H}$ is finite, then \[ m \;=\;\frac{1}{\varepsilon}\Bigl(\ln\lvert \mathcal{H}\rvert + \ln\frac{1}{\delta}\Bigr) \] uniformly-random labelled examples suffice to ensure that every $h\in\mathcal{H}$ consistent with the sample errs on fewer than an $\varepsilon$ fraction of the underlying distribution with probability at least $1-\delta$.

The above theorem tells us the required number of coherent examples before all predicates that mislabel one of them can be safely removed with the guarantee that the survivors are $\varepsilon_i$-accurate. However, this theorem is useless unless we can actually collect the required number $m_i$ of coherent examples. The following Lemma shows this happens with overwhelming probability; hence, the algorithm does not stall due to lack of data.

Lemma 2 (Coherent-Sample Guarantee). Assuming the accuracy condition for gates $1$ through $i-1$, $\mathsf{CSL}$ obtains its required coherent sample set with probability at least $1-\delta/(2s)$.

By the inductive hypothesis, earlier gates are already $\varepsilon/(sK)$-accurate, so on a fresh input they all agree with probability at least $1-\varepsilon/s$. Treating incoherence as a Bernoulli error and applying a Chernoff bound, we can show that after drawing at most $2m_i$ random examples, the learner can gather $m_i$ coherent samples required by Theorem 1 with probability at least $1 -\delta/(2s)$.
Even once each gate has an accurate version space, it must be the case that combining them introduces no new errors. Lemma 3 below shows that if each $F_i$ is $\varepsilon/(sK)$-accurate on the domain where all earlier gates agree, then $Q$ (the final combined classifier) never outputs a wrong label and abstains on fewer than an $\varepsilon$ fraction of the full distribution.

Lemma 3 (Global Reliability). With probability at least $1-\delta$, the Reliable Learner returns a classifier that satisfies the reliable-probably useful guarantee and runs in polynomial time.

Finally, combining the previous results with an inductive union bound gives the final guarantee. Concretely, fix gates $i=1,\dots,s$. For each $i$, Lemma 2 guarantees that $\mathsf{CSL}$ finishes gathering samples unless a bad event of probability $\delta/(2s)$ occurs. Conditional on finishing, Theorem 1 ensures $F_i$ is $\varepsilon/(sK)$-accurate unless a second bad event of probability $\delta/(2s)$ occurs.
By a union bound over these two events at gate $i$, the failure chance is at most $\delta/s$. Another union bound over all $s$ gates yields the overall probability of success to be at least $1-\delta$. On this high-probability event, Lemma 3 applies to guarantee that the final classifier $Q$ is reliable and probably useful. Plugging in $\varepsilon_i=\varepsilon/(sK)$ and $\delta_i=\delta/(2s)$ into the sample-size formula of Theorem 1 and summing over $i=1,\dots,s$ yields the final polynomial sample-complexity bound $O\!\Bigl(\tfrac{s}{\varepsilon}\bigl(\ln K + \ln\tfrac{s}{\delta}\bigr)\Bigr).$

Do Transformers Parse while Predicting the Masked Word? (Zhao et al., 2023)

Keywords: PCFG, masked language modeling, inside-outside algorithm, parse tree

This paper theoretically analyzes the empirical observation that contextual embeddings obtained from attention-based models (e.g., BERT) trained with masked language modeling objective somewhat encode parse tree information. Previous empirical studies show that this encoding can use semantic cues in natural language to falsely indicate that the embeddings contain syntax parsing information. This paper removes semantics from the equation by using strings generated with PCFGs as data.

There exists a transformer architecture (for both hard and soft attention) that implements the Inside-Outside algorithm to recover parse trees for PCFGs. But, the number of layers and dimensions required is unrealistic.
They consider \textit{approximating} the Inside-Outside algorithm with transformers, which yields a more realistic-size transformer architecture without losing too much performance.
For a PCFG, the Inside-Outside algorithm yields the optimal solution for the masked language modeling objective.
Experimentally, they use a linear probe and show:
- the contextual embeddings capture the structure of the constituency parse trees.
- the intermediate-layer representations of the transformer can be used to predict marginal probabilities computed in the Inside-Outside algorithm.

Punchline: To implement inside-outside algorithm perfectly w/ transformer architecture, a lot of layers is needed. But transformers can approximate inside-outside algorithm with realistic number of layers.

Pay Attention to MLPs (Liu et al., 2021)

This paper studies the necessity of self-attention layers in transformers by proposing a new architecture called gMLP, which incorporates cross-token interactions but does not utilize self-attention layers. Given input $X \in \mathbb{R}^{N\times d}$, one vanilla gMLP layer is written as: \[ \Big(\sigma(XU) \odot (W \sigma(XU) + b) \Big) V + X, \] where $X\in \mathbb{R}^{n\times d}$ is the input to the layer, $\sigma(\cdot)$ denotes an activation function, and $U, V, W\in \mathbb{R}^{n\times n}$ are learnable parameters (no special constraints). They refer to the element-wise product \[ s(\cdot) = \sigma(XU) \odot f_{W, b}(X) = \sigma(XU) \odot (W \sigma(XU) + b) \] as the "spatial gating unit," hence the "g" in "gMLP." The "mixing" part is handled by the $W$ matrix and $W, b$ need to be initialized to near-zero and all-ones matrix for empirical training stability. It is also shown that when one additional self-attention module is included in gMLP in the form: \[ s(\cdot) = \sigma(XU) \odot \big( (W \sigma(XU) + b) + A(X) \big) \] where $A(X)$ denotes one self-attention head, then better performance than both transformers and gMLP is achievable.

gMLP performs better on vision tasks than transformers with same number of parameters and training time.
gMLP trained on masked language modeling objective has better pre-training perplexity than BERT, and does comparably to BERT on NLP tasks that do not require cross-lingual alignment.
gMLP requires a scalar multiple (~3) of the number of parameters to match BERT's performance on natural language tasks that require pairwise attention.

Punchline: This architecture is arguably simpler than self-attention and achieves comparable performance on vision tasks, but not on natural language tasks that require token mixing.
-> Perhaps self-attention is comparably "cost-efficient" in terms of parameters?

Limits to Depth-Efficiencies of Self-Attention (Levine et al., 2020)

When the depth is less than a polynomial factor of the width, then increasing depth improves separability.
When the depth is more than a polynomail factor of the width, then increasing depth has minimal benefit. A polynomially wide network can surpass an arbitrarily deep network.
Empirically, for sufficiently small networks, deep networks perform worse than shallow networks. i.e. for a fixed width, a network can be too shallow (where performance can increase by adding more layers) or can be too deep (where performance degrades).

Punchline: Depth is not all you need for self-attention networks, in contrast to other neural architectures.

Are Sparse Autoencoders Useful? A Case Study in Sparse Probing (Kantamneni et al., 2025)

We lack a ground truth to know whether SAEs truly extract the concepts used by language models (similar in spirit to how correlation does not indicate causation). This paper attempts to examine whether the representations (latents) learned by SAEs are truly "better" than standard probing-based approaches.

The paper uses classification as the testbed: given a prompt and a binary label, whether we can train a linear classifier on the model's hidden state to predict the label. Use Gemma-2-9B and Llama-3.1-8B as the backbones and pass prompts from the dataset to collect the residual stream at different layers to train different probes (logistic regression, XGBoost, KNN, an MLP). Concurrently, they also train an SAE, take prompts in datasets for both labels, then pick $k$ latents in the SAE that result in the largest magnitude difference between the mean representations for the two labels, namely: \[ \mathcal{I} = \textup{ArgTopK}_{i \in \{W\}} \left| \frac{1}{|T_1|} \sum_{j \in T_1} Z_{j,i} -\frac{1}{|T_0|} \sum_{j \in T_0} Z_{j,i} \right|, \] which are then used to train a (norm-constrained) logistic regression model.

They then use what they term a "quiver and arrow" approach, which is effectively an ablation experiment that measure how frequently the SAE-probe approach ("arrow") outperforms other probing methods ("quiver": logistic regression, MLP, $k$-NN, etc.) on the validation set in terms of ROC-AUC after training each on a train set, which is then used on the test set.

On ~100 binary classification datasets, they show that adding the SAE probe does not typically yield consistent improvements. Under standard conditions (balanced data, sufficient training samples), logistic regression or XGBoost on raw hidden states beats or matches the SAE-based probe. Under "harder" regimes (Data Scarcity, Class Imbalance, Label Noise), while they hypothesize that the "interpretable-latent" structure might help, they still find that the SAE-probe approach does not reliably beat the best standard baselines in these settings also. On out-of-distribution test sets, standard probes (e.g., logistic regression on raw hidden states) either tie or win. Even though the SAE latents can be "labeled/interpreted," they still do not provide robust improvements over simpler methods in terms of raw classification metrics on real tasks.

Authors also note that multi-token pooling or other specialized modifications occasionally see some gains. They also emphasize that you should compare equally tuned baselines. In many prior works that showed SAE outperforms baseline, the baseline was not equivalently tuned, so the difference might vanish once strong baselines are used.

Thoughts

Perhaps, somehow, SAEs work for certain types of concepts but not for others (likely can be attributed to the training/data mixture properties)?
Bad classification error is not quite equivalent to unfaithful/uninterpretable representations for $k>1$, but it still weakly measures how well-representative the $k$-dimensional subspace is at capturing the concept as a whole. Perhaps linear classification for the $k>1$ case is not what we want to be doing as an experiment, and instead some form of component analysis would be better?

Calibrated Language Models Must Hallucinate (Kalai and Vempala, 2024)

Focuses on understanding hallucinations in LLMs. Shows that even in an idealized setting (perfectly factual training data, no prompts, one fact per document), any LM that is a calibrated predictor will necessarily hallucinate at a nontrivial rate because of a lower-bound tied to the Good-Turing missing-mass phenomenon (facts that appear only once in training can't be distinguished from unseen and potentially false facts).

Problem Setup

For a set of all possible documents $X$, denote the set of factoids (each either a true fact or a hallucination) as $Y$. Each $x\in X$ carries exactly one factoid $f(x) \in Y$ or a special empty $\perp$. Consider a language distribution $D_L \in \Delta(X)$ drawn from $D_L \sim D_\text{world}$(i.e., docments are selected or not selected from $X$ with some probability) and train LLM on $n$ i.i.d. samples drawn from $D_L$ via learning algorithm $\mathcal{A}$ to obtain $D_\text{LM} \in \Delta(X)$. The induced "factoid distributions" are $p = f \circ D_L$ (true-fact distribution) and $g = f \circ D_\text{LM}$ (what is generated by the model).

Facts are defined as $F = \{y \in Y \mid p(y) > 0\} \cup \{\perp\}$ and hallucinations are defined as $H = Y \setminus F$. This setup can be viewed as a missing-mass problem where denoting $), U = Y\setminus O$ as the observed and unobserved facts during traning, $p(U)$ denotes the total true-fact probability mass never seen in training. The "monofact estimator" (Good-Turing estimator) can be used to estimate this, which can be written as \[ \widehat{\text{MF}} = \frac{\#\text{Number of facts seen exactly once}}{n}, \] which concentrates near $p(U)$ at rate $O(1/\sqrt{n})$. The missing mass problem is used as the building block for showing the main result, and the above two are important values used to estimate the hallucination lower bound.

Results

Under clear data (no false statements in training set), one factiod per document, and sparsity and regularity (most conceivable factoids are false and unseen true factoids are individually rare) assumptions, the following holds for any learning algorithm: \[ g(H)\ge \widehat{\text{MF}} -\mathrm{MC}(p,g) -O(e^{-\text{sparsity}}) -O(n^{-1/2}), \] where $\mathrm{MC}(p,g)$ is a total-variation measure of how far $g$ is from perfect calibration. If the model is well-calibrated, the dominant term is $\widehat{\text{MF}}$, so the LM must hallucinate roughly as often as the fraction of facts that appeared exactly once in training.
Proof idea: When $D_\text{LM}$ perfectly estimates $D_L$ (i.e., the LM is calibrated), $g(U) \approx p(U)$ holds. $g(H) = g(U) -g(F \cap U)$ can be shown to hold, but $g(F \cap U)$ is negligible (due to sparsity and regularity assumptions), so $g(H)$ cannot be much smaller than $p(U)$. This indicates that the LM assigns a non-trivial probability to halluciations, hence the result.

The lower bound is mainly relevant to "arbitrary" factoids (facts that cannot be checked by a simple rule/logic). For well-defined facts, there is no statistical reason to hallucinate besides reasons such as insufficient model capacity; hence, one remedy would be to selectively de-calibrate or resort to retrieval only when dealing with arbitrary facts while preserving calibration elsewhere.

Hallucination, Monofacts, and Miscalibration: An Empirical Investigation (Miao and Kearns, 2025)

Focuses on testing the theoretical result by Kalai and Vempala, 2024. Concretely, if an LM is calibrated, then no matter how the model is trained, it must hallucinate at least as often as the fraction of arbitrary facts in its training data (i.e., the "monofact" rate) minus miscalibration rate. Tests this hypothesis in two settings: Bigram models (synthetic movie-fact sequences) and in-context learning with Gemini on synthetic data. Shows that the result holds in practice both for simpler models and real LLMs. Proposes to instead of aggressively deduplicating training data, one might re-introduce carefully chosen duplicates post-training to reduce hallucinations without harming overall performance.

Foundational Challenges in Assuring Alignment and Safety of Large Language Models
(Anwar et al., 2024)

(Section 3.4 Only)

Existing methods for interpreting model behavior lack faithfulness. This section touches on interpretability-based methods (think mechanistic interpretability) and explainability-based methods (generating natural language explanations).
Critical challenges include:

Interp. methods assume a priori that internal model reasoning works in specific ways, which leads to questionable assumptions.
For example, Integrated Gradients, Shapley values assume non-linear behavior can be sufficiently explained with linear models, which turned out to have arbitrarily worse failure cases; interpretabiity of neurons → may be polysemantic; linear probing easily learns spurious features instead of actual features.
Neural networks do not have incentive to be inherently interpretable/explainable.
Particularly a concern for domains where AI outperforms humans, because explanations that are understandable by humans may not exist/be possible. Representation alignment helps AI models learn human-aligned representations, but works in a human-in-the-loop manner so hard to scale up
Scalable evaluation of the faithfulness of a generated explanation is hard.
Measuring faithfulness requires gaining access to the true underlying internal reasoning, but this is inaccessible (and why we want explainability to begin with). An important direction is that there is a need for benchmarks to standardize metrics of success and to help create better standards for evaluating explanation faithfulness. In particular, for detecting alignment failures, metrics that focus on worst-case is necesary.
Desirably identified patterns should lead to modifiable behavior, but unclear how this can be done.

Challenges for explainability, i.e., externalized reasoning primarily lies in faithfulness and steerability: Attempts to make reasoning visible in a natural language manner (e.g., chain-of-thought-style methods) can be misleading/unfaithful and insensitive to "edits" made to the reasoning. Open questions include:

Understanding to what extent externalized reasoning is causally responsible for improved performance on various reasoning tasks.
Understanding the extent to which the training directly or implicitly incentivizes unfaithfulness.

Normalized Word Embedding and Orthogonal Transform for Bilingual Word Translation
(Xing et al., 2015)

Keywords: Cross-lingual alignment, linear transformation, orthogonal transformation

Uses static word embeddings (word2vec-style). Claims Mikolov et al. (2013)'s approach is ill-posed because the same similarity metrics are not used during training and testing. Proposes to length-normalize vectors during training.

First paper to propose the orthogonal transform approach within projection-based methods for cross-lingual alignment.
Approach for learning linear projection between embeddings of different dimensions is ad-hoc.
Linear projection is learned using data from Google Translate.

Thoughts:

Wonder if unsupervised learning of the projection is possible.

Learning principled bilingual mappings of word embeddings while preserving monolingual invariance
(Artetxe et al., 2016)

Keywords: cross-lingual alignment, linear transformation, orthogonal transformation

Direct follow-up to Xing et al. (2015). Combines the optimization objectives of Xing et al. (2015) and Faruqui and Dyer (2014). Evaluates word embeddings and the learned transformation on translational similarity (How close the mapped vector from language $A$ is to the word in language $B$ with "same" meaning.) and word embeddings only on analogy solving task.

Length-normalization is not important over orthogonal matrix.
Adds mean centering constraint to optimization and states the optimization is equivalent to Faruqui and Dyer (2014) but does not alter monolingual embeddings manually.

Offline bilingual word vectors, orthogonal transformations and the inverted softmax
(Smith et al., 2017)

Keywords: cross-lingual alignment, orthogonal transformation, hubness problem

Follow-up work to all orthogonal transformation literature. Shows exisiting methods can be unified into an optimization procedure using SVD. Proves orthogonal transformations are self-consistent (Similarity between words in language A and B should be same as the other way around, i.e. $S_{i,j} = S_{j, i}$). Introduces an "inverted softmax" approach to choosing the translated word after a linear map is learned that combats the hubness problem by ensuring hubs (Words that are the nearest neighbor for a lot of words) are not chosen as the translation word with high probability. Also shows a "pseudo-dictionary" (A dictionary between two languages that are created by counting the number of shared characters between two words) is sufficient for achieving comparable (better) performance to supervised (with dictionary) cases. Moreover, instead of learning word-word transformations, a learned sentence-sentence transformation on the word embeddings can fare well on both sentence and word translation.

A pseudo-dictionary is enough to achieve good performance (42% for 1-NN, 59% for 5-NN).
Orthogonal transformations learned between word-word and sentence-sentence fare the same performance on word-word translation.
When the sentence-sentence transformations are applied to word-word translation, using the inverted softmax greatly boosts performance.

Thoughts:

What's the main difference between this paper and Artetxe et al. (2016)? Seems like the "unifying" part is very similar.
Are the linear transformations learned in the word-word case and sentence-sentence case very similar?
Surely, pseudo-dictionaries are a very bad source to obtain word correspondences. Is the good performance because of the orthogonal transformation or inverted softmax?

Learning bilingual word embeddings with (almost) no bilingual data (Artetxe et al., 2017)

Keywords: cross-lingual alignment, bilingual lexicon induction

Using previous methods that learn a linear transformation, considers a bootstrapping method using a very small dictionary and iteratively improving it by retraining and regenerating a word-word dictionary by choosing nearest neighbor of transformed vectors. The learned transformation tends to be the same regardless of starting dictionary, and observed errors are very similar to that of Artetxe et al. (2016).

Bootstrapping from a very small (~25 word pairs) dictionary performs as well as using a large dictionary.
Using very small starting dictionary leads to similar final dictionary as Artetxe et al. (2016) indicates that learning a transformation in an unsupervised manner could be possible.
However, (while not mentoned in the paper) performance does correlate with the similarity between the two languages.

Thoughts:

This likely does not work on two very different languages, because the text embeddings learned in the languages likely take a very different structure.
At each time step, are we using all words as anchors or only the most "confident" words?

Word translation without parallel data (Conneau et al., 2018)

Keywords: cross-lingual alignment, unsupervised, hubness problem

Introduces cross-lingual adversarial alignment method that performs on par/better than previous data-dependent methods in an unsupervised setting and a scoring metric to evaluate goodness of transformation during training. Also mitigates hubness problem by considering the same idea as in Artetxe et al. (2017) and introducing CSLS (Intuition: word similarity is discounted by average word similarity with other words of both words. A hub will take a large average word similarity value).

First paper to introduce a fully unsupervised way to learn orthogonal transformations between two language embeddings.
Works reasonably well for relatively different languages too (English-Esperanto).

Thoughts:

Does this not assume the embedding space structure of the two languages are very similar? English and Esperanto are arguably relatively similar.
Will this still work when the vocabulary size of the two languages are drastically different?

Improving Cross-Lingual Word Embeddings by Meeting in the Middle (Doval et al., 2018)

Keywords: cross-lingual transfer

Based on previous work in meta-learning where averaging embeddings learned under different settings result in better-performing embeddings, this paper considers learning another linear transformation after an orthogonal transformation is learned, where this additional transformation looks to map the embeddings words that correspond to each other to the average between the two embeddings. Learns another linear mapping to map embeddings to the average embedding between the two langauge embeddings. Applying this mapping to the original embeddings does not work.

Loss in Translation: Learning Bilingual Word Mapping with a Retrieval Criterion (Joulin et al., 2018)

Keywords: cross-lingual transfer, hubness problem

Argues solutions to the hubness problem are unsatisfactory "because the loss used for inference is not consistent with that employed for training." Incorporates CSLS into the training objective when learning the transformation. Argues removing orthogonality constraint does not degrade word analogy and word similarity performance when transforming Spanish embeddings to English. Essentially, incorporated CSLS variant into training.

Thoughts:

I didn't see why the unsatisfactory part of previous work is unsatisfactory.

Learning Multilingual Word Embeddings in Latent Metric Space: A Geometric Approach
(Jawanpuria et al., 2018)

Keywords: cross-lingual alignment, linear transformation

Follow-up work to previous cross-lingual alignment works that employ linear transformation, but multiple language correspondings are learned simultaneously by mapping them into a "latent space." Considers individual orthogonal transformations for each language, then a universal positive definite matrix applied afterwards. Individual transformations and universal matrix are jointly trained. Also considers the low-resource setting where dictionary data is scarce and shows bootstrapping to iteratively improve dictionary achieves on par performance as previous work (Artetxe et al. (2017)).

Can learn transformations between multiple languages simultaneously.
Also functions on par with existing approaches in low-resource settings.

Thoughts:

Unclear if orthogonal transformations are trained all at once, or each langauge pair is trained individually.

Cross-Lingual Alignment of Contextual Word Embeddings, with Applications to Zero-shot Dependency Parsing (Schuster et al., 2019)

Keywords: cross-lingual transfer, linear transformation, contextualized embeddings, parsing

Considers contextualized word embeddings (from ELMo) and uses anchors (the average of all contextualized embeddings of the same word) as representation for each word. Considers three settings: 1) when dictionary is available, apply standard linear transformation learning methods to anchors, 2) Apply unsupervised learning (as in Artetxe et al. (2017), Conneau (2018), etc.) to anchors, and 3) apply standard linear transformation learning methods to contextualized embeddings. Also considers low-resource target language case, in which they consider adding a regularization term on the Euclidean distance between embeddings from the source and target dictionary.

First paper to use contextualized word embeddings.
Model can perform well zero-shot, i.e., train on different language than testing language.
Uses anchors to capture one representation for each word.

Thoughts:

Surely we can do more than just apply existing methods for static word embeddings?

How Multilingual is Multilingual BERT? (Pires et al., 2019)

Keywords: cross-lingual transfer, contextualized embeddings

mBERT performs well on zero-shot POS tagging across languages that have similar typographical properties (even when the tokens are not the same) but poorly on different languages. When considering an alternate representation of a language (e.g., Hindi written in Latin), the model fails. When a language trained in one script (e.g., Hindi) is tested on an alternate script (i.e., Devanagari), the model performs comparably well as the original language. Model performance on translation is seen to be maximized when the contextualized representations of the middle layers of the model are used (authors attribute diminsihing performance of later layers to no fine-tuning).

mBERT--BERT trained on masked language modeling objective for 104 languages--can generalize a fine-tuned task (e.g., NER) to a different language.
Transferability diminishes as language typographic similarity decreases, i.e., syntactic structure is the main source of information necessary for mBERT to perform zero-shot inference.

Thoughts:

Perhaps OOD generalization for syntactically similar languages indicate the model is learning some form of a generalized parser?

Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing (Wang et al., 2019)

Keywords: cross-lingual alignment, linear transformation

Considers learning a linear transformation on contextualized embeddings from mBERT and denotes it CLBT. Learn a mapping that maps embeddings from source language to target language, and plugs transformed embeddings into neural parsers. Performance is better than mBERT and XLM. Learned linear mapping between contextualized embeddings of two different languages.

Thoughts:

I don't see any novelty in the paper beyond what Schuster et al. (2019) already showed.

Beto, Bentz, Becas: The Surprising Cross-Lingual Effectiveness of BERT (Wu and Dredze, 2019)

Keywords: cross-lingual transfer

A case study of mBERT on five NLP tasks, extending Pires (2019). Shows that mBERT's representations perform on par with and sometimes better than cross-lingual embeddings. Moreover, shows how mBERT's performances vary depending on which layer's representation is used; shows using later layers' embeddings lead to worse performance. Shows representations of all layers perform well on natural language identification. Shows cross-lingual transfer performance correlates with the number of subtoken overlaps between languages.

mBERT performs on par with cross-lingual alignment methods.
Cross-lingual transfer performance depends on subtoken alignment rate.
mBERT is good at identifying languages (i.e., representations of different languages are clearly mapped to different sections in the representation space)

Thoughts:

Results slightly differ from Pires (2019); they show cross-lingual transfer performance is good even with completely different tokens. I suppose this study didn't take into account the typography of the language when studying subtoken overlap?

Are Girls Neko or Shōjo? Cross-Lingual Alignment of Non-Isomorphic Embeddings with Iterative Normalization (Zhang et al., 2019)

Keywords: cross-lingual alignment

Considers method to preprocess static word embeddings to ensure applying an orthogonal transformation between language embeddings with different structures is suitable. The method is simply iteratively normalizing the word vectors then centering them until convergence. Performance boosts on word translation accuracy is most significant in distant languages (Japense, Chinese) whereas increase in similar, high-resource languages (Spanish) are minimal. Iterative normalization strongly boosts word translation performance for distant languages than English.

Thoughts:

What about other tasks besides word translation?
Does the same idea hold for contextualized embeddings?

Multilingual Alignment of Contextual Word Representations (Cao et al., 2020)

Keywords: multilingual alignment

Considers fine-tuning mBERT using a large cross-lingual parallel corpus such that the contextualized representations of words with the same meaning are pulled closer (with some regularization term). Introduces contextual word alignment task, evaluates all models on it and NLI. Fine-tuning mBERT representations improves cross-lingual alignment too.

Thoughts:

What about performance on other tasks (XNLI, etc.)?
Not quite convinced this new proposed task is a worthy replacement benchmark of existing ones.

Do Explicit Alignments Robustly Improve Multilingual Encoders? (Wu and Dredze, 2020)

Keywords: cross-lingual alignment, fine-tuning, mBERT, XLM-R

Considers a follow-up work to fine-tuning multilingual LLMs where a contrastive learning objective (weak and strong) are considered. More extensive experiments (in the sense that multiple seeds are used on the same model) are conducted to show existing methods (linear transformation, $L_2$ alignment (Cao (2020))) do not improve performance over mBERT's representations and contrastive learning marginally improves performance. mBERT is also beat by XLM-R which is trained on more data by 3 points on every task and is further beaten by XLM-R base, indicating that increasing the amount of data and model size is far more effective than existing methods.

Contrastive learning objective marginally improves performance on various tasks (XNLI, NER, ...).
Increasing the amount of data and model size improves performance the most.

Thoughts:

Can we find a way to achieve same performance with small models (or boost performance of larger models) with better data usage?

X-FACTR: Multilingual Factual Knowledge Retrieval from Pretrained Language Models
(Jiang et al., 2020)

Built X-FACTR dataset (same structure as LAMA, but across 23 languages). Implicitly investigates cross-lingual consistency (Section 6). Proposes fine-tuning model on code-switched data (e.g., creating training samples by swapping: "Obama later reflected on his years" -> "Obama (in some other language) later reflected on his years").

Thoughts:

Dataset is rigorously designed.
Similar motivation to "Multilingual LAMA: Investigating Knowledge in Multilingual Pretrained Language Models,” which is mentioned by mLAMA's author.

Factual Consistency of Multilingual Pretrained Language Models (Fierro et al., 2022)

"Builds” mPaRaRel dataset by translating PaRaRel (38 types of cloze-style questions with ~9 paraphrased templates) into 45 languages. Tests monolingual consistency in different languages with mBERT (110M) and XLM-RoBERTa (560M). Shows other languages have worse consistency than English, which has already been shown to be monolingually inconsistent. Consistency is measured simply by the total number of matches across all templates (i.e., if the answer is the same for two queries that are paraphrases of each other, this is considered to be consistent). Shows results for accuracy, consistency, and consistency-accuracy (ones that are both consistent and accurate).

Thoughts:

Only investigates monolingual consistency in different languages. Very different from cross-lingual consistency.

Self-Consistency Improves Chain of Thought Reasoning in Language Models (Wang et al., 2023)

Keywords: chain-of-thought reasoning, self-consistency

Introduces self-consistency method: conduct multiple path decoding for chain-of-thought reasoning and choosing the most consistent final answer by majority vote. Leads to improved accuracy on arithmetic and commonsense reasoning tasks using LLMs (20B ~ 540B). Robust to sampling (decoding) strategies and imperfect prompts. Self-consistency helps when CoT hurts performance (not stated specifically when this happens). Also attempted using "normalized" (softmaxed) weighted sum w.r.t. decoding probabilities, which provides comparable results to majority vote.

Thoughts:

Can consider using weak translators to translate original question to different language and conduct "self-consistency." Weak translators should suffice because this method is said to be robust to prompt quality (or perhaps this is only in English?).

Self-Refine: Iterative Refinement with Self-Feedback (Madaan et al., 2023)

Keywords: iterative refinement, self-feedback

Proposes self-refine: Get LLM output, ask itself for feedback, and get new LLM output with original output and feedback as input. Evaluated on wide range of tasks. Results: Adding feedback (even non-specific ones, e.g., "make the code more efficient") significantly boosts performance (10~30%) across three tasks, and using more informative feedback increases performance a bit more. Performance increases as more iterations of self-refine are run. Comparing self-refine to generating $k$ different outputs, humans still preferred self-refine outputs more than all $k$ outputs. Self-refine does not work for relatively smaller models (~13B, e.g., Vicuna-13b.) When self-refine fails to improve original output, the majority cause is because of incorrect feedback. Self-refine still works when feedback is partially incorrect.

Thoughts:

Not clarified what the stop condition is (which is later "exposed" in Huang et al. (2023)).
Unclear whether self-refine works in other languages (mentioned in Limitations section). Perhaps the capability to self-refine only emerges given sufficient data? Do multilingual models have the same property or not?

Measuring and Narrowing the Compositionality Gap in Language Models (Press et al., 2023)

Keywords: compositionality gap, multi-hop reasoning

Defines the \emph{compositionality gap}: while LLMs can answer factual (1-hop) questions correctly, they struggle on questions that convolute such questions (multi-hop); the authors term this gap the compositionality gap. Builds two datasets: Compositional Celebrities (CC), which convolves facts using 17 different types of templates, and Bamboogle, a smalls set of hand-crafted 2-hop questions. Shows that for GPT3, 1-hop accuracy can be around 80% but 2-hop accuracy drops to 1.2%. Proposes self-ask: Ask LLM to follow up multi-hop questions by breaking it down into 1-hop questions, answer each 1-hop question individually, then combine them to answer original 2-hop question. Leads to significant boost in performance, even compared to chain-of-thought.

Thoughts:

Great paper. Potential flaw is when using self-ask, the final answer would be incorrect if the 1-hop question is incorrectly answered.

Large Language Models Can Self-Improve (Huang et al., 2022)

Keywords: self-improvement, chain-of-thought, self-consistency

Proposes \emph{Language Model Self-Improved} (LMSI): Given unlabeled questions, use CoT reasoning and "self-consistency" to generate labels and fine-tune model on this pseudolabeled data. Tested on 540B model. Improves performance on both in-domain tasks (tasks where unlabeled data is taken from) and OOD tasks (marginally). Pseudolabeled data prompts are diversified by using four possible templates to prevent model from producing only one style of output. Also studied augmenting data. Training on self-generated data increases performance but not as well as training on existing data. Also tried distilling. 8B distilled model beats 62B model and 62B distilled model beats 540B model. Unclear which dataset is used for this.

Thoughts:

Does training in language $A$ boost performance in language $B$ using LMSI?

Large Language Models are Better Reasoners with Self-Verification (Weng et al., 2023)

Keywords: self-verification, reasoning

Proposes self-verification:

LLM generates multiple answers (Forward Reasoning).
Masks factual information in the input question and asks the LLM to answer the masked part (Backward Verification) $k$ times.
Answer in FR with the most consistency (out of $k$ times, how many times was the correct information recovered) is chosen as answer.

Model performance is slightly increased for LLMs, does not work for smaller models. Also combined this with self-consistency and other methods and shows performance can be slightly increased.

Thoughts:

A bit misleading: it doesn't show that LLMs can self-verify, they show that they can use a "trick" using LLMs to improve performance on reasoning questions. Moreover, don't agree that this is "self-verification": their approach is simply Wang et al. (2023) but slightly more complicated. The model itself is not verifying its own answer.
Don't believe this method can be applied to slightly more complicated questions than GSM8K. This method assumes that the questions are "simple enough" such that they are correctly answerable most of the time (because their "Backward Verification" is essentially doing the same thing as "Forward Reasoning").

LM vs. LM: Detecting Factual Errors via Cross Examination (Cohen et al., 2023)

Keywords: factual error detection, cross examination

Proposes LM vs. LM as a method to automatically detect factual errors (given factual statement, want to ensure examiner can accurately output whether statement is correct or not):

Input statement to examiner to generate list of follow-up questions.
Ask examinee to answer all questions, which the outputs are passed to examiner. (Optional) Examiner is asked whether follow-up questions are required.
Given set of examiner questions and examinee answers, examiner checks whether examinee's answers are consistent.

Testing: Used closed-book open-ended datasets. Also created set of "false" claims to test how well LLMs can detect false claims. Results indicate LM vs. LM performs better than existing methods by 5~10%. LM vs. LM fails when examinee provides incorrect but consistent outputs. 9~15% of the time, LLM produces incoherent output. Examinee typically outputs incorrect answers when original statement is false. Not exactly about the self-correcting abilities of LLMs, and more so proposing a method to conduct consistency checks with LLMs.

Large Language Models Cannot Self-Correct Reasoning Yet (Huang et al., 2023)

Keywords: self-correction, reasoning

An analytical study that questions previous papers on "self-correction/self-improvement/etc." and examines whether LLMs can self-correct without any form of external feedback (e.g., terminating iterative prompting when model generates correct output, changing prompt instructions because previous prompt didn't work (this is considered external feedback as it leverages info on learning what prompts do not work after testing)). Tests three-step prompting (initial generation, critique previous output, generate response using initial output and critique). Shows that previously reported positive results uses external feedback in some way. Uses datasets from arithmetic reasoning to question answering. Shows that LLM performance decreases under self-correction regime (model is more likely to change correct -> incorrect than vice versa) under no external feedback, which they provide their intuition as "asking the model to assess the model's output might skew the model towards changing its answer." Self-correction is less effective for tasks where LLMs cannot easily identify errors or assess its correctness in its outputs. However, self-correction is useful for altering the text style and improving appropriateness.

Thoughts:

Paper suggests that searching within the answer space of the LLM can be done robustly with external feedback (or self-consistency (Wang et al. (2023))). Perhaps we can do even better by employing self-consistency across languages?

Self-contrast: Better Reflection through Inconsistent Solving Perspectives (Zhang et al., 2024)

Keywords: self-contrast, self-reflection

Following works that doubt LLMs can self-correct, re-shows self-correction isn't effective (and sometimes decreases performance) for math reasoning and translation tasks. Feedback analysis: Claims that bad feedback generated by LLMs can be classified into the following: when an incorrect answer isn't fixed the model tends to assert no correction is needed, when a correct answer is incorrectly changed the model's feedback is inconsistent when tested multiple times. (Also shows that when incorrect output is corrected, model output isn't consistent, but I am confused because this result is contradictory) Proposes self-contrast framework (slightly complicated):

Input paraphrases of original question with diverse styles and obtain multiple outputs
Ask LLM to contrast difference between each answer.
Based on contrast result, ask model to create a checklist to verify which case is correct between discrepancies.
Based on answers to checklist, refine original outputs for each style, and choose most consistent output.

Thoughts:

Proposing another variant to do self-reflection. It's not really answering any open questions. Specifically, this question does not answer "why" self-correct doesn't work, but is only examining "how" the feedback is bad for failure cases of self-correction, which is not the fundamental question that needs to be answered.

Learning Universal Authorship Representations (Rivera-Soto et al., 2021)

Previous methods for authorship attribution rely on statistical methods. Recently, neural-based methods have been shown to outperform these traditional methods, but A) neural methods are difficult to interpret (features are automatically learned), and B) they require vast amounts of data. This paper studies whether universal authorship features can be learned with training data in the open-world setting (large amount of authors). They propose a self-attention (SBERT) + constrastive-learning based model and use three datasets (Reddit, Amazon, Fanfiction) to show that A) topic diversity is important so that model representations are learned to be independent with the topic feature for author attribution, and B) given sufficient independence w.r.t. topics, more authors lead to better generalizability.

"Given sufficient independence w.r.t. topics, more authors lead to better generalizability" is not surprising. This essentially corresponds to more training data that is useful for extracting more generalized features.
By "large amount of authors," this seems to mean $>$ 100K authors. Is there not a "crowding problem" that will happen if we operate in 512 dimensions (especially if we are using inner products as the similarity metric)? I feel like there should be an impossibility result.

Thoughts: Raises concerns about scalability and potential crowding issues when operating in limited-dimensional spaces with very many authors.

Interpreting Embedding Spaces by Conceptualization (Simhi and Markovitch, 2023)

Introduces a method to (non-linearly) transform LLM representations into a space where features are interpretable. Uses concepts $c$ in a pre-defined ontology (Wikipedia) and considers their embeddings $f(\tau(c))$ ($\tau(c)$ is the natural language expression of the concept and $f$ is the LLM embedding of the expression) as independent bases in the transformed space. Given text $t$, considers $f(t)$ as its representation. Shows that the transformed representations are A) equivalent in terms of infomation encoded (tested by learning classifiers on top of representations) and B) meaningful in that the concepts are informative of the original sentence.

Thoughts:

(I'm aware that this is already done but) I'm wondering how this can be applied reasonably to author attribution: the difficulty seems to be figuring out reasonable concepts $c$ for author attribution.
I'm curious if these concepts can be obtained in an unsupervised fashion and later interpreted (the assumption is that the concepts are obtained in a way where they are (somehow) constrained to be more interpretable).

Learning Interpretable Style Embeddings via Prompting LLMs (Patel et al., 2023)

Creates dataset (StyleGenome). Two types of prompts (free-form and ones that target specific features) are used to generate free-form natural language descriptions, which are then formatted into multiple short-sentence descriptions to be used as styles (these are not filtered, to maintain unsupervised setting). To obtain LISA, a style-and-text to salience predictor (referred to as SFAM), which consists of a T5 backbone with a linear binary classification layer trained with a contrastive objective is used. This is further distilled into another model that simultaneously predicts salience (between 0 and 1) for all (768) styles. Human-model agreement scores are seen to increase with more training data. To obtain embeddings, $[0, 1]$ scores for each feature are mapped with linear projection where the linear map is obtained by training distilled SFAM with an additional linear layer using authorship attribution datasets. A small fraction of the original generated styles are observed to refer to useless features. Styles are also seen to be difficult to completely separate from content (which can be done if we manually filter the styles). LISA also sometimes produces high scores for styles that are relevant but opposite (likely moreso due to LISA's backbone's tendency).

Thoughts:

I wonder if creating a "hierarchy" of styles might help.
In the original Reddit scrape, I wonder if 10 documents even contain sufficient information in pinning down who the author is.

ICLEF: In-Context Learning with Expert Feedback for Explainable Style Transfer
(Saakyan and Muresan, 2024)

Studies attribute style transfer task (rewrite text as informal → formal, etc.). No studies on explainability for textual style transfer. Builds datasets for explainable style transfer using a proposed framework (ICLEF). Focuses on formality and subjectivity style transfer and uses the GYAFC dataset (formality style transfer dataset, source informal sentences collected from Yahoo Answers and target formal variants collected through MTurk) and WNC dataset (Wikipedia-based dataset). Collect improved versions of portions of original dataset using expert annotators. Verifies semantic preservation quality using MIS (paraphrase quantification metric) and human (expert) evaluation. Train models on collected explanations and text pairs. Shows mild improvement on authorship attribution task using generated explanations for style features.

DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature
(Mitchell et al., 2023)

Previous methods for binary classification on whether an input text is AI-generated uses training neural classifiers, which are prone to overfitting. The paper notices that the generated text of a language model tends to lie in regions of its log probability function where the curvature is negative (i.e., near a local maximum). In such regions, small edits/perturbations to the text will cause a noticeable drop in the log probability, which is less pronounced in human-written text in contrast.

The paper frames the detection problem as a zero-shot classification task where the goal is to determine if a candidate passage $x$ was generated by a specific source model $f_\theta(\cdot)$ using only its own log probabilities. The method defines the perturbation discrepancy as \[ \Delta \triangleq \log f_\theta(x) -\mathbb{E}_{\tilde{x} \sim q(\cdot\mid x)} \log f_\theta(\tilde{x}), \] where $q(\cdot\mid x)$ produces slight, semantically similar perturbations of $x$. We can approximate the negative curvature using this drop, which can be estimated using Hutchinson's trace estimator. By applying finite differences, \[ \mathbf{z}^\top H_f(x) \mathbf{z} \approx \frac{f(x + h\mathbf{z}) + f(x -h\mathbf{z}) - 2f(x)}{h^2}, \] and setting $h = 1$ with a symmetric noise distribution, the negative trace of the Hessian is given by \[ \frac{-\text{tr}\left(H_f(x)\right)}{2} \approx f(x) -\mathbb{E}_\mathbf{z} f(x + \mathbf{z}). \] Thus, comparing the log probability of the original passage with that of its perturbations provides a good estimate of the negativity of the curvature.

The paper uses the T5 backbone to perturb text by masking and filling text (spans of 2 consecutive words, continue masking until roughly 15\% of the words in the text are masked, sampled at random). Evaluated on GPT-2, OPT-2.7B, GPT-Neo-2.7B, GPT-J, GPT-NeoX, GPT-3, and Jurassic-2 Jumbo. 0.06 AUROC increase compared to other zero-shot baselines (compute average log probability using source model and uses fixed threshold). Threshold is set by examining the separation between the distributions of this normalized discrepancy for machine-generated and human-written texts.

Style-Specific Neurons for Steering LLMs in Text Style Transfer (Lai et al., 2024)

Focuses on task of Text Style Transfer (TST): Transform text from a source to a target style while maintaining semantic content and fluency. There are two main approaches that use LLMs for TST: (1) fine-tuning on parallel data, and (2) few-shot prompting, but LLMs prioritize meaning over style transfer (e.g., for polite -> impolite). This paper studies identifying and using style-specific neurons as knobs for TST and tests what they refer to as "sNeuron-TST": Given parallel training corpus, feed source and target sentence to find neurons that active in each case. Then, activate/deactivate neurons that only appear in the source text.

More concretely,

Consider two groups of distinct styles $A, B$. Pass these separately into an LLM (e.g., Llama-3) and obtain the output of the MLP layers.
Select the set of neurons where the value is greater than zero for each group (referred to as $S_A, S_B$).
Take the neurons with the top-$k$ values ($k$ is quite large; $k=500n$) to create $S_A', S_B'$.
Compute the disjoint $N_A = S_A'\setminus S_B'$ and vice versa.

Then, they deactivate neurons in $N_A$ for TST. This turns out to improve "style transfer accuracy" (?) but reduces fluency.

Hence, they also consider using a contrastive decoding algorithm (Dola) on top of the above to improve fluency. The idea of Dola is to adjust the next-token probabilities to favor tokens that remain consistently likely across the layers. The authors of this paper observe that the last few (four) layers of the model carry more style-specific information. So, they contrast these "style layers" against the final layer's distribution and dynamically pick one of those style layers that exhibits the greatest difference (largest JSD) from the final layer's distribution. These two distributions are contrasted, where if a token is in the top set for the final layer, scale its probability by the ratio of final/earlier probabilities, and suppress it if this ratio is excessively large, meaning it wasn't supported by the style layer. The idea is that if a token is consistently among the top candidates from the style layers and the final layer, it is likely a genuinely appropriate word (often style-independent or correctly used style-$B$ word). In contrast, if a token only becomes probable in the very last layer because of neuron deactivation, the algorithm sees a big mismatch vs. the style layers' distribution and lowers its final probability.

P$^3$SUM: Preserving Author's Perspective in News Summarization with Diffusion Language Models
(Liu et al., 2024)

Motivation: Perspective-preserving summarization is understudied; LLMs might be politically biased, and thus, summarization systems built on top of these LLMs will propagate bias. This study evaluates the aptitude of LLMs to preserve political stances in summaries. It shows LLMs struggle to preserve the author's perspective for summarization. Proposes P$^3$SUM, a diffusion model-based political stance summarizer:

Train diffusion model on standard summarization data.
At each step:
1. Generated output's stance is evaluated with a political stance classifier and compared to the target stance.
2. Steers summary towards the target stance.

P$^3$SUM doesn't require training on perspective summarization data due to the diffusion model backbone. Its performance on ROUGE/abstractiveness suffers marginally while performing significantly better on perspective summarization.

Polarity Calibration for Opinion Summarization (Lei et al., 2024)

Motivation: Authors observe that existing summarization systems amplify polarity bias (the difference in representation between perspectives). Defines an intelligent summarizer as “proportionally presenting both majority and minority opinions and aligning with the input text polarity.” Develops a reward model focusing on three criteria:

Whether the summary's polarity distance matches the input text.
Whether the summary's semantic content is faithful (using RoBERTa embeddings).
Whether the summary is coherent (using CoLA).

Tested on two domains using automated/human evaluation.

NeuS: Neutral Multi-News Summarization for Mitigating Framing Bias (Lee et al., 2022)

Motivation: Studies framing bias (when journalists selectively choose content to cover) and explores generating neutral summaries. Traditional framing-bias-free roundups are costly and time-consuming. This study examines multi-document summarization for framing bias mitigation. A new dataset from Allsides.com is used, along with a polarity-based metric for framing bias aligned with human perceptions. The study also reveals that article titles indicate framing bias, and hallucinated generations tend to have high framing bias.

Aspect-Controllable Opinion Summarization (Amplayo et al., 2021)

Examines a variant of summarization to obtain pinpointed summaries for specific entities, introducing aspect controllers for focused summarization. The model might not be directly applicable to our setting due to the need for custom training data.

Fair Abstractive Summarization of Diverse Perspectives (Zhang et al., 2023)

Summarization for text labeled with social values (e.g., male, female). Defines fairness by matching social value ratios between source and generated summaries.

Understanding Position Bias Effects on Fairness in Social Multi-Document Summarization
(Olabisi et al., 2024)

Investigates position bias in multi-document summarization for text from multiple dialects in social settings. Most position bias studies do not assess fairness. Uses DivSumm with three dialects in source text, considering randomly shuffled versus ordered documents. Shows that:

Reference summaries lack word overlap bias.
Abstractive models show no word overlap bias when documents are shuffled but favor earlier parts when ordered.

Position bias is conditional: models favor early text only when linguistically similar. Semantic similarity between shuffled/ordered cases also measured; when ordered, summaries are biased toward the first dialect group.

Thoughts: Unclear how semantic similarity is measured (mentions only cosine similarity).

Automatically Evaluating Content Selection in Summarization without Human Models
(Louis et al., 2009)

Shows that distributional dissimilarity between source and generated summaries correlates with manual PYRAMID metrics. Tested on TAC dataset; stemming improves distributional correlation.

Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation (Shapira et al., 2019)

Original Pyramid involves SCU extraction and merging for summary evaluation. Lightweight Pyramid uses fewer annotators without cross-referencing SCUs across different references. Can LLMs perform similar extraction when prompted?

Thoughts: Raises the question of whether LLMs could perform similar extraction tasks.

Automatic Pyramid Evaluation Exploiting EDU-based Extractive Reference Summaries
(Hirao et al., 2018)

Automates PYRAMID by dividing source and reference summaries into EDUs, identifying matches, and scoring based on EDU counts. Using ILP for this may be overkill.

Thoughts: The approach might be overcomplicated due to the use of ILP.

Supervised Learning of Automatic Pyramid for Optimization-Based Multi-Document Summarization (Peyrard et al., 2017)

Develops extractive summarization method relying on an auxiliary function to approximate PYRAMID scores, using a genetic algorithm to select sentences. Assumes a linear relationship between features of generated summaries and PYRAMID scores?

Thoughts: Assumes a linear relationship between summary features and Pyramid scores.

Automated Pyramid Summarization Evaluation (Gao et al., 2019)

Automates PYRAMID by segmenting and vectorizing summaries, using EDUA for grouping and cosine similarity for evaluation.EDUA could be useful if we can gather evidence lists.

Thoughts: EDUA could be very useful if evidence lists can be reliably gathered.

Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems
(Laban et al., 2024)

This work studies the Needle-in-a-Haystack summarization task and benchmarks RAG versus long-context LLMs, measuring coverage annotated by humans.

Open Domain Multi-document Summarization: A Comprehensive Study of Model Brittleness under Retrieval (Giorgi et al., 2022)

Investigates the open-domain multi-document summarization setting, focusing on extracting relevant information from large document sets to answer queries.

Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles (Huang et al., 2023)

Developed dataset DiverseSumm, with coverage measured through human annotations, showing that LLMs generally struggle with coverage.

Representation Degeneration Problem in Training Natural Language Generation Models
(Zhao et al., 2023)

Studies geometry of embedding matrix for weight-tied transformer model
The embeddings of words that do not occur will minimize loss when norm is very large (tends to infinity)
(Not strictly pointed out) Embeddings are dominated by first few singular components
Adding simple regularization term to minimize average cosine similarity mitigates RDP

Thoughts:

The paper indicates push-pull on Zipfian data causes anisotropy. Why not test this out on (synthetic) uniformly generated sequence?
How did the anisotropy change? Provide some numbers? (even sth as simple as average cosine similarity, not just SVD plots)
Why is RDP even an issue---if insufficient signals for softmax is the concern, why can we not just scale all values by some large constant? They do show performance increase by adding the regularization term, but it isn't clear if mitigating RDP directly led to the performance increase.
What was the point of comparing geometry with NNs and Word2vec if they aren't even going to compare downstream task performance? These (more or less) satisfy isotropy better, but do they perform better or worse than transformers?