Studies the problem of closing the performance gap between explicit and implicit CoT
reasoning. Identifies a “latent instability issue” in implicit CoT methods like
Coconut, where scaling the number of latent reasoning tokens causes performance
to collapse because of (i) semantic homogenization, where latent representations collapse
and become nearly identical, and (ii) geometric drift, where the collapsed latents drift
away from the model's token embedding space and become OOD.
Proposes SIM-CoT that introduces step-level supervision for implicit CoT, where
an auxiliary decoder is attached to the backbone LLM and both the decoder and LLM are
jointly trained. For each latent token generated by the backbone, the decoder is trained to
autoregressively reconstruct the corresponding explicit reasoning step (e.g., for word
arithmetic problems, given one intermediate latent step, the decoder is trained to generate
“6 eggs + 12 eggs = 18 eggs”). Auxiliary decoder is removed at test time but can also be
used to interpret latent steps.
SIM-CoT improves accuracy over Coconut by +8.2 points on
GPT-2 and CODI (distillation-based method for implicit CoT) by
+3.0 points on LLaMA-3.1 8B.
Also beats the explicit CoT baseline on GPT-2 by +2.1 points with 2.3× token
efficiency. On larger models, it lags behind explicit CoT but remains comparable while
maintaining a significant speed advantage. Generalizes to other related benchmarks (SVAMP,
GSM-Hard). Ablations show a moderate decoder (1B matched to a 1B backbone) works best, while
larger decoders (3B/8B) slightly degrade performance.
-
Feels a bit that
SIM-CoTdefeats the purpose ofCoconutby supervising the latent representation to directly correspond to (a sequence of) explicit tokens. One key interest in implicit CoT is not just to increase efficiency but also potentially boost model capabilities. -
Does this work for bigger LLMs and more interesting tasks beyond math? → Likely not; the
supervision signal is too constraining and requires annotated data to train on. In which
case, this may not be very interesting, because one limitation of
Coconutis that it does not work at all for any interesting task beyond solving simple math problems. - Does the auxiliary decoder work on unobserved OOD tasks? Likely not.
- Using a decoder that is the same size as the base model feels a bit iffy.