Coconut and the Geometry of Reasoning: When Continuous Latent Spaces Subsume Chain-of-Thought

Summary

[Sun et al. 2024] (arXiv:2412.06769) propose Coconut (Chain of Continuous Thought), a method for training large language models to reason in continuous latent space rather than through discrete token generation. The core claim is both simple and provocative: the standard chain-of-thought paradigm [Wei et al. 2022] forces reasoning through an information bottleneck, the projection onto a finite vocabulary at each intermediate step, and this bottleneck is not merely an implementation detail but a fundamental constraint on the kinds of reasoning strategies a model can employ. Coconut replaces discrete reasoning tokens with "continuous thoughts," hidden-state vectors fed directly back into the model as inputs for subsequent reasoning steps, bypassing the vocabulary projection entirely.

The authors present three main contributions. First, a curriculum-based training procedure that gradually replaces language-based chain-of-thought tokens with continuous latent representations. Second, empirical evidence that continuous thoughts can encode a form of breadth-first search (BFS) over reasoning paths, in contrast to the inherently sequential, depth-first nature of token-by-token generation. Third, experimental results on logical reasoning benchmarks (ProntoQA, ProsQA) and mathematical reasoning (GSM8k) demonstrating that Coconut can match or exceed standard chain-of-thought performance while operating in a compressed latent space.

My assessment: the theoretical framing is genuinely novel and points toward an important research direction. The information-theoretic argument about the vocabulary bottleneck is sound and underexplored in the literature. However, the experimental validation operates at modest scale on synthetic or well-controlled benchmarks, leaving open the critical question of whether continuous reasoning confers advantages on the messy, compositional problems that motivate chain-of-thought reasoning in practice.

Why the Vocabulary Bottleneck Matters More Than You Think

Novelty rating: Significant.

The core insight, that projecting intermediate reasoning states into vocabulary space constitutes an unnecessary information bottleneck, is not entirely new in spirit. [Goyal et al. 2024] explored "pause tokens" that grant models additional computation steps without requiring meaningful token output. [Nye et al. 2021] introduced scratchpads as intermediate computation space, though still in discrete token form. What distinguishes Coconut is its explicit formalization of the bottleneck problem and its proposal to eliminate it entirely by operating in continuous representation space.

The right abstraction makes the problem trivial, and finding it is the hard part. Coconut proposes that the right abstraction for LLM reasoning is not a sequence of tokens but a trajectory through continuous representation space. That is a substantive conceptual advance.

Consider the information-theoretic argument more precisely. At each step of standard chain-of-thought reasoning, the model's hidden state $h_{t} \in R^{d}$ (where $d$ is the model dimension, typically 4096 or larger) is projected to a distribution over vocabulary $V$ via the language model head, yielding at most $lo g_{2} ∣ V ∣$ bits of information per selected token. For a vocabulary of size 32,000, this amounts to approximately 15 bits. The continuous thought, by contrast, operates in $R^{d}$ with floating-point precision, yielding representational capacity orders of magnitude larger. In the most favorable interpretation, each continuous thought can carry $O (32 d)$ bits (32-bit floats times dimension), versus $O (lo g ∣ V ∣)$ for a discrete token. The gap reveals something fundamental about the relationship between representational capacity and reasoning.

The BFS property is perhaps the most surprising empirical finding. Standard autoregressive generation produces a single reasoning chain, effectively depth-first. The authors demonstrate, through analysis of continuous thought representations, that a single continuous thought can encode information about multiple reasoning branches simultaneously. This connects to a deep theoretical question about computational complexity. We know from [Merrill & Sabharwal, 2023] and [Li et al. 2024] that chain-of-thought extends the effective computational power of transformers beyond the $TC^{0}$ class believed to capture a single forward pass. If continuous thoughts can implicitly represent a superposition of reasoning states, Coconut may be accessing computational strategies that are qualitatively different from sequential CoT, not merely quantitatively more efficient.

A necessary caveat: the connection between continuous thoughts and BFS is demonstrated empirically, not proved formally. The claim that the model "performs BFS" should be understood as "the model's behavior is consistent with BFS-like exploration." That is a meaningfully weaker statement.

Under the Hood: How the Training Curriculum Works

The training methodology employs a curriculum that is well-motivated but theoretically underspecified. The procedure operates in stages:

1. Standard chain-of-thought fine-tuning on problems with explicit reasoning traces.

2. Progressive replacement of CoT tokens with continuous thoughts, starting from the earliest reasoning steps and proceeding forward.

3. At each stage, the continuous thought $c_{t}$ is defined as the last hidden state from the transformer at position $t$ in the previous forward pass, used directly as the input embedding for position $t$ in the current forward pass.

Formally:

c_{t} = h_{t}^{(L)}

where $h_{t}^{(L)}$ denotes the hidden state at the final layer $L$ for position $t$ . This vector replaces the token embedding $e (x_{t})$ that would normally serve as input.

The loss function is computed only on the answer tokens, not on the continuous thoughts (which have no target in vocabulary space). This is correct in principle, but it means the continuous thoughts are trained purely through their downstream effect on answer accuracy. No auxiliary loss encourages them to encode any particular structure. The BFS property, if it genuinely emerges, does so without explicit encouragement, both remarkable and difficult to verify rigorously.

A subtle technical concern: the gradient flow through continuous thoughts creates a recurrence that extends the effective depth of the computation graph. With $k$ continuous thoughts, the effective computation involves $k$ sequential forward passes, each of depth $L$ (the number of transformer layers). The total effective depth is $k L$ , which for even moderate values of $k$ and $L$ produces very long backpropagation paths. The authors should address whether gradient stability is maintained across this extended computation. Their use of curriculum learning may implicitly help by building up to longer continuous thought chains gradually, but this deserves explicit analysis, ideally with gradient norm statistics across training.

Where the Evidence Holds Up, and Where It Doesn't

The experimental evaluation focuses on three benchmarks.

ProntoQA [Saparov & He, 2023]: A synthetic logical reasoning benchmark requiring multi-hop deductive inference. Coconut achieves strong results here, with continuous thoughts outperforming both standard chain-of-thought and no-chain-of-thought baselines. The controlled nature of ProntoQA makes it an appropriate testbed for the core hypothesis, but it also means the results may not transfer to more naturalistic reasoning settings.

ProsQA: A harder variant introduced by the authors that requires reasoning over graph structures with branching paths. This is where the BFS property emerges most clearly. On problems where the correct strategy involves exploring multiple paths before committing, Coconut's advantage over sequential CoT is most pronounced.

GSM8k [Cobbe et al. 2021]: A grade-school math benchmark. The results here are more modest, expected, given that mathematical reasoning may be less amenable to the BFS-style exploration that Coconut facilitates.

Primary experimental models

Small-scale LLMs (substantially smaller than frontier models)

Key benchmarks

ProntoQA, ProsQA (synthetic), GSM8k (naturalistic math)

Several concerns about experimental rigor warrant attention.

The baselines leave gaps. The comparisons include standard CoT and no-CoT variants, which is reasonable for establishing the core hypothesis. However, a fairer evaluation would include recent methods that also address reasoning efficiency, [Zelikman et al. 2022] (STaR self-taught reasoning) and methods providing additional computation without explicit reasoning tokens, such as [Goyal et al. 2024] (pause tokens). Without these baselines, it is difficult to isolate whether the improvement stems from continuous representation specifically or simply from additional implicit computation steps.

Scale remains an open question. The experiments use models substantially smaller than frontier systems. This matters because the relationship between model scale and reasoning capability is highly nonlinear [Wei et al. 2022]. The vocabulary bottleneck argument (each token carries only $lo g_{2} ∣ V ∣$ bits) may be less binding at larger scales, where the model can distribute information across multiple tokens more effectively. Conversely, the advantage of continuous thoughts might grow with scale if the hidden state dimension $d$ grows proportionally. Without experiments at multiple scales, we cannot distinguish between these possibilities.

Key ablations are missing. The ablation that would most strengthen the paper is a systematic study of continuous thought count versus task complexity. The theoretical argument implies that the optimal number should scale with problem difficulty. A rigorous ablation varying this parameter against problems of known hop-count complexity would substantially sharpen the claims and illuminate the relationship between latent computation budget and reasoning depth.

Statistical reporting falls short. The paper would benefit from confidence intervals, results across multiple random seeds, and significance tests when comparing to baselines. This is particularly important for the BFS claims, which rest on analysis of internal representations that may be sensitive to initialization.

Beyond the limitations acknowledged in the paper, five additional concerns deserve scrutiny.

1. Interpretability collides with alignment. By moving reasoning into continuous latent space, Coconut sacrifices the interpretability that is one of chain-of-thought's primary practical benefits. Discrete reasoning traces can be inspected, debugged, and used for alignment verification [Bender & Koller, 2020]. Continuous thoughts are opaque. This is not merely a practical inconvenience but a fundamental tension: the more expressive the reasoning space, the harder it becomes to verify that the model is reasoning correctly rather than exploiting spurious correlations. Understanding the content of continuous latent reasoning may approach the lower bound of what is fundamentally possible.

2. Compositionality across reasoning types remains untested. The experiments involve problems with relatively homogeneous structure: logical chains and arithmetic. Natural language reasoning often requires composing heterogeneous reasoning types, logical inference, causal reasoning, analogical thinking, and world knowledge retrieval. Whether continuous thoughts can learn to compose these different modalities is unclear, especially since the training curriculum derives continuous thoughts from a single type of discrete reasoning trace at a time.

3. The curriculum creates an implicit ceiling. The continuous thoughts are bootstrapped from discrete chain-of-thought reasoning, meaning the continuous reasoning space can initially only learn to replicate and compress what discrete CoT already achieves. The authors suggest that continuous thoughts transcend this, enabling BFS-like strategies that sequential CoT cannot express. But the training signal comes entirely from problems solvable by sequential CoT. How does the model learn strategies that exceed its training distribution? This fundamental question connects to broader challenges in out-of-distribution generalization for learned reasoning, and the paper does not adequately address it.

4. Latent-space fragility is unexplored. The paper does not examine what happens when the hidden state encoding a continuous thought is perturbed by small noise. If the continuous reasoning trajectory is sensitive to minor perturbations, this would suggest fragile generalization. A formal analysis of the Lipschitz continuity of the continuous thought mapping, or at minimum, an empirical robustness study with injected Gaussian noise at varying $σ$ , would be informative.

5. The elephant in the room: recurrence. The continuous thought mechanism is, in effect, recurrence grafted onto a transformer architecture. Each continuous thought feeds the output of one forward pass as input to the next, creating an implicit recurrent loop. This connects to a substantial body of work on augmenting transformers with recurrence for increased computational depth. The authors do not adequately discuss this lineage or compare against recurrent alternatives that might achieve similar benefits with different tradeoffs in parallelizability and training stability.

Questions for Authors

1. On the BFS claim. The evidence for BFS-like behavior rests on cosine similarity analysis between continuous thoughts and states encountered during explicit BFS traversal. Can you provide a more formal characterization? Specifically, can you demonstrate that the continuous thought encoding at step $t$ contains recoverable information about multiple frontier nodes, perhaps by training linear probes to decode individual branch states from the continuous thought vector?

2. On scaling. Have you conducted any experiments, even preliminary, at scales beyond those reported? The theoretical advantage of continuous thoughts ( $O (d)$ bits versus $O (lo g ∣ V ∣)$ bits per step) should scale favorably with model dimension. Do you observe this empirically, or does the training curriculum become harder to stabilize at larger scales?

3. On the curriculum ceiling. The continuous thoughts are trained from discrete CoT supervision. What evidence do you have that the learned continuous reasoning exceeds what the discrete CoT teacher achieves? Could the improvements over discrete CoT be attributed to regularization effects of the continuous bottleneck, or to elimination of exposure bias in the reasoning chain, rather than genuinely superior reasoning strategies?

4. On gradient stability. With $k$ continuous thoughts each requiring a full forward pass, the effective backpropagation depth is $k L$ . How do you handle gradient stability? Do you observe training instabilities or vanishing gradients as $k$ increases, and if so, at what threshold?

5. On generalization beyond training distribution. Can Coconut solve problems at test time that require more reasoning steps than any training example? Demonstrating positive length generalization would be the strongest evidence for genuine reasoning capability rather than sophisticated compression of the training distribution.

Broader Impact

If Coconut's core hypothesis proves robust at scale, the implications extend in several directions. Reasoning in continuous space is more compute-efficient per step than generating and processing discrete tokens, potentially enabling faster inference for reasoning-heavy applications. The possibility of implicit breadth-first search could enable LLMs to tackle problems that are intractable for sequential chain-of-thought without explicit tree search algorithms layered on top.

The interpretability concern, however, carries genuine societal weight. As LLMs enter high-stakes domains, the ability to inspect and verify reasoning traces is not a convenience but a requirement. A transition to opaque continuous reasoning would demand new verification methods, perhaps drawing on formal verification techniques or probing classifiers that decode continuous thoughts into human-interpretable form. Without such methods, continuous latent reasoning could amplify existing concerns about the opacity of neural network decision-making.

There is also a question of access and reproducibility. If continuous thought training proves essential for strong reasoning and requires careful curriculum tuning at scale, this could further concentrate capability advances among well-resourced laboratories, deepening existing asymmetries in AI research capacity.

Verdict: A Compelling Direction That Outpaces Its Own Evidence

This paper is really about structure, not scale. Coconut asks a fundamental question about the geometry of reasoning in neural networks and provides a thoughtful, if preliminary, answer. The information-theoretic argument for the vocabulary bottleneck is sound and important. The observation that BFS-like reasoning can emerge in continuous thought representations, if it holds up under more rigorous scrutiny, constitutes a genuinely surprising finding with implications for our understanding of what transformers can compute.

Yet the experimental validation does not match the ambition of the theoretical claims. The scale is small, the benchmarks primarily synthetic, the baselines incomplete, and the BFS property demonstrated through correlation rather than formal proof. For a top venue such as ICLR or NeurIPS, I would rate this as a borderline accept. The conceptual contribution is clear and the direction important, but the paper would be substantially strengthened by: (a) experiments at multiple model scales, (b) inclusion of pause token and STaR baselines, (c) a formal or more rigorous empirical characterization of the BFS property, and (d) ablations systematically relating continuous thought count to problem complexity.

The contribution is significant in its framing and conceptual clarity. It opens a research direction that will generate substantial follow-up work. The open conjecture most deserving pursuit: can continuous latent reasoning access computational complexity classes provably unreachable by discrete chain-of-thought of polynomial length? A proof either way would reshape our understanding of neural network reasoning.

Reproducibility & Sources

Primary paper: Hao Sun et al. "Training Large Language Models to Reason in a Continuous Latent Space," arXiv:2412.06769, December 2024.

Code repository: No official code release verified at time of review.

Datasets:

ProntoQA [Saparov & He, 2023]: Publicly available synthetic logical reasoning benchmark.
ProsQA: Introduced by the authors in this paper; generation procedure described but standalone release to be confirmed.
GSM8k [Cobbe et al. 2021]: Publicly available grade-school math benchmark.

Reproducibility assessment:

Code availability: 2/5. No confirmed public code release. The method description is detailed enough for careful reimplementation, but the curriculum training procedure involves hyperparameter choices (schedule for replacing CoT tokens, learning rate per stage) that are not fully specified.
Data availability: 4/5. Two of three benchmarks are publicly available and well-established. ProsQA, introduced in this paper, requires the authors to release their generation code for full reproducibility.
Experimental detail: 3/5. The high-level algorithm is clear and the architectural choices well-described, but key training details (gradient handling across continuous thought boundaries, exact curriculum schedule, number of training seeds) would benefit from more precise specification. Without them, reproducing the exact reported numbers would require substantial hyperparameter search.