Latent Reasoning with Normalizing Flows

Guancheng Tu^1*,Xiangjun Fu^1*,Suhao Yu¹,Yao Tang¹,Haoqiang Kang²,Lianhui Qin²,Yizhe Zhang³,Jiatao Gu¹

* Equal contribution

¹University of Pennsylvania²UC San Diego³Meta

Large language models often improve reasoning by generating explicit chain-of-thought (CoT), demonstrating the importance of intermediate computation. However, textual CoT forces this computation through a discrete, serial, and communication-oriented token stream: each reasoning step must be verbalized before the model can proceed, even when the underlying update is semantic, uncertain, or only partially formed. Latent reasoning offers a higher-bandwidth alternative by performing intermediate computation in compact continuous states before committing to text. Yet existing latent-reasoning methods often sacrifice key advantages that make CoT effective in autoregressive language models, including native left-to-right generation, probabilistic sampling, compatibility with KV-cache decoding, and tractable likelihood estimation. We propose NF-CoT, a latent reasoning framework that preserves these advantages by modeling continuous thoughts with normalizing flows. NF-CoT instantiates a TARFlow-style normalizing flow inside the LLM backbone, defining a tractable probability model over compact continuous thoughts distilled from explicit CoT. Continuous-thought positions are generated by an NF head, while text positions are generated by the standard LM head within the same causal stream. This design provides exact likelihoods for latent thoughts, enables probabilistic left-to-right decoding with the original KV cache, and supports direct policy-gradient optimization in the latent reasoning space. On code-generation benchmarks, NF-CoT improves pass rates over explicit-CoT and prior latent-reasoning baselines while substantially reducing intermediate-reasoning cost.

Four Paradigms for Chain-of-Thought Reasoning

Comparing different approaches to intermediate reasoning in language models

discrete token hidden state diffusion latent Gaussian

noise

Method Overview

NF-CoT training and inference pipeline. Training uses a frozen VAE encoder and shallow flow blocks to convert discrete-CoT rationales into continuous-CoT states.

Key Insight: At inference, NF-CoT samples continuous thoughts left-to-right and continues answer generation in the same causal stream—without recomputing the prompt or running the VAE/shallow blocks.

Performance Results

Comprehensive evaluation on code generation benchmarks comparing NF-CoT with autoregressive, diffusion, and latent reasoning baselines.

Model / Method	Base Model	MBPP	MBPP+	HumanEval	HumanEval+	LCB v6	Avg.
Open-Sourced Models
Autoregressive Coding Models
Qwen 2.5 Coder*autoregressive	Qwen2.5-7B	75.9	62.9	66.5	60.4	26.8	59.9
OpenCoder*autoregressive	from scratch (8B)	79.9	70.4	66.5	63.4	29.6	62
OlympicCoder*autoregressive	Qwen2.5-7B	80	66.4	82.1	76.9	37.3	68.5
Seed-Coder*autoregressive	from scratch	82	69	77.4	68.3	28.4	65
Diffusion Language Models
Dream*diffusion	Qwen2.5-7B	68.7	57.4	56.7	50	18.6	—
LLaDA*diffusion	from scratch (8B)	50.1	42.1	35.4	30.5	12.4	—
Diffu-Coder*diffusion	Qwen2.5-Coder-7B	75.1	61.9	72	65.2	24.5	59.7
Dream-Coder*diffusion	Qwen2.5-Coder-7B	75.9	61.6	66.5	60.4	21.4	57.2
d1*diffusion	LLaDA+RL	39	—	45.5	—	—	—
Looped Latent Reasoning Models
Ouro*looped	from scratch (2.6B)	80.4	66.6	78.2	70.7	38.7	66.9
Method Comparison
Same Backbone Comparison (Qwen3-8B-Base)
Base Model	Qwen3-8B-Base	60.5	53.8	78.2	68.6	17.7	55.8
Standard SFT	Qwen3-8B-Base	63.3+2.8	52.7-1.1	84.6+6.4	69.5+0.9	—	—
Soft Thinking	Qwen3-8B-Base	64.2+3.7	53.1-0.7	85+6.8	71.2+2.6	—	—
TaH+	Qwen3-8B-Base	65.6+5.1	56.5+2.7	85.8+7.6	74.3+5.7	—	—
LaVAE	Qwen3-8B-Base	42-18.5	30.2-23.6	47.8-30.4	32.8-35.8	10.8-6.9	32.7-23
LaDiRdiffusion	Qwen3-8B-Base	66.8+6.3	59.5+5.7	87.4+9.2	73.2+4.6	21.3+3.6	61.6+5.9
NF-CoT (Dual-Path)normalizing flowOurs	Qwen3-8B-Base	77.5+17	66.2+12.4	82.9+4.7	77.7+9.1	21.6+3.9	65.2+9.4
NF-CoT (Unified)normalizing flowOurs	Qwen3-8B-Base	83.9+23.4	72.1+18.3	85.8+7.6	78.3+9.7	23.7+6	68.8+13
↳ + RLnormalizing flowOurs	Qwen3-8B-Base	85.4+24.9	73.3+19.5	86.7+8.5	80.2+11.6	25.1+7.4	70.1+14.3

autoregressiveStandard LLM decodingdiffusionIterative denoisingnormalizing flowTractable likelihood (Ours)

Pass@k Scaling

Pass@k measures the probability that at least one of k generated samples passes all test cases. NF-CoT's probabilistic sampling in continuous thought space enables diverse solution exploration, leading to consistent improvements over both the base model and LaDiR across all k values.

Figure 3: Pass@k on MBPP+ (left) and HumanEval+ (right). NF-CoT dominates both baselines across the full k range. On MBPP+, NF-CoT's pass@1 (72.1) already matches the base model's pass@128 (72.0) and rises to 87.5 at k = 128. On HumanEval+, NF-CoT improves from 78.3 to 97.5 (+19.2).

RL Preserves Pass@k Diversity

A common concern is that optimizing a single correctness reward with RL can raise pass@1 while collapsing the diversity needed for pass@k scaling. We compare standard token-space GRPO with our latent-space RL under the same pass@k diagnostic. Token-space GRPO improves the low-k region but saturates at larger k, while latent-space RL for NF-CoT improves pass@1 and preserves the upward scaling trend.

Pass@k diversity before and after reinforcement learning on MBPP+ and HumanEval+. Top row: token-space AR Vanilla GRPO improves low-k but saturates at larger k. Bottom row: latent-space RL for NF-CoT improves pass@1 while preserving the upward pass@k scaling trend.

Figure 4: Pass@k diversity before and after RL. Top row: token-space GRPO concentrates probability mass on fewer solution modes and saturates below the base model at large k. Bottom row: policy-gradient refinement in the continuous-CoT space does not collapse the latent trajectory distribution, keeping NF-CoT + RL consistently above the supervised checkpoint across the full k range.

Case Studies

Qualitative examples showing how different latent samples lead to different implementation strategies while still producing functionally correct programs.

Note: Decoded latent CoTs are qualitative probes, not faithful natural-language explanations of the model's internal reasoning.

HumanEval/63: fibfib64/64 passed

Problem

def fibfib(n: int):
    """The FibFib number sequence is a sequence similar to the 
    Fibonacci sequence that's defined as follows:
    fibfib(0) == 0
    fibfib(1) == 0
    fibfib(2) == 1
    fibfib(n) == fibfib(n-1) + fibfib(n-2) + fibfib(n-3).
    
    Please write a function to efficiently compute the n-th 
    element of the fibfib number sequence.
    >>> fibfib(1)
    0
    >>> fibfib(5)
    4
    >>> fibfib(8)
    24
    """

This task demonstrates how the same problem produces 3 passing answers with genuinely different implementation strategies.

Decoded Latent CoT:

To calculate the Fibonacci number sequence up to the n-th number, we need to follow these steps: ### Plan 1. **Understanding the Fibonacci Sequence**: The Fibonacci sequence is defined as: - F(1) = 0 - F(2) = 1 - F(n) = F(n-1) + F(n-2) 2. **Compressing the Formula for Efficiency**: We use a fixed amount of space to store the last three Fibonacci numbers and calculate the next. This optimizes, since we don't need to use n variables. 3. **Flow**: This calculation is done in constant O(1) extra space by using three variables.

Generated Answer:passed

def fibfib(n: int) -> int:
    """Compute the n-th element of the fibfib sequence."""
    # Base cases
    if n == 0:
        return 0
    elif n == 1:
        return 0
    elif n == 2:
        return 1
    # Initialize the first three elements
    a, b, c = 0, 0, 1
    # Compute the sequence iteratively
    for _ in range(3, n + 1):
        next_value = a + b + c
        a, b, c = b, c, next_value
    return c

HumanEval/154: cycpattern_check61/64 passed

HumanEval/89: encrypt64/64 passed

Key Takeaway

Across these examples, decoded latent CoTs consistently reflect the coarse algorithmic route followed by the final program. NF-CoT does not merely store a single hidden solution trace—instead, its latent space captures a distribution over plausible reasoning trajectories, allowing different samples to steer generation toward distinct but correct implementations.

Citation

BibTeX

@misc{tu2026latentreasoningnormalizingflows,
      title={Latent Reasoning with Normalizing Flows}, 
      author={Guancheng Tu and Xiangjun Fu and Suhao Yu and Yao Tang and Haoqiang Kang and Lianhui Qin and Yizhe Zhang and Jiatao Gu},
      year={2026},
      eprint={2606.06447},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.06447v1}, 
}