Latent Reasoning with Normalizing Flows

Guancheng Tu1*,Xiangjun Fu1*,Suhao Yu1,Yao Tang1,Haoqiang Kang2,Lianhui Qin2,Yizhe Zhang3,Jiatao Gu1

* Equal contribution

1University of Pennsylvania2UC San Diego3Meta
arXivCode

Large language models often improve reasoning with explicit chain-of-thought (CoT), but expressing thoughts as text confines reasoning to a verbose, low-information-density channel. Latent CoT addresses this limitation by moving intermediate reasoning into continuous states. However, existing latent methods often fail to retain the core advantages of explicit CoT: native autoregressive generation, probabilistic sampling, and tractable likelihood estimation. We propose NF-CoT, a latent reasoning framework that preserves these advantages by modeling continuous thoughts with normalizing flows. NF-CoT realizes a TARFlow-style normalizing flow within the LLM backbone, defining a tractable probability model over compact continuous thoughts learned from explicit CoT. This yields tractable likelihoods for continuous thoughts, supports direct policy-gradient optimization in the latent reasoning space, and enables probabilistic left-to-right decoding with the original KV cache. On code-generation benchmarks, NF-CoT improves pass rates over explicit-CoT and prior latent-reasoning baselines while substantially reducing intermediate-reasoning cost.

Four Paradigms for Chain-of-Thought Reasoning

Comparing different approaches to intermediate reasoning in language models

discrete token hidden state diffusion latent Gaussian
noise
Explicit CoTdiscrete text tokensQLLMd₁d₂d₃d₄ACoconutdeterministic hidden-state feedbackQLLMh1h2h3ALaDiRiterative latent denoisingQLLMzTzT-1z1z0ANF-CoT (Ours)autoregressive Gaussian samplingu_i ~ N(μ_i, σ_i | Q, u_{<i})QLLM +NF Headp(u₁|Q)u1p(u₂|Q,u₁)u2p(u₃|...)u3LMHeadADecoded probe:"First ... then ... therefore ..."Playing (0s / 7s)

Method Overview

NF-CoT training and inference pipeline. Training uses a frozen VAE encoder and shallow flow blocks to convert discrete-CoT rationales into continuous-CoT states.

NF-CoT Training and Inference Pipeline

Key Insight: At inference, NF-CoT samples continuous thoughts left-to-right and continues answer generation in the same causal stream—without recomputing the prompt or running the VAE/shallow blocks.

Performance Results

Comprehensive evaluation on code generation benchmarks comparing NF-CoT with autoregressive, diffusion, and latent reasoning baselines.

Model / MethodBase ModelMBPPMBPP+HumanEvalHumanEval+LCB v6Avg.
Open-Sourced Models
Autoregressive Coding Models
Qwen 2.5 Coder*autoregressive
Qwen2.5-7B75.962.966.560.426.859.9
OpenCoder*autoregressive
from scratch (8B)79.970.466.563.429.662
OlympicCoder*autoregressive
Qwen2.5-7B8066.482.176.937.368.5
Seed-Coder*autoregressive
from scratch826977.468.328.465
Diffusion Language Models
Dream*diffusion
Qwen2.5-7B68.757.456.75018.6
LLaDA*diffusion
from scratch (8B)50.142.135.430.512.4
Diffu-Coder*diffusion
Qwen2.5-Coder-7B75.161.97265.224.559.7
Dream-Coder*diffusion
Qwen2.5-Coder-7B75.961.666.560.421.457.2
d1*diffusion
LLaDA+RL3945.5
Looped Latent Reasoning Models
Ouro*looped
from scratch (2.6B)80.466.678.270.738.766.9
Method Comparison
Same Backbone Comparison (Qwen3-8B-Base)
Base Model
Qwen3-8B-Base60.553.878.268.617.755.8
Standard SFT
Qwen3-8B-Base63.3+2.852.7-1.184.6+6.469.5+0.9
Soft Thinking
Qwen3-8B-Base64.2+3.753.1-0.785+6.871.2+2.6
TaH+
Qwen3-8B-Base65.6+5.156.5+2.785.8+7.674.3+5.7
LaVAE
Qwen3-8B-Base42-18.530.2-23.647.8-30.432.8-35.810.8-6.932.7-23
LaDiRdiffusion
Qwen3-8B-Base66.8+6.359.5+5.787.4+9.273.2+4.621.3+3.661.6+5.9
NF-CoT (Dual-Path)normalizing flowOurs
Qwen3-8B-Base77.5+1766.2+12.482.9+4.777.7+9.121.6+3.965.2+9.4
NF-CoT (Unified)normalizing flowOurs
Qwen3-8B-Base83.9+23.472.1+18.385.8+7.678.3+9.723.7+668.8+13
↳ + RLnormalizing flowOurs
Qwen3-8B-Base85.4+24.973.3+19.586.7+8.580.2+11.625.1+7.470.1+14.3
autoregressiveStandard LLM decodingdiffusionIterative denoisingnormalizing flowTractable likelihood (Ours)

Pass@k Scaling

Pass@k measures the probability that at least one of k generated samples passes all test cases. NF-CoT's probabilistic sampling in continuous thought space enables diverse solution exploration, leading to consistent improvements over both the base model and LaDiR across all k values.

Pass@k scaling curves on MBPP+ and HumanEval+. NF-CoT (blue squares) consistently outperforms Base model (light blue circles) and LaDiR (red circles) across all k values from 1 to 128.

Figure: Pass@k on MBPP+ (left) and HumanEval+ (right). NF-CoT continues to improve with larger k, demonstrating effective sampling diversity from the learned continuous thought distribution.

Case Studies

Qualitative examples showing how different latent samples lead to different implementation strategies while still producing functionally correct programs.

Note: Decoded latent CoTs are qualitative probes, not faithful natural-language explanations of the model's internal reasoning.

HumanEval/63: fibfib64/64 passed

Problem

def fibfib(n: int):
    """The FibFib number sequence is a sequence similar to the 
    Fibonacci sequence that's defined as follows:
    fibfib(0) == 0
    fibfib(1) == 0
    fibfib(2) == 1
    fibfib(n) == fibfib(n-1) + fibfib(n-2) + fibfib(n-3).
    
    Please write a function to efficiently compute the n-th 
    element of the fibfib number sequence.
    >>> fibfib(1)
    0
    >>> fibfib(5)
    4
    >>> fibfib(8)
    24
    """

This task demonstrates how the same problem produces 3 passing answers with genuinely different implementation strategies.

Decoded Latent CoT:
To calculate the Fibonacci number sequence up to the n-th number, we need to follow these steps: ### Plan 1. **Understanding the Fibonacci Sequence**: The Fibonacci sequence is defined as: - F(1) = 0 - F(2) = 1 - F(n) = F(n-1) + F(n-2) 2. **Compressing the Formula for Efficiency**: We use a fixed amount of space to store the last three Fibonacci numbers and calculate the next. This optimizes, since we don't need to use n variables. 3. **Flow**: This calculation is done in constant O(1) extra space by using three variables.
Generated Answer:passed
def fibfib(n: int) -> int:
    """Compute the n-th element of the fibfib sequence."""
    # Base cases
    if n == 0:
        return 0
    elif n == 1:
        return 0
    elif n == 2:
        return 1
    # Initialize the first three elements
    a, b, c = 0, 0, 1
    # Compute the sequence iteratively
    for _ in range(3, n + 1):
        next_value = a + b + c
        a, b, c = b, c, next_value
    return c
HumanEval/154: cycpattern_check61/64 passed
HumanEval/89: encrypt64/64 passed

Key Takeaway

Across these examples, decoded latent CoTs consistently reflect the coarse algorithmic route followed by the final program. NF-CoT does not merely store a single hidden solution trace—instead, its latent space captures a distribution over plausible reasoning trajectories, allowing different samples to steer generation toward distinct but correct implementations.

Citation

BibTeX
@article{tu2026nfcot,
  title={Latent Reasoning with Normalizing Flows},
  author={Tu, Guancheng and Fu, Xiangjun and Yu, Suhao and Tang, Yao and Kang, Haoqiang and Qin, Lianhui and Zhang, Yizhe and Gu, Jiatao},
  journal={arXiv preprint arXiv:2506.XXXXX},
  year={2026}
}