Paper：High-Resolution Image Synthesis with Latent Diffusion Models

传统的 Diffusion 模型都是在像素空间上运行的，训练和推理速度很慢。同时这样会让模型花费更多精力去优化细节，忽略核心的语义生成。本论文提出了 LDM，是Stable 的奠基之作。

Stable Diffusion 是怎么做的？

首先预训练一个自编码器，将图像映射到低维空间，而解码器将其还原成图像。

然后模型在这个低维空间进行扩散生成。

另外 Stable Diffusion 还引入了 Cross-Attention，将各种模态（比如文本，草图）经过专用的编码器转换后，作为和，与潜空间特征进行交互。

训练函数目标如下：

其实和 DDPM 相比，就是把扩散空间从像素空间改变到了潜空间，而且加了多模态条件控制。

整体架构图如下：

为了方便理解，让 Gemini 生成了一段 Pytorch 的伪代码。

import torch
import torch.nn as nn

class LatentDiffusionModel(nn.Module):
    def __init__(self):
        super().__init__()
        # 1. 第一阶段：预训练的感知压缩模型 
        # 负责将图像下采样到潜空间 (Factor f)
        self.autoencoder = Autoencoder(KL_or_VQ_reg=True) 
        
        # 2. 第二阶段：潜空间中的降噪 U-Net
        # 相比 DDPM，这里的卷积操作都在低维 latent 空间进行，极大地节省了算力
        self.unet = UNet(in_channels=latent_dim, out_channels=latent_dim)
        
        # 3. 条件编码器：将各种模态（文本、布局等）转换为中间向量
        self.cond_stage_model = TransformerEncoder() # 如 BERT/CLIP

    def forward(self, x, y):
        """
        x: 输入图像
        y: 条件输入 (如 text prompt)
        """
        # --- 感知压缩阶段 ---
        # 使用 Encoder 将图像映射到潜空间 z = E(x)
        # z 的空间维度比 x 缩小了 f 倍 (例如 512x512 -> 64x64) 
        z = self.autoencoder.encode(x) 
        
        # --- 扩散过程 (Forward Process) ---
        # 随机采样一个时间步 t
        t = torch.randint(0, T, (z.shape[0],))
        # 产生高斯噪声 epsilon
        noise = torch.randn_like(z)
        # 将噪声加到 z 上，得到噪声潜变量 z_t
        z_t = self.q_sample(z, t, noise) 
        
        # --- 条件注入阶段 ---
        # 将输入 y 编码为中间表征 tau_theta(y)
        context = self.cond_stage_model(y) 
        
        # --- 语义生成阶段 (Denoising) ---
        # U-Net 预测噪声。关键点：context 通过 Cross-Attention 层注入 U-Net
        # 内部执行: Attention(Q=UNet_feat, K=context, V=context)
        predicted_noise = self.unet(z_t, t, context)
        
        # --- 训练目标 ---
        # 优化目标是预测噪声与真实噪声之间的 L2 损失
        loss = F.mse_loss(predicted_noise, noise)
        return loss

    @torch.no_grad()
    def sample(self, y):
        # 推理过程：从纯高斯噪声开始
        z_t = torch.randn((1, latent_dim, h, w))
        context = self.cond_stage_model(y)
        
        # 迭代 T 步降噪还原出 z
        for t in reversed(range(T)):
            z_t = self.denoise_step(z_t, t, context)
            
        # 最后：用 Decoder 将 z 映射回像素空间，得到图像 x_tilde = D(z)
        return self.autoencoder.decode(z_t)

生成效果如何？

#Diffusion

LDM（Stable Diffusion）

https://d4wnnn.github.io/2026/03/14/Notion/LDM（Stable Diffusion）/

作者

D4wn

发布于

2026年3月14日

许可协议

U-Net 的细节理解上一篇

Miniconda 安装与配置数据目录下一篇