How Does the New Generative Model "Discrete Distribution Networks (DDN)" Achieve Simple Principles and Unique Properties?

This article is authored by Yang Lei, who currently serves as a post-training algorithm engineer at the large model startup StepFun, with research interests in generative models and language model post-training. Previously, he worked as a computer vision algorithm engineer at Megvii Technology for six years, focusing on 3D vision and data synthesis. He graduated from Beijing University of Chemical Technology in 2018 with a bachelor's degree.

Currently, mainstream foundational generative models can be categorized into five major types: Energy-Based Models (Diffusion), GAN, Autoregressive, VAE, and Flow-Based Models. This work proposes a novel generative model: Discrete Distribution Networks, abbreviated as DDN. The related paper has been published in ICLR 2025.

DDN employs a concise and unique mechanism to model target distributions:

1. In a single forward pass, DDN simultaneously generates K outputs (rather than a single output). 2. These outputs collectively form a discrete distribution containing K equally weighted sample points (each with probability 1/K), which is the origin of the name "Discrete Distribution Networks." 3. The training objective is to optimize the positions of sample points so that the network's output discrete distribution approximates the true distribution of training data as closely as possible.

Each type of generative model has its unique properties, and DDN is no exception. This article will focus on introducing three characteristics of DDN:

**Discrete Distribution Networks**

Figure 1: DDN reconstruction process diagram

First, let's use the DDN reconstruction workflow shown in the above figure as an entry point to understand its principles. Unlike diffusion and GAN models that cannot reconstruct data, DDN has data reconstruction capabilities similar to VAE: first mapping data to latent space, then generating reconstructed images highly similar to the original from the latent representation.

The figure shows the process of DDN reconstructing a target and obtaining its latent representation. Generally, DDN contains multiple hierarchical structures with L layers, where L=3 in the diagram. Let's first focus on the leftmost first layer.

**Discrete Distribution**: As mentioned above, DDN's core idea is to have the network simultaneously generate K outputs, representing "the network outputs a discrete distribution." Therefore, each DDN layer has K outputs, generating K different images at once, where K=3 in the diagram. Each output represents a sample point in this discrete distribution, with each sample point having equal probability mass of 1/K.

**Hierarchical Generation**: The ultimate goal is to make this discrete distribution (K outputs) as close as possible to the target distribution (training set). Obviously, the K outputs from the first layer alone cannot clearly characterize the entire MNIST dataset. The K images obtained from the first layer are more like average images after clustering MNIST into K categories. Therefore, we introduce a "hierarchical generation" design to obtain clearer images. Next, we continue to select the most similar image to the target from the second layer's outputs as the condition for the third layer, repeating this process. As the number of layers increases, the generated images become increasingly similar to the target, ultimately completing the reconstruction of the target.

**Latent**: The index of the selected output at each layer throughout this selection process forms the target's latent representation (the green part "3-1-2" in the figure). Therefore, the latent is an integer array of length L with values in the range [1,K].

**Network Structure**

Further refining the "reconstruction process diagram" gives us the network structure diagram (a) below:

DDN network structure diagram and two supported network structure forms

In figure (a), generation-related designs are integrated into the Discrete Distribution Layer (DDL), modules providing only basic computation are encapsulated as NN Blocks, with emphasis on showing the data flow inside DDL during training. Key points to focus on:

The right side figures (b) and (c) respectively show two network structure forms supported by DDN:

For computational efficiency considerations, DDN defaults to using a single shot generator form with coarse-to-fine characteristics.

**Loss Function**

Additionally, this work proposes the Split-and-Prune optimization algorithm to ensure that during training, each node has an equal probability of 1/K of being matched with the ground truth.

The figure below shows DDN's optimization process for two-dimensional probability density estimation:

Left: Generated sample set; Right: Probability density GT

**Experiments and Property Demonstration**

**Random Sampling Results**

Random sampling results on face datasets

**More General Zero-Shot Conditional Generation**

First, let's describe the task of "Zero-Shot Conditional Generation" (ZSCG):

Using Unconditional DDN for zero-shot conditional generation:

DDN can use conditions from different modalities (such as text prompts with CLIP) to guide Unconditional trained DDN for conditional generation without requiring gradients. The yellow boxes highlight the reference ground truth. SR represents super-resolution, ST represents Style Transfer.

As shown in the figure above, DDN supports rich zero-shot conditional generation tasks, with methods almost identical to the DDN reconstruction process in Figure 1. Specifically, we only need to replace the target in Figure 1 with the corresponding condition and adjust the sampling logic to select the output that best matches the current condition from multiple outputs at each layer as the current layer's output. As the number of layers increases, the generated output increasingly conforms to the condition. The entire process requires no gradient computation, relying only on a black-box discriminative model to guide the network for zero-shot conditional generation.

DDN is the first generative model to support such characteristics.

In more professional terms: > DDN is the first generative model that supports using purely discriminative models to guide the sampling process; > In some sense, it promotes the unification of generative and discriminative models.

This also means users can efficiently filter and manipulate the entire distribution space through DDN. This property is very interesting with high playability, and personally, I feel "zero-shot conditional generation" will be widely applied.

**Conditional Training**

Training conditional DDN is very simple - just input the condition or condition features directly into the network, and the network automatically learns P(X|Y). Additionally, conditional DDN can be combined with ZSCG to enhance the controllability of the generation process. The fourth and fifth columns in the figure below show the generation effects of conditional DDN when guided by other images through ZSCG.

Conditional-DDNs performing colorization and edge-to-RGB tasks. The fourth and fifth columns show zero-shot conditional generation effects when guided by other images, where generated images maintain condition compliance while approximating the color tone of the guided images as much as possible.

**End-to-End Differentiability**

Samples generated by DDN are fully differentiable with respect to the computational graph that produced them, allowing end-to-end optimization of all parameters using standard chain rule. This property of smooth gradient flow is reflected in two aspects:

1. DDN has a consistent main feature backbone through which gradients can efficiently backpropagate. In contrast, diffusion models need to convert gradients multiple times to noisy sample space for backpropagation when transmitting gradients.

2. DDN's sampling process does not block gradients, meaning intermediate outputs generated by the network are also fully differentiable, requiring no approximation operations and introducing no noise.

Theoretically, in scenarios using discriminative models for fine-tuning or reinforcement learning tasks, using DDN as a generative model could enable more efficient fine-tuning.

**Unique One-Dimensional Discrete Latent**

DDN naturally has one-dimensional discrete latent representation. Since each layer's outputs are conditioned on all previous results, its latent space has a tree structure. The tree has degree K and L layers, with each leaf node corresponding to a DDN sampling result.

DDN's latent space is tree-structured, with the green path showing the latent corresponding to the target in Figure 1

**Latent Visualization**

To visualize the latent structure, we trained a DDN on MNIST with output level layers L=3 and K=8 output nodes per layer, displaying its latent tree structure in a recursive nine-grid format. The center grid is the condition (the output sampled from the previous layer), and the adjacent 8 grids represent 8 new outputs generated based on the center grid as condition.

Hierarchical Generation Visualization of DDN

**Future Research Directions**

[The article continues with potential future research directions]

Disclaimer: Investing carries risk. This is not financial advice. The above content should not be regarded as an offer, recommendation, or solicitation on acquiring or disposing of any financial products, any associated discussions, comments, or posts by author or other users should not be considered as such either. It is solely for general information purpose only, which does not consider your own investment objectives, financial situations or needs. TTM assumes no responsibility or warranty for the accuracy and completeness of the information, investors should do their own research and may seek professional advice before investing.

Tiger Brokers

How Does the New Generative Model "Discrete Distribution Networks (DDN)" Achieve Simple Principles and Unique Properties?

Most Discussed