How does a generative model generate samples

Deep Learning (2) -Generative Adversarial Network


Translation, please contact the author if there are any errors

This article proposes a new framework for predicting generative models through a confrontation process in which two models are trained at the same time: a generative model G to capture the data distribution and a discriminant model D which predicts the likelihood that samples will come from training data instead of G. . The purpose of G training is to maximize D's likelihood of error. This framework implements the "minimax game for two players". In each functional space of G and D there is a unique solution, ie G restores the probability distribution of the training data and the differentiation probability of D is always 1/2. If G and D are defined by a multilayer perceptron, the entire system can be trained using back propagation (BP). No Markov chain or extension of the approximate network of reasoning is required throughout training or sample generation. The experiment proves the potential of the framework through qualitative and quantitative evaluation of the generated samples.

The prospect of deep learning is to find rich hierarchical models that represent the probability distribution of various data encountered in artificial intelligence applications, such as natural images, audio waveforms containing speech, and symbols in natural language corpora. To date, the most influential deep learning applications in deep learning include discriminatory models that typically map high-dimensional, rich perceptual inputs (such as images, audio, video, etc.) onto category labels. These successful applications are mainly based on back propagation and dropout algorithms, using piecewise linear units with good gradient effects. There are few successful cases of deep generative models because some maximum likelihood estimation approximations and related strategies have many difficult probability calculations and it is difficult to use the piecewise linear unit. This paper proposes a new generative model that can circumvent these difficulties.

In the proposed confrontation model, the opponent of the generative model is the discriminant model, which is used to learn to determine whether the sample comes from the model distribution or the data distribution. The generative model can be viewed as a fake team trying to create fake currencies and use them without checking whether the fake currency is qualified, while the discriminant model is similar to the police and the view detects fake currencies. Encourage both parties to improve the model in the game between the two parties until the discriminant model is difficult to distinguish between true and false.

The framework can be trained with a variety of models and optimization algorithms. In this article we discuss the special case where the generative model transmits samples generated by random noise through the multilayer perceptron and the discriminant model is also the multilayer perceptron. We call this particular situation a confrontational network. In this case we only use the very successful backpropagation and dropout algorithms to train the two models and only use the forward propagation in the generative model to generate samples. It is not necessary to use approximate reasoning and Markov chains.

Until recently, most of the work on generative models has focused on models that provide parametric specifications for probability distribution models. Train the model by maximizing the log probability. The most successful in this model series is the deep Boltzmann machine. These models usually have a probability function that is difficult to work with, so several approximations of the probability gradient are required. These difficulties drove the development of "generating machines". These models did not clearly express the possibilities, but were able to generate samples from the desired distribution. The Generative Random Network is an example of a generative machine that can be trained with precise backpropagation rather than the large number of approximations required by Boltzmann machines. This work extends the concept of generative machines by eliminating the Markov chain used in generative random networks.

limσ → 0∇xEϵ∼N (0, σ2I) f (x + ϵ) = ∇xf (x)

In this article, Formula 1 is used to observe the back propagation derivative of the generation process. When we started this work, it was not clear to us that Kingma, Welling, and Rezende had developed more general random backpropagation rules that propagate back through Gaussian distribution with finite variance and propagate back to covariance parameters and we treat it as hyperparameter in this work. Kingma et al. Used random backpropagation to train a Variation Autoencoder (UAE). Similar to the generative adversary network, UAE pairs a differentiable generator with a second neural network, the difference being that VAE's second network is a detection model that uses approximate inference. GAN distinguishes through visible units, so it cannot model discrete data, while VAE must distinguish through hidden units, so that it cannot have discrete latent variables. Other UAE-like methods are not closely related to the methods mentioned in this article.

Previous work also used discriminant criteria to train generative models. These methods use standards that are difficult to handle with deep generative models. These methods are even difficult to approximate deep models because they contain probability ratios that cannot be approximated with variation approximations that reduce the probability. Noise Contrast Estimation (NCE) trains a generative model by learning weights so that the model can be used to distinguish data with a fixed noise distribution. Using a previously trained model as the noise distribution enables a number of models to be trained with improved quality. This can be seen as an informal competitive mechanism, and its spirit is similar to the confrontation network mechanism. The main limitation of NCE is that its discriminator is defined by the ratio of the probability density of the noise distribution to the model distribution, so it must pass both density evaluation and back propagation.

Some previous work used the general concept of two competing neural networks. The most important work is to minimize predictability. In minimizing predictability, each hidden unit in the first neural network is trained to be different from the output of the second network, and the second network predicts the value of the hidden unit taking into account the values ​​of all other hidden units. Value. This work differs from minimizing predictability in three important ways: 1) In this work, competition between networks is the only training criterion, and it is enough to train the network itself. The minimization of predictability is just a regularizer that encourages the hidden units of the neural network to be statistically independent in performing other tasks. This is not the main training criterion. 2) The type of game is different. To minimize predictability, compare the outputs from two networks. One network tries to make the output similar, the other tries to make the output different. The output in question is a single scalar. In GAN, one network generates a rich, high-dimensional vector that is used as the input of another network and tries to pick an input that the other network cannot handle. 3) The norms of the learning process are different. Predictability minimization is described as an optimization problem in which the objective function is minimized and the learning is close to the minimum value of the objective function. GAN is based on a binary minimax game rather than an optimization problem and has a value function where one network tries to maximize and the other tries to minimize. The game ends at a saddle point where the strategy for one player is smallest and the strategy for the other player is highest.

Generative adversarial networks are sometimes referred to as "Conflicting Samples Confusion of Related Concepts. Conflicting Samples are samples found by direct gradient-based optimization of the classification network input to find samples that are data-like but misclassified. This differs from the current work, since opposing samples are not trained. The mechanism of the generative model. On the contrary, opposing samples are mainly an analysis tool used to show that neural networks behave in interesting ways, usually with a high degree of certainty, to classify two images differently, even if the difference between them is for human observation. The existence of such adversarial examples indicates that training generative adversarial networks can be inefficient as they demonstrate that modern day discriminates en networks can be easily identified without imitating any human perceptible attributes of that class. A class.

If both the generative model and the discriminant model are multilayer perceptrons, the generative model framework can be applied directly. To enter dataxDistribution of learning generators to pgWe enter the noise variable pz (z) Define a priori above and then express the assignment of the data space as exG (z; θg), where G is a differentiable function represented by a multilayer perceptron, and the parameter is θg. We also define a second multilayer perceptron that outputs a single scalar D (x; θd)。 D (x) meanxOut of data instead of pgThe probability. We formDTo maximize the training patterns and fromGThe probability of assigning the correct label (i.e., to maximize the logD (x)). We simultaneously trainGMinimizelog (1 − D (G (z))). in other words, DmitG, use the value function V (G , D), run the following binary minimax game:

minGmaxDV (D, G) = Ex∼pdata (x) [logD (x)] + Ez∼pz (z) [log (1 − D (G (z)))] (1)

In the next section we will do a theoretical analysis of the adversary network, which essentially shows that the training standard will allow us if G and D get sufficient capacity, that is, below the nonparametric limit distribution of the generation of recovery data.
For a less formal, but more educational explanation of this method, see Figure 1.

While training the generative adversarial network, update the discriminant distribution (D, blue dashed line) so that D can distinguish the sample of the data generation distribution (black dashed line) and the generation distribution ppgProbe (G, solid green line).
The horizontal line at the bottom is an even scanzThe horizontal line at the top isx the distribution interval. The upward arrow shows the assignment anx = G (z) How to impose an uneven distribution on the transformed samplepg。G impgThe area with high density shrinks and thepgThe area of ​​low density diffuses.
(a) Consider the opposite pair near the convergence: pgSimilar to pdata, D is a partially accurate classifier.
(b) Keep G stationary, optimize D, train in the inner loop of algorithm D to distinguish samples and data, and converge to D ∗ (x) = pdata (x) pdata (x) + pg (x) ) This makes the classification accuracy the highest.
(c) After each update of G, the gradient of D G (z) leads to areas that are more likely to be classified as data. That is, keep D unchanged and tweak G until the level of confusion is highest.
(d) If G and D are performing well after many iterations, they reach a point where neither of them can be improved because pg = pdata. The discriminator cannot distinguish between two distributions D (x) = 12。

In practice, we have to use iterative numerical methods to implement the game. It is computationally impermissible to optimize D optimally in the inner training loop, and if the data set is limited this leads to an overfitting. Instead, we alternate between k steps to optimize D and one step to optimize G. As long as G changes slowly enough, D remains close to its optimal solution. This process is formally presented in Algorithm 1.

Algorithm 1

——————————————————————————————————————————————————————————————————————————————————————————————
Algorithm 1 Use stochastic mini-batch gradient descent training to generate an opposing network. The number of gradient descent k applied to the discriminator is a hyperparameter. We used k = 1 in the experiment, which is the cheapest option.
————————————————————————————————————————————————————————————————————————————————————————————————
for Number of training iterationsdo
for Optimize K timesdo
Vonpg (z)