Deep learning strives to discover a probability distribution on the underlying nonlinear system. This probability distribution is usually discovered using composite mappings of neurons stacked into layers (with many such layers composing a deep network model), providing an abstraction of underlying low-level features for ease of inference at a much higher-level.

The most successful deep learning models have been discriminative, supervised ones that map a high-dimensional, sensory-rich input to an output using the backpropagation and/or dropout algorithms.

In order to learn in an unsupervised approach, one has to find a model that is capable of approximating the underlying probabilistic distribution. This is, by no means, difficult due to the intractability of probabilistic computations that arise in maximum likelihood estimates, importance sampling and Gibb’s distributions. In such deep generative contexts, it is difficult to leverage on the scalability of piecewise linear models used in discriminative algorithms in an unsupervised context.

Deciphering Generative Adversarial Networks

Since such deep generative models are difficult to train, the question to ask then is, “is it possible to train a generative model from a classical discriminative classification setting?” If we have labeled data available, one can train a differentiable function that maps from an input space to the output space of a Borel measurable set in a supervised setting; this differentiable function can be thought of as fitting a probability distribution on a generative model.

What happens if we pit the generative model against a counterfeiter (an adversary) whose goal is to score the likelihood (not exactly, but we’ll see the definition shortly) that a sample drawn from the probability distribution of the generative model is real or fake? You see, things get interesting here. If the adversary samples from the latent space of the generator, then it improves its likelihood of passing off as a real model. This is the central idea behind generative adversarial networks. Interesting, huh?

Prior Work

Deep Deterministic Networks are directed graphical models. To generate stochastic distributions that represent an underlying model, people have used undirected graphical models with hidden variables such as restricted Boltzmann machines (RBMs, RBMs) and deep Boltzmann machines (DBMs). Unless the underlying partition function within such models are trivial, the product of normalized potential functions and the gradients used are intractable, making such methods limiting.

With deep belief networks, the gradient of a log likelihood expectation is computed with two different techniques: the expectations of data-dependent modes are estimated with variational approximations while the expectations of data-independent modes are approximated by persistent Markov chains.

Noise contrastive estimates (NCEs) and generative stochastic networks (GSNs) are the closer of the generative adversarial network ancestors. NCEs do not approximate or propose a bound on the log likelihood estimate but require the learned probability distribution to be analytically specified up to a normalization constant. But with deep models containing many latent variables, providing a normalized constant for the probability distribution is almost an impossible task.

GSNs belong to those class of algorithms that instead of bounding the log likelihood estimate, they train a generative model with backpropagation to draw samples from a desired distribution in a parameterized Markov chain manner. Adversarial nets, on the contrast, do not require parameterized Markov chains for sampling from a data distribution but instead use piecewise linear units (ReLUs, MaxOuts etc) to improve the performance of backpropagation.

Inside GANs

The original adversarial net was designed using two multilayer perceptrons as the modeling framework. It’s defined as follows:


  • suppose we have a data set, \(\textbf{x} \sim p_g(\centerdot)\)
    • where \(p_g\) represents the parameterization of the probability distribution of data \(\textbf{x}\);
  • suppose also that we have a noisy dataset \(\textbf{z}\) with a prior defined as \(p_z(\textbf{z})\)

  • let \(G(\textbf{z};\theta_g)\) be a differentiable mapping from \(\textbf{z}\)’s state space to \(\textbf{x}\)’s state space parameterized by \(\theta_g\)
    • i.e. \(G_{\theta_g}: \mathbb{R}^{n_z} \rightarrow \mathbb{R}^{n_x}\)
  • define \(G(\textbf{z};\theta_g)\) as the generator


  • let us define \(D(\textbf{x})\) as the probability that the data \(\textbf{x}\) is drawn from the data state space and not its parameterized probability distribution \(p_g(\centerdot)\)

  • let \(D(\textbf{x}; \theta_d): \mathbb{R}^{n_x} \rightarrow \mathbb{R}^1\) represent the differentiable function that parameterizes the probability \(D(\textbf{x})\)

  • define \(D(\textbf{x}; \theta_d)\) as the discriminator

GANs Algorithm

GANs simulataneously train two different multilayer perceptrons in a minimax scenario:

  • one maximizes the probability of assigning the correct label to both training examples and samples drawn from \(G(z; \theta_g)\),
    • i.e. train to \(\max_D \text{ log } D(x)\);
    • note that this is done with supervised learning
  • the second perceptron trains \(G(\textbf{z};\theta_g)\) to minimize the \(\text{log }[1 - D(G(\textbf{z}))]\)

Essentially, we are playing a two-player minimax game with value function \(V(G(\centerdot), D(\centerdot))\):

\begin{equation} \min_G \max_D V(D,G) = \mathbb{E}_{x \sim p(x)} \left[\text{ log } D(\textbf{x}) \right] + \mathbb{E}_{\textbf{z} \sim p(\textbf{z})} \left[\text{ log } \left(1 - D(G(\textbf{z})\right)\right] \end{equation}

One may wonder why we are taking the logarithms of the differentiable functions: my gut is the authors used the logarithms to hasten the optimization process since differentiating the logarithm of a high cost takes less time than the bare-bones cost function itself \(i.e. (\textbf{O}(\text{ log } (n)) < < \textbf{ O } (n))\). If the functions are matrix variables, we could take their log determinants.

Fig.1: Intuition behind Generative Adversarial Networks (reproduced from the original paper)..

Essentially, GANs involve simultaneously:

  • generating samples from a data with probability distribution, \(p_{\text{data}}(x)\) (black, dotted system in Fig 1)

  • generating samples from a generative distribution, \(p_g(G)\) (green, solid system in Fig 1),
    • \(z\) is typically sampled uniformly from a uniform Gaussian noise distribution (lower horizontal line in Fig 1).
  • updating a discriminant, \(D\) (blue, dashed system in Fig 1),
    • this distinguishes between samples from \(p_{\text{data}}(x)\) from samples from \(p_g(G)\). \(\textbf{z}\)
  • the horizontal line is part of the domain of \(\textbf{x}\) such that
    • \(\textbf{x} =G(\textbf{z})\) in the upward arrows imposes the non-uniform distribution \(p_g\) on transformed samples.
  • when the underlying probability distribution is very dense, \(G(\centerdot)\) contracts and it expands in regions of low density in $p_g$.

  • near optimum, \(p_g\) is similar to $p_\text{data}$ and $D$ is a partially accurate classifier (fig1a).

  • in the inner loop of the algorithm, at convergence, \(D\) discriminates samples from data, yielding \(D^\star(\textbf{x}) = \frac{p_\text{data}(\textbf{x})}{ p_\text{data}(\textbf{x}) + p_g(\textbf{x})}\) (fig1b) .

  • updating \(G\), the gradients of \(D\) guides \(G(\textbf{z})\) to flow to regions that are more likely to be classified as data (fig1c) .

  • after multiple steps of backprop, provided that $G$ and $D$ have enough capacity, they will reach a saddle point where neither can improve, since $p_g = p_\text{data}$ (fig1d) .

At equilibrium, discriminator is unable to differentiate between the two distributions, i.e. \(D(\textbf{x}) = \frac{1}{2}\).

GANs allow us to train a discriminator as an unsupervised “density estimator” which outputs a low value for an original data and a high value for fictitious data. By pitting the adversary against the discriminator, the generator parameterizes a nonlinear manifold of a dynamical system, maps it to a point on the data manifold, and the discriminator develops internal dynamics that is capable of solving a difficult unsupervised learning problem.

Training GANS

The algorithm is as detailed below:

  for i in N do
      for steps in k do
        obtain m noise samples {z(1), ..., z(m)} from generator noise prior p(z)
        obtain m minibatch examples {x(1), ..., x(m)} from data generating distribution p(x)
        update discriminator by ascending its stochastic gradient:

\begin{equation} \nabla_{\theta_d}\frac{1}{m}\sum_{i=1}^{m} [\text{log } D(x(i)) + \text{ log } (1 - D(G(z{i})))]. \nonumber \end{equation}

      end for
      obtain m noise samples {z(1), ..., z(m)} from generator noise prior p(z)
      update generator by descending its stochastic gradient:

\begin{equation} \nabla_{\theta_g}\frac{1}{m}\sum_{i=1}^{m} \text{ log } (1 - D(G(z{i}))). \nonumber \end{equation}

  end for