1. Bayesian Theorem

Bayesian Model Averaging(BMA)1

Bayesian model averaging(BMA) is another wildly used method that is very like a combining model. However, the difference between BMA and combining models is also significant.

A Bayesian model averaging is a Bayesian formula in which the random variable are models(hypothesizes) \(h=1,2,\cdots,H\) with prior probability \(\Pr(h)\), then the marginal distribution over data \(X\) is:

\[ \Pr(X)=\sum_{h=1}^{H}\Pr(X|h)\Pr(h) \]

And the MBA is used to select a model(hypothesis) that can model the data best through Bayesian theory. When we have a larger size of \(X\), the posterior probability

\[ \Pr(h|X)=\frac{\Pr(X|h)\Pr(h)}{\sum_{i=1}^{H}\Pr(X|i)\Pr(i)} \]

become sharper. Then we got a good hypothesis.

Mixture of Gaussian(Combining Models)

In post ‘Mixtures of Gaussians’, we have seen how a mixture of Gaussians works. Then the joint distribution of input data \(\mathbf{x}\) and latent variable \(\mathbf{z}\) is:

\[ \Pr(\mathbf{x},\mathbf{z}) \]

and the margin distribution of \(\mathbf{x}\) is

\[ \Pr(\mathbf{x})=\sum_{\mathbf{z}}\Pr(\mathbf{x},\mathbf{z}) \]

For the mixture of Gaussians: \[ \Pr(\mathbf{x})=\sum_{k=1}^{K}\pi_k\mathcal{N}(\mathbf{x}|\mathbf{\mu}_k,\Sigma_k) \] the latent variable \(\mathbf{z}\) is designed: \[ \Pr(z_k) = \pi_k \] for \(k=\{1,2,\cdots,K\}\). And \(z_k\in\{0,1\}\) is a \(1\)-of-\(K\) representation.

This mixture of Gaussians is a kind of combining models. Each time, only one \(k\) is selected(for \(\mathbf{z}\) is \(1\)-of-\(K\) representation). An example of a mixture of Gaussians, and its original curve is like:

And the latent variables \(\mathbf{z}\) separate the whole distribution into several Gaussian distributions:

This is the simplest model of combining models where each expert is a Gaussian model. And during the voting, only one model was selected by \(\mathbf{z}\) to make the final decision.

Distinction between BMA and Combining Methods

A combining model method contains several models and predicts by voting or other rules. However, Bayesian model averaging can be used to generate a hypothesis from several candidates.


  1. Bishop, Christopher M. Pattern recognition and machine learning. springer, 2006.↩︎