next up previous
Next: Heuristic Initialization in AIS-BN Up: AIS-BN: Adaptive Importance Sampling Previous: Basic Algorithm

Modifying the Sampling Distribution in AIS-BN

Based on the theoretical considerations of Section 2.1, we know that the crucial element of the algorithm is converging on a good approximation of the optimal importance function. In what follows, we first give the optimal importance function for calculating Pr($ \bf E$ = $ \bf e$) and then discuss how to use the structural advantages of Bayesian networks to approximate this function. In the sequel, we will use the symbol $ \rho$ to denote the importance sampling function and $ \rho^{*}_{}$ to denote the optimal importance sampling function.

Since Pr($ \bf X$\$ \bf E$,$ \bf E$ = $ \bf e$) > 0, from Corollary 1 we have

$\displaystyle \rho$($\displaystyle \bf X$\$\displaystyle \bf E$) = $\displaystyle {\frac{{\mbox{\rm Pr}}({\bf X}\backslash{\bf E},{\bf E}={\bf e})}{{\mbox{\rm Pr}}({\bf E}={\bf e})}}$ = Pr($\displaystyle \bf X$|$\displaystyle \bf E$ = $\displaystyle \bf e$)  .

The following corollary captures this result.

Corollary 2   The optimal importance sampling function $ \rho^{*}_{}$($ \bf X$\$ \bf E)$ for calculating Pr($ \bf E$ = $ \bf e$) in Equation 6 is Pr($ \bf X$|$ \bf E$ = $ \bf e$).

Although we know the mathematical expression for the optimal importance sampling function, it is difficult to obtain this function exactly. In our algorithm, we use the following importance sampling function

$\displaystyle \rho$($\displaystyle \bf X$\$\displaystyle \bf E$) = $\displaystyle \prod\limits_{i=1}^{n}$Pr(Xi|Pa(Xi),$\displaystyle \bf E$)  . (8)

This function partially considers the effect of all the evidence on every node during the sampling process. When the network structure is the same as that of the network which has absorbed the evidence, this function is the optimal importance sampling function. It is easy to learn and, as our experimental results show, it is a good approximation to the optimal importance sampling function. Theoretically, when the posterior structure of the model changes drastically as the result of observed evidence, this importance sampling function may perform poorly. We have tried to find practical networks where this would happen, but to the day have not encountered a drastic example of this effect.

From Section 2.2, we know that the score sums corresponding to {xi,pa(Xi),$ \bf e$} can yield an unbiased estimator of Pr(xi,pa(Xi),$ \bf e$). According to the definition of conditional probability, we can get an estimator of Pr$\scriptstyle \prime$(xi|pa(Xi),$ \bf e$). This can be achieved by maintaining an updating table for every node, the structure of which mimicks the structure of the CPT. Such tables allow us to decompose the above importance function into components that can be learned individually. We will call these tables the importance conditional probability tables (ICPT).

Definition 1   An importance conditional probability table (ICPT) of a node X is a table of posterior probabilities Pr(X|Pa(X),$ \bf E$ = $ \bf e$) conditional on the evidence and indexed by its immediate predecessors, Pa(X).

The ICPT tables will be modified during the process of learning the importance function. Now we will prove a useful theorem that will lead to considerable savings in the learning process.

Theorem 2  

Xi $\displaystyle \in$ $\displaystyle \bf X$, Xi$\displaystyle \notin$Anc($\displaystyle \bf E$) $\displaystyle \Rightarrow$ Pr(Xi|Pa(Xi),$\displaystyle \bf E$) = Pr(Xi|Pa(Xi))  . (9)

Proof: Suppose we have set the values of all the parents of node Xi to pa(Xi). Node Xi is dependent on evidence $ \bf E$ given pa(Xi) only when Xi is d-connecting with $ \bf E$ given pa(Xi) [Pearl1988]. According to the definition of d-connectivity, this happens only when there exists a member of Xi's descendants that belongs to the set of evidence nodes $ \bf E$. In other words Xi$ \notin$Anc($ \bf E$). $ \Box$

Theorem 2 is very important for the AIS-BN algorithm. It states essentially that the ICPT tables of those nodes that are not ancestors of the evidence nodes are equal to the CPT tables throughout the learning process. We only need to learn the ICPT tables for the ancestors of the evidence nodes. Very often this can lead to significant savings in computation. If, for example, all evidence nodes are root nodes, we have our ICPT tables for every node already and the AIS-BN algorithm becomes identical to the likelihood weighting algorithm. Without evidence, the AIS-BN algorithm becomes identical to the probabilistic logic sampling algorithm.

Figure 3: The AIS-BN algorithm for learning the optimal importance function.
\fbox{ \parbox{5.3in}{ \renewedcommand{baselinestretch}{1.0} \begin{minipage}[t]... ...i),{\bf e})\right) \end{eqnarray*}\end{enumerate}{\bf end for} \end{minipage}}}

It is worth pointing out that for some Xi, Pr(Xi| Pa(Xi),$ \bf E$) (i.e., the ICPT table for Xi), can be easily calculated using exact methods. For example, when Xi is the only parent of an evidence node Ej and Ej is the only child of Xi, the posterior probability distribution of Xi is straightforward to compute exactly. Since the focus of the current paper is on sampling, the test results reported in this paper do not include this improvement of the AIS-BN algorithm.

Figure 3 lists an algorithm that implements Step 7 of the basic AIS-BN algorithm listed in Figure 2. When we estimate Pr$\scriptstyle \prime$(xi|pa(Xi),$ \bf e$), we only use the samples obtained at the current stage. One reason for this is that the information obtained in previous stages has been absorbed by Prk($ \bf X$\$ \bf E$). The other reason is that in principle, each successive iteration is more accurate than the previous one and the importance function is closer to the optimal importance function. Thus, the samples generated by Prk + 1($ \bf X$\$ \bf E$) are better than those generated by Prk($ \bf X$\$ \bf E$). Pr$\scriptstyle \prime$(Xi|pa(Xi),$ \bf e$) - Prk(Xi|pa(Xi),$ \bf e$) corresponds to the vector of first partial derivatives in the direction of the maximum decrease in the error. $ \eta$(k) is a positive function that determines the learning rate. When $ \eta$(k) = 0 (lower bound), we do not update our importance function. When $ \eta$(k) = 1 (upper bound), at each stage we discard the old function. The convergence speed is directly related to $ \eta$(k). If it is small, the convergence will be very slow due to the large number of updating steps needed to reach a local minimum. On the other hand, if it is large, convergence rate will be initially very fast, but the algorithm will eventually start to oscillate and thus may not reach a minimum. There are many papers in the field of neural network learning that discuss how to choose the learning rate and let estimated importance function converge quickly to the destination function. Any method that can improve learning rate should be applicable to this algorithm. Currently, we use the following function proposed by Ritter et al. [1991]

$\displaystyle \eta$(k) = a$\displaystyle \left(\vphantom{\frac{b}{a}}\right.$$\displaystyle {\frac{b}{a}}$$\displaystyle \left.\vphantom{\frac{b}{a}}\right)^{k/k_{\max }}_{}$, (10)

where a is the initial learning rate and b is the learning rate in the last step. This function has been reported to perform well in neural network learning [Ritter et al.1991].


next up previous
Next: Heuristic Initialization in AIS-BN Up: AIS-BN: Adaptive Importance Sampling Previous: Basic Algorithm
Jian Cheng 2000-10-01