Our goal is to compute a tight upper bound on the likelihood of the observed findings: . As discussed in Section 4.2, we obtain an upper bound on by introducing upper bounds for individual node conditional probabilities. We represent this upper bound as , which is a product across the individual variational transformations and may contain contributions due to findings that are being treated exactly (i.e., are not transformed). Marginalizing across d we obtain a bound:
It is this latter quantity that we wish to minimize with respect to the variational parameters .
To simplify the notation we assume that the first m positive findings have been transformed (and therefore need to be optimized) while the remaining conditional probabilities will be treated exactly. In this notation is given by
where the expectation is taken with respect to the posterior distribution for the diseases given those positive findings that we plan to treat exactly. Note that the proportionality constant does not depend on the variational parameters (it is the likelihood of the exactly treated positive findings). We now insert the explicit forms of the transformed conditional probabilities (see Eq. (17)) into Eq. (41) and find:
where we have simply converted the products over i into sums in the exponent and pulled out the terms that are constants with respect to the expectation. On a log-scale, the proportionality becomes an equivalence up to a constant:
Several observations are in order. Recall that is the conjugate of the concave function f (the exponent), and is therefore also concave; for this reason is convex. In Appendix C we prove that the remaining term:
is also a convex function of the variational parameters. Now, since any sum of convex functions is convex, we conclude that is a convex function of the variational parameters. This means that there are no local minima in our optimization problem. We may safely employ the standard Newton-Raphson procedure to solve . Alternatively we can utilize fixed-point iterations. In particular, we calculate the derivatives of the variational form and iteratively solve for the individual variational parameters such that the derivatives are zero. The derivatives are given as follows:
where the expectation and the variance are with respect to the posterior approximation , and both derivatives can be computed in time linear in the number of associated diseases for the finding. The benign scaling of the variance calculations comes from exploiting the special properties of the noisy-OR dependence and the marginal independence of the diseases.
Calculating the expectations in Eq. (7) is exponentially costly in the number of exactly treated positive findings. When there are a large number of positive findings, we can have recourse to a simplified procedure in which we optimize variational parameters after having transformed all or most of the positive findings. While the resulting variational parameters are suboptimal, we have found in practice that the incurred loss in accuracy is typically quite small. In the simulations reported in the paper, we optimized the variational parameters after approximately half of the exactly treated findings had been introduced. (To be precise, in the case of 8, 12 and 16 total findings treated exactly, we optimized the parameters after 4, 8, and 8 findings, respectively, were introduced).