Carrying out diagnostic inference in the QMR model involves computing the posterior marginal probabilities of the diseases given a set of observed positive ( ) and negative ( ) findings. Note that the set of observed findings is considerably smaller than the set of possible findings; note moreover (from the bi-partite structure of the QMR-DT graph) that unobserved findings have no effect on the posterior probabilities for the diseases. For brevity we adopt a notation in which corresponds to the event , and refers to (positive and negative findings respectively). Thus the posterior probabilities of interest are , where and are the vectors of positive and negative findings.
The negative findings are benign with respect to the inference problem--they can be incorporated into the posterior probability in linear time in the number of associated diseases and in the number of negative findings. As we discuss below, this can be seen from the fact that the probability of a negative finding in Eq. (4) is the exponential of an expression that is linear in the . The positive findings, on the other hand, are more problematic. In the worst case the exact calculation of posterior probabilities is exponentially costly in the number of positive findings (Heckerman, 1989; D'Ambrosio, 1994). Moreover, in practical diagnostic situations the number of positive findings often exceeds the feasible limit for exact calculations.
Let us consider the inference calculations in more detail. To find the posterior probability , we first absorb the evidence from negative findings, i.e., we compute . This is just with normalization. Since both and P(d) factorize over the diseases (see Eq. (1) and Eq. (2) above), the posterior must factorize as well. The normalization of therefore reduces to independent normalizations over each disease and can be carried out in time linear in the number of diseases (or negative findings). In the remainder of the paper, we concentrate solely on the positive findings as they pose the real computational challenge. Unless otherwise stated, we assume that the prior distribution over the diseases already contains the evidence from the negative findings. In other words, we presume that the updates have already been made.
We now turn to the question of computing , the posterior marginal probability based on the positive findings. Formally, obtaining such a posterior involves marginalizing across the remaining diseases:
where the summation is over all the possible configurations of the disease variables other than (we use the shorthand summation index for this). In the QMR model has the form:
which follows from Eq. (4) and the fact that . To perform the summation in Eq. (5) over the diseases, we would have to multiply out the terms corresponding to the conditional probabilities for each positive finding. The number of such terms is exponential in the number of positive findings. While algorithms exist that attempt to find and exploit factorizations in this expression, based on the particular pattern of observed evidence (cf. Heckerman, 1989; D'Ambrosio, 1994), these algorithms are limited to roughly 20 positive findings on current computers. It seems unlikely that there is sufficient latent factorization in the QMR-DT model to be able to handle the full CPC corpus, which has a median number of 36 positive findings per case and a maximum number of 61 positive findings.