Next: Problem Assumption 1: Are Up: A Critical Assessment of Previous: Metric Assumption 2: Do

Interpretation of Results and Recommendations

The previous section presented our summarization and analysis of the planner runs. In this section, we reflect on what those results mean for empirical comparison of planners; we summarize the results and recommend some partial solutions. It is not possible to guarantee fairness and we propose no magic formula for performing evaluations, but the state of the practice in general can certainly be improved. We propose three general recommendations and 12 recommendations targeted to specific assumptions.

Many of the targeted recommendations amount to requesting problem and planner developers to be more precise about the requirements for and expectations of their contributions. Because the planners are extremely complex and time consuming to build, the documentation may be inadequate to determine how a subsequent version differs from the previous or under what conditions (e.g., parameter settings, problem types) the planner can be fairly compared. With the current positive trend in making planners available, it behooves the developer to include such information in the distribution of the system.

The most sweeping recommendation is to shift the research focus away from developing the best general-purpose planner. Even in the competitions, some of the planners identified as superior have been ones designed for specific classes of problems, e.g., FF and IPP. The competitions have done a great job of exciting interest and encouraging the development and public availability of planners that incorporate the same representation.

However, to advance the research, the most informative comparative evaluations are those designed for a specific purpose - to test some hypothesis or prediction about the performance of a planner¹⁰. An experimental hypothesis focuses the analysis and often leads naturally to justified design decisions about the experiment itself. For example, Hoffmann and Nebel, the authors of the Fast-Forward (FF) system, state in the introduction to their JAIR paper that FF's development was motivated by a specific set of the benchmark domains; because the system is heuristic, they designed the heuristics to fit the expectations/needs of those domains [Hoffmann Nebel 2001]. Additionally, in part of their evaluation, they compare to a specific system on which their own system had commonalities and point out the various advantages or disadvantages of their design decisions on specific problems. Follow-up work or researchers comparing their own systems to FF now have a well-defined starting point for any comparison.

Recommendation 1: Experiments should be driven by hypotheses. Researchers should precisely articulate in advance of the experiments their expectations about how their new planner or augmentations to an existing planner add to the state of the art. These expectations should in turn justify the selection of problems, other planners and metrics that form the core of the comparative evaluation.

A general issue is whether the results are accurate. We reported the results as they are output by the planners. If a planner stated in its output that it had been successful, we took it at face value. However, by examining some of the output, we determined that some claims of successful solution were erroneous - the proposed solution would not work. The only way to ensure that the output is correct is with a solution checker. Drew McDermott used a solution checker in the AIPS98 competition. However, the planners do not all provide output in a compatible format with his checker. Thus, another concern with any comparative evaluation is that the output needs to be cross-checked. Because we are not declaring a winner (i.e., that some planner exhibited superior performance), we do not think that the lack of a solution checker casts serious doubt on our results. For the most part, we have only been concerned with factors that cause the observed success rates to change.

Recommendation 2: Just as input has been standardized with PDDL, output should be standardized, at least in the format of returned plans.

Another general issue is whether the benchmark sets are representative of the space of interesting planning problems. We did not test this directly (in fact, we are not sure how one could do so), but the clustering of results and observations by others in the planning community suggest that the set is biased toward logistics problems. Additionally, many of the problems are getting dated and no longer distinguish performance. Some researchers have begun to more formally analyze the problem set, either in service of building improved planners (e.g., [Hoffmann Nebel 2001]) or to better understand planning problems. For example, in the related area of scheduling, our group has identified distinctive patterns in the topology of search spaces for different types of classical scheduling problems and has related the topology to performance of algorithms [Watson et al. 2001]. Within planning, Hoffman has examined the topology of local search spaces in some of the small problems in the benchmark collection and found a simple structure with respect to some well-known relaxations [Hoffmann 2001]. Additionally, he has worked out a partial taxonomy, based on three characteristics, for the analyzed domains. Helmert has analyzed the computational complexity of a subclass of the benchmarks, transportation problems, and has identified key features that affect the difficulty of such problems [Helmert 2001].

Recommendation 3: The benchmark problem sets should themselves be evaluated and over-hauled. Problems that can be easily solved should be removed. Researchers should study the benchmark problems/domains to classify them into problem types and key characteristics. Developers should contribute application problems and realistic versions of them to the evolving set.

The remainder of this section describes other recommendations for improving the state of the art in planner comparisons.

Subsections

Next: Problem Assumption 1: Are Up: A Critical Assessment of Previous: Metric Assumption 2: Do