Metrics

Next: Our Experimental Setup Up: Assumptions of Direct Comparison Previous: Planners

Metrics

Ideally, performance would be measured based on how well the planner does its job (i.e., constructing the `best' possible plan to solve the problem) and how efficiently it does so. Because no planner has been shown to solve all possible problems, the basic metric for performance is the number or percentage of problems actually solved within the allowed time. This metric is commonly reported in the competitions. However, research papers tend not to report it directly because they typically test a relatively small number of problems.

Efficiency is clearly a function of memory and effort. Memory size is limited by the hardware. Effort is measured as CPU time, preferably but not always on the same platform in the same language. The problems with CPU time are well known: programmer skill varies; research code is designed more for fast prototyping than fast execution; numbers in the literature cannot be compared to newer numbers due to processor speed improvements. However, if CPU times are regenerated in the experimenter's environment then one assumes that

performance degrades similarly with reductions in capabilities of the runtime environment (e.g., CPU speed, memory size) (metric assumption 1).

In other words, an experimenter or user of the system does not expect that code has been optimized for a particular compiler/operating system/hardware configuration, but it should perform similarly when moved to another compatible environment.

The most commonly reported comparison metric is computation time. The second most is number of steps or actions (for planners that allow parallel execution) in a plan. Although planning seeks solutions to achieving goals, the goals are defined in terms of states of the world, which does not lend itself well to general measures of quality. In fact, quality is likely to be problem dependent (e.g., resource cost, amount of time to execute, robustness), which is why number of plan steps has been favored. Comparisons assume that

number of steps in a resulting plan varies between planner solutions and approximates quality (metric assumption 2).

Any comparison, competitions especially, has the unenviable task of determining how to trade-off or combine the three metrics (number solved, time, and number of steps). Thus, if number of steps does not matter, then the comparison could be simplified.

We converted each assumption into a testable question. We then either summarized the literature on the question or ran an experiment to test it.

Next: Our Experimental Setup Up: Assumptions of Direct Comparison Previous: Planners