Sequential Testing for Efficient Policy (STEP) Comparison




TL;DR: We introduce STEP, a finite-sample, near-optimal sequential testing procedure for determining statistically significant performance improvements between robotic policies while requiring a near-minimal number of evaluation trials. STEP allows for tunable levels of Type-I and Type-II error risk, and presents the user with the capacity to interpretably encode prior information about test regimes in the form of a naturally specified 'risk budget.'



Problem formulation

We consider the problem of policy comparison for two robot policies using binary success/failure evaluation metrics. We formulate the problem as a canonical statistical testing procedure, corresponding to null (red) and alternative (blue) hypotheses as shown below:



Approach overview

STEP utilizes an optimization-based procedure to approximately solve an optimal-stopping partial differential equation with a finite-sample-exact representation at the resolution of the testing problem. Standard methods solve the asymptotic solution of a reverse heat equation, and are inefficient and sometimes ill-posed at small-to-moderate N.



Performance evaluation on simulated data

In the following, we exhaustively evaluate STEP and baseline methods in a variety of conditions. 45 pairs (p0, p1) are selected at intervals of 0.1, with offset of 0.05 (e.g.: (0.05, 0.15), (0.05, 0.25), ..., (0.85, 0.95)). For each of these pairs, 5000 evaluation sequences are drawn from the associated Bernoulli distribution to evaluate power, and 5000 sequences are drawn to evaluate FPR on the associated worst-case null hypothesis pair.

STEP demonstrates near-optimal statistical power across a range of sample complexity scales (order ~5 -- 500), corresponding to comparison instances of varying difficulty. In so doing, it Pareto dominates existing methods in the finite-sample regime (e.g., for evaluation trial counts less than ~1000), as shown in the following plot:



Additionally, STEP explicitly controls the false positive rate (Type-I Error rate) uniformly across the space of worst-case null hypotheses, which ensures maximal efficiency in the `spending' of risk. This is shown in the uniformly light blue shade of the FPR plot, below.



Performance evaluation on hardware rollouts

In the following, we evaluate STEP and baseline methods in a variety of real hardware evaluation sequences.

All sequential methods demonstrate efficient reduction in required evaluation trials for easy (large-gap) comparison settings (FoldRedTowel, CleanUpSpill). For smaller gaps, we demonstrate that STEP performs significantly better, especially in the low-variance regime corresponding to the lower-left and upper-right corners of the preceding plots of statistical power (StackCube). We also show that for small gaps, significantly more trials than are standard in the robotics literature are required to reliably compare policy performance (CarrotOnPlate, EggplantInBasket):



Additional STEP properties

We discuss several properties of STEP evaluation with strong practical benefits to the user.

STEP demonstrates significant robustness to the choice of n_max, allowing the user to conservatively allocate significant evaluation capacity without requiring many additional trials in easy cases. This is significantly more efficient than having to pre-select an exact batch size!



Additionally, STEP efficiently adapts to the underlying problem difficulty, ensuring that at all difficulty levels of the underlying comparison (i.e., easy cases with large gaps and hard cases with small gaps) the decision is made with near-optimal expected sample complexity.



Citation