How Generalizable Is My Behavior Cloning Policy? A Statistical Approach to Trustworthy Performance Evaluation

Abstract

With the rise of stochastic generative models in robot policy learning, end-to-end visuomotor policies are increasingly successful at solving complex tasks by learning from human demonstrations. Nevertheless, since real-world evaluation costs afford users only a small number of policy rollouts, it remains a challenge to accurately gauge the performance of such policies. This is exacerbated by distribution shifts causing unpredictable changes in performance during deployment. To rigorously evaluate behavior cloning policies, we present a framework that provides a tight lower-bound on robot performance in an arbitrary environment, using a minimal number of experimental policy rollouts. Notably, by applying the standard stochastic ordering to robot performance distributions, we provide a worst-case bound on the entire distribution of performance (via bounds on the cumulative distribution function) for a given task. We build upon established statistical results to ensure that the bounds hold with a user-specified confidence level and tightness, and are constructed from as few policy rollouts as possible. In experiments we evaluate policies for visuomotor manipulation in both simulation and hardware. Specifically, we (i) empirically validate the guarantees of the bounds in simulated manipulation settings, (ii) find the degree to which a learned policy deployed on hardware generalizes to new real-world environments, and (iii) rigorously compare two policies tested in out-of-distribution settings. Our experimental data, code, and implementation of confidence bounds are open-source.

Evaluation in Simulation

We obtain upper confidence bounds on the cumulative distribution function (CDF) of the total reward obtained by diffusion policies in out-of-distribution robosuite environments. An upper confidence bound on the CDF can be interpreted as the worst-case distribution of reward that is consistent with the observed policy rollouts. Here we show representative policy rollouts for the Square environment, and plot the histogram of reward and our corresponding 95% upper confidence bound.

Evaluation in Hardware

We obtain 95% lower confidence bounds on the success rate of a diffusion policy tested in two out-of-distribution environments. The confidence bounds we obtain quantify our uncertainty in the performance of the robot in a concrete and interpretable manner.

Comparing Policies

We can also compare two policies using our confidence bounds. Here we apply our statistical bounds to the recent results from the RT-2 paper, where the authors compare their RT-2 policy to a VC-1 policy in three settings designed to test emergent capabilities in symbol understanding, reasoning, and human recognition. For each setting we find the 95% confidence intervals for policy success rate are disjoint, and we conclude with 95% confidence that RT-2 outperforms VC-1 in these settings.

Confidence intervals for policy success rates

BibTeX