Sequential Multi-hypothesis Testing in Multi-armed Bandit Problems

  • 01



Name of the Speaker: Gayathri R Prabhu (EE15D035)
Guide: Dr. Srikrishna Bhashyam
Venue/Online meeting link:
Date/Time: 1st November 2022, 2.00pm

Hypothesis testing under controlled sampling has received significant interest recently. One such setting with controlled sampling is the multi-armed bandit setting. In this work, we consider sequential multi-hypothesis testing in a multi-armed bandit setting involving K arms. We consider the case where each arms signal follows a distribution from a vector exponential family. The actual parameters of the arms are unknown to the decision maker. The decision maker incurs a delay cost for delay until a decision and a switching cost whenever the decision maker switches from one arm to another. Thus we have a sequential decision making problem where the decision maker gets only a limited view of the true state of nature at each stage, but can control his view by choosing the arm to observe at each stage.The goal of this work is to develop sequential policies to identify the true hypothesis as quickly as possible while ensuring that probability of false detection is below a constraint. We focus on problems where there is definite structure associated with the parameter set of the arms. This work is divided into two parts: (a) an odd arm identification problem where, as the name suggests, one among the set of K arms has a distribution different from that of the rest of the arms, and (b) a general multi-hypothesis testing setting of which a wide range of problems including the odd arm identification, best arm identification, multiple anomaly detection and partition identification problems are special cases.In each of these cases, an information-theoretic lower bound on the total cost (expected time for a reliable decision plus total switching cost) is first identified, and a variation on a sequential policy based on the generalised likelihood ratio statistic is then studied. Due to the vector exponential family assumption, the signal processing at each stage is simple; the associated conjugate prior distribution on the unknown model parameters enables easy updates of the posterior distribution. The proposed policies, with a suitable threshold for stopping, are shown to satisfy the given constraint on the probability of false detection. The policies are also shown to be asymptotically optimal in terms of the total cost among all policies that satisfy the constraint on the probability of false detection.