| MS TSA Meeting


Name of the Speaker: Mr. Amit Anand (EE19S002)
Guide: Prof. Krishna Jagannathan
Online meeting link: meet.google.com/yjx-nsfm-rdf
Date/Time: 28th March 2024 (Thursday), 11:00 AM
Title: Collaborative Best Arm Identification in Multi-armed Bandits

Abstract

The literature on Multi-armed Bandits consists of problems about regret minimization and best arm identification.

This work considers the best arm identification problem in stochastic multi-armed bandits (MAB) under a fixed budget

setting where the agents learning the MAB instance are connected through a network. The model assumes a collaborative network of

agents where the agent and its one-hop neighbors instantaneously observe each agent’s action and reward. We tried to estimate the best arm for the entire network with this collaborative model. We proposed several policies for the same. We first considered devising policies for a star network since they are a common motif in many communication systems. After devising policies for star networks, we further devised policies for generic networks. We have two different sets of approaches for policies of a generic network. In one approach, we estimate the best arm with the help of an Ensemble Learner (EL). In the other approach, we tried to estimate the best arm without using EL.

For the best arm identification policies, which have the EL, at the end of the budget, the EL aggregates samples from the agents and decides the best arm. The objective is to design arm selection policies for each agent (and a suitable aggregation policy for the EL) to minimize the probability of incorrect arm identification. We propose two collaborative learning policies that first partition the network of agents into dominating sets. We then employ (i) a highly exploring UCB-E variant or (ii) a ‘Follow Your Leader’ policy within each star partition, and at the end of the budget, the EL aggregates the samples from all the partitions to select the best arm.

We also propose voting-based policies, which include the EL to estimate the best arm. The voting policies are again of two categories, one based on the upper confidence bound, while the other based on the successive rejects of the worst performing arm.

Our policies enjoy an exponentially decaying probability of error in the number of agents and the budget. Notably, our policies based on dominating set partitions do not require a minimum dominating set partition of

the graph (which is an NP-hard problem), and the mathematical probability of error bounds remain independent of the size of the dominating set partition.