| PhD Viva


Name of the Speaker: Mr. Boyapati Prafulla Chandra (EE16D402)
Guide: Prof. Andrew Thangaraj
Online meeting link: https://meet.google.com/aac-ioam-xrz
Date/Time: 13th May 2024 (Monday), 11:00 AM
Title: Investigating the Missing Parts of Distributions in Samples

Abstract :

Estimating a distribution from samples is a classical and important problem in statistics. When the alphabet size is large compared to the number of samples, a portion of the distribution is highly likely to be unobserved or sparsely observed, and the error of the empirical distribution estimator is bounded below by missing mass, defined as the sum of probabilities over the missing letters. The Good-Turing (GT) estimator for missing mass is the ratio of the number of letters occuring once in the samples to the sample size. GT estimator is an important tool in large-alphabet distribution estimation and is widely used in natural language modelling for smoothing the empirical estimates. In this work, we propose various new notions and generalizations of missing mass that are useful to infer the missing parts of distributions from samples and characterize the minimax squared error risk of estimating these new types of missing mass under different sampling models, including those with memory. We show that the GT estimator or its appropriate modifications are minimax rate optimal for the proposed types of missing mass. We also propose new notions of concentration that provide better tail probability bounds, than sub-Gaussian concentration, for missing mass and its generalizations.