Learning under Weak-Supervision: Event Localization and Representation Learning

  • 11



Name of the Speaker: Mr. Kranthi Kumar Rachavarapu (EE18D004)
Guide: Dr. Rajagopalan AN
Venue: ESB-244 (Seminar Hall)
Date/Time: 11th September 2023 (Monday), 3:00 PM

Weakly-supervised learning aims to learn from data with noisy, incomplete labels. It is a promising alternative to the two extremes of unsupervised and fully-supervised learning frameworks, as the data for weak-supervision is easier to obtain while also being informative. Given that such supervision is incomplete and fine-grained labels are absent, it still remains an open and challenging problem to develop effective methods for utilizing it on computer vision tasks. In this seminar, we discuss how to effectively utilize such weak-supervision for two computer vision tasks: (1) event location and (2) representation learning. We propose to adopt latent-variable models by modeling the missing fine-grained labels as the latent variables and the weakly-supervised data as observables. We then propose to maximize the likelihood of the data with an Expectation-maximization (EM) framework under weak-supervision, with task-specific design choices, and show the effectiveness of our formulation.

In the first part of this seminar, we explore the problem of Weakly-Supervised Audio-Visual Video Parsing (AVVP), where the goal is to temporally localize events that are audible or visible and simultaneously classify them into known event categories. This is a challenging task, as we only have access to the video-level event labels during training but need to predict fine-grained event labels at the segment level during evaluation. Existing multiple-instance learning (MIL) based methods use a form of attentive pooling over segment-level predictions. These methods only optimize for a subset of most discriminative segments that satisfy the weak-supervision constraints, which miss identifying positive segments, leading to degraded performance. To this end, we explore modeling (1) segment labels and (2) the proportion of positive segments as the latent variables. Here, we hypothesize that the proportion of positive segments provides a more informative signal than weak labels while being less noisy than segment labels. Even though modeling segment labels is optimal in theory, it is difficult to estimate accurately from a weakly-supervised model in practice without additional inductive biases. With the segment-labels as latent, unlike existing methods that adopt learning a classifier with a single weight vector for each class, we model each event as a set of prototypes by clustering the more reliable segments based on MIL-model predictions. We employ this nonparametric prototypical classifier to estimate segment-level pseudo labels. When using the proportion of positive segments as latents, we show that it can be modeled as Poisson binomial distribution over segment-level predictions, which can be computed exactly. We then propose an Expectation-Maximization (EM) approach to learn the model parameters by maximizing the evidence lower bound (ELBO). We iteratively estimate these latent variables in E-Step and optimize for the model parameters in M-Step. We conducted extensive experiments on the AVVP task to evaluate the effectiveness of our proposed approaches, and the experimental results clearly show that our formulations are more robust and consistently outperform existing approaches. Additionally, our experiments on Temporal Action Localization (TAL) demonstrate the potential of our method for generalization to similar MIL tasks under weak-supervision.

In the second part of this seminar, we focus on the tasks of weakly-supervised representation learning, where the goal is to learn representations from data with noisy labels obtained from auxiliary information such as hashtags for Instagram images. In order to learn meaningful representations in this setting, we propose a Pseudo-Label guided Weakly-Supervised Learning (PL-WSL) method. Here, we propose to exploit the semantic similarity of data to obtain sub-concepts within each weak label, effectively dividing the data into finer-grained subgroups, which then serve as pseudo-labels for the subsequent contrastive learning phase. More formally, given image-weak label pairs, we model the fine-grained subgroups as latent variables. Our proposed formulation then adopts the Expectation Maximization (EM) framework, where we perform clustering within each weak label to obtain the pseudo-labels in the E-step, and in the M-step, we train the model using these pseudo-labels via contrastive learning. We also provide a mutual information-based analysis that offers intuition into the improved performance of our proposed approach. Our proposed approach outperforms state-of-the-art contrastive learning methods under weak-supervision on multiple datasets and performs particularly well in cases with extremely coarse weak labels.