| PhD Viva


Name of the Speaker: Mr. Kranthi Kumar R (EE18D004)
Guide: Dr. Rajagopalan
Online meeting link: https://meet.google.com/fmr-rrri-qvh
Date/Time: 6th May 2025 (Tuesday), 10 AM
Title: Look and Listen: From Spatial Understanding to Event Localization in Audio-Visual Learning

Abstract :

Human perception is inherently multi-modal. We perceive the world by looking, listening, and also through touch, smell, and taste. Together, these senses create a rich experience and help us understand and interact with our surroundings. Inspired by this, we focus on building systems that use both audio and visual information to better understand the environment, similar to how humans do. The work is broadly divided into two main themes: (1) generative tasks in the audio-visual domain, and (2) audio-visual event localization and representation learning. A significant part of this research focuses on developing techniques under weak supervision, enabling models to learn effectively even with limited labeled data.

In the first part of the work, we explore how spatial cues in sight and sound can be used to solve two audio-visual generative problems: generating immersive binaural audio for videos and estimating 360° scene depth using echoes. For binaural audio, we aim to convert regular mono audio into spatial 3D sound by using visual cues from the video. We approach this in a weakly semi-supervised setting, where direct supervision is limited. Instead, we use sound source localization as a proxy task to guide the learning. In the second task, we study how sound echoes, reflections of sound from the environment, can be combined with 90° RGB images to estimate the depth of a full 360° scene. These echoes encode room geometry and, when combined with limited camera views, enable us to reconstruct a complete 3D depth map, even in unseen areas.

In the second part of the work, we focus on weakly supervised learning for event localization and representation learning. Weak supervision refers to using cheaper, less detailed labels instead of fine-grained labels. However, such supervision is often noisy and incomplete, which makes learning more challenging. To address this, we propose techniques that model the missing fine-grained labels as hidden variables and use an Expectation-Maximization (EM) framework to learn from the data effectively. We apply this approach to audio-visual event localization and image representation learning, demonstrating that our methods can achieve strong performance even with limited supervision.

In conclusion, this work aims to develop audio-visual systems that perceive and understand the world more like humans do, by combining sight and sound. By leveraging spatial cues and learning from limited supervision, we present effective solutions for binaural audio generation, 360° depth estimation, event localization, and representation learning, showing that even with minimal weak supervision, multimodal systems can perform well on complex perception tasks.