| PhD Seminar


Name of the Speaker: Mr. Kranthi Kumar R (EE18D004)
Guide: Prof. Rajagopalan
Online meeting link: https://meet.google.com/eia-gxsb-tiu
Date/Time: 12th April 2024 (Friday), 2:00 PM
Title: Leveraging Audio-Visual Spatial Cues for Binaural Video Generation and 360° Depth Perception

Abstract

Audio-visual modalities inherently encode complementary information about the environment. In this seminar, we explore leveraging the spatial cues in sound and vision to solve two problems: generating immersive binaural audio for videos and 360° depth estimation.

In the first part of the seminar, we explore the problem of generating binaural audio (two-channel) for videos, even if they only have regular monaural (single-channel) audio. These binaural videos provide a more immersive video experience for users. Such videos are more immersive, enhancing user experience. Here, we tackle a more difficult version of this problem by synthesizing binaural audio for a video with monaural audio in a weakly semi-supervised setting (in the absence of a lot of annotated data). Our approach relies on using any downstream task that requires binaural audio as a proxy for supervision, reducing the need for explicit supervision. In this work, we use Sound Source Localization with only audio as a proxy task for weak supervision. Using this proxy task as supervision, we leverage the visual cues to understand where sounds are coming from in the scene and infuse them into audio to create a realistic 3D audio effect.

In the second part of the seminar, we explore how sound, along with RGB images, can be leveraged to improve the depth map of the scene. We experiment with echoes, which are sound reflections from the environment. Given that the echoes encode room geometry, it is possible to reconstruct the 3D room geometry, thereby depth, from echoes. We investigate how echoes, along with limited camera views, can be used to create a complete 3D depth map of a surrounding environment. This technique enables us to estimate 360° depth perception with limited camera views and gives spatial awareness even if we can't see everything around us. Our experiments demonstrate that echoes can significantly improve depth estimation accuracy, particularly for 360° cameras. Furthermore, the use of audio-visual information together provides a complementary approach to estimating 360° depth maps with limited information.