| MS TSA Meeting


Name of the Speaker: Mr. Kartik Vishnu Hegde (EE21S005)
Guide: Prof. Rajagopalan AN
Online meeting link: https://meet.google.com/kgp-tkad-jzj
Date/Time: 26th July 2024 (Friday), 11.00 AM
Title: Data-centric Performance Enhancement of Vision-Language Models for Downstream Tasks

Abstract :

Vision-Language research has gained significant attention in recent years due to its potential for impactful real-world applications. Large pre-trained vision-language models (VLMs) have been developed to address a variety of downstream tasks, such as Image Captioning, Visual Question Answering (VQA), Visual Grounding, Text-to-Video retrieval, etc. Traditionally, improving the performance of these models involves either developing more sophisticated architectures or training the models with vast amounts of high-quality paired data, both of which are resource-intensive and computationally demanding. In this thesis, we propose methods to enhance the performance of VLMs for vision-language downstream tasks by utilizing data effectively without the need for retraining the model from scratch. Our approach focuses on leveraging existing pre-trained VLMs and fine-tuning them with strategic data augmentation and efficient data processing techniques. We demonstrate that by carefully curating and utilizing data, it is possible to achieve significant improvements in model performance. Through extensive experimentation, we show that our methods surpass baseline performance on benchmark datasets for tasks such as Text-to-Video retrieval and Visual Question Answering. Specifically, our approach achieves notable gains in accuracy, demonstrating the effectiveness of data-centric enhancements over conventional training-heavy methodologies. These findings suggest promising directions for future research and applications in fields requiring advanced comprehension and interaction between visual and textual data.