Hey, thanks for stopping by.
I am a Pre-Doctoral Fellow at the Visual Computing Lab, IISc Bangalore, working with Prof. Anirban Chakraborty on diffusion models and multimodal learning.
Previously, I collaborted with Dr. Prashant Kumar on research in topological deep learning, worked with Dr. Soumya Sanyal on calibrating large language models for long-form question-answering prompts, and worked with Dr. Satyam Srivastava at CSIR-CEERI on multi-label classification and segmentation of cancerous tumors.
I completed my bachelor's in Electrical and Electronics Engineering from BITS Pilani, where I was advised by Prof. Tejasvi Alladi and worked on enhancing the efficiency and scalability of the practical byzantine fault tolerance (PBFT) consensus protocol for decentralized vehicular networks.
Outside of research, I am an avid football and squash player, and I also enjoy long-distance running. Recently, Iβve picked up the piano again after a long break. Iβm always open to new research ideas. If you have an interesting project in mind, feel free to reach out!
Email / CV / GitHub / Twitter / Google Scholar / LinkedIn
I'm broadly interested in deep learning, focusing on multimodal learning, generative models, and reinforcement learning.
We introduce SAGA, a framework that uses a frozen multimodal large language model and GRPO-based optimization to provide attribute-aware supervision for visual retrieval. Instead of relying on coarse pair-level labels, SAGA learns from fine-grained semantic differences and similarities identified by the MLLM, improving retrieval embeddings while keeping inference cost unchanged.
We introduce DynEval, a dynamic framework for evaluating text-to-image generation that jointly assesses prompt alignment and image quality. To enable scalable training, we construct GenDB and DynEvalInstruct, two large-scale datasets containing generated promptβimage pairs and structured evaluation instructions. By distilling a strong multimodal teacher into compact 2B and 4B evaluator models, DynEval achieves higher correlation with human judgments than existing T2I evaluators while also providing fine-grained diagnostic feedback on generation failures.
This survey studies the evolution of multimodal AI agents that combine perception, reasoning, planning, memory, and action across text, images, audio, and video. It introduces a modality-centric taxonomy of agent architectures, analyzes multimodal fusion strategies, and reviews applications spanning robotics, web navigation, multimedia generation, and long-form video understanding, while highlighting key challenges toward building robust general-purpose agentic systems.
We introduce ππππ₯ππ, a real-world paired industrial object point cloud dataset, and show how itβs fundamentally different from existing synthetic datasets, exhibiting rich topological features. We highlight the importance of integrating Persistent Homology priors into existing point cloud completion models, and present a Homology Sampler-based completion model, πππππππ.
We propose a two-tiered BFT consensus framework for vehicular networks that uses geographic clustering to reduce messaging complexity from O(nΒ²) to O(nΒΉΒ·β΅), enhancing scalability and efficiency over traditional PBFT.