Dheeraj Baiju

Hey, thanks for stopping by.

I am a Pre-Doctoral Fellow at the Visual Computing Lab, IISc Bangalore, working with Prof. Anirban Chakraborty on diffusion models and multimodal learning.

Previously, I collaborted with Dr. Prashant Kumar on research in topological deep learning, worked with Dr. Soumya Sanyal on calibrating large language models for long-form question-answering prompts, and worked with Dr. Satyam Srivastava at CSIR-CEERI on multi-label classification and segmentation of cancerous tumors.

I completed my bachelor's in Electrical and Electronics Engineering from BITS Pilani, where I was advised by Prof. Tejasvi Alladi and worked on enhancing the efficiency and scalability of the practical byzantine fault tolerance (PBFT) consensus protocol for decentralized vehicular networks.

Outside of research, I am an avid football and squash player, and I also enjoy long-distance running. Recently, I’ve picked up the piano again after a long break. I’m always open to new research ideas. If you have an interesting project in mind, feel free to reach out!

Email / CV / GitHub / Twitter / Google Scholar / LinkedIn

Research

I'm broadly interested in deep learning, focusing on multimodal learning, generative models, and reinforcement learning.

Publications (* denotes equal contribution)

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

Shubhang Bhatnagar*, Dheeraj Baiju*, Narendra Ahuja.

Under Review, 2026

We introduce SAGA, a framework that uses a frozen multimodal large language model and GRPO-based optimization to provide attribute-aware supervision for visual retrieval. Instead of relying on coarse pair-level labels, SAGA learns from fine-grained semantic differences and similarities identified by the MLLM, improving retrieval embeddings while keeping inference cost unchanged.

DynEval: Holistic Evaluations of T2I Generative Models in the Wild

Shyam Marjit*, Dheeraj Baiju*, Anuj Shikarkhane, Akhil Sakthieswaran, Sayak Paul, Anirban Chakraborty.

Under Review, 2026

We introduce DynEval, a dynamic framework for evaluating text-to-image generation that jointly assesses prompt alignment and image quality. To enable scalable training, we construct GenDB and DynEvalInstruct, two large-scale datasets containing generated prompt–image pairs and structured evaluation instructions. By distilling a strong multimodal teacher into compact 2B and 4B evaluator models, DynEval achieves higher correlation with human judgments than existing T2I evaluators while also providing fine-grained diagnostic feedback on generation failures.

A Survey on Foundations and Frontiers of Multimodal Agentic Frameworks: Techniques and Applications

Neel Mokaria*, Rishie Raj*, Dheeraj Baiju*, Xiaoqian Shen, Shraman Pramanick, Kevin Qinghong Lin, Arda Senocak, Mike Zheng Shou, Philip Torr, Mohamed Elhoseiny, Yapeng Tian, Ruohan Gao, Salman Khan, Sayan Nag, Sanjoy Chowdhury, Dinesh Manocha

TMLR, 2026
arxiv

This survey studies the evolution of multimodal AI agents that combine perception, reasoning, planning, memory, and action across text, images, audio, and video. It introduces a modality-centric taxonomy of agent architectures, analyzes multimodal fusion strategies, and reviews applications spanning robotics, web navigation, multimedia generation, and long-form video understanding, while highlighting key challenges toward building robust general-purpose agentic systems.

Revisiting Point Cloud Completion: Are We Ready For The Real-World?

Stuti Pathak*, Prashant Kumar*, Dheeraj Baiju, Nicholus Mboga, Gunther Steenackers, Rudi Penne

ICCV, 2025
arxiv

We introduce 𝐑𝐞𝐚𝐥𝐏𝐂, a real-world paired industrial object point cloud dataset, and show how it’s fundamentally different from existing synthetic datasets, exhibiting rich topological features. We highlight the importance of integrating Persistent Homology priors into existing point cloud completion models, and present a Homology Sampler-based completion model, 𝐁𝐎𝐒𝐇𝐍𝐞𝐭.

An Efficient and Scalable Byzantine Fault Tolerant Consensus for Vehicular Networks

Ameya Deshmukh, Dheeraj Baiju, Atish Atish, Tejasvi Alladi, F. Richard Yu

TVT, 2025
xplore

We propose a two-tiered BFT consensus framework for vehicular networks that uses geographic clustering to reduce messaging complexity from O(n²) to O(n¹·⁵), enhancing scalability and efficiency over traditional PBFT.