Dheeraj Baiju

We introduce SAGA, a framework that uses a frozen multimodal large language model and GRPO-based optimization to provide attribute-aware supervision for visual retrieval. Instead of relying on coarse pair-level labels, SAGA learns from fine-grained semantic differences and similarities identified by the MLLM, improving retrieval embeddings while keeping inference cost unchanged.