This survey studies the evolution of multimodal AI agents that combine perception, reasoning, planning, memory, and action across text, images, audio, and video. It introduces a modality-centric taxonomy of agent architectures, analyzes multimodal fusion strategies, and reviews applications spanning robotics, web navigation, multimedia generation, and long-form video understanding, while highlighting key challenges toward building robust general-purpose agentic systems.