Robustness Enhancement in Multimodal Learning Systems for Reliable Perception and Generation
EAS CSIS Doctoral Defense by Md Iqbal Hossain
Advisor: Dr. Long Jiao, UMass Dartmouth
Committee Members:
- Dr. Gokhan Kul, UMass Dartmouth
- Dr. Mohammad Karim, UMass Dartmouth
- Dr Ming Shao, UMass Lowell
Zoom: https://umassd.zoom.us/j/4061999032?pwd=bUw0WGpDbTQ4UzJneFd5TTBFeUw1dz09
Meeting ID: 406 199 9032
Passcode: 600381
Abstract:
Multimodal artificial intelligence systems that integrate vision, language, and heterogeneous sensing data are increasingly deployed in real-world and safety-critical applications, including generative AI and autonomous vehicle perception. Despite their strong empirical performance, these systems remain highly vulnerable to adversarial attacks, data poisoning, and semantic misalignment across modalities, which can lead to unreliable, misleading, or unsafe outputs. This dissertation focuses on developing principled methods to enhance robustness, reliability, and semantic coherence in multimodal AI systems, spanning contrastive representation learning, generative modeling, and real-world autonomous sensing.
The first component of this research investigates robustness in vision–language contrastive learning through EftCLIP, a framework designed to analyze and mitigate fine-grained adversarial and backdoor vulnerabilities in CLIP-based models. By operating at the embedding level, EftCLIP improves resistance to poisoned data while preserving semantic alignment between visual and textual representations, addressing a critical weakness in widely used multimodal foundation models.
The second component addresses robustness in Retrieval-Augmented Generation (RAG)–based text-to-image diffusion models, where generation is guided by retrieved visual exemplars. While retrieval grounding improves image fidelity, multimodal retrieval pipelines are highly susceptible to poisoning attacks, often causing semantic incoherence between text prompts and retrieved images. This dissertation identifies semantic incoherence as a fundamental failure mode and proposes a score-based semantic coherence module that evaluates prompt–image consistency, corrects misaligned prompt components, and re-retrieves coherent exemplars prior to diffusion. This multimodal feedback loop prevents poisoned retrievals from influencing generation and substantially improves alignment and robustness.
In future work, this dissertation will be extended to multimodal robustness in autonomous vehicle perception systems, leveraging complementary sensing modalities including RGB images, radar, mmWave, and wireless Channel State Information (CSI). By studying cross-modal alignment, redundancy, and failure detection across heterogeneous sensors, this work aims to improve perception reliability under adverse conditions such as occlusion, sensor noise, environmental variability, and adversarial interference.
In summary, this dissertation aims to develop principled solutions for enhancing robustness, reliability, and semantic coherence in multimodal AI systems, spanning contrastive representation learning, generative modeling, and real-world autonomous sensing.
For further information please contact Dr Long Jiao at ljiao@umassd.edu
Dion 311
Dr. Long Jiao
Ljiao@umassd.edu
https://umassd.zoom.us/j/4061999032?pwd=bUw0WGpDbTQ4UzJneFd5TTBFeUw1dz09