University calendar

Thesis Defense by Ankita Mungalpara

Wednesday, July 30, 2025 at 11:00am to 1:00pm

Dr. Ming Shao and Dr. Jiawei Yuan - Co-Advisors

Committee Members: Dr. Ashok Patel - Department of Computer & Information Science

Abstract: Multimodal large language models (MLLMs), which can reason across different modalities, are opening new frontiers in AI development that will soon enable complex reasoning with both visual and textual modalities. One application of multimodal MLLMs is Visual Question Answering (VQA) in healthcare; for example, a clinician can ask questions about medical images in natural language to receive detailed explanations. This research examines the efficacy of fine-tuning multimodal (vision and language) foundational models for performing medical visual question answering. This study aims to evaluate how effectively these models can interpret and respond to medical queries based on visual inputs, ultimately enhancing diagnostic accuracy and patient care. By leveraging the strengths of both vision and language processing, the research seeks to advance the capabilities of AI in medical settings. The author studies three fine-tuned variants derived from the baseline LLaVA-Med architecture: a caption-only baseline trained to generate global descriptions of medical images; a model that uses instruction tuning to better answer diagnostic and region- specific questions, and a variant that incorporates Alpha-CLIP with LLaVA-Med to allow spatially targeted understanding by conditioning responses on user-defined regions of interest (ROIs). All variants are adapted utilizing Low-Rank Adaptation (LoRA), a parameter- efficient fine-tuning (PEFT) technique that facilitates minimal computational overhead. The results show that the foundational LLaVA-Med model, which was trained only on image-caption pairs, performs fairly well with BLEU (0.09) and ROUGE-2 (0.14), but it struggles to answer specific questions compared to the improved version that has the best ROUGE-1 (0.59) and ROUGE-L (0.54). The region-focused Alpha-CLIP combined with LLaVA-Med achieved the best performance on both the ROUGE-2 (0.40) and BLEU (0.28), while still exhibiting accurate and context-sensitive reasoning, which points to the importance of task-specific adaptation in medical VQA. The combined results suggest that concentrating attention on specific areas within images, along with instruction-based reasoning, is essential for medical AI systems that manage diverse types of data. This approach facilitates the comparison of methods and provides a detailed strategy for improving specialized diagnostic medical AI assistants. All CIS Students are encouraged to attend. For further questions, please contact Dr. Ming Shao at mshao@umassd.edu or Dr. Jiawei Yuan at jyuan@umassd.edu

Zoom (please contact: amungalpara@umassd.edu@umassd.edu or jyuan@umassd.edu for Zoom information)

Back to top of screen