Research at Human-First Artificial Intelligence Lab (HAL 2.0)


Reliable Multimodal AI Systems

Vision-Language Models (VLMs) have revolutionized the integration of visual perception and natural language understanding, excelling in tasks like image captioning and visual question answering, thanks to Transformer-based large language models (LLMs). Despite this remarkable progress, a critical challenge remains: ensuring the trustworthiness of these powerful models. Current VLMs often struggle with fundamental reliability issues, exhibiting "hallucinations" where they generate non-existent objects, particularly in hypothetical scenarios. They also demonstrate "linguistic sycophancy," inappropriately prioritizing user feedback over visual evidence, which can lead to inaccurate and potentially harmful outputs. These shortcomings significantly hinder the safe and ethical deployment of VLMs, especially in high-stakes domains such as healthcare and assistive technologies. At the Human-First Artificial Intelligence Lab (HAL 2.0), we are directly addressing these reliability concerns by rigorously investigating VLMs’ visual causal reasoning capabilities—probing their ability to infer cause-effect relationships, counterfactuals, and commonsense implications—and developing innovative methods to mitigate both hypothetical object hallucination and linguistic sycophancy, as visually illustrated in Figure 1, Figure 2, and Figure 3.

Factual and Causal Reasoning in VLMs Hypothetical Object Hallucination in VLMs Linguistic Sycophancy in VLMs

We are pioneering novel techniques to enhance the reasoning capabilities and overall trustworthiness of VLMs. Our current work introduces a training-free meta-thinking approach, inspired by human System 2 reasoning as described by Daniel Kahneman, which utilizes carefully crafted prompts to guide VLMs toward more structured and context-aware decision-making. This method effectively mitigates both hypothetical object hallucination and linguistic sycophancy by encouraging deeper reasoning. Building upon this foundation, we are developing a fine-tuning method designed to directly address object hallucination. This approach focuses on grounding textual outputs in verifiable visual evidence, dynamically penalizing hallucinated content during training, and incorporating reasoning supervision to ensure the model's outputs are consistently aligned with the correct visual context. This comprehensive strategy aims to significantly improve the reasoning abilities of VLMs in complex multimodal scenarios.

Broader Impacts

By significantly enhancing the reliability and trustworthiness of VLMs, this research will have a profound and multifaceted impact. Ensuring that these models deliver accurate, reliable, and contextually appropriate outputs is paramount for their safe and ethical deployment across diverse applications. This includes enhancing the accuracy and dependability of assistive technologies for the visually impaired, improving the precision of diagnostic tools in healthcare, and ensuring the integrity of information in educational settings. Furthermore, by tackling the fundamental challenges of hypothetical object hallucination and linguistic sycophancy, our work directly contributes to the ongoing efforts in AI alignment. Improving the factual accuracy and reducing undesirable behaviors in VLMs ensures these systems are more aligned with human intentions and values, thus fostering the development of more ethical AI. This research sets new benchmarks for building dependable multimodal AI systems that can be confidently and responsibly integrated into critical aspects of society, ultimately leading to more beneficial and trustworthy AI technologies for all.

Publications

Student advisees are marked with an asterisk (*).

Funding
  • UNL College of Engineering
  • Undergraduate Creative Activities and Research Experience (UCARE) fellowship, UNL