AI Safety: Hallucinations, Jailbreaks, Fine-tuning

A recent body of scientific research highlights the increasing vulnerabilities of artificial intelligence models, from Vision-Language Models (VLM) to Large Language Models (LLM), when faced with phenomena such as hallucinations and "jailbreak" attacks. These studies raise crucial questions about the reliability and safety of AI systems, especially when they are adapted to specific contexts.

What happened

Several articles published on ArXiv in 2026 have brought significant issues to light. In the field of VLMs, the "Counterfactual Segmentation Reasoning" research diagnosed and proposed mitigations for pixel-grounding hallucinations, where models generate masks for incorrect or entirely absent objects, compromising visual understanding Counterfactual Segmentation Reasoning. In parallel, the study "Fake or Real, Can Robots Tell?" evaluated the robustness of VLMs in object recognition in robotic scenarios, revealing how a simple physical domain shift (e.g., 3D-printed vs. real objects) can lead models to erroneous descriptions, with direct implications for robotic autonomy Fake or Real, Can Robots Tell?.

Regarding LLMs, research has shown that fine-tuning, a common practice to adapt models for specific tasks, can unexpectedly degrade their safety. The "Secure LLM Fine-Tuning via Safety-Aware Probing" study explored why fine-tuning, even with non-harmful data, can compromise safety alignment, proposing "safety-aware probing" techniques to prevent such regressions Secure LLM Fine-Tuning via Safety-Aware Probing. Another proposal, SafeMERGE, offers a post-fine-tuning framework to restore safety while maintaining task performance SafeMERGE.

Concurrently, vulnerability to "jailbreak" attacks remains a concern. The "Logic Jailbreak" research introduced LogiBreak, an innovative method that leverages formal logical expression to circumvent LLM safety systems by converting harmful natural language prompts into a logical format that bypasses defenses, even in black-box mode Logic Jailbreak. This suggests that current safety alignment techniques may have distributional gaps that attackers can exploit.

Why it matters

These developments have a profound impact on the trust and adoption of artificial intelligence. If VLMs fail to distinguish between real and counterfeit objects or hallucinate visual elements, applications in critical sectors such as robotics, medicine, or autonomous driving become inherently risky. Visual reliability is paramount for making safe and informed decisions in the physical world.

For LLMs, the possibility that fine-tuning compromises safety means that companies customizing pre-trained models face a significant risk of generating harmful content or being exploited for malicious purposes. This is not just a technical problem, but a matter of corporate responsibility and social impact. The ease with which methods like LogiBreak can bypass safety protections raises urgent questions about the robustness of current alignment mechanisms and the need for more sophisticated and proactive defenses. AI governance must consider these emerging attack vectors.

The HDAI perspective

From Human Driven AI's perspective, these studies reinforce the belief that AI innovation must proceed hand-in-hand with a robust commitment to safety, ethics, and reliability. Safety is not an optional feature, but an intrinsic pillar for useful and responsible AI. It is essential that AI system developers and deployers adopt a holistic approach, considering safety not as an add-on, but as an inherent element of the model's lifecycle, from pre-training to fine-tuning and deployment. Transparency about the limitations and vulnerabilities of models is just as important as celebrating their successes.

What to watch

It will be crucial to monitor developments in continuous alignment and safety validation techniques for AI models. Research is moving towards methods that integrate safety at every stage of the model lifecycle, such as adversarial learning and selective model merging strategies (like SafeMERGE). The evolution of regulatory frameworks, such as the future European AI Act, will also play a key role in defining minimum safety and accountability standards to prevent hallucinations and jailbreak attacks from compromising the integrity of AI systems for the benefit of society.

AI Safety: Hallucinations and Jailbreaks Threaten Model Reliability

What happened

Why it matters

The HDAI perspective

What to watch

Original sources(5)

Related articles