The recent surge of research on ArXiv highlights a crucial challenge for artificial intelligence: ensuring consistency, reliability, and mitigating biases, especially when AI systems are deployed in sensitive areas.
What happened
Several recent studies have raised fundamental questions about the stability and impartiality of artificial intelligence models. One research investigated the consistency of AI-generated exercise prescriptions from three prominent Large Language Models (LLMs) – GPT-4.1, Claude Sonnet 4.6, and Gemini 2.5 Flash – even under "temperature zero" conditions, which should maximize reproducibility. The results showed that despite high semantic similarity, significant variations persist between repeated generations and across models, indicating that even in controlled contexts, the output is not perfectly stable. This is particularly relevant for applications where precision is critical, such as health recommendations.
Another study addressed compositional biases in Multimodal Large Language Models (MLLMs) used as "automatic judges." The research revealed that these models often fail to reliably integrate key visual or textual cues, producing unreliable evaluations when evidence is missing or inconsistent, and exhibiting instability even with semantically irrelevant perturbations. This calls into question the reliability of MLLMs in critical evaluation roles, where accuracy and robustness are paramount.
In parallel, research is focusing on how AI models can self-verify to ensure correct answers through valid reasoning paths. A geometric approach proposes that valid generation trajectories reside on a high-density "manifold," while invalid paths exhibit "off-manifold drift." This suggests a path to improve the internal reliability of models, allowing them to identify and correct errors.
Why it matters
The inherent variability and biases in AI models have profound implications for their adoption and trustworthiness. In the context of exercise prescriptions, even small inconsistencies can compromise efficacy or safety for users, especially for individuals with specific clinical conditions. If AI systems generate divergent advice for the same scenario, their utility and reliability as medical support tools are seriously threatened.
When MLLMs fail to correctly integrate information or show instability in their evaluations, the consequences can be significant in sectors such as content moderation, diagnostics, or performance evaluation. The reliability of an "AI judge" is crucial to ensure fairness and prevent unjust or erroneous decisions, which could directly impact people's lives. The lack of robustness and the presence of biases can erode public trust in AI and lead to discriminatory or ineffective outcomes.
These issues extend to any AI application requiring precision and consistency, from logistical planning to medical diagnosis. For example, research on lightweight transformers for pain recognition from brain activity demonstrates AI's potential in extremely sensitive clinical areas. In such contexts, accuracy and consistency are not only desirable but ethically imperative, as an error could have direct consequences on patient well-being. The models' ability to self-verify, as suggested by one of the studies, therefore becomes a fundamental requirement to ensure that AI operates safely and reliably.
The HDAI perspective
From Human Driven AI's perspective, this research reinforces the need for a human-centric approach to the development and implementation of artificial intelligence. Despite advancements, AI models are not infallible, and their "intelligence" is still far from human-level consistency and contextual understanding. It is imperative that developers and implementers actively recognize and address consistency limitations and biases. This means investing in more rigorous testing methodologies, developing transparency mechanisms to understand how models arrive at their conclusions, and, crucially, maintaining significant human oversight. AI should serve as a support tool, not an autonomous replacement, particularly in contexts requiring ethical judgment, empathy, or decisions impacting people's health and well-being. Trust in AI is built on its reliability and our ability to understand and manage its limits.
What to watch
Research into model self-verification and bias mitigation is rapidly evolving. It will be crucial to monitor how these new techniques are integrated into commercial models and how regulatory frameworks, such as the European AI Act, will address issues of consistency and reliability, especially for high-risk AI systems. The focus must remain on creating AI that is not only powerful but also predictable, fair, and ultimately, serves humanity.

