Thu, 19 Mar 2026

Hi friend,

How will we know when AI is safe? Or when it is clinically effective? How can we compare one AI chatbot against another?

Evals (evaluations) have emerged as one of the most common ways to answer these questions. These structured tests measure how an AI model behaves in specific scenarios by simulating how people use these products for their mental health.

The field is confusing. There are now more than sixty¹ evals in the mental health space alone. There’s no shared standard, some evals are public while others are private and only seen by the companies who build them. Many, understandably, don’t know how they work. Everyone has different opinions on which evals are good, and some believe evals - at least in their current state - are not very useful at all.

Kevin Hou and I set out to understand this space. We’ve been gathering data and speaking to experts on AI in mental health. In this report, we give a primer on evals, discuss their current state, share their limitations and present what the frontier of AI testing looks like in 2026. In the appendix we also share a link to a rapid literature review² of recent research that uses evals to test AI performance in mental health.

Whether you know nothing about evals or are deep in the weeds of AI testing, I’m confident there’s something interesting in this for you.

Let’s get into it!

NB: This is the second article in the Hemingway series on AI safety in mental health. Our first article on what we’re getting wrong about AI safety in mental health is available to read here.

What are evals?

Evals, however, are structured, automated tests that measure how an AI model behaves in specific scenarios.

There are many other ways to test how an AI model performs in mental health scenarios. Red-teaming uses humans, often clinicians or researchers to try and break a model and find edge cases. Clinical expert review is another human based test where clinicians read transcripts from an AI and then rate the responses. Real world outcome tracking measures how people use the products in the real world (duh) and how the model performs. Clinical trials can be used too.

Compared to these other forms of testing, evals are much quicker and much cheaper.

How do evals work?

Evals are like driving tests. They aim to simulate how something will perform in the real world by providing a defined set of scenarios - parallel parking, emergency stops - that are standardised, repeatable, and scalable. They assess performance in these scenarios and provide a score for how the driver (model) performed. We can use that score to determine if a driver is safe enough to be allowed on the road. The scores also give us feedback on where drivers can improve.

Every eval has three components.

Dimensions
The dimensions define what's actually being measured. Does the model recognise crisis risk? Does it escalate appropriately? Does it avoid harmful language? Some newer evals go further. EmoAgent³ , uses the PHQ-9 depression scale to track how a simulated user's mental state changes across a conversation - measuring psychological impact, not just whether the model said the right words. There’s a very wide range of dimensions against which we can test a model.

Inputs
These define the scenarios being tested - who is the simulated user, what are they saying, how distressed are they, and how many turns does the conversation run? Again, there is a huge range of potential inputs.

Scoring
Scoring is how we turn a model's response into something measurable. Every eval has a rubric - a set of criteria that define what a good or bad response looks like. For a crisis scenario, that might include: did the model recognise the risk? Did it respond with empathy? Did it ask clarifying questions? Did it provide appropriate resources? Did it avoid stigmatising language? Each criterion gets a score, and those scores get aggregated into an overall result. The rubric is built by humans - usually clinicians and researchers - based on clinical best practice. VERA-MH, for example, scores suicide risk conversations across five dimensions: risk detection, risk probing, appropriate action, validation and collaboration, and safe boundaries. Each is rated on a scale from best practice to actively damaging.
Once you have a rubric, you need something to apply it. One option is to use human reviewers; clinicians read each conversation and score them manually. But the more common approach is to use LLMs-as-a-judge. In this technique, a second AI reads the conversation and scores it against the rubric. They are fast, cheap, and scalable and most evals use this scoring approach.

The Eval Frontier

Now that we understand the basics of how evals work, let’s discuss some of their challenges and what’s happening at the frontier of this important space.

Single Turn, Multi Turn, Multi Session

Early evals mostly tested single exchanges - a user sends a message, the model responds and the eval tests if the model’s response was appropriate. Single-turn evals are deemed increasingly inappropriate for evaluating risk - because they are a poor simulation of real world use - but are often still used.

More recent evals use multi-turn simulations (several messages), which is closer to how these products actually get used. SIM-VAIL, for example, ran 810 multi-turn transcripts across different psychiatric phenotypes.

The next step, is to have evals that are multi-session - which represents how products are usually used by real people. These would test memory, personalisation and context over a prolonged time period where a user comes back to the product multiple times across different conversations. For example, can a model identify the risk of a user that mentioned they lost their job in a previous session who is now asking about finding high places? So far, there are no publicly available multi-session evals that we are aware of.

The Probability Problem

LLMs are probabilistic. Unlike traditional software, where the same input always produces the same output, an LLM can respond differently to identical prompts across different runs. This means that a model that passes your safety eval today might fail it tomorrow without anything having changed. That’s a challenge.

Of course, in the fast moving world of AI, things do change. Temperature settings, subtle prompt variations, and model updates can all change model behaviour without any obvious signal.

The outcome is that when we ask, “Is this AI model safe?”, our answers are always grounded in probability. Increasing the scale of an eval (more turns, more scenarios) can increase the certainty of the answer, but it will always be a probabilistic answer. When someone passes a driving test, we can say that they are probably going to drive safely in the real world, but we can’t guarantee they won’t decide to embrace their inner Paul Walker and go drag racing.

The Ground Truth Problem

LLM-as-a-judge is the main method of scoring models but it has a few limitations.

First, to create an LLM-as-a-judge, we need to base it on some sort of ground truth. This truth should be both reliable (you get the same result for the same input) and valid (it measures what it says it intends to measure). Clinician reviews are what is used most often as a source of ground truth. If an LLM-as-a-judge gives the same score as a clinician reviewer, then we deem it to be a good judge. But clinician reviews may not be perfect - often, clinicians don’t agree with each other on how a response should be scored (low inter-rater reliability). VERA-MH has done a lot of work to develop high inter-rater reliability within the clinician raters and then to align the LLM-as-a-judge with those clinicians.

The second limitation is that LLM judges can have their own biases - they may favour longer responses or may be sensitive to the specific positioning of words.

Finally, because LLM Judges are also stochastic models, they can sometimes produce different scores for the same message (low reliability). This is a manifestation of the probability problem. It exists on both sides of the evaluation - the subject (the AI chatbot) is probabilistic, but so is the judge. That has obvious challenges.

To get past these limitations, some teams, like Circuit Breaker Labs, use ensemble methods (a combination of other machine learning techniques) to score models. Using this, they claim to be able to produce more consistent, repeatable scoring where the same output always generates the same score.

Some argue that building eval scoring based on expert opinion is actually the wrong approach entirely. They say that evals should be built from realised outcomes - what actually happened to real users - and only then verified by expert opinion. The thinking here is that we don’t yet have any experts on how AI should act in these situations, just how humans should act, so applying that logic to an AI is not a good assessment. They want to run the AI in real-world scenarios, see what happened to users by measuring their actual outcomes, then assess what the model said and how that relates to outcomes.

The User Simulation Problem

Evals are only useful if the simulation reflects real user behaviour. This is an overlooked component of many evals. Some evals use specific, pre-determined messages for the simulated “clients”. But they may not reflect how people actually talk to an AI, especially over prolonged periods.

Others use LLMs to simulate the clients in the evals by defining client personas and having the models generate the messages. But that is not a perfect simulation either.

VERA-MH recognises the importance of this and calls it out in their own study⁴ . In this study, clinicians were asked to evaluate whether the simulated clients reflected real cases. The clinicians perceived the the simulated clients to be “mostly realistic” in their overall presentation [median score = 4; range: 1-5] and “somewhat realistic” in their communication style (median score = 3; range: 1-5). Ideally, these scores would be higher.

MindEval⁵ from Sword takes an interesting approach to this problem. They generate client profiles by sampling attributes from a large pool of demographic and clinical characteristics and then use an LLM to write a clinical backstory from those attributes. A separate LLM then uses that profile to simulate the client in the conversation, generating messages in character. To test how realistic this simulation is, the researchers hired ten psychologists to role-play the same patient profiles themselves, then compared their messages to the LLM-generated ones using text similarity analysis. The LLM-simulated patients produced text closer to human-written text than simpler prompt configurations. But there are still limitations to this approach: expert reviewers noted the simulated patients were too cooperative, sharing personal information openly and accepting the AI's suggestions too readily. Real patients are often resistant, avoidant, and ambivalent and those real-world behaviours should be represented in a good eval.

Accurately simulating users - especially over multi-turn and multi-session use - is hard.

Eval Hacking

Evals can be hacked. This is a known problem in broader AI circles.

One way this happens is through benchmark contamination. This is when a model's training data includes the questions or scenarios of the eval that is testing it. When this happens, the model will perform better on that eval, but it may not generalise to other scenarios. This is rarely malicious - the benchmark data just happens to in the training data and the creators may not even be aware of this.

Another issue is “hill-climbing”, This is when models are iteratively optimised against specific benchmarks. This means a company can fine-tune a model to score well on a known eval without the model actually behaving better in the real world. For example, in 2024 one study⁶ demonstrated this by creating a new set of maths problems identical in difficulty to a well-known eval. Several models scored significantly worse on the new set of problems compared to how they performed on the eval, revealing that their high scores reflected an isolated ability to perform on the original test, not genuine problem-solving ability.

Optimising a model to perform well on a test can be a great way to improve it. There’s nothing inherently wrong with that and it’s actually a pretty good thing to do. But the performance on the eval must generalise to real world performance. In mental health, fine-tuning for specific evals can introduce unexpected trade-offs, including increased over-refusal of benign requests following safety alignment⁷ .

Both of these are examples of evals being “hacked” at the product level. But they can also be hacked at the reporting level.

Model creators choose which evals to run and which results to publish. There is an incentive to find the eval where your model performs best and to share those results. Bad actors will take this opportunity. The reverse works too - if you want to make a competitor look bad, you run them through evals until you find one where they score poorly and publish those results.

The Harm We Cannot See

Most discourse on AI safety has focused on crisis risk. Evals have followed suit - focusing on suicidal ideation, self-harm, psychosis - the more visible risks in this space. As we discussed in our recent article on AI safety⁸ , while managing crisis risk is important, safety is a much wider concept and there are many harms we cannot see. These include risks associated with para-social relationships, emotional dependency, cognitive substitution, the erosion of human connection. One study⁹ found high levels of emotional manipulation among several conversational AI apps. There are also significant health equity risks. Fewer evals exist for this range of potential harm.

A lot of what we assess today is based on subjective opinions on where the risk lies (including opinions from the authors of this article). An important step to better evals would be to get a better empirical understanding of where the risk actually lies.

There are data-driven ways to do this, like using unsupervised data reduction techniques on real world data. This concept is similar to the idea proposed above regarding building evals based on realised outcomes and would use real world outcomes to identify where the real risks lie. This is a very good idea.

Evals alone are not enough

Evals provide a scalable, repeatable way to test models, but some are better than others. They also clearly have limitations. Combining evals with red-teaming, real world data and clinical trials will be needed to show these products are both safe and effective. As Matthew Nour, Principal Scientist at Microsoft pointed out to us, “The current state of the art is combining expert human red-teaming with automated adversarial evaluations that can operate at a scale humans simply can't."

Making evals easier

While running evals is easier than human reviewers or red-teaming, they still require technical infrastructure, clinical expertise, and time. We need to make it easier for everyone to run evals. The easier they are to run, the more companies will run them. This increases the chance they spot risks and gives them more insights to improve their products. The decision by Spring to make VERA-MH open-source is a meaningful step here. Ideally, we want everyone building AI to be regularly running high quality evals and making changes to their products based on the findings. Making them easy to run is a critical part of that.

Evals as a standard

Several researchers and businesses have built their own evals. Sometimes they’ll test their products against other publicly available evals. But no eval has become a standard for the industry. This is because we are still in the early chapters of this technology and no organisation has the scale or authority to create this standard. There are also incentives to be the one who sets the standard and these with compete with any desire to align behind a competitor’s evals.

So far, VERA-MH seems to represent the most serious attempt at a shared standard for crisis safety. It's open-source, clinically validated, and I’ve heard very positive feedback on the evals themselves and their openness to feedback and development. The field should strongly consider how it could convene around a shared set of standards and collectively contribute to making them better. This would be highly supportive of building better relations with regulators of this space.

Evals as a competitive advantage

Evals are used as a feedback tool that allows companies to iterate on their product. They can use them to test whether a specific change to their model or safeguards made things better or worse. If they build better evals, they can get better feedback and build better products. This makes good evals a competitive advantage and as a result, some companies don’t share their evals publicly.

These two kinds of evals can coexist. Some evals will become public standards - the driving test equivalent, a baseline every product should pass and can be compared against. Others will remain proprietary, the internal systems that allow companies to build more competitive products.

Moving from Evaluating Risk to Evaluating Performance

Right now, almost all evals are focused on safety. That's the right starting point. But ultimately we’ll need to test if they can actually produce outcomes. John Torous has proposed a three-stage framework¹⁰ for this progression; Safety validation first, then clinical framework validation (does the model apply evidence-based approaches correctly), then real-world efficacy - does it actually help people get better.

Most products are still at stage one of this process, but recent announcements (e.g., from Limbic¹¹ and Flourish Science¹² ) show that some are moving into latter stages.

Safety is critical, but our goal should not be just to deliver safe AI. It should be to deliver safe AI that meaningfully improves mental health outcomes. Doing so is a design challenge and while we can’t guarantee we’ll be able to do it, it is certainly possible.

In 1931, the philosopher Alfred Korzybski wrote, "A map is not the territory it represents, but, if correct, it has a similar structure to the territory, which accounts for its usefulness."

Evals are useful. But like the AI models, they will need to be designed thoughtfully and responsibly. And the more similar their structure to the territory of real world AI usage, the more useful they will be.

That’s all for this now. If you’d like to discuss frontier topics like this with other mental health leaders, consider joining The Hemingway Community.

Keep fighting the good fight!

Steve

Founder of Hemingway

Many thanks to Xuan Zhao, Max Rollwage, Matthew Nour, Derrick Hull, Shirali Nigam, Kevin Ramotar, Matt Scult, Val Hoffman and Sarah Kunkle for their insights on this topic.

Notes:
(1) https://pubmed.ncbi.nlm.nih.gov/41360938/

(2) As part of this work we evaluated 10 studies, from April 2025 untill March 2026, that use different Evals for AI in mental health. Our evaluation includes assessments of the inputs, dimensions, “ground truth” used and the key results from each study. You can access the full spreadsheet here.

(3) https://arxiv.org/abs/2504.09689

(4) https://arxiv.org/abs/2602.05088

(5) https://arxiv.org/abs/2511.18491

(6) https://arxiv.org/abs/2405.00332

(7) https://arxiv.org/abs/2405.00332

(8) https://thehemingwayreport.beehiiv.com/p/84-all-the-harm-we-cannot-see

(9) https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5390377

(10) https://pmc.ncbi.nlm.nih.gov/articles/PMC12434366/

(11) https://www.nature.com/articles/s41591-026-04278-w

(12) https://www.hbs.edu/faculty/Pages/download.aspx?name=26-030.pdf