Imagine you get blood work done, and your doctor spots something unusual in your results. Would you trust AI to decide your treatment?
It’s true—AI has shown impressive medical knowledge, like Google’s Med-PaLM 2 passing medical licensing exams. But does that mean we’re ready for AI doctors? Not quite.
That’s because passing a test is one thing. Doing the actual job is another.
If a doctor’s only job were to pass medical exams, then an AI like Med-PaLM 2 could be a perfect replacement. But real patient care goes far beyond tests. It requires judgment, experience, and human context.
The same is true when we think about AI in government. Just because AI can pass a test doesn’t mean it’s ready to handle the messy, complicated choices that shape people’s lives.
This is especially true for people working in Health and Human Services (HHS), where decisions about food assistance, housing, healthcare, and child welfare directly impact people’s lives, raising the stakes even higher.
So when, if ever, can we confidently hand them over to AI?
The short answer: AI often has to do better than humans to be safe and effective in the real world. Here’s why.
What does success really mean?
To better understand why AI might struggle with real-world complexities, let’s look at an everyday example.
Imagine a government system that automates data entry from handwritten applications. On paper, AI might match human performance in recognizing handwriting samples.
But in the real world, caseworkers bring knowledge that AI lacks. For example, they know that a pay stub for “Elizabeth” might match an application for “Liz,” “Beth,” or “Betty.” They can also recognize that $1,200 is more likely to be a monthly rather than an annual income.
Without that broader understanding, AI risks falling short where it matters most—real-world use.
What’s at stake?
Even if AI makes mistakes at a comparable rate to humans, the types of mistakes it makes are different. That matters.
For example, an Air Canada chatbot hallucinated a fake bereavement fare reimbursement policy—something a human agent would almost never do. The company ended up getting sued and was ultimately held financially liable for the chatbot’s mistake.
Fairness is another issue to consider. McDonalds ended its AI-powered drive-through trial after it struggled with accented English. Of course, humans can also struggle to understand accented English—but with the scale of AI systems, widespread failures can disenfranchise entire groups, creating ethical and reputational fallout.
In human services, the costs and benefits of AI must be weighed at the macro and micro levels.
Consider AI-powered benefits determinations. Automated determinations could happen in minutes instead of months, making backlogs disappear for millions of applicants. But what if it wrongly denies assistance to a struggling single mom? That’s not just a glitch. It’s skipped meals, empty grocery carts, and an impossible choice between rent and dinner.
For this reason, we need to be nuanced in the application of AI for human services.
Mistakes that favor applicants—like approving someone who turns out to be ineligible—are usually less harmful than mistakes that deny an eligible person the help they need. Or at least that’s true when we think about a single use case. But imagine if tons of approvals were accidentally granted. The consequences could be devastating for underfunded agencies that simply don’t have the resources to cover everyone.
That’s why agencies need to weigh both the individual and system-wide risks when deciding how to use AI. One approach might be to allow automated approvals but require a human to review every denial. Another might be to decide that any error—at any level—is too risky, and keep AI out of decision-making entirely.
But we also can’t ignore the cost of inaction. When agencies are stretched thin and backlogs drag on for months, people in need are left waiting—and that delay is a decision, too. In some cases, the speed of automation may be worth the trade-off.
So, how do we make sure we’re using AI wisely?
Both agencies and vendors need to understand the “jagged frontier” of AI——the uneven edges where it excels in some tasks but struggles in others. That means getting specific about where AI can help, where it can hurt, and how to adapt as the technology evolves. This can be especially challenging since AI capabilities are evolving quickly. Policies need to stay flexible to keep up.
Can you fail gracefully?
Even the best AI systems will sometimes make mistakes. Since errors are inevitable, organizations must have strategies for catching them when possible and minimizing the impact of those that slip through.
Humans tend to do this naturally. Think about when we have trouble hearing someone on the phone. It’s common to ask people to repeat themselves, confirm what we thought we heard, or ask them to spell words. AI could, in theory, be designed with similar safeguards, but there are always practical limits. The net result is that AI may need to outperform humans in certain tasks when it lacks reliable ways to catch and correct its own mistakes.
Human review is a common strategy for catching AI errors. Instead of making critical decisions outright, AI tools can present evidence and recommendations to caseworkers. Appeals can also act as a safeguard, letting applicants flag suspected AI errors and request a review by a human caseworker.
The key to making processes like this effective is to make sure that AI systems are designed with enough transparency (e.g., citing the specific regulations and evidence they used to make their decisions) so that people can efficiently audit the decisions they make.
When humans are in the loop to help catch and correct errors, higher AI error rates can be tolerated.
Key takeaways
It’s tempting to measure AI success by whether it can match human performance. However, this perspective can oversimplify the complexities of AI in real-world scenarios.
Matching human performance isn’t always enough. Often, AI needs to outperform humans to make up for its lack of context, the high stakes of getting it wrong, and the fact that it can’t always tell when it’s made a mistake.
HHS leaders can make informed decisions about using AI by:
- Critically evaluating—and clearly defining— how success will measured on the tasks you need AI to perform
- Understanding the different types of mistakes AI can make, the risks of those failures, and their associated costs
- Designing safeguards to mitigate AI’s limitations and ensure reliable outcomes
Remember, the goal is not to replace humans but to enhance outcomes, ensuring that AI complements human strengths rather than merely imitating them.
AI holds real promise for public service—but only when it’s applied with care, context, and humility.
For HHS leaders, the challenge is not merely to decide whether to use AI, but how to use it in ways that protect people, improve outcomes, and reflect the values of public service. That means looking beyond flashy benchmarks and asking the harder questions: What does this tool really do? Who does it serve? And what happens when it gets it wrong?
Getting it right will take cross-sector collaboration, thoughtful policy, and a commitment to centering both effectiveness and humanity. But if we do it well, we won’t just be keeping up with change—we’ll be shaping it.
Because in public service, success isn’t just about speed or scale. It’s about trust. And trust isn’t built on performance alone—it’s built on judgment, transparency, and care.