Why Your ML Model Can Be 95% Accurate and Still Be Useless

If you've spent any time building or studying machine learning models, you've probably celebrated a high accuracy score at some point.

95% accurate. Sounds great. Ship it.

But here's the uncomfortable truth that took me a while to fully internalize:

Accuracy is one of the most misleading metrics in machine learning. A model can score 95% accuracy and be completely, dangerously useless in production. Not because something went wrong in your code. Not because your data was dirty. But because accuracy alone doesn't tell you what kind of mistakes your model is making.

And in AI systems, the type of mistake matters more than the number of mistakes.

This is where sensitivity and specificity come in, two metrics that most beginners skip over, most tutorials gloss over, and most deployed models desperately needed someone to pay attention to.

Let's fix that.

First, Let's Talk About Why Accuracy Lies

Imagine you're building a model to detect fraudulent transactions for a fintech company. You train it on a dataset of 10,000 transactions. 9,500 of them are legitimate. 500 are fraudulent.

Now imagine your model is lazy. It learns one rule: predict every transaction as legitimate.

What's its accuracy? 9,500 ÷ 10,000 = 95%.

Impressive number. Zero utility. Your model has never caught a single fraudulent transaction in its life and it's sitting at 95% accuracy looking smug about it.

This is called the accuracy paradox, and it shows up everywhere in imbalanced datasets, which is most real-world data.

So how do you actually measure whether your model is doing its job? You need to understand what's happening inside the confusion matrix, specifically the four outcomes every prediction falls into.

The Four Outcomes You Need to Know

Before we define sensitivity and specificity, you need to understand the building blocks:

True Positive (TP): Your model predicted positive and it was actually positive. Correct call.

True Negative (TN): Your model predicted negative and it was actually negative. Correct call.

False Positive (FP): Your model predicted positive but it was actually negative. Wrong call, you raised an alarm that didn't need raising.

False Negative (FN): Your model predicted negative but it was actually positive. Wrong call, you missed something real.

These four outcomes are the foundation of everything that follows. Keep them in mind.

What Is Sensitivity?

Sensitivity (aka Recall) answers this question: Out of all the actual positive cases in your data, how many did your model correctly identify?

The formula is:

Sensitivity = True Positives ÷ (True Positives + False Negatives)

In plain language, sensitivity measures how good your model is at catching the things that matter. A high sensitivity score means your model is rarely missing real positive cases. A low sensitivity score means real positives are slipping through undetected as false negatives.

Real example — Medical Diagnosis:

You're building a model to screen patients for a serious illness. The dataset has 1,000 patients. 100 actually have the illness.

Your model correctly identifies 90 of the 100 sick patients but misses 10 of them, labeling them as healthy.

Sensitivity = 90 ÷ (90 + 10) = 90%

Is that good? In most contexts, yes. But in medical screening, missing 10 sick patients is a serious problem. Those 10 people leave the clinic believing they're healthy. The cost of a false negative here is potentially someone's life.

This is why medical screening models are built to prioritize sensitivity above almost everything else. You would rather send 50 healthy people for further testing (inconvenient but survivable) than miss one person who actually has the illness.

The rule of thumb: When the cost of missing a real positive is catastrophic, you optimize for high sensitivity.

What Is Specificity?

Specificity answers a different question: Out of all the actual negative cases in your data, how many did your model correctly identify as negative?

The formula is:

Specificity = True Negatives ÷ (True Negatives + False Positives)

In plain language, specificity measures how good your model is at not raising false alarms. A high specificity score means your model rarely flags something as positive when it isn't. A low specificity score means you're generating a lot of false positives - crying wolf when there's nothing there.

Real example — Email Spam Filter:

You're building a spam filter. Your dataset has 1,000 emails. 900 are legitimate, 100 are spam.

Your model correctly identifies 870 of the 900 legitimate emails as legitimate, but incorrectly flags 30 real emails as spam.

Specificity = 870 ÷ (870 + 30) = 87%

Here the cost of a false positive is that a real email lands in someone's spam folder. Annoying. But nobody's life is at risk. Meanwhile the cost of a false negative (spam getting through) is manageable. So specificity matters more here than sensitivity. You don't want your filter sending legitimate client emails to the junk folder.

The rule of thumb: When the cost of a false alarm is the bigger problem, you optimize for high specificity.

The Tradeoff Nobody Warns You About

Here's where it gets technically interesting and where most beginners hit a wall without understanding why.

Sensitivity and specificity exist in tension with each other. When you increase one, you almost always decrease the other. This isn't a bug in your model. It's a fundamental mathematical reality of classification.

Here's why it happens:

Every classification model produces a probability score, a number between 0 and 1 representing how confident it is that a data point is positive. You then set a threshold (say 0.5) and anything above that threshold gets classified as positive, anything below gets classified as negative.

When you lower the threshold, your model becomes more aggressive at flagging positives. It catches more true positives, sensitivity goes up. But it also catches more false positives, specificity goes down.

When you raise the threshold, your model becomes more conservative. It only flags something as positive when it's very confident. False positives drop, specificity goes up. But it also starts missing real positives, sensitivity goes down.

This is called the sensitivity-specificity tradeoff and understanding it is what separates someone who trains models from someone who deploys intelligent systems responsibly.

The threshold you choose isn't a technical default. It's a business decision.

How This Shows Up in Real AI Systems

This is where I want to connect the theory to something practical, because these metrics don't just live in academic ML pipelines. They live inside every intelligent system you build.

Take an HR ATS (Applicant Tracking System) agent, an AI system that screens CVs and decides which candidates move forward in a hiring pipeline.

Your model is classifying candidates as either a good fit (positive) or not a good fit (negative).

Now ask yourself the two critical questions:

What's the cost of a false negative? You miss a genuinely qualified candidate. They get filtered out before a human ever sees their CV. You lose talent your company needed.

What's the cost of a false positive? An underqualified candidate moves forward. A recruiter spends time reviewing them. Slightly inefficient but recoverable.

In this case, you want higher sensitivity. The cost of missing real talent is greater than the cost of a few extra CVs in the review pile. So you'd lower your classification threshold, accepting that more candidates pass through to ensure you're not filtering out strong applicants.

But now consider a fraud detection agent embedded in a payment system. Every transaction flagged as fraudulent gets blocked and triggers a review.

What's the cost of a false positive? A legitimate customer gets their transaction blocked. Frustrating experience. Potential churn.

What's the cost of a false negative? A real fraudulent transaction goes through. Financial loss. Potential regulatory consequences.

Here the business context determines everything. For a high-volume consumer platform, you might prioritize specificity to protect user experience. For a high-value B2B payment system, you might prioritize sensitivity to ensure no fraud slips through.

Same metrics. Completely different threshold decisions. Both justified by context.

This is the thinking that turns a model trainer into an AI engineer.

Before You Deploy Anything, Ask These Two Questions

The practical takeaway from everything above is simple.

Before you deploy any model, any agent, any intelligent system, ask yourself:

What is the cost of a false positive in this context?

What is the cost of a false negative in this context?

The answers will tell you exactly which metric to optimize for, what threshold to set, and what success actually looks like for your specific use case.

Accuracy will not tell you any of this. Accuracy will just tell you a number and let you feel good about it while your model quietly fails the people it was built to serve.

Sensitivity and specificity tell you the truth.

Wrapping Up

These two metrics fundamentally changed how I think about model evaluation and intelligent system design. Not because they're complicated, the formulas are simple. But because they force you to ask the right question before you ask how accurate your model is.

The right question is never "how often is my model correct?"

The right question is "what happens when my model is wrong and which kind of wrong can this system afford?"

Answer that first. Build from there.

I'm documenting my full AI/ML engineering journey publicly, from Python fundamentals to agentic AI and ML systems and everything in between. If this kind of technical breakdown is useful to you, follow along. More coming.

What metric do you find most misunderstood in ML? Let's talk in the comments.

Why Your ML Model Can Be 95% Accurate and Still Be Useless — Sensitivity & Specificity Explained

First, Let's Talk About Why Accuracy Lies

The Four Outcomes You Need to Know

What Is Sensitivity?

What Is Specificity?

The Tradeoff Nobody Warns You About

How This Shows Up in Real AI Systems

Before You Deploy Anything, Ask These Two Questions

Wrapping Up

Comments

Command Palette

First, Let's Talk About Why Accuracy Lies

The Four Outcomes You Need to Know

What Is Sensitivity?

What Is Specificity?

The Tradeoff Nobody Warns You About

How This Shows Up in Real AI Systems

Before You Deploy Anything, Ask These Two Questions

Wrapping Up

Comments