Machine Learning: Where Validation Gets Interesting

AI Foundations Series - Part 4A

February 12, 2026

This is Part 4 of the AI Foundations Series, which explores how different categories of AI technology map to validation decisions in regulated life sciences environments.

AI Foundations for Life Sciences — A Taxonomy

The series so far:

This article is where the series shifts gear. The topic, Machine Learning (ML), is large enough — and the implications differ enough from rule-based systems — that it naturally splits into two parts. Part 4A covers what ML is, how it works, and why the shift from deterministic to probabilistic changes the validation conversation. Part 4B takes on the accuracy problem — what “accurate” actually means, why it’s harder to define than you’d expect, and what to do about it.

The positions here are solely mine. The articles are an attempt to place handholds on a dynamic and complex topic and are informed by my 25+ years in life sciences technology and validation. But this is an evolving and expanding space. If your experience differs, by all means add your comments - I want to hear it.

Introduction

In the previous article, we covered rule-based systems — Category 1 in the taxonomy. Explicit logic. Deterministic output. Same input, same output, every time. Validation is straightforward: test the rules, verify the math, confirm the boundaries.

Machine learning is where that certainty ends.

ML systems don’t follow rules you explicitly encoded. They learn patterns from data you provided.

Consider the difference: “IF creatinine > 1.5 THEN flag” is a rule. It’s explicit, testable, deterministic. Use a rule — ML adds nothing here but unneeded complexity. But now consider identifying whether a cell culture is contaminated from a microscopy image.

Today, a trained scientist looks at the image and makes a judgment call — contaminated or not. They’re reading patterns across cell density, morphology, color variation, and dozens of other features they can’t fully articulate. Ask them to write down the rules and they can’t. “That cluster looks wrong” isn’t an IF/THEN statement. That’s expertise built from years of looking at cultures — pattern recognition that resists explicit encoding.

To train ML to do this, you capture that expertise as data. Scientists review thousands of culture images and label each one: contaminated or not contaminated. That labeled dataset is the training data. The model learns by comparing its own predictions against those labels and adjusting until prediction error is minimized across the training data.

But here’s the question that determines your entire validation posture: what are those labels based on?

Not all training data is equal, and the validation implications differ substantially depending on where the labels came from.

If each sample was independently confirmed — cultured, tested, and the result documented — then the labels are backed by objective evidence. The model is learning to predict what the test would find, not what a human would say. That’s training against an objective reference standard, and your validation metrics (accuracy, sensitivity, specificity) are meaningful against it. This isn’t theoretical — manufacturing models trained on confirmed QC outcomes, lab models trained against certified reference standards, and clinical models trained on documented treatment responses all operate this way.

If the labels represent expert decisions — which for image-based applications like cell culture contamination is typically the case, since objectively confirming every training sample is rarely feasible — then the methodology for building the training data scales with consequence, the same way validation rigor does.

A practical approach: have three independent experts review the same dataset, each recording their own determination. The model is trained on the full set of determinations. Unanimous cases are strong training data. Split determinations still have a majority, but they also introduce label noise — which is unavoidable at scale. The model isn’t explicitly learning “borderline logic,” but the variation in expert determinations becomes part of what it learns to generalize across. No adjudication panels, no second passes — just independent expert judgment at scale.

For higher-consequence applications — training a model to call progressive disease from CT scans in oncology, for example — formal adjudication of disagreements may be warranted. Either way, you’re training the model to be a human, not to be the truth — but multiple independent humans are closer to truth than one. Your “95% accuracy” still means “agrees with expert consensus 95% of the time,” not “is correct 95% of the time.” The distinction matters when an auditor asks how you know the model is accurate.

The ideal is confirmed outcomes — objective, testable, traceable. Where that’s achievable, pursue it. Where it isn’t, be honest about what your accuracy metrics actually measure. And in either case, the borderline cases create noise: the model treats all labels equally as ground truth and doesn’t know which ones were certain and which were close calls.

So the model is learning to replicate either confirmed outcomes or human expert judgment — and you need to know which one, because it changes what “accuracy” means. The ceiling for the model is the quality of those labels. The output is probabilistic — a confidence score, not a binary answer. And that changes everything about how you validate.

This is Category 2 in the taxonomy from Article 1 — the first step beyond explicit programming. The taxonomy described it as learning decision boundaries and relationships from examples rather than explicit rules. Everything from here forward in the series builds on this shift from deterministic to probabilistic.

What Machine Learning Actually Does

An ML model is a mathematical function derived from data. You provide labeled examples — inputs paired with expected outputs — and the algorithm finds patterns that map one to the other. The result is a model: a trained function that can process new inputs it hasn’t seen before and produce predictions with associated confidence.

The cell culture example above is one pattern. Here’s another: you feed the system 10,000 chest X-rays labeled “normal” or “abnormal.” The algorithm identifies visual patterns that distinguish the two. When it encounters a new X-ray, it assigns a probability — say, 87% likelihood of abnormality. It doesn’t “know” what it’s looking at the way a radiologist does. It found a statistical pattern that works.

A key distinction from the taxonomy: ML models as described here are static. The model is trained once, then deployed. Behavior is fixed until explicitly retrained — or its input data or thresholding logic changes — which triggers revalidation. This is different from adaptive systems that change behavior based on new data continuously; those fall into more advanced categories. Most regulated deployments intentionally lock models at release. Continuous learning is possible, but it requires a governance structure most organizations aren’t equipped for.

In life sciences, ML shows up where rules are too complex or too numerous to encode explicitly:

Signal detection in pharmacovigilance — identifying potential adverse events across millions of case reports
Clinical trial site performance — predicting enrollment rates based on historical data patterns
Manufacturing process optimization — predicting yield or quality outcomes from process parameter combinations
Image analysis — cell culture contamination detection, particle identification
Classification models — flagging potential adverse events in safety narratives

In each case, you could try writing rules. For signal detection, you could define specific term combinations and frequency thresholds. But the rule set would be enormous, brittle, and unable to catch patterns a human hadn’t anticipated. ML handles the complexity — at the cost of transparency.

Remember the taxonomy’s “Same Use Case, Different Implementation” point: three vendors could sell you “AI-powered” adverse event detection. One uses keyword matching (rule-based). One uses a trained classifier (ML). One uses deep learning NLP. Same marketing language. The ML version produces confidence scores, not binary flags — and that distinction drives your entire validation approach.

Supervised vs. Unsupervised

The taxonomy groups ML as a single category, which is appropriate for mapping to validation decisions. But within ML, there’s a distinction worth understanding because it affects what validation looks like.

Supervised learning uses labeled data: you tell the model what the right answer is during training, and it learns to predict those answers for new inputs. Most audit-visible ML applications are supervised — classification (is this an adverse event: yes/no?), prediction (will this batch meet spec?), regression (what yield should we expect from these parameters?).

Unsupervised learning finds structure in unlabeled data. Clustering, anomaly detection, dimensionality reduction. Nobody tells the system what to look for — it identifies groupings or outliers on its own. In life sciences: identifying unexpected patient subgroups in trial data, detecting anomalous manufacturing process behavior, or clustering adverse event reports by similarity.

The validation difference: supervised models have a reference standard to test against — you know what the labels say the right answer should be (with all the caveats about label quality discussed above). Unsupervised models don’t. You’re validating whether the patterns the system found are meaningful and useful, which requires domain expertise to evaluate. Both are ML. Both are probabilistic. But the testing methodology differs.

For most organizations encountering ML in validated environments, it’s supervised learning embedded in vendor products. That’s where we’ll focus the validation discussion.

The Validation Shift

Here’s the fundamental change from rule-based systems: you can’t read the logic.

A rule-based system has an explicit decision path you can trace, test, and document. An ML model has learned weights — numerical parameters that collectively produce predictions. You can’t open it up and say “it flagged this case because of conditions A, B, and C.” The model doesn’t work that way.

The taxonomy’s “How They Relate” table captured this: as you move from rule-based to ML, determinism decreases, explainability drops, and validation complexity increases. Here’s what that means in practice:

Training data provenance determines your validation posture - We covered this in the cell culture example — confirmed outcomes and expert decisions produce fundamentally different kinds of training data, and your accuracy metrics mean different things depending on which one you have. This isn’t background information. The provenance of the training data is a risk factor that belongs in your risk assessment and drives your acceptance criteria.
Performance is statistical, not absolute - Rule-based systems are right or wrong. ML models produce confidence scores. A model that’s 95% accurate is wrong 5% of the time — but “accurate” means “agrees with the labels,” and the labels may or may not represent truth.
Models degrade over time - This is drift — the concept introduced in the taxonomy under “When is it no longer validated?” The data distribution the model was trained on shifts: patient populations change, manufacturing processes evolve, new drug classes emerge. A model that performed well at deployment may lose accuracy months later without any code changes. Nothing changed in the software. The world changed around it.
Retraining changes the model - When performance degrades, you retrain with new data. But retraining produces a different model — different weights, potentially different behavior. The taxonomy was explicit: retraining triggers revalidation. This isn’t a software patch. It’s a new version of learned behavior.

This article covered the mechanics: what ML is, how training data works, why the provenance of that data — confirmed outcomes versus expert decisions — determines your entire validation posture, and why the shift from deterministic to probabilistic outputs changes what you can test, what you can claim, and what you need to monitor. These are the building blocks.

But they raise a harder question. ML doesn’t just change how systems behave — it forces you to confront what validation has always assumed. What “accurate” actually means, what your acceptance criteria are really measuring, and whether the standard you’re holding ML to was ever applied to the process it replaces.

Next: Part 4B — The accuracy problem. What “accurate” actually means for ML, why it’s harder to define than you’d expect, and why the question isn’t new — ML just makes it impossible to ignore.