Methodology

Foresight is a remarkable model. It is not a patient digital twin.

· Raganele Consulting

A few weeks ago we listened to a podcast on patient digital twins. The hosts, both clinicians-turned-AI-builders, spent an hour discussing Foresight — a generative-pretrained-transformer trained on the electronic health records of 811,336 patients across King’s College Hospital, the South London and Maudsley Trust, and MIMIC-III, published in Lancet Digital Health in 2024, and scaled up at UCL and King’s this spring to a national-scale dataset of roughly fifty-seven million NHS patients. The framing throughout was that this is what a patient digital twin looks like in 2026: a generative model over patient histories, predicting what happens next.

Foresight is a remarkable model. The engineering is real. The dataset is one of the largest of its kind. The team has done careful work for years on the underlying named-entity recognition (MedCAT) that makes the whole thing possible. The paper is also notably honest about one of its limitations: the authors state explicitly that, because the model is derived from real-world data and reflects historical common practice rather than current best-practice guidelines, “it should not be used for clinical decision support in its current form.”

But it is not a patient digital twin. And the distinction is not a quibble about words. It is the difference between two completely different scientific projects, only one of which can do the thing the phrase actually promises.

What Foresight does

Foresight is, architecturally, a decoder-only transformer of the same family as GPT-2, trained autoregressively on tokenised sequences of medical concepts. Each patient’s electronic health record is converted, by the MedCAT pipeline the same group has spent a decade building, into a chronologically ordered stream of SNOMED-CT concept tokens — diagnoses, symptoms, medications, procedures, lab events. The model learns to predict the next concept in the stream.

You can do striking things with that. You can ask the model what medical event the patient is most likely to experience next month. You can compare cohorts. You can flag patients whose predicted trajectory diverges from their observed one. ETHOS, a parallel project out of Harvard and Jagiellonian, does the same shape of thing on US data. DT-GPT, the most recent entry, layers a larger LLM on top.

It is a real foundation model and the people who built it should be proud of it.

What a patient digital twin is supposed to do

The phrase digital twin has a precedent in engineering. The original use of the term, going back to Michael Grieves’ product-lifecycle work in the 2000s and General Electric’s industrial deployments in the 2010s, refers to a simulation of the underlying system, parameterised to a specific instance, that you can run experiments against. The point is to ask counterfactual questions. What happens to this turbine if we change the inlet temperature by two degrees? What is the stress at the root of this blade if we change the alloy? You do not run a turbine twice to find out. You run the simulation, twice, with the parameter changed.

A patient digital twin, by the same logic, is supposed to let a clinician ask: what happens to this patient if I change the dose? What happens if I delay treatment by two weeks? What happens if I switch from drug A to drug B? Counterfactual questions, parameterised to the specific instance.

That is the job description. The interesting question is whether Foresight, ETHOS, DT-GPT or anything in the same architectural family can do it.

What Foresight cannot do

It cannot answer counterfactual questions, and the reason is in the mathematics, not the engineering.

A model trained autoregressively on observed sequences learns the conditional distribution of the next token given the prefix it has seen. It is, in the technical sense, a correlational model. It will tell you what kinds of patients with the prefix you typed have, historically, gone on to receive next. If patients with that prefix were, in the training data, typically given drug A and then went on to outcome X, that is what it will predict. The model has no representation, anywhere, of what would have happened if those patients had been given drug B instead. There is no parameter to change, no mechanism to perturb, no organ to dose differently. The model is the joint distribution of histories that actually happened. The histories that did not happen are not in there.

This is not a flaw of the particular implementation. It is a feature of what the architecture is. The same statement is true of ETHOS, of DT-GPT, of BEHRT, of every transformer-over-EHR-codes model in the family. They are exquisite predictors of the observed trajectory distribution. They are silent on counterfactuals because counterfactuals are, by construction, not in the training set.

This is worth slowing down on, because the Foresight paper itself, in the Interpretation paragraph at the front of the manuscript, lists “simulate interventions and counterfactuals” among the model’s intended uses. We do not think this is a misrepresentation by the authors — we think it is the precise place the field has lost the thread. A generative model over observed trajectories can certainly generate a sequence under a different prefix, and it is tempting to call that a counterfactual. It is not. Counterfactual inference, in the sense Pearl, Rubin, and the causal-inference literature have spent four decades formalising, requires either an explicit mechanism or a study design that controls confounding. A transformer trained on what physicians did and what then happened has neither.

What patient digital twins look like when they exist

The patient digital twin, in the engineering sense, does exist. It just does not live in the same labs.

The mechanistic-modelling community — pharmacometrics, physiologically-based pharmacokinetics, quantitative systems pharmacology — has been building this kind of artefact for forty years. The European Medicines Agency holds a recognised set of qualifications for these models as drug-development tools. The Living Heart Project at Dassault Systèmes is a finite-element multi-physics model of the human heart, parameterised to individual patients, and it is what cardiac-device companies actually use to ask counterfactual questions about device design. Certara’s Simcyp, Open Systems Pharmacology’s PK-Sim, esqLABS’ models, and a growing systems-biology literature on what is now being called immune digital twins — these are the artefacts the engineering definition of digital twin actually fits.

They are coupled differential equations. They are parameterised at the level of organ, enzyme, transporter, ion channel. You can change the dose and re-integrate and read a different trajectory. They are not trained on patient timelines; they are built from physiology. The data fits them; they do not fit the data.

These models are not perfect. They have known limitations — the parameter spaces are large, the calibration is hard, the validation is anchored on specific cohorts and may not generalise. The community is honest about this. There is a serious literature, much of it in CPT: Pharmacometrics & Systems Pharmacology and in EMA qualification documents, on exactly where these models work and where they don’t.

But they answer counterfactual questions, and Foresight does not.

Why the conflation matters

If the term patient digital twin gets settled on Foresight-style models in the public conversation, two things happen, both bad.

The first is that physicians and regulators will, eventually, ask the LLM-twin a counterfactual question — what happens if we change the dose — and get an answer. The answer will be confidently phrased and will look like a prediction. It will actually be the most likely observed next token under the dose-change prefix, which is not the same thing and is not even close to the same thing. Whether harm follows is an empirical question. The structural risk is real.

The second is that the actual patient digital twin — the mechanistic, counterfactual-capable, EMA-qualified kind — gets crowded out of the conversation. The authors of From Population-Based PBPK to Individualised Virtual Twins, published earlier this year, and the EMA’s recent presentations on QSP terminology, are both visibly trying to reclaim the label from the LLM camp.

What we’d like the field to do

Two small things would help.

Call Foresight, ETHOS, DT-GPT and the rest patient trajectory foundation models. That is what they are. The term is accurate, technically precise, and does not borrow credibility from an engineering tradition the architecture cannot honour.

Reserve patient digital twin for mechanistic, counterfactual-capable, parameterisable simulations of the individual patient. There are several such models in clinical and regulatory use today. They deserve the name they were given.

These are two different scientific projects. They are both worth doing. Conflating them helps neither.

Disclaimer. This note is a methodology and capability description, not medical or clinical advice. Modelled outputs are not substitutes for peer-reviewed evidence, regulatory review, or qualified clinical judgement. Raganele is not a medical practice.