GMLP Principles 3–5: Data, Validation & Reference Standards

Ensure training and test data represent your patient population, maintain independence between datasets, and use fit-for-purpose reference standards.

All articlesSource document

Clinical datasets representing diverse patient populations for AI validation

Principle 3: Representative patient populations

Clinical evaluation must use datasets representative of the intended patient population. Data collection protocols should ensure relevant characteristics — such as age, gender, sex, race, ethnicity, geographical location, and medical condition — plus the intended use environment and measurement inputs, are sufficiently represented in samples of adequate size for training, testing, and monitoring.

This is fundamental for clinical evaluation and helps:

Manage unintended bias and dataset drift
Promote generalisable performance across the intended population
Assess usability
Identify subgroups or circumstances where the model may underperform, including over time

Hong Kong context

An AI model trained predominantly on data from other regions or demographics may underperform for local patients. Ask vendors whether local or comparable Asian populations were included in development and validation.

Principle 4: Independent training and test sets

Training and test datasets must be appropriately independent. All potential sources of dependence — related to patients, sites, and data acquisition — must be considered and addressed.

The extent of external validation should be proportionate to risk. Leakage between training and test data can inflate reported performance and create false confidence in clinical settings.

Principle 5: Fit-for-purpose reference standards

Accepted methods should ensure clinically relevant, well-characterised data are collected and that limitations of reference standards are understood. Documentation should explain:

Rationale for choosing reference standards based on intended use
Suitability for the intended use environment
Use of accepted standards that promote robustness and generalisability, where available

Reference standard selection should reflect broad consensus and appropriate expertise where possible.

Clinical takeaway

When reviewing AI performance reports, check whether the "ground truth" label method matches how you would define the condition in practice — and whether test data were truly independent of training.

Source: IMDRF — Good Machine Learning Practice for Medical Device Development: Guiding Principles (January 2025)

Ready to test your knowledge?

Take a short quiz based on this article to check your understanding.

Take the quiz