Technical Aspects of AI/ML for Medicines
EMA expectations for data acquisition, model development, performance assessment, explainability, and deployment of AI in regulated settings.

Data acquisition and augmentation
AI/ML models are intrinsically data-driven. All efforts should be made to acquire a balanced training dataset, considering oversampling of rare populations and EU non-discrimination principles. Data sources and acquisition processes — including cleaning, transformation, imputation, annotation, and normalisation — should be documented in a detailed, traceable manner in line with GxP requirements.
Exploratory data analyses should describe data characteristics, representativeness, fairness, and relevance. Documented considerations should cover population representativeness, class imbalances and mitigation, and potential risk for unfair or discriminatory outcomes. Augmentation techniques may expand training data; limitations affecting generalizability or fairness should be clearly presented with recommendations for alternative methods.
Training, validation, and test data
In ML, "validation" refers to data used for model architecture selection and hyperparameter tuning — distinct from medicines development terminology. Once development is complete, performance is evaluated on a hold-out test dataset. If test performance is unsatisfactory and further development is needed, the current test set becomes a second-stage validation set and a new independent test dataset is required.
An early train-test split, prior to normalisation or processing using aggregated measures, is strongly encouraged. Data leakage risks include unknown case overlaps, sponsor-specific shared features, and prior knowledge of study outcomes. Models for high-risk settings should be prospectively tested with newly acquired data.
Model development and performance
Developers should ensure SOPs promote generalisability and robustness, with traceable documentation and development logs. Methods such as regularisation, dropout, and sensitivity analyses stratified by calendar time are encouraged. Overfitting from non-optimal practices is usually discoverable at test phase; data leakage from the test set into training is more problematic.
Performance metrics should include parameters insensitive to class imbalances (such as Matthews Correlation Coefficient) and describe the full confusion matrix. Cross-validation distributions, sensitivity analyses for minority classes and calendar time, and a priori defined thresholds support credibility.
Interpretability, explainability, and deployment
Transparent models are preferred where possible. Black box models may be acceptable if developers substantiate that interpretable models show unsatisfactory performance or robustness, supported by monitoring and risk management plans. Explainable AI methods (feature importance, SHAP, LIME, attention plots) should be used whenever possible.
Deployment should follow the risk-based approach. For high-risk use cases, non-trivial changes in software/hardware stacks require bridge re-evaluation. Data acquisition and transformation at inference must match pre-defined specifications. Monitoring should detect performance degradation with clearly defined thresholds and risk management plans for fail modes.
Ready to test your knowledge?
Take a short quiz based on this article to check your understanding.
Take the quiz