In Silico methods are used to predict a toxicological outcome directly from the chemical structure. In general, the state-of-the-art is to use multiple methodologies in combination with an expert review of prediction results.1 These approaches are used either as a direct replacement for an in vitro or in vivo assay or a part of a weight of evidence (WoE) assessment in combination with other results. They are also used to support a variety of use cases including regulatory submissions, screening and chemicals prioritization. In silico toxicology has a number of compelling and often unique advantages including as they (1) support the reduction, replacement, and refinement of animal studies (3Rs), (2) often identify the structural basis of a prediction which can support the design of new chemicals with reduced toxicity as well as supporting an expert review, (3) may propose a mechanism of action associated with any predicted toxicity to support experimental planning and analysis of results, (4) do not require any test material, (5) are inexpensive to run, and (6) can generate results quickly, often in seconds. Of course, these benefits can only be realized once such methods have been validated. The following post identifies some considerations for assessing whether in silico models are fit-for-purpose.

Central to any validation is the prediction of an external test set containing a list of chemical structures with experimental data for the toxicological outcome being assessed. Such a set needs to include a sufficiently large number of diverse chemicals representative of real-world applications. External sets with small numbers of chemicals that only cover relatively small numbers of chemical classes have limited utility. Therefore, it is important to compare the external test set to the training or reference sets used in the development of the in silico models being assessed. Any overlapping chemicals between the two sets should be identified and any duplicates discarded from the external test set. It can also be helpful to compare the two sets based on chemical classes to support any interpretation of the validation results regarding the accuracy and coverage of the model (as well as the subset of model classes tested by the external validation set). When comparing multiple models with different training or reference sets, it is possible to introduce bias into the assessment if different subsets of the external test set are used to assess the comparative performance of the different models. Since most computational models use training data from the public domain, proprietary data is often a necessary source for external test sets.2

Having created an external test set, the next step is to run the in silico model(s) over this set and compare the predictions to the experimental data. One or more performance statistics are then calculated to provide an overall assessment. The selection of these statistics is dependent on the type of the data being predicted such as whether the data is binary (e.g., a chemical is mutagenic or non-mutagenic), whether the data is ordinal (e.g., GHS categories representing ranges of values), or continuous (e.g., a NOAEL value). Relevant statistics  also depend on the context of use (e.g., a complete replacement of an in vivo methods or prioritization of chemicals) where the consequences of an incorrect predict should be considered. This includes false positive and false negative predictions for binary endpoints, and more and less conservative predictions for ordinal or continuous endpoints. Many factors may need to be taken into consideration. For example, as part of a recent assessment of acute toxicity models that are being considered as a direct replacement of the acute in vivo test for classification and labelling purposes, it was considered that more conservative predictions were desirable since this would be protective of health; however, such conservative predictions may add to transportation or other costs and so the overall accuracy of the predictions was also considered.2

It is also important to consider how the model predictions will be interpreted and any associated expert review will be performed, either as a standalone method or as part of the WoE. When using different models, the performance of each model may be assessed individually alongside the performance of different combinations of models. Different methodologies for deriving a consensus prediction from multiple models (and experimental data in the case of WoE approaches) will produce different overall results that may be optimized. In addition, some in silico methodologies may contribute towards enhancing the predictive performance whereas others may better support an expert review, yet both contribute to the overall ultimate assessment. Here, it may be helpful to consider the performance of the models, the performance of different approaches for deriving a consensus and the performance before and after an expert review. It may also be helpful to consider at the performance of different subsets of the external test set identifying those predictions with high vs. low confidence or predictions across different chemical classes, in order to further refine any recommendations. Finally, it is important to consider the variability in the underlying experimental data that will ultimately limit the performance of any in silico solution.

Please do not hesitate to get in touch with one of our subject matter experts if you wish to discuss these approaches (


  1. Myatt et al., In silico toxicology protocols, Regul. Toxicol. Pharmacol. 96 (2018) 1–17.
  2. Bercu et al., A cross-industry collaboration to assess if acute oral toxicity (Q)SAR models are fit-for-purpose for GHS classification and labelling, Regul. Toxicol. Pharmacol. 120 (2021) 104843.

Published by Glenn Myatt

Glenn J. Myatt is the co-founder of Leadscope and currently Senior Vice President, In Silico & Translational Science Solutions at Instem with over 30 years’ experience in computational chemistry/toxicology. He holds a Bachelor of Science degree in Computing, a Master of Science degree in Artificial Intelligence and a Ph.D. in Chemoinformatics. He has published 37 papers, 11 book chapters and three books.