Big Data in Predictive Toxicology
- 1.1 Introduction
- 1.2 Big Data in the Area of Predictive Toxicology
- 1.3 The Big Vs of Predictive Toxicology Data
- 1.4 Challenges of Big Data in Predictive Toxicology
- 1.4.1 Need for an Adequate Infrastructure
- 1.4.2 Standardisation and Data Curation
- 1.4.3 Too Big? Identifying the Relevant Data
- 1.4.4 Data Integration Infrastructures
- 1.4.5 Making Sense of the Data
- 1.5 Opportunities Provided by Big Data for Predictive Toxicology
- 1.5.1 More is More – Benefits of a Broader and More Diverse Data Basis
- 1.5.2 Big Data for the Big Picture
- 1.5.3 Creating und Using New Knowledge: Applications for Hot Topics in (Predictive) Toxicology
- 1.6 Conclusions and Perspectives
CHAPTER 1: Big Data in Predictive Toxicology: Challenges, Opportunities and Perspectives
-
Published:04 Dec 2019
-
Special Collection: 2019 ebook collectionSeries: Issues in Toxicology
A. Richarz, in Big Data in Predictive Toxicology, ed. D. Neagu and A. Richarz, The Royal Society of Chemistry, 2019, pp. 1-37.
Download citation file:
Predictive toxicology and model development rely heavily on data to draw upon and have historically suffered from the paucity of available and good quality datasets. The situation has now dramatically changed from a lack of data hampering model development to “data overload”. With high throughput/content screening methodologies being systematically used aiming to understand the mechanistic basis of adverse effects, and increasing use of omics technologies and consideration of (bio)monitoring data, the volume of data is continuously increasing. Big data in predictive toxicology may not have reached the dimension of other areas yet, such as real-time generated data in the health sector, but encompass similar characteristics and related challenges. Pertinent questions in this area are whether the new plethora of data are adequate for use in predictive toxicology and whether they address this area's most urgent problems. This overview chapter looks at the definition and characteristics of big data in the context of predictive toxicology as well as the challenges and opportunities big data present in this field.
1.1 Introduction
Predictive toxicology and model development rely heavily on data to draw upon and have historically suffered from the paucity of available and good quality datasets. As recently as 20 years ago, obtaining even relatively limited datasets of chemical structures and experimental bioactivities involved the laborious and manual collection, curation and compilation of data. Historically, the foremost problem in predictive toxicity modelling has been the scarcity of data, in addition to that of data quality. The data were produced manually, and thus slowly, by measuring physico–chemical properties and activities experimentally in the laboratory. As a result, models have been built based on the same training sets, and data were only available for a limited number of properties and endpoints.
With the advance in laboratory automation and the paradigm change towards 21st Century Toxicology, focussing on adverse effect pathways and elucidating mechanisms of action, the availability of data has changed and, overall, improved. For example, initiatives such as ToxCast and Tox21 have produced publicly accessible bioactivity data for a large number of chemicals for many endpoints with high throughput/high content assays. Furthermore, omics technologies for screening changes in genomes, proteomes, metabolomes etc. have produced many data. Biomonitoring and epidemiological data are also increasingly available.
Thus, in a short time period, the situation for predictive toxicology has changed dramatically from a lack of data hampering model development to that of “data overload”. It can be argued that big data in predictive toxicology have not yet reached the dimension of other sectors, such as large-scale real-world data generated in the health sector.1,2 However, these data encompass similar characteristics and related challenges and represent big data for this specific field.
This introductory chapter looks at the definition of big data in the context of predictive toxicology and the challenges and opportunities big data present in this field.
1.2 Big Data in the Area of Predictive Toxicology
Big data is a “buzz word” in many areas of science and technology and without doubt have changed everyday lives in many areas of society, for example, with the “internet of things”, by offering new opportunities to integrate and use information.
Given the ubiquity of the term, what are the “big data” as relates to predictive toxicology? As compared to other fields, the data taken into account for predictive toxicology are not yet generated in real time for instant analysis. In the closely related sector of health care, for example, 24-hour data produced by monitoring functions in patients are already a reality. There are many initiatives in the area of drug discovery or medicine and public health that address the leverage of big data. For example, the European Union Innovative Medicines Initiative (IMI) programme Big Data for Better Outcomes (BD4BO3 ) supports the use of big data to improve health care. This programme assists by developing platforms for integrating and analysing diverse big data sets. In the area of drug discovery, possibilities to accelerate medicinal chemistry by using big data and artificial intelligence are being investigated.4 The European Lead Factory,5 a collaborative public–private partnership, has established the first European Compound Library and the first European Screening Centre, comprising up to 500 000 novel compounds. The EU Integrated Training Network BIGCHEM6 project provides education in large chemical data analysis for drug discovery, to be able to process and use the growing amount of biomedical data in chemistry and life sciences.7 The EU ExCAPE project8 applies the power of supercomputers and machine learning to accelerate drug discovery.
It must be said that compared to the use of big data in other fields (e.g., finance, economics, etc.), the data in toxicology are still relatively small scale, but with regard to what went before, they have broken the mould and can be considered big data. Typical examples of the types of big data in the context of toxicology are high throughput/high content screening data and data generated with omics technologies, gene arrays etc. Compared to “absolute” or “static” toxicity data, in addition to dose–response curves, resolution of responses in time is becoming a major area of investigation, including toxicokinetic measurements – or models – to take into account, for example, the rate of absorption or metabolism. As a consequence, more data are being measured. This is particularly true for the area of high throughput kinetics. Furthermore, as discussed in the 2017 US National Academies of Sciences, Engineering, and Medicine (NAS) report on “Using 21st Century Science to Improve Risk-Related Evaluations”,9 the use of epidemiological data will play an important role in chemical risk assessment in the future. Monitoring data, either environmental monitoring or human biomonitoring, will also contribute to risk assessment and predictive toxicology, and form an important part of the increasing big data in the area.
Similar challenges in terms of compiling, storing, processing, analysing and integrating these big data apply as for other areas and types of big data. Their quality, adequacy and appropriateness for specific purposes need to be investigated. They do, however, offer opportunities to tackle problems which were previously out of reach, and make completely new applications possible, to support toxicology evaluation and chemical safety assessment.
1.3 The Big Vs of Predictive Toxicology Data
The characteristic attributes of big data, the “Big Vs”, have been discussed frequently in the literature,10 starting from the three major V attributes that define the characteristics of big data, i.e., Volume, Velocity, Variety. Currently, there are seven to up to 10 or more Vs described for big data,11–14 which point at other attributes of big data and issues that need to be taken into account when dealing with big data; these vary slightly according to situation and use. The additional Vs include: Veracity, Variability, Validity, Visibility, Visualisation, Volatility, Vulnerability and Value (see Table 1.1). This section addresses how these characteristics of big data apply to the field of (predictive) toxicology; the 11th V, Vulnerability, is not considered in this context.
Ten Vs of big data and their applicability in predictive toxicology; related opportunities and challenges
Big V . | Applicability in predictive toxicology . | Opportunities . | Challenges . |
---|---|---|---|
Volume | High number of data generated, e.g., high content screening (HCS), omics test read-outs, epidemiological and (bio)monitoring data | Broader data basis for modelling and elucidation of modes of action and pathways | Storing and processing large amounts of data; finding the relevant data in the flood of information; limits of capacities for data curation |
Velocity | Speed of data generation increased, e.g., high throughput screening (HTS) | More rapid generation of data to fill gaps; ability to generate time-dependent data | Speed of data generation overtaking speed of storage, analysis and processing capacities |
Variety | Many different types of data, e.g., chemical structures, results from a variety of assays, omics, time-dependent kinetics etc. | Different types of information that can be combined to get the full picture | Integration of different types of data. Representation and informatics processing of chemical structures is a challenge, especially if 3D transformations of structures have to be computed for many chemicals. Comparability of data from varied sources might not always be given |
Veracity | Data quality, accuracy and reliability; uncertainties requiring data curation and evaluation of data quality | Large amounts of data might statistically compensate for inaccuracy in individual data when integrated | Data curation and evaluation of data quality for large amount of data |
Variability | Intrinsic variability of biological data, e.g., inter- and intra- individual variations, genetic, population variations | Availability of a large amount of data might enable taking into account variations of parameters with the models built, enabling better prediction of population variations | Processing large amounts of variable data |
Validity | Validity of the data for a specific application, e.g., for prediction of toxicity (for a specific endpoint) | Generation of many (types) of data possible in a targeted way, tailored for the specific prediction and toxicity assessment goal | Finding/choosing the relevant data for the specific application |
Visibility | Data sharing, access to data sources leading to centralised databases and repositories | Large data sets more visible than small disparate data sets, preferred storage in centralised repositories or linked via hubs/portals | Making the data available and visible in an appropriate way |
Visualisation | Representation of the data content | Supports and facilitates making sense of the varied and complex data sets, also supporting the organisation of the data | Visualisation of complex data difficult, methods to improve representations of the data content in a clear way are needed |
Volatility | Data access might cease, or repositories disappear, which affects the sustainability of data resources | N/A | Appropriate storage and sustainability concept necessary |
Value | Adequacy and usefulness for predictive toxicology and hazard/risk assessment, depending on specific risk assessment goal | Availability of many (types) of data as a broad data basis to understand mechanisms and pathways, in order to build informed predictive models | Extracting and distilling the knowledge from the large amount of data |
Big V . | Applicability in predictive toxicology . | Opportunities . | Challenges . |
---|---|---|---|
Volume | High number of data generated, e.g., high content screening (HCS), omics test read-outs, epidemiological and (bio)monitoring data | Broader data basis for modelling and elucidation of modes of action and pathways | Storing and processing large amounts of data; finding the relevant data in the flood of information; limits of capacities for data curation |
Velocity | Speed of data generation increased, e.g., high throughput screening (HTS) | More rapid generation of data to fill gaps; ability to generate time-dependent data | Speed of data generation overtaking speed of storage, analysis and processing capacities |
Variety | Many different types of data, e.g., chemical structures, results from a variety of assays, omics, time-dependent kinetics etc. | Different types of information that can be combined to get the full picture | Integration of different types of data. Representation and informatics processing of chemical structures is a challenge, especially if 3D transformations of structures have to be computed for many chemicals. Comparability of data from varied sources might not always be given |
Veracity | Data quality, accuracy and reliability; uncertainties requiring data curation and evaluation of data quality | Large amounts of data might statistically compensate for inaccuracy in individual data when integrated | Data curation and evaluation of data quality for large amount of data |
Variability | Intrinsic variability of biological data, e.g., inter- and intra- individual variations, genetic, population variations | Availability of a large amount of data might enable taking into account variations of parameters with the models built, enabling better prediction of population variations | Processing large amounts of variable data |
Validity | Validity of the data for a specific application, e.g., for prediction of toxicity (for a specific endpoint) | Generation of many (types) of data possible in a targeted way, tailored for the specific prediction and toxicity assessment goal | Finding/choosing the relevant data for the specific application |
Visibility | Data sharing, access to data sources leading to centralised databases and repositories | Large data sets more visible than small disparate data sets, preferred storage in centralised repositories or linked via hubs/portals | Making the data available and visible in an appropriate way |
Visualisation | Representation of the data content | Supports and facilitates making sense of the varied and complex data sets, also supporting the organisation of the data | Visualisation of complex data difficult, methods to improve representations of the data content in a clear way are needed |
Volatility | Data access might cease, or repositories disappear, which affects the sustainability of data resources | N/A | Appropriate storage and sustainability concept necessary |
Value | Adequacy and usefulness for predictive toxicology and hazard/risk assessment, depending on specific risk assessment goal | Availability of many (types) of data as a broad data basis to understand mechanisms and pathways, in order to build informed predictive models | Extracting and distilling the knowledge from the large amount of data |
Volume: one of the three major defining characteristics of big data is the amount of data generated. In the field of toxicology, the large datasets produced are currently high throughput/content screening (HTS/HCS) assay data, high throughput kinetics, and omics technology read-outs; it is highly probable that in the future the use of large epidemiological and (bio)monitoring data sets will increase. Examples of large data sets and databases making these data available are ToxCast/Tox21 15–18 data for high throughput screening data – screening thousands of chemicals in about 70 assays covering over 125 processes in the organism, over 120 million data points have been produced so far19 (see Chapter 8); ChEMBL20 from the European Bioinformatics Institute (EMBL-EBI) for chemical, bioactivity and genomic data for bioactive molecules with drug-like properties;21 and TG-GATEs22 for large-scale toxicogenomics data23 (see Chapters 4 and 7). The EU research initiative HBM4EU,24 a joint effort of 28 countries, the European Environment Agency and the European Commission, aims to advance human biomonitoring and use of the data for evaluating the actual exposure to chemicals and impact on human health. Data will be made accessible via the Information Platform for Chemical Monitoring IPCHEM,25 the European Commission's reference access point for chemical occurrence data collected and managed in Europe.
Velocity: the speed of generation of toxicity data has increased significantly compared to historic data, by using automated assays, in particular high-throughput screening. Omics technologies also produce large datasets in a short time, compared to traditional assays. In addition, toxicology is moving away from static data to time-resolved recording of bioactivities. (Bio)monitoring data will gain increasing importance for risk assessment; these might also be recorded in real-time, for example, as daily monitoring and thus rapidly produce large amounts of data.
Variety: big data are also particularly characterised by their variety. Data relevant for predictive toxicology encompass many different types of data, such as results from a variety of assays in different formats, omics, time-dependent kinetics, metabolism information, and most importantly in chemical assessment: chemical structures. These are linked both to the unique identity of the compounds as well as specific activities or effects related to specific structural features ((quantitative) structure–activity relationships ((Q)SARs)). The informatics processing of chemical structures is a particular challenge (see Chapter 3). Furthermore, there is also a variety of data sources, resulting in datasets being dispersed and often not connected electronically, and the comparability of data produced in different laboratories might not always be given.
Veracity: this relates to the accuracy of the data and the reliability of the methods used to generate them. Data quality should always be evaluated; data may be prone to certain uncertainties. Data curation is needed in order to provide a reliable dataset. Data should ideally be generated according to Good Laboratory Practice (GLP) or Good In Vitro Method Practices (GIVIMP)26 and transparently documented. Whilst Klimisch scores have traditionally been used to assign quality scores to a study,27 there are currently other methods to evaluate study quality or potential bias.28,29
Variability: biological data possess an intrinsic variability: inter-individual variations, intra-individual variations of parameters, for example, depending on the time of the day, genetic and population variations. There is also variability due to different assay conditions or measurement errors. In the case of nanomaterials, there is an inherent distribution of the properties between nanoparticles.
Validity: validity encompasses, amongst many other aspects, both the reliability and relevance of the data for a specific application, e.g., the prediction of toxicity for a specific endpoint and risk assessment goal.30 In toxicological assessments, there is a formal process of test validation for in vitro assays, and validation of new methodologies, including predictive computational models, is currently being discussed. In general, the target chemical needs to be in the applicability domain of the method used. For example, some classic toxicity assays are not applicable to nanomaterials.31 Therefore, the validity of the data produced needs to be verified.
Visibility: datasets should be visible to potential users. Data sharing and free access to data is important for data use in predictive toxicology (if not in-house use from companies), in order to allow best use, for example for model building. Ideally, data should be made available in centralised databases and repositories, and/or different data sources should be interconnected to facilitate integration and to allow the full potential of use of the information.
Visualisation: in order to facilitate making sense of the varied and complex data sets, methods to improve representations of the data content in a clear way are needed, also supporting the organisation of the data.
Volatility: access to data might cease or websites with repositories disappear over time, as is the case, for example, at the end of research projects or companies going out of business. In order to provide continued access to the data, a sustainability plan should guarantee long-term archiving and access to the data resources.
Value: this relates to the adequacy and usefulness of the data for predictive toxicology and hazard or risk assessment, depending on the specific use and assessment goal. For example, data can be adequate for a number of purposes including: producing models or profilers useful for screening or flagging a potential hazard; allowing quantitative predictions for decision making; or for filling data gaps complying with regulatory information requirements. Value also relates to financial considerations, i.e., the cost, time taken and animal use, or the value of providing a valid alternative to an animal study, as well as ensuring the safety of the human population and environment.
1.4 Challenges of Big Data in Predictive Toxicology
After modellers have been requesting more data for decades, the situation seems to have turned into the question of whether there are now too many data. There might be a risk of being overwhelmed by the flood of data and losing sight of the objective of the hazard or risk assessment to be undertaken. It will be increasingly necessary to identify the relevant data for the specific application to avoid being lost in “noise”. Moreover, it is also true that quantity cannot replace quality, and a thorough evaluation of the data and their adequacy for the purpose needs to be carried out, which is increasingly difficult with increasing volume, variety and speed of generation of the data.
The 2017 NAS report highlights the big data challenge for risk-related evaluations, stating that
The 21st century science with its diverse, complex, and very large datasets, however, poses challenges related to analysis, interpretation, and integration of data and evidence for risk assessment. In fact, the technology has evolved far faster than the approaches for those activities.9
In this section, a number of the challenges that big data pose are highlighted – some are general issues encountered when dealing with data, magnified on a bigger scale, others specifically concern challenges of big data in the field of (predictive) toxicology.
The major challenges to take into account when considering the use of big data in predictive toxicology are represented in the scheme shown in Figure 1.1 and also summarised in relation to the Big Vs in Table 1.1. They concern sharing, accessibility, and processing of the data, data quality and comparability, interoperability of the data resources and integration of varied types of data, as well as identifying the relevant data needed for the intended purpose.
Key steps of big data in predictive toxicology from generation to application, related challenges and opportunities.
Key steps of big data in predictive toxicology from generation to application, related challenges and opportunities.
1.4.1 Need for an Adequate Infrastructure
First of all, data need to be available, which is the case for a variety of types of data with increasing speed and volume. In addition to data from high-throughput and omics technologies, large compilations of bioscience data, large datasets from monitoring and epidemiology are increasingly available. Moreover, new sources of data such as the mining of the grey literature and social media, e.g., Twitter, might be used more in the future, as well as more connections to the pharmaceutical and biomedical fields.
Potential users need to know about the existence of the data in the first place. However, broad visibility of research project data is not always given and, in the past, they were mostly scattered, individual data collections. For significantly-sized datasets generated in large-scale endeavours, visibility is generally higher, for example the ToxCast/Tox21 data highlighted through different channels. Ideally, data should be linked to major, well-known and well-trusted data repositories or portals.
The data should also be easily, publicly and freely accessible, which is not the case for in-house data, and has not always been for research project data. In the future, public crowd-sourcing data collection activities could play a role, as well as the use of new types of information from public sources, such as from mining social media and the world wide web. Questions of intellectual property rights and licensing will need to be solved. It is also of upmost importance that the data are reliably accessible, and respective measures to achieve this should be taken. Again, linking to established repositories with defined strategies for sustainability and long-time archiving will be helpful. Generally, data sharing and access to data resources has significantly increased as part of the big data era.
As is the case for big data in general, handling large volumes of data poses challenges regarding storage, retrieval, sharing and processing of the data, with the speed of data generation overtaking the speed of storage, analysis and processing capacities. In particular, representation and informatics processing of chemical structures are a challenge, especially if 3D transformations have to be computed for a large amount of chemicals (see Chapter 3). An adequate informatics infrastructure is essential. Technical advances have been made and have enabled the big data era, but have specifically to be adapted to the type of data used in the area of (predictive) toxicology (see Chapters 2, 7 and 5). This field has to further adapt itself to the new realities, possibilities and availabilities of data and what they can offer.
1.4.2 Standardisation and Data Curation
Big data sets for predictive toxicology comprise a high degree of both variety and variability of the data. This concerns intrinsic variability of biological data, variability related to different measurement procedures used by different laboratories, as well as measurement errors. In addition, there are errors in recording or transferring of study results in general. Thus, the data sets are prone to some form of uncertainties. To minimise these uncertainties, there is a need to increase data quality as well as compatibility and comparability of the data. Efforts should be undertaken to standardise formats, agreeing on the minimum information required to be recorded for experimental data, and linking to standardised ontologies. Standard test guidelines and standard operating procedures should be used to enable the compatibility and comparability of the results and data from varied sources, e.g., when measured in different laboratories. This would also allow for the connection or merging of several datasets. Such guidelines generally contribute to assuring a higher quality of the data. Good quality data are essential for building of models that give accurate predictions of toxicity.
An example of the standardisation of tests are the standard Test Guidelines (TGs), which have been adopted by the Organisation for Economic Co-operation and Development (OECD)32 and are regularly updated. Guidelines exist for most established toxicological in vivo and in vitro tests (see Chapter 2). The OECD Harmonised Templates (OHTs) are standardised formats supporting the processing of the results in databases, for example, by means of the IUCLID software.33 Most templates correspond to specific OECD Test Guidelines. OECD Guidance Document 211 aims to harmonise the description of non-guideline (in vitro) test methods to allow evaluation and comparison of the results.34 ToxCast assays and results are described in this format. For other “new” types of data such as omics data, development of reporting frameworks is ongoing in several international initiatives, to increase the transparent and standardised recording and reporting of omics assay results. These are, for example: the summary of Minimum Information about a Microarray Experiment for Toxicogenomics (MIAME; US National Research Council),35 the development of a Transcriptomics Reporting Framework (TRF) at OECD, and the ECETOC MEtabolomics standaRds Initiative in Toxicology (MERIT).36,37 At the laboratory level, Standard Operating Procedures (SOPs) with detailed descriptions of the test procedures to be performed, facilitate the generation of reliable and reproducible data.
A particular issue of data quality for experimental chemical toxicity data is the correct identity and identification of the chemical used for testing and being referred to. The measured bioactivity should be linked to an unambiguous chemical structure. Data curation of data sets is needed for creation of high-quality databases.38
Although these requirements for generating, processing and curating data apply to data in general, they are more complex and challenging when considering big data sets of a large variety of data types to be investigated together in view of their use for predictive toxicology; there are capacity limits for manual curation. Therefore, for the evaluation and curation of large data sets, including correct assignment and processing of chemical structures, automated procedures can be used. For example, a workflow has been developed by the US EPA for, amongst others, normalising chemical structures and resolving chemical ID conflicts. A significant effort was also undertaken to curate, update and resolve conflicts in the over 15 000 chemicals in the KOWWIN™ data file (underlying the model in the US EPA Estimation Programme Interface (EPI) Suite™ software39 ) to improve (re)building of the models. Quality flags were added to the datasets.40,41 A publicly available KNIME42 workflow was developed to process Distributed Structure-Searchable Toxicity (DSSTox) Database files (which includes about 22 000 chemical structures) to create “QSAR-ready” structures, for example standardising tautomers and removing stereochemistry.
The large data streams of ToxCast and Tox21 undergo rigorous quality assurance procedures (see Chapter 8). These relate to both a quality check to assure the chemical's identity and purity, tracking sample problems such as limited solubility or volatility, which can impact on the assay result,43 as well as checking the bioactivity data. The automated processing of the results, necessary because of the large volume of data, may have produced possible false positive or negative hit calls, and annotations with different flags are added in an additional processing step.44 Another example of a data curation process for a large data repository is the ChEMBL database, which has been performed by automated and manual steps, in order to increase accessibility, comparability as well as usefulness of the data, especially also for modellers, by adding curated annotations.45–47
1.4.3 Too Big? Identifying the Relevant Data
The wealth of data brings new opportunities for predictive toxicology, both in terms of amount of information as well as more diverse types of information. However, it will also become increasingly difficult to grasp all existing data in the flood of information. As stated at a NAS workshop, big data can be described as “more data than you know what to do with”.48 It is key that users need to be able to find the information that is actually relevant for the purpose and toxicological question they are looking at. This might seem like needing to find a needle in a haystack, for example, which gene changes are connected with the endpoint in question and will be predictive for an adverse health outcome? With more data becoming available, the same is true for more noise, or spurious correlations being made with unrelated data. It will be necessary to identify the appropriate, adequate and relevant data and distil out the adequate information – which might depend on the context of the prediction to make or the toxicological question to answer. Thus, the challenge has shifted from finding any useful data at all to needing to identify and select the relevant data for the specific application, and possibly try to find the sweet spot in the amount (and types) of data that are sufficient to support predictive modelling and safety assessment.
There are different approaches to find data and information: either broad data – or text – mining, in an attempt to “fish” for all possibly relevant data, increasingly narrowing down the search results in view of the respective application;49,50 or a targeted search in all possible data resources to answer a specific question or fill a data gap.
Text mining for extracting information from large amounts of data is combined with complementary approaches and analytic tools in order to leverage the information and enable knowledge discovery. It has been applied, for example, in the field of toxicogenomics to discover information not yet available in structured databases, e.g., to identify early predictors of toxicity.50 Another example is the development of automatic data mining methods to extract relevant bioassays – ranked by means of a newly developed scoring system – from publicly available bioassays databases. For instance, data from 739 000 assays in PubChem have been mined to generate response profiles to evaluate a set toxicity for a large group of nearly 5000 target compounds.49,51
1.4.4 Data Integration Infrastructures
The data sets and repositories in toxicology comprise a variety of different types of data: in vitro or in chemico assay results, in silico predictions, gene arrays and omics read-outs for different toxicological endpoints (see Chapters 2 and 7), epidemiological, monitoring and exposure data. In addition, chemical structures may be stored in different representations (see Chapter 3), along with further substance and methodological details. All this information needs to be processed informatically and data set structures need to be harmonised to allow integration of the different data streams in large repositories (see Chapter 5).
There are different technical approaches related to data integration:
Creating large repositories and feeding in available data after curation and standardisation
Connecting data resources directly by enabling technical compatibility and interoperability of data resources and data formats
Portals allowing the search of data repositories in parallel
Overarching data infrastructures.
The interoperability between data resources is enabled by Application Programming Interfaces (APIs). Connection of data from different data repositories, which might use different terms for the same content, also relies on a mapping to ontologies for harmonisation and comparability (see Chapters 4 and 5). An example is the ChEMBL database, which has been integrated with other data resources using open standards and ontologies, after mapping to the Resource Description Framework (RDF).52 Similarly, the annotation of data with metadata is crucial for interpretation, comparability and connection of data across data sources.
Integration of structurally different data is required, such as structured data from curated databases and unstructured content such as full text from the scientific literature. The Semantic Enrichment of the Scientific Literature (SESL) project has used approaches to connect these two types of data via semantic web standards.53 Similarly, the Open PHACTS project created an “open pharmacological space” using semantic web standards.54 There are some other examples of approaches from the field of drug discovery, for example connecting, mining and sharing heterogeneous biological and chemical data from low-throughput and big data sources in a community-based platform linking informatics and social networks.55 An analysis of the concordance of pharmaceutical toxicity in different species, with over 1 600 000 recorded adverse events, combining text mining and statistical analysis, showed the advantage of a large body of analytical results, but, on the other hand, the challenge of lacking controlled vocabulary and non-aligned ontologies between data sources.56
Portals or warehouses can encompass many databases as an overarching “one stop” reference point, and allow for properties of a target chemical to be searched for simultaneously across all linked databases. For example, the OECD eChemPortal57 currently comprises 34 data collections that can be queried in a single search. The data resources in eChemPortal include Canada's Existing Substances Assessment Repository (CESAR), the European Chemicals Agency's Dissemination portal with information on REACH registered chemicals, the Organisation for Economic Co-operation and Development Existing Chemicals Database, the Data Bank of Environmental Properties of Chemicals, the Information Platform for Chemical Monitoring IPCHEM. The US EPA Aggregated Computational Toxicology Resource (ACToR),58,59 as a warehouse for US EPA web applications, allowed the search for thousands of toxicity testing results, hazard, exposure and risk assessment data for over 500 000 chemicals, by aggregating over a thousand public sources. ACToR is now integrated in the iCSS CompTox Chemistry Dashboard,60,61 which also links to external resources such as PubChem, ChemSpider and DrugBank. The European Commission's Joint Research Centre provides the ChemAgora Portal,62 which allows the on-the-fly search of connected repositories, including ChEMBL, ToxNet and the AOP Wiki. The Data Infrastructure for Chemical Safety Assessment (diXa) resource has been created as a central data repository and warehouse for toxicogenomics data, connected to a portal linking to chemical, molecular and phenotype data.63 ELIXIR is an intergovernmental organisation coordinating a “distributed infrastructure for life-science information”.64 Part of its activities is the ELIXIR Data Platform, which identifies Core Data Resources and Deposition Databases.
Other initiatives and approaches contribute to facilitating the handling, analysis and integration of big data resources. For example, the EU ExCAPE project uses supercomputers and machine learning technologies to accelerate drug discovery. The project's ExcapeDB65 is an integrated large-scale dataset facilitating big data analysis and predictive model building in chemogenomics. BIOGPS is a semiautomated approach to navigate biological space and identify structurally similar protein binding sites. It automatically prepares protein structure data, identifies and aligns binding sites, and compares GRID Molecular Interactions Fields.66
The Online Chemical Modelling Environment (OCHEM)67 aims to connect data resources of experimental measurements with a modelling framework, as a “crowd-sourcing” platform where users can share and thus connect results and cheminformatics tools in order to facilitate QSAR model development.68 The National Institutes of Health (NIH) Big Data to Knowledge (BD2K) initiative,69 launched in 2012, aims to support the broad use of biomedical big data by making them Findable, Accessible, Interoperable, and Reusable (FAIR), as well as facilitating discovery and new knowledge. Supported by the BD2K Initiative, the Center for Expanded Data Annotation and Retrieval (CEDAR)70 was established in 2014 to further the development and use of biomedical metadata, with a view to enhancing data interpretation and discovery.71 CEDAR is working on community‐driven standards for representing biomedical metadata, focussing on templates for describing different types of experiments, as well as a machine-readable repository for these standards.
In general, in the big data era, new technologies such as clouds have opened up new opportunities to store, share and handle data as well as virtually connecting disparate data sources. New cheminformatics/bioinformatics technologies are available to integrate the different types of data available in different formats across the connected repositories, i.e., to apply an integrative data analysis (see Chapter 5).
Apart from the informatics and technical issues, a major challenge in integrating toxicological data and results for predictive toxicology and chemical safety assessment lies in the integration of the data content, assessing the weight of the different data, possible contradictions, their overall relevance and contribution to the predictive question, as well as their uncertainties or possible bias.
1.4.5 Making Sense of the Data
From large-scale datasets and repositories, connected or separate sources, to official sources or data brought together in the scientific communities via crowd-sourcing, there are many different data, data types and data formats available. It has been said that “We are drowning in information, while starving for wisdom”,72 a situation that has been amplified in the big data era. It is essential to make sense of the data in order to be able to use them successfully and build new knowledge.
Setting aside the technical interoperability of datasets and formats, the next challenge is the structuring of the data, to help the potential user find a way through the data jungle. Ideally, data should be presented in a form that helps the user to make sense of them. In general, visualisation of large compound collections can, for example, be achieved by projecting them in a low dimensional space, to reduce the complexity and allow visual and intuitive analysis.7,73–75 Techniques for data reduction and comparison include Principal Component Analysis (PCA), Generative Topographic Mapping (GTM), K-means clustering and Bayesian methods, amongst many others. Visualisation of large collections of diverse and complex data is per se more challenging. Visualisation can be multidimensional and in particular needs to include a temporal dimension for time-dependent data. The US EPA iCSS CompTox Chemistry Dashboard is also an example of visualisation of toxicity/bioactivity data. It serves as a graphical user interface for the user to search for information either from a chemical, including drawing chemical structures, or an assay viewpoint. The results of the query and biological activity for assay combinations are presented in the form of graphs or charts.
Another important aspect is the adequate annotation of the data, by introducing sufficient metadata describing the context of the data generation, the data format etc. and, in general, a description of the data source and data collection, helping to understand the nature of the data. An example is the ToxCast Manual,76 linking to a list of chemical tests, assay descriptions and protocols, and other related user guidance and documentation.77 In view of harmonisation, the assay descriptions are summarised in the format recommended in the OECD Guidance Document 211 34 for describing non-guideline in vitro test methods.
In terms of rationalising and visualising the relationships between different biological activities, reactions with biological molecules and the adverse effects that may result, the concept of pathways, and in particular the Adverse Outcome Pathway (AOP) framework, has been established in the last decade.78 The AOP Knowledge Base79 and AOP Wiki80 have been introduced as practical tools to collect and bring mechanistic data and knowledge together in a crowd-sourcing effort. In addition, the Effectopedia81 application aims to collect qualitative and quantitative data that are related to the (key) events in the pathways, as a means of organising and integrating different assay results and data types related to the same biological events. As such, it envisions guiding the user who is attempting to navigate the information.
The challenges described in this section need to be addressed and are prerequisites for making the best use of the wealth of data for supporting human health or environmental safety assessment. Once the relevant data are located and initially processed, they need to be analysed, evaluated, interpreted and integrated to answer set problems or create knowledge. The question is whether the new plethora of data are adequate for use in predictive toxicology and risk assessment and whether they can address some of the most urgent problems. Opportunities and examples for the use of big data in predictive toxicology are discussed in the next section.
1.5 Opportunities Provided by Big Data for Predictive Toxicology
Biological and toxicological big data, as new and diverse data streams, offer many opportunities in the context of chemical risk assessment, by virtue of the big data characteristics related to their volume, velocity, variety or overall value (see Figure 1.1). This section discusses how these big data contribute to forming a comprehensive overview of chemical hazards and risks, as well as increasing and creating new knowledge in toxicology and providing support for the building of better predictive models.
Some issues in predictive toxicology and chemical risk assessment cannot be adequately addressed at the current time due to the complexity of the problem. This relates, for example, to safety assessment and computational models for predicting nanomaterial toxicity82 as well as complex chemical mixtures or exposure scenarios, such as aggregated exposure or “real-life” co-exposure from different sectors of chemical uses.83 Specific aspects and examples of how big data could be applied to address these “hot topics” in predictive toxicology for chemical risk assessment are discussed in this section.
1.5.1 More is More – Benefits of a Broader and More Diverse Data Basis
The 2017 NAS report on “Using 21st Century Science to Improve Risk-Related Evaluations”,9 which followed up on the reports on the vision of 21st Century toxicity testing84 as well as exposure science in the 21st Century,85 acknowledges that various data streams from exposure science, toxicology and epidemiology have not been taken into account sufficiently. It also emphasises the value of these new types and volumes of data for risk assessment.9 The importance of exposure information is highlighted in particular. More and more diverse information from new data sources is becoming available. This broader data basis, also obtained by connecting different large data sources, gives a broader picture, and enhances the weight of evidence, providing more certainty. In addition, new types of information not previously taken into consideration can provide new perspectives for the same issues of chemical risk assessment. Therefore, a transdisciplinary approach, exploring new and increasingly diverse types of data beyond classic toxicology, is key to leverage the new information available to enhance (predictive) toxicology.
New technologies have provided new means to monitor the impact of chemicals on biological systems and in the environment, for example mobile and remote sensors to trace pollutant concentrations86–88 or “crowd-sourced” monitoring data.89 Biomonitoring initiatives contribute to increasing exposure information: the HBM4EU project is coordinating human biomonitoring in Europe and centralised data collection and sharing. IPCHEM is key for this endeavour and generally for enhancing access to monitoring data, connecting separate data sets and making them explorable in a single search by chemical or location. Epidemiological studies are able to provide the link, possibly confirming causality, between exposure and health effects by biologically plausible statistical associations.90 Furthermore, data and knowledge from close disciplines such as pharmaceutical research and public health should be increasingly considered to support toxicology and chemical safety assessment.2
New sources and types of information can also be found by mining grey literature and other sources in the world wide web, information on bioactive chemicals from patents,91 as well as social media for communication and consumer behaviour information. Similar techniques are already being used, for example, in the area of pharmacovigilance,92–94 and could systematically be extended to identify and capture early signs of adverse effects from chemicals. Furthermore, additional information on intended or non-intended actual uses might be obtained from these channels and contribute to a realistic picture of exposure to a substance.
It is generally acknowledged that “quantity is not quality”. However, it is sometimes considered that possible problems with data quality deficiencies might be overruled by a sufficiently large amount of data, i.e., that shortcomings of lower quality or wrong data points might be overcome by the sheer number of – mostly correct – data points. If large-scale data sets are available, ideally from different sources, connecting these data and identifying the overall trends can compensate some “inaccurate” data points. As part of this process, outliers can be identified and ignored in model development. Overall, an increasing number of data increases the weight of evidence which leads to less uncertainty.
In order to obtain a “critical mass” of data, i.e., a big data set of sufficient size, several data sources can be compiled or integrated in a single overarching repository, or otherwise connected. This again shows that data sharing is essential. Crowd-sourcing efforts can contribute to obtaining the necessary number and types of data. As such, models for predictive toxicology can be created from data even if scattered over several systems or repositories, as long as they can be technically, and semantically, connected (see also Chapter 5). Boyles et al. described a vision for computationally enabled semantic toxicology linked to the concept of ontologies.95
Thus, big data offer the opportunity to build models on a broader data basis and with greater statistical robustness. Furthermore, with new types of data becoming available, such as data from human cell lines, targeted high-throughput data generation with assays relevant to the human organism, large-scale human biomonitoring data and epidemiologic data on exposure and adverse health effects, there is an opportunity to build more specific and human relevant predictive models. This corresponds to the aim of chemical safety assessment, which is to secure safety for human health for the population – as well as an intact environment. Classic toxicological assessments using animal tests were used as a surrogate. However, predictive toxicology should not restrict itself to simply reproducing the results of toxicity tests in surrogate species etc., but contribute with relevant and valid predictions to the overall aim of safety assessment and decision making.
The new wealth of data also calls for new approaches for modelling to use their full potential, opening up new opportunities for model development at the same time. Deep Learning approaches, machine learning algorithms based on networks of artificial neural networks, used in other areas such as face or speech recognition or in the detection of fraud and suggested for use in pharmaceutical research,96 have been introduced specifically for toxicity prediction and virtual screening.97 Since Deep Learning is able to analyse and learn from large amounts of uncategorised data, it is very well suited to deal with big data and their data variety.98 In fact, Deep Learning approaches were used in the winning entries for the Tox21 Grand Data Challenge for building predictive models based on the Tox21 big data to predict toxicity of nuclear receptor and stress response pathways. This was one of the first applications of Deep Learning methods to computational toxicology99 (see Chapter 8). The deep nets automatically learned features resembling well-established toxicophores.100 In general, the application of artificial intelligence approaches in predictive toxicology is being explored. This goes beyond the concept of machine learning, using the wealth of information now available in order to distil and simulate human expert knowledge and decisions.101
1.5.2 Big Data for the Big Picture
A European Commission Communication on data, information and knowledge management defined that knowledge “is acquired through analysis and aggregation of data and information, supported by expert opinion, skills and expertise”.102 The availability of a wealth of information and a broad data basis of a variety of types of data from different disciplines thus offers the opportunity to create new knowledge from the analysis and integration of these data, giving the full picture of chemical toxicity assessment from different perspectives.
This “big picture” that the big data provide is a holistic view of the area, enabled through fitting together the many “pieces of the puzzle” of a variety of information. These would not be apparent if big data were analysed piecemeal. The integration of the diverse data streams of biological/toxicological big data, which might not have been considered together in the past, enables new insights. Since the whole is greater than the sum of its parts, the variety of the toxicological big data pool plays an important role, contributing different perspectives and enabling the identification of inferences and distilling of new insights in toxicology and the effect of chemicals, and thus contributing to the understanding of their mechanisms of action – a prerequisite for building predictive models. In particular, it is the large-scale nature of the data resources that allows recognition of overarching patterns and general trends that would be lost and remain unidentified if only smaller datasets were investigated. Bespoke pattern recognition technologies can contribute to the analysis of these datasets. Examples for “big pictures” supported by toxicological big data include the ensemble of toxicological pathways and their perturbations as well as the universe of chemicals as considered for chemical safety assessment.
Biological big data have enormous potential to contribute to the continuing elucidation of AOPs and related knowledge disovery.78,103 Firstly, screening the data and identifying patterns and connections can identify new pathways, or recognise so far undiscovered connections between AOPs forming AOP networks.104 Secondly, big data can help to enrich these AOP networks with quantitative bioactivity data to establish quantitative AOPs with dose–response relationships. Conversely, AOPs can be seen as a framework useful to anchor the huge amount of increasingly available big data, contributing to their comprehension in a toxicological context. As such, AOPs can serve as a reference point to connect different data streams and integrate their information content (see Section 1.4.5).
The biotechnology revolution and its omics technologies have made it possible, in a systems toxicology approach,105 to monitor many cellular pathways in parallel106 and indicate perturbations of pathways if these are known. A workshop has explored in particular how omics data can further support the AOP framework for chemical risk assessment, identify molecular initiating events as well as provide evidence for key events at different organisational levels,107 contributing to a data-driven shift in the assessment of large numbers of chemicals. Adverse outcomes can be multifactorial in nature, i.e., be linked to multiple mechanisms or connected to different pathways and initiating events,9 even over a long period of time. For example, a combination of effects of individual, non-carcinogenic chemicals acting via different mechanisms on different pathways and target levels in the organism can lead to carcinogenesis,108,109 according to the hallmarks of cancer.110 Therefore, it is important, and within reach with new big data streams, to cover the big picture of pathways and consequences of their perturbation and synergies of dissimilar processes.108
The increasing amount of data results in an increasing coverage across different toxicity endpoints, thus enabling us to see the full picture of properties and hazards of a substance, enabling us to make connections between endpoints and the leveraging of knowledge from one toxic effect to others. In this way, development of predictive toxicology models and expert systems can draw on a broad basis and a holistic view of the various interconnected effects that a toxicant can have in the organism. Discovering new knowledge by cross-linking unconnected areas has been successfully applied in the domain of drug discovery. Studies have shown, for example, that drugs with similar safety profiles have similar polypharmacologies.111 Finding such cross-links within the pool of toxicological big data could be applied in an analogous way in the field of toxicology and chemical safety assessment.
The large universe of information available through chemical, biological, bioactivity and toxicological big data also provides the opportunity to contribute to the big picture of chemicals and assist the overall mapping of the chemical universe. Big data may allow the connection of disconnected areas of information, filling hitherto existing data gaps and further enriching the chemical space. ToxCast/Tox21, for example, contributes to the chemical landscape by targeted in vitro bioactivity data generation, and enables strategic data mining and predictive toxicity model development.112,113 In terms of computational modelling, the mapping of the chemical space also contributes to the definition of applicability domains for model development and application.
1.5.3 Creating und Using New Knowledge: Applications for Hot Topics in (Predictive) Toxicology
The new data streams and new knowledge drawn from toxicological big data might be able to help address the major challenges and support current trends of (predictive) toxicology in chemical risk assessment. These are areas where toxicological big data can add most value.
1.5.3.1 Prediction of Emerging Risks
In chemical risk assessment and management, even when all precautions are taken and evaluations are performed according to the best current knowledge, the emergence of some unexpected hazard or new risks can never be excluded. The International Risk Governance Council (IRGC) defines emerging risks as “new risks or familiar risks that become apparent in new or unfamiliar conditions”.114 There is also often a reference to so-called black swan events, which refer to unexpected events that were not foreseen or envisioned as possible at all. Flage and Aven state that for both types of events “knowledge is the key concept”.115
Risk assessment is confronted with known unknowns and unknown unknowns; it might either be known that there is a potential problem, but there are not sufficient or adequate data available to evaluate the situation, or the nature of the risk has not been encountered before at all. Emerging risks can be newly created risks that are caused by new substances or new technologies (such as nanoforms of the same already known bulk material with different properties) or by new types of applications; existing issues which are only identified or understood by means of new knowledge (for example, the relationship and impact between biological pathways); an increased incidence of an issue due to changed habits or – qualitatively or quantitatively – new types of exposure (for example, hazardous compounds from electronic waste unexpectedly appearing in toy articles due to recycled plastics).
Detecting and recognising “early” warning signals, e.g., of adverse effects, is crucial; however, they might not always be apparent with the standard information and processes available. The concept of gathering, centralising or connecting as much information as possible was already realised in networks such as the network of reference laboratories, research centres and related organisations for the monitoring and biomonitoring of emerging environmental substances (NORMAN116 ) or the MODERNET network (Monitoring trends in Occupational Diseases and tracing new and Emerging Risks in a NETwork117 ).
However, the wealth of available data now offers new and so far unavailable opportunities to detect and amplify signals, and to identify patterns in the overall “noise” of the big data. An example from the clinical sciences: the innovative real-time quantification of the arterial waveform shape and variability, with a mathematical method adequate for handling long non-stationary data streams, allowed the prediction of an emerging health problem (sepsis) with high sensitivity at an early stage, which was not possible with the classic clinical measurement of the pulse, i.e., small parts of the signal.118 Toxicological big data streams provide the opportunity for similar applications in future. With more data streams being available, it is possible to link different types of information, making connections, and discovering trends and patterns that were not visible and would have been missed otherwise. In other words, big data related to chemical properties and toxicological information can create new knowledge that supports the identification of emerging risks, and even opens up the possibility to predict otherwise unforeseen events.
As discussed, new data streams can be explored for toxicological applications, such as epidemiological data, biological monitoring data or mining of “signals” in the world wide web or social media. Mining and analysing social media data is already common practice in the field of pharmacovigilance for detecting emerging adverse drug effects.92,93 Another example is untargeted environmental metabolomics to detect environmental stressors (xenobiotic substances) and elucidate mechanisms, when the mode of action is so far unknown.119
1.5.3.2 Complex Exposure Scenarios of Combined Exposure to Multiple Chemicals
The assessment of the effect of chemicals is getting more challenging with increasingly complex exposure scenarios. Exposure to one chemical alone can already occur via several sources or routes (aggregated exposure), and exposure is getting more complex for the combined exposure to multiple chemicals via different sources and routes. This is the real-world situation, i.e., humans and ecosystems are exposed to a quasi-infinite number of different combinations of chemicals via sources such as food, consumer products as well as environmental chemicals or in an occupational context.
The assessment of chemical mixtures and combined exposure is a challenge in chemical risk assessment, as well as for predictive toxicology.120 Exposure to multiple chemicals, which do not cause harm individually, can lead to adverse effects in combination.83,121,122 Even if the toxicity of all individual components is known or can be predicted, as well as the relative concentrations, the determination of the effect of the combined exposure is not always straightforward. Interactions might occur between chemicals and lead to synergistic or antagonistic effects. In addition, a time aspect needs to be taken into account, i.e., the exposure to several chemicals which might interact to disturb a biological pathway could be sequential and even years apart, but would need to be considered together. It is impossible to physically test all possible combinations of chemicals considering the huge number of possible combinations of compounds and their concentrations. Computational prediction methods will support the assessment of mixtures, but need an adequate data basis.
There is an opportunity to combine information from disparate sources and exposure from different media and chemical sectors (e.g., collected in the framework of different legislations) at different times, and integrate the information in order to provide an overall big data resource closer to the diverse real-life exposure to chemicals/chemical mixtures and for assessing co-exposures. Data can be mined to find trends and elucidate chemical interaction mechanisms, for a better assessment of mixtures.
Chemical monitoring is a step towards better knowledge of the actual exposure (numbers of compounds and their concentrations). Biomonitoring data, including real-time epidemiological monitoring, can contribute towards understanding causal connections between diseases and – sometimes also geographically specific – exposure to chemicals; epidemiological evidence, for example, has linked air pollution to cancer and neurodevelopment effects.9 They also provide insights into co-exposure patterns.123 Exposure data are increasingly becoming available, on a big data scale. Many initiatives are measuring, compiling, or predicting exposure data. Examples are biomonitoring initiatives such as HBM4EU, databases such as IPCHEM or the US EPA project to compile an exposure data landscape for manufactured chemicals.124,125 ExpoCast, the US EPAs exposure-based chemical prioritisation programme, is providing high-throughput exposure predictions for thousands of chemicals, looking at multiple routes of exposure.126 This exposure information is integrated with ToxCast data, also exploring the use of big data techniques, including data mining, for example for defining consumer product use profiles.127
Furthermore, technological and methodological advances in the big data era, such as the use of crowd-sourced monitoring and epidemiological data from mobile personal devices, provide an increasing amount of new exposure data sources and thus new opportunities for further development of the emerging discipline of computational exposure science128 and predictive models from big data. Predictive toxicology approaches simulating the different combinations of exposure to chemicals and related probabilities, as well as possible interactions, will benefit from the new large-scale data resources and the possible application of artificial intelligence approaches101 to gain new knowledge.
Overall, big data offer opportunities, which have so far not been available, to address the complex exposure scenarios of combined exposure to multiple chemicals; their potential is waiting to be fully exploited in the future.
1.5.3.3 Addressing Population Variability and Specific Susceptibility
Big data resources also cover intrinsic variabilities of measured properties and have the advantage of being sufficiently large and complex to represent and address population and genome variabilities and related different susceptibilities to chemical agents. Genetic data have already established a link between genetic variants (polymorphisms) with diseases or adverse effects of toxicants or drugs, recognising the importance of taking into account this type of information on different sensitivities to chemical agents and disease susceptibilities.106
Variability in response to chemical stressors can also be due to factors such as life stage and sex or influence of external factors or other, non-chemical, stressors. Epidemiological and exposure, such as biomonitoring, data can contribute towards identifying these influences and susceptibilities.9 Omics data also contribute to elucidating species sensitivity.107
These new approaches and generated large-scale data will contribute to regulatory toxicology practice in the future. The US National Center for Biotechnology Information at the National Library of Medicine at the National Institutes of Health, for example, has a selection of resources available related to human variability, especially genotype to phenotype associations. They include the database of single nucleotide polymorphisms and estimates of their occurrence within the population (dbSNP), the database of Genotypes and Phenotypes (dbGaP), the GTEx database on Genotype-Tissue Expression and a Phenotype-Genotype Integrator (PheGenI).129
1.5.3.4 Grouping Chemicals and Read-across for Chemical Safety Assessment
Being faced with a very large amount – tens of thousands – of compounds that need to be safety assessed, and which might appear as a chemical big data entity in itself, grouping approaches seem to be the only viable solution for their evaluation in an acceptable time frame. This strategy goes beyond mere computational screening with QSAR models, although in silico models can be part of the category formation. Grouping will make it possible to bin the chemicals and flag and prioritise them for further assessment, where applicable. Furthermore, evaluation of several to many chemicals at the same time is possible, in order to enable a faster and more efficient evaluation compared to single chemical evaluation. Such an approach of organising the chemical space for prioritisation and chemical safety assessment is being pursued, for example, by the Canadian Authorities within the context of the Chemicals Management Plan.130 The European Chemicals Agency (ECHA) also strategically maps the chemical universe to address substances of concern.131 The grouping of chemicals with similar properties also facilitates the substitution of chemicals. In this sense, grouping and read-across is used to solve the problem of chemical big data.
On the other hand, this type of group formation relies on the variety of biological/toxicological property data connected with the chemical substances to find trends, identify and justify the similarity hypothesis. The recent availability of large-scale repositories of bioactivity data, such as PubChem132 and ChEMBL, or other data that support mechanistic understanding, such as information from omics technologies133 (see Chapter 7), now offers the opportunity to provide more and better evidence for the justification of similarity between chemicals. Thus, biological big data provide crucial support to read-across approaches (see also Chapter 12). Different types of data cover different parts of the biological/toxicological space. More evidence gives more confidence overall in similarity argumentation and read-across reasoning.
1.5.3.5 Predictive Nanotoxicology
The interactions of nanomaterials with biological molecules are still not fully understood, small differences in properties being able to make significant differences in effects, and overall nanotoxicology is struggling with its complexity.134 Nanomaterials have a high inherent complexity: there is a distribution of sizes and variability of properties, structural complexity, as well as different possible coatings, surface corona etc. There is, thus, an almost infinite number of potential combinations and variations in the materials, which – together with a lack of representative samples, as well as standard operation procedures – makes it difficult to define the physico–chemical characteristics and reliably measure toxic effects. Standard test guidelines are not always applicable to nanomaterials and there is also a high variability of the nanomaterials depending on the experimental conditions.82,135 There might be many data on nanomaterials available; however, the issues described and the related insufficient quality of the data are also impairing the development of computational models to predict nanotoxicity.82,136 In particular, the correlation of potential adverse effects and the properties of the nanomaterials that are causing them are not fully elucidated yet.
Big data could help to identify property–activity relationships by providing larger numbers of data points to base the analyses upon and to be able to identify trends. In order to support the use of big data in nanotoxicology, means of processing and analysing large-scale datasets produced by high-throughput screening have been developed. For example, a web-based platform for HTS data analyses tools (HDAT) was created as a publicly available computational nanoinformatics infrastructure for statistical analysis of nanomaterial toxicity data, offering, amongst others, self-organising map (SOM)-based clustering analysis as well as possibilities to visualise raw or processed data.137 Furthermore, natural language processing techniques are available to mine nanomaterial-related information.138
In another line of research related to nanomaterial discovery, multivariate data analytics were applied to identify nanoparticle prototypes and archetypes and their relevant properties.139 The method helps to identify representative types of nanostructure in the big data sets, which can then be further studied to understand the structure/property–effect relationships. The authors highlight that creating value from data is a separate endeavour from creating the data in the first place, in particular considering the uncertainties and variabilities of big data, and conclude that this type of data-driven search will have its own value as a nanoscience discipline.139 Similarly, it was emphasised that in order to make value of the large amount of compiled nanomaterial data and research efforts,140 it is necessary to integrate top-down decision analytic tools to identify the relevant information and bottom-up data management approaches for visualising these data, in order not to “waste” efforts on collecting and processing information irrelevant for decision-making.141 The use of multicriteria decision analysis (MCDA), value of information (VOI), weight of evidence (WoE) and portfolio decision analysis (PDA) in order to integrate and interpret nanomaterial research data in a way to distil out the information relevant to decision-making needs was discussed in a recent paper by Bates et al.141
1.6 Conclusions and Perspectives
The quantity of big data in toxicology is growing, and whilst that may not be at the same level as seen in other fields, they are profoundly transforming the possibilities for predicting chemical hazards and undertaking risk assessment, opening up many new opportunities along the way. The era of big toxicological data coincides with the developing paradigm of Toxicology in the 21st Century, which is moving from testing and observation of apical effects to a mechanistic toxicology. The new direction in toxicology is based on an understanding – and continuing elucidation – of the pathways leading to adverse outcomes, in the framework of systems toxicology, and focusses more on the human relevance of the assessments. As such, this new toxicology is immensely data hungry and requires support from the related targeted screening of chemicals as well as investigation of pathway perturbations. This is only truly possible with advances in technologies, and associated big data, such as the development of high-throughput and high-content bioassays as well as omics technologies measuring changes at the protein and genetic level etc. These new technologies have produced large numbers and a high variety of data points at a significantly higher speed than classic experimental toxicology, corresponding to the three basic characteristics of big data. Moving from static measurement of effects to time-resolved data has also contributed towards increasing the volume of information, in particular taking into consideration high-throughput kinetics.
These “new methodology” data are complemented by an overall large legacy of historical data in toxicology, which is becoming increasingly available. A major challenge is that the historical data are not always available in a structured way such as in easily accessible databases, but rather in isolated archives and in the form of reports, as well as an abundant quantity of scientific journal publications. This unstructured information needs to be extracted to add it to the pool of – easily accessible and usable – toxicological data. Importantly, toxicology can draw from the “bigger” data streams from closely related fields: pharmaceutical sciences and drug discovery, health sciences, as well as epidemiology. (Bio)monitoring data, for example, or data from large cohort epidemiological investigations are valuable if not essential additions for risk assessment at the population level. They will improve the connection between chemical hazards and diseases, as well as going a long way towards integrating exposure. Knowledge of exposure to realistic concentrations and combinations of chemical substances is fundamental to risk assessment and essential as a basis for decision making with the goal of protecting human health and the environment. Another new type of data can bring additional value to toxicology and decision making: information extracted from the internet or social media, which might bring, for example, knowledge of the actual uses of substances or potential first warning signs about adverse effects, as has been shown for pharmaceuticals. In the future, predictive toxicology should integrate the experience from related disciplines even more, embrace the approaches and draw upon the data streams used in these fields to allow further cross-fertilisation. It has to be a truly interdisciplinary approach. Big data in (predictive) toxicology could move to another level.
These developments in big data sources and availability have also advanced computational toxicology, by providing comprehensive data as an input for modelling, the volume of data points also providing an improved statistical robustness, as well as by strengthening the mechanistic basis of the models to improve the predictions, their (human) relevance and credibility for human safety assessment.
Big data have thus brought significant advances, but also constitute significant challenges to toxicology. The challenges are from both the technical side relating to the integration and interpretation of their content, as well as the overall challenge not to be overwhelmed by the quantity of data and being able to determine their meaning. Advances in data sciences have made the handling, processing, storing, integrating and analysing of these data possible such that data sciences have become an integral part of modern toxicology. The lines between experimental testing and computational modelling have generally become blurred with the analysis and interpretation of certain types of data, for example from omics, requiring routine use of bespoke computational algorithms. In this sense, predictive toxicology is seen here not only as the application of computational predictive modelling. Predictive toxicology used to be understood mainly as computational models predicting the hazard of a chemical from a chemical structure utilising structure–activity relationships or being based on statistical correlations between a number of physico–chemical properties and toxic effects. However, many methods in chemical safety assessment make predictions about the impact of a chemical on human health (or the environment and wildlife), including in chemico and in vitro assays and even animal studies. These are models used to predict the outcome in humans. In this sense, predictive toxicology has been and will continue to be extended to many other aspects in chemical safety assessment.
Another change in chemical safety assessment is the emphasis on risk assessment as compared to the evaluation of intrinsic hazards of chemicals, thus putting more focus on the exposure aspect, both exposure of individuals or populations as well as actual internal exposure of target sites in the organism. Therefore, exposure science has grown to be an essential part of risk assessment, and data streams from (bio)monitoring or epidemiological research are an integral part of risk assessment in the 21st Century. Real-life exposure to chemicals, i.e., combined exposure to multiple chemicals from many different sources over a long time, as well as the different susceptibilities in the population, also changing with the life stage, need to be addressed.
The velocity of big data is an important aspect and will not only support the increase in the pace of chemical safety assessment, but will also enable predictive toxicology to foresee forthcoming risks. These emerging risks may appear with new substances or new uses of chemicals. Through their broad coverage, big data seem to be best suited for the early identification of emerging risks; this could be a major scope of predictive toxicology in the future.
Big data are now an accepted reality in toxicology and chemical risk assessment. They are being increasingly well characterised and understood, especially as the data science from other fields (where big data are more advanced) develops. They provide numerous opportunities to change the face of this area of science and improve the prospects for chemical risk assessment. There will be even more advanced methods and technologies to integrate the data and make sense of them, finding trends and connections; artificial intelligence will undoubtedly increase possible applications and opportunities. However, in the end, it comes down to how the wealth of data, new methods and new knowledge gained can actually be used for risk assessment and achieving the goal of the protection of human health and the environment. Whilst there is excitement and optimism about the use of big data, in particular for addressing major current challenges in chemical assessment, there is still, however, some way to go before they reach their full potential in regulatory risk assessment. As such, big data support efforts in new toxicological assessment and will be translated to routine chemical evaluation in a regulatory context in the future.