Big Data science and technologies have won their presence in every section of the digital economy. This was initially motivated by the increasingly efficient and cheaper information gathering and storage solutions, development of non-relational databases and generalisation of accessing internet resources as a service. Such developments, encouraged also by new legislation and industry initiatives, started changing the processes for data acquisition, management, analysis, validation and re-use also of experiments regarding biological and chemical data.
Big Data systems and analytics, as newly established technologies, provide solutions that regard an increasingly fluid number of characteristics that challenge and stretch traditional storage and processing tools and environments, for example, relational databases and off-line machine learning model development. In a traditionalist approach, Big Data is defined by four unanimously acknowledged data features: the volume, the velocity, the variety, and the veracity (mentioned by Hilbert1 as the Big Data 4 Vs) of its available digital content.
The volume of Big Data refers to those instances already stored in data warehouses, lakes or less sophisticated databases that have a continuous or intermittent increase of their size. Data management solutions in such cases may not satisfy current challenges and expectations regarding terabytes to exabytes of data available to be processed consistently and efficiently.
The velocity of Big Data refers to dynamical attributes of data acquisition, storage, update and transfer mostly in real time, challenged, for example, by short times (e.g., milliseconds to seconds) for the database to respond. Regular solutions are offered by well-defined processes but are less prepared to deliver streamed processing and specialised data formats to support realistic and efficient timely services.
The variety of Big Data is a consequence of both diversity of sources and provenance of data, and the increased number of multimedia formats that are used to produce, store, exchange and process such data. This is a consequence of successful developments in information gathering, digitalisation, networking, and mobile and communication technologies that allowed a fast democratisation of accessing and exchanging data.
The veracity of Big Data refers to the quality, confidence and usefulness of current data resources – features that are requested more and more by regulatory and industry standards, and also acknowledged from the beginning by the data mining and data analytics communities. Data veracity, though, is currently an open field for meta-data capture and processing, consistent data quality measurement and reporting.
Chemical, biological and toxicological data resources show a continuous increase in volume of the number of records and instances recorded, as well as of the number and diversity of the types of data, features and attributes captured or integrated in such repositories. This is due to the current advancements in biology, toxicology, chemistry – as for example in the field of nanomaterials – and related sciences that are producing complex experimental data with increasing digital volumes and increasing speed of data generation, as compared to the traditional experimental toxicological data. A variety of different data – from the chemical and biological side – need to be combined. Thus, more and more sophisticated solutions are required to organise, integrate, analyse and interpret such data and also their digital derivatives (such as statistical analyses, machine learning models), and their further use in multidisciplinary repositories, dashboards and projects. International collaborations and public–private partnership initiatives encourage and further respective research.
Predictive Toxicology is both a multidisciplinary domain and an interdisciplinary field of research and knowledge transfer, critically relying on a broad and qualitatively sound data basis for model development, and as such, it is well suited to take up the challenge and benefit from Big Data technologies. Its multidisciplinarity is an example of how multiple forms of expertise (e.g., biology, chemistry, toxicology, applied statistics, computer science, machine learning) and perspectives are called upon to address toxicology modelling and evaluation challenges. Its interdisciplinarity also combines knowledge and builds on novel disciplines at the crossroads of existing approaches, such as bio-informatics, chemo-informatics, toxico-genomics, applied data sciences, and Big Data.
While Predictive Toxicology has started benefitting from new solutions for large-scale data and model storage, it has also raised additional requirements, specifications and challenges. These regard data gathering, representation, formatting, processing, validation, knowledge extraction and integration from big and diverse toxicology as well as chemistry data, model governance, and visualisation. Therefore, Predictive Toxicology is stepping into the Big Data scene, requiring a diverse and specific effort from different communities of researchers, users and model developers to update and validate sustainable solutions, and take up the opportunities for many applications.
In general, the volume and velocity of the toxicology and chemistry data resources may not yet create serious technical issues when considered within the context of Big Data, and are the first to offer opportunities for new technologies and progress, such as cloud services and distributed (e.g., Hadoop) processing. More important, though, is currently how the variety and complexity of these data are managed and systematised, linked together to become a useful information resource, and finally abstracted through large-scale data analysis in the form of models to produce useful pieces of knowledge that can support further decision making. While data organisation is an important topic, knowledge organisation (its storage, accessing and consistent representation both of the data and the resulting predictive models) remains a major issue to be dealt with for any application and institution.
The era of Big Data is already evolving from the relatively normal expectations of traditional measurable features by adding even more challenging characteristics and functionalities,2 such as variability, validity, volatility, visibility and value, aiming for sustainable governance, re-use and automation. While some of these newly added features share common ground with existing ones and require more fundamental research in the definitions of the concepts and measurements of their contributions, nevertheless they contribute towards uncovering the attractiveness and diversity of expectations and requirements from the general public, users and specialists for their area of interest; this also applies to Predictive Toxicology, as described in different chapters of this book.
Big Data in Predictive Toxicology: Overview of the Book Contents
The experience, knowledge, and examples shared by the authors of the chapters are of interest for both Big Data and machine learning researchers as well as for researchers and practitioners in toxicology and chemical safety assessment, to learn about the advances in the field. The content of the book, while sharing features with the topic of Big Data, through the diversity of topics and viewpoints, value of knowledge and variety of approaches, is by no means exhaustive in terms of volume coverage and pace of updates. Its aim is to provide a stepping stone for the acknowledgment of work done to date, for the storage of the current approaches, and to offer an open table for the identification of challenges and for the report and reference of new ideas and contributions.
The book is aimed at giving a broad overview of the area of Big Data in Predictive Toxicology, explaining basic notions and providing examples of applications in twelve chapters ordered in three sections as follows.
Chapter 1 – Big Data in Predictive Toxicology: Challenges, Opportunities and Perspectives – gives a systematic introduction of Big Data concepts and features as applied in the context of Predictive Toxicology, from the motivation and relevant characteristics to challenges that need to be addressed to leverage their potential, and opportunities they offer for advancing chemical safety assessment and related urgent issues.
The first section of the book provides in Chapters 2, 3 and 4 the technical basis and a review of types, processing and storage of data in biology, chemistry and toxicology.
Chapter 2 – Biological Data in the Light of Toxicological Risk Assessment – reviews the types of toxicological data and studies that are at the basis of predictive model development as well as the toxicological endpoints that are currently information requirements for regulatory toxicological assessment, which Predictive Toxicology is targeting. These comprise in vivo and in vitro testing data, with advances in high-throughput bioactivity assays and ‘omics’ technologies building new data resources.
Chapter 3 – Chemoinformatics Representation of Chemical Structures – A Milestone for Successful Big Data Modelling in Predictive Toxicology – reviews chemical structure representations in chemoinformatics as a key aspect of linking toxicity information to a chemical in computational toxicology, in the context of storing and searching structural data in and across big databases and its use for predictive modelling.
Chapter 4 – Organisation of Toxicological Data in Databases – reviews a variety of approaches and formats for data exchange and database design for organising toxicity, chemistry and related data, and gives an overview of existing toxicological databases as well as examples for integrated approaches. The importance of high-quality databases as support for (in silico) safety assessment is highlighted.
The book's second section builds in Chapters 5 and 6 on the available data and data formats by adding the challenges of constructing, using, integrating, storing and sharing data and models as complex, heterogeneous and valid resources for modern Predictive Toxicology applications.
Chapter 5 – Making Big Data Available: Integrating Technologies for Toxicology Applications – considers alongside data volume other less studied characteristics of toxicity Big Data: variety and veracity, which add representative challenges in data management and analysis. The solutions presented hereby build on a syntactic level from database integration (Application Programming Interfaces) and ontologies, and on a semantic level from chemo- and bioinformatics methods of data analysis.
Chapter 6 – Storing and Using Qualitative and Quantitative Structure–Activity Relationships in the Era of Toxicological and Chemical Data Expansion – presents practices and developments for standardisation of prediction model formats by their systematic representation, organisation and storage. It raises the topic of the need and progress of sustainable representation and fusion of data and models using generalised data governance functionalities.
The third section of the book shows in the contributions contained in Chapters 7 to 12 how the analysis and organisation of different types of toxicological Big Data can leverage building knowledge and support applications in modelling and computational predictions for chemical safety assessment.
Chapter 7 – Toxicogenomics and Toxicoinformatics: Supporting Systems Biology in the Big Data Era – reports and exemplifies the current technologies for storage, pre-processing and integration of ‘omics’ data in order to allow data mining, machine learning and mechanistic analyses revealing new patterns and ways of identification and explanation of adverse effect mechanisms and build knowledge in Systems Toxicology. Examples of toxicoinformatics applications for toxicity prediction are described.
Chapter 8 – Profiling the Tox21 Chemical Library for Environmental Hazards: Applications in Prioritisation, Predictive Modelling, and Mechanism of Toxicity Characterisation – highlights the Toxicology for the 21st Century (Tox21) programme, which contributes to Big Data in toxicology with over 120 million data points generated for about 10 000 environmental chemicals and drugs with a high-throughput screening assay battery, and provides examples of in vivo toxicity prediction models built from the compounds' activity profiles and how to prioritise compounds for further in-depth toxicological studies.
Chapter 9 – Big Data Integration and Inference – reviews ongoing efforts to aggregate toxicological knowledge from different data types and sources in a sustainable knowledge base based on the Adverse Outcome Pathway (AOP) framework, and provides case studies of data integration and inferential analysis for use in Predictive Toxicology from a Big Data volume viewpoint. It also unveils opportunities for mining Big Data sets and producing large libraries of computationally predicted AOPs. The use of such libraries and AOPs in decision support, systems biology applications, testing and risk assessment and model development is discussed.
Chapter 10 – Chemometrical Analysis of Proteomics Data – presents case studies for using chemometrical approaches to model data of diverse input formats such as proteomics data and chemical descriptors. The outcome of such an approach relies on opportunities for knowledge extraction from a large amount of data points and using proteomics data analysis to understand toxicity mechanisms and identify biomarkers.
The implication of Big Data in two specific areas of interest in Predictive Toxicology is described in the last two chapters, notably biokinetics, which are at the centre of modern safety assessment as a key for translating in vitro and in silico toxicity estimates to in vivo conditions, and read-across for which there is considerable interest for data gap filling in the regulatory context.
Chapter 11 – Big Data and Biokinetics – describes the incorporation of Big Data approaches to improve the ‘throughput’ of biokinetic modelling, defining a comprehensive physiological/biochemical modelling framework to rapidly predict in vivo biokinetics of chemicals, supporting interpretation of high volumes of toxicity data on large numbers of chemicals in an efficient way.
Chapter 12 – Role of Toxicological Big Data to Support Read-Across for the Assessment of Chemicals – discusses how the use of Big Data from new methodology approaches supports the justification of similarity in read-across, extending from chemical to biological and toxicological similarity, as well as the mechanistic interpretation, and thus increases the confidence in read-across, essential for regulatory acceptance. Furthermore, Big Data can support read-across approaches in addressing upcoming challenges such as hazard assessment of mixtures and nanomaterials.
We would like to thank all authors and researchers as well as the editors from the Royal Society of Chemistry who contributed to and supported this multi-authored book.
University of Bradford,
Department of Computer Science,
Faculty of Engineering and Informatics,
European Commission Joint Research Centre,