Processing Metabolomics and Proteomics Data with Open Software: A Practical Guide
The development of data analysis software accompanies technological advances in mass spectrometry. Naturally, academic laboratories are often the first to implement new ideas and algorithms into scientific software. The public release of code has led to the community-driven creation, testing, and improvement of applications, and the compilation of complete data-analysis platforms. The definition of open data formats has facilitated the exchange of data between researchers and the analysis of data by a variety of diverse programs.
Open software for the analysis of mass spectrometry data has reached a maturity that makes it suitable for professional use by analytical chemists in academia and industry. Further, exciting new tools, such as data mining by machine learning/artificial intelligence and DevOps—agile software development/testing/delivery—are accelerating developments in all fields of computer science.
However, documentation for the multiple existing programs and their optimal use is scattered in package manuals, read-mes, and online forums. This information fragmentation complicates the efficient use of open-source software, especially for students, scientists, and professionals who analyse mass spectrometry data occasionally or for first-time users.
In my laboratory, we develop analytical platforms for mass spectrometry, including hardware, software, and databases. Consequently, programming is an essential part of our daily work. Furthermore, we regularly provide mass spectrometry primers to new lab members and offer optional courses in our postgraduate programs. Many of our collaborators are more interested in biological questions than technical details, but also want to understand their data or reanalyse their results. As a result, we identified the need for a book that provides the very basics, facilitates an understanding of the technology, and that also covers practical aspects of mass spectrometry data analysis.
Consequently, when Marek Domin asked me to edit a book about metabolomics for the Royal Society of Chemistry, I suggested that the scope be widened to mass spectrometry (MS) data analysis using open-source software. We discussed possible contents with the Editorial Board of the Royal Society of Chemistry and agreed on a book that strongly focuses on MS data analysis with open source software and statistical methods, but also covers the most prominent MS-based omics methods, namely metabolomics and proteomics.
The primary target audience of this book is the students and scientists who are experiencing mass spectrometry for the first time, as well as MS professionals who currently work with commercial data processing software and want to explore new ways of interpreting their results. Furthermore, (bio)informaticians working on omics datasets or MS software development can profit from the book since the book presents diverse modules, toolkits, and strategies for workflow design and data mining.
A central feature of this book is the inclusion of example data and workflows, which provides the reader with the opportunity to reproduce the presented protocols. The demo data are downloadable from public repositories free of charge and serve as useful practice material for hands-on courses or as templates for the development of individual workflows.
The book comprises three parts. Part A describes general concepts and holds five main chapters:
Chapter one provides the fundamentals, such as distinct research approaches (hypothesis-driven/exploratory/data mining) and mass spectrometry technology. Furthermore, it encourages the use of open software.
Chapter two explains typical unit operations applied to raw mass spectrometry data and the design of practical MS data processing workflows.
Chapter three presents different “flavours” of metabolomics—targeted metabolomics, untargeted metabolomics, fluxomics and metabolite imaging, technologies and software tools for metabolomics data evaluation.
Chapter four acquaints the reader with the complexity of the proteome and presents experimental strategies for qualitative and quantitative analyses.
Chapter five demonstrates, using practical examples, statistical methods for sample comparisons, clustering, dimension reduction, biomarker discovery and the creation of predictive modelling employing machine learning.
Part B gives an overview of available mass spectrometry software, including general MS data processing programs, metabolomics/proteomics software, and development tools (modules for R, Java, Python, and Docker). Part C briefly comments on the dynamics in MS software development, and speculates on future developments.
Of course, I am aware that this book has various limitations. The field of open software development is evolving rapidly. Currently, Docker (presented in chapter 19) and other DevOps concepts are changing the world of software development, testing, and deployment. For emerging and growing areas of mass spectrometry, such as ambient ionisation mass spectrometry (AIMS) methods, ion mobility separation (IMS), and mass spectrometry imaging (MSI), the development of software, data formats, and data banks has just begun. Therefore, this book only provides a snapshot of the current state-of-the-art.
I warmly thank all authors and editors who participated in the book, for the time and energy that they have invested. All of these individuals are also active developers and supporters of open source software and hence contribute extraordinarily to the evolution of science. Also, I greatly acknowledge the editorial support of the Royal Society of Chemistry, represented by Janet Freshwater and Katie Morrey, and the editors, reviewers, and proof-readers for their valuable feedback. Finally, I thank all users of open source software, because only an active community of developers and practitioners motivates further development.