ActivePapers - computational science made reproducible and understandable

ActivePapers is a research project about the future of science communication in the context of the ever increasing importance of computing in scientific research. The central question that ActivePapers wishes to answer is:

How should we package and publish the outcomes of computer-aided research in order to make them maximally useful to the scientific community?

An ActivePaper is such a publishable package, which in general contains documentation, code, and data.

The desirable properties for an ActivePaper are:

Exploring this research question involves designing, implementing, and applying infrastructure software for computer-aided research. This software is openly available, and everybody is welcome to use it in whatever way. However, please be aware that these software packages are research prototypes, not products. Don’t expect technical support or long-term maintenance. We will do our best to help with potential problems, but in the spirit of collaborative research rather than customer support.

For keeping up to date with ActivePapers development, follow @ActivePapers on Twitter and read our blog.

On reproducibility by construction and stable platforms (Python and JVM editions):

On the human-computer interface in computational science (Pharo edition and the Leibniz project):

ActivePapers in practice

The best platforms for publishing ActivePapers are currently Zenodo and figshare. You can consult the list of ActivePapers available on Zenodo and on figshare.

ActivePapers infrastructure software

There are currently three implementations of the ActivePapers concept: the Python edition, the JVM edition, and the Pharo edition. The Python edition is the most immediately useful one for computational scientists, but the JVM edition is a more complete implementation of the design goals outlined in the first paper on ActivePapers. The Pharo edition is the latest member of the family and should be considered work in progress.

The Python and JVM implementations of ActivePapers use HDF5 as the underlying storage format. An ActivePaper is thus an HDF5 file. Datasets in an ActivePaper can be inspected using many generic HDF5 tools, in particular HDFView. HDF5 has the advantage of providing compact binary storage for large datasets and efficient access to them. In the Pharo edition, an ActivePaper is a Pharo class with a singleton object containing the data.

Core ideas and concepts of ActivePapers

Reproducibility by construction

Today’s most common approach to ensuring computational reproducibility is to do a computation and then check if its results are reproducible. Such a check requires significant competence and effort, and as a consequence it is rarely done, in particular for long-running computations. The best evidence is that reviewers for scientific journals are not expected to check for reproducibility. ActivePapers pursues a different approach: the infrastructure software guarantees the reproducibility of the results stored in an ActivePaper. This requires authors to adopt a new tool for their work, but removes the need to check reproducibility.

Data is more important than software

Computational scientists usually first choose the software for doing their work, and then let the software decide how to store the data they work on. But data is both more fundamental and more long-lived than software. A protein structure or a temperature field will make as much sense in fifty years as it does today, but the software used to produce it will probably be obsolete by then. Data should be stored in well documented and software-independent formats.

Code is essential documentation for data

Good data formats go a long way to document the meaning of a dataset, but they cannot provide the context in which a particular dataset was produced. For computational results, the ultimate documentation is the code that produced the dataset. This code should thus be easily accessible from the dataset itself.

The code-data dichotomy is not fundamental

The distinction between code as a tool and data as the virtual matter manipulated by that tool is little more than an historical accident of computing technology. What matters more for science is the distinction between the different kinds of scientific information:

Today we often neglect the distinction between observational data and computed results, calling both “data”. We also sometimes put parameter choices into the “data” category, and bury small datasets in software source code for practical convenience.

Re-use is a form of citation

Scientists have always built on other scientists’ work. In computational science, this means re-using other scientists’ datasets and software. This works smoothly only through precise machine-readable references, such as DOIs (Digital Object Identifiers) or, better yet, intrinsic identifiers based on the concepts of content-addressable storage. Such machine-readable references can also be used to integrate data and code references into bibliometry, creating the missing incentive for scientists to actually publish data and code.

Research often moves at a slower pace than computing

In the natural sciences, researchers typically consult original journal publications that are up to about thirty years old, whereas they turn to textbooks and review articles for older work. Published computational science must therefore remain usable for a few decades. Computing environments evolve at a much faster pace, which is why software requires maintenance to remain usable for more than a few years. Computational science therefore requires stable computing platforms.

The human-computer interface of computational science matters

Marshall McLuhan told us that “First we shape our tools, and then our tools shape us.” This phenomenon is easy to observe in many aspects of life (cities structured around cars, social life impacted by social networks, etc.), including computational science. Scientists choose models and methods at least as much for ease of use (readily available implementations etc.) as for scientific validity. There are also more fundamental but less visible impacts of computing technology on the way we do scientific research. The complexity of modern software makes the models and methods it applies very opaque. As a consequence, scientists more and more often apply models and methods without knowing which assumptions they are based on. Worse, they often don’t even know exactly which software they have run, leading to non-reproducible results. We must therefore pay more attention to the human-computer interface of our tools, making sure that they favor understandability and verifiability.