Digital skills and scholarship for researchers 4: the 3×3 table explained
When Billy Meinke and I sat to work on planning our sprint session for MozFest, he suggested that the activities of science could be grouped into 3 objects (text, code, data) and 3 actions (create, share, reuse). I was skeptical – surely science is way more complex than that. After running the session at MozFest and later in New Zealand, however, convinced me that Billy was right. We never encountered any object or action that we could not fit within that classification.
The actions are self explanatory – create, share, reuse. You either are the author of a manuscript or you are not, you have contributed data or not, you have contributed to software or not. You create, share or reuse, or you don’t. However, what emerged at MozFest was how these 3 actions (which we seem to engage with apparently separately) are actually very dependent on each other – how we can share depends on how we created. Let’s look at an example:
I am capturing neural data using proprietary software that creates proprietary formats. That decision affects how those data can be reused (by a future self or others): only those with access to that software can open those files. Sharing and reuse, hence, becomes limited. If instead, I think of sharing and reuse from the onset, I may choose to use a piece of software (proprietary or not) that at least lets me export the raw data in an open format. Once I do this, then the opportunities to share for reuse come down to licencing. (Note: using a proprietary software may bring about other issues, but that is slightly a different discussion). So, during the act of creation we build constraints (or eliminate them) around sharing and reuse. So why not think about this upfront? Similarly, once we decide to share the licences we use will determine what kinds of reuse our work can have. These 3 actions are very interconnected – and it would be useful to think about how each affect each other from step 0 (planning). Those decisions will affect not just how we create, but also the infrastructure we choose to use so that the act of sharing and reuse is made as easy as possible when the time comes. Licences may state what we can do ‘legally’ but the infrastructure we use defines what we can do ‘easily’.
The objects became a lot more interesting as the MozFest and the New Zealand workshops progressed. At first, the idea of text, data and code seemed quite explanatory. Each of them we can identify with something we actually recognise, like a manuscript, the software we use for data acquisition and analysis, the data we measure or that is produced by some automated system. The fun part started when we tried to describe how these objects ‘behaved’, and, given those behaviours, how we were able to describe them (e.g., metadata).
Examples of text usually came across as manuscripts. When we think of manuscripts, we think of things that tell the story of data and code. They have a narrative that provides context to the work, there is usually a version of record which is difficult to modify, we usually publish it, it is peer reviewed, authors are well-defined, etc. The drafts prior to the version of record, the peer review, the corrections, etc., are usually not available (although that is changing in places, e.g., PeerJ, F1000, BioRxiv, to name a few). We usually interact with the versions of records; our ability to comment on those artefacts are limited , but are becoming available (e.g., PubMed) and mechanisms to suggest modifications (or to modify) these artefacts are almost non-existent. In other words – text (manuscripts) are stable. Another artefact that ‘behaves’ like text is equipment. Look at the following comparison:
|Volume, page numbers||Internal asset tag|
|Errata and corrections||Maintenance and repair records|
In other words, equipment seems to ‘behave’ like manuscripts. If you want to do (or say) something the equipment doesn’t do (or the paper doesn’t say) you need to buy (or write) a new one. So, when describing equipment you end up using similar descriptors to those for papers. In itself, a piece of equipment is a ‘stable’ object that can be described with ‘stable’ descriptors, not too dissimilar to a paper. Defective equipment breaks, defective papers get retracted. What this means is that the category ‘text’, when thought of as a set of behaviours and descriptors can help us build better descriptors for other artefacts with similar behaviours. This behaviour also determines how we create, share and reuse. How we publish a manuscript (behind paywalls, or with an Open Access licence) determines who can reuse and how. Using ‘open hardware’ equipment is similar to putting an open creative commons licence to a manuscript – using proprietary equipment is equivalent to publishing behind a paywall.
At the other end of the spectrum is code. Code likes to live in places like github. There is version control, the ability for multiple external and internal contributions, the list of contributors is agile and expands, and can stay alive and dynamic for long periods of time. There may be stable versions that are released at different time points, but in between, code changes. Code is at the core or reproducibility – it is the recipe that lists the ingredients and the sequence of what we did with and to the data. Not sharing the code is the equivalent of giving someone a cupcake and expecting them to be able to go and bake an identical version. So code is dynamic and its value is in the details. A lot of the value in code is that it is amenable to adaptations and modifications. One artefact of research that behaves like code is the experimental protocol, e.g., a protocol that describes a specific method for in situ hybridization, or how to make a buffer.
|Original author, plus future contributors||Original author plus future contributors|
|What a line of code does is described through annotation||What a line describing a step does ‘should’ be described by annotation|
|Bits and pieces can be copied to be part of another piece of software||Bits and pieces can be copied to be part of a different protocol|
|A single version for a single study||A single version for a single study|
|Otherwise constantly changing and being updated||Otherwise constantly changing and being updated|
So protocols seem to behave like code. Unfortunately, we tend to treat them as text (we share them in the materials and methods of our manuscripts). It would be much more useful to have protocols on places like github – allowing line testing and annotation, allowing ‘test driven development’ of protocols, allowing branching and merging, etc. If we were to think of protocols as ‘code’ we could then share them in a way that they could be more amenable for reuse. And if we do so, then we might think that an appropriate way to licence a protocol for sharing and reuse would be to apply the licences that promote sharing and reuse for code, not licences for text.
Data sits a bit in the middle of the two. Like text, it has stable versions – e.g., the data that accompanies a specific manuscript. Once data is captured, it cannot be changed (except of course to correct an error, or to legitimately remove ‘bad’ data points or outliers). In essence data changes by growing or reorganising, subsetting, etc, not by changing specific pre-existing values. It has some dynamic behaviours of code, and stable behaviours of text. It has stable versions and dynamic versions. How data is created determines how it can be shared and reused: are the formats open or proprietary, is it licenced openly or not, etc., as the project progresses, as outliers are eliminated, as new data is added. But for the most part there is a point in time where data moves from behaving like ‘code’ to behaving like text. Good open formats and licences can bring data back to a dynamic state (something harder to do with text-objects). This behaviour is important when we write the descriptors of data. There are the authors, data is linked to protocols and code, and eventually text, it can be used for different analysis, etc. Chemicals, in a way, behave like data:
|File name||Catalogue number|
|Version||Lot #, shipping date, aliquots|
|Storage place||Storage place|
|Linked to code and text||Linked to protocols and manuscripts|
How we share and describe data and chemicals is again similar. Is the chemical/data available to other researchers so they can repeat my experiments? Or is it something I produced in my lab and only share with a limited number of people? Again, how you ‘licence’ data and chemicals determines the extent to which these artefacts can be shared and reused. And, again, thinking about this intention at the planning stage makes a difference.
All three objects can be published and cited, and data and code and slowly claiming the hierarchical position they deserve in the research cycle. The need for unique identifiers for resources is also recognised here and here, for example.
During the workshops it was fun to get people to ‘classify’ their research artefacts based on these behaviours. At MozFest, for example, Daniel Mietchen suggested his manuscripts behave more like code. I would argue that they should then be licenced (and described) like code.
What I learned from these workshops (and Billy’s 3×3 table) is that if we can classify all of our artefacts within these categories, then the process of describing our research artefacts and building them with the intention of openly sharing for reuse becomes much easier. And teaching the skills to understand how your choices constraint downstream effects becomes more achievable.
As long as, of course, you think about this from the beginning.
Footnote: This is my interpretation of Billy Meinke’s thinking model – he may loudly laugh about my interpretation. He may even roll his eyes, hit his head against the wall – I don’t know. But the clarity he brought to my approach to the problem is something I am extremely grateful for. Hat tip.