‘what would Software Carpentry look like if it was delivered as a university course?’
A number of conversations and workshops were had that kept indicating that the thirst and need for this was there, that there wasn’t a clear solution in place, and that the solution was not going to be easy to produce. We knew what we wanted the house to look like, but we needed to find an architect. And of course, money to pay them.
Enter Nat Torkington
Nat organises an unconference called KiwiFoo. He invites a bunch of people to a retreat north of Auckland and lets the awesome happen. In 2015 I was invited, and, by pure luck Kaitlin Thaney was invited too as she was around that time in Australia for a software carpentry instructor training around ResBaz Melbourne. Also invited were Nick Jones, director of NeSi which had recently become the New Zealand institutional partner of Software Carpentry, and John Hosking, Dean of the Faculty of Science, University of Auckland.
The words that Kaitlin Thaney said at one of our meetings came back as if from a loudspeaker: You need to engage with the University Leadership. You need to think strategically.
And KiwiFoo gave us that opportunity.
Kaitlin, Nick and I brought John Hosking into the conversation, and his response was positive. We tried to exploit the convergence as much as we could over that weekend – there are not that many chances to get to sit with this group of people in a relaxed environment and without interruptions or the need to run to another meeting. We had each other’s full attention. And exploit we did.
Back in Auckland, Nick suggested that I talk about the project to the Centre of eResearch Advisory Board. The Centre of eResearch at the University of Auckland is helping researchers with exactly these kinds of issues. Next thing I know, Cameron McLean and I are trying to get everything we learned through the workshops into something more concrete. I talked to those details, and when the Board asked: ‘how can we help you’ I did not know what to say.
Luckily, Nick Jones, as usual came to the rescue. We had a chat, and decided to work with me on higher level thinking. I was still missing the big picture that we could offer the leadership. Watching Nick’s thinking process was a humbling joy. I think I learned more from that session than what I did in all the Leadership programmes I was part of. I realised also how far I was from getting to where we needed to get. What is the long term vision? What are the gaps? Why do we need to fill them? How are you going to manage change?
At this meeting we saw we needed to engage with CLeaR, the organisation that provides Professional Development for staff and the group has a lot to offer in instructional design. We had already agreed that this training project should not be focused solely on students, but, rather, should have a broader scope. We produced an initial outline of what we were proposing, and invited Adam Blake from CLeaR to join the conversation and contribute to this document.
I was invited again to the eResearch Advisory Board, and this time I was better prepared. The timing was also perfect. The application window for the Vice Chancellor’s Strategic Development fund was open and I now knew what I needed: support to put an application through. We built a team of key project advisors, each who could contribute something quite specific: Adam Blake, to advise on course structure and to provide support to do the research on the course, Mark Gahegan, Director of the Centre for eResearch, Poul Nielsen, from the Auckland Bioengineering Institute, Nick Jones, from NeSI, and myself as the Project Lead, and the intention of hiring Cameron McLean as project manager. We worked on the application and backed, by the eResearch Advisory Board, it went in.
Our proposal was to develop a training suite, based on Software and Data Carpentry that could be used to be delivered to students and staff in different formats, to support a ResBaz in Auckland in February 2016, and to run a pilot course for students about to enter the research lab on second semester in 2016. We knew our bottleneck was time – people’s time to do the work. We asked for $150,000 in salaries.
In September we got the email: your application has been approved….
The Vice Chancellor’s fund was giving us initially a limited amount of money with the rest of the money contingent on the approval of a needs analysis by the eResearch Advisory Board.
We accepted the offer and hired Cameron McLean as Project Manager (by now he was a trained Software Carpentry Instructor and had submitted his PhD thesis and was waiting for his viva). First order of business, a needs analysis.
Time to go to the library.
When Billy Meinke and I sat to work on planning our sprint session for MozFest, he suggested that the activities of science could be grouped into 3 objects (text, code, data) and 3 actions (create, share, reuse). I was skeptical – surely science is way more complex than that. After running the session at MozFest and later in New Zealand, however, convinced me that Billy was right. We never encountered any object or action that we could not fit within that classification.
The actions are self explanatory – create, share, reuse. You either are the author of a manuscript or you are not, you have contributed data or not, you have contributed to software or not. You create, share or reuse, or you don’t. However, what emerged at MozFest was how these 3 actions (which we seem to engage with apparently separately) are actually very dependent on each other – how we can share depends on how we created. Let’s look at an example:
I am capturing neural data using proprietary software that creates proprietary formats. That decision affects how those data can be reused (by a future self or others): only those with access to that software can open those files. Sharing and reuse, hence, becomes limited. If instead, I think of sharing and reuse from the onset, I may choose to use a piece of software (proprietary or not) that at least lets me export the raw data in an open format. Once I do this, then the opportunities to share for reuse come down to licencing. (Note: using a proprietary software may bring about other issues, but that is slightly a different discussion). So, during the act of creation we build constraints (or eliminate them) around sharing and reuse. So why not think about this upfront? Similarly, once we decide to share the licences we use will determine what kinds of reuse our work can have. These 3 actions are very interconnected – and it would be useful to think about how each affect each other from step 0 (planning). Those decisions will affect not just how we create, but also the infrastructure we choose to use so that the act of sharing and reuse is made as easy as possible when the time comes. Licences may state what we can do ‘legally’ but the infrastructure we use defines what we can do ‘easily’.
The objects became a lot more interesting as the MozFest and the New Zealand workshops progressed. At first, the idea of text, data and code seemed quite explanatory. Each of them we can identify with something we actually recognise, like a manuscript, the software we use for data acquisition and analysis, the data we measure or that is produced by some automated system. The fun part started when we tried to describe how these objects ‘behaved’, and, given those behaviours, how we were able to describe them (e.g., metadata).
Examples of text usually came across as manuscripts. When we think of manuscripts, we think of things that tell the story of data and code. They have a narrative that provides context to the work, there is usually a version of record which is difficult to modify, we usually publish it, it is peer reviewed, authors are well-defined, etc. The drafts prior to the version of record, the peer review, the corrections, etc., are usually not available (although that is changing in places, e.g., PeerJ, F1000, BioRxiv, to name a few). We usually interact with the versions of records; our ability to comment on those artefacts are limited , but are becoming available (e.g., PubMed) and mechanisms to suggest modifications (or to modify) these artefacts are almost non-existent. In other words – text (manuscripts) are stable. Another artefact that ‘behaves’ like text is equipment. Look at the following comparison:
|Volume, page numbers||Internal asset tag|
|Errata and corrections||Maintenance and repair records|
In other words, equipment seems to ‘behave’ like manuscripts. If you want to do (or say) something the equipment doesn’t do (or the paper doesn’t say) you need to buy (or write) a new one. So, when describing equipment you end up using similar descriptors to those for papers. In itself, a piece of equipment is a ‘stable’ object that can be described with ‘stable’ descriptors, not too dissimilar to a paper. Defective equipment breaks, defective papers get retracted. What this means is that the category ‘text’, when thought of as a set of behaviours and descriptors can help us build better descriptors for other artefacts with similar behaviours. This behaviour also determines how we create, share and reuse. How we publish a manuscript (behind paywalls, or with an Open Access licence) determines who can reuse and how. Using ‘open hardware’ equipment is similar to putting an open creative commons licence to a manuscript – using proprietary equipment is equivalent to publishing behind a paywall.
At the other end of the spectrum is code. Code likes to live in places like github. There is version control, the ability for multiple external and internal contributions, the list of contributors is agile and expands, and can stay alive and dynamic for long periods of time. There may be stable versions that are released at different time points, but in between, code changes. Code is at the core or reproducibility – it is the recipe that lists the ingredients and the sequence of what we did with and to the data. Not sharing the code is the equivalent of giving someone a cupcake and expecting them to be able to go and bake an identical version. So code is dynamic and its value is in the details. A lot of the value in code is that it is amenable to adaptations and modifications. One artefact of research that behaves like code is the experimental protocol, e.g., a protocol that describes a specific method for in situ hybridization, or how to make a buffer.
|Original author, plus future contributors||Original author plus future contributors|
|What a line of code does is described through annotation||What a line describing a step does ‘should’ be described by annotation|
|Bits and pieces can be copied to be part of another piece of software||Bits and pieces can be copied to be part of a different protocol|
|A single version for a single study||A single version for a single study|
|Otherwise constantly changing and being updated||Otherwise constantly changing and being updated|
So protocols seem to behave like code. Unfortunately, we tend to treat them as text (we share them in the materials and methods of our manuscripts). It would be much more useful to have protocols on places like github – allowing line testing and annotation, allowing ‘test driven development’ of protocols, allowing branching and merging, etc. If we were to think of protocols as ‘code’ we could then share them in a way that they could be more amenable for reuse. And if we do so, then we might think that an appropriate way to licence a protocol for sharing and reuse would be to apply the licences that promote sharing and reuse for code, not licences for text.
Data sits a bit in the middle of the two. Like text, it has stable versions – e.g., the data that accompanies a specific manuscript. Once data is captured, it cannot be changed (except of course to correct an error, or to legitimately remove ‘bad’ data points or outliers). In essence data changes by growing or reorganising, subsetting, etc, not by changing specific pre-existing values. It has some dynamic behaviours of code, and stable behaviours of text. It has stable versions and dynamic versions. How data is created determines how it can be shared and reused: are the formats open or proprietary, is it licenced openly or not, etc., as the project progresses, as outliers are eliminated, as new data is added. But for the most part there is a point in time where data moves from behaving like ‘code’ to behaving like text. Good open formats and licences can bring data back to a dynamic state (something harder to do with text-objects). This behaviour is important when we write the descriptors of data. There are the authors, data is linked to protocols and code, and eventually text, it can be used for different analysis, etc. Chemicals, in a way, behave like data:
|File name||Catalogue number|
|Version||Lot #, shipping date, aliquots|
|Storage place||Storage place|
|Linked to code and text||Linked to protocols and manuscripts|
How we share and describe data and chemicals is again similar. Is the chemical/data available to other researchers so they can repeat my experiments? Or is it something I produced in my lab and only share with a limited number of people? Again, how you ‘licence’ data and chemicals determines the extent to which these artefacts can be shared and reused. And, again, thinking about this intention at the planning stage makes a difference.
All three objects can be published and cited, and data and code and slowly claiming the hierarchical position they deserve in the research cycle. The need for unique identifiers for resources is also recognised here and here, for example.
During the workshops it was fun to get people to ‘classify’ their research artefacts based on these behaviours. At MozFest, for example, Daniel Mietchen suggested his manuscripts behave more like code. I would argue that they should then be licenced (and described) like code.
What I learned from these workshops (and Billy’s 3×3 table) is that if we can classify all of our artefacts within these categories, then the process of describing our research artefacts and building them with the intention of openly sharing for reuse becomes much easier. And teaching the skills to understand how your choices constraint downstream effects becomes more achievable.
As long as, of course, you think about this from the beginning.
Footnote: This is my interpretation of Billy Meinke’s thinking model – he may loudly laugh about my interpretation. He may even roll his eyes, hit his head against the wall – I don’t know. But the clarity he brought to my approach to the problem is something I am extremely grateful for. Hat tip.
Mozfest had given the project a good kick and start. Working with Billiy Meinke helped me reframe my way of thinking, and I was impatient to find out what the session would look like with a room full of academics. I went back to talk about how to go about this with Cameron McLean, who had been part of the original biscuit eating exercise that got this started.
Cameron McLean was a PhD student at the University of Auckland who I was co-supervising. He had contacted me after coming across this blog, and I encouraged him to contact Mark Gahegan from the Centre of eResearch to discuss a possible PhD. Part of his thesis work focused on trying to understand how to make the implicit knowledge we have in our research workflows explicit. (His thesis is about more than that – so hey, go read some of his work).
I related to him what I had learned at MozFest from Billy Meinke. I introduced him to the concept of the 3 objects and 3 actions that we had used at our sprint at MozFest, and he stared back at me with the same confusing gaze that I must have given Billy on our first day of work. Nonetheless, we went ahead.
We managed to gather a number of people from the University of Auckland that were involved in research and we ran a session. Like we had done at Mozfest, we asked people to write down statements in the form of ‘I [action] [object]’ and mapped those into the action/object tree. Like at MozFest we followed that with a second capture in the form ‘I [action] [object] for [context]’ which gave us an idea of what the motivation for those things were. As in MozFest we saw that the action/object combinations that were produced were slightly different when a justification was needed. In other words, there are things we do, and then other things that we do because there is a ‘because’. For example, in the context round we started finding statements related to ‘because my job requires it’. For some reason, adding the context to the statements somewhat elicits a different set of combinations (or makes values explicit), which was useful to know.
At Auckland we added another round which was divided into two phases. By now we were moving forward in our thinking about what this ‘semester course’ would look like. So we wanted to ask researchers what were the digital skills that they kept teaching over and over again as new people entered their research group, and that they wished they didn’t have to. We captured some more of this, we looked for what were the commons skills and which skills were more ‘lab-specific’. This was a first step towards identifying what topics would have to be included in the course.
In the next round we did something slightly different – we tried to get at what they wished people entering the lab knew how to do. While it seemed not too different from the previous round, what this did was elicit (we didn’t plan for this) a total accelerated venting of what actually frustrated researchers. A lot of what came out of this round were behavioural attributes, such as collaboration, knowing how to ask questions, etc.
The overview of the map did not look too different from the one at MozFest. We also saw that there was an accumulation of paper notes around the ‘paper’ object. Not surprisingly, it is the main artefact through which we measure value of our researchers.
At the eResearch Symposium in Queenstown in 2015 Cameron and I teamed up with Matt McGregor from Creative Commons Aotearoa New Zealand to run a workshop. This time we would have a group of people who were more tech savvy, but also from diverse institutions from New Zealand. We essentially ran a similar workshop, with more of a focus on what topics would you teach if you were given the opportunity to decide the content of the course.
We learned from all of these workshop that if we were to build a course to solve the issue of digital literacy we would need not only to teach the skills but also tackle, explicitly, specific behavioural attributes. More importantly we needed to focus on creating an understanding of the value of treating non-traditional research artefacts (data and software) with the respect they are due. We also found agreement in that a course for students was not going to be enough to solve the problem: other research staff also needed access to this training. The problem was becoming bigger. It was also becoming more exciting.
It also reinforced the importance of providing not solely a set of skills, but also the social context which potentiates the value of these skills. We also agreed that the objective should not be to encourage students to become seasoned software developers, but instead to give them enough confidence to undertake small software projects and to provide them with a common language that they could share with software developers for larger projects.
We had been shown a pretty clear roadmap and a pile of bricks to make the awesome happen.
We now needed to find an architect.
MozFest 2014 was worth every minute of suffering from jet lag and trying to recover a lost suitcase containing the only change of clothes I had.
I met Billy Meinke for a beer the night I arrived. We had interacted over email as we planned the session for MozFest – I really did not know what to expect and was hoping we would get along well in real life. We were to run a sprint on ‘Skills and Curriculum Mapping for Open Science’. Our aim was not to think about a set of workshops but rather how to embed the skills that are delivered through initiatives like Software Carpentry into a University curriculum. To do that, one has to think about how learning objectives map onto graduate profiles, and how activities map onto different levels of learning (think Bloom’s or SOLO taxonomy). Billy was ideal to have this conversation with – his superpower is understanding how we learn. (I came to find out he has heaps more superpowers!). Billy turned out to also be an awesome person. We were off to a good start.
The next day we planned our sprint. While most around us were working on their laptops, Billy pulled out a set of cards, paper and pens. He went off to explain to me the mind-map he had drawn on the airplane, and then suggested that the entire science enterprise could be reduced to to three objects (data, papers and code) and three actions (create, reuse, share). I secretly hoped nobody would slip on the bits of brain on the floor from my head exploding. Billy looked young and us scientists tend to have certain arrogance about ourselves. So I really tried hard to hide my initial reaction. Billy, however, looked quite convinced about what he was saying – so I said ‘let’s give it a go’. Best. Decision. Ever.
We ran the sprint. We asked participants to write on bits of papers how they interacted with science in the form of ‘I [action] [object]’. For example, ‘I analyse data’, ‘I review manuscripts’, ‘I sequence genes’. We then worked on trying to fit those descriptions into the 3 objects and the 3 actions that Billy had suggested. What came to me as a surprise was how easy it was. As I watched Billy pin things on the board I kept thinking: have we been over-complicating the description of what we scientists do? It seemed that the answer was yes.
For the rest of MozFest we left that board up there, next to a table with bits of papers and pens for people to feel free to add their bit. More clarity emerged.
If we can really reduce the description of the objects and the actions to these simple sets, then solving the training problem becomes easier. It is no longer entangled into discipline-specific details and nuances – there is a common ground that we can leverage on. And, if so, then mapping to a curriculum is easier – there is a generalisation to be made that we can exploit to make it happen.
There was a number of researchers at MozFest, but the contributions to the board were not limited to them. MozFest also captures a rather unique crowd, most of which are quite happy to being pushed outside the box (or people who never saw the box in the first place).
Would this vision work on the mainstream university academic?
I am grateful to Mozilla Science Lab for the opportunity to go to London and to Kaitlin Thaney for pairing me with such a great partner in crime. I also cannot thank Billy enough for challenging my thinking the way he did. It was an eye opening experience. I flew around the world back to Auckland with a new mission: to find out what that board would look like if you packed the room with seasoned, mainstream academics.
And so I did.
Let’s start with some history.
Hamilton. 2014. NZ eResearch Symposium.
Kaitlin Thaney from the Mozilla Science Lab had accepted the invitation to fly to the other side of the world to speak at the Symposium. It was great to have her here in NZ. We had plenty of opportunities to speak about open and reproducible science, and of course, the great work that Software Carpentry was doing.
I had taken a version of software carpentry in a previous NZ eResearch Symposium. I was impressed. I went back to the lab filled with enthusiasm about getting those skills to work, signed up to Roger Peng’s MOOC, and got going. After finishing the MOOC I decided to give my newly acquired skills a go on some of my data, but didn’t get too far – I kept coming into stumbling blocks, and there was no one around that I could really talk to. I then contacted my colleague Andy Moiseff at UCONN. I knew he had a set of data that would be ideal for what I was trying to learn. Andy also had offered a programming course at UCONN when I was a PhD student in the very early 90’s (yes, last century). He is also probably the best teacher I have ever had, and has been an incredible mentor to me. I suggested that I would try to write in R a solution to the analysis if he was happy to help me overcome my stumbling blocks. He agreed. And so I started trying to get onto that.
It all went well on the basics, but as soon as things got a bit more complex, I found it hard to keep going. And then the teaching semester started, and with that I also lost the ability to stick to the learning on a consistent basis. And so that attempt slowly disappeared from the workflow.
In Hamilton, I related this to Kaitlin – I was frustrated. It was not for lack of wanting. But there was something that was missing in my ability to move forward. While the interactions with Andy were always encouraging and rewarding they were too asynchronous to keep a good momentum going. I was needing the social context of learning.
So, having biscuits over morning tea at the conference in 2014 with Kaitlin, Nick Jones and Cameron McLean, we asked: What would Software Carpentry look like if it was a semester course? If it let learners work, with peers, on their own research problems, with more time to practice the skills and reflect on the learning and how it applied to each learner’s context?
This is how this started.
Soon after this exchange, we proposed to explore this at MozFest. I am grateful to Mozilla Science Lab for supporting me to fly to London and for pairing me up with Billy Meinke (then at Creative Commons HQ) to run a session there.
At MozFest we learned a lot. Working with Billy was more than awesome. There are those occasions that make me change the way I look at things – and this was one of those. But more on that later.
Forward to Auckland 2016. We are now running a pilot of this original concept as a credit earning course for students at the University of Auckland (MEDSCI 736).
This is the first of a series of posts that will describe how we got here, what helped us get here and the obstacles we encountered along the way. There is a long list of names to come of people that have made this possible. But before that, I want to start here thanking Kaitlin, Nick, Billy and Cameron who encouraged me to ‘do something’.