Several years ago I made a personal commitment to Open Science: I try to publish papers on which I am the lead author on in open access journals and I archive data for these papers in Dryad, Figshare and other repositories. Recently I started posting preprints of my manuscripts as well, and I encourage authors submitting to the journal of which I am Editor to do the same. The thing that I had been most hesitant about was posting my code – I’ve been programming for a long time, but I’m not the most elegant of programmers (NB: massive understatement), so to be honest I was a little worried people would mock my efforts. I finally got over that, stumbled through some GitHub tutorials, and as a result you can now go over there to see the code used to do the analyses and generate figures for my two most recent articles, as well as for a few projects in progress.
Although I feel good about having done this, it’s also become clear to me that there is a real opportunity cost to Open Science about which I think we need to be honest with our students. There are actually multiple opportunity costs. One potential one is lost future papers. While I often hear the about the possibility of getting scooped because we post our data, however, concrete examples seem hard to come by and I just don’t buy it. Another is lost status and opportunities: Our profession still prioritizes articles in publications like Science, Nature, or PNAS, so as a candidate on the job market you are still way ahead of the pack if your paper is in one of those journals than if it is in PeerJ or PLoS ONE (I’m not saying that’s right, only that it’s true). I think this is a more legitimate OC, and one that will be with us for a while, unfortunately. The opportunity cost I’m talking about here is more diffuse: the time I devoted to archiving data and code could have been spent on other activities that have greater rewards under the current system. I could have also used the money I spent on archiving fees and publishing in a journal with an OA option to advance ongoing research in the lab.
For my most recent paper I did the math:
- Double checking the main dataset and doing some reformatting to prepare it for submission to Dryad: 5 hours (NB: I had already invested a fair amount of time in reorganization of the dataset to get it in line with the suggestions of Borer et al. because I use R for analyses and to generate (many of) the figures. Unfortunately, we didn’t give much thought to data oranization when we originally entered the data, so the result was a very complex and inefficient set of files).
- Realizing I probably needed a second file to complement the first one (the main file I was going to upload includes a list of 40 demographic plots, but not the specific locations in the reserve where these plots are located), creating the CSV file using the original data set, thinking there was a mistake in a few of the points, trying to figure it out, and realizing there wasn’t a mistake after all, and preparing the metadata file: 3 hours
- Submission of these two files and the metadata to Dryad: 45 minutes
- Preparing a figure of these locations (a map, since not everyone is familiar with the layout of trails at the BDFFP): 1 hour
- Submission of this map to Figshare: 15 minutes
- Getting up the courage to post my code to GitHub, looking over my code, rewriting all the comments and annotations so that someone other than me understood them to the point they could see what I did step-by-step, deciding to take my long inefficient scripts and simplifying them by creating functions to do some of the redundant stuff (which I should have done in the first place), and uploading to GitHub: 25 hours
- Freezing a version of my code on github and getting a DOI for it using Zenodo (as per the suggestion of @noamross; the DOI makes it easier to cite in journals and allows for people to better see the changes in versions of code over time): 30 minutes
- Editing the Endnote output styles for Ecology to include bibliography templates for “Computer Program” and “Dataset”: 30 minutes
- Cost of archiving in Dryad: US$90
- Page Charges: Estimated article lengths 8 pages @ US$75 per page: $600
TOTAL: 36 hours* and US$690.
This is not a trivial amount of time: it’s almost a full week of work spent on archiving data and code for one paper. That was precious time I could have spent preparing for classes, working on other manuscripts, writing grant proposals, going to the gym, staring longingly at the sax I haven’t played in months…whatever. And bear in mind, this quick calculation also doesn’t include any of the one-time financial and time investments that are amortized over multiple submissions, including:
- Time spent registering with Dryad
- Time spent registering with Figshare
- Time spent learning how to use R/MATLAB/Python/whatever for analyses instead of Systat or JMP so that the scripts are available for others to use and reproduce results
- Time spent learning to use Git and the RStudio/GitHub tools to that code are available.
- Time spent learning how to organize data / metadata for the purposes of archiving (e.g., reading Borer et al. and the DataONE primer on best practices for data management).
- Note also that the cost could have been even higher if I had published in, for example, PLoS ONE
Granted, the cost could also have been lower. I could have reduced the price US$200 by publishing in PeerJ or by $600 if I had published in Biotropica, which waives page charges for ATBC members. In addition, the time spent on these tasks will decrease as I become a better programmer and because in the future datasets we collect will be well organized at the start, diminishing the need for reorganization at the time of deposition. Still, this past week was a reminder of what I see as being the major hurdle to overcome when trying to convince others that we should strive for Open Science: it is a major pain in the ass and is really expensive, in terms of both the money and amount of time required. Without a better system of incentives from the community for archiving data and code, 35 hours and $690 may be too much effort and money for most people. We need to recognize that reality and identify creative ways to change the current system, because let’s get real – telling people “you should because it’s the right thing to do” and assuming that’s enough just isn’t going to cut it (not that a compliment from Ethan isn’t reward enough for me, but I already have tenure, so…). We also really need to teach our students how to do this now; it’s much easier if you develop good habits early. Finally, it’s also important for me to remember that it will get cheaper and cheaper every time I do it (e.g., preparing metadata was a snap this time because I used the template from my prior submissions to Dryad).
Regardless, let’s be mindful when advocating Open Science – it’s hard, expensive, and comes with both accounting and opportunity costs. If we want to make Open Science the norm, we need to find ways to minimize these actual and opportunity costs, not just promote incentives for doing engaging in OS. We need to stop telling people “You should” and get better at telling people “Here’s how”.
[edited 4 September 2014 @ 9:44 pm for clarity and to correct some typos]
[upated 10 September 2014 @ 9:06 am for clarity and after one hour was spent on activities 7 and 8]