The mantra and mania of data sharing

This post is by Lizzie. The photo is from a photos folder I found from my PhD called ‘favorite stake photos.’

I wrote this post before seeing Andrew’s post for today.

When I was a grad student I spent a remarkable amount of time wandering around Sweetwater National Wildlife Refuge visiting 56 shrubs. Over and over again. I visited the shrubs almost every day. It wasn’t wandering, it was a structured, efficient route into the site, hitting each `replicate,’ then down the hill, up the next. Some days I was opening or collecting pitfall traps under the shrubs, some days I was vacuuming the shrubs (both of these tasks were to collect arthropods), other days I measured soil respiration, collected soil samples under the shrubs, I took clippings of the shrubs. There was also climate data to collect under the shrubs, little litter bags I was variously adding and removing under the shrubs. Vegetation sampling! I have forgotten much of it but I recorded a daily log so it can come flooding back. It felt like a lot of work.

In the end I published four papers about those 56 shrubs (which became 54 after a fire). Stuff about invasive grasses and carbon cycling, and bugs and stuff. Solid work, science maybe inched forward? Maybe it stepped right, but because of all the data I think it inched forward.

After grad school I joined a sort of think-tank that had been funded by NSF to promote data synthesis in ecology. Ecology needed it. It was (is?) a bit of a stuck field with a cacophony of individual studies in different places with different shrubs (or in lakes, forests etc.) — ‘boots and bucket’ ecology I heard it called.

You put on your boots, grabbed your bucket and — voila! Ecological science. The think tank was a renegade endeavor, trying to make sense of all the individual studies, by looking across them for patterns and maybe even testing some theory now and then. There was a tension between ‘boots and bucket’ ecologists at the time and ‘synthesis’ ecologists. According to the ‘boots and bucket’ tribe the `synthesis’ ecologists were stealing all their data for flashy papers. They were upsetting the order of things. Some said they were getting it all wrong because if you didn’t collect the data, you didn’t know the system enough, you could never figure anything out (yes, let’s all take a minute and think about where a field with this idea would be headed). Others said data would stop being collected and everyone would just do ‘synthetic ecology’ and never have much data.

This world was swirling far above and away from me and my 56 shrubs at Sweetwater, but at the think-tank they gathered all the new postdocs and told them about the power of data sharing. Science advances if we share data! Think of the questions we could answer if all the data were shared and organized! It just takes a hour or so to post your data. Go ahead, post your PhD data!

I was totally in. I posted all my PhD data and I dove in on the power of data sharing and wrote a paper about it for climate change biologists. I found papers showing that the massive improvements in pediatric oncology (for leukemia it went from 4 to 94% survival) could be attributed in part to data sharing. I read up on GenBank and drooled at a field so close to mine in topic but so far away in data sharing — and also trounces ecology in finding important interesting science IMHO. I felt like scientists should take an oath to advance science, and if they took that oath, then clearly they would see that they have to share “their” data. We’re trying to mitigate climate change people! Share your data!

Fast forward 20 years and all the scare tactics of the anti-data sharing folks have not come true. There’s no drop in data. (Though I got this argument recently from a marine biology postdoc, who then retreated a little from the premise when I asked for data on the declining data — given, and she did manage to agree, that this had been happening for 20 years at least so shouldn’t we see the pattern? — she then said data is only be propped up by PhD students who have to collect it for their PIs, so I guess suggesting a radical shift in how data are collected? And some verifiable decline in other data types? Through I didn’t try to steer the argument anymore.) Journals require data sharing. Granting agencies do. I think the synthetic ecologists won.

But a bunch of folks — beyond that one marine biologist postdoc — missed the message. I have been running into major governmental and non-profit data-collecting agencies that will not share data over the last two years.

For today, I will tell you about just one of them.

It’s the Canadian Forest Service (CFS). My lab recently contacted them for a big tree growth responses to climate across western North America analysis we’re doing. We have a lot of data, because these data in the US are generally public. They’re either on the ITRDB or they were uploaded with papers (there are certainly some that are not shared, but I like think they all will be, as the USFS and related US agencies do usually have a mandate to share data), but I happened to know that the CFS usually does not share data without co-authorship. They don’t share plot level data, they don’t share tree ring data, they don’t share data unless you sign an agreement with them and guarantee them co-authorship (and some other weird stuff that sort of sounds like they control whether you can publish what you find or not, but I think they have had to back off on that, so now there’s just related smushy language I suspect).

We asked anyway. I told my lab it’s important to ask and not just assume (even if everyone has told you that you will not get the data without co-authorship) and here’s the reply I got:

We are particularly interested in collaborating with researchers who bring expertise in Bayesian methods to help advance our analyses and explore future growth projections. With that in mind, we would like to explore the possibility of a scholarly collaboration with you.

Entering into collaboration would help streamline access to tree-ring data across Canada by removing certain barriers. Some of the data you requested are under restricted use, with licenses granted only to CFS researchers. Others require external requestors to obtain authorization from the original data owners—a process that can be time-consuming. Additionally, some datasets (highlighted in red below) have not yet been published, and we are actively encouraging collaborative projects that incorporate these data.

To provide further context, it is common practice for National Forest Inventory (NFI) data to have access restrictions, particularly for raw or highly detailed data. These restrictions are in place for several important reasons. NFI plots are located on both public and private lands. Disclosing precise locations could compromise data integrity or infringe on landowner privacy. Some datasets also contain sensitive ecological or proprietary information. Also, NFI data are designed to provide an unbiased representation of forest resources. Controlled access helps prevent misuse or misinterpretation, especially given the complexity of the data and the ongoing updates and revisions. NFI data support national and international reporting obligations, policy development, and collaborative efforts across jurisdictions. Ensuring consistent and validated use is essential. Finally, while publicly funded, the collection and processing of NFI data represent a significant investment (CFS & NSERC). Responsible dissemination protects this investment and ensures proper attribution and use.

I am working on a reply to this and open to all ideas/suggestions. I’ll give you what I have so far, vaguely in order of the arguments they have given (which is probably not the best order).

I appreciate their reply and understand their perspective, but requiring collaboration for access to data slows scientific progress, reduces equity and diversity in access to data, and has never been shown to be helpful or beneficial to science to publishing robust results, and thus is something my lab has a policy against (we do).
If you have data you have had for a while and not published (7 of the 30 datasets we asked about), but would like the data analyzed, then publish the data. This is the best way to get data analyzed and then it will likely be analyzed by different teams of researchers so CFS would get maximum insights from the data.
Sensitive data can be fuzzed, jittered or otherwise changed enough to meet privacy standards but still allow others to use the data. Certainly for our purposes, given the grid-size of the climate data we’re using it is hard to imagine this would not be possible.
The best way to get data cleaned, corrected and properly interpreted is to share it widely. The more eyes on the data, the quicker these issues can be spotted and fixed. Further, lack of access suggests there is something to hide, which is extremely concerning.
It is precisely because these data are used for national and international policy that making them public seems critical. (Can anyone help me here? This seems so obvious that I am not sure how else to say it.)
Data is far more widely used (and cited) when publicly shared. More papers and research seems like a better return on the Canadian taxpayers’ investment, no?
If you really want to charge for the data, then charge for it — but make access of those data available to all who can pay.
Fundamentally, there is a large number of researchers — Canadian and otherwise — who would use the CFS data and don’t because of this policy. These are excellent researchers who simply either do not have time for the efforts of collaboration with one team that requires collaboration in exchange for data when all other teams make the data public or do not want to support this process because it slows progress in science and the more researchers who sign onto it, the more it is tacitly condoned. Lots of good scientists I know — myself perhaps soon to be one — will not use CFS data because of the current access policies. Or, if they use it once, they won’t use it again.
Taxpayers paid for these data to be collected, they should get to see it and use it how they please. And with the way things are going, I would add that — if the commitment is to data quality — the more people who can access and download it now, the better. Political regimes of the worst kind often remove and restrict data.

I don’t think CFS researchers have anything to hide. They run a really nice database of their data (I know, because you get to search around it to request the data) and they are helpful and sharp when I meet with them or we correspond over email. I also don’t think widely incorrect papers or policies have been prevented by this restricted access. But I think that some researchers have been fed a steady diet to make them fear these possibilities and I am not sure how to disabuse them of this version of the world.

I hang out with a lot of people who share their data — climatologists often share it, and folks related to them (e.g., dendrochronologists in the US), phenology people usually are better (though I have had recent issues) — and I hang out with folks who don’t.

The people who share data are happier. They don’t spend time telling me all the horrible things that will happen if they share data. They don’t spend their time worrying about it. They just share their data and move on.

It’s like people who spend all their time talking about work-life balance. I find them much less happy than the people working until 11pm some days — those folks are often also the ones tango dancing until 1am the next night, or leaving on multi-day kayak trips or getting in a Truck Surf hotel to tool around Morocco surfing. The ones talking about work life balance seem to set on searching for something I think they would find if they stopped searching for it so desperately.

The mantra and mania of data sharing

Related Posts

New Research Finds AI Prefers Content From Other AIs

Intel Deal Gives US 10% Stake To Keep Foundry In-house

Leave a Reply Cancel reply