An empirical analysis of microbiome data availability

12 minute read

Published:

My labmate sent us a paper called “An empirical analysis of journal policy effectiveness for computational reproducibility” recently, and obviously I (1) got very excited and (2) wondered if I could do something similar from my own experience gathering microbiome datasets.

Now, before I begin I want to point out some obvious differences between this blog post and the paper which might lead to lower data availability success rates: I specifically targeted microbiome case-control datasets, and not every paper was published in a journal that had requirements on data availability. Also, many of the datasets I found were generated by teams of clinicians with (I imagine) even less computational support than the majority of the researchers on the predominantly-computational studies that these authors identified. On the other hand, I did bias my search toward papers which were more likely to have data available by focusing on 16S sequencing data only, which does have a standardized way to be shared and made available.

That said, I think it’s still an interesting exercise to quantify success rates and difficulties in acquiring data. I once saw a talk by Casey Greene which quantified the amount of money represented by a subset of data in the SRA, just considering sequencing costs. I was tempted to do something similar for my paper, but decided against it. Either way, it’s an important message: the data we generate is worth actual money and it seems like a shame for it to only serve one purpose, if it could serve more! Like the saying goes… reuse, reduce, and recycle! :P

And a final caveat: replicating and reproducing results are not the same thing. (Simply Statistics has a great explanation of this difference). In my opinion, replicating computational work should be an absolute expectation. Reproducing results/experiments, on the other hand, can be trickier.

Here’s how the authors of the replicability paper in PNAS went about their analysis:

  1. Identify potential papers (in their case, articles “whose findings relied on the use of computational and data-enabled methods”; in my case case-control gut microbiome 16S datasets.)
  2. Try to find the data
    • Count how many articles had enough info just from the paper to find the data and code.
    • For the others, email the authors and track:
      • How many responded favorably vs. unfavorably
      • How many responded but then led to a dead end
  3. Try to replicate the results, keeping track of how much extra work it took them to succeed.

Since in the meta-analysis I used the authors’ data to do my own re-analysis, the first two steps are the only relevant ones for me.

As a side note, I did briefly look into reproducing a few papers’ results mostly as a sanity check for myself early on in the project and in my PhD. It was actually really hard to do, partially because I was learning everything from scratch but also because many of the steps were either not documented (so I had to figure them out by digging into the data), or they were scattered throughout the paper (a complaint noted in the PNAS paper). And going through each paper to extract which OTUs they reported as significant and compare them with our results (the ~30-page supplement wrangled together by my co-author-in-crime Sean Gibbons) was extremely painful, to say the least.

Okay, but back to the data! Looking back at my master spreadsheet of datasets (which is a huge mess, full disclosure), it looks like I looked through 58 studies. I found about 10 more which I didn’t get around to reading the paper/looking for the data, so for now let’s stick with these 58.

Success stories

Of the 58 studies in my spreadsheet, 28 (48%) were included in my meta analysis. This means that I had the raw data, processing information (e.g. barcodes map, primer sequences if applicable, etc), and metadata mapping samples to controls or cases. It also looks like I have the full data for 4 more of these studies, I just didn’t get around to including them in the meta-analysis or the data was posted after I had stopped adding studies. So, in total I was able to successful acquire data for about half of the studies I identified. Wow! That’s pretty low. My collection process spanned at least a year, and I really enriched for studies that would have been more likely to have data available…

Overall, the majority of my “successful” datasets deposited their data in public databases (24 out of 32, 75%). 18 put their data in the SRA, 5 in the ENA, and 1 in MG-RAST. 3 studies posted their data on a separate author-provided website, and I emailed the corresponding authors of the 5 remaining studies.

But the story doesn’t end with raw data, because to do almost anything useful with raw data you need to know the associated metadata. In my case, that was clinical case/control status. Of the 27 studies with raw data publicly available on the internet without needing to email the authors (24 in databases + 3 from author-provided links), 17 provided the metadata along with the raw data (though a couple of the SRA datasets had the metadata in a really strange place, like the SRR sample description field), and 6 studies provided the metadata in the paper’s supplement (a few of which were in pdfs, ugh). There were also two examples of studies available online where the metadata was encoded in the sample ID, and I had to infer the case/control status from that. I had to email 2 authors to get the clinical metadata, even though their raw data was available.

Unsuccessful stories

Ok, but that’s not the most interesting part of this analysis (or the PNAS paper). What about the other 26 studies which I couldn’t get data for?

Not all data is deposited equally

Two of these unsuccessful studies were available in dbGaP, an NCBI database that is not public because it contains sensitive human data. I didn’t want to bother going through all the approvals to get access to these data, and I think that unless someone really wanted to use a study specifically in dbGaP, they wouldn’t either. This is actually an interesting commentary on the state of microbiome science: we still haven’t reached consensus on how “private” or “protected” microbiome data should be. But that’s a conversation for another day…

Interestingly, 5 of the studies that I couldn’t get full data for were in the SRA and 2 had raw data available somewhere other than the SRA. 3 of these studies were excluded because I could not get ahold of clinical metadata: I emailed 2 of them with no response, and didn’t get around to emailing the third. I excluded one of these studies because the data was weirdly trimmed to a very short length, and I didn’t really trust it. The last dataset was pretty wild: they collected multiple samples per patient, but lost the patient identifiers when they uploaded the data to the SRA. I emailed them, and they confirmed that this is what happened (and recommended that I treat the samples independently, which is definitely not correct). Of the 2 non-SRA datasets, one didn’t have associated clinical metadata and the other pointed to an MG-RAST project which didn’t seem to exist. There was also one additional dataset that had an SRA accession number, but which was a dead link.

Emails upon emails

I emailed the authors of 16 of these studies for data; 3 (19%) responded and got me the data. I ended up not following through on 2 of them (getting clinical metadata, checking that I had everything I needed, etc) because it was pretty late in my meta-analysis and I had a paper to write. One of these responsive authors sent me a dropbox folder full of a jumble of files in folders upon folders. I was pretty annoyed at the state of these Dropbox folders, because it seemed like they’d made zero effort to clean them up for a non-them audience. I (was very grumpy and) decided it was way too much of a hassle to figure out and instead opted not to include this study in my analysis.

I got no response from 6 (35%) of the authors I emailed - womp.

Two of the authors replied, but then were lost to follow-up - one put me in touch with someone else who didn’t respond, and the other had to ask their ethics committee for approval and then didn’t follow up with me.

Interestingly, 5 of the authors I emailed were in the process of putting their data on the SRA. I was sending these emails in the Fall of 2016; 3 of these papers were published in 2016, and 2 were published in 2015. In my email communications, one said they were preparing to publish their dataset, another had some back and forth with the SRA because their data wasn’t where they expected it to be, and another threw me some serious shade:

If you have not done this before, transferring files to NCBI is a time-consuming, painstaking process. After much back and forth with them, the samples are available here.

I didn’t follow up on these 5 studies because, again, it was late in my analysis - so perhaps if I had, they would have bumped up the success rate.

Drama drama

Some juicy stuff did happen in this journey. One corresponding author I emailed mentioned some dispute between people in the lab of the person who generated the data, such that the corresponding author’s main point of contact no longer had access to the data. One author, whose data is on the SRA but didn’t match the sample numbers reported in the paper, sent me a dropbox link to the fasta files. When I asked for the raw fastq’s, he said that they are probably somewhere on the server from his former institution, but now that he’s moved on he wouldn’t know how to access them. And, there was the previously-mentioned samples which lost their subject ID mappings. Also, in a different project than this one I encountered an author who had simply lost their raw data: they said they’d switched computers since the time of the publication, and didn’t know where the data had gone to or if it even still existed.

All of these situations are unacceptable, in my opinion. Data costs money and has value so far beyond its originally published findings. We really need to find ways to incentivize, value, and simplify data deposition so that data is not lost to people moving on, computers not being backed up, or personal conflicts getting in the way of collaborations. </soapbox rant>

In conclusion: deposit your data!

Complete data?Raw data sourceNMetadata sourceN
Yes (32)Databases24Same database16
   Another database1
   Paper supplement5
   Email authors2
 Other internet3Same website2
   Paper supplement1
 Emailed5  
No (26)Databases6Same database1
   Study doesn’t exist1
   Paper supplement1
   Email authors3
 Other internet1Email authors1
   Paper supplement1
 dbGaP2  
 Emailed17No response6
   Lost to follow up2
   In process of uploading to SRA5
   Provided data, but I didn’t follow through3
   I didn’t email1

Not only were the majority of my successfully re-used datasets ones which I got from the internet, they were also way less painful to download and process. Emailing corresponding authors is not the way to go. I remember feeling so bad about continuing to pester authors about the data I needed, especially since I knew that in many cases they were fairly far removed from the people who’d actually done the computational work. So save yourself the hassle, save your future research parasites the hassle, skip the middle person and just put your raw data and associated metadata in publicly available databases!