Gaining access to and processing dbGaP datasets which are unavailable on Anvil Data Explorer

Is there a recommended process for accessing datasets that are not available on AnVIL? For example, let’s say I want to process eMerge PGx data on the AnVIL and I have approval for the dataset. Currently, that data is not available on the AnVIL, and only dbGaP is listed as a data storage and distribution platform.

  1. Is there a way to request that it be hosted on AnVIL?
  2. Assuming yes, how long would that process typically take?
  3. Assuming no, what is the recommended way to get the data into an AnVIL workspace for processing? This post suggested using a WDL/dockstore workflow that utilises info obtained via the dbGaP download guide.

Thanks.
/Nuwan

References:

Hi Nuwan, thank you for writing in with these important questions. Please see below for responses to your questions.

  1. Yes, AnVIL does have a formal dataset onboarding application (https://forms.gle/pecCZmSXS4sgdeHK8). Prospective AnVIL data submitters should complete and submit this form, which will send the application to the AnVIL leadership committee for review.
  2. Decisions on onboarding new datasets are typically made within one month.
  3. If AnVIL does not host a dataset of interest but dbGaP or SRA does, you can use pre-existing workflows that will fetch data and store it in a bucket your own google bucket/Terra workspace. Learn more at https://anvilproject.org/learn/find-data/importing-data-from-dbgap-and-sra. More generally, anything that can be transferred via gcloud storage can be analyzed on AnVIL.
1 Like