# Data Management – SRA Submission Olympia Oyster UW PacBio Data from 20170323

Submitted the FASTQ files from the UW PacBio Data from 20170323 to the NCBI sequence read archive (SRA).

FTP’d the data to NCBI’s servers, following their instructions. Briefly,

Change to the directory where the FASTQ files are (Owl/web/nightingales/O_lurida) and then initiate an FTP session:

ftp -i ftp-private.ncbi.nlm.nih.gov

Enter provided username/password, change to my designated uploads directory, create new folder dedicate to this particular upload. Then, upload all the files using the mput command:

mput *filtered_subreads*

SRA deets are below (assigned FASTQ files to existing BioProject and created a new BioSample). Will update post with SRA number when processing is complete on the NCBI end.

SRA: SRS2339870
Study: SRR5809355
BioProject: PRJNA316624
BioSample: SAMN07326085

# Data Management – SRA Submission of Ostrea lurida GBS FASTQ Files

Prepared a short read archive (SRA) submission for archiving our Olympia oyster genotype-by-sequencing (GBS) data in NCBI. This is in preparation for submission of the mansucript we’re putting together.

I followed my outline/guideline for navigating the SRA submission process, as it’s a bit of a pain in the neck. Glad my notes were actually useful!

The following two files are currently being uploaded via FTP; the process will take about 3hrs, as each file is ~18GB in size:

They are being submitted under the following accession numbers (note: a final accession number will be provided once this is publicly available; I will update this post when that happens):

# SRA Submission – Genome sequencing of the Olympia oyster (Ostrea lurida)

Adding our Olympia oyster genome sequencing (sequencing done by BGI) to the NCBI Sequence Read Archive (SRS). The current status can be seen in the screen cap below. Release date is set for a year from now, but will likely bump it up. Need Steven to review the details of the submission (BioProject, Experiment descriptions, etc.) before I initiate the public release. Will update this post with the SRA number once we receive it.

Here’s the list of files uploaded to the SRA:

151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_1.fq.gz
151114_I191_FCH3Y35BCXX_L1_wHAIPI023992-37_2.fq.gz
151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_1.fq.gz
151114_I191_FCH3Y35BCXX_L2_wHAMPI023991-66_2.fq.gz
151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_1.fq.gz
151118_I137_FCH3KNJBBXX_L5_wHAXPI023905-96_2.fq.gz
160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_1.fq.gz
160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCABDLAAPEI-62_2.fq.gz
160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_1.fq.gz
160103_I137_FCH3V5YBBXX_L3_WHOSTibkDCACDTAAPEI-75_2.fq.gz
160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_1.fq.gz
160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCABDLAAPEI-62_2.fq.gz
160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_1.fq.gz
160103_I137_FCH3V5YBBXX_L4_WHOSTibkDCACDTAAPEI-75_2.fq.gz

Paired-end sequencing files were uploaded together within a single “Run”.

SRA Info:
SRA: SRS1365663
Study: SRP072461
BioProject: PRJNA316624
BioSample: SAMN04588827

# SRA Submission – Genome sequencing of the Pacific geoduck (Panopea generosa)

Adding our geoduck genome sequencing (sequencing done by BGI) to the NCBI Sequence Read Archive (SRS). The current status can be seen in the screen cap below. Release date is set for a year from now, but will likely bump it up. Need Steven to review the details of the submission (BioProject, Experiment descriptions, etc.) before I initiate the public release. Will update this post with the SRA number once we receive it.

Here’s the list of files uploaded to the SRA:

151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_1.fq.gz
151114_I191_FCH3Y35BCXX_L1_wHAIPI023989-79_2.fq.gz
151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_1.fq.gz
151122_I136_FCH3L2FBBXX_L7_wHAXPI023990-97_2.fq.gz
160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_1.fq.gz
160103_I137_FCH3V5YBBXX_L5_WHPANwalDDABDLAAPEI-100_2.fq.gz
160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_1.fq.gz
160103_I137_FCH3V5YBBXX_L5_WHPANwalDDACDTAAPEI-102_2.fq.gz
160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_1.fq.gz
160103_I137_FCH3V5YBBXX_L6_WHPANwalDDABDLAAPEI-100_2.fq.gz
151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_1.fq.gz
151114_I191_FCH3Y35BCXX_L2_wHAMPI023988-81_2.fq.gz
160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_1.fq.gz
160103_I137_FCH3V5YBBXX_L6_WHPANwalDDACDTAAPEI-102_2.fq.gz

Mate pair sequencing files were uploaded together within a single “Run”.

# SRA Submission – Transcriptomic Profiles of Adult Female & Male Gonads in Panopea generosa (Pacific geoduck).

RNAseq experiment, which is part of a larger project that involves characterizing geoduck gonad development across multiple stages: histologically, proteomically, and transcriptomically. Initial sample collection performed by Grace Crandall.

The current status can be seen in the screen cap below. Current release date is set for a year from now, but will likely bump it up. Need Steven to review the details of the submission (BioProject, Experiment descriptions, etc.) before I initiate the public release. Will update this post with the SRA number once we receive it.

Here’s the list of files uploaded to the SRA:

Geo_Pool_F_GGCTAC_L006_R1_001.fastq.gz
Geo_Pool_F_GGCTAC_L006_R2_001.fastq.gz
Geo_Pool_M_CTTGTA_L006_R1_001.fastq.gz
Geo_Pool_M_CTTGTA_L006_R2_001.fastq.gz

Mate pair sequencing files were uploaded together within a single “Run”.

# SRA Submission – Individual Transcriptomic Profiles of C.gigas Before & After Heat Shock

RNA-seq experiment conducted by Claire in 2013.

She sampled mantle tissue from three adult oysters, allowed them to recover from the sampling (one week?) and then subjected those same oysters to a 1hr heat shock at 40C and collected mantle tissue from them again.

As this is our first Small Read Archive (SRA) submission in many years, I decided to submit these to the SRA due to the small number of samples (6) from the Illumina sequencing we had done to make sure it was manageable.

The current status can be seen in the screen cap below. Current release date is set for a year from now, but will likely bump it up. Need Steven to review the details of the submission (BioProject, Experiment descriptions, etc.) before I initiate the public release. Will update this post with the SRA number once we receive it.

Here’s the list of files uploaded to the SRA:

2M-HS_CCGTCC_L001_R1_001.fastq.gz
2M_AGTCAA_L001_R1_001.fastq.gz
4M-HS_GTCCGC_L001_R1_001.fastq.gz
4M_AGTTCC_L001_R1_001.fastq.gz
6M-HS_GTGAAA_L001_R1_001.fastq.gz
6M_ATGTCA_L001_R1_001.fastq.gz

SRA Accession: SRP072251

# Data Management – SRA Submission Overview

We have an enormous backlog of high-throughput sequencing files (641 FASTQ files, to be exact) that we need/want to get added to the NCBI Sequence Read Archive (SRA).

This post provides a brief summary of what’s involved in the process (mostly via screen shots) and attempts to identify the various pitfalls/pains that I’ve already stumbled through trying to get a set of six FASTQ files submitted properly.

OVERALL – It’s horrible and tedious.

Important things to note:

• Once any of the three required components for SRA submission have been created (SRA, BioProject, and BioSamples), they can no longer be edited/deleted by the user! Understandable if they’ve already been publicly released, but if they’re still in pre-public release status, I think the user should be able to make changes as they see fit. As it currently stands, the user has to email the help desk at SRA and/or BioProjects to make any changes.
• Extremely difficult to figure out which information will show up (and where it will show up) in the final, formatted SRA record – no guide to this that I could find. Thus, if you screw it up, it’s a major, major hassle to try to change anything.
• When creating a “Run” (within an “Experiment”, within your SRA submission), only include sequencing files that provide the same data (e.g. if you have multiple sequence files, each generated from different individuals/samples, then you need to create a separate “Experiment” and “Run” for each of those files – otherwise, all files uploaded to a “Run” are combined into a single SRA file that loses any distinguishing info from the separate sequencing files).
• When creating a batch submission for BioSamples, there’s no way to set a Title attribute. This means all of your submissions (in my case) will have all have a title of “Invertebrate sample”. Considering that I will likely end up with dozens of BioSamples, that means there’s no easy way to distinguish them from each other without some extra clicking and poking around.

Here’s the best way to proceed:

1. Create a BioProject. This will sit at the top of the hierarchy in the SRA submission and will be displayed as the STUDY associated with the SRA.

2. Create BioSample(s). This will be the next level of the hierarchy in the SRA submission and will be displayed as SAMPLE. This only shows up in the SRA when you create a new “Experiment”

3. Create SRA. This will end up encompassing any BioProject(s) and BioSample(s) that you need to include to describe the sequencing files you’re submitting to the SRA.

4. Create an Experiment.

5. Create a Run. This option is available once you’ve saved your experiment. This is where you provide your sequencing filename and associated MD5 checksum. This will also provide you with the login info to upload your sequencing files via FTP to NCBI servers. You can associate multiple sequencing files within a single run. This should be done if your sequencing files all provide data for the BioSample you selected. However, if you have sequencing files that are associated with different BioSamples, then you need to create an individual Experiment (and Run) for each BioSample!

Here are some links that might come in handy (although, none are that great)…

SRA Submission Quick Start Guide (this is useful!): http://www.ncbi.nlm.nih.gov/sra/docs/submit/

And, here are the screen caps, roughly in chronological order of how the process presents itself. It’s too time consuming to caption any of these, so I’m putting them up for a reference. Also, all of the information seen in these screen caps has been deleted (because the entire submission was totally jacked up in multiple facets), so don’t look for any of the various submission IDs – they no longer exist. This is really just to visually show how many steps there are in order to get stuff submitted – it’s brutal.

# Using ESearch and EFetch to retrieve data

Lets say there was this person, we will call her Emma for now, that needed to download lots of data but wanted to make it more robust and reliable. Here is a way to use NCBI ESearch and EFetch tools to do so. Complete documention at http://www.ncbi.nlm.nih.gov/books/NBK25498/. Specific example used is here

Will use Esearch to get GI numbers, post them to history and multiple EFetch calls to retrieve data.

Input: $query – ilumatobacter[orgn] Output: A file named “ilumatobacter.fa” containing FASTA data. Perl script  use LWP::Simple;$query = 'ilumatobacter[orgn]';

 #assemble the esearch URL $base = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/';$url = $base . "esearch.fcgi?db=protein&term=$query&usehistory=y"; #post the esearch URL $output = get($url); #parse WebEnv, QueryKey and Count (# records retrieved) $web =$1 if ($output =~ /(S+)</WebEnv>/);$key = $1 if ($output =~ /(d+)</QueryKey>/); $count =$1 if ($output =~ /(d+)</Count>/); #open output file for writing open(OUT, ">ilumatobacter.fa") || die "Can't open file!n";  #retrieve data in batches of 500$retmax = 500; for ($retstart = 0;$retstart < $count;$retstart += $retmax) {$efetch_url = $base ."efetch.fcgi?db=protein&WebEnv=$web"; $efetch_url .= "&query_key=$key&retstart=$retstart";$efetch_url .= "&retmax=$retmax&rettype=fasta&retmode=text";$efetch_out = get($efetch_url); print OUT "$efetch_out"; } close OUT; 

So if you wanted to use this simple paste the above code in text file (Suggest using TextWrangler) and saving as .pl file (ie /Users/sr320/Desktop/ill-prot.pl. Then in Terminal, type perl /Users/sr320/Desktop/ill-prot.pl. The data will download to whatever directory you are in Terminal.

In actuallity, this still seems to fail randomly. This is common to see on the internets. The best guess is too many requests during busy time of day, so it might take a couple if trys. See http://www.ncbi.nlm.nih.gov/books/NBK25497/#chapter2.Usage_Guidelines_and_Requiremen for usage recommendations.

# BLAST – C.gigas Larvae OA Illumina Data Against GenBank nt DB

In an attempt to figure out what’s going on with the Illumina data we recently received for these samples, I BLASTed the 400ppm data set that had previously been de-novo assembled by Steven: EmmaBS400.fa.

Jupyter (IPython) Notebook : 20150501_Cgigas_larvae_OA_BLASTn_nt.ipynb

Notebook Viewer : 20150501_Cgigas_larvae_OA_BLASTn_nt

Results:

BLASTn Output File: 20150501_nt_blastn.tab

BLAST e-vals <= 0.001: 20150501_Cgigas_larvae_OA_blastn_evals_0.001.txt

Unique BLAST Species: 20150501_Cgigas_larvae_OA_unique_blastn_evals.txt

Firstly, since this library was bisulfite converted, we know that matching won’t be as robust as we’d normally see.

However, the BLAST matches for this are terrible.

Only 0.65% of the BLAST matches (e-value <0.001) are to Crassostrea gigas. Yep, you read that correctly: 0.65%.

It’s nearly 40-fold less than the top species: Dictyostelium discoideum (a slime mold)

It’s 30-fold less than the next species: Danio rerio (zebra fish)

Then it’s followed up by human and mouse.

I think I will need to contact the Univ. of Oregon sequencing facility to see what their thoughts on this data is, because it’s not even remotely close to what we should be seeing, even with the bisulfite conversion…

# 2014/11/20 – Lab notebook Open Access Siderophore Database

2014/11/20

Siderophores are secondary metabolites (small molecules) that bacteria use for catching metals. These siderophores then make their way back to the bacteria for use in metabolism. They have uses in bioremediation (cleaning up toxic sites), novel drugs for humans. I want to look at ecologic roles and evolutionary history. Currently we are lacking an open access siderophore database that is linked to the underlying genetics.

I suspect that the presence of siderophore genes follows a similar pattern to the rest of the secondary metabolites. See:
Peter Cimermancic, Marnix H. Medema, Jan Claesen, Kenji Kurita, Laura C. Wieland Brown, Konstantinos Mavrommatis, Amrita Pati, Paul A. Godfrey, Michael Koehrsen, Jon Clardy, Bruce W. Birren, Eriko Takano, Andrej Sali, Roger G. Linington, Michael A. Fischbach, Insights into Secondary Metabolism from a Global Analysis of Prokaryotic Biosynthetic Gene Clusters, Cell,

Volume 158, Issue 2, 17 July 2014, Pages 412-421, http://dx.doi.org/10.1016/j.cell.2014.06.034.

Methods
I started out by downloading all the siderophores mentioned in the siderophore database. However these are not connected to the NCBI database. So I switch gears and did a search here:

http://www.ncbi.nlm.nih.gov/nuccore

SEARCH: siderophore AND bacteria[organism] NOT genome AND cluster
This says look for siderophore in bacteria that’s not a genome and is also a gene cluster
200 hits
For each cluster record the bacterial species, compound if known, accession number, download the fasta, and then the coding region fasta. Both as nucleotide codes. This makes it easier for GC content, codon usage, and tetranucleotide freq calculations down the road.