# Computing – Retrieve data from Amazon EC2 Instance

I had an existing instance that still had data on it from my PyRad analysis on 20160727 that I needed to retrieve.

Logged into Amazon AWS via the web interface and started my existing instance (via the Actions > Instance State > Start menu). After the instance started and generated a new public IP address, I SSH’d into the instance:

ssh -i "/full/path/to/bioinformatics.pem" ubuntu@instance.public.ip.address

NOTE: I needed the full path to the PEM file! Tried multiple times using a relative path (e.g. ~/Documents/bionformatics.pem) and received error messages that the file did not exist and “Permission denied (public key)”.

Changed to the directory with the PyRAD analysis and created a tarball to speed up eventual download from the EC2 instance to my local computer:

tar -cvzf 20160715_pyrad_analysis.tar.gz /home/ubuntu/data/analysis/

After compression, I used secure copy to copy the file from the EC2 instance to my local computer:

scp -i "/full/path/to/bioinformatics.pem" ubuntu@instance.public.ip.address:/home/ubuntu/data/20160715_pyrad_analysis.tar.gz /Volumes/toaster/sam/

This didn’t work initially because I attempted to transfer the file using Hummingbird (instead of my computer). The SSH connection kept timing out. The reason for this was that I hadn’t previously used Hummingbird to connect to the EC2 instance and Hummingbird’s IP address wasn’t listed in the Security Groups table as being allowed to connect. I made that change using the Amazon AWS web interface:

Once transfer was complete, I terminated the EC2 instance and the corresponding data volume.

# Data Analysis – PyRad Analysis of Olympia Oyster GBS Data

I’ve now completed a PyRad analysis on the full set. Now, I just need to figure out what to do with the output from this…

# Computing – Not Enough Power!

Well, I tackled the storage space issue by expanding the EC2 Instance to have a 1000GB of storage space. Now that that’s no longer a concern, it turns out I’m running up against processing/memory limits!

I’m running the EC2 c4.2xlarge (Ubuntu 14.04 LTS, 8 vCPUs, 16 GiB RAM) instance.

I’m trying to run two programs simultaneously: PyRad and Stacks (specifically, the ustacks “sub” program).

PyRad keeps crashing with some memory error stuff (see embedded Jupyter Notebook at the end of this post).

Used the following Bash program to visualize what’s happening with the EC2 Instance resources (i.e. processors and RAM utilization):

htop

sudo apt-get install htop

I see why PyRad is dying. Here are two screen captures that show what resources are being used (click to see detail):

The top image shows that ustacks is using 100% of all eight CPUs!

The second image shows when ustacks is finishing with one of the files it’s processing, it uses all of the memory (16GBs)!

So, I will have to wait until ustacks is finished running before being able to continue with PyRad.

If I want to be able to run these simultaneously, I can (using either of these options still requires me to wait until ustacks completes in order to manipulate the current EC2 instance to accommodate either of the two following options):

• Increase the computing resources of this EC2 Instance

• Create an additional EC2 Instance and run PyRad on one and Stacks programs on the other.

Here’s the Jupyter Notebook with the PyRad errors (see “Step 3: Clustering” section):



jQuery(function($){ var target_height =$(window).height();
\$("iframe.iframe-class").height(target_height);
});



# Computing – Amazon EC2 Instance Out of Space?

Running PyRad analysis on the Olympia oyster GBS data. PyRad exited with warnings about running out of space. However, looking at free disk space on the EC2 Instance suggests that there’s still space left on the disk. Possibly PyRad monitors the expected disk space usage during analysis to verify there will be sufficient disk space to write to? Regardless, will expand EC2 volume instance to a larger size…

Working through the GBS data, I wanted to leave a trace of a workflow that will be improved upon (I used the wrong cut site) as well as results for comparison.

Only one end of the 96 files (ie samples) was used. Here is the notebook and output directory.

Requisite screenshot:

# GBS Frustrations

This isn’t really a notebook entry – it’s more of a traditional blog post.

It’s a quick summary of the frustrations and struggles I’ve encountered while trying to analyze the Olympia oyster GBS data. Hopefully it will serve as a place holder for others to find (and avoid) some of the pitfalls I’ve encountered so far. But, mostly, this is just for me to vent…

1. Using the Stacks program (on Hummingbird over the network to our server Owl) takes forever and, more importantly, consistently fails to complete the ustacks and cstacks programs.

2. Using the Stacks program (on Hummingbird via external HDD connected through Firewire) takes forever (combined, process_radtags and ustacks has been running since 20160428; that’s eight days)!!! Granted, this is running on all 96 samples, but, regardless, this type of time frame is not very conducive to productivity.

3. The “raw” non-demultiplexed fastq files supplied by BGI have a ‘N’ in the barcode in the FASTQ header lines. This prevents Stacks (and possibly Tassel – I’ll get to this in a second) from being able to perform the demultiplexing. Here’s a screen shot of what I’m talking about:

1. Cyverse has a program called Tassel that should be able to handle GBS data just like ours. However, it doesn’t produce the expected output to proceed to the second step. Although I haven’t tested it, it’s possible that the problem is related to the ‘N’ in the FASTQ header barcode sequence I mentioned above. I suspect it’s related because the first step in using Tassel is demultiplexing utilizing a supplied barcode keyfile.
2. Cyverse has Stacks installed, but in order to use it, someone has to build a Cyverse “app.” I’ve tried and the process is brutal. It’s not conducive for a program (that is really a suite of programs) like Stacks that has so many command line options and, depending on your input file types (e.g. “non-standard” Illumina filenames for paired-end sequencing), requires looping over filenames to specify corresponding file pairs.

3. Pyrad actually worked relatively well, but the VCF output file (for visualizing in the Integrative Genomics Viewer) has an ill-formed header that IGV won’t accept. Attempts at tweaking the header don’t seem to resolve the issue. Additionally, it’s not apparent in the output files if individuals get grouped, even though there is an option to specify which individuals should be grouped together.

4. And, the most frustrating thing of all???!!!  I just realized how to handle the problematic barcodes in the FASTQ headers!! Instead of trying to alter the FASTQ files (which I’ve been messing around with over the past few days), all I’ve needed to do this entire time is CHANGE THE BARCODE KEY FILE THAT STACKS AND/OR TASSEL USES TO HAVE A ‘N’ AT THE BEGINNING OF EACH BARCODE!

I’m going to go cry now…

Regardless of that last one, it doesn’t change the fact that Stacks is painfully slow and, at times, unreliable.

# Data Analysis – Subset Olympia Oyster GBS Data from BGI as Single Population Using PyRAD

Attempting to get some sort of analysis of the Ostrea lurida GBS data from BGI, particularly since the last run at it using Stacks crashed (literally) and burned (not literally).

Katherine Silliman at UIC recommended using PyRAD.

I’ve taken the example Jupyter notebook from the PyRAD website and passed a subset of the 96 individuals through it.

In this instance, the subset of individuals were all analyzed as a single population. I have another Jupyter notebook running on a different computer that will separate the three populations that are present in this subset.

Overall, I don’t fully understand the results. However, this seems to be the quickest assessment of the data (from the *.snps file generated):

28 individuals, 36424 loci, 72251 snps

Additionally, I did run into an issue when I tried to visualize the data (using the *.vcf file generated) in IGV (see screen cap below). I’ve posted the issue to the pyrad GitHub repo in hopes of getting it resolved.

One last thing. This might be obvious to most, but I discovered that trying to do all this computation over the network (via a mounted server share) is significantly slower than performing these operations on th efiles when they’re stored locally. Somewhere in the notebook you’ll notice that I copy all of the working directory from the server (Owl) to the local machine (Hummingbird). Things proceeded very quickly after doing that. Didn’t realize this would have so much impact on speed!!