Tag Archives: EC2

Computing – Retrieve data from Amazon EC2 Instance

I had an existing instance that still had data on it from my PyRad analysis on 20160727 that I needed to retrieve.

Logged into Amazon AWS via the web interface and started my existing instance (via the Actions > Instance State > Start menu). After the instance started and generated a new public IP address, I SSH’d into the instance:

ssh -i "/full/path/to/bioinformatics.pem" ubuntu@instance.public.ip.address

NOTE: I needed the full path to the PEM file! Tried multiple times using a relative path (e.g. ~/Documents/bionformatics.pem) and received error messages that the file did not exist and “Permission denied (public key)”.

Changed to the directory with the PyRAD analysis and created a tarball to speed up eventual download from the EC2 instance to my local computer:

tar -cvzf 20160715_pyrad_analysis.tar.gz /home/ubuntu/data/analysis/

After compression, I used secure copy to copy the file from the EC2 instance to my local computer:

scp -i "/full/path/to/bioinformatics.pem" ubuntu@instance.public.ip.address:/home/ubuntu/data/20160715_pyrad_analysis.tar.gz /Volumes/toaster/sam/

This didn’t work initially because I attempted to transfer the file using Hummingbird (instead of my computer). The SSH connection kept timing out. The reason for this was that I hadn’t previously used Hummingbird to connect to the EC2 instance and Hummingbird’s IP address wasn’t listed in the Security Groups table as being allowed to connect. I made that change using the Amazon AWS web interface:

Once transfer was complete, I terminated the EC2 instance and the corresponding data volume.

Share

Computing – Amazon EC2 Cost “Analysis”

I recently moved some computing jobs over to Amazon’s Elastic Cloud Computing (EC2) in attempt to avoid some odd computing issues/errors I kept encountering on our lab computers (Apple Xserve 3,1).

The big trade off here is that the lab computers are paid for and using EC2 means we’ll be sinking more money into computing resources. With that expense should come faster processing (i.e. less time) to perform various analyses. As they say, time is money…

Let’s look at how things’ve worked out so far.

 

First, how much did we spend and how did we spend it (click on the image to enlarge)?

 

Of course, it’s easy to see that for the instance I was running, it cost us $0.419/hr. That’s great and all, but you sort of lose sense of what that ends up costing over the long-term. Let’s look at how things break out over a larger time scale.

According to Amazon’s (very useful!) billing breakdown, we spent $187 in the month of July 2016. This doesn’t seem too bad. In fact, this would only cost us ~$2200/yr if we continue to run this instance in this fashion. However, let’s look at it a bit further.

We see that the instance ran for a total of 374 hrs during July 2016. Divide that by 24hrs/day and we see that the instance was running for 15.6 days; just over half the month. That means we would’ve spent ~$374 for the full month, which would equate to $4488/yr. For our lab, that kind of money starts to add up and one starts to wonder if it wouldn’t be better to invest in higher end hardware to use in the lab with a single “sunk” cost that will last us many, many years.

Regardless, with the lab’s current computing hardware, we should compare another factor that’s involved with the expense of using Amazon EC2 instead of our lab computers: time.

I performed a very rough “guestimation” of the time savings that EC2 has provided us.

 

I compared the length of “real” time for the first step in the PyRad program using the same data set on one of our lab computers (roadrunner) and the Amazon EC2 instance:

  • roadrunner: 1118 minutes

  • EC2: 771 minutes

 

Roadrunner is nearly 1.5x slower than the EC2 instance! To really appreciate what type of impact that has, we should look at the run time for the full PyRad analysis:

  • roadrunner: 5546 minutes (NOTE: Due to incomplete analysis, roadrunner time is “guestimated” as 1.45 x EC2; see below)
  • EC2: 3825 minutes

 

Let’s convert those numbers into something more easily understood – hours and days:

  • roadrunner: 95hrs
  • roadrunner: ~4 days

  • EC2: 63hrs

  • EC2: ~2.6 days

 

Of course, these times don’t take into account any technical issues that we might encounter (and I have encountered many technical issues using roadrunner) on either platform, but I can tell you that I’ve not had any headaches using EC2 (other than unintentional, self-imposed ones).

 

Another potential option is trying out InsideDNA. They offer cloud computing services that are specifically geared towards high-throughput bioinformatics analysis. They have many, many bioinformatics tools already installed and available to use on their platform. Additionally, they have nice tutorials on how to use some of these tools, which goes a long ways in getting started on any analyses using new software. Here are the various pricing tiers that they offer:

 

 

 

The “Advanced” tier ($100/month) certainly seems like it could potentially be better than using Amazon. However, this tier only offers 500GB of storage. If you look up above at the Amazon pricing breakdown, you’ll notice that I’ve already used 466GB of storage for just that one experiment! Additionally, the 1000 CPU hours seems great, but remember, this is likely divided by the number of CPUs that you end up using. The Amazon EC2 instance was running eight cores. If I were to run a similar set up on InsideDNA, that would amount to 125 CPU hours per core. Again, looking up above, we see that I ran the EC2 instance for 374 hours! That means the “Advanced” tier on InsideDNA wouldn’t be enough to get our jobs done.

 

Anyway, in the grand scheme of things, using an Amazon EC2 instance periodically as we need it throughout the year isn’t terrible. However, if we start using the University of Washington Hyak computing cluster we may be able to avoid spending on EC2 and be able to have similar time savings (compared to using the lab computers). Need to get cracking on that…

Share

Data Analysis – PyRad Analysis of Olympia Oyster GBS Data

Previously, I ran a PyRad analysis on just a subset of these samples in an attempt to have some data for a grant pre-proposal.

I’ve now completed a PyRad analysis on the full set. Now, I just need to figure out what to do with the output from this…

Jupyter Notebook: 20160715_ec2_oly_gbs_pyrad.ipynb

Share

Computing – Not Enough Power!

Well, I tackled the storage space issue by expanding the EC2 Instance to have a 1000GB of storage space. Now that that’s no longer a concern, it turns out I’m running up against processing/memory limits!

I’m running the EC2 c4.2xlarge (Ubuntu 14.04 LTS, 8 vCPUs, 16 GiB RAM) instance.

I’m trying to run two programs simultaneously: PyRad and Stacks (specifically, the ustacks “sub” program).

PyRad keeps crashing with some memory error stuff (see embedded Jupyter Notebook at the end of this post).

Used the following Bash program to visualize what’s happening with the EC2 Instance resources (i.e. processors and RAM utilization):

htop

Downloaded/installed to EC2 Instance using:

sudo apt-get install htop

 

I see why PyRad is dying. Here are two screen captures that show what resources are being used (click to see detail):

 

 

 

 

The top image shows that ustacks is using 100% of all eight CPUs!

The second image shows when ustacks is finishing with one of the files it’s processing, it uses all of the memory (16GBs)!

So, I will have to wait until ustacks is finished running before being able to continue with PyRad.

If I want to be able to run these simultaneously, I can (using either of these options still requires me to wait until ustacks completes in order to manipulate the current EC2 instance to accommodate either of the two following options):

  • Increase the computing resources of this EC2 Instance

  • Create an additional EC2 Instance and run PyRad on one and Stacks programs on the other.

 

Here’s the Jupyter Notebook with the PyRad errors (see “Step 3: Clustering” section):


					
				



Share

Computing – Amazon EC2 Instance Out of Space?

Running PyRad analysis on the Olympia oyster GBS data. PyRad exited with warnings about running out of space. However, looking at free disk space on the EC2 Instance suggests that there’s still space left on the disk. Possibly PyRad monitors the expected disk space usage during analysis to verify there will be sufficient disk space to write to? Regardless, will expand EC2 volume instance to a larger size…

 

 

Share

Computing – A Very Quick “Guide” to Amazon EC2 Continued

Yesterday’s post ended with me trying to mount a S3 bucket to my EC2 instance using s3fs-fuse.

Waited for the 36GB of data to copy over to new bucket with proper naming (i.e. no capital letters in name). Copying took hours; left lab before copying completed.

 

Mount S3 bucket: kubu4

s3fs kubu4 /mnt/s3bucket/ -o passwd_file=/home/ubuntu/creds_s3fs

 

So, that didn’t work. The reason that it doesn’t work is that I uploaded the files to the S3 bucket via the Amazon AWS command line (awscli). Apparently, s3fs-fuse can’t mount S3 buckets that contain data uploaded via awscli [see this GitHub Issue for s3fs-fuse]! However, I had to upload them via awscli because the web interface kept failing!

 

That means I need to upload the data directly to my EC2 instance, but my EC2 instance is set with the default storage capacity of 8GB so I need to increase the capacity to accommodate my two large files, as well as the anticipated intermediate files that will be generated by the types of analysis I plan on running. I’m guessing I’ll need at least 100GB to be safe. To do this, I have to expand the Elastic Block Storage (EBS) volume of my instance. The rest of stuff below is fully explained and covered very well in the EBS expansion link I have in the previous sentence.

Don’t be fooled into thinking I figured any of this out on my own!

 

Expanding the EC2 Instance

The initial part of the process is creating a Snapshot of my instance. This took a long time (2.5hrs). However, I did finally decide to refresh the page when I noticed that the “Status” progress bar hadn’t moved beyond 46% for well over an hour. After refreshing, the “Status” showed “Complete.” Maybe this actually was ready to go much faster, but the page didn’t automatically refresh? Regardless, in retrospect, since this EC2 instance is pretty much brand new and doesn’t have too many changes from when it was initialized, I probably should’ve just created a brand new EC2 instance with the desired amount of EBS…

Created volume from that Snapshot with 150GB of magnetic storage.

Attached volume to the EC2 instance at /dev/sda1 (the default setting /dev/sdf resulted in an error message about the instance not having a root volume) and SSH’d into the instance. Odd, it seems to show that I still only have 8GB of storage (see the “Usage of…” in the screenshot below):

 

 

Check to see if I actually have the expanded storage volume or not. It turns out, I do! (notice that the only drive listed is “xvda” and its partition, “xvda/xvda1″ AND they are equal in size; 150G):

 

 

Time to upload (via the secure copy command) the files to my EC2 instance! The following commands upload the files to a folder called “data” in my /home directory. I also ran the “time” command at the beginning to get an idea of how long it takes to upload each of these files.

time scp -i ~/Dropbox/Lab/Sam/bioinformatics.pem /Volumes/web/nightingales/O_lurida/20160223_gbs/160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_1.fq.gz ubuntu@ec2.ip.address:~/data
time scp -i ~/Dropbox/Lab/Sam/bioinformatics.pem /Volumes/web/nightingales/O_lurida/20160223_gbs/160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_2.fq.gz ubuntu@ec2.ip.address:~/data

 

Details on upload times and file sizes:

 

Confim the files now reside in my EC2 instance:

 

Alas, I should’ve captured all of this in a Jupyter Notebook. However, I didn’t because I thought I would need to enter passwords (which you can’t do with a Jupyter Notebook). It turns out, I didn’t need a password for anything; even when using “sudo” on the EC2 instance. Oh well, it’s set up and running with my data finally accessible. That’s all that really matters here.

Alrighty, time to get rolling on some data analysis with a fancy new Amazon EC2 instance!!!

Share

Computing – The Very Quick “Guide” to Amazon Web Services Cloud Computing Instances (EC2)

This all takes a surprisingly long time to set up.

Setup AWS Identity and Access Management (IAM): http://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html?icmpid=docs_iam_console

Install AWS command line interface: https://aws.amazon.com/cli/

Copy files to S3 bucket:

aws s3 cp /Volumes/web/nightingales/O_lurida/20160223_gbs/160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_1.fq.gz s3://Samb
aws s3 cp /Volumes/web/nightingales/O_lurida/20160223_gbs/160123_I132_FCH3YHMBBXX_L4_OYSzenG1AAD96FAAPEI-109_2.fq.gz s3://Samb

Launch EC2 instance c4.2xlarge (Ubuntu 14.04 LTS, 8 vCPUs, 16 GiB RAM). Configured to have SSH open (TCP, port 22) and also to be able to access Jupyter Notebook via tunnel (TCP, port 8888). Set with “My IP” to limit access to these ports.

Create new key pair. Have to change permissions:

chmod 400 bioinformatics.pem

 

Connect to instance

For Amazon AMI:

ssh -i "bioinformatics.pem" ec2-user@ip.address.of.instance

 

For Amazon Ubuntu Server:

ssh -i "bioinformatics.pem" ubuntu@ip.address.of.instance


Update/Upgrade default Ubuntu packages at after initial launch:

sudo apt-get update
sudo apt-get upgrade

 

Set up Docker

Install Docker for Ubuntu 14.04 and copy our bioinformatics Dockerfile to the /home directory of the EC2 instance:

ssh -i "bioinformatics.pem" /Users/Sam/GitRepos/LabDocs/code/dockerfiles/Dockerfile.bio ubuntu@ip.address.of.instance:

Access data stored in Amazon S3 bucket(s)

Mounting S3 storage as volume in EC2 instance requires https://github.com/s3fs-fuse/s3fs-fuse

 

Mount bucket:

sudo s3fs Samb /mnt/s3bucket/ -o passwd_file=/home/ubuntu/s3fs_creds

 

Error:

s3fs: BUCKET Samb, name not compatible with virtual-hosted style.

 

Turns out, the error is due to the bucket name having an uppercase letter.

Made new bucket in S3 (via web interface) and copied data files to the new bucket. Will try mounting again once the files are copied over (this will take awhile; the two files total 36GB)..

Share