Tag Archives: GitHub

GitHub Curation

Updated a couple of GitHub Wikis:



Created a new repo in the RobertsLab Organization GitHub account with a wiki to provide an overview of how to use of Hyak (mox) computing node. This was lightly modified from what Sean already had in his personal repo.



As a quick test, I updated all the md files in  the sr320/LabDocs/code md files to format headers for GitHub’s newest interpretation of headers. The headers (represented by a series of ‘#’) require a space between them and the subsequent text.  I used the following command in bash:

for i in *.md; do sed -i.bak 's/^#*/& /g' "$i"; done

The code works as follows:

  • Run for loop on all .md files in the directory

  • Use sed to edit the files in place: -i.bak (this command structure is needed for Mac OS X).

  • 's/^#*/& /g': Performs a substitution by identifying all lines beginning (^) with a pound symbol (#) and match zero or more occurrences of the pound symbol (*), then substituting the same pattern that was matched and adding a space at the end of the pattern (& ). Do this for all occurrences found within the document (g).

Since this worked, I’ll probably run this through all of the md files in all of our various repos to quickly and easily fix header formatting issues.



Working on updating the Genome-sequencing-December-2016-(UW-PacBio) wiki, but need to work out the kinks on any easy, documentable way to rename and move some files around in order to make files/organization compliant with our data management plan (DMP).


Current strategy:

  • Generate MD5 checksums for fastq files for each of the SMRT cell runs.

  • Copy file names from the the .xml file in the top level of each SMRT cell run folder to an array.

  • Use parameter substitution (in bash) to strip path and suffix from each index of the array (results likely stored in a secondary or tertiary array).

  • Use bash find command to copy the filtered_subreads.fastq.gz from each SMRT cell run folder, and append each of the corresponding stripped filenames in the final array to the beginning of the fastq file, to the owl/nightingales/O_lurida directory.

  • Generate new MD5 checksums on the copied files and compare to original MD5 checksums. This will confirm two things: 1 – The data did not get corrupted when copied. 2 – The new filenames correspond to the correct, original filtered_subreads.fastq.gz file (renaming a file doesn’t alter the MD5 checksum).

  • Archive the original SMRT cell run folders (which contain a ton of metdata files)


Manuscript – Oly GBS 14 Day Plan

For Pub-a-thon 2017, Steven has asked us to put together a 14 day plan for our manuscripts.

My manuscript is accessible in three locations:

Current: Overleaf for final editing/formatting before submission Scientific Data.
Archival: Authorea for initial writing/comments.
Archival: GitHub for initial writing/issues.

Additionally, I have established a data repository with a Digital Object Identifier (DOI) at Open Science Framework

Here’s what I have going on:

Day 1

  • Convert .xls data records to .csv to see if they will render in OSF repo.
  • Assemble figure: phylogenetic tree.
  • Add figure to manuscript.
  • Deal with any minor edits.

Day 2

  • Assemble figure: Puget Sound map.
  • Add figure to manuscript.
  • Deal with any minor edits.

Day 3

  • Submit? Depends on what Steven’s availability is to finish of Background & Summary and write up Abstract.

Teaching – OA Lesson Plan Development

I’m currently collaborating on tweaking/developing a lesson plan and corresponding curriculum to teach Washington high school students about the chemistry involved in ocean acidification.

This is a project that’s already been in the works and I’m being brought in to assist (or, take over?) with the development. I’m pretty interested and excited by this. The reason for my excitement is that I was in the secondary education program to become a certified secondary education teacher while I was in graduate school. So, this project lets me apply the knowledge I garnered about teaching science during that time.

The current state of the project has a lab protocol, but no real lesson plan for the teachers to utilize. The lab protocol, in my view, is a bit too dense for high schoolers to digest and is a bit too much of “do this, write down the number: that’s ocean acidification!” It currently lacks an important element of science pedagogy: discovery. My goals are to tweak the protocol in such a fashion that it is more engaging and, possibly, hypothesis-(i.e. discovery) driven. This type of teaching has been shown to greatly improve retention and help improve/develop critical thinking skills.

The lesson plan should have sufficient information for teachers to decide if the lesson is appropriate for them to teach (e.g. which Washington state standards are addressed, what learning level(s) does the lesson require, what materials/supplies are needed, etc.), if they have enough time to conduct the lesson, and if they have ample understanding of the topic to feel comfortable teaching it.

I’ve put this project on GitHub. It allows for active collaboration on projects. Although there are some hurdles for those collaborators who have not used the service before, I think there are some good organizational benefits that are worth dealing with the initial headaches that might come for beginning GitHub users.

One benefit to developing this project on GitHub is that all changes are tracked and a description of the changes are required when they are made. This makes it relatively to see what changes were made, by who, and when. Although using something like Google Docs also automatically tracks changes, it does not allow the ability to provide a comment when changes are made. Because of this, it’s not always clear why the change was made in the first place.

An additional benefit, and this is the main reason I think it’s best to develop this project on GitHub, is the Issues tracker (see screenshot):


The Issues section allows for targeted discussion of the project and eliminates the volleys of email that often happen on collaborative projects. It will keep all discussions about this project in a single location and won’t require exhaustive searches of emails that easily get buried during a work week. Additionally, the discussions can remain focused on specific topics without getting lost within a emails attempting to broach multiple topics at once.