Data Aggregation – Ava’s Complete Sample List

I received Ava’s master sheet of all the samples she collected for this project. I needed to aggregate a full list of the samples I’ve previously extracted DNA from, so that I can compare to her master sample list and generate a list of the remaining samples that I need to extract DNA from..

Here are the files I needed to work with (Google Sheets):

The files required multiple formatting steps in order to produce accession numbers that were formatted in the same fashion across all three sheets. This was needed in order to be able to successfully merge all of the sheets into a single sheet containing all of the data, which will make it easy to sort, and generate a list of samples that need to be extracted.

Text file manipulations were performed in a Jupyter notebook, which is linked below. All files were downloaded from Google Sheets as tab-delimited files prior to working on them.

Jupyter Notebook file: 20170831_ava_ab_samples_aggregation.ipynb
Jupter Notebook on NBviewer: 20170831_ava_ab_samples_aggregation.ipynb

Now that we have the tables formatted, we can use the accession number as a common field by which to combine the two tables. This will allow easy sorting and identification of the remaining samples that I need to extract. I’ll do this by using SQLite3.

Use SQLite3 (in Linux Ubuntu):

Change to directory containing files:

cd ~/Dropbox/Sam\ Friedman\ Lab/tmp

Start SQLite3:


Set field separator as tab-delimited:

.separator "\t"

Create databases by importing files and providing a name for corresponding databases:

.import ava_master_ab_list_formatted.tsv master_list
.import Ava_WS_Transmission_DNA_Extractions_all.tsv extracted_list

Set output display mode to tabs:

.mode tabs

Set output display to include column headers:

.headers on

Set the output to write to a file instead of the screen:

.output 20170905_master_extraction_list.tsv

SELECT statement to combine the two tables:

SELECT * FROM (SELECT * FROM master_list UNION ALL SELECT * FROM extracted_list) s GROUP BY accession_number ORDER BY accession_number;

The SELECT statement above works in the following fashion:

Uses a sub-query (contained in the parentheses) that combines all of the rows in both tables and creates an intermediate table (that’s the s after the sub-query). Then, all of the columns in that intermediate table are selected by the initial SELECT * FROM and organized by the GROUP BY clause (which combines any rows with identical values in the accession_number column) and then sorts them with the ORDER BY clause.

After that’s finished, we want to reset the output to the screen so we don’t overwrite our file:

.output stdout

The output file is here (Google Sheet): ava_abalone_master_extraction_list

Data Management – Script for Compiling qPCR Data

Over the last few weeks, I’ve wrestled with tracking down data (primarily qPCR data) from the litany of projects the Friedman Lab has had over the last decade or so. I’ve also noticed that it’s increasingly difficult for me to track down my own data from individual projects where data is not generated continuously over time, but in chunks. Tracking down 6 different qPCR runs that were conducted over the course of a year is tedious.

In order to save myself, and others who might need/want to review my data, a lot of time in the future, I decided to write a script that will allow me to compile all of my qPCR data into a single, massive CSV file. Accessing this CSV file via spreadsheet (Excel/Calc/Sheets) or database (SQL) means I’ll always be able to quickly search for all the related data I need, since it will reside in a single file!

The script is written in bash, called, and is currently hosted here:

Basically, the script does the following:

  • Replace the spaces in the BioRad filenames with underscores (used to simplify parsing of the filename for use in downstream steps in the script)
  • Replace the header row to accommodate two new fields: qPCR_filename and qPCR_date
  • Add qPCR_filename and qPCR_date to each file.
  • Concatenate all the files into a single “master” CSV file.

Now, the easy part – exporting the data from hundreds of qPCR files…