# Data Aggregation – Ava’s Complete Sample List

I received Ava’s master sheet of all the samples she collected for this project. I needed to aggregate a full list of the samples I’ve previously extracted DNA from, so that I can compare to her master sample list and generate a list of the remaining samples that I need to extract DNA from..

Here are the files I needed to work with (Google Sheets):

The files required multiple formatting steps in order to produce accession numbers that were formatted in the same fashion across all three sheets. This was needed in order to be able to successfully merge all of the sheets into a single sheet containing all of the data, which will make it easy to sort, and generate a list of samples that need to be extracted.

Text file manipulations were performed in a Jupyter notebook, which is linked below. All files were downloaded from Google Sheets as tab-delimited files prior to working on them.

Jupyter Notebook file: 20170831_ava_ab_samples_aggregation.ipynb
Jupter Notebook on NBviewer: 20170831_ava_ab_samples_aggregation.ipynb

Now that we have the tables formatted, we can use the accession number as a common field by which to combine the two tables. This will allow easy sorting and identification of the remaining samples that I need to extract. I’ll do this by using SQLite3.

Use SQLite3 (in Linux Ubuntu):

Change to directory containing files:

cd ~/Dropbox/Sam\ Friedman\ Lab/tmp

Start SQLite3:

sqlite3

Set field separator as tab-delimited:

.separator "\t"

Create databases by importing files and providing a name for corresponding databases:

.import ava_master_ab_list_formatted.tsv master_list
.import Ava_WS_Transmission_DNA_Extractions_all.tsv extracted_list

Set output display mode to tabs:

.mode tabs

Set output display to include column headers:

.headers on

Set the output to write to a file instead of the screen:

.output 20170905_master_extraction_list.tsv

SELECT statement to combine the two tables:

SELECT * FROM (SELECT * FROM master_list UNION ALL SELECT * FROM extracted_list) s GROUP BY accession_number ORDER BY accession_number;

The SELECT statement above works in the following fashion:

Uses a sub-query (contained in the parentheses) that combines all of the rows in both tables and creates an intermediate table (that’s the s after the sub-query). Then, all of the columns in that intermediate table are selected by the initial SELECT * FROM and organized by the GROUP BY clause (which combines any rows with identical values in the accession_number column) and then sorts them with the ORDER BY clause.

After that’s finished, we want to reset the output to the screen so we don’t overwrite our file:

.output stdout

The output file is here (Google Sheet): ava_abalone_master_extraction_list