Last week, I ran the two raw FASTQ files through FastQC. As expected, FastQC detected “errors”. These errors are due to the presence of adapter sequences, barcodes, and the use of a restriction enzyme (ApeKI) in library preparation. In summary, it’s not surprising that FastQC was not please with the data because it’s expecting a “standard” library prep that’s already been trimmed and demultiplexed.
However, just for comparison, I ran the demultiplexed files through FastQC. The Jupyter notebook is linked (GitHub) and embedded below. I recommend viewing the Jupyter notebook on GitHub for easier viewing.
Pretty much the same, but with slight improvements due to removal of adapter and barcode sequences. The restriction site still leads to FastQC to report errors, which is expected.
Links to all of the FastQC output files are linked at the bottom of the notebook.
After completing the downloads of these files from BGI, I needed to verify that the downloaded copies matched the originals. Below is a Jupyter Notebook detailing how I verified file integrity via MD5 checksums. It also highlights the importance of doing this check when working with large sequencing files (or, just large files in general), as a few of them had mis-matching MD5 checksums!
Although the notebook is embedded below, it might be easier viewing via the notebook link (hosted on GitHub).
At the end of the day, I had to re-download some files, but all the MD5 checksums match and these data are ready for analysis: