Data Analysis – Continued O.lurida Fst Analysis from GBS Data

Continued the analysis I started the other day. Still following Katherine Silliman’s notebook for guidance.

Quick summary of this analysis:

  • Mean Fst comparing all populations = 0.139539326187024
  • Mean Fst HL vs NF = 0.143075552548742
  • Mean Fst HL vs SN = 0.155234939276722
  • Mean Fst NF vs SN = 0.117889300124951

NOTE: Mean Fst values were calculated after replacing negative Fst values with 0. Thus, the means are higher than they would be had the raw data been used. I followed Katherine’s notebook and she doesn’t explicitly explain why she does this, nor what the potential implications are for interpreting the data. Will have to discus her rationale behind this with her.

Jupyter notebook: 20161201_docker_oly_vcf_analysis_R.ipynb


  1. Nice job getting vcftools to run! Are you using the vcf file that pyrad outputs, or ipyrad? Pyrad will output a VCF file with all variant sites, while ipyrad will output a VCF file with ALL sites variant and invariant. In either case, you want to thin your VCF to 1 variant site per GBS loci when looking at mean FST. In the notebook you are looking at, I am using a VCF file that has already been thinned. See the end of this notebook to see how I did that using scripts from Mikhail Matz’s github:

    What was your MinCov setting in pyrad? Are loci included that are missing in a lot of individuals? I’ve found after digging some more that the “NaN” arise for invariant sites. These can occur when the SNP is only in 1 of the populations, and therefore when comparing the other 2 populations you are left with an invariant site. Make sure to excluded NaN when doing mean.

    For negative values, I usually see these with loci that with very uneven coverage between populations (like a loci was only sequenced in 1 individual from Pop A and 10 individuals from Pop B). I found a Biostars post about negative Fst values from Weir-Cockerham, that in the edit suggests throwing out negative Fst value loci as they may be due to calculation errors.

    I have lately been using a different approach for my GBS data with pairwise Fst, using a different Fst formula modified by a friend specifically for the uneven coverage you often get with GBS data. It only calculates pairwise Fst for a site that is genotyped in at least 2 individuals and is variant when considering only individuals from those 2 populations. The notebook is found here:

    Another alternative that I haven’t explored much yet is the populations program in Stacks. It now accepts a VCF file from another program, without requiring you to use the rest of the Stacks pipeline to assemble and genotype. They have their own modified Fst and nucleotide diversity formulas that I think also account for missingness in GBS-like data.

    Let me know if you have any questions!

    1. Thanks for all of this! Much appreciated!

      VCF was generated from PyRad with no alterations.

      Mincov setting in PyRad was 4.

      Thanks for the Biostar link; I think I’ve looked at that (and the other thread linked in that post) a few times.

      I’ll check out your other notebooks; thanks!

      I’ll also look into running Stacks’ “Populations” program.

      Thanks again for the info on this. Much appreciated!

Leave a Reply

Your email address will not be published. Required fields are marked *

e.g. 0000-0002-7299-680X

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>