See that link for run setup and configuration. They are essentially the same, except for the change I’ll discuss below.
The results from that run can be seen here:
In that previous run, I neglected to provide a transposable elements FastA file for use with RepeatMasker.
I remedied that and re-ran it. I modified
maker_opts.ctl to include the following:
repeat_protein=../../opt/maker/data/te_proteins.fasta #provide a fasta file of transposable element proteins for RepeatRunner
This TEs file is part of RepeatMasker.
Annotated genome file (GFF):
This run took about an hour longer than the previous run, but for some reason it ran with only 21 workers, instead of 22. This is probably the reason for the increased run time.
I’d like to post a snippet of the GFF file here, but the line lengths are WAY too long and will be virtually impossible to read in this notebook. The GFF consists of listing a “parent” contig and its corresponding info (start/stop/length). Then, there are “children” of this contig that show various regions that are matched within the various databases that were queried, i.e. repeatmasker annotations for identifying repeat regions, protein2genome for full/partial protein matches, etc. Thus, a single scaffold (contig) can have dozens or hundreds of corresponding annotations!
Probably the easiest and most logical to start working with will be those scaffolds that are annotated with a “protein_match”, as these have a corresponding GenBank ID. Parsing these out and then doing a join with NCBI protein IDs will give us a basic annotaiton of “functional” portions of the genome.
Additionally, we should probably do some sort of comparison of this run with the previous run where I did not provide the transposable elements FastA file.