Include/Exclude Files

Effect of excluderegions and include/exclude options

When calculating reported discrepancy rates (substitution and indel errors), binned quality score accuracy, and phasing statistics, only the regions that GQC writes to the file “includednonexcludedregions.<benchmarkname>.bed” are included. GQC creates this BED file by GQC by first creating an excluded regions BED file by merging:

  1. Regions in the “excluderegions” BED file in GQC’s config file (these are typically the benchmark’s low-confidence regions)

  2. Regions in a BED file passed to GQC with the –excludefile option

  3. Benchmark regions not covered by regions in a BED file passed to GQC with the –includefile option

These excluded regions are then subtracted from the entire benchmark genome to obtain the “includednonexcluded” regions.

Running GQC for restricted parts of the genome benchmark

It is possible to calculate GQC statistics for particular types of sequence in the benchmark. Some examples are in the following table.

Stratification BED files

Sequence type

Include file

Gene sequence

Gene sequence BED (AWS)

Segmental duplications

Segmental duplications BED (AWS)

Centromere sequence

Centromere BED (AWS)

Human satellites (HSATs)

HSAT BED (AWS)

Ribosomal DNA (rDNAs)

rDNA BED (AWS)