GQC – Genome Quality Checker

The GQC python package evaluates a test assembly by comparing it to a diploid genome benchmark (such as HG002v1.1) and prints general statistics, BED-formatted scaffold regions reporting the alignments and discrepancies within them, and PDF-formatted plots. In addition, it can report statistics regarding discrepancies between aligned sequencing read sets and a benchmark which can help to deduce a sequencing platform’s strengths and weaknesses. Example outputs of various assembly and read benchmarking using GQC with the HG002v1.1 assembly are available on AWS analyses.

The program was written by Nancy Fisher Hansen, a staff scientist in the Genome Informatics Section, Center for Genomics and Data Science Research, National Human Genome Research Institute (NHGRI). Nancy can be reached at nhansen@mail.nih.gov.

Install

Software dependencies

Running GQC’s assembly benchmarking requires an installation of commit 38b07c2 or later of Gene Myers’ FASTK package with its “KmerMap” and “FastK” commands in your path. When evaluating assemblies, GQC calls the minimap2 aligner, which should also be installed and in your path. For both assembly and read analyses, GQC uses R’s Rscript command with Bioconductor to create plots, and bedtools to compare and merge intervals. If the “Rscript” command is not in a user’s path, the program will complain, and perform all functions except plotting. If the “bedtools” command isn’t in the user’s path, the program will exit with an error. To use GQC’s plotting functions, you will need to install the “stringr” package and the “karyoploteR” package, which are part of Bioconductor.

In addition, the program requires a set of files with data about the benchmark assembly you are comparing to. For the Q100 benchmark assembly hg002v1.1, a tarball of these files is available on AWS resources. Once downloaded, this tarball should be unpacked and the first line in the file GQC/benchconfig.txt should be edited to specify the path of the downloaded resources directory (see the section “Config file” for more details).

All other dependencies will be installed by the pip installer with the commands in the next section called “Local Installation”. Feel free to post installation issues to the issues section of this github repository and we will attempt to address them promptly.

Local Installation

GQC is available on PyPi and can be installed with pip:

pip install GQC

Another easy way to install GQC is locally. This allows you to run the most up-to-date version of the code. First clone this github repository:

git clone https://github.com/marbl/GQC
cd GQC

Create a virtual environment for the project:

python3 -m venv venv
source venv/bin/activate

Finally use python’s pip installer to install and test a development copy for yourself to run:

python3 -m pip install -e .
pytest

Config file

In order to evaluate heterozygous sites, short tandem repeat run lengths, and other features of the genome benchmark, the GQC program needs specially formatted annotation files for the benchmark genomes. For hg002v1.1, these files are contained in a tarball available on AWS benchmark resources. The program reads the locations of these files from a config file, which GQC assumes, by default, is a file called “benchconfig.txt” in your working directory, but the location of this file can also be specified with the -c or –config option. If the config file is not accessible in one of these two ways, the program will complain and exit.

An example config file is located here and contains the necessary parameters and file names. Edit that config file to specify the full path of the downloaded resources directory, and you can use that file as your config file when running the GQC or readbench commands.

Publications