Key concepts and conventions
Alignment
BWA-MEM2 is used internally in oncoanalyser
for alignment. The pipeline has been validated on and is compatible with
BAMs aligned with BWA-MEM, BWA-MEM2 and DRAGEN. Note that the mate CIGAR attribute is mandatory for any BAM records with
paired reads. Non-compatible BAMs may be rectified using tools such as the Picard
FixMateInformation routine.
Unmapping problematic reads
After read alignment, REDUX is run and performs ‘unmapping’ of reads in pre-defined problematic regions which are discordant, have long soft clipping, or are in a region of extreme high depth. The purpose of this unmapping step is to remove obvious poor alignments from the BAM prior to running downstream tools. The unmapped reads are retained in the BAM. Overall, the problematic regions make up ~0.3% of the genome and lead to ~3-6% of all reads being unmapped depending on genome version
Deduplication, consensus and UMIs
In oncoanalyser
, read deduplication is also performed by
REDUX. Deduplication aims to remove both PCR and
optical duplicates to avoid double counting of fragments. If UMIs (unique molecular identifiers) are present and
configured in oncoanalyser
, then UMI aware deduplication will be performed. If duplicate fragments are found, then
REDUX marks all fragments as duplicates and creates a single consensus read with consensus bases and base qualities
computed. The consensus fragment is annotated as either single or dual strand. This allows downstream tools to
distinguish between high quality versus low quality consensus reads.
A detailed description of deduplication logic is available in the REDUX documentation.
Error recalibration
Two types of sample specific error recalibration are currently performed in oncoanalyser
. REDUX measures the rate of
microsatellite errors per consensus type (for UMIs: single vs dual stranded), repeat context, repeat length, and fits
these variables to a model (see REDUX microsatellite jitter
modeling for details).
SAGE measures the rate of base errors per consensus type, trinucleotide context, and mutation type (see SAGE
concepts for details). In both cases
the rate of recalibrated errors are stored in lookup files and are used downstream in small variant calling.
Gene and transcript definitions
Hartwig’s universe of genes consists of all HGNC symbols with a matching Ensembl gene in GRCh38. The universe of transcripts consists of all Ensembl transcripts belonging to a gene with a matching HGNC symbol. The canonical transcript is set to the Ensembl canonical transcript. For GRCh37, the transcripts differ substantially as the Ensembl database is no longer updated.
More details can be found on the HMF Gene Utilities documentation.
Driver gene panel
The driver gene panel is a key configuration in oncoanalyser
. Genes configured in this file are used to generate a BED
file which defines the PANEL
tier for variant calling. Calling of driver events is controlled by the per gene
configuration in this file. Users may wish to modify the driver list to be more representative of their specific cancer
type (e.g. Adult driver genes are very different from pediatric).
Note that some reference data files in oncoanalyser
have been generated based on the default pre-defined driver gene
panel. For example, driver likelihood estimates from
PURPLE were
trained on an adult pan-cancer cohort. Similarly, driver genes used in
CUPPA were selected based on their presence/frequency in
this adult pan-cancer cohort. In the future we aim to make these reference data files more customisable.
More details on the driver gene panel can be found in the PURPLE driver catalog documentation.