The variant detection system needs to process the reads that were aligned to specific areas of the genome, and then look
at a consolidated version of the individual bases that the sequencer reported within each of those reads in order to find
potential alleles. There are several capabilities to control the quality and error corrections that will be applied
to the reads as they are being processed.
If you enable Unique Molecular Tag/Unique Molecular Id (UMT/UMI) processing the system will first group all of the reads at the same position
into consensus families before attempting to detect variants. When calculating the consensus family the system will also attempt to correct for
incorrect information in the reads by using the configured threshold to determine how many of the reads within a family need to show the same
nucleotide at any position. You can also filter out consensus families with a smaller number of reads during the variant analysis. This can prevent
affecting the variant detection results with reads that don't have many duplicates present with a matching UMT/UMI.
Important Note: If you did not enable UMT/UMI processing when aligning the reads (available on the "Pre-Processing"
tab on the "Start Alignment" screen), then Curio automatically hides the UMT/UMI processing options on the variant detection screen.
When performing the consensus family error correction in the reads that have a matching UMT/UMI the system will use this threshold to determine how
many of the reads within a family need to show the same nucleotide at any position. So, setting this value to 60%, for example, would mean that 60%
of the reads within a family have to show the same nucleotide in order for the consensus family to include that nucleotide at the given
position. If at least 60% of the reads don't show a consensus at any given position, then the consensus family will be switched to an 'N' at that
Example: Consider a case where the reads at a given position that all have the same UMT/UMI look like the following:
At position 5 and position 9 the reads show some differences. If the family consensus threshold was set to 60%, the consensus read for this family
would be set to the following:
If you enable Unique Molecular Id/Tag (UMI/UMT) processing the system will first group all of the reads
at the same position into consensus families before attempting to detect variants. By default, Curio
will require that the UMI of all the reads within a consensus family are exactly the same. However,
this setting allows you to adjust the number of base pairs that Curio will allow
to be different between two UMIs to still group them into the same family (i.e. the allowable
"hamming distance" between the UMIs that are within the same family.)
Example: Imagine, for example, that the reads at a given position contained UMIs as shown below in red:
With the "Family Hamming Distance" setting set to "perfect match" (i.e. a hamming distance of zero) then the above
reads would be grouped into two families and the blue 'G' base would be corrected for in the
first consensus family, leaving the only call at that position as a 'T'. However, the blue 'C' base
would remain (since it is the consensus at that position in the second family). Therefore
there would be two calls at that second position of both 'A' and 'C'. If instead you were to set
this setting to allow for one base to be different when calculating the UMI families (i.e.
a hamming distance of one), then there would only be one call at that later position of
an 'A' - since all seven reads would be treated as a single family.
When utilizing UMT/UMI processing, after calculating the consensus families at each position this setting can then be used to remove the
smaller families. E.g. if you set this to a value of "5 reads" then any consensus families that contain 4 reads or less would be excluded when
attempting to find variants. Note that if you slide this setting all the way to the left (i.e. "Include all Families") then all consensus read
families will be included in the analysis, even if the family only contains one read.
UMT/UMI processing therefore provides a useful way to correct for errors in individual reads that occurred either during
amplication or sequencing, specifically by using the UMT/UMI of the reads to calculate a consensus read family that more
accurately represents the original molecule. And, in addition, it provides a way to save reads that contain information
from an original molecule that would have otherwise been filtered out during de-duplication, by using the UMT/UMI of each
consensus family to determine the unique information that should be retained at each alignment position (instead of simply
removing duplicate reads that have a matching alignment position and "CIGAR" alignment string.)
When the "De-duplication" option is enabled, Curio will attempt to get rid of reads that are potential PCR amplification duplicates. The algorithm
used finds all reads that have the same alignment position, orientation, and CIGAR alignment string. In the case of a paired-end read, those
aspects of both the read and the mate are taken account. Note that the "CIGAR" alignment string is how the aligner specifies areas of each read that
represent potential insertions or soft clipped regions (i.e. bases present in the read that are not in the reference) or deletions (i.e. missing
bases in the read that are in the reference). The read (or read pair) with the highest quality is then kept, and all other reads (or read
pairs) that have a matching alignment position, orientation, and CIGAR string are then removed before attempting to detect variants. Note that to
calculate which reads have the highest quality, the best read (or read pair) is determined to be the one with the highest sum of Phred
base qualities that are greater than or equal to Q15.
If you disable this option all reads will be included in the analysis, even if they are potential PCR duplicates. When UMT/UMI processing is enabled,
this option will appear disabled since duplicate reads that have the same identifier will be automatically consolidated when UMT/UMI processing is in
effect. Note that the option to enable UMT/UMI processing during variant analysis is only visible if you enabled UMT/UMI processing when first
aligning the sequence alignment file.
The "Minimum Quality" setting allows you to control how both the Phred quality data of the individual base pairs within the reads as well as the
read alignment quality data is processed. If you set this setting to "Include All Data" (by moving the slider all the way to the left) than all
reads and base pairs will be included when attempting to find variants, regardless of their quality level. By setting this to a specific Phred
quality score though, the system will then use that value in two ways:
Any base within a read that has a Phred quality score below the requested value will be switched to an 'N', so that it won't contribute towards a
variant call at that position.
Some of the aligners (Bowtie, etc.) report a Phred-like quality score that is used to represent how likely the position chosen for the read
alignment is correct. If this type of alignment quality score is available for the reads, then this setting is also used to exclude reads whose
alignment quality is below the selected value.