The coverage analysis system needs to process the reads that were aligned to all areas of the genome, and then calculate metrics on the
reads that overlap with the features (e.g. "on target") and those that do not (e.g. "off target"). There are several capabilities to
control the quality and error corrections that will be applied to the reads as they are being processed.
If you enable Unique Molecular Id/Tag (UMI/UMT) processing the system will first group all of the reads at the same position
into consensus families before calculating the coverage at each position. You can optionally choose to filter out consensus families with a
smaller number of reads during the coverage analysis. This can prevent affecting the results with reads that don't have many duplicates
present with a matching UMI/UMT at any position.
Important Note: If you did not enable UMI/UMT processing when aligning the reads (available on the "Pre-Processing"
tab on the "Start Alignment" screen), then Curio automatically hides the UMI/UMT processing options on the coverage analysis screen.
When utilizing UMI/UMT processing, after calculating the consensus families at each position this setting can then be used to remove the
smaller families. E.g. if you set this to a value of "5 reads" then any consensus families that contain 4 reads or less would be excluded before
calculating the coverage metrics. Note that if you slide this setting all the way to the left (i.e. "Include all Families") then all consensus read
families will be included in the analysis, even if the family only contains one read.
As part of a coverage analysis UMI/UMT processing therefore provides a way to save reads that contain information
from an original molecule that would have otherwise been filtered out during de-duplication, by using the UMI/UMT of each
consensus family to determine the unique information that should be retained at each alignment position (instead of simply
removing duplicate reads that have a matching alignment position and "CIGAR" alignment string.) In addition, by using the
"Minimum Family Size" setting, you can get rid of reads that are potentially noise where there is no evidence of
other reads at the same position that had a corresponding UMI/UMT.
If you enable Unique Molecular Id/Tag (UMI/UMT) processing the system will first group all of the reads
at the same position into consensus families before attempting to calculate coverage levels.
By default, Curio will require that the UMI of all the reads within a consensus family are exactly the same.
However, this setting allows you to adjust the number of base pairs that Curio will allow
to be different between two UMIs to still group them into the same family (i.e. the allowable
"hamming distance" between the UMIs that are within the same family.)
Example: Imagine, for example, that the reads at a given position contained UMIs as shown below in red:
With the "Family Hamming Distance" setting set to "perfect match" (i.e. a hamming distance of zero) then the above
reads would be grouped into two families. If instead you were to set this setting to allow for one base to be
different when calculating the UMI families (i.e. a hamming distance of one), then there would be only a single
consensus read family counted towards the coverage of any features that the above reads overlap with.
When the "De-duplication" option is enabled, Curio will attempt to get rid of reads that are potential PCR amplification duplicates. The algorithm
used finds all reads that have the same alignment position, orientation, and CIGAR alignment string. In the case of a paired-end read, those
aspects of both the read and the mate are taken account. Note that the "CIGAR" alignment string is how the aligner specifies areas of each read that
represent potential insertions or soft clipped regions (i.e. bases present in the read that are not in the reference) or deletions (i.e. missing
bases in the read that are in the reference). The read (or read pair) with the highest quality is then kept, and all other reads (or read
pairs) that have a matching alignment position, orientation, and CIGAR string are then removed before calculating coverage metrics. Note that to
calculate which reads have the highest quality, the best read (or read pair) is determined to be the one with the highest sum of Phred
base qualities that are greater than or equal to Q15.
If you disable this option all reads will be included in the analysis, even if they are potential PCR duplicates. When UMI/UMT processing is enabled,
this option will appear disabled since duplicate reads that have the same identifier will be automatically consolidated when UMI/UMT processing is in
effect. Note that the option to enable UMI/UMT processing during coverage analysis is only visible if you enabled UMI/UMT processing when first
aligning the sequence alignment file.
Some of the aligners (Bowtie, etc.) report a Phred-like quality score that is used to represent how likely the position chosen for the read
alignment is correct. If this type of alignment quality score is available for the reads, then this setting can be used to exclude reads whose
alignment quality is below the selected value.