7

Evaluating collapsed misassembly with asmgene

 2 years ago
source link: http://lh3.github.io/2020/12/25/evaluating-assembly-quality-with-asmgene
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

Evaluating collapsed misassembly with asmgene

25 December 2020

It is usually easy to evaluate the contiguity of a de novo assembly – just compute N50. It is much harder to evaluate the correctness. We typically identify misassemblies by aligning contigs to a reference genome. However, it is tricky to interpret the results. In case of human, there are thousands of structural variations (SVs) between the reference and the sample being assembled. Alignment-based evaluation often mistakes these SVs as misassemblies. For example, QUAST identifies >10,000 “misassemblies” in the T2T assembly when compared to GRCh38. We can’t reliably tell misassemblies from SVs which leads to overestimated misassembly rate. A second problem with reference-based alignment is that most alignment differences come from complex regions such as centromeres and subtelomeres. It fails to evaluate gene regions we are mostly interested in; on the contrary it penalizes an assembly that represents these complex regions better.

Most assembly problems are caused by repetitive or paralogous regions. When an assembler cannot resolve such a region, it either creates an assembly gap or forces through the region with a misassembly. To probe these issues, we can align a multi-copy gene to the assembly and see if it remains multi-copy.

More precisely, we do the following. We align all annotated transcripts to a reference genome and select the longest isoform among overlapping transcripts. For each selected transcript, we record a hit if the transcript is mapped at ≥99% identity over ≥99% of of the transcript length. A transcript is considered to be single-copy (SC) if it has only one hit; otherwise it is considered multi-copy (MC). We do the same for the assembly and then compute the fraction of missing multi-copy gene as

MMC = 1 - |{MCinASM} ∩ {MCinREF}| / |{MCinREF}|

In the ideal case of a perfect assembly, %MMC should be zero. A higher fraction suggests more collapsed assemblies. We can compute percent MMC (%MMC) with paftools.js asmgene from minimap2:

minimap2 -cxsplice:hq -t8 ref.fa cdna.fa > ref.cdna.paf
minimap2 -cxsplice:hq -t8 asm.fa cdna.fa > asm.cdna.paf
paftools.js asmgene [-a] ref.cdna.paf asm.cdna.paf

The output looks like:

H Metric   ref.cdna asm.cdna
X full_sgl 36426    36389
X full_dup 0        18
X frag     0        4
X part50+  0        5
X part10+  0        0
X part10-  0        10
X dup_cnt  1334     1330
X dup_sum  4110     4080

On the line X dup_cnt, 1334 is the number of multi-copy genes in the reference, of which 1330 remain multi-copy in the assembly. %MMC is thus 1-1330/1334=0.3%. Also in this output, 36426 is the number of single-copy genes in the reference, of which 36389 remain single-copy in the assembly and 18 are false duplications. We can similarly compute BUSCO-like metrics but based on the reference.

Collapsed misassemblies in long-read assemblies

The following figure shows the level of collapsed genes in various CHM13 assemblies, taking Ensembl genes as input and the T2T CHM13 as the reference:

Because the reference and the assemblies all come from the same sample, a perfect assembly should have no collapsed genes. Hifiasm and HiCanu are close to that mark. However, some assemblers may miss up to 70% of multi-copy genes. We had a closer look at missing genes. As is expected, they either fall into assembly gaps or leave a misassembly in a long contig. If you want to study a gene family, such assembly problems will ruin your day.

CHM13 is a homozygous cell line. The following figure shows the level of collapsed genes in diploid HG00733 assemblies, again with CHM13 as the reference:

In this figure, even GRCh38 misses 10% multi-copy genes in CHM13. This is background noises caused by between-sample SVs. It is much lower than the level of collapsed misassemblies of many assemblers, demonstrating the effectiveness this metric.

Conclusion

Percent MMC is a new metric to measure the quality of an assembly. It takes minutes to compute, is gene focused and is robust to structural variations in comparison to evaluations based on assembly-to-reference alignment. The obvious downside of %MMC is that it requires a high-quality reference genome and is not applicable to new species, but this is not a concern during the development of assemblers.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK