Estimation of substitution and indel rates viak-mer statistics

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Methods utilizingk-mers are widely used in bioinformatics, yet our understanding of their statistical properties under realistic mutation models remains incomplete. Previously, substitution-only mutation models have been considered to derive precise expectations and variances for mutated k-mers and intervals of mutated and nonmutated sequences. In this work, we consider a mutation model that uses insertions and deletions in addition to single-nucleotide substitutions. Within this framework, we derive closed-formk-mer-based-estimators for the three fundamental mutation parameters: substitution rate, deletion rate, and average insertion length. We provide statistics ofk-mers under this model and theoretical guarantees via concentration inequalities, ensuring correctness under reasonable conditions. Empirical evaluations on simulated evolution of genomic sequences confirm our theoretical findings, demonstrating that accounting for indel signals allows for accurate estimation of mutation rates and improves upon the results obtained by considering a substitution-only model. An implementation of estimating the mutation parameters from a pair of FASTA files is available here:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="http://git-hub.com/mahmudhera/estimate_rates_using_mutation_model.git">git-hub.com/mahmudhera/estimate_rates_using_mutation_model.git</ext-link>. The results presented in this manuscript can be reproduced using the code available here:<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="http://github.com/mahmudhera/est_rates_experiments.git">github.com/mahmudhera/est_rates_experiments.git</ext-link>.

2012 ACM Subject Classification

Applied computing → Computational biology; Theory of computation → Theory and algorithms for application domains; Mathematics of computing → Probabilistic inference problems

Funding

This material is based upon work supported by the National Science Foundation under Grant No. DBI2138585. Research reported in this publication was supported by the National Institute Of General Medical Sciences of the National Institutes of Health under Award Number R01GM146462. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Related articles

Related articles are currently not available for this article.