FEDRANN: effective long-read overlap detection based on dimensionality reduction and approximate nearest neighbors

This article has 0 evaluations Published on
Read the full article Related papers
This article on Sciety

Abstract

Overlap detection is a key step inde novogenome assembly pipelines based on the Overlap-Layout-Consensus (OLC) paradigm. However, existing methods for overlap detection either rely on heuristic seed-and-extension strategies or locality-sensitive hashing (LSH), both of which struggle to handle repetitive genomic regions and the computational burden of large-scale datasets. Here, we present FEDRANN, a novel strategy for overlap graph construction that integrates feature extraction, dimensionality reduction (DR), and approximate nearest neighbor (ANN) search. We find the pipeline combining inverse document frequency (IDF) transformation, sparse random projection (SRP), and NNDescent enables accurate detection of overlaps across diverse datasets. We developed an efficient open-source implementation of this pipeline named Fedrann (<ext-link xmlns:xlink="http://www.w3.org/1999/xlink" ext-link-type="uri" xlink:href="https://github.com/jzhang-dev/fedrann">https://github.com/jzhang-dev/fedrann</ext-link>). Through systematic benchmarking on real long-read sequencing data, we demonstrate that Fedrann produces overlap graphs comparable to or better than those generated by existing state-of-the-art tools, including MECAT2, minimap2, and wtdbg2, while maintaining competitive runtime. Despite being implemented primarily in Python,Fedrannachieves performance on par with tools written in compiled languages, owing to matrix-based representations and C-accelerated numerical libraries. Our results suggest that DR and ANN techniques offer a promising new direction for scalable and accurate overlap detection in long-read assembly and broader sequence similarity search tasks.

Related articles

Related articles are currently not available for this article.