Run-length compressed metagenomic read classification with SMEM-finding and tagging

Depuydt, LoreLoreDepuydtAhmed, Omar Y.Omar Y.AhmedFostier, JanJanFostierLangmead, BenBenLangmeadGagie, TravisTravisGagie2026-06-152026-06-1520252589-0042https://imec-publications.be/handle/20.500.12860/59691Metagenomic read classification is a fundamental task in computational biology but remains challenging due to the scale and diversity of sequencing data. We present a run-length compressed BWT-based index using the move structure for efficient multi-class classification. Our method finds all super-maximal exact matches (SMEMs) of length ≥ L between a read and a reference and associates each SMEM with one class identifier using a sampled tag array. A consensus algorithm then compacts these SMEMs and their class identifiers into a single classification. We are the first to perform run-length compressed read classification using full rather than semi-SMEMs. We evaluated on long and short reads across two datasets: a large bacterial pan-genome with few classes and a smaller 16S rRNA gene database spanning thousands of genera. Our method outperforms SPUMONI 2 in accuracy and runtime while maintaining run-length compressed memory complexity and surpasses Cliffy in memory efficiency with comparable accuracy.engRun-length compressed metagenomic read classification with SMEM-finding and taggingJournal article10.1016/j.isci.2025.114029WOS:001643714000001MEDLINE:41497396