Publication:

Columba: fast approximate pattern matching with optimized search schemes

 
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.department#PLACEHOLDER_PARENT_METADATA_VALUE#
cris.virtual.orcid0000-0001-8517-0479
cris.virtual.orcid0000-0002-2244-1427
cris.virtual.orcid0000-0002-9994-8269
cris.virtualsource.departmentc4cd9ad3-10fc-4ea8-b7f3-19c50b10d7a7
cris.virtualsource.departmentcafc39a2-8610-45ab-befa-b3f04ef3481d
cris.virtualsource.department4a97cd9e-e619-4718-8c1e-9168bc19ef13
cris.virtualsource.orcidc4cd9ad3-10fc-4ea8-b7f3-19c50b10d7a7
cris.virtualsource.orcidcafc39a2-8610-45ab-befa-b3f04ef3481d
cris.virtualsource.orcid4a97cd9e-e619-4718-8c1e-9168bc19ef13
dc.contributor.authorRenders, Luca
dc.contributor.authorDepuydt, Lore
dc.contributor.authorGagie, Travis
dc.contributor.authorFostier, Jan
dc.date.accessioned2026-04-01T07:19:21Z
dc.date.available2026-04-01T07:19:21Z
dc.date.issued2025
dc.description.abstractMotivation Aligning sequencing reads to reference genomes is a fundamental task in bioinformatics. Aligners can be classified as lossy or lossless: lossy aligners prioritize speed by reporting only one or a few high-scoring alignments, whereas lossless aligners output all optimal alignments, ensuring completeness and sensitivity. Results This paper introduces Columba, a high-performance lossless aligner tailored for Illumina sequencing data. Columba processes single or paired-end reads in FASTQ format and outputs alignments in SAM format. By utilizing advanced search schemes and bit-parallel alignment techniques, Columba achieves exceptional speed. Columba is available in two variants. The first, based on the bidirectional FM-index, prioritizes speed. The second, Columba RLC, uses run-length compression using a bidirectional move structure, significantly reducing memory usage for large, repetitive datasets like pan-genomes. Benchmarks on the human genome, as well as bacterial and human pan-genome datasets, demonstrate that Columba is much faster than existing lossless aligners and even competitive with lossy tools. We integrated Columba into the OptiType HLA genotyping pipeline, where it substantially reduced computational time while maintaining accuracy. These results position Columba as a versatile, state-of-the-art tool for high-sensitivity genomic analyses. Availability and implementation The source code of Columba is available at https://github.com/biointec/columba under AGPL license. Scripts to reproduce the benchmarks and analyses are available at https://doi.org/10.5281/zenodo.15849246.
dc.identifier.doi10.1093/bioinformatics/btaf652
dc.identifier.issn1367-4803
dc.identifier.urihttps://imec-publications.be/handle/20.500.12860/58991
dc.language.isoen
dc.provenance.editstepusergreet.vanhoof@imec.be
dc.publisherOxford Academic
dc.relation.ispartofBIOINFORMATICS
dc.relation.ispartofseriesBIOINFORMATICS
dc.source.beginpagebtaf652
dc.source.issue12
dc.source.journalBIOINFORMATICS
dc.source.numberofpages8
dc.source.volume41
dc.subjectREAD ALIGNMENT
dc.subjectACCURATE
dc.subjectScience & Technology
dc.subjectLife Sciences & Biomedicine
dc.subjectTechnology
dc.subjectPhysical Sciences
dc.title

Columba: fast approximate pattern matching with optimized search schemes

dc.typeJournal article
dspace.entity.typePublication
oaire.citation.editionWOS.SCI
oaire.citation.issue12
oaire.citation.volume41
Files

Original bundle

Name:
btaf652.pdf
Size:
1.21 MB
Format:
Adobe Portable Document Format
Description:
Published
Publication available in collections: