Samtools is a suite of programs for manipulating high-throughput sequencing data consisting of:
- Samtools: For handling SAM/BAM/CRAM formats
- BCFtools: Variant calling (VCF/BCF formats)
- HTSlib: The underlying C library
Read the official samtools documentation here
Command | Description | Syntax |
dict |
create a sequence dictionary file | |
faidx |
index/extract FASTA | |
fqidx |
index/extract FASTQ | |
index |
index alignment |
Command | Description | Syntax |
calmd |
recalculate MD/NM tags and '=' bases | |
fixmate |
fix mate information | |
reheader |
replace BAM header | |
targetcut |
cut fosmid regions (for fosmid pool only) | |
addreplacerg |
adds or replaces RG tags | |
markdup |
mark duplicates | samtools markdup in.algnsorted.bam out.bam |
ampliconclip |
clip oligos from the end of reads |
Handling Duplicate Reads
Let's assume we already have sorted, marked duplicates, and indexed BAM files.
for bamfile in ./*.bam; do
# note that in this case, files are named like `DNMT3A_sorted_markdup.bam`
samtools markdup @$(nproc) -r -s $bamfile ${bamfile%markdup.bam}dedup.bam
File operations
Command | Description | Syntax |
collate |
shuffle and group alignments by name | |
cat |
concatenate BAMs | |
consensus |
produce a consensus Pileup/FASTA/FASTQ | |
merge |
merge sorted alignments | |
mpileup |
multi-way pileup | |
sort |
sort alignment file | |
split |
splits a file by read group | |
quickcheck |
quickly check if SAM/BAM/CRAM file appears intact | |
fastq |
converts a BAM to a FASTQ | |
fasta |
converts a BAM to a FASTA | |
import |
Converts FASTA or FASTQ files to SAM/BAM/CRAM | |
reference |
Generates a reference from aligned data | |
reset |
Reverts aligner changes in reads |
Sorting and Indexing
samtools sort
: Sort alignments by leftmost coordinates
Sort a BAM file:
samtools sort -o sorted_output.bam input.bam
Sort by read names (useful for some tools):
samtools sort -n -o name_sorted.bam input.bam
samtools index
: Index a sorted BAM/CRAM file
samtools index sorted_output.bam
Always index your sorted BAM files. Many tools require indexed BAMs for efficient access.
Merging and Splitting
samtools merge
: Combine multiple BAM files
samtools merge output.bam input1.bam input2.bam input3.bam
Ensure all input BAMs are sorted in the same way (coordinate or query name).
samtools split
: Split a BAM file by read group
samtools split -u unassigned.bam -f '%*_%!.bam' input.bam
This is useful when you have multiple samples in one BAM file.
Command | Description | Syntax |
bedcov |
read depth per BED region | |
coverage |
alignment depth and percent coverage | |
depth |
compute the depth | |
flagstat |
simple stats | |
idxstats |
BAM index stats | |
cram-size |
list CRAM Content-ID and Data-Series sizes | |
phase |
phase heterozygotes | |
stats |
generate stats (former bamcheck) | |
ampliconstats |
generate amplicon specific stats |
samtools flagstat
: Generate simple alignment statistics
samtools flagstat input.bam
samtools idxstats
: Report alignment summary statistics
samtools idxstats input.bam
samtools stats
: Generate comprehensive statistics
samtools stats input.bam > stats.txt
Use plot-bamstats
to visualize these statistics.
Command | Description | Syntax |
flags |
explain BAM flags | |
head |
header viewer | |
tview |
text alignment viewer | |
view |
SAM<->BAM<->CRAM conversion | |
depad |
convert padded BAM to unpadded BAM | |
samples |
list the samples in a set of SAM/BAM/CRAM files |
Convert SAM to BAM:
samtools view -b -o output.bam input.sam
Convert BAM to CRAM (requires indexed reference genome):
samtools view -C -T reference.fa -o output.cram input.bam
View specific region of a BAM file:
samtools view input.bam chr1:1000000-2000000
Filter for mapped reads and convert to SAM:
samtools view -F 4 -h -o mapped_reads.sam input.bam
The -F 4
flag filters out unmapped reads. Run samtools flags
for more info.
Working with CRAM Files
CRAM is a compressed alternative to BAM, offering significant space savings.
Converting BAM to CRAM:
samtools view -C -T reference.fa -o output.cram input.bam
Important Considerations for CRAM
- Always keep your reference genome available.
- Use the exact same reference for creating and reading CRAM files.
- Set the
environment variable to help samtools locate reference sequences:
export REF_PATH="/path/to/references/%2s/%2s/%s:"
This allows for local and EBI-hosted reference lookup.
Best Practices and Tips
Always work with sorted and indexed BAM/CRAM files for efficiency.
Use multithreading when available:
samtools sort -@ 4 -o sorted.bam input.bam
When working with large files, use the -c
option to compress temporary files:
samtools sort -c -m 4G -o sorted.bam input.bam
For variant calling workflows, mark duplicates:
samtools markdup input.bam output.bam
Use samtools faidx
to index your reference genome:
samtools faidx reference.fa
When downloading reference genomes, ensure they're bgzip-compressed:
# If you have a gzipped file:
gunzip reference.fa.gz
bgzip reference.fa
samtools faidx reference.fa.gz
For large-scale analyses, consider using CRAM format to save storage space.
Practical Workflow Example
Here's a typical workflow for processing a new sequencing run:
# Convert FASTQ to BAM (assuming you've already aligned with BWA)
bwa mem -t 8 reference.fa read1.fq read2.fq | samtools view -b -o raw_aligned.bam -
# Sort the BAM file
samtools sort -@ 8 -o sorted.bam raw_aligned.bam
# Mark duplicates
samtools markdup -@ 8 sorted.bam marked_duplicates.bam
# Index the final BAM
samtools index marked_duplicates.bam
# Generate alignment statistics
samtools flagstat marked_duplicates.bam > alignment_stats.txt
samtools idxstats marked_duplicates.bam > chromosome_stats.txt
# Convert to CRAM for storage
samtools view -C -T reference.fa -o final_output.cram marked_duplicates.bam
Example Pipelines
Created by Ryan D. Najac for the Palomero Lab at the Institute for Cancer Genetics.Page last updated on 2025-02-04.