Samtools

Samtools is a suite of programs for manipulating high-throughput sequencing data consisting of:

  • Samtools: For handling SAM/BAM/CRAM formats
  • BCFtools: Variant calling (VCF/BCF formats)
  • HTSlib: The underlying C library

Tip

Read the official samtools documentation here

samtools Commands

Indexing

Command Description Syntax
dict create a sequence dictionary file
faidx index/extract FASTA
fqidx index/extract FASTQ
index index alignment

Editing

Command Description Syntax
calmd recalculate MD/NM tags and '=' bases
fixmate fix mate information
reheader replace BAM header
targetcut cut fosmid regions (for fosmid pool only)
addreplacerg adds or replaces RG tags
markdup mark duplicates samtools markdup in.algnsorted.bam out.bam
ampliconclip clip oligos from the end of reads

Handling Duplicate Reads

Let's assume we already have sorted, marked duplicates, and indexed BAM files.

for bamfile in ./*.bam; do
# note that in this case, files are named like `DNMT3A_sorted_markdup.bam`
    samtools markdup @$(nproc) -r -s $bamfile ${bamfile%markdup.bam}dedup.bam
done

File operations

Command Description Syntax
collate shuffle and group alignments by name
cat concatenate BAMs
consensus produce a consensus Pileup/FASTA/FASTQ
merge merge sorted alignments
mpileup multi-way pileup
sort sort alignment file
split splits a file by read group
quickcheck quickly check if SAM/BAM/CRAM file appears intact
fastq converts a BAM to a FASTQ
fasta converts a BAM to a FASTA
import Converts FASTA or FASTQ files to SAM/BAM/CRAM
reference Generates a reference from aligned data
reset Reverts aligner changes in reads

Sorting and Indexing

samtools sort: Sort alignments by leftmost coordinates

Sort a BAM file:

samtools sort -o sorted_output.bam input.bam

Sort by read names (useful for some tools):

samtools sort -n -o name_sorted.bam input.bam

samtools index: Index a sorted BAM/CRAM file

samtools index sorted_output.bam

Tip

Always index your sorted BAM files. Many tools require indexed BAMs for efficient access.

Merging and Splitting

samtools merge: Combine multiple BAM files

samtools merge output.bam input1.bam input2.bam input3.bam

Ensure all input BAMs are sorted in the same way (coordinate or query name).

samtools split: Split a BAM file by read group

samtools split -u unassigned.bam -f '%*_%!.bam' input.bam

This is useful when you have multiple samples in one BAM file.

Statistics

Command Description Syntax
bedcov read depth per BED region
coverage alignment depth and percent coverage
depth compute the depth
flagstat simple stats
idxstats BAM index stats
cram-size list CRAM Content-ID and Data-Series sizes
phase phase heterozygotes
stats generate stats (former bamcheck)
ampliconstats generate amplicon specific stats

samtools flagstat: Generate simple alignment statistics

samtools flagstat input.bam

samtools idxstats: Report alignment summary statistics

samtools idxstats input.bam

samtools stats: Generate comprehensive statistics

samtools stats input.bam > stats.txt

Use plot-bamstats to visualize these statistics.

Viewing

Command Description Syntax
flags explain BAM flags
head header viewer
tview text alignment viewer
view SAM<->BAM<->CRAM conversion
depad convert padded BAM to unpadded BAM
samples list the samples in a set of SAM/BAM/CRAM files

Convert SAM to BAM:

samtools view -b -o output.bam input.sam

Convert BAM to CRAM (requires indexed reference genome):

samtools view -C -T reference.fa -o output.cram input.bam

View specific region of a BAM file:

samtools view input.bam chr1:1000000-2000000

Filter for mapped reads and convert to SAM:

samtools view -F 4 -h -o mapped_reads.sam input.bam

Note

The -F 4 flag filters out unmapped reads. Run samtools flags for more info.

Working with CRAM Files

CRAM is a compressed alternative to BAM, offering significant space savings.

Converting BAM to CRAM:

samtools view -C -T reference.fa -o output.cram input.bam

Important Considerations for CRAM

  • Always keep your reference genome available.
  • Use the exact same reference for creating and reading CRAM files.
  • Set the REF_PATH environment variable to help samtools locate reference sequences:
export REF_PATH="/path/to/references/%2s/%2s/%s:http://www.ebi.ac.uk/ena/cram/md5/%s"

This allows for local and EBI-hosted reference lookup.

Best Practices and Tips

Always work with sorted and indexed BAM/CRAM files for efficiency.

Use multithreading when available:

samtools sort -@ 4 -o sorted.bam input.bam

When working with large files, use the -c option to compress temporary files:

samtools sort -c -m 4G -o sorted.bam input.bam

For variant calling workflows, mark duplicates:

samtools markdup input.bam output.bam

Use samtools faidx to index your reference genome:

samtools faidx reference.fa

When downloading reference genomes, ensure they're bgzip-compressed:

# If you have a gzipped file:
gunzip reference.fa.gz
bgzip reference.fa
samtools faidx reference.fa.gz

For large-scale analyses, consider using CRAM format to save storage space.

Practical Workflow Example

Here's a typical workflow for processing a new sequencing run:

# Convert FASTQ to BAM (assuming you've already aligned with BWA)
bwa mem -t 8 reference.fa read1.fq read2.fq | samtools view -b -o raw_aligned.bam -

# Sort the BAM file
samtools sort -@ 8 -o sorted.bam raw_aligned.bam

# Mark duplicates
samtools markdup -@ 8 sorted.bam marked_duplicates.bam

# Index the final BAM
samtools index marked_duplicates.bam

# Generate alignment statistics
samtools flagstat marked_duplicates.bam > alignment_stats.txt
samtools idxstats marked_duplicates.bam > chromosome_stats.txt

# Convert to CRAM for storage
samtools view -C -T reference.fa -o final_output.cram marked_duplicates.bam

Example Pipelines

Created by Ryan D. Najac for the Palomero Lab at the Institute for Cancer Genetics.
Page last updated on 2025-02-14.