Samtools
Samtools is a suite of programs for manipulating high-throughput sequencing data. It's essential for many bioinformatics workflows, particularly in processing and analyzing DNA sequencing data.
The suite includes:
- Samtools: For handling SAM/BAM/CRAM formats
- BCFtools: For variant calling and manipulating VCF/BCF files
- HTSlib: A C library underlying samtools and bcftools
Key Samtools Commands
Viewing and Converting Files
samtools view
: Convert between formats and filter alignments
Convert SAM to BAM:
samtools view -b -o output.bam input.sam
Convert BAM to CRAM (requires indexed reference genome):
samtools view -C -T reference.fa -o output.cram input.bam
View specific region of a BAM file:
samtools view input.bam chr1:1000000-2000000
Filter for mapped reads and convert to SAM:
samtools view -F 4 -h -o mapped_reads.sam input.bam
The -F 4
flag filters out unmapped reads. Use samtools flags
to understand SAM flags.
Sorting and Indexing
samtools sort
: Sort alignments by leftmost coordinates
Sort a BAM file:
samtools sort -o sorted_output.bam input.bam
Sort by read names (useful for some tools):
samtools sort -n -o name_sorted.bam input.bam
samtools index
: Index a sorted BAM/CRAM file
samtools index sorted_output.bam
Always index your sorted BAM files. Many tools require indexed BAMs for efficient access.
Merging and Splitting
samtools merge
: Combine multiple BAM files
samtools merge output.bam input1.bam input2.bam input3.bam
Ensure all input BAMs are sorted in the same way (coordinate or query name).
samtools split
: Split a BAM file by read group
samtools split -u unassigned.bam -f '%*_%!.bam' input.bam
This is useful when you have multiple samples in one BAM file.
Statistics and Quality Control
samtools flagstat
: Generate simple alignment statistics
samtools flagstat input.bam
samtools idxstats
: Report alignment summary statistics
samtools idxstats input.bam
samtools stats
: Generate comprehensive statistics
samtools stats input.bam > stats.txt
Use plot-bamstats
to visualize these statistics.
Working with CRAM Files
CRAM is a compressed alternative to BAM, offering significant space savings.
Converting BAM to CRAM:
samtools view -C -T reference.fa -o output.cram input.bam
Important Considerations for CRAM
- Always keep your reference genome available.
- Use the exact same reference for creating and reading CRAM files.
- Set the
REF_PATH
environment variable to help samtools locate reference sequences:
export REF_PATH="/path/to/references/%2s/%2s/%s:http://www.ebi.ac.uk/ena/cram/md5/%s"
This allows for local and EBI-hosted reference lookup.
Best Practices and Tips
Always work with sorted and indexed BAM/CRAM files for efficiency.
Use multithreading when available:
samtools sort -@ 4 -o sorted.bam input.bam
When working with large files, use the -c
option to compress temporary files:
samtools sort -c -m 4G -o sorted.bam input.bam
For variant calling workflows, mark duplicates:
samtools markdup input.bam output.bam
Use samtools faidx
to index your reference genome:
samtools faidx reference.fa
When downloading reference genomes, ensure they're bgzip-compressed:
# If you have a gzipped file:
gunzip reference.fa.gz
bgzip reference.fa
samtools faidx reference.fa.gz
For large-scale analyses, consider using CRAM format to save storage space.
Practical Workflow Example
Here's a typical workflow for processing a new sequencing run:
# Convert FASTQ to BAM (assuming you've already aligned with BWA)
bwa mem -t 8 reference.fa read1.fq read2.fq | samtools view -b -o raw_aligned.bam -
# Sort the BAM file
samtools sort -@ 8 -o sorted.bam raw_aligned.bam
# Mark duplicates
samtools markdup -@ 8 sorted.bam marked_duplicates.bam
# Index the final BAM
samtools index marked_duplicates.bam
# Generate alignment statistics
samtools flagstat marked_duplicates.bam > alignment_stats.txt
samtools idxstats marked_duplicates.bam > chromosome_stats.txt
# Convert to CRAM for storage
samtools view -C -T reference.fa -o final_output.cram marked_duplicates.bam
Troubleshooting
- If you get "file not found" errors, check if your BAM files are where you expect and have read permissions.
- For "invalid file format" errors, ensure your input files are properly formatted and not corrupted.
- If samtools seems slow, check your disk I/O and consider using an SSD for temporary files.
- For memory errors in
sort
, adjust the-m
option to use less memory per thread.
Remember to consult the official samtools documentation for detailed information on each command and its options. Regular practice and experimentation will help you become proficient with these powerful tools.
Created by Ryan D. Najac for the Palomero Lab at the Institute for Cancer Genetics.Page last updated on 2024-10-17.