Home / js string format / Sequence Alignment/Map (SAM) Format - Duke …
Sequence Alignment/Map (SAM) Format
Version 0.1.2-draft (20090820)
This specification aims to define a generic nucleotide alignment format, SAM, that describes the alignment of query
sequences or sequencing reads to a reference sequence or assembly, and:
? Is flexible enough to store all the alignment information generated by various alignment programs;
? Is simple enough to be easily generated by alignment programs or converted from existing alignment formats;
? Is compact in file size;
? Allows most of operations on the alignment to work on a stream without loading the whole alignment into memory;
? Allows the file to be indexed by genomic position to efficiently retrieve all reads aligning to a locus.
The document also describes the format of the binary equivalent to SAM and the format of alignment index.
This document specifies the formats of the text and binary alignment files and describes the indexing algorithm and the
format of index files. It does not specify any application programming interfaces (APIs) or language bindings.
Reference sequence. An existing sequence typically from previous studies. A reference sequence can be, but not is
restricted to, a chromosome, a supercontig/scaffold or a contig from de novo assembly.
Query sequence. A sequence that is aligned to the reference sequences. A query sequence can be, but is not restricted
to, a sequencing read, a cDNA or a contig. Typically, a query sequence is shorter than a target sequence.
Alignment. An alignment record describes a relationship between one query and one reference sequence. Insertions
and deletions are allowed on either sequence. A query or a target sequence can be present in more than one alignment
1.4.1. Text vs. binary format
SAM is a TAB-delimited text format. It is easy to understand, easy to parse, easy to generate and easy to check for
errors. However, SAM is a bit slow to parse. Therefore we introduce a binary equivalent to SAM, called BAM, for
intensive data processing. We envision that BAM will be used in most production pipelines, but that SAM, which is
simpler to parse and can be produced by streaming from BAM, may be useful for interconversion with external
applications and for exploratory analyses.
1.4.2. Flexibility: storing optional fields
Different alignment programs may produce different information which may be useful to the downstream analyses. A
generic alignment format should allow for such information to be stored conveniently. In SAM, each alignment must
contain a fixed number of mandatory fields that describe the key information about the alignment (such as coordinate
detailed alignment and sequences) and may contain a variable number of optional fields which are less important or
1.4.3. Flexibility: storing various types of alignments
SAM Format Specification 0.1.2-draft (20090820)
SAM is able to store clipped alignments, spliced alignments, multi-part alignments, padded alignments and alignments in
color space. The extended CIGAR string is the key to describing these types of alignments.
Clipped alignment. In Smith-Waterman alignment, a sequence may not be aligned from the first residue to the last one.
Subsequences at the ends may be clipped off. We introduce operation S to describe (softly) clipped alignment. Here is
an example. Suppose the clipped alignment is:
where on the read sequence, bases in uppercase are matches and bases in lowercase are clipped off. The CIGAR for
this alignment is: 3S8M1D6M4S.
Spliced alignment. In cDNA-to-genome alignment, we may want to distinguish introns from deletions in exons. We
introduce operation N to represent long skip on the reference sequence. Suppose the spliced alignment is:
where ... on the read sequence indicates the intron. The CIGAR for this alignment is: 9M32N8M.
Multi-part alignment. One query sequence may be aligned to multiple places on the reference genome, either with or
without overlaps. In SAM, we keep multiple hits as multiple alignment records. To avoid presenting the full query
sequence multiple times for non-overlapping hits, we introduce operation H to describe hard clipped alignment. Hard
clipping (H) is similar to soft clipping (S). They are different in that hard clipped subsequence is not present in the
alignment record. The example alignment in "clipped alignment" can also be represented with CIGAR: 3H8M1D6M4H, but
in this case, the sequence stored in SAM is "GTGTAACCGACTAG", instead of "GGGGTGTAACCGACTAGGGGG" if soft clipping
is in use.
Padded alignment. Most sequence aligners only give the sequences inserted to the reference genome, but do not
present how these inserted sequences are aligned against each other. Alignment with inserted sequences fully aligned is
called padded alignment. Padded alignment is always produced by de novo assemblers and is important for an
alignment viewer to display the alignment properly. To store padded alignment, we introduce operation P which can be
considered as a silent deletion from padded reference sequence. In the following example, GA on READ1 and A on
READ2 are inserted to the reference. With unpadded CIGAR, we would not be able to distinguish the following padded
REF: CACGATCA**GACCGATACGTCCGA REF: CACGATCA**GACCGATACGTCCGA
READ1: CGATCAGAGACCGATA READ1: CGATCAGAGACCGATA
READ2: ATCA*AGACCGATAC READ2: ATCAA*GACCGATAC
READ3: GATCA**GACCG READ3: GATCA**GACCG
The padded CIGAR are different:
READ1: 6M2I8M READ1: 6M2I8M
READ2: 4M1P1I9M READ2: 4M1I1P9M
READ3: 5M2P5M READ3: 5M2P5M
Note that it is hard to convert unpadded CIGAR to padded one. Fully resolving the alignment between inserted
sequences would essentially require a de novo assembler. However, it is easy vice versa. By simply removing all P
operations we get the CIGAR without padding.
Alignments in color space. Color alignments are stored as normal nucleotide alignments with additional tags
describing the raw color sequences, qualities and color-specific properties.
1.4.4. Storing paired-end reads
A mapped read pair is stored in two (or more if multiple hits are stored) separate alignment records. The two reads in the
pair have identical read pair name and are distinguished by their flag field (Section 2.2.2). The mate coordinate and the
inferred insert size are recommended (not required) to be present. A tool is also provided to reconstruct mating
information from BAM, although this is done at the cost of intensive computation and large disk space.
If in a read pair one read is mapped but the other not, the unmapped read can be absent from the alignment file or may
be stored in two optional ways. The first method is to record no coordinate for the unmapped read (i.e. reference name =
"*"). When using this method, flag bit 0x08 must be set on the mapped mate. The second method is to give the
- 2 -
SAM Format Specification 0.1.2-draft (20090820)
unmapped read a coordinate for sorting/indexing purposes only (this is generally the coordinate of the mapped mate).
When using the second method, flag bit 0x4 must be set on the unmapped read, flag bit 0x08 must be set on the
1.4.5. File compression and random access in a compressed file
Typically, the size of a BAM file can be reduced by nearly a factor of four (to ~27%) under gzip/zlib compression. This
compression ratio is significant. To achieve smaller file size, we always compress a BAM file with the BGZF library,
developed by Bob Handsaker. BGZF is a stand-alone library that achieves similar compression ratio to gzip/zlib while
supporting random access using virtual file offsets. A file compressed with BGZF is also gzip/zlib compatible in that we
can use gzip/zlib to decompress the compressed file, although random file access is not supported in this case.
1.4.6. Ordering the alignments
An SAM/BAM file can be sorted by the reference coordinates, by query names, or unsorted. However, most operations
on the alignments only work on a BAM sorted by the leftmost reference coordinate. Such an order is crucial to data
processing on a stream and to indexing. A command-line tool is provided to sort an unsorted BAM in the required order.
1.4.7. Indexing alignments
Indexing paves the way for quick retrieval of alignments overlapping with a specified region. As BAM is supposed to work
with spliced alignments, indexing must be efficient for alignments spanning long distance on the reference genome. A
binning index as is used in the UCSC Genome Browser suits this goal better than a linear index alone. The binning index
is further improved by being coupled with a simple linear index. For short read alignments, one seek call is needed in
most cases to retrieve alignments.
1.5. Format implementation
SAM/BAM is implemented in two forms: a development library and a command-line tool. The library provides developers
with basic I/O on SAM/BAM as well as routines on manipulating the alignment, such as merging, sorting, indexing and
viewing. The command-line tool is built upon the library and is more convenient to non-developers. However, describing
implementation details is out of the scope of this document.
The major contributors to this specification are (in no particular order):
? Heng Li (Sanger Institute)
? Bob Handsaker (Broad Institute)
? Jue Ruan (Beijing Genomics Institute)
? Richard Durbin (Sanger Institute)
? Gabor Marth (Boston College)
? Michael Stromberg (Boston College)
? Fiona Hyland (Applied Biosystems)
? Goncalo Abecasis (University of Michigan)
? Richa Agarwala (NCBI)
- 3 -
Author: Heng Li
Producer: Mac OS X 10.5.7 Quartz PDFContext
CreationDate: Thu Aug 20 12:07:07 2009
ModDate: Thu Aug 20 12:07:07 2009
Page size: 595 x 842 pts (A4) (rotated 0 degrees)
File size: 163081 bytes
PDF version: 1.3