.pairsam format

specification

pairsamtools define .pairsam, a simple tabular format to store the information on ligation junctions detected in sequences of DNA molecules generated by Hi-C experiments.

.pairsam is a valid extension of the .pairs format and is fully compliant with its specification, defined by the 4DN Consortium.

A pairsam starts with an arbitrary number of header lines, each starting with a “#” character. .pairsam headers contain all information mandated by the .pairs format. Additionally, .pairsam format stored the header of the .sam file that it was generated from. When multiple .pairsam files get merge, the stored .sam headers get checked for consistency and merged. Each pairsamtool applied to a .pairsam file adds a brief record to the .sam header.

The body of a pairsam contains a table with a variable number of fields separated by a “t” character (a horizontal tab):

index name description
1 read_id the ID of the read as defined in fastq files
2 chrom1 the chromosome of the alignment on side 1
3 pos1 the 1-based genomic position of the outer-most (5’) mapped bp on side 1
4 chrom2 the chromosome of the alignment on side 2
5 pos2 the 1-based genomic position of the outer-most (5’) mapped bp on side 2
6 strand1 the strand of the alignment on side 1
7 strand2 the strand of the alignment on side 2
8 pair_type the type of a Hi-C pair
9 sam1 the sam alignment(s) on side 1; separate supplemental alignments by NEXT_SAM
10 sam2 the sam alignment(s) on side 2; separate supplemental alignments by NEXT_SAM

The sides 1 and 2 as defined in pairsam file do not correspond to side1 and side2 in sequencing data! Instead, side1 is defined as the side with the alignment with a lower sorting index (using the lexographic order for chromosome names, followed by the numeric order for positions and the lexicographic order for pair types). This procedure is defined as upper-triangular flipping, or triu-flipping.

The rows of the table are block-sorted: i.e. first lexicographically by chrom1 and chrom2, then numerically by pos1 and pos2, then lexicographically by pair_type.

Null/ambiguous/chimeric alignments are stored as chrom=’!’, pos=0, strand=’-‘.

The columns of the sam records in lines 9 and 10 are separated by a UNIT SEPARATOR character (031) instead of the horizontal tab character, such that it does not affect the columns of the pairsam file.

Notes of the motivation behind some of the technical decisions in the definition of pairsam: - while the information in columns 1-8 may appear redundant to sam alignments in the columns 9+, extracting this information is non-trivial and thus is better done only once with results stored. - storing sam entries together with pairs drastically speeds up and simplifies several operations like filtering and tagging of unmapped/ambiguous/duplicated Hi-C molecules. - pair flipping and sorting is essential for the processing steps like PCR duplicate removal and aggregation. - the exclamation mark “!” is used as a character for unmapped chromosomes because it has a lexicographic sorting order lower than that of “0”, good interpretability and no other reserved technical roles.

pair types

pairsamtools uses a simple two-character notation to define all possible pair types by the quality of alignment. For each pair, its type can be defined unambiguously using the table below. To use this table, identify which side has an alignment of a “poorer” quality (unmapped < multimapped < unique alignment) and which side has a “better” alignment and find the corresponding row in the table.

. Less informative alignment More informative alignment . . .
>2 alignments Mapped Unique Mapped Unique Pair type Code Sidedness
chimeric-chimeric CC 0 [1]
    null NN 0
  null-multi NM 0
  null-unique NU 1
  null-rescued-chimeric NR 1 [2]
multi-multi MM 0
multi-unique MU 1
multi-rescued-chimeric MR 2 [2]
unique-unique UU 2
rescured-chimeric UR or RU 2 [2]
duplicate DD 2 [3]
[1]chimeric reads represent Hi-C molecules formed via multiple ligation events and thus cannot be reported as a single pair.
[2](1, 2, 3) some chimeric reads correspond to valid Hi-C molecules formed via a single ligation event, with the ligation junction sequenced through on one side. Following the procedure introduced in [HiC-Pro](https://github.com/nservant/HiC-Pro) [Juicer](https://github.com/theaidenlab/juicer), pairsamtools rescue such molecules, report their outer-most mapped positions and tag them as “UR” or “RU” pair type. Such molecules can and should be used in downstream analysis.
[3]pairsamtools detect molecules that could be formed via PCR duplication and tags them as “DD” pair type. These pairs should be excluded from downstream analyses.