Command-line tools

pairsamtools

pairsamtools [OPTIONS] COMMAND [ARGS]...

Options

--post-mortem

Post mortem debugging

--output-profile <output_profile>

Profile performance with Python cProfile and dump the statistics into a binary file

--version

Show the version and exit.

dedup

find and remove PCR duplicates.

Find PCR duplicates in an upper-triangular flipped sorted pairs/pairsam file. Allow for a +/-N bp mismatch at each side of duplicated molecules.

PAIRSAM_PATH : input triu-flipped sorted .pairs or .pairsam file. If the path ends with .gz/.lz4, the input is decompressed by pbgzip/lz4c. By default, the input is read from stdin.

pairsamtools dedup [OPTIONS] [PAIRSAM_PATH]

Options

-o, --output <output>

output file for pairs after duplicate removal. If the path ends with .gz or .lz4, the output is pbgzip-/lz4c-compressed. By default, the output is printed into stdout.

--output-dups <output_dups>

output file for duplicated pairs. If the path ends with .gz or .lz4, the output is pbgzip-/lz4c-compressed. If the path is the same as in –output or -, output duplicates together with deduped pairs. By default, duplicates are dropped.

--output-unmapped <output_unmapped>

output file for unmapped pairs. If the path ends with .gz or .lz4, the output is pbgzip-/lz4c-compressed. If the path is the same as in –output or -, output unmapped pairs together with deduped pairs. If the path is the same as –output-dups, output unmapped reads together with dups. By default, unmapped pairs are dropped.

--output-stats <output_stats>

output file for duplicate statistics. If file exists, it will be open in the append mode. If the path ends with .gz or .lz4, the output is pbgzip-/lz4c-compressed. By default, statistics are not printed.

--max-mismatch <max_mismatch>

Pairs with both sides mapped within this distance (bp) from each other are considered duplicates.

--method <method>

define the mismatch as either the max or the sum of the mismatches ofthe genomic locations of the both sides of the two compared molecules

Default

max

Options

max | sum

--sep <sep>

Separator (t, v, etc. characters are supported, pass them in quotes)

--comment-char <comment_char>

The first character of comment lines

--send-header-to <send_header_to>

Which of the outputs should receive header and comment lines

Options

dups | dedup | both | none

--c1 <c1>

Chrom 1 column; default 1

--c2 <c2>

Chrom 2 column; default 3

--p1 <p1>

Position 1 column; default 2

--p2 <p2>

Position 2 column; default 4

--s1 <s1>

Strand 1 column; default 5

--s2 <s2>

Strand 2 column; default 6

--unmapped-chrom <unmapped_chrom>

Placeholder for a chromosome on an unmapped side; default !

--mark-dups

If specified, duplicate pairs are marked as DD in “pair_type” and as a duplicate in the sam entries.

--extra-col-pair <extra_col_pair>

Extra columns that also must match for two pairs to be marked as duplicates. Can be either provided as 0-based column indices or as column names (requires the “#columns” header field). The option can be provided multiple times if multiple column pairs must match. Example: –extra-col-pair “phase1” “phase2”

--nproc-in <nproc_in>

Number of processes used by the auto-guessed input decompressing command.

Default

3

--nproc-out <nproc_out>

Number of processes used by the auto-guessed output compressing command.

Default

8

--cmd-in <cmd_in>

A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3

--cmd-out <cmd_out>

A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8

Arguments

PAIRSAM_PATH

Optional argument

filterbycov

filter out pairs from locations with suspiciously high coverage. Useful for single-cell Hi-C experiments, where coverage is naturally limited by the chromosome copy number.

Find and remove pairs with >(MAX_COV-1) neighbouring pairs within a +/- MAX_DIST bp window around either side.

PAIRSAM_PATH : input triu-flipped sorted .pairs or .pairsam file. If the path ends with .gz/.lz4, the input is decompressed by pbgzip/lz4c. By default, the input is read from stdin.

pairsamtools filterbycov [OPTIONS] [PAIRSAM_PATH]

Options

-o, --output <output>

output file for pairs from low coverage regions. If the path ends with .gz or .lz4, the output is pbgzip-/lz4c-compressed. By default, the output is printed into stdout.

--output-highcov <output_highcov>

output file for pairs from high coverage regions. If the path ends with .gz or .lz4, the output is pbgzip-/lz4c-compressed. If the path is the same as in –output or -, output duplicates together with deduped pairs. By default, duplicates are dropped.

--output-unmapped <output_unmapped>

output file for unmapped pairs. If the path ends with .gz or .lz4, the output is pbgzip-/lz4c-compressed. If the path is the same as in –output or -, output unmapped pairs together with deduped pairs. If the path is the same as –output-highcov, output unmapped reads together. By default, unmapped pairs are dropped.

--output-stats <output_stats>

output file for statistics of multiple interactors. If file exists, it will be open in the append mode. If the path ends with .gz or .lz4, the output is pbgzip-/lz4c-compressed. By default, statistics are not printed.

--max-cov <max_cov>

The maximum allowed coverage per region.

--max-dist <max_dist>

The resolution for calculating coverage. For each pair, the local coverage around each end is calculated as (1 + the number of neighbouring pairs within +/- max_dist bp)

--method <method>

calculate the number of neighbouring pairs as either the sum or the max of the number of neighbours on the two sides

Default

max

Options

max | sum

--sep <sep>

Separator (t, v, etc. characters are supported, pass them in quotes)

--comment-char <comment_char>

The first character of comment lines

--send-header-to <send_header_to>

Which of the outputs should receive header and comment lines

Options

lowcov | highcov | both | none

--c1 <c1>

Chrom 1 column; default 1

--c2 <c2>

Chrom 2 column; default 3

--p1 <p1>

Position 1 column; default 2

--p2 <p2>

Position 2 column; default 4

--s1 <s1>

Strand 1 column; default 5

--s2 <s2>

Strand 2 column; default 6

--unmapped-chrom <unmapped_chrom>

Placeholder for a chromosome on an unmapped side; default !

--mark-multi

If specified, duplicate pairs are marked as FF in “pair_type” and as a duplicate in the sam entries.

--nproc-in <nproc_in>

Number of processes used by the auto-guessed input decompressing command.

Default

3

--nproc-out <nproc_out>

Number of processes used by the auto-guessed output compressing command.

Default

8

--cmd-in <cmd_in>

A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3

--cmd-out <cmd_out>

A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8

Arguments

PAIRSAM_PATH

Optional argument

markasdup

tag all pairsam entries with a duplicate tag.

PAIRSAM_PATH : input .pairsam file. If the path ends with .gz, the input is gzip-decompressed. By default, the input is read from stdin.

pairsamtools markasdup [OPTIONS] [PAIRSAM_PATH]

Options

-o, --output <output>

output .pairsam file. If the path ends with .gz or .lz4, the output is pbgzip-/lz4c-compressed. By default, the output is printed into stdout.

--nproc-in <nproc_in>

Number of processes used by the auto-guessed input decompressing command.

Default

3

--nproc-out <nproc_out>

Number of processes used by the auto-guessed output compressing command.

Default

8

--cmd-in <cmd_in>

A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3

--cmd-out <cmd_out>

A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8

Arguments

PAIRSAM_PATH

Optional argument

merge

merge sorted pairs/pairsam files.

Merge triu-flipped sorted pairs/pairsam files. If present, the @SQ records of the SAM header must be identical; the sorting order of these lines is taken from the first file in the list. The ID fields of the @PG records of the SAM header are modified with a numeric suffix to produce unique records. The other unique SAM and non-SAM header lines are copied into the output header.

PAIRSAM_PATH : upper-triangular flipped sorted pairs/pairsam files to merge or a group/groups of .pairsam files specified by a wildcard. For paths ending in .gz/.lz4, the files are decompressed by pbgzip/lz4c.

pairsamtools merge [OPTIONS] [PAIRSAM_PATH]...

Options

-o, --output <output>

output file. If the path ends with .gz/.lz4, the output is compressed by pbgzip/lz4c. By default, the output is printed into stdout.

--max-nmerge <max_nmerge>

The maximal number of inputs merged at once. For more, store merged intermediates in temporary files.

Default

8

--tmpdir <tmpdir>

Custom temporary folder for merged intermediates.

--memory <memory>

The amount of memory used by default.

Default

2G

--compress-program <compress_program>

A binary to compress temporary merged chunks. Must decompress input when the flag -d is provided. Suggested alternatives: lz4c, gzip, lzop, snzip. NOTE: fails silently if the command syntax is wrong.

Default

--nproc <nproc>

Number of threads for merging.

Default

8

--nproc-in <nproc_in>

Number of processes used by the auto-guessed input decompressing command.

Default

1

--nproc-out <nproc_out>

Number of processes used by the auto-guessed output compressing command.

Default

8

--cmd-in <cmd_in>

A command to decompress the input. If provided, fully overrides the auto-guessed command. Does not work with stdin. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3

--cmd-out <cmd_out>

A command to compress the output. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8

Arguments

PAIRSAM_PATH

Optional argument(s)

parse

parse .sam and make .pairsam.

SAM_PATH : input .sam file. If the path ends with .bam, the input is decompressed from bam. By default, the input is read from stdin.

pairsamtools parse [OPTIONS] [SAM_PATH]

Options

-c, --chroms-path <chroms_path>

Required Chromosome order used to flip interchromosomal mates: path to a chromosomes file (e.g. UCSC chrom.sizes or similar) whose first column lists scaffold names. Any scaffolds not listed will be ordered lexicographically following the names provided.

-o, --output <output>

output file. If the path ends with .gz or .lz4, the output is pbgzip-/lz4-compressed.By default, the output is printed into stdout.

--assembly <assembly>

Name of genome assembly (e.g. hg19, mm10) to store in the pairs header.

--min-mapq <min_mapq>

The minimal MAPQ score to consider a read as uniquely mapped

Default

1

--max-molecule-size <max_molecule_size>

The maximal size of a Hi-C molecule; used to rescue single ligationsfrom molecules with three alignments.

Default

2000

--drop-readid

If specified, do not add read ids to the output

--drop-seq

If specified, remove sequences and PHREDs from the sam fields

--drop-sam

If specified, do not add sams to the output

--add-columns <add_columns>

Report extra columns describing alignments Possible values (can take multiple values as a comma-separated list): a SAM tag (any pair of uppercase letters) or mapq, pos5, pos3, cigar, read_len, matched_bp, algn_ref_span, algn_read_span, dist_to_5, dist_to_3, seq.

--output-parsed-alignments <output_parsed_alignments>

output file for all parsed alignments, including walks. Useful for debugging and rnalysis of walks. If file exists, it will be open in the append mode. If the path ends with .gz or .lz4, the output is pbgzip-/lz4-compressed. By default, not used.

--output-stats <output_stats>

output file for various statistics of pairsam file. By default, statistics is not generated.

--report-alignment-end <report_alignment_end>

specifies whether the 5’ or 3’ end of the alignment is reported as the position of the Hi-C read.

Options

5 | 3

--max-inter-align-gap <max_inter_align_gap>

read segments that are not covered by any alignment and longer than the specified value are treated as “null” alignments. These null alignments convert otherwise linear alignments into walks, and affect how they get reported as a Hi-C pair (see –walks-policy).

Default

20

--walks-policy <walks_policy>

the policy for reporting unrescuable walks (reads containing more than one alignment on one or both sides, that can not be explained by a single ligation between two mappable DNA fragments). “mask” - mask walks (chrom=”!”, pos=0, strand=”-“); “all” - report all pairs of consecutive alignments [NOT IMPLEMENTED]; “5any” - report the 5’-most alignment on each side; “5unique” - report the 5’-most unique alignment on each side, if present; “3any” - report the 3’-most alignment on each side; “3unique” - report the 3’-most unique alignment on each side, if present.

Default

mask

Options

mask | all | 5any | 5unique | 3any | 3unique

--no-flip

If specified, do not flip pairs in genomic order and instead preserve the order in which they were sequenced.

--nproc-in <nproc_in>

Number of processes used by the auto-guessed input decompressing command.

Default

3

--nproc-out <nproc_out>

Number of processes used by the auto-guessed output compressing command.

Default

8

--cmd-in <cmd_in>

A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3

--cmd-out <cmd_out>

A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8

Arguments

SAM_PATH

Optional argument

phase

phase a pairsam file mapped to a diploid genome.

PAIRSAM_PATH : input .pairsam file. If the path ends with .gz or .lz4, the input is decompressed by pbgzip/lz4c. By default, the input is read from stdin.

pairsamtools phase [OPTIONS] [PAIRSAM_PATH]

Options

-o, --output <output>

output file. If the path ends with .gz or .lz4, the output is pbgzip-/lz4c-compressed. By default, the output is printed into stdout.

--phase-suffixes <phase_suffixes>

phase suffixes.

--clean-output

drop all columns besides the standard ones and phase1/2

--nproc-in <nproc_in>

Number of processes used by the auto-guessed input decompressing command.

Default

3

--nproc-out <nproc_out>

Number of processes used by the auto-guessed output compressing command.

Default

8

--cmd-in <cmd_in>

A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3

--cmd-out <cmd_out>

A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8

Arguments

PAIRSAM_PATH

Optional argument

restrict

identify the restriction fragments that got ligated into a Hi-C molecule.

PAIRSAM_PATH : input .pairsam file. If the path ends with .gz/.lz4, the input is decompressed by pbgzip/lz4c. By default, the input is read from stdin.

pairsamtools restrict [OPTIONS] [PAIRSAM_PATH]

Options

-f, --frags <frags>

Required a tab-separated BED file with the positions of restriction fragments (chrom, start, end). Can be generated using cooler digest.

-o, --output <output>

output pairsam file. If the path ends with .gz/.lz4, the output is compressed by pbgzip/lz4c. By default, the output is printed into stdout.

--nproc-in <nproc_in>

Number of processes used by the auto-guessed input decompressing command.

Default

3

--nproc-out <nproc_out>

Number of processes used by the auto-guessed output compressing command.

Default

8

--cmd-in <cmd_in>

A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3

--cmd-out <cmd_out>

A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8

Arguments

PAIRSAM_PATH

Optional argument

select

select pairsam entries.

CONDITION : A Python expression; if it returns True, select the read pair. Any column declared in the #columns line of the pairs header can be accessed by its name. If the header lacks the #columns line, the columns are assumed to follow the pairs/pairsam standard (readID, chrom1, chrom2, pos1, pos2, strand1, strand2, pair_type). Finally, CONDITION has access to COLS list which contains the string values of columns. In Bash, quote CONDITION with single quotes, and use double quotes for string variables inside CONDITION.

PAIRSAM_PATH : input .pairsam file. If the path ends with .gz or .lz4, the input is decompressed by pbgzip/lz4c. By default, the input is read from stdin.

The following functions can be used in CONDITION besides the standard Python functions:

  • csv_match(x, csv) - True if variable x is contained in a list of

comma-separated values, e.g. csv_match(chrom1, ‘chr1,chr2’)

  • wildcard_match(x, wildcard) - True if variable x matches a wildcard,

e.g. wildcard_match(pair_type, ‘C*’)

  • regex_match(x, regex) - True if variable x matches a Python-flavor regex,

e.g. regex_match(chrom1, ‘chrd’)

Examples:
pairsamtools select ‘(pair_type==”UU”) or (pair_type==”UR”) or (pair_type==”RU”)’
pairsamtools select ‘chrom1==chrom2’
pairsamtools select ‘COLS[1]==COLS[3]’
pairsamtools select ‘(chrom1==chrom2) and (abs(pos1 - pos2) < 1e6)’
pairsamtools select ‘(chrom1==”!”) and (chrom2!=”!”)’
pairsamtools select ‘regex_match(chrom1, “chrd+”) and regex_match(chrom2, “chrd+”)’

pairsamtools select ‘True’ –chr-subset mm9.reduced.chromsizes

pairsamtools select [OPTIONS] CONDITION [PAIRSAM_PATH]

Options

-o, --output <output>

output file. If the path ends with .gz or .lz4, the output is pbgzip-/lz4c-compressed. By default, the output is printed into stdout.

--output-rest <output_rest>

output file for pairs of other types. If the path ends with .gz or .lz4, the output is pbgzip-/lz4c-compressed. By default, such pairs are dropped.

--send-comments-to <send_comments_to>

Which of the outputs should receive header and comment lines

Default

both

Options

selected | rest | both | none

--chrom-subset <chrom_subset>

A path to a chromosomes file (tab-separated, 1st column contains chromosome names) containing a chromosome subset of interest. If provided, additionally filter pairs with both sides originating from the provided subset of chromosomes. This operation modifies the #chromosomes: and #chromsize: header fields accordingly.

--nproc-in <nproc_in>

Number of processes used by the auto-guessed input decompressing command.

Default

3

--nproc-out <nproc_out>

Number of processes used by the auto-guessed output compressing command.

Default

8

--cmd-in <cmd_in>

A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3

--cmd-out <cmd_out>

A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8

Arguments

CONDITION

Required argument

PAIRSAM_PATH

Optional argument

sort

sort a pairs/pairsam file.

The resulting order is lexicographic along chrom1 and chrom2, numeric along pos1 and pos2 and lexicographic along pair_type.

PAIRSAM_PATH : input .pairsam file. If the path ends with .gz or .lz4, the input is decompressed by pbgzip or lz4c, correspondingly. By default, the input is read as text from stdin.

pairsamtools sort [OPTIONS] [PAIRSAM_PATH]

Options

-o, --output <output>

output pairsam file. If the path ends with .gz or .lz4, the output is compressed by pbgzip or lz4, correspondingly. By default, the output is printed into stdout.

--nproc <nproc>

Number of processes to split the sorting work between.

Default

8

--tmpdir <tmpdir>

Custom temporary folder for sorting intermediates.

--memory <memory>

The amount of memory used by default.

Default

2G

--compress-program <compress_program>

A binary to compress temporary sorted chunks. Must decompress input when the flag -d is provided. Suggested alternatives: gzip, lzop, lz4c, snzip.

Default

--nproc-in <nproc_in>

Number of processes used by the auto-guessed input decompressing command.

Default

3

--nproc-out <nproc_out>

Number of processes used by the auto-guessed output compressing command.

Default

8

--cmd-in <cmd_in>

A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3

--cmd-out <cmd_out>

A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8

Arguments

PAIRSAM_PATH

Optional argument

split

split a .pairsam file into pairs and sam.

PAIRSAM_PATH : input .pairsam file. If the path ends with .gz or .lz4, the input is decompressed by pbgzip or lz4c. By default, the input is read from stdin.

pairsamtools split [OPTIONS] [PAIRSAM_PATH]

Options

--output-pairs <output_pairs>

output pairs file. If the path ends with .gz or .lz4, the output is pbgzip-/lz4c-compressed. If -, pairs are printed to stdout. If not specified, pairs are dropped.

--output-sam <output_sam>

output sam file. If the path ends with .bam, the output is compressed into a bam file. If -, sam entries are printed to stdout. If not specified, sam entries are dropped.

--nproc-in <nproc_in>

Number of processes used by the auto-guessed input decompressing command.

Default

3

--nproc-out <nproc_out>

Number of processes used by the auto-guessed output compressing command.

Default

8

--cmd-in <cmd_in>

A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3

--cmd-out <cmd_out>

A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8

Arguments

PAIRSAM_PATH

Optional argument

stats

calculate various statistics of a pairs/pairsam file.

INPUT_PATH : by default, a .pairsam file to calculate statistics. If not provided, the input is read from stdin. If –merge is specified, then INPUT_PATH is interpreted as an arbitrary number of stats files to merge.

The files with paths ending with .gz/.lz4 are decompressed by pbgzip/lz4c.

pairsamtools stats [OPTIONS] [INPUT_PATH]...

Options

-o, --output <output>

output stats tsv file.

--merge

If specified, merge multiple input stats files instead of calculating statistics of a pairsam file. Merging is performed via summation of all overlapping statistics. Non-overlapping statistics are appended to the end of the file.

--nproc-in <nproc_in>

Number of processes used by the auto-guessed input decompressing command.

Default

3

--nproc-out <nproc_out>

Number of processes used by the auto-guessed output compressing command.

Default

8

--cmd-in <cmd_in>

A command to decompress the input file. If provided, fully overrides the auto-guessed command. Does not work with stdin. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -dc -n 3

--cmd-out <cmd_out>

A command to compress the output file. If provided, fully overrides the auto-guessed command. Does not work with stdout. Must read input from stdin and print output into stdout. EXAMPLE: pbgzip -c -n 8

Arguments

INPUT_PATH

Optional argument(s)