Skip to content
Convergence
GitHub

Examples

This is a tool-oriented primer designed to guide you through the functionalities of query_bedpe. From your tinker box, type query_bedpe -h into the command window and press enter to view the help message.

Help message

query_bedpe -h

Query and annotate a bedpe file with AQuA normalized contact values

-P|--pair            : Full path to the bedpe (pairs) file you want to annotate, without headers!
-A|--sample1         : Name of the sample you want to use as it appears on the Tinkerbox
-G|--genome          : The genome build the sample(s) has been processed using. Strictly hg19 or hg38
[ -Q|--norm        ] : Which normalization to use. Strictly 'none', 'cpm' or 'aqua' in lower case. Non-spike-in samples default to cpm. Spike-in samples default to aqua.
[ -B|--sample2     ] : The name of the second sample. If triggered, calculates the delta contact values for that pair. Useful in case vs control
[ -R|--resolution  ] : Resolution of sample in basepairs, using which the contact values should be calculated. Default 5000. Accepted resolutions- 1000,5000,10000,25000,50000,100000,250000,500000,1000000,2500000
[ -f|--formula     ] : Arithmetic to use to report contact calues. Options- center, max, average, sum. Default = center
[ -F|--fix         ] : If FALSE, reports new coordinates based on arithmetic center or max. Default = TRUE
[    --shrink_wrap ] : Squeezes a 2D bedpe interval until supplied value is reached. Default = FALSE
[    --split       ] : Splits a 2D bedpe interval into multiple sub-intervals greater than supplied value. Default = FALSE
[    --padding     ] : Joins sub-intervals in 2D space reported by --split, based on supplied value in bin units. Default = 2
[    --expand      ] : Expands 1D bedpe feet in both directions based on supplied value(in bin units). Default = 0
[ -I|--inherent    ] : If TRUE, hic values transformed to inherent units. For one-sample tests only. Default = FALSE
[ -m|--preserve_meta]: If TRUE, bedpe metadata columns will be preserved. Default = FALSE
[ -h|--help        ] : Help message. Primer can be found at [https://rb.gy/zyfjxc](https://rb.gy/zyfjxc)

Viewing loops

Let’s tinker with this tool using provided sample data in your tinker box.

The sample data comes from H3K27ac HiChIP experiments performed in a childhood rhabdomyosarcoma cell line (Gryder et al., 2019), treated with DMSO and HDAC inhibitor Entinostat.

sample1=RH4_DMSO
sample2=RH4_HDACi

Loops in 3D genomics are two stretches of DNA with high contact frequency. We first built query_bedpe to quickly access the conformation matrix and obtain interaction counts between any two regions. Over time, we’ve enhanced its abilities to handle imprecise loop coordinates.

For this primer we’re looking at two example interacting regions, each 55kb in size, separated by a distance of 120kb. These two regions are stored in a bedpe file.

bedpe=$/path/to/sample.bedpe

cat $bedpe
chr16     875000     930000      chr16     1050000     1105000

Before we continue with query_bedpe, let’s first visualize this hypothetical loop using another tool in the AQuA Tools suite, plot_contacts. Visualizations will be helpful understanding the outputs we get from query_bedpe. To accomplish this, we will use the RH4_DMSO_K27ac sample at 5kb resolution.


annotation=$tutorial_data_path/sample.bed

cat $annotation
chr16        875000       930000
chr16       1050000      1105000

range=chr16:790000:1260000:test
resolution=5000
genome=hg38

plot_contacts \
-A $sample1 \
-R $range \
-G $genome \
-O AL_1.pdf \
-Q aqua \
--max_cap 1 \
--annotations_default FALSE \
--annotations_custom $annotation \
--annotations_custom_color c000ff \
--bedpe $bedpe \
--bedpe_color c000ff

This is our loop!

Loop

Reporting bin values with —formula

Now that we have a visual understanding of the loop we are working with, we can use query_bedpe to ask questions about this loop. We are likely interested in the contact values (red bins) within this region, but which contact value from the loop (purple square) should we report? The default option is the center bin.

Loop_close-up

Loop_close-up

However, in this case, the default (center) bin is not the best choice. For single sample analyses, query_bedpe will return coordinates in the first 6 columns and contact values in the 7th column. As you can see below, query_bedpe reports a 0 value for this bin.

query_bedpe \
- A $sample1 \
- P $bedpe \
- G $genome

chr16   875000    930000    chr16   1050000    1105000    0

The default can be changed to obtain the max, sum or average of the enclosed 2D area, using the -f or --formula parameter.

query_bedpe \
  -A $sample1 \
  -P $bedpe \
  -G $genome \
  --formula max

chr16     875000     930000    chr16       1050000    1105000      2.42
query_bedpe \
  -A $sample1 \
  -P $bedpe \
  -G $genome \
  --formula sum

chr16     875000     930000    chr16       1050000    1105000     11.54
query_bedpe \
  -A $sample1 \
  -P $bedpe \
  -G $genome \
  --formula average

chr16     875000     930000    chr16       1050000    1105000     0.095

This is the area for sum and average --formula parameters

Loop_close-up

This is the area for max --formula parameter

Loop_max

Normalization methods

We can also change default behavior of the tool and report counts per million contact values, instead of the default AQuA normalized values, using the -Q or  --norm parameter.

query_bedpe \
  -A $sample1 \
  -P $bedpe \
  -G $genome \
  --formula max \
  --norm cpm

chr16     875000     930000     chr16     1050000     1105000   0.436
query_bedpe \
  -A $sample1 \
  -P $bedpe \
  -G $genome \
  --formula sum \
  --norm cpm

chr16     875000     930000     chr16     1050000     1105000   2.079
query_bedpe \
  -A $sample1 \
  -P $bedpe \
  -G $genome \
  --formula average \
  --norm cpm

chr16     875000     930000     chr16     1050000     1105000   0.017

Fixed or updating coordinates based on —formula

The above example reports back the same bedpe coordinates you supplied. However, the center and max bins (contained within the supplied 2D area) will have different (smaller) 2D genomic areas than the ones supplied. Using the -F or --fix parameter, we can instruct query_bedpe to either keep the coordinates fixed (default; —fix TRUE), or to return updated bedpe coordinates along with the contact value (—fix FALSE). This assists in obtaining precise coordinates from fuzzy intervals. Let’s compare output when --fix is TRUE and FALSE.

query_bedpe \
  -A $sample1 \
  -P $bedpe \
  -G $genome \
  --formula center \
  --fix FALSE

chr16     900000     905000    chr16     1075000    1080000    0

# --fix TRUE
# chr16    875000    930000    chr16     1050000    1105000    0
query_bedpe \
  -A $sample1 \
  -P $bedpe \
  -G $genome \
  --formula max \
  --fix FALSE

chr16     910000     915000    chr16     1050000    1055000    0.436

# --fix TRUE
#chr16    875000     930000    chr16     1050000    1105000    0.436
query_bedpe \
  -A $sample1 \
  -P $bedpe \
  -G $genome \
  --formula sum \
  --fix FALSE

chr16     875000     930000    chr16     1050000    1105000    2.079

# --fix TRUE
# chr16     875000     930000    chr16     1050000    1105000    2.079
query_bedpe \
  -A $sample1 \
  -P $bedpe \
  -G $genome \
  --formula average \
  --fix FALSE

chr16     875000     930000    chr16     1050000    1105000    0.017

# --fix TRUE
#chr16     875000     930000    chr16     1050000    1105000    0.017

You can see how the coordinates have changed when using --fix FALSE for center and max bins.

Two sample query & delta contact values

We can also use query_bedpe to obtain contact values of bedpe regions of two samples at the same time.

query_bedpe \
-A $sample1 \
-P $bedpe \
-G $genome \
-B $sample2

chr16   875000    930000    chr16   1050000    1105000    0    0.379   0.379

We now have two new columns! The first contact value is for sample1 (-A), the second for sample2 (-B), and the third column is the delta contact value (sample2 - sample1). This can be very useful in case-control analyses.

query_bedpe \
-A $sample1 \
-P $bedpe \
-G $genome \
-B $sample2 \
--formula max

chr16   875000    930000    chr16   1050000    1105000   2.42   2.653   0.233

What happens if we supply --fix FALSE to get the precise max bin coordinates?

query_bedpe \
- A $sample1 \
- P $bedpe \
- G $genome \
- B $sample2 \
- -formula max \
- -fix FALSE

chr16    875000    930000   chr16   1050000   1100000   2.42   2.653   0.233

We do not get altered coordinates, since we’re dealing with two samples at the same time! Parameter --fix FALSE only works with one sample analysis.

Shrink-wrap

Time to try out some more powerful query_bedpe functionalities, all designed to find better signal from fuzzy intervals! Let’s look at a larger region.

bedpe=sample_bedpe_2.bedpe

cat $bedpe
chr16    895000    990000     chr16    1020000     1070000

annotation=sample_bed_2.bed

cat $annotation
chr16     895000     990000
chr16    1020000    1070000
plot_contacts \
-A $sample1 \
-R $range \
-G $genome \
-O AL_2.pdf \
-Q aqua \
--max_cap 1 \
--annotations_default FALSE \
--annotations_custom $annotation \
--annotations_custom_color c000ff \
--bedpe $bedpe \
--bedpe_color c000ff

larger_sample

Parameter --shrink_wrap can reduce the dimensions of a 2D bedpe region until it encounters a bin with a user-supplied count value. Here, we will use 10 as the count value. We will also use --formula sum to show the decrease following --shrink_wrap, since the 2D area is shrinking.

query_bedpe \
- A $sample1 \
- P $bedpe \
- G $genome \
- -formula sum

chr16    895000     990000     chr16    1020000     1070000     62.744

query_bedpe \
- A $sample1 \
- P $bedpe \
- G $genome \
- -formula sum \
- -shrink_wrap 10 > shrink.bedpe

cat shrink.bedpe

chr16   910000      990000     chr16    1040000     1060000    46.173

Let’s visually inspect the effect of shrink_wrap with plot_contacts

plot_contacts \
-A $sample1 \
-R $range \
-G $genome \
-O AL_3.pdf \
-Q aqua \
--max_cap 2 \
--annotations_default FALSE \
--bedpe shrink.bedpe \
--bedpe_color c000ff

before shrink_wrap

larger_sample_crop

after shrink_wrap

shrink_sample

Split

Parameter --split can identify clusters of bins above a user-supplied contact value. The size of the cluster can be controlled by the parameter --padding.

query_bedpe \
- A $sample1 \
- P $bedpe \
- G $genome \
- -split 8 --padding 2 > split.bedpe

cat split.bedpe

chr16      900000      925000     chr16      1040000     1065000    2.42
chr16      960000     1000000     chr16      1030000     1070000    3.91
plot_contacts \

-A $sample1 \
-R $range \
-G $genome \
-O AL_4.pdf \
-Q aqua \
--max_cap 1 \
--annotations_default FALSE \
--bedpe split.bedpe \
--bedpe_color c000ff

before split

larger_sample_crop

after split

split

Expand

Given a supplied 2D bedpe area, parameter --expand can, as the name suggests, expand the 2D area in units of the resolution. We will use --expand with --fix FALSE to see the expanded coordinates in our output. Let’s use our original loop example again.

bedpe=sample_bedpe.bedpe
cat $bedpe

chr16      875000      930000      chr16       1050000     1105000

query_bedpe \
- A $sample1 \
- P $bedpe \
- G $genome \
- -formula sum

chr16      875000      930000      chr16       1050000     1105000     11.543

query_bedpe \
- A $sample1 \
- P $bedpe \
- G $genome \
- -formula sum \
- -fix FALSE \
- -expand 5

chr16      850000      955000      chr16       1025000     1130000     33.699

before expand

Loop

after expand

expand