Data Preparation¶

Displaying large amounts of data often requires first turning it into not-so-large amounts of data. Clodius is a program and library designed to aggregate large datasets to make them easy to display at different resolutions.

Bed Files¶

BED files specify genomic intervals. They are aggregated according to an importance function that determines which values should be visible at lower zoom levels. This importance function is user specified. In the absence of any clear ranking of the different lines in the BED file, a random value can be used in lieu of the importance function.

Example BED file:

chr9    135766734       135820020       TSC1    Biallelic inactivation may predict sensitivity to MTOR inhibitors
chr16   2097895 2138721 TSC2    Biallelic inactivation may predict sensitivity to MTOR inhibitors
chr3    10183318        10195354        VHL     May signal the presence of a germline mutation.
chr11   32409321        32457081        WT1     May signal the presence of a germline mutation.

This file can be aggregated like so:

clodius aggregate bedfile \
    --chromsizes-filename hg19.chrom.sizes \
    short.bed

If the bed file has tab-separated values, that can be specified using the --delimiter $'\t' option.

And then imported into higlass after copying to the docker temp directory (cp short.bed.multires ~/hg-tmp/):

higlass-manage ingest short.bed.beddb

A note about assemblies and coordinate systems¶

HiGlass doesn’t really have a notion of an assembly. It only displays data where it’s told to display it. When you aggregate a bedfile with using chromsizes-filename, it uses the lengths of the chromosomes to determine the offsets of the bedfile entries from the 0 position. So if aggregate and load the resulting the beddb file in HiGlass, you’ll see the bedfile entries displayed as if the chromosomes in the chromsizes file were laid end to end.

Now, if you want to see which chromosomes correspond to which positions along the x-axis or to have the search bar display “assembly” coordinates, you’ll need to register the chromsizes file using:

higlass-manage ingest \
    --filetype chromsizes-tsv \
    --datatype chromsizes \
    --assembly galGal6 \
    negspy/data/galGal6/chromInfo.txt

If you would like to be able to search for gene annotations in that assembly, you’ll need to create a gene annotation track.

** Note that while the lack of assembly enforcement is generally the rule, bigWig tracks are a notable exception. All bigWig files have to be associated with a coordinate system that is already present in the HiGlass server in order to be ingested.

Bedpe-like Files¶

BEDPE-like files contain two sets of chromosomal coordinates:

chr10   74160000        74720000    chr10   74165000    74725000
chr12   120920000       121640000   chr12   120925000   121645000
chr15   86360000        88840000    chr15   86365000    88845000

To view such files in HiGlass, they have to be aggregated so that tiles don’t contain too many values and slow down the renderer:

clodius aggregate bedpe \
    --assembly hg19 \
    --chr1-col 1 --from1-col 2 --to1-col 3 \
    --chr2-col 4 --from2-col 5 --to2-col 6 \
    --output-file domains.txt.multires \
    domains.txt

This requires the --chr1-col, --from1-col, --to1-col, --chr2-col, --from2-col, --to2-col parameters to specify which columns in the datafile describe the x-extent and y-extent of the region.

The priority with which regions are included in lower resolution tiles is specified by the --impotance-column parameter. This can either provide a value, contain random, or if it’s not specified, default to the size of the region.

BED files can also be aggregated as BEDPE-like files for use with the 2d-rectangle-domains track. The from1_col,to1_col and from2_col,to2_col parameters need to be set to the same columns. Example file:

chrZ    80050000        80100000        False   0.19240442973331        0.24341494300858102
chrZ    81350000        81400000        False   0.5359549218130373      0.30888749507071034
chrZ    81750000        81800000        False   -0.5859846849030403     1.602383514196359

With the aggregate command:

clodius aggregate bedpe \
--chromsizes-filename galGal6.chrom.sizes \
--chr1-col 1 --chr2-col 1 \
--from1-col 2 --to1-col 3 \
--from2-col 2 --to2-col 3 \
--has-header  my_file.bed

Ingesting into higlass¶

higlass-manage ingest my-file.bedpe.multires \
--filetype bed2ddb \
--datatype 2d-rectangle-domains

BedGraph files¶

Warning

The order of the chromosomes in the bedgraph file have to be consistent with the order specified for the assembly in the negspy repository.

Ordering the chromosomes in the input file¶

input_file=~/Downloads/phastCons100way.txt.gz;
output_file=~/Downloads/phastConst100way_ordered.txt;
chromnames=$(awk '{print $1}' ~/projects/negspy/negspy/data/hg19/chromInfo.txt);
for chr in $chromnames;
    do echo ${chr};
    zcat $input_file | grep "\t${chr}\t" >> $output_file;
done;

Aggregation by addition¶

Assume we have an input file that has id chr start end value1 value2 pairs:

location        chrom   start   end     copynumber      segmented
2900001-3000000       1       2900001 3000000 -0.614  -0.495
3000001-3100000       1       3000001 3100000 -0.407  -0.495
3100001-3200000       1       3100001 3200000 -0.428  -0.495
3200001-3300000       1       3200001 3300000 -0.437  -0.495

We can aggregate this file by recursively summing adjacent values. We have to indicate which column corresponds to the chromosome (--chromosome-col 2), the start position (--from-pos-col 3), the end position (--to-pos-col 4) and the value column (--value-col 5). We specify that the first line of the data file contains a header using the (--has-header) option.

clodius aggregate bedgraph          \
    test/sample_data/cnvs_hw.tsv    \
    --output-file ~/tmp/cnvs_hw.hitile \
    --chromosome-col 2              \
    --from-pos-col 3                \
    --to-pos-col 4                  \
    --value-col 5                   \
    --assembly grch37               \
    --nan-value NA                  \
    --transform exp2                \
    --has-header

Data Transform¶

The dataset used in this example contains copy number data that has been log2 transformed. That is, the copy number given for each bin is the log2 of the computed value. This is a problem for HiGlass’s default aggregation method of summing adjacent values since \log_2 a + \log_2 b \neq \log_2 ab.

Using the --transform exp2 option tells clodius to raise two to the power of the provided value before doing the transformation and storing. As an added benefit, NaN values become apparent in the resulting because they have values of 0.

NaN Value Identification¶

NaN (not a number) values in the input file can be specified using the --nan-value option. For example, --nan-value NA indicates that whenever NA is encountered as a value it should be treated as NaN. In the current implementation, NaN values are simply treated as 0. In the future, they should be assigned a special value so that they are ignored by HiGlass.

When NaN values are aggregated by summing, they are treated as 0 when added to another number. When two NaN values are added to each other, however, the result is Nan.

NaN Value Counting¶

Sometimes, we just want to count the number of NaN values in the file. The --count-nan option effectively treats NaN values as 1 and all other values as 0. This makes it possible to display a track showing how many NaN values are present in each interval. It also makes it possible to create compound tracks which use that information to normalize track values.

bigWig files¶

bigWig files store genomic data in a compressed, indexed form that allows rapid retrieval and visualization. bigWig files can be loaded directly into HiGlass using the vector datatype and bigwig filetype:

higlass-manage ingest cnvs_hw.bigWig --assembly hg19

Important: BigWig files have to be associated with a chromosome order!! This means that there needs to be a chromsizes file for the specified assembly in the local higlass database. This means that the chromsizes should have been ingested locally. Chromsizes available on remote servers (e.g. higlass.io) can not be associated with local bigWig files even though they may be visible within the browser. If no assembly is specified for the bigWig file using the –assembly option, HiGlass will try to find one in the database that matches the chromosomes present in the bigWig file. If a chromsizes tileset is found, it’s coordSystem will also be used for the bigWig file. If none are found, the import will fail. If more than one is found, the import will also fail. If a coordSystem is specified for the bigWig, but no chromsizes are found on the server, the import will fail.

TLDR: The simplest way to import a bigWig is to have a chromsizes present e.g.

higlass-manage ingest --filetype chromsizes-tsv --datatype chromsizes --assembly hg19 chromSizes.tsv

and then to add the bigWig with the same coordSystem:

higlass-manage ingest --assembly hg19 cnvs_hw.bigWig

Creating bigWig files¶

bigWig files can be created from any BED-like file containing chrom, start, end, and value fields. Just make sure to get rid of the heading if there is one (tail -n +2) and to sort by chromosome and start position (sort -k1,1 -k2,2n):

tail -n +2 my_bed_file.tsv \
    | sort -k1,1 -k2,2n \
    | awk \
    '{ if (NF >= 4) print $1 "\t" $2 "\t" $3 "\t" $5}' \
    > my.bed;
bedGraphToBigWig my.bed assembly.chrom.sizes.tsv my.bw;

The bedGraphToBigWig utility can be installed be either downloading the binary from the UCSC genome browser or using conda. Note that the example above is only an example. Other input files may have more header lines or a different format.

Chromosome Sizes¶

Chromosome sizes can be used to create chromosome label and chromosome grid tracks. They consist of a tab-separated file containing chromosome names and sizes as columns:

chr1    249250621
chr2    243199373
chr3    198022430
...

Chromosome sizes can be imported into the higlass server using the --filetype chromsizes-tsv and --datatype chromsizes parameters. A coordSystem should be included to identify the assembly that these chromosomes define.

higlass-manage ingest --filetype chromsizes-tsv --datatype chromsizes --assembly hg19 chromSizes.tsv

Gene Annotation Tracks¶

HiGlass uses a specialized track for displaying gene annotations. It is rougly based on UCSC’s refGene files (e.g. http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/). For any identifiable genome assembly the following commands can be run to generate a list of gene annotation that can be loaded as a zoomable track in HiGlass.

Prerequisites¶

For any assembly, there needs to a refGene file:

http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz

And a list of chromosome sizes in the negspy python package.

If there are no available chromosome sizes for this assembly in negspy, adding them is simply a matter of downloading the list from UCSC (e.g. http://hgdownload.cse.ucsc.edu/goldenpath/hg19/bigZips/hg19.chrom.sizes)

Set the assembly name and species ID¶

ASSEMBLY=mm9
TAXID=10090

#ASSEMBLY=hg19
#TAXID=9606

#ASSEMBLY=sacCer3
#TAXID=559292

#ASSEMBLY=dm6
#TAXID=7227

Download data from UCSC and NCBI¶

# Download NCBI genbank data
DATADIR=~/data
mkdir $DATADIR/genbank
wget -N -P $DATADIR ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz
wget -N -P $DATADIR ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_info.gz
wget -N -P $DATADIR ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2pubmed.gz

# Download UCSC refGene database for assembly of interest
mkdir $DATADIR/$ASSEMBLY
wget -N -P $DATADIR/$ASSEMBLY/ http://hgdownload.cse.ucsc.edu/goldenPath/$ASSEMBLY/database/refGene.txt.gz

# Filter genbank data for species of interest
zcat $DATADIR/gene2refseq.gz | grep ^${TAXID} > $DATADIR/$ASSEMBLY/gene2refseq
zcat $DATADIR/gene_info.gz | grep ^${TAXID} | sort -k 2 > $DATADIR/$ASSEMBLY/gene_info
zcat $DATADIR/gene2pubmed.gz | grep ^${TAXID} > $DATADIR/$ASSEMBLY/gene2pubmed

# Sort
# Optional: filter out unplaced and unlocalized scaffolds (which have a "_" in the chrom name)
zcat $DATADIR/$ASSEMBLY/refGene.txt.gz \
    | awk -F $'\t' '{if (!($3 ~ /_/)) print;}' \
    | sort -k 2 \
    > $DATADIR/$ASSEMBLY/refGene_sorted

Get full model and citation count for each gene¶

# Count pubmed citations
# Output: {gene_id} \t {citation_count}
cat $DATADIR/$ASSEMBLY/gene2pubmed \
    | awk '{print $2}' \
    | sort \
    | uniq -c \
    | awk '{print $2 "\t" $1}' \
    | sort \
    > $DATADIR/$ASSEMBLY/gene2pubmed-count

# Gene2refseq dictionary
# Output: {gene_id} \t {refseq_id}
cat $DATADIR/$ASSEMBLY/gene2refseq \
    | awk -F $'\t' '{ split($4,a,"."); if (a[1] != "-") print $2 "\t" a[1];}' \
    | sort \
    | uniq  \
    > $DATADIR/$ASSEMBLY/geneid_refseqid

# Append refseq IDs to citation count table
# Output: {gene_id} \t {refseq_id} \t {citation_count}
join $DATADIR/$ASSEMBLY/geneid_refseqid \
    $DATADIR/$ASSEMBLY/gene2pubmed-count  \
    | sort -k2 \
    > $DATADIR/$ASSEMBLY/geneid_refseqid_count

# Join the refseq gene model against gene IDs
# Output: {gene_id} \t {refseq_id} \t {chrom}(5) \t {strand}(6) \t {txStart}(7) \t {txEnd}(8) \t {cdsStart}(9) \t {cdsEnd}(10) \t {exonCount}(11) \t {exonStarts}(12) \t {exonEnds}(13)
join -1 2 -2 2 \
    $DATADIR/$ASSEMBLY/geneid_refseqid_count \
    $DATADIR/$ASSEMBLY/refGene_sorted \
    | awk '{ print $2 "\t" $1 "\t" $5 "\t" $6 "\t" $7 "\t" $8 "\t" $9 "\t" $10 "\t" $11 "\t" $12 "\t" $13 "\t" $3; }' \
    | sort -k1   \
    > $DATADIR/$ASSEMBLY/geneid_refGene_count

# Join citation counts against gene information
# output -> geneid \t symbol \t gene_type \t name \t citation_count
join -1 2 -2 1 -t $'\t' \
    $DATADIR/$ASSEMBLY/gene_info \
    $DATADIR/$ASSEMBLY/gene2pubmed-count \
    | awk -F $'\t' '{print $1 "\t" $3 "\t" $10 "\t" $12 "\t" $16}' \
    | sort -k1 \
    > $DATADIR/$ASSEMBLY/gene_subinfo_citation_count

# 1: chr (chr1)
# 2: txStart (52301201) [9]
# 3: txEnd (52317145) [10]
# 4: geneName (ACVRL1)   [2]
# 5: citationCount (123) [16]
# 6: strand (+)  [8]
# 7: refseqId (NM_000020)
# 8: geneId (94) [1]
# 9: geneType (protein-coding)
# 10: geneDesc (activin A receptor type II-like 1)
# 11: cdsStart (52306258)
# 12: cdsEnd (52314677)
# 13: exonStarts (52301201,52306253,52306882,52307342,52307757,52308222,52309008,52309819,52312768,52314542,)
# 14: exonEnds (52301479,52306319,52307134,52307554,52307857,52308369,52309284,52310017,52312899,52317145,)
join -t $'\t' \
    $DATADIR/$ASSEMBLY/gene_subinfo_citation_count \
    $DATADIR/$ASSEMBLY/geneid_refGene_count \
    | awk -F $'\t' '{print $7 "\t" $9 "\t" $10 "\t" $2 "\t" $16 "\t" $8 "\t" $6 "\t" $1 "\t" $3 "\t" $4 "\t" $11 "\t" $12 "\t" $14 "\t" $15}' \
    > $DATADIR/$ASSEMBLY/geneAnnotations.bed

# Download: https://raw.githubusercontent.com/higlass/clodius/develop/scripts/exonU.py
python exonU.py $DATADIR/$ASSEMBLY/geneAnnotations.bed > $DATADIR/$ASSEMBLY/geneAnnotationsExonUnions.bed

Create a gene annotation track file¶

clodius aggregate bedfile \
    --max-per-tile 20 \
    --importance-column 5 \
    --chromsizes-filename assembly.chromSizes \
    --output-file $DATADIR/$ASSEMBLY/gene-annotations-${ASSEMBLY}.db \
    --delimiter $'\t' \
    $DATADIR/$ASSEMBLY/geneAnnotationsExonUnions.bed

Hitile files¶

Hitile files are HDF5-based 1D vector files containing data at multiple resolutions.

To see hitile datasets in higlass, use the docker container to load them:

docker exec higlass-container python \
        higlass-server/manage.py ingest_tileset \
        --filename /tmp/cnvs_hw.hitile \
        --filetype hitile \
        --datatype vector

Point your browser at 127.0.0.1:8989 (or wherever it is hosted), click on the little ‘plus’ icon in the view and select the top position. You will see a listing of available tracks that can be loaded. Select the dataset and then choose the plot type to display it as.

Cooler files¶

Cooler files (extension .cool) store arbitrarily large 2D genomic matrices, such as those produced via Hi-C and other high throughput proximity ligation experiments. HiGlass can render cooler files containing matrices of the same dataset at a range of bin resolutions or zoom levels, so called multiresolution cool files (typically denoted .mcool).

From pairs¶

Note

Starting with cooler 0.7.9, input pairs data no longer needs to be sorted and indexed.

Often you will start with a list of pairs (e.g. contacts, interactions) that need to be aggregated. For example, the 4DN-DCIC developed a standard pairs format for HiC-like data. In general, you only need a tab-delimited file with columns representing chrom1, pos1, chrom2, pos2, optionally gzipped. In the case of Hi-C, these would correspond to the mapped locations of the two ends of a Hi-C ligation product.

You also need to provide a list of chromosomes in semantic order (chr1, chr2, …, chrX, chrY, …) in a two-column chromsizes file.

Ingesting pairs is done using the cooler cload command. Choose the appropriate loading subcommand. If you pairs file is sorted and indexed with pairix or with tabix, use cooler cload pairix or cooler cload tabix, respectively. Otherwise, you can use the new cooler cload pairs command.

Raw pairs example

If you have a raw pairs file or you can stream your data in such a way, you only need to specify the columns that correspond to chrom1, chrom2, pos1 and pos2. For example, if chrom1 and pos1 are the first two columns, and chrom2 and pos2 are in columns 4 and 5, the following command will aggregate the input pairs at 1kb:

cooler cload pairs -c1 1 -p1 2 -c2 4 -p2 5 \
    hg19.chrom.sizes:1000 \
    mypairs.txt \
    mycooler.1000.cool

To pipe in a stream, replace the pairs path above with a dash -.

Note

The syntax <chromsizes_path>:<binsize_in_bp> is a shortcut to specify the genomic bin segmentation used to aggregate the pairs. Alternatively, you can pass in the path to a 3-column BED file of bins.

Indexed pairs example

If you want to create a sorted and indexed pairs file, follow this example. Because an index provides random access to the pairs, this method can be more efficient and parallelized.

cooler csort -c1 1 -p1 2 -c2 4 -p2 5 mypairs.txt hg19.chrom.sizes

will generate a sorted and compressed pairs file mypairs.blksrt.txt.gz along with a companion pairix .px2 index file. To aggregate, use the cload pairix command.

cooler cload pairix hg19.chrom.sizes:1000 mypairs.blksrt.txt.gz mycooler.1000.cool

The output mycooler.1000.cool will serve as the base resolution for the multires cooler you will generate.

From a matrix¶

If your base resolution data is already aggregated, you can ingest data in one of two formats. Use cooler load to ingest.

Note

Prior to cooler 0.7.9, input BG2 files needed to be sorted and indexed. This is no longer the case.

COO: Sparse matrix upper triangle coordinate list , i.e. tab-delimited sparse matrix triples (row_id, col_id, count). This is an output of pipelines like HiCPro.

cooler load -f coo hg19.chrom.sizes:1000 mymatrix.1kb.coo.txt mycooler.1000.cool

BG2: A 2D “extension” of the bedGraph format. Tab delimited with columns representing chrom1, start1, end1, chrom2, start2, end2, and count.

cooler load -f bg2 hg19.chrom.sizes:1000 mymatrix.1kb.bg2.gz mycooler.1000.cool

Zoomify¶

To recursively aggregate your matrix into a multires file, use the zoomify command.

cooler zoomify mycooler.1000.cool

The output will be a file called mycooler.1000.mcool with zoom levels increasing by factors of 2. You can also request an explicit list of resolutions, as long as they can be obtained via integer multiples starting from the base resolution. HiGlass performs well as long as zoom levels don’t differ in resolution by greater than a factor of ~5.

cooler zoomify -r 5000,10000,25000,50000,100000,500000,1000000 mycooler.1000.cool

If this is Hi-C data or similar, you probably want to apply iterative correction (i.e. matrix balancing normalization) by including the --balance option.

Loading pre-zoomed data¶

If the matrices for the resolutions you wish to visualize are already available, you can ingest each one independently into the right location inside the file using the Cooler URI :: syntax.

HiGlass expects each zoom level to be stored at a location named resolutions/{binsize}.

cooler load -f bg2 hg19.chrom.sizes:1000 mymatrix.1kb.bg2 mycooler.mcool::resolutions/1000
cooler load -f bg2 hg19.chrom.sizes:5000 mymatrix.5kb.bg2 mycooler.mcool::resolutions/5000
cooler load -f bg2 hg19.chrom.sizes:10000 mymatrix.10kb.bg2 mycooler.mcool::resolutions/10000
...

Multivec Files¶

Multivec files store arrays of arrays organized by chromosome. They are currently implemented as binary HDF5 files. To aggregate this data, we need an input file where each chromosome is a separate dataset. Here is an example creating of how to create the base resolution of a multivec file:

f = h5py.File('/tmp/blah.h5', 'w')

d = f.create_dataset('chr1', (10000,5), compression='gzip')
d[:] = np.random.random((10000,5))
f.close()

This base resolution can be aggregated to multiple resolutions using clodius aggregate multivec:

clodius aggregate multivec \
    --chromsizes-filename ~/projects/negspy/negspy/data/hg38/chromInfo.txt \
    --starting-resolution 1000 \
    --row-infos-filename ~/Downloads/sampled_info.txt \
    /tmp/blah.h5

The –chromsizes-filename option lists the chromosomes that are in the input file and their sizes. The contents should be a list of tab-separated values containing chromosome name and size:

chr1    10000

The –starting-resolution option indicates that the base resolution for the input data is 1000 base pairs.

The row-infos-filename parameter specifies a file containing a list of names for the rows in the multivec file:

Spleen
Thymus
Liver

Epilogos Data (multivec)¶

Epilogos (https://epilogos.altiusinstitute.org/) show the distribution of chromatin states over a set of experimental conditions (e.g. cell lines). The data consist of positions and states:

chr1    10000   10200   id:1,qcat:[ [-0.2833,15], [-0.04748,5], [-0.008465,7], [0,2], [0,3], [0,4], [0,6], [0,10], [0,11], [0,12], [0,13], [0,14], [0.0006647,1], [0.436,8], [1.921,9] ]
chr1    10200   10400   id:2,qcat:[ [-0.2833,15], [-0.04748,5], [0,3], [0,4], [0,6], [0,7], [0,10], [0,11], [0,12], [0,13], [0,14], [0.0006647,1], [0.004089,2], [0.8141,8], [1.706,9] ]
chr1    10400   10600   id:3,qcat:[ [-0.2588,15], [-0.04063,5], [0,2], [0,3], [0,4], [0,6], [0,7], [0,10], [0,11], [0,12], [0,13], [0,14], [0.0006647,1], [0.2881,8], [1.58,9] ]
chr1    10600   10800   id:4,qcat:[ [-0.02619,15], [0,1], [0,2], [0,3], [0,4], [0,6], [0,7], [0,8], [0,10], [0,11], [0,12], [0,13], [0,14], [0.1077,5], [0.4857,9] ]

This can be aggregated into multivec format:

clodius convert bedfile_to_multivec \
    hg38/all.KL.bed.gz \
    --assembly hg38 \
    --starting-resolution 200 \
    --row-infos-filename row_infos.txt \
    --num-rows 15 \
    --format epilogos

States Data (multivec)¶

A bed file with categorical data, e.g from chromHMM. The data consist of positions and states for each segment in categorical data:

chr1  0       10000   Quies
chr1  10000   10400   FaireW
chr1  10400   15800   Low
chr1  15800   16000   Pol2
chr1  16000   16400   Gen3'
chr1  16400   16600   Elon
chr1  16600   139000  Quies
chr1  139000  139200  Ctcf

This can be aggregated to multivec format:

clodius convert bedfile_to_multivec \
    hg38/all.KL.bed.gz \
    --assembly hg38 \
    --starting-resolution 200 \
    --row-infos-filename row_infos.txt \
    --num-rows 7 \
    --format states

A rows_info.txt file is required in the parameter --row-infos-filename for this type of data. This file contains the name of the states in the bedfile. e.g. rows_infos.txt:

Quies
FaireW
Low
Pol2
Gen3'
Elon
ctcf

The number of rows with the name of the states in the rows_info.txt file must match the number of states in the bedfile and that number should be stated in the --num-rows parameter.

The resulting output file can be ingested using higlass-manage:

higlass-manage.py ingest --filetype multivec --datatype multivec data.mv5

Other Data (multivec)¶

Multivec files are datatype agnostic. For use with generic data, create a segments file containing the length of each segment. A segment is an arbitrary set of discontinuous blocks that the data is partitioned into. In the case of genomics data, segments correspond to chromosomes. If the data has no natural grouping, it can all be lumped into one “segment” which is wide enough to accommodate all the data points. Below is an example of a dataset grouped into two “segments”.

segment1    20000
segment2    40000

Data will be displayed as if the segments were laid out end to end:

.. code-block:: bash

|---------------|——————————|

segment1 segment2

The individual datapoints should then be formatted as in the block below. Each row in this file corresponds to a column in the displayed plot. Each value is one of sections of the stacked bar plot or matrix that is rendered by the multivec plot.

segment_name    start  end  value1  value2   value3
segment1            0 10000      1       2        1
segment2        20000 30000      1       1        1

         ______
        |______|                 ______
        |      |                |______|
        |______|                |______|
        |      |                |      |
|---------------|------------------------------|
     segment1               segment2

This can be converted to a multivec file using the following command:

clodius convert bedfile_to_multivec \
    data.tsv \
    --chromsizes-file segments.tsv \
    --starting-resolution 1

This command can also take the parameter --row-infos-filename rows.txt to describe, in human readable text, each row (e.g. cell types). The passed file should have as many rows as there are rows in the multivec matrix.

The resulting output file can be ingested using higlass-manage:

higlass-manage.py ingest --filetype multivec --datatype multivec data.mv5