Gapped-kmer sequence modeling robustly identifies regulatory vocabularies and distal enhancers conserved between evolutionarily distant mammals

Jin Woo Oh, Michael A. Beer ^†

Department of Biomedical Engineering and McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University

^†Correspondence should be addressed to: Michael A. Beer (mbeer AT jhu DOT edu)

Gene regulatory elements drive complex biological phenomena and their mutations are associated with common human diseases. The impacts of human regulatory variants are often tested using model organisms such as mice. However, mapping human enhancers to conserved elements in mice remains a challenge, due to both rapid enhancer evolution and limitations of current computational methods. We analyze distal enhancers across 45 matched human/mouse cell/tissue pairs from a comprehensive dataset of DNase-seq experiments, and show that while cell-specific regulatory vocabulary is conserved, enhancers evolve more rapidly than promoters and CTCF binding sites. Enhancer conservation rates vary across cell types, in part explainable by tissue specific transposable element activity. We present an improved genome alignment algorithm using gapped-kmer features, called gkm-align, and make genome wide predictions for 1,401,803 orthologous regulatory elements. We show that gkm-align discovers 23,660 novel human/mouse conserved enhancers missed by previous algorithms with strong evidence of conserved functional activity.

Citation

If you use this data, please cite as:

Oh JW and Beer MA. Gapped-kmer sequence modeling robustly identifies regulatory vocabularies and distal enhancers conserved between evolutionarily distant mammals. Nature Communications 15, 6464 (2024).

If you use gkm-SVM models, please also cite:
Ghandi, M., Lee, D., Mohammad-Noori, M. & Beer, M. A. Enhanced Regulatory Sequence Prediction Using Gapped k-mer Features. PLoS Comput Biol 10: e1003711 (2014).
Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955-961 (2015). doi:10.1038/ng.3331

Gkm-align software can be downloaded from: github

Gkm-SVM model information

All the gkm-SVM models used in this study can be searched by their gkm-SVM model aliases (e.g., brain model alias: ENCSR339EHN) and downloaded from the ENCODE portal encodeproject.org (e.g., ENCSR339EHN). The following tables contain information for the gkm-SVM models, including their model aliases. These tables are also included as Supplementary Tables in the manuscript.

Human models (Supp Table 2)

Mouse models (Supp Table 3)

45 pairs of human/mouse cell/tissue pairs (Supp Table 4)

Excel spreadsheet containing the tables above (Supp Tables)

Posterior gkm-SVM kmer-weight files can be used to make gkm-SVM predictions ranging from 0 to 1. These files contain mean and variance of kmer weight sum of positive and negative training sets used for gkm-SVM training (mu: mean; var: variance; w: enhancer width). These files are used for weighted gkm-align, and posterior kmer-weight files for every human and mouse models listed above can be accessed through the following links:

Human models Mouse models

Gkm-align uses gkm-SVM genomic background models to detect and mask low-complexity repetitive elements, and the models are provided below:

Human background Mouse background

DHS Peaks

DHS peaks for the 45 pairs of human/mouse cell/tissue pairs are provided in the links below (4th column: MACS2-peak score combining multiple replicate experiments):

Human DHS Mouse DHS

Enhancers

Enhancers for the 45 pairs of cell/tissue pairs are provided below

Human enhancers Mouse enhancers

We defined enhancers as DHS peaks that are at least 2,000 base pairs from any transcription start sites (TSS) not overlapping DHS peaks that are DNase-accessible in more than 30% of ENCODE biosamples. The ubiquitously accessible peaks are listed below:

Human: hg38ubiq30.bed

Mouse: mm10ubiq30.bed

Human/mouse syntenic intergenic loci used for gkm-align input

Human/mouse syntenic intergenic loci were derived using the list of orthologous protein coding genes (Supp Table 5) from the mouse ENCODE project.

Syntenic intergenic loci (Supp Table 6)

Short sequence matches within hg38/mm10 syntenic intergenic loci (download)

Output files of aligning the hg38 and mm10 genomes with gkm-align

We provide the output files (.coord) from aligning the hg38 and mm10 genomes using gkm-align. These files can be used to map any human sequence within conserved intergenic loci to the mouse genome. The alignmnet output files include:

Unweighted Alignment: Output generated with gkm-align using a weighting parameter c=0.

hg38-mm10_unweighted.coord

Cell-Specific Weighted Alignment: Outputs based on each of the 45 cell/tissue-specific enhancer models. For this category, we include gkm-align outputs using weighting parameters c=0.5 and c=c_max. The value of c_max was determined as the parameter that yielded the highest conserved enhancer mapping rate in our dataset.

Enhancer-model-weighted coordinate files

Tables of orthologous human/mouse enhancers and their conservation metrics for the 45 cell/tissue pairs.

gkm-align_mapping gkm-SVM_weighted_gkm-align_mapping

For enhancer mappings for human to mouse, the tables contain the following 8 columns (conserved MYOD1 enhancer as an example):

1. Query human enhancer coordinate ID (chr11:17687946-17688246 )

2. Mapped mouse element coordinate ID (chr7:46341521-46341821)

3. gkm-similarity of the human and mouse elements (0.53; range: 0-1)

4. gkm-SVM prediction score of mapped mouse element using human-trained gkm-SVM model (0.86; range: 0-1)

5. human DNase-signal (# mapped reads); fold change from genomic average (164.45)

6. mouse DNase-signal; fold change from genomic average (71.01)

7. predicted mouse DNase-signal using columns 3-6 (fold change from avg.) (66.23)

8. unique identifier (top_0.10%_(A)muscle_rank_19)

Coordinates of conserved enhancers (.bed)

We also provide human (hg38) and mouse (mm10) coordinates of conserved enhancers across the 45 cell/tissues (coordinates extracted from the mapping tables above). For each table, we filtered gkm-align enhancer mappings by the predictive regression scores (table column 7) at three levels: most stringent (top 1%), stringent (top 10%), and permissive (top 100%). We then merged the resulting bed files across the cell types.

Each .bed coordinates can be matched and identified for further details in the mapping tables (table column 8) by the unique identifier provided in the fourth column of the bed files.

gkm-align_mapping gkm-SVM_weighted_gkm-align_mapping

Errata: The Equation labels 7 and 8 are mistakenly repeated in the Supplementary Note (p.47 of Supp. Info), duplicating some equation references.

If you have any questions, please contact Mike Beer at mbeer AT jhu DOT edu.