MSigDF: Molecular Signature Database (MSigDB) in a Data Frame

Stephen D. Turner1* and Enrique M. Toledo2**

1University of Virginia, Charlottesville VA, United States
2---

*vustephen@gmail.com
**enriquetoledo@gmail.com

21 June 2025

Abstract

This data package contains the Molecular Signature Database (MSigDB) for both human and predicted mouse orthologs in separate data frames (tibbles). Each data frame (msigdf.human and msigdf.mouse) contain three columns: the collection (Hallmark, or c1-c8), the gene set, and Entrez IDs for genes in that set. The msigdf.urls tibble contains links to descriptions on the Broad Institute’s website of each gene set. Source code available on GitHub.

Package

msigdf 2025.1

1 Data sources

Original data from the Broad Institute’s Molecular Signature Database (MSigDB)111 http://www.broad.mit.edu/gsea/msigdb/index.jsp, redistributed as separate gmt data files from the MSigDB.

The gene sets contained in the MSigDB are from a wide variety of sources, and relate to a variety of species, mostly human. To facilitate use of the MSigDB in our work, we have created a pure mouse version of the MSigDB by mapping all sets to mouse orthologs. A pure human version is also provided.

Procedure:

1. The current MSigDB v2025.1 gmt files were downloaded from Broad ftp.
2. This was done with the human and mouse gene sets
3. Each collection was converted to a list in R, and written to a RData file using save().

See the script in data-raw/ to see how the data frames (tibbles) were created.

2 Example usage

There are three data frames (tibbles) this package. The msigdf.human data frame has columns for each MSigDB collection divided by sub-collection (like cc, bp and mf for C5). The format of the data is tidy, so each row is a single gene set collection, sub-collection and gene symbol. The msigdf.mouse data frame has the same structure for mouse orthologs. The msigdf.urls data frame links the name of the gene set to the URL on the Broad’s website.

New C5 ontology information was added to the category subcode for easy filtering and consistency.

HPO: Human Phenotype Ontology
MF: GO Molecular Function ontology
BP: GO Biological Process ontology
CC: GO Cellular Component ontology

The data sets in this package have several million rows. The package imports the tibble package so they’re displayed nicely.

library(tidyverse)
library(msigdf)

Take a look:

msigdf.human %>% head()

## # A tibble: 6 × 4
##   category_code category_subcode geneset symbol 
##   <chr>         <chr>            <chr>   <chr>  
## 1 c1            all              MT      MT-ATP6
## 2 c1            all              MT      MT-ATP8
## 3 c1            all              MT      MT-CO1 
## 4 c1            all              MT      MT-CO2 
## 5 c1            all              MT      MT-CO3 
## 6 c1            all              MT      MT-CYB

msigdf.mouse %>% head()

## # A tibble: 6 × 4
##   category_code category_subcode geneset symbol 
##   <chr>         <chr>            <chr>   <chr>  
## 1 m1            all              MT      mt-Atp6
## 2 m1            all              MT      mt-Atp8
## 3 m1            all              MT      mt-Co1 
## 4 m1            all              MT      mt-Co2 
## 5 m1            all              MT      mt-Co3 
## 6 m1            all              MT      mt-Cytb

msigdf.urls %>% as.data.frame() %>% head()

##   category_code category_subcode  geneset
## 1            c1              all       MT
## 2            c1              all chr10p11
## 3            c1              all chr10p12
## 4            c1              all chr10p13
## 5            c1              all chr10p14
## 6            c1              all chr10p15
##                                                             url
## 1       http://software.broadinstitute.org/gsea/msigdb/cards/MT
## 2 http://software.broadinstitute.org/gsea/msigdb/cards/chr10p11
## 3 http://software.broadinstitute.org/gsea/msigdb/cards/chr10p12
## 4 http://software.broadinstitute.org/gsea/msigdb/cards/chr10p13
## 5 http://software.broadinstitute.org/gsea/msigdb/cards/chr10p14
## 6 http://software.broadinstitute.org/gsea/msigdb/cards/chr10p15

Just get the entries for the KEGG non-homologous end joining pathway:

msigdf.human %>% 
  filter(geneset=="KEGG_NON_HOMOLOGOUS_END_JOINING")

## # A tibble: 13 × 4
##    category_code category_subcode geneset                         symbol 
##    <chr>         <chr>            <chr>                           <chr>  
##  1 c2            cp.kegg_legacy   KEGG_NON_HOMOLOGOUS_END_JOINING DCLRE1C
##  2 c2            cp.kegg_legacy   KEGG_NON_HOMOLOGOUS_END_JOINING DNTT   
##  3 c2            cp.kegg_legacy   KEGG_NON_HOMOLOGOUS_END_JOINING FEN1   
##  4 c2            cp.kegg_legacy   KEGG_NON_HOMOLOGOUS_END_JOINING LIG4   
##  5 c2            cp.kegg_legacy   KEGG_NON_HOMOLOGOUS_END_JOINING MRE11  
##  6 c2            cp.kegg_legacy   KEGG_NON_HOMOLOGOUS_END_JOINING NHEJ1  
##  7 c2            cp.kegg_legacy   KEGG_NON_HOMOLOGOUS_END_JOINING POLL   
##  8 c2            cp.kegg_legacy   KEGG_NON_HOMOLOGOUS_END_JOINING POLM   
##  9 c2            cp.kegg_legacy   KEGG_NON_HOMOLOGOUS_END_JOINING PRKDC  
## 10 c2            cp.kegg_legacy   KEGG_NON_HOMOLOGOUS_END_JOINING RAD50  
## 11 c2            cp.kegg_legacy   KEGG_NON_HOMOLOGOUS_END_JOINING XRCC4  
## 12 c2            cp.kegg_legacy   KEGG_NON_HOMOLOGOUS_END_JOINING XRCC5  
## 13 c2            cp.kegg_legacy   KEGG_NON_HOMOLOGOUS_END_JOINING XRCC6

Some software, e.g., fGSEA might require gene sets to be a named list of genes identifiers, where the name of each element in the list is the name of the pathway. This is how the data was originally structured, and we can return to it with plyr::dlply(). Here, let’s use only the hallmark sets, and after we dlply the data into this named list format, get just the first few pathways, and in each of those, just display the first few gene symbols.

msigdf.human %>% 
  filter(category_code=="c2") %>% 
  select(geneset, symbol) %>% 
  group_by(geneset) %>% 
  summarize(symbol=list(symbol)) %>% 
  deframe() %>% 
  head() %>% 
  map(head)

## $ABBUD_LIF_SIGNALING_1_DN
## [1] "AHNAK"    "ALCAM"    "ANKRD40"  "ARID1A"   "BCKDHB"   "C16orf89"
## 
## $ABBUD_LIF_SIGNALING_1_UP
## [1] "ACAA2"   "ALDOC"   "ANXA8L1" "BCL3"    "CEBPB"   "CXCL14" 
## 
## $ABBUD_LIF_SIGNALING_2_DN
## [1] "CGA"    "CITED2" "NALCN"  "PITX2"  "PTHLH"  "SCN1A" 
## 
## $ABBUD_LIF_SIGNALING_2_UP
## [1] "ATP1B1"  "COL11A1" "DAB2"    "DCN"     "DIO2"    "EZR"    
## 
## $ABDELMOHSEN_ELAVL4_TARGETS
## [1] "BCL2"  "CAB39" "CASP3" "CDC42" "CDH2"  "DLG4" 
## 
## $ABDULRAHMAN_KIDNEY_CANCER_VHL_DN
## [1] "ACTA2"   "ALDH1A1" "ALDH3B1" "ITGB3BP" "MPPE1"   "MTMR3"

3 Further exploration

The number of gene sets in each collection for each organism is dependent of the construction at MSigDB.

Human Collection of gene sets https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp

msigdf.human %>%
  group_by(category_code,category_subcode) %>% 
  tally()

## # A tibble: 26 × 3
## # Groups:   category_code [9]
##    category_code category_subcode      n
##    <chr>         <chr>             <int>
##  1 c1            all               42654
##  2 c2            cgp              393284
##  3 c2            cp.biocarta        4814
##  4 c2            cp.kegg_legacy    12795
##  5 c2            cp.kegg_medicus    9662
##  6 c2            cp.pid             8054
##  7 c2            cp.reactome       97590
##  8 c2            cp.wikipathways   37188
##  9 c3            mir              406258
## 10 c3            mir.mir_legacy    34178
## # ℹ 16 more rows

Mouse Collection of gene sets https://www.gsea-msigdb.org/gsea/msigdb/mouse/collections.jsp

msigdf.mouse %>%
  group_by(category_code,category_subcode) %>% 
  tally()

## # A tibble: 13 × 3
## # Groups:   category_code [6]
##    category_code category_subcode      n
##    <chr>         <chr>             <int>
##  1 m1            all               41569
##  2 m2            cgp              113138
##  3 m2            cp.biocarta        3958
##  4 m2            cp.reactome       71506
##  5 m2            cp.wikipathways    9617
##  6 m3            gtrd             163607
##  7 m3            mirdb            233394
##  8 m5            go.bp            649092
##  9 m5            go.cc            101934
## 10 m5            go.mf            109239
## 11 m5            mpt                2606
## 12 m8            all               47984
## 13 mh            all                7191

Get the URL for the hallmark set with the fewest number of genes (Notch signaling). Optionally, %>% this to browseURL to open it up in your browser.

msigdf.human %>%
  filter(category_code=="h") %>%
  count(geneset) %>%
  arrange(n) %>%
  head(1) %>%
  inner_join(msigdf.urls, by="geneset") %>%
  pull(url)

## [1] "http://software.broadinstitute.org/gsea/msigdb/cards/HALLMARK_NOTCH_SIGNALING"

Just look at the number of genes in each KEGG pathway (sorted descending by the number of genes in that pathway):

msigdf.human %>%
  filter(category_code=="c2" & grepl("^KEGG_", geneset)) %>%
  count(geneset) %>% 
  arrange(desc(n))

## # A tibble: 844 × 2
##    geneset                                          n
##    <chr>                                        <int>
##  1 KEGG_OLFACTORY_TRANSDUCTION                    389
##  2 KEGG_PATHWAYS_IN_CANCER                        325
##  3 KEGG_NEUROACTIVE_LIGAND_RECEPTOR_INTERACTION   272
##  4 KEGG_MAPK_SIGNALING_PATHWAY                    267
##  5 KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION    264
##  6 KEGG_REGULATION_OF_ACTIN_CYTOSKELETON          213
##  7 KEGG_FOCAL_ADHESION                            199
##  8 KEGG_CHEMOKINE_SIGNALING_PATHWAY               188
##  9 KEGG_HUNTINGTONS_DISEASE                       182
## 10 KEGG_ENDOCYTOSIS                               181
## # ℹ 834 more rows

Session info

## R version 4.5.0 (2025-04-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] msigdf_2025.1    lubridate_1.9.4  forcats_1.0.0    stringr_1.5.1   
##  [5] dplyr_1.1.4      purrr_1.0.4      readr_2.1.5      tidyr_1.3.1     
##  [9] tibble_3.3.0     ggplot2_3.5.2    tidyverse_2.0.0  knitr_1.50      
## [13] BiocStyle_2.36.0
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6        jsonlite_2.0.0      compiler_4.5.0     
##  [4] BiocManager_1.30.26 tidyselect_1.2.1    jquerylib_0.1.4    
##  [7] scales_1.4.0        yaml_2.3.10         fastmap_1.2.0      
## [10] R6_2.6.1            generics_0.1.4      bookdown_0.43      
## [13] tzdb_0.5.0          bslib_0.9.0         pillar_1.10.2      
## [16] RColorBrewer_1.1-3  rlang_1.1.6         utf8_1.2.6         
## [19] stringi_1.8.7       cachem_1.1.0        xfun_0.52          
## [22] sass_0.4.10         timechange_0.3.0    cli_3.6.5          
## [25] withr_3.0.2         magrittr_2.0.3      digest_0.6.37      
## [28] grid_4.5.0          rstudioapi_0.17.1   hms_1.1.3          
## [31] lifecycle_1.0.4     vctrs_0.6.5         evaluate_1.0.3     
## [34] glue_1.8.0          farver_2.1.2        rmarkdown_2.29     
## [37] tools_4.5.0         pkgconfig_2.0.3     htmltools_0.5.8.1