PubMatrixPython — full reference notebook¶

No description has been provided for this image

Complete walkthrough of every parameter and feature. Mirrors the PubMatrixR documentation.

Reference: Becker et al. (2003) BMC Bioinformatics 4:61. https://doi.org/10.1186/1471-2105-4-61

Setup¶

In [1]:
import sys
sys.path.insert(0, '..')

import pandas as pd
import matplotlib.pyplot as plt

from pubmatrix import (
    pubmatrix,
    pubmatrix_from_file,
    plot_pubmatrix_heatmap,
    pubmatrix_heatmap,
)

NCBI API key¶

Without a key NCBI allows 3 requests/second; with a key, 10/second. Get one at https://account.ncbi.nlm.nih.gov/

API_KEY = "YOUR_KEY_HERE"

Leave as None to run without one.

In [2]:
API_KEY = None  # replace with your key to increase rate limit

pubmatrix() — core query function¶

pubmatrix(
    A,                   # list of str — column terms
    B,                   # list of str — row terms
    api_key   = None,    # NCBI API key
    database  = "pubmed",# "pubmed" or "pmc"
    daterange = None,    # [start_year, end_year]
    outfile   = None,    # base filename for export
    export_format = None,# None | "csv" | "ods"
    n_tries   = 2,       # retries on network failure
)

Returns a pandas.DataFrame — rows = B terms, columns = A terms, values = publication counts.

Basic usage¶

In [3]:
A = ["WNT1", "WNT2", "CTNNB1"]
B = ["obesity", "diabetes", "cancer"]

result = pubmatrix(A=A, B=B, api_key=API_KEY)
result
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:04<00:00,  1.85query/s]
Out[3]:
WNT1 WNT2 CTNNB1
obesity 63 6 90
diabetes 118 18 267
cancer 1278 297 8363

Larger matrix — 7 × 7 WNT × obesity genes¶

In [4]:
wnt_genes     = ["WNT1", "WNT2", "WNT3A", "WNT5A", "WNT7B", "CTNNB1", "DVL1"]
obesity_genes = ["LEPR", "ADIPOQ", "PPARG", "TNF", "IL6", "ADRB2", "INSR"]

result_wnt = pubmatrix(A=wnt_genes, B=obesity_genes, api_key=API_KEY)
result_wnt
Querying NCBI: 100%|█████████████████████████████████████| 49/49 [00:19<00:00,  2.55query/s]
Out[4]:
WNT1 WNT2 WNT3A WNT5A WNT7B CTNNB1 DVL1
LEPR 6 0 0 2 0 4 0
ADIPOQ 2 0 0 6 0 9 0
PPARG 2 3 7 5 1 26 0
TNF 83 4 110 123 6 216 3
IL6 75 7 87 143 9 151 3
ADRB2 1 0 0 1 0 0 0
INSR 1 1 1 1 0 4 0

database parameter¶

"pubmed" (default) searches MEDLINE abstracts. "pmc" searches full-text articles in PubMed Central — counts are typically higher.

In [5]:
result_pmc = pubmatrix(A=A, B=B, database="pmc", api_key=API_KEY)
result_pmc
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:03<00:00,  2.58query/s]
Out[5]:
WNT1 WNT2 CTNNB1
obesity 2765 1013 4904
diabetes 4396 1610 7073
cancer 12695 5089 29362
In [6]:
# Side-by-side comparison
print("PubMed:")
print(result)
print("\nPMC:")
print(result_pmc)
PubMed:
          WNT1  WNT2  CTNNB1
obesity     63     6      90
diabetes   118    18     267
cancer    1278   297    8363

PMC:
           WNT1  WNT2  CTNNB1
obesity    2765  1013    4904
diabetes   4396  1610    7073
cancer    12695  5089   29362

daterange parameter¶

Filter results to a publication year range. Useful for tracking how co-occurrence changes over time.

In [7]:
result_2000_2010 = pubmatrix(A=A, B=B, daterange=[2000, 2010], api_key=API_KEY)
result_2011_2024 = pubmatrix(A=A, B=B, daterange=[2011, 2024], api_key=API_KEY)

print("2000–2010:")
print(result_2000_2010)
print("\n2011–2024:")
print(result_2011_2024)
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:03<00:00,  2.59query/s]
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:03<00:00,  2.45query/s]
2000–2010:
          WNT1  WNT2  CTNNB1
obesity      1     0       5
diabetes    11     4      41
cancer     361    82    2100

2011–2024:
          WNT1  WNT2  CTNNB1
obesity     60     6      79
diabetes   103    12     201
cancer     768   169    5351

                            

Export to CSV¶

Saves a .csv where each cell is an Excel HYPERLINK formula linking directly to the PubMed search for that term pair.

In [8]:
pubmatrix(A=A, B=B, outfile="output", export_format="csv", api_key=API_KEY)
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:03<00:00,  2.58query/s]
Saved CSV to output.csv

                            
Out[8]:
WNT1 WNT2 CTNNB1
obesity 63 6 90
diabetes 118 18 267
cancer 1278 297 8363

Export to ODS¶

Same as CSV but in OpenDocument Spreadsheet format, with clickable hyperlinks in LibreOffice / OpenOffice.

In [9]:
pubmatrix(A=A, B=B, outfile="output", export_format="ods", api_key=API_KEY)
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:03<00:00,  2.56query/s]
Saved ODS to output.ods

                            
Out[9]:
WNT1 WNT2 CTNNB1
obesity 63 6 90
diabetes 118 18 267
cancer 1278 297 8363

n_tries — retry on network failure¶

Default is 2. Increase for unstable connections.

In [10]:
result_retry = pubmatrix(A=A, B=B, n_tries=5, api_key=API_KEY)
result_retry
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:03<00:00,  2.57query/s]
Out[10]:
WNT1 WNT2 CTNNB1
obesity 63 6 90
diabetes 118 18 267
cancer 1278 297 8363

pubmatrix_from_file() — load terms from a text file¶

File format — A terms first, # separator, then B terms:

WNT1
WNT2
CTNNB1
#
obesity
diabetes
cancer

All keyword arguments are passed through to pubmatrix().

In [11]:
sample_terms = "WNT1\nWNT2\nCTNNB1\n#\nobesity\ndiabetes\ncancer\n"
with open("sample_terms.txt", "w") as f:
    f.write(sample_terms)

result_file = pubmatrix_from_file("sample_terms.txt", api_key=API_KEY)
result_file
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:03<00:00,  2.55query/s]
Out[11]:
WNT1 WNT2 CTNNB1
obesity 63 6 90
diabetes 118 18 267
cancer 1278 297 8363
In [12]:
# With optional arguments
result_file_dated = pubmatrix_from_file(
    "sample_terms.txt",
    daterange=[2015, 2024],
    api_key=API_KEY,
)
result_file_dated
Querying NCBI: 100%|███████████████████████████████████████| 9/9 [00:03<00:00,  2.41query/s]
Out[12]:
WNT1 WNT2 CTNNB1
obesity 42 4 68
diabetes 89 11 179
cancer 579 136 4406
In [13]:
import os
os.remove("sample_terms.txt")

Heatmap visualisation¶

Cell values show overlap percentage:

overlap = (intersection / union) × 100
union   = row_total + col_total - intersection

This is a Jaccard-style normalisation — it accounts for terms that appear frequently on their own, so a pair like (CTNNB1, cancer) is not inflated just because both terms are common.

pubmatrix_heatmap() — quick plot with defaults¶

In [14]:
pubmatrix_heatmap(result)
No description has been provided for this image
Out[14]:
<Axes: title={'center': 'PubMatrix Results'}>

plot_pubmatrix_heatmap() — full control¶

plot_pubmatrix_heatmap(
    matrix,
    title          = "PubMatrix Co-occurrence Heatmap",
    cluster_rows   = True,
    cluster_cols   = True,
    show_numbers   = True,
    color_palette  = None,   # list of hex colours; defaults to red gradient
    filename       = None,   # save to PNG if set
    width          = 10,
    height         = 8,
    scale_font     = True,
)
In [15]:
plot_pubmatrix_heatmap(
    result,
    title="WNT Genes × Disease Co-occurrence",
    cluster_rows=True,
    cluster_cols=True,
    show_numbers=True,
    width=8,
    height=5,
)
No description has been provided for this image
Out[15]:
<Axes: title={'center': 'WNT Genes × Disease Co-occurrence'}>

Clustering disabled¶

In [16]:
plot_pubmatrix_heatmap(
    result,
    title="No clustering",
    cluster_rows=False,
    cluster_cols=False,
)
No description has been provided for this image
Out[16]:
<Axes: title={'center': 'No clustering'}>

Numbers hidden¶

In [17]:
plot_pubmatrix_heatmap(
    result,
    title="No cell annotations",
    show_numbers=False,
)
No description has been provided for this image
Out[17]:
<Axes: title={'center': 'No cell annotations'}>

Custom colour palette¶

Pass any list of hex colours — gradient is interpolated between them.

In [18]:
plot_pubmatrix_heatmap(
    result,
    title="Blue gradient",
    color_palette=["#deebf7", "#9ecae1", "#3182bd"],
)
No description has been provided for this image
Out[18]:
<Axes: title={'center': 'Blue gradient'}>
In [19]:
plot_pubmatrix_heatmap(
    result,
    title="Green gradient",
    color_palette=["#e5f5e0", "#a1d99b", "#31a354"],
)
No description has been provided for this image
Out[19]:
<Axes: title={'center': 'Green gradient'}>

Save to PNG¶

In [20]:
plot_pubmatrix_heatmap(
    result,
    title="Saved heatmap",
    filename="heatmap_full.png",
    width=8,
    height=5,
)
Saved heatmap to heatmap_full.png
Out[20]:
<Axes: title={'center': 'Saved heatmap'}>
No description has been provided for this image

Working with the result DataFrame¶

The return value is a plain pandas.DataFrame — all standard pandas operations apply.

Summary statistics¶

In [21]:
print("Column sums (total co-occurrences per A term):")
print(result.sum(axis=0))
print()
print("Row sums (total co-occurrences per B term):")
print(result.sum(axis=1))
Column sums (total co-occurrences per A term):
WNT1      1459
WNT2       321
CTNNB1    8720
dtype: int64

Row sums (total co-occurrences per B term):
obesity      159
diabetes     403
cancer      9938
dtype: int64

Bar charts — co-occurrences per term¶

In [22]:
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

col_totals = result.sum(axis=0).sort_values(ascending=False)
axes[0].bar(col_totals.index, col_totals.values, color="#de2d26")
axes[0].set_title("Co-occurrences per column term (A)")
axes[0].set_ylabel("Total publication count")
axes[0].tick_params(axis="x", rotation=45)

row_totals = result.sum(axis=1).sort_values(ascending=False)
axes[1].bar(row_totals.index, row_totals.values, color="#3182bd")
axes[1].set_title("Co-occurrences per row term (B)")
axes[1].set_ylabel("Total publication count")
axes[1].tick_params(axis="x", rotation=45)

plt.tight_layout()
plt.show()
No description has been provided for this image

Temporal trend — comparing two date windows¶

In [23]:
# Reuse results computed above
diff = result_2011_2024 - result_2000_2010
print("Absolute change in co-occurrence counts (2011–2024 vs 2000–2010):")
diff
Absolute change in co-occurrence counts (2011–2024 vs 2000–2010):
Out[23]:
WNT1 WNT2 CTNNB1
obesity 59 6 74
diabetes 92 8 160
cancer 407 87 3251
In [24]:
plot_pubmatrix_heatmap(
    diff,
    title="Change in co-occurrences: 2011–2024 vs 2000–2010",
    color_palette=["#f7f7f7", "#fc8d59", "#d73027"],
    cluster_rows=False,
    cluster_cols=False,
    width=7,
    height=4,
)
No description has been provided for this image
Out[24]:
<Axes: title={'center': 'Change in co-occurrences: 2011–2024 vs 2000–2010'}>

Save results to CSV manually¶

In [25]:
result.to_csv("my_results.csv")
print("Saved.")
Saved.