Statistical analysis of ΔGC–size relationships across bacterial taxa

We quantified how GC-content divergence (ΔGC) changes with ER size across all bacterial taxa (phylum, class, order, family, genus) that contained at least 100 ERs in the dataset.

Splitting ΔGC into two subsets

Because ΔGC crosses zero, applying a single regression model across all values would obscure opposite directional trends on either side of this boundary. To avoid this and to detect asymmetric patterns, the analysis was performed separately for ERs with ΔGC < 0 and for ERs with ΔGC > 0. A minimum of 5 ERs per side was required to fit a regression model.

Regression model

For each taxon and each ΔGC side, we fitted an ordinary least squares (OLS) linear regression of the form:

ΔGC ∼ log10(% chromosome size)

The dependent variable is log10(% chromosome) and the predictor is ΔGC. For each model, we extracted the regression slope and the p-value associated with the ΔGC term. A trend was considered statistically significant when p < 0.05.

Interpretation of slope direction

Interpretations were generated independently for the ΔGC < 0 and ΔGC > 0 subsets:

Visualisation

For every taxon with at least 100 ERs:

All figures were combined into a single interactive HTML document:

GC_size_slopes_by_taxa.html

Statistical summary table

For each taxon, the following statistics were recorded:

These results are provided as an interactive sortable table:

GC_size_slope_statistics.html