Tissue-specific impacts of aging and genetics on gene expression patterns in humans – Nature.com

Posted: October 4, 2022 at 2:00 am

Data collection age groupings

We downloaded gene expression data for multiple individuals and tissues from GTEx V810, which were previously aligned and processed against the hg19 human genome. Tissues were included in the analysis if they had >100 individuals in both the age 55 and <55 cohorts (Supplementary Fig.2). For a given tissue, genes were included if they had >0.1 TPM in 20% of samples and 6 reads in 20% of samples, following GTExs eQTL analysis pipeline. To compare gene expression heritability across individuals of different ages, for some analyses we split the GTEx data for each tissue into two age groups, "young" and "old," based on the median age of individuals in the full dataset, which was 55 (Supplementary Fig.1). Within each tissue dataset, we then equalized the number of individuals in the young and old groups by randomly downsampling the larger group, to ensure that our models were equally powered for the two age groups.

We analyzed existing precomputed PEER factors available from GTEx to check for correlations between these hidden covariates and age. In particular, we fit a linear regression between age and each hidden covariate and identified significant age correlations using an F-statistic (Supplementary Fig.3). Because some of the covariates were correlated with age, we generated age-independent hidden covariates of gene expression to remove batch and other confounding effects on gene expression while retaining age related variation. In particular, we first removed age contributions to gene expression by regressing gene expression on age and then ran PEER on the age-independent residual gene expression to generate 15 age-independent hidden PEER factors.

Using the binary age groups defined above, we assessed the relative significance of eQTLs in old and young individuals by carrying out separate assessment of eQTLs identified by GTEx. We report the number of genes included in analysis for each tissue (Supplementary Table1). For each gene in each tissue and each age group, we regressed the GTEx pre-normalized expression levels on the genotype of the lead SNP (identified by GTEx, MAF>0.01) using 5 PCs, 15 PEER factors, sex, PCR protocol and sequencing platform as covariates, following the GTEx best practices. We confirmed our results using both our recomputed PEER factors as well as the PEER factors provided by GTEx (Supplementary Fig.5). To test for significant differences in genetic associations with gene expression between the old and young age groups, we compared the p-value distributions between these groups for all genes and all SNPs in a given tissue using Welchs t-test. To investigate the validity of the age cutoff used for these binary age groups, we replicated the eQTL analysis using two additional age cutoffs of 45 and 65 years old. We observed the same trends in both cases; however, statistical power decreased due to smaller sample sizes in the resulting age bins, leading to a non-significant result for age cutoff 45 (Supplementary Fig.40).

To quantify differences in gene expression between individuals, we computed the pairwise distance for all pairs of individuals in an age group using the square root of Jensen-Shannon Divergence (JSD) distance metric, which measures the similarity of two probability distributions. Here we applied JSD between pairs of individuals transcriptome vectors containing the gene expression values for each gene, which we converted to a distribution by normalizing by the sum of the entries in the vector. For two individuals transcriptome distributions, the JSD can be calculated as:

$${{{{{{{rm{JSD}}}}}}}}({P}_{1},;{P}_{2})=Hleft(frac{1}{2}{P}_{1}+frac{1}{2}{P}_{2}right)-frac{1}{2}(H({P}_{1})+H({P}_{2}))$$

(1)

where Pi is the distribution for individual i and H is the Shannon entropy function:

$$H(X)=-mathop{sum }limits_{i=1}^{n}P({x}_{i}){log }_{2}(P({x}_{i}))$$

(2)

JSD is known to be a robust metric that is less sensitive to noise when calculating distance compared to traditional metrics such as Euclidean distance and correlation. It has been shown that JSD metrics and other approaches yield similar results but that JSD is more robust to outliers12. The square root of the raw JSD value follows the triangle inequality, enabling us to treat it as a distance metric.

In addition to comparing JSD between the two age groups defined above, "young" and "old", we also binned all GTEx individuals into 6 age groups, from 20 to 80 years old with an increment of 10 years. We then computed pairwise distance and average age for each pair of individuals within each bin using the square root of JSD as the distance metric. We applied a linear regression model of JSD versus age to obtain slopes, confidence intervals, and p-values.

To analyze whether cell type composition affects age-associated expression changes, we utilized the tool CIBERSORTx16 to estimate cell type composition and individual cell type expression levels in GTEx whole blood. Cell type composition estimates were computed using CIBERSORTx regular mode. Individual cell type expression level estimates were computed using CIBERSORTx high resolution mode. We then repeated our JSD and eQTL analyses on each cell type independently (see JSD and eQTL sections for details). In addition, to analyze tissue-specific differences in cell type composition, we referred to a previous study36 that computed cell type composition for different GTEx tissues using CIBERSORTx. We applied the JSD metric to each tissue, using the cell type composition vector as the distribution. Additionally, we applied the Breusch-Pagan test to compute heteroskedasicity coefficients and p-values with respect to age, after inverse logit transformation to give an approximately Gaussian distribution (Supplementary Fig.44) (see section on heteroskedastic gene expression).

We used the Breusch-Pagan test to call heteroskedastic gene expression with age. For each gene and tissue, we computed gene expression residuals by regressing out age-correlated PEER factors, other GTEx covariates, and age. To test for age-related heteroskedasticity, we squared these residuals and divided by the mean, regressed them against age, and looked at the age effect size (het). We called significantly heteroskedastic genes using a two-sided t-test with the null hypothesis that the het is zero. The Benjamini-Hochberg procedure was used to control for false positives. To determine which tissues have more genes with increasing gene expression heterogeneity with age, we compare the number of genes with positive heteroskedasticity (het > 0 and FDR<0.2) to the total of all heteroskedastic genes (FDR < 0.2). We compare this metric to the per-tissue 2-bin JSD (Supplementary Fig.41) and 6-bin JSD slope (Supplementary Fig.15).

We used a multi-SNP gene expression prediction model based on PrediXcan14 to corroborate our findings from the eQTL and JSD analyses on the two age groups, "young" and "old". For each gene in each tissue, we trained a multi-SNP model separately within each age group to predict individual-level gene expression.

$${Y}_{g,t}=mathop{sum}limits_{i}{beta }_{i,g,t}{X}_{i}+epsilon$$

(3)

Where i,g,t is the coefficient or effect size for SNP Xi in gene g and tissue t and includes all other noise and environmental effects. The regularized linear model for each gene considers dosages of all common SNPs within 1 megabase of the genes TSS as input, where common SNPs are defined as MAF > 0.05 and Hardy-Weinberg equilibrium P>0.05. We removed covariate effects on gene expression prior to model training by regressing out both GTEx covariates and age-independent PEER factors (described above). Coefficients were fit using an elastic net model which solves the problem37:

$${min }_{beta_{0},;beta }frac{1}{2N}mathop{sum }limits_{j=1}^{N}{left({Y}_{j}-{beta }_{0}-{X}_{j}^{T}beta right)}^{2}+lambda left(frac{1-alpha }{2}||beta|{|}_{2}^{2}+alpha||beta|{|}_{1}right)$$

(4)

The minimization problem contains both the error of our model predictions ({({Y}_{j}-{beta }_{0}-{X}_{j}^{T}beta )}^{2}) and a regularization term (lambda (frac{1-alpha }{2}||beta|{|}_{2}^{2}+alpha||beta|{|}_{1})) to prevent model overfitting. The elastic net regularization term incorporates both L1 (1)) and L2 ((||beta|{|}_{2}^{2})) penalties. Following PrediXcan, we weighted the L1 and L2 penalties equally using =0.514. For each model, the regularization parameter was chosen via 10-fold cross validation. The elastic net models were fit using Pythons glmnet package and R2 was evaluated using scikit-learn. From the trained models for each gene, we evaluated training set genetic R2 (or h2) for the two age groups and subtracted ({h}_{{{{{young}}}}}^{2}-{h}_{{{{{old}}}}}^{2}) to get the difference in gene expression heritability between the groups. We compared this average difference in heritability to the mean JSDoldJSDyoung and (log ({P}_{old})-log ({P}_{young})) using P-values from the eQTL analyses across genes.

To uncover linear relationships between gene expression and both age and genetics, we built a set of gene expression prediction models using both common SNPs and standardized age as input. An individuals gene expression level Y for a gene g and tissue t is modeled as:

$${Y}_{g,t}=mathop{sum}limits_{i}{beta }_{i,g,t}{X}_{i}+{beta }_{{{{{{{{rm{age}}}}}}}},g,t}A+epsilon$$

(5)

Where A is the normalized age of an individual. Coefficients were fit using elastic net regularization, as above, which sets coefficients for non-informative predictors to zero. The sign of the fitted age coefficient (age,g,t), when nonzero, reflects whether the gene in that tissue is expressed more in young (negative coefficient) or old (positive coefficient) individuals. We also evaluated the training set R2 using the fit model coefficients separately for genetics (across all SNPs in the model) and age:

$${R}_{genetics}^{2}={h}^{2}={R}^{2}({Y}_{g,t},mathop{sum}limits_{i}{beta }_{i,g,t}{X}_{i})$$

(6)

$${R}_{age}^{2}={R}^{2}({Y}_{g,t},;{beta }_{{{{{{{{rm{age}}}}}}}},g,t}A)$$

(7)

We also tested whether the age-related gene expression relationship was sex-specific by rerunning the joint model with an additional age-sex interaction term as follows:

$${Y}_{g,t}=mathop{sum}limits_{i}{beta }_{i,g,t}{X}_{i}+{beta }_{{{{{{{{rm{age}}}}}}}},g,t}A+{beta }_{{{{{{{{rm{age}}}}}}}} * {{{{{{{rm{sex}}}}}}}},g,t}A * S+epsilon$$

(8)

Where agesex,g,t is the additional model weight for the age-sex interaction term and S is the binary sex of the GTEx individual. The R2 of age, genetics, and the age-sex interaction term are evaluated as before by determining the variance explained by each term in the model. We compared the ({R}_{age}^{2}) between the models including or excluding the age-sex interaction term (Supplementary Fig.26). We also compared the tissue-averaged variance explained by age and the age-sex interaction term. Finally, to check the consistency of tissue-specific gene expression heritability estimates from our model and the original PrediXcan model trained on GTEx data, we evaluate Pearsons r between our heritability estimates and those of PrediXcan (Supplementary Fig.20), using heritability estimates from the original PrediXcan model available in PredictDB.

We evaluated the variability of age and genetic associations across tissues using a measure of tissue specificity for age and genetic R238. We measured the tissue-specificity of a gene gs variance explained ({R}_{g}^{2}) using the following metric:

$${S}_{g}=frac{mathop{sum }nolimits_{t=1}^{n}left(1-frac{{R}_{g,t}^{2}}{{R}_{g,max }^{2}}right)}{n-1}$$

(9)

Where n is the total number of tissues, ({R}_{g,t}^{2}) is the variance explained by either age or genetics for the gene g in tissue t and ({R}_{g,max }^{2}) is the maximum variance explained for g over all tissues. This metric can be thought of as the average reduction in variance explained relative to the maximum variance explained across tissues for a given gene. The metric ranges from 0 to 1, with 0 representing ubiquitously high genetic or age R2 and 1 representing only one tissue with nonzero genetic or age R2 for a given gene. We calculate Sg separately for ({R}_{{{{{{{{rm{age}}}}}}}}}^{2}) and ({R}_{{{{{{{{rm{genetics}}}}}}}}}^{2}) across all genes.

We quantified gene constraint using the probability of loss of function intolerance (pLI) from gnomAD 2.1.122. We analyzed the relationships between pLI vs age and pLI vs heritability across genes. For these analyses, genes were only included if age or genetics were predictive of gene expression (R2>0) for that gene. For genes with R2>0, we used linear regression to determine the direction of the relationship between pLI and age or heritability for each tissue. The F-statistic was used to determine whether pLI was significantly related to these two model outputs. For pLI vs age, a significant negative slope was considered a Medawarian trend (consistent with Medawars hypothesis) and a significant positive slope a non-Medawarian trend. To test whether the non-Medawarian trends were driven by genes with higher expression, we excluded genes in the top quartile of median gene expression and repeated the analysis between pLI and age (Supplementary Fig.42). We also analyzed the evolutionary constraint metric dN/dS23 and its tissue-specific relationship with age by determining the slope and significance of the linear regression, as above.

We quantified the per-gene and per-tissue cancer somatic mutation frequency using data from the COSMIC cancer browser26. For each tissue, we selected the closest cancer type as noted in Supplementary Data5 and downloaded the number of mutated samples (tumor samples with at least one somatic mutation within the gene) and the total number of samples for all genes. We computed the cancer somatic mutation frequency by dividing the number of mutated samples by the total number of samples. For each tissue, we plotted the genes age vs its cancer somatic mutation frequency for all genes with>200 tumor samples. We report the slope and significance of the relationship between age and cancer somatic mutation frequency for each tissue. To determine whether age-dependent gene expression heteroskedasticity is related to a genes involvement in cancer (Supplementary Fig.43), we also plotted each genes heteroskedasticity effect size vs the cancer somatic mutation frequency for all genes with >200 tumor samples and moderately significant heteroskedasticity (FDR<0.2). Tissues with5 genes meeting these criteria are not plotted.

To explore the non-Medawarian trend in some tissues, we assessed the distribution of age across Medawarian and non-Medawarian tissues for genes within each of the 50 MSigDB hallmark pathways24. Significant differences between the distributions were called using a t-test, and p-values were adjusted for multiple hypothesis testing using a Benjamini-Hochberg correction.

Further information on research design is available in theNature Research Reporting Summary linked to this article.

See the article here:
Tissue-specific impacts of aging and genetics on gene expression patterns in humans - Nature.com

Related Posts