Interpreting Polygenic Prediction of Cognitive Ability: Evidence for Direct, Reliable, and Portable Genetic Effects

Tobias Wolfram; Spencer Moore; Jeremiah H. Li; Jonathan Anomaly; Ivan Davidson; Michael Christensen

doi:10.65550/001c.158459

Editor’s Note

All three reviewers initially recommended a revision that required full algorithm details for an important finding reported in this paper. The Action Editor (RH) encouraged resubmission with these details, noting that transparency would ultimately strengthen the paper and the science, outweighing short-term commercial drawbacks. The authors affirmatively addressed all other reviewer comments and suggestions but withheld algorithm details for proprietary reasons, citing common practice in scientific publishing (see Authors’ Statement). Two reviewers accepted this view; one strongly recommended rejection if the details were not included. We reviewed policies from other publishers on this issue, which typically defer to editorial judgment. Top journals have published papers omitting proprietary details, especially algorithms, balancing transparency with intellectual property protection. Without the algorithm details, the reported polygenic score variance predictions cannot be independently verified, but other key findings in the report can. In our view, the paper retains scientific merit and we accepted it. This is not an endorsement but reflects confidence that such work will inspire further research. An Associate Editor at ICA (TB) consults for the authors’ company but had no involvement in processing of or decisions about this manuscript and he was not informed of any reviews until after the final decision.—TC.

Author’s Statement

This manuscript’s scientific contribution centers on what rigorous validation reveals about the genetic architecture of cognitive ability, specifically, that direct genetic effects account for the large majority of PGS prediction, that measurement error substantially attenuates reported heritability estimates, and that cognitive PGS show predominantly beneficial associations with life and health outcomes. The validation methods (within-family designs, reliability-aware scaling, and convergent robustness estimators) are described and reproducible for any researcher with approved access to UK Biobank and ABCD. The specific score construction pipeline and scoring weights underlie an actively maintained commercial product and cannot be fully disclosed. We therefore describe score construction at a conceptual level and provide a benchmark analysis based on publicly available data (Supplement III), demonstrating that our conclusions generalize beyond the proprietary predictor.

Introduction

General cognitive ability (GCA) is a highly heritable trait (Polderman et al., 2015) consistently associated with diverse life outcomes (Strenze, 2007), ranging from educational and occupational attainment to physical (Calvin et al., 2017; Stern, 2012; Whalley, 2001) and mental health (Koenen et al., 2009; Zammit et al., 2004). Strengthening its genetic measurement is therefore highly relevant for health-related research (Abdellaoui et al., 2025).

Genetically, GCA is influenced by thousands of variants of small effect (Hill et al., 2019), in addition to high-effect rare variants (Chen et al., 2023). The growth of GWAS sample sizes for cognition-related phenotypes from under 10000 individuals (Butcher et al., 2008) to over 1 million participants (Lee et al., 2018) has enabled polygenic scores (PGS) with increasing predictive power. Recent genome-wide association studies have identified hundreds of genome-wide significant loci (Davies et al., 2018; Savage et al., 2018; Van Den Berg et al., 2025).

However, the interpretation of these scores remains contested. Influential analyses suggest that a substantial fraction of PGS prediction reflects indirect genetic effects rather than direct causal pathways (Young, 2019); reported SNP-heritabilities appear far below twin-based estimates, and concerns persist about cross-ancestry portability (Martin et al., 2019), gene–environment interactions, and antagonistic pleiotropy.

Using a new, optimized polygenic score (Methods), we address these challenges by isolating direct genetic effects from confounding and correcting for measurement error in brief cognitive assessments. We show that our analyses replicate when using a publicly available benchmark predictor, demonstrating that our findings reflect properties of cognitive genetic architecture rather than idiosyncrasies of a particular score.

Our analysis focuses on three critical domains of validity. First, we address the attenuation of genetic effects by measurement error. We apply psychometric reliability corrections and latent variable modeling to estimate the association between the PGS and latent general ability ( $g$ ), rather than noisy observed scores. We demonstrate through simulation and three convergent estimators that this correction recovers the true latent association.

Second, we examine the persistence of predictive power within families to isolate direct genetic effects from population stratification (Young, 2019). We extend this analysis beyond cognitive test scores to evaluate whether the PGS predicts real-world outcomes, including educational attainment, occupational status, and cardiometabolic health, within sibling pairs, thereby establishing its practical utility and assess the potential for antagonistic pleiotropy by conducting a broad screen of psychiatric and health outcomes to test for “off-target” effects. Finally, we systematically characterize the boundary conditions of polygenic prediction. We evaluate the portability of the score across diverse ancestry groups (Martin et al., 2019) in the ABCD cohort to quantify the expected decay in predictive power. Furthermore, we test for gene-environment interactions (GxE) with key socioeconomic factors, such as parental education and family income, to determine whether genetic effects on cognition are additive or dependent on environmental context.

By integrating these approaches, we demonstrate that when measurement error is properly modeled, polygenic scores achieve substantial predictive accuracy for latent general ability ( $r \approx 0.5$ ) that is robust to family-level confounding ( $\delta\text{/}\beta \approx 0.88$ ) and broadly applicable across additive environmental contexts. These results suggest that previous concerns about the limited utility of cognitive PGS due to environmental confounding, missing heritability or context-dependence may be overstated when appropriate validation methods are employed.

Within-family Validation

PGS prediction ability among unrelated individuals can be higher than among individuals within the same family. Within families, genetic variation arises solely from random Mendelian segregation and recombination events. In contrast, predictions among unrelated individuals can also capture gene-environment correlations induced by population stratification, indirect genetic effects (the influence of alleles in relatives on the focal individual via the family environment), and correlations with other genetic factors not directly measured by the score. Such correlations often result from assortative mating and other forms of non-random mating (Kong et al., 2018; Young et al., 2022). We therefore validated our GCA PGS in two independent hold-out samples, examining both its prediction ability among unrelated individuals and within-family (Table 1 and Figure 1).

Table 1.Within-family validation results.

		N	δ	β
Dataset	Phenotype	(sib pairs)	(direct effect)	(pop effect)	δ/β	Reliability	δ_g	β_g
UK Biobank	Fluid Intelligence	4,642	0.355	0.406	0.876	0.607	0.456	0.521
			(0.0218)	(0.009)	(0.050)		(0.028)	(0.011)
ABCD	GCA factor score	736	0.363	0.380	0.959	0.803	0.408	0.425
			(0.055)	(0.023)	(0.134)		(0.062)	(0.026)

This table gives standardized coefficients when predicting the phenotype using our general cognitive ability (GCA) PGS without controlling for family fixed-effects ( $\beta$ , population effect) and when adding family-fixed effects ( $\delta$ , direct effect). Phenotypes were residualized on age, sex, their interaction, and 10 genetic principal components. The coefficients are given for phenotype and PGS standardized to have variance 1 so correspond to (semi-partial) correlation coefficients, with standard errors in brackets. Standard errors for the effect-reduction ratio $\delta\text{/}\beta$ were calculated using sibling pair clustered bootstrap. We also give estimates adjusted for the reliability, $\rho$ , of the phenotype taken as a measure of latent — rather than noisily measured — GCA (often called ‘ $g$ ’). The adjustment is performed by first computing the measurement reliability $\rho$ , which is given by Fawns-Ritchie & Deary (Fawns-Ritchie & Deary, 2020) for Fluid Intelligence in UKB (test-retest reliability $r_{\text{test-retest}})$ , and based on our own computations for the measures of general cognitive ability we constructed for ABCD using McDonald’s $\omega$ . The error-corrected direct and population effects are given by $\delta_{g}\text{=}\delta\text{/}\sqrt{\rho}$ and $\beta_{g}\text{=}\beta\text{/}\sqrt{\rho}$ , respectively.

In a sample of 4642 European ancestry sibling pairs from UK Biobank with complete fluid intelligence/verbal-numerical reasoning (VNR)^[1] data, we tested associations with VNR after controlling for age, sex and their interaction, as well as 10 genetic principal components^[2]. The PGS coefficient (with residualized phenotype and PGS standardized to have variance 1, corresponding to the correlation coefficient with the residualized phenotype^[3]) without family fixed-effects was 0.406 (SE 0.009), corresponding to an $R^{2}$ of 16.4% (95% confidence interval: $\lbrack 15.1\text{\%},17.9\text{\%}\rbrack$ ). After adding family-fixed effects, we estimated the standardized direct effect of the PGS to be 0.355 (SE 0.0218)^[4], implying only slight attenuation within-family ( $\delta\text{/}\beta\text{=}0.876$ , $\text{SE=}0.050$ ), similar to that observed in the largest meta-analysis of such effects (Okbay et al., 2022) , but in contrast to more pronounced effect reduction recently reported in another dataset (Lin et al., 2025).^[5]

Figure 1.Cognitive Ability Polygenic Scores: Performance and Real-World Relevance.

(A) Polygenic score validation in UK Biobank and ABCD sibling cohorts. ‘Population’ refers to the score’s association without family fixed effects and ‘direct’ refers to the score’s within-family effect (with family-fixed effects). Estimates are given for PGS and phenotype standardized to have variance 1, after the phenotype was residualized on age, sex, their interaction, and 10 principal components. We display here effects on latent general cognitive ability (‘g’), i.e. after accounting for the reliability of the psychometric measure in each cohort (Table 1). (B) Within-family polygenic score associations with various life outcomes. *** p < 0.0056 (Bonferroni corrected); * p < 0.05. (C) Relationship between mean occupational GCA polygenic score and CAMSIS occupational status in the UK Biobank. Point size is proportional to the number of individuals in each occupation.

We also validated in the ABCD Study (Volkow et al., 2018), a broadly representative sample (ages 9–10) from across the United States. We constructed a general cognitive ability factor from ten assessments including NIH Toolbox measures (Weintraub et al., 2013) (Picture Vocabulary, Flanker, List Sorting, Card Sorting, Pattern Comparison, Picture Sequence Memory, Oral Reading Recognition), WISC-V Matrix Reasoning (Wechsler, 2014), Little Man (Luciana et al., 2018), and Rey Auditory Verbal Learning Test trials (Rey, 1958), using the same validation methodology as in UK Biobank. Each score was first residualized for age, sex, and their interaction, then inverse rank-normal (IRNT) transformed. This measure of GCA showed good psychometric properties (McDonald’s $\omega\text{=}0.803$ ) and substantial heritability ( $h^{2}\text{=}0.64$ , $\text{SE=}0.089$ ) based on ACE decomposition in twin and sibling pairs ( $\text{N}_{MZ} = 242$ and $\text{N}_{DZ}/Sib = 736$ European ancestry pairs with complete data). Validating in this group of siblings and dizygotic twins, we obtained similar results to our analysis of VNR in UK Biobank (Table 1).

As a comparative metric of what an estimator based on published results might be able to achieve, we leveraged summary statistics from Savage et al. (2018), the largest published GCA GWAS to date, and used them to build an LDpred2 (Privé et al., 2020)-based polygenic score, which we validated similarly to our own predictor. The results are given in Supplement III and Table S13.

Considering Measurement Reliability

Genetic effects operate on latent general cognitive ability—often referred to as $g$ —rather than directly on the test scores obtained from any specific psychometric instrument (Panizzon et al., 2014). Because psychometric tests are inherently noisy proxies, observed scores contain measurement error that attenuates estimates of both SNP-based heritability and polygenic score associations. From a practical standpoint, it is unrealistic to assume that a brief, 13-item cognitive assessment like the UK Biobank fluid intelligence/VNR test (which takes only a few minutes to complete) measures general cognitive ability as reliably as a comprehensive, multi-hour assessment such as the Wechsler Adult Intelligence Scale battery. Although both tests aim to target the same underlying trait, the extensive item coverage of the Wechsler scales ensures substantially lower measurement error.

Applying a reliability correction thus places shorter and more extensive psychometric assessments on a common latent-trait scale, enabling accurate validation of genetic prediction and meaningful comparisons across measurement instruments. Furthermore, for applications such as predicting the differences in cognitive ability of embryos, the appropriate scale is the latent ability scale, not the observed scale, since differences in genotype at conception affect latent ability, not merely the scores on a test with less than perfect reliability.

To quantify and correct for this attenuation, we employ standard approaches from classical test theory. We assume the measured phenotype ( $g_{\text{obs}}$ ) reflects an underlying latent ability ( $g$ ) with added measurement error. Reliability ( $\rho$ ) is defined as the proportion of variance in the observed measure due to variation in the latent trait:

$\rho\text{=}\frac{Var(g)}{Var\left( g_{\text{obs}} \right)}$

Under the standard assumption that measurement error is independent of genotype, genetic variance remains unchanged when transitioning from the latent trait to the observed measure, but total phenotypic variance increases, inflating the denominator of variance ratios.

If the SNP-heritability of the latent trait is $h_{\text{SNP},g}^{2}$ , then the SNP-heritability estimated from the observed measure ( $h_{\text{SNP},g_{\text{obs}}}^{2}$ ) is reduced proportionally to the reliability:

$h_{\text{SNP},g_{\text{obs}}}^{2}\text{=}\rho \cdot h_{\text{SNP},g}^{2} \Rightarrow h_{\text{SNP},g}^{2}\text{=}\frac{h_{\text{SNP},g_{\text{obs}}}^{2}}{\rho}$

We verified this relationship through simulations (Figure 2A). Specifically, we simulated a phenotype with a SNP-based heritability of 0.35 using GCTA, assuming 10000 causal loci drawn from the HapMap SNP set and the subset of individuals of British ancestry in the UK Biobank. To mimic varying degrees of measurement error, we progressively introduced random noise into this latent phenotype, reducing the effective reliability such that observed heritabilities decreased stepwise to approximately 0.30, 0.25, 0.20, 0.15, and 0.10. We then performed a GWAS using PLINK on a randomly selected training subset comprising 80% of individuals ( $N\text{=}273763$ ) and estimated SNP-heritability using Linkage Disequilibrium Score Regression (LDSC). As expected, decreasing reliability proportionally lowered the observed SNP-heritability estimates; however, after applying our reliability correction, these estimates consistently and accurately recovered the true latent heritability across all simulated scenarios (Figure 2A).

The correlation between a polygenic score and an observed measure similarly underestimates the true correlation with the latent trait due to measurement error. The standard correction for correlations (Spearman’s correction) gives:

$r(\text{PGS},g)\text{=}\frac{r\left( \text{PGS},g_{\text{obs}} \right)}{\sqrt{\rho}}$

where $\text{PGS}$ is the standardized PGS and $r(\text{PGS},g)$ the correlation.

We confirmed the effectiveness of this correction through additional simulations (Figure 2B). Using the GWAS summary statistics generated at different reliability levels, we constructed polygenic scores by clumping and thresholding SNP associations (threshold $p\text{=}1$ ). We then validated these scores within the aforementioned 20% test set of 68514 UK Biobank participants of British ancestry. At perfect reliability, the true correlation given the unattenuated SNP-heritability of 0.35 was $r(\text{PGS},g)\text{=}0.419$ , while lower reliability measures systematically underestimated the observed PGS-phenotype correlations. After applying the reliability correction, the estimates consistently recovered the true latent correlation across the range of reliabilities simulated (Figure 2B).

Figure 2.The effect of reliability on SNP heritability and PGS association.

In UK Biobank, we simulated a latent phenotype with true SNP heritability equaling 35% (an illustrative value consistent with observed $h^{2} \approx 0.22$ after accounting for test reliability $\rho \approx 0.6$ ). We then examined the impact of the reliability of different measures of this latent phenotype on the estimated SNP heritability (A) and PGS-phenotype correlation (B). As expected, lower levels of reliability reduced apparent SNP heritability and PGS-phenotype correlation. However, reliability corrected estimates of both SNP heritability and PGS-phenotype correlation were never statistically significantly different from the true value for the latent phenotype. Bars in (A) show observed estimates with x-fold underestimation; red outlines denote reliability-corrected estimates

Thus, we adjust both the population-level ( $\beta$ ) and within-family ( $\delta$ ) standardized associations to account for reliability, yielding corrected latent-trait estimates ( $\beta_{g}\text{=}\beta\text{/}\sqrt{\rho}$ and $\delta_{g}\text{=}\delta\text{/}\sqrt{\rho}$ ) (Table 1 and Figure 2), using test-retest reliability ( $\rho\text{=}r_{\text{test-retest}}\text{=}0.607$ ) for fluid intelligence/VNR in UKB from Fawns-Ritchie & Deary^[6], and McDonald’s Omega ( $\rho\text{=}\omega\text{=}0.803$ ) that we calculated for our measure of general cognitive ability in ABCD^[7]. Based on the UKB results, we estimate $\beta_{g}\text{=}0.521$ ( $\text{SE=}0.0113$ ) without family-fixed effects, and $\delta_{g}\text{=}0.456$ ( $\text{SE=}0.0276$ ) with family fixed-effects. The latter reflects the standardized direct effect of the PGS on latent $g$ . Our results based on ABCD indicated slightly lower correlation with latent $g$ than in the UK Biobank ( $\beta_{g}\text{=}0.425$ , SE 0.0259), possibly due to lower heritability at ages 9–10 than in adulthood (Bouchard, 2013), as well as the fact that, in contrast to UKB, the ABCD PGS was constructed from imputed chip data, which is known to attenuate PGS-outcome associations (Li et al., 2021). The fixed-effects inverse-variance weighted meta-analysis estimate of the within-family correlation with latent $g$ is $\delta_{g}\text{=}0.448$ , $\text{SE=}0.025$ .

The validity of these adjustments is independently confirmed by a separate approach using Common-Path Slope estimation, an errors-in-variables correlated-vectors regression under a one-factor MIMIC model (see Supplement I): because both the PGS and the observed $g_{\text{obs}}$ are imperfect measures of the same latent trait, $g$ , their associations with external outcomes should differ only by a constant multiplicative factor if the PGS acts primarily via $g$ . Concretely, across outcomes $Y_{j}$ we expect $\beta_{j}^{(\text{PGS})} \approx k\beta_{j}^{(g_{\text{obs}})}$ with a line through the origin; the slope $k$ captures the relative attenuation from measurement error in $g_{\text{obs}}$ and limited predictive performance of the PGS. Fitting an errors-in-variables (Deming) regression to $\left( \beta_{j}^{(\text{PGS})},\beta_{j}^{(g_{\text{obs}})} \right)$ and combining the slope with the population effect PGS- $g$ association recovers the latent correlation $r\left( \text{PGS},g \right)$ and the implied reliability of the observed $g_{\text{obs}}$ .

Within ABCD, across 20 explicitly non-cognitive outcomes selected for an exploratory pleiotropic scan (see below), the estimated Deming regression slope was $k\text{=}0.539$ ( $\text{SE=}0.028$ ). Under a simple latent-factor model where PGS and $g_{\text{obs}}$ measure the same latent trait, we can combine the association between PGS and observed GCA for ABCD given in Table 1, and use the Deming regression to estimate the latent correlation $r(PGS,g)\text{=}0.453$ (95% CI 0.417–0.488). This is in line with our attenuation-corrected population estimates in both cohorts. The same identity implies for our GCA measure in ABCD a reliability of 0.705 (95% CI 0.595–0.815), which matches our psychometric reliability estimate using McDonald’s $\omega$ of 0.803 (for details, see Supplement 1).

Table 2.Latent-variable within- and population associations of the cognitive PGS with general ability (g).

Cohort	Measurement model	N (individuals)	δ_gSEM	β_gSEM	CFI	TLI	RMSEA
UKB	One-factor g	36,126	0.439	0.525	0.969	0.927	0.022
			(0.018)	(0.008)
ABCD	Second-order g	1,560	0.435	0.509	0.955	0.947	0.031
			(0.053)	(0.031)

This table reports standardized latent-scale coefficients from structural equation models: in UK Biobank a one-factor $g$ (8 indicators), and in ABCD a second-order $g$ with first-order domains (crystallized, speed, memory). Coefficients are standardized on the latent scale (std.lv), estimated by MLR with FIML and family-clustered SEs in lavaan. Standard errors are shown in parentheses on the line below each estimate. $\delta_{g_{SEM}}$ is the within-family (direct) slope on $PGS_{w}$ . $\beta_{g_{SEM}}$ is the latent population slope from a model regressing latent $g$ on individual PGS without family fixed effects. Sample sizes reflect individuals contributing via FIML (partially missing indicators and incomplete pairs). Reported CFI/TLI/RMSEA summarize measurement fit. Unlike the reliability-adjusted coefficients in Table 1, these SEM estimates are already on the latent scale and nevertheless reproduce the same pattern of results.

To address psychometric skepticism about classical-test-theory deattenuation on potentially multidimensional batteries, we further re-estimated the association between the cognitive PGS and latent general ability using structural equation models with family-clustered (robust) standard errors and tested measurement invariance (see Supplement II). Because the latent models define a technically slightly different phenotype (multi-test latent $g$ estimated by SEM and fit with FIML on individuals), we report them in a separate table. Table 2 shows the latent-scale within-family slope. Whereas a simple one-factor model fit the UKB data well, the ABCD data were better captured by a hierarchical model with three first-order domains loading on $g$ (Supplementary Table S9). The within-family standardized latent coefficients are $\delta_{g_{SEM}} \approx 0.44$ in both cohorts, closely matching the reliability-corrected within-family estimates. In contrast to the classical test theory-derived results presented above, latent variable-based population estimates $\beta_{g_{SEM}}$ align well between both cohorts, closing the gap in performance between UK Biobank and ABCD and demonstrating that our headline results do not hinge on a single reliability parameter. We further verified metric measurement invariance across PGS strata and found no material item-level differential prediction (DIF) by within-family PGS (see Supplementary Tables S10–S12).

Within-family PGS effects on life outcomes

Phenotypically, cognitive ability is associated with diverse life outcomes (Calvin et al., 2017; Strenze, 2007). Figure 1B demonstrates which associations persist within families, providing evidence for causal relationships between the polygenic score and several important life outcomes. Within UK Biobank sibling pairs, a one standard deviation increase in the polygenic score predicted 0.154 SD higher educational attainment ( $p\text{<}0.001$ ). We also observed significant associations with occupational status measured by CAMSIS scores ( $\beta\text{=}0.157$ , $p\text{<}0.001$ ) and family income ( $\beta\text{=}0.104$ , $p\text{<}0.001$ ), aligning with known associations on cognitive ability and socioeconomic attainment (Marks, 2022).

The polygenic score was positively associated with better self-reported health ( $\beta\text{=}0.069$ , $p\text{<}0.0056$ ), higher satisfaction with friendships ( $\beta\text{=}0.102$ , $p\text{<}0.0056$ ) and lower neuroticism (higher emotional stability) ( $\beta\text{=}0.049$ , $p\text{<}0.0056$ ). However, we found no significant within-family associations with general happiness ( $p\text{=}0.88$ ) or work satisfaction ( $p\text{=}0.94$ ). These null findings for subjective wellbeing (albeit potentially a result of limited power), align with behavioral genetic studies showing that while GCA correlates with life outcomes, it has limited direct effects on subjective well-being (Bartels, 2015).

As an additional confirmation of validity, Figure 1C illustrates how mean polygenic scores vary across occupations in the UK Biobank. The systematic gradient aligns with cognitive demands of different professions (Schmidt & Hunter, 2004), with highest scores among medical practitioners, academics, and IT professionals, and lowest scores in manual occupations. (Wolfram, 2023)

Figure 3.Association of GCA PGS with disease outcomes in UKB.

Estimates are given for the GCA PGS standardized to have variance 1, controlling for age, sex, and 10 principal components. ‘Population’ refers to the score’s association without controlling for parental PGSs and ‘direct’ refers to the score’s within-family effect estimated by controlling for observed or imputed parental PGSs. Bands indicate 95% confidence intervals, *** $p<.05\text{/}19$ , * $p<.05$

Predicting disease risk

Motivated by the within-family association of the GCA PGS with better self-reported health, we estimated its effect on 19 common diseases using snipar in up to 17661 UKB sibling sets (Figure 3). At the population level, the PGS showed broadly protective associations with cardiometabolic diseases, which persisted within families. Direct genetic effects remained negative for all five cardiometabolic outcomes, though due to limited power, Bonferroni-significance was only achieved for hypertension. Direct genetic effect estimates for immune/inflammatory and cancer outcomes were generally small or imprecise. Overall, 14 of 19 outcomes showed the same direction for population and direct estimates, 12 of which indicate protective effects, lending further evidence to the hypothesis that part of the favorable health association of higher cognitive ability reflects direct effects rather than only family-level or population confounding.

We observed no significant association between the GCA polygenic score and Alzheimer’s disease risk in our within-family analyses, though AD cases were rare in the sibling sample (N = 461) and most participants remain below typical dementia onset age.

Cross-Ancestry Performance Declines in Line with Expectations

Polygenic scores typically show substantially reduced performance in ancestry groups not represented in the training data due to differences in linkage disequilibrium patterns and allele frequencies among other factors (Mostafavi et al., 2020). Figure 4A shows our score’s relative population-level performance in the ABCD Study for individuals of non-European ancestries. The score retained 89% of its standardized effect size in Hispanic/Latino Americans ( $N\text{=}1985$ , $\text{SE=}6.5\text{\%}$ ), 88% in South Asian Americans ( $N\text{=}42$ , $\text{SE=}40.8\text{\%}$ ), 66% in African Americans ( $N\text{=}1491$ , $\text{SE=}7.2\text{\%}$ ), and 59% in East Asian Americans ( $N\text{=}136$ , $\text{SE=}23.3\text{\%}$ ) relative to European Americans. Moore et al. (Moore et al., 2025) estimated the expected effect reduction for scores built using methods similar to the one employed here, which our results (despite noise due to small sample sizes) closely track (Figure 4A).

Figure 4.Robustness of Cognitive Ability Polygenic Score Effects.

(A) Relative performance across ancestry groups in the ABCD Study. The decrease is relative to the European American reference group. Error bars are 95% CIs. The observed decrease is shown in blue, and the expected decrease is shown as red triangles. (B) No significant gene-environment interactions with family conflict, family income (log), and average parental education (see materials and methods). All variables have been z-standardized. Bands indicate 95% confidence intervals.

No Evidence for Gene-Environment Interactions

Gene-environment interactions for cognitive ability have often been proposed as an important factor in individual differences in cognitive ability. To assess whether there is any evidence for gene-environment interactions with our GCA PGS, we regressed the PGS, given environmental exposure, and examined their interaction on the cognitive ability measure we constructed in the European-American subset of ABCD ( $N\text{=}5670$ ). Figure 4B shows no significant interactions between our polygenic score and environmental factors: parental education ( $N\text{=}5032$ , $p\text{=}0.87$ ), family income ( $N\text{=}5395$ , $p\text{=}0.25$ ), and family conflict ( $N\text{=}5670$ , $p\text{=}0.33$ ). This pattern is consistent with findings from well-powered twin and administrative studies in non-U.S. and population-scale U.S. samples (Bates et al., 2016; Figlio et al., 2017; Hanscombe et al., 2012), and with contemporary ABCD analyses, which likewise find SES and PGS associations to be largely additive (Paul et al., 2024).^[8] While we cannot rule out the existence of any gene-environment interaction, the current results indicate that the magnitude of such interactions is likely small relative to the main effect of the PGS, indicating that predictions that do not model such interactions will remain well calibrated.

Assessing Potential Pleiotropic Off-Target Effects

Concerns about the interpretation of polygenic scores for cognitive ability often center on unintended associations with genetically correlated traits, particularly autism spectrum disorder (ASD) (Grove et al., 2019) and anorexia nervosa (Watson et al., 2019), a phenomenon also called “antagonistic pleiotropy”. To directly address this, we screened for associations of our GCA PGS across 20 psychological traits in ABCD, including psychopathology, temperament, and autistic traits (see Materials & Methods).^[9]

Figure 5.Pleiotropic Associations of GCA PGS.

(A) Standardized associations ( $\beta, \pm \text{CI}$ ) between the GCA PGS and 20 psychological/neurodevelopmental outcomes in ABCD (European-ancestry youth, $N \approx 5600\text{–}5940$ ), controlling for age, sex, $\text{age} \times \text{sex}$ , and 10 genetic PCs; PGS and outcomes were z-standardized. Triangles indicate associations opposite to the expected beneficial direction. *** p < 0.0025 (Bonferroni); * p < 0.05 (nominal). (B) Cross-outcome scaling of betas for PGS vs. phenotypic $g_{\text{obs}}$ . Each point is an outcome $Y_{j}$ ; vertical and horizontal bars show $95\text{Cls}$ (Bonferroni-adjusted). The dashed line indicates $y\text{=}x$ , with points clustering around $y\text{=}x$ indicating (45°) that score-outcome associations are consistent with mediation by realized GCA. Bands indicate 95% confidence intervals, *** $p<.05\text{/}20$ , * $p<.05$ .

The results in Figure 5A show overwhelmingly beneficial associations between higher GCA PGS and outcomes indicative of improved psychological functioning. Specifically, a higher GCA PGS correlated with fewer psychotic-like symptoms ( $\beta\text{=-}0.126$ , $p\text{<} \times 10^{\text{-}21}$ ), lower ADHD symptoms ( $\beta\text{=-}0.116$ , $p\text{<} \times 10^{\text{-}20}$ ), reduced externalizing behaviors ( $\beta\text{=-}0.095$ , $p\text{<} \times 10^{\text{-}10}$ ), fewer sleep problems ( $\beta\text{=-}0.053$ , $p\text{<} \times 10^{\text{-}3}$ ), and increased positive affect ( $\beta\text{=}0.103$ , $p\text{<} \times 10^{\text{-}12}$ ). Notably, we observed no significant positive association between our cognitive PGS and autistic symptoms; indeed, the association was weakly negative ( $\beta\text{=-}0.039$ , nominally significant, Bonferroni non-significant).

The Common-Path Slope analysis (see Supplement I) shown in Figure 5B, establishes that PGS associations scale almost perfectly with the corresponding associations of observed $g_{\text{obs}}$ , consistent with the PGS acting primarily via general ability rather than through idiosyncratic pathways specific to the PGS (pleiotropy). This supports the view that what we detect in the pleiotropy scan reflects the downstream correlates of g, not independent “off-target” mechanisms.

Overall, our comprehensive pleiotropy analysis provides little support for substantial negative off-target effects. Instead, higher cognitive ability genetic predispositions appear beneficial.

Conclusions

The interpretation of cognitive polygenic scores has been complicated by concerns about environmental confounding, limited cross-context generalizability, and the gap between PGS prediction and twin-based heritability estimates. Our results address each of these concerns through converging lines of evidence.

We achieved an $R^{2}$ of 16.4% for fluid intelligence/VNR in UKB, which represents a substantial advance over previous studies where performance has not exceeded 10% variance explained (Becker et al., 2021; Van Den Berg et al., 2025) and found this association to be robust within families. Reliability-corrected estimates indicate only slight attenuation within families ( $\delta\text{/}\beta \approx 0.88$ ), a finding replicated using latent-variable structural equation modeling (Table 2) and confirmed using a reproducible benchmark predictor (Supplement III). These results are in line with recent findings showing that modeling cognitive ability at the latent level instead of relying solely on individual observed test scores improves polygenic prediction (Lin & Plomin, 2025). Although within-sibship GWAS can yield less biased estimates of direct genetic effects at individual loci (Howe et al., 2022), such analyses remain underpowered for cognitive phenotypes given current sibling sample sizes. Our results underscore that, across the phenotypes and predictors examined here, polygenic scores derived from conventional population-based GWAS capture substantial direct genetic effects, supporting their utility in contexts where only within-family variation is relevant (Ahangari et al., 2025; Craig et al., 2025; Moore et al., 2025).^[10]

Beyond the magnitude of the effect, our analyses clarify the nature of the genetic influence on cognition. The significant within-family associations with educational attainment, occupational status, and family income provide evidence that these genetic effects translate into tangible life outcomes. Furthermore, the absence of gene-environment interactions with key socioeconomic variables strengthens the argument that, within the range of environments typical of European ancestry families in the USA, polygenic effects on cognition operate in a largely additive manner.

The public health relevance of this framework is reinforced by our disease prediction analyses. The consistently protective direct effects on cardiometabolic outcomes align with Mendelian randomization evidence regarding coronary artery disease and hypertension (Wang et al., 2023; Yang et al., 2022), suggesting these effects plausibly operate through behavioral pathways—such as diet and treatment adherence—that are influenced by cognitive ability. While our exploratory pleiotropy analysis indicated predominantly beneficial associations with psychological outcomes, future studies with larger sibling cohorts are required to confirm these population-level indications. Conversely, the null within-family effect for Alzheimer’s disease must be interpreted with caution given the imperfect sensitivity of registry-based diagnoses (Wilkinson et al., 2019) and the potential confounding role of cognitive reserve.

Beyond validating a specific predictor, our reliability-aware approach has implications for how the field interprets SNP-heritability estimates, as it suggests that a substantial component of the apparent gap between SNP-h² and twin-h² in cognitive measures may reflect measurement unreliability. For instance, the relatively low SNP-based heritability ( $h^{2} \sim 0.22$ ) observed for the fluid intelligence/VNR measure in the UK Biobank, given its low reliability ( $\sim 0.6$ ), likely underestimates the true common variant SNP heritability of $g$ , which, applying the attenuation correction outlined above, may exceed 0.35. Drawing parallels from height, where a significant portion of heritability resides in rare variants (Wainschtein et al., 2022), it is plausible that whole-genome sequencing analyses combined with proper modeling of measurement reliability will substantially bridge the heritability gap observed between SNP- and twin-based estimates, where the latter might suffer substantially less from measurement error, given measurement on the same day and age, often by the same rater and that aspects of the error themselves might be heritable (Nivard, 2023). Indeed, recent whole-genome analyses indicate that common and rare variants (MAF $\geq 0.001$ ) can explain plausibly 33–40% of variation in cognitive ability (Wainschtein et al., 2025).

A key prediction of such a reliability-aware approach is that better psychometric measurement should yield higher SNP-based heritabilities. This highlights the potential value of professionally constructed assessments, such as military IQ tests or scholastic achievement exams, whose SNP- $h^{2}$ should more closely approximate the true heritability of latent $g$ . Practically, these results underscore the importance of extensive pretesting of cognitive measures before their implementation in large biobank cohorts. When assessment time is limited, adaptive, computer-assisted testing could further enhance measurement precision.

Collectively, these results highlight the diverse research settings in which a powerful, rigorously validated GCA PGS can benefit future health-related research. By establishing a validation protocol that explicitly models measurement error and family-level confounding, we provide a template for the safe and effective use of polygenic prediction in causal inference and epidemiological applications.

Materials & Methods

Scope of Disclosure

The analyses in this manuscript are designed to evaluate polygenic prediction in population and within-family designs and to characterize the role of measurement reliability using three convergent estimators. However, the score construction pipeline (including the exact engineered-phenotype search procedure, imputation system implementation details, and the final integrated scoring weights) underlies an actively maintained commercial product. We therefore describe the score construction process at a conceptual level and provide additional sensitivity analyses using an alternative GWAS-based predictor in the Supplement to demonstrate that our main validation conclusions are not specific to the proprietary predictor.

Score Development

We developed our polygenic score through a novel and comprehensive phenotype engineering pipeline designed explicitly to enhance the genetic signal of cognitive ability within the UK Biobank, restricting the analyses to individuals within the European ancestry cluster (participants of Ashkenazi, Polish, Italian, and British descent, following Privé et al., 2022 (Privé et al., 2022)). Siblings in UKB (reserved for validation purposes) were excluded from any phenotype engineering and downstream analysis to avoid data leakage.

Rather than relying on the raw cognitive measures provided by UKB, we explored a high-dimensional space of alternative score definitions for each of the 11 core cognitive tasks. Concretely, for each task we generated various candidate metrics spanning a small number of families of transformations (e.g., alternative scaling/transformations of the raw score distribution; alternative ways of combining paradata where available; and alternative item-weighting strategies for multi-item tasks). We also considered nuisance-adjusted variants that differ in which pre-specified covariate sets are removed prior to genetic screening (e.g., basic demographics and assessment context variables). Each engineered phenotype was then subjected to a highly optimized, custom implementation of a GWAS + LDSC pipeline, which enabled a rapid estimation of approximate SNP-heritabilities, allowing us to select only the single phenotype with the highest genetic signal-to-noise ratio for each cognitive domain. Candidate selection was then followed by full discovery GWAS runs using the primary GWAS pipeline described below.

To address missingness in cognitive measures within UKB, we used an autoencoder-based imputation strategy to predict missing cognitive scores from a broad panel of non-genetic variables (demographic, socioeconomic, and occupational/assessment-related variables) and other available UKB measures. The imputation model was trained and evaluated using held-out individuals within the non-sibling discovery set to avoid information leakage into the sibling validation cohort. Imputed values were used to increase completeness for phenotype engineering and auxiliary GWAS inputs. The key within-family validation analyses are conducted on observed outcomes in the respective holdout cohorts.

This pipeline yielded a suite of optimized measures which formed the basis of internal discovery Genome-wide association studies (GWAS), which were subsequently conducted on these phenotypes using REGENIE (Mbatchou et al., 2021). Individuals failing standard genomic QC were excluded. All analyses controlled for age, sex, their interaction, and the first ten principal components of genetic ancestry.

The UK Biobank’s fluid intelligence/VNR test—comprising 13 offline-administered verbal-numerical reasoning items—was selected as the focal phenotype for our polygenic score training due to its substantial SNP-based heritability (LDSC $h^{2}\text{=}0.2295$ ) and substantial non-imputed sample size ( $N\text{=}182276$ ). Where multiple fluid intelligence/VNR assessments were available per participant, scores were averaged.

We further boosted statistical power by integrating our internal cognition GWAS results with publicly available summary statistics from cognition-related and genetically correlated traits, including neurological and academic phenotypes, as well as additional GWAS conducted in the UK Biobank on heritability-maximizing transformations of such traits. We did not use educational attainment in our analyses due to its reduced within-family predictive validity (Okbay et al., 2022). Given the substantial difference in the SNP set used in our internal GWAS and in external summary statistics, we systematically imputed variants absent from individual GWAS summary statistics to comprehensively capture genetic variation. Using a hierarchical, multi-stage multi-trait meta-analysis framework, we substantially increased the effective sample size available for our focal fluid intelligence/VNR measure, significantly enhancing statistical power.^[11]

Finally, we derived the polygenic score from these integrated summary statistics using SBayesRC (Zheng et al., 2024), applying a hierarchical, annotation-informed prior across an expanded genome-wide variant set. Functional annotations used were curated to augment the standard BaselineLD (Gazal et al., 2017) annotation set, including among others, evolutionary conservation metrics, cell-type-specific chromatin accessibility, and neurodevelopmental gene expression profiles.

Reliability

Using GCTA, we simulated a latent phenotype with $h_{\text{SNP},g}^{2}\text{=}0.35$ (an illustrative value, consistent with observed estimates after reliability correction) on the subset of individuals of British ancestry in the UK Biobank ( $N\text{=}342277$ ) using HapMap SNPs, assuming 10000 causal loci; we then added mean-zero noise to reduce reliability levels to the level of observed $h_{\text{SNP},g_{\text{obs}}}^{2}$ as shown in Figure 2. We then performed a simple GWAS using PLINK on a randomly selected training subset comprising 80% of individuals ( $N_{\text{train}}\text{=}273763$ ) and estimated SNP-heritability using LDSC. Validation used the 20% hold-out $(N_{\text{test}}\text{=}68514$ ), where simple C+T scores ( $p\text{-threshold=}1$ , constructed using PRSice-2) were evaluated on the phenotypes of varying reliability.

Validation

Statistical models & inference

Let $y$ denote the residualized cognitive outcome and PGS the standardized polygenic score (both variance-scaled to 1 after residualization). In our main analysis, we estimate

$\small{y_{i} = \alpha + \beta\text{PGS}_{i} + \varepsilon_{i} \quad \text{(Population model)}}$

$\small{y_{if} = \alpha_{f} + \delta\text{PGS}_{i} + \varepsilon_{i} \quad \text{(Family fixed-effects model)}}$

where $\alpha_{f}$ are family dummies. With this scaling, coefficients correspond to semi-partial correlations. We report the attenuation ratio $R\text{=}\delta\text{/}\beta$ with clustered (by family) bootstrap SEs.

Using the snipar (Young et al., 2022) within-family modeling framework, we instead estimate

$\small{y_{if} = \beta_{\text{SNIPAR}}\text{PGS}_{i} + u_{f} + \varepsilon_{i} \quad \text{(Population model)}}$

$\small{y_{if} = \delta_{\text{SNIPAR}}\text{PGS}_{i} + \alpha{\hat{g}}_{\text{par}(f)} + u_{f} + \varepsilon_{i} \quad \text{(Within-family model)}}$

where $\alpha$ is instead the average non-transmitted coefficient, ${\hat{g}}_{\text{par}(f)}$ the (imperfectly) imputed parental PGS for family $f$ , and $u_{f}$ a family-level random effect. We likewise report the attenuation ratio $\delta_{\text{SNIPAR}}\text{/}\beta_{\text{SNIPAR}}$ with clustered (by family) bootstrap SEs.

UK Biobank

Sample

The UKB sibling cohort reserved for validation was held out from all discovery stages. Sibling pairs were identified from UKB relationship files. We restricted to the Privé et al. (2022) white-ancestry cluster (Ashkenazi, Polish, Italian, UK) and excluded genotyping outliers, sex-chromosome aneuploidies, and heterozygosity/missingness outliers. Phenotypically, 4642 families with two genotyped siblings with non-missing outcome were retained for the validation^[12].

Measure

We validated against each participant’s first measure of the UKB offline fluid intelligence/VNR test, residualized for age, sex, their interaction and the first 10 PCs, Test–retest reliabilities (Pearson’s $r$ ) were taken from Fawns-Ritchie & Deary (2020) as specified in Table 1.

ABCD

Cognitive factor construction

At baseline (ages 9–10), we extracted raw scores from NIH Toolbox tasks (Picture Vocabulary, Flanker, List Sorting, Card Sorting, Pattern Comparison, Picture Sequence Memory, Oral Reading Recognition), WISC-V Matrix Reasoning (raw sum), Little Man Task (percent correct), and a composite of Rey Auditory Verbal Learning Test trials (mean across trials). Each score was first residualized for age, sex, and their interaction, then IRNT-transformed. We fit a one-factor model using the psych package – the first factor explained $\sim 30\text{\%}$ of variance and showed good reliability ( $\omega \approx 0.80$ , see Table 1). The resulting factor scores (residualized for age, sex, their interaction and the first 10 PCs) served as the ABCD validation outcome.

Sample

We identified European participants by reported ancestry (variable demo_race) being white and excluded individuals for whom multiple ancestry groups were given. From this set, we identified monozygotic and dizygotic twins as well as full siblings using the genetic paired_subjectid and rel_relationship fields. Following ABCD guidance, we excluded a small number of participants with inadequate visual acuity during testing ( $\text{Snellen<}4$ ).

Twin-Heritability

Using independent MZ pairs and DZ/sibling pairs, we computed correlations within zygosity, then Falconer ACE components. Standard errors for correlations used Fisher’s $z$ transformation; ACE SEs were obtained by the delta method.

Comparative Baseline

To demonstrate that our validation framework generalizes to standard, reproducible cognitive polygenic predictors, we additionally constructed a genome-wide benchmark score using the results from Savage et al. (2018). Since publicly released summary statistics include the UK Biobank, we reconstructed their results while excluding our sibling validation sample. To achieve this, we sampled 142077 white British respondents who participated in the offline fluid intelligence/VNR test, as well as 53576 for whom only online test data was available. These numbers mirror those given in their paper, but explicitly exclude any siblings. This approach was feasible due to the increased availability of cognitive ability data since the original study was conducted. In cases where multiple data points were available for a participant, we used the first one. Using this data, we ran separate GWAS for the offline and online data and meta-analyzed them with summary statistics from Savage et al. (2018) excluding the UK Biobank. A widely used Bayesian shrinkage approach (LDpred2-auto) was used to construct a polygenic score which was subsequently used to rerun all validation analyses presented in the main text and beyond. Results are reported in Supplement III.

Associations with Other Outcomes

Within UKB sibling pairs, we regressed standardized outcomes (see Table S2 for a full list) on the standardized PGS with family fixed effects. Outcomes were pre-residualized (age, sex, $\text{age} \times \text{sex}$ , PCs) and z-scored as before. Multiple testing was controlled via Bonferroni across $K\text{=}9$ .

Occupational Stratification

We merged the sibling holdout with their PGS values, then joined occupation and CAMSIS from the ukbjobs package (Akimova et al., 2025). For each occupation we computed mean standardized PGS, mean CAMSIS and sample size per occupation. Occupations with $n\text{<}40$ validation individuals were excluded to reduce instability. We visualized mean PGS vs CAMSIS, displaying point size by $n$ , and added an OLS trend line to summarize the occupational gradient.

Disease prediction

To examine the cognitive ability PGS’s capacity to predict disease outcomes within-families, we regressed binary outcomes on standardized sibling PGS values including or excluding imputed parental PGS values following the above snipar within-family modeling framework. Outcomes were analyzed using a generalized linear mixed model, fitted in R with the lme4 package’s glmer() function using the binomial probit link and including age, sex (except for sex-specific diseases) and the first 10 principal components as covariates. To achieve convergence for all 19 models, we set $nAGQ\text{=}0$ and used the bobyqa optimizer. The set of 19 disease outcomes were selected so as to include non-overlapping conditions collectively affecting multiple biological systems. Cases were identified within up to 17661 sibling sets from the UKB using ICD-10 hospital codes, OPCS-4 operation codes, medication usage, self-reports and responses to disease-specific surveys (Table S3).

Ethnic Portability

Effect reduction. In $\text{ABCD}$ , we estimated population-level standardized effects separately by genetic ancestry group, using the same QC, residualization and standardization as for the sibling sample. Groups were defined using parent-reported ancestry: EUR = “white”, AFR = “black”, AMR = “hispanic”, EAS = “chinese”, “korean”, “japanese”, “filipino”, SAS = “asian indian”. Children with more than one ancestry category (with the exception of AMR) were removed. For each group $g_{i}$ we fit $Y\text{=}\beta_{\text{POP},g}\text{PGS}$ and estimate relative performance $\text{RelEff}_{g}\text{=}\beta_{\text{POP},g}\text{/}\beta_{\text{POP},\text{EUR}}$ ( $N_{\text{EUR}}\text{=}5673$ ; $N_{\text{AFR}}\text{=}1491$ ; $N_{\text{AMR}}\text{=}1985$ ; $N_{\text{EAS}}\text{=}136$ ; $N_{\text{SAS}}\text{=}42$ ). For presentation, we summarized the relative decrease and its 95% confidence interval via parametric bootstrap using the estimated $\beta$ and SE for each group in Figure 4A.

Expected effect reduction. Moore et al. provided estimates for East Asian (58.7% relative $R^{2}$ ) and South Asian (79.4% relative $R^{2}$ ) populations. For African Americans, we used linear interpolation assuming 80% African/20% European ancestry based on their Nigerian validation (28.3% relative performance at 100% African ancestry). For Hispanic/Latino Americans, we applied Ding et al.'s (Ding et al., 2023) genetic distance framework: Hispanic/Latino populations cluster at intermediate genetic distances ( $\sim 0.30\text{–}0.35$ ) between Europeans and Africans in their analysis, reflecting their admixed ancestry. Using an intermediate distance of 0.32, a linear model based on Moore’s results predicts 74% relative $R^{2}$ , corresponding to 86% retention of standardized effect size, which aligns well with our observed 89%.

Gene-Environment Interaction

We exploratorily tested moderation of the PGS effect in European-ancestry ABCD at baseline using three contexts, all using the same QC, residualization and standardization as for the sibling sample, in a standard regression framework:

$Y\text{=}\alpha\text{+}\beta_{1}\text{PGS+}\beta_{2}E\text{+}\beta_{3}(\text{PGS} \times E)\text{+}\varepsilon.$

Moderators were Parental education ( $N\text{=}5032$ ): mean of primary and partner education, measured as number of years in education based on highest school grade or degree attained (demo_prnt_ed_v2, demo_prtnr_ed_v2). Family income ( $N\text{=}5395$ ): natural log of combined family income (demo_comb_income_v2). Family conflict ( $N\text{=}5670$ ): Sum of parent report on the Family Environment Scale conflict subscale (fes_p_ss_fc_pr).

Pleiotropy Analysis

To assess potential “off-target” pleiotropy associated with our cognitive ability PGS, we conducted a systematic pleiotropy screen across 20 psychological and neurodevelopmental outcomes in ABCD (ages 9–10, $N \approx 5600\text{–}5940$ European-ancestry participants; see Table S4). Outcomes covered child psychopathology, temperament, affective traits, sleep, and autistic traits (SSRS). Each outcome was residualized for age, sex, $\text{age} \times \text{sex}$ , and 10 genetic PCs, inverse-rank-normal transformed when needed, and standardized. Analyses were run at the population-level (between-family), maximizing power, and did not include sibling fixed-effects.

Associations were tested using standardized regressions, controlling the family-wise error rate with Bonferroni correction ( $\alpha\text{=}0.05\text{/}20$ ). To clarify the clinical interpretation, each outcome was assigned a valence: associations were labeled “positive pleiotropy” if higher cognitive PGS correlated positively with beneficial traits (e.g., effortful control) or negatively with undesirable traits (e.g., ADHD symptoms), and “negative pleiotropy” otherwise. Given these analyses are between-family, effect sizes likely reflect upper bounds due to inflation from assortative mating and genetic nurture, and should thus primarily be interpreted for directionality.

Additionally, we applied the Common-Path Slope (CPS) method (Supplement I) to verify whether observed associations were mediated predominantly through general cognitive ability rather than independent, outcome-specific genetic pathways.

Acknowledgements

This research has been conducted using the UK Biobank Resource under Application Number #103244. This work uses data provided by patients and collected by the NHS as part of their care and support.

Furthermore, data used in the preparation of this article were obtained from the Adolescent Brain Cognitive Development (ABCD) Study (https://abcdstudy.org), held in the NIMH Data Archive (NDA) under application #21063. This is a multisite, longitudinal study designed to recruit more than 10000 children aged 9–10 and follow them over 10 years into early adulthood. The ABCD Study® is supported by the National Institutes of Health and additional federal partners under award numbers U01DA041048, U01DA050989, U01DA051016, U01DA041022, U01DA051018, U01DA051037, U01DA050987, U01DA041174, U01DA041106, U01DA041117, U01DA041028, U01DA041134, U01DA050988, U01DA051039, U01DA041156, U01DA041025, U01DA041120, U01DA051038, U01DA041148, U01DA041093, U01DA041089, U24DA041123, U24DA041147. A full list of supporters is available at https://abcdstudy.org/federal-partners.html. A listing of participating sites and a complete listing of the study investigators can be found at https://abcdstudy.org/consortium_members/. ABCD consortium investigators designed and implemented the study and/or provided data but did not necessarily participate in the analysis or writing of this report. This manuscript reflects the views of the authors and may not reflect the opinions or views of the NIH or ABCD consortium investigators.

Competing interests

All authors are employees of Herasight, Inc. and hold equity in the company.

Funding

This work was funded by Herasight, Inc.

UK Biobank labels this measure ‘fluid intelligence’; however, the items primarily assess verbal and numerical reasoning, so we refer to it as verbal–numerical reasoning (VNR) throughout, following existing literature (Lyall et al., 2016).
Repeating the UKB analyses with 40 genetic PCs (instead of 10) yielded materially identical results ( $\beta\text{=}0.4047$ , $\text{SE=}0.0095$ , $\delta\text{=}0.3472$ , $\text{SE=}0.0181$ .
We residualize phenotypes on standard covariates to remove non-genetic and ancestry-related structure prior to standardization; coefficients therefore reflect standardized associations between the PGS and covariate-adjusted phenotypes. As a robustness check, we additionally residualized the PGS on the same covariates (yielding strict partial correlations, i.e., residualizing both phenotype and PGS on covariates) and obtained essentially unchanged estimates: UKB - $\delta$ = 0.3468 (0.0213), $\beta$ = 0.4062 (0.00874); ABCD - $\delta$ = 0.3637 (0.0541), $\beta$ = 0.3836 (0.0231).
As controlling for principal components technically could be considered as conditioning on a collider in the context of a within-family model, we conducted a robustness check where we did not consider PCs, which did not affect the estimate ( $\Delta\delta_{g}\text{=}0.0003$ ).
We conducted a further robustness check, using the within-family modeling framework of snipar, which boosts power for estimation of direct genetic effects via Mendelian imputation of parental genotypes (Young et al., 2022) (see Materials & Methods). We similarly found minimal and not significantly different attenuation within-family using this method (ratio =0.853, SE 0.039).
This published estimate aligns well with the Cronbach’s alpha as provided elsewhere (Hagenaars et al., 2016) as 0.62. We also estimated test-retest reliability in the sample of UKB participants that took the test twice over the years (N = 29,879) as 0.647.
We chose test–retest reliability for the UK Biobank fluid intelligence/VNR test because it is a single, brief cognitive assessment administered at two time points. Test–retest reliability directly captures the proportion of variance that remains stable across repeated administrations, isolating genetically relevant variation from transient noise. In contrast, our ABCD cognitive measure is a composite factor derived from multiple cognitive tests. For such multi-item composites, McDonald’s $\omega$ is the appropriate reliability index: it estimates the proportion of variance in the composite measure that is attributable to the common latent factor (here, general cognitive ability, or $g$ ), rather than idiosyncratic measurement error associated with individual tests or items.
By contrast, the frequently cited evidence in favor of GxE by Tucker-Drob & Bates (2016) rests on nation-stratified subgroup estimates within the meta-analysis and its U.S.–non-U.S. heterogeneity does not comport with subsequent replication (Figlio et al., 2017). Given that early U.S. reports rested on modest samples, e.g., Turkheimer et al. (2003) inferred Scarr–Rowe moderation from 319 twin pairs at age 7 (114 MZ, 205 DZ), and that detecting small moderation effects typically requires thousands of twin pairs (Hanscombe et al., 2012), the accumulating record suggests Scarr–Rowe-style interactions are not a general feature of cognitive development.
For statistical power reasons, analyses were conducted at the population level (between-family) rather than within-family, meaning these associations may be inflated by factors such as indirect genetic effects (Tan et al., 2024) and cross-trait assortative mating (Border et al., 2022). Therefore, effect sizes should be interpreted with caution and primarily as directional indicators.
The likelihood that our within-family validations in UKB and ABCD are influenced by sample overlap with discovery datasets is minimal. The entire UKB sibling cohort used for validation was explicitly excluded from all discovery analyses, with UKB-based polygenic scores derived solely from non-sibling participants. ABCD served exclusively as an independent validation cohort, contributing no data to any discovery GWAS. Furthermore, any residual overlap from external summary statistics is numerically insignificant and incapable of generating the robust within-family effects observed, which depend solely on Mendelian segregation. The replication of within-family slopes in two independent cohorts further reduces any plausible concern about sample overlap influencing our results.
Here, “hierarchical” refers to a staged integration strategy in which trait inputs are grouped and combined according to their relationship to the focal phenotype (e.g., cognition-related and genetically correlated auxiliary traits), rather than being treated as a single undifferentiated pool.
UK Biobank participation is non-random; participation probabilities increase with education/cognition and related traits, and participation itself has genetic correlates. Conditioning on inclusion (especially when restricting to complete sibling pairs) creates a collider: within included families, for a given within-family PGS difference, the lower-PGS sibling must on average have a more favorable residual in cognition to be observed. In the differenced fixed-effects equation, this induces $corr(\Delta\text{PGS},\Delta\varepsilon)\text{<}0$ (an index-event/volunteer-selection mechanism), biasing the within-family slope toward downward. Under monotone selection on cognition/residuals this attenuation is the default expectation in UKB-type samples with a “healthy volunteer bias”. Under these conditions, the reported within-family coefficient can be interpreted as conservative with respect to the direct genetic association.

Interpreting Polygenic Prediction of Cognitive Ability: Evidence for Direct, Reliable, and Portable Genetic Effects

Abstract

Editor’s Note

Author’s Statement

Introduction

Within-family Validation

Considering Measurement Reliability

Within-family PGS effects on life outcomes

Predicting disease risk

Cross-Ancestry Performance Declines in Line with Expectations

No Evidence for Gene-Environment Interactions

Assessing Potential Pleiotropic Off-Target Effects

Conclusions

Materials & Methods

Scope of Disclosure

Score Development

Reliability

Validation

Statistical models & inference

UK Biobank

Sample

Measure

ABCD

Cognitive factor construction

Sample

Twin-Heritability

Comparative Baseline

Associations with Other Outcomes

Occupational Stratification

Disease prediction

Ethnic Portability

Gene-Environment Interaction

Pleiotropy Analysis

Acknowledgements

Competing interests

Funding

References