Data mining methods may prove useful in detecting and defining some of the additive and non–additive (epistatic) higher order interactions that we know must be contributing quite heavily to risk. We have been exploring the genetic architecture using the RandomForests® and TreeNet® programs from Salford Systems (San Diego, CA) to try and identify combinations of single nucleotide polymorphisms (SNPs) that influence clinical risk and cognitive performance. We mined 332,236 SNPs in parallel in our own GCAP (Genes Cognition and Psychosis Program) sample (252 schiz., 270 controls ) and in the GAIN sample (522 schiz., 600 controls). In GCAP, a set of 4430 SNPs (median PLINK Genotypic case-control p=0.1028) were strongly predictive of clinical phenotype group (TreeNet 10-fold cross validation test ROC 0.987, schiz. prediction error 0.06, control prediction error 0.07). In GAIN, a set of 4371 SNPs were strongly predictive of group (TreeNet test ROC 0.956, schiz. prediction error 0.11, control prediction error 0.12). By chance, prediction error and prediction success are each 0.50, and only a small fraction of these SNPs are likely to be tagging schizophrenia susceptibility genes. Nevertheless, we found that at both the SNP and gene level there was overlap between the two independent samples: 60 SNPs were in common, and 182 genes had multiple SNPs from both samples closest to them, and 139 of these had SNPs within them, and within 100kb in both studies. Pathway analysis (INGENUITY SYSTEMS, Redwood CA) showed a high degree of potential connectivity – 22 appear to be involved in protein–protein (PP) interactions, in 7 groups. Addition of just 4 PP “bridges” connect 4 of these groups, which then are able to include additional genes from the 139, totalling 26.
We have developed an integrated database of our clinical, cognitive, and data mining results and including genes of interest from the literature we are tracking about 400 specific genes/regions. Ongoing work includes genotyping the remainder of our sample (510 total schizophrenics, 500 controls) with the ILLUMINA 660W chip, SNP–SNP epistasis analysis and data mining of additional independent samples, and detection of copy number variants (CNVs) with the AGILENT 1x1M chip. Finally, we are working on empirical approaches to determine which predictors are indicating a gene nearby and which are merely detecting some cryptic group differences (eg. from ascertainment bias) having little to do with phenotype. We think that a multilevel, integrative approach that fundamentally acknowledges the heterogeneity with epistasis inherent in the condition, and builds on prior work is essential to properly evaluate and utilize new GWAS results, and that we have produced many new candidates, each supported by multiple pieces of evidence.
*This presentation was given by Richard Straub at the 2009 Salford Data Mining Conference.