IMPC data portal documentation
More information about the way IMPC uses statistics
High-throughput phenotyping generates large volumes of varied data including both categorical and continuous data. Operational and cost constraints can lead to a work-flow that precludes traditional analysis methods. Furthermore, for a high throughput environment, a robust automated statistical pipeline that alleviates manual intervention is required.
The IMPC has produced a short guide to help with understanding the statistical analysis pipeline:
The IMPC uses a variety of statistical methods for making phenotype calls, including:
- Fisher's Exact test - used for categorical data parameters
- Mixed model - used for continuous data parameters which include random effects
- Linear model - used for continuous data parameters when random effects are not significant
- Mann-Whitney U Rank sum test - used for continuous data parameters when conditions for Mixed model are not appropriate
- Reference Range Plus - used for some unidimensional data parameters
All analysis frameworks output a statistical significance measure, an effect size measure, model diagnostics (when appropriate), and graphical visualisation.
The statistical methods used by the IMPC have been formalized into an R package called PhenStat.
The PhenStat package provides statistical methods for the identification of abnormal phenotypes with an emphasis on high-throughput dataflows. The package contains:
- dataset checks and cleaning in preparation for the analysis
- 2 statistical frameworks for genotype to phenotype identification
- Fisher's Exact test for Categorical data
- Linear Mixed model for continuous data
- Reference range plus model for low N continuous data
- and additional functions that help to decide the correct method for analysis.
- PhenStat User Guide
- How to Guide - Installing PhenStat
- PhenStat is available as a Bioconductor package
- See the complete PhenStat user's guide
The Mixed model framework assumes that base line values of the dependent variable are normally distributed but batch (assay date) adds noise and models variables accordingly in order to separate the batch and the genotype. Model optimisation starting with:
Y = Genotype + Sex + Genotype*Sex + (1|Batch)
Genotype*Sex is sometimes called the "interaction term" in PhenStat.
Assume batch is normally distributed with defined variance.
NOTE: The MM encoded in PhenStat supports an optional "weight" term.
The Mixed model framework is an iterative process to select the best model for the data which considers both the best modelling approach (Mixed model or general linear regression) and which factors to include in the model.
If PhenStat assumptions about the input data are not met, a second attempt at analyzing the data will be attempted — a Mann-Whitney U Rank Sum test.
Control selection strategy
One side effect of producing data in a high throughput pipeline is that the input data for a statistical calculation might be produced over multiple days. Environmental fluctuations have been identified as a confounding factor when comparing data gathered on different days. The IMPC describes this as a "batch effect" and it is treated as a random effect in the Mixed model framework.
The data sets to be analysed are identified using unique combinations of these fields:
|Background strain||The original strain from which the mutant specimen was derived.|
|Allele / Colony||The genomic variation in the mutant. The allele describes the character of the mutation and the Genotype effect term of the Mixed model.|
|Zygosity||The severity of the mutation.
|Pipeline||The standardized phenotyping pipeline as described in IMPReSS Pipelines.|
|Procedure||The standardised set of procedures (experiments) as described in IMPReSS procedures.|
|Parameter||The standardised set of measurements as described in IMPReSS parameters.|
|Metadata group||Some parameters are indicated as "procedureMetadata" type. Some of these metadata are used to group comparable data together as described on the IMPReSS parameters page under the "Required For Data Analysis" section. The parameters that are marked as "Required For Data Analysis" are collectively identified by an identifier called the metadata group.|
|Organisation||The phenotyping organisation that performed the experiment and collected the data.|
|Sex||The sex of the specimens. When analyzed using the Mixed model males and females are analysed together to determine the
Sex and Sex*Genotype interaction effect terms.
 - optional
IMPC phenotyping centers operate using different work flows which contribute to the batch effect.
|Workflow||Description||Statistical implications||Control selection strategy|
|One batch||All mutant and control data are measured on one day.||No batch effect. The controls and mutants are analysed using Y = Genotype + Sex + Genotype*Sex||Concurrent control strategy — Use control data that are collected on the same day as the mutant data.|
|Multi-batch batch (2+)||Mutant and control data are gathered over a few days.||Possible batch effect. The controls and mutants are analysed using
Y = Genotype + Sex + Genotype*Sex + (1|Batch), the batch effect might be removed.
|Baseline control strategy — Use all control data within the same metadata group.|
For each data set, the appropriate work flow is determined and the statistical calculation is performed. For continuous data, Mixed model is the IMPC preferred method of analysis, however, this method requires that the following assumptions are met:
- 1. The data is normally distributed
- 2. The data has some variation
- 3. There must be more than four data points per sex per genotype
The graph pages display plots according to the data type of the parameter. Categorical data parameters display a stacked bar chart whereas continuous data displays a box plot and a scatter plot of the data point values. See the graph documentation for more details.
Fisher's exact output
A table displaying more information about the data used to determine the P value and effect size is displayed below the graph.
Mixed model (PhenStat) output
The more statistics link at the bottom of the table will list the statistical method as "MM framework, generalized least squares, equation withoutWeight" when the batch term is not significant, otherwise "MM framework, linear mixed-effects model, equation withoutWeight".
Rank sum output
The more statistics link at the bottom of the table will list the statistical method as "Wilcoxon rank sum test with continuity correction" when a rank sum calculation has generated the statistics.
Reference Range Plus output
The more statistics link at the bottom of the table will list the statistical method as "Reference Range Plus" when a reference range calculation has generated the statistics.
Statistics to Phenotype
If the mutant genotype effect represents a significant change from the control group, then the IMPC pipeline will attempt to associate a Mammalian Phenotype (MP) term to the data.
The particular MP term(s) defined for a parameter are maintained in IMPReSS. Frequently, the term indicates an increase or decrease of the parameter measured.
When a statistical result is determined as significant, the following diagram is used for associating MP terms:
When a mutant genotype effect P value is less than 1.0E-4 (i.e. 0.0001), it is considered significant.
- GO lookup tool: GO annotations to phenotyped IMPC genes
- Paper lookup tool: References using IKMC and IMPC resources
- Parallel coordinate
The parallel coordinates tool allows users to compare strains across different parameters. Hover over a row in the table to highlight the corresponding line on the chart.
To start using the tool select one or more procedures from the drop-down select box. Once this is done you can filter the data based on the phenotyping center.
The values displayed are the genotype effect, which accounts for different variation sources. Information about this and the statistical methods used is available in the statistics documentation.
To help visualize we have added two special lines: mean, displaying the average genotype effect for all genes displayed and no effect, that runs through the zero values to help visualize how a gene with no genotype effect for the measured parameters would look like. For large datasetst the mean and no effect line usually converge.
The tool allows filtering on each axis (parameter) by selecting the region of interest with the mouse.
The clear button removes existing filters.
The export button generates an export of the values in the table. If any filter is set, only the data displayed in the table will be exported.
The generation of this chart is computationally intensive and the number of parameters that can be plotted may vary from one machine to the other. If you notice the tool becoming too slow, please consider selecting fewer procedures.