More information about the way IMPC uses statistics.
Highthroughput phenotyping generates large volumes of varied data including both categorical and continuous data. Operational and cost constraints can lead to a workflow that precludes traditional analysis methods. Furthermore, for a high throughput environment, a robust automated statistical pipeline that alleviates manual intervention is required.
The IMPC uses a variety of statistical methods for making phenotype calls, including:
 Fisher's Exact test  used for categorical data parameters
 Mixed model  used for continuous data parameters which include random effects
 Linear model  used for continuous data parameters when random effects are not significant
 MannWhitney U Rank sum test  used for continuous data parameters when conditions for Mixed model are not appropriate
 Reference Range Plus  used for some unidimensional data parameters
The Mixed model (MM), Fisher's Exact (FE), and Reference Range Plue (RR+) methods used have been formalized into an R package called PhenStat. See the complete PhenStat user's guide.
All analysis frameworks output a statistical significance measure, an effect size measure, model diagnostics (when appropriate), and graphical visualisation.
The PhenStat package provides statistical methods for the identification of abnormal phenotypes with an emphasis on highthroughput dataflows. The package contains:
 dataset checks and cleaning in preparation for the analysis
 2 statistical frameworks for genotype to phenotype identification
 Fisher's Exact test for Categorical data
 Linear Mixed model for continuous data
 Reference range plus model for low N continuous data
 and additional functions that help to decide the correct method for analysis.
The Mixed model framework assumes that base line values of the dependent variable are normally distributed but batch (assay date) adds noise and models variables accordingly in order to separate the batch and the genotype. Model optimisation starting with:
Y = Genotype + Sex + Genotype*Sex + (1Batch)Genotype*Sex is sometimes called the "interaction term" in PhenStat.
Assume batch is normally distributed with defined variance.
NOTE: The MM encoded in PhenStat supports an optional "weight" term.
The Mixed model framework is an iterative process to select the best model for the data which considers both the best modelling approach (Mixed model or general linear regression) and which factors to include in the model.
If PhenStat assumptions about the input data are not met, a second attempt at analyzing the data will be attempted — a MannWhitney U Rank Sum test.
Control selection strategy
One side effect of producing data in a high throughput pipeline is that the input data for a statistical calculation might be produced over multiple days. Environmental fluctuations have been identified as a confounding factor when comparing data gathered on different days. The IMPC describes this as a "batch effect" and it is treated as a random effect in the Mixed model framework.
The data sets to be analysed are identified using unique combinations of these fields:
Field  Description 

Background strain  The original strain from which the mutant specimen was derived. 
Allele / Colony  The genomic variation in the mutant. The allele describes the character of the mutation and the Genotype effect term of the Mixed model. 
Zygosity  The severity of the mutation.

Pipeline  The standardized phenotyping pipeline as described in IMPReSS Pipelines. 
Procedure  The standardised set of procedures (experiments) as described in IMPReSS procedures. 
Parameter  The standardised set of measurements as described in IMPReSS parameters. 
Metadata group  Some parameters are indicated as "procedureMetadata" type. Some of these metadata are used to group comparable data together as described on the IMPReSS parameters page under the "Required For Data Analysis" section. The parameters that are marked as "Required For Data Analysis" are collectively identified by an identifier called the metadata group. 
Organisation  The phenotyping organisation that performed the experiment and collected the data. 
Sex^{[1]}  The sex of the specimens. When analyzed using the Mixed model males and females are analysed together to determine the
Sex and Sex*Genotype interaction effect terms. [1]  optional 
IMPC phenotyping centers operate using different work flows which contribute to the batch effect.
Workflow  Description  Statistical implications  Control selection strategy 

One batch  All mutant and control data are measured on one day.  No batch effect. The controls and mutants are analysed using Y = Genotype + Sex + Genotype*Sex  Concurrent control strategy — Use control data that are collected on the same day as the mutant data. 
Multibatch batch (2+)  Mutant and control data are gathered over a few days.  Possible batch effect. The controls and mutants are analysed using Y = Genotype + Sex + Genotype*Sex + (1Batch), the batch effect might be removed. 
Baseline control strategy — Use all control data within the same metadata group. 
For each data set, the appropriate work flow is determined and the statistical calculation is performed. For continuous data, Mixed model is the IMPC preferred method of analysis, however, this method requires that the following assumptions are met:
 1. The data is normally distributed
 2. The data has some variation
 3. There must be more than four data points per sex per genotype
The graph pages display plots according to the data type of the parameter. Categorical data parameters display a stacked bar chart whereas continuous data displays a box plot and a scatter plot of the data point values. See the graph documentation for more details.
Fisher's exact output
A table displaying more information about the data used to determine the P value and effect size is displayed below the graph.
Mixed model (PhenStat) output
The more statistics link at the bottom of the table will list the statistical method as "MM framework, generalized least squares, equation withoutWeight" when the batch term is not significant, otherwise "MM framework, linear mixedeffects model, equation withoutWeight".
Rank sum output
The more statistics link at the bottom of the table will list the statistical method as "Wilcoxon rank sum test with continuity correction" when a rank sum calculation has generated the statistics.
Reference Range Plus output
The more statistics link at the bottom of the table will list the statistical method as "Reference Range Plus" when a reference range calculation has generated the statistics.
Statistics to Phenotype
If the mutant genotype effect represents a significant change from the control group, then the IMPC pipeline will attempt to associate a Mammalian Phenotype (MP) term to the data.
The particular MP term(s) defined for a parameter are maintained in IMPReSS. Frequently, the term indicates an increase or decrease of the parameter measured.
When a statistical result is determined as significant, the following diagram is used for associating MP terms:
Significance
When a mutant genotype effect P value is less than 1.0E4 (i.e. 0.0001), it is considered significant.
The IMPC Newsletter
Get highlights of the most important data releases, news and events, delivered straight to your email inbox
Subscribe to newsletter