False Discovery Rate (FDR) is a measure of accuracy when multiple hypotheses are being tested at once. Ex: multiple metrics measured in the same experiment.
False Discovery Rate (FDR) is a statistical measure used to assess the rate of Type I errors in hypothesis testing or significance testing (incorrectly rejecting a true null hypothesis) among all rejected hypotheses in multiple testing procedures. It is intended to measure error rate in multiple testing, i.e., when multiple hypotheses are being tested at once. This concept is particularly relevant in large-scale data analysis scenarios, such as gene expression studies, genome-wide association studies, where thousands of tests are performed simultaneously. FDR provides a more nuanced criterion for significance than traditional methods, aiming to control the expected proportion of false discoveries (incorrect rejections) among the declared significant results. Some common methods for FDR control include the Benjamini and Hochberg (Hochberg procedure) method, Benjamini Y (yekutieli) or Hochberg Y (yekutieli) procedure, and Storey-Tibshirani procedure etc.
In technical terms, the false discovery rate is the proportion of all “discoveries” which are false.
When running a classical statistical test, any time a null hypothesis is rejected it can be considered a “discovery”. For example, any statistically significant metric is considered a discovery since we can conclude the measured difference is highly unlikely to be due to random noise alone and the treatment is directly influencing the metric. On the other hand, metrics which did not reach significance are statistically inconclusive — they are not a discovery as it wasn’t possible to reject the null hypothesis.
In the context of online experimentation and A/B testing, the false discovery rate is the proportion of statistically significant results which are false positives. Or to write it algebraically:
FDR = N_falsely_significant / N_significant
Where N_falsely_significant is the number of statistically significant metrics which were not truly impacted (number of false positives) and N_significant is the total number of metrics which were deemed statistically significant.
For example, if you see 10 statistically significant metrics in your experiment and you happen to know that 1 of those 10 significant metrics was a false positive and wasn’t really impacted, that gives you a false discovery rate of 10% (1 out of 10). In this way the FDR only depends on the statistically significant metrics, it doesn’t matter if there was 1 or 1000 other statistically inconclusive metrics in the example above, the FDR would still be 10%.
Traditionally, the Bonferroni correction is a conservative approach to address the multiple testing problem by adjusting p-values to control the family-wise error rate (see section below), but it can be overly stringent, leading to a decrease in power, especially in large-scale studies.
As an alternative, FDR-based methods, including the use of q-values (adjusted p-values that provide a FDR-based measure of significance), have been developed. These q-values represent the minimum FDR at which a test statistic would be considered significant, offering a balance between discovering true positives and limiting false discoveries.
Bayesian and empirical Bayes methods can further refine FDR control by incorporating prior knowledge about the distribution of test statistics or parameters within the data set, leading to adaptive procedures that can more accurately estimate the proportion of true null hypotheses. These adaptive methods adjust their criteria based on the observed data, making them particularly useful for complex testing problems where the distribution of test statistics may vary.
In the context of gene expression studies, for example, researchers might use t-tests to compare expression levels across conditions for thousands of genes. The challenge lies in determining which differences are statistically significant while controlling for the high risk of Type I errors due to the large number of tests. By applying FDR-controlling procedures, researchers can set a significance level (usually denoted as level α) that reflects the proportion of false positives they are willing to accept among the significant findings, thus making the results more reliable.The Journal of the Royal Statistical Society, and Annals of Statistics, among other academic sources, has extensively discussed FDR and its implications for statistical practice, highlighting its importance in ensuring the integrity of conclusions drawn from large-scale testing. The development and refinement of FDR-controlling algorithms, including those based on Bayesian and empirical Bayes approaches, continue to be a critical area of research, offering more sophisticated tools for managing the trade-off between sensitivity and specificity in hypothesis testing.
Another common measure of the accuracy of a test is the False Positive Rate (FPR). This is the probability that a null hypothesis will be rejected when it was in fact true. In other words, it is the chance that a given metric, which is not impacted at all by your experiment, will show a statistically significant impact.
The important distinction between the false positive rate and the false discovery rate is that the false positive rate applies to each metric individually, i.e. each non-impacted metric may have 5% chance of showing a false positive result, whereas the false discovery rate looks at all hypotheses that are being tested together as one set.
The Family Wise Error Rate (FWER) is another measure of accuracy for tests with multiple hypotheses. It can be thought of as a way of extending the false positive rate to apply to situations where multiple things are being tested. The FWER is defined as the probability of seeing at least one false positive out of all the hypotheses you are testing. The FWER can increase dramatically as you begin to test more metrics. For example, if you test 100 metrics and each has a false positive rate of 5% (as would be the case if you use a typical 95% confidence or 0.05 p-value threshold), the chance that at least one of those metrics would be statistically significant is over 99%, even if there was no true impact whatsoever to any of the metrics.
When you are testing only one hypothesis (ex: a test with only one metric) these three measures will all be equivalent to each other. However it is when multiple hypotheses are being tested that they differ; in these situations the false discovery rate can be a very useful measure of the accuracy as it takes into account the number of hypotheses being tested, yet is far less conservative than the FWER.The false discovery rate is a popular way of measuring accuracy because it reflects how experimenters make decisions. It is (usually) only the significant results — the discoveries — which are acted upon. Hence it can be very valuable to know the confidence with which you can report on those discoveries. For example, if you have a false discovery rate of 5%, this is equivalent to saying that there is only 5% chance, on average, that a statistically significant metric was not truly impacted. If you know your false discovery rate is 5%, you can rest assured that 95% of all the statistically significant metrics you see reflect a true underlying impact.
As well as simply measuring the accuracy of a test, there are ways to control and limit the accuracy to the desired rate through the experimental design. The false-positive rate can easily be controlled by adjusting the significance threshold that is used to determine statistical significance. Controlling the false discovery rate is more complex as it depends upon the results, which cannot be known in advance. However, there are statistical techniques, such as the Benjamini Hochberg Correction, which can be applied to your results to ensure that the false discovery rate is no larger than your desired level.
Harness Feature Management and Experimentation gives you the confidence to move fast without breaking things. Set up feature flags and safely deploy to production, controlling who sees which features and when. Connect every flag to contextual data, so you can know if your features are making things better or worse and act without hesitation. Effortlessly conduct feature experiments like A/B tests without slowing down. Whether you’re looking to increase your releases, to decrease your MTTR, or to ignite your dev team without burning them out–Split is both a feature management platform and partnership to revolutionize the way the work gets done. Schedule a demo to learn more.