Statistical analysis of the National Institutes of Health peer review system
See allHide authors and affiliations

Communicated by James O. Berger, Duke University, Durham, NC, May 15, 2008 (received for review February 17, 2008)
Abstract
A statistical model is proposed for the analysis of peerreview ratings of R01 grant applications submitted to the National Institutes of Health. Innovations of this model include parameters that reflect differences in reviewer scoring patterns, a mechanism to account for the transfer of information from an application's preliminary ratings and group discussion to final ratings provided by all panel members and posterior estimates of the uncertainty associated with proposal ratings. Application of this model to recent R01 rating data suggests that statistical adjustments to panel rating data would lead to a 25% change in the pool of funded proposals. Viewed more broadly, the methodology proposed in this article provides a general framework for the analysis of data collected interactively from expert panels through the use of the Delphi method and related procedures.
Every year, the National Institutes of Health (NIH) spend more than $22 billion to fund scientific research (1). Approximately 70% of these funds are awarded through a peerreview process overseen by the NIH Center for Scientific Review (CSR). Despite the vast sum of money involved, the absence of statistical methodology appropriate for the analyses of peerreview scores generated by this system has precluded the type of detailed assessment applied to other national health and educational systems (2, 3). As a consequence, statistical adjustments to account for uncertainties and biases inherent to these scores are not made before funding decisions. To address this deficiency, this article examines the properties of these ratings and proposes methodology to more efficiently extract the information contained in them.
It is useful to begin with a brief review of the NIH peerreview system. Upon submission to the NIH, most grant applications (e.g., R01, R03, R21, etc.) are assigned to a study section within an Integrated Review Group (IRG) for review, and to an NIH Institute and Center (IC) for eventual funding. IRG study sections typically contain ≈30 members and review ≈50 grant applications (proposals) during each of three annual meetings. Because it is impractical for every member of a study section to review every application, between two and five reviewers are typically assigned to read and score each application before the study section convenes. In the sequel, these individuals are called the proposal's “readers,” and the scores they assign before a study section convenes are called “prescores.” Proposals are scored on a 1.0–5.0 scale in increments of 0.1 units, with 1.0 representing the best score. When the study section convenes, the scientific review officer (SRO) and the study section chair suggest a list of proposals that might be “streamlined.” Based on their prescores, proposals on this list are viewed as unlikely to receive fundable priority scores and, if no one in the study section objects, are not considered further. The remaining proposals are discussed and scored by all members of the study section.
Readers of a grant application begin the discussion by announcing their prescores and summarizing the proposal for other members of the study section, most of whom will not have read it. After these summaries, there is an open discussion of the application. Proposal readers then state their “postscores” for the application, and all other members of the study section (i.e., the proposal's nonreaders) also score the proposal. Nonreaders are required to either score the proposal within 0.5 units of the range of scores established by reader postscores or provide a written statement to the SRO explaining why they scored the proposal outside of that range. Scores received from all study section members are then averaged to obtain the proposal's priority score. In “established” study sections, priority scores are converted to a percentile ranking through a comparison with recent priority scores from other grant applications scored within that study section. In newer study sections or special emphasis panels (i.e., panels that are convened to rate a limited number of proposals), percentile scores are calculated by comparing the proposal's priority score to established norms. Finally, proposal percentile ratings are used by ICs to determine which applications will be funded. Although the exact criteria by which ICs use these percentiles to make funding decisions vary by the IC, funding decisions are thought to be highly correlated with percentile scores.
In this article, I propose statistical methodology to account for the effect of the selection of readers on a proposal's final percentile score, quantify the uncertainty associated with the percentile scores, and demonstrate how such uncertainties can be incorporated into a decisiontheoretic framework to improve the probability that the greatest proportion of top proposals are funded. Viewed more generally, methods developed in this article extend existing statistical methodology for the analysis of multirater ordinal data (4⇓⇓–7) and item response data (8⇓⇓⇓–12) to provide a framework for the analysis of panel rating data collected by using the Delphi method and related interactive rating schemes (13).
The data that form the basis for this study were collected as part of a contract awarded to the author by the CSR in November 2004. As part of that study, all preliminary and final reader scores and nonreader scores for all R01 grant proposals submitted to the NIH and reviewed under the auspices of the CSR over two review cycles (June and October 2005) were collected and redacted.
Description of Data
Ratings for 18,959 R01 proposals rated by 14,041 reviewers in 744 study sections (including special emphasis panels) were available for analysis. Fig. 1 displays a histogram of all scores, including reader prescores and postscores, and nonreader scores. Table 1 provides a summary of the mean and standard deviation of the rater scores.
Several interesting features of the data are apparent from Fig. 1. Among these is a tendency for reviewers to use two distinct scales to score proposals. The first scale, nominally assumed by the CSR, runs from 1.0 to 5.0 in increments of 0.1 units. The second scale, used more frequently for less competitive proposals, runs from 1.0 to 5.0 in increments of 0.5 units. Evidence for the operation of these dual scales is provided in Fig. 2, in which the conditional means of reader prescores are displayed as a function of the prescore assigned to a proposal by a single reviewer. The relation between a reader prescore and the mean of other reader prescores for the same proposal is nearly linear between ≈1.1 and 3.0, but, outside of that range, the relationship is not monotonic. For example, among proposals that receive one prescore of 5.0, the mean of the remaining prescores is 3.2; for proposals receiving a prescore of 4.9, the mean of the remaining prescores is 3.7. Although not a central focus of this article, these observations suggest that a 20point scale, anchored at an “average” rating of 10, might be better supported by current rating procedures. Such a scale would nominally provide a 10point scale for nonstreamlined proposals.
Results
I used a latent variable model (14, 15) to formally describe the relation among application merit, reader pre and postscores, and nonreader scores. Within this model, reader prescores were assumed to represent independent assessments of application merit, whereas reader postscores and nonreader scores were assumed to represent weighted averages of information elicited during the proposal discussion and the scores of (other) proposal readers. I used a continuousvalued latent variable μ_{i} to represent the merit of the ith application. The resulting model was then used to estimate the effects of reader biases and to assess the uncertainty in final proposal rankings. A description of this statistical model is provided in the supporting information (SI).
Adjustments for Reader Bias.
Demonstrating the benefit of corrections for reviewer bias is difficult because true proposal merits are not known. For this reason, I examined the effectiveness of bias corrections in two stages. First, I performed a crossvalidation study that used only reader prescores. Because reader prescores can be considered to be conditionally independent, they can be analyzed without modeling the complex structure among their values, reader postscores, and nonreader scores. Therefore, a comparison of the modelbased prediction errors based on reader prescores to the NIH prediction error provides an indication of the effectiveness of corrections for reader biases and a partial model validation. Second, I applied the full statistical model to all rater scores to illustrate the impact of reader bias on the final estimates of the proposals' merits.
I implemented the crossvalidation experiment by first splitting reader prescores into two samples, randomly assigning 90% of the scores to a training sample and assigning the remaining 10% to a test sample. I used the training data to estimate model parameters. The posterior means of merit parameters for the proposals were then converted back to the original rating scale and were used to predict prescores in the test sample. The mean squared error for these predictions was 0.373.
In the NIH scoring system, proposal merit is estimated by the sample mean of the raters' scores. Thus, the estimate of a proposal's merit based on the training sample is the sample mean of the training sample prescores. The mean squared error of the corresponding prediction of prescores in the test sample was 0.413. Use of the statistical model to predict reader prescores in the test sample thus reduced the mean squared error of prediction by ≈10%.
The improvement in mean squared error enjoyed by the modelbased estimate can be attributed primarily to the estimation of parameters that represent rater biases, or the tendency of some raters to score proposals more stringently than others.
When propagated through the full statistical model for reader postscores and nonreader scores, these effects can be quite dramatic. For example, consider the posterior estimates of the proposal merits listed in Table 2. These proposals represent the top 15 applications selected from a study section that reviewed 99 proposals over the two cycles for which data were collected. Proposal rankings were based on the posterior means of the μ_{j}, which are listed in column two of the table (μ̄). The sample mean of reader postscores and nonreader scores are listed in column three (ȳ). Columns 4 through 18 provide the posterior probabilities that each proposal had higher merit than each of the other proposals. Note that there is substantial disagreement between the ordering of proposals obtained from the statistical model and the raw priority score averages. Because these differences are so great, it is helpful to examine their source. To this end, consider the most extreme example from Table 2, the eighth proposal.
This proposal had a posterior mean estimate of μ̄_{8} = −0.86, which was based on four reader prescores of 1.2, 1.8, 2.5, and 1.4. The third reader, who assigned this proposal a prescore of 2.5, assigned prescores of 2.6, 2.8, and 2.2 to the three other proposals he read, and as a consequence was estimated to have a relatively large, positive bias. Similarly, the second reader, who assigned the prescore of 1.8, also graded more stringently than average, assigning an average prescore of 2.0 to the 11 proposals she reviewed. The prescores assigned to this proposal by the first and fourth readers are even more unusual and were the lowest prescores that these readers assigned to any proposal. These two panel members prescored 7 and 12 grants, respectively, and assigned average prescores of 2.74 and 2.33. The reader postscores of this grant application, in order from the first to the fourth reader, were an abstention, 1.8, 2.5, and 1.2. There was thus considerable disagreement among the readers concerning the merit of this proposal.
This discord carried over to the nonreaders of the proposal, who were split in their opinions. Ten of 22 nonreaders scored the proposal 1.2 or 1.3, whereas 7 of 22 scored the proposal 1.8 or higher.
The scores of this proposal thus reflect one obvious but important feature of the NIH scoring system: The scoring patterns of readers assigned to an application have a major impact on its final priority score.
Restricting attention only to the effects of rater biases, the modelbased correction for these effects changed the rank of the eighth proposal from 13 to 8, or from being near the current NIH funding line to being under it. Applying similar corrections to proposals in all study sections suggests that corrections for rater biases would lead to a change in ≈25% of funding decisions. At a 15% funding line, 20% of funded proposals would be replaced by unfunded proposals if an account was made for the differences in reader scoring patterns. At a 10% funding line, this difference becomes ≈27%. In dollars, this translates to the redirected allocation of approximately $5 billion of grant funding every year.
Uncertainty in Proposal Ratings.
Uncertainties associated with proposal orderings should also be considered when allocating research funds, particularly when uncertainty is great (2, 3, 16). To examine the importance of this factor, consider again the eighthranked proposal from Table 2. Because of the disparity of scores assigned to this application, it is difficult to accurately determine its relative merit. The posterior probability that it was better than the ninthranked proposal was estimated to be only 0.60, and the posterior probabilities that it was better than the 10th and 11thranked proposals were 0.55 and 0.70, respectively. Yet there was a 24% chance that it was better than the seventhranked proposal and a 20% chance that it was better than the fifth and sixthranked proposals.
These probabilities reflect another feature of the modelbased estimates of each proposal's merit that is not captured by the sample mean of the priority scores. The actual merit of this proposal is not clear from the reader scores nor the nonreader scores; it could rank in the top four or five proposals from this study section, or it might only be among the top 10 or 15.
More generally, a statistical model to determine the merit of each proposal provides a mechanism for balancing the estimates of posterior uncertainty regarding the relative merit of proposals against the requested costs of proposals to arrive at more rational funding decisions. To understand how this might be accomplished, consider again the proposals summarized in Table 2.
Because the costs requested in the proposals in Table 2 are not available, hypothetical costs have been inserted into the final column of the table. For convenience, a proposal's total costs were assumed to be distributed between $200,000 and $450,000, based on the assumption that the average funding of an R01 proposal is approximately $350,000 (1).
In the absence of a formal utility function for proposal merit, let us assume that the NIH wishes to maximize the probability that the top, e.g., 13% of grant applications are funded under a fixed constraint on the available funding. Suppose further that 13 × 350,000 = $4.55 million is available to fund a subset of the proposals listed in Table 2 and recall that 99 proposals were rated by this study section.
Without accounting for the uncertainty in proposal rankings, a natural funding decision would be to simply fund the top 13 proposals in the table. The combined cost of these proposals is 4.55 million dollars, and so this selection might appear to maximize the probability that the top 13% of proposals would be funded. However, this choice does not account for the uncertainty associated with the estimates of the relative merit of proposals 13–15.
To account for the uncertainty in the relative merits of proposals, the numerical algorithm used to sample from the posterior distribution was also used to rank proposals for each sample generated from the posterior distribution. Based on these samples, it was possible to calculate the probability that each fundable subset of proposals (i.e., a group of proposals costing less than $4.55 million) contained the 13 top proposals. The posterior probability that proposals 1–13 were the 13 best proposals was thus calculated to be 17%.
Given the imposed cost constraints and noting that proposals 14 and 15 have the same total cost as proposal 13, an alternative funding decision would be to fund proposals 1–12, 14, and 15. The combined cost of these proposals is also $4.55 million. Perhaps surprisingly, the posterior probability that this set of proposals contains the 13 best proposals is 21%—nearly 24% greater than the probability achieved by the selection of proposals 1–13. Clearly, this selection of proposals would significantly increase the NIH's probability of funding the top 13% of proposals from within this study section.
This general approach for combining uncertainty and costs extends easily to different target levels of funding, or to funding decisions made for proposals pooled from several study sections. In addition to maximizing the probability that the top proposals are funded, using such an approach to balance costs against uncertainties would also have an additional benefit: It would decrease the costs requested in grant applications. In the current highly competitive funding environment, applicants would submit reduced budgets if they knew this would improve their chance of being funded.
Discussion
The statistical model proposed in this article illustrates the potential that exists for modeling rating data collected interactively from panels of experts. It accounts for differences in reviewer scoring criteria, provides a model for the sequential rating of items by various subsets of reviewers, and quantifies uncertainty associated with final proposal ratings. Numerous refinements to this model framework are clearly possible. For example, the model could be extended to account for differences between the weights assigned to ratings by primary reviewers, secondary reviewers, and discussants or for differences that might be explained by reviewer attributes [e.g., academic rank, gender, ethnicity, scientific review group (SRG) experience]. Indeed, entirely different classes of statistical models might alternatively be considered, and it would be worthwhile to assess the sensitivity of funding decisions to the particular model adopted. Within the context of NIH peerreview rankings, an issue that urgently requires additional study involves the impact of review group discussion on the final rankings of proposals. The approach taken here represents an extremely optimistic view of this “discussion effect.” That is, systematic shifts in reader prescores to reader postscores and nonreader scores were assumed to result from an implicitly unbiased glimpse of the true merit of a proposal manifested through group discussion. In practice, group dynamics and reviewer attributes probably play as important a role in such discussions as do the proposals' merits. Unfortunately, the data do not contain unambiguous information regarding the true value of review group discussions or the possible biases associated with them. Such information might be obtained, however, through an experimental study of the rating process itself.
Perhaps the simplest experiment that could be conducted to assess the validity of the discussion effect would be to set aside from SRG discussion a random sample of reader prescore and postscore information. Nonreader scores could subsequently be contrasted to omitted prescores, which (under the assumptions of the model above) could be corrected to provide unbiased estimates of the proposals' merits. The relative distribution of deviations of nonreader scores from omitted prescores and reported prescores would provide an indication of the extent to which a discussion of a proposal represents an independent assessment of its merit. Such analyses could be strengthened by examining the impact of individual reader attributes (e.g., academic rank, gender, years of SRG experience) on observed shifts of nonreader scores toward reported reader prescores. Ultimately, data collected from such experiments might be used to assess the tradeoff between the cost of conducting SRG meetings and the cost of collecting additional, independent ratings of applications.
There is, however, no ambiguity regarding the need for more sophisticated statistical analyses of NIH peerreview data. As the example in the previous section illustrates, variability inherent to rater scores, and differences in the criteria used by individual raters to assign scores to proposals, have an enormous impact on funding decisions. The statistical model proposed in this article—or a modification of it—should be applied by the NIH to account for these effects.
The primary technical difficulty associated with the implementation of this model stems from the estimation of model hyperparameters that are common to all SRGs. For example, estimation of hyperparameters that model the correlation of category thresholds across review groups requires the evaluation of the complete likelihood function, which depends on rating data collected from all study sections. However, collecting these data from all SRGs to estimate global model hyperparameters would delay the processing of summary scores. A practical implementation of the model would thus require both an offline procedure to estimate and update the posterior distributions of global model hyperparameters based on past review cycle data, and the updating of the values (or summary statistics describing the posterior distribution of values) of a static set of global hyperparameters used concurrently in enduser software (17).
Implementation of such a system would likely change the pool of funded proposals by 25%; accounting for both requested costs and uncertainty in the relative merits of proposals would likely result in more than a 35% change. Explicitly accounting for cost in funding decisions would also result in a net decrease in the cost of the average proposal, which in turn would allow the NIH to fund more grant applications.
Methods
I used a Bayesian hierarchical statistical model to describe the process by which raters scored grant applications. Stages in the model hierarchy were specified sequentially according to the order in which scores for proposals were generated.
FirstStage Model.
I modeled reader prescores using ordinal probit models (6, 18) defined by using latent variables. Letting μ_{i} denote the “true” merit of proposal i on an underlying measurement scale, r_{j} denote a “bias” term associated with the prescores assigned by reader j, γ_{m} denote a vector of category thresholds associated with IRG study section m, and x_{i,j}^{pre} denote the unobserved latent variable upon which reader j assigns prescore y_{i,j}^{pre} to proposal i, such a model was specified by assuming that To establish the underlying scale of measurement, the μ_{i} values were assumed to be independently distributed as standard normal deviates.
A priori, biases attributable to raters and the error terms associated with the assignment of prescores to categories were assumed to be independent and distributed according to The mean of rater biases ζ was included in the model to account for the fact that reader prescores have a lower mean value than either the rater postscores or nonreader postscores.
SecondStage Model.
In the next stage of the data generating process, I assumed that readers modified their prescores by using both the reported values of other reader prescores and the group's discussion of the proposal. The resulting reader postscores were thereby represented as a weighted average of these three information sources.
The latent value x_{i,j}^{post} assumed to be responsible for the generation of reader j's postscore of proposal i, y_{ij}^{post}, can be written as where and The error terms ε_{ij}^{post} were assumed to be independently distributed as N(0,σ_{1}^{2}) random variables. Here, A_{i} denotes the set of reviewers who provided prescores for proposal i.
On the latent scale of measurement, the model specification described so far resembles a standard hierarchical model with a Gaussian error structure. Unfortunately, the usual Gaussian model does not provide an accurate representation of reader postscores and nonreader scores at higher levels in the model hierarchy. This difficulty stems from the high proportion of reader postscores that fall within the range defined by the reader prescores, and the even higher proportion of nonreader scores that fall within the range defined by the reader postscores. There also is a tendency for nonreaders to assign scores that are identical to a reader postscore.
To account for these tendencies, the weights u_{ij}, v_{i}, and w_{ijk} were assumed to be generated from a Dirichlet model with a parameter vector containing a component a for each u_{ij}, a component b for each v_{i}, and a component c for each w_{ijk}. The distribution of hyperparameters estimated at higher levels in the model hierarchy make it likely that these weights are assigned values that are either close to 0 or 1; this permits the model to mimic the tendency of nonreaders to concentrate their scores around and between the scores recorded by the proposal's readers.
Another innovation of the statistical model involves the inclusion of the term v_{i} μ_{i} in the weighted average defining the latent variable x_{i,j}^{post} (Eq. 2). The purpose of this term is to model systematic shifts between reader prescores and reader postscores that result from a proposal's discussion. In the construction of this term, v_{i} weights μ_{i}, the parameter that represents the true merit of the proposal. That is, the model implicitly assumes what might be regarded as the ideal situation from the NIH's standpoint. Alternative assumptions regarding the distributions of these weights can be incorporated into the model framework, but for the purposes of this article the NIH's “ideal” was assumed. It is important to note, however, that the rating data themselves cannot be used to validate this assumption in the absence of an external “gold standard” for relative proposal merits.
The values of the hyperparameters a, b, and c determine, respectively, the average relative weights that readers assign to their own prescores, the proposal discussion, and the prescores of other readers when determining their final postscore ratings.
ThirdStage Model.
The model for nonreader scores y_{i,j}^{non} is similar to the model specified for reader postscores y_{i,j}^{post}, except that nonreader scores were assumed to be based on a latent variable x_{i,j}^{non} that represents a weighted average of reader postscores and proposal merit. That is, the model for nonreader scores was obtained by replacing Eq. 2 with and modifying Eqs. 3 and 4 accordingly. The weights appearing in Eq. 5 were defined similarly to those used to model reader postscores.
Further description of higherlevel model structures [including the prior distributions imposed on model hyperparameters (γ_{m}, a, b, c, σ_{0}^{2}, σ_{1}^{2}, σ_{2}^{2}, τ^{2})], along with model diagnostics and a brief description of the numerical algorithm used to fit this model to the peerreview data, is provided in the SI.
Acknowledgments
I thank James Berger and two referees for constructive comments and suggestions that significantly improved the manuscript.
Footnotes
 ↵*Email: vejohnson{at}mdanderson.org

Author contributions: V.E.J. designed research, performed research, analyzed data, and wrote the paper.

The author declares no conflict of interest.

Data deposition: Dr. Johnson will provide the data in ASCII format upon request.

This article contains supporting information online at www.pnas.org/cgi/content/full/0804538105/DCSupplemental.
 Received February 17, 2008.
 © 2008 by The National Academy of Sciences of the USA
Freely available online through the PNAS open access option.
References
 ↵
 Office of Budget, National Institutes of Health
 ↵
 ↵
 ↵
 ↵
 ↵
 Johnson VE,
 Albert JA
 ↵
 ↵
 Verhelst N,
 Verstralen H
 ↵
 ↵
 ↵
 Skrondal A,
 RabeHesketh S
 ↵
 ↵
 ↵
 ↵
 ↵
 National Research Council
 ↵
 Kolen MJ,
 Brennan RL
 ↵
 McCullagh P