Got a question? Check if the answer is here. If not, please send an e-mail to
You can also check the references and background information.

I heard somewhere that ensemble mean forecasts outperform single control forecasts started from the best available analysis beyond 96 hours. Is the ensemble useless before that time?

 There are two issues here. First, a properly formed ensemble should have a mean that is equivalent or better in quality than its equal resolution control right from the beginning (not only after some lead time). The NCEP global ensemble is certainly not perfect but still behaves like this for the NH extratropics, see rms error statistics for the spring or summer of 2001 at:
For the first few days the scores are very similar (i. e., the ensemble mean looks similar to the control fcst). In a finer resolution model, however, significant differences can develop due to nonlinearities much earlier than 2-3 days.

 Second, the issue of higher resolution control forecasts. As can be seen from the verification figures above the T170L42 higher resolution control outperforms the T126/L28 lower resolution ensemble control (that starts from the same but truncated initial conditions) and ensemble mean (that initially is centered around the same initial condition) for the first few days. Beyond 3 days or  so, the low resolution ensemble mean beats even the high resolution control. So the 72-96 hour lead time you heard beyond which a global ensemble mean beats the control is related to the facts that (1) significant nonlinearities develop only after 2-3 days and (2) the slight advantage a higher resolution control may have initially is outweighted by the nonlinear error filtering effect of ensemble averaging (see next question).

I heard the ensemble mean is better than a control forecast because it smooths out small scale features. Cannot we then simply apply a spatial filter on a single control forecast and avoid the need for running an ensemble?

 These experiments have been performed. Both the control and ensemble mean forecasts were retrospectively and separately smoothed by a spatial filter, removing smaller and smaller spatial scales until optimal verification scores were achieved with each forecast. The result is that the optimally smoothed ensemble mean retains 60% of the advantage it had before each forecasts were smoothed, see p. 3311 in Toth, Z., and E. Kalnay, 1997: Ensemble forecasting at NCEP and the breeding method.  Mon.  Wea. Rev, 125, 3297-3319.

 These results were corroborated further when small scale features in control forecasts were replaced by the small scale features present in the ensemble mean. These experimental forecasts verified better than the control. Tests were also run where the variance of the small scale waves in the control were reduced to the level these waves have in the ensemble mean. These forecasts verified poorer than those where the ensemble mean was used to specify the smaller scales.

 What these results show is that ensemble averaging provides more than simple spatial averaging. It is generally true that larger scale features are more predictable than smaller scale features. However, there are lots of variations on this theme. Barring model errors (see next question) for a moment, predictability of a particular feature depends on two factors: (1) initial value related unceratinty in the upstream area that affects the forecast feature at verification time; and (2) the instabilitiy of the flow that connects the upstream area and verifying feature - i. e., all the dynamics connecting the initial and final time features. (Note that targeted observations, where data are collected in areas that influence particular downstream features, is a discipline devoted to studying these temporal/spatial connections.)

 As you can imagine this is a very complex problem that leads from time to time to highly predictable smaller scale features and poorly predictable larger scale features. The ensemble, through its perturbed forecasts, can capture these variations and its mean will selectively filter out those unpredictable features that vary from member to member while retain those predictable features that are lined up in most member forecasts. Indiscriminate spatial filtering will remove some predictable smaller scale features while retaining some unpredictable larger scale features.

As I understand ensemble forecasts use the same NWP models that are used to generate the control forecasts. I know these models have lots of problems. What's the use of looking at a bunch of ensemble forecasts if they all have the same bias as the control forecast?

 Forecast errors are due to two or three main factors: (1) errors in the initial conditions; (2) errors caused by the use of imperfect models; and (3) errors in specified boundary conditions (which we can from here on also treat as model errors, i. e., lack of modeling the boundary conditions properly). Ensembles are known for being able to distinguish between forecasts with low and high uncertainty due to initial value related errors (Toth et al. 2001). Apparently the model problems are not serious enough to prevent the ensemble from performing this task - as control forecasts are undoubtedly able to provide very useful forecast information in most cases.

 Every once and a while, however, the whole ensemble will miss the verifying analysis, as you noted in your question. How often this happens can be easily quantified. For the NCEP global ensemble 500 hPa height forecasts this occurs once a week (1 out of 7 cases). Probabilistic forecasts can then be "calibrated" based on this information. If all ensemble members, for example, suggest a particular weather event (100% forecast probablity) we can evaluate all similar forecast cases in the past month and see how often that forecast verified. If only 90% of those past cases verified, we can make an adjustment and issue a forecast with 85% probability (because we know the model errs, on average, that often).

 Thus an ensemble capturing only initial value related uncertainty can still be useful and provide very reliable probabiolistic forecasts (i. e., forecast probabilities matching associated observed frequencies over the long term), especially if  the ensemble is properly postprocessed, reflecting its average failure rate due to model problems. A better ensemble would capture variations in forecast uncertainty related to model errors as well, giving say 95% probabilities on days when model error is expected to play a lesser than average role, and 85% on days when model error is more likely to dominate. Our ensembles are not that sophisticated yet but when they will be, the resolution of the resulting probabilistic forecasts (i. e., how close the probability values are to the ideal 0 and 1 levels) will be further improved.

Return to Ensemble Training page