Celebrity Weather Models

The Meteorological Catwalk

Apr 17, 2025

I’ve been dreading this issue. Ever since AI moved beyond drawing celebrities as kids and into the world of weather forecasting, I knew I’d have to write about it. But it’s a vast, nuanced topic that requires actual thought. I’d much rather write another piece about rainbows, hit send, and take a nap.

Little-known fact: Snoop Dogg was born with his mustache. (Lumenor)

But, alas, the latest issue of the journal Weather and Forecasting has a doozy of an article that demands attention. The authors engaged in a meteorological battle royale: Google’s AI forecasting model vs. two of the most popular physics-based forecasting models.

I’m sorry, Dave. I can’t draw a photo of a tornado eating a tomato. No, I also won’t illustrate “it’s raining men”. STOP IT. (The real reason HAL went insane)

Prêt-a-Porter

Weather forecasting models are computer systems designed to predict the future. But instead of relying on the Psychic Friends Network or a Magic 8-Ball, they’re grounded in real-world data. The type of data they use and how they analyze it differentiates one model from another.

ATOS has >1 million microprocessors, all powered by Jack Black’s enthusiasm

In meteorology, numerical weather prediction (NWP) models are the current gold standard. These models ingest historical and current weather data and apply complex mathematical equations rooted in the physics of atmospheric science. They require enormous amounts of data and computing power; the most advanced models take hours to run on some of the world’s most powerful supercomputers.

Artificial intelligence–based weather prediction (AIWP) is a new approach emerging from the recent revolution in AI technology. These models use machine-learning and deep-learning strategies to identify patterns in historical data and apply them to current situations. They rely less (if at all) on simulating atmospheric physics. Because the underlying math is less computationally intensive, AI models can generate forecasts much quicker and with significantly less computing power. Instead of a Jack Black, they can get by with a Jason Segel.

D-D-D-D-Don’t Believe the Hype

In the past couple of years, researchers have published numerous papers proclaiming the superiority of AI-based weather prediction. Many—though not all—of these papers appear in high-quality journals and undergo peer review, meaning they meet the community’s standards for methodology and analysis. Some of these models have also been released publicly, allowing other scientists to attempt replication. So far, so good!

However, the hype train is rarely slowed by reality. When I worked in astronomy, it was a running joke to bet on when the next press release would announce the discovery of an Earth-like planet or the moment Voyager 2 would "finally" exit the solar system. Both seemed to happen annually for decades. The reason is that one can define the solar system boundaries, or even what an Earth-like planet is, in many different ways.

Meteorology faces a similar challenge. We have a near-infinite array of variables that can be used to compare forecasts, giving researchers the flexibility to highlight whichever ones make their model look best. This is known as positive results bias, and one way to combat it is through pre-registration: publicly defining your methods and evaluation metrics before running the study. But that takes guts—and most scientists aren’t doing it yet. As a result, we have many papers describing Generic Model X as superior to Generic Model Y, because they can selectively report outcome variables (moving the goal posts). Yes, they may perform better in the areas discussed—but what about the ones that aren’t mentioned? Did the model perform worse there? Were those variables even analyzed? So take model competitions with a grain of salt.

AIWP models show immense promise. But they are still new and researchers report some limitations. First, their spatial and temporal resolution is generally coarser than that of NWP models, making them less useful for specific, localized predictions. Additionally, AIWP outputs tend to focus more on general atmospheric variables (ex: temperature, humidity, and wind) rather than the more impactful phenomena people are typically interested in (ex: precipitation, cloud cover, and wind gusts). Since the former often drives the latter, it’s easy to imagine that the best approach might be to combine the strengths of both AI and NWP models. But that won’t make for a fun press release.

Are dire wolves back? Or did a press officer pull a fast one? (Colossal Biosciences)

The Research

A research team led by Jacob T. Radford at the Cooperative Institute for Research in the Atmosphere and NOAA compared a leading AIWP model with two leading NWP models on many real-world weather variables associated with precipitation. They published their results in the journal Weather and Forecasting.

The researchers selected Google’s open-source GraphCast as the AIWP contender because it is one of the more mature options currently available. It was recently updated to include greater detail in precipitation forecasts and offers increased transparency in its underlying processes, such as access to model weights.

For the NWP contenders, the researchers chose the Integrated Forecasting System (IFS) and Global Forecast System (GFS) - two leading global NWP models used for medium-range forecasting. The IFS is developed by the European Centre for Medium-Range Weather Forecasts (ECMWF) and is widely regarded as one of the most accurate models. The GFS, produced by the U.S. National Weather Service (NWS), is another well-respected global model that runs more frequently and is freely available to the public.

Kissin’ Cousins

The field of battle was records of precipitation data across the contiguous United States, drawn from existing ERA5 and NCEP/EMC Stage IV global datasets. These sources provide gridded precipitation estimates based on rain gauge measurements, radar data, and internal modeling techniques.

Now here’s where it gets a little weird. The GraphCast model can be improved by training it using data from the GFS or IFS datasets. But they only used the data itself, not the physics-based predictions from the GFS or IFS models. As a result, there are two GraphCast outputs to analyze: one initialized with GFS data and one initialized with IFS data.

The Judges’ Scorecards

The two GraphCast AIWP models were more accurate across most measures, particularly when initialized with ERA5 data. One way the researchers evaluated this was by comparing the Equitable Threat Score (ETS), which measures how well a forecast matches observed events while adjusting for hits that could occur by chance1. ETS compares the number of correct forecasts (hits) to the total number of forecasted and observed events, minus the number of matches expected randomly.

An example comparison of the Equitable Threat Score for 24-hour forecasts. The dotted lines are the AIWP models, and the solid lines are the NWP models. The X-axis is 0 → 1 inches of precipitation. The AIWP models were consistently more accurate, but more so in the mid range of .25-.75” of precip (Radford et al. 2025).

However, in some areas, the NWP models did better. GraphCast tends to overestimate the frequency of light precipitation and underestimate the frequency of intense precipitation. In general, GraphCast’s advantage seems to be associated with better accuracy in predicting where precipitation falls as opposed to how much.

72-hour precipitation forecasts for the ERA5 (l) and Stave IV (r) data. Solid lines are the NWP models, and the dotted color lines are the AIWP models. Black is what really happened, according to the dataset. At low levels of precip all models performed similarly. However, at higher levels, the NWP models were closer to the truth. (Radford, et al. 2025)

One of my favorite science writers is Alex Hutchinson, an endurance athlete with a Ph.D who writes books and cool articles like these. He does a terrific job of explaining recent research in the field of athletic performance, with a focus on endurance sports such as marathons. I’ve read him for years and have noticed a few common threads in his articles: (1) the best results come from what works for you, and (2) the differences between various tested conditions are usually small and often only applicable in specific conditions.

This paper echoes that 2nd point. It’s not about one-model-to-rule-them-all and more about taking what works best among the models and combining them. In this case, forecasters can use GraphCast to predict where it will rain and IFS/GFS to predict the intensity of the rain. Together, we have a better forecast for everyone.

The results remind us not to get caught up in the hype of a new toy, but also not to ignore progress when it stares you in the face.

And Now for Something Completely Different

The MLB baseball season is almost a month in. Forget pennant races, this is one of the most exciting times of the year. There’s enough data to get an idea of who the good teams will be, but not enough to dash the hopes of… say, White Sox fans.

The impact of weather conditions on baseball is a well-studied field going back decades. But it’s showing no sign of slowing down. Here are some interesting recent studies:

This paper found that fans would pay almost $3 more for a ticket to a game under pleasant temperatures versus extreme temperatures.
This paper found that Tommy John (UCL) injuries2 are more common among college and high school pitchers in warmer climates, and it’s not because they pitch more often.

More colored lines! This time, representing players’ birth climates: A) Tropical, B) Arid, C) Temperate, D) Cold. Y-axis is batting average and X-axis is temperature. Given that B & D are the driest, I wonder if relative humidity may be a better predictor?

This paper found that players born in cold, arid climates tended to hit for less power when playing in hot temperatures as pros.

Baseball truly is a multidisciplinary sport. These articles are from journals in the fields of Geography, Orthopaedics, and Economics.

We acknowledge additional contributions from Chris Nabholz and Beth Mills. We are grateful for a grant from Lockheed Martin to support this newsletter.

Our archive of WWAT articles is here.

During the preparation of this work, the author(s) used ChatGPT-4o in order to copyedit text. After using this tool, WWAT’s author(s) reviewed and edited the content as needed and take(s) full responsibility for the content of the publication.