How Many Hottest Days of the Year (So Far)?

Summary

Introduction

I often see news articles along the lines of ‘beach-goers out in force as record temperatures reached in Cornwall’. By ‘record temperatures’ these articles often mean ‘hottest day of the year so far’ (which I will abbreviate to HDSF). This often left me wondering: are ‘hottest day of the year so far’ stories really noteworthy? There must be several days in any year upon which you could write this type of story. In the UK, what is the average number of days in a given calendar year which could be described as the ‘hottest day of the year so far’?

After some unfruitful internet searches, I was left wondering. I considered writing to the team on the fantastic radio programme More or Less (I'm an incredibly loyal listener), but eventually decided to dig into the numbers myself. This post shares some of the findings.

And who knows, maybe one day I'll get my two minutes of fame on More or Less due to this post.

How Many Hottest-Days-So-Far?

I started directly with the simple question I wanted to answer: what is the average number of days in each year which could be described as the ‘hottest day so far’ (HDSF) in that year. For a day to be an HDSF, the highest recorded temperature at any station on that day simply needs to exceed the previous highest recorded station at any temperature on any day earlier in that year. So straight to the headline statistic:

The yearly average number of HDSFs in the UK between 1920 and 2020 is 20.0.

It's more or less what I expected — in fact, it's remarkably close to my rough-but-informed initial estimate1.

We can also just look at the distribution:

Distribution of HDSF counts
Distribution of yearly HDSF counts

The mean number of yearly HDSFs is 20.05. The median is 20. The standard deviation is 4.49. The minimum number of HDSFs is 11, in 1964, and the maximum is 34 in 1995.

At this point, you might be interested as to what a typical year looks like in terms of maximum daily temperature. 1978 has 20 HDSFs, so I plotted the daily max temperature over that year below.

Daily Max. Temp in 1978
Daily maximum temperatures in 1978. HDSFs are marked by diamonds

What about the year with the most/least HDSFs, 1995/1964?:

Daily Max. Temp in 1995
Daily maximum temperatures in 1995 — the year with the most HDSFs. There are several runs of increasingly warm days
Daily Max. Temp in 1964
Daily maximum temperatures in 1964 — the year with the least HDSFs. There are a few days with large temperature spikes, which reduces the overall number of HDSFs

We can also see whether there is a trend to the number of yearly HDSFs

Number of HDSFs over the years
Number of HDSFs over the years.

It almost looks as if there's a discontinuity in the late 1960s, after which the average increases from 18.0 to 21.9. I'm not sure what causes this: perhaps an increased number of active weather stations or more reliable recording equipment leads to fewer sudden jumps in daily max temperature, and thus more yearly HDSFs.

Noteworthiness

The original motivation for this investigation was to count how many days newspapers could publish a ‘hottest day so far’ story each year. You might object that it's not noteworthy if a day is only 0.1˚C warmer than the previous HDSF, and therefore we shouldn't count such days.

To investigate this, I propose the concept of ‘ΔT-noteworthiness’: A day is considered ΔT-noteworthy (abbreviated ΔT-HDSF) if the maximum recorded temperature on that day is at least ΔT˚C more than the previous maximum recorded daily temperature2. According to this definition, in the previous section we considered 0˚C-HDSFs. We can see a concrete example of the difference between 0˚C-HDSFs and 1˚C-HDSFs in the plot below:

Difference between 0˚C and 1˚C-noteworthiness 1970
Difference between 0˚C and 1˚C-noteworthiness 1970

Now we can answer questions such as ‘How does the noteworthiness threshold ΔT affect the number of HDSFs?’

Relationship between ΔT and mean number of HDSFs
Relationship between ΔT and mean number of yearly ΔT-HDSFs. Shaded region shows the 20th/80th percentiles. Note that the asymptote is 1, since 1st January is always a ΔT-HDSF

We can also investigate which years have the most noteworthy temperature gaps. Initial investigation suggests 1963 is the year with the largest gap. On 12th Jan 1963, a temperature of 16.7˚C was recorded in Exeter Southam station, exceeding the previous max. recorded temperature of 8.1˚C by 8.6˚C. Apparently this was a particularly cold winter — one of the coldest on record in the UK.

8.6˚C-Noteworthy HDSFs in 1963
8.6˚C-Noteworthy HDSFs in 1963

I think this measurement is probably a mistake: while Exeter Southam station reported a high of 16.7˚C, the mean daily high temperature in the rest of Devon on this day was -1.7˚C, and the maximum daily high in Devon-excluding-Exeter-Southam was just 0.6˚C. It seems implausible that the temperature at the Exeter Southam station was actually 16˚C warmer than anywhere else in Devon.

The most noteworthy plausible gap was on 4th May 2006, when a gap of 6.3˚C was recorded. This looks less likely to be a mistake, due to a prolonged period of higher temperatures following the spike. The high was 27.7°C (compared to a nationwide average of 18.1°C), measured in Northolt, London. I can't find anything in the news about the date (other than a council election taking place in Northolt on this day!). There was a heatwave later in the summer though, with highs of 36.5°C recorded

6.3˚C-Noteworthy HDSFs in 2006
6.3˚C-Noteworthy HDSFs in 2006

(ΔT, Tmin)-Noteworthiness

It would seem strange to write a ‘hottest day’ news story about a 12˚C day, even if it's several degrees hotter than the previous hottest day so far. Also, it seems strange that we should always count 1st Jan as an HDSF. We can define a new criterion: (ΔT, Tmin)-noteworthiness. A day is (ΔT, Tmin)-noteworthy if it is ΔT-noteworthy and the temperature is above Tmin.

Relationship between ΔT, Tmin, and (ΔT, Tmin)-noteworthiness
Relationship between ΔT, Tmin, and (ΔT, Tmin)-noteworthiness

I think it's reasonable to set Tmin = 20˚C. I think it's also reasonable to write the ‘hottest day so far’ news story about 3 times per year. We can use this to make a recommendation to the press standards organisation: a day must be (1.6˚C, 20˚C)-noteworthy in order to write a ‘hottest day so far’ story.

It would be very interesting to investigate the circumstances under which newspapers publish ‘hottest day so far’ stories. Is (ΔT,Tmin)-noteworthiness a good model for empirical newsworthiness? My guess is that it depends partially on something like (ΔT,Tmin)-noteworthiness, but with additional factors: mainly whether or not the date falls on a weekend/bank holiday, but to a lesser extent on general news cycle events and time since the last article. If anyone takes a look at this, I'd be interested to hear your findings!

Where are the Hottest-Days-So-Far?

Since we have per-station data, we can investigate at which weather stations the (0,0)-HDSFs are recorded.

Rank Station Name Location Number of HDSFs
1 Heathrow Greater London 89
2 Cambridge Botanic Garden Cambridgeshire 80
3 London Weather Centre Greater London 71
4 St James's Park Greater London 69
5 Cheltenham Gloucestershire 62
6 Wisley Surrey 59
7 South Farnborough Hampshire 41
8 Cromer Norfolk 37
=9 Mildenhall Suffolk 33
=9 Cranwell Lincolnshire 33

No surprise, these tend to be concentrated in the South East of England. We can plot these on a map

Map of stations with the most HDSFs
Map of stations with the most HDSFs

You'd guess that stations at lower latitudes, and perhaps more eastern longitudes. This is borne out in the data.

Latitude distribution of HDSFs
Latitude distribution of yearly HDFS counts
Longitude distribution of HDSFs
Longitude distribution of yearly HDFS counts

The above results are based on the count of HDSFs between 1920 and 2020. However, not all weather stations have been around for the same amount of time, so simply considering the raw HDSF count is biased toward larger stations. If we divide by the number of days the station has been operational, we can adjust for this bias, at the risk of introducing more variance into the ranking.

Rank Station Name Location % Days which are HDSFs
1 Liphook Hampshire 1.31
2 London Weather Centre Greater London 0.55
3 Lindholm South Yorkshire 0.55
4 Worcester Barbourne Hereford & Worcester 0.52
5 Mildenhall Suffolk 0.44
6 Heathrow Greater London 0.34
7 Hope Powys North 0.33
8 Mount Wise Met Office Devon 0.33
9 Gravesend Broadness Kent 0.29
10 Greenwich Observatory Greater London 0.29
Map of stations with the most per-day HDSFs
Map of stations with the most HDSFs per day of operation

When are the Hottest-Days-So-Far?

When in the year do the hottest days so far occur? We'd probably expect these to be concentrated at the start of the year, and in the summer months, with autumn and later having basically zero chance of being hottest so far. We can plot this on a calendar:

Calendar of daily HDSF probabilities
Probability of each calendar date being a 0˚C-HDSF

As expected, the first few days are very likely to be HDSFs, and the latter half of the year has very few. The peak time for HDSFs is late April/early May3. The latest recorded HDSF was 13th September, in 2016. It even made the news! Unsurprisingly this record was measured at Gravesend.

Modelling

In future, I might revisit the question of whether we can reconstruct the observed data via simple modelling.

Daily maximum temperatures averaged over years
Daily maximum temperatures, averaged over the years between 1920 and 2020. Shaded area shows 20th and 80th percentiles.

It appears that the daily max temperatures over a year could be modelled as a sinusoid with heteroscedastic noise (unsure of noise distribution, but Normal would probably be a good start). There's probably also significant autocorrelation so assuming daily noise is independently distributed might not be good.

Data

(I was going to include this section at the start, for the chronological narrative, but my guess is that most people won't be interested, so I moved it to the end, like an appendix)

I decided to jump straight in to computing statistics from temperature data. I initially questioned whether it was possible to skip data processing by looking at average temperatures and measures of spread. In the end, I decided against this because it would rest on assumptions I'd struggle to validate without access to the data anyway. I also expect the best I could do with easily accessible data is to build a model where daily high temperatures vary sinusoidally with independent daily Gaussian noise — but this is misleading, as I'd expect significant autocorrelation. Besides, playing with the data can be fun, and allow for more questions to be answered.

I needed a source of daily temperature information. I'm English, so naturally mostly consume British media, and therefore restricted the data to temperatures in the UK. The obvious place to go for UK weather data is the Met Office.

It turns out that the Met Office make available daily temperature information from each of their weather stations, stretching right back to 1853! The MIDAS dataset4 is available from the CEDA archives. Unfortunately, accessing this data is not so straightforward. The data is split up into different files, each containing a year's worth of temperature data for one weather station. I looked for a single file I could download which had all this data to no avail. The data is accessible via the CEDA FTP sever. I'd have to crawl over the relevant directory.

A note on the tools I used here: I was going to make this into a notebook-style document, with code interleaved with text. However, this is quite low priority for me, though I might revisit this post in the future if I have lots of free time. So I'll just vaguely describe what I did to process the data. Throughout, I used Julia, which is by far my favourite programming language for ‘scientific computing’ tasks.

To download the data, I used FTPClient.jl, which ended up being quite straightforward. Once I had each file, I had to parse it. CEDA use a format called BADC-CSV, which as far as I'm aware is a CEDA ‘in house’ format. I couldn't find a ready-made BADC-CSV parser for Julia, but fortunately it took only small modifications to the functionality provided by CSV.jl to read the files.

Once I parsed the data, I saved it to a new file to avoid having to scrape the CEDA FTP server every time. I chose to save it to an HDF5 file using HDF5.jl, preferring the HDF5 format over CSV since the resulting data was 1591×63,000. With over 100M entries, many of which were empty (due to weather stations not operating in certain periods), making use of HDF5's compression and incremental read/write.

Following this, I obtained an HDF5 file containing the maximum temperature recorded at each UK weather station on each day between 1853 and 2021. If you'd like this file (50MB) to do your own investigations, send me an email.

In my calculations I only used data from between 1920 and 2020, out of an unvalidated concern that data reliability and availability would be lower before 1920. Besides, 101 years of data should be enough!

Working with Julia for this was smooth. A common criticism I see of Julia is that the package ecosystem is undeveloped relative to more popular languages like Python or R. While perhaps true in general5, I think the issue is often exaggerated, and for the purposes of this project, the Julia packages worked well. The only real issue I had was in trying to set fillvalues for using the HDF5 package, which is something I didn't struggle with in the python package.


  1. Before working with the full dataset, I did a quick investigation by downloading the data for the Rochdale weather station in 1948 and 1973 (no particular reason for any of these choices). In 1948, there were 14 HDSFs recorded, and in 1973 there were 22 HDSFs so I wrote down my initial guess of yearly average nationwide HDSFs as 20±5. 

  2. Note that the temperature is compared to the previous maximum recorded daily temperature — not the previous ΔT-HDSF temperature. So, in the sequence {10.0, 10.9, 11.8} only the first day is 1˚C-noteworthy. Despite being more than 1˚C hotter than the previous 1˚C-HDSF (day 1, 10˚C), it is only 0.9˚C hotter than the previous maximum recorded daily temperature (day 2, 10.9˚C) and is therefore not 1˚C-noteworthy. 

  3. Indeed, just a few days before I'm writing this, I saw a ‘hottest day so far’ story on 15th April. It's also the Easter bank holiday, which is probably a hotspot for these stories since people will be at the beach if the weather is good enough! 

  4. Citation: Met Office (2021): MIDAS Open: UK daily temperature data, v202107. NERC EDS Centre for Environmental Data Analysis, 08 September 2021. doi:10.5285/92e823b277cc4f439803a87f5246db5f 

  5. Though last I remember, there are some areas, such as differential equations where the Julia ecosystem far surpasses anything else.