Your Best Sleep Sensor Is… You?

Tracking Sleep or Losing Sleep

Through hundreds of conversations we have had with people to understand their relationship with sleep, we’ve found that many who want to sleep better start with a sleep tracking device. We’ve also learned that these devices can actually make people a lot more anxious about their sleep (a condition so common there is a medical term for it now - orthnosomnia). Scientific studies have shown that people’s perception of their sleep and rest can also be influenced by what devices tell them about their sleep - even when the devices present data known to be inconsistent with their actual sleep.

At The Better Lab, we believe that the best measure of your sleep and restedness is your own perception. When it comes to sleep, the human body itself is a finely tuned sensor. In fact, when sleep doctors evaluate a patient in a sleep lab, they will rely on the person’s own subjective rating of sleep over information from sleep monitoring equipment. 

Nonetheless, we know that many people have simply gotten in the habit of tracking their sleep with wearable devices - along with lots of other things you can track. And since practices like bedtime consistency have a meaningful impact on sleep, we see value in accurate data around what time you went to bed and actually fell asleep.

Given what we know about the influence of sleep tracking data on our perception of sleep and the prevalence of these devices, we thought it was worth investigating for ourselves just how accurate these devices are. We wanted to examine the accuracy of the underlying data they track (such as sleep time, sleep stages, and heart rate variability) as well as the consistency of the topline scores different devices’ algorithms generate each morning.

Unfortunately, we don’t have access to a clinical grade sleep system for polysomnography (which tracks numerous different vitals through 22 different sensors) to use as a baseline, but we know that most major sleep tracking devices claim they are highly accurate vs. this expensive laboratory equipment anyways. Which begs the question - if they are all reasonably accurate - shouldn’t they at least agree with one another?

Running Experiments on Myself

To answer that question, I (the test subject) simultaneously wore three current model popular sleep and tracking devices for seven nights: a Whoop 4.0, an Oura RIng 4, and a Garmin Fenix 8. To ensure all three were getting the same inputs, I wore all the devices continuously during the day as well, only removing them all for a short time in the morning to charge them. Suffice it to say that wearing three devices at once did feel strange at first, but after the first few days I wasn’t really aware of it anymore.

I wanted to do this experiment during a relatively ‘normal’ week where I was able to consistently maintain the sleep practices I have built with The Better Lab, and wasn’t traveling or doing anything that would force me to significantly change my sleep schedule. However, I intentionally did the experiment when I knew I might have some unusual sleep patterns. I had just returned from Australia the day before the experiment began and knew I was tired and would be jet lagged for at least the first few days of the week.

The Headline Scores

In addition to a wealth of detailed metrics, each device provides two main scores (both on 100 point scale) that are intended to summarize the sufficiency of your sleep and how well recovered or ‘ready’ you are at the start of each day. While the devices use their own proprietary algorithms to calculate these scores, based on how they explain the scores, they appear to use similar inputs to calculate them.

If you are just looking for a quick summary of how you slept and how rested you are, you probably glance at these metrics every morning. So how consistent are these scores?

Not very. Whoop gave me a near perfect sleep score every day, while Oura and Garmin were close to one another, but with significant differences on two out of the seven nights.

When it came to recovery or readiness, it was Garmin that gave me a near perfect score each day, and the inconsistency between the three was even greater than for sleep scores. 

Sleep Time and Stages

If you want to go deeper on your sleep, each device gives detailed information on total time you spent in bed, total time asleep, and time in each of the main stages of sleep (e.g. light, deep, and REM). Most people pay attention to total sleep time, deep sleep and REM sleep versus the recommendations the devices provide for each (e.g. you should spend 20% to 25% of the night in REM sleep, according to one device).

Total sleep time generally directionally moved in sync across the devices, but the difference between them was significant, with an average of a 1 hour difference between Garmin and Oura, compared to an average sleep duration of 8 hours across all nights and all devices!

Unlike sleep time, sleep staging information didn’t move in sync and showed even more significant differences between devices. While Garmin and Whoop varied by as much as 40 minutes, they were directionally similar. Oura, on the other hand, indicated I barely got 30 minutes of deep sleep per night (vs. the typical recommendation of 1 to 2 hs), and on average 57 minutes less than Garmin.

REM sleep seemed to actually agree between devices on the last two days, but I might chalk that up to randomness, since the devices showed big differences on other days and on average the difference between the highest and lowest measure REM time was over an hour per night.

Sleep Metrics

All the devices monitor at least four metrics during sleep - resting heart rate, heart rate variability (or HRV, a measure of the time variation between your heart beats), respiration rate, and skin temperature. Generally, absent being sick, people see the most variation in resting HR and HRV from night to night and these are key inputs into their recovery scores as they theoretically respond when you are putting a lot of strain on your body during the day. 

Although my resting HR didn’t vary much during the week, the devices were quite close in their estimates of resting HR (which is the low point of HR during the night). In fact, Oura and Garmin agreed almost spot on (for once).

My HRV didn’t consistently move in sync between devices and the difference between them was quite meaningful - 9 points on average between the high and low device which was 26% of the higher reading.

Activity Data

While I was most interested in sleep tracking between the devices, I did compare some of the key activity related metrics because they contribute to the devices’ estimates of your daily strain and many people use the devices to track overall measures like steps or calories.

On an activity by activity basis, my average HR was pretty close by activity. However, if you relied on the devices’ estimates of time in each HR zone (e.g. Zone 2), you get wildly different estimates (for the test, I saw 30 mins between low and high estimate on an average activity of 1 hour and 25 minutes) because each device uses very different HR ranges to define Zone 2. This would be fine if you know your actual maximum HR and can adjust the HR zones, but unfortunately only Garmin allows you to manually do this.

Step and calorie data generally moved in sync, but with large differences (e.g., 19% on average for calories). It is worth noting that the Oura step count is not really comparable to the other devices because Oura appears to include an estimate of ‘effective steps’ for activities like cycling or swimming that don’t actually involve steps!

The Bottom Line 

My experiment in sleep tracking suggests you shouldn’t be wedded to either the scores or the data your device is providing you every morning. While overall sleep time was directionally accurate between the devices, the sleep staging information between devices is so different and erratic, you probably are best off ignoring it. 

When it comes to the scores these devices generate, I also urge caution on relying on them. Even if some of the underlying data used to generate them (such as resting HR) is likely accurate, other data points (like HRV) may not be, and the underlying algorithms between devices are black boxes and give quite different results from the same night of sleep and day’s activity.

At the end of the day (or each morning), The Better Lab suggests relying on your own sense of sleep and restedness to measure your sleep and progress to better sleep over any specific device. After all, if your car gave you three different measures of remaining fuel, you’d probably give up paying attention to them and take it to the shop! 

Admittedly, it is possible that despite not agreeing on almost any measure, one of these devices was perfectly accurate. I only compared them to each other and not a highly accurate reference device. Curiously though, both Whoop and Oura have published research which compared them to polysomnography, and they still varied significantly on most measures.


No Accessories Required

The Better Lab encourages you to rely on your own perception of your sleep and works without the need for any sleep tracking device.

The science to improve sleep is open source, free and straightforward - yet most people find putting it into practice on their own is difficult.

That’s why we built The Better Lab to guide you to actually do the things proven to improve your sleep, one practice at a time.

Get the app free here if you haven’t already:


Next
Next

Life Without Good Sleep Is Not Great