Wrist-worn accelerometers are the most used wearable DHTs in clinical trials and research today. Accelerometer-based DHTs continue to play a central role in clinical research, thanks to the low cost, lightweight, long battery life and decades of research findings and datasets. The algorithms used to derive sleep outcomes from wrist acceleration data have been the focus of technical improvement and continue to evolve with the advancement of data science. Our systematic evaluation of the most common sleep algorithms developed over the past 40 years provides researchers with an evidence-based approach to use them effectively in sleep research.
Our findings suggest that the current application of machine learning and deep learning techniques to predict sleep-wake classification are not as robust in estimating sleep outcomes as simple heuristic (van Hees) and legacy regression models (Oakley-rescore and Cole-Kripke). This is surprising, as the deep learning and random forest models were trained on large datasets and thus would be expected to better model the complex relationship between wrist movements and sleep than simple models developed on smaller datasets13,15. The fact that these machine and deep learning models do not perform better may be due to the difference in activity counts used across sleep data sets, suboptimal model architectures, the intrinsic challenge with using motion data only to estimate sleep physiology and different PSG annotation styles between different data sets.
Computation of activity counts from raw accelerometer data is a common data reduction step from the early days of actigraphy. The conversion from raw acceleration to activity counts is not always well documented nor understood. Early studies presenting the legacy count algorithms did not provide any information about how the counts were obtained7,8,10,16. In addition, manufacturers might not disclose the way they derive counts from raw accelerometer data. The activity count calculation is a crucial step in the count-based sleep algorithms and differences in the counts would lead to differences in sleep-wake classifications. Despite this, the current research confirmed that legacy algorithms estimate sleep outcomes with high validity on counts computed using the open-source agcounts Python package9. The deep learning count-based algorithms were trained on proprietary counts from the MESA dataset and may not generalize to different types of counts as readily as the simple legacy count algorithms due to the model’s overfitting the PSG annotation style in the MESA dataset13. Due to the lack of available raw acceleration data in large data sets, no raw acceleration-based deep learning models have been presented in the literature.
The top-performing deep learning algorithms performed slightly worse than the much simpler heuristic and legacy algorithms. It is possible that the model architectures used in the current deep-learning algorithms may not be optimal15. In particular, the models are very simple, having only one layer of convolution filters or LSTM cells connected to a dense layer. Most deep learning models employ several layers (hence the term deep) which helps them capture more details from the training set. Their training also did not involve regularization techniques designed to avoid overfitting (and thus improve generalization) such as dropout or early stopping. The fact that the models were trained for only 30 epochs may limit their performance compared to heuristic algorithms15. In short, while deep learning algorithms hold promise, there is still work to be done.
Polysomnography (PSG) is considered the gold standard for sleep assessment and provides clinical diagnosis of sleep disorders such as apneas, hypopneas and rapid eye movement (REM) disorders17. Using PSG scoring as the ground truth for actigraphy sleep algorithms, however, has some intrinsic challenges. PSG measures physiological changes during sleep, while wrist actigraphy measures the movement of the distal forearm. Physiology and movement present highly structured and correlated patterns during sleep and wake cycles, which is the fundamental principle behind actigraphy use in sleep research. But the intrinsic difference between the two types of source signals means that there is a limit to how close one can be used to estimate the other. This does not mean wrist-based accelerometer assessment of sleep is inferior to PSG, as this method is superior in longitudinal and reliable assessment of sleep patterns in free-living environments. To facilitate the proper use of wrist accelerometer-based sleep outcomes, it may be necessary to interpret actigraphy-quantified sleep endpoints in their own right, and not expect them to perfectly match PSG.
Due to the subjective nature of PSG scoring, it may be difficult for a model to generalize between different data sets with different scorers. PSG needs to be scored by trained technicians to derive sleep outcomes. The scoring process takes 2–4 h to score one night of sleep and is also known to have high inter-rater variability, especially in pathological sleep populations18. To improve objectivity and reduce variability, the American Academy of Sleep Medicine (AASM) guidelines provide a series of rules that the PSG technician applies while they score raw PSG data17. For example, the scoring rule for wake is when more than 50% of the epoch contains an alpha rhythm (8–13hz) over the occipital region or eye blinks at 0.5 to 2hz or rapid eye movements associated with normal / high chin muscle tone or reading eye movements17. Such scoring criteria is inherently subjective and leaves room for different raters to score the same segment differently. Automated software packages have been developed to score PSG data; however, these are not considered gold-standard18. With the presence of high inter-rater variability, it may be difficult for a model trained on the relationship between PSG scoring and movement patterns to generalize robustly to different data sets scored by different raters. Deep learning and machine learning models run the risk of overfitting to data set-specific PSG annotator style if they do not include proper model architecture to enhance generalization.
While this work is the first systematic comparison across simple regression to complex acceleration-based machine learning sleep algorithms, a subset of these models has been evaluated in previous literature. Sundararajan et al. 2021 reported slightly worse results for Sadeh (F1 68.1 to 78.5%), Cole-Kripke (F1 67.5 to 78.0%), van Hees (F1 70.1 to 79.1%) and Random Forest (F1 73.9 to 76.4%) than the present study15. There are several potential reasons for this difference. One, no data was dropped in the current study. Sundararajan et al. 2021 had 24 participants in their test set, while the current work used all 28 participants from the Newcastle PSG dataset. Second, accuracy, sensitivity and other statistics were calculated for each subject and then averaged in the current work, whereas Sundararajan et al. 2021 combined all epochs from all subjects together and calculated the evaluation metrics. The advantage of averaging evaluation metrics from each participant is that a Bland-Altman style validation analysis can be performed on the sleep outcomes.
Rescoring is a series of heuristic rules that was developed in conjunction with the original legacy sleep algorithm to rescore periods as wake or sleep based on the length of a period and the length of the surrounding periods7. Previous research showed that rescoring improved performance for all legacy algorithms on the MESA data set13. However, on the Newcastle PSG data set, rescoring resulted in poorer performance for Cole-Kripke (RMSE 13.0 to 18.5), Sadeh (RMSE 13.6 to 23.4) and Sazonov (RMSE 14.5 to 30.5) algorithms, while it improved performance for the Oakley algorithm (RMSE 15.9 to 12.7). In both studies rescoring decreased sensitivity and increased specificity, so algorithms with high sensitivity and low specificity to begin with were improved with rescoring and algorithms with reasonable sensitivity and specificity to begin with were made worse with rescoring.
Supplementary Table 1 summarizes sensitivity, specificity and accuracy from previous studies presenting algorithms to predict sleep-wake from wrist accelerometry. The Sadeh model on the Sadeh data set had the highest specificity and accuracy of all other model / data set combinations, however the Sadeh data set consisted of only healthy sleepers. On the MESA data set, rescoring improved performance for the Sadeh, Cole-Kripke and Oakley algorithms. The van Hees and Sundararajan (random forest) algorithms could not be run on MESA because they require raw acceleration and the MESA data set contains only activity counts.
The current work has several limitations. Since the PSG scoring process was not detailed in the open-source Newcastle PSG dataset, we do not know how it was performed. The test data set in the current work was from one PSG study, future work should consider using multiple PSG studies as test data sets, to ensure generalizability to different PSG annotation styles. The challenge with this currently is that many open-source PSG data sets (MESA, STAGES) only include activity count data and not raw acceleration, making it impossible to test the raw acceleration-based algorithms. The current work considered algorithms that used acceleration only. Heart rate and other physiological signals have the potential to improve sleep classification and staging, however, the trade off with more sensors is a decrease in battery life19. This is an important area for future research.