Article Text


Original article
Framework for estimating sleep timing from digital footprints
  1. Bo-chiuan Chen1,
  2. Dong-Chul Seo1,
  3. Hsien-Chang Lin1,
  4. David Crandall2
  1. 1 School of Public Health, Indiana University, Bloomington, Indiana, USA
  2. 2 School of Informatics and Computing, Indiana University, Bloomington, Indiana, USA
  1. Correspondence to Dr Dong-Chul Seo, School of Public Health, Indiana University, Bloomington, IN 47405, USA; seo{at}


Objective We propose a method that estimates sleep timing from publicly observable activity on online social network sites. The method has the potential to minimise participant-related biases, does not require specialised equipment and can be applied to a large population.

Materials and methods We propose a framework that estimates midpoints of habitual sleep time from activity records on a social media—Twitter. We identified sets of before-bedtime and after-wake-up tweets that marked the periods of reduced Twitter activity, which we use as a proxy of sleep. We then estimated the timing of sleep by deriving the median among midpoints of paired before-bedtime and after-wake-up tweets. Visualisations and examples of our estimates comparing sleep timing of users from different countries are provided.

Discussion Initial results suggest that the proposed framework could detect differences in sleep timing among user groups of different countries. The proposed framework may be a cost-efficient complement for future research regarding sleep-related health concerns. Researchers and practitioners may benefit from accessing habitual sleep data. While validation is still required prior to actual applications, the proposed framework may be a first step towards a convenient and cost-efficient complement to currently available methods.

  • sleep
  • sleep timing
  • Twitter
  • digital footprint
  • social media
  • social network services

Statistics from


Sleep timing has been documented as being associated with health outcomes and well-being.1–5 Survey, sleep diary and actigraphy based approaches for monitoring and measuring sleep are well established but also have a number of known limitations and constraints in quantifying sleep deprivation.6–15 Information bias is common for survey-based studies that require participant’s recall and subjective evaluation of sleep times. The heavy burden of entering records into sleep diaries and substantial costs of actigraph equipment limit the sample sizes and duration of data collection among the latter two approaches. Recent applications of electronic sleep diaries partially mitigate recall bias but still require prior arrangement with participants. There is a gap between findings that are based on high-frequency short-duration approaches such as sleep diaries and actigraphy and results based on low-frequency long-duration techniques such as surveys.

To supplement these traditional approaches, we propose a framework that estimates sleep timing from individuals’ online activities on social networking services (SNSs). Although people primarily use these sites to communicate and share with family and friends, their online activities records—digital footprints—reveal a surprising amount of information about their lives.16 Taking the advantages of popular SNS—Facebook, Twitter, Instagram and so on—in the recent years, the proposed method is designed to be applicable to a large population. By using SNS data that are automatically recorded, the method has reduced burden of participation and benefits from objectively recorded time. After reviewing the key features of SNS data, we describe the framework, give illustrative examples and present initial results.

Materials and methods

Sleep midpoints (SMPs)

This study estimates the midpoint of habitual sleep time (HST) to represent the timing aspect of sleep. Most research uses HST to represent the times when individuals usually go to sleep and when they usually wake up, although the exact sleep time fluctuates every day. We follow the terminology of chronobiology, the field that examines cyclic patterns of phenomena in living organisms and their adaptation to environmental factors,17 and use the SMP) to refer to the midpoint of HST. In general, SMP on free days (ie, when an individual is not constrained by work or school) serves as a chronotype indicator—the individual preference for being early birds or night owls (ie, morningness–eveningness). Formally, this measure of chronotype is to be derived from instruments such as the Munich Chronotype Questionnaire, which surveys the bedtimes and wake-up times on both work and free days.18 Because our main purpose is to derive an estimate of the timing at which sleep is mostly likely to occur, we do not distinguish work days from free days. The SMP derived here was calculated by taking the middle point between self-reported habitual bedtimes and wake-up times, tmidpoint = (tgo-to-bed + twake-up)/2. Conceptually, one can derive a midpoint between the last time record of an individual being awake at the end of last night and the first record at the beginning of a day as a noisy estimate towards SMP for a given individual on a given day. In the absence of self-reported data, this study attempts to estimate SMP by deriving the median of multiple noisy day-wise midpoints.


We used Twitter as the test case for our SNS. We chose Twitter because of its popularity and the fact that it makes data publicly available (unless users specifically opt out). A Twitter user timeline (ie, the sequence of tweets created by a specific Twitter user) provides records about the times when an individual was awake and active online. These time records are objective in the sense that they are automatically recorded and maintained by the Twitter platform itself and thus less susceptible to inaccuracies from self-reports. While Twitter supplies time records in coordinated universal time, tweets embedded with global positioning system (GPS) coordinates can be used to identify the locations and therefore local times. We collected a set of user timelines from through Twitter’s Application Programming Interfaces and removed users that were actually automatic Twitter ‘bots’, which were deployed by company or group users instead of individuals (see online supplementary file). Data collection was performed from June to December 2016 using Python 3. Details on the data collection and bot removal are found in the online supplementary file)


Given a dataset of Twitter user timelines, our first step was to align the posting time of each tweet to the user’s daily routine (as opposed to wall-clock time), since the daily routines varied across individuals. Instead of having to argue whether a daily routine starts from 02:00, 06:00 or 08:00, we attempted to individually match the posting times. For ease of reference, we discretised the hours of a day into two 12-hour periods—‘morning’ and ‘evening’—and defined the period that contained the beginning of daily routines as the ‘morning’ and the other one as the ‘evening’. The intuition follows that the individual activities among days should be separated by lengthy recurring dormant periods corresponding to HST. Operationally, we counted the number of tweets occurring in each hour of the day for each user across their user timelines. We then identified the 6-hour period with the minimal total tweet count, breaking ties if needed by choosing the period closer to midnight, and we marked this as a coarse estimate of the period of HST. The middle of the 6-hour period was defined as the beginning of the user’s ‘morning’. All tweets posted by the user in the 12 hours following this cut-off point were defined as ‘morning’ tweets, while the rest were considered ‘evening’ ones (figure 1, panel A).

Figure 1

Summary of the proposed framework based on tweet frequency histograms. First (panel A), defining the ‘morning’ and ‘evening’ for each individual: the middle of the 6-hour period with the minimal total tweet counts was defined as the beginning of ‘morning’ and the end of ‘evening’. Second (panel B), identifying the most relevant tweets: we defined that one was more likely to be online by having higher-than-threshold tweet counts for each 10 min intervals (upper panel); the tweets posted within or between the last three ‘evening’ and first three ‘morning’ online intervals were considered relevant to sleep (lower panel). Third, estimating the sleep midpoint (panel D): we paired each ‘evening’ tweet with every ‘morning’ tweet and calculated the midpoints of all the pairs (demonstrated in panel C); we derived the median of these midpoints as our estimated sleep midpoint.

Next, we refined this coarse estimate of HST by identifying the time in which an individual’s online activities tend to transition from normal to reduced levels and vice versa. For each user, we constructed a histogram over 10 min periods counting the latest ‘evening’ tweets for each day. We then derived a threshold by dividing the total number of latest ‘evening’ tweets by the number of 10 min intervals in 12 hours (ie, 72 intervals). This threshold reflects the expected number of latest tweets among the 72 intervals under the assumption that the end of online activities happens randomly across time. We then defined the intervals that had higher-than-threshold tweet counts as ‘online’ intervals and ‘sleep’ intervals otherwise (figure 1, upper figure of panel B). We identified the transition to reduced levels of online activities by finding the last three ‘online’ intervals. The ‘evening’ tweets that were posted within or after any of the three ‘online’ intervals were considered as relevant and processed in the next step, since the HST is unlikely to start earlier than these observations on the left boundary of sleep times. Similar treatment was done on the ‘morning’ tweets to find the first three ‘online’ intervals as the transition to normal levels of online activities and the tweets that fell within the right boundaries of sleep times (figure 1, lower figure of panel B).

Finally, we estimated the SMPs. From the relevant latest ‘evening’ tweets and the earliest ‘morning’ tweets, we considered every possible pair of a ‘morning’ tweet and an ‘evening’ tweet and calculated the midpoint among all the pairs (figure 1, panel C). The median of these midpoints was computed as an estimated SMP for each individual (figure 1, panel D).

Demonstrative results

As a proof of concept, we plotted the activity levels of 100 randomly sampled Twitter users by time of day in figure 2, along with their SMP estimates. Figure 3 shows results from a set of 4983 Twitter users from our sample. To make this analysis possible, we restricted our analysis to users who had more than 500 tweets with GPS coordinates, so that we could approximate each user’s country affiliation by finding the most common location among their GPS records that were embedded in tweets. We visualised these estimates to the mean bedtimes and mean wake-up times across countries that were reported in a mobile app-based survey (referred to as the Survey-Results) and accelerometer-based user statistics (the Accelerometer-Stats) (see figure 3).19 20 Most of our SMP estimates (box plots) were between the mean bedtime and mean wake-up time reported in the two references (coloured bars). We also compared the medians among our timeline samples (red lines, box plots) to the reconstructed middle points between mean bedtime and mean wake-up time that were reported for each country in the two references (circles and triangles). Our results were closer to the Accelerometer-Stats (circles) than the Survey-Results (triangles), while the midpoints of the Survey-Results seemed earlier than the ones of both the Accelerometer-Stats and the medians of our estimates.

Figure 2

Level of online activities among 100 randomly sampled Twitter users by the hours of a day and corresponding sleep midpoint estimates. Each horizontal line corresponds to the online activities level of an individual among the hours of a day. The colours of each cell denote the relative level of online activities in the hour (hourly tweet counts of the individual, in percentage; darker colours for more online activities). The stars denote the estimated sleep midpoints of each individual.

Figure 3

Cross-country comparisons of sleep midpoints estimates. The box plots summarise the estimated sleep midpoints by the proposed framework, and the bars represent the range between mean bedtime and mean wake-up times that were reported in the Survey-Results (upper bars)19 and the Accelerometer-Stats (lower bars).20 The triangles and circles denote the middle points of corresponding mean bedtimes and mean wake-up times.


We proposed a framework for estimating timing of sleep (as SMPs) from digital footprints. We used objectively recorded time data of individuals being awake and active on Twitter in recovering the timing of individual’s HST. The estimated SMPs were identified by referencing the time periods that were associated with reduced online activities. Requiring only personal computers and mobile devices rather than special equipment points to the potential that this study’s methodology could be applicable to a larger population. While the accuracy of our estimates is to be determined, it is conceptually less prone to participant-related biases given that participants are not actively involved in the data collection process. This also means that retrospective analysis is also possible, and being able to access information about sleep history may be especially valuable for assessing sleep deprivation in clinical settings, where prior arrangements are typically required.21

The cross-country comparison suggests that our technique is accurate enough to at least reflect known sleep timing differences among Twitter users of various countries. In figure 3, there is a west-to-east order of Spain–France–The Netherlands–Germany where Spanish people sleep the latest and German people sleep the earliest among the Central European countries. The results seem to capture the phenomena that individuals who live in the western part of a time zone are more likely to have later sleep timing than individuals who live in the eastern side, due to the difference in daylight cyclics.22–24 While there seem to be larger differences when comparing the median of our estimates to the reconstructed midpoints of the referenced survey, most of our estimated SMPs were less than 1 hour from the derived midpoints of the Accelerometer-Stats. It should be noted that the Survey-Results were based on self-reported data, whereas the Accelerometer-Stats and our results used accelerometer and digital footprint data, respectively, and that sleep measures derived from surveys and from accelerometer-based approaches still do not fully agree with each other.13 25–27

Further studies are needed to better understand the strengths and limitations of our proposed methodology, but we believe it could complement existing techniques with a number of advantages. Like survey-based research, our technique is able to reach a large population, but may be less prone to respondent biases since we use unobtrusive data collection. Like accelerometer-based data collection, the evidence we use is objective but uses existing public data and does not require the costs and effort of wearing a sensor. Considering the low burden of data collection and that one-item or two-item online surveys (asking sleep duration directly, or asking bedtime and wake-up time) have been reported as imprecise and systematically biased when compared with a 7-day sleep diary,28 the proposed framework may be a cost-efficient supplement for future research regarding sleep-related health concerns.

This proposed framework has the potential to be applied to research on sleep deprivation, when the estimated SMP can be further validated. By adopting the 7 hours recommendation of sleep,29 the frequency of being awake during one’s 7-hour sleep time around individual SMP may serve as a proxy measure of sleep deprivation. If we assume that people cannot use social media while asleep, then any observed online activities occurring during the ideal sleep time is evidence of a potential sleep deprivation event. Positing that a person who routinely goes to bed at midnight is more likely to produce digital footprints between 23:00 and midnight than a person who goes to bed at 23:00, this measure, in turn, could potentially be observed through activity on SNS over a long observation period (ie, several weeks or months). A successful application may provide unique evidence that supplements survey-based research, which has wide gap (mostly years) between data points and experimental research that has limited data collection period.

In addition to the major limitation that the proposed methodology is yet to be validated, other important limitations should be noted. First, we assumed that sleep consists of one event per day. Thus, daytime naps were not addressed. Second, tweets that were posted between interrupted sleep periods might not have been distinguished from tweets that were posted before or after sleep, although such instances are believed to account for a small proportion of the data. Third, we did not attempt to distinguish whether it is a human account or a collectively contributed one.30 Fourth, in identifying transition between online and sleep, we empirically set the thresholds for being online by averaging tweet counts and used three ‘online’ intervals as an attempt to find the transitional ‘hours’ that consisted with half of normal activities and half of reduced activities. Future studies are warranted to evaluate these methodological choices when reference data become available. Lastly, generalisability may be limited to the Twitter user population, which currently accounts for roughly one-fourth of internet users,31 32 and is likely biased towards younger, wealthier and more urban demographics. However, since we only used time information rather than any information specific to one social networking platform, our framework should be applicable to data from other systems. Future work should attempt to validate the result of this proposed framework and explore the feasibility of applying it on other data sources. One of the first follow-up studies may be to validate the estimated SMPs against the ones derived from sleep diaries.

Despite these limitations, this study contributes to the literature by demonstrating a framework that uses digital footprints in understanding the time dimension of sleep. While the midpoint estimates of this study still require further validation, we believe this digital footprint-based approach has promise for collecting data about sleep at a much larger scale than has been possible with traditional techniques.


View Abstract


  • Contributors Chen conceptualised the framework with substantial input from Seo, Lin and Crandall. Chen drafted the manuscript and Seo provided critical revisions. All authors participated in manuscript preparation, and read and approved the final manuscript. All authors agree to be accountable for all aspects of the work.

  • Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.

  • Competing interests None declared.

  • Patient consent Not required.

  • Provenance and peer review Not commissioned; externally peer reviewed.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.