How Many Tokens Define a Human Life?
Machine learning papers love to compare context windows to books, codebases, or the internet. 1 million tokens roughly covers the entire Harry Potter series. 10 million tokens is a code repository. A trillion tokens is a training corpus.
The comparison I find more interesting is a human life.
Think for a moment: how many tokens do you think you've processed in your lifetime?
More precisely, tokenizable words and internal dialogue. For this thought experiment, I do not want to count images or sounds. After all, people without the ability to see or hear still process the world in equivalently human ways. I want to count the discrete symbolic traces that define our core of our conscious being: words heard, seen, said, written, and thought.
The rough answer is smaller than I expected:
For a language-heavy knowledge worker, a reasonable upper-bound estimate is roughly 381,000 tokens/day.
Over 30 adult-equivalent years, that is about 4 billion tokens. A quieter life can plausibly be much lower.
Why does this matter? Well, if a life is measured in memories, it might one day be modelable in tokens. While few would argue that a raw transcript defines a person, many would agree that our life experiences and context are what separate us from language models. As long-context pretraining and active distillation research continues, we might have to adjust our priors on what it means to be human.
Defining the Question
Let's start with a toy version of the problem. Suppose I gave you a compressed day in the life of a software engineer:
morning:
read issue, repro steps, stack trace
skimmed source and wrote a patch
exchanged slack messages
thought through failure case
afternoon:
reviewed PR comments
watched a short talk
read docs
rehearsed next debugging step
...
If this log were expanded into words, how big would it be? Not how meaningful would it be, not how much of it would be remembered, and not how much information is latent in the physical state of the person. Just how many text-like tokens pass through the system, as this forms an upper bound for point-in-time retrievable context.
Obviously, more goes on in the head than this. We form latent thoughts, spatial intuitions, emotional associations, and mental maps that are not tokenizable in the same way words and images currently are. But for this comparison, what matters is the stream of discrete ideas that survive latent processing and become expressible as words.
For English, the usual tokenizer approximation is:
1 token ~= 4 characters ~= 0.75 words
1 word ~= 1.33 tokens
Daily Estimate
The estimate comes from three dominant buckets: external media, conversation, and inner speech. A modern media day mixes text, audio, captions, docs, code, Slack, TV, music, and short-form video. So I model the media budget directly.
symbolic transcript = external media
+ conversation
+ inner speech
The cleanest daily estimate is:
- External media: ~197,000 tokens/day
- Conversation: ~51,000 tokens/day
- Inner speech: ~133,000 tokens/day
- Total: ~381,000 tokens/day
- 30 adult-equivalent years: ~4 billion tokens
This is not a confidence interval. It is an activity-budget estimate. The point is to make each term legible enough that you can swap in your own day and see how much the answer moves.
External Media
In early 2008, the UCSD Measuring Consumer Information study attempted to answer this exact question and estimated that the average American received about 100,500 media words/day outside work, or roughly 134,000 tokens/day. That included TV, radio, print, phone, computer use, games, recorded music, and movies. It is a useful sanity check, but it predates the current shape of social video, always-on messaging, and modern screen work.
For the version of the question I care about here, I want to model a language-heavy knowledge worker, for example a software engineer. So I build the external budget from current activity estimates:
TV / movies:
2.6 hours/day * 60 * 150 wpm ~= 23,000 words/day
work text:
4.7 hours/day * 60 * 260 wpm ~= 73,000 words/day
music:
20.7 hours/week * 60 / 7 * 147 wpm ~= 26,000 words/day
leisure reading:
17 minutes/day * 238 wpm ~= 4,000 words/day
social video:
141 minutes/day * 150 wpm ~= 21,000 words/day
external media total ~= 148,000 words/day
148,000 words/day * 1.33 ~= 197,000 tokens/day
A few notes on those constants:
- The Bureau of Labor Statistics reports 2.6 hours/day of TV for U.S. adults in the 2024 American Time Use Survey. For TV and social video, I use 150 wpm as a normal spoken-language rate.
- For work text, I use 260 wpm from the adult silent-reading meta-analysis and 4.7 hours/day from a workplace-reading study of office workers. That is still aggressive for code and dense docs; the point is to model the kind of day where a coder is continuously reading issues, code, compiler errors, PR comments, docs, chat, and search results.
- As for music, the IFPI reports 20.7 hours/week of music engagement, and a lyrics study found recent hip-hop/rap averages around 147 words/minute.
- For leisure reading, BLS reports about 17 minutes/day, and the reading-rate meta-analysis gives about 238 wpm for nonfiction.
- The social line uses DataReportal/GWI's 2025 mean of 2 hours 21 minutes/day on social media and treats the feed as speech-heavy video at 150 wpm.
Conversation
The daily speech estimate comes from studies using the Electronically Activated Recorder (EAR): participants wear a device during normal life, it periodically records short snippets of ambient audio, and researchers later code the sampled recordings. That matters because people are not very good at estimating how much they talk. A large registered replication of daily word use pooled 2,197 participants from 22 samples using this kind of naturalistic audio sampling and estimated 12,792 spoken words/day with a standard deviation of 9,154.
The EAR study measures spoken output, not the full back-and-forth of conversation. Since conversation includes the words a person hears from other people, I multiply the spoken average by three to estimate total conversational language throughput:
conversation_words_per_day = 12,792 * 3
~= 38,000 words/day
conversation_tokens_per_day = 38,000 * 1.33
~= 51,000 tokens/day
Inner Speech
Inner speech is the hardest term to measure. The frequency estimate comes from Descriptive Experience Sampling (DES). In DES, participants carry a random beeper in everyday life. When it beeps, they note what was in experience immediately before the beep. They are then interviewed in detail, usually soon afterward, to distinguish inner speaking from visual imagery, unsymbolized thought, feelings, remembered recent speech, or a general sense of "I was thinking."
Across the studies discussed by the UNLV inner-experience group, inner speaking appears in roughly 20-23% of sampled moments, with large individual differences across participants. I use 23% as the empirical frequency anchor.
For the rate of active inner speech, I use an expanded-inner-speech experiment that measured inner speech in syllables/second; converted with an English word-length corpus estimate, that is about 260 words/minute. Combining the DES frequency estimate with 16 waking hours gives:
inner_speech_words_per_day = 16 hours/day * 60 * 0.23 * 260
~= 57,000 words/day
inner_speech_tokens_per_day = 57,000 * 1.33
~= 76,000 tokens/day
For a language-heavy upper bound, I use 40% of waking time at the same measured inner-speech rate:
upper_inner_speech_words_per_day = 16 hours/day * 60 * 0.40 * 260
~= 100,000 words/day
upper_inner_speech_tokens_per_day = 100,000 * 1.33
~= 133,000 tokens/day
Putting It All Together
Using the point estimates:
external media: ~197,000 tokens/day
conversation: ~51,000 tokens/day
inner speech: ~133,000 tokens/day
total: ~381,000 tokens/day
381,000 tokens/day * 365 * 30 ~= 4.2B tokens
So the clean point estimate for a language-heavy knowledge worker is about 4 billion tokens per 30 adult-equivalent years.
The useful thing about this framing is that the answer is easy to audit. If you think 4.7 hours of work text at 260 wpm is too high, cut that term in half and the total drops by about 49,000 tokens/day. If you think the social feed is mostly silent images, cut the social term and the total drops by about 28,000 tokens/day. If you think internal speech is much more frequent than DES suggests, the inner speech term is the next obvious place to move the estimate.
Calculator
Here is the same arithmetic as a small calculator. I included it mostly because it is easy to lose intuition for the scale once the units switch from days to decades.
Why Not Pixels?
The obvious objection is that words are not the whole input stream. Humans see, hear, touch, move, and feel. If you naively counted raw visual or auditory bandwidth, the number would be much larger.
But raw sensory bandwidth is not the thing I am trying to measure. People without the ability to see or hear still think in fully human ways. Their conscious lives are not missing the essence of human cognition because one sensory channel is absent. That is evidence that the part of experience most relevant to thought is not raw pixels or sound waves themselves, but the structured representations we build from them.
Language is not all of those representations, but it is the most portable and shared one. It is how we name things, explain them, remember them, plan around them, and pass them to other minds. For this comparison, I care about that symbolic layer: words read, heard, spoken, written, and internally rehearsed. Conveniently, it is also the layer language models are trained to process.
Closing Thoughts
We already have models that can operate over million-token context windows. Dario Amodei has said that 100 million words of context is not really blocked by ML fundamentals so much as inference cost. If that is roughly right, then the scale changes completely.
At the estimate above, 100 million tokens is months of symbolic life. A few billion tokens is a 30-year symbolic life. That does not mean a model has a body, a childhood, or human consciousness. But it does mean that the durable text-like residue of a life -- what someone read, heard, said, wrote, rehearsed, decided, and explained -- is no longer an obviously impossible amount of context.
Ilya Sutskever tweeted in 2023: "If you value intelligence above all other human qualities, you're gonna have a bad time." The same may be true of context. If the distinction you care about is that humans have vastly more lived context than models, that distinction may not survive long-context scaling.
The obvious objection is depth. A fixed-depth transformer can hold the tokens without necessarily connecting them. In my Divide and Conquer post, I argue that dependency resolution can in principle scale exponentially with layer count; assuming a base of 2, 32 layers gives 2^32 ~= 4.3B. As shown in that blog, current models may still only make on the order of ~10 semantic hops per generated token, but longer-context pretraining or dynamic-depth architectures like HRM and Sakana's DiffusionBlocks will likely just solve this problem.
All this points to the same message: models may soon be able to experience human-length contexts sooner than we might expect.
Sources
- Token conversion: OpenAI, What are tokens and how to count them?
- Context examples: Harry Potter word count; CPython source lines of code. Harry Potter; Debian Sources CPython listing
- Long-context scaling: Anthropic context-window documentation; Dario Amodei interview transcript; Ilya Sutskever tweet; my divide-and-conquer context-depth argument; HRM and DiffusionBlocks dynamic-depth references. Anthropic context windows; Dario transcript; Ilya tweet; Divide and Conquer; HRM; DiffusionBlocks
- Words spoken: Large registered replication of daily word use; original daily-speech study. registered replication; original paper
- External media: UCSD media-word estimate; Bureau of Labor Statistics TV and reading time; workplace-reading estimate; Ghent University reading-rate meta-analysis; spoken-language-rate rule of thumb; DataReportal/GWI social-media time; IFPI music listening; lyrics words/minute by genre. UCSD media words; BLS time use; BLS reading; workplace reading; reading-rate meta-analysis; speaking rate; DataReportal social media; IFPI music; lyrics by genre
- Inner speech: UNLV Descriptive Experience Sampling work; individual-difference study; expanded inner-speech rate experiment; English word-length corpus estimate. DES paper; individual differences; expanded inner-speech rate; word length