How Many Tokens Define a Human Life?


Machine learning papers love to compare context windows to books, codebases, or the internet. 1 million tokens roughly covers the entire Harry Potter series. 10 million tokens is a code repository. A trillion tokens is a training corpus.

The comparison I find more interesting is a human life.

Think for a moment: how many tokens do you think you've processed in your lifetime?

More precisely, tokenizable words and internal dialogue. For this thought experiment, I do not want to count images or sounds. After all, people without the ability to see or hear still process the world in equivalently human ways. I want to count the discrete symbolic traces that define our core of our conscious being: words heard, seen, said, written, and thought.

The rough answer is smaller than I expected:

For a language-heavy knowledge worker, a reasonable upper-bound estimate is roughly 381,000 tokens/day.

Over 30 adult-equivalent years, that is about 4 billion tokens. A quieter life can plausibly be much lower.

Why does this matter? Well, if a life is measured in memories, it might one day be modelable in tokens. While few would argue that a raw transcript defines a person, many would agree that our life experiences and context are what separate us from language models. As long-context pretraining and active distillation research continues, we might have to adjust our priors on what it means to be human.

Defining the Question

Let's start with a toy version of the problem. Suppose I gave you a compressed day in the life of a software engineer:

morning:
  read issue, repro steps, stack trace
  skimmed source and wrote a patch
  exchanged slack messages
  thought through failure case

afternoon:
  reviewed PR comments
  watched a short talk
  read docs
  rehearsed next debugging step
...

If this log were expanded into words, how big would it be? Not how meaningful would it be, not how much of it would be remembered, and not how much information is latent in the physical state of the person. Just how many text-like tokens pass through the system, as this forms an upper bound for point-in-time retrievable context.

Obviously, more goes on in the head than this. We form latent thoughts, spatial intuitions, emotional associations, and mental maps that are not tokenizable in the same way words and images currently are. But for this comparison, what matters is the stream of discrete ideas that survive latent processing and become expressible as words.

For English, the usual tokenizer approximation is:

1 token ~= 4 characters ~= 0.75 words
1 word ~= 1.33 tokens

Daily Estimate

The estimate comes from three dominant buckets: external media, conversation, and inner speech. A modern media day mixes text, audio, captions, docs, code, Slack, TV, music, and short-form video. So I model the media budget directly.

symbolic transcript = external media
                    + conversation
                    + inner speech

The cleanest daily estimate is:

  • External media: ~197,000 tokens/day
  • Conversation: ~51,000 tokens/day
  • Inner speech: ~133,000 tokens/day
  • Total: ~381,000 tokens/day
  • 30 adult-equivalent years: ~4 billion tokens

This is not a confidence interval. It is an activity-budget estimate. The point is to make each term legible enough that you can swap in your own day and see how much the answer moves.

External Media

In early 2008, the UCSD Measuring Consumer Information study attempted to answer this exact question and estimated that the average American received about 100,500 media words/day outside work, or roughly 134,000 tokens/day. That included TV, radio, print, phone, computer use, games, recorded music, and movies. It is a useful sanity check, but it predates the current shape of social video, always-on messaging, and modern screen work.

For the version of the question I care about here, I want to model a language-heavy knowledge worker, for example a software engineer. So I build the external budget from current activity estimates:

TV / movies:
  2.6 hours/day * 60 * 150 wpm ~= 23,000 words/day

work text:
  4.7 hours/day * 60 * 260 wpm ~= 73,000 words/day

music:
  20.7 hours/week * 60 / 7 * 147 wpm ~= 26,000 words/day

leisure reading:
  17 minutes/day * 238 wpm ~= 4,000 words/day

social video:
  141 minutes/day * 150 wpm ~= 21,000 words/day

external media total ~= 148,000 words/day
148,000 words/day * 1.33 ~= 197,000 tokens/day

A few notes on those constants:

Conversation

The daily speech estimate comes from studies using the Electronically Activated Recorder (EAR): participants wear a device during normal life, it periodically records short snippets of ambient audio, and researchers later code the sampled recordings. That matters because people are not very good at estimating how much they talk. A large registered replication of daily word use pooled 2,197 participants from 22 samples using this kind of naturalistic audio sampling and estimated 12,792 spoken words/day with a standard deviation of 9,154.

The EAR study measures spoken output, not the full back-and-forth of conversation. Since conversation includes the words a person hears from other people, I multiply the spoken average by three to estimate total conversational language throughput:

conversation_words_per_day = 12,792 * 3
                           ~= 38,000 words/day

conversation_tokens_per_day = 38,000 * 1.33
                            ~= 51,000 tokens/day

Inner Speech

Inner speech is the hardest term to measure. The frequency estimate comes from Descriptive Experience Sampling (DES). In DES, participants carry a random beeper in everyday life. When it beeps, they note what was in experience immediately before the beep. They are then interviewed in detail, usually soon afterward, to distinguish inner speaking from visual imagery, unsymbolized thought, feelings, remembered recent speech, or a general sense of "I was thinking."

Across the studies discussed by the UNLV inner-experience group, inner speaking appears in roughly 20-23% of sampled moments, with large individual differences across participants. I use 23% as the empirical frequency anchor.

For the rate of active inner speech, I use an expanded-inner-speech experiment that measured inner speech in syllables/second; converted with an English word-length corpus estimate, that is about 260 words/minute. Combining the DES frequency estimate with 16 waking hours gives:

inner_speech_words_per_day = 16 hours/day * 60 * 0.23 * 260
                           ~= 57,000 words/day

inner_speech_tokens_per_day = 57,000 * 1.33
                            ~= 76,000 tokens/day

For a language-heavy upper bound, I use 40% of waking time at the same measured inner-speech rate:

upper_inner_speech_words_per_day = 16 hours/day * 60 * 0.40 * 260
                                 ~= 100,000 words/day

upper_inner_speech_tokens_per_day = 100,000 * 1.33
                                  ~= 133,000 tokens/day

Putting It All Together

Using the point estimates:

external media: ~197,000 tokens/day
conversation:    ~51,000 tokens/day
inner speech:    ~133,000 tokens/day

total:          ~381,000 tokens/day
381,000 tokens/day * 365 * 30 ~= 4.2B tokens

So the clean point estimate for a language-heavy knowledge worker is about 4 billion tokens per 30 adult-equivalent years.

The useful thing about this framing is that the answer is easy to audit. If you think 4.7 hours of work text at 260 wpm is too high, cut that term in half and the total drops by about 49,000 tokens/day. If you think the social feed is mostly silent images, cut the social term and the total drops by about 28,000 tokens/day. If you think internal speech is much more frequent than DES suggests, the inner speech term is the next obvious place to move the estimate.

Calculator

Here is the same arithmetic as a small calculator. I included it mostly because it is easy to lose intuition for the scale once the units switch from days to decades.

381,000
30
total ~= 4.2B tokens

Why Not Pixels?

The obvious objection is that words are not the whole input stream. Humans see, hear, touch, move, and feel. If you naively counted raw visual or auditory bandwidth, the number would be much larger.

But raw sensory bandwidth is not the thing I am trying to measure. People without the ability to see or hear still think in fully human ways. Their conscious lives are not missing the essence of human cognition because one sensory channel is absent. That is evidence that the part of experience most relevant to thought is not raw pixels or sound waves themselves, but the structured representations we build from them.

Language is not all of those representations, but it is the most portable and shared one. It is how we name things, explain them, remember them, plan around them, and pass them to other minds. For this comparison, I care about that symbolic layer: words read, heard, spoken, written, and internally rehearsed. Conveniently, it is also the layer language models are trained to process.

Closing Thoughts

We already have models that can operate over million-token context windows. Dario Amodei has said that 100 million words of context is not really blocked by ML fundamentals so much as inference cost. If that is roughly right, then the scale changes completely.

At the estimate above, 100 million tokens is months of symbolic life. A few billion tokens is a 30-year symbolic life. That does not mean a model has a body, a childhood, or human consciousness. But it does mean that the durable text-like residue of a life -- what someone read, heard, said, wrote, rehearsed, decided, and explained -- is no longer an obviously impossible amount of context.

Ilya Sutskever tweeted in 2023: "If you value intelligence above all other human qualities, you're gonna have a bad time." The same may be true of context. If the distinction you care about is that humans have vastly more lived context than models, that distinction may not survive long-context scaling.

The obvious objection is depth. A fixed-depth transformer can hold the tokens without necessarily connecting them. In my Divide and Conquer post, I argue that dependency resolution can in principle scale exponentially with layer count; assuming a base of 2, 32 layers gives 2^32 ~= 4.3B. As shown in that blog, current models may still only make on the order of ~10 semantic hops per generated token, but longer-context pretraining or dynamic-depth architectures like HRM and Sakana's DiffusionBlocks will likely just solve this problem.

All this points to the same message: models may soon be able to experience human-length contexts sooner than we might expect.

Sources