Smoother Time Remaining for Text-to-Speech Playback

If a read-aloud app tells you there are five minutes left, you want that number to feel believable.

Browser text-to-speech APIs do not hand you a clean, steady progress signal. They report word positions in bursts. Sometimes that happens every few hundred milliseconds. Sometimes it is less consistent. Some voices barely report anything at all.

So many read-aloud timers feel wrong for the same reason. They sit still, jump forward, stall again, and make the remaining time look unreliable even when playback itself is fine.

Instead of waiting for the next browser event, Readox keeps making an estimate between updates. If the browser says 5:00 at the start and then stays quiet for 37 seconds before suddenly reporting 4:23, a basic timer freezes at 5:00 and then jumps. Readox keeps counting down during that gap, then uses the real 4:23 update to pull the estimate back into line. This is why the countdown feels steadier.

A simple example

Imagine a 4,500 character page in English at normal speed. Readox starts with an English estimate of 15 characters per second, so the first estimate is 4,500 / 15 = 300 seconds = 5:00.

Now imagine the browser reports a real boundary at 1,200 characters after 80 seconds. That means 1,200 / 4,500 = 0.267, so Readox can project the full duration as 80 / 0.267 ≈ 300 seconds.

If the next real update instead implies a faster voice, say the full page will finish in 270 seconds, Readox does not snap straight from 300 to 270. It blends the old estimate with the new one: 300 × 0.8 + 270 × 0.2 = 294 seconds.

So the timer keeps moving instead of freezing and then suddenly changing speed. With normal word events, Readox only pulls 20% toward each new measurement. With sparse sentence-end events, it uses a stronger 50% blend so the estimate catches up faster.

Why text-to-speech progress bars jump in the first place

The basic problem is straightforward: the browser does not stream a clean “you are now 37.4% done” signal.

Instead, it gives sparse events such as:

a word boundary
a sentence boundary
sometimes almost nothing until later in playback

If you build the progress bar directly from those events, the UI only moves when the browser happens to report something. The result is familiar: nothing moves, then the number jumps, then it sits still again.

The text is still readable. The problem is that the timer feels crude.

What Readox does instead

Readox runs an estimator alongside the browser’s native events.

At the start of playback, it makes an initial duration estimate from:

text length
playback speed
the voice’s language

Then it updates the progress bar every 100 milliseconds instead of waiting for the next browser event. That gives the UI continuous motion instead of frozen gaps.

When a real browser position update arrives, Readox uses it as a correction signal instead of treating it as the only signal available. That lets the timer keep moving without drifting too far away from what the voice is actually doing.

If every correction fully overwrote the estimate, the timer would still jerk around. If the estimator ignored real data, it would slowly wander off. The whole point is to keep both problems in check at the same time.

Why language matters for time remaining

A naive approach would assume one reading speed for every language. That breaks quickly.

Different languages pack very different amounts of spoken content into the same number of characters. Japanese and Chinese behave differently from English. Spanish and French behave differently from English too. Even within alphabetic languages, the character-to-speech relationship is not constant enough for one universal rate to feel good.

So Readox picks an initial character-per-second estimate based on the selected voice’s language. That gives the timer a better starting point before real playback data begins correcting it.

Why Readox uses characters per second

The starting estimate is text length / (characters per second × playback rate).

Readox uses characters per second instead of words per minute because writing systems behave differently. The same number of written characters can take very different amounts of time to speak in English, Japanese, Chinese, or Spanish, so one universal reading-speed assumption would be wrong from the start.

A few examples from the current table:

zh = 4.5
ja = 7
ko = 7
en = 15
fr = 17
es = 18
de = 18
tr = 19
fi = 20

That means the same 4,500 characters start very differently: English 4,500 / 15 = 300 seconds = 5:00, Spanish 4,500 / 18 = 250 seconds = 4:10, and Japanese 4,500 / 7 ≈ 643 seconds = 10:43.

The estimate is only the starting point. Real boundary events still recalibrate it during playback.

This matters most on longer pages. A bad starting estimate makes the timer feel wrong for too long. A language-aware estimate gets much closer before the first few corrections even land.

What happens with voices that barely report progress

Some voices report precise word boundaries. Some report only sentence-level boundaries. A few are so sparse that a naive high-resolution progress bar is basically impossible.

Readox still gives those voices a smoother progress experience by using whatever signal exists:

word boundaries when the browser provides them
sentence boundaries when that is all the voice exposes
a stronger correction weight when updates are rare

So even when the voice is sparse, you still get a timer that feels usable. You do not need a perfect voice to get a decent countdown.

Why this is better than a fake perfectly linear timer

There is an obvious shortcut here: make the timer move in a perfectly straight line from start to finish and ignore the browser entirely.

That looks smooth for a few seconds, but it falls apart in real use:

longer pauses between sentences
voice-specific pacing differences
different playback speeds
passages with unusually dense or short text

Readox does not pretend playback is uniform when it is not. It starts with an estimate, then keeps pulling that estimate back toward real playback.

This is the difference between a timer that only looks smooth and one you can actually trust.

What this means when you are actually listening

The benefit is not the animation. The benefit is that the number becomes more useful.

If the time remaining is steadier, it is easier to decide:

do I have time to finish this before I leave?
should I speed this up?
is this worth saving for later instead of listening now?

That makes playback easier to trust and easier to plan around.

The broader point

Small interface details change whether audio feels dependable.

If the remaining time swings wildly, the whole experience feels less trustworthy. If it settles quickly and feels believable, it is easier to decide whether to finish now, speed up, or save the piece for later.

Smoothing matters because the timer becomes something you can use, not just something that looks nicer.

Smoother Time Remaining for Text-to-Speech Playback

Why text-to-speech progress bars jump in the first place

What Readox does instead

Why language matters for time remaining

What happens with voices that barely report progress

Why this is better than a fake perfectly linear timer

What this means when you are actually listening

The broader point

Related Reading

How to Use Keyboard Shortcuts to Read Pages Without Leaving the Keyboard

Why Pronunciation Feedback Matters for Premium Voices