Sammy Hajhamid Blog


load balancing usage across multiple codex accounts
getting blocked by usage limits? these new scheduling algorithms for codex-lb (and some neat tricks) can help
May 2, 2026
cs

The problem

I’ve been exhausting my Codex account’s usage limits, so I’m borrowing my friends’ ChatGPT subscriptions to use more accounts. Now, I’ve got 7. Even with more accounts, I’m still hitting my limits in the middle of my sessions. How can I be interrupted less?

Let’s start by considering some algorithms. When I start a session with codex, should I pick the account:

  • Randomly?
  • By exhausting each account’s quota in order?
  • With the most weekly quota remaining? Or maybe 5 hour quota remaining?

What’ll minimize interruption duration and frequency1? Does it even matter? It’s not immediately obvious to me. The most popular Codex multi-accounting solution on GitHub, codex-lb, proposes three load-balancing strategies:

  • By round robin, selecting one account after another,
  • By usage weighting, selecting the account with the smallest sort key, given by (secondary_used_percent, primary_used_percent, last_selected_at, account_id),
  • And by capacity weighting, selecting the account with probability proportional to remaining_secondary_credits.

We’ll measure these algorithms’ interruption characteristics, improve on codex-lb’s algorithms, and design our own multi-accounting Codex wrapper to use usage limit optimizations codex-lb currently can’t.

Algorithms for codex-lb

First, we’ll need to understand how usage limits work.

An account ii has two usage quotas: the five hour quota, Qi5hQ^{5h}_i, and weekly quota, QiwQ^{w}_i. Empirically, I’ve found that the five hour quota is 12% of the weekly quota.

Usage quota is replenished via two refresh timers: the time until the five hour and weekly quota refreshes, Ti5hT^{5h}_i and TiwT^{w}_i. Note, refresh timers do not refresh at a fixed period! Instead, a timer first starts when an account starts a session without an active timer.

Usage quota is depleted during sessions. If the session’s account runs out of quota, the session must be paused until quota refreshes2. This is how interruptions happen.

Simulating usage limits

We’ll assume each algorithm must load-balance an infinite list of sessions, arriving at some rate λ\lambda according to a Poisson process. Session jj consumes quota UjU_j, sampled from an exponential distribution with parameter μ\mu.

Try it yourself!

Given Q5hQ^{5h}, QwQ^{w}, T5hT^{5h}, TwT^{w}, λ\lambda, and μ\mu3, design an algorithm to minimize cumulative wait duration and frequency.

Your goal is to choose which account ii should handle an incoming session. See how your solution compares to a random baseline, codex-lb’s algorithms, and three new algorithms I’ve tossed into the ring: “greedy frequency”, “greedy duration”, and “phase targeting”.

If you’re interested, you can inspect the source for all algorithms in the widget.

accounts 7
+83.3 quota/hour
lambda 2.23
2.23 sessions/hour
mu 0.027
-36.8 quota/session
start the simulation to draw the interruption duration curves
start the simulation to draw the interruption frequency curves

The three algorithms do surprisingly well. How do they work?

  • “greedy frequency” picks the account with the most immediately usable quota, minimizing the chance the next session hits a limit.

  • “greedy duration” estimates each account’s average wait for the next session, then picks the smallest one.

  • “phase targeting” tries to stagger refresh timers and spends quota closer to refreshes.

It seems like, of the codex-lb algorithms, “usage weighted” performs the best overall, despite “capacity weighting” being the default. Comparing usage weighting to the greedy algorithms, the greedy algorithms do much better. We’ll scale up from a single trial later.

The difference in performance is sometimes dramatic, sometimes not; it depends on the parameters. Specifically, on the relationship between quota depletion rate, RdλE[U]=λ/μR_d \approx \lambda \mathbb{E}[U] = \lambda / \mu, and the quota replenishment rate, Rrnmin ⁣(Q5h/T5h,Qw/Tw)=nQw/TwR_r \approx n\min\!\left(Q^{5h}/T^{5h}, Q^{w}/T^{w}\right) = nQ^{w}/T^{w}:

Intuitively, if you’re either rapidly maxing out all your usage limits or barely dipping into them, your load balancing algorithm is a weak differentiator for interruption duration and frequency.

When RdRrR_d \approx R_r, the relative performance of algorithms can still vary by workload. For example, usage weighting prefers many small tasks to few large tasks.

What else can we do to improve performance besides use these greedy algorithms?

Algorithms for my codex wrapper, cx

At the moment, codex-lb doesn’t allow you to move sessions between accounts or proactively refresh account timers. My shell wrapper around Codex, cx, is designed to work around codex-lb’s limitations, providing smaller interruption duration and frequency.

Why move sessions?

All sessions are JSONL files under the CODEX_HOME/sessions directory. When an account runs out of quota, we can simply copy its session file from the drained account to a fresh one, and then resume it on the new account with codex resume. This would be useful, since you’d only be interrupted when all of your accounts are exhausted, instead of just one.

Moving a session to another account is not free, since moving sessions uncaches the context window’s prefix4. Broadly, it’s cheap to move and uncache fresh sessions, but expensive to move and uncache nearly finished ones.

In practice, I’ve found this cost negligible.

Why proactively start refresh timers?

By starting a small session with Uj0U_j \approx 0 to an account, you could kickstart an account’s timer before a real session is scheduled. This would be useful, because it increases the algorithm’s quota replenishment rate, RrR_r, for effectively no cost.

Besides maximizing the frequency of each account’s refresh, we can choose the phase of each account’s refresh. Empirically, I’ve found staggering account’s refresh phase performs much better than synchronizing them, but feel free to compare.

Simulating usage limits with the new optimizations

Try it yourself!

Again, try writing your own algorithm! Also, try to predict the impact of the two above optimizations.

accounts 7
+83.3 quota/hour
lambda 2.23
2.23 sessions/hour
mu 0.027
-36.8 quota/session
session movement on
resume remaining work on another account
refresh timing staggered
timers run proactively with even offsets
start the simulation to draw the interruption duration curves
start the simulation to draw the interruption frequency curves

With session movement, phase targeting does exceptionally well. However, two runs in your browser isn’t very good evidence of what algorithm to use, so we’ll need to scale up.

I’ve averaged the performance of the algorithms over 2048 runs ahead of time. To avoid combinatorial explosion, instead of varying λ\lambda and μ\mu independently, we’ll fix Rd=Rr=λ/μR_d = R_r = \lambda / \mu, and add a granularity slider to control how chunky the tasks are.

accounts 7
+83.3 quota/hour
granularity 22.2
-22.2 quota/session
session movement on
resume remaining work on another account
refresh timing staggered
timers run proactively with even offsets
loading precomputed data...
select a GPU data point
select a GPU data point

The results

With session movement enabled, the algorithm that performs the best, regardless of setting, is phase targeting.

By far, the best optimization for avoiding interruptions is session movement. Averaging over all options, cumulative interruption time drops from 54,120 hours to 289 hours, and interruptions from 11,252 to 114; 99.5% and 99.0% reductions, respectively!

This is interesting to me, because the largest interruption decrease did not come from abstracting codex-lb’s quota model as a theoretical computer science problem. Rather, by having a better grasp of Codex’s practical details, we could rewrite the quota model into something more amicable.

Because of these results, cx uses phase targeting, session movement, and staggered refresh phases.

Implementing in practice

Each account’s CODEX_HOME points to a folder that contains symlinks to one shared CODEX_HOME, except for the auth.json, which is unique per account:

typeset -g CX_HOME="$HOME/.codex"
typeset -g CX_AUTH="$CX_HOME/auth"
typeset -g CX_HOMES="$CX_HOME/homes"


_cx_home() {
  local home="$CX_HOMES/$1" file

  mkdir -p "$home" || return
  for file in "$CX_HOME"/*(ND); do
    ln -sfn "$file" "$home/${file:t}" || return
  done
  ln -sfn "$CX_AUTH/$1.auth.json" "$home/auth.json"
}

Now, since the sessions folder is shared, we can simply run cx resume on a halted session to resume it on a new account. The UX is convenient since configs, selected models, memories, skills, and whatever new per-account state Codex adds will be shared between accounts too.

To proactively stagger refreshes across accounts, I use a straightforward cron job than runs cx [n] exec on a dummy prompt every 43 minutes, incrementing n.

Conclusion

To summarize:

  • You can optimize your usage limits. Better load-balancing algorithms can make better use of these limits.
  • A surprising amount of interpretable complexity and patterns—like when load-balancing does and doesn’t matter—emerges from trying to optimize these limits.
  • Taking codex-lb’s quota model as a given has limited returns. Having a mechanistic understanding of how Codex works provided the biggest gains in performance: making sessions moveable.
  • You can use the algorithms and optimizations mentioned in this post by following the instructions in the cx GitHub repository.

With more information about your usage, you can probably do much better than the algorithms shown in this post. For example, if you allow arbitrary time varying distributions for session size and session arrival, you can probably trigger the refreshes more advantageously for your work schedule.

Footnotes

Footnotes

  1. We could use other metrics, like time to first interruption, cumulative overdrawn quota, or some metric which weighs frequency and duration according to your preferred parameters. Each metric proxies a piece of the pain response I feel when I hit my quota limit, and no metric will proxy it perfectly.

  2. In practice, you might not always wait for quota to refresh. Maybe if the time to refresh is short, but you might otherwise reprompt Codex in a new session, or copy the session over and codex resume from a new account (shoutout albert for putting me on). We’ll model this later.

  3. You might notice that, in this model, the algorithm doesn’t know how large the incoming session is. In the real world, this isn’t always true—you (or an automated system) could provide an estimate.

    However, I think the complexity incurred by incorporating size information isn’t worth the benefit. More critically than the time and cognitive cost of a less digestible model being high, I think the benefit is low: estimating the effort it takes to complete sessions is generally very hard, especially when working with AI agents.

    This tradeoff may not make sense for your use case—maybe it’s easy to estimate how much quota your sessions will consume, and maybe you rarely engage in multi-turn conversations with your agents—and that’s fine. For my use case, I’d rather not.

  4. This model of moving cost is approximate, since the cost of the task doesn’t contain the necessary information for computing its cost to move.

    We could come up with a more formal model for uncaching costs by using the ratio of uncached input tokens to input tokens, but I think this would detract from the point of this post.