P95 stands for the 95th percentile. 95% of iterations completed at or below this time. Tells you the 'typical worst case' — what a user experiences on a bad-but-not-extreme run.

P99 stands for the 99th percentile. 99% of iterations completed at or below this time. It highlights the tail latency (the rare outlier spikes). With 100 iterations, this is the 99th value (second-worst)

Sandbox Provider Leaderboard

Q: What is Time to Interactive (TTI)?

TTI is the total elapsed time it takes for a sandbox provider to boot up a sandbox and run a terminal command inside the sandbox.

Q: Why use median instead of average?

The average can be easily skewed by a single extreme outlier. The median provides a more accurate representation of the typical performance most users will experience.

Sandbox Benchmarks

A leaderboard of common benchmarks for each of our sandbox providers.

View on Github

Last run: June 22, 2026

Details

Provider Leaderboard

Performance Over Time

Composite Score

Detailed Metrics

Provider	Score	Median	P95	P99	Success
Isorun	99.2	0.08s	0.10s	0.10s	100%
Declaw	98.0	0.18s	0.23s	0.23s	100%
Northflank	96.6	0.33s	0.37s	0.37s	100%
E2B	93.3	0.58s	0.79s	0.82s	100%
Daytona	93.1	0.59s	0.81s	0.85s	100%
Modal	93.3	0.65s	0.69s	0.70s	100%
Vercel	91.0	0.81s	1.00s	1.06s	100%
Blaxel	91.5	0.82s	0.90s	0.91s	100%
Tensorlake	52.8	1.21s	10.62s	10.63s	100%
Archil	77.0	2.19s	2.44s	2.51s	100%
Cloudflare	71.3	2.51s	3.33s	3.56s	100%
Runloop	66.4	3.19s	3.62s	3.62s	100%
Upstash	27.4	5.43s	12.64s	13.02s	100%
CodeSandbox	23.9	6.75s	8.57s	9.46s	100%
Hopx	0.0	16.25s	16.96s	17.01s	100%

Want to see a provider added?

Let us know on X

Methodology

What We Measure

Every benchmark measures Time to Interactive (TTI) — the elapsed time from calling compute.sandbox.create() to the first successful runCommand() inside the sandbox.

Each provider is tested with 100 iterations per run. Benchmarks run automatically via GitHub Actions on a recurring schedule. All results are committed to the public benchmarks repo.

Sequential Test: Sandboxes are launched one at a time, waiting for each to become interactive before starting the next.

Staggered Test: Sandboxes are launched with 200ms delays between each.

Burst Test: All sandboxes are launched concurrently in a single burst.

How We Score

The Composite Score is a weighted blend of timing metrics multiplied by the success rate. Each metric is scored against a fixed 10-second ceiling: 100 × (1 − value / 10,000ms), so a 200ms median scores 98 and anything ≥10s scores 0.

The weighted timing score is then multiplied by the success rate (0–1), so providers that fail frequently are penalized proportionally.

• Median: 60% — primary signal for typical experience
• P95: 25% — tail latency / consistency
• P99: 15% — extreme tail latency

Sandbox Benchmarks FAQs

Have another question? Email us.

What is a sandbox?

A sandbox is anywhere you can run code in isolation. It could be a VM, bare metal, a container, anywhere with compute resources.

What is Time to Interactive (TTI)?

Why use median instead of average?

What does P95 mean?

What does P99 mean?