AMO / Quantum Computing · Benchmarking

Gate Benchmarking (RB)

A thorough interactive reference for RB, from the SPAM problem and Clifford twirling, through the decay model and IRB, to practical design choices and state-of-the-art neutral atom results. Includes working calculators for standard RB and interleaved RB.

Why Randomized Benchmarking? The SPAM Problem

Before RB existed, measuring gate fidelity seemed simple: prepare $|0\rangle$, apply gate $G$, measure. But this conflates three separate error sources that you cannot disentangle with a single experiment.

State Preparation and Measurement (SPAM) Errors

Every qubit experiment involves three steps: (1) prepare a known state, (2) apply your gate or circuit, (3) measure the output state. Steps 1 and 3 both have imperfections called state preparation errors and measurement errors, collectively SPAM. In neutral atom experiments, SPAM typically contributes 0.5–2% error per shot, often comparable to or larger than the gate error you're trying to measure.

If you apply gate $G$ once and measure, you see: $P(|0\rangle) = F_{\rm gate} \cdot F_{\rm SPAM}$. You cannot extract $F_{\rm gate}$ without knowing $F_{\rm SPAM}$ independently. Even worse, SPAM errors fluctuate with laser power drifts, alignment, and atom temperature — making them hard to characterize reliably.

The Key Insight: SPAM Errors Don't Scale with Circuit Depth

Randomized Benchmarking's crucial observation: apply a random sequence of $m$ Clifford gates followed by the recovery gate $C_r$ that inverts the whole sequence. The ideal output is always $|0\rangle$. As you increase $m$, gate errors accumulate, but SPAM errors are the same no matter how long the circuit is. This means SPAM appears as constants $A$ and $B$ in the fit model, while gate errors appear in the exponential decay $p^m$. You get a clean separation.

|0⟩ SPAM Prep error C₁ C₂ Cₘ Cᵣ (recover) SPAM Meas error P(|0⟩) → fit F(m)=Apᵐ+B SPAM constants A, B, same for all m. Gate errors → pᵐ decay.
Bottom line: By varying the circuit depth $m$ and fitting the survival probability decay, you extract the decay constant $p$ which depends only on gate errors. SPAM gets absorbed into the pre-factors $A$ and $B$ and disappears from the gate error estimate. This is what makes RB so powerful, it's SPAM-insensitive by construction.

The Clifford Group

RB uses random Clifford gates, not arbitrary unitaries. The choice is deliberate: Clifford operations have a unique property that makes the RB decay model exact.

Definition: Normalizer of the Pauli Group

The Clifford group $\mathcal{C}_n$ consists of all $n$-qubit unitaries $U$ that map Pauli operators to Pauli operators under conjugation:

$U \in \mathcal{C}_n \iff U P U^\dagger \in \mathcal{P}_n \text{ for all } P \in \mathcal{P}_n$

In other words, Clifford gates normalize the Pauli group. This means if you conjugate any Pauli by a Clifford, you get another Pauli (up to a phase). Paulis get "shuffled around", they never become non-Pauli.

Group Sizes

  • 1-qubit: 24 elements, generated by $H$ and $S$, or equivalently $X_{\pi/2}$ and $Z_{\pi/2}$. These correspond to the 24 orientation-preserving symmetries of a cube (the chiral octahedral group $O$).
  • 2-qubit: 11,520 elements, generated by 1Q Cliffords plus CNOT (or CZ).
  • n-qubit: grows as $\sim 2^{O(n^2)}$, exponentially large, but still efficiently simulable via the Gottesman-Knill theorem (since Cliffords map stabilizer states to stabilizer states).

Why Cliffords? The Twirling Argument

The key insight behind RB is Clifford twirling. Given any noise channel $\Lambda$ (which could be completely arbitrary, coherent errors, non-Markovian, anything), averaging over the Clifford group converts it into a depolarizing channel:

$\frac{1}{|\mathcal{C}_n|} \sum_{C \in \mathcal{C}_n} C^\dagger \Lambda(C \rho C^\dagger) C = p\rho + (1-p)\frac{I}{d}$

This is the twirl. It means: no matter what your noise actually is, coherent Z rotations, amplitude damping, cross-talk, laser phase noise, after Clifford twirling it looks like depolarizing noise parameterized by a single number $p$. The RB decay $F(m) = Ap^m + B$ is therefore exact (not approximate) for any noise that is gate-independent.

1-Qubit Clifford Decompositions in {$X_{\pi/2}$, $Z_{\pi/2}$} Basis

Modern neutral atom hardware implements 1Q gates natively as $X_{\pi/2}$ and $Z_{\pi/2}$ pulses (virtual Z via phase shift). Every 1Q Clifford can be decomposed into at most 3 such pulses, giving an average of $\approx 1.875$ native gates per Clifford.

Gate / Axis Native Gate Sequence # Native Gates Comment
Identity $I$— (none)0Virtual only
$X_\pi$ (X gate)$X_{\pi/2} \cdot X_{\pi/2}$2Two half-X pulses
$Y_\pi$ (Y gate)$Z_{\pi/2} \cdot X_\pi \cdot Z_{-\pi/2}$3Axis rotation
$Z_\pi$ (Z gate)$Z_{\pi/2} \cdot Z_{\pi/2}$0 (virtual)Phase update
$H$ (Hadamard)$Z_{\pi/2} \cdot X_{\pi/2} \cdot Z_{\pi/2}$1 + 2 virtual1 physical pulse
$S$ ($Z_{\pi/2}$)$Z_{\pi/2}$0 (virtual)Frame rotation
$T_3$ (120° face rotations)1–3 pulse combos1–38 elements, avg 2
$T_6$ (180° edge rotations)$X_{\pi/2}$ or $Z_{\pi/2}$1–26 elements
Avg over all 241.875Theoretical average
|0⟩ (Z+) |1⟩ (Z−) X −X |0⟩, |1⟩ axis (4) ±X, ±Y axis (8) Face/diagonal (8) −Z neighbors (4) 24 Single-Qubit Clifford States on Bloch Sphere

The RB Decay Model

From first principles: why the survival probability decays exponentially, what the fit parameters mean, and how to extract gate error and useful circuit depth.

Derivation: From Twirling to Exponential Decay

1
Start with gate-independent depolarizing noise. Each Clifford gate $C_i$ is followed by a noise channel: $\tilde{C}_i(\rho) = \Lambda(C_i \rho C_i^\dagger)$ where $\Lambda(\rho) = p\rho + (1-p)\frac{I}{d}$ is depolarizing with parameter $p \in [0,1]$.
2
After m gates, average over all random Clifford sequences. By the twirling argument, the noise from $m$ sequential gates compounds as $p^m$ (since each application of $\Lambda$ shrinks the Bloch vector by factor $p$). The averaged state before measurement is: $\overline{\rho_m} = p^m |\psi_{\rm prep}\rangle\langle\psi_{\rm prep}| + (1-p^m)\frac{I}{d}$
3
Apply the recovery gate and measure. The measured survival probability (probability of recovering $|0\rangle$) is: $F(m) = \langle 0|\overline{\rho_m}|0\rangle = A \cdot p^m + B$ where $A = \langle\psi_{\rm prep}|0\rangle^2 - 1/d$ absorbs SPAM, and $B = 1/d$ is the long-depth asymptote (fully mixed state for $d=2$ gives $B = 1/2$).
4
Extract error per Clifford from $p$. For a $d$-dimensional system, depolarizing parameter $p$ relates to error per Clifford: $r_C = \frac{(d-1)(1-p)}{d}$. For 1-qubit ($d=2$): $r_C = (1-p)/2$, so $p = 1 - 2r_C$. For 2-qubit ($d=4$): $r_C = 3(1-p)/4$, so $p = 1 - (4/3)r_C$.
5
Convert to error per native gate. Since each Clifford is compiled from $\bar{n}$ native gates on average, the error per native gate is $\varepsilon_{\rm gate} \approx r_C / \bar{n}$. For 1Q: $\bar{n} \approx 1.875$. For 2Q CZ + overhead: $\bar{n} \approx 2$.
$$F(m) = A \cdot p^m + B$$
$$r_C = \frac{(d-1)(1-p)}{d} \quad \text{[1Q: } r_C = \tfrac{1-p}{2}\text{, 2Q: } r_C = \tfrac{3(1-p)}{4}\text{]}$$
$$\varepsilon_{\rm gate} \approx \frac{r_C}{\bar{n}} \quad \text{where } \bar{n} \approx 1.875 \text{ (1Q)}, \approx 2.0 \text{ (2Q)}$$
$$N_{\rm useful} = \frac{1}{2r_C} \quad \text{[circuit depth at which } F \approx 1/e \approx 0.37\text{]}$$

Physical Meaning of Parameters

  • $p$ (decay constant): Proximity to identity channel. $p=1$ means perfect gates, $p=0$ means fully depolarizing. Extracted directly from exponential fit.
  • $A$ (amplitude): Absorbs SPAM. Typically $A \lesssim 0.95$ due to ~5% SPAM in neutral atom experiments. Does not affect $r_C$.
  • $B$ (asymptote): Long-depth limit where the state is fully mixed. For 1Q: $B \to 1/2$ (chance). For 2Q: $B \to 1/4$. In practice, $B$ is a fit parameter that captures imperfect long-depth behavior.
  • $r_C$ (error per Clifford): The quantity you report. Benchmark comparing platforms. Equivalent to infidelity per Clifford averaged uniformly over the Clifford group.

Average Gate Fidelity

The average gate fidelity of a noisy channel $\tilde{C}$ relative to ideal $C$ is:

$F_{\rm avg}(C, \tilde{C}) = \int_\psi \langle\psi|C^\dagger \tilde{C}(|\psi\rangle\langle\psi|) C|\psi\rangle \, d\psi$

Under the depolarizing model, $F_{\rm avg} = p + (1-p)/d$. For a 1Q gate: $F_{\rm avg} = p + (1-p)/2 = 1 - r_C$. For a native gate: $F_{\rm avg,gate} \approx 1 - \varepsilon_{\rm gate} = 1 - r_C/\bar{n}$.

Surface code fault-tolerance requires $F_{\rm avg} \gtrsim 99\%$ per gate, corresponding to $r_C \lesssim 0.01$ per Clifford. Leading neutral atom systems achieve $r_C \sim 0.003{-}0.005$ for 2Q gates.

Interactive Standard RB Calculator

Adjust $r_C$ and SPAM amplitude to see how the RB decay curve changes. Compare 1Q and 2Q benchmarks.

RB Decay Curve, F(m) = A · pᵐ + B
r_C = 0.0050 (0.50%)
A = 0.950
Live RB Simulation
Each dot = one random Clifford circuit run. Dots accumulate depth-by-depth, tracing the exponential decay.
Reading the chart: The dashed $1/e \approx 0.368$ line shows the characteristic depth $N_{\rm useful} = 1/(2r_C)$ , roughly the depth at which your circuit maintains useful coherence. The 2Q curve decays faster because $p_{\rm 2Q} = 1 - (4/3)r_C < p_{\rm 1Q} = 1 - 2r_C$ for the same $r_C$. SPAM amplitude $A$ shifts the whole curve vertically but does not change $r_C$.

Interleaved RB (IRB)

Standard RB gives the average error per Clifford. To measure the fidelity of a specific gate, use Interleaved RB.

Why IRB?

The standard RB decay constant $p_{\rm ref}$ reports an average over all 24 (or 11520) Clifford gates. But for building a quantum computer, you care about individual gates: how well does your $\text{CZ}$ gate perform? Your $X_{\pi/2}$? IRB isolates a single target gate $G$.

Protocol

1
Standard RB: Run regular RB with random Clifford sequences of depths $m = 1, 2, 5, \ldots$ Fit $F_{\rm ref}(m) = A_{\rm ref} \cdot p_{\rm ref}^m + B$ to get $p_{\rm ref}$.
2
Interleaved RB: Insert the target gate $G$ after every random Clifford: $\ldots C_j \cdot G \cdot C_{j-1} \cdot G \cdot C_{j-2} \cdot G \ldots$ The combined sequence is still $m$ pairs of (Clifford, target gate) followed by recovery. Fit $F_{\rm int}(m) = A_{\rm int} \cdot p_{\rm int}^m + B$ to get $p_{\rm int}$.
3
Extract gate error: The ratio $p_{\rm int}/p_{\rm ref}$ cancels the average Clifford error, leaving only the contribution of $G$: $r_{\rm gate} = \frac{d-1}{d}\left(1 - \frac{p_{\rm int}}{p_{\rm ref}}\right)$
$$r_{\rm gate} = \frac{d-1}{d}\left(1 - \frac{p_{\rm int}}{p_{\rm ref}}\right)$$
IRB, Reference vs Interleaved Decay Curves
r_C = 0.0050
r_gate = 0.0050
IRB Systematic Errors: The formula $r_{\rm gate} = \frac{d-1}{d}(1 - p_{\rm int}/p_{\rm ref})$ is a lower bound. If the noise of $G$ is highly coherent or gate-dependent, the extracted $r_{\rm gate}$ can underestimate the true error. The bound is tight when $r_{\rm gate} \ll r_C$.
Practical note: $p_{\rm int} \leq p_{\rm ref}$ always (the interleaved gate adds more error). If $r_{\rm gate} \gg r_C$, the interleaved curve decays much faster and the ratio is sensitive to fitting noise. Use more sequences (100+) per depth in that regime.

Two-Qubit RB Specifics

Two-qubit benchmarking involves a much larger Clifford group and requires accounting for the structure of 2Q gate decompositions and cross-talk.

2Q Clifford Group Statistics

  • Group size: 11,520 elements (vs 24 for 1Q). Sampling is still efficient since random 2Q Cliffords can be generated from a small circuit gate set.
  • CZ gate count per 2Q Clifford: On average $\sim 1.5$ CZ gates + $\sim 8.4$ single-qubit gates (in $\{H, S, X_{\pi/2}\}$ basis).
  • Dimension: $d = 4$, so $B \to 1/4$ and the formula becomes $r_C = \frac{3(1-p)}{4}$.
  • Error budget: 2Q error per Clifford includes contributions from CZ gate fidelity AND single-qubit gates within the Clifford decomposition. To isolate CZ: use IRB with $G = \text{CZ}$.

Simultaneous RB (SimRB)

Run single-qubit RB on both qubits simultaneously while the other qubit is active. Compare to individual 1Q RB on each qubit in isolation.

Crosstalk signature: If $r_C^{\rm sim} > r_C^{\rm iso}$, the difference $\Delta r_C = r_C^{\rm sim} - r_C^{\rm iso}$ quantifies crosstalk-induced error. In Rydberg arrays, crosstalk arises from: residual Rydberg admixture, off-resonant blockade from neighboring atoms, and laser spillover.

Parallel RB: Running many qubit pairs simultaneously (Evered 2023: 60 pairs) tests that gate fidelity doesn't degrade under realistic many-body load, crucial for demonstrating scalable performance.

Property 1Q Standard RB 2Q Standard RB
Clifford group size 24 11,520
Hilbert space dim $d$ 2 4
Asymptote $B$ 0.5 0.25
$r_C$ from $p$ $(1-p)/2$ $3(1-p)/4$
Avg native gates / Clifford ~1.875 ~1.5 CZ + ~8.4 SQ
Typical $r_C$ neutral atoms <0.001 0.003–0.007
What limits fidelity? Laser phase noise, off-resonance, Doppler Spontaneous emission during Rydberg excitation, Doppler, blockade errors
Generators of the 2Q Clifford Group: The 2Q Clifford group is generated by $\{H \otimes I,\, S \otimes I,\, I \otimes H,\, I \otimes S,\, \text{CNOT}\}$. In a Rydberg processor, you replace CNOT with CZ (related by single-qubit Hadamards). Random 2Q Cliffords can be synthesized as random circuits of depth $\sim$12–20 in this gate set.

Practical Considerations

How to actually run RB in the lab, choosing sequence counts, depths, shots, and understanding when the RB model breaks down.

Experimental Design Parameters

  • Number of random sequences per depth ($K$): 30–100 is typical. More sequences reduce variance in the fit. For high-fidelity systems ($r_C \lesssim 0.003$), use $K \gtrsim 75$ to resolve the small decay.
  • Number of depth values ($M$): 8–15 points, log-spaced up to $m_{\rm max} \sim 3/r_C$ (where survival drops to $\sim e^{-3} \approx 5\%$ above $B$). Too few depth points → unreliable fit. Too many → waste of time on fully decohered sequences.
  • Shots per circuit ($n_{\rm shots}$): 100–1000. More shots reduce shot noise. With atom-resolved imaging, even 200 shots per circuit is often sufficient.
  • Total experiments: $K \times M \times n_{\rm shots} \sim 30 \times 10 \times 200 = 60{,}000$ shots minimum.

Statistical Error on $r_C$

The statistical uncertainty on the fitted error per Clifford scales approximately as:

$\delta r_C \approx \frac{1}{A \cdot \sqrt{K \cdot M \cdot n_{\rm shots}}}$

With $A \approx 0.9$, $K=50$, $M=10$, $n_{\rm shots}=200$: $\delta r_C \approx \frac{1}{0.9\sqrt{100{,}000}} \approx 0.00035$, about $3.5 \times 10^{-4}$. This is the floor on how precisely you can report $r_C$. For sub-0.1% gate errors, you need $\gtrsim 10^6$ total shots.

Bootstrapping: For confidence intervals, resample the set of sequences (with replacement) and refit many times. Report the 95th percentile range.

When Does RB Break Down?

  • Gate-dependent noise: If different Clifford gates have very different error rates, the twirl doesn't produce a clean single-exponential decay, you may see multi-exponential behavior. Use character RB (Helsen et al.) to handle this.
  • Leakage out of the qubit subspace: If population leaks into $|2\rangle$, $|3\rangle$ (e.g., second Rydberg state, hyperfine levels), the decay model fails. Use leakage RB (Wood & Gambetta) which explicitly tracks leakage as an additional decay channel.
  • Non-Markovian noise: If noise has temporal correlations (e.g., slow laser frequency drifts, magnetic field fluctuations), different sequences in the same batch see different error rates. The ensemble average is still approximately valid but the fit variance increases.
  • Crosstalk for multi-qubit benchmarking: If benchmarking qubit $i$ while qubit $j$ is nearby and in a different state, the effective noise on qubit $i$ depends on qubit $j$'s state. Simultaneous RB explicitly captures this.
Neutral atom highlight (Evered et al., Nature 2023): Benchmarked 60 qubit pairs simultaneously, each pair undergoing independent RB sequences. Result: $99.5\%$ 2Q CZ gate fidelity holding globally across the array, not just for isolated qubit pairs. This parallel RB approach is essential for proving that cross-atom interactions don't degrade performance at scale. The fact that the isolated-pair and simultaneous-array results agreed confirmed that crosstalk is negligible at typical array spacings ($\sim 3\,\mu\text{m}$ inter-site distance in the tweezer array).
Parameter Conservative Standard High-precision
Sequences per depth $K$ 20 50 100+
Depth points $M$ 6 10 15
Shots per circuit 100 200 1000
$\delta r_C$ (approx) ~0.001 ~0.0004 ~0.0001
Total shots 12,000 100,000 1,500,000

Results in Neutral Atom Experiments

State-of-the-art RB benchmarks for neutral atom platforms (Rb, Cs, Yb tweezer arrays), with comparison to fault-tolerance thresholds.

Platform / Species Gate $r_C$ (per Clifford) $\varepsilon_{\rm gate}$ (per native gate) Notes / Paper
⁸⁷Rb tweezer (Harvard) 2Q CZ (Rydberg) 0.0050 0.25% Evered et al., Nature 622, 268 (2023). 60-pair parallel RB, 99.5% fidelity.
⁸⁷Rb tweezer (Harvard) 1Q <0.001 <0.1% Manetsch et al. (2025). 6100-atom array, 99.93% single-qubit fidelity.
¹⁷¹Yb tweezer (JILA) 2Q CZ (Rydberg) 0.0030 0.14% Ma et al., PRX Quantum 6, 010332 (2025). Nuclear spin qubit, exceptionally low SPAM.
¹³³Cs tweezer (CsQ / various) 2Q ~0.010 ~0.5% Multiple groups; heavier atom, different blockade physics.
⁸⁷Rb tweezer (Lukin group) 1Q + 2Q (Rydberg) 0.005 (2Q) 0.25% Bluvstein et al., Nature 604, 451 (2022). Logical qubit experiments with RB characterization.
¹⁷¹Yb tweezer (Quera/MIT) 1Q (nuclear spin) 0.0002 0.01% Scholl et al. / Barnes et al. 2022. Nuclear spin qubit with laser-free microwave drives.
Surface code threshold <0.001 <0.1% Theoretical threshold for concatenated / surface codes (depolarizing noise model). Goal for fault-tolerant QC.
Progress timeline: Rydberg 2Q gate fidelity has improved from $\sim 97\%$ (2019) to $\sim 99.5\%$ (2023) to $\gtrsim 99.7\%$ (2025) in under 6 years. The rapid improvement has been driven by: better atom cooling (reducing Doppler error), shorter gate times (reducing spontaneous emission), and improved laser stability (phase noise).
Why Yb is promising: ¹⁷¹Yb has a nuclear spin qubit ($I=1/2$, zero electronic spin in ${}^1S_0$), giving $T_2 \sim$ seconds in a tweezer and nearly zero sensitivity to magnetic field noise. SPAM is also improved: ${}^1S_0 \to {}^3P_0$ clock transition has near-unity detection fidelity via fluorescence imaging. These factors combine to push $r_C$ well below $10^{-3}$.