Gate Benchmarking (RB)
A thorough interactive reference for RB, from the SPAM problem and Clifford twirling, through the decay model and IRB, to practical design choices and state-of-the-art neutral atom results. Includes working calculators for standard RB and interleaved RB.
Why Randomized Benchmarking? The SPAM Problem
Before RB existed, measuring gate fidelity seemed simple: prepare $|0\rangle$, apply gate $G$, measure. But this conflates three separate error sources that you cannot disentangle with a single experiment.
State Preparation and Measurement (SPAM) Errors
Every qubit experiment involves three steps: (1) prepare a known state, (2) apply your gate or circuit, (3) measure the output state. Steps 1 and 3 both have imperfections called state preparation errors and measurement errors, collectively SPAM. In neutral atom experiments, SPAM typically contributes 0.5–2% error per shot, often comparable to or larger than the gate error you're trying to measure.
If you apply gate $G$ once and measure, you see: $P(|0\rangle) = F_{\rm gate} \cdot F_{\rm SPAM}$. You cannot extract $F_{\rm gate}$ without knowing $F_{\rm SPAM}$ independently. Even worse, SPAM errors fluctuate with laser power drifts, alignment, and atom temperature — making them hard to characterize reliably.
The Key Insight: SPAM Errors Don't Scale with Circuit Depth
Randomized Benchmarking's crucial observation: apply a random sequence of $m$ Clifford gates followed by the recovery gate $C_r$ that inverts the whole sequence. The ideal output is always $|0\rangle$. As you increase $m$, gate errors accumulate, but SPAM errors are the same no matter how long the circuit is. This means SPAM appears as constants $A$ and $B$ in the fit model, while gate errors appear in the exponential decay $p^m$. You get a clean separation.
The Clifford Group
RB uses random Clifford gates, not arbitrary unitaries. The choice is deliberate: Clifford operations have a unique property that makes the RB decay model exact.
Definition: Normalizer of the Pauli Group
The Clifford group $\mathcal{C}_n$ consists of all $n$-qubit unitaries $U$ that map Pauli operators to Pauli operators under conjugation:
$U \in \mathcal{C}_n \iff U P U^\dagger \in \mathcal{P}_n \text{ for all } P \in \mathcal{P}_n$
In other words, Clifford gates normalize the Pauli group. This means if you conjugate any Pauli by a Clifford, you get another Pauli (up to a phase). Paulis get "shuffled around", they never become non-Pauli.
Group Sizes
- 1-qubit: 24 elements, generated by $H$ and $S$, or equivalently $X_{\pi/2}$ and $Z_{\pi/2}$. These correspond to the 24 orientation-preserving symmetries of a cube (the chiral octahedral group $O$).
- 2-qubit: 11,520 elements, generated by 1Q Cliffords plus CNOT (or CZ).
- n-qubit: grows as $\sim 2^{O(n^2)}$, exponentially large, but still efficiently simulable via the Gottesman-Knill theorem (since Cliffords map stabilizer states to stabilizer states).
Why Cliffords? The Twirling Argument
The key insight behind RB is Clifford twirling. Given any noise channel $\Lambda$ (which could be completely arbitrary, coherent errors, non-Markovian, anything), averaging over the Clifford group converts it into a depolarizing channel:
$\frac{1}{|\mathcal{C}_n|} \sum_{C \in \mathcal{C}_n} C^\dagger \Lambda(C \rho C^\dagger) C = p\rho + (1-p)\frac{I}{d}$
This is the twirl. It means: no matter what your noise actually is, coherent Z rotations, amplitude damping, cross-talk, laser phase noise, after Clifford twirling it looks like depolarizing noise parameterized by a single number $p$. The RB decay $F(m) = Ap^m + B$ is therefore exact (not approximate) for any noise that is gate-independent.
1-Qubit Clifford Decompositions in {$X_{\pi/2}$, $Z_{\pi/2}$} Basis
Modern neutral atom hardware implements 1Q gates natively as $X_{\pi/2}$ and $Z_{\pi/2}$ pulses (virtual Z via phase shift). Every 1Q Clifford can be decomposed into at most 3 such pulses, giving an average of $\approx 1.875$ native gates per Clifford.
| Gate / Axis | Native Gate Sequence | # Native Gates | Comment |
|---|---|---|---|
| Identity $I$ | — (none) | 0 | Virtual only |
| $X_\pi$ (X gate) | $X_{\pi/2} \cdot X_{\pi/2}$ | 2 | Two half-X pulses |
| $Y_\pi$ (Y gate) | $Z_{\pi/2} \cdot X_\pi \cdot Z_{-\pi/2}$ | 3 | Axis rotation |
| $Z_\pi$ (Z gate) | $Z_{\pi/2} \cdot Z_{\pi/2}$ | 0 (virtual) | Phase update |
| $H$ (Hadamard) | $Z_{\pi/2} \cdot X_{\pi/2} \cdot Z_{\pi/2}$ | 1 + 2 virtual | 1 physical pulse |
| $S$ ($Z_{\pi/2}$) | $Z_{\pi/2}$ | 0 (virtual) | Frame rotation |
| $T_3$ (120° face rotations) | 1–3 pulse combos | 1–3 | 8 elements, avg 2 |
| $T_6$ (180° edge rotations) | $X_{\pi/2}$ or $Z_{\pi/2}$ | 1–2 | 6 elements |
| Avg over all 24 | — | 1.875 | Theoretical average |
The RB Decay Model
From first principles: why the survival probability decays exponentially, what the fit parameters mean, and how to extract gate error and useful circuit depth.
Derivation: From Twirling to Exponential Decay
Physical Meaning of Parameters
- $p$ (decay constant): Proximity to identity channel. $p=1$ means perfect gates, $p=0$ means fully depolarizing. Extracted directly from exponential fit.
- $A$ (amplitude): Absorbs SPAM. Typically $A \lesssim 0.95$ due to ~5% SPAM in neutral atom experiments. Does not affect $r_C$.
- $B$ (asymptote): Long-depth limit where the state is fully mixed. For 1Q: $B \to 1/2$ (chance). For 2Q: $B \to 1/4$. In practice, $B$ is a fit parameter that captures imperfect long-depth behavior.
- $r_C$ (error per Clifford): The quantity you report. Benchmark comparing platforms. Equivalent to infidelity per Clifford averaged uniformly over the Clifford group.
Average Gate Fidelity
The average gate fidelity of a noisy channel $\tilde{C}$ relative to ideal $C$ is:
$F_{\rm avg}(C, \tilde{C}) = \int_\psi \langle\psi|C^\dagger \tilde{C}(|\psi\rangle\langle\psi|) C|\psi\rangle \, d\psi$
Under the depolarizing model, $F_{\rm avg} = p + (1-p)/d$. For a 1Q gate: $F_{\rm avg} = p + (1-p)/2 = 1 - r_C$. For a native gate: $F_{\rm avg,gate} \approx 1 - \varepsilon_{\rm gate} = 1 - r_C/\bar{n}$.
Surface code fault-tolerance requires $F_{\rm avg} \gtrsim 99\%$ per gate, corresponding to $r_C \lesssim 0.01$ per Clifford. Leading neutral atom systems achieve $r_C \sim 0.003{-}0.005$ for 2Q gates.
Interactive Standard RB Calculator
Adjust $r_C$ and SPAM amplitude to see how the RB decay curve changes. Compare 1Q and 2Q benchmarks.
RB Decay Curve, F(m) = A · pᵐ + B
Interleaved RB (IRB)
Standard RB gives the average error per Clifford. To measure the fidelity of a specific gate, use Interleaved RB.
Why IRB?
The standard RB decay constant $p_{\rm ref}$ reports an average over all 24 (or 11520) Clifford gates. But for building a quantum computer, you care about individual gates: how well does your $\text{CZ}$ gate perform? Your $X_{\pi/2}$? IRB isolates a single target gate $G$.
Protocol
IRB, Reference vs Interleaved Decay Curves
Two-Qubit RB Specifics
Two-qubit benchmarking involves a much larger Clifford group and requires accounting for the structure of 2Q gate decompositions and cross-talk.
2Q Clifford Group Statistics
- Group size: 11,520 elements (vs 24 for 1Q). Sampling is still efficient since random 2Q Cliffords can be generated from a small circuit gate set.
- CZ gate count per 2Q Clifford: On average $\sim 1.5$ CZ gates + $\sim 8.4$ single-qubit gates (in $\{H, S, X_{\pi/2}\}$ basis).
- Dimension: $d = 4$, so $B \to 1/4$ and the formula becomes $r_C = \frac{3(1-p)}{4}$.
- Error budget: 2Q error per Clifford includes contributions from CZ gate fidelity AND single-qubit gates within the Clifford decomposition. To isolate CZ: use IRB with $G = \text{CZ}$.
Simultaneous RB (SimRB)
Run single-qubit RB on both qubits simultaneously while the other qubit is active. Compare to individual 1Q RB on each qubit in isolation.
Crosstalk signature: If $r_C^{\rm sim} > r_C^{\rm iso}$, the difference $\Delta r_C = r_C^{\rm sim} - r_C^{\rm iso}$ quantifies crosstalk-induced error. In Rydberg arrays, crosstalk arises from: residual Rydberg admixture, off-resonant blockade from neighboring atoms, and laser spillover.
Parallel RB: Running many qubit pairs simultaneously (Evered 2023: 60 pairs) tests that gate fidelity doesn't degrade under realistic many-body load, crucial for demonstrating scalable performance.
| Property | 1Q Standard RB | 2Q Standard RB |
|---|---|---|
| Clifford group size | 24 | 11,520 |
| Hilbert space dim $d$ | 2 | 4 |
| Asymptote $B$ | 0.5 | 0.25 |
| $r_C$ from $p$ | $(1-p)/2$ | $3(1-p)/4$ |
| Avg native gates / Clifford | ~1.875 | ~1.5 CZ + ~8.4 SQ |
| Typical $r_C$ neutral atoms | <0.001 | 0.003–0.007 |
| What limits fidelity? | Laser phase noise, off-resonance, Doppler | Spontaneous emission during Rydberg excitation, Doppler, blockade errors |
Practical Considerations
How to actually run RB in the lab, choosing sequence counts, depths, shots, and understanding when the RB model breaks down.
Experimental Design Parameters
- Number of random sequences per depth ($K$): 30–100 is typical. More sequences reduce variance in the fit. For high-fidelity systems ($r_C \lesssim 0.003$), use $K \gtrsim 75$ to resolve the small decay.
- Number of depth values ($M$): 8–15 points, log-spaced up to $m_{\rm max} \sim 3/r_C$ (where survival drops to $\sim e^{-3} \approx 5\%$ above $B$). Too few depth points → unreliable fit. Too many → waste of time on fully decohered sequences.
- Shots per circuit ($n_{\rm shots}$): 100–1000. More shots reduce shot noise. With atom-resolved imaging, even 200 shots per circuit is often sufficient.
- Total experiments: $K \times M \times n_{\rm shots} \sim 30 \times 10 \times 200 = 60{,}000$ shots minimum.
Statistical Error on $r_C$
The statistical uncertainty on the fitted error per Clifford scales approximately as:
$\delta r_C \approx \frac{1}{A \cdot \sqrt{K \cdot M \cdot n_{\rm shots}}}$
With $A \approx 0.9$, $K=50$, $M=10$, $n_{\rm shots}=200$: $\delta r_C \approx \frac{1}{0.9\sqrt{100{,}000}} \approx 0.00035$, about $3.5 \times 10^{-4}$. This is the floor on how precisely you can report $r_C$. For sub-0.1% gate errors, you need $\gtrsim 10^6$ total shots.
Bootstrapping: For confidence intervals, resample the set of sequences (with replacement) and refit many times. Report the 95th percentile range.
When Does RB Break Down?
- Gate-dependent noise: If different Clifford gates have very different error rates, the twirl doesn't produce a clean single-exponential decay, you may see multi-exponential behavior. Use character RB (Helsen et al.) to handle this.
- Leakage out of the qubit subspace: If population leaks into $|2\rangle$, $|3\rangle$ (e.g., second Rydberg state, hyperfine levels), the decay model fails. Use leakage RB (Wood & Gambetta) which explicitly tracks leakage as an additional decay channel.
- Non-Markovian noise: If noise has temporal correlations (e.g., slow laser frequency drifts, magnetic field fluctuations), different sequences in the same batch see different error rates. The ensemble average is still approximately valid but the fit variance increases.
- Crosstalk for multi-qubit benchmarking: If benchmarking qubit $i$ while qubit $j$ is nearby and in a different state, the effective noise on qubit $i$ depends on qubit $j$'s state. Simultaneous RB explicitly captures this.
| Parameter | Conservative | Standard | High-precision |
|---|---|---|---|
| Sequences per depth $K$ | 20 | 50 | 100+ |
| Depth points $M$ | 6 | 10 | 15 |
| Shots per circuit | 100 | 200 | 1000 |
| $\delta r_C$ (approx) | ~0.001 | ~0.0004 | ~0.0001 |
| Total shots | 12,000 | 100,000 | 1,500,000 |
Results in Neutral Atom Experiments
State-of-the-art RB benchmarks for neutral atom platforms (Rb, Cs, Yb tweezer arrays), with comparison to fault-tolerance thresholds.
| Platform / Species | Gate | $r_C$ (per Clifford) | $\varepsilon_{\rm gate}$ (per native gate) | Notes / Paper |
|---|---|---|---|---|
| ⁸⁷Rb tweezer (Harvard) | 2Q CZ (Rydberg) | 0.0050 | 0.25% | Evered et al., Nature 622, 268 (2023). 60-pair parallel RB, 99.5% fidelity. |
| ⁸⁷Rb tweezer (Harvard) | 1Q | <0.001 | <0.1% | Manetsch et al. (2025). 6100-atom array, 99.93% single-qubit fidelity. |
| ¹⁷¹Yb tweezer (JILA) | 2Q CZ (Rydberg) | 0.0030 | 0.14% | Ma et al., PRX Quantum 6, 010332 (2025). Nuclear spin qubit, exceptionally low SPAM. |
| ¹³³Cs tweezer (CsQ / various) | 2Q | ~0.010 | ~0.5% | Multiple groups; heavier atom, different blockade physics. |
| ⁸⁷Rb tweezer (Lukin group) | 1Q + 2Q (Rydberg) | 0.005 (2Q) | 0.25% | Bluvstein et al., Nature 604, 451 (2022). Logical qubit experiments with RB characterization. |
| ¹⁷¹Yb tweezer (Quera/MIT) | 1Q (nuclear spin) | 0.0002 | 0.01% | Scholl et al. / Barnes et al. 2022. Nuclear spin qubit with laser-free microwave drives. |
| Surface code threshold | — | <0.001 | <0.1% | Theoretical threshold for concatenated / surface codes (depolarizing noise model). Goal for fault-tolerant QC. |