Transcript | Benchmarking Quantum Computers with Super.tech Listen Quantum volume. Algorithmic qubits. Qubit counts in general. What do they really tell us about the usability and power of a quantum computer? As we learned from classical computing, only benchmark software can really compare how systems perform different tasks. Super.tech is back to discuss their SupermarQ benchmark suite that gives us real performance data for quantum hardware. Guest Speakers: Pranav Gokhale and Fred Chong from Super.tech. Listen Topics Vorstandsangelegenheiten Digitale Transformation Konstantinos Quantum volume, algorithmic qubits, qubit counts in general — what do they really tell us about the usability and power of a quantum computer? As we learn from classical computing, only running benchmark software can reveal a system’s true capabilities at various tasks. Super.tech is back to discuss their suite that gives us real performance data for quantum hardware. Find out more about SupermarQ in this episode of The Post-Quantum World. I’m your host, Konstantinos Karagiannis. I lead Quantum Computing Services at Protiviti, where we’re helping companies prepare for the benefits and threats of this exploding field. I hope you’ll join each episode as we explore the technology and business impacts of this post-quantum era. Our guests today are my first repeat guests. They were on separately, so I thought I’d bring them on together because they’re from the same company. They’re the cofounders of Super.tech, and one of them is also CEO, and that’s Pranav Gokhale, and the other is chief scientist, and that’s Fred Chong. Welcome, both of you, back. Pranav Thanks for having us. Fred Thank you. It’s great to be back. Konstantinos It’s not just that there’s nothing going on in the world of quantum, and that’s not why I’m having you back on, but since you’ve been here, your entire paradigm of what you bring to this industry has changed a bit. You hinted at it, Pranav, when you were on last time — this idea of benchmarking. We’re going to talk a lot about that today and introduce the world to SupermarQ officially. Of course, there’s a Q at the end, because it’s a quantum product and you’re legally required to have a Q in there somewhere. What gave you the idea to go down this path of having SupermarQ? Fred The idea came from the fact that benchmarking has been a very important and rich tradition in classical computing, which is my background, and quantum machines have evolved to a point where they could be running much more substantial code benchmarks. We thought that we should leverage this tradition of classical benchmarking and that methodology to come up with a suite of quantum benchmarks that is both balanced and principled, motivated by real applications and designed to work both for today’s machines and tomorrow’s larger, more reliable machines. Then, finally, broken down into this thing that we’ll talk about today, which is these features of the machine, like a fingerprint or a DNA of each application that tries to predict how it will run on different machines. Konstantinos It does remind me of the old benchmarks used in gaming systems and things like that — you would have each area — but in your case, you’re doing things like finance or whatever, and the types of algorithms. We’ll get into all that. But going back to that initial idea, this is to correct what has been a problem in the industry — this idea of identifying what your machine is by, let’s say, qubits, which could vary in quality and power, and their ability to stay coherent, etc. Then, this idea of artificial metrics, like quantum volume or algorithmic qubits, and what does that mean? Are you starting to see any companies using this and giving you feedback and saying that they want to mention the scores? Are you already seeing that kind of traction? Pranav We just launched it last week, so, it’s maybe a little early to declare victory yet, but — Konstantinos I didn’t know if anyone was behind the scenes using it — that kind of thing. Pranav You’re right. Absolutely. We have been beta testing. In fact, you were one of our first listeners in that P33 Chicago-area quantum consortium, and we have received a lot of input, including from investors. Fred brought up this idea of Super.tech initiating this benchmarking effort, and it’s maybe not, on first glance, obvious that a startup works on benchmarking, but as we thought about it, if this is not a directly revenue-generating product, we think it’s an important service for the quantum industry. Hopefully, it’ll direct investor and customer dollars to the right hardware platforms. Secondarily, for us, it’s about thought leadership and showing customers that we can help guide you to the right quantum hardware, and that has started to take off. Again, it’s only been a week since we’ve been public, but in this last week, we’ve received inbound emails from people in the finance industry and the energy industry who want to start to get their toes wet in quantum, as well as investors who have seen all this noise that’s being made in quantum — pun not intended — and are thinking about putting in their million dollars of their billion-dollar fund and want to know where they should get started. Konstantinos That’s great. I would like to see something like this as a living source of information that people can get, because not everyone’s going to run these benchmarks, but more like, “Currently, these IBM machines score like this, and this is how many qubits there are.” Do you envision having access to something like that all the time — a constantly updated real-time source of those scores? Pranav Absolutely. In some ways, that’s the beauty of us as a company partnering with a research institution like EPiQC, which Fred runs. Generally, in academia, you do the paper once, and then it’s done. What we’ve done in a bigger-picture sense is two things: Number one is, we’ve made the results of SupermarQ and the infrastructure for running it open-source. It’s on Github.com/SupertechLabs/SupermarQ. The second is a commitment to keep doing this on a recurring basis. The infrastructure is our own software products — a super stack which connects to a lot of different hardware platforms — and it is going to be important that as the industry evolves, like you said, we’re not benchmarking just IBM’s quantum hardware from five years ago, but what they just announced today or tomorrow. We have very concrete plans this summer to benchmark the next set of hardware, and we see this as an ongoing story, as opposed to a one-and-done. Fred One of the things that the Super.tech infrastructure provides is this ability to run on multiple platforms and optimise on multiple platforms. So, it’s a very automated way to get a very good set of benchmark results on available platforms out there. This benchmarking effort is synergistic with Super.tech’s effort to provide a software infrastructure that can optimise for a broad base of platforms. Pranav Konstantinos, zooming back to your question a bit, I’m reminded of how every time the new iPhone or the new MacBook comes out, there’s a claim from Apple that this is x times better at your favorite gaming or Fruit Ninja or Final Cut Pro. Then, to some extent, there’s, “Take Apple at its word,” but then the next day, the next week, the media stories are all from Tom’s Hardware actually running the benchmarks on the hardware and reporting to consumers, “Does this actually work? Is your battery life actually going to be this much longer?” That is the level of rigor that we aspire for the quantum hardware industry to have. I know it’s not going to happen overnight, but it’s the vision for where we should go. Konstantinos Yes. It’s going to be super useful to end users or companies like mine, where we’re helping the end users run these algorithms and use cases, because one of the big questions we have to decide the answer to is, what machine do we pick? If we’re doing a binary classification problem, do we run it on IonQ? Do we run it on Honeywell? Where do we spend the money on those shots? Sometimes, you don’t get to run it on every single one. Obviously, that wouldn’t be a very smart way to go about it. So, it would be helpful to be able to look at a glance and say, “Last time we ran this algorithm, we got these results, but now, looking at these numbers, it looks like we should have run it on this one, and then let’s try it on H1” — or whatever — “this time.” That’s why I like that there’s going to be a breakdown by type and by algorithm. Did you want to talk about the nitty-gritty of which tests are being run, and which algorithms? Pranav Sure. I’ll talk on the application side, what we’re running, and then Fred can talk about how these applications are diverse and stress different pieces of hardware. We did try to look at a wide range of applications. These days, in quantum, you can’t have a full suite of tests unless you have things like QAOA — the Quantum Approximate Optimisation Algorithm. I know, Konstantinos, you and I discussed about this earlier. It can be used for financial applications, looking how to optimise a wrap for a shipping company or logistics company. So, that’s one domain, and another, in the chemistry domain, Variational Quantum Eigensolver, is for finding ground-states molecules, which, in turn, helps scientists predict reaction rates of molecules and, potentially, one way to develop better drugs. Those are the most application-near-term-centric focuses. One thing that’s interesting about our benchmark suite is that we also benchmark error correction itself, which is arguably the holy grail of quantum. One day, the Defense Department envisions running things like prime factorisation for breaking encryption or, potentially, Grover search at very large scale to solve enormous search problems. Those kinds of applications, we know, are going to take much larger systems, and those larger systems are going to require fault tolerance and error correction. So, we thought, why not benchmark that right now? That is what we view as the lens into the future where there are new hardware requirements like intermediate measurements and new scaling approaches. So, that’s the spectrum, and just one or two more to add in our suite, we also want to test the quantumness of the system. One of the early controversies in quantum has been, “Is this device actually doing something quantum mechanical, or is it just a classical computer that happens to be operating at a very low temperature?” Two of our tests, which are called Mermin-Bell and GHC, quantify the quantumness: “Is this computer really doing something that my laptop couldn’t do?” That’s, at a bird’s-eye view, the spectrum of applications that we benchmarked. Fred If you think about these applications, one of the big challenges was picking a suite of applications that both are motivated by something you would really do and that represent different stresses on a real machine. I mentioned this idea of a fingerprint-feature vector — a series of properties that a machine might have that you might be stressing in a particular programme: things that you probably may have heard of before, like how good a qubit is in terms of its T1 and T2 times, how long the quantum state lasts, the quality of gates — one-qubit and two-qubit gates — but then other things like, how much communication is there in the programme? Essentially, how often do different qubits have to communicate with each other? How often do you have to move things around in the machine? Then, other, perhaps, newer things such as, how often do you have to measure the quantum state? In particular, something like these error-correction benchmarks have something called midcircuit measurement, which is a very new thing that only recently quantum machines even support. You’re in the middle of your quantum programme — you need to do measurement. Traditionally, that’s only been done at the end of your quantum application. So, these different properties, we look at how important they are, how frequent they are in these different benchmarks, and we show them pictorially, and you can think of them as a shape. That shape uniquely identifies the different benchmarks and other benchmarks that are like it. Then, we try to come up with benchmarks from different application classes that have very different shapes also. That’s how we come up with these different stress tests, and then we correlate the different dimensions of the shape, or this fingerprint, with performance on machines. So, it’s like a big table in our academic paper that looks at, how well does each one of these features predict the performance we saw on all these different benchmarks? Konstantinos When you end up with these gate-connectivity models, do you then see patterns in the future — “Every time we see a model that looks like this, we expect it to do really well at VQE,” or something like that? Is that predictable now? Fred Yes. What we find is that certain kinds of applications are very dependent on properties like the topology or the communication capability or the quality of measurement, or perhaps, more standard, on the quality of two-qubit gates or something like that. If you look across the different kinds of benchmarks and the different machines, you’ll see these particular points of high correlation that tell you, “This machine is good at this, and you can see that this programme really cares about this.” Konstantinos At some level, are these shapes that you create, are they imaginary? I know with, let’s say, transmon, there really is a shape. These things are lined up a certain way. But with something like trapped ion, they’re moved into position for computation. Is that captured if you were to do Honeywell or IonQ, when they actually move the trapped ions around? Fred Let me clarify. The shape I’m talking about is just the way that we graph how important each feature is for an application. Then, when you’re talking about moving things around in a trapped-ion machine, for example, that is just another way to support communication, and that translates into, when we run a certain amount of communication on the machine, what’s the fidelity of that communication? The fact that there’s movement, it’s certainly accommodated in the model of how all this is evaluated. Does that make sense? Konstantinos I didn’t know if it created an artificial shape or something that when you see it, you’re, like, “That’s trapped ion.” Fred For us, the shape comes from the benchmark, the software, the application, and we haven’t actually made shapes for the machines, although that is possible. We might imagine, is there some match between the machine and the application, or something like that? So far, we’ve just taken the application and run it on the machines, and then numerically correlated this. I suppose we might be able to plot that correlation and make another shape for the machine or something. Konstantinos Yes, because as we’re in this part of NISQ era now, we’re very much going to be looking for any advantage we can get in one particular machine or another — sometimes, one literal machine over another, even though they’re the same exact technology — and I was hoping that with a benchmark like this, we would be able to see that: “For the next year, we have these machines, and we know every time we’re doing a QUBO, we want to go this way.” Were you hoping to accomplish that ability to be able to get that little-last-bit-of-the-full-stack benefit, that kind of thing? Pranav Yes, and I can reflect on a couple of nuggets of our results that start to point us in that direction. One of them is — and maybe this is against some of the conventional wisdom that you hear from the trapped-ion community — “Our devices have very high connectivity. Therefore, it’s going to be good for financial applications where the interaction graph could be anything, or similar for chemistry applications.” What we found is that for both QAOA for optimisation and VQE for chemistry, there is, generally speaking, parity between the superconducting devices that we benchmarked and the trapped-ion devices. When we boil this down, the reason is that because of something called a swap network — and I won’t get into technical details here — it is possible to effectively emulate full connectivity with a superconducting system for these kinds of applications. Maybe the one where the trapped-ion people think they would do the best, actually, there’s not a definitive advantage in our results. On the other hand — and this took me by surprise too — in our error-correction benchmarks, the two devices that are head and shoulders taller than the rest are the trapped-ion devices that we benchmarked. So, that was pretty interesting. Maybe a third item to add is that we saw a very impressive performance from a device that had not previously been benchmarked, which is the Advanced Quantum Testbed at the Berkeley Lab. So, one of the other presses of this initiative is to get more devices benchmarked, and that was one of our contributions here, and perhaps they will have more exposure to users through this platform. Konstantinos That’s interesting, what you said about the trapped-ion error situation. Trapped-ion systems are the ones claiming, they hope, the fewest qubits used for error correction in the future. That’s an interesting little bit of that story. I was going to ask you if you had surprise results. Are there any others? For example, in how quantum something is, that measurement, did you get anything that made you scratch your head and say, “What’s happening here? It’s like a simulator is running, or something.” I was curious about that. Pranav There are two that measure the quantumness. One of them, Mermin-Bell, is the one that indicates how much control the machine has over its quantum properties — and most of the devices perform not great on it. That’s not saying that they’re not quantum. It just means that there’s difficulty in maintaining exquisite quantum control. There isn’t much variance — they all do so poorly. But the other one is GHC production. Another way we describe this is, how well can this quantum computer do quantum sensing? In this application, there are two things to note: One is that all of them do very well at intermediate qubit scales of five, six, seven qubits, and that’s encouraging, because a lot of people talk about how quantum sensing is maybe the nearest-term application of quantum technology broadly. This aligns with that. We see very strong results even at intermediate skills. But the more dangerous side note to that is, while we see that performance is good at small scale, medium scale, it does degrade as we go to bigger scales, and that’s not to be unexpected, but it does indicate the limitations of the NISQ era — that we’re not sure if these quantum sensors are going to scale past maybe 50, 60, 100 qubits. So, those were not entirely surprising, but important, results to have benchmarked rigorously. Konstantinos That is interesting. Did you give thought to applying some of these benchmarks to things we know aren’t quantum — for example, that hybrid approach to a simulation that some companies are working on, where you have actual individual molecules, but they’re not calling it quantum? It’s just like a hybrid simulation approach where they’re getting 200, 300 cubits? Do you think these benchmarks would be able to tell us anything interesting about those devices? Fred Those are very different machines. This is definitely a benchmark suite that’s initially designed for these gate-based general machines. One thing to say about SupermarQ is, it’s an initial version of a very evolving, living suite of benchmarks. I do envision that we would go forward with looking at more hybrid systems and looking at benchmarking annealers and simulators of various quantumness. That’s not currently within the scope of the suite, which already has a pretty large scope. But what you bring up is interesting, and we do have some ideas for how to evaluate some of those other technologies and machine models and even try to come up with ways to compare them with the gate-based machines, which has traditionally been very difficult. Konstantinos Yes, I always like to think “What if?” before I do an episode. What if this were applied this or that way? You never know. Sometimes, you strike gold on what might happen in the future. Fred Yes. I can just give you one example, which is, there are these annealers which are good at solving, very approximately, very large problems. At the same time, you can solve, essentially, an annealing problem very well for a very small problem on a gate-based machine. So, these two things aren’t comparable, because if you give the small problem to an annealer, you’re not using most of its qubit capacity, its device capacity, and it’s going to solve it terribly, and it’s not going to look very good against the gate-based machine. On the other hand, if you try to take the large problem, it won’t fit on the gate-based machine. In fact, some other work that we’ve been doing at Super.tech is a sampling approach called core sets, which basically allows you to take a large problem, create a weighted sample of it and then run it on a small quantum computer, and then use that solution to seed, classically, a solution to the larger problem. That allows us actually to do an apples-to-apples comparison where we can take a big problem, run it on an annealer — a large D-wave-type machine, for example — but then sample it down to a small gate-based machine, run it in this hybrid core-set model, and then try to compare those two. That’s just an example of something that we’ve been thinking about in terms of future benchmarking of different systems. Konstantinos Yes, that’s interesting. One goal I would have for something like this is that it becomes so popular that every machine that comes out, we hear this is how many qubits we have and this is our SupermarQ score. I know you guys wouldn’t mind that either, but it would just be so much more useful than what we’ve talked about in the past — quantum volume, algorithm qubits. What does that really mean? It’s just something on paper — you can’t prove it. Is it really erroring out? Also, in the future, we’re going to have to reach a point where we’re starting to talk logical qubits. We use the word now, but we don’t really use it. No one talks about logical qubits. IBM announced 127 qubits. They are not logical qubits — they’re just 127 qubits. From what we’ve heard, it might be 1,000 of them to get one logical qubit. We don’t know. Do you anticipate evolving this to be like the logical-qubit test one day? What are you really required to get that one error-corrected qubit? Do you see a path forward there to identify that? You’re already tangling with error correction, but do you see something like that in the future — a logical test? Fred One thing I would say to modify your question is that there’s a continuum between physical qubits and logical qubits. There’s a middle ground, which would be some sort of error-mitigated qubit where you take a few physical qubits and you group them together and you get a much more reliable qubit that you wouldn’t call a logical qubit but a much better physical qubit. That’s definitely coming, and then, after that, there’s definitely going to be this much higher-quality logical qubit that might take, minimum, 20 or 30 physical qubits — up to hundreds of physical qubits, maybe — depending upon the error rate of the machine. I definitely see SupermarQ going toward that: first, toward error-correction benchmarks, as we have, but then toward things like error-mitigated qubits — basically, better qubit ensembles — and then toward logical qubits. That’s a natural evolution as the machines start to evolve. Konstantinos It feels to me like that’s the ultimate thing to prove one day: If you say you have this many logical qubits, we want proof. Pranav It is, and maybe one concern one could have about a benchmark suite at that scale where we have logical qubits — and, let’s say, thousands of logical qubits — is, how do we make sure that the benchmark suite keeps up? There are two dimensions. One is just making sure we’re staying up to date with whatever the latest and greatest quantum algorithms are. That part, open-sourcing, is where we’re getting at that, evolving and adapting. But then the second part that is particularly challenging for quantum computing is, how do we know what the right answer is once we’re talking about 1,000 cubits? The pieces of SupermarQ that we put a lot of thought and intention into are that although the algorithms and applications that we study are things that can exhibit quantum advantage, like QAOA or VQE, we were very careful about the exact tests that we run such that even though, in general, things like QAOA and VQE are impossible to classically verify, we instantiate them with specific versions that are actually classically simulable — although at the 1,000-, 2,000-logical-qubit scale, no one will know exactly what the output distribution of QAOA should be, unless you have a perfect quantum computer. We give it a specific instance where we know what the answer should be due to recently discovered classical analyses of the scaling QA on certain cases. That’s how we get around this issue that has bugged a lot of other attempts at benchmarking and scalable benchmarking. Konstantinos Yes. The one last big question I had today is, the other use for benchmarking in this whole industry has been this idea of, how do you prove advantage in general? If you want to prove advantage, you’re talking lab-based benchmarking conditions. You’ve tried everything, you’ve ruled everything out, and you know for sure that quantum is better because you’ve ruled everything out in a controlled environment. Is there any way to help that along here? Is there any way to have some numbers that exist even just in the output scores — once you passed this number, you know that you’ve probably achieved advantage in this operation? Is there anything like that that’s possible, or is there not enough data published on the current state of the art on classical to be able to make those assumptions? Fred That’s certainly possible. That’s not what SupermarQ was designed for at the moment, because it’s designed to be runnable on current machines. As the machines evolve toward a more capable demonstration of quantum advantage, making it more likely, I could see evolving SupermarQ toward benchmarks that could have some sort of metric of quantum advantage, but right now, the benchmark suite is specifically designed to be scalable down to a size and sort of fidelity requirement that can be run on, essentially, all available platforms out there. So, we would need those machines to progress before we refocus it toward something like quantum advantage. Konstantinos Yes, it might be something you backfeed into the system after: You’ve reached advantage in something, and then now, we say every time another machine reaches the score, it probably has reached advantage as well, or something like that. I was just trying to think outside of the box on benchmarking in general. This has been fascinating. I’m super excited about this. I think I’ve mentioned SupermarQ in five episodes — I’m not exaggerating — just because I’ve been looking forward to something like this being in the actual world. I’m excited to have a chance to talk to you guys about this again. I’m going to put links to everything, including Github, and everything in the show notes, but I didn’t know if there are any last thoughts you wanted to share on this. This is your final chance at a platform here to talk about this product. Pranav Thanks again for having us. Part of this may be long-term — it helps their business — but part of it is just a service to people who are exploring quantum. We do hope that folks who want head-to-head comparisons of machines and not just abstract scores that are hard to correlate to applications will find our reports on our web portal helpful, and we want to keep benchmarking more hardware. So far, we’ve done more than 12 devices from over five hardware vendors, but we haven’t yet benchmarked things like photonic devices. That is very much in our road map. So, stay tuned on this evolving effort. Fred Thank you so much for the excitement and the exposure here, because what we want from an effort like this is for it to grow and become something that has critical mass and becomes some de facto standard such that we can get the best benchmark results with each vendor if they give us access, and potentially even work together with us to optimise for those machines. The more motivated vendors are to be open to this kind of benchmarking, the better the community will be served. Konstantinos Yes. On a practical side, personally, and selfishly, I love the idea of when I’m talking to a customer and being able to say, “You have this problem you want to solve. I know the exact machine we’re going to use, because we ran benchmarks, and we know which one we want to use for this use case.” That’s what benchmarks are all about. Thanks a lot for doing this work. Now, it’s time for Coherence, the quantum executive summary where I take a moment to highlight some of the business impacts we discussed today in case things got too nerdy at times. Let’s recap. Benchmarking has been part of classical computing for years. SupermarQ brings that to the world of quantum computing. Like the best classical benchmarks, SupermarQ runs different types of applications and algorithms to gauge performance in different areas. Super.tech hopes that this will help investors decide where to focus and users decide which machines are best for which use cases. As with classical benchmarks, it should be possible to quickly test vendor claims. The benchmark uses the company’s super-stacked quantum development platform as its basis. SupermarQ tests include QAOA for optimisation problems, VQE for finding ground states of molecules, and GHC and Mermin-Bell to measure quantumness or entanglement between qubits and quantum control. The software also tests error correction. Results are shown visually with unique shapes and graphs that can show performance at a glance for different applications. All of this is important because when you’re looking to get the best results in a use case, you want to choose the best machine possible for the task. Time on quantum computers can get expensive — best to get big bang for the buck. (Pardon the physics joke.) SupermarQ is already starting to show general trends and results that we can analyse. SupermarQ is still focused on gate-based machines, but Super.tech can see a future where the software may evaluate annealers or hybrid simulators. We’ve moved from quantum volume and algorithmic qubits to benchmarking, which is exciting. The next step would be identifying how well machines do at creating error-mitigated logical qubits. One day, we might also be able to have scores that reflect if machines are in a performance range known for quantum advantage. There are a lot of possibilities for the future here. That does it for this episode. Thanks to Pranav Gokhale and Fred Chong for joining to discuss SupermarQ. Thank you for listening. If you enjoyed the show, please subscribe to Protiviti’s The Post-Quantum World, and leave a review to help others find us. Be sure to follow me on Twitter and Instagram @KonstantHacker. You’ll find links there to what we’re doing in Quantum Computing Services at Protiviti. You can also DM me questions or suggestions for what you’d like to hear on the show. For more information on our quantum services, check out Protiviti.com, or follow @ProtivitiTech on Twitter and LinkedIn. Until next time. Be kind, and stay quantum curious.