a16z: How to Phased Approach Secure and Efficient zkVM Implementation (A Must-Read for Developers)

Bitget App

Trade smarter

Bitget

News

BlockBeats2025/03/12 12:00

By:BlockBeats

This will be a lengthy construction project of no less than four years.

Original Title: The path to secure and efficient zkVMs: How to track progress

Original Author: a16z crypto

Original Translation: Golem, Odaily Planet Daily Golem

zkVM (zero-knowledge Virtual Machine) promises to "make SNARKs mainstream," allowing anyone (even without specialized SNARK expertise) to prove that they have correctly executed any program on a given input (or witness). Their core advantage lies in developer experience, but currently they face significant challenges in security and performance. To fulfill zkVM's vision, designers must overcome these challenges. In this article, I outline the possible stages of zkVM development, which will take several years to complete.

Challenges

In terms of security, zkVM is a highly complex software project still riddled with vulnerabilities. In terms of performance, the speed of proving program correctness can be orders of magnitude slower than native execution, making it impractical for most applications to deploy in the real world.

Despite these real-world challenges, most companies in the blockchain industry portray zkVM as ready for immediate deployment. In fact, some projects have already paid significant computational costs to generate proofs of on-chain activity. However, because zkVM is still imperfect, this is merely a costly way of pretending the system is SNARK-protected when, in reality, it is either protected by permission or, worse, vulnerable to attacks.

We are still several years away from achieving a secure and high-performance zkVM. This article proposes a series of phased specific goals to track the progress of zkVM—goals that can eliminate hype and help the community focus on real advancement.

Security Stage

SNARK-based zkVMs typically include two main components:

· Polynomial Interactive Oracle Proof (PIOP): Used to prove statements about polynomials (or constraints derived from them) in an interactive proof framework.

· Polynomial Commitment Scheme (PCS): Ensures that the prover cannot lie about polynomial evaluations without being caught.

zkVM fundamentally transforms efficient execution into a constraint system—broadly meaning that it enforces the virtual machine to correctly use registers and memory—then applies SNARKs to prove that these constraints are satisfied.

Ensuring that complex systems like zkVM are error-free is done through formal verification. Below is a breakdown of the security phases. Phase 1 focuses on the correct protocol, while Phase 2 and Phase 3 focus on the correct implementation.

Security Phase 1: Correct Protocol

1. Formal verification proof of PIOP reliability;

2. Formal verification proof that PCS has efficacy under certain cryptographic assumptions or ideal models;

3. If using Fiat-Shamir, a concise argument obtained by combining PIOP and PCS is secure in the random oracle model formal verification proof (enhanced with other cryptographic assumptions as needed);

4. Formal verification proof that the constraint system employed by PIOP is equivalent to the VM's semantics;

5. Binding all the above parts into a unified, formally verified secure SNARK proof for executing any program specified by the VM bytecode. If the protocol aims for zero-knowledge properties, this attribute must also be formally verified to prevent leakage of sensitive information about witnesses.

Recursive Alert: If zkVM employs recursion, every PIOP, commitment scheme, and constraint system involved anywhere in the recursion must be verified to consider this phase complete.

Security Phase 2: Correct Validator Implementation

Formally verify the implementation of zkVM's validator (using Rust, Solidity, etc.) matches the protocol validated in Phase 1. Achieving this ensures that the implemented protocol is sound (not just a paper design or an inefficient specification written in Lean, for example).

There are two reasons why Phase 2 concerns solely the validator implementation (not the prover). First, ensuring the correct usage of the validator is sufficient to guarantee reliability (i.e., ensuring the validator cannot trust any false statements to be true). Second, the implementation of the zkVM validator is over an order of magnitude simpler than the prover's implementation.

Security Phase 3: Correct Prover Implementation

The actual implementation of the zkVM Prover correctly generates the proof system for Phase 1 and Phase 2 verification, i.e., achieves formal verification. This ensures completeness, meaning that any system using zkVM will not be "stuck" with unprovable statements. If the Prover intends to achieve zero-knowledge, this property must be formally verified.

Expected Timeline

· Phase 1 Progress: We can expect incremental progress next year (e.g., ZKLib). However, it will take at least two years before any zkVM can fully meet the requirements of Phase 1;

· Phase 2 and Phase 3: These phases can progress alongside some aspects of Phase 1. For example, some teams have already demonstrated that the Plonk Prover's implementation matches the protocol in the paper (although the protocol in the paper itself may not be fully validated). Nevertheless, I expect that no zkVM will reach Phase 3 in less than four years—and possibly longer.

Key Points: Fiat-Shamir Security and Verified Bytecode

A major complicating factor is the unresolved research issues surrounding the security of the Fiat-Shamir transformation. All three phases treat Fiat-Shamir and the random oracle as part of their bulletproof security, but in reality, the entire paradigm may have vulnerabilities. This is due to an over-idealization of the random oracle and differences with practical hash functions. In the worst case, a system that has reached Phase 2 could later be found to be entirely insecure due to Fiat-Shamir issues. This has raised serious concerns and ongoing research. We might need to modify the transformation itself to better mitigate such vulnerabilities.

Non-recursive systems are theoretically more robust because certain known attacks involve circuits similar to those used in recursive proofs.

Another point to note is that if the bytecode itself is flawed, then even if the proof of correct execution of a computer program (specified through bytecode) has run successfully, its value is limited. Therefore, the practicality of zkVM largely depends on the method of generating formally verified bytecode—a significant challenge beyond the scope of this article.

Regarding Post-Quantum Security

At least for the next five years (possibly longer), quantum computing will not pose a serious threat, while vulnerabilities present a survival risk. Thus, the primary focus now should be on meeting the security and performance stages discussed in this article. If we can achieve these security requirements more quickly with non-quantum-secure SNARKs, then we should do so until post-quantum SNARKs catch up or serious concerns about cryptographically relevant quantum computing arise for consideration.

zkVM Performance Status

Currently, the zkVM prover's overhead factor is close to 1,000,000 times the native execution cost. If a program takes X cycles to run, the cost to prove correct execution is approximately X multiplied by 1,000,000 CPU cycles. This was the case a year ago, and remains so today.

Popular narratives typically describe this overhead in a way that sounds acceptable. For example:

· "The cost to generate a proof for all Ethereum mainnet transactions in a year is less than one million dollars."

· "We can almost generate Ethereum block proofs in real time using a cluster of tens of GPUs."

· "Our latest zkVM is 1,000 times faster than its predecessor."

While technically accurate, these statements can be misleading without proper context. For example:

· It is 1,000 times faster than the old version of zkVM, but the absolute speed is still very slow. This more reflects how bad things were rather than how good they are.

· There have been proposals to increase the computational load on the Ethereum mainnet by a factor of 10. This would make the current zkVM performance slower.

· What is referred to as "near real-time proof of Ethereum blocks" is still much slower than what many blockchain applications require (for example, Optmism has a block time of 2 seconds, much faster than Ethereum's 12-second block time).

· "A cluster of tens of GPUs always running flawlessly" cannot achieve an acceptable liveness guarantee.

· Spending less than one million dollars per year to prove all activity on the Ethereum mainnet reflects the fact that an Ethereum full node only needs to spend about $25 per year to perform computation.

For applications outside of blockchain, such overhead is clearly too high. No amount of parallelization or engineering can offset such enormous overhead. We should take as a basic benchmark that zkVM's slowdown compared to native execution does not exceed 100,000 times—even if this is just the first step. True mainstream adoption may require overhead closer to 10,000 times or lower.

Measuring Performance

SNARK performance has three main components:

· Underlying proof system's intrinsic efficiency.

· Application-specific optimizations (e.g., precompilation).

· Engineering and hardware acceleration (e.g., GPU, FPGA, or multi-core CPU).

While the latter two are crucial for real-world deployment, they typically apply to any proof system, so they may not necessarily reflect the underlying overhead. For example, adding GPU acceleration and precompilation in zkEVM can easily achieve a 50x speedup, much faster than a purely CPU-based approach without precompilation—enough to make an inherently less efficient system appear superior to one that has not been similarly polished.

Therefore, the focus below is on the performance of SNARK without specialized hardware and precompilation. This is different from the current benchmarking approach, which often lumps all three factors into a single "headline number." This is akin to judging the value of a diamond based on its polishing time rather than its inherent clarity. Our aim is to eliminate the intrinsic overhead of a generic proof system—helping the community eliminate confounding variables and focus on true progress in proof system design.

Performance Phases

Here are 5 milestones of performance achievement. First, we need to reduce the verifier's overhead on the CPU by several orders of magnitude. Only then should the focus shift to further reductions through hardware. Memory usage also needs to increase.

Across all stages below, developers should not have to implement custom code specific to zkVM to achieve the necessary performance. Developer experience is a key advantage of zkVM. Sacrificing DevEx to meet performance benchmarks would contradict the essence of zkVM itself.

These metrics focus on the prover's cost. However, if unlimited verifier cost is allowed (i.e., no bounds on proof size or verification time), any prover metric can be easily achieved. Therefore, for systems to adhere to the stages described, maximum values for proof size and verification time must be specified.

Performance Requirements

Phase 1 Requirement: "Reasonable and Nontrivial Verification Cost":

· Proof Size: The proof size must be smaller than the witness size.

· Verification Time: The speed of verifying the proof must not be slower than running the program natively (i.e., performing the computation without a correctness proof).

These are the minimal and succinct requirements. They ensure that the proof size and verification time are not worse than simply sending the witness to the verifier and having the verifier directly check its correctness.

Requirements for Phase 2 and Beyond:

· Maximum Proof Size: 256 KB.

· Maximum Verification Time: 16 milliseconds.

These cutoff values are intentionally set high to accommodate new fast proof technologies that may bring higher verification costs. At the same time, they exclude very expensive proofs that few projects would be willing to include on the blockchain.

Speed Phase 1

A single-thread proof must be at most 100,000 times slower than native execution, measured across a range of applications (not just proofs of Ethereum blocks) and not relying on precompiles.

Specifically, think of a RISC-V process running at about 30 billion cycles per second on a modern laptop. Achieving Phase 1 means you can prove at a rate of approximately 30,000 RISC-V cycles per second on the same laptop (single-threaded). However, the verification cost must be as mentioned above, "reasonable yet non-trivial."

Speed Phase 2

A single-thread proof must be at most 10,000 times slower than native execution.

Alternatively, due to some promising SNARK techniques (especially those based on binary fields) being hindered by current CPUs and GPUs, you can reach this stage by comparing against using an FPGA (or even an ASIC):

The number of RISC-V cores an FPGA can simulate natively;

The number of FPGAs required to simulate and prove (near-) real-time execution of RISC-V.

If the latter is at most 10,000 times more than the former, you qualify for Phase 2. On a standard CPU, the proof size must be at most 256 KB, and the validator time must be at most 16 milliseconds.

Speed Phase 3

In addition to achieving Speed Phase 2, you can also use automatically synthesized and formally verified precompiled implementations with proof costs of less than 1,000 times (suitable for a wide range of applications). Essentially, you can customize an instruction set dynamically for each program to accelerate the proof, but it needs to be done in an easy-to-use and formally verified manner.

Memory Phase 1

The speed in Phase 1 is achieved with the prover requiring less than 2 GB of memory (while also achieving zero-knowledge).

This is crucial for many mobile devices or browsers, opening up countless client-side zkVM use cases. Client-side proofs are important because our phones are our ongoing contact with the real world: they track our location, credentials, etc. If generating a proof requires more than 1-2 GB of memory, that's too much for most of today's mobile devices. Two points need to be clarified:

· The 2 GB space limit applies to large statements (statements that require trillions of CPU cycles to run locally). Proof systems that only implement a space limit for small statements lack broad applicability.

· If a prover is very slow, it's easy to keep the prover's memory footprint below 2 GB. Therefore, in order to make Stage 1 memory non-trivial, I require Stage 1 speed to be met within this 2 GB space limit.

Memory Stage 2

Stage 1 speed is achieved with a memory footprint of less than 200 MB (10 times better than Memory Stage 1).

Why push it below 2 GB? Consider a non-blockchain example: every time you visit a website through HTTPS, you download a certificate for identification and encryption. Instead, the website could send zk proofs with these certificates. A large website could issue millions of such proofs per second. If each proof requires 2 GB of memory to generate, that would require a total of PB-level RAM. Further reducing memory usage is crucial for non-blockchain deployments.

Precompiles: The Last Mile or a Crutch?

In zkVM design, precompiles are specialized SNARKs (or constraint systems) tailored for specific functions, such as Keccak/SHA hashing for digital signatures or elliptic curve group operations. In Ethereum (where much of the heavy lifting involves Merkle hashing and signature checking), some hand-crafted precompiles can reduce the verifier's costs. However, relying on them as a crutch does not allow SNARKs to achieve their intended purpose. Here's why:

· Still too slow for most applications (both internal and external to blockchains): Even with hash and signature precompiles, the current zkVM is still too slow (both inside and outside blockchain environments) due to the inefficient core proof system.

· Security Failures: Handwritten precompiles not formally verified are almost certainly riddled with bugs, potentially leading to catastrophic security failures.

· Poor Developer Experience: In most zkVMs today, adding a new precompile means manually writing a constraint system for each functionality — essentially reverting back to a 1960s-style workflow. Even with existing precompiles, developers must refactor code to invoke each precompile. We should optimize for security and developer experience rather than sacrificing both in pursuit of incremental performance gains. Doing so only proves that performance has not met its true potential.

· I/O Overhead and Lack of RAM: While precompiles can improve performance for heavy cryptographic tasks, they may not provide meaningful acceleration for more diverse workloads as they incur significant overhead when handling input/output and cannot use RAM. Even in a blockchain context, as soon as you move beyond a single L1 like Ethereum (e.g., if you want to build a series of cross-chain bridges), you encounter different hash functions and signature schemes. Redoing precompiles over and over for the same problem is not scalable and poses significant security risks.

For all these reasons, our primary task should be to enhance the efficiency of the underlying zkVM. The technology that produces the best zkVM will also produce the best precompiles. I do believe precompiles will remain crucial in the long run, but only if they are auto-synthesized and formally verified. This way, we can maintain the developer experience advantage of zkVM while avoiding disastrous security risks. This viewpoint is reflected in Speed Phase 3.

Expected Timeline

I expect a few zkVMs to achieve Speed Phase 1 and Memory Phase 1 later this year. I also anticipate Speed Phase 2 to be achieved within the next two years, although it is currently unclear if we can reach this goal without some yet-to-emerge new ideas. I predict the remaining phases (Speed Phase 3 and Memory Phase 2) will take a few more years to accomplish.

Conclusion

While I have delineated the security and performance stages of zkVM separately in this article, these aspects of zkVM are not entirely independent. As more vulnerabilities are found in zkVM, it is expected that some vulnerabilities can only be fixed at the cost of a significant performance hit. Performance should be deferred until zkVM reaches Security Phase 2.

zkVM promises to truly democratize zero-knowledge proofs but is still in its infancy — filled with security challenges and significant performance overhead. Hype and marketing make it difficult to assess true progress. By outlining clear security and performance milestones, this roadmap aims to provide a distraction-free path forward. We will achieve the goals, but it will take time and sustained effort.

Original Article Link

Disclaimer: The content of this article solely reflects the author's opinion and does not represent the platform in any capacity. This article is not intended to serve as a reference for making investment decisions.

PoolX: Locked for new tokens.

APR up to 10%. Always on, always get airdrop.

Lock now!