We recently shared that two of our customers, Degen and Proof of Play, experienced downtime on their chains. Conduit worked continuously over 72 hours to restore these chains and to work with their respective teams to ensure the best recovery possible for their end users. With both chains now back up and running, we’d like to share more details.

Table of Contents

Summary

On Friday, May 10th, Conduit increased batch sizes on Degen and Proof of Play Apex to 10MB in order to reduce costs. This delayed the batch posting of data from these networks to their parent chains. On Sunday, May 12th around 1 PM PST, this configuration was reverted to fix batch-posting. This caused re-orgs on both networks due to batches being posted after the 24-hour force-inclusion window. Arbitrum Nitro will insert any inbox messages before any transactions in the batch, and replay these transactions with new timestamps.

Post re-org, the nodes came back up with corrupted databases caused by the depth of the re-org not being handled well by geth. This necessitated a resync of the data directory from genesis. The resyncs took over 40 hours per network with a replay rate of about 100M gas/s.

Once nodes were resynced, Conduit attempted various transaction replay schemes, though not all transactions were able to be recovered because some transactions depended on exact timestamps. After conferring with each rollup team, various strategies were discussed and attempted in parallel to bring the networks online and recover the pre-reorg state.

Degen’s chain came online on May 14th at 7:30 PM PST, about 54 hours after the network went down. Proof of Play’s Apex chain was recovered around the same time but was only available for public use on May 15th at 4 PM PST after an alternative recovery scheme was implemented.

Background

Degen and Proof of Play Apex are chains built on the Arbitrum tech stack that utilize AnyTrust for data availability. The rollups’ data is stored on off-chain data availability servers (DAS) and provided on-demand by a data availability committee (DAC). The batch poster receives transaction batches from the sequencer, compresses them, sends them to the DAS, and posts certificates on the underlying settlement layer. Degen posts on Base, and Proof of Play Apex posts on Arbitrum One.

In the past few weeks, both chains saw significant growth and their costs were rising. We had seen previously when L2s had gotten congested that batch posting costs dominated the expenses that L3s pay. To insulate L3s from current and future congestion, Conduit explored ways to reduce the frequency of batch posting to the L2.

Although the default batch size for Orbit chains is 100KB, both chains were already using a batch size of 1MB to meet their needs. On Thursday, May 9th, Conduit tested a batch size of 10MB on a high-volume testnet for over 24 hours. During this test, batches were posted as expected, and the network advanced as normal.

After validating the 10MB configuration on testnet for 24 hours, Conduit updated the batch size for Degen and Proof of Play mainnets to 10MB. This brings us to the beginning of the incident.

Root Cause Analysis (RCA)

For both chains, the max batch size parameter was increased to 10MB. This parameter is meant to limit the max batch size that is posted to the underlying DA layer. However, what this actually measures is something “close” to the compressed size but this estimation can be inaccurate for many small similar transactions which are very compressible together. Because we assumed this would set a hard limit on the batch size, and batches would be full post-compression, we believed that validating a testnet using the 10MB configuration would work everywhere.

      --node.batch-poster.max-size int maximum batch size (default 100000)                                                                maximum batch size (default 100000)