This post will go into more detail about the event that caused Altair to halt block production for over a day and try to share the thought process that helped identify and fix the issue in addition to highlighting the learnings.
First some current Altair relevant context:
- Altair Polkadot/Substrate/Cumulus at v0.9.10
- Altair number of vesting schedules at 4k
Content of the Altair Runtime Upgrade 2:
- Bring node and runtime to Polkadot/Substrate/Cumulus v0.9.12
- Enabling Crowdloan Claiming functionality
- Support for Compressed Runtime WASM
More information about the release: https://github.com/centrifuge/centrifuge-chain/releases/tag/v3.0.0-alpha4
Runtime Upgrade sequence of events:
- Executed Upgrade in Centrifuge’s Charcoal Testnet (connected to Chachacha Relay Chain). At this stage, Charcoal had same data set as Altair in terms of content, not data hierarchy (will explain why this was important later on)
- Verified Claiming functionality successfully
- Executed Upgrade in Altair. Block production stopped trying to propose the next candidate.
Time to put on the troubleshooting hat:
First we started looking at the logs and noticed that the collators were not able to build the block in the first place:
2021-11-09 15:33:43.004 INFO tokio-runtime-worker runtime::frame-support: [Parachain] ⚠️ Vesting declares internal migrations (which *might* execute). On-chain `StorageVersion(0)` vs current storage version `StorageVersion(0)`
2021-11-09 15:33:43.733 INFO tokio-runtime-worker aura: [Parachain] ⌛️ Discarding proposal for slot 136372668; block production took too long
In the first line, we can see a migration in the Vesting Pallet. This was introduced as part of Substrate v0.9.11 to support multiple vesting schedules and we were inheriting it on our move from v0.9.10 to v0.9.12.
To clarify, we already knew that and we tested it with same data content beforehand on Charcoal, but we couldn’t trigger the same behavior.
In the second log, we can se that a timeout for block proposal was triggered at aroung ~700ms. Now we were wondering what was the maximum block proposal time allowed on the collator side for our Aura implementation:
// We got around 500ms for proposing
block_proposal_slot_portion: SlotProportion::new(1f32 / 24f32), // And a maximum of 750ms if slots are skipped
We had 0.5 seconds (500ms) to propose the block before proposal slot is discarded. Since the block slot is discarded after 0.5s we didn’t know for sure how much longer did the collator need to build it even if there was no time constraint, so we rebuilt the node with a unrealistically high block proposal timeout of 12s so we could know how far we were from the max target:
2021-11-10 11:13:06.600 INFO tokio-runtime-worker runtime::frame-support: [Parachain] ⚠️ Vesting declares internal migrations (which *might* execute). On-chain `StorageVersion(0)` vs current storage version `StorageVersion(0)`
2021-11-10 11:13:12.096 INFO tokio-runtime-worker runtime::frame-support: [Parachain] ✅ no migration for Identity
Well that was not very encouraging. It took roughly 5.5s to wrap up the vesting migration, which is 5s above target. So why we couldn’t identify this behavior during our testing phase? Why did it take 0.3s on Charcoal and 5.5s in Altair?
We started looking at the infrastructure where our collators were running in and couldn’t find an obvious bottleneck neither on Memory, CPU or I/O. Therefore, if the code is the same, the physical resources are the same, the data content is the same, then it had to be something around the data morphology.
Since Altair-Kusama storage trie is significantly bigger than Charcoal-Chachacha then the hierarchical path where the vesting storage is located can be significantly different from one network to another, then the time it can take to iterate over all the vesting schedules can as well increase depending on the location of the vesting schedules storage node. (Thanks to @basti & @rphmeier from Parity for pointing that out)
At this point, we were talking about how to fit 100 shoes in a box of 10, essentially trying to execute 4k vesting schedule migrations in a limited time of 0.5s, we were running out of options to get this through without runtime changes (through Kusama Governance).
So we needed to go from 5.5s to 0.5s, that is an improvement of order 11. We know that running some process in memory vs on SSD can be around 10–100 times faster, so in theory if we run our chain storage in memory we could improve the computation time by that much and potentially hit the 0.5s mark. And that is what we did, we provisioned an instance of 16 cores and 64GB of memory and modified the node code to use an in-memory DB (thanks @rphmeier for the suggestion and @mustermeiszer for the fast plan execution):
Then when running the new node, had to sync up from scratch but since we were using in-memory db it was very fast (800 bps). Slowly the free memory of the host was going down (expected since it had to host the whole DB in it), once the parachain DB was synced up, the node attempted to produce the block and it went through! We were able to decrease the block time from 5.5s to 0.06s (60ms), that is 91x faster in memory than against an SSD.
Summary of Learnings:
- Be aware of every migration that can be inherited from Substrate and assess the impact upfront
- Try to build a similar data storage hierarchy on your test environments before pushing the upgrade on your production ones (not trivial, and we will try to work on that soon)
- Ensure that you follow the suggested host specifications for both running validators and collators
- Good logging around your services to kick off the troubleshooting process
- Even though increasing vertically our host fixed the issue, once the network is big enough this might not be an option. Therefore paying attention and being proactive on the above will save many headaches.
Many thanks to @basti, @rphmeier & @alejandro @ Parity Technologies for being responsive and helpful on this matter.