Network Analytics Subgraph v1.0.0 outage (April 23rd)

Issue Summary

This post-mortem aims to clarify the recent incident on the Analytics subgraph v1.0.0.

  • The subgraph failed non-deterministically and kept retrying the handler in hopes of fixing itself, with no success.
  • The issue itself was found to be performance-related, although the exact cause is not 100% certain. The issue was seemingly caused by going over the memory limits of the 32-bit WASM runtime, which caused strange errors to appear in the logs.
  • Many attempts were made to improve performance to a point where it could be resynced successfully, none of which could use grafting to reduce the sync time, so development times for the fix were less than ideal.
  • A short-term solution was found, which, even if it doesn’t fully solve the root cause of the problem, allows us to give ourselves some headroom with the memory limits that we were hitting, resulting in the subgraph no longer being stuck.
  • The current solution is a short/middle-term fix, and we are exploring potential long-term solutions to fix the inherent design flaw for this subgraph without any loss of features.

Timeline

Because the Analytics subgraph is inherently slow to sync and computationally intensive, resolving the issue took a considerable amount of time. The problem persisted for an entire month, with each iteration of the fix taking approximately a week to test.

Root Cause(s) Analysis

The problem appears to be related to memory limits, as the only fix that ultimately allowed the subgraph to sync was designed to address this issue. However, the root cause of the issue can be found in some known design issues of the subgraph.

The Analytics subgraph was born due to some very particular requirements that the Core Network subgraph couldn’t satisfy, leading to slower sync times and bloating due to not really critical entities being generated all the time. Those requirements can be summed up to:

  • Daily aggregations of entities, like Indexer, Delegator, DelegatedStake, GraphNetwork, etc.
  • Pre-computation of delegation-focused values (DelegatedStakes/Delegator current delegation and original delegation).

That second requirement is particularly important for this issue. The reason is that, due to how delegations are represented in the contracts, due to performance reasons when distributing revenue streams (a single Delegation pool with each delegator getting shares of said pool, which receives the revenue streams, instead of having to send the revenue to each delegator individually), the exact amount of GRT that each delegator has for each of their delegations, isn’t a value that’s directly available in the Core Network subgraph, since the subgraph can’t know how much GRT each delegation is worth at any point if it doesn’t update it every time there’s revenue to distribute.

The issue itself is that each time that an Indexer settles revenue (either query fees or indexing rewards), it needs to calculate the new values for all delegations for a given Indexer.

That means we need to loop through all delegations for a given Indexer and update each one separately. Because of this, we decided to add that logic to the Analytics subgraph so that the slow calculations could be done on a separate (and, most importantly, less critical) subgraph.

This isn’t so much of an issue for most Indexers, but recently, there have been a few indexers that amassed a huge amount of delegations, which ended up generating situations where the Analytics subgraph was simply unable to keep up.

The reason why it wasn’t able to process that many delegations was seemingly due to hitting the memory limits of the 32-bit WASM runtime.

This seems to happen particularly when the subgraph tries to load the whole list of delegations for a particularly heavily delegated Indexer by using some helper functions to get a list of derived entities.

Impact

The incident impacted information availability on different frontend instances that consume the Analytics Subgraph’s data, such as Graph Explorer and GraphSeer, as well as other third-party apps that relied on this subgraph’s data.

Due to the design of the Network Subgraph, it helped avoid impact on critical components, such as Indexers or Gateways, so the network itself was mostly unaffected (aside from frontends not being able to display correct data for Delegators, or useful charts).

Resolution & Recovery

After a few failed attempts to fix the issue, we eventually figured out a possible solution, which still required us to load the full list of entities, but could give us quite a substantial amount of headroom so that we don’t hit memory limits.

The fix itself was to use an auxiliary entity for the list of derived entities that would still allow us to get the required entities loaded up properly afterward but that wouldn’t require us to load them up all at once (and, thus, potentially running out of memory). Previous attempts to do this with older techniques proved not to be performant enough, and workarounds like splitting the load with a block handler weren’t as easy to implement or performant enough to solve the issue effectively.

Recovery time will depend on how different apps used the data coming from the Analytics Subgraph (if they aren’t affected by the schema changes, it should be relatively easy, but it might require some changes for some of them, full changelog available in the release notes).

The subgraph itself is already synced and working on the hosted service, waiting to be published on the decentralized network.

Lessons Learned & Preventive Measures

  • The Analytics subgraph design makes it intrinsically bad at scaling.
  • The solution is only a short/middle-term fix, so we need to figure out more improvements to simply avoid having to do all of those batch calculations.
  • We are already exploring some potential new features for graph-node that could help us remove most of the batch calculation dependencies without having to forgo all the delegation data pre-calculations.
6 Likes