Core Network Subgraph v0.23.0 Upgrade Post-Mortem (Jan 23rd)

Summary

This post-mortem aims to clarify the recent incident on the Core Network Subgraph v0.23.0 upgrade that was caused by the deployment of a breaking change in the File Data Sources (FDS) Refactor.

During the latest rollout of upgrades to the Core Network Subgraph, a breaking change was introduced to the schema, resulting in third-party apps/pages relying on IPFS metadata to break when querying any of the multiple environment endpoints for said subgraph.

Timeline of Events

The timeline of events for the incident itself is quite simple, as the issue itself was the deployment of a breaking change, and the response was to inform third parties to update their implementations to accommodate for the changes.

The incident unfolded as follows:

  • 13:52 UTC Jan 23rd: File Data Sources (FDS) Refactor for the Core Network Subgraph is deployed to production.
  • 14:08 UTC Jan 23rd: We receive reports from Payne regarding issues on a few third-party apps and spreadsheets.
  • 14:15 UTC Jan 23rd: We confirm that the incident is caused by the release of the FDS refactor and proceed to summarize the required changes to solve the issues on any third-party app depending on IPFS data (also summarized in the release notes, albeit they should be more explicit).

We decided not to roll back the subgraph endpoint changes, as doing so would require reverting changes in the official Graph Explorer UI, and this could potentially affect even more users. Given that the issue itself is due to a breaking change, and some apps/sites were already updated while others weren’t/aren’t, the best course of action was to inform affected parties of the breaking changes and help summarize the modifications needed.

Impact and Mitigation

The breaking change was confined to a part of the subgraph that isn’t critical for indexer stack components (IPFS metadata). As a result, all of the impact happened on third-party apps that either displayed information for any of the affected entities (GraphAccount, Subgraph, SubgraphVersion, SubgraphDeployment) or had to consume the IPFS metadata for other reasons. As the impact was limited to third-party apps, the extent of the impact is not entirely known. However, no major components of the protocol were affected.

To address the issue, the primary mitigation came in the form of debugging the initial alert and responding with a statement on how to update any affected third-party app accordingly (as the change required to comply with the new schema was trivial, while a rollback would be hard to do and potentially create even more issues).

Root Cause Analysis

The incident occurred due to a lack of a formal release process for the Core Network Subgraph. This resulted in an improper release of a breaking change in a widely used subgraph, without sufficient warning and communication to dependents.

Causes Identified:

  • Implementation, testing, and release of the change were all done by a single developer, with only minor feedback rounds from Graph Explorer product team members in charge of the UI integration.
  • Development of the change took a long time (in development since Sept 2023) mostly due to delays in fixes, particularly due to some of those issues requiring fixes at the graph-node level.
  • The initial planned release for the feature with Graph Explorer integration was a multi-stage release, particularly to avoid this issue in the first place, but during the discussions for the rollout, we failed to recognize the risk to third-party apps.
  • The initial planned release for the feature was delayed due to another low-level issue being discovered just as we were rolling it out (which even managed to cause some issues in Graph Explorer).
  • The release process also required some coordination with other core developers to ensure the core functionality of Graph Explorer wasn’t affected.
  • Overall lack of explicit communications and announcements regarding the update itself aside from some minor mentions during IOH (#138, #134, #132, #131, #130 and #129), as well as a lack of clarity that the update contained a breaking change.

Contributing Factors

Overall, the process currently in place for releasing a subgraph wasn’t particularly adequate to preserve the “social contract” for communicating those changes effectively to the community, due to it being mostly a technical process and not involving other areas that could oversee the process from a communications perspective.

There was also the unfortunate timeline that the implementation, testing, and release went through. This accentuated the issue due to steering the team into focusing on making the implementation work properly, while not realizing the lack of communication required for a breaking change.

Corrective and Preventative Measures

Thankfully, there’s always a silver lining. Due to this issue, we realized that our release processes for subgraphs are far less formal than they should be, and we’ll be improving and setting new SOPs to formalize them, with the intention of avoiding a similar failure in the future.

To summarize the upcoming initiatives:

  • Publish the core subgraphs to the Decentralised Network and recommend this as the canonical endpoint for core data
    • Describe how we would approach breaking changes on the Decentralized Network
    • Recommend alternatives (e.g. if you can’t tolerate external upgrades, what would be the ideal procedure for you)
  • Revamp the release process for core network subgraphs:
    • Establish a clear set of guidelines on what communication is expected and how long in advance warnings should be given before breaking upgrades are released
    • Make sure all releases have a full changelog clearly mentioning any further deprecations or breaking changes
    • Create a general rollout process for breaking changes upgrades (multi stage process that would allow people to gradually upgrade within the upgrade window interval)
    • Establish a canonical communications channel
    • Establish the expected canonical endpoints that would follow this release process (avoiding misunderstandings when we switch from hosted to the Decentralized Network, and making clear which endpoints are production ready and expected to be maintained)
7 Likes

Great write-up. Awesome. Thanks for the transparency.

5 Likes