High-level
Looking to optimize our subgraph for reduced downtime
Background
We had a bug in our subgraph recently (entirely our doing) for an unexpected edge case from a user interacting directly with the contracts.
We have set up an alert notification system (we should open source this!) for warning and critical logs in the subgraph which allowed us to react swiftly to the bug. Naturally being at the data layer no user funds were at risk.
We were able to replicate the transaction locally in our contract test deployment scripts, find the error and debug a local subgraph and fix the problem rapidly.
Problem statement
Our subgraph (polygon) is and handles a lot of data, because of fast block timestamps, slow rpc and lots of data emitted from complex transactions the time it takes for us to deploy and sync Float Capital’s subgraph is > 1 hour and this time period to sync will only increase the longer our system operates.
Question statement
What is the best practises and community thoughts to ensure subgraph uptime?
Simple answer
Have a killer subgraph with no bugs that doesn’t go down.
Discussion points
Perfect code doesn’t exist, downtime is still a reality and a possibility in complex subgraphs & contracts. I’m hoping the following topics will provide a base to discuss better approaches and best practises and hopefully assist others with the same problem too.
- Subgraph failure response process
- Subgraph tests
- Multiple subgraphs = no single point of failure
- Fallback core data subgraph
- Sync time bottleneck
1. Subgraph failure response process
Step 1: Identify graph error has occurred
We have incorporated warning logs into our subgraph to catch unexpected behaviour, in the case of subgraph getting stuck a critical log is emitted and we have developed a notification alert system with integrations to slack, discord, sentry and email. (Will look to open source this yet it is a bit rough around the edges still)
Step 2: Notify users via the ui (banner warning message)
Simple manual switch control to communicate with all users ‘our subgraph is stuck, no funds are at risk’.
Step 3: Identify block number subgraph got stuck at and blacklist from all logic and redeploy
A bit of a hacky step but as a hot fix have a simple array blacklist of block numbers which is checked at the beginning of each handler which effectively skips all logic (and bugs) for that block. Not sold on this as a fix but with the goal of minimising downtime. Open to feedback if you think this is a proper fail messy idea. It’s a bit of a ‘break for some, fix for others’ in-optimal approach.
Step 4: Replicate transactions on local blockchain & subgraph, debug, fix error remove blacklist and redeploy
Float Capital has some code here too which could be useful for developers which dockerizes services and has some shell script to handle things locally nicely (Again a bit rough around the edges and probably require reading the code to configure correctly but will make an effort to open source this too).
2. Subgraph tests
This entire discussion is probably irrelevant with some beast end to end integration and full unit test coverage yet alas we are not there. I have been keeping my ears open and have heard a few different graph test frameworks in the pipeline. Are there any standouts at the moment? We have a custom test suite @JasoonS has been working on which we are expanding on but by no means at full coverage yet.
3. Multiple subgraphs = no single point of failure
This is a big dev maintainability tradeoff. We already do this separation at a basic level having a subgraph dedicated to a block handler and associated data at each block and a subgraph for all event handlers. With substantial refactors on the ui / subgraph we could separate event handlers into separate subgraph instances, this can get quite complex though.
Another opportunity to open source some code but we have a
txHelper.ts
script which allows us to capture event data directly in the subgraph and helps us to debug issues through querying the subgraph directly, this with some code generation help could be separated into its own graph instance too.
4. Fallback core data subgraph
Another in-optimal, maintainability nightmare, is to have a separate skeleton graph with core data under the philosophy of less complexity = less points of failure. In the event of subgraph downtime the ui can fallback to this subgraph and decrease the severity of the downtime impact (provided that doesn’t break with the same error )
5. Sync time bottleneck
My assumption: The biggest bottleneck to the slow sync time is reading the data from the chain, ie the rpc endpoint (if you’re working on Polygon you know especially on mumbai this can be fickle.).
Beefy rpc endpoints
Final notes
In closing, the graph is an essential application to our tech stack allowing us to query and manage far more complex data points than querying the contracts directly would allow us to. Looking to get feedback and thoughts on how other devs in the space are handling this.
In writing this I realize we have a lot of code that is not quite as dev friendly as it could be but the community could benefit from. I will add to my todo list to create a public repo with some of this code however will disclaim that it is rough around the edges for everyones usecases. Lots of credit to @JasoonS here. We even have code generation tool which generates getters and getOrInitialize helpers by reading the schema.