Motivation
Quality of Service data from the Gateway is incredibly important when it comes to managing an Indexer operation, particularly when it comes to monitoring your performance, figuring out issues and adjusting to changes in the query market.
Currently, E&N is running the main gateway, so all of the QoS data that we have available comes from them, but that is expected to change with the steps being taken to go towards a progressively decentralized set of Gateways.
One of the issues we have with the decentralization of the Gateways comes from the complexity of the current QoS Oracle data pipeline. It was made by E&N initially to suit their internal needs for their data science team, and then slightly repurposed to also be able to post some of that data onchain.
Given that currently the QoS Oracle is unnecessarily complex for other Gateway operators, it poses quite a challenge for QoS data to be accurate on a decentralized Gateway world, given that Gateway operators would either have to spin up a complex data science pipeline that they mostly don’t need.
This GRC aims to tackle that issue by figuring out how to simplify the data pipeline to make it easy for Gateway operators to post their QoS data onchain, while also improving the data that we post, so that we can leverage it afterwards on other products (like GraphSeer).
Requirements
The primary requirements stem from the need from future Gateway operators to be able to easily expose the aforementioned QoS Data, while improving the schema of the data itself, as well as adding new mechanisms to support the progressive decentralization of the gateways and lining up our stack.
To expand on the aforementioned high level requirements for the QoS Oracle V2, we can divide it into a few categories:
Usability and setup improvements
- It should be a standalone app to be run in the Gateway stack
- It should be as simple as possible to set up
- It should post data to the main chain selected for the protocol (Arbitrum one) to simplify any requirements for data posting
- It should ensure all gateway data is processed and posted
Schema improvements
- It should aim to improve the already existing V1 schema, so retaining the existing data is required
- Example of schema of indexer data posted by V1 oracle:
{
"indexer_wallet": "0x6125ea331851367716bee301ecde7f38a7e429e7",
"indexer_url": "http://34.117.100.32/",
"subgraph_deployment_ipfs_hash": "QmcPHxcC2ZN7m79XfYZ77YmF4t9UCErv87a9NFKrSLWKtJ",
"chain": "mainnet",
"gateway_id": "mainnet",
"start_epoch": 1696363800,
"end_epoch": 1696364100,
"avg_query_fee": 0.0002890346,
"max_query_fee": 0.0002890346,
"total_query_fees": 0.0002890346,
"query_count": 1,
"avg_indexer_latency_ms": 214,
"max_indexer_latency_ms": 214,
"num_indexer_200_responses": 1,
"proportion_indexer_200_responses": 1,
"avg_indexer_blocks_behind": 1,
"max_indexer_blocks_behind": 1,
"stdev_indexer_latency_ms": null
}
- Example of the same data point with a potential schema improvement for V2:
{
"indexer_wallet": "0x6125ea331851367716bee301ecde7f38a7e429e7",
"indexer_url": "http://34.117.100.32/",
"subgraph_deployment_ipfs_hash": "QmcPHxcC2ZN7m79XfYZ77YmF4t9UCErv87a9NFKrSLWKtJ",
"indexed_chain": "mainnet",
"network_chain": "arbitrum-one",
"gateway_id": "graphops-us-east-arb-one",
"start_epoch": 1696363800,
"end_epoch": 1696364100,
"avg_query_fee": 0.0002890346,
"max_query_fee": 0.0002890346,
"total_query_fees": 0.0002890346,
"query_count": 1,
"avg_indexer_latency_ms": 214,
"max_indexer_latency_ms": 214,
"p99_indexer_latency_ms": 214,
"p90_indexer_latency_ms": 214,
"num_indexer_200_responses": 1,
"proportion_indexer_200_responses": 1,
"avg_indexer_blocks_behind": 1,
"max_indexer_blocks_behind": 1,
"avg_indexer_time_behind": 1000,
"max_indexer_time_behind": 1000,
"stdev_indexer_latency_ms": null
}
- It should still aim for 5 minute time interval data points as the base unit
- Improved “lag” metrics (instead of only blocks behind, time behind?)
- Percentiles? (I remember Theo/Craig mentioned percentiles were hard to do with their current pipeline)
- Better gateway/network chain/indexed chain differentiations
- Extra fields that could be useful later on for the subgraph to properly do aggregations/averages?
- Have the data be as data service agnostic as possible
- …
Subgraph improvements
- Improved whitelisting strategies to be more in line with a scalable decentralized solution
- Hardcoded and dynamically added entities on the whitelist could work
- Add support for newly schema improvements on the base data
- Support base data points and daily aggregations for all data point types (like V1)
- Open source, so anyone that wants to run their own deployment with their own selection of gateways can do so easily, to enable a fully permissionless setup
- Provide a way to also allow for the curated whitelists to be dynamically modified by the gateway for the canonical deployment
Acceptance Criteria
Contracts
- Deploy/Reuse an eventful DataEdge on Arbitrum-one
Gateway
- Add newly required fields to Kafka messages
- time_behind
- network_chain
- Rename/Duplicate already existing fields in Kafka messages that aren’t clear
- graph_env → gateway_id
- network → indexed_chain
Oracle
- Make use of already existing Kafka messages
- Perform proper time based bucketing
- Perform corresponding bucket aggregations (based on the different data point type)
- Post processed data to IPFS
- Post IPFS hash to DataEdge on Arbitrum-one
- Setup documentation
Subgraph
- Support modifications to the whitelists (topic and submitter whitelists) through onchain interactions
- Aggregate data coming from the oracle into daily (weekly and monthly too?) buckets
- QueryDataPoints
- AllocationDataPoints
- IndexerDataPoints
- Usage documentation
Out of Scope
Subgraph
- User-level data points
Proposed Solution
Standalone Rust based oracle
The Rust program will have to be constantly listening to Kafka to receive data, but the basic flow would look something like this:
- Consume messages from Kafka
- Aggregate messages with time based bucketing (5 minutes) for:
- Query data points → Kafka client_query message
- Allocation data points → Kafka indexer_query message
- Process each bucket before posting to enrich the aggregated data
- Once buckets get finalized, post them onchain as soon as possible
For Kafka message consumption, we have a few options:
- Kafka-rust is a pure-Rust implementation of the Kafka protocol
- Rust-rdkafka is a wrapper for rdkafka, the officially-supported C client library
Apparently the recommended crate is actually not the pure rust crate, but the C client library wrapper, since it’s better maintained and the underlying library is quite polished.
For onchain interactions we can either use:
Ethers_rs is being deprecated in favour of the previous alternatives.
For the processing step, we might need to use crates to handle big integers or other data types, but it shouldn’t require anything fancy.
We might be able to reuse some logic (or a lot of logic) from the subscriptions-api repo, or at the very least take some inspiration from it.
In order to ensure correctness of each time bucket, we will need to temporarily store the kafka messages until we have that bucket filled. Copying a little bit from subscriptions-api, we could store it in postgresql.
Contracts changes
Move from a Gnosis non-eventful DataEdge contract to an Arbitrum-one eventful DataEdge contract (probably slightly more expensive than before, but shouldn’t be that much worse, and would streamline the environment so that everything is now on the same chain, while making the subgraph easier to run)
Gateway changes
Add newly required fields to Kafka messages
- time_behind
- network_chain
Rename/Duplicate already existing fields in Kafka messages that aren’t clear
- graph_env → gateway_id
- network → indexed_chain
Subgraph changes
The changes to take advantage of the new data available from the oracle should be self explanatory (add new fields or relevant entities)
For the whitelist changes we propose to add another message/topic for the subgraph to handle, that would allow something like the permissions set list that the EBO uses.
Rollout Plan
There’s not a particular rollout needed, since it’s not a service, but a part of the stack that operators will need to run/rollout, and also it’s the first version of the oracle, so there are no migrations nor upgrade paths to be resolved for already existing operators, as it’s not already being used.
Alternative Solutions Considered
- Whitelisting changes for the subgraph could be removed from the scope of the GRC, only ensuring that all the data points can be properly processed but submitter (and possibly also topics) aren’t checked (or ar checked with hardcoded whitelists, requiring either resyncs or specific deployments per gateways), but we think having a way for a canonically curated deployment to be able to dynamically modify the list of curated gateways is preferred, even if it means the canonical view is permissioned (or eventually permissionless if whatever solution we use allows for that).
- Encoding/Decoding changes: Currently the V1 oracle and subgraph simply handle calldata that is a bytes representation of a JSON object, which the subgraph decodes into a full fledged JSON object and uses the parsed fields to differentiate the action it needs to take.
While this is quite easy to implement, it might not be the most efficient in terms of gas cost. This wasn’t such a problem for Gnosis chain, but on Arbitrum-one, even if it’s cheap, it might make running the oracle more expensive than on Gnosis.
In order to minimize those costs, we could aim to use an encoding/decoding strategy like the EBO, but this would make the implementation more complex, and the gas cost savings aren’t exactly known.
Copyright Waiver
Copyright and related rights waived via CC0.