GRC-002: QoS Oracle V2

juanmardefago · April 23, 2024, 5:58pm

Motivation

Quality of Service data from the Gateway is incredibly important when it comes to managing an Indexer operation, particularly when it comes to monitoring your performance, figuring out issues and adjusting to changes in the query market.

Currently, E&N is running the main gateway, so all of the QoS data that we have available comes from them, but that is expected to change with the steps being taken to go towards a progressively decentralized set of Gateways.

One of the issues we have with the decentralization of the Gateways comes from the complexity of the current QoS Oracle data pipeline. It was made by E&N initially to suit their internal needs for their data science team, and then slightly repurposed to also be able to post some of that data onchain.

Given that currently the QoS Oracle is unnecessarily complex for other Gateway operators, it poses quite a challenge for QoS data to be accurate on a decentralized Gateway world, given that Gateway operators would either have to spin up a complex data science pipeline that they mostly don’t need.

This GRC aims to tackle that issue by figuring out how to simplify the data pipeline to make it easy for Gateway operators to post their QoS data onchain, while also improving the data that we post, so that we can leverage it afterwards on other products (like GraphSeer).

Requirements

The primary requirements stem from the need from future Gateway operators to be able to easily expose the aforementioned QoS Data, while improving the schema of the data itself, as well as adding new mechanisms to support the progressive decentralization of the gateways and lining up our stack.

To expand on the aforementioned high level requirements for the QoS Oracle V2, we can divide it into a few categories:

Usability and setup improvements

It should be a standalone app to be run in the Gateway stack
It should be as simple as possible to set up
It should post data to the main chain selected for the protocol (Arbitrum one) to simplify any requirements for data posting
It should ensure all gateway data is processed and posted

Schema improvements

It should aim to improve the already existing V1 schema, so retaining the existing data is required
- Example of schema of indexer data posted by V1 oracle:

{
  "indexer_wallet": "0x6125ea331851367716bee301ecde7f38a7e429e7",
  "indexer_url": "http://34.117.100.32/",
  "subgraph_deployment_ipfs_hash": "QmcPHxcC2ZN7m79XfYZ77YmF4t9UCErv87a9NFKrSLWKtJ",
  "chain": "mainnet",
  "gateway_id": "mainnet",
  "start_epoch": 1696363800,
  "end_epoch": 1696364100,
  "avg_query_fee": 0.0002890346,
  "max_query_fee": 0.0002890346,
  "total_query_fees": 0.0002890346,
  "query_count": 1,
  "avg_indexer_latency_ms": 214,
  "max_indexer_latency_ms": 214,
  "num_indexer_200_responses": 1,
  "proportion_indexer_200_responses": 1,
  "avg_indexer_blocks_behind": 1,
  "max_indexer_blocks_behind": 1,
  "stdev_indexer_latency_ms": null
}

Example of the same data point with a potential schema improvement for V2:

{
  "indexer_wallet": "0x6125ea331851367716bee301ecde7f38a7e429e7",
  "indexer_url": "http://34.117.100.32/",
  "subgraph_deployment_ipfs_hash": "QmcPHxcC2ZN7m79XfYZ77YmF4t9UCErv87a9NFKrSLWKtJ",
  "indexed_chain": "mainnet",
  "network_chain": "arbitrum-one",
  "gateway_id": "graphops-us-east-arb-one",
  "start_epoch": 1696363800,
  "end_epoch": 1696364100,
  "avg_query_fee": 0.0002890346,
  "max_query_fee": 0.0002890346,
  "total_query_fees": 0.0002890346,
  "query_count": 1,
  "avg_indexer_latency_ms": 214,
  "max_indexer_latency_ms": 214,
  "p99_indexer_latency_ms": 214,
  "p90_indexer_latency_ms": 214,
  "num_indexer_200_responses": 1,
  "proportion_indexer_200_responses": 1,
  "avg_indexer_blocks_behind": 1,
  "max_indexer_blocks_behind": 1,
  "avg_indexer_time_behind": 1000,
  "max_indexer_time_behind": 1000,
  "stdev_indexer_latency_ms": null
}

It should still aim for 5 minute time interval data points as the base unit
Improved “lag” metrics (instead of only blocks behind, time behind?)
Percentiles? (I remember Theo/Craig mentioned percentiles were hard to do with their current pipeline)
Better gateway/network chain/indexed chain differentiations
Extra fields that could be useful later on for the subgraph to properly do aggregations/averages?
Have the data be as data service agnostic as possible
…

Subgraph improvements

Improved whitelisting strategies to be more in line with a scalable decentralized solution
- Hardcoded and dynamically added entities on the whitelist could work
Add support for newly schema improvements on the base data
Support base data points and daily aggregations for all data point types (like V1)
Open source, so anyone that wants to run their own deployment with their own selection of gateways can do so easily, to enable a fully permissionless setup
Provide a way to also allow for the curated whitelists to be dynamically modified by the gateway for the canonical deployment

Acceptance Criteria

Contracts

Deploy/Reuse an eventful DataEdge on Arbitrum-one

Gateway

Add newly required fields to Kafka messages
- time_behind
- network_chain
Rename/Duplicate already existing fields in Kafka messages that aren’t clear
- graph_env → gateway_id
- network → indexed_chain

Oracle

Make use of already existing Kafka messages
Perform proper time based bucketing
Perform corresponding bucket aggregations (based on the different data point type)
Post processed data to IPFS
Post IPFS hash to DataEdge on Arbitrum-one
Setup documentation

Subgraph

Support modifications to the whitelists (topic and submitter whitelists) through onchain interactions
Aggregate data coming from the oracle into daily (weekly and monthly too?) buckets
- QueryDataPoints
- AllocationDataPoints
- IndexerDataPoints
Usage documentation

Out of Scope

Subgraph

User-level data points

Proposed Solution

Standalone Rust based oracle

The Rust program will have to be constantly listening to Kafka to receive data, but the basic flow would look something like this:

Consume messages from Kafka
Aggregate messages with time based bucketing (5 minutes) for:
- Query data points → Kafka client_query message
- Allocation data points → Kafka indexer_query message
Process each bucket before posting to enrich the aggregated data
Once buckets get finalized, post them onchain as soon as possible

For Kafka message consumption, we have a few options:

Kafka-rust is a pure-Rust implementation of the Kafka protocol
Rust-rdkafka is a wrapper for rdkafka, the officially-supported C client library

Apparently the recommended crate is actually not the pure rust crate, but the C client library wrapper, since it’s better maintained and the underlying library is quite polished.

For onchain interactions we can either use:

Ethers_rs is being deprecated in favour of the previous alternatives.

For the processing step, we might need to use crates to handle big integers or other data types, but it shouldn’t require anything fancy.

We might be able to reuse some logic (or a lot of logic) from the subscriptions-api repo, or at the very least take some inspiration from it.

In order to ensure correctness of each time bucket, we will need to temporarily store the kafka messages until we have that bucket filled. Copying a little bit from subscriptions-api, we could store it in postgresql.

Contracts changes

Move from a Gnosis non-eventful DataEdge contract to an Arbitrum-one eventful DataEdge contract (probably slightly more expensive than before, but shouldn’t be that much worse, and would streamline the environment so that everything is now on the same chain, while making the subgraph easier to run)

Gateway changes

Add newly required fields to Kafka messages

time_behind
network_chain

Rename/Duplicate already existing fields in Kafka messages that aren’t clear

graph_env → gateway_id
network → indexed_chain

Subgraph changes

The changes to take advantage of the new data available from the oracle should be self explanatory (add new fields or relevant entities)

For the whitelist changes we propose to add another message/topic for the subgraph to handle, that would allow something like the permissions set list that the EBO uses.

Rollout Plan

There’s not a particular rollout needed, since it’s not a service, but a part of the stack that operators will need to run/rollout, and also it’s the first version of the oracle, so there are no migrations nor upgrade paths to be resolved for already existing operators, as it’s not already being used.

Alternative Solutions Considered

Whitelisting changes for the subgraph could be removed from the scope of the GRC, only ensuring that all the data points can be properly processed but submitter (and possibly also topics) aren’t checked (or ar checked with hardcoded whitelists, requiring either resyncs or specific deployments per gateways), but we think having a way for a canonically curated deployment to be able to dynamically modify the list of curated gateways is preferred, even if it means the canonical view is permissioned (or eventually permissionless if whatever solution we use allows for that).
Encoding/Decoding changes: Currently the V1 oracle and subgraph simply handle calldata that is a bytes representation of a JSON object, which the subgraph decodes into a full fledged JSON object and uses the parsed fields to differentiate the action it needs to take.
While this is quite easy to implement, it might not be the most efficient in terms of gas cost. This wasn’t such a problem for Gnosis chain, but on Arbitrum-one, even if it’s cheap, it might make running the oracle more expensive than on Gnosis.
In order to minimize those costs, we could aim to use an encoding/decoding strategy like the EBO, but this would make the implementation more complex, and the gas cost savings aren’t exactly known.

Copyright Waiver

Copyright and related rights waived via CC0.

theodus · April 30, 2024, 1:06pm

I have some questions & recommendations for the V2 schema:

Gateway operators should be identified using some address, such as the TAP sender address used for indexer payments (gateway_tap_sender?). We should also include a label for the specific gateway instance like gateway_id (gateway_instance_label?).
What is start_epoch/end_epoch are these the block numbers associated with the epoch boundaries? If so, shouldn’t we just use one field for the epoch number?
I’d like to stop using “200” in place of success in these field names. It’s a bit misleading since an indexer’s response with a HTTP 200 status code may not be considered successful by the gateway.
Do we need blocks behind as a metric? I don’t think it’s useful when we can instead just use seconds_behind_chain, which can translate across indexed chains.

juanmardefago · April 30, 2024, 2:02pm

Great questions!

Sounds good! Albeit what would be the difference between gateway_id and gateway_instance_label? (I was thinking of using gateway_id as a unique id for each instance, but it could be a unique id of a provider, and gateway_instance_label a unique ID for the instance of said provider)

Unsure, this is part of the V1 schema, I’m assuming it might be coming from the data science pipeline, so I wanted to respect the schema as much as possible to avoid creating compatibility issues.

Yes, makes sense, although as stated previously, we wanted to make it as compatible as possible.

Maybe duplicating the fields that start with 200_ in favour of success_ could be a good middle ground solution?

Not really, but again, mostly trying to make it as compatible as possible to reduce stress on the people integrating these products, but if there’s a ton of really not that useful fields or fields that need to be renamed, we could push for a big “breaking” change for the MVP.

theodus · April 30, 2024, 2:06pm

I might be missing something but, at least from the gateway perspective, I see no practical reason to avoid breaking changes here. The interface for data being exported from the gateway has been stuck in a “temporary” state for years.

juanmardefago · April 30, 2024, 2:20pm

Mainly compatibility is for existing products that rely on this data (GraphSeer being one, although we can always just jump from V1 to V2 whenever needed and ditch compatibility)

theodus · April 30, 2024, 2:49pm

Alright. I guess the backward compatibility requirements for the subgraph output are mostly up to GraphOps. My only major concern is that block_behind info may not be available from the gateway in the nearish future in a shift to only reporting seconds_behind.

juanmardefago · April 30, 2024, 3:05pm

Then I think it might be a good idea to break compatibility from the start and just call it a day

chris · April 30, 2024, 5:01pm

Definitely agree that now’s the time to break compat and design a schema that we all love.

ellipfra · May 5, 2024, 12:54pm

Would it be appropriate to include aggregate indexer scoring values in the schema, reflecting how gateways assess the quality of indexers? The proposed schema may include all the inputs for the ISA, yet these inputs might be weighted differently according to customer preferences.

Considering that each gateway operator will eventually implement their own unique indexer selection algorithm, standardizing this metric across gateways could prove challenging, if not unfeasible. As a result, this data may not be suitable for dashboards and reports. However, it would still provide a valuable indicator for indexers to use in their own monitoring processes.

juanmardefago · May 7, 2024, 6:07pm

I’ll have to double check if that data is available or feasible to get from the Gateway (I’d assume yes), maybe @theodus knows though!

But it sure sounds like it might be good to have some sort of knowledge of how the ISA ranks those data points, the only thing I think might not be possible or might be difficult, or even not make sense, is whether the ISA ranking applies to the whole interval, a single query, or if it can even make sense when discussing daily aggregates.

theodus · May 7, 2024, 6:24pm

I’m currently in the middle of making ISA changes, and some of the gateway reporting issues are a result of over-fitting the metrics we report based on a very outdated view of indexer selection. Given those, I’m very hesitant to bake the indexer-selection inputs into the schema. Especially now that the gateway & indexer-selection repos are open-source, I think the focus should be on reporting the high-level off-chain values that the gateway cares about (success rate, latency, etc.). Note that the distribution of queries across the available indexers per deployment is the outcome of ISA scoring.

theodus · May 7, 2024, 6:28pm

Also, “ranking” indexers based on their individual scores would be a misleading over-simplification of how indexers are selected. The gateway doesn’t even log all the scoring info because it would probably be bad for performance. We just sample 1/1000 queries to get snapshots for debugging.

Topic		Replies	Views
GIP-0038: Epoch Block Oracle Graph Improvement Proposals (GIP) subgraph-devs	10	3785	August 25, 2022
GIP-0042 A World Of Data Services Graph Improvement Proposals (GIP)	13	6087	April 13, 2023
This Month in Graph Indexing - July 2022 Edition Ecosystem Updates indexer	2	2548	August 4, 2022