Indexer PoI discrepancy

I would like to discuss the topic of indexers PoI discrepancy.

Indexers have been noticing this discrepancy for a long time. By default, it is considered that if your PoI does not match, then you have a bad database and you need to resync the subgraph. But the question arises, who exactly has the error, you or the one with whom you are comparing? And then another question, why did this error occur at all?

Ryabina team, together with the Grassets.tech team, created the Poifier tool just to understand how large this problem is and who has it. Unfortunately, in order to conduct at least some analysis, it was necessary to gain a critical mass of users of this tool, which we could not achieve. But thanks to Payne, our tool was added to his docker compose, which greatly increased the number of users and now we can analyze at least some data.

Let’s look at a subgraph with the largest number of reports at the moment.

subgraph: Connext Network - Gnosis
deployment id: QmXWbpH76U6TM4teRNMZzog2ismx577CkH7dzn1Nw69FcV
reports count (last epoch): 98
poi ratio: 91/7
This means that out of 98 reports, 91 indexers have the same PoI and only 7 do not. A good result, and now we can definitely answer the question which PoI is valid for this subgraph. But what is interesting is that the remaining 7 indexers, which are difficult to accuse of collusion and we can safely assume that they are not connected in any way, also have the same poi. It turns out that on several indexer servers independent of each other, one and the same error occurred, which led to the fact that the hash is now calculated incorrectly.

It is difficult to name the reason, but as for me, it is definitely necessary to start an investigation into why this happened. Find out what hardware and what software indexers use? Will resyncing this subgraph help them or will the bug reproduce itself?

It is important to make this investigation official by the Graph protocol development team. This is important, because the system of slashing indexers is tied precisely to the correctness of the Poi sent to the network. And if we do not understand the reason why there is a discrepancy between indexers, then it is not clear how non-malicious indexers can be sure that they will not be accused of incorrect poi when closing the allocation?

Here are some more interesting examples:

subgraph: Connext Network - Gnosis
deployment id: QmcdDo39c8ztsCViQ5kQotjNj4iL3QVq9tnQEqVJXffcD5
reports count: 14
poi ratio: 7/7

subgraph: Uniswap v3 Ethereum
deployment id: QmcPHxcC2ZN7m79XfYZ77YmF4t9UCErv87a9NFKrSLWKtJ
reports count: 5
poi ratio: 1/1/1/1/1

subgraph: Radicle
deployment id: QmeB3CddhtZqNW3FVvsbNeEt2Ek1VSjdwNBU9fJEYrsC3o
reports count: 5
poi ratio: 3/2
This subgraph is notable for the fact that the poi ratio in it changes from epoch to epoch. In some epoch 5 poi are the same in some ratio is 3/2.

P.S. We have limited access to the Poifier dashboard and you have to be an indexer to enter. Login via Metamask at any address associated with indexers (operators, beneficiary, indexer address itself) If someone from the Graph team needs access, but he/she is not an indexer, let me know, we will add any address to the whitelist.

3 Likes

@warfollowsme, thank you for raising this! Fully agree that this is an important issue that needs attention. Ensuring that POI discrepancies are minimised is important to making the protocol’s dispute and arbitration processes effective, and to ensuring that the security model of the network is upheld.

Your work at Ryabina (alongside the Grassets.tech team) on POIfier has been an exemplary effort in collecting data to understand the discrepancies that exist today. Thank you! From the data you have presented above, one can already observe a few different archetypes of nondeterminism emerging (e.g. POI forks vs total nondeterminism).

I think we need to prioritise an effort to understand and root-cause each instance of known non-determinism. The graph-node team will be essential to understanding and fixing possible determinism issues originating in graph-node. We will need a good way of separating those issues from ones arising from dependencies in the stack, like blockchain nodes or JSON-RPC load balancers. Aside from the technical work here, there’s likely a lot of operational work to do in collecting stack information and logs.

I also think it would be good for the Foundation or Council to define a set of “mainnet entry requirements” including data integrity KPIs (e.g. % of subgraphs displaying non-determinism issues) for a chain to meet before being eligible for the decentralised network mainnet and indexing rewards. This will create a clear target and incentives to focus efforts on squashing nondeterminism issues.

3 Likes

I agree with @chris - this is very important to the network’s health and quality of service and fundamental to ensuring that rewards are fairly distributed.

The data science team at Edge & Node expressed interest in accessing the dashboard. They are already consolidating some data about a few indexers, but it would be interesting to compare that to the POIfier. @warfollowsme please DM me on Discord so we chat about how to get access.

About the proof of indexing, but more generally, the reward distribution mechanism, we need to work on:

  1. Monitoring the consensus of proofs submitted by all indexing processes. There are multiple parties with tools for doing this today that we can iterate to have a more precise and global coverage. We can identify clusters where groups diverge, which can help find something they have in common, such as running a different version of an RPC node. Additionally, we may find single rogue indexers that any party observing the network can dispute.

  2. Commitment to determinism

    Each team working on the core indexing tech to engage in helping dig into the root cause of any discrepancy and fixing it as a top priority.

  3. Improving Arbitration Bandwidth

    We can extend the capacity of Arbitrators by adding more members and developing tools for this purpose. In the long-term, it may be possible to eliminate the need for Arbitration altogether by using verifiable computation.

  4. A transparent Subgraph Availability Oracle (SAO)

    Sharing logs publicly so that everybody can see how and why it is denying or accepting subgraphs for rewards distribution. On this line, also getting multiple core devs to run SAO processes.

Each of this items could really be GIPs. I’d like to see core devs and the community proposing improvements in these areas and Council working on grants to get those improvements going. Personally I have some ideas for 4 and 3 that will share soon.

5 Likes

Check discord DM please.