GIP-0042 A World Of Data Services

A World Of Data Services


GIP: 0042
Title: A World Of Data Services
Authors: jannis@edgeandnode.com, adam@edgeandnode.com
Created: 2022-12-01
Updated: 2023-21-02
Stage: Draft


Abstract

This GIP aims to establish a framework that allows to expand the data
services/APIs offered by The Graph Network over time, without having to make
invasive changes to the protocol every time. The proposal is rooted in the
realization that subgraphs are not well-suited for a number of use cases and
that a healthy and efficient indexer network requires specialization and reuse
of data already generated by the network.

The primary change proposed by this GIP is to abstract subgraphs in the protocol
contracts into a more general concept of data services that consist of
publishable data sets. This necessarily also affects some of the logic around
allocations and rewards as well as discovery of the new data services/sets.

Motivation

Two definitions upfront:

  1. A data service is a type of data set or API. Examples of data services are
    subgraphs, substreams or Firehoses.
  2. A data set is a specific data set, e.g. a subgraph for Decentraland or a
    substream for Uniswap swaps.

In other words: a data service is the technology or type of API, a data set
refers to a specific set of data.

The motivations for the proposed changes are manyfold.

Firstly, based on feedback from the developer community over the past few years,
it has become clear that subgraphs are not well-suited for a number of use cases
such as analytics pipelines or combining subgraph data with external and
potentially off-chain data sources. New developments in The Graph ecosystem,
most notably Firehose and Substreams, are emerging as solutions to address some
of these use cases. Other types of APIs will undoubtedly follow and it is vital
for The Graph to be able to support them natively in the network.

Secondly, to maintain a scalable, healthy, diverse indexer network, The Graph
needs to allow for specialization and outsourcing of processing power and
storage among indexers. For example, one indexer might specialize in providing
raw, low-level blockchain data in the form of a Firehose, another might focus
entirely on substreams processing, yet another might focus on indexing and
serving subgraphs. The interactions between these indexers require a
decentralized data market. As luck would have it, The Graph has already
established such a market around subgraphs. It merely needs to be extended to
support more data services.

High Level Description

This GIP proposes to make it possible to extend the data services offered by The
Graph Network. The GIP only covers changes proposed at the protocol layer, i.e.
in the smart contracts. More specific GIPs for the first additional data
services will follow soon. This section describes how the contracts could be
changed at a high level.

Three main changes are proposed:

  1. GNS contract: Instead of assuming that everything that is created and published is
    a subgraph or a subgraph version, add a notion of data service types such that:

    • the council can add new data service types,
    • each data service type comes with basic meta data such as a name and description,
    • data sets of any supported data service type can be created,
    • new data set versions of any supported data service type can be published,
    • an IPFS manifest hash is required for every new data set version that is
      published.
  2. Staking contract: Instead of assuming that all allocations are against
    subgraph deployments, allow indexers to specify the data service type when
    creating an allocation such that:

    • each allocation is associated with a data service type,
    • each allocation is associated with a data set ID.
  3. Verifiability router: Instead of using a single dispute mechanism, expand the
    contracts to allow a different dispute mechanism for each supported data
    service type, and to allow this dispute mechanisms to change over time. The
    existing disputes mechanism is kept for subgraphs. Additional data service
    types can then specify their own contract based on disputes or verifiable
    proofs.

How consumers, gateways and indexers need to be updated to support new data
service types is left to the GIPs that introduce these new data service types.
The detailed specification below illustrates how the changes above can be
introduced in a completely backwards-compatible way.

Detailed Specification

TBD.

This will require input from the smart contract developers working on The Graph.
This team will know how to change the contracts best to make the above possible.

Backwards Compatibility

TBD.

It should be possible to maintain the current publishing and allocation
behavior by introducing additional contract functions rather than changing the
existing ones. But whether this is possible largely depends on the changes
specified in Detailed Specification.

We anticipate some changes in the network subgraph to be necessary, but this too
remains to be seen.

Dependencies

None.

Risks and Security Considerations

This proposal will likely require a lot of changes especially in the GNS
contract. Since GNS publishing (and the discovery experience in e.g. The Graph
Explorer) are decoupled from allocation management and rewards, we could do the
development (and even the detailed specification) in phases, starting with just
adding the notion of council-controlled data service types to the GNS and
allowing to associate allocations with data service types.

This would immediately allow indexers to allocate towards new types of data sets
(like substreams or Firehoses) and unblock integrating new data service types
end to end. Developers could start using the new services and discovery of these
new data sets could follow in a second phase.

Any changes proposed to the smart contracts will, of course, require an audit.

Copyright Waiver

Copyright and related rights waived via
CC0.

18 Likes

Hi all!

I would like to start hashing out what we’d need to do in order to allow other types of data services (e.g. Firehose, substreams) to be offered as part of The Graph Network.

The above GIP draft describes what, roughly, would need to be added or changed in the protocol contracts. It leaves out the details though, because I believe the smart contracts team is much more qualified to think about how we could pull this off without being too intrusive (i.e. break as little in terms of external behavior).

I would like to propose, however, that we consider the narrowed focus / phased development strategy outlined in the Risks and Security Considerations section. If we skip discovery via the GNS, we may only have to change very little and can potentially integrate new data services much sooner.

3 Likes

When it comes to managing data service types, my immediate intuition is that we’d probably want something like the typical create/update functions (guarded with a council-only modifier) and a data structure that describes a data service type.

Put that in a mapping(uint32 => DataServiceType) dataServiceTypes where the uint32 is the ID fo the data service. Make the DataServiceType include the ID as well, a verifiability router (?), plus a reference to some meta data on IPFS (?).

Then, for the scope without discovery, simply add a new allocation function like allocateWithType(uint32 dataServiceType, ...usual parameters...), make sure that checks the data service type with that ID exists, change the internals of how allocations are represented to include the data service type (if needed) and… we’re good?

It’s probably not that simple, but that’s the sketch I have in my head.

2 Likes

Nice writeup! My first impression is this makes a lot of sense, and sounds feasible from a smart contracts point of view (though it will be a fair amount of work).

I would note a few things:

  • If we want these new data services to accrue indexer rewards, we also need to abstract the data service type in the Curation contract somehow, as currently curation is tied to a subgraphDeploymentId (which could become a dataSetDeploymentId?) and this is what ultimately defines the rewards for indexing a data set.
  • At first sight I think the main thing tying allocations in the Staking contract to the specific concept of a subgraph is the subgraphDeploymentId (which is easily changeable for a dataSetDeploymentId as mentioned above), but also more importantly the Proof of Indexing that is set to be 32 bytes. If different data service types will have different verifiability requirements, this POI concept might be insufficient (e.g. if some service uses SNARKs that don’t fit in 32 bytes) - I suppose we can always make it so that, in those cases, the POI becomes an id/pointer to the verifiability data in some other contract, or even an IPFS hash if there’s no need for on-chain verification.

There’s probably a lot of details we’ll need to think of when turning this into a concrete implementation, but overall I really like it.

2 Likes

If we want these new data services to accrue indexer rewards

I think we can do just query fees. That said, it would be cool to have curation to help users/developers to find the best quality / most useful substreams for instance.

If different data service types will have different verifiability requirements, this POI concept might be insufficient

Good point. I think I like the idea of allocations being closed by pointing to a data-service-type-specific verifiability record in a different contract, instead of it having to be a bytes32. You’re right that the current POIs will not work everywhere. The whole more-flexible verifiability topic is something I haven’t thought much about yet, but there’s some design work necessary here.

2 Likes

I agree for some things (e.g. substreams) where there is no pre-indexing, query fees would be sufficient. But for other services like Firehose where there is considerable work done before serving queries, and that work could be verifiable, I think it would make sense to also include indexer rewards. Otherwise indexers who serve these services would be at a disadvantage vs. indexers who focus on subgraphs. I think it would be good to identify between the two kinds in the data service type definition, rather than leave subgraphs as the only service that gets rewards? I know there’s some discussion/research about indexer rewards going on elsewhere, so it might be good to think about how this could inform that discussion.

Definitely, this will be a (fun) rabbithole.

3 Likes

I think there is a distinction between a Firehose, which is essentially running an Ethereum client plus some additional storage infrastructure to provide generic block streaming for a blockchain, and a subgraph, which is a much more specific ETL pipeline & query interface. To even provide substreams, you need to run a Firehose for the relevant network.
That’s not to say that there isn’t work required, but a query-only setup is much simpler in a lot of ways and simplicity and legibility is a very real and significant benefit as we look to introduce new services.

That may as you say put indexers who focus on new services at a disadvantage, but if there isn’t a query price for the data which is acceptable for developers, and profitable for indexers, then the new data services have a more fundamental problem. If that price does exist, then there is a good opportunity for a market to emerge. Introducing an indexing reward into the mix makes it harder to understand that value exchange (though of course rewards can be useful to bootstrap a market)

3 Likes

Great start, thanks for getting the conversation going @jannis. I personally think a protocol subsidy for the fixed cost of pre-processing in order to serve queries could still make a lot of sense in bootstrapping growth of these new data services.

My understanding is that for substreams there are essentially two (or more) query access patterns:

  1. Single ad-hoc query where all processing work to serve the query may be done on-demand.
  2. Query subscription where the data consumer always wants the latest result of a query and the Indexer is doing ongoing processing work to always have that available (looks a lot like indexing a subgraph today).

In the case of Firehose, unless we can get each chain that adopts Firehose to incorporate it into their consensus rules (no chain is currently planning on doing this AFAIK), the verifiability story for that chain may need to sit a layer above consensus (i.e., as a refereed game/roll-up), in which case I think there is an additional incentive to incentivize Indexers to do that processing work and help secure the Firehose integration for the network.

5 Likes

I’m really excited about this proposal. I’d like to focus on the topic of the verifiability router and propose some reframing.

In changes we have:

  1. Verifiability router: Instead of using a single dispute mechanism, expand the
    contracts to allow a different dispute mechanism for each supported data
    service type, and to allow this dispute mechanisms to change over time. The
    existing disputes mechanism is kept for subgraphs. Additional data service
    types can then specify their own contract based on disputes or verifiable
    proofs.

Would it be appropriate to expand the scope of “dispute mechanisms” to encompass “consensus mechanisms”? I see a dispute as something that happens after there has been a failure in consensus. I would be excited to see us move toward a “programmable consensus” future where a consumer gets to decide what consensus and dispute mechanism they require. Here are three examples of different consensus and dispute mechanisms. I’m focusing on subgraphs, but I think the reasoning can extend to all of our future data services:

  1. Subgraph A deployment: no consensus required. Consumers of this deployment trust that Indexers in The Graph are doing their best job. Not willing to pay for zk, POI checking, human arbitration, or anything other than the query result. Likely only used for low-impact data. Cheap but no dispute is available.
  2. Subgraph B deployment: n out of m Indexer query results required. This consumer only trusts m Indexers in The Graph and will pay n out of m of them to return the query result. No dispute is available.
  3. Subgraph C deployment: zk-proofs of indexing and querying will be returned with all queries. If a proof does not verify, the Indexer gets slashed. Expensive but useful for critical data and does not require trusted relationships.

The VI/VQ discussion is still early, but in general, there is chatter among core devs that different subgraphs may require different approaches for verifiability. (And of course, substreams, Firehose, etc. will, as already said in the GIP intro.) E.g. erc-20 vs. erc-721 vs. custom contract may be more efficiently verified in different ways. Additionally, as given in use case 1, some subgraph data may not be valuable enough to have any verifiability guarantees.

Basically, I would like consumers to have as much flexibility as possible/practical in deciding the verifiability guarantees that come along with their data. From the above, I hope it’s clear that I like this idea because it would allow us to price queries more efficiently. But that’s not the only reason :slight_smile: I’m also a fan of programmable consensus because it could potentially more quickly get us to instant verifiability and dispute resolution for some use cases. For example, I would love to get access to verified aggregated erc-20 data that’s always up to the chain head, e.g. the average USDC-ETH ratio. Such data, when authenticated, could then be immediately used as calldata on-chain with The Graph serving as a just-in-time oracle.

9 Likes

Hey Sam, I like this idea of exploring consensus as a service to whoever is using the data. The world of data service story forces to think the network in a general way and find the interfaces that are global to any data provider.

I’d say that a subgraph deployment (or any data transformation) should just describe how to source and map the data, then after multiple indexers are working on a particular data source a consumer can decide from the universe of indexers, if they want to use the ones under consensus or just any one.

5 Likes

I am really excited about this proposal and possibilities it can have. There are really insightful points made so far on query processing and verifiability. I would like to build on this proposal by highlighting an interesting service that opens new markets for subgraphs, which is also very important.

One of the exciting possibilities of this service is the ability to expose an ETL data service layer on top of subgraphs, which can serve SQL queries to consumers. We can call it “SQL as Service”. This would allow all the power of SQL, including business analytics, transformations, complex aggregations, slicing/dicing, can be performed on multi-chain subgraphs.

SQL is the de-facto standard for business analytics, and I strongly believe that graph will have a massive impact on data analytics, much like Dune/Footprint, if SQL as a service is exposed. With the integration of Firehose/substreams on this service, real-time business analytics can be provided, addressing current data refresh interval issues faced by current analytics platforms.

The service will open up new markets and consumers for subgraphs. For instance, domains such as Business Analytics, Financing, Taxes, and Accounting, which require complex queries and multi-step transformations can be done easily, would benefit greatly from this service.

Moreover, users could easily plug visualization layers on top of this service, allowing them to create compelling visual representations of their data.

I believe that it’s important to give indexers or data providers the flexibility to choose whether or not they want to run the ETL and expose this service. It should not be tightly coupled with a graph node. Indexers who choose to run the ETL and serve queries could be incentivized with additional rewards.

I would also like to initiate an open discussion point: if indexers are serving SQL as a service on top of subgraphs, which requires extra processing and storage, should they receive additional rewards for running ETL, aside from query rewards?

Looking forward to hearing your thoughts on this matter.

12 Likes

I’m happy to see this proposal being discussed.

From a business perspective, I believe this is key to addressing risks The Graph faces from competition with centralized SaaS alternatives.

The Pinax team specializes in operating Firehose and Substreams, and we already see demand for these services. If The Graph doesn’t move fast to make these services available, the developers seeking them will go elsewhere.

7 Likes

Hello all,

Last week during Indexer Office Hours (IOH), we hosted a conversation on this topic. Please check out the timestamped link below if you’d like to watch the section of the recording relating to this topic:

9 Likes

Came across another use case that made me think of this GIP today.

On lending market subgraph, they’re wanting to gather the bad debt for a liquidation dashboard. To do this they need to filter on a value (LoanToValue) that is a derivation of multiple fields from various tables:

LoanToValue = (position.debt * token.priceUSD) / (position.collateral * token.priceUSD)

This is fairly easy to put together with a SQL query. Adding it into the subgraph as an entity would require an expensive loop that changes an entity each time a) user balance changes, b) interest accrues and c) token price changes.

5 Likes