GIP-0041 A World Of Data Services

A World Of Data Services


GIP: 0041
Title: A World Of Data Services
Authors: jannis@edgeandnode.com, adam@edgeandnode.com
Created: 2022-12-01
Updated: 2022-12-01
Stage: Draft


Abstract

This GIP aims to establish a framework that allows to expand the data
services/APIs offered by The Graph Network over time, without having to make
invasive changes to the protocol every time. The proposal is rooted in the
realization that subgraphs are not well-suited for a number of use cases and
that a healthy and efficient indexer network requires specialization and reuse
of data already generated by the network.

The primary change proposed by this GIP is to abstract subgraphs in the protocol
contracts into a more general concept of data services that consist of
publishable data sets. This necessarily also affects some of the logic around
allocations and rewards as well as discovery of the new data services/sets.

Motivation

Two definitions upfront:

  1. A data service is a type of data set or API. Examples of data services are
    subgraphs, substreams or Firehoses.
  2. A data set is a specific data set, e.g. a subgraph for Decentraland or a
    substream for Uniswap swaps.

In other words: a data service is the technology or type of API, a data set
refers to a specific set of data.

The motivations for the proposed changes are manyfold.

Firstly, based on feedback from the developer community over the past few years,
it has become clear that subgraphs are not well-suited for a number of use cases
such as analytics pipelines or combining subgraph data with external and
potentially off-chain data sources. New developments in The Graph ecosystem,
most notably Firehose and Substreams, are emerging as solutions to address some
of these use cases. Other types of APIs will undoubtedly follow and it is vital
for The Graph to be able to support them natively in the network.

Secondly, to maintain a scalable, healthy, diverse indexer network, The Graph
needs to allow for specialization and outsourcing of processing power and
storage among indexers. For example, one indexer might specialize in providing
raw, low-level blockchain data in the form of a Firehose, another might focus
entirely on substreams processing, yet another might focus on indexing and
serving subgraphs. The interactions between these indexers require a
decentralized data market. As luck would have it, The Graph has already
established such a market around subgraphs. It merely needs to be extended to
support more data services.

High Level Description

This GIP proposes to make it possible to extend the data services offered by The
Graph Network. The GIP only covers changes proposed at the protocol layer, i.e.
in the smart contracts. More specific GIPs for the first additional data
services will follow soon. This section describes how the contracts could be
changed at a high level.

Three main changes are proposed:

  1. GNS contract: Instead of assuming that everything that is created and published is
    a subgraph or a subgraph version, add a notion of data service types such that:

    • the council can add new data service types,
    • each data service type comes with basic meta data such as a name and description,
    • data sets of any supported data service type can be created,
    • new data set versions of any supported data service type can be published,
    • an IPFS manifest hash is required for every new data set version that is
      published.
  2. Staking contract: Instead of assuming that all allocations are against
    subgraph deployments, allow indexers to specify the data service type when
    creating an allocation such that:

    • each allocation is associated with a data service type,
    • each allocation is associated with a data set ID.
  3. Verifiability router: Instead of using a single dispute mechanism, expand the
    contracts to allow a different dispute mechanism for each supported data
    service type, and to allow this dispute mechanisms to change over time. The
    existing disputes mechanism is kept for subgraphs. Additional data service
    types can then specify their own contract based on disputes or verifiable
    proofs.

How consumers, gateways and indexers need to be updated to support new data
service types is left to the GIPs that introduce these new data service types.
The detailed specification below illustrates how the changes above can be
introduced in a completely backwards-compatible way.

Detailed Specification

TBD.

This will require input from the smart contract developers working on The Graph.
This team will know how to change the contracts best to make the above possible.

Backwards Compatibility

TBD.

It should be possible to maintain the current publishing and allocation
behavior by introducing additional contract functions rather than changing the
existing ones. But whether this is possible largely depends on the changes
specified in Detailed Specification.

We anticipate some changes in the network subgraph to be necessary, but this too
remains to be seen.

Dependencies

None.

Risks and Security Considerations

This proposal will likely require a lot of changes especially in the GNS
contract. Since GNS publishing (and the discovery experience in e.g. The Graph
Explorer) are decoupled from allocation management and rewards, we could do the
development (and even the detailed specification) in phases, starting with just
adding the notion of council-controlled data service types to the GNS and
allowing to associate allocations with data service types.

This would immediately allow indexers to allocate towards new types of data sets
(like substreams or Firehoses) and unblock integrating new data service types
end to end. Developers could start using the new services and discovery of these
new data sets could follow in a second phase.

Any changes proposed to the smart contracts will, of course, require an audit.

Copyright Waiver

Copyright and related rights waived via
CC0.

11 Likes

Hi all!

I would like to start hashing out what we’d need to do in order to allow other types of data services (e.g. Firehose, substreams) to be offered as part of The Graph Network.

The above GIP draft describes what, roughly, would need to be added or changed in the protocol contracts. It leaves out the details though, because I believe the smart contracts team is much more qualified to think about how we could pull this off without being too intrusive (i.e. break as little in terms of external behavior).

I would like to propose, however, that we consider the narrowed focus / phased development strategy outlined in the Risks and Security Considerations section. If we skip discovery via the GNS, we may only have to change very little and can potentially integrate new data services much sooner.

1 Like

When it comes to managing data service types, my immediate intuition is that we’d probably want something like the typical create/update functions (guarded with a council-only modifier) and a data structure that describes a data service type.

Put that in a mapping(uint32 => DataServiceType) dataServiceTypes where the uint32 is the ID fo the data service. Make the DataServiceType include the ID as well, a verifiability router (?), plus a reference to some meta data on IPFS (?).

Then, for the scope without discovery, simply add a new allocation function like allocateWithType(uint32 dataServiceType, ...usual parameters...), make sure that checks the data service type with that ID exists, change the internals of how allocations are represented to include the data service type (if needed) and… we’re good?

It’s probably not that simple, but that’s the sketch I have in my head.

1 Like

Nice writeup! My first impression is this makes a lot of sense, and sounds feasible from a smart contracts point of view (though it will be a fair amount of work).

I would note a few things:

  • If we want these new data services to accrue indexer rewards, we also need to abstract the data service type in the Curation contract somehow, as currently curation is tied to a subgraphDeploymentId (which could become a dataSetDeploymentId?) and this is what ultimately defines the rewards for indexing a data set.
  • At first sight I think the main thing tying allocations in the Staking contract to the specific concept of a subgraph is the subgraphDeploymentId (which is easily changeable for a dataSetDeploymentId as mentioned above), but also more importantly the Proof of Indexing that is set to be 32 bytes. If different data service types will have different verifiability requirements, this POI concept might be insufficient (e.g. if some service uses SNARKs that don’t fit in 32 bytes) - I suppose we can always make it so that, in those cases, the POI becomes an id/pointer to the verifiability data in some other contract, or even an IPFS hash if there’s no need for on-chain verification.

There’s probably a lot of details we’ll need to think of when turning this into a concrete implementation, but overall I really like it.

1 Like

If we want these new data services to accrue indexer rewards

I think we can do just query fees. That said, it would be cool to have curation to help users/developers to find the best quality / most useful substreams for instance.

If different data service types will have different verifiability requirements, this POI concept might be insufficient

Good point. I think I like the idea of allocations being closed by pointing to a data-service-type-specific verifiability record in a different contract, instead of it having to be a bytes32. You’re right that the current POIs will not work everywhere. The whole more-flexible verifiability topic is something I haven’t thought much about yet, but there’s some design work necessary here.

1 Like

I agree for some things (e.g. substreams) where there is no pre-indexing, query fees would be sufficient. But for other services like Firehose where there is considerable work done before serving queries, and that work could be verifiable, I think it would make sense to also include indexer rewards. Otherwise indexers who serve these services would be at a disadvantage vs. indexers who focus on subgraphs. I think it would be good to identify between the two kinds in the data service type definition, rather than leave subgraphs as the only service that gets rewards? I know there’s some discussion/research about indexer rewards going on elsewhere, so it might be good to think about how this could inform that discussion.

Definitely, this will be a (fun) rabbithole.

1 Like

I think there is a distinction between a Firehose, which is essentially running an Ethereum client plus some additional storage infrastructure to provide generic block streaming for a blockchain, and a subgraph, which is a much more specific ETL pipeline & query interface. To even provide substreams, you need to run a Firehose for the relevant network.
That’s not to say that there isn’t work required, but a query-only setup is much simpler in a lot of ways and simplicity and legibility is a very real and significant benefit as we look to introduce new services.

That may as you say put indexers who focus on new services at a disadvantage, but if there isn’t a query price for the data which is acceptable for developers, and profitable for indexers, then the new data services have a more fundamental problem. If that price does exist, then there is a good opportunity for a market to emerge. Introducing an indexing reward into the mix makes it harder to understand that value exchange (though of course rewards can be useful to bootstrap a market)

2 Likes