Web3 data pipelines using subgraphs with file data sources

Hi all,

I just published a blog post about how web3 data pipelines can be constructed using a combination of permaweb storage and subgraphs to power dynamic sites in a performant way without the need to host and maintain a proprietary API.

Starting this thread to facilitate discussions on this use case and answer any questions you may have. If you are using this workflow or would like to use it in your project, let’s discuss here. Would be very interested in collaborating with the community to productize this workflow in open source libraries as well if there is sufficient demand.

Thanks,
Craig

5 Likes

Hi, @craigtutterow – read the blog and it’s an interesting idea for sure.

I guess for me, a few questions come up, most of them use-case driven.

Definitions of public data
For example, does this mean the use case is scoped to data that is intended to open and freely available from a web2 context? So blogs and the comments would be one use case. But data that can be downloaded from say an open gov portal would not?

The reason I ask relates to the next area of curiosity.

Cryptographic assurance requirements
If an Indexer is selecting data say from web2, like a CSV of census data or some other gov portal, and then enabled that data available for use by anyone searching for that index, what are the expectations of cryptographic assurance?

For example, someone could say they are indexing voting polling stations or something, but then the indexer, before writing to permastorage, mutate it. But then it gets indexed.

So I am curious about the use cases based on source and end-user.

If the above is in scope, then either the indexers need multiple indexers to have consensus before indexing for queries, or they only index data that has cryptographic assurances from a decentralized oracle.

Which brings me to the primary use-cases of what will generate high query interest working backwards.

The reason I work backwards is there is a query fee involved.

Given that, I am guessing the use cases that work best involve a query that can justify the fee (the fee can be extracted from a transaction associated with the fee, or can be truly fee-per-query, is my understanding).

Who serves the raw data?
Which brings me back to the question around incentives and breaking web2 silos.

IF a query for a given data set is valuable such that someone can monetize the querying, I suspect that the suppliers in many instances, except in the case of true public goods, also charge for it in a web2 environment.

Or those pay-for-use silos may put through an oracle with a public-facing API, yet still charge (although not sure why there would be a need for consensus if there’s one source – that’s my lack of knowledge, so welcome clarification there). This to me seems like the silos are still perpetuated, although shifting their infrastructure.

Quick set of thoughts as I’ve been wondering the same things, so possibly getting some of the concepts wrong.

1 Like