A framework to build any data applications entirely on the decentralized Web3 infrastructures

Hi community,

This is Kuan Huang from (www.scout.cool). I am proposing an open source software framework that enables developers to quickly build any data applications entirely on the Web3 infrastructure. Whether someone is building an analytics dashboard for a dApp/protocol or leveraging existing machine learning models to do data science work, this framework should save you lots of time.

Problem:
Every blockchain data analytics company (including ourselves) that we know of is using centralized databases to index blockchain data before serving it to their users. As a result, the majority of data analytics charts or dashboards are living on centralized hosting services and databases. This approach is not only the opposite of where Web3 is heading but also creates a few real issues:

  1. Users are locked into a proprietary web platform due to the centralization of data indexes and hosting.
  2. The lack of transparency in how data is indexed and organized in a centralized database makes it very difficult to debug issues. We personally have run into numerous cases where the on-chain metrics we put together are different from other vendors.
  3. As faster L1 and L2 blockchains are becoming available, this approach is becoming the biggest bottleneck for every data company to scale.

Proposed Solution:
The MVP of the proposed software framework should provide:

  1. A generic module that efficiently pulls data from any subgraphs or other on-chain/off-chain data sources. This module should take care of some common challenges such as pulling from multiple subgraphs in parallel, handling “pagination” in subgraph in the background and etc. (One of the grantees from first wave, Keyko, might be working on some parts of what I am describing)

  2. A data transformation module that does the preparation before the visualizations. There are actually some existing package can be reused such as Pandas, Tensorflow. This also opens a door for any potential machine learning applications that leverage The Graph.

  3. Pre-built widgets (with the options to customize) to render charts and dashboards. Developer should be able to use very few lines of code to render a chart or design a dashboard without touching any front end code.

  4. A simple mechanism to deploy and sharing through decentralized hosting/storage services. The entire community can discover and learn.

  5. An easy to maintain structure since frequent updates of the applications are expected.

Why now:
We are beginning to see what the Web3 infrastructure might look like in the near future:

  1. Decentralized query layers such as The Graph (700+ subgraphs and growing!)
  2. Decentralized storage and hosting services such as IPFS, textile.io, Fleek.co
  3. Decentralized code repository such as Radicle
  4. Decentralized database/caching layer: threadDb, Gun DB

Some inspirations: (Sorry, I am not allowed to post more than two links as a new user in this forum)
Pandas
Observablehq
Streamlit
Vega-lite

10 Likes

Hey this is great! That problem statement definitely resonates.
I would be interested in how you think the different parts of that stack best fit together - for example we are looking at ways of extending subgraph functionality to make it easier to extract data (which might provide some of the functionality you describe in the “generic module”), up to and including “analytics” type functionality (aggregations & transformations). There have also been requests for “custom resolvers”, to give subgraph developers more flexibility in how the underlying data can be queried via graphQL.
However I do also definitely see value in separating concerns.
Might you be free to have a quick call to discuss?
PS vega-lite & streamlit are great!

Hi Adam,

I am based in NYC (est). What’s your email and I can send you a few available slots for a call? Mine is huangkuan at gmail

It’s definitely nice to have the flexibility in the subgraph to extract/aggregate/transform data. But we have also found the benefits of being able to do it on the application level.

For example, you have a subgraph that powers your web application but it is only 70% optimized for an analytics chart that you are building. Writing a few lines of code to transform the data is probably easier than updating the subgraph and redeploy it.

Here is another example. Let’s say I am building this project: https://beta.web3index.org/. Having the ability to aggregate/transform the data on the application level allows me to leverage all the existing public subgraphs without creating my own and syncing blockchain data from the beginning.

1 Like

Great point on giving applications flexibility, in particular when it comes to leveraging subgraphs created and maintained by other developers.
Looking forward to catching up on this next week.

1 Like

Hey @huangkuan and @adamfuller - I’ve been building something very much like you describe for the past 10 years or so. Based on linked data / semantic web.

It basically maps any data (for example from GraphQL) to typed linked data … puts it in the global linked data graph … and then offers interface components based on the chosen universal types / schema (like Person, Token, Project, etc). You can build & click together itnerfaces and visualisations of the the data and finally publish it as a fully functioning web3 app.

After years of developing the underlaying tech I’m now launching it as a project / product and am getting more people involved. Just launched a very early rough website www.semantu.com

More demo’s - including visualization of data from The Graph - coming soon!
Get in touch with me for a 1-1 if this interests anyone, would love to get in touch with some people from The Graph to further explore this.

Hi everyone, let me share with you some great progress we have made on the library:

Demo: AAVE vault deposit amount from Jan 1st to Jan 10th

This demo shows how little code is required to load data from a subgraph and display it in a line chart. Earlgrey is the name of the library.

Highlights:
load_subgraph: This function does all the magic of loading data from a subgraph efficiently

  • It bypasses the page limit on the The Graph side. You can simply pass the start time and end time in the query and it takes care of the rest.
  • It automatically converts string type data (which is how the graph returns data) to its proper type based on the schemas of the subgraph
  • If your graph ql contains multiple entities, this function loads them concurrently to save time.

plot:
Plot the data on a line chart.

cc @adamfuller

Here is another demo that shows a slightly more complicated use case:

Go to the same link https://earlgrey-demo.herokuapp.com/ and click the arrow on the top left corner to see the second demo. Instructions below:
Screen Shot 2021-07-12 at 3.26.04 PM

aggregate_timeseries aggregates data by time intervals (hourly, daily, monthly)

Hi @huangkuan this does indeed look like great progress! Is the code open-source, would be interested to take a look if so? (the “How to install link” seemed to be broken)
How are you determining time for blocks, are you using the ethereum-blocks subgraph?
What do you mean by “If your graph ql contains multiple entities, this function loads them concurrently to save time.”?

Hi @adamfuller Our goal is to find an efficient solution to work with large data sets. We are experimenting with a few approaches:

  1. In this demo, we use one queue for each entity.
  2. We are also trying to get the data without breaking it down by entity.

In a scenario that to query X Y Z entities, each has x, y, z (thousand) records.
With approach #1, the data is fetched in x + y + z requests in 3 concurrent queues.
With approach #2, the data is fetched in max(x,y,z) requests in 1 queue.
Obviously, #2 would make fewer requests, still testing timing cost. But in case of failure, #2 would have a smaller number of data to re-fetch.

Hi everyone,

We have made our first step! I want to share with you an initial version of a library that we have been working on in the past 6 weeks. The library is called Bubbletea

The idea here is to enable developers and data scientists to quickly build data applications directly on top of The Graph network. We would love to have interested people to give it a spin and share with us features you would like to see or issues you have discovered along the way.

5 Likes

Hi community,

Sharing with you our progress of the past two months.

Our long term vision is to build a decentralized data analytics product with a new biz model to compete with (and do better than) centralized data analytics platforms such as chainanalysis, nansen, dune. To get there, the first question that we try to answer is whether we can recreate what has been built on these centralized platforms.

The answer is yes but very tricky at this moment.

Let me walk you through a simple NFT dashboard we try to recreate from Nansen. Nansen has a feature called “24H NFT Market Overview” (which requires $149/month to access). It tracks the past 24 hours market information of the popular NFTs on Opensea.

Here is our demo and the obstacles we have run into:

  1. Throughput limitations: It takes very long time to load just 6 hours worth of data.

  2. Missing log index in the transaction model. It would be very tedious to recreate the “%Opensea+Rarible” column if we don’t have this feature. In order to identify whether this NFT was transacted on the opensea or rarible platforms, we need to access the logs index section of the transaction model to locate that specific log which is associated with an opensea or rarible exchange smart contract.

  3. “Marketcap column”. In order to calculate the latest market cap (to sum up the sales price of each token in an NFT collection), we need to get the history of all the token sales. Then we need to periodically pre-calculate and store the latest prices in a subgraph. Our subgraph only syncs data since September and it is still not done yet. (It has been 4 days and it has only synced additional 10k blocks)

  4. “#Wallet”. We have no way to tell whether if an address is a wallet or contract. Although web3.js has a way to tell but we are not allowed to use any 3rd party libraries when indexing subgraphs.

  5. Our demo only contains EIP721 compatible NFT transactions. For EIP1155 compatible NFTs, we found an almost perfect subgraph but it does not have the ETH transferred value per transaction that we need. In order to use it, we would need to modify its code, redeploy and sync the subgraph. Then we would run into the same issue mentioned in 3.

  6. Fetching off chain data. For some of the ERC721 tokens and all ERC1155 tokens, it would be ideal to read their metadata, such as names, rarity data. Here is an example:
    This is an NFT on Opensea:
    https://opensea.io/assets/0x495f947276749ce646f68ac8c248420045cb7b5e/4981676894159712808201908443964193325271219637660871887967795137569907277825
    And we would like to get its metadata from here:
    https://api.opensea.io/api/v1/metadata/0x495f947276749Ce646f68AC8c248420045cb7b5e/4981676894159712808201908443964193325271219637660871887967795137569907277825
    To do so, we need to use async functions to fetch data which is currently prohibited by the graph compiler. We have raised this issue in this forum post
    Proposing to make more diverse subgraphs. Need your feedback

2 Likes

You can play with our demo here:
https://bubbletea-demo.herokuapp.com/?demo=demo_nft.py&starttimestamp=1630627200&endtimestamp=1630648800

1 Like

Hey thanks so much for this write up! Lots of good points here.

  1. This loading is from the subgraph endpoint on the fly? Is the bulk of that time pulling the individual entities out of the subgraph? Definitely seems like better support for analytical functionality would help here.
  2. Interesting use case, I replied on that thread about logs.
  3. This point is related to sync speed? (Which is definitely a focus area). Or also the ability to do more custom calculations on the fly?
  4. Interesting - I think we could maybe expose the getCode function in graph-ts to check for contracts, though there are some edge cases to consider (contracts might be destroyed, and I don’t know if you can getCode as of a given block height)
  5. This is an interesting point - I think we have previously discussed some of the functionality required here (e.g. subgraph composition). One thing that does occur to me is that the msg.value won’t always capture ETH transfers, if you’re using a multisig (for example). I think there is potential to use the upcoming introduction of the Firehose in Graph Node to solve some of those problems
  6. Better indexing off-chain data is also top of mind at the moment - we are working on a better model for content addressable data (IPFS, Arweave), but arbitrary HTTP endpoints make deterministic indexing impossible. I think we could apply some of the same patterns to support that in Graph Node, but that opens up a wider discussion cc @Brandon

There are quite a few areas there where we have stuff in flight, so will come back to this thread, but thanks again for sharing all the above and excited to see the progress on Bubbletea!

3 Likes

Thanks Adam. Sorry about my late response. It would be super helpful to have some rough ideas of when these requests can be addressed. Even estimates like “in days”, “in weeks or months” would be helpful.

It’s the limit of 1000 items per API call that forces us to ping the api multiple times. It also creates 502 connection errors from time to time when the api calls are too frequent.

OK. Will be following this thread. haven’t seen any responses yet.

The former.

Please do.

True although this is not a showstopper for us. But we definitely need a much much simpler way to be able to leverage existing subgrahs to create and deploy new subgraphs.

I don’t quite understand the difficulty of indexing off-chain data from arbitrary http endpoints but it’s crucial for many use cases that i can think of. happy to learn more.

I don’t quite understand the difficulty of indexing off-chain data from arbitrary http endpoints but it’s crucial for many use cases that i can think of. happy to learn more.

I can weigh in on this one- indexing off-chain data isn’t really a computationally difficult thing to do (at least, no harder than indexing on-chain data), the issue comes from determinism. It’s similar to the oracle problem, where there is no finality to the potential data you may be indexing.

Arbitrary HTTP endpoints can provide data, but the data they return may change at-will and without warning, meaning that 2 separate graph nodes indexing the same subgraph can produce different results- then, when a user queries this subgraph, there’s a determinism bug that may cause an incorrect query result to be returned.

The use cases for being able to index off-chain data are certainly numerous and useful, but cause a lot of problems with finality that don’t happen with on-chain data, since those have (relative) finality and triggers that allow for the data in a subgraph to be consistent across all different nodes and times.

2 Likes

Thank you, @rburdett It makes sense.

@adamfuller @Brandon There is a huge number of use cases for indexing off-chain data. If the vision of The Graph is to become the modern database of web3, this is a must have feature IMO. Without it, it would also be very difficult for us to build a competitive product on top of The Graph.

For my understanding, as long as the node accurately, promptly and honestly indexes the data from the data source designated by the subgraph creators, your side of the responsibility of data finality is accomplished. The subgraph creators should be responsible for (and fully incentivized to find) the best possible data source.

You guys probably have already given lots of thoughts on this subject. Happy to discuss further if there is interest.

Hi!

It would be super helpful to have some rough ideas of when these requests can be addressed

This depends a bit on the feature, though I think most of the things discussed are either weeks, or months.

It’s the limit of 1000 items per API call that forces us to ping the api multiple times

It sounds like you’re doing a lot of post-processing on the client side if you are fetching all the entities, would be great to work out what could be moved from the server side (and whether that would happen at indexing time, or at query time)

Code checking in Graph Node

Created an issue here to track (I am not sure when we would be able to get to this)

Deterministic indexing requirements

@rburdett described the constraints well. I agree with you that off-chain data is a crucial component for indexing.

For my understanding, as long as the node accurately, promptly and honestly indexes the data from the data source designated by the subgraph creators, your side of the responsibility of data finality is accomplished. The subgraph creators should be responsible for (and fully incentivized to find) the best possible data source.

The challenge here is trustlessness - as an end consumer, can you be certain that the indexer has correctly fetched the data as specified? The importance of this obviously depends on the trust setting itself, but on the Graph network, deterministic indexing means that subgraph creators and consumers don’t have to trust the indexers at all to be sure they are getting the right data.

I would be happy to continue the discussion further - a lot of our use-case focus has been on NFT meta-data (which is often off chain), so it would be great to understand and elaborate the datasets / resources that you would like to plug into from an analytical perspective?

Thanks. We are basically trying assess whether it’s too early to pursuit Bubbletea as a business idea or not. This idea has a big dependency on the The Graph network.

@adamfuller We had a similar conversation before. The ability to aggregate data at indexing time and query time are equally important. In our particular example, we want to display past 6 hours of data in a searchable table view. There were only 19262 entries for that 6 hours worth of data. It takes a while to load because we can only retrieve 1000 entries per call. But in a different context such as loading 12 months worth of daily data, we would absolutely aggregate them at indexing time instead of calculating it on the client side.

And as of now, editing, redeploying and syncing subgraphs is very time consuming. If we forgot to pre-calculate a field at indexing time and want to add it later, we would have to wait to see the result. Our subgraph deployed days ago is still far from being synced.

Making a frequently used subgraph as generic as possible is good for reusability of subgraphs across different applications. It is also very difficult to forecast every potential use case when the subgraph is initially designed.

@adamfuller what’s the best way to track the implementations of all our feature requests? We are a little bit stuck here.

Hey @huangkuan, sorry for the slow response. Returning to the original list:

  • Logs in transactions: this is currently blocked by Graph Node <> Firehose integration for Ethereum (in progress, tracked at a high level here). Logs are part of the receipt in the Firehose type, so will be readily available. Though the challenge might be parsing those logs (would need ABIs for all the relevant contracts, or at least a means of parsing them somehow)
  • Performance: a few of your issues relate to the time taken to sync. We are currently working on performance for token-based subgraphs specifically. Nothing to track here yet, but there will be.
  • #Wallet: this is tracked here, though this hasn’t been prioritised, and I think this may actually be better served by tracking contract creation via the Firehose, once available.
  • Token sales: one quick note on sales, it’s not enough to rely on transaction.value, as in certain cases (e.g. multisigs) there will be ETH value transfer for a transaction, even if the transaction itself has zero value.
  • Off-chain data: work has started on File Data Sources, which would be the first requirement for the arbitrary HTTP data that you require.

Your specific use-case (on-the-fly analytics) does encounter some issues with the existing subgraph API. I will be in touch with you directly to discuss the latest thinking on changing that.

2 Likes