I just had a productive call with @abourget and want to summarize some highlights here.
First, let me limit the scope of what will be addressed. This comment does not talk about adding user-level parallelism in any way, which I still hold to be unworkable. Instead, this post will be limited to the problem of data extraction and specifically whether or not it is beneficial to “pre-aggregate” data as a part of the implementation of Firehose.
TLDR: There are arguments for both, but it is possible to achieve great speed without pre-aggregation. Since avoiding pre-aggregating has some advantages in flexibility, it is the preferred approach by both of us.
The key insight of @abourget is that by using pessimistic filtering we can enable parallelism in the extraction phase without involving the user - turning the problem into essentially a map reduce.
To understand this, we will consider a subgraph with a dynamic data source that is listening to some event with the signature event()
but for which the contract addresses are yet unknown.
What we would do to index this subgraph is to create a filter selecting for a superset of the required data (all events matching the signature event()
, regardless of their contract) and go wide executing this filter against large ranges of blocks in parallel. The output of these jobs are piped into the indexing stage. The indexing stage would now have a linear sequence of events supplied to it nearly as fast as though it had been pre-aggregated.
It is worth addressing the fact that the filtered data pipe has more data than is necessary in the solution. There is a limit to how much filtering can be done in practice (for the pre-aggregation method, more filters creates more redundancy in the “indexed” data set). On-the-fly filtering as shown here is likely to create an even smaller data set for the processing stage than pre-aggregation would by itself.
What method is used by the indexing stage or what parallelism is possible there is now a separate and unanswered question. All that we have established is that pre-aggregation is not necessary to create a filtered pipe of data for indexing quickly. The two approaches can actually be combined for even greater efficiency (but not necessarily much greater speed).
The final point made by @abourget which solidifies the parallel, on-the-fly filtering in my mind is this:
It is much easier to get ecosystem adoption and make chains “The Graph FireHose Ready” by having a simple standard for what files to output that is not a moving target as graph-node requires different kinds of pre-aggregation. This concern far outweighs the limited difference in speed between the two approaches.