Protean Labs proposal: Recovering entity based aggregation functionality at the schema level

The Problem

One of the major pain points of working with subgraph data is that doing any kind of aggregation (e.g.: summation, counting, averages…) on a subset (or all) instances of some entity type requires the query-er to fetch all instances of that entity type and perform the aggregation client-side, which scales very poorly as the number of entities increases. For example, consider the following toy subgraph schema (inspired by the Uniswap V2 subgraph):

type Swap @entity {
  id: ID!
  timestamp: BigInt!
  pair: Pair!

  amount0In: BigDecimal!
  amount1In: BigDecimal!
  amount0Out: BigDecimal!
  amount1Out: BigDecimal!
}

type Pair @entity {
  id: ID!
  swaps: [Swap!]! @derivedFrom(field: "pair")
}

If one wanted to get the number of swaps for a given pair, or get the sum of all amount0In values for swaps related to a given pair, there is no better way other than querying all such swaps and then performing the operation client-side.

Obviously, if the subgraph writer knows in advance the kinds of aggregations that are useful, they can always add it as an explicit entity field and write mapping code to update that value. In our example, one could modify the Pair entity to include the aggregations that we are interested in (see below) and update those values whenever a new swap is indexed (this is the approach taken by Uniswap in their V2 subgraph for metrics such as total trade volume).

type Pair @entity {
  id: ID!
  swaps: [Swap!]! @derivedFrom(field: "pair")
  
  numSwaps: BigInt!
  totalAmount0In: BigDecimal!
  totalAmount1In: BigDecimal!
  totalAmount0Out: BigDecimal!
  totalAmount1Out: BigDecimal!
}

However, this approach is not ideal for three reasons:

  1. It increases the complexity of subgraphs (more potential for bugs) and potentially increases indexing/syncing time
  2. It assumes that the subgraph writer and subgraph user are the same person, or at the very least that the subgraph writer knows in advance what aggregation data users want
  3. It basically replicates features that are already supported natively by the graph-node stack, specifically the PostgreSQL database used to store the indexed data

All this being said, we have come up with two potential approaches that would solve this problem.

Potential solution 1: Aggregation directives

The first potential solution involves the support for additional directives to support aggregations on fields of type list (see updated schema below).

type Pair @entity {
  id: ID!
  swaps: [Swap!]! @derivedFrom(field: "pair")

  numSwaps: BigInt! @count(entities: "swaps")

  totalAmount0In: BigDecimal! @sum(entities: "swaps", field: "amount0In")
  totalAmount1In: BigDecimal! @sum(entities: "swaps", field: "amount1In")
  totalAmount0Out: BigDecimal! @sum(entities: "swaps", field: "amount0Out")
  totalAmount1Out: BigDecimal! @sum(entities: "swaps", field: "amount1Out")
}

In this scenario, numSwaps: BigInt! @count(entities: "swaps") would be a implicit field (as in, it does not have to be handled in the mapping code) whose value will be the number of swaps returned by the pair’s swaps field.

Likewise, totalAmount0In: BigDecimal! @sum(entities: "swaps", field: "amount0In") would also be an implicit field whose value would be the sum of all amount0In values for all swaps returned by the pair’s swaps field.

These directives would essentially correspond to SQL COUNT and SUM queries made to the underlying database and would be executed at runtime. This approach would require some additional schema type checking (e.g.: checking that the field being summed is a number, checking that the field being counted is a list, etc.).

One of the advantages of such an approach is that adding additional @count or @sum fields to a schema would not require re-indexing of the subgraph, since these values are computed at query time and no new data is added to the database. Moreover, it opens the door to support more directives that would leverage SQL operations (e.g.: AVG, MAX, MIN, …).

However, a disadvantage of this approach is that these implicit fields would only be available as entity fields and not as top-level queries. In our example, there would be no way to perform counting or summation on all swaps in the subgraph (unless an entity is created explicitely for that purpose). Moreover, there would be no way to count or perform summation on a subset of these entities (e.g.: count the number of swaps related to a given pair that happened after timestamp 1639400000).

This brings us to the second potential approach to support aggregations at the schema level.

Potential solution 2: Implicit “info” entity

The second potential solution involves the creation (by the graph-node) of implicit “info” entities for each entity defined in the subgraph schema, as well as toplevel queries to fetch these “info” entities. These entities would hold aggregation values for each of the entities they are based from. Let’s go back to our initial example to illustrate this idea. In this scenario, the actual graphql API generated by the graph-node would look like the following:

enum OrderDirection {...}

type Swap {
  id: ID!
  timestamp: BigInt!
  pair: Pair!

  amount0In: BigDecimal!
  amount1In: BigDecimal!
  amount0Out: BigDecimal!
  amount1Out: BigDecimal!
}

# NEW ENTITY
type Swap_info {
  count: BigInt!
  sum_amount0In: BigDecimal!
  sum_amount1In: BigDecimal!
  sum_amount0Out: BigDecimal!
  sum_amount1Out: BigDecimal!
}

input Swap_filter {...}
enum Swap_orderBy {..}

type Pair {
  id: ID!

  swaps(
    skip: Int = 0
    first: Int = 100
    orderBy: Swap_orderBy
    orderDirection: OrderDirection
    where: Swap_filter
  ): [Swap!]!

  # NEW FIELD
  swaps_info(
    skip: Int = 0
    first: Int = 100
    orderBy: Swap_orderBy
    orderDirection: OrderDirection
    where: Swap_filter
  ): Swap_info!
}

# NEW ENTITY
type Pair_info {
  count: BigInt!
}

input Pair_filter {...}
enum Pair_orderBy {..}

type Query {
  # Same as current implementation
  pair(id: ID!, ...): Pair
  pairs(...): [Pair!]!
  
  # NEW TOPLEVEL QUERY
  pairs_info(
    skip: Int = 0
    first: Int = 100
    orderBy: Pair_orderBy
    orderDirection: OrderDirection
    where: Pair_filter
  ): Pair_info!

  # Same as current implementation
  swap(id: ID!, ...): Swap
  swaps(...): [Swap!]!

  # NEW TOPLEVEL QUERY
  swaps_info(
    skip: Int = 0
    first: Int = 100
    orderBy: Swap_orderBy
    orderDirection: OrderDirection
    where: Swap_filter
  ): Swap_info!
}

First, notice the two additional entities Swap_info and Pair_info. These entities would be generated from the Swap and Pair entities defined in the subgraph schema. The rules for generating these would be quite straighforward: 1) Always include a count value; and 2) For each non-null number field (i.e.: Float, Int, BigDecimal, BigInt) in the original entity, create a field sum_FIELDNAME with the same type. Alternatively, a directive could be used on the original entities to indicate which field to include in its *_info counterpart.

Notice also that there now two additional toplevel queries: swaps_info and pairs_info, which take the same arguments as swaps and pairs, but instead of returning a list of entities, it returns the corresponding *_info entity which would contain aggregation statistics about the entities that would have otherwise been returned by swaps and pairs. In other words, the Pair_info entity returned by querying the pair_info field with a set of arguments would contain a count value equal to the number of entities that would have otherwise been returned by querying the pairs field with those same arguments. The rules for generating these toplevel queries are also straightforward: for each field in the schema that returns a value of type [T!]! (i.e.: a non-null list), create a new field with the same arguments that returns an entity of type T_info (this applied to non-toplevel fields as well, see the Pair entity’s swaps_info field).

The values of the *_info entities would be calculated at query time using COUNT and SUM queries on the underlying database.

The advantages of this approach is that it would be entirely handled by the graph-node. In other words, the subgraph writer would not have to think about the which aggregation values they want to offer in their subgraph. However, it also makes this approach quite opinionated and the fields present in the *_info entities should be chosen with care.

As is the case for the aggregation directives, this approach could be extended to include other SQL aggregation operations (e.g.: AVG, MAX, MIN, …) by including more fields in the *_info entities.

Conclusion

Aggregation directives and info entities would allow subgraph writers to leverage the aggregation functionality already supported in the underlying PostgreSQL database with minimal effort on their part (no additional mapping code required). Moreover, they would enable subgraph users to query aggregation data directly from the subgraph instead of having to fetch all entities first.

We are very interested in hearing what The Graph community has to say about these ideas.

Cheers!

4 Likes

Hey @stopher! Thanks so much for this detailed write-up. Aggregations are definitely an area where, as you say, the current approaches (defining aggregations in the mappings, or on the client side) are quite onerous for developers.

The query-time aggregations you describe definitely have a lot of potential - as you say, they can be auto-generated, and leverage Postgres functionality. The main consideration that makes things more complicated is query performance - by surfacing this functionality, you introduce the possibility for consumers to make very heavy queries, potentially doing a lot of computation, across (on some subgraphs) millions of entities. That not only means that an individual analytical query might be slow, but it might also have knock-on effects for other users querying a given indexer, if the database is having to do a lot of work. So we want to make sure we think about what functionality is made available in this way, and what controls are available to subgraph developers and indexers. That might mean (in some cases) implicit or declarative pre-aggregations, caching, or replicas.

We are very keen to better support aggregations, and analytical users in general. Can you elaborate a bit more on your use case? i.e. are you looking to power end-user app interfaces, or exploratory data analysis? What kind of query latency and data freshness requirements do you have? How much are you creating groupBy type aggregations, vs timeseries?

2 Likes

Hi Adam,

Thanks for your reply!

Can you elaborate a bit more on your use case? i.e. are you looking to power end-user app interfaces, or exploratory data analysis?

In terms of use cases, we lean more towards data analysis than simple frontend development, although the analysis that we do would be displayed on a dashboard (so a bit of both). More specifically, we are working with the Playgrounds team (https://twitter.com/playgrounds0x) helping them recreate their Dune dashboard (to the extent possible) using Python + Dash + The Graph. This involves dealing with some timeseries (e.g.: market cap over time), but also doing aggregation (e.g.: total LP fees). We’ll most likely end up handling those in the mappings, but the ideas in our original post would save us some work. Moreover, we were thinking more broadly about how subgraphs can become more powerful with minimal additional work from the point of view of subgraph writers without having a particular use case in mind.

The main consideration that makes things more complicated is query performance - by surfacing this functionality, you introduce the possibility for consumers to make very heavy queries, potentially doing a lot of computation, across (on some subgraphs) millions of entities.

Surely the computational overhead of executing these heavy queries would be smaller than the cost of querying all instances of a given entity to do the aggregation client-side, no? Or am I mistaken and graph nodes are optimized for these kinds of queries? Also, correct me if I am wrong, but don’t @derivedFrom directives already generate JOIN queries at query time without significant performance overhead?

On the other hand, the requirement that these aggregations be computed at query time is not necessarily a must have (at least for the first potential solution: aggregation directives), it just seemed like the simplest way to implement them (one directive → one SQL query). For instance, going back to our example, the field numSwaps: BigInt! @count(entities: "swaps") could be stored as an actual column in the Pair entity table and the value would be updated every time swaps changes. However, I feel like this would involve adding a lot of complicated logic to the graph-node (e.g.: having some sort of dependency detection that would detect that numSwaps depends on swaps and update it accordingly, having the ability to detect when a field annotated with @derivedFrom changes, etc.)

1 Like

Graph Node is optimised for those kind of queries, but also querying all instances of an entity to do aggregation client-side or elsewhere is necessarily batched (given pagination), while the described query-time API options would make it trivial for anyone to make quite heavy queries (and indeed as a former analyst, I know the initial instinct is always to COUNT, SUM and cross-cut everything to start with).

As I said, Graph Node definitely needs to better support these use cases, but we need to ensure that Subgraph Developers / Indexers are afforded sufficient controls to do so in a performant way. For complex subgraphs with lots of entities we already see some queries which can cause significant database load, so we want to make sure new features don’t compromise robustness.