The Problem
One of the major pain points of working with subgraph data is that doing any kind of aggregation (e.g.: summation, counting, averages…) on a subset (or all) instances of some entity type requires the query-er to fetch all instances of that entity type and perform the aggregation client-side, which scales very poorly as the number of entities increases. For example, consider the following toy subgraph schema (inspired by the Uniswap V2 subgraph):
type Swap @entity {
id: ID!
timestamp: BigInt!
pair: Pair!
amount0In: BigDecimal!
amount1In: BigDecimal!
amount0Out: BigDecimal!
amount1Out: BigDecimal!
}
type Pair @entity {
id: ID!
swaps: [Swap!]! @derivedFrom(field: "pair")
}
If one wanted to get the number of swaps for a given pair, or get the sum of all amount0In values for swaps related to a given pair, there is no better way other than querying all such swaps and then performing the operation client-side.
Obviously, if the subgraph writer knows in advance the kinds of aggregations that are useful, they can always add it as an explicit entity field and write mapping code to update that value. In our example, one could modify the Pair entity to include the aggregations that we are interested in (see below) and update those values whenever a new swap is indexed (this is the approach taken by Uniswap in their V2 subgraph for metrics such as total trade volume).
type Pair @entity {
id: ID!
swaps: [Swap!]! @derivedFrom(field: "pair")
numSwaps: BigInt!
totalAmount0In: BigDecimal!
totalAmount1In: BigDecimal!
totalAmount0Out: BigDecimal!
totalAmount1Out: BigDecimal!
}
However, this approach is not ideal for three reasons:
- It increases the complexity of subgraphs (more potential for bugs) and potentially increases indexing/syncing time
- It assumes that the subgraph writer and subgraph user are the same person, or at the very least that the subgraph writer knows in advance what aggregation data users want
- It basically replicates features that are already supported natively by the graph-node stack, specifically the PostgreSQL database used to store the indexed data
All this being said, we have come up with two potential approaches that would solve this problem.
Potential solution 1: Aggregation directives
The first potential solution involves the support for additional directives to support aggregations on fields of type list (see updated schema below).
type Pair @entity {
id: ID!
swaps: [Swap!]! @derivedFrom(field: "pair")
numSwaps: BigInt! @count(entities: "swaps")
totalAmount0In: BigDecimal! @sum(entities: "swaps", field: "amount0In")
totalAmount1In: BigDecimal! @sum(entities: "swaps", field: "amount1In")
totalAmount0Out: BigDecimal! @sum(entities: "swaps", field: "amount0Out")
totalAmount1Out: BigDecimal! @sum(entities: "swaps", field: "amount1Out")
}
In this scenario, numSwaps: BigInt! @count(entities: "swaps") would be a implicit field (as in, it does not have to be handled in the mapping code) whose value will be the number of swaps returned by the pair’s swaps field.
Likewise, totalAmount0In: BigDecimal! @sum(entities: "swaps", field: "amount0In") would also be an implicit field whose value would be the sum of all amount0In values for all swaps returned by the pair’s swaps field.
These directives would essentially correspond to SQL COUNT and SUM queries made to the underlying database and would be executed at runtime. This approach would require some additional schema type checking (e.g.: checking that the field being summed is a number, checking that the field being counted is a list, etc.).
One of the advantages of such an approach is that adding additional @count or @sum fields to a schema would not require re-indexing of the subgraph, since these values are computed at query time and no new data is added to the database. Moreover, it opens the door to support more directives that would leverage SQL operations (e.g.: AVG, MAX, MIN, …).
However, a disadvantage of this approach is that these implicit fields would only be available as entity fields and not as top-level queries. In our example, there would be no way to perform counting or summation on all swaps in the subgraph (unless an entity is created explicitely for that purpose). Moreover, there would be no way to count or perform summation on a subset of these entities (e.g.: count the number of swaps related to a given pair that happened after timestamp 1639400000).
This brings us to the second potential approach to support aggregations at the schema level.
Potential solution 2: Implicit “info” entity
The second potential solution involves the creation (by the graph-node) of implicit “info” entities for each entity defined in the subgraph schema, as well as toplevel queries to fetch these “info” entities. These entities would hold aggregation values for each of the entities they are based from. Let’s go back to our initial example to illustrate this idea. In this scenario, the actual graphql API generated by the graph-node would look like the following:
enum OrderDirection {...}
type Swap {
id: ID!
timestamp: BigInt!
pair: Pair!
amount0In: BigDecimal!
amount1In: BigDecimal!
amount0Out: BigDecimal!
amount1Out: BigDecimal!
}
# NEW ENTITY
type Swap_info {
count: BigInt!
sum_amount0In: BigDecimal!
sum_amount1In: BigDecimal!
sum_amount0Out: BigDecimal!
sum_amount1Out: BigDecimal!
}
input Swap_filter {...}
enum Swap_orderBy {..}
type Pair {
id: ID!
swaps(
skip: Int = 0
first: Int = 100
orderBy: Swap_orderBy
orderDirection: OrderDirection
where: Swap_filter
): [Swap!]!
# NEW FIELD
swaps_info(
skip: Int = 0
first: Int = 100
orderBy: Swap_orderBy
orderDirection: OrderDirection
where: Swap_filter
): Swap_info!
}
# NEW ENTITY
type Pair_info {
count: BigInt!
}
input Pair_filter {...}
enum Pair_orderBy {..}
type Query {
# Same as current implementation
pair(id: ID!, ...): Pair
pairs(...): [Pair!]!
# NEW TOPLEVEL QUERY
pairs_info(
skip: Int = 0
first: Int = 100
orderBy: Pair_orderBy
orderDirection: OrderDirection
where: Pair_filter
): Pair_info!
# Same as current implementation
swap(id: ID!, ...): Swap
swaps(...): [Swap!]!
# NEW TOPLEVEL QUERY
swaps_info(
skip: Int = 0
first: Int = 100
orderBy: Swap_orderBy
orderDirection: OrderDirection
where: Swap_filter
): Swap_info!
}
First, notice the two additional entities Swap_info and Pair_info. These entities would be generated from the Swap and Pair entities defined in the subgraph schema. The rules for generating these would be quite straighforward: 1) Always include a count value; and 2) For each non-null number field (i.e.: Float, Int, BigDecimal, BigInt) in the original entity, create a field sum_FIELDNAME with the same type. Alternatively, a directive could be used on the original entities to indicate which field to include in its *_info counterpart.
Notice also that there now two additional toplevel queries: swaps_info and pairs_info, which take the same arguments as swaps and pairs, but instead of returning a list of entities, it returns the corresponding *_info entity which would contain aggregation statistics about the entities that would have otherwise been returned by swaps and pairs. In other words, the Pair_info entity returned by querying the pair_info field with a set of arguments would contain a count value equal to the number of entities that would have otherwise been returned by querying the pairs field with those same arguments. The rules for generating these toplevel queries are also straightforward: for each field in the schema that returns a value of type [T!]! (i.e.: a non-null list), create a new field with the same arguments that returns an entity of type T_info (this applied to non-toplevel fields as well, see the Pair entity’s swaps_info field).
The values of the *_info entities would be calculated at query time using COUNT and SUM queries on the underlying database.
The advantages of this approach is that it would be entirely handled by the graph-node. In other words, the subgraph writer would not have to think about the which aggregation values they want to offer in their subgraph. However, it also makes this approach quite opinionated and the fields present in the *_info entities should be chosen with care.
As is the case for the aggregation directives, this approach could be extended to include other SQL aggregation operations (e.g.: AVG, MAX, MIN, …) by including more fields in the *_info entities.
Conclusion
Aggregation directives and info entities would allow subgraph writers to leverage the aggregation functionality already supported in the underlying PostgreSQL database with minimal effort on their part (no additional mapping code required). Moreover, they would enable subgraph users to query aggregation data directly from the subgraph instead of having to fetch all entities first.
We are very interested in hearing what The Graph community has to say about these ideas.
Cheers!