The Problem
One of the major pain points of working with subgraph data is that doing any kind of aggregation (e.g.: summation, counting, averages…) on a subset (or all) instances of some entity type requires the query-er to fetch all instances of that entity type and perform the aggregation client-side, which scales very poorly as the number of entities increases. For example, consider the following toy subgraph schema (inspired by the Uniswap V2 subgraph):
type Swap @entity {
id: ID!
timestamp: BigInt!
pair: Pair!
amount0In: BigDecimal!
amount1In: BigDecimal!
amount0Out: BigDecimal!
amount1Out: BigDecimal!
}
type Pair @entity {
id: ID!
swaps: [Swap!]! @derivedFrom(field: "pair")
}
If one wanted to get the number of swaps for a given pair, or get the sum of all amount0In
values for swaps related to a given pair, there is no better way other than querying all such swaps and then performing the operation client-side.
Obviously, if the subgraph writer knows in advance the kinds of aggregations that are useful, they can always add it as an explicit entity field and write mapping code to update that value. In our example, one could modify the Pair
entity to include the aggregations that we are interested in (see below) and update those values whenever a new swap is indexed (this is the approach taken by Uniswap in their V2 subgraph for metrics such as total trade volume).
type Pair @entity {
id: ID!
swaps: [Swap!]! @derivedFrom(field: "pair")
numSwaps: BigInt!
totalAmount0In: BigDecimal!
totalAmount1In: BigDecimal!
totalAmount0Out: BigDecimal!
totalAmount1Out: BigDecimal!
}
However, this approach is not ideal for three reasons:
- It increases the complexity of subgraphs (more potential for bugs) and potentially increases indexing/syncing time
- It assumes that the subgraph writer and subgraph user are the same person, or at the very least that the subgraph writer knows in advance what aggregation data users want
- It basically replicates features that are already supported natively by the graph-node stack, specifically the PostgreSQL database used to store the indexed data
All this being said, we have come up with two potential approaches that would solve this problem.
Potential solution 1: Aggregation directives
The first potential solution involves the support for additional directives to support aggregations on fields of type list (see updated schema below).
type Pair @entity {
id: ID!
swaps: [Swap!]! @derivedFrom(field: "pair")
numSwaps: BigInt! @count(entities: "swaps")
totalAmount0In: BigDecimal! @sum(entities: "swaps", field: "amount0In")
totalAmount1In: BigDecimal! @sum(entities: "swaps", field: "amount1In")
totalAmount0Out: BigDecimal! @sum(entities: "swaps", field: "amount0Out")
totalAmount1Out: BigDecimal! @sum(entities: "swaps", field: "amount1Out")
}
In this scenario, numSwaps: BigInt! @count(entities: "swaps")
would be a implicit field (as in, it does not have to be handled in the mapping code) whose value will be the number of swaps returned by the pair’s swaps
field.
Likewise, totalAmount0In: BigDecimal! @sum(entities: "swaps", field: "amount0In")
would also be an implicit field whose value would be the sum of all amount0In
values for all swaps returned by the pair’s swaps
field.
These directives would essentially correspond to SQL COUNT
and SUM
queries made to the underlying database and would be executed at runtime. This approach would require some additional schema type checking (e.g.: checking that the field being summed is a number, checking that the field being counted is a list, etc.).
One of the advantages of such an approach is that adding additional @count
or @sum
fields to a schema would not require re-indexing of the subgraph, since these values are computed at query time and no new data is added to the database. Moreover, it opens the door to support more directives that would leverage SQL operations (e.g.: AVG
, MAX
, MIN
, …).
However, a disadvantage of this approach is that these implicit fields would only be available as entity fields and not as top-level queries. In our example, there would be no way to perform counting or summation on all swaps in the subgraph (unless an entity is created explicitely for that purpose). Moreover, there would be no way to count or perform summation on a subset of these entities (e.g.: count the number of swaps related to a given pair that happened after timestamp 1639400000).
This brings us to the second potential approach to support aggregations at the schema level.
Potential solution 2: Implicit “info” entity
The second potential solution involves the creation (by the graph-node) of implicit “info” entities for each entity defined in the subgraph schema, as well as toplevel queries to fetch these “info” entities. These entities would hold aggregation values for each of the entities they are based from. Let’s go back to our initial example to illustrate this idea. In this scenario, the actual graphql API generated by the graph-node would look like the following:
enum OrderDirection {...}
type Swap {
id: ID!
timestamp: BigInt!
pair: Pair!
amount0In: BigDecimal!
amount1In: BigDecimal!
amount0Out: BigDecimal!
amount1Out: BigDecimal!
}
# NEW ENTITY
type Swap_info {
count: BigInt!
sum_amount0In: BigDecimal!
sum_amount1In: BigDecimal!
sum_amount0Out: BigDecimal!
sum_amount1Out: BigDecimal!
}
input Swap_filter {...}
enum Swap_orderBy {..}
type Pair {
id: ID!
swaps(
skip: Int = 0
first: Int = 100
orderBy: Swap_orderBy
orderDirection: OrderDirection
where: Swap_filter
): [Swap!]!
# NEW FIELD
swaps_info(
skip: Int = 0
first: Int = 100
orderBy: Swap_orderBy
orderDirection: OrderDirection
where: Swap_filter
): Swap_info!
}
# NEW ENTITY
type Pair_info {
count: BigInt!
}
input Pair_filter {...}
enum Pair_orderBy {..}
type Query {
# Same as current implementation
pair(id: ID!, ...): Pair
pairs(...): [Pair!]!
# NEW TOPLEVEL QUERY
pairs_info(
skip: Int = 0
first: Int = 100
orderBy: Pair_orderBy
orderDirection: OrderDirection
where: Pair_filter
): Pair_info!
# Same as current implementation
swap(id: ID!, ...): Swap
swaps(...): [Swap!]!
# NEW TOPLEVEL QUERY
swaps_info(
skip: Int = 0
first: Int = 100
orderBy: Swap_orderBy
orderDirection: OrderDirection
where: Swap_filter
): Swap_info!
}
First, notice the two additional entities Swap_info
and Pair_info
. These entities would be generated from the Swap
and Pair
entities defined in the subgraph schema. The rules for generating these would be quite straighforward: 1) Always include a count
value; and 2) For each non-null number field (i.e.: Float, Int, BigDecimal, BigInt) in the original entity, create a field sum_FIELDNAME
with the same type. Alternatively, a directive could be used on the original entities to indicate which field to include in its *_info
counterpart.
Notice also that there now two additional toplevel queries: swaps_info
and pairs_info
, which take the same arguments as swaps
and pairs
, but instead of returning a list of entities, it returns the corresponding *_info
entity which would contain aggregation statistics about the entities that would have otherwise been returned by swaps
and pairs
. In other words, the Pair_info
entity returned by querying the pair_info
field with a set of arguments would contain a count
value equal to the number of entities that would have otherwise been returned by querying the pairs
field with those same arguments. The rules for generating these toplevel queries are also straightforward: for each field in the schema that returns a value of type [T!]!
(i.e.: a non-null list), create a new field with the same arguments that returns an entity of type T_info
(this applied to non-toplevel fields as well, see the Pair
entity’s swaps_info
field).
The values of the *_info
entities would be calculated at query time using COUNT
and SUM
queries on the underlying database.
The advantages of this approach is that it would be entirely handled by the graph-node. In other words, the subgraph writer would not have to think about the which aggregation values they want to offer in their subgraph. However, it also makes this approach quite opinionated and the fields present in the *_info
entities should be chosen with care.
As is the case for the aggregation directives, this approach could be extended to include other SQL aggregation operations (e.g.: AVG
, MAX
, MIN
, …) by including more fields in the *_info
entities.
Conclusion
Aggregation directives and info entities would allow subgraph writers to leverage the aggregation functionality already supported in the underlying PostgreSQL database with minimal effort on their part (no additional mapping code required). Moreover, they would enable subgraph users to query aggregation data directly from the subgraph instead of having to fetch all entities first.
We are very interested in hearing what The Graph community has to say about these ideas.
Cheers!