Hi everyone,
I’m working on a subgraph that processes a fairly large dataset, and I’ve been running into performance issues that I’m hoping to get some guidance on. The dataset contains millions of entities, and while the subgraph syncs without errors, the speed and efficiency leave much to be desired.
Here are the details of my current setup:
- Mapping Functions: I use mapping functions to index new entities and update existing ones. However, I’ve noticed that the processing slows down significantly as the dataset grows. Are there best practices for structuring mapping functions for large datasets?
- Entity Relationships: My schema has several one-to-many and many-to-many relationships. I suspect that the way I’ve designed these relationships might be contributing to the slowdown. Would denormalizing the schema help in this case?
- Filtering and Pagination: When querying data through the GraphQL API, I’m using filters and pagination, but the response times are still longer than I’d like. Are there any optimizations I can apply on the query side to speed up API responses?
- Indexing Performance: Are there specific methods or configurations to improve the indexing performance during the sync process, especially for event-heavy contracts?
- Node Infrastructure: Lastly, I’m running my own Graph Node, and I wonder if hardware specs or network configurations could be a bottleneck. Is there a recommended setup for handling high-throughput subgraphs?
I have also checked this: https://forum.thegraph.com/t/a-process-for-specifying-the-subgraph-api-version-and-feature-support-matrix/python
I’d greatly appreciate any insights, resources, or examples of similar setups that have worked well for you. Also, if there are common pitfalls to avoid in large-scale subgraph development, please share your experiences.
Thanks in advance for your help!