This forum post is a discussion thread for the GRC-20 knowledge graph standard. The specification can be found on GitHub. Please ask any questions, propose any changes, or point out any room for improvement here. The spec is currently in a draft stage. Once we launch Geo Genesis, we’ll freeze the spec and all future data service implementations will need to be backwards compatible.
Abstract
This GRC introduces a standard for storing and representing knowledge on The Graph. It can be used by any application that wants to create and consume interoperable information across applications.
Data is defined as raw bits that are stored and transmitted. Information is data that is decoded and interpreted into a useful structure. Knowledge is created when information is linked and labeled to attain a higher level of understanding. This document outlines the valid serialization format for the knowledge data that is anchored onchain, shared peer-to-peer or stored locally. Using this standard, any application can access the entire knowledge graph and produce knowledge that can become part of The Graph.
Motivation
Knowledge graphs are the most flexible way of representing information. Knowledge is highly interconnected across applications and domains, so flexibility is important to ensure that different groups and individuals can independently extend and consume knowledge in an interoperable format. Most apps define their own schemas. Those schemas work for a specific application but getting apps to coordinate on shared schemas is difficult. Once data is produced by an app, custom code has to be written to integrate with each app. Migrations, versioning and interpreting data from different apps becomes even more difficult as schemas evolve. These are some of the reasons that instant composability and interoperability across apps is still a mostly unsolved problem. The original architects of the web understood this and tried to build the semantic web which was called Web 3.0 twenty years ago. Blockchains and decentralized networks give us the tools we need to build open and verifiable information systems. Solving the remaining composability challenges will enable an interoperable web3.
Computer scientists widely understand triples to be the best unit for sharing knowledge across organizational boundries. RDF is the W3C standard for triples data. A new standard is necessary for web3 for several reasons: IDs in RDF are URIs which typically point to servers that are controlled by corporations or individuals using https. This breaks the web3 requirement of not having to depend on specific server operators. Additionally, RDF doesn’t support the property graph model, which is needed to describe facts about relationships between entities. Some of RDF’s concepts are cumbersome and complex which has hindered its adoption beyond academic and niche enterprise settings. For these reasons, a new standard is being proposed that is web3 native, benefits from the latest advancements in graph databases and can be easily picked up by anyone who wants to build the decentralized web.
As I was saying on Twitter congratulations on this and excited to learn more about this new standard. I was mainly wondering how are format and schema rules for entities and triples enforced in this new system? Is there a validation at the smart contract or indexing level?
Cool stuff! Will be watching the development of this and see how we can map our data into this over time. This seems to be somewhat similar to: https://www.intuition.systems (maybe some others that I don’t know about)
We are first building our dataset in MySQL with the goal of being about to map it into systems like this, hence the interest. But sharing the above as could be an interesting case for you to map into GRC20.
Note at time of posting (25/12/24): This data is currently not released under any licence, but that also means it is free for anyone just to play around with.
A design principle that we have here is that Types and Attributes on Types are more like hints. We currently don’t perform validation based on Types and Attributes. There’s a few reasons behind the philosophy:
Triples are fully self describing. A Triple Value has a Value Type. That type is one of the 6 native types. The value value has to match the value type, but the value type doesn’t have to match the Type on the entity, since a triple is it’s own atomic unit outside the context of any entities and types. A triple may match a type definition in one space, but not one in another space for example. Ultimately we want knowledge that’s produced to always be usable in future contexts that you may not have known about when the knowledge was produced. So the way we achieve that is by not enforcing schemas but instead allowing people to create different views on top of the knowledge that’s been produced. Types and the attributes on types, are then viewed more has hints. As a way to nudge applications towards using the same attributes to refer to the same things.
Different data services can be built on top of the same datasets. We could ingest knowledge into different types of databases. I understand that relational databases have strict type constraints… We currently don’t use typed relational databases anywhere. We do have a relational database indexing just into Entities and Triples tables. The problem with relational databases here are the foreign key constraints. You probably could create a data service that is strict about enforcing relation value types as foreign keys and dropping relations that don’t match the schema. This would be an interesting area of experimentation. Right now, we’re moving our focus more towards graph databases which are more flexible than relational databases. Would be interested in seeing how that tradeoff plays out and what the best experience is that we can create with relational databases. I’m also open to continuing the conversation about validation. There’s still time to update the spec here. We’ve discussed this extensively and our current position is “don’t validate” but I’m interested in having this conversation with more people. Especially if people are able to work on different data service implementations and can share their experience.
I’m on the team working on the next iteration of the knowledge graph backend using the neo4j graph database and I wanted to propose an addition to the spec that could improve efficiency and developer experience: batch ops.
Currently, creating a new instance of a type requires multiple separate triples - one to establish the type relationship and additional triples for each attribute value. For example, creating an instance of a type with 4 attributes requires setting 5 separate triples (1 for the type and one for each attribute field), each with their own op.
While this granular approach provides maximum flexibility, it can lead to unnecessarily large payloads and added complexity when working with structured data.
I propose adding a batch operation that would allow an entire type instance to be created in a single atomic operation. The operation would specify:
The instance ID
The type entity ID
An ordered set of values corresponding to the type’s defined attributes
This would be particularly valuable for handling relations in the neo4j property graph model. Currently, establishing a relationship requires coordinating multiple pieces of information - the source entity, target entity, relation type in addition to the fact that the entity is a relation. A batch operation would ensure all this information is processed atomically while reducing the complexity of managing these relationships.
The benefits include:
Significantly reduced payload sizes (e.g. 5 ops become 1)
Simplified client implementation for common operations
Atomic creation of complete type instances
More efficient handling of property graph relations
Thanks for this proposal @0xThierry. I’m super supportive of it. Question if we should try to get this in for 1.0 or if it can come in a later release. I’m up for drawing something out and seeing how we feel about it potentially next week.
Do you think there should be any situations in which ops should have to be required in a batch? Or should it always be optional to include ops in a batch?
I would say that triples that define relations should always be batched.
Relations are the only entities that I can think of that have multiple required fields are relations (type, relation type, to, from) in order to be complete.
This becomes a big deal when “rolling up” the triples into actual nodes and edges in Neo4j, so perhaps it would be worth it to make batching these triples required by the time we move to Neo4j for the data service.
In general however, I think it should be optional.
Hello everyone,
I’ve been exploring the spec of the GRC-20 standard and I have a few questions to ensure I’m understanding them correctly:
With the language option, does Geo become fully multilingual?
With text including markdown, will we be able to include hyperlinks and code snippets in content blocks?
With Unicode Technical Standard #35, will it be possible to use BCE and CE dates in historical data?
Will it be feasible to develop prediction markets or similar applications using checkboxes to capture true/false values?
Thanks so much for addressing my questions! I’m really enthusiastic about the opportunities this standard creates and I can’t wait to see its impact. Congratulations to the team for such an inspiring accomplishment!
In an upcoming version, essentially yes. Although there is some processing that happens to convert the triples into the nodes and edges that would then be stored in Neo4j.
Yes, that’s the idea. Though I should caveat that we haven’t fully tested this. Would be great to walk through some different use cases to gain confidence. There are a few things to think through and we’ll actually need to make one small adjustment to the spec. Currently, the spec specifies that subsequent triples for the same (space, entity, attribute) tuple should be treated as an upsert. We actually need to specify that this doesn’t apply to Text values where the language is different. Additionally, we need to think through how we want to expose the language option in the API. We’ll need to special case how we expose the language parameter, and potentially we would want to let users set the language at the top level of a query and have that apply to nested Text attributes.
Yes we’ll be able to use Markdown links… though there is a question which protocols we want to support. Do you think it could be sufficient to disable https links in content? And only support HTTP links for URL triples? I really want to move away from the web and discourage HTTP…
Yes BCE and CE dates are supported. This is a big reason we’re not using Unix timestamps
Interesting question. Haven’t thought through how one would implement prediction markets with GRC-20. Most of the logic for prediction markets would likely exist onchain, but the metadata could definitely be defined using GRC-20, and yes a Checkbox field could be used for the outcome of a binary market.
Excited about this initiative. I think the motivation section as outlined in @yaniv’s original post clearly outlines the value proposition here.
In regards to Attributes as currently outlined in the GRC-20 knowledge graph standard, it seems in our best interest to ensure that changes to Value Type(s) are restricted. This restriction should help to reduce the impact that modifications to the global knowledge graph have on downstream type based systems built upon it. If one requires a modified AttributeValue Type, one can replicate the Attribute with their desired Value Type.
An alternative would be to depend on versioning, but in my view, this introduces unnecessary complexity at both the indexing and data consumer levels.
A couple of ideas that I believe could add value to GRC-20 revolve around handling updates or updated types in a decentralized context.
In both Web2 and Web3, timestamping has become a critical component of data integrity, often baked into blockchain designs. While I understand GRC-20 might not inherently operate like a blockchain, if a piece of data (type) is modified—whether by the original proposer or a decentralized third party—how will GRC handle such changes? For example:
Will it be logged simply as “updated”?
Will there be metadata such as “updated by source” or “updated by third party”?
Could there be a detailed version history tied to these changes?
On a related note, how would GRC-20 address the frequency of updates? At Pinax, for instance, we timestamp our blog articles (see below article in blue, right before author’s profile) on-chain upon creation. If updates are required, we timestamp those changes too, providing a clear revision history. This approach has proven beneficial, especially as platforms like Google in Web2 have shown an increasing preference for updated and maintained content. Could GRC-20 push this idea further in Web3?
Here are a few initial ideas for discussion:
Metadata on Updates: Every update could include metadata such as who made the change (source/third-party), the timestamp, and the reason for the change.
Immutable History: Instead of overwriting data, each update could create a new “version” while retaining the old one, ensuring transparency and traceability.
Weighted Updates: Updates made by the original source might carry more “authority” compared to third-party contributions, with the ability to toggle or prioritize certain updates based on context or user preferences. This might open the pandora box of trust.
Incentivizing Updates: A mechanism where contributors can earn rewards for keeping data relevant and up-to-date (e.g., akin to how maintainers work in decentralized repositories and/or DPOS-type types).
Would love to hear everyone’s thoughts on these points and how GRC-20 could build upon this to create a robust, decentralized knowledge graph that emphasizes the dynamic nature of data.