GRC-20: Knowledge Graph

yaniv · November 21, 2024, 5:32pm

Discussion

This forum post is a discussion thread for the GRC-20 knowledge graph standard. The specification can be found on GitHub. Please ask any questions, propose any changes, or point out any room for improvement here. The spec is currently in a draft stage. Once we launch Geo Genesis, we’ll freeze the spec and all future data service implementations will need to be backwards compatible.

Abstract

This GRC introduces a standard for storing and representing knowledge on The Graph. It can be used by any application that wants to create and consume interoperable information across applications.

Data is defined as raw bits that are stored and transmitted. Information is data that is decoded and interpreted into a useful structure. Knowledge is created when information is linked and labeled to attain a higher level of understanding. This document outlines the valid serialization format for the knowledge data that is anchored onchain, shared peer-to-peer or stored locally. Using this standard, any application can access the entire knowledge graph and produce knowledge that can become part of The Graph.

Motivation

Knowledge graphs are the most flexible way of representing information. Knowledge is highly interconnected across applications and domains, so flexibility is important to ensure that different groups and individuals can independently extend and consume knowledge in an interoperable format. Most apps define their own schemas. Those schemas work for a specific application but getting apps to coordinate on shared schemas is difficult. Once data is produced by an app, custom code has to be written to integrate with each app. Migrations, versioning and interpreting data from different apps becomes even more difficult as schemas evolve. These are some of the reasons that instant composability and interoperability across apps is still a mostly unsolved problem. The original architects of the web understood this and tried to build the semantic web which was called Web 3.0 twenty years ago. Blockchains and decentralized networks give us the tools we need to build open and verifiable information systems. Solving the remaining composability challenges will enable an interoperable web3.

Computer scientists widely understand triples to be the best unit for sharing knowledge across organizational boundries. RDF is the W3C standard for triples data. A new standard is necessary for web3 for several reasons: IDs in RDF are URIs which typically point to servers that are controlled by corporations or individuals using https. This breaks the web3 requirement of not having to depend on specific server operators. Additionally, RDF doesn’t support the property graph model, which is needed to describe facts about relationships between entities. Some of RDF’s concepts are cumbersome and complex which has hindered its adoption beyond academic and niche enterprise settings. For these reasons, a new standard is being proposed that is web3 native, benefits from the latest advancements in graph databases and can be easily picked up by anyone who wants to build the decentralized web.

BaptisteGr · November 23, 2024, 11:18am

As I was saying on Twitter congratulations on this and excited to learn more about this new standard. I was mainly wondering how are format and schema rules for entities and triples enforced in this new system? Is there a validation at the smart contract or indexing level?

jpknegtel · November 25, 2024, 12:46pm

Cool stuff! Will be watching the development of this and see how we can map our data into this over time. This seems to be somewhat similar to: https://www.intuition.systems (maybe some others that I don’t know about)

Our data schema: tgs.thegrid.id
Our beta GraphQL endpoint: https://beta.node.thegrid.id/graphql

We are first building our dataset in MySQL with the goal of being about to map it into systems like this, hence the interest. But sharing the above as could be an interesting case for you to map into GRC20.

Note at time of posting (25/12/24): This data is currently not released under any licence, but that also means it is free for anyone just to play around with.

yaniv · November 25, 2024, 5:19pm

Hey thanks for the question!

A design principle that we have here is that Types and Attributes on Types are more like hints. We currently don’t perform validation based on Types and Attributes. There’s a few reasons behind the philosophy:

Triples are fully self describing. A Triple Value has a Value Type. That type is one of the 6 native types. The value value has to match the value type, but the value type doesn’t have to match the Type on the entity, since a triple is it’s own atomic unit outside the context of any entities and types. A triple may match a type definition in one space, but not one in another space for example. Ultimately we want knowledge that’s produced to always be usable in future contexts that you may not have known about when the knowledge was produced. So the way we achieve that is by not enforcing schemas but instead allowing people to create different views on top of the knowledge that’s been produced. Types and the attributes on types, are then viewed more has hints. As a way to nudge applications towards using the same attributes to refer to the same things.

Different data services can be built on top of the same datasets. We could ingest knowledge into different types of databases. I understand that relational databases have strict type constraints… We currently don’t use typed relational databases anywhere. We do have a relational database indexing just into Entities and Triples tables. The problem with relational databases here are the foreign key constraints. You probably could create a data service that is strict about enforcing relation value types as foreign keys and dropping relations that don’t match the schema. This would be an interesting area of experimentation. Right now, we’re moving our focus more towards graph databases which are more flexible than relational databases. Would be interested in seeing how that tradeoff plays out and what the best experience is that we can create with relational databases. I’m also open to continuing the conversation about validation. There’s still time to update the spec here. We’ve discussed this extensively and our current position is “don’t validate” but I’m interested in having this conversation with more people. Especially if people are able to work on different data service implementations and can share their experience.

0xThierry · December 2, 2024, 9:35pm

Hi everyone,

I’m on the team working on the next iteration of the knowledge graph backend using the neo4j graph database and I wanted to propose an addition to the spec that could improve efficiency and developer experience: batch ops.

Currently, creating a new instance of a type requires multiple separate triples - one to establish the type relationship and additional triples for each attribute value. For example, creating an instance of a type with 4 attributes requires setting 5 separate triples (1 for the type and one for each attribute field), each with their own op.

While this granular approach provides maximum flexibility, it can lead to unnecessarily large payloads and added complexity when working with structured data.

I propose adding a batch operation that would allow an entire type instance to be created in a single atomic operation. The operation would specify:

The instance ID
The type entity ID
An ordered set of values corresponding to the type’s defined attributes

This would be particularly valuable for handling relations in the neo4j property graph model. Currently, establishing a relationship requires coordinating multiple pieces of information - the source entity, target entity, relation type in addition to the fact that the entity is a relation. A batch operation would ensure all this information is processed atomically while reducing the complexity of managing these relationships.

The benefits include:

Significantly reduced payload sizes (e.g. 5 ops become 1)
Simplified client implementation for common operations
Atomic creation of complete type instances
More efficient handling of property graph relations

Looking forward to your thoughts!

yaniv · December 3, 2024, 6:34am

Thanks for this proposal @0xThierry. I’m super supportive of it. Question if we should try to get this in for 1.0 or if it can come in a later release. I’m up for drawing something out and seeing how we feel about it potentially next week.

Do you think there should be any situations in which ops should have to be required in a batch? Or should it always be optional to include ops in a batch?

0xThierry · December 3, 2024, 2:44pm

I would say that triples that define relations should always be batched.
Relations are the only entities that I can think of that have multiple required fields are relations (type, relation type, to, from) in order to be complete.

This becomes a big deal when “rolling up” the triples into actual nodes and edges in Neo4j, so perhaps it would be worth it to make batching these triples required by the time we move to Neo4j for the data service.

In general however, I think it should be optional.

yaniv · December 3, 2024, 6:15pm

That makes sense to me. Okay, I’ll take a stab.

lesimpso · December 4, 2024, 8:39am

So you’re going to use Neo4j to store the data (triples)?

Armando · December 4, 2024, 2:06pm

Hello everyone,
I’ve been exploring the spec of the GRC-20 standard and I have a few questions to ensure I’m understanding them correctly:

With the language option, does Geo become fully multilingual?
With text including markdown, will we be able to include hyperlinks and code snippets in content blocks?
With Unicode Technical Standard #35, will it be possible to use BCE and CE dates in historical data?
Will it be feasible to develop prediction markets or similar applications using checkboxes to capture true/false values?

Thanks so much for addressing my questions! I’m really enthusiastic about the opportunities this standard creates and I can’t wait to see its impact. Congratulations to the team for such an inspiring accomplishment!

0xThierry · December 4, 2024, 2:41pm

In an upcoming version, essentially yes. Although there is some processing that happens to convert the triples into the nodes and edges that would then be stored in Neo4j.

yaniv · December 4, 2024, 9:59pm

Hey Armando, great questions!

Yes, that’s the idea. Though I should caveat that we haven’t fully tested this. Would be great to walk through some different use cases to gain confidence. There are a few things to think through and we’ll actually need to make one small adjustment to the spec. Currently, the spec specifies that subsequent triples for the same (space, entity, attribute) tuple should be treated as an upsert. We actually need to specify that this doesn’t apply to Text values where the language is different. Additionally, we need to think through how we want to expose the language option in the API. We’ll need to special case how we expose the language parameter, and potentially we would want to let users set the language at the top level of a query and have that apply to nested Text attributes.
Yes we’ll be able to use Markdown links… though there is a question which protocols we want to support. Do you think it could be sufficient to disable https links in content? And only support HTTP links for URL triples? I really want to move away from the web and discourage HTTP…
Yes BCE and CE dates are supported. This is a big reason we’re not using Unix timestamps
Interesting question. Haven’t thought through how one would implement prediction markets with GRC-20. Most of the logic for prediction markets would likely exist onchain, but the metadata could definitely be defined using GRC-20, and yes a Checkbox field could be used for the outcome of a binary market.

denver · December 5, 2024, 3:24pm

Hi all,

Excited about this initiative. I think the motivation section as outlined in @yaniv’s original post clearly outlines the value proposition here.

In regards to Attributes as currently outlined in the GRC-20 knowledge graph standard, it seems in our best interest to ensure that changes to Value Type(s) are restricted. This restriction should help to reduce the impact that modifications to the global knowledge graph have on downstream type based systems built upon it. If one requires a modified Attribute Value Type, one can replicate the Attribute with their desired Value Type.

An alternative would be to depend on versioning, but in my view, this introduces unnecessary complexity at both the indexing and data consumer levels.

Best,
Denver

louis · December 5, 2024, 6:52pm

A couple of ideas that I believe could add value to GRC-20 revolve around handling updates or updated types in a decentralized context.

In both Web2 and Web3, timestamping has become a critical component of data integrity, often baked into blockchain designs. While I understand GRC-20 might not inherently operate like a blockchain, if a piece of data (type) is modified—whether by the original proposer or a decentralized third party—how will GRC handle such changes? For example:

Will it be logged simply as “updated”?
Will there be metadata such as “updated by source” or “updated by third party”?
Could there be a detailed version history tied to these changes?

On a related note, how would GRC-20 address the frequency of updates? At Pinax, for instance, we timestamp our blog articles (see below article in blue, right before author’s profile) on-chain upon creation. If updates are required, we timestamp those changes too, providing a clear revision history. This approach has proven beneficial, especially as platforms like Google in Web2 have shown an increasing preference for updated and maintained content. Could GRC-20 push this idea further in Web3?

Here are a few initial ideas for discussion:

Metadata on Updates: Every update could include metadata such as who made the change (source/third-party), the timestamp, and the reason for the change.
Immutable History: Instead of overwriting data, each update could create a new “version” while retaining the old one, ensuring transparency and traceability.
Weighted Updates: Updates made by the original source might carry more “authority” compared to third-party contributions, with the ability to toggle or prioritize certain updates based on context or user preferences. This might open the pandora box of trust.
Incentivizing Updates: A mechanism where contributors can earn rewards for keeping data relevant and up-to-date (e.g., akin to how maintainers work in decentralized repositories and/or DPOS-type types).

Would love to hear everyone’s thoughts on these points and how GRC-20 could build upon this to create a robust, decentralized knowledge graph that emphasizes the dynamic nature of data.

yaniv · December 10, 2024, 6:31am

Hey Louis great questions.

We do get timestamping from the blockchain itself. Each published edit is anchored onchain and has a blockchain based timestamp. Everything else can be derived. We do this in several ways:

System properties - GRC-20 introduces the concept of system properties. These are standardized properties that Indexers should generate triples for. The spec enumerates Created at, Updated at, and Created by as such system properties.

From there it’s up to the Indexer to build up a representation of the version histories. We have two Indexer implementations that are doing this well and we can standardize how we represent versions in the future.

We can absolutely have Git style histories for knowledge. We track versions for spaces as well as for each entity. We have algorithms that work today and future work could make these algorithms more efficient.

With regards to attribution, that is a very interesting ontology question. With most applications, a user performs an action and only that authorship is recorded. Medium did an interesting thing where a post had an Author and a Publication. You could show both the writer and the organization they worked for. I’m interested in this type of metadata. My current proposal is that the Edit has an entity ID and any additional metadata can be added as knowledge in the edit. This puts the ontology into userland.

GRC-20 is purely a standard for serialization. Incentives are an interesting topic. Take a look at Rem’s recent proposal and see how the updated Curator role could be used to incentivize contributing knowledge

yaniv · December 10, 2024, 4:01pm

Hey Denver thanks for the input on validation. I agree there are interesting considerations when it comes to validation.

Currently the model is that triples are self describing and we want any application to be able to process any set of triples from any point in time.

Right now we don’t actually have a place where we enforce validation. We could introduce validation but it’s sensitive. When processing ops, we could say that certain ops aren’t valid.

Logically, I agree with the premise that attributes shouldn’t change value types. Once a value type is set on an attribute, it should never be changed. If somebody wants to publish a triple and change the value type of an established attribute, they should use a different attribute ID.

The only question is if this should make its way into the spec. To define that precisely we would need to talk about materialized views on types from the perspective of spaces.

My considerations here are:

That’s difficult to define in a spec and we would need to introduce a concept of the state of the schema
Would add a lot of logic to implementations to enforce

I’ll leave these thoughts here for now and would appreciate any reactions from others.

nikgraf · December 11, 2024, 3:28pm

We implemented the ID generation and noticed that the length of the ID is not always 22 characters as described in the specification.

I investigated and realized this is due the Base58 encoding. The theoretical minimum length of the ID is 16 characters and the maximum length is 22 characters.

const id1 = new Uint8Array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]);
const result = encodeBase58(id1);
console.log(result.length); // 16

const id2 = new Uint8Array([255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255, 255]);
const result2 = encodeBase58(id2);
console.log(result2.length); // 22

Since a valid UUID v4 includes 6-7 bits to indicate the version and variant the actual minimum length is higher than 16. That said I’m wondering if libraries should even validate that the UUID v4 version and variant are correct.

I suggest to update the specification to indicate that the ID can be minimum 16 characters and maximum 22 characters long and add a hint that that UUID v4 version and variant are not supposed to be validated.

yaniv · December 12, 2024, 10:50pm

Thanks, good catch. We can update the spec to reflect this. I don’t think clients would need to validate that the IDs are generated correctly, but an observer could see if the IDs conform to the spec or not.

seref · December 17, 2024, 5:36pm

Thank you for sharing this comprehensive standard. A few years back, I would have been all over a comprehensive ID-based system like this. But now that we have these powerful language models, I’m wondering if we still need static IDs.

Here’s my thinking:

Static IDs feel like they box us in and create unnecessary central points of coordination
ID spaces get messy real quick in distributed environments - we’ve all been there
They can actually get in the way of understanding context and relationships naturally

I’ve started thinking about graphs more as how we look at data, not what the data actually is. Instead of trying to turn everything into a fixed graph structure, what if we just let graphs be this flexible lens we can apply when and where we need it?

Sure, for blockchain stuff, especially finance where you need those guaranteed outcomes, I get why you’d want something more rigid. Maybe for those cases we could lean on some shared, trusted AI models for figuring out what’s what. For example, if we have unstructured data and a model that consistently produces the same graph from that data with 99.999% accuracy, we could consider this a form of deterministic “resolution” - achieving the same goals but through AI rather than static IDs.

What if we tried something more flexible:

Let graph structures pop up naturally based on context
Use AI to figure out what’s connected to what through regular language
Keep IDs just for those blockchain operations that absolutely need them
Focus more on understanding relationships than storing them in a fixed way

We would like to adopt GRC-20 at some point, we are just figuring out how to do it. Curious what others think about this.

Topic		Replies	Views
Geo December 2024 update Core Team Updates geo	0	65	December 10, 2024
Geo May 2024 Update Core Team Updates geo	0	406	May 15, 2024
GRC-002: QoS Oracle V2 Graph Request for Comment (GRC)	11	1787	May 7, 2024

GRC-20: Knowledge Graph

Discussion

Abstract

Motivation

Related topics