5 Reasons Knowledge Graph will never bloom
Authors: Dr Dongsheng Wang and Dr Hongyin Zhu
Figure 1. Knowledge Graph Visualization Example (From Web)
Knowledge Graph (KG) is used to organize knowledge as a graph with nodes and edges (called triples). KG has been one of the most popular technologies in the AI community and has attracted increasing attention in recent years. However, KG is definitely not a new technology; this article will detail five reasons why KG will never thrive (or at least in the next two decades). By analyzing its past and present, this article will provide insights into the future of KG technology.
Since its proposal in 1989, the key technology behind the Semantic Web has remained largely unchanged, despite its various iterations, including semantic technology, knowledge base, linked data, and the current Knowledge Graph (KG). However, despite over two decades of effort with these different versions, none have managed to thrive. While the latest iteration, KG, has set a smaller goal of knowledge organization than its predecessor (the Semantic Web, which aimed to revolutionize the entire web), the question remains as to whether it will be successful in the coming decades. To answer whether KG will thrive in the coming decades, it’s important to consider its development from inception to present day. While recent advances in knowledge representation learning have been promising, significant barriers that have hindered KG’s success over the past 20 years have yet to be addressed. It is important to note that KG is not necessarily promoting the development of neural networks, nor is the neural network promoting KG’s application. Therefore, it is critical to examine the factors that have prevented KG from reaching its full potential and identify solutions to overcome these obstacles. Only by doing so, can we hope to see significant progress in this area in the future.
To begin, we must examine the problems that plagued the Semantic Web and continue to affect its descendants. From a Knowledge Graph (KG) perspective, we have identified five key challenges, including entity generalization (granularity and ambiguity), relation overloading, triple knowledge complexity, difficult accessibility, and knowledge acquisition latency. Unfortunately, knowledge representation learning, while a promising field, is unlikely to address these challenges and may even exacerbate them. As a result, it’s unlikely that KG will lead to any major breakthroughs in the near future.
At the end of the article, we would like to offer some insightful suggestions based on our extensive R&D knowledge. We strongly believe that radical changes are necessary to break free from the stagnation that has hindered the development of KG and its predecessors. It is only by taking bold, innovative steps that we can move away from the limitations of the past and towards a meaningful future.
1 Origin of Knowledge Graph
1.1 Grand proposal for Web3.0 — Semantic Web
The original vision for the Web was to progress from Web 2.0 to Web 3.0, marking a new era of development. This can be understood as a chronological evolution of three different phases: Web 1.0, Web 2.0, and the targeted Web 3.0.
Web 1.0 was a unidirectional service model where a few editors maintained servers to serve a large number of users, for example, Yahoo and Sohu. Web 2.0, on the other hand, was a bidirectional service model with no clear boundaries between users and editors, exemplified by social networks, blogs, tweets, YouTube, Instagram, and so on. As a result of this design, editors became users themselves, leading to a significant increase in data. Web 3.0, or the Semantic Web, involves the use of machines as the third component in the triangle relationship between machines, users, and editors, and is characterized by machine-readability and machine-understandability.
1.2 One word to understand semantic web — Unifying
It has ever been a myth for a long time when the semantic web was first proposed two decades ago because it is often related to the terms of machine-readable, semantics, reasoning, or whatever fancy terms you might have seen somewhere. However, the core principle of the semantic web can be distilled into a single word: “unifying.” The goal of the semantic web is simply to unify data in a comprehensive manner, with the RDF triple format serving as a universal solution for global knowledge representation. In the following sections, we will explore how the various complex concepts associated with the semantic web can be understood as the single word — ‘unifying’:
- Unifying = formalized、standardized data
- Unifying = could be distributed (because of the same triple format)
- Unifying = Machine-readable (by designing a resolver to this unified format)
- Unifying = re-use & share (again, same data format)
Therefore, if someone mentions terms like “standardized, formalized, semantics, machine-readable, reasoning,” or other related terms, you can simply say “Ah, I understand, you’re referring to unifying.”
The unified form of RDF is simply a composition of triples, with each string adopting a URI for uniqueness. The triple format, consisting of subject, predicate, and object, is considered the minimum unit necessary to describe world knowledge. This is why a relational database can always be represented as RDF data, as illustrated in the example below. The philosophy behind the triple format is that a 2-tuple is insufficient, and a 4-tuple can always be reduced to a triple.
Figure 2. A 5-tuple or relational database to RDF triples
The SW community has long been focused on generating ontologies, which involves transforming existing web data into RDF format.
However, machine-readability is only possible when there is a parser capable of recognizing the unified RDF format. It’s important to note that no data format inherently possesses reasoning capability. In the case of natural language, humans serve as parsers because we have learned languages like English, Chinese, and Korean. Similarly, programming languages such as Jena and query standards like SPARQL have been developed as parsers for the unified RDF format, enabling machines to understand and interpret the data.
1.3 Four reasons why SW has faded
It has been over two decades since the concept of the Semantic Web was first introduced, yet it has not evolved into the anticipated influential Web3.0 era. Why has this been the case? We can identify four main barriers that have hindered the development of the Semantic Web.
1.3.1 RDF Sharing disability
Academic agents have been mainly focused on building their knowledge bases, which are intended to be global datasets. However, these RDF databases often suffer from being hard to understand, outdated, unreliable, and substantially diverse, making it challenging to achieve linked data. As a result, they tend to become another type of local databases.
On the other hand, industrial agents are more realistic and pragmatic, and there is currently no strong incentive for them to widely and effectively transform or share their data as RDF. Many companies display data on the internet in JSON format, which is often sufficient for developers. Even if some private companies and industries are willing to share their data, they may not want to make the additional effort to publish RDF data since it requires more effort than reward.
1.3.2 Expensive to consume RDF
Consuming RDF data requires a significant effort. From an engineering perspective, applications typically interact with local data or data from specific APIs transmitted in JSON format. JSON data is structured, accessible, and sufficient for describing entities. When working with RDF data, it is necessary to first study its schema and parse it into local memory to assemble triples into complete instances. This process is necessary because triples of the same entity can be distributed, and predicates can be arbitrary.
Without adopting external RDFs effectively, a local RDF database is not superior to a simple relational database. Therefore, people are still more willing to write programs that parse structured JSON than to learn RDF schema and ensemble triples in memory. Overall, manipulating RDF data is often trivial, time-consuming, and expensive, which outweighs its potential benefits.
1.3.3 Ontology definition splurging
The ideal scenario for RDF datasets is that everyone uses shared word schemes to define them, which means unifying ontology schemes as much as possible.
However, the reality is that scheme splurging is usually exponential, resulting from different groups of people from different places. As a result, understanding a new RDF dataset from an unknown source can take a significant amount of time. Despite having Linked Open Vocabulary (LOV) available for years, unifying schemes is still a challenge. It seems unlikely that there will be a breakthrough in this field in the near future, even with similar efforts spanning decades.
Figure 3. LOV(2018)
1.3.4 Expensive to use first-order logic
The focus on reasoning capacity by early Semantic Web researchers created high expectations but ultimately led to disappointment. Despite advances in reasoning techniques, including path reasoning using machine learning, inference accuracy remains unsatisfactory and unreliable. Three main reasons contribute to the failure of this approach.
First, first-order logics have not been widely adopted in traditional AI systems, and their reasoning efficiency and effectiveness are questionable.
Second, descriptive logics require high-quality and small-scale data, making it infeasible to scale up to the large amounts of dirty data generated daily.
Finally, alternative methodologies such as machine learning models trained on massive datasets can be more effective and efficient than descriptive logics. As a result, first-order logics are expensive and replaceable.
2 Five reasons KG will not be thriving
The Semantic Web has faded into obscurity, but its successor, Knowledge Graphs (KG), has been proposed as a solution to NLP and IR problems as a semantic representation framework. KG focuses on organizing information with entities and their relations, which is a smaller target than revolutionizing the entire web. Some researchers are highly optimistic about KG because they have adopted knowledge representation learning to improve its effectiveness. However, we believe that KG will not lead to any breakthrough because the same problems that have plagued the Semantic Web for two decades are still present. Furthermore, attempting to solve these problems with KG may only exacerbate them. It is as if a child who cannot crawl is attempting to run, and we will elaborate on this further from the perspective of KG.
2.1 Entities are hard to be generalized
Two significant problems associated with entities are granularity and ambiguity. One of the primary difficulties in defining entities is their granularity. Entities often lack clear boundaries, such as the distinction between ‘Mouse Brain’ and ‘Brain’ or between ‘Apple’ and ‘Green Apple.’ These entities can exist independently in different scenarios, and one may include the other. Determining the appropriate granularity of entities is a case-by-case problem that may need to be adapted for different domains or situations, making it difficult to standardize for general use. Ambiguity is another well-known challenge in defining entities. For instance, ‘Green Apple’ can refer to the fruit, a piece of music, a book title, a brand name, or even a chapter in an article. It is not always necessary to define fine-grained entities to disambiguate in all circumstances.
In recent years, there has been a surge of interest in knowledge embedding, which is inspired by word embedding technology. Models such as transE and transR are used to represent entities and relations with distributed semantics in low-dimensional vectors, as shown in Figure 4. While these vector representations have shown promise in tasks like link prediction, they do not address the problems of entity granularity and ambiguity. Despite the development of several knowledge graph embeddings in academia, they have not been widely adopted in practice. Additionally, the accuracy of these models remains questionable, and the distributed representations are not easily interpretable for symbolic representation.
Figure 4. knowledge graph embeddings (from web)
To summarize, the problem of entity granularity and ambiguity remains unsolved in the field of knowledge representation, including semantic web and knowledge graphs. This issue was not resolved in the earlier research on semantic web technologies such as ontology definition and alignment. Additionally, when dealing with multiple knowledge graphs, the problem of ambiguity becomes more complicated.
2.2 Relation splurging and explosion
Looking at Figure 1, it appears easy to read because the relations mostly consist of multi-word phrases. However, these relations can become exponentially complex, derived from the simple triple format. To express the natural relations between entities, the knowledge graph often uses compound phrases. However, facing hundreds of relations like song_of, written_by, create_by, is_author_of, build_time, cause_by, has_been_to, and has_visited, humans can become confused by the ambiguity, let alone machines. This is why it is sometimes necessary to define the domain and range if you expect certain relations to belong to specific classes. Consumers must first understand the concept and relation definition of the KG. However, as relations grow from different sources, they become increasingly ambiguous with one another, such as belongs_to, part_of, and included_by. Therefore, it is difficult to disambiguate by either humans or machines.
Researchers have been heavily focusing on knowledge representation learning in recent years, often combining it with knowledge graph embedding and deep neural models applied to NLP corpora. By projecting entities and relations into low-dimensional space (embedding space), they aim to improve knowledge graph completion, entity recognition, relation extraction, and other KG-related tasks. Pre-trained models like BERT have been particularly successful in enhancing performance in these areas. However, the massive amount of learned entities and/or relations can lead to diverse granularity and complexity, ultimately exacerbating the problem of ambiguity. Furthermore, managing new triples in a KG is already a challenging task, and when dealing with heterogeneous KGs, the issue of relation splurging can quickly spiral into relation explosion, making it extremely difficult to consume. In essence, the current obsession with knowledge representation learning is like a child attempting to fly before learning to walk.
2.3 Simple triple format leads to higher complexity
The triple format has simplified the data sharing as a minimum unit to represent world knowledge. However, the KG creator, in turn, has to define complex relations with the compound phrase for accurate description as discussed in section 2.2.
The triple knowledge form eventually brings even more complex than we imagined. When we ask who the United States president is, we can retrieve the triple: (X, president, United States). This actually hides a default that we are asking the current president. When the question is slightly different, like “who was the United States president in 1989?”, we have to use a query like (X, is_the_40th president, United States) or (XX, president_period, from 1981 to 1989). This quesion immediately increases the complexity of retrieving, and actually, this triple knowledge itself is an ambiguity for the previous simple question as you can imagine. In reality, it is more common to have knowledge containing constraints that involve multiple relations rather than single relation. As shown in the Figure 5 below, this is more than trivial for both who design it or who consume it.
Figure 5. from web
We should not blame that developers sometimes question would it be more straightforward if we can design a relational database table. When you employ a relational database with a table name called ‘president’, you can easily define the properties that belong only to this table like name, x_th, period, spouse, spouse_period, etc. And this concept oriented design, in general, is more accessible and human-understandable than RDF oriented design.
Having confused human ourselves, knowledge with multiple relations are also ambiguous to machines. We can find that even with state-of-the-art knowledge representation learning models, the F1 performance is generally lower than 50% for complex question answering over KG. This is almost meaningless for industrial application. When there are errors in the KG, the correction process is usually much more tedious than a relational database.
In short, knowledge is often diverse and conditional than simple deterministic knowledge. When we use simplified triple formats to express complex knowledge, we end up with higher complexity.
2.4 Hard to access Knowledge Graph as a database
The consumption of triple knowledge requires a high understanding of the ontology schemes and RDF query knowledge for whoever wants to access it. People inevitably have to parse and load all the triples together in memory and manipulate it with a program (like Jena library), SPARQL query, or sometimes both. The reason is that these triples can be located in different lines or different files and databases. We know this mechanism is the advantage of the triple format that enables distributed storage but it sets the barrier for us to look it up quickly on the other side.
Moreover, the triples are usually not readable due to URIs and distributed triples. What we see first are usually the triples like (<uri:28809> <uri:creator> <uri:201339>), then we have to figure out the readable relations like ‘rdfs:label’, ‘skos:prefLabel’ or ‘dbpedia:name’, in order to find triples like (<uri:28809> <rdfs:label> “The Matrix”). This process always force the consumers to spend quite a bit time to figure it out what relations an entity might have. In addition, the SPARQL queries are more complicated than SQL because of the diverse URIs and complex query grammar.
In short, the URI mechanism and SPARQL have been barriers that stopped many developers from accessing, plus the relation splurging lead to more complexity than the convenience they brought for KG consumption.
2.5 Unreliable quality and generation latency
The constructed RDF usually suffers from quality and latency problem. First, there is no technology so far that can guarantee the accuracy of RDF triples.
Even if the triples are accurate, triple knowledge acquisition is always slower than the emerging of new data. In other words, RDF generation is a linear effort of us, while the speed of new information growing is exponential. For example, when Elon Must tweeted a message, it will be commented on and retweeted tens of thousands of times within a few minutes. This is why the end-to-end deep learning models are more and more popular because it avoids extracting the information as structured data but directly consumes the raw data to predict the ultimate purpose.
In short, we do not want a structured dataset that is not reliable and is of high latency in general.
3 Why people still insist it
In this article, we have provided a comprehensive review of the development of KG, from its origin in the semantic web to its current purpose. Despite this, the main problems we have discussed remain unsolved or even exacerbated. We have argued that knowledge representation learning is unlikely to mitigate these problems; in fact, it may worsen them. For instance, the fact that even with state-of-the-art models, performance on complex question answering is still below 0.5 (F1) suggests that optimism about KG’s potential should be tempered. Given that many researchers may be aware of these issues, it is unclear why some continue to be obsessed with KG, and why some top researchers claim it could lead us to ‘cognitive intelligence’ or ‘advanced intelligence’.
We attempt to clarify our stance with the following four points:
- KG is currently the most promising technology that shares similarities with human thinking.
- There are no other alternative technologies that can sustain hope for the next breakthrough, especially after the advancements of deep learning from the connectionism community.
- We acknowledge that there is a compromise between symbolism researchers and connectionism researchers.
- Technology media often prioritize propaganda over accountability, and researchers may not face consequences for making grandiose claims.
4 Our Suggestions
While we do not have all the answers, we share the same hope as you that a new design can be proposed to replace KG, although this may be a difficult task. Our intention in this article is to highlight potential directions that could address some of the problems we have discussed.
4.1 Emphasizing knowledge rather than the form or graph
There is a need to shift the focus from the form of knowledge, such as natural language or RDF, to the knowledge itself. Natural language understanding and generation have advanced significantly, but there is still much to be done to make KGs compatible with these advancements. Recent research suggests that pre-trained language models may already contain relational knowledge, although this view is not universally accepted. Moreover, it is often not visible or consumable in a straightforward way. While the general KG may not be a viable solution, token-level semantics from models like BERT can be used to construct domain-specific entities for specific use cases.
Technically, there is a need to strike a balance between the simplicity of triple formats and the information richness of n-tuple structures in representing knowledge. While technologies like Neo4j are moving in this direction, there is a need for a new standard that involves researchers and developers in defining a schema and n-tuple knowledge that can be easily understood and accessed by both humans and machines. This will enable a more efficient and effective use of knowledge graphs in various applications.
4.2 Ease the KG technology as much as possible
While URI is considered a vital requirement for ensuring global uniqueness, cheaper alternatives could be used instead. For example, universally unique identifier (UUID) is a widely known identifier that can be used to identify entities both locally and globally. UUIDs can ensure that entities from different datasets remain globally unique even when merged. Thus, we can use UUIDs to replace URIs for identifying entities and predicates with little impact.
If we create an n-tuple knowledge form, we should ensure that the knowledge pieces are easily accessible through both manual check-up and query languages like SPARQL (though a more lightweight version would be preferable). To enhance the usability of the n-tuple knowledge form, confidence and label should be two of the mandatory relations for each knowledge piece.
Final words
Finally, it is important to note that our intention is not to dismiss the value of knowledge graphs altogether. Rather, we are highlighting the potential limitations and challenges that may arise as the complexity and ambiguity of KGs continue to grow. We believe that it is important to approach the development of KGs with realistic expectations and a clear understanding of the potential trade-offs involved. This article represents our perspectives and is intended for reference purposes only. We welcome any comments, criticisms, or alternative viewpoints on this topic.
Authors introduction:
The two authors, Dongsheng Wang and Hongyin Zhu, have extensive research and development experience spanning seven years in the fields of Knowledge Graphs and Natural Language Processing (NLP). Dongsheng Wang earned his PhD from the University of Copenhagen, while Hongyin Zhu received his PhD from the University of Chinese Academy of Sciences. Currently, Dongsheng Wang is working in industry on Conversational AI, while Hongyin Zhu is a postdoctoral researcher at Tsinghua University, working on Knowledge Graphs. If you wish to contact them, you can reach out to Dongsheng Wang at dswang2011@gmail.com and Hongyin Zhu at hongyin_zhu@163.com.