Andrea Rossi

Kelpie: explaining embedding-based Link Predictions

Andrea Rossi — Tue, 04 Jan 2022 20:04:34 GMT

Hello, there!

I'm very excited to announce that my latest paper has been accepted for the SIGMOD 2022 conference! It is my most ambitious work so far, and now that the peer-review process is over, I can finally disclose some details about it.

My work consists in a novel explainability framework for embedding-based Link Prediction models, named Kelpie. Given any link yielded by an embedding-based model, Kelpie can identify why the model predicted it by extracting various types of explanations.

By the way, Kelpies are mythological shapeshifting fish-horse creatures said to dwell in the depths of Scottish lakes, so they seemed the perfect mascot for a machine learning / deep learning project *ba-dum tsss*. Also, “Kelpie” makes a nice acronym: “Knowledge graph Embeddings for Link Prediction: Interpretable Explanations”.

Since this work is heavily invested in Link Prediction on Knowledge Graphs, allow me to start with a brief recap on the topic 😁

Knowledge Graphs and Link Prediction

As I already mentioned in the past, Knowledge Graphs are repositories of real-world information where entities are connected via edges labeled by relations: they thus form <head, relation, tail> triples called facts. Knowledge Graphs are powerful but generally incomplete; Link Prediction tackles this issue by inferring new facts from the patterns and semantics of the already known ones.

Most Link Prediction models nowadays map the entities and relations in the graph to vectorized representations called Knowledge Graph Embeddings. Embedding-based Link Prediction models generally define a scoring function Φ that, given a head entity, a relation, and a tail entity, uses their embeddings to estimate a plausibility value for the corresponding fact. The embeddings of all entities and relations are usually initialized randomly, and then trained with Machine Learning methods to optimize the Φ values of a set of facts known to be true a priori (i.e., our training set). The trained embeddings should generalize and result in high plausibility scores for unseen true facts too.

Explainability and Link Prediction

Embedding-based models have achieved promising results in Link Prediction, often surpassing traditional rule-based counterparts. Unfortunately, these systems are almost always opaque: the embedding of an entity or relation is just a vector of numbers, with no memory of which training facts have been most influential to it, and with no insights on why it supports certain predictions while hindering others.

Explaining the outcomes of embedding-based Link Prediction models is thus becoming an increasingly urgent challenge. Link Prediction models are often used in scenarios that inherently require explainability, such as fact checking or drug discovery and repurposing; furthermore, explanations can reveal if our systems are leveraging reliable patterns or rather spurious correlation, thus assessing their trustworthiness (or lack thereof).

One could ask: “Why don’t we just use one of the general-purpose explainability frameworks that are already out there? Why do we need a new framework specific for Link Prediction?” The answer is that, unfortunately, no general-purpose frameworks seem to apply well to embedding-based Link Prediction models.

Most explainability frameworks operate by perturbing the features of the input samples and then checking the resulting effect on predictions: if the prediction outcome changes, it means that the perturbed features were relevant, or salient, to it. These saliency-based frameworks, unfortunately, are only useful when the input features of samples are directly interpretable by humans: e.g., macro-pixels in images, or words in a sentence. In the case of Link Prediction, samples are just triplets of embeddings: perturbation approaches would thus just identify the most salient components of those vectors, which would not be informative from a human point of view.

A few frameworks follow a different paradigm, and try to identify the training samples that have been most influential to the prediction to explain. The approach by Koh and Liang, based on the robust statistics concept of Influence Functions, is considered the cornerstone of this category. This approach seems quite sensible for our scenario; unfortunately, Influence Functions are computationally very expensive, and this approach has been proved unfeasible to explain Link Predictions.

Introducing Kelpie

Kelpie overcomes these issues by mixing the advantages of both categories of frameworks. On the one hand, similarly to Influence Function methods, Kelpie explains predictions in terms of the training samples that have enabled them in the first place: on the other hand, it identifies those training samples with a custom, saliency-inspired approach which is feasible for the Link Prediction field.

More specifically, Kelpie interprets any tail prediction <h, r, t> by identifying the enabling training facts mentioning the head entity h, and, analogously, any head prediction by identifying the enabling facts featuring the tail. Given any tail prediction <h, r, t>, Kelpie supports two different explanation types:

a necessary explanation is the smallest set of training facts mentioning h such that, if those facts are erased from the training set, the model (retraining it from scratch) will predict a different tail for head h and relation r.

Given the tail prediction , a necessary explanation is the set {, } if, removing those facts from the training set, the model ceases to predict that Barack Obama is American.

a sufficient explanation is the smallest set of training facts mentioning h such that, if those facts are added to any random entity e, the model (retraining it from scratch) will predict <e, r, t> too.

Given the tail prediction , a sufficient explanation is the fact if, adding the USA presidency to any non-American entity (e.g., Édith Piaf, Vladimir Putin, or Pikachu) the model starts to predict their nationality as USA.

The same concepts apply to head predictions as well.

In order to identify which sets of facts constitute necessary or sufficient explanations, Kelpie creates alternate versions of the already existing entities, called mimics. A mimic is featured in the same training facts as the original entity it refers to, except for a few purposefully injected perturbations, i.e. removals or additions. Ideally, a mimic should display the same behavior that the original entity would have shown if its training facts had been perturbed in that way since the very beginning. By creating mimics and checking how their predictions differ from those of the original entities, we are able to verify which sets of facts constitute a necessary or a sufficient explanation to the prediction to interpret.

Clearly, mimics are only useful as long as they are both faithful to the behavior that the embeddings of original entities would have displayed, and feasible in the way they are computed. For example, a faithful mimic can be easily obtained by actually re-training the model from scratch after injecting the perturbations; the heavy computational costs of re-training the whole model, however, would make this approach unfeasible.

Kelpie generates mimics that are both faithful and feasible by relying on a novel Machine Learning methodology that we have called post-training. The embedding of any mimic is initialized randomly, as any other entity embedding; then, it undergoes a training process analogous to the original training in terms of optimizer and hyper-parameters, but with two key differences:

Except for the embedding of the mimic, all the other embeddings and shared parameters are kept frozen and constant: the only component that gets updated in the training process is the embedding of the mimic.
Since the embedding of the mimic is the only one that gets updated, the training set is limited to the training facts that actually feature the mimic entity, i.e., to the perturbed set of training facts of the original entity.

The Obama Mimic is an alternate version of Barack Obama: its facts (in purple) are identical
to those of the original Barack Obama, except for a few injected perturbations.
And, yeah, in an alternate universe Barack Obama is a Cthulhu cultist.

These differences make the post-training a very lightweight process, because it only involves one embedding (instead of thousands) and it only optimizes the scores of a few dozens or hundreds of facts (instead of hundreds of thousands). In practice, while fully training a Link Prediction model usually takes several hours, a post-training process is generally over in a few seconds. This makes it possible to use post-training to discover which perturbations would affect the prediction to explain the most, and, thus, which training facts enabled it.

Finally, an aspect I am particularly proud of is that, thanks to the flexibility of post-training, Kelpie can theoretically support any Link Prediction models based on embeddings: this is a sorely needed quality in a field where dozens of new models are released each year.

That’s it for this post!

As usual, I will leave here some references to additional contents:

you can find here the papers of three general purpose explainability frameworks: LIME, ANCHOR, and SHAP. They all have had a big impact in the AI community.
here are the links to Data Poisoning and CRIAGE, two very interesting data poisoning frameworks for Link Prediction models. Rather than explaining predictions, they focus on verifying the robustness of the learned embeddings to perturbations. Since their techniques and experiments partially overlap with ours, we have used them in our Kelpie as a baseline.

See you next time! 👋

Data Bias and Machine Learning

Andrea Rossi — Wed, 10 Nov 2021 22:23:23 GMT

Hi, there!

I am super excited to announce that the paper "Knowledge Graph Embeddings or Bias Graph Embeddings? A Study of Bias in Link Prediction Models", which I have written with Paolo Merialdo and Donatella Firmani, has won the best paper award in DL4KG '21! 🥳 🥳 The paper defines three types of sample selection bias, assesses their presence in the best-established Link Prediction datasets, and investigates how they affect the behavior of Link Prediction models.

I thought it would be fitting to write something about the effects of data bias on AI models on a broader scale. Data bias is defined as the presence of unwanted patterns or distributions in a dataset; in the context of AI, and in particular of Machine Learning, data bias can be a huge issue, because training a model on biased data usually leads the model to incorporate the bias and, thus, to yield biased outcomes.

Data Bias and Machine Learning

The history of Machine Learning is studded with examples of data bias messing with the behaviour of models.

In 2013 the Word2Vec word embeddings were famously found to reflect semantic relations among words. For example it was observed that embedding(“King”) - embedding (“Man”) + embedding (“Woman”) lands almost exactly on embedding(“Queen”), thus conveying the relation “Man”:”Woman”=“King”:”Queen”. Unfortunately, these embeddings have also been observed to convey sexist relations, such as “Man”:”Woman”=”Doctor”:”Nurse”, or “Man”:”Woman”=”Computer Programmer”:”Housemaker”. This is likely due to how different professions are referred to men and women in the training corpora.

Ouch.

In 2015 the automatic labelling feature of Google Photos sparked controversy due to tagging pictures of black people as "Gorillas". This type of mispredictions typically occurs in the presence of skewed datasets: in this case, black people were probably underrepresented in the “Person” class in training. Interestingly, rather than (or in addition to) correcting this problem directly, Google have decided to remove the "Gorilla" class altogether from its classifier.

This is really messed up.

More recently, in 2020 the PULSE method, which relies on the NVIDIA StyleGAN architecture to generate upscaled versions of low-res face pictures, was found to most often produce faces with Caucasian features even if the person in the original low-res image had a different ethnicity. Once more, this probably depends on a skewed distribution in the original StyleGAN training data, even though other reasons may also concur.

Aaand that's it, I'm outta here

These are just a handful of examples, but dozens more could be mentioned. In all these cases, the presence of unnaturally skewed distributions in training only became apparent after the models were released in production, and they were faced with a different distribution of samples: e.g., a model that focused on a certain demographic in training will perform poorly in real-world scenarios that also involve other demographics.

How To Counter Data Bias

In short, trying to remove bias from our models or datasets is a heck of a challenge.

De-biasing a model a posteriori, i.e., after its training is over, is very troublesome. Some works have proposed ways to mitigate gender bias in trained word embeddings. However, no approaches are general enough to cover all the many possible types of bias, or all the possible architectures of our models.

Intuitively, a more general meth would be to work a priori, just fixing the training data eliminating the unwanted skew in their distribution, and then re-training the model from scratch. This approach is quite natural, but it comes with a set of issues of its own:

Defining bias in an operative way is not trivial: the line separating sensible correlations from biases may not always be as clear as in the examples discussed above. The concept of bias inherently depends on the reference context: at the beginning of the 1900s a classifier distinguishing men from women based on whether they wore pants or gowns would be considered reasonable; the same criteria today would seem silly and controversial.
Even when a correlation is found to be clearly undesirable, e.g., unfavouring women in hiring processes, removing bias from our training data may be very challenging. Our datasets cannot be de-biased automatically: if a software could automatically identify bias, then this would be an already solved issue. On the contrary, it is generally required to include human workers in the loop, which may not be feasible when dealing with hundreds of Gigabytes or even Terabytes of training data.
Some systems are designed to continuously learn even after they are deployed, granting even less control on their training data. This is the case of the Tay chatbot launched by Microsoft, which should ideally have honed its conversational skills by interacting with humans on Twitter. In practice, the experiment was shut down after less than 24 hours, as internet trolls had managed to quickly convert Tay into a full-fledged nazi:

Whelp.

It must also be acknowledged that in development even just assessing the presence of bias - let alone fix it - would come at significantly high costs and longer development times, without any guarantee of success. Needless to say, researchers and engineers (or their managers) are not particularly eager to include these activities in their development process.

Some Help From XAI

Explainable AI, or XAI, is a subcategory of AI trying to “open the black box” of our models by interpreting their outcomes and behaviours. XAI frameworks can be extremely valuable in the fight against data bias, because they can highlight the reasons why our models yield certain predictions. In development, they can tell us which correlations our model is leveraging to yield the correct answers: if such correlations are inappropriate, the model is probably biased and unsuitable for real-world uses, so further investigation is recommended.

For instance, the authors of the popular framework LIME have shown a logistic regression classifier that could correctly distinguish Wolves from Husky dogs in the datasets, but only did so by verifying the presence of snow in the picture: in the used dataset, Wolf pictures tended to most often include a snowy background whereas Husky pictures didn’t.

As another example, a recent explainability framework for Link Prediction that I have developed in the course of my PhD (more details in upcoming posts!) has revealed weird correlations. For example, we have found that in certain datasets, correctly predicting the birthplace of a person always depends on that person playing on a football team from that city or nation. This was caused by those datasets being very poor in personal data, so the best pattern that models can leverage seems to be the slight preference that football players may have towards teams from their birthplace.

Conclusions

All in all, there is no clear-cut approach to cleanse our data and/or our models from bias. Fighting bias is hard because bias is heavily embedded in our data: this, in turn, is a reflection of how deeply bias is rooted in our history and in our culture.

In this regard, AI only incorporates bias if we are the ones exposing it to bias in the first place. I like to think of AI just like a mirror: given a large set of training data, AI can find trends and return us a bigger picture; it can slightly deform things; or it can even show us things that we didn't know were there. But ultimately, what AI models show to us is just a reflection of what we have shown to them.

I believe this is actually a good thing: it means that we, as a species, are the ones in control. As long as we keep progressing and working hard to eliminate prejudices from our cultures, the contents and data we produce will reflect this type of improvement, and our AI models will too.

I find this kind of poetic: the only way to eradicate biases in AI models may be to keep fighting and eradicate them from our own minds.

That’s it for this post! Thanks for reading this far 🙏

As usual, I will leave here some additional contents:

my paper on Data bias in Link Prediction, published in the Deep Learning for Knowledge Graphs (DL4KG) workshop at ISWC 2021;
And here is a very recent comprehensive survey on this topic, published in July 2021 by researchers from the USC Information Sciences Institute.

See you soon! 👋

Machine Learning, Videogames, and Mechanical Turks

Andrea Rossi — Wed, 11 Aug 2021 20:19:07 GMT

As a very passionate gamer and a Machine Learning researcher, this is a post that I definitely couldn't help writing 😁

Nowadays AI, and by extension Machine Learning, is common in several areas of game design. For example games like Minecraft and No Man's Sky successfully apply procedural AI to create entire worlds, and Machine Learning is heavily used in graphics: the Nvidia DLSS technology allows games to render images at lower resolutions (say, 1080p) and scales them up with Machine Learning before outputting them at 4K, achieving better quality and higher framerates with an overall lighter GPU workload.

In this post, however, I want to discuss specifically how AI is used to model the behavior of NPCs (non-playable characters). Videogames almost always simulate scenarios where one or multiple agents cooperate or compete to achieve certain goals, so controlling NPCs is intuitively the first task that comes to mind when mentioning AI in game design.

AI and Games: an Age-Old Love Story

The idea of gaming AIs capable of outwitting humans has fascinated people for ages. In the 18th century the “Turk” chess automaton achieved worldwide fame for beating the likes of Napoleon Bonaparte and Benjamin Franklin; it was actually a hoax (a human secretly operated the automaton from the inside), but the idea of a Verne-esque machine smart enough to defeat humans captivated people immensely - and it still does.

Since the 1950s, computer scientists have been applying AI to board games with increasingly impressive results. In 1997 the IBM DeepBlue model famously defeated the chess champion Garry Kasparov: it was the first time a reigning champion was beaten at an intellectual task by an AI. The recent rise of Machine Learning has allowed AIs to tackle the harder game of Go: in 2015 the DeepMind AlphaGo model beat the Go champion Lee Sedol, with Sedol himself recently claiming that AIs have become "an entity that cannot be defeated".

I believe that the reason why AI in games is so appealing is that games provide a fictional setting in which both people and AI agents are restricted to the same set of rules and actions. This scenario facilitates the illusion of interacting with an artificial human-like intelligence, i.e., a general AI, and allows the players to fully immerse in the game.

Given this premise, Machine Learning should be ubiquitous in videogames, which, being natively digital, provide the ideal environment for AI agents... right?

Does the Game Industry Use Machine Learning?

Unfortunately, nowadays Machine Learning is mostly not employed in videogames. Game developers rather prefer traditional AI techniques such as Pathfinding, Finite State Machines and Behavior Trees.

Pathfinding

Pathfinding studies the best way to move from a point A to a point B; most of them are heavily based on graph traversal algorithms, such as Dijkstra or A*. Pathfinding algorithms are used in almost all games where agents act on a map, (especially if made of tiles): for example, 2D strategic games such as Age of Empires, MMORPGs such as World of Warcraft, and first person shooters such as Half-Life and Counter-Strike.

A graphical example of a Pathfinding Algorithm based on Dijkstra. (source: Codingame).

Finite State Machines (FSM)

Finite State Machines (FSMs) define all the situations (states) that AI agents can encounter, and script the corresponding reactions. For instance if the health value of an agent falls below a threshold, this may trigger the reaction "heal yourself"; applying it may switch the FSM another state, which in turn will elicit a new reaction, and so on, resulting in an infinite decision-making loop. FSMs are great at modeling simple behaviors, and they have been used a lot in gaming, ranging from the ghosts of Pac-man to the NPCs of Call of Duty, Metal Gear Solid, Halo and The Elder Scrolls (e.g., Skyrim).

A simple schema for a FSM (source: gamedevelopment.tutsplus).

Behavior Trees

When AI agents need to choose the best among many possibile actions, Behavior Trees are usually a better solution than FSMs. Behavior Trees are similar to flow-charts: possible conditions are represented as tree branches, and possible actions are the tree leafs. The tree is evaluated at regular intervals, identifying the best action to apply. If the tree is too large to visit it entirely, algorithms such as the Monte Carlo Search Tree (MCST) can provide estimates of the payback of each action. Behavior Trees are very common in strategic turn-based games, such as Civilization, Heroes of Might and Magic, and Pokémon.

An example of behavior tree in the Unreal Engine 4 framework documentation.

Does the Game Industry Need Machine Learning?

All in all, while many AI-based tasks have shifted towards Machine Learning in the last decade, the game industry has not undergone the same process: devs have generally just refined and enhanced the same technologies with no radical changes.

I think the main reason behind this lies in what the actual purpose of games is. Similarly to other medias, such as movies or books, games build a narrative pact between developers and players: even the simplest game, through its context and rules, tells a story that the players unconsciously accept to believe. In this context AI agents are not meant to be strong: rather, they should be plausible, to sustain the suspension of disbelief.

Therefore the goal of AI in gaming is not to create smart agents, but rather to provide the illusion of smart agents. Game designers themselves claim that AI in games "is the equivalent of smoke and mirrors". Interestingly, this is the same philosophy behind the "Turk" hoax: the most important thing was, and still is, the experience felt by the player rather than the actual intelligence of the machine. I find these considerations quite interesting: out of the dozens of fields where AI is currently employed, they mostly only apply to games due to their inherent nature of medias.

This explains why the game industry seem so reluctant to adopt Machine Learning: given the purpose of AI in gaming, it may be useless or even counter-productive. In addition to that, of course, Machine Learning is not actually devoid of issues:

Predictability : Machine Learning agents may behave inconsistently, e.g., they may alternate smart actions with idiotic ones, or display certain flaws only under specific conditions never met in development. The opacity of Machine Learning models would make it very hard for game designers to identify these behaviors and correct them.
Development Processes: in most games the development is ruled by very tight schedules. Spending months (or years!) to build from scratch a whole new AI engine based on Machine Learning sounds like a bad move when you can just re-use pre-existing technologies with incremental refinements.
Computation: Machine Learning models, unless they are really small and simple, generally need to run on GPUs. In videogames, though, GPUs are already busy with the game graphics, so they may not be able to handle the behavior of AI agents too (especially if there are too many of them).

However, I do not think that any of these issues is truly blocking: as a matter of fact a few experimental games in the past have already overcome them, using neural networks for specific tasks, e.g., Peter Molyneux's Black and White in the early 2000s.

Will the Game Industry Need Machine Learning?

If in the present Machine Learning and Game Industry do not look like a well-matched pair, the same may not apply for the future. Researchers have already run impressive (and extremely cool) experiments proving that Machine Learning technically can be applied to videogames with amazing results:

OpenAI have developed an OpenAI Five model capable of playing Dota 2, a very popular MOBA where players clash in 5 vs 5 matches. OpenAI Five, which is based on self-play reinforcement learning, has spent about 10 months in a custom distributed training process; after that, it has defeated the Dota 2 world champions on a livestream event, thus becoming the first AI to beat the world champions in an esports game.
DeepMind have developed an AlphaStar model to play the famous Real Time Strategic (RTS) game Starcraft 2. In its latest iteration AlphaStar is limited to the same constraints as humans (e.g., viewing the game world through a camera, performing actions with a limited frequency, etc.); it has ranked above 99.8% of active players in the official server battle.net, achieving the grandmaster class in all the three races (Protoss, Terran, Zerg) of the game.

These results prove that Machine Learning can indeed be applied to videogame agents with formidable results. The technology is here, and it is mature: studios could probably adopt it right away if it provided some added value in the greater picture of the game purposes. I feel that the only missing piece is a scenario where Machine Learning enables new gameplay possibilities would not be achievable otherwise; for instance, I would love to battle adaptive NPCs that learn in real time to counter the style and strategies of human players.

I can definitely see indie devs, or the most authorial game designers (e.g., Hideo Kojima, or the already mentioned Peter Molyneux) acting as pioneers in this field, and being later followed by more mainstream productions.

Until then, I can only keep dreaming of a neural-powered Ganondorf 😭

Thank you for reading this far!

As usual I'll leave here some additional sources:

A nice article by Laura Maass on the current state of AI in gaming;
A scientific survey by Kun Shao et al. on how reinforcement learning has been used so far on games in research projects.

Have a nice day!

Knowledge Graphs, Link Prediction and enterprises

Andrea Rossi — Sat, 07 Aug 2021 12:58:31 GMT

Hello, there! In my previous post I discussed the basics of embedding-based Link Prediction on Knowledge Graphs.

On that occasion I included a pointer to a comparative analysis that I published on the topic; in this post I’d like to borrow a few of the concepts from that work about the current status of Link Prediction research.

Quick recap: Link Prediction infers new facts in a Knowledge Graph leveraging the already known ones. Most Link Prediction models use Machine Learning to learn embeddings of the entities and relations; embeddings are optimized to fit a Scoring Function Φ that estimates the plausibility of individual facts. After training on many facts already known to be true, the learned embeddings should yield good Φ values for unknown true facts as well.

Link Prediction Trends

Since the pioneering TransE model researchers have gone into a “Link Prediction frenzy” and they have started creating dozens and dozens of new embedding-based models, each with a Φ function of its own. To put things into a numeric perspective, these are Google Scholar yearly numbers of the works citing TransE:

Damn, 2020, you just couldn’t keep the exponential trend going, could you?

This trend is pretty darn impressive. Of course, not all of these works propose new models, nonetheless these numbers are huge – and TransE is now a bit outdated, so the most recent works may not even mention it that often anymore. It would not be an understatement to say that hundreds of embedding-based Link Prediction models have been developed in the last 7-8 years.

With such a crowded scene, when I started my PhD on this topic I felt utterly overwhelmed. My first reaction was denial: “There’s no way all of these models are really meaningful! Most of them must just be junk!”. While this is not necessarily wrong, after a while my point of view started shifting. When you think about it, there is a special kind of beauty in the fact that knowledge can be learned in hundreds of different approaches, each with its own features and quirks, and even if many of them turn out to be suboptimal research-wise. I guess this is part of what makes Machine Learning so attractive, after all.

I was still puzzled, though: why are there so many Link Prediction models? Why is this topic getting this much attention, when it does not even have many practical applications yet? It took me some time to wrap my head around this, but I think the answer lies in the big shift that Knowledge Graphs have undergone in the last decade.

The Knowledge Graph Boom

In the 2000s Knowledge Graphs (or Knowledge Bases, or Ontologies) were synonyms to Linked Open data. The big players were open projects like Freebase or DBpedia, trying to implement the vision of Tim Berners-Lee of a Semantic Web of distributed, machine-readable concepts.

In the 2010s, for the good or the bad, enterprise Knowledge Graphs have taken over. In 2010 Google outright bought Freebase and built their Google Knowledge Graph (2012) on top of it to enhance their web engine with semantic knowledge (fun fact: this is where the term “Knowledge Graph” comes from!). Two years later they launched the Knowledge Vault project, combining the data obtained by a multitude of extractors with Link Prediction outcomes. In the same years, Microsoft developed their own Knowledge Graph Satori (“understanding” in Japanese), and in 2017 they merged it with Bing into the Bing Entity Search service.

Big marketplaces like Amazon and Ebay joined the game too, developing product graphs to encompass semantic knowledge on the products they sale. Amazon leverages the data in their graph to improve product recommendation, while Ebay mostly uses theirs to power smart conversational agents. In 2018 both Airbnb and Uber Eats announced the creation of their own Knowledge Graphs, that they use to improve recommendation of activities and foods/restaurants respectively.

In the meantime the social networks did not stay dormant. In 2013 Facebook launched their Facebook Graph Search project to leverage semantic knowledge on the entities and topics that users are most invested on; it is mostly used to improve user profiling and, thus, provide users with better recommendations and targeted advertising. In 2016 LinkedIn announced their own Knowledge Graph with similar purposes, and in 2020 the same route has been followed by Pinterest.

It is undeniable that such a Knowledge Graph boom has boosted many related research topics too. In the specific case of Link Prediction, I think it has become so popular because it has hit the “sweet spot” of three extremely favorable conditions:

It applies to tools that have rapidly become very useful and profitable for giant tech companies, i.e., Knowledge Graphs;
It attempts to tackle arguably the greatest issue that such tools suffer from, i.e., incompleteness;
It does so with super fancy and trendy novel technologies that everybody is interested in, i.e., Machine Learning.

I finally tried to come full circle, developing my own organization to the messy scenario of Link Prediction models.

Link Prediction Taxonomy

This taxonomy groups embedding-based models based on the interpretation of their Φ function.

My taxonomy for Link Prediction models, with a small selection of representative examples. And with colors!

I identified three main families, each with further sub-groups:

Matrix Factorization Models
They mostly rely on linear algebra to combine the embeddings of heads, relations, and tails. I further split them into Bilinear Models, based on bilinear products, and Non-bilinear Models, that may employ more “esoteric” operations, e.g., Circular Correlation or Tucker composition.
Geometric Models
They interpret relations as geometric operations in the embedding space. Starting from TransE, which is a purely translational model, researchers have studied smart ways including Additional Embeddings, often mapping each entity to multiple relation-specific embeddings. Lately, though, Roto-translational operations have become the most promising research direction in this family.
Deep Learning Models
They rely on Neural Networks, which include deep sequences of layers interspersed with non-linear activation functions. Their parameters are learned jointly the embeddings, on which they operate in the ɸ function. This family can be naturally divided into sub-groups based on the neural architecture type: Convolutional Models, Recurrent Models, Capsule Models, etc. The use of additional parameters makes these models quite expressive, but in turn lead to longer training times and greater risks of overfitting.

That’s it for this post! Thank you for reading this far 🙏

As usual I will leave here a few useful references:

Some data on enterprise Knowledge Graphs have been collected from the excellent book “Knowledge Graphs”, by Hogan et al.
Tsinghua University has collected a list with no less than 50 must-read scientific papers on Link Prediction: enjoy!

Using Embeddings to Predict New Links in Knowledge Graphs

Andrea Rossi — Wed, 28 Jul 2021 14:18:00 GMT

Hello, there!

This is my very first post, so I'm super excited. I'd like to talk about my current field of research, which deals with how Machine Learning techniques can be applied to Link Prediction on Knowledge Graphs.

Knowledge Graphs are stores of real-world information structured into nodes and labeled directed edges. Nodes represent entities (people, places, etc) and they are connected by edges whose labels convey semantic relations.
In a Knowledge Graph an edge linking two nodes represents a unit of information called fact, i.e., <Barack_Obama, was_born_in, Honolulu>.

A wee little Knowledge Graph.

Knowledge Graphs are quite useful and they are employed in a wide variety of contexts, from semantic web projects to user profiling and recommendation systems. Unfortunately though, all Knowledge Graphs tend to suffer from incompleteness, as they only contain a small portion of all the real-world information they should encompass.

Link Prediction tackles incompleteness by inferring new facts from the existing ones. For instance, knowing that <Barack_Obama, was_born_in, Honolulu> you can deduce that, probably, <Barack_Obama, has_nationality, USA> (assuming it was previously unknown).

Nowadays, most Link Prediction models map each entity and relation to vectorized representations called embeddings. Link Prediction model based on embeddings usually works by defining a mathematical scoring function Φ that, for any fact <h, r, t>, can use the embeddings of h, r, and t to compute a floating point output value Φ(h, r, t) .

Each entity and relation is mapped to an embedding.

In training, all the embeddings are initialized randomly, and then optimized to maximize the scores Φ of the known facts. In other words, we tweak and improve the values for the embeddings of all entities and all relations so that, for any known fact <h, r, t>, we make Φ(h, r, t) become as large as possible. In practice, what we actually gather the Φ scores of all the known facts in a Loss Function (e.g., Negative Log-Likelihood, Binary Cross-Entropy, or Pairwise Ranking Loss), and train the embeddings to optimize the Loss value.

After the training is complete, the learned embeddings should (hopefully!) be able to generalize and to lead to high Φ values even for true facts that were not seen in training. So we can discover new relevant facts by just trying new combinations <entity_1, relation, entity_2> and their Φ score: if it is good enough, chances are the combination corresponds to a true, previously unknown, fact. In other words, after we have carefully trained the embeddings, we expect Φ to be a good estimate of the plausibility of any (known or unknown) fact.

Does this seem a bit too convenient? Well, it should 😜 In practice, the hardest part of building a new Link Prediction model is devising a "good" Φ function. Many functions sounded very good at first, but in time unexpected flaws emerged.

At this regard, the TransE model is a fine example. TransE was created by Bordes et al. in 2013; it is a pioneering work, and one of the very first Link Prediction models based on embeddings in history. TransE was largely inspired by the translational properties Word2Vec neural language model, and it enforces the translational operation explicitly in its scoring function Φ:

Φ(h, r, t) = |h + r - t|

The scoring function of TransE can be read in this way: given the fact <h, r, t> we take into account the vector of the head entity h and we translate it by the relation embedding r; if the fact was true, we expect to land on a position close to the embedding of t (distance is measured with L1 or L2 norm). In a space with 3-dimensional embeddings, applying the Φ function of TransE would look like this:

TransE scoring function in a nutshell.

This is a very simple Φ function inspired by basic geometry, but for many cases it works pretty well. For instance, it was found to correctly predict the Capital cities of various countries:

TransE is good on one-to-one relations, such as capital_of.

Unfortunately for TransE, though, not all relations are as smooth as capital_of.
Many relations may convey “one-to-many” semantics: the same head entity can be connected by the same relation to multiple tail entities, e.g., the same uncle can have multiple nephews: <Donald, uncle_of, Huey>, <Donald, uncle_of, Dewey>, and <Donald, uncle_of, Louie>.

"You are a terrible uncle, unca' Donald!"

In cases like this, TransE fails spectacularly. Starting from the same head entity (e.g., Donald) and applying the same relation translation (e.g., uncle_of), we will just land in one spot of the vector space; and a single spot is not likely to lead equally good Φ values for all the correct tail entities (Huey, Dewey and Louie). The same issues occurs with many other types of relations: for starters, many-to-one and many-to-many relations, but also symmetric relations, transitive relations, and so on.

After this flaw was found, researchers did what they do best: they developed new models trying to overcome this issue, and then the next one, and so on. In time, new families of Φ functions have been tailored, each with its own pros and cons. Starting from TransE, in just a few years Knowledge Graph embeddings have become a sparkling research topic, with dozens of new models being proposed every year.

I don't want to get into too much detail - I will probably cover this in another post - but often times the Φ function of such models also includes the use of deep machine learning architectures, such as convolutional or recurrent layers.
The weights of such layers can be seen as just additional parameters that are learned at the same time as the embeddings of entities and relations. Since they do not refer to any of the KG entities or relations, but rather affect all of them, in the Link Prediction field they are often called “shared parameters”.

Thank you for reading this far! Of course this is just an overview, and there are a lot of details worth exploring… but that’s a topic for another day 😄

I will just leave here some material further materials if you are curious to delve deeper into these topics:

First of all, here is the original TransE paper published in the NIPS conference in 2013. I will also leave here a link to the Word2Vec paper inspiring it;
I would also like to reference this extensive comparative analysis among Link Prediction models published in the TKDD Journal. I have written it myself, so I'm quite proud of it!

That’s it! Have a nice day 👋