Guide To AmpliGraph: A Machine Learning Library For Knowledge Graphs – Analytics India Magazine

  • Lauren
  • March 26, 2021
  • Comments Off on Guide To AmpliGraph: A Machine Learning Library For Knowledge Graphs – Analytics India Magazine

AmpliGraph is a TensorFlow-based open-source library developed by Accenture Labs for predicting links between concepts in knowledge graphs. It is a collection of neural ML models for statistical relational learning (SRL) (also called Relational Machine Learning) – a subdiscipline of AI/ML which deals with supervised learning on knowledge graphs.
Before going into the details of AmpliGraph, let us have a quick look at what a knowledge graph means.
What is a knowledge graph?
The knowledge graph is a diagrammatic representation showing how various entities of a system (e.g. objects, individuals, abstract concepts, events) are interlinked. There is no precise definition of a knowledge graph. In simple terms, it is a graph representing distinct entities and the relationships among them, according to a GitHub repository. It enables data integration and analysis by providing context to a system’s associated data. Visit this page to understand about knowledge graphs in detail. Following is an example of a knowledge graph:

Image source: GitHub

A graph is represented by a set of nodes representing entities and connecting edges showing relationships among them. It can be homogenous (e.g. a social network having people and their connections – all entities of a common type) or heterogeneous (e.g. graph of a university having different types of entities like students, professors, department etc. and relations like ‘studies-at’, ‘teaches-at’ and so on). Besides, a graph can be a ‘multigraph’ in which we can have multiple directed edges between one or more pairs of nodes, some of which can even form loops. 
A university graph, mentioned as a heterogeneous graph in the example above, conveys meaningful information (known as ‘semantics’) its entities and associated relations.
Now that we know terminologies like ‘heterogeneous graph’, ‘multigraph’ and ‘semantics’, we can define a knowledge graph as “a heterogenous multigraph in which entities and relations have semantics specific to a particular domain”.
Overview of AmpliGraph
AmpliGraph library provides ML models that can create knowledge graph embeddings (KGEs), which are nothing but low-level vector representations of the entities and relations mong them constituting a knowledge graph. 
Consider the following knowledge graph and its corresponding KGE to understand what AmpliGraph does:

Image source: GitHub
There is no direct link between certain entities through some relations in the above knowledge graph, e.g. there is no information shown for how ‘Acme Inc’ and ‘Liverpool’ can be connected through ‘basedIn’ relation. AmpliGraph combines the above KGE with some scoring function and makes predictions about new links.
E.g. It predicts that there is an 85% probability of Acme Inc being based in Liverpool, which can be represented as:

Image source: GitHub
Modules of AmpliGraph

Highlighting features of AmpliGraph

It can run on CPUs as well as GPUs to speed-up the training process
Its APIs reduce the amount of code required for code predictions in knowledge graphs
AmpliGraph base estimators are extensible

Practical implementation
Here’s a demonstration of using AmpliGraph for discovering novel relations in a GoT knowledge graph, the database for which can be downloaded from here and the graph is available at GitHub.
The condensed dataset looks something like this:

While the graph appears as follows:

Image source: GitHub
The code here has been implemented using Google colab with Python 3.7.10 and AmpliGraph 1.3.2 versions. We have used ComplEx (Complex Embeddings) model for KGE. Step-wise explanation of the code is as follows:
Install Ampligraph library
!pip install ampligraph
Import required libraries
import ampligraph
import numpy as np
import pandas as pd
import requests  #module for making HTTP requests
from ampligraph.datasets import load_from_csv
from ampligraph.evaluation import train_test_split_no_unseen 
from ampligraph.latent_features import ComplEx
from ampligraph.evaluation import evaluate_performance
from ampligraph.utils import create_tensorboard_visualizations
Download the dataset 
#Define the URL from which to download the data
data_url = ‘https://ampligraph.s3-eu-west-1.amazonaws.com/datasets/GoT.csv’
#Open a file called ‘GoT.csv’in binary write mode and write the contents of #downloaded dataset into it
open(‘GoT.csv’, ‘wb’).write(requests.get(url).content)
#Load knowledge graph from the GoT.csv file using load_from_csv()
data = load_from_csv(‘.’, ‘GoT.csv’, sep=’,’)
Get the unique entities present in the dataset
ent = np.unique(np.concatenate([data[:, 0], data[:, 2]]))
ent      #display those entities
Output:
array([‘Abelar Hightower’, ‘Acorn Hall’, ‘Addam Frey’, …, ‘the Antlers’,’the Paps’, ‘unnamed tower’], dtype=object)
Similarly, get the names of unique relations among the entities
rel = np.unique(X[:, 1])
rel      #display names of those relations
Output:
array([‘ALLIED_WITH’, ‘BRANCH_OF’, ‘FOUNDED_BY’, ‘HEIR_TO’, ‘IN_REGION’,
       ‘LED_BY’, ‘PARENT_OF’, ‘SEAT_OF’, ‘SPOUSE’, ‘SWORN_TO’],
      dtype=object)
Perform train-test split to create training and test sets from the dataset
#We split the data into 70-30 train-test ratio.Compute number of test samples accordingly
num_test_samples = int(len(data) * (30 / 100))
#Split the data into train and test set from ‘data’ such that test set has #number of samples equal to ‘num_test_samples’ and there are no duplicate entries
X = {}
X[‘train’], X[‘test’] = train_test_split_no_unseen(data,
test_size=num_test_samples, seed=0, allow_duplication=False) 
train_test_split_no_unseen() creates a test set such that test samples are not unseen ones i.e. it involves only those entities and relations which are also parts of the training set.

See Also

#Check sizes of training and test sets
print(‘Train set size: ‘, X[‘train’].shape)
print(‘Test set size: ‘, X[‘test’].shape)
Output:
Train set size:  (2223, 3)
Test set size:  (952, 3)
Instantiate the ComplEx model
ce_model = ComplEx(batches_count=100, 
                seed=0, 
                epochs=200, 
                k=150,   #dimensionality of embedding space
#number of negative triples which must be generated for each positive triple while training
                eta=5,
                optimizer=’adam’,  #Adam optimizer
                optimizer_params={‘lr’:1e-3},  #learning rate
                loss=’multiclass_nll’,   #loss function
#Lpregularization technique; here we specify p=2 for L2regularization
                regularizer=’LP’,  
                regularizer_params={‘p’:2, ‘lambda’:1e-5},  
                verbose=True)
Fit the model to training data
ce_model.fit(X[‘train’], early_stopping = False)
Evaluate the embedding model on test data
test_rank = evaluate_performance(X[‘test’], model=ce_model,
# corrupt subject and object separately while evaluatin  
            use_default_protocol=True,
verbose=True)
evaluate_performance() method computes rank at which each test set triple was found when the model performed link prediction.
Create some unseen statements for new links prediction
unseen_links = np.array([
    [‘Jorah Mormont’, ‘SPOUSE’, ‘Daenerys Targaryen’],
    [“King’s Landing”, ‘SEAT_OF’, ‘House Lannister of Casterly Rock’],
    [‘Daenerys Targaryen’, ‘SPOUSE’, ‘Jon Snow’],
    [‘House Stark of Winterfell’, ‘IN_REGION’, ‘The North’],
    [‘House Tyrell of Highgarden’, ‘IN_REGION’, ‘Beyond the Wall’],
    [‘Brandon Stark’, ‘ALLIED_WITH’, ‘House Lannister of Casterly
Rock’],    
    [‘House Hutcheson’, ‘SWORN_TO’, ‘House Tyrell of Highgarden’],
    [‘Daenerys Targaryen’, ‘ALLIED_WITH’, ‘House Lannister of Casterly
Rock’],
    [‘Robert I Baratheon’, ‘PARENT_OF’, ‘Myrcella Baratheon’],
    [‘Cersei Lannister’, ‘PARENT_OF’, ‘Brandon Stark’],
    [“Missandei”, ‘SPOUSE’, ‘Grey Worm’],
])
Rank the unseen triples by applying the embedding model
ranks_unseen = evaluate_performance(
    unseen_links, 
    model=ce_model, 
     corrupt_side = ‘s+o’,
# corrupt subjest and object separately while evaluating
    use_default_protocol=False,
    verbose=True
)
Make predictions for the unseen links
sc = ce_model.predict(unseen_links)
Convert the predicted scores for unseen statements into probabilities in the range 0-1
probability = expit(sc)
Display predicted score and probability of each of the unseen links.
pd.DataFrame(list(zip([‘ ‘.join(i) for i in unseen_links], 
                      ranks_unseen, 
                      np.squeeze(sc),
                      np.squeeze(probs))), 
             columns=[‘new link’, ‘rank’, ‘score’,
‘probability’]).sort_values(“sc”)
Output:

Visualize the knowledge graph embedding using Tensorboard
create_tensorboard_visualizations(model, ‘Knowledge_Graph_Embeddings’)
The ‘Knowledge_Graph_Embeddings’ directory should now have several files as follows:

Embeddings Visualization Output:

References
To dive deeper into the AmpliGraph library, refer to the following web links:

#wpdevar_comment_1 span,#wpdevar_comment_1 iframe{width:100% !important;}

Subscribe to our Newsletter
Get the latest updates and relevant offers by sharing your email.

Join Our Telegram Group. Be part of an engaging online community. Join Here.

Source: https://analyticsindiamag.com/guide-to-ampligraph-a-machine-learning-library-for-knowledge-graphs/