### Install pyRDF2Vec from Source Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/readme.md Clone the repository and install pyRDF2Vec locally from the source code. ```bash git clone https://github.com/IBCNServices/pyRDF2Vec.git pip install . ``` -------------------------------- ### Start Stardog Server Source: https://github.com/predict-idlab/pyrdf2vec/wiki/Fast-generation-of-RDF2Vec-embeddings-with-a-SPARQL-endpoint Start the Stardog database server with security and CORS disabled. Ensure the Stardog version number matches your installation. ```bash ./stardog-7.4.5/bin/stardog-admin server start --disable-security --no-cors ``` -------------------------------- ### Install pyRDF2Vec Dependencies with Poetry Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/contributing.md Install the project dependencies in a virtual environment using poetry. This is the recommended method for local development. ```bash pip install poetry ``` ```bash poetry install ``` ```bash poetry shell ``` -------------------------------- ### Install Poetry Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Install Poetry, a dependency management tool, using pip. ```bash pip install poetry ``` -------------------------------- ### Install pyRDF2Vec from PyPI Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/readme.md Install the pyRDF2Vec library using pip. This is the recommended method for most users. ```bash pip install pyRDF2vec ``` -------------------------------- ### Install pyRDF2Vec Dependencies Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Install pyRDF2Vec dependencies within a virtual environment using Poetry. ```bash poetry install ``` -------------------------------- ### Changelog News Fragment Example (Feature) Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/contributing.md Example of a news fragment file for a new feature. The filename should follow the pattern ..rst, e.g., 456.feature.rst. ```rst ``fit_transform()`` now can deal with bigger Knowledge Graph Embeddings (KGE). ``` -------------------------------- ### Changelog News Fragment Example (Bugfix) Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/contributing.md Example of a news fragment file for a bugfix. The filename should follow the pattern ..rst, e.g., 123.bugfix.rst. ```rst Added ``wallkers.Foo``. This walker is a new walker. ``` -------------------------------- ### Preview Changelog with Tox Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Use this command to preview how your changelog entry will appear in the final release notes. Ensure you have tox installed and configured. ```bash tox -e changelog ``` -------------------------------- ### Install pyRDF2Vec with Poetry Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/readme.md Add pyRDF2Vec as a dependency using the Poetry package manager. ```bash poetry add pyRDF2vec ``` -------------------------------- ### Run pyRDF2Vec with Docker Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Build and run pyRDF2Vec using Docker Compose, avoiding local dependency installation. ```bash docker-compose up --build -d ``` -------------------------------- ### Implement a Custom RDF2Vec Connector Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Example of creating a custom connector by extending the base Connector class and implementing the fetch method. Includes retry logic for HTTP requests. ```python import attr from requests.adapters import HTTPAdapter from requests.packages.urllib3.util import Retry from pyrdf2vec.connectors import Connector @attr.s class FooConnector(Connector): """Represents a Foo connector.""" def __attrs_post_init__(self): adapter = HTTPAdapter( Retry( total=3, status_forcelist=[429, 500, 502, 503, 504], method_whitelist=["HEAD", "GET", "OPTIONS"], ) ) self._session.mount("http", adapter) self._session.mount("https", adapter) def fetch(self, query: str) -> None: """Fetchs the result of a query. Args: query: The query to fetch the result. Returns: The generated dictionary from the ['results']['bindings'] json """ # TODO: to be implemented ``` -------------------------------- ### RDF2VecTransformer: Fit and Transform KG Entities Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt Orchestrate walk extraction and embedding training using RDF2VecTransformer. This example uses Word2Vec for embedding and RandomWalker for walk generation. ```python import pandas as pd from pyrdf2vec import RDF2VecTransformer from pyrdf2vec.embedders import Word2Vec from pyrdf2vec.graphs import KG from pyrdf2vec.walkers import RandomWalker data = pd.read_csv("samples/countries-cities/entities.tsv", sep="\t") entities = list(data["location"]) kg = KG("https://dbpedia.org/sparql", skip_predicates={"www.w3.org/1999/02/22-rdf-syntax-ns#type"}) transformer = RDF2VecTransformer( Word2Vec(epochs=10, workers=1), # embedding technique walkers=[RandomWalker(4, 10, # depth=4, max_walks=10 with_reverse=True, # include parent hops n_jobs=2, random_state=42)], verbose=1, ) # fit_transform = extract walks + train embedder + transform in one call embeddings, literals = transformer.fit_transform(kg, entities) ``` -------------------------------- ### Reproducible Embeddings Setup Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt Ensures reproducible embeddings by coordinating PYTHONHASHSEED, walker random_state, and embedder workers=1. Running the script again with the same seed produces identical vectors. ```python # Run script with: PYTHONHASHSEED=42 python my_script.py from pyrdf2vec import RDF2VecTransformer from pyrdf2vec.embedders import Word2Vec from pyrdf2vec.graphs import KG from pyrdf2vec.walkers import RandomWalker SEED = 42 transformer = RDF2VecTransformer( Word2Vec(workers=1, epochs=10), # single worker for determinism walkers=[RandomWalker( max_depth=2, max_walks=None, random_state=SEED, # walker + sampler seed )], ) kg = KG("samples/mutag/mutag.owl", skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"}) entities = ["http://dl-learner.org/carcinogenesis#d1", "http://dl-learner.org/carcinogenesis#d2"] embeddings, _ = transformer.fit_transform(kg, entities) # Running this script again with PYTHONHASHSEED=42 produces identical vectors. print(embeddings[0]) ``` -------------------------------- ### Serve Local Documentation Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Open the locally generated documentation in your web browser. The path to the index file is provided. ```bash $BROWSER _build/html/index.html ``` -------------------------------- ### Create and Load DBPedia Database Source: https://github.com/predict-idlab/pyrdf2vec/wiki/Fast-generation-of-RDF2Vec-embeddings-with-a-SPARQL-endpoint Create a new Stardog database named 'dbpedia' and load the downloaded bz2 files. Then, add the DBPedia OWL ontology to the database. Recommended to run in a separate screen or tmux session. ```bash ./stardog-7.4.5/bin/stardog-admin db create -n dbpedia $(find . -name \*.bz2 -print -type f | xargs) ./stardog-7.4.5/bin/stardog data add dbpedia data/dbpedia_2015-10.owl ``` -------------------------------- ### Import Custom Sampler in __init__.py Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Add your custom sampler to the __init__.py file and the __all__ list. ```python from .sampler import Sampler # ... from .foo import FooSampler # ... from .uniform import UniformSampler __all__ = [ # ... "FooSampler", # ... "Sampler", "UniformSampler", ] ``` -------------------------------- ### Generate Documentation Locally Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Build the project's documentation locally using tox. This command generates HTML documentation in the _build/html directory. ```bash tox -e docs ``` -------------------------------- ### Configure Stardog Server Java Arguments Source: https://github.com/predict-idlab/pyrdf2vec/wiki/Fast-generation-of-RDF2Vec-embeddings-with-a-SPARQL-endpoint Set STARDOG_SERVER_JAVA_ARGS to allocate sufficient JVM heap and direct memory for loading large datasets. Adjust values based on the number of triples and available system memory. ```bash export STARDOG_SERVER_JAVA_ARGS=-Xms8g -Xmx8g -XX:MaxDirectMemorySize=20g ``` -------------------------------- ### Initialize KG from Local RDF File Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt Load a Knowledge Graph from a local OWL/RDF file. Configure predicates to skip and specify literal extraction rules. ```python # --- 2. Local OWL/RDF file --- kg_local = KG( "samples/mutag/mutag.owl", skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"}, literals=[ ["http://dl-learner.org/carcinogenesis#hasBond", "http://dl-learner.org/carcinogenesis#inBond"], ["http://dl-learner.org/carcinogenesis#hasAtom", "http://dl-learner.org/carcinogenesis#charge"], ], ) ``` -------------------------------- ### Create Knowledge Graph from Scratch Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/README.rst Create a Knowledge Graph (KG) from scratch by adding vertices and edges representing relationships. ```python from pyrdf2vec.graphs import KG, Vertex GRAPH = [ ["Alice", "knows", "Bob"], ["Alice", "knows", "Dean"], ["Dean", "loves", "Alice"], ] URL = "http://pyRDF2Vec" CUSTOM_KG = KG() for row in GRAPH: subj = Vertex(f"{URL}#{row[0]}") obj = Vertex((f"{URL}#{row[2]}")) pred = Vertex((f"{URL}#{row[1]}"), predicate=True, vprev=subj, vnext=obj) CUSTOM_KG.add_walk(subj, pred, obj) ``` -------------------------------- ### Import New Walker in __init__.py Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/contributing.md Add your custom walker to the __init__.py file in the pyrdf2vec/walkers directory and include it in the __all__ list. ```python from .walker import Walker # ... from .foo import FooWalker ``` -------------------------------- ### Import New Sampler in pyrdf2vec/samplers/__init__.py Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/contributing.md To make a new sampler available, import it in the pyrdf2vec/samplers/__init__.py file and add it to the __all__ list. ```python from .sampler import Sampler # ... from .foo import FooSampler # ... from .uniform import UniformSampler __all__ = [ # ... "FooSampler", # ... "Sampler", "UniformSampler", ] ``` -------------------------------- ### Import Custom Walker in __init__.py Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Add your custom walker to the __init__.py file and the __all__ list to make it available. ```python from .walker import Walker # ... from .foo import FooWalker # ... from .walklet import WalkletWalker from .weisfeiler_lehman import WLWalker __all__ = [ # ... "FooWalker", # ... "Walker", "WalkletWalker", "WLWalker", ] ``` -------------------------------- ### HALKWalker for Frequency-Filtered Walks Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt Use HALKWalker to filter walks by dropping low-frequency object vertices. This reduces noise and vocabulary size, especially useful for large knowledge graphs. Ensure necessary libraries like pandas and pyrdf2vec are installed. ```python from pyrdf2vec.walkers import HALKWalker from pyrdf2vec.samplers import WideSampler from pyrdf2vec import RDF2VecTransformer from pyrdf2vec.embedders import Word2Vec from pyrdf2vec.graphs import KG import pandas as pd train_data = pd.read_csv("samples/mutag/train.tsv", sep="\t") test_data = pd.read_csv("samples/mutag/test.tsv", sep="\t") entities = list(train_data["bond"]) + list(test_data["bond"]) embeddings, _ = RDF2VecTransformer( Word2Vec(workers=1, epochs=10), walkers=[HALKWalker( 2, # max_depth None, # BFS n_jobs=2, sampler=WideSampler(), random_state=22, md5_bytes=None, # no hashing — keeps full URI (small KG) )], verbose=1, ).fit_transform( KG("samples/mutag/mutag.owl", skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"}), entities, ) print(embeddings[0].shape) # (100,) ``` -------------------------------- ### Initialize Knowledge Graph from RDF File Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/README.rst Initialize a Knowledge Graph (KG) from an RDF file, specifying predicates to exclude and literals to retrieve. ```python from pyrdf2vec.graphs import KG # Defined the MUTAG KG, as well as a set of predicates to exclude from # this KG and a list of predicate chains to get the literals. KG( "samples/mutag/mutag.owl", skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"}, literals=[ [ "http://dl-learner.org/carcinogenesis#hasBond", "http://dl-learner.org/carcinogenesis#inBond", ], [ "http://dl-learner.org/carcinogenesis#hasAtom", "http://dl-learner.org/carcinogenesis#charge", ], ], ), ``` -------------------------------- ### End-to-End MUTAG Classification Pipeline Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt This code snippet demonstrates a complete machine learning pipeline for MUTAG classification. It integrates pyRDF2Vec for entity embedding generation with scikit-learn for model training and evaluation. Ensure pyRDF2Vec, scikit-learn, and pandas are installed. The KG class loads local RDF files or remote knowledge graphs. ```python import pandas as pd from sklearn.manifold import TSNE from sklearn.metrics import accuracy_score, confusion_matrix from sklearn.model_selection import GridSearchCV from sklearn.svm import SVC from pyrdf2vec import RDF2VecTransformer from pyrdf2vec.embedders import Word2Vec from pyrdf2vec.graphs import KG from pyrdf2vec.samplers import WideSampler from pyrdf2vec.walkers import HALKWalker RANDOM_STATE = 22 train_data = pd.read_csv("samples/mutag/train.tsv", sep="\t") test_data = pd.read_csv("samples/mutag/test.tsv", sep="\t") train_entities = list(train_data["bond"]) test_entities = list(test_data["bond"]) train_labels = list(train_data["label_mutagenic"]) test_labels = list(test_data["label_mutagenic"]) embeddings, _ = RDF2VecTransformer( Word2Vec(workers=1, epochs=10), walkers=[HALKWalker(2, None, n_jobs=2, sampler=WideSampler(), random_state=RANDOM_STATE, md5_bytes=None)], verbose=1, ).fit_transform( KG("samples/mutag/mutag.owl", skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"}, literals=[ ["http://dl-learner.org/carcinogenesis#hasBond", "http://dl-learner.org/carcinogenesis#inBond"], ["http://dl-learner.org/carcinogenesis#hasAtom", "http://dl-learner.org/carcinogenesis#charge"], ]), train_entities + test_entities, ) train_emb = embeddings[:len(train_entities)] test_emb = embeddings[len(train_entities):] clf = GridSearchCV(SVC(random_state=RANDOM_STATE), {"C": [10**i for i in range(-3, 4)]}) c.fit(train_emb, train_labels) preds = clf.predict(test_emb) print(f"Accuracy: {accuracy_score(test_labels, preds)*100:.2f}%") print("Confusion Matrix:", confusion_matrix(test_labels, preds)) # 2D visualization with t-SNE X_tsne = TSNE(random_state=RANDOM_STATE).fit_transform(train_emb + test_emb) ``` -------------------------------- ### Run All Project Checks Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Execute all project checks, including unit tests, code style, and documentation, using the tox command. This is a comprehensive check before submitting changes. ```bash tox ``` -------------------------------- ### Run Linting and Documentation Checks Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Perform code style and documentation checks using tox. ```bash tox -e lint,docs ``` -------------------------------- ### Build KG Programmatically Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt Construct a Knowledge Graph from scratch by adding vertices and edges. Useful for small or custom graph structures. ```python # --- 3. Built from scratch --- GRAPH = [ ["Alice", "knows", "Bob"], ["Alice", "knows", "Dean"], ["Dean", "loves", "Alice"], ] URL = "http://pyRDF2Vec" kg_custom = KG() for row in GRAPH: subj = Vertex(f"{URL}#{row[0]}") obj = Vertex(f"{URL}#{row[2]}") pred = Vertex(f"{URL}#{row[1]}", predicate=True, vprev=subj, vnext=obj) kg_custom.add_walk(subj, pred, obj) ``` -------------------------------- ### Initialize Knowledge Graph from SPARQL Endpoint Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/README.rst Initialize a Knowledge Graph (KG) from a SPARQL endpoint, specifying predicates to skip and literals to fetch. ```python from pyrdf2vec.graphs import KG # Defined the DBpedia endpoint server, as well as a set of predicates to # exclude from this KG and a list of predicate chains to fetch the literals. KG( "https://dbpedia.org/sparql", skip_predicates={"www.w3.org/1999/02/22-rdf-syntax-ns#type"}, literals=[ [ "http://dbpedia.org/ontology/wikiPageWikiLink", "http://www.w3.org/2004/02/skos/core#prefLabel", ], ["http://dbpedia.org/ontology/humanDevelopmentIndex"], ], ), ``` -------------------------------- ### Configure RDF2Vec Transformer with PageRank Sampler Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/README.rst Use PageRankSampler with RandomWalker to potentially speed up walk extraction by assigning higher weights to certain paths. Ensure necessary libraries are imported. ```python import pandas as pd from pyrdf2vec import RDF2VecTransformer from pyrdf2vec.graphs import KG from pyrdf2vec.samplers import PageRankSampler from pyrdf2vec.walkers import RandomWalker data = pd.read_csv("samples/countries-cities/entities.tsv", sep="\t") RDF2VecTransformer( walkers=[RandomWalker(2, None, PageRankSampler())] ).fit_transform( KG("https://dbpedia.org/sparql"), [entity for entity in data["location"]], ) ``` -------------------------------- ### Enable Stardog Bulk Loading Mode Source: https://github.com/predict-idlab/pyrdf2vec/wiki/Fast-generation-of-RDF2Vec-embeddings-with-a-SPARQL-endpoint Create a stardog.properties file in STARDOG_HOME to enable bulk loading and disable strict parsing. This optimizes the loading of large RDF files. ```properties memory.mode = bulk strict.parsing = false ``` -------------------------------- ### Check Documentation Code Style Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Verify the code style specifically for documentation files using tox. This ensures consistency in documentation assets. ```bash tox -e lint ``` -------------------------------- ### Create Knowledge Graph from Scratch Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/readme.md Construct a Knowledge Graph object programmatically from a list of triples. Vertices and predicates are defined using the Vertex class. ```python from pyrdf2vec.graphs import KG, Vertex GRAPH = [ ["Alice", "knows", "Bob"], ["Alice", "knows", "Dean"], ["Dean", "loves", "Alice"], ] URL = "http://pyRDF2Vec" CUSTOM_KG = KG() for row in GRAPH: subj = Vertex(f"{URL}#{row[0]}") obj = Vertex((f"{URL}#{row[2]}")) pred = Vertex((f"{URL}#{row[1]}"), predicate=True, vprev=subj, vnext=obj) CUSTOM_KG.add_walk(subj, pred, obj) ``` -------------------------------- ### Configuring Sampling Strategies for Walks Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt Customize walk extraction by passing various sampler instances to a walker. Different samplers like PageRankSampler, ObjFreqSampler, and WideSampler allow for prioritizing specific types of hops based on node importance, frequency, or degree. ```python from pyrdf2vec.samplers import ( UniformSampler, # equal weight to all hops (default) PageRankSampler, # weight hops by PageRank score of target node ObjFreqSampler, # weight by object frequency ObjPredFreqSampler, # weight by (object, predicate) pair frequency PredFreqSampler, # weight by predicate frequency WideSampler, # prioritize wide (high-degree) nodes ) from pyrdf2vec.walkers import RandomWalker from pyrdf2vec import RDF2VecTransformer from pyrdf2vec.embedders import Word2Vec from pyrdf2vec.graphs import KG import pandas as pd kg = KG("samples/mutag/mutag.owl", skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"}) entities = pd.read_csv("samples/mutag/train.tsv", sep="\t")["bond"].tolist() # PageRank sampler — prioritize high-PageRank object nodes embeddings, _ = RDF2VecTransformer( Word2Vec(workers=1), walkers=[RandomWalker(2, 10, PageRankSampler(alpha=0.85))], ).fit_transform(kg, entities) # Inverse PageRank — prioritize low-PageRank (rare) nodes embeddings_inv, _ = RDF2VecTransformer( Word2Vec(workers=1), walkers=[RandomWalker(2, 10, PageRankSampler(inverse=True))], ).fit_transform(kg, entities) # Split + frequency sampler combination embeddings_split, _ = RDF2VecTransformer( Word2Vec(workers=1), walkers=[RandomWalker(2, 10, ObjFreqSampler(split=True))], ).fit_transform(kg, entities) ``` -------------------------------- ### Enable Multiprocessing with n_jobs Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/README.rst Utilizes multiple processors for walk extraction by setting n_jobs. Be cautious of SPARQL endpoint server policies when using a high number of processors. ```python from pyrdf2vec.walkers import RandomWalker RDF2VecTransformer(walkers=[RandomWalker(4, 10, n_jobs=4)]) ``` -------------------------------- ### Activate Poetry Shell Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Activate the virtual environment's shell created by Poetry. ```bash poetry shell ``` -------------------------------- ### Run Connector Unit Tests Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Execute unit tests for a custom connector. Ensure your connector tests are located in the tests/connectors directory. ```bash pytest tests/connectors/foo.py ``` -------------------------------- ### Initialize KG from Remote SPARQL Endpoint Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt Connect to a remote SPARQL endpoint, specifying predicates to skip and literals to extract. The `mul_req` parameter enables asynchronous bulk SPARQL requests. ```python from pyrdf2vec.graphs import KG, Vertex # --- 1. Remote SPARQL endpoint --- kg_remote = KG( "https://dbpedia.org/sparql", skip_predicates={"www.w3.org/1999/02/22-rdf-syntax-ns#type"}, literals=[ ["http://dbpedia.org/ontology/wikiPageWikiLink", "http://www.w3.org/2004/02/skos/core#prefLabel"], ["http://dbpedia.org/ontology/humanDevelopmentIndex"], ], mul_req=True, # bundle SPARQL requests asynchronously ) ``` -------------------------------- ### Load Knowledge Graph from RDFLib File Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/readme.md Initialize a Knowledge Graph object from a local RDF file using RDFLib. Define predicates to exclude and predicate chains for literals. ```python from pyrdf2vec.graphs import KG # Defined the MUTAG KG, as well as a set of predicates to exclude from # this KG and a list of predicate chains to get the literals. KG( "samples/mutag/mutag.owl", skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"}, literals=[ [ "http://dl-learner.org/carcinogenesis#hasBond", "http://dl-learner.org/carcinogenesis#inBond", ], [ "http://dl-learner.org/carcinogenesis#hasAtom", "http://dl-learner.org/carcinogenesis#charge", ], ], ) ``` -------------------------------- ### Import Embedders in __init__.py Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Import new embedders and add them to the __all__ list in the embedders package. ```python from .embedder import Embedder from .foo import FooEmbedder from .word2vec import Word2Vec __all__ = [ "Embedder", "FooEmbedder", "Word2Vec", ] ``` -------------------------------- ### Implement Custom Sampler Class Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Extend the Sampler class and implement the fit and get_weight functions for your custom sampling strategy. ```python import attr from pyrdf2vec.graph import KG from pyrdf2vec.samplers import Sampler from pyrdf2vec.typings import Hop @attr.s class FooSampler(Sampler): """Defines the Foo sampling strategy.""" def fit(self, kg: KG) -> None: """Since the weights are uniform, this function does nothing. Args: kg: The Knowledge Graph. """ # TODO: to be implemented def get_weight(self, hop: Hop) -> int: """Gets the weight of a hop in the Knowledge Graph. Args: hop: The hop (pred, obj) to get the weight. Returns: The weight for a given hop. """ # TODO: to be implemented ``` -------------------------------- ### Load Knowledge Graph from SPARQL Endpoint Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/readme.md Initialize a Knowledge Graph object from a SPARQL endpoint. Specify predicates to skip and predicate chains for fetching literals. ```python from pyrdf2vec.graphs import KG # Defined the DBpedia endpoint server, as well as a set of predicates to # exclude from this KG and a list of predicate chains to fetch the literals. KG( "https://dbpedia.org/sparql", skip_predicates={"www.w3.org/1999/02/22-rdf-syntax-ns#type"}, literals=[ [ "http://dbpedia.org/ontology/wikiPageWikiLink", "http://www.w3.org/2004/02/skos/core#prefLabel", ], ["http://dbpedia.org/ontology/humanDevelopmentIndex"], ], ) ``` -------------------------------- ### Extend Sampler Class for Custom Sampling Strategy Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/contributing.md To implement a new sampling strategy, extend the Sampler class and implement the fit and get_weight functions. The fit function is used for any necessary preprocessing, and get_weight determines the weight of a hop. ```python import attr from pyrdf2vec.graph import KG from pyrdf2vec.samplers import Sampler from pyrdf2vec.typings import Hop @attr.s class FooSampler(Sampler): """Defines the Foo sampling strategy.""" def fit(self, kg: KG) -> None: """Since the weights are uniform, this function does nothing. Args: kg: The Knowledge Graph. """ # TODO: to be implemented def get_weight(self, hop: Hop) -> int: """Gets the weight of a hop in the Knowledge Graph. Args: hop: The hop (pred, obj) to get the weight. Returns: The weight for a given hop. """ # TODO: to be implemented ``` -------------------------------- ### Create Vertex Objects Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt Instantiate Vertex objects for subjects, objects, and predicates. Predicate vertices include context about their previous and next neighbors. ```python from pyrdf2vec.graphs import Vertex # Subject / object vertex v_alice = Vertex("http://example.org/Alice") v_bob = Vertex("http://example.org/Bob") # Predicate vertex (links Alice → knows → Bob) v_knows = Vertex( "http://example.org/knows", predicate=True, vprev=v_alice, vnext=v_bob, ) print(v_alice == Vertex("http://example.org/Alice")) # True print(hash(v_knows)) # unique hash including vprev/vnext context print(v_alice < v_bob) # lexicographic comparison on .name ``` -------------------------------- ### RDF2VecTransformer: Online Learning (Incremental Updates) Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt Demonstrates how to incrementally update an existing RDF2VecTransformer model with new entities using `is_update=True`. This is useful for extending models without retraining from scratch. ```python from pyrdf2vec import RDF2VecTransformer from pyrdf2vec.embedders import Word2Vec from pyrdf2vec.graphs import KG from pyrdf2vec.walkers import RandomWalker import pandas as pd kg = KG("samples/mutag/mutag.owl", skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"}) train_data = pd.read_csv("samples/mutag/train.tsv", sep="\t") train_entities = list(train_data["bond"]) transformer = RDF2VecTransformer( Word2Vec(workers=1), walkers=[RandomWalker(2, None, random_state=22)], ) transformer.fit_transform(kg, train_entities) transformer.save("mutag_model") # Incrementally add new entities new_data = pd.read_csv("samples/mutag/online-training.tsv", sep="\t") new_entities = list(new_data["bond"]) transformer = RDF2VecTransformer(Word2Vec(workers=1), walkers=[RandomWalker(2, None)]).load("mutag_model") transformer.fit_transform(kg, new_entities, is_update=True) # transformer._embeddings now contains embeddings for ALL entities seen so far all_embeddings = transformer._embeddings print(f"Total embedded entities: {len(all_embeddings)}") ``` -------------------------------- ### Implement a Custom RDF Connector Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/contributing.md Extend the base Connector class to add support for new RDF syntaxes or file formats. Ensure the fetch method is implemented and configure HTTP adapter retries for robustness. ```python import attr from requests.adapters import HTTPAdapter from requests.packages.urllib3.util import Retry from pyrdf2vec.connectors import Connector @attr.s class FooConnector(Connector): """Represents a Foo connector.""" def __attrs_post_init__(): adapter = HTTPAdapter( Retry( total=3, status_forcelist=[429, 500, 502, 503, 504], method_whitelist=["HEAD", "GET", "OPTIONS"], ) ) self._session.mount("http", adapter) self._session.mount("https", adapter) def fetch(self, query: str) -> None: """Fetchs the result of a query. Args: query: The query to fetch the result. Returns: The generated dictionary from the ['results']['bindings'] json """ # TODO: to be implemented ``` -------------------------------- ### Bundle SPARQL Requests for Remote KGs Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/README.rst Speeds up walk extraction for remote KGs by bundling SPARQL requests. This option can be combined with multiprocessing. Be aware of potential SPARQL endpoint server policies. ```python import pandas as pd from pyrdf2vec import RDF2VecTransformer from pyrdf2vec.graphs import KG from pyrdf2vec.walkers import RandomWalker data = pd.read_csv("samples/countries-cities/entities.tsv", sep="\t") RDF2VecTransformer(walkers=[RandomWalker(4, 10)]).fit_transform( KG("https://dbpedia.org/sparql", mul_req=True), [entity for entity in data["location"]], ) ``` -------------------------------- ### RandomWalker: Walk Extraction (BFS vs. DFS) Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt Illustrates the usage of `RandomWalker` for extracting entity walks, differentiating between Breadth-First Search (BFS) and Depth-First Search (DFS) strategies. BFS is used when `max_walks` is None, while DFS is used when `max_walks` is a positive integer. ```python from pyrdf2vec.walkers import RandomWalker from pyrdf2vec.samplers import UniformSampler # BFS — extract all walks up to depth 4 (faster, more walks) walker_bfs = RandomWalker( max_depth=4, max_walks=None, # BFS with_reverse=True, # include parent hops for better context n_jobs=4, # parallel processes random_state=42, md5_bytes=8, # hash non-entity object vertices to save memory ) # DFS — extract at most 10 walks per entity up to depth 4 walker_dfs = RandomWalker( max_depth=4, max_walks=10, # DFS sampler=UniformSampler(), n_jobs=2, random_state=42, ) from pyrdf2vec.graphs import KG kg = KG("samples/mutag/mutag.owl", skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"}) entities = ["http://dl-learner.org/carcinogenesis#d1"] walks = walker_bfs.extract(kg, entities, verbose=1) # walks[0] -> list of tuples like: # [('http://.../d1', 'http://.../hasBond', 'b"\x3f..."'), ...] print(f"Walks for entity 0: {len(walks[0])}") ``` -------------------------------- ### RDF2VecTransformer: Save and Load Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt Shows how to serialize a trained RDF2VecTransformer to a file and deserialize it later for reuse. This is essential for persisting trained models. ```python from pyrdf2vec import RDF2VecTransformer from pyrdf2vec.embedders import Word2Vec from pyrdf2vec.graphs import KG from pyrdf2vec.walkers import RandomWalker kg = KG("samples/mutag/mutag.owl", skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"}) entities = ["http://dl-learner.org/carcinogenesis#d1", "http://dl-learner.org/carcinogenesis#d2"] transformer = RDF2VecTransformer(Word2Vec(workers=1), walkers=[RandomWalker(2, None)]) transformer.fit_transform(kg, entities) transformer.save("my_transformer") # writes "my_transformer" binary # Later / in another process: loaded = RDF2VecTransformer.load("my_transformer") embeddings, _ = loaded.transform(kg, entities) print(embeddings[0]) # same vectors as before ``` -------------------------------- ### RDF2VecTransformer: Fit and Transform Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt Demonstrates fitting the transformer to a knowledge graph and entities, then transforming them into embeddings. This is useful for initial model training. ```python from pyrdf2vec import RDF2VecTransformer from pyrdf2vec.embedders import Word2Vec from pyrdf2vec.graphs import KG from pyrdf2vec.walkers import RandomWalker kg = KG("samples/mutag/mutag.owl", skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"}) entities = ["http://dl-learner.org/carcinogenesis#d1", "http://dl-learner.org/carcinogenesis#d2"] transformer = RDF2VecTransformer(Word2Vec(workers=1), walkers=[RandomWalker(2, None)]) transformer.fit_transform(kg, entities) embeddings, literals = transformer.transform(kg, entities) print(embeddings[0].shape) # (100,) default Word2Vec vector_size print(len(literals)) # same length as entities # Separate fit / transform for reuse transformer.fit(kg, entities) embeddings, literals = transformer.transform(kg, entities) # Get raw walks without training walks = transformer.get_walks(kg, entities) # walks[i] -> list of tuples (sequences of vertex name strings) ``` -------------------------------- ### Download DBPedia Data Source: https://github.com/predict-idlab/pyrdf2vec/wiki/Fast-generation-of-RDF2Vec-embeddings-with-a-SPARQL-endpoint Download DBPedia triples for the October 2015 English version using wget. This script creates data directories and downloads core and core-i18n data, as well as the OWL ontology. ```bash mkdir -p data cd data mkdir core cd core wget -np -nd -r -A ttl.bz2 -A nt.bz2 "http://downloads.dbpedia.org/2015-10/core/" cd .. mkdir core-i18n cd core-i18n wget -nd -np -r -A ttl.bz2 "http://downloads.dbpedia.org/2015-10/core-i18n/en/" cd .. wget -nd -np -r -A .owl "http://downloads.dbpedia.org/2015-10/dbpedia_2015-10.owl" ``` -------------------------------- ### Configure Stardog Default Memory Mode Source: https://github.com/predict-idlab/pyrdf2vec/wiki/Fast-generation-of-RDF2Vec-embeddings-with-a-SPARQL-endpoint Change the memory mode in stardog.properties from 'bulk' to 'default'. This rebalances RAM for optimal SELECT query performance. ```properties memory.mode = default ``` -------------------------------- ### Define Random Walker with PageRank Sampler Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/readme.md Instantiate a RandomWalker with a specified depth, number of walks, and PageRankSampler. This configuration limits the number of walks extracted per entity. ```python from pyrdf2vec.samplers import PageRankSampler from pyrdf2vec.walkers import RandomWalker walkers = [RandomWalker(4, 10, PageRankSampler())] ``` -------------------------------- ### Implement Custom Walker Class Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst Extend the Walker class and implement the _extract function for your custom walking strategy. ```python from hashlib import md5 from typing import List, Set import attr from pyrdf2vec.graphs import KG, Vertex from pyrdf2vec.typings import EntityWalks, SWalk, Walk from pyrdf2vec.walkers import Walker @attr.s class FooWalker(Walker): """Defines the foo walking strategy. Args: depth: The maximum depth of one walk. max_walks: The maximum number of walks per entity. sampler: The sampling strategy. Defaults to pyrdf2vec.samplers.UniformSampler(). n_jobs: The number of process to use for multiprocessing. Defaults to 1. with_reverse: extracts children's and parents' walks from the root, creating (max_walks * max_walks) more walks of 2 * depth. Defaults to False. random_state: The random state to use to ensure random determinism to generate the same walks for entities. Defaults to None. """ def _extract(self, kg: KG, instance: Vertex) -> EntityWalks: """Extracts walks rooted at the provided entities which are then each transformed into a numerical representation. Args: kg: The Knowledge Graph. instance: The instance to be extracted from the Knowledge Graph. Returns: The 2D matrix with its number of rows equal to the number of provided entities; number of column equal to the embedding size. """ canonical_walks: Set[SWalk] = set() # TODO: to be implemented return {instance.name: list(canonical_walks)} ```