### Install pyRDF2Vec from Source

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/readme.md

Clone the repository and install pyRDF2Vec locally from the source code.

```bash
git clone https://github.com/IBCNServices/pyRDF2Vec.git
pip install .
```

--------------------------------

### Start Stardog Server

Source: https://github.com/predict-idlab/pyrdf2vec/wiki/Fast-generation-of-RDF2Vec-embeddings-with-a-SPARQL-endpoint

Start the Stardog database server with security and CORS disabled. Ensure the Stardog version number matches your installation.

```bash
./stardog-7.4.5/bin/stardog-admin server start --disable-security --no-cors
```

--------------------------------

### Install pyRDF2Vec Dependencies with Poetry

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/contributing.md

Install the project dependencies in a virtual environment using poetry. This is the recommended method for local development.

```bash
pip install poetry
```

```bash
poetry install
```

```bash
poetry shell
```

--------------------------------

### Install Poetry

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Install Poetry, a dependency management tool, using pip.

```bash
pip install poetry
```

--------------------------------

### Install pyRDF2Vec from PyPI

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/readme.md

Install the pyRDF2Vec library using pip. This is the recommended method for most users.

```bash
pip install pyRDF2vec
```

--------------------------------

### Install pyRDF2Vec Dependencies

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Install pyRDF2Vec dependencies within a virtual environment using Poetry.

```bash
poetry install
```

--------------------------------

### Changelog News Fragment Example (Feature)

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/contributing.md

Example of a news fragment file for a new feature. The filename should follow the pattern <ISSUE>.<TYPE>.rst, e.g., 456.feature.rst.

```rst
``fit_transform()`` now can deal with bigger Knowledge Graph Embeddings (KGE).

```

--------------------------------

### Changelog News Fragment Example (Bugfix)

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/contributing.md

Example of a news fragment file for a bugfix. The filename should follow the pattern <ISSUE>.<TYPE>.rst, e.g., 123.bugfix.rst.

```rst
Added ``wallkers.Foo``.
This walker is a new walker.

```

--------------------------------

### Preview Changelog with Tox

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Use this command to preview how your changelog entry will appear in the final release notes. Ensure you have tox installed and configured.

```bash
tox -e changelog
```

--------------------------------

### Install pyRDF2Vec with Poetry

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/readme.md

Add pyRDF2Vec as a dependency using the Poetry package manager.

```bash
poetry add pyRDF2vec
```

--------------------------------

### Run pyRDF2Vec with Docker

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Build and run pyRDF2Vec using Docker Compose, avoiding local dependency installation.

```bash
docker-compose up --build -d
```

--------------------------------

### Implement a Custom RDF2Vec Connector

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Example of creating a custom connector by extending the base Connector class and implementing the fetch method. Includes retry logic for HTTP requests.

```python
import attr
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util import Retry

from pyrdf2vec.connectors import Connector

@attr.s
class FooConnector(Connector):
    """Represents a Foo connector."""

    def __attrs_post_init__(self):
        adapter = HTTPAdapter(
            Retry(
                total=3,
                status_forcelist=[429, 500, 502, 503, 504],
                method_whitelist=["HEAD", "GET", "OPTIONS"],
            )
        )
        self._session.mount("http", adapter)
        self._session.mount("https", adapter)

    def fetch(self, query: str) -> None:
        """Fetchs the result of a query.

           Args:
               query: The query to fetch the result.

           Returns:
               The generated dictionary from the ['results']['bindings']
               json

        """
        # TODO: to be implemented

```

--------------------------------

### RDF2VecTransformer: Fit and Transform KG Entities

Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt

Orchestrate walk extraction and embedding training using RDF2VecTransformer. This example uses Word2Vec for embedding and RandomWalker for walk generation.

```python
import pandas as pd
from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker

data     = pd.read_csv("samples/countries-cities/entities.tsv", sep="\t")
entities = list(data["location"])

kg = KG("https://dbpedia.org/sparql",
        skip_predicates={"www.w3.org/1999/02/22-rdf-syntax-ns#type"})

transformer = RDF2VecTransformer(
    Word2Vec(epochs=10, workers=1),           # embedding technique
    walkers=[RandomWalker(4, 10,              # depth=4, max_walks=10
                          with_reverse=True,  # include parent hops
                          n_jobs=2,
                          random_state=42)],
    verbose=1,
)

# fit_transform = extract walks + train embedder + transform in one call
embeddings, literals = transformer.fit_transform(kg, entities)
```

--------------------------------

### Reproducible Embeddings Setup

Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt

Ensures reproducible embeddings by coordinating PYTHONHASHSEED, walker random_state, and embedder workers=1. Running the script again with the same seed produces identical vectors.

```python
# Run script with: PYTHONHASHSEED=42 python my_script.py

from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker

SEED = 42

transformer = RDF2VecTransformer(
    Word2Vec(workers=1, epochs=10),          # single worker for determinism
    walkers=[RandomWalker(
        max_depth=2,
        max_walks=None,
        random_state=SEED,                   # walker + sampler seed
    )],
)

kg = KG("samples/mutag/mutag.owl",
        skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"})
entities = ["http://dl-learner.org/carcinogenesis#d1",
            "http://dl-learner.org/carcinogenesis#d2"]

embeddings, _ = transformer.fit_transform(kg, entities)
# Running this script again with PYTHONHASHSEED=42 produces identical vectors.
print(embeddings[0])
```

--------------------------------

### Serve Local Documentation

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Open the locally generated documentation in your web browser. The path to the index file is provided.

```bash
$BROWSER _build/html/index.html
```

--------------------------------

### Create and Load DBPedia Database

Source: https://github.com/predict-idlab/pyrdf2vec/wiki/Fast-generation-of-RDF2Vec-embeddings-with-a-SPARQL-endpoint

Create a new Stardog database named 'dbpedia' and load the downloaded bz2 files. Then, add the DBPedia OWL ontology to the database. Recommended to run in a separate screen or tmux session.

```bash
./stardog-7.4.5/bin/stardog-admin db create -n dbpedia $(find . -name \*.bz2 -print -type f | xargs)
./stardog-7.4.5/bin/stardog data add dbpedia data/dbpedia_2015-10.owl
```

--------------------------------

### Import Custom Sampler in __init__.py

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Add your custom sampler to the __init__.py file and the __all__ list.

```python
from .sampler import Sampler

    # ...
    from .foo import FooSampler
    # ...
    from .uniform import UniformSampler

    __all__ = [
        # ...
        "FooSampler",
        # ...
        "Sampler",
        "UniformSampler",
    ]
```

--------------------------------

### Generate Documentation Locally

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Build the project's documentation locally using tox. This command generates HTML documentation in the _build/html directory.

```bash
tox -e docs
```

--------------------------------

### Configure Stardog Server Java Arguments

Source: https://github.com/predict-idlab/pyrdf2vec/wiki/Fast-generation-of-RDF2Vec-embeddings-with-a-SPARQL-endpoint

Set STARDOG_SERVER_JAVA_ARGS to allocate sufficient JVM heap and direct memory for loading large datasets. Adjust values based on the number of triples and available system memory.

```bash
export STARDOG_SERVER_JAVA_ARGS=-Xms8g -Xmx8g -XX:MaxDirectMemorySize=20g
```

--------------------------------

### Initialize KG from Local RDF File

Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt

Load a Knowledge Graph from a local OWL/RDF file. Configure predicates to skip and specify literal extraction rules.

```python
# --- 2. Local OWL/RDF file ---
kg_local = KG(
    "samples/mutag/mutag.owl",
    skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"},
    literals=[
        ["http://dl-learner.org/carcinogenesis#hasBond",
         "http://dl-learner.org/carcinogenesis#inBond"],
        ["http://dl-learner.org/carcinogenesis#hasAtom",
         "http://dl-learner.org/carcinogenesis#charge"],
    ],
)
```

--------------------------------

### Create Knowledge Graph from Scratch

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/README.rst

Create a Knowledge Graph (KG) from scratch by adding vertices and edges representing relationships.

```python
from pyrdf2vec.graphs import KG, Vertex

GRAPH = [
    ["Alice", "knows", "Bob"],
    ["Alice", "knows", "Dean"],
    ["Dean", "loves", "Alice"],
]
URL = "http://pyRDF2Vec"
CUSTOM_KG = KG()

for row in GRAPH:
    subj = Vertex(f"{URL}#{row[0]}")
    obj = Vertex((f"{URL}#{row[2]}"))
    pred = Vertex((f"{URL}#{row[1]}"), predicate=True, vprev=subj, vnext=obj)
    CUSTOM_KG.add_walk(subj, pred, obj)
```

--------------------------------

### Import New Walker in __init__.py

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/contributing.md

Add your custom walker to the __init__.py file in the pyrdf2vec/walkers directory and include it in the __all__ list.

```python
from .walker import Walker

# ...
from .foo import FooWalker

```

--------------------------------

### Import New Sampler in pyrdf2vec/samplers/__init__.py

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/contributing.md

To make a new sampler available, import it in the pyrdf2vec/samplers/__init__.py file and add it to the __all__ list.

```python
from .sampler import Sampler

# ...
from .foo import FooSampler
# ...
from .uniform import UniformSampler

__all__ = [
    # ...
    "FooSampler",
    # ...
    "Sampler",
    "UniformSampler",
]
```

--------------------------------

### Import Custom Walker in __init__.py

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Add your custom walker to the __init__.py file and the __all__ list to make it available.

```python
from .walker import Walker

    # ...
    from .foo import FooWalker
    # ...
    from .walklet import WalkletWalker
    from .weisfeiler_lehman import WLWalker

    __all__ = [
        # ...
        "FooWalker",
        # ...
        "Walker",
        "WalkletWalker",
        "WLWalker",
    ]
```

--------------------------------

### HALKWalker for Frequency-Filtered Walks

Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt

Use HALKWalker to filter walks by dropping low-frequency object vertices. This reduces noise and vocabulary size, especially useful for large knowledge graphs. Ensure necessary libraries like pandas and pyrdf2vec are installed.

```python
from pyrdf2vec.walkers import HALKWalker
from pyrdf2vec.samplers import WideSampler
from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
import pandas as pd

train_data = pd.read_csv("samples/mutag/train.tsv", sep="\t")
test_data  = pd.read_csv("samples/mutag/test.tsv",  sep="\t")
entities   = list(train_data["bond"]) + list(test_data["bond"])

embeddings, _ = RDF2VecTransformer(
    Word2Vec(workers=1, epochs=10),
    walkers=[HALKWalker(
        2,                    # max_depth
        None,                 # BFS
        n_jobs=2,
        sampler=WideSampler(),
        random_state=22,
        md5_bytes=None,       # no hashing — keeps full URI (small KG)
    )],
    verbose=1,
).fit_transform(
    KG("samples/mutag/mutag.owl",
       skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"}),
    entities,
)
print(embeddings[0].shape)  # (100,)
```

--------------------------------

### Initialize Knowledge Graph from RDF File

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/README.rst

Initialize a Knowledge Graph (KG) from an RDF file, specifying predicates to exclude and literals to retrieve.

```python
from pyrdf2vec.graphs import KG

# Defined the MUTAG KG, as well as a set of predicates to exclude from
# this KG and a list of predicate chains to get the literals.
KG(
    "samples/mutag/mutag.owl",
    skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"},
    literals=[
        [
            "http://dl-learner.org/carcinogenesis#hasBond",
            "http://dl-learner.org/carcinogenesis#inBond",
        ],
        [
            "http://dl-learner.org/carcinogenesis#hasAtom",
            "http://dl-learner.org/carcinogenesis#charge",
        ],
    ],
),
```

--------------------------------

### End-to-End MUTAG Classification Pipeline

Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt

This code snippet demonstrates a complete machine learning pipeline for MUTAG classification. It integrates pyRDF2Vec for entity embedding generation with scikit-learn for model training and evaluation. Ensure pyRDF2Vec, scikit-learn, and pandas are installed. The KG class loads local RDF files or remote knowledge graphs.

```python
import pandas as pd
from sklearn.manifold import TSNE
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC

from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.samplers import WideSampler
from pyrdf2vec.walkers import HALKWalker

RANDOM_STATE = 22

train_data = pd.read_csv("samples/mutag/train.tsv", sep="\t")
test_data  = pd.read_csv("samples/mutag/test.tsv",  sep="\t")

train_entities = list(train_data["bond"])
test_entities  = list(test_data["bond"])
train_labels   = list(train_data["label_mutagenic"])
test_labels    = list(test_data["label_mutagenic"])

embeddings, _ = RDF2VecTransformer(
    Word2Vec(workers=1, epochs=10),
    walkers=[HALKWalker(2, None, n_jobs=2,
                        sampler=WideSampler(),
                        random_state=RANDOM_STATE,
                        md5_bytes=None)],
    verbose=1,
).fit_transform(
    KG("samples/mutag/mutag.owl",
       skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"},
       literals=[
           ["http://dl-learner.org/carcinogenesis#hasBond",
            "http://dl-learner.org/carcinogenesis#inBond"],
           ["http://dl-learner.org/carcinogenesis#hasAtom",
            "http://dl-learner.org/carcinogenesis#charge"],
       ]),
    train_entities + test_entities,
)

train_emb = embeddings[:len(train_entities)]
test_emb  = embeddings[len(train_entities):]

clf = GridSearchCV(SVC(random_state=RANDOM_STATE),
                   {"C": [10**i for i in range(-3, 4)]})
c.fit(train_emb, train_labels)
preds = clf.predict(test_emb)

print(f"Accuracy: {accuracy_score(test_labels, preds)*100:.2f}%")
print("Confusion Matrix:", confusion_matrix(test_labels, preds))

# 2D visualization with t-SNE
X_tsne = TSNE(random_state=RANDOM_STATE).fit_transform(train_emb + test_emb)

```

--------------------------------

### Run All Project Checks

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Execute all project checks, including unit tests, code style, and documentation, using the tox command. This is a comprehensive check before submitting changes.

```bash
tox
```

--------------------------------

### Run Linting and Documentation Checks

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Perform code style and documentation checks using tox.

```bash
tox -e lint,docs
```

--------------------------------

### Build KG Programmatically

Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt

Construct a Knowledge Graph from scratch by adding vertices and edges. Useful for small or custom graph structures.

```python
# --- 3. Built from scratch ---
GRAPH = [
    ["Alice", "knows", "Bob"],
    ["Alice", "knows", "Dean"],
    ["Dean", "loves", "Alice"],
]
URL = "http://pyRDF2Vec"
kg_custom = KG()
for row in GRAPH:
    subj = Vertex(f"{URL}#{row[0]}")
    obj  = Vertex(f"{URL}#{row[2]}")
    pred = Vertex(f"{URL}#{row[1]}", predicate=True, vprev=subj, vnext=obj)
    kg_custom.add_walk(subj, pred, obj)
```

--------------------------------

### Initialize Knowledge Graph from SPARQL Endpoint

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/README.rst

Initialize a Knowledge Graph (KG) from a SPARQL endpoint, specifying predicates to skip and literals to fetch.

```python
from pyrdf2vec.graphs import KG

# Defined the DBpedia endpoint server, as well as a set of predicates to
# exclude from this KG and a list of predicate chains to fetch the literals.
KG(
    "https://dbpedia.org/sparql",
    skip_predicates={"www.w3.org/1999/02/22-rdf-syntax-ns#type"},
    literals=[
        [
            "http://dbpedia.org/ontology/wikiPageWikiLink",
            "http://www.w3.org/2004/02/skos/core#prefLabel",
        ],
        ["http://dbpedia.org/ontology/humanDevelopmentIndex"],
     ],
),
```

--------------------------------

### Configure RDF2Vec Transformer with PageRank Sampler

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/README.rst

Use PageRankSampler with RandomWalker to potentially speed up walk extraction by assigning higher weights to certain paths. Ensure necessary libraries are imported.

```python
import pandas as pd

from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.graphs import KG
from pyrdf2vec.samplers import PageRankSampler
from pyrdf2vec.walkers import RandomWalker

data = pd.read_csv("samples/countries-cities/entities.tsv", sep="\t")

RDF2VecTransformer(
    walkers=[RandomWalker(2, None, PageRankSampler())]
).fit_transform(
    KG("https://dbpedia.org/sparql"),
    [entity for entity in data["location"]],
)
```

--------------------------------

### Enable Stardog Bulk Loading Mode

Source: https://github.com/predict-idlab/pyrdf2vec/wiki/Fast-generation-of-RDF2Vec-embeddings-with-a-SPARQL-endpoint

Create a stardog.properties file in STARDOG_HOME to enable bulk loading and disable strict parsing. This optimizes the loading of large RDF files.

```properties
memory.mode = bulk
strict.parsing = false
```

--------------------------------

### Check Documentation Code Style

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Verify the code style specifically for documentation files using tox. This ensures consistency in documentation assets.

```bash
tox -e lint
```

--------------------------------

### Create Knowledge Graph from Scratch

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/readme.md

Construct a Knowledge Graph object programmatically from a list of triples. Vertices and predicates are defined using the Vertex class.

```python
from pyrdf2vec.graphs import KG, Vertex

GRAPH = [
    ["Alice", "knows", "Bob"],
    ["Alice", "knows", "Dean"],
    ["Dean", "loves", "Alice"],
 ]
URL = "http://pyRDF2Vec"
CUSTOM_KG = KG()

for row in GRAPH:
    subj = Vertex(f"{URL}#{row[0]}")
    obj = Vertex((f"{URL}#{row[2]}"))
    pred = Vertex((f"{URL}#{row[1]}"), predicate=True, vprev=subj, vnext=obj)
    CUSTOM_KG.add_walk(subj, pred, obj)
```

--------------------------------

### Configuring Sampling Strategies for Walks

Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt

Customize walk extraction by passing various sampler instances to a walker. Different samplers like PageRankSampler, ObjFreqSampler, and WideSampler allow for prioritizing specific types of hops based on node importance, frequency, or degree.

```python
from pyrdf2vec.samplers import (
    UniformSampler,       # equal weight to all hops (default)
    PageRankSampler,      # weight hops by PageRank score of target node
    ObjFreqSampler,       # weight by object frequency
    ObjPredFreqSampler,   # weight by (object, predicate) pair frequency
    PredFreqSampler,      # weight by predicate frequency
    WideSampler,          # prioritize wide (high-degree) nodes
)
from pyrdf2vec.walkers import RandomWalker
from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
import pandas as pd

kg = KG("samples/mutag/mutag.owl",
        skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"})
entities = pd.read_csv("samples/mutag/train.tsv", sep="\t")["bond"].tolist()

# PageRank sampler — prioritize high-PageRank object nodes
embeddings, _ = RDF2VecTransformer(
    Word2Vec(workers=1),
    walkers=[RandomWalker(2, 10, PageRankSampler(alpha=0.85))],
).fit_transform(kg, entities)

# Inverse PageRank — prioritize low-PageRank (rare) nodes
embeddings_inv, _ = RDF2VecTransformer(
    Word2Vec(workers=1),
    walkers=[RandomWalker(2, 10, PageRankSampler(inverse=True))],
).fit_transform(kg, entities)

# Split + frequency sampler combination
embeddings_split, _ = RDF2VecTransformer(
    Word2Vec(workers=1),
    walkers=[RandomWalker(2, 10, ObjFreqSampler(split=True))],
).fit_transform(kg, entities)
```

--------------------------------

### Enable Multiprocessing with n_jobs

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/README.rst

Utilizes multiple processors for walk extraction by setting n_jobs. Be cautious of SPARQL endpoint server policies when using a high number of processors.

```python
from pyrdf2vec.walkers import RandomWalker

RDF2VecTransformer(walkers=[RandomWalker(4, 10, n_jobs=4)])
```

--------------------------------

### Activate Poetry Shell

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Activate the virtual environment's shell created by Poetry.

```bash
poetry shell
```

--------------------------------

### Run Connector Unit Tests

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Execute unit tests for a custom connector. Ensure your connector tests are located in the tests/connectors directory.

```bash
pytest tests/connectors/foo.py
```

--------------------------------

### Initialize KG from Remote SPARQL Endpoint

Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt

Connect to a remote SPARQL endpoint, specifying predicates to skip and literals to extract. The `mul_req` parameter enables asynchronous bulk SPARQL requests.

```python
from pyrdf2vec.graphs import KG, Vertex

# --- 1. Remote SPARQL endpoint ---
kg_remote = KG(
    "https://dbpedia.org/sparql",
    skip_predicates={"www.w3.org/1999/02/22-rdf-syntax-ns#type"},
    literals=[
        ["http://dbpedia.org/ontology/wikiPageWikiLink",
         "http://www.w3.org/2004/02/skos/core#prefLabel"],
        ["http://dbpedia.org/ontology/humanDevelopmentIndex"],
    ],
    mul_req=True,   # bundle SPARQL requests asynchronously
)
```

--------------------------------

### Load Knowledge Graph from RDFLib File

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/readme.md

Initialize a Knowledge Graph object from a local RDF file using RDFLib. Define predicates to exclude and predicate chains for literals.

```python
from pyrdf2vec.graphs import KG

# Defined the MUTAG KG, as well as a set of predicates to exclude from
# this KG and a list of predicate chains to get the literals.
KG(
    "samples/mutag/mutag.owl",
    skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"},
    literals=[
        [
            "http://dl-learner.org/carcinogenesis#hasBond",
            "http://dl-learner.org/carcinogenesis#inBond",
        ],
        [
            "http://dl-learner.org/carcinogenesis#hasAtom",
            "http://dl-learner.org/carcinogenesis#charge",
        ],
    ],
)
```

--------------------------------

### Import Embedders in __init__.py

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Import new embedders and add them to the __all__ list in the embedders package.

```python
from .embedder import Embedder
from .foo import FooEmbedder
from .word2vec import Word2Vec

__all__ = [
    "Embedder",
    "FooEmbedder",
    "Word2Vec",
]
```

--------------------------------

### Implement Custom Sampler Class

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Extend the Sampler class and implement the fit and get_weight functions for your custom sampling strategy.

```python
import attr

    from pyrdf2vec.graph import KG
    from pyrdf2vec.samplers import Sampler
    from pyrdf2vec.typings import Hop

    @attr.s
    class FooSampler(Sampler):
        """Defines the Foo sampling strategy."""

        def fit(self, kg: KG) -> None:
            """Since the weights are uniform, this function does nothing.

            Args:
                kg: The Knowledge Graph.

            """
            # TODO: to be implemented

        def get_weight(self, hop: Hop) -> int:
            """Gets the weight of a hop in the Knowledge Graph.

            Args:
                hop: The hop (pred, obj) to get the weight.

            Returns:
                The weight for a given hop.

            """
            # TODO: to be implemented
```

--------------------------------

### Load Knowledge Graph from SPARQL Endpoint

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/readme.md

Initialize a Knowledge Graph object from a SPARQL endpoint. Specify predicates to skip and predicate chains for fetching literals.

```python
from pyrdf2vec.graphs import KG

# Defined the DBpedia endpoint server, as well as a set of predicates to
# exclude from this KG and a list of predicate chains to fetch the literals.
KG(
    "https://dbpedia.org/sparql",
    skip_predicates={"www.w3.org/1999/02/22-rdf-syntax-ns#type"},
    literals=[
        [
            "http://dbpedia.org/ontology/wikiPageWikiLink",
            "http://www.w3.org/2004/02/skos/core#prefLabel",
        ],
        ["http://dbpedia.org/ontology/humanDevelopmentIndex"],
     ],
 )
```

--------------------------------

### Extend Sampler Class for Custom Sampling Strategy

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/contributing.md

To implement a new sampling strategy, extend the Sampler class and implement the fit and get_weight functions. The fit function is used for any necessary preprocessing, and get_weight determines the weight of a hop.

```python
import attr

from pyrdf2vec.graph import KG
from pyrdf2vec.samplers import Sampler
from pyrdf2vec.typings import Hop

@attr.s
class FooSampler(Sampler):
    """Defines the Foo sampling strategy."""

    def fit(self, kg: KG) -> None:
        """Since the weights are uniform, this function does nothing.

        Args:
            kg: The Knowledge Graph.

        """
        # TODO: to be implemented

    def get_weight(self, hop: Hop) -> int:
        """Gets the weight of a hop in the Knowledge Graph.

        Args:
            hop: The hop (pred, obj) to get the weight.

        Returns:
            The weight for a given hop.

        """
        # TODO: to be implemented
```

--------------------------------

### Create Vertex Objects

Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt

Instantiate Vertex objects for subjects, objects, and predicates. Predicate vertices include context about their previous and next neighbors.

```python
from pyrdf2vec.graphs import Vertex

# Subject / object vertex
v_alice = Vertex("http://example.org/Alice")
v_bob   = Vertex("http://example.org/Bob")

# Predicate vertex (links Alice → knows → Bob)
v_knows = Vertex(
    "http://example.org/knows",
    predicate=True,
    vprev=v_alice,
    vnext=v_bob,
)

print(v_alice == Vertex("http://example.org/Alice"))  # True
print(hash(v_knows))  # unique hash including vprev/vnext context
print(v_alice < v_bob)  # lexicographic comparison on .name
```

--------------------------------

### RDF2VecTransformer: Online Learning (Incremental Updates)

Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt

Demonstrates how to incrementally update an existing RDF2VecTransformer model with new entities using `is_update=True`. This is useful for extending models without retraining from scratch.

```python
from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker
import pandas as pd

kg = KG("samples/mutag/mutag.owl",
        skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"})

train_data = pd.read_csv("samples/mutag/train.tsv", sep="\t")
train_entities = list(train_data["bond"])

transformer = RDF2VecTransformer(
    Word2Vec(workers=1),
    walkers=[RandomWalker(2, None, random_state=22)],
)
transformer.fit_transform(kg, train_entities)
transformer.save("mutag_model")

# Incrementally add new entities
new_data = pd.read_csv("samples/mutag/online-training.tsv", sep="\t")
new_entities = list(new_data["bond"])

transformer = RDF2VecTransformer(Word2Vec(workers=1),
                                  walkers=[RandomWalker(2, None)]).load("mutag_model")
transformer.fit_transform(kg, new_entities, is_update=True)

# transformer._embeddings now contains embeddings for ALL entities seen so far
all_embeddings = transformer._embeddings
print(f"Total embedded entities: {len(all_embeddings)}")
```

--------------------------------

### Implement a Custom RDF Connector

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/contributing.md

Extend the base Connector class to add support for new RDF syntaxes or file formats. Ensure the fetch method is implemented and configure HTTP adapter retries for robustness.

```python
import attr
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util import Retry

from pyrdf2vec.connectors import Connector

@attr.s
class FooConnector(Connector):
    """Represents a Foo connector."""

    def __attrs_post_init__():
        adapter = HTTPAdapter(
            Retry(
                total=3,
                status_forcelist=[429, 500, 502, 503, 504],
                method_whitelist=["HEAD", "GET", "OPTIONS"],
            )
        )
        self._session.mount("http", adapter)
        self._session.mount("https", adapter)

    def fetch(self, query: str) -> None:
        """Fetchs the result of a query.

           Args:
               query: The query to fetch the result.

           Returns:
               The generated dictionary from the ['results']['bindings']
               json

        """
        # TODO: to be implemented

```

--------------------------------

### Bundle SPARQL Requests for Remote KGs

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/README.rst

Speeds up walk extraction for remote KGs by bundling SPARQL requests. This option can be combined with multiprocessing. Be aware of potential SPARQL endpoint server policies.

```python
import pandas as pd

from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker

data = pd.read_csv("samples/countries-cities/entities.tsv", sep="\t")

RDF2VecTransformer(walkers=[RandomWalker(4, 10)]).fit_transform(
    KG("https://dbpedia.org/sparql", mul_req=True),
    [entity for entity in data["location"]],
)
```

--------------------------------

### RandomWalker: Walk Extraction (BFS vs. DFS)

Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt

Illustrates the usage of `RandomWalker` for extracting entity walks, differentiating between Breadth-First Search (BFS) and Depth-First Search (DFS) strategies. BFS is used when `max_walks` is None, while DFS is used when `max_walks` is a positive integer.

```python
from pyrdf2vec.walkers import RandomWalker
from pyrdf2vec.samplers import UniformSampler

# BFS — extract all walks up to depth 4 (faster, more walks)
walker_bfs = RandomWalker(
    max_depth=4,
    max_walks=None,          # BFS
    with_reverse=True,       # include parent hops for better context
    n_jobs=4,                # parallel processes
    random_state=42,
    md5_bytes=8,             # hash non-entity object vertices to save memory
)

# DFS — extract at most 10 walks per entity up to depth 4
walker_dfs = RandomWalker(
    max_depth=4,
    max_walks=10,            # DFS
    sampler=UniformSampler(),
    n_jobs=2,
    random_state=42,
)

from pyrdf2vec.graphs import KG
kg = KG("samples/mutag/mutag.owl",
        skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"})
entities = ["http://dl-learner.org/carcinogenesis#d1"]

walks = walker_bfs.extract(kg, entities, verbose=1)
# walks[0] -> list of tuples like:
# [('http://.../d1', 'http://.../hasBond', 'b"\x3f..."'), ...]
print(f"Walks for entity 0: {len(walks[0])}")
```

--------------------------------

### RDF2VecTransformer: Save and Load

Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt

Shows how to serialize a trained RDF2VecTransformer to a file and deserialize it later for reuse. This is essential for persisting trained models.

```python
from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker

kg = KG("samples/mutag/mutag.owl",
        skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"})
entities = ["http://dl-learner.org/carcinogenesis#d1",
            "http://dl-learner.org/carcinogenesis#d2"]

transformer = RDF2VecTransformer(Word2Vec(workers=1),
                                  walkers=[RandomWalker(2, None)])
transformer.fit_transform(kg, entities)
transformer.save("my_transformer")        # writes "my_transformer" binary

# Later / in another process:
loaded = RDF2VecTransformer.load("my_transformer")
embeddings, _ = loaded.transform(kg, entities)
print(embeddings[0])   # same vectors as before
```

--------------------------------

### RDF2VecTransformer: Fit and Transform

Source: https://context7.com/predict-idlab/pyrdf2vec/llms.txt

Demonstrates fitting the transformer to a knowledge graph and entities, then transforming them into embeddings. This is useful for initial model training.

```python
from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker

kg = KG("samples/mutag/mutag.owl",
        skip_predicates={"http://dl-learner.org/carcinogenesis#isMutagenic"})
entities = ["http://dl-learner.org/carcinogenesis#d1",
            "http://dl-learner.org/carcinogenesis#d2"]

transformer = RDF2VecTransformer(Word2Vec(workers=1),
                                  walkers=[RandomWalker(2, None)])
transformer.fit_transform(kg, entities)
embeddings, literals = transformer.transform(kg, entities)
print(embeddings[0].shape)   # (100,)  default Word2Vec vector_size
print(len(literals))         # same length as entities

# Separate fit / transform for reuse
transformer.fit(kg, entities)
embeddings, literals = transformer.transform(kg, entities)

# Get raw walks without training
walks = transformer.get_walks(kg, entities)
# walks[i] -> list of tuples (sequences of vertex name strings)
```

--------------------------------

### Download DBPedia Data

Source: https://github.com/predict-idlab/pyrdf2vec/wiki/Fast-generation-of-RDF2Vec-embeddings-with-a-SPARQL-endpoint

Download DBPedia triples for the October 2015 English version using wget. This script creates data directories and downloads core and core-i18n data, as well as the OWL ontology.

```bash
mkdir -p data
cd data

mkdir core
cd core
wget -np -nd -r -A ttl.bz2 -A nt.bz2 "http://downloads.dbpedia.org/2015-10/core/"
cd ..

mkdir core-i18n
cd core-i18n
wget -nd -np -r -A ttl.bz2 "http://downloads.dbpedia.org/2015-10/core-i18n/en/"
cd ..

wget -nd -np -r -A .owl "http://downloads.dbpedia.org/2015-10/dbpedia_2015-10.owl"
```

--------------------------------

### Configure Stardog Default Memory Mode

Source: https://github.com/predict-idlab/pyrdf2vec/wiki/Fast-generation-of-RDF2Vec-embeddings-with-a-SPARQL-endpoint

Change the memory mode in stardog.properties from 'bulk' to 'default'. This rebalances RAM for optimal SELECT query performance.

```properties
memory.mode = default
```

--------------------------------

### Define Random Walker with PageRank Sampler

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/docs/readme.md

Instantiate a RandomWalker with a specified depth, number of walks, and PageRankSampler. This configuration limits the number of walks extracted per entity.

```python
from pyrdf2vec.samplers import PageRankSampler
from pyrdf2vec.walkers import RandomWalker

walkers = [RandomWalker(4, 10, PageRankSampler())]
```

--------------------------------

### Implement Custom Walker Class

Source: https://github.com/predict-idlab/pyrdf2vec/blob/main/CONTRIBUTING.rst

Extend the Walker class and implement the _extract function for your custom walking strategy.

```python
from hashlib import md5
    from typing import List, Set

    import attr

    from pyrdf2vec.graphs import KG, Vertex
    from pyrdf2vec.typings import EntityWalks, SWalk, Walk
    from pyrdf2vec.walkers import Walker

    @attr.s
    class FooWalker(Walker):
        """Defines the foo walking strategy.

        Args:
            depth: The maximum depth of one walk.
            max_walks: The maximum number of walks per entity.
            sampler: The sampling strategy.
                Defaults to pyrdf2vec.samplers.UniformSampler().
            n_jobs: The number of process to use for multiprocessing.
                Defaults to 1.
            with_reverse: extracts children's and parents' walks from the root,
                creating (max_walks * max_walks) more walks of 2 * depth.
                Defaults to False.
            random_state: The random state to use to ensure random determinism to
                generate the same walks for entities.
                Defaults to None.

        """

        def _extract(self, kg: KG, instance: Vertex) -> EntityWalks:
            """Extracts walks rooted at the provided entities which are then
            each transformed into a numerical representation.

            Args:
                kg: The Knowledge Graph.
                instance: The instance to be extracted from the Knowledge Graph.

            Returns:
                The 2D matrix with its number of rows equal to the number of
                provided entities; number of column equal to the embedding size.

            """
            canonical_walks: Set[SWalk] = set()
            # TODO: to be implemented
            return {instance.name: list(canonical_walks)}
```