### Initial Command Setup Source: https://github.com/kalininalab/datasail/blob/main/docs/install.html JavaScript code to set the initial default selections for OS and package manager by programmatically clicking the first available option. ```javascript document.getElementById("os").children[0].click(); document.getElementById("package").children[0].click(); ``` -------------------------------- ### Install and Use CBC Solver Source: https://github.com/kalininalab/datasail/blob/main/docs/workflow/solvers.rst Instructions for installing the CBC solver using mamba and pip, and specifying it in DataSAIL via CLI or Python API. CBC is a free solver. ```shell mamba install -c conda-forge coin-or-cbc pip install cylp ``` ```APIDOC Solver Usage: CLI: --solver CBC Python API: solver="CBC" Notes: CBC is a free solver and can be used without any license. Issues may arise on larger problem instances. ``` -------------------------------- ### Install and Use SCIP Solver Source: https://github.com/kalininalab/datasail/blob/main/docs/workflow/solvers.rst Instructions for installing the SCIP solver using mamba and specifying it in DataSAIL via CLI or Python API. SCIP is a free solver. ```shell mamba install -c conda-forge pyscipopt ``` ```APIDOC Solver Usage: CLI: --solver SCIP Python API: solver="SCIP" Notes: SCIP is the default, pre-installed solver with DataSAIL. ``` -------------------------------- ### Install and Use GLPK_MI Solver Source: https://github.com/kalininalab/datasail/blob/main/docs/workflow/solvers.rst Instructions for installing GLPK_MI using mamba and specifying it in DataSAIL via CLI or Python API. GLPK_MI is a free solver for mixed-integer problems. ```shell mamba install -c conda-forge cvxopt ``` ```APIDOC Solver Usage: CLI: --solver GLPK --solver GLPK_MI Python API: solver="GLPK" solver="GLPK_MI" Notes: GLPK_MI is an extension of GLPK for mixed-integer problems. Both are free to use. ``` -------------------------------- ### Install DataSAIL Documentation Requirements Source: https://github.com/kalininalab/datasail/blob/main/docs/extensions/contributing.rst Installs the necessary Python packages for building the DataSAIL documentation using pip and a requirements file. This is a prerequisite for working on the documentation. ```shell pip install -r docs/requirements.txt ``` -------------------------------- ### DataSAIL Command-Line Interface Example Source: https://github.com/kalininalab/datasail/blob/main/examples/pdbbind.ipynb Provides an example of how to invoke the DataSAIL tool from the command line. It specifies techniques, split ratios, names, number of runs, solver, and data paths for ligands and targets. ```Bash $ datasail -t R I1e I2f I2 C1e C1f C2 -s 7 2 1 -n train val test -r 3 -i inter.tsv --solver SCIP --e-type M --e-data --f-type P --f-data ``` -------------------------------- ### DataSAIL Command-Line Interface Example Source: https://github.com/kalininalab/datasail/blob/main/examples/asteroids.ipynb Provides an example of how to use the DataSAIL command-line interface to perform dataset splitting. It specifies techniques, split ratios, output names, runs, solver, and data/distance file paths. ```Bash $ datasail -t I1e C1e -s 8 1 -n train test -r 1 --solver SCIP --e-type O --e-data --e-dist ``` -------------------------------- ### Install and Use XPRESS Solver Source: https://github.com/kalininalab/datasail/blob/main/docs/workflow/solvers.rst Instructions for installing the XPRESS solver using mamba and specifying it in DataSAIL via CLI or Python API. XPRESS is a commercial solver with a free academic license. ```shell mamba install -c fico-xpress xpress ``` ```APIDOC Solver Usage: CLI: --solver XPRESS Python API: solver="XPRESS" Notes: Requires a valid XPRESS license. Obtain a free academic license from FICO. ``` -------------------------------- ### DataSAIL Command-Line Usage Example Source: https://github.com/kalininalab/datasail/blob/main/examples/qm9.ipynb Illustrates the command-line interface for DataSAIL, specifying techniques (e.g., C1e for cluster-based splitting), split ratios, names for the splits, number of runs, the solver to use (e.g., SCIP), the data type (e.g., M for molecules), and the path to the data file. ```bash $ datasail -t C1e -s 7 2 1 -n train val test -r 3 --solver SCIP --e-type M --e-data ``` -------------------------------- ### Install and Use GUROBI Solver Source: https://github.com/kalininalab/datasail/blob/main/docs/workflow/solvers.rst Instructions for installing the GUROBI solver using mamba and specifying it in DataSAIL via CLI or Python API. GUROBI is a commercial solver with a free academic license. ```shell mamba install -c gurobi gurobi ``` ```APIDOC Solver Usage: CLI: --solver GUROBI Python API: solver="GUROBI" Notes: Requires a valid GUROBI license. Obtain a free academic license from the GUROBI website. ``` -------------------------------- ### Install and Use CPLEX Solver Source: https://github.com/kalininalab/datasail/blob/main/docs/workflow/solvers.rst Instructions for installing the CPLEX solver using mamba and specifying it in DataSAIL via CLI or Python API. CPLEX is a commercial solver with a free academic license. ```shell mamba install -c ibmdecisionoptimization cplex ``` ```APIDOC Solver Usage: CLI: --solver CPLEX Python API: solver="CPLEX" Notes: Requires a valid CPLEX license. Obtain a free academic license from IBM. ``` -------------------------------- ### Dynamic Installation Command Updater Source: https://github.com/kalininalab/datasail/blob/main/docs/install.html JavaScript function to update installation commands displayed on the page based on selected OS and package manager attributes. It targets an element with the ID 'command' and updates its content. ```javascript function updateCommand() { // Get the attributes from the #command element var commandElement = document.getElementById("command"); var os = commandElement.getAttribute("os"); var package = commandElement.getAttribute("package"); // Get the

element inside the #command element
var preElement = commandElement.querySelector("pre");

// Update the text based on the conditions
if (package === "pip") {
preElement.textContent = 'pip install datasail';
} else if (os === "linux" || os === "osx") {
preElement.textContent = 'mamba install -c conda-force -c bioconda -c kalininalab datasail # or datasail-lite\n\npip install grakel<0.1.10';
} else {
preElement.textContent = `mamba install -c conda-force -c bioconda -c kalininalab datasail-lite\n\npip install grakel<0.1.10`;
}
}
```

--------------------------------

### Install and Use MOSEK Solver

Source: https://github.com/kalininalab/datasail/blob/main/docs/workflow/solvers.rst

Instructions for installing the MOSEK solver using mamba and specifying it in DataSAIL via CLI or Python API. MOSEK is a commercial solver with a free academic license.

```shell
mamba install -c mosek mosek
```

```APIDOC
Solver Usage:

CLI:
--solver MOSEK

Python API:
solver="MOSEK"

Notes:
Requires a valid MOSEK license. Obtain a free academic license from the MOSEK website.
```

--------------------------------

### Check DIAMOND Installation in Python

Source: https://github.com/kalininalab/datasail/blob/main/docs/extensions/metric.rst

Checks if the DIAMOND tool is installed on the system using the `INSTALLED` dictionary and raises a `ValueError` if it's not found, ensuring the metric can only run if its dependencies are met.

```python
if not INSTALLED[DIAMOND]:
raise ValueError("DIAMOND is not installed.")

```

--------------------------------

### DataSAIL CLI Usage Example

Source: https://github.com/kalininalab/datasail/blob/main/README.md

Demonstrates how to use the DataSAIL command-line interface to split protein data clustered with mmseqs. It specifies input data path, similarity method, output path, and splitting technique.

```shell
datasail --e-type P --e-data --e-sim mmseqs --output --technique C1e
```

--------------------------------

### Event Listeners for Command Selection

Source: https://github.com/kalininalab/datasail/blob/main/docs/install.html

JavaScript code to attach click event listeners to elements that control OS and package manager selection. It updates the 'command' element's attributes and calls the updateCommand function.

```javascript
document.querySelectorAll(".quick-start .content-column .row div").forEach(function(element) {
element.addEventListener("click", function() {
// Remove the 'selected' class from all siblings
let siblings = this.parentNode.querySelectorAll('div');
siblings.forEach(function(sibling) {
sibling.classList.remove('selected');
});

// Add the 'selected' class to the clicked element
this.classList.add('selected');

// Get the parent row's id and the clicked element's id
let parentId = this.parentNode.id;
let elementId = this.id;

// Set the corresponding attribute on the #command element
document.getElementById("command").setAttribute(parentId, elementId);

// Call the updateCommand function
updateCommand();
});
});
```

--------------------------------

### DataSAIL Command-Line Interface

Source: https://github.com/kalininalab/datasail/blob/main/examples/tox21.ipynb

Provides an example of how to use the DataSAIL command-line tool to perform data splitting with stratification. It specifies techniques, split ratios, names, number of runs, solver, dataset type, and file paths for data and stratification.

```APIDOC
datasail: Command-line interface for DataSAIL data splitting.

Usage:
datasail [options]

Options:
-t, --techniques List of splitting techniques (e.g., C1e, I1e).
The 'e' suffix indicates splitting of e-data.
-s, --splits List of split sizes, normalized to ratios.
Example: 8 2 for 80/20 split.
-n, --names List of names for the splits.
Example: train test.
-r, --runs Number of different splits to compute per technique.
--solver Solving algorithm for optimization (e.g., SCIP).
--e-type Type of the dataset in the first axis (e.g., M for molecular).
--e-data Filepath containing the data mapping (IDs to SMILES strings).
--e-strat Filepath containing the stratification target values mapping (IDs to values).

Example:
$ datasail -t C1e -s 8 2 -n train test -r 3 --solver SCIP --e-type M --e-data --e-strat
```

--------------------------------

### Register DIAMOND Metric in settings.py

Source: https://github.com/kalininalab/datasail/blob/main/docs/extensions/metric.rst

Demonstrates registering the DIAMOND metric by defining its name, adding it to the list of similarity algorithms, checking its installation status, and specifying its configuration file name within the DataSAIL settings.py file.

```python
DIAMOND = "diamond"

SIM_ALGOS = [WLK, MMSEQS, MMSEQS2, MMSEQSPP, FOLDSEEK, CDHIT, CDHIT_EST, ECFP, DIAMOND, ]

INSTALLED = {
# Define the check per tool
...
DIAMOND: shutil.which("diamond") is not None,
...
}

YAML_FILE_NAMES = {
# define the yaml file per tool
...
DIAMOND: "args/diamond.yaml",
...
}
```

--------------------------------

### Train Model for MPP Experiments

Source: https://github.com/kalininalab/datasail/blob/main/experiments/Strat/README.md

Initiates the model training phase for the MPP benchmark experiments. It requires the path to the pre-processed data splits and generates performance results, typically saved as `results.csv` files within the experiment directories.

```bash
python -m experiments.Strat.train
```

--------------------------------

### DataSAIL Command-Line Usage

Source: https://github.com/kalininalab/datasail/blob/main/examples/bace.ipynb

Illustrates the command-line interface for DataSAIL. It specifies techniques ('I1e', 'C1e'), split ratios ('7', '2', '1'), split names ('train', 'val', 'test'), number of runs ('3'), solver ('SCIP'), dataset type ('M'), and paths to external data and distance files.

```bash
$ datasail -t I1e C1e -s 7 2 1 -n train val test -r 3 --solver SCIP --e-type M --e-data --e-dist
```

--------------------------------

### Default Similarity Algorithm Selection

Source: https://github.com/kalininalab/datasail/blob/main/docs/extensions/metric.rst

Defines the default order of similarity algorithms to be checked for installation. DIAMOND is listed as a potential default method when processing FASTA data types.

```python
order = [DIAMOND, MMSEQS2, CDHIT, MMSEQSPP]
for method in order:
if INSTALLED[method]:
return method, None
```

--------------------------------

### Splitting Configuration Arguments

Source: https://github.com/kalininalab/datasail/blob/main/docs/interfaces/cli.rst

Arguments that define the splitting mode, split sizes, naming conventions, overflow handling, and variance control for data partitioning.

```APIDOC
-t, --techniques
Required!
Select the mode to split the data. Choices are:
* R: Random split,
* I1: identity-based cold-single split,
* I2: identity-based cold-double split,
* C1: similarity-based cold-single split,
* C2: similarity-based cold-double split
For both, I1 and C1, you have to specify e or f, i.e. I1e, I1f, C1e, or C1f, to make clear if DataSAIL shall compute a cold split based on the e-entity or the f-entity.

-s, --splits
The sizes of the individual splits the program shall produce.

-n, --names
The names of the splits in order of the -s argument. If left empty, splits will be called Split1, Split2, ...

--overflow
How to handle overflow of the splits. If 'assign', a cluster that overflows a split size will be assigned to one split. The remaining data is split normally into n-1 splits. If 'break', the cluster will be broken into smaller parts to fit into a split.

-d, --delta
A multiplicative factor by how much the limits (as defined in the -s / --splits argument defined) of the stratification can be exceeded.

-e, --epsilon
A multiplicative factor by how much the limits (as defined in the -s / --splits argument defined) of the splits can be exceeded.

-r, --runs
The number of different runs to perform per technique. The idea is to compute several different splits of the dataset using the same technique to investigate the variance of the model on different data-splits. The variance in splits is introduced by shuffling the dataset everytime a new split is requested.

--solver
Which solver to use to solve the binary linear program. The choices are presented here.

--cache
Boolean flag indicating to store clustering matrices in cache to not recompute clusters multiple times.

--cache-dir
Destination of the cache folder. Default is the OS-default cache dir.
```

--------------------------------

### Train MPP Model using Python Script

Source: https://github.com/kalininalab/datasail/blob/main/experiments/MPP/README.md

Initiates the model training process for the MPP benchmark. It requires the path to the saved MPP data and an optional dataset name. This command generates 'results.csv' files within dataset folders.

```bash
python -m experiments.MPP.train []
```

--------------------------------

### Run DataSAIL CLI

Source: https://github.com/kalininalab/datasail/blob/main/docs/index.rst

This command shows the help message for the DataSAIL command-line interface, listing all available parameters and options. It's the primary way to understand the CLI's capabilities.

```shell
datasail -h
```

--------------------------------

### DataSAIL CLI Help Command

Source: https://github.com/kalininalab/datasail/blob/main/README.md

Shows how to access the full list of arguments and options available for the DataSAIL command-line tool. This command provides details on all configurable parameters.

```shell
datasail -h
```

--------------------------------

### Entity Configuration Arguments

Source: https://github.com/kalininalab/datasail/blob/main/docs/interfaces/cli.rst

Arguments specific to defining entities (e.g., e-entities and f-entities), including their types, data sources, weights, and similarity methods.

```APIDOC
--e-type
The type of the first data batch to the program. Choices are: [P]rotein, [M]olecule, [G]enome, [O]ther

--e-data
The first input to the program. This can either be the filepath to a file or a directory containing only data files.

--e-weights
The custom weights of the samples, the format can be a CSV/TSV-file or equivalent as described above.

--e-sim
Provide the name of a method to determine similarity between samples of the first input dataset. This can either be the name of a method based on the data type or a filepath to a method.
```

--------------------------------

### Visualize MPP Experiment Results

Source: https://github.com/kalininalab/datasail/blob/main/experiments/Strat/README.md

Generates visualizations for the MPP benchmark experiment results. This command processes the saved experiment data and outputs graphical representations, such as plots, to a specified directory.

```bash
python -m experiments.Strat.visualize
```

--------------------------------

### Split Data for MPP Experiments

Source: https://github.com/kalininalab/datasail/blob/main/experiments/Strat/README.md

Executes the data splitting process for the MPP benchmark. This command takes a target directory as input and generates stratified data splits, saving them in a structured format suitable for subsequent training and analysis.

```bash
python -m experiments.Strat.split
```

--------------------------------

### Split MPP Data using Python Script

Source: https://github.com/kalininalab/datasail/blob/main/experiments/MPP/README.md

Executes the data splitting process for MPP benchmark datasets. It takes a path to save the split data, an optional dataset name, and an optional solver. The output is organized by tool, dataset, technique, and split.

```bash
python -m experiments.MPP.split [] []
```

--------------------------------

### Visualize MPP Experiment Results using Python Script

Source: https://github.com/kalininalab/datasail/blob/main/experiments/MPP/README.md

Runs the visualization script to generate plots from the MPP experiment results. It requires the path to the saved MPP data and outputs visualization files (e.g., PNG) into a 'plots/' directory.

```bash
python -m experiments.MPP.visualize
```

--------------------------------

### Import DataSAIL Library

Source: https://github.com/kalininalab/datasail/blob/main/examples/rna.ipynb

Imports the necessary DataSAIL library for dataset splitting operations. This is the initial step before utilizing DataSAIL's functionalities.

```python
from datasail.sail import datasail
```

--------------------------------

### Run DataSAIL CLI for Dataset Splitting

Source: https://github.com/kalininalab/datasail/blob/main/examples/rna.ipynb

Executes DataSAIL from the command line to split a dataset. It specifies techniques (I1e, C1e), split ratios (7, 2, 1), names (train, val, test), number of runs, input file, solver, and data type.

```bash
$ datasail -t I1e C1e -s 7 2 1 -n train val test -r 3 -i inter.tsv --solver SCIP --e-type G --e-data
```

--------------------------------

### Generate DataSAIL Documentation Build

Source: https://github.com/kalininalab/datasail/blob/main/docs/extensions/contributing.rst

Removes the existing build directory and then uses sphinx-build to generate a clean, static build of the DataSAIL documentation. This command is used for local testing of documentation changes.

```shell
rm -rf build/
sphinx-build ./ ./build/ -a
```

--------------------------------

### Execute DataSAIL Experiment Pipeline (Shell)

Source: https://github.com/kalininalab/datasail/blob/main/experiments/README.md

Provides the general command structure for executing DataSAIL experiments. It specifies how to run scripts for splitting, training, and visualization within different experiment types (DTI, MPP, Strat) and how to specify the storage folder. The scripts must be run in order due to interdependencies.

```shell
python -m experiments..