### Full Federated Simulation Setup Source: https://context7.com/sattarov/fedtabdiff/llms.txt Sets up parameters for an end-to-end federated learning simulation using FedTabDiff. This includes data preprocessing, model initialization, and strategy configuration. ```python import pandas as pd import numpy as np import torch from torch.utils.data import DataLoader, TensorDataset from sklearn.preprocessing import LabelEncoder, QuantileTransformer import flwr as fl from flwr.server.strategy import FedAvg from fedtabdiff_modules import init_model from FlowerClient import get_client_fn, get_eval_config from FlowerServer import get_evaluate_server_fn from utils import get_parameters exp_params = dict( seed=42, batch_size=512, n_cat_emb=2, learning_rate=1e-4, mlp_layers=[512, 512], activation='lrelu', diffusion_steps=500, diffusion_beta_start=1e-4, diffusion_beta_end=0.02, scheduler='linear', server_rounds=100, client_rounds=10, n_clients=5, fraction_fit=1.0, fraction_evaluate=1.0, min_fit_clients=5, min_evaluate_clients=1, eval_rate_client=100, eval_rate_server=10, device=torch.device("cuda" if torch.cuda.is_available() else "cpu") ) ``` -------------------------------- ### Launch Federated Learning Simulation Source: https://context7.com/sattarov/fedtabdiff/llms.txt Starts the federated learning simulation using Flower. This function orchestrates the training rounds, client interactions, and strategy execution. ```python hist = fl.simulation.start_simulation( client_fn=get_client_fn(train_loaders, test_loaders, exp_params), num_clients=exp_params['n_clients'], config=fl.server.ServerConfig(num_rounds=exp_params['server_rounds']), strategy=strategy ) print(hist.metrics_centralized['fidelity']) ``` -------------------------------- ### Build Per-Client Non-IID Data Loaders Source: https://context7.com/sattarov/fedtabdiff/llms.txt Creates PyTorch DataLoaders for training and testing, distributing data non-IID across clients. This setup is crucial for simulating realistic federated learning scenarios. ```python train_num_torch = torch.FloatTensor(train_num_scaled) train_cat_torch = torch.LongTensor(train_cat_scaled.values) label_torch = torch.LongTensor(label) data_split = {k: np.argwhere(label == k).squeeze() for k in np.unique(label)} train_loaders, test_loaders = [], [] for idx in data_split.values(): train_loaders.append(DataLoader(TensorDataset(train_cat_torch[idx], train_num_torch[idx], label_torch[idx]), batch_size=512, shuffle=True)) test_loaders.append((train.iloc[idx], label_torch[idx])) test_loader_server = (train, label_torch) ``` -------------------------------- ### Initialize Synthesizer and Diffuser from Config Source: https://context7.com/sattarov/fedtabdiff/llms.txt Instantiates MLPSynthesizer and BaseDiffuser models using experiment parameters from a dictionary. This is the standard entry point for model construction in FedTabDiff. ```python from fedtabdiff_modules import init_model import torch exp_params = { 'encoded_dim': 15, 'mlp_layers': [512, 512], 'activation': 'lrelu', 'n_cat_tokens': 200, 'n_cat_emb': 2, 'n_classes': 5, 'diffusion_steps': 500, 'diffusion_beta_start': 1e-4, 'diffusion_beta_end': 0.02, 'scheduler': 'linear', 'device': torch.device('cpu') } synthesizer, diffuser = init_model(exp_params) print(type(synthesizer)) # print(type(diffuser)) # ``` -------------------------------- ### Initialize Model and Evaluate Source: https://context7.com/sattarov/fedtabdiff/llms.txt Initializes the model and performs an evaluation. The metrics are printed after evaluation. ```python synthesizer, _ = init_model(exp_params) params = get_parameters(synthesizer) loss, metrics = evaluate_fn(server_round=10, parameters=params, config={}) print(metrics) # {'fidelity': 0.743} (example) ``` -------------------------------- ### `BaseDiffuser.__init__` Source: https://context7.com/sattarov/fedtabdiff/llms.txt Initializes the BaseDiffuser object, setting up the noise diffusion scheduler with configurable parameters for beta schedules and total diffusion steps. It supports both linear and quadratic beta schedules. ```APIDOC ## `BaseDiffuser.__init__` — Initialize the noise diffusion scheduler Constructs a `BaseDiffuser` object that precomputes the `alpha` and `beta` noise schedules used throughout the forward (noising) and reverse (denoising) diffusion processes. Supports both `linear` and `quad` (quadratic) beta schedulers. The cumulative product `alphas_hat` is also precomputed for efficient noise injection at arbitrary timesteps. ```python from BaseDiffuser import BaseDiffuser import torch # Linear scheduler (default) — 500 diffusion steps diffuser = BaseDiffuser( total_steps=500, beta_start=1e-4, beta_end=0.02, device=torch.device("cuda" if torch.cuda.is_available() else "cpu"), scheduler='linear' # or 'quad' for quadratic schedule ) print(f"Total steps : {diffuser.total_steps}") print(f"Beta range : {diffuser.betas[0].item():.6f} \u2192 {diffuser.betas[-1].item():.6f}") print(f"alphas_hat : shape {diffuser.alphas_hat.shape}") # Total steps : 500 # Beta range : 0.000200 \u2192 0.040000 # alphas_hat : shape torch.Size([500]) ``` ``` -------------------------------- ### Initialize BaseDiffuser with Linear Scheduler Source: https://context7.com/sattarov/fedtabdiff/llms.txt Initializes the BaseDiffuser with a linear noise schedule. Supports custom total steps, beta start/end values, and device selection. The default scheduler is 'linear'. ```python from BaseDiffuser import BaseDiffuser import torch # Linear scheduler (default) — 500 diffusion steps diffuser = BaseDiffuser( total_steps=500, beta_start=1e-4, beta_end=0.02, device=torch.device("cuda" if torch.cuda.is_available() else "cpu"), scheduler='linear' # or 'quad' for quadratic schedule ) print(f"Total steps : {diffuser.total_steps}") print(f"Beta range : {diffuser.betas[0].item():.6f} → {diffuser.betas[-1].item():.6f}") print(f"alphas_hat : shape {diffuser.alphas_hat.shape}") # Total steps : 500 # Beta range : 0.000200 → 0.040000 # alphas_hat : shape torch.Size([500]) ``` -------------------------------- ### Initialize Model and Federated Strategy Source: https://context7.com/sattarov/fedtabdiff/llms.txt Initializes the generative model and the FedAvg strategy for federated learning. This includes setting up server-side evaluation functions and configuration callbacks. ```python synthesizer, _ = init_model(exp_params) strategy = FedAvg( fraction_fit=exp_params['fraction_fit'], fraction_evaluate=exp_params['fraction_evaluate'], min_fit_clients=exp_params['min_fit_clients'], min_evaluate_clients=exp_params['min_evaluate_clients'], min_available_clients=exp_params['n_clients'], initial_parameters=fl.common.ndarrays_to_parameters(get_parameters(synthesizer)), evaluate_fn=get_evaluate_server_fn(test_loader_server, exp_params), on_fit_config_fn=get_eval_config, on_evaluate_config_fn=get_eval_config ) ``` -------------------------------- ### init_model Source: https://context7.com/sattarov/fedtabdiff/llms.txt Instantiates the MLPSynthesizer and BaseDiffuser from experiment parameters. This is a factory function for setting up the model components. ```APIDOC ## init_model ### Description Instantiate synthesizer and diffuser from experiment config. Convenience factory in `fedtabdiff_modules.py` that reads the `exp_params` dictionary and returns a fully-configured `(MLPSynthesizer, BaseDiffuser)` pair, ready for training or inference. This is the standard entry point for constructing the model in both `main.py` and each `FlowerClient`. ### Method `init_model(exp_params: dict)` ### Parameters - **exp_params** (dict) - A dictionary containing experiment parameters such as: - `encoded_dim` (int): The internal dimension of the encoded features. - `mlp_layers` (list[int]): A list defining the sizes of the hidden MLP layers. - `activation` (str): The activation function to use (e.g., 'lrelu'). - `n_cat_tokens` (int): The total number of unique categorical tokens. - `n_cat_emb` (int): The embedding dimension for categorical features. - `n_classes` (int): The number of distinct classes for conditional generation. - `diffusion_steps` (int): The total number of diffusion steps. - `diffusion_beta_start` (float): The starting value for the noise schedule beta. - `diffusion_beta_end` (float): The ending value for the noise schedule beta. - `scheduler` (str): The type of noise scheduler (e.g., 'linear'). - `device` (torch.device): The device to run the model on (e.g., `torch.device('cpu')`). - `learning_rate` (float, optional): The learning rate for the optimizer. - `client_rounds` (int, optional): Number of local training rounds per client. ### Response #### Success Response - **synthesizer** (MLPSynthesizer): An instance of the MLPSynthesizer. - **diffuser** (BaseDiffuser): An instance of the BaseDiffuser. ### Request Example ```python from fedtabdiff_modules import init_model import torch exp_params = { 'encoded_dim': 15, 'mlp_layers': [512, 512], 'activation': 'lrelu', 'n_cat_tokens': 100, 'n_cat_emb': 2, 'n_classes': 5, 'diffusion_steps': 500, 'diffusion_beta_start': 1e-4, 'diffusion_beta_end': 0.02, 'scheduler': 'linear', 'device': torch.device('cpu') } synthesizer, diffuser = init_model(exp_params) print(type(synthesizer)) print(type(diffuser)) ``` ``` -------------------------------- ### Prepare Noise Schedule: Linear vs. Quadratic Source: https://context7.com/sattarov/fedtabdiff/llms.txt Generates per-step alpha and beta tensors for both linear and quadratic noise schedules. The schedules are scaled by 1000 / total_steps. ```python from BaseDiffuser import BaseDiffuser diffuser_linear = BaseDiffuser(total_steps=1000, scheduler='linear') diffuser_quad = BaseDiffuser(total_steps=1000, scheduler='quad') alphas_lin, betas_lin = diffuser_linear.alphas, diffuser_linear.betas alphas_quad, betas_quad = diffuser_quad.alphas, diffuser_quad.betas print(f"Linear beta[-1]: {betas_lin[-1].item():.4f}") # 0.0200 print(f"Quad beta[-1]: {betas_quad[-1].item():.4f}") # 0.0200 ``` -------------------------------- ### `BaseDiffuser.prepare_noise_schedule` Source: https://context7.com/sattarov/fedtabdiff/llms.txt Builds the beta and alpha noise schedule tensors for either a linear or quadratic schedule. The schedules are scaled by `1000 / total_steps` to ensure the noise level is independent of the number of diffusion steps. ```APIDOC ## `BaseDiffuser.prepare_noise_schedule` — Build beta/alpha noise schedule tensors Returns the per-step `alphas` and `betas` tensors for either a `linear` or `quad` schedule, scaled by `1000 / total_steps` so the effective noise level is independent of the number of diffusion steps chosen. ```python from BaseDiffuser import BaseDiffuser diffuser_linear = BaseDiffuser(total_steps=1000, scheduler='linear') diffuser_quad = BaseDiffuser(total_steps=1000, scheduler='quad') alphas_lin, betas_lin = diffuser_linear.alphas, diffuser_linear.betas alphas_quad, betas_quad = diffuser_quad.alphas, diffuser_quad.betas print(f"Linear beta[-1]: {betas_lin[-1].item():.4f}") # 0.0200 print(f"Quad beta[-1]: {betas_quad[-1].item():.4f}") # 0.0200 ``` ``` -------------------------------- ### Initialize MLP-based Noise Prediction Network Source: https://context7.com/sattarov/fedtabdiff/llms.txt Sets up the feed-forward network for noise prediction, including timestep embeddings, categorical embeddings, and optional class conditioning. ```python from MLPSynthesizer import MLPSynthesizer synthesizer = MLPSynthesizer( d_in=15, # encoded_dim = cat_dim + num_dim hidden_layers=[512, 512], # two hidden layers of 512 neurons each activation='lrelu', # 'lrelu' | 'relu' | 'tanh' | 'sigmoid' dim_t=64, # timestep embedding dimension n_cat_tokens=200, # total unique categorical tokens across all cat attributes n_cat_emb=2, # embedding dimension per categorical token embedding_learned=False, # freeze embeddings (True to fine-tune) n_classes=5 # number of label classes for conditional generation ) print(synthesizer) # MLPSynthesizer( # (mlp): MLP(...) # (embedding): Embedding(200, 2) # (label_emb): Embedding(5, 64) # (proj): Sequential(Linear(15,64), SiLU, Linear(64,64)) # (time_embed): Sequential(Linear(64,64), SiLU, Linear(64,64)) # (head): Linear(512, 15) # ) ``` -------------------------------- ### Perform Single Denoising Step with BaseDiffuser Source: https://context7.com/sattarov/fedtabdiff/llms.txt Executes one step of the reverse DDPM process. Use this iteratively to generate synthetic data from noise. ```python from BaseDiffuser import BaseDiffuser from MLPSynthesizer import MLPSynthesizer import torch diffuser = BaseDiffuser(total_steps=500) synthesizer = MLPSynthesizer(d_in=15, hidden_layers=[128, 128], n_cat_tokens=50, n_cat_emb=2, n_classes=3) z = torch.randn(4, 15) # start from pure noise label = torch.randint(0, 3, (4,)) for i in reversed(range(0, 10)): # abbreviated: normally range(0, total_steps) t = torch.full((4,), i, dtype=torch.long) with torch.no_grad(): model_out = synthesizer(z, t, label) z = diffuser.p_sample_gauss(model_out, z, t) print(f"Generated tensor shape: {z.shape}") # torch.Size([4, 15]) ``` -------------------------------- ### MLPSynthesizer.__init__ Source: https://context7.com/sattarov/fedtabdiff/llms.txt Initializes the MLP-based noise prediction network. This network predicts the noise at each diffusion step, incorporating timestep embeddings, categorical embeddings, and optional class-conditioning. ```APIDOC ## `MLPSynthesizer.__init__` — Initialize the MLP-based noise prediction network Creates the feed-forward neural network used to predict the noise at each diffusion step. Incorporates sinusoidal timestep embeddings, a learnable or pre-trained categorical embedding table, optional class-conditioning via label embeddings, and a final linear head that outputs a noise prediction of the same dimensionality as the input. ```python from MLPSynthesizer import MLPSynthesizer synthesizer = MLPSynthesizer( d_in=15, # encoded_dim = cat_dim + num_dim hidden_layers=[512, 512], # two hidden layers of 512 neurons each activation='lrelu', # 'lrelu' | 'relu' | 'tanh' | 'sigmoid' dim_t=64, # timestep embedding dimension n_cat_tokens=200, # total unique categorical tokens across all cat attributes n_cat_emb=2, # embedding dimension per categorical token embedding_learned=False, # freeze embeddings (True to fine-tune) n_classes=5 # number of label classes for conditional generation ) print(synthesizer) # MLPSynthesizer( # (mlp): MLP(...) # (embedding): Embedding(200, 2) # (label_emb): Embedding(5, 64) # (proj): Sequential(Linear(15,64), SiLU, Linear(64,64)) # (time_embed): Sequential(Linear(64,64), SiLU, Linear(64,64)) # (head): Linear(512, 15) # ) ``` ``` -------------------------------- ### Load and Preprocess Tabular Data Source: https://context7.com/sattarov/fedtabdiff/llms.txt Loads, cleans, and preprocesses tabular data for federated learning. It encodes categorical features, scales numerical features, and prepares data for client-side distribution. ```python train_raw = pd.read_csv('data/city_payments_fy2017.csv.zip') train_raw.columns = [c.replace('_', ' ') for c in train_raw.columns] cat_attrs = ['fm', 'check date', 'department title', 'character title', 'sub obj title', 'vendor name', 'contract description'] num_attrs = ['transaction amount'] label_name = 'doc ref no prefix definition' top_n = train_raw[label_name].value_counts().nlargest(5).index train_raw = train_raw[train_raw[label_name].isin(top_n)].reset_index(drop=True) for col in cat_attrs: train_raw[col] = col + '_' + train_raw[col].astype(str) label = train_raw[label_name].fillna('NA') class_encoder = LabelEncoder().fit(label) label = class_encoder.transform(label) train = train_raw[[*cat_attrs, *num_attrs]] num_scaler = QuantileTransformer(output_distribution='normal', random_state=exp_params['seed']) train_num_scaled = num_scaler.fit_transform(train[num_attrs]) vocabulary_classes = np.unique(train[cat_attrs].astype(str)) label_encoder = LabelEncoder().fit(vocabulary_classes) train_cat_scaled = train[cat_attrs].apply(label_encoder.transform) vocab_per_attr = {a: set(train_cat_scaled[a]) for a in cat_attrs} exp_params.update({ 'n_cat_tokens': len(vocabulary_classes), 'n_classes': len(np.unique(label)), 'cat_dim': exp_params['n_cat_emb'] * len(cat_attrs), 'encoded_dim': exp_params['n_cat_emb'] * len(cat_attrs) + len(num_attrs), 'vocab_per_attr': vocab_per_attr, 'num_scaler': num_scaler, 'num_attrs': num_attrs, 'cat_attrs': cat_attrs, 'label_encoder': label_encoder }) ``` -------------------------------- ### Decode Samples to DataFrame with FedTabDiff Source: https://context7.com/sattarov/fedtabdiff/llms.txt Reverses data preprocessing to convert encoded synthetic samples back into a human-readable DataFrame. Requires fitted scalers and encoders. ```python from fedtabdiff_modules import init_model, generate_samples, decode_samples from sklearn.preprocessing import LabelEncoder, QuantileTransformer import torch, numpy as np, pandas as pd # Minimal reproducible setup cat_attrs = ['category_a', 'category_b'] num_attrs = ['amount'] vocab = ['category_a_X', 'category_a_Y', 'category_b_P', 'category_b_Q'] label_encoder = LabelEncoder().fit(vocab) num_scaler = QuantileTransformer(output_distribution='normal').fit([[10], [50], [100], [200]]) vocab_per_attr = {'category_a': {0, 1}, 'category_b': {2, 3}} exp_params = { 'encoded_dim': 5, 'mlp_layers': [64], 'activation': 'lrelu', 'n_cat_tokens': 4, 'n_cat_emb': 2, 'n_classes': 2, 'diffusion_steps': 50, 'diffusion_beta_start': 1e-4, 'diffusion_beta_end': 0.02, 'scheduler': 'linear', 'device': torch.device('cpu') } synthesizer, diffuser = init_model(exp_params) label = torch.tensor([0, 1]) samples = generate_samples(synthesizer, diffuser, exp_params['encoded_dim'], exp_params['diffusion_steps'], label=label) df = decode_samples( samples=samples, cat_dim=4, # n_cat_attrs * n_cat_emb = 2 * 2 n_cat_emb=2, num_attrs=num_attrs, cat_attrs=cat_attrs, num_scaler=num_scaler, vocab_per_attr=vocab_per_attr, label_encoder=label_encoder, embeddings=synthesizer.get_embeddings() ) print(df) # category_a category_b amount # 0 category_a_X category_b_P 42.17 # 1 category_a_Y category_b_Q 118.53 ``` -------------------------------- ### Serialize and Restore PyTorch Model Weights Source: https://context7.com/sattarov/fedtabdiff/llms.txt Extracts and loads model weights using NumPy arrays. Ensure the model architectures match when setting parameters. ```python from utils import get_parameters, set_parameters from MLPSynthesizer import MLPSynthesizer import numpy as np model_a = MLPSynthesizer(d_in=15, hidden_layers=[256, 256], n_cat_tokens=50, n_cat_emb=2, n_classes=3) model_b = MLPSynthesizer(d_in=15, hidden_layers=[256, 256], n_cat_tokens=50, n_cat_emb=2, n_classes=3) # Serialize model_a weights params = get_parameters(model_a) print(f"Number of parameter arrays : {len(params)}") # e.g. 14 print(f"First array shape : {params[0].shape}") # Copy model_a weights into model_b set_parameters(model_b, params) # Verify transfer for p_a, p_b in zip(get_parameters(model_a), get_parameters(model_b)): assert np.allclose(p_a, p_b), "Parameter mismatch!" print("Parameters successfully transferred.") ``` -------------------------------- ### Generate Samples: Reverse Diffusion Process Source: https://context7.com/sattarov/fedtabdiff/llms.txt Generates synthetic tabular data by running the full reverse diffusion chain. Supports class-conditional generation using a `label` tensor. ```python from fedtabdiff_modules import init_model, generate_samples import torch exp_params = { 'encoded_dim': 15, 'mlp_layers': [256, 256], 'activation': 'lrelu', 'n_cat_tokens': 50, 'n_cat_emb': 2, 'n_classes': 3, 'diffusion_steps': 100, 'diffusion_beta_start': 1e-4, 'diffusion_beta_end': 0.02, 'scheduler': 'linear', 'device': torch.device('cpu') } synthesizer, diffuser = init_model(exp_params) # Conditional generation: generate one row per label in the tensor label = torch.tensor([0, 1, 2, 0, 1]) samples = generate_samples( synthesizer=synthesizer, diffuser=diffuser, encoded_dim=exp_params['encoded_dim'], last_diff_step=exp_params['diffusion_steps'], label=label ) print(f"Generated shape: {samples.shape}") # torch.Size([5, 15]) ``` -------------------------------- ### generate_samples Source: https://context7.com/sattarov/fedtabdiff/llms.txt Generates synthetic tabular data rows using the reverse diffusion process. ```APIDOC ## generate_samples ### Description Generate synthetic tabular rows via reverse diffusion. Runs the full reverse diffusion chain (from `last_diff_step` down to 0) to denoise a random Gaussian matrix into realistic synthetic data in the encoded feature space. Supports class-conditional generation by passing a `label` tensor; use `n_samples` for unconditional generation. ### Method `generate_samples(synthesizer, diffuser, encoded_dim, last_diff_step, label=None, n_samples=None)` ### Parameters - **synthesizer** (MLPSynthesizer): The synthesizer model instance. - **diffuser** (BaseDiffuser): The diffuser model instance. - **encoded_dim** (int): The dimensionality of the encoded feature space. - **last_diff_step** (int): The final diffusion step to start the reverse process from. - **label** (torch.Tensor, optional): Class labels for conditional generation. If provided, the number of samples generated will match the number of labels. - **n_samples** (int, optional): The number of unconditional samples to generate. If `label` is also provided, `n_samples` is ignored. ### Response #### Success Response - **samples** (torch.Tensor): A tensor containing the generated synthetic data samples in the encoded feature space. ### Request Example ```python from fedtabdiff_modules import init_model, generate_samples import torch exp_params = { 'encoded_dim': 15, 'mlp_layers': [256, 256], 'activation': 'lrelu', 'n_cat_tokens': 50, 'n_cat_emb': 2, 'n_classes': 3, 'diffusion_steps': 100, 'diffusion_beta_start': 1e-4, 'diffusion_beta_end': 0.02, 'scheduler': 'linear', 'device': torch.device('cpu') } synthesizer, diffuser = init_model(exp_params) # Conditional generation: generate one row per label in the tensor label = torch.tensor([0, 1, 2, 0, 1]) samples = generate_samples( synthesizer=synthesizer, diffuser=diffuser, encoded_dim=exp_params['encoded_dim'], last_diff_step=exp_params['diffusion_steps'], label=label ) print(f"Generated shape: {samples.shape}") ``` ``` -------------------------------- ### Train Model: Local Client Training Round Source: https://context7.com/sattarov/fedtabdiff/llms.txt Executes mini-batch gradient descent steps for the diffusion objective on a local dataset. Accepts an optional pre-existing optimizer and returns the mean training loss. ```python from fedtabdiff_modules import init_model, train_model from torch.utils.data import DataLoader, TensorDataset import torch exp_params = { 'encoded_dim': 15, 'mlp_layers': [256, 256], 'activation': 'lrelu', 'n_cat_tokens': 50, 'n_cat_emb': 2, 'n_classes': 3, 'diffusion_steps': 500, 'diffusion_beta_start': 1e-4, 'diffusion_beta_end': 0.02, 'scheduler': 'linear', 'device': torch.device('cpu'), 'learning_rate': 1e-4, 'client_rounds': 5 } synthesizer, diffuser = init_model(exp_params) # synthetic stand-in data: 100 samples, 7 cat attrs, 1 num attr, 3 classes cat = torch.randint(0, 50, (100, 7)) num = torch.randn(100, 1) label = torch.randint(0, 3, (100,)) loader = DataLoader(TensorDataset(cat, num, label), batch_size=32, shuffle=True) loss = train_model(synthesizer, diffuser, loader, exp_params) print(f"Training loss: {loss:.6f}") # e.g. Training loss: 0.998213 ``` -------------------------------- ### Flower Client for Federated Learning with FedTabDiff Source: https://context7.com/sattarov/fedtabdiff/llms.txt A Flower NumPyClient subclass for federated learning. It handles local training and evaluation, including generating synthetic data and computing fidelity scores. ```python from FlowerClient import get_client_fn, get_eval_config from fedtabdiff_modules import init_model from torch.utils.data import DataLoader, TensorDataset import torch, flwr as fl exp_params = { 'encoded_dim': 15, 'mlp_layers': [256, 256], 'activation': 'lrelu', 'n_cat_tokens': 50, 'n_cat_emb': 2, 'n_classes': 3, 'diffusion_steps': 100, 'diffusion_beta_start': 1e-4, 'diffusion_beta_end': 0.02, 'scheduler': 'linear', 'device': torch.device('cpu'), 'learning_rate': 1e-4, 'client_rounds': 5, 'eval_rate_client': 50, 'cat_dim': 14, 'num_attrs': ['amount'], 'cat_attrs': ['dept', 'vendor'], 'num_scaler': None, 'vocab_per_attr': {}, 'label_encoder': None } cat = torch.randint(0, 50, (100, 7)) num = torch.randn(100, 1) label = torch.randint(0, 3, (100,)) train_loader = DataLoader(TensorDataset(cat, num, label), batch_size=32) test_loader = (None, label) # (DataFrame, label_tensor) # get_client_fn returns a closure used by Flower's VirtualClientEngine client_fn = get_client_fn( train_loaders=[train_loader], test_loaders=[test_loader], exp_params=exp_params ) # get_eval_config returns the per-round config dict consumed by fit() and evaluate() config = get_eval_config(server_round=10) print(config) # {'server_round': 10} ``` -------------------------------- ### train_model Source: https://context7.com/sattarov/fedtabdiff/llms.txt Executes one round of local training on a client's data loader using the diffusion objective. ```APIDOC ## train_model ### Description Run one round of local training on a client's data loader. Executes `client_rounds` mini-batch gradient descent steps of the diffusion objective (MSE between true and predicted noise) on the local dataset. Accepts an optional pre-existing optimizer to allow state to persist across federated rounds. Returns the mean training loss for logging. ### Method `train_model(synthesizer, diffuser, loader, exp_params)` ### Parameters - **synthesizer** (MLPSynthesizer): The synthesizer model instance. - **diffuser** (BaseDiffuser): The diffuser model instance. - **loader** (DataLoader): The data loader for the client's local dataset. - **exp_params** (dict): Dictionary containing experiment parameters, including training configurations like `learning_rate` and `client_rounds`. ### Response #### Success Response - **loss** (float): The mean training loss for the round. ### Request Example ```python from fedtabdiff_modules import init_model, train_model from torch.utils.data import DataLoader, TensorDataset import torch exp_params = { 'encoded_dim': 15, 'mlp_layers': [256, 256], 'activation': 'lrelu', 'n_cat_tokens': 50, 'n_cat_emb': 2, 'n_classes': 3, 'diffusion_steps': 500, 'diffusion_beta_start': 1e-4, 'diffusion_beta_end': 0.02, 'scheduler': 'linear', 'device': torch.device('cpu'), 'learning_rate': 1e-4, 'client_rounds': 5 } synthesizer, diffuser = init_model(exp_params) # synthetic stand-in data: 100 samples, 7 cat attrs, 1 num attr, 3 classes cat = torch.randint(0, 50, (100, 7)) num = torch.randn(100, 1) label = torch.randint(0, 3, (100,)) loader = DataLoader(TensorDataset(cat, num, label), batch_size=32, shuffle=True) loss = train_model(synthesizer, diffuser, loader, exp_params) print(f"Training loss: {loss:.6f}") ``` ``` -------------------------------- ### Sample Random Diffusion Timesteps Source: https://context7.com/sattarov/fedtabdiff/llms.txt Uniformly samples a specified number of integer timesteps from the range [1, total_steps). These timesteps are used to determine noise levels in the diffusion process. ```python from BaseDiffuser import BaseDiffuser diffuser = BaseDiffuser(total_steps=500) timesteps = diffuser.sample_timesteps(n=8) print(timesteps) # tensor([312, 47, 491, 123, 388, 15, 274, 205]) (random each call) print(timesteps.shape) # torch.Size([8]) ``` -------------------------------- ### Server-Side Evaluation Function Factory for FedTabDiff Source: https://context7.com/sattarov/fedtabdiff/llms.txt A factory function that returns a closure for server-side evaluation in Flower. It re-initializes models, aggregates parameters, and computes fidelity scores periodically. ```python from FlowerServer import get_evaluate_server_fn from fedtabdiff_modules import init_model from utils import get_parameters import torch, pandas as pd, flwr as fl exp_params = { 'encoded_dim': 15, 'mlp_layers': [256, 256], 'activation': 'lrelu', 'n_cat_tokens': 50, 'n_cat_emb': 2, 'n_classes': 3, 'diffusion_steps': 100, 'diffusion_beta_start': 1e-4, 'diffusion_beta_end': 0.02, 'scheduler': 'linear', 'device': torch.device('cpu'), 'eval_rate_server': 10, 'cat_dim': 14, 'num_attrs': ['amount'], 'cat_attrs': ['dept', 'vendor'], 'num_scaler': None, 'vocab_per_attr': {}, 'label_encoder': None } label = torch.randint(0, 3, (200,)) test_loader = (pd.DataFrame(), label) # (real_data_df, label_tensor) evaluate_fn = get_evaluate_server_fn(test_loader=test_loader, exp_params=exp_params) ``` -------------------------------- ### MLPSynthesizer Forward Pass: Predict Noise Source: https://context7.com/sattarov/fedtabdiff/llms.txt Performs the full forward pass of the MLPSynthesizer to predict noise given noisy data and timestep. Requires MLPSynthesizer and BaseDiffuser imports, and initializes them with specified dimensions and layers. ```python from MLPSynthesizer import MLPSynthesizer from BaseDiffuser import BaseDiffuser import torch synthesizer = MLPSynthesizer(d_in=15, hidden_layers=[512, 512], n_cat_tokens=100, n_cat_emb=2, n_classes=5) diffuser = BaseDiffuser(total_steps=500) batch = torch.randn(8, 15) # noisy input batch t = diffuser.sample_timesteps(n=8) # random timesteps label = torch.randint(0, 5, (8,)) # class labels for conditional sampling predicted_noise = synthesizer(x=batch, timesteps=t, label=label) print(f"Input shape : {batch.shape}") # torch.Size([8, 15]) print(f"Predicted noise shape: {predicted_noise.shape}") # torch.Size([8, 15]) ``` -------------------------------- ### `BaseDiffuser.add_gauss_noise` Source: https://context7.com/sattarov/fedtabdiff/llms.txt Implements the forward noising step of the DDPM. Given a batch of data and sampled timesteps, it adds Gaussian noise according to the precomputed noise schedule, returning the noisy data and the added noise. ```APIDOC ## `BaseDiffuser.add_gauss_noise` — Forward diffusion: add Gaussian noise at timestep t Implements the closed-form forward noising step of the DDPM. Given a batch of encoded data `x_num` and the sampled timesteps `t`, returns the noisy tensor `x_noise_num` and the pure noise `noise_num` that was added. The noise magnitude is governed by the precomputed `alphas_hat[t]`. ```python from BaseDiffuser import BaseDiffuser import torch diffuser = BaseDiffuser(total_steps=500) batch = torch.randn(4, 15) # 4 samples, 15-dimensional encoded space t = diffuser.sample_timesteps(4) # random timesteps for each sample x_noisy, noise = diffuser.add_gauss_noise(x_num=batch, t=t) print(f"Input shape : {batch.shape}") # torch.Size([4, 15]) print(f"Noisy shape : {x_noisy.shape}") # torch.Size([4, 15]) print(f"Noise shape : {noise.shape}") # torch.Size([4, 15]) ``` ``` -------------------------------- ### Compute Column-Level Fidelity Source: https://context7.com/sattarov/fedtabdiff/llms.txt Calculates fidelity scores (TVComplement for categorical, KSComplement for numerical) between real and synthetic data. Requires SDV metadata. ```python from utils import collect_fidelity from sdv.metadata import SingleTableMetadata import pandas as pd, numpy as np # Real vs. synthetic data (small example) real_data = pd.DataFrame({ 'department': ['Engineering', 'HR', 'Finance', 'Engineering', 'HR'], 'amount': [5000.0, 4500.0, 6200.0, 5100.0, 4800.0] }) synthetic_data = pd.DataFrame({ 'department': ['Engineering', 'Finance', 'HR', 'Engineering', 'HR'], 'amount': [5050.0, 6150.0, 4600.0, 4950.0, 4750.0] }) metadata = SingleTableMetadata() metadata.detect_from_dataframe(data=real_data) scores = collect_fidelity(real_data=real_data, synthetic_data=synthetic_data, metadata=metadata) print(scores) # {'fidelity': 0.923} — mean of TVComplement(dept) and KSComplement(amount) ``` -------------------------------- ### BaseDiffuser.p_sample_gauss Source: https://context7.com/sattarov/fedtabdiff/llms.txt Performs a single denoising step in the reverse DDPM process. It takes the model's predicted noise, the current noisy tensor, and the current timestep to return a slightly less noisy tensor. This is used iteratively during inference to generate synthetic data. ```APIDOC ## `BaseDiffuser.p_sample_gauss` — Reverse diffusion: single denoising step Performs one step of the reverse (denoising) DDPM process. Given the model's predicted noise `model_out`, the current noisy tensor `z_norm`, and the current timestep `t`, it returns the slightly-less-noisy tensor. Called iteratively from `total_steps - 1` down to `0` during inference to generate synthetic data from pure noise. ```python from BaseDiffuser import BaseDiffuser from MLPSynthesizer import MLPSynthesizer import torch diffuser = BaseDiffuser(total_steps=500) synthesizer = MLPSynthesizer(d_in=15, hidden_layers=[128, 128], n_cat_tokens=50, n_cat_emb=2, n_classes=3) z = torch.randn(4, 15) # start from pure noise label = torch.randint(0, 3, (4,)) for i in reversed(range(0, 10)): # abbreviated: normally range(0, total_steps) t = torch.full((4,), i, dtype=torch.long) with torch.no_grad(): model_out = synthesizer(z, t, label) z = diffuser.p_sample_gauss(model_out, z, t) print(f"Generated tensor shape: {z.shape}") # torch.Size([4, 15]) ``` ``` -------------------------------- ### `BaseDiffuser.sample_timesteps` Source: https://context7.com/sattarov/fedtabdiff/llms.txt Samples a specified number of random integer timesteps uniformly from the range `[1, total_steps)`. These timesteps are used to determine the noise level for each sample during the diffusion process. ```APIDOC ## `BaseDiffuser.sample_timesteps` — Sample random diffusion timesteps Uniformly samples `n` integer timesteps from `[1, total_steps)`, one per sample in a training batch. These timesteps index into the precomputed schedules to determine how much noise to add during the forward pass. ```python from BaseDiffuser import BaseDiffuser diffuser = BaseDiffuser(total_steps=500) timesteps = diffuser.sample_timesteps(n=8) print(timesteps) # tensor([312, 47, 491, 123, 388, 15, 274, 205]) (random each call) print(timesteps.shape) # torch.Size([8]) ``` ``` -------------------------------- ### MLPSynthesizer.get_embeddings Source: https://context7.com/sattarov/fedtabdiff/llms.txt Retrieves the raw embedding weight matrix for categorical tokens. These embeddings are crucial for decoding generated samples, enabling the mapping of continuous latent vectors back to categorical tokens. ```APIDOC ## `MLPSynthesizer.get_embeddings` — Extract trained categorical embedding weights Returns the raw embedding weight matrix from the model's `nn.Embedding` layer. These embeddings are used during decoding of generated samples to map continuous latent vectors back to the nearest categorical token via distance computation. ```python from MLPSynthesizer import MLPSynthesizer synthesizer = MLPSynthesizer(d_in=15, hidden_layers=[128], n_cat_tokens=50, n_cat_emb=2, n_classes=3) embeddings = synthesizer.get_embeddings() print(f"Embedding matrix shape: {embeddings.shape}") # torch.Size([50, 2]) # Each row is the 2-dimensional embedding vector for one categorical token. ``` ``` -------------------------------- ### Retrieve Categorical Embedding Weights Source: https://context7.com/sattarov/fedtabdiff/llms.txt Extracts the raw embedding weight matrix from the model. These are used in decoding to map latent vectors back to categorical tokens. ```python from MLPSynthesizer import MLPSynthesizer synthesizer = MLPSynthesizer(d_in=15, hidden_layers=[128], n_cat_tokens=50, n_cat_emb=2, n_classes=3) embeddings = synthesizer.get_embeddings() print(f"Embedding matrix shape: {embeddings.shape}") # torch.Size([50, 2]) # Each row is the 2-dimensional embedding vector for one categorical token. ``` -------------------------------- ### Add Gaussian Noise at Timestep t Source: https://context7.com/sattarov/fedtabdiff/llms.txt Implements the forward noising step of DDPM. Adds Gaussian noise to input data `x_num` at specified timesteps `t`, returning the noisy tensor and the added noise. Noise magnitude depends on `alphas_hat[t]`. ```python from BaseDiffuser import BaseDiffuser import torch diffuser = BaseDiffuser(total_steps=500) batch = torch.randn(4, 15) # 4 samples, 15-dimensional encoded space t = diffuser.sample_timesteps(4) # random timesteps for each sample x_noisy, noise = diffuser.add_gauss_noise(x_num=batch, t=t) print(f"Input shape : {batch.shape}") # torch.Size([4, 15]) print(f"Noisy shape : {x_noisy.shape}") # torch.Size([4, 15]) print(f"Noise shape : {noise.shape}") # torch.Size([4, 15]) ``` -------------------------------- ### Embed Categorical Tokens to Dense Vectors Source: https://context7.com/sattarov/fedtabdiff/llms.txt Maps integer-encoded categorical attributes to their corresponding embedding vectors and flattens them. This is used before concatenating with numerical features. ```python from MLPSynthesizer import MLPSynthesizer import torch synthesizer = MLPSynthesizer(d_in=15, hidden_layers=[128], n_cat_tokens=50, n_cat_emb=2, n_classes=3) # batch of 4 samples, each with 7 categorical attributes (encoded as integer indices) x_cat = torch.LongTensor([[0, 3, 7, 12, 21, 35, 42], [1, 4, 8, 13, 22, 36, 43], [2, 5, 9, 14, 23, 37, 44], [0, 6, 10, 15, 24, 38, 45]]) x_cat_emb = synthesizer.embed_categorical(x_cat) print(f"x_cat shape : {x_cat.shape}") # torch.Size([4, 7]) print(f"Embedded shape : {x_cat_emb.shape}") # torch.Size([4, 14]) (7 attrs × 2 emb_dim) ``` -------------------------------- ### MLPSynthesizer.forward Source: https://context7.com/sattarov/fedtabdiff/llms.txt Performs the forward pass of the MLPSynthesizer to predict noise given noisy data and a timestep. It handles timestep embeddings, optional class embeddings, projection, and passing through hidden layers. ```APIDOC ## MLPSynthesizer.forward ### Description Runs the full forward pass: (1) computes sinusoidal timestep embeddings, (2) optionally adds class-label embeddings for conditional generation, (3) projects the noisy input to the internal dimension and fuses with time/label embeddings, (4) passes through the hidden MLP layers, and (5) projects back to the original input dimension as the predicted noise tensor. ### Method `MLPSynthesizer.__call__` (or `forward`) ### Parameters - **x** (torch.Tensor) - The noisy input batch. - **timesteps** (torch.Tensor) - The timesteps for each sample in the batch. - **label** (torch.Tensor, optional) - Class labels for conditional generation. ### Request Example ```python from MLPSynthesizer import MLPSynthesizer from BaseDiffuser import BaseDiffuser import torch synthesizer = MLPSynthesizer(d_in=15, hidden_layers=[512, 512], n_cat_tokens=100, n_cat_emb=2, n_classes=5) diffuser = BaseDiffuser(total_steps=500) batch = torch.randn(8, 15) # noisy input batch t = diffuser.sample_timesteps(n=8) # random timesteps label = torch.randint(0, 5, (8,)) # class labels for conditional sampling predicted_noise = synthesizer(x=batch, timesteps=t, label=label) print(f"Input shape : {batch.shape}") print(f"Predicted noise shape: {predicted_noise.shape}") ``` ### Response #### Success Response - **predicted_noise** (torch.Tensor) - The predicted noise tensor. ```