### Quickstart: Run Model-X Knockoffs with Lasso Source: https://amspector100.github.io/knockpy/installation.html A quick example demonstrating how to use knockpy to perform variable selection with cross-validated lasso. This snippet generates synthetic data and applies the KnockoffFilter. ```python import knockpy as kpy from knockpy.knockoff_filter import KnockoffFilter # Generate synthetic data from a Gaussian linear model data_gen_process = kpy.dgp.DGP() data_gen_process.sample_data( n=1500, # Number of datapoints p=500, # Dimensionality sparsity=0.1, x_dist='gaussian', ) X = data_gen_process.X y = data_gen_process.y Sigma=data_gen_process.Sigma # Run model-X knockoffs kfilter = KnockoffFilter( fstat='lasso', ksampler='gaussian', ) rejections = kfilter.forward(X=X, y=y, Sigma=Sigma) ``` -------------------------------- ### Install choldate Source: https://amspector100.github.io/knockpy/_sources/installation.rst.txt Install the choldate library from GitHub using pip. This is a prerequisite for installing knockpy. ```bash pip install git+git://github.com/jcrudy/choldate.git ``` -------------------------------- ### Install scikit-dsdp for SDP knockoffs Source: https://amspector100.github.io/knockpy/installation.html Install scikit-dsdp to speed up computation for SDP knockoffs. This package can be challenging to install on non-Linux environments. ```bash pip install scikit-dsdp ``` -------------------------------- ### Install Cython using pip Source: https://amspector100.github.io/knockpy/_sources/installation.rst.txt Install or upgrade Cython using pip. This is a prerequisite for speeding up certain knockoff computations. ```bash pip install cython>=0.29.14 ``` -------------------------------- ### Install lightweight knockpy Source: https://amspector100.github.io/knockpy/_sources/installation.rst.txt Install a lightweight version of knockpy that includes most functionality but may be slower for certain algorithms. This is recommended if the full installation fails. ```bash pip install knockpy ``` -------------------------------- ### Quickstart: Variable Selection with Knockpy Source: https://amspector100.github.io/knockpy/_sources/installation.rst.txt Demonstrates using knockpy with a cross-validated lasso feature statistic and Gaussian knockoffs for variable selection. Requires synthetic data generation. ```python import knockpy as kpy from knockpy.knockoff_filter import KnockoffFilter # Generate synthetic data from a Gaussian linear model data_gen_process = kpy.dgp.DGP() data_gen_process.sample_data( n=1500, # Number of datapoints p=500, # Dimensionality sparsity=0.1, x_dist='gaussian', ) X = data_gen_process.X y = data_gen_process.y Sigma=data_gen_process.Sigma # Run model-X knockoffs kfilter = KnockoffFilter( fstat='lasso', k المed='gaussian', ) rejections = kfilter.forward(X=X, y=y, Sigma=Sigma) ``` -------------------------------- ### Install knockpy with fast dependencies Source: https://amspector100.github.io/knockpy/_sources/installation.rst.txt Install the main knockpy package with the 'fast' extra, which includes optimized dependencies for faster computation. ```bash pip install knockpy[fast] ``` -------------------------------- ### Instantiate and Run KnockoffGGM Source: https://amspector100.github.io/knockpy/_modules/knockpy/ggm.html Example of fitting KnockoffGGM under the global null for a Gaussian graphical model with no edges. Requires fake data generation and specifies the LCD statistic with FX knockoffs. ```python import numpy as np from knockpy.ggm import KnockoffGGM # Fake data-generating process for Gaussian graphical model X = np.random.randn(300, 30) # LCD statistic with FX knockoffs gkf = KnockoffGGM( fstat='lcd', knockoff_kwargs={"method":"mvr"}, ) edges = gkf.forward(X=X, verbose=True) ``` -------------------------------- ### Install Cython using conda Source: https://amspector100.github.io/knockpy/_sources/installation.rst.txt Install Cython using the conda package manager. This is an alternative if pip installation fails, as conda includes a C compiler. ```bash conda install cython ``` -------------------------------- ### Check PyTorch Availability Source: https://amspector100.github.io/knockpy/_modules/knockpy/utilities.html Verifies if PyTorch is installed and raises a ValueError with installation instructions if it's not available. ```python def check_kpytorch_available(purpose): try: import torch as torch except ImportError as err: raise ValueError( f"Pytorch is required for {purpose}, but importing torch raised {err}. See https://pytorch.org/get-started/." ) ``` -------------------------------- ### Run KnockoffFilter with Model-X Approach Source: https://amspector100.github.io/knockpy/usage.html Applies the KnockoffFilter to the dataset using the forward method. This example demonstrates the model-X approach, requiring the covariance matrix (Sigma) as input, and specifies the desired false discovery rate (fdr). ```python # Flags of whether each feature was rejected rejections = kfilter.forward( X=X, y=y, Sigma=Sigma, fdr=0.1, # desired level of false discovery rate control ) ``` -------------------------------- ### Data Generation for AR1 Covariance Matrix Example Source: https://amspector100.github.io/knockpy/mrcknock.html This snippet sets up a data generation process for features with an AR1 covariance matrix, a scenario where SDP knockoffs are also expected to perform poorly. It samples data for a sparse Gaussian linear model. ```python # Generate data where X is Gaussian with an AR1 covariance matrix # where Cov(Xj, X_{j+1}) is sampled from Beta(a,b) # and y | X is a sparse Gaussian linear model np.random.seed(111) data_gen_process = kpy.dgp.DGP() data_gen_process.sample_data( method="ar1", a=3, b=1, n=650, # number of data-points p=500, # dimensionality sparsity=0.1, # number of non-null coefficients coeff_dist="uniform", # distribution of size of non-null coefficients coeff_size=1, # size of non-null coefficients corr_signals=True, # non-nulls features are correlated ) X = data_gen_process.X y = data_gen_process.y beta = data_gen_process.beta Sigma = data_gen_process.Sigma ``` -------------------------------- ### Instantiate Knockoff Filters with Different Samplers Source: https://amspector100.github.io/knockpy/usage.html Shows how to instantiate KnockoffFilter with various knockoff generation methods ('maxent', 'sdp', 'mvr') and samplers ('gaussian', 'fx', 'artk'). ```python # This uses gaussian maxent knockoffs kfilter1 = KnockoffFilter(ksampler="gaussian", knockoff_kwargs={"method": "maxent"}) # This uses fixed-X SDP knockoffs kfilter2 = KnockoffFilter(ksampler="fx", knockoff_kwargs={"method": "sdp"}) # Metropolized sampler for heavy-tailed t markov chain using MVR-guided proposals kfilter3 = KnockoffFilter(ksampler="artk", knockoff_kwargs={"method": "mvr"}) ``` -------------------------------- ### Initialize Knockoff Sampler Source: https://amspector100.github.io/knockpy/_modules/knockpy/metro.html Initializes the Knockoff Sampler by setting up Gibbs graphs, clique dictionaries, and divide-and-conquer strategies. It prepares data structures for sampling subsets of variables. ```python if gibbs_graph[i, j] != 0: self.cliques.append((i, j)) temp = gibbs_graph[i, j] self.log_potentials.append(make_ising_logpotential(temp, i, j)) # Maps variables to cliques they're part of # Clique key maps the clique back to the log-potential self.clique_dict = {i: [] for i in range(self.p)} for clique_key, clique in enumerate(self.cliques): for j in clique: self.clique_dict[j].append((clique_key, clique)) # Split X up into blocks along the n-axis. # Each divconq key corresponds to one way to # divide/conquer the variables, and to one # of these blocks. self.dc_keys = [] for div_type in ["row", "col"]: for trans in [max_width - 1, max_width - 2]: self.dc_keys.append(f"{div_type}-{trans}") rand_inds = np.arange(self.n) np.random.shuffle(rand_inds) self.X_ninds = {key: [] for key in self.dc_keys} for i, j in enumerate(rand_inds): key = self.dc_keys[j % len(self.dc_keys)] self.X_ninds[key].append(i) # Structure of self.divconq_info # (1) Dictionary takes a divide-and-conquery key # (translation + row/col) # (2) This maps to a list of dictionaries. Each # dictionary corresponds to one of the blocks # in the divide and conquer procedure. # (3) Each dictionary takes three keys: inds, # cliques, lps. # - inds is the list of ORIGINAL coordinates # of the block of variables. E.g. if block 1 # corresponds to columns 1,5,6, in X, # then inds = [1,5,6] # - cliques is a list of cliques in the NEW coordinates # of the block. So for example, if 1,5 was a clique in # the prior example, then (0,1) would be in this list. # - lps are the log-potentials corresponding to # the cliques above. self.divconq_info = {} # This maps divconq key # to a list of separators (e.g. knockoffs = features) # for these indices. self.separators = {} for dc_key in self.dc_keys: div_type = dc_key.split("-")[0] trans = int(dc_key.split("-")[-1]) seps, dict_list = self._divide_variables( dc_key=dc_key, translation=trans, max_width=max_width, div_type=div_type, ) self.separators[dc_key] = seps self.divconq_info[dc_key] = dict_list # Initialize samplers. Each sampler will only sample # a subset of variables for a subset of the data # Structure: maps divconq key to list of n_inds, p_inds, sampler self.samplers = {key: [] for key in self.dc_keys} for key in self.dc_keys: # Fetch indicies n_inds = np.array(self.X_ninds[key]) sep_inds = self.separators[key] # Helper for conditional cov matrices V11 = V[sep_inds][:, sep_inds] # s x s V11_inv = utilities.chol2inv(V11) for div_group_dict in self.divconq_info[key]: p_inds = div_group_dict["inds"] # Find conditional covariance matrix V # for p_inds given # the conditioned-on-separators V22 = V[p_inds][:, p_inds] # p_i x p_i V21 = V[p_inds][:, sep_inds] # p_i x s Vcond = V22 - np.dot( V21, np.dot(V11_inv, V21.T), ) sampler = MetropolizedKnockoffSampler( lf=None, X=X[n_inds][:, p_inds], mu=mu[p_inds], Sigma=Vcond, undir_graph=gibbs_graph[p_inds][:, p_inds] != 0, cliques=div_group_dict["cliques"], log_potentials=div_group_dict["lps"], buckets=self.buckets, S=None, **kwargs, ) if sampler.width > max_width: warnings.warn( f"Treewidth heuristic inaccurate during divide/conquer, sampler for {p_inds} has width {sampler.width} > max_width {max_width}" ) self.samplers[key].append((n_inds, p_inds, sampler)) ``` -------------------------------- ### PSGDSolver Initialization Source: https://amspector100.github.io/knockpy/_modules/knockpy/kpytorch/mrcgrad.html Initializes the Projected Gradient Descent (PGD) solver for MRC knockoffs. It sets up parameters for optimization, including the correlation matrix, groups, loss calculator, and convergence criteria. ```python def __init__( self, Sigma, groups, losscalc=None, lr=1e-2, verbose=False, max_epochs=100, tol=1e-5, line_search_iter=10, convergence_tol=1e-1, **kwargs, ): # Add Sigma self.p = Sigma.shape[0] self.Sigma = Sigma self.groups = groups self.opt_S = None # Output initialization self.opt_loss = np.inf # Save parameters for optimization self.lr = lr self.verbose = verbose self.max_epochs = max_epochs self.tol = tol self.line_search_iter = line_search_iter self.convergence_tol = convergence_tol # Sort by groups for ease of computation inds, inv_inds = utilities.permute_matrix_by_groups(groups) self.inds = inds self.inv_inds = inv_inds self.sorted_Sigma = self.Sigma[inds][:, inds] self.sorted_groups = self.groups[inds] # Loss calculator if losscalc is not None: self.losscalc = losscalc ``` -------------------------------- ### check_pyglmnet_available Source: https://amspector100.github.io/knockpy/_modules/knockpy/utilities.html Checks if the pyglmnet library is installed and available. Raises a ValueError if pyglmnet is not found, providing instructions for installation. ```APIDOC ## check_pyglmnet_available ### Description Checks if the pyglmnet library is installed and available. Raises a ValueError if pyglmnet is not found, providing instructions for installation. ### Parameters - **purpose** (string) - A description of why pyglmnet is needed. ``` -------------------------------- ### check_kpytorch_available Source: https://amspector100.github.io/knockpy/_modules/knockpy/utilities.html Checks if the PyTorch library is installed and available. Raises a ValueError if PyTorch is not found, providing instructions for installation. ```APIDOC ## check_kpytorch_available ### Description Checks if the PyTorch library is installed and available. Raises a ValueError if PyTorch is not found, providing instructions for installation. ### Parameters - **purpose** (string) - A description of why PyTorch is needed. ``` -------------------------------- ### Initialize Knockoff Class Source: https://amspector100.github.io/knockpy/_modules/knockpy/knockoffs.html Initializes the Knockoff class with feature data and optional parameters for covariance estimation, grouping, and S-matrix computation. Defaults are provided for most parameters if not specified. ```python def __init__( self, X, mu=None, Sigma=None, invSigma=None, groups=None, sample_tol=1e-5, S=None, method=None, verbose=False, **kwargs, ): # Save parameters with defaults self.X = X self.n = X.shape[0] self.p = X.shape[1] if mu is None: mu = X.mean(axis=0) self.mu = mu if Sigma is None: Sigma, invSigma = utilities.estimate_covariance(X, tol=1e-2) self.Sigma = Sigma if invSigma is None: invSigma = np.linalg.inv(Sigma) self.invSigma = invSigma if groups is None: groups = np.arange(1, self.p + 1, 1) self.groups = groups self.sample_tol = sample_tol self.verbose = verbose # Save S information and possibly compute S matrix self.S = S self.method = method self.kwargs = kwargs if self.S is None: if self.verbose: print("Computing knockoff S matrix...") self.S = smatrix.compute_smatrix( Sigma=self.Sigma, groups=self.groups, method=self.method, **self.kwargs ) ``` -------------------------------- ### GibbsGridSampler Initialization Source: https://amspector100.github.io/knockpy/apiref.html Initializes the GibbsGridSampler for discrete Gibbs grids using a divide-and-conquer approach combined with metropolized knockoff sampling. Requires design matrix X, Gibbs graph, and covariance matrix Sigma. ```python GibbsGridSampler(_X_ , _gibbs_graph_ , _Sigma_ , _Q =None_, _mu =None_, _max_width =6_, _** kwargs_) ``` -------------------------------- ### Initialize KnockoffFilter Source: https://amspector100.github.io/knockpy/apiref.html Initializes the KnockoffFilter with specified feature statistics and knockoff sampler. Uses LedoitWolf covariance estimation by default. ```python from knockpy.knockoff_filter import KnockoffFilter kfilter = KnockoffFilter( fstat='lcd', ksampler='gaussian', knockoff_kwargs={"method":"mvr"}, ) rejections = kfilter.forward(X=dgprocess.X, y=dgprocess.y) ``` -------------------------------- ### Example Output for MVR Knockoffs Source: https://amspector100.github.io/knockpy/mrcknock.html This is an example of the output generated after running the power and false positive rate calculation for MVR knockoffs. ```text MVR knockoffs made 55.0 discoveries. MVR knockoffs had a power of 94.0% and false positive rate of 14.55%. ``` -------------------------------- ### Sample Proposals (Continuous/Discrete) Source: https://amspector100.github.io/knockpy/_modules/knockpy/metro.html Generates a knockoff proposal, either continuous using random normal sampling or discrete by sampling from bucket probabilities. Allows passing pre-computed conditional mean and variance. ```python def sample_proposals( self, X, prev_proposals, cond_mean=None, cond_var=None, ): """ Samples a continuous or discrete proposal given the design matrix and the previous proposals. Can pass in the conditional mean and variance of the new proposals, if cached, to save computation. """ # These will be compatible as long as Sigma is if cond_mean is None or cond_var is None: cond_mean, cond_var = self.fetch_proposal_params( X=X, prev_proposals=prev_proposals ) # Continuous sampling if self.buckets is None: proposal = np.sqrt(cond_var) * np.random.randn(self.n, 1) + cond_mean # Discrete sampling if self.buckets is not None: # Cumulative probability buckets bucket_probs = gaussian_log_likelihood( X=self.buckets, mu=cond_mean, var=cond_var, ) bucket_probs = scipy.special.softmax( bucket_probs.astype(np.float32), axis=-1 ) cuml_bucket_probs = bucket_probs.cumsum(axis=-1) ``` -------------------------------- ### many_ks_tests Source: https://amspector100.github.io/knockpy/apiref.html Gets p values by running ks tests and then does a multiple testing correction. ```APIDOC ## many_ks_tests(sample1s, sample2s) ### Description Gets p values by running ks tests and then does a multiple testing correction. ### Parameters #### Path Parameters - **sample1s** (list of arrays) - The first list of samples. - **sample2s** (list of arrays) - The second list of samples. ``` -------------------------------- ### KnockoffFilter Class Source: https://amspector100.github.io/knockpy/apiref.html The KnockoffFilter class performs knockoff-based inference from start to finish, wrapping both the KnockoffSampler and FeatureStatistic classes. ```APIDOC ## class knockpy.knockoff_filter.KnockoffFilter(fstat='lasso', ksampler='gaussian', fstat_kwargs={}, knockoff_kwargs={}) Performs knockoff-based inference, from start to finish. This wraps both the `knockoffs.KnockoffSampler` and `knockoff_stats.FeatureStatistic` classes. ### Parameters * **fstat** (str or knockpy.knockoff_stats.FeatureStatistic) - The feature statistic to use in the knockoff filter. Defaults to a lasso statistic. This may also be a string identifier, including: * ‘lasso’ or ‘lcd’: cross-validated lasso coefficients differences * ‘mlr’: the masked likelihood ratio (MLR) statistic, which has provable optimality properties for knockoffs * ‘lsm’: signed maximum of the lasso path statistic as in Barber and Candes 2015 * ‘dlasso’: Cross-validated debiased lasso coefficients * ‘ridge’: Cross validated ridge coefficients * ‘ols’: Ordinary least squares coefficients * ‘margcorr’: marginal correlations between features and response * ‘deeppink’: The deepPINK statistic as in Lu et al. 2018 * ‘mlrspline’ or ‘spline’: The MLR statistic applied to regression spline basis functions. * ‘randomforest’: A random forest with swap importances Note that when using FX knockoffs, the lcd statistic does not use cross-valdilidation and sets `alphas = sigma2 * sqrt(log(p) / n).` * **ksampler** (str or knockpy.knockoffs.KnockoffSampler) - The knockoff sampler to use in the knockoff filter. This may also be a string identifier, including: * ‘gaussian’: Gaussian Model-X knockoffs * ‘fx’: Fixed-X knockoffs. * ‘metro’: Generic metropolized knockoff sampler * ‘artk’: t-tailed Markov chain. * ‘blockt’: Blocks of t-distributed * ‘gibbs_grid’: Discrete gibbs grid An alternative to specifying the ksampler is to simply pass in a knockoff matrix during the `forward` call. * **fstat_kwargs** (dict) - Kwargs to pass to the feature statistic `fit` function, excluding the required arguments, defaults to {} * **knockoff_kwargs** (dict) - Kwargs for instantiating the knockoff sampler argument if the ksampler argument is a string identifier. This can be the empt dict for some identifiers such as “gaussian” or “fx”, but additional keyword arguments are required for complex samplers such as the “metro” identifier. Defaults to {} ### Example Here we fit the KnockoffFilter on fake data from a Gaussian linear model: ```python # Fake data-generating process for Gaussian linear model import knockpy as kpy dgprocess = kpy.dgp.DGP() dgprocess.sample_data(n=500, p=500, sparsity=0.1) # LCD statistic with Gaussian MX knockoffs ``` ``` -------------------------------- ### Initialize KnockoffFilter with LCD and Gaussian Knockoffs Source: https://amspector100.github.io/knockpy/_modules/knockpy/knockoff_filter.html Instantiate the KnockoffFilter using the LCD statistic and Gaussian knockoffs. This configuration is suitable for Gaussian linear models and utilizes LedoitWolf covariance estimation by default. Additional keyword arguments can be passed to the knockoff sampler. ```python from knockpy.knockoff_filter import KnockoffFilter kfilter = KnockoffFilter( fstat='lcd', ksampler='gaussian', knockoff_kwargs={"method":"mvr"}, ) ``` -------------------------------- ### Get Active Coordinate Helper Source: https://amspector100.github.io/knockpy/_modules/knockpy/metro.html A helper function for divide-and-conquer in Ising models, used to extract the active coordinate from a variable index. ```python def _get_ac(self, i, div_type): """ Helper function for divide-and-conquer in Ising model. Extracts active coordinate from variable i. Returns ac, lc, wc. """ lc, wc = self.num2coords(i) if div_type == "row": ``` -------------------------------- ### Convert Graph to Cliques Source: https://amspector100.github.io/knockpy/_modules/knockpy/dgp.html Transforms a connection graph into a dictionary of cliques, represented as binary connections. Used for Gibbs grid setup. ```python def graph2cliques(Q): """ Turns graph Q of connections into binary cliques for Gibbs grid """ p = Q.shape[0] clique_dict = {i: [] for i in range(p)} for i in range(p): for j in range(i): if Q[i, j] != 0: clique_dict[i].append((i, j)) clique_dict[j].append((j, i)) # Remove duplicates for i in range(p): clique_dict[i] = list(set(clique_dict[i])) return clique_dict ``` -------------------------------- ### Initialize GibbsGridSampler Source: https://amspector100.github.io/knockpy/_modules/knockpy/metro.html Initializes a GibbsGridSampler for discrete Gibbs grids using a divide-and-conquer approach combined with metropolized knockoff sampling. It infers parameters from the input data and graph, and sets up internal structures for clique learning and log-potentials. ```python def __init__(self, X, gibbs_graph, Sigma, Q=None, mu=None, max_width=6, **kwargs): # Infer bucketization and V V = Sigma self.n = X.shape[0] self.p = X.shape[1] self.X = X self.V = V self.S = None self.gibbs_graph = gibbs_graph self.gridwidth = int(np.sqrt(self.p)) if self.gridwidth**2 != self.p: raise ValueError(f"p {self.p} must be a square number for gibbs grid") self.buckets = np.unique(X) if mu is None: mu = X.mean(axis=0) # Div and conquer makes this nontrivial # (Note to self: maybe TODO?) if "S" in kwargs: kwargs.pop("S") if "invSigma" in kwargs: kwargs.pop("invSigma") # Dummy order / inv_order variables for consistency self.order = np.arange(self.p) self.inv_order = np.arange(self.p) def make_ising_logpotential(temp, i, j): def lp(X): return -1 * temp * np.abs(X[:, 0] - X[:, 1]) return lp # Learn cliques, log-potentials self.cliques = [] self.log_potentials = [] for i in range(self.p): for j in range(i): ``` -------------------------------- ### Preprocess Groups for Knockpy Source: https://amspector100.github.io/knockpy/_modules/knockpy/utilities.html Maps unique elements in a group array to consecutive integers starting from 1. This is often a prerequisite for other group-based operations. ```python import numpy as np def preprocess_groups(groups): """ Maps the m unique elements of a 1D "groups" array to the integers from 1 to m. """ unique_vals = np.sort(np.unique(groups)) conversion = {unique_vals[i]: i for i in range(unique_vals.shape[0])} return np.array([conversion[x] + 1 for x in groups]) ``` -------------------------------- ### Compute q1 and q2 Source: https://amspector100.github.io/knockpy/apiref.html Computes q1 and q2 as specified by page 33 of the paper. No specific setup is mentioned, but it relates to the metropolized sampler. ```python log_q12(_x_flags_ , _j_) ``` -------------------------------- ### DeepPinkModel Initialization Source: https://amspector100.github.io/knockpy/_modules/knockpy/kpytorch/deeppink.html Initializes the DeepPinkModel with specified dimensions and MLP architecture. Supports optional normalization of the initial sparse layer weights. ```python class DeepPinkModel(nn.Module): def __init__(self, p, hidden_sizes=[64], y_dist="gaussian", normalize_Z=True): """ Adapted from https://arxiv.org/pdf/1809.01185.pdf. The module has two components: 1. A sparse linear layer with dimension 2*p to p. However, there are only 2*p weights (each feature and knockoff points only to their own unique node). This is (maybe?) followed by a ReLU activation. 2. A multilayer perceptron (MLP) Parameters ---------- p : int The dimensionality of the data hidden_sizes: list A list of hidden sizes for the mlp layer(s). Defaults to [64]. normalize_Z : bool If True, the first sparse linear layer is normalized so the weights for each feature/knockoff pair have an l1 norm of 1. This can modestly improve power in some settings. """ super().__init__() # Initialize weight for first layer self.p = p self.y_dist = y_dist self.Z_weight = nn.Parameter(torch.ones(2 * p)) self.norm_Z_weight = normalize_Z # Save indices/reverse indices to prevent violations of FDR control self.inds, self.rev_inds = utilities.random_permutation_inds(2 * p) self.feature_inds = self.rev_inds[0 : self.p] self.ko_inds = self.rev_inds[self.p :] # Create MLP layers mlp_layers = [nn.Linear(p, hidden_sizes[0])] for i in range(len(hidden_sizes) - 1): mlp_layers.append(nn.ReLU()) mlp_layers.append(nn.Linear(hidden_sizes[i], hidden_sizes[i + 1])) # Prepare for either MSE loss or cross entropy loss mlp_layers.append(nn.ReLU()) if y_dist == "gaussian": mlp_layers.append(nn.Linear(hidden_sizes[-1], 1)) else: mlp_layers.append(nn.Linear(hidden_sizes[-1], 2)) # Then create MLP self.mlp = nn.Sequential(*mlp_layers) ``` -------------------------------- ### Check pyglmnet Availability Source: https://amspector100.github.io/knockpy/_modules/knockpy/utilities.html Checks for the availability of the pyglmnet library, essential for certain functionalities. Provides an informative error message and installation link if the library is missing. ```python def check_pyglmnet_available(purpose): try: import pyglmnet as pyglmnet except ImportError as err: raise ValueError( f"pyglmnet is required for {purpose}, but importing pyglmnet raised {err}. See https://github.com/glm-tools/pyglmnet/." ) ``` -------------------------------- ### Initialize KnockoffFilter with MetropolizedKnockoffSampler Source: https://amspector100.github.io/knockpy/usage.html Shows how to initialize a KnockoffFilter using a pre-configured MetropolizedKnockoffSampler instance. This allows using custom sampling methods within the filter. ```python kfilter_metro = KnockoffFilter(ksampler=metrosampler, fstat="ridge") ``` -------------------------------- ### SDP Solver Implementation in KnockPy Source: https://amspector100.github.io/knockpy/_modules/knockpy/mac.html This code block outlines the process of setting up and solving a semidefinite programming problem. It includes sorting the covariance matrix, creating block variables for the S matrix, defining constraints, and constructing the optimization objective based on different norms. The problem is then solved, and the resulting S matrix is unsorted and processed. ```python elif dsdp_warning: warnings.warn(constants.DSDP_WARNING) # Sort the covariance matrix according to the groups inds, inv_inds = utilities.permute_matrix_by_groups(groups) sortedSigma = Sigma[inds][:, inds] # Create blocks of semidefinite matrix S, # as well as the whole matrix S variables = [] constraints = [] S_rows = [] shift = 0 for j in range(m): # Create block variable gj = int(group_sizes[j]) Sj = cp.Variable((gj, gj), symmetric=True) constraints += [Sj >> 0] variables.append(Sj) # Create row of S if shift == 0 and shift + gj < p: rowj = cp.hstack([Sj, cp.Constant(np.zeros((gj, p - gj)))]) elif shift + gj < p: rowj = cp.hstack( [ cp.Constant(np.zeros((gj, shift))), Sj, cp.Constant(np.zeros((gj, p - gj - shift))), ] ) elif shift + gj == p and shift > 0: rowj = cp.hstack([cp.Constant(np.zeros((gj, shift))), Sj]) elif gj == p and shift == 0: rowj = cp.hstack([Sj]) else: raise ValueError( f"shift ({shift}) and gj ({gj}) add up to more than p ({p})" ) S_rows.append(rowj) # Incremenet shift shift += gj # Construct S and Grahm Matrix S = cp.atoms.affine.wraps.psd_wrap( cp.vstack(S_rows) ) # does this improve performance? sortedSigma = cp.Constant(sortedSigma) constraints += [2 * sortedSigma - S >> 0] # Construct optimization objective if objective == "abs": objective = cp.Minimize(cp.sum(cp.abs(sortedSigma - S))) elif objective == "pnorm": objective = cp.Minimize(cp.pnorm(sortedSigma - S, norm_type)) elif objective == "norm": objective = cp.Minimize(cp.norm(sortedSigma - S, norm_type)) # Note we already checked objective is one of these values earlier # Construct, solve the problem. problem = cp.Problem(objective, constraints) problem.solve(verbose=verbose, **kwargs) if verbose: print("Finished solving SDP!") # Unsort and get numpy S = S.value if S is None: raise ValueError( "SDP formulation is infeasible. Try decreasing the tol parameter." ) S = S[inv_inds][:, inv_inds] # Clip 0 and 1 values for i in range(p): S[i, i] = max(tol, min(1 - tol, S[i, i])) # Scale to make this PSD using binary search S, gamma = utilities.scale_until_PSD(Sigma, S, tol, num_iter) # Return unsorted S value return S ``` -------------------------------- ### Get Variable Ordering from Junction Tree Source: https://amspector100.github.io/knockpy/_modules/knockpy/metro.html Extracts a variable ordering and active frontier from a junction tree. This is used for the metro knockoff sampler and is adapted from external research. ```python def get_ordering(T): """ Takes a junction tree and returns a variable ordering for the metro knockoff sampler. The code from this function is adapted from the code distributed with https://arxiv.org/abs/1903.00434. Parameters ---------- T : A networkx graph that is a junction tree. Nodes must be sets with elements 0,...,p-1. Returns ------- order : a numpy array with unique elements 0,...,p-1 active_frontier : list of lists a list of length p gwhere entry j is the set of entries > j that are in V_j. This specifies the conditional independence structure of a joint covariate distribution. See page 34 of https://arxiv.org/abs/1903.00434. """ # Initialize T = T.copy() order = [] active_frontier = [] while T.number_of_nodes() > 0: # Loop through leaf nodes gen = (x for x in T.nodes() if T.degree(x) <= 1) active_node = next(gen) # Parent nodes of leaf nodes # active_vars get added to the order in this step # activated set are just the variables in the active node parents = list(T[active_node].keys()) if len(parents) == 0: active_vars = set(active_node) activated_set = active_vars.copy() else: active_vars = set(active_node.difference(parents[0])) activated_set = active_vars.union(parents[0]).difference(set(order)) for i in list(active_vars)[::-1]: # This line was changed too order += [i] frontier = list(activated_set.difference(set(order))) active_frontier += [frontier] T.remove_node(active_node) return [np.array(order), active_frontier] ``` -------------------------------- ### Initialize KnockoffFilter with Masked Likelihood Ratio (MLR) Statistics Source: https://amspector100.github.io/knockpy/usage.html Shows how to initialize a KnockoffFilter using MLR statistics. Use 'mlr' for Gaussian linear or binary regression models, and 'mlr_spline' for generalized additive models based on regression splines. ```python # default MLR statistics; these work for both continuous and binary responses kfilter_mlr = KnockoffFilter(ksampler="gaussian", fstat="mlr") # masked likelihood ratio statistics based on regression splines kfilter_mlr_splines = KnockoffFilter(ksampler="gaussian", fstat="mlr_spline") ``` -------------------------------- ### Instantiate KnockoffFilter with LCD and Gaussian MX Source: https://amspector100.github.io/knockpy/apiref.html Instantiates the KnockoffFilter using the LCD statistic and Gaussian Model-X knockoffs. Requires prior data generation. ```python import knockpy as kpy dgprocess = kpy.dgp.DGP() dgprocess.sample_data(n=500, p=500, sparsity=0.1) ``` -------------------------------- ### Gaussian Graphical Model Edge Detection Source: https://amspector100.github.io/knockpy/_sources/usage.ipynb.txt Utilize KnockoffGGM for detecting edges in Gaussian Graphical models using LCD statistics with FX knockoffs. This example demonstrates a scenario with no edges under the global null. ```python # Fake data-generating process for Gaussian graphical model # under global null with no edges import numpy as np from knockpy.ggm import KnockoffGGM X = np.random.randn(300, 30) # LCD statistic with FX knockoffs gkf = KnockoffGGM( fstat="lcd", knockoff_kwargs={"method": "mvr"}, ) edges = gkf.forward(X=X, verbose=False) np.any(edges) ``` -------------------------------- ### Get Variable Ordering Source: https://amspector100.github.io/knockpy/apiref.html Takes a junction tree and returns a variable ordering for the metro knockoff sampler. Code is adapted from a provided arXiv paper. Nodes must be sets with elements 0 to p-1. ```python knockpy.metro.get_ordering(_T_) ``` -------------------------------- ### Calculate DeepPINK Feature Statistics (W) Source: https://amspector100.github.io/knockpy/_modules/knockpy/knockoff_stats.html Computes feature statistics (W) using the DeepPINK model. Supports different feature importance calculation methods and group aggregation strategies. Ensure kpytorch is installed. ```python utilities.check_kpytorch_available(purpose="deepPINK statistics") from .kpytorch import deeppink # Bind data n = X.shape[0] p = X.shape[1] features = np.concatenate([X, Xk], axis=1) # The deeppink model class will shuffle statistics, # but for compatability we create indices anyway self.inds = np.arange(2 * p) self.rev_inds = self.inds # By default, all variables are their own group if groups is None: groups = np.arange(1, p + 1) self.groups = groups # Parse y_dist, hidden_sizes, initialize model parse_y_dist(y) if "hidden_sizes" not in kwargs: kwargs["hidden_sizes"] = [min(n, p)] self.model = deeppink.DeepPinkModel(p=p, **kwargs) # Train model self.model.train() self.model = deeppink.train_deeppink(self.model, features, y, **train_kwargs) self.model.eval() # Get Z statistics feature_importance = str(feature_importance).lower() if feature_importance == "deeppink": self.Z = self.model.feature_importances(weight_scores=True) elif feature_importance == "unweighted": self.Z = self.model.feature_importances(weight_scores=False) elif feature_importance == "swap": self.Z = self.swap_feature_importances(features, y) elif feature_importance == "swapint": self.Z = self.swap_path_feature_importances(features, y) else: raise ValueError( f"feature_importance {feature_importance} must be one of 'deeppink', 'unweighted', 'swap', 'swapint'" ) # Get W statistics self.W = combine_Z_stats( self.Z, self.groups, antisym=antisym, group_agg=group_agg ) # Possibly score model self.cv_score_model(features=features, y=y, cv_score=cv_score) return self.W ``` -------------------------------- ### Run KnockoffFilter Forward Pass with Provided Knockoffs Source: https://amspector100.github.io/knockpy/usage.html Demonstrates how to manually pass generated knockoffs (Xk) to the KnockoffFilter's forward method for feature selection. This is an alternative to letting the sampler generate them internally. ```python out = kfilter_metro.forward(X=X_metro, y=y_metro, Xk=Xk) ``` -------------------------------- ### Sample Knockoffs with Divide-and-Conquer Source: https://amspector100.github.io/knockpy/_modules/knockpy/metro.html Samples knockoffs using a divide-and-conquer strategy. Initializes knockoff and acceptance matrices, then iterates through different variable separation methods. For each method, it processes blocks of variables, samples knockoffs using specialized samplers, and updates the main knockoff matrix. Finally, it checks the validity of the generated knockoffs. ```python def sample_knockoffs(self, **kwargs): """ Samples knockoffs using divide-and-conquer approach. Parameters ---------- kwargs : dict Keyword args for ``MetropolizedKnockoffSampler.sample_knockoffs``. Returns ------- Xk : np.ndarray A ``(n, p)``-shaped knockoff matrix in the original order the variables were passed in. """ self.Xk = np.zeros((self.n, self.p)) self.Xk[:] = np.nan self.final_acc_probs = self.Xk.copy() self.acceptances = self.Xk.copy() # Loop through different ways of separating variables for key in self.dc_keys: # N inds for this particular method of separation n_inds = np.array(self.X_ninds[key]) # Initialize output Xkblock = np.zeros((len(n_inds), self.p)) Xkblock[:] = np.nan accblock = Xkblock.copy() probblock = Xkblock.copy() # Set separating knockoffs = to their features sep_inds = self.separators[key] Xkblock[:, sep_inds] = self.X[n_inds][:, sep_inds] accblock[:, sep_inds] = 0 probblock[:, sep_inds] = 0 # Loop through blocks for n_inds, p_inds, sampler in self.samplers[key]: # Sample knockoffs Xkblock[:, p_inds] = sampler.sample_knockoffs(**kwargs) # Save final_acc_probs, acceptances accblock[:, p_inds] = sampler.final_acc_probs[:, sampler.inv_order] probblock[:, p_inds] = sampler.acceptances[:, sampler.inv_order] # Set Xk value for this block self.Xk[n_inds] = Xkblock self.acceptances[n_inds] = accblock self.final_acc_probs[n_inds] = probblock # Test validity self.check_xk_validity(self.X, self.Xk, testname="IsingSampler", alpha=1e-5) return self.Xk ``` -------------------------------- ### KnockoffFilter Class Initialization Source: https://amspector100.github.io/knockpy/apiref.html Instantiate the KnockoffFilter with specified feature statistics and knockoff sampler options. ```APIDOC ## KnockoffFilter ### Description Initializes the KnockoffFilter with specified feature statistics and knockoff sampler options. It can be configured with various statistics and sampling methods. ### Parameters - **fstat** (FeatureStatistic) - The feature statistics to use for inference. Defaults to LedoitWolf covariance estimation. - **ksampler** (KnockoffSampler) - The knockoff sampler to use during inference. - **fstat_kwargs** (dict) - Dictionary of kwargs to pass to the `fit` call of `self.fstat`. - **knockoff_kwargs** (dict) - If `ksampler` is not yet initialized, kwargs to pass to `ksampler`. ### Example ```python from knockpy.knockoff_filter import KnockoffFilter kfilter = KnockoffFilter( fstat='lcd', ksampler='gaussian', knockoff_kwargs={"method":"mvr"}, ) ``` ``` -------------------------------- ### Generate Wishart Correlation Matrix Source: https://amspector100.github.io/knockpy/_modules/knockpy/dgp.html Generates a correlation matrix using the Wishart distribution. It starts with a random matrix W, computes V = W^T W, converts it to a correlation matrix, and ensures it's positive semi-definite. ```python from . import utilities from .utilities import cov2corr, shift_until_PSD DEFAULT_DF_T = 3 [docs] def Wishart(d=100, p=100, tol=1e-2): """ Let W be a random `d` x `p` matrix with i.i.d. Gaussian entries. Then Sigma = ``cov2corr``(W^T W). """ W = np.random.randn(d, p) V = np.dot(W.T, W) V = cov2corr(V) return cov2corr(shift_until_PSD(V, tol=tol)) ``` -------------------------------- ### Create Synthetic Dataset for Knockpy Source: https://amspector100.github.io/knockpy/usage.html Generates a synthetic dataset with a specified number of data points and features, including a random covariance matrix and sparse coefficients for the outcome variable. This setup is typical for testing variable selection methods. ```python import numpy as np import knockpy import warnings np.random.seed(123) n = 300 # number of data points p = 500 # number of features Sigma = knockpy.dgp.AR1(p=p, rho=0.5) # Stationary AR1 process with correlation 0.5 # Sample X X = np.random.multivariate_normal(mean=np.zeros(p), cov=Sigma, size=(n,)) # Create random sparse coefficients beta = knockpy.dgp.create_sparse_coefficients(p=p, sparsity=0.1) y = np.dot(X, beta) + np.random.randn(n) ``` -------------------------------- ### Initialize and Cache Optimal S in PSGDSolver Source: https://amspector100.github.io/knockpy/_modules/knockpy/kpytorch/mrcgrad.html Initializes the loss calculation and caches the optimal solution S. This is typically done once during solver initialization. ```python # Initialize cache of optimal S with torch.no_grad(): init_loss = self.losscalc(smoothing=0) if init_loss < 0: init_loss = np.inf self.cache_S(init_loss) # Initialize attributes which save losses over time self.all_losses = [] self.projected_losses = [] self.improvements = [] def cache_S(self, new_loss): # Cache optimal solution with torch.no_grad(): self.prev_opt_S = self.opt_S self.prev_opt_loss = self.opt_loss self.opt_loss = new_loss self.opt_S = self.losscalc.pull_S().clone().detach().numpy() ``` -------------------------------- ### FXSampler Constructor Source: https://amspector100.github.io/knockpy/api/knockpy.knockoffs.FXSampler.html Initializes the FXSampler with input data and optional parameters. See GaussianSampler documentation for argument descriptions. ```APIDOC class FXSampler(_X_, _groups=None_, _sample_tol=1e-05_, _S=None_, _method=None_, _verbose=False_, _**kwargs_) ```