### Setup for Bias-Variance Decomposition Example Source: https://scikit-learn.org/dev/auto_examples/ensemble/plot_bias_variance.html Imports necessary libraries and sets up parameters for simulating regression problems to analyze bias-variance decomposition. ```python # Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause import matplotlib.pyplot as plt import numpy as np from sklearn.ensemble import BaggingRegressor from sklearn.tree import DecisionTreeRegressor # Settings n_repeat = 50 # Number of iterations for computing expectations n_train = 50 # Size of the training set n_test = 1000 # Size of the test set noise = 0.1 # Standard deviation of the noise np.random.seed(0) # Change this for exploring the bias-variance decomposition of other # estimators. This should work well for estimators with high variance (e.g., # decision trees or KNN), but poorly for estimators with low variance (e.g., # linear models). estimators = [ ("Tree", DecisionTreeRegressor()), ("Bagging(Tree)", BaggingRegressor(DecisionTreeRegressor())), ] n_estimators = len(estimators) ``` -------------------------------- ### Setup and Helper Functions for Discretization Example Source: https://scikit-learn.org/dev/_downloads/aa8e07ce1b796a15ada1d9f0edce48b5/plot_discretization_classification.ipynb Imports necessary libraries and defines helper functions for plotting and classifier naming. Sets up the plotting mesh size. ```python # Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause import matplotlib.pyplot as plt import numpy as np from matplotlib.colors import ListedColormap from sklearn.datasets import make_circles, make_classification, make_moons from sklearn.ensemble import GradientBoostingClassifier from sklearn.exceptions import ConvergenceWarning from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV, train_test_split from sklearn.pipeline import make_pipeline from sklearn.preprocessing import KBinsDiscretizer, StandardScaler from sklearn.svm import SVC, LinearSVC from sklearn.utils._testing import ignore_warnings h = 0.02 # step size in the mesh def get_name(estimator): name = estimator.__class__.__name__ if name == "Pipeline": name = [get_name(est[1]) for est in estimator.steps] name = " + ".join(name) return name # list of (estimator, param_grid), where param_grid is used in GridSearchCV # The parameter spaces in this example are limited to a narrow band to reduce # its runtime. In a real use case, a broader search space for the algorithms # should be used. classifiers = [ ( make_pipeline(StandardScaler(), LogisticRegression(random_state=0)), {"logisticregression__C": np.logspace(-1, 1, 3)}, ), ( make_pipeline(StandardScaler(), LinearSVC(random_state=0)), {"linearsvc__C": np.logspace(-1, 1, 3)}, ), ( make_pipeline( StandardScaler(), KBinsDiscretizer( encode="onehot", quantile_method="averaged_inverted_cdf", random_state=0 ), LogisticRegression(random_state=0), ), { "kbinsdiscretizer__n_bins": np.arange(5, 8), "logisticregression__C": np.logspace(-1, 1, 3), }, ), ( make_pipeline( StandardScaler(), KBinsDiscretizer( encode="onehot", quantile_method="averaged_inverted_cdf", random_state=0 ), LinearSVC(random_state=0), ), { "kbinsdiscretizer__n_bins": np.arange(5, 8), "linearsvc__C": np.logspace(-1, 1, 3), }, ), ( make_pipeline( StandardScaler(), GradientBoostingClassifier(n_estimators=5, random_state=0) ), {"gradientboostingclassifier__learning_rate": np.logspace(-2, 0, 5)}, ), ( make_pipeline(StandardScaler(), SVC(random_state=0)), {"svc__C": np.logspace(-1, 1, 3)}, ), ] names = [get_name(e).replace("StandardScaler + ", "") for e, _ in classifiers] n_samples = 100 datasets = [ make_moons(n_samples=n_samples, noise=0.2, random_state=0), make_circles(n_samples=n_samples, noise=0.2, factor=0.5, random_state=1), make_classification( n_samples=n_samples, n_features=2, n_redundant=0, n_informative=2, random_state=2, n_clusters_per_class=1, ), ] fig, axes = plt.subplots( nrows=len(datasets), ncols=len(classifiers) + 1, figsize=(21, 9) ) cm_piyg = plt.cm.PiYG cm_bright = ListedColormap(["#b30065", "#178000"]) # iter ``` -------------------------------- ### Setup for Theil-Sen Regression Example Source: https://scikit-learn.org/dev/auto_examples/linear_model/plot_theilsen.html Imports necessary libraries and defines estimators for OLS, Theil-Sen, and RANSAC regression, along with their colors and line widths for plotting. ```python # Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause import time import matplotlib.pyplot as plt import numpy as np from sklearn.linear_model import LinearRegression, RANSACRegressor, TheilSenRegressor estimators = [ ("OLS", LinearRegression()), ("Theil-Sen", TheilSenRegressor(random_state=42)), ("RANSAC", RANSACRegressor(random_state=42)), ] colors = {"OLS": "turquoise", "Theil-Sen": "gold", "RANSAC": "lightgreen"} lw = 2 ``` -------------------------------- ### Fitting and Calibration Example Source: https://scikit-learn.org/dev/auto_examples/calibration/plot_calibration_multiclass.html This snippet shows the setup for fitting and calibrating a multiclass classifier. It involves defining the base estimator and the calibration strategy. ```python from sklearn.ensemble import RandomForestClassifier from sklearn.calibration import CalibratedClassifierCV from sklearn.model_selection import train_test_split from sklearn.datasets import make_classification X, y = make_classification(n_samples=1000, n_features=4, n_informative=2, n_redundant=0, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) # Base estimator base_estimator = RandomForestClassifier(n_estimators=25, random_state=42) # Calibrated classifier with isotonic calibration calibrated_isotonic = CalibratedClassifierCV(base_estimator, method='isotonic', cv='prefit') calibrated_isotonic.fit(X_train, y_train) # Calibrated classifier with sigmoid calibration calibrated_sigmoid = CalibratedClassifierCV(base_estimator, method='sigmoid', cv='prefit') calibrated_sigmoid.fit(X_train, y_train) # Using cross-validation for calibration calibrated_cv = CalibratedClassifierCV(base_estimator, method='isotonic', cv=5) calibrated_cv.fit(X_train, y_train) # Accessing calibrated classifiers print(f"Isotonic calibrated classifiers (prefit): {calibrated_isotonic.calibrated_classifiers_}") print(f"Sigmoid calibrated classifiers (prefit): {calibrated_sigmoid.calibrated_classifiers_}") print(f"Cross-validated calibrated classifiers: {calibrated_cv.calibrated_classifiers_}") # Making predictions prob_isotonic = calibrated_isotonic.predict_proba(X_test) prob_sigmoid = calibrated_sigmoid.predict_proba(X_test) prob_cv = calibrated_cv.predict_proba(X_test) print(f"\nSample predicted probabilities (Isotonic, prefit): {prob_isotonic[:5]}") print(f"Sample predicted probabilities (Sigmoid, prefit): {prob_sigmoid[:5]}") print(f"Sample predicted probabilities (Cross-validated): {prob_cv[:5]}") ``` -------------------------------- ### Minimal Custom Callback Implementation Source: https://scikit-learn.org/dev/developers/developing_callbacks.html This example demonstrates a basic custom callback class with methods for setup, teardown, and hooks for the beginning and end of a fit task. It prints messages indicating the stage and provides information about the task and training data. ```python class MyCallback: def setup(self, estimator, context): print(f"Setup hook is being called in the {context.task_name} task.") def teardown(self, estimator, context): print(f"Teardown hook is being called in the {context.task_name} task.") def on_fit_task_begin(self, estimator, context, *, X=None): msg = f"{context.task_name} task is starting." if X is not None: msg += f" With training data of shape {X.shape}." print(msg) def on_fit_task_end( self, estimator, context, *, X=None, y=None, fitted_estimator=None ): msg = f"{context.task_name} task is ending." mean_squared_error = ((y - fitted_estimator.predict(X))**2).mean() msg += f" With a mean squared error of {mean_squared_error}." print(msg) ``` -------------------------------- ### Build Documentation with Example Gallery Source: https://scikit-learn.org/dev/developers/contributing.html Generates the full documentation, including the example gallery by running all examples. This process can take a significant amount of time. ```bash make html ``` -------------------------------- ### Basic RidgeClassifier Usage Source: https://scikit-learn.org/dev/modules/generated/sklearn.linear_model.RidgeClassifier.html Demonstrates how to instantiate, fit, and score a RidgeClassifier using sample data. This is a fundamental example for getting started with the classifier. ```python from sklearn.datasets import load_breast_cancer from sklearn.linear_model import RidgeClassifier X, y = load_breast_cancer(return_X_y=True) clf = RidgeClassifier().fit(X, y) clf.score(X, y) ``` -------------------------------- ### FastICA with Default Parameters Source: https://scikit-learn.org/dev/modules/generated/fastica-function.html Performs Fast Independent Component Analysis using default parameters. This is a basic example to get started with the function. ```python from sklearn.decomposition import FastICA X = np.array([[1, 1], [2, 2], [3, 3], [4, 4]]) # Perform FastICA fastica = FastICA(n_components=2, random_state=0) s = fastica.fit_transform(X) # Reconstruct signals ``` -------------------------------- ### Example warning message for new metadata consumption Source: https://scikit-learn.org/dev/auto_examples/miscellaneous/plot_metadata_routing.html This is the expected warning message output when an estimator like `WeightedMetaRegressor` starts consuming metadata (`sample_weight`) that it did not consume before. It guides the user on how to explicitly manage the request. ```text Received sample_weight of length = 100 in WeightedMetaRegressor. Support for sample_weight has recently been added to WeightedMetaRegressor(estimator=LinearRegression()) class. To maintain backward compatibility, it is ignored now. Using `set_fit_request(sample_weight={True, False})` on this method of the class, you can set the request value to False to silence this warning, or to True to consume and use the metadata. ``` -------------------------------- ### Build Documentation (Basic) Source: https://scikit-learn.org/dev/developers/contributing.html Generates the main web documentation without the example gallery. The output is placed in the '_build/html/stable' directory. ```bash make ``` -------------------------------- ### Prepare ARM64 Development Environment Source: https://scikit-learn.org/dev/developers/tips.html Download Miniforge installer and clone the scikit-learn repository into a dedicated folder for ARM64 development. ```bash mkdir arm64 pushd arm64 wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-aarch64.sh git clone https://github.com/scikit-learn/scikit-learn.git ``` -------------------------------- ### GaussianMixture.get_metadata_routing Source: https://scikit-learn.org/dev/modules/generated/sklearn.mixture.GaussianMixture.html Get metadata routing of this object. Please check User Guide on how the routing mechanism works. ```APIDOC ## get_metadata_routing() ### Description Get metadata routing of this object. ### Returns - **routing** MetadataRequest A `MetadataRequest` encapsulating routing information. ``` -------------------------------- ### Perceptron.get_metadata_routing Source: https://scikit-learn.org/dev/modules/generated/sklearn.linear_model.Perceptron.html Get metadata routing of this object. Please check User Guide on how the routing mechanism works. ```APIDOC ## Perceptron.get_metadata_routing ### Description Get metadata routing of this object. Please check User Guide on how the routing mechanism works. ### Parameters None ### Returns * **routing** (MetadataRequest) - A `MetadataRequest` encapsulating routing information. ``` -------------------------------- ### Creating and using a Product kernel Source: https://scikit-learn.org/dev/modules/generated/sklearn.gaussian_process.kernels.Product.html Demonstrates how to create a Product kernel by combining ConstantKernel and RBF, then use it with GaussianProcessRegressor for fitting and scoring. The example shows the resulting kernel representation. ```python >>> from sklearn.datasets import make_friedman2 >>> from sklearn.gaussian_process import GaussianProcessRegressor >>> from sklearn.gaussian_process.kernels import ( ... RBF, Product, ConstantKernel) >>> X, y = make_friedman2(n_samples=500, noise=0, random_state=0) >>> kernel = Product(ConstantKernel(2), RBF()) >>> gpr = GaussianProcessRegressor(kernel=kernel, ... random_state=0).fit(X, y) >>> gpr.score(X, y) 1.0 >>> kernel 1.41**2 * RBF(length_scale=1) ``` -------------------------------- ### get_metadata_routing Source: https://scikit-learn.org/dev/modules/generated/sklearn.base.BaseEstimator.html Get metadata routing of this object. Please check User Guide on how the routing mechanism works. ```APIDOC ## get_metadata_routing ### Description Get metadata routing of this object. Please check User Guide on how the routing mechanism works. ### Returns - **routing** (MetadataRequest) - A `MetadataRequest` encapsulating routing information. ``` -------------------------------- ### Pipeline Initialization with Metadata Requests Source: https://scikit-learn.org/dev/auto_examples/miscellaneous/plot_metadata_routing.html Demonstrates how to instantiate a `SimplePipeline` with an `ExampleTransformer` and a `RouterConsumerClassifier`, setting specific metadata requests for each component. This example shows how to enable metadata like `sample_weight` and `groups` for different methods. ```python from sklearn.base import clone from sklearn.utils.metadata_routing import MetadataRouter, MethodMapping, process_routing, check_metadata # Assuming ExampleClassifier and RouterConsumerClassifier are defined elsewhere # For demonstration purposes, let's define minimal versions if not provided: class ExampleClassifier(BaseEstimator): def fit(self, X, y, sample_weight=None): check_metadata(self, sample_weight=sample_weight) return self def predict(self, X, groups=None): check_metadata(self, groups=groups) return X class RouterConsumerClassifier(BaseEstimator): def __init__(self, estimator): self.estimator = estimator def get_metadata_routing(self): router = MetadataRouter(owner=self) router.add(estimator=self.estimator, method_mapping=MethodMapping().add(caller="fit", callee="fit").add(caller="predict", callee="predict")) return router def fit(self, X, y, **params): routed_params = process_routing(self, "fit", **params) self.estimator_ = clone(self.estimator).fit(X, y, **routed_params["estimator"]["fit"]) return self def predict(self, X, **params): routed_params = process_routing(self, "predict", **params) return self.estimator_.predict(X, **routed_params["estimator"]["predict"]) # The actual pipeline instantiation from the source: pipe = SimplePipeline( transformer=ExampleTransformer() # we set transformer's fit to receive sample_weight .set_fit_request(sample_weight=True) # we set transformer's transform to receive groups .set_transform_request(groups=True), classifier=RouterConsumerClassifier( estimator=ExampleClassifier() # we want this sub-estimator to receive sample_weight in fit .set_fit_request(sample_weight=True) # but not groups in predict .set_predict_request(groups=False), ) # and we want the meta-estimator to receive sample_weight as well .set_fit_request(sample_weight=True), ) ``` -------------------------------- ### Gradient Boosting Classifier Initialization Source: https://scikit-learn.org/dev/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html Initializes a GradientBoostingClassifier with default parameters. This is a basic setup for starting with the algorithm. ```python from sklearn.ensemble import GradientBoostingClassifier clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=0) ``` -------------------------------- ### Navigate to Documentation Directory Source: https://scikit-learn.org/dev/developers/contributing.html Change the current directory to the 'doc' folder to begin building the documentation. ```bash cd doc ``` -------------------------------- ### Filter estimators by type Source: https://scikit-learn.org/dev/modules/generated/sklearn.utils.discovery.all_estimators.html Retrieves a list of estimators filtered by a specific type. This example shows how to get only classifiers. ```python from sklearn.utils.discovery import all_estimators classifiers = all_estimators(type_filter="classifier") classifiers[:2] ``` -------------------------------- ### PLSSVD Example Source: https://scikit-learn.org/dev/modules/generated/sklearn.cross_decomposition.PLSSVD.html Demonstrates how to initialize PLSSVD, fit it to sample data, and transform the data. It also shows how to check the shapes of the transformed data. ```python import numpy as np from sklearn.cross_decomposition import PLSSVD X = np.array([[0., 0., 1.], [1., 0., 0.], [2., 2., 2.], [2., 5., 4.]]) y = np.array([[0.1, -0.2], [0.9, 1.1], [6.2, 5.9], [11.9, 12.3]]) pls = PLSSVD(n_components=2).fit(X, y) X_c, y_c = pls.transform(X, y) X_c.shape, y_c.shape ((4, 2), (4, 2)) ``` -------------------------------- ### OPTICS get_metadata_routing method Source: https://scikit-learn.org/dev/modules/generated/sklearn.cluster.OPTICS.html Get metadata routing of this object. Please check User Guide on how the routing mechanism works. ```APIDOC ## OPTICS.get_metadata_routing ### Description Get metadata routing of this object. Please check User Guide on how the routing mechanism works. ### Method `get_metadata_routing()` ### Returns - **routing** (MetadataRequest) - Metadata routing object. ``` -------------------------------- ### set_output Source: https://scikit-learn.org/dev/modules/generated/sklearn.cluster.Birch.html Set output container. Refer to the user guide for more details and Introducing the set_output API for an example on how to use the API. ```APIDOC ## set_output ### Description Set output container. Refer to the user guide for more details and Introducing the set_output API for an example on how to use the API. ### Parameters #### Parameters - **transform** ({"default", "pandas", "polars"}, default=None) - Configure output of `transform` and `fit_transform`. * "default": Default output format of a transformer * "pandas": DataFrame output * "polars": Polars output * `None`: Transform configuration is unchanged ### Returns #### Returns - **self** (estimator instance) - Estimator instance. ### Added in version 1.4: "polars" option was added. ``` -------------------------------- ### Data Preparation and Algorithm Initialization Source: https://scikit-learn.org/dev/auto_examples/cluster/plot_cluster_comparison.html This snippet shows the setup for comparing clustering algorithms. It includes data loading, normalization, bandwidth estimation for MeanShift, connectivity matrix creation for Ward and average linkage, and initialization of various clustering algorithm objects with dataset-specific parameters. ```python plt.figure(figsize=(9 * 2 + 3, 13)) plt.subplots_adjust( left=0.02, right=0.98, bottom=0.001, top=0.95, wspace=0.05, hspace=0.01 ) plot_num = 1 default_base = { "quantile": 0.3, "eps": 0.3, "damping": 0.9, "preference": -200, "n_neighbors": 3, "n_clusters": 3, "min_samples": 7, "xi": 0.05, "min_cluster_size": 0.1, "allow_single_cluster": True, "hdbscan_min_cluster_size": 15, "hdbscan_min_samples": 3, "random_state": 42, } datasets = [ ( noisy_circles, { "damping": 0.77, "preference": -240, "quantile": 0.2, "n_clusters": 2, "min_samples": 7, "xi": 0.08, }, ), ( noisy_moons, { "damping": 0.75, "preference": -220, "n_clusters": 2, "min_samples": 7, "xi": 0.1, }, ), ( varied, { "eps": 0.18, "n_neighbors": 2, "min_samples": 7, "xi": 0.01, "min_cluster_size": 0.2, }, ), ( aniso, { "eps": 0.15, "n_neighbors": 2, "min_samples": 7, "xi": 0.1, "min_cluster_size": 0.2, }, ), (blobs, {"min_samples": 7, "xi": 0.1, "min_cluster_size": 0.2}), (no_structure, {}), ] for i_dataset, (dataset, algo_params) in enumerate(datasets): # update parameters with dataset-specific values params = default_base.copy() params.update(algo_params) X, y = dataset # normalize dataset for easier parameter selection X = StandardScaler().fit_transform(X) # estimate bandwidth for mean shift bandwidth = cluster.estimate_bandwidth(X, quantile=params["quantile"]) # connectivity matrix for structured Ward connectivity = kneighbors_graph( X, n_neighbors=params["n_neighbors"], include_self=False ) # make connectivity symmetric connectivity = 0.5 * (connectivity + connectivity.T) # Create cluster objects ms = cluster.MeanShift(bandwidth=bandwidth, bin_seeding=True) two_means = cluster.MiniBatchKMeans( n_clusters=params["n_clusters"], random_state=params["random_state"], ) ward = cluster.AgglomerativeClustering( n_clusters=params["n_clusters"], linkage="ward", connectivity=connectivity ) spectral = cluster.SpectralClustering( n_clusters=params["n_clusters"], eigen_solver="arpack", affinity="nearest_neighbors", random_state=params["random_state"], ) dbscan = cluster.DBSCAN(eps=params["eps"]) hdbscan = cluster.HDBSCAN( min_samples=params["hdbscan_min_samples"], min_cluster_size=params["hdbscan_min_cluster_size"], allow_single_cluster=params["allow_single_cluster"], copy=True, ) optics = cluster.OPTICS( min_samples=params["min_samples"], xi=params["xi"], min_cluster_size=params["min_cluster_size"], ) affinity_propagation = cluster.AffinityPropagation( damping=params["damping"], preference=params["preference"], random_state=params["random_state"], ) average_linkage = cluster.AgglomerativeClustering( linkage="average", metric="cityblock", n_clusters=params["n_clusters"], connectivity=connectivity, ) birch = cluster.Birch(n_clusters=params["n_clusters"]) gmm = mixture.GaussianMixture( n_components=params["n_clusters"], covariance_type="full", random_state=params["random_state"], ) clustering_algorithms = ( ("MiniBatch\nKMeans", two_means), ("Affinity\nPropagation", affinity_propagation), ("MeanShift", ms), ("Spectral\nClustering", spectral), ("Ward", ward), ("Agglomerative\nClustering", average_linkage), ("DBSCAN", dbscan), ("HDBSCAN", hdbscan), ("OPTICS", optics), ("BIRCH", birch), ("Gaussian\nMixed", gmm), ) for name, algorithm in clustering_algorithms: t0 = time.time() # catch warnings related to kneighbors_graph with warnings.catch_warnings(): warnings.filterwarnings( "ignore", message="the number of connected components of the " "connectivity matrix is [0-9]{1,2}" " > 1. Completing it to avoid stopping the tree early.", category=UserWarning, ) warnings.filterwarnings( "ignore", message="Graph is not fully connected, spectral embedding" " may not work as expected.", category=UserWarning, ) algorithm.fit(X) t1 = time.time() if hasattr(algorithm, "labels_"): labels = algorithm.labels_ else: labels = algorithm.predict(X) # Number of clusters in labels, ignoring noise if the algorithm reports it n_clusters = len(set(labels)) - (1 if -1 in labels else 0) n_noise = list(labels).count(-1) print(f"Algorithm: {name}\n Estimated number of clusters: {n_clusters}\n Estimated number of noise points: {n_noise}") if hasattr(algorithm, "cluster_centers_"): centers = algorithm.cluster_centers_ # Plot result plt.subplot(len(datasets), len(clustering_algorithms), plot_num) if i_dataset == 0: plt.title(name, size=17) plt.scatter(X[:, 0], X[:, 1], c=labels, s=10, cmap="viridis") # Plot the cluster centers plt.scatter(centers[:, 0], centers[:, 1], c="black", s=50, alpha=0.7) plt.xlim(-2, 2) plt.ylim(-2, 2) plt.xticks(()) plt.yticks(()) plt.text( 0.99, 0.02, ( "%.2f" % (t1 - t0) ).lstrip("0"), size=11, horizontalalignment='right', color="w", ) plot_num += 1 elif hasattr(algorithm, "cluster_centers_indices"): centers = X[algorithm.cluster_centers_indices_] plt.subplot(len(datasets), len(clustering_algorithms), plot_num) if i_dataset == 0: plt.title(name, size=17) plt.scatter(X[:, 0], X[:, 1], c=labels, s=10, cmap="viridis") plt.scatter(centers[:, 0], centers[:, 1], c="black", s=50, alpha=0.7) plt.xlim(-2, 2) plt.ylim(-2, 2) plt.xticks(()) plt.yticks(()) plt.text( 0.99, 0.02, ("%.2f" % (t1 - t0)).lstrip("0"), size=11, horizontalalignment='right', color="w", ) plot_num += 1 else: # DBSCAN does not have a cluster_centers_ attribute. # For this algorithm, we will plot the points and the noise points. # The noise points are marked with a label of -1. plt.subplot(len(datasets), len(clustering_algorithms), plot_num) if i_dataset == 0: plt.title(name, size=17) plt.scatter(X[:, 0], X[:, 1], c=labels, s=10, cmap="viridis") plt.xlim(-2, 2) plt.ylim(-2, 2) plt.xticks(()) plt.yticks(()) plt.text( 0.99, 0.02, ("%.2f" % (t1 - t0)).lstrip("0"), size=11, horizontalalignment='right', color="w", ) plot_num += 1 plt.show() ``` -------------------------------- ### Setup and Data Generation Source: https://scikit-learn.org/dev/auto_examples/linear_model/plot_sgdocsvm_vs_ocsvm.html Imports necessary libraries and generates synthetic training, testing, and outlier data for the One-Class SVM comparison. Sets up plotting fonts and random state for reproducibility. ```python # Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause import matplotlib import matplotlib.lines as mlines import matplotlib.pyplot as plt import numpy as np from sklearn.kernel_approximation import Nystroem from sklearn.linear_model import SGDOneClassSVM from sklearn.pipeline import make_pipeline from sklearn.svm import OneClassSVM font = {"weight": "normal", "size": 15} matplotlib.rc("font", **font) random_state = 42 rng = np.random.RandomState(random_state) # Generate train data X = 0.3 * rng.randn(500, 2) X_train = np.r_[X + 2, X - 2] # Generate some regular novel observations X = 0.3 * rng.randn(20, 2) X_test = np.r_[X + 2, X - 2] # Generate some abnormal novel observations X_outliers = rng.uniform(low=-4, high=4, size=(20, 2)) # OCSVM hyperparameters nu = 0.05 gamma = 2.0 # Fit the One-Class SVM clf = OneClassSVM(gamma=gamma, kernel="rbf", nu=nu) clf.fit(X_train) y_pred_train = clf.predict(X_train) y_pred_test = clf.predict(X_test) y_pred_outliers = clf.predict(X_outliers) n_error_train = y_pred_train[y_pred_train == -1].size n_error_test = y_pred_test[y_pred_test == -1].size n_error_outliers = y_pred_outliers[y_pred_outliers == 1].size ``` -------------------------------- ### PartialDependenceDisplay Source: https://scikit-learn.org/dev/modules/generated/sklearn.inspection.PartialDependenceDisplay.html Partial Dependence Plot (PDP) and Individual Conditional Expectation (ICE). It is recommended to use `from_estimator` to create a `PartialDependenceDisplay`. All parameters are stored as attributes. For general information regarding `scikit-learn` visualization tools, see the Visualization Guide. For guidance on interpreting these plots, refer to the Inspection Guide. For an example on how to use this class, see the following example: Advanced Plotting With Partial Dependence. Added in version 0.22. ```APIDOC class sklearn.inspection.PartialDependenceDisplay(_pd_results_, *_*, _features_, _feature_names_, _target_idx_, _deciles_, _kind='average'_, _subsample=1000_, _random_state=None_, _is_categorical=None_) Parameters: **pd_results** : list of Bunch Results of `partial_dependence` for `features`. **features** : list of (int,) or list of (int, int) Indices of features for a given plot. A tuple of one integer will plot a partial dependence curve of one feature. A tuple of two integers will plot a two-way partial dependence curve as a contour plot. **feature_names** : list of str Feature names corresponding to the indices in `features`. **target_idx** : int * In a multiclass setting, specifies the class for which the PDPs should be computed. Note that for binary classification, the positive class (index 1) is always used. * In a multioutput setting, specifies the task for which the PDPs should be computed. Ignored in binary classification or classical regression settings. **deciles** : dict Deciles for feature indices in `features`. **kind** : {‘average’, ‘individual’, ‘both’} or list of such str, default=’average’ Whether to plot the partial dependence averaged across all the samples in the dataset or one line per sample or both. * `kind='average'` results in the traditional PD plot; * `kind='individual'` results in the ICE plot; * `kind='both'` results in plotting both the ICE and PD on the same plot. A list of such strings can be provided to specify `kind` on a per-plot basis. The length of the list should be the same as the number of interaction requested in `features`. Note ICE (‘individual’ or ‘both’) is not a valid option for 2-ways interactions plot. As a result, an error will be raised. 2-ways interaction plots should always be configured to use the ‘average’ kind instead. Note The fast `method='recursion'` option is only available for `kind='average'` and `sample_weights=None`. Computing individual dependencies and doing weighted averages requires using the slower `method='brute'`. Added in version 0.24: Add `kind` parameter with `'average'`, `'individual'`, and `'both'` options. Added in version 1.1: Add the possibility to pass a list of string specifying `kind` for each plot. **subsample** : float, int or None, default=1000 Sampling for ICE curves when `kind` is ‘individual’ or ‘both’. If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to be used to plot ICE curves. If int, represents the maximum absolute number of samples to use. Note that the full dataset is still used to calculate partial dependence when `kind='both'`. Added in version 0.24. **random_state** : int, RandomState instance or None, default=None Controls the randomness of the selected samples when subsamples is not `None`. See Glossary for details. Added in version 0.24. **is_categorical** : list of (bool,) or list of (bool, bool), default=None Whether each target feature in `features` is categorical or not. The list should be same size as `features`. If `None`, all features are assumed to be continuous. Added in version 1.2. Attributes: **bounding_ax_** : matplotlib Axes or None If `ax` is an axes or None, the `bounding_ax_` is the axes where the grid of partial dependence plots are drawn. If `ax` is a list of axes or a numpy array of axes, `bounding_ax_` is None. **axes_** : ndarray of matplotlib Axes If `ax` is an axes or None, `axes_[i, j]` is the axes on the i-th row and j-th column. If `ax` is a list of axes, `axes_[i]` is the i-th item in `ax`. Elements that are None correspond to a nonexisting axes in that position. **lines_** : ndarray of matplotlib Artists If `ax` is an axes or None, `lines_[i, j]` is the partial dependence curve on the i-th row and j-th column. If `ax` is a list of axes, `lines_[i]` is the partial dependence curve corresponding to the i-th item in `ax`. Elements that are None correspond to a nonexisting axes or an axes that does not include a line plot. **deciles_vlines_** : ndarray of matplotlib LineCollection If `ax` is an axes or None, `vlines_[i, j]` is the line collection representing the x axis deciles of the i-th row and j-th column. If `ax` is a list of axes, `vlines_[i]` corresponds to the i-th item in `ax`. Elements that are None correspond to a nonexisting axes or an axes that does not include a PDP plot. ``` -------------------------------- ### Build Documentation with Filtered Examples Source: https://scikit-learn.org/dev/developers/contributing.html Builds the documentation and runs only examples whose filenames contain 'plot_calibration'. This is useful for testing specific example changes. ```bash EXAMPLES_PATTERN="plot_calibration" make html ``` -------------------------------- ### RandomizedSearchCV Example Source: https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.RandomizedSearchCV.html Demonstrates how to use RandomizedSearchCV with Logistic Regression on the Iris dataset. It shows parameter distribution setup and fitting the model. ```python from sklearn.datasets import load_iris from sklearn.linear_model import LogisticRegression from sklearn.model_selection import RandomizedSearchCV from scipy.stats import uniform iris = load_iris() logistic = LogisticRegression(solver='saga', tol=1e-2, max_iter=200, random_state=0) distributions = dict(C=uniform(loc=0, scale=4), l1_ratio=[0, 1]) clf = RandomizedSearchCV(logistic, distributions, random_state=0) search = clf.fit(iris.data, iris.target) search.best_params_ ``` -------------------------------- ### Create a Pipeline with make_pipeline Source: https://scikit-learn.org/dev/auto_examples/miscellaneous/plot_pipeline_display.html This example demonstrates creating a pipeline using the `make_pipeline` utility function, which automatically names the steps. It's a convenient way to build pipelines. ```python from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression # Create a pipeline using make_pipeline pipe = make_pipeline( StandardScaler(), LogisticRegression(solver='liblinear') ) # Print the pipeline steps to see the auto-generated names print(pipe.steps) ``` -------------------------------- ### MiniBatchNMF Example Source: https://scikit-learn.org/dev/modules/generated/sklearn.decomposition.MiniBatchNMF.html This example demonstrates how to initialize and use the MiniBatchNMF model. It shows the basic steps of creating an instance, fitting it to data, and obtaining the transformed data and components. ```python import numpy as np X = np.array([[1, 1], [2, 1], [3, 1.2], [4, 1], [5, 0.8], [6, 1]]) from sklearn.decomposition import MiniBatchNMF model = MiniBatchNMF(n_components=2, init='random', random_state=0) W = model.fit_transform(X) H = model.components_ ``` -------------------------------- ### Import necessary libraries Source: https://scikit-learn.org/dev/_downloads/19e9c0cb24a132133cef3b311caaf199/plot_nca_illustration.ipynb Imports libraries for plotting, numerical operations, and Neighborhood Components Analysis. This setup is required for the subsequent examples. ```python # Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause import matplotlib.pyplot as plt import numpy as np from matplotlib import cm from scipy.special import logsumexp from sklearn.datasets import make_classification from sklearn.neighbors import NeighborhoodComponentsAnalysis ``` -------------------------------- ### LeavePGroupsOut Example Source: https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.LeavePGroupsOut.html Demonstrates how to use LeavePGroupsOut to split data into training and testing sets, leaving out a specified number of groups. This example shows how to get the number of splits and iterate through them, printing the train and test indices along with their corresponding group labels. ```python import numpy as np from sklearn.model_selection import LeavePGroupsOut X = np.array([[1, 2], [3, 4], [5, 6]]) y = np.array([1, 2, 1]) groups = np.array([1, 2, 3]) lpgo = LeavePGroupsOut(n_groups=2) lpgo.get_n_splits(groups=groups) print(lpgo) for i, (train_index, test_index) in enumerate(lpgo.split(X, y, groups)): print(f"Fold {i}:") print(f" Train: index={train_index}, group={groups[train_index]}") print(f" Test: index={test_index}, group={groups[test_index]}") ``` -------------------------------- ### Anomaly Detection Algorithms Comparison Setup Source: https://scikit-learn.org/dev/auto_examples/miscellaneous/plot_anomaly_comparison.html Sets up parameters and defines a list of anomaly detection algorithms to be compared. Includes imports for necessary libraries and data generation functions. ```python # Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause import time import matplotlib import matplotlib.pyplot as plt import numpy as np from sklearn import svm from sklearn.covariance import EllipticEnvelope from sklearn.datasets import make_blobs, make_moons from sklearn.ensemble import IsolationForest from sklearn.kernel_approximation import Nystroem from sklearn.linear_model import SGDOneClassSVM from sklearn.neighbors import LocalOutlierFactor from sklearn.pipeline import make_pipeline matplotlib.rcParams["contour.negative_linestyle"] = "solid" # Example settings n_samples = 300 outliers_fraction = 0.15 n_outliers = int(outliers_fraction * n_samples) n_inliers = n_samples - n_outliers # define outlier/anomaly detection methods to be compared. # the SGDOneClassSVM must be used in a pipeline with a kernel approximation # to give similar results to the OneClassSVM anomaly_algorithms = [ ( "Robust covariance", EllipticEnvelope(contamination=outliers_fraction, random_state=42), ), ("One-Class SVM", svm.OneClassSVM(nu=outliers_fraction, kernel="rbf", gamma=0.1)), ( "One-Class SVM (SGD)", make_pipeline( Nystroem(gamma=0.1, random_state=42, n_components=150), SGDOneClassSVM( nu=outliers_fraction, shuffle=True, fit_intercept=True, random_state=42, tol=1e-6, ), ), ), ( "Isolation Forest", IsolationForest(contamination=outliers_fraction, random_state=42), ), ( "Local Outlier Factor", LocalOutlierFactor(n_neighbors=35, contamination=outliers_fraction), ), ] ``` -------------------------------- ### Basic QDA Example Source: https://scikit-learn.org/dev/modules/generated/sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis.html Demonstrates the basic usage of QuadraticDiscriminantAnalysis with sample data. This snippet shows how to import the class, create sample data, and fit the model. ```python from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis import numpy as np X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) y = np.array([1, 1, 1, 2, 2, 2]) clf = QuadraticDiscriminantAnalysis() clf.fit(X, y) ``` -------------------------------- ### Get Accuracy Scorer Source: https://scikit-learn.org/dev/modules/generated/sklearn.metrics.get_scorer.html Demonstrates how to obtain the 'accuracy' scorer and use it to evaluate a fitted classifier. This example requires numpy and DummyClassifier from scikit-learn. ```python >>> import numpy as np >>> from sklearn.dummy import DummyClassifier >>> from sklearn.metrics import get_scorer >>> X = np.reshape([0, 1, -1, -0.5, 2], (-1, 1)) >>> y = np.array([0, 1, 1, 0, 1]) >>> classifier = DummyClassifier(strategy="constant", constant=0).fit(X, y) >>> accuracy = get_scorer("accuracy") >>> accuracy(classifier, X, y) 0.4 ``` -------------------------------- ### Data Generation and Preprocessing Setup Source: https://scikit-learn.org/dev/auto_examples/preprocessing/plot_discretization_classification.html Imports necessary libraries and defines helper functions for plotting and estimator naming. Sets up the mesh grid size for plotting decision boundaries. ```python # Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause import matplotlib.pyplot as plt import numpy as np from matplotlib.colors import ListedColormap from sklearn.datasets import make_circles, make_classification, make_moons from sklearn.ensemble import GradientBoostingClassifier from sklearn.exceptions import ConvergenceWarning from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV, train_test_split from sklearn.pipeline import make_pipeline from sklearn.preprocessing import KBinsDiscretizer, StandardScaler from sklearn.svm import SVC, LinearSVC from sklearn.utils._testing import ignore_warnings h = 0.02 # step size in the mesh def get_name(estimator): name = estimator.__class__.__name__ if name == "Pipeline": name = [get_name(est[1]) for est in estimator.steps] name = " + ".join(name) return name # list of (estimator, param_grid), where param_grid is used in GridSearchCV # The parameter spaces in this example are limited to a narrow band to reduce # its runtime. In a real use case, a broader search space for the algorithms ``` -------------------------------- ### Load Diabetes Dataset Source: https://scikit-learn.org/dev/modules/generated/sklearn.datasets.load_diabetes.html Loads the diabetes dataset and accesses its target values and data shape. Use this to get started with the dataset for regression tasks. ```python >>> from sklearn.datasets import load_diabetes >>> diabetes = load_diabetes() >>> diabetes.target[:3] array([151., 75., 141.]) >>> diabetes.data.shape (442, 10) ``` -------------------------------- ### ParameterGrid Initialization and Usage Source: https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.ParameterGrid.html Demonstrates how to initialize ParameterGrid with a dictionary of parameters and iterate over the generated combinations. It also shows how to handle a sequence of dictionaries for more complex grid exploration. ```APIDOC ## ParameterGrid class sklearn.model_selection.ParameterGrid(_param_grid_) ### Description Grid of parameters with a discrete number of values for each. Can be used to iterate over parameter value combinations with the Python built-in function iter. The order of the generated parameter combinations is deterministic. ### Parameters #### param_grid - **dict of str to sequence, or sequence of such** - The parameter grid to explore, as a dictionary mapping estimator parameters to sequences of allowed values. An empty dict signifies default parameters. A sequence of dicts signifies a sequence of grids to search, and is useful to avoid exploring parameter combinations that make no sense or have no effect. ### Examples ```python >>> from sklearn.model_selection import ParameterGrid >>> param_grid = {'a': [1, 2], 'b': [True, False]} >>> list(ParameterGrid(param_grid)) == ( ... [{'a': 1, 'b': True}, {'a': 1, 'b': False}, ... {'a': 2, 'b': True}, {'a': 2, 'b': False}]) True ``` ```python >>> grid = [{'kernel': ['linear']}, {'kernel': ['rbf'], 'gamma': [1, 10]}] >>> list(ParameterGrid(grid)) == [{'kernel': 'linear'}, ... {'kernel': 'rbf', 'gamma': 1}, ... {'kernel': 'rbf', 'gamma': 10}] True >>> ParameterGrid(grid)[1] == {'kernel': 'rbf', 'gamma': 1} True ``` ``` -------------------------------- ### StratifiedKFold Example Source: https://scikit-learn.org/dev/modules/generated/sklearn.model_selection.StratifiedKFold.html Demonstrates how to use StratifiedKFold to split data into stratified train and test sets. It shows how to get the number of splits and iterate through the generated folds. ```python import numpy as np from sklearn.model_selection import StratifiedKFold X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) y = np.array([0, 0, 1, 1]) skf = StratifiedKFold(n_splits=2) skf.get_n_splits() print(skf) for i, (train_index, test_index) in enumerate(skf.split(X, y)): print(f"Fold {i}:") print(f" Train: index={train_index}") print(f" Test: index={test_index}") ```