### Install Libraries Source: https://github.com/huggingface/evaluate/blob/main/docs/source/transformers_integrations.mdx Install the necessary libraries for running Transformers examples. ```bash pip install datasets transformers torch evaluate nltk rouge_score ``` -------------------------------- ### Verify 🤗 Evaluate Installation (pip) Source: https://github.com/huggingface/evaluate/blob/main/docs/source/installation.mdx Run this command to confirm that 🤗 Evaluate has been installed correctly and is functional. It loads the 'exact_match' metric and computes a simple example. ```python python -c "import evaluate; print(evaluate.load('exact_match').compute(references=['hello'], predictions=['hello']))" ``` -------------------------------- ### Install Spacytextblob and Download Corpora Source: https://github.com/huggingface/evaluate/blob/main/docs/source/custom_evaluator.mdx Installs the `spacytextblob` library and downloads necessary NLTK corpora and Spacy language models. These are required for the Spacy sentiment analysis example. ```bash pip install spacytextblob python -m textblob.download_corpora python -m spacy download en_core_web_sm ``` -------------------------------- ### Install Scikit-Learn Source: https://github.com/huggingface/evaluate/blob/main/docs/source/sklearn_integrations.mdx Install the scikit-learn library to use its estimators and pipelines. ```bash pip install -U scikit-learn ``` -------------------------------- ### Install Documentation Dependencies Source: https://github.com/huggingface/evaluate/blob/main/CONTRIBUTING.md Install all dependencies required to build the documentation. This ensures that the doc-builder can function correctly. ```bash pip install ".[docs]" ``` -------------------------------- ### Install Documentation Dependencies Source: https://github.com/huggingface/evaluate/blob/main/docs/README.md Installs the necessary packages for building the documentation, including the evaluate package with its documentation extras. ```bash pip install -e ".[docs]" ``` -------------------------------- ### Install Doc Builder Tool Source: https://github.com/huggingface/evaluate/blob/main/docs/README.md Installs the specialized tool required for building the documentation from a Git repository. ```bash pip install git+https://github.com/huggingface/doc-builder ``` -------------------------------- ### Install Hugging Face Evaluate with pip Source: https://github.com/huggingface/evaluate/blob/main/README.md Install the Hugging Face Evaluate library using pip. It is recommended to install within a virtual environment. ```bash pip install evaluate ``` -------------------------------- ### CharCut Output Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/charcut_mt/README.md Example of the output format for the CharCut metric computation. ```json {"charcut_mt": 0.1971153846153846} ``` -------------------------------- ### Install Development Environment Source: https://github.com/huggingface/evaluate/blob/main/CONTRIBUTING.md Install the project in editable mode with development dependencies. This command should be run within a virtual environment. ```bash pip install -e ".[dev]" ``` -------------------------------- ### Install Evaluate with Template for New Metrics Source: https://github.com/huggingface/evaluate/blob/main/README.md Install the necessary dependencies to create a new metric using the evaluate library. This includes the 'template' extra. ```bash pip install evaluate[template] ``` -------------------------------- ### Example: Compute Perplexity on Custom Predictions Source: https://github.com/huggingface/evaluate/blob/main/metrics/perplexity/README.md Demonstrates calculating perplexity on a custom list of input texts using the 'gpt2' model, with the start token disabled. ```python perplexity = evaluate.load("perplexity", module_type="metric") input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"] results = perplexity.compute(model_id='gpt2', add_start_token=False, predictions=input_texts) print(list(results.keys())) >>>['perplexities', 'mean_perplexity'] print(round(results["mean_perplexity"], 2)) >>>646.75 print(round(results["perplexities"], 2)) >>>32.25 ``` -------------------------------- ### Clone and Install 🤗 Evaluate from Source Source: https://github.com/huggingface/evaluate/blob/main/docs/source/installation.mdx Clone the 🤗 Evaluate repository from GitHub and install it in editable mode. This method is useful for development or when contributing to the library. ```bash git clone https://github.com/huggingface/evaluate.git cd evaluate pip install -e . ``` -------------------------------- ### CUAD Metric: Partial Match Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/cuad/README.md Illustrates the CUAD metric computation for a partial match scenario. Ensure the 'evaluate' library is installed and the 'cuad' metric is loaded. ```python from evaluate import load cuad_metric = load("cuad") predictions = [{'prediction_text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}] predictions = [{'prediction_text': ['The Company appoints the Distributor as an exclusive distributor of Products in the Market, subject to the terms and conditions of this Agreement.', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}] results = cuad_metric.compute(predictions=predictions, references=references) print(results) ``` -------------------------------- ### METEOR Output Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/meteor/README.md Example of the output dictionary from the METEOR metric computation, showing the 'meteor' score. ```python {'meteor': 0.9999142661179699} ``` -------------------------------- ### CUAD Metric: Maximal Values Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/cuad/README.md Demonstrates the computation of the CUAD metric with maximal prediction and reference values. Ensure the 'evaluate' library is installed and the 'cuad' metric is loaded. ```python from evaluate import load cuad_metric = load("cuad") predictions = [{'prediction_text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}] references = [{'answers': {'answer_start': [143, 49], 'text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.']}, 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}] results = cuad_metric.compute(predictions=predictions, references=references) print(results) ``` -------------------------------- ### Precision Output Example (Binary) Source: https://github.com/huggingface/evaluate/blob/main/metrics/precision/README.md An example of the output format for the precision metric when computed for a binary classification task. ```python {'precision': 0.2222222222222222} ``` -------------------------------- ### Example Output of Label Distribution Source: https://github.com/huggingface/evaluate/blob/main/measurements/label_distribution/README.md This is an example of the output dictionary returned by the label_distribution measurement, showing the distribution of labels and the calculated label skew. ```python >>> {'label_distribution': {'labels': [1, 0, 2], 'fractions': [0.1, 0.6, 0.3]}, 'label_skew': 0.7417688338666573} ``` -------------------------------- ### CharacTER Metric Output Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/character/README.md An example of the output structure when computing the CharacTER metric. ```python { 'count': 2, 'mean': 0.3127282211789254, 'median': 0.3127282211789254, 'std': 0.07561653111280243, 'min': 0.25925925925925924, 'max': 0.36619718309859156, 'cer_scores': [0.36619718309859156, 0.25925925925925924] } ``` -------------------------------- ### Multi-line Code Block Example Source: https://github.com/huggingface/evaluate/blob/main/docs/README.md Demonstrates how to format multi-line code blocks using triple backticks, suitable for doctest examples. ```markdown ``` # first line of code # second line # etc ``` ``` -------------------------------- ### Download NLTK Resources Source: https://github.com/huggingface/evaluate/blob/main/docs/source/creating_and_sharing.mdx Example of downloading specific NLTK resources, such as 'punkt_tab', within the _download_and_prepare method. ```python def _download_and_prepare(self, dl_manager): import nltk nltk.download("punkt_tab") ``` -------------------------------- ### SARI Metric Output Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/sari/README.md Shows the expected dictionary output format for the SARI metric computation. ```python print(sari_score) {'sari': 26.953601953601954} ``` -------------------------------- ### Recall Metric Output Example (Binary) Source: https://github.com/huggingface/evaluate/blob/main/metrics/recall/README.md An example of the output dictionary for the recall metric when used in a binary classification context. ```python {'recall': 1.0} ``` -------------------------------- ### Minimal TREC Eval Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/trec_eval/README.md A minimal example demonstrating how to use the TREC Eval metric with sample qrel and run data. It shows the expected structure for predictions and references. ```Python qrel = { "query": [0], "q0": ["q0"], "docid": ["doc_1"], "rel": [2] } run = { "query": [0, 0], "q0": ["q0", "q0"], "docid": ["doc_2", "doc_1"], "rank": [0, 1], "score": [1.5, 1.2], "system": ["test", "test"] } trec_eval = evaluate.load("trec_eval") results = trec_eval.compute(references=[qrel], predictions=[run]) results["P@5"] ``` -------------------------------- ### MASE Metric Output Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/mase/README.md Example of the MASE metric's output dictionary when using the default 'uniform_average' multioutput configuration. ```python >>> print(results) {'mase': 0.1833...} ``` -------------------------------- ### Precision Output Example (Multiclass) Source: https://github.com/huggingface/evaluate/blob/main/metrics/precision/README.md An example of the output format for the precision metric when computed for a multiclass classification task, returning scores for each class. ```python {'precision': array([0.66666667, 0.0, 0.0])} ``` -------------------------------- ### XNLI Partial Match Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/xnli/README.md Demonstrates computing the XNLI metric with a mix of correct and incorrect predictions, resulting in a partial accuracy. ```python >>> from evaluate import load >>> xnli_metric = load("xnli") >>> predictions = [1, 0, 1] >>> references = [1, 0, 0] >>> results = xnli_metric.compute(predictions=predictions, references=references) >>> print(results) {'accuracy': 0.6666666666666666} ``` -------------------------------- ### CUAD Metric: Minimal Values Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/cuad/README.md Shows how to compute the CUAD metric with minimal prediction and reference values. This example requires the 'evaluate' library and the 'cuad' metric to be loaded. ```python from evaluate import load cuad_metric = load("cuad") predictions = [{'prediction_text': ['The Company appoints the Distributor as an exclusive distributor of Products in the Market, subject to the terms and conditions of this Agreement.'], 'id': 'LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMENT__Exclusivity_0'}] references = [{'answers': {'answer_start': [143], 'text': 'The seller'}, 'id': 'LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMENT__Exclusivity_0'}] results = cuad_metric.compute(predictions=predictions, references=references) print(results) ``` -------------------------------- ### Download and Extract Resources Source: https://github.com/huggingface/evaluate/blob/main/docs/source/creating_and_sharing.mdx Implement resource downloading and extraction within the _download_and_prepare method using the dl_manager. This example shows downloading a checkpoint for BLEURT. ```python def _download_and_prepare(self, dl_manager): model_path = dl_manager.download_and_extract(CHECKPOINT_URLS[self.config_name]) self.scorer = score.BleurtScorer(os.path.join(model_path, self.config_name)) ``` -------------------------------- ### Text Duplicates Output Example (Default) Source: https://github.com/huggingface/evaluate/blob/main/measurements/text_duplicates/README.md Illustrates the default output format of the text_duplicates measurement, showing only the duplicate fraction. ```python { 'duplicate_fraction': 0.33333333333333337 } ``` -------------------------------- ### Text Duplicates Example with Multiple Duplicates and List Duplicates Source: https://github.com/huggingface/evaluate/blob/main/measurements/text_duplicates/README.md An example demonstrating the text_duplicates measurement with multiple duplicate strings and the `list_duplicates=True` option enabled. This output includes both the fraction and a dictionary of duplicate strings with their counts. ```python >>> data = ["hello sun", "goodbye moon", "hello sun", "foo bar", "foo bar"] >>> duplicates = evaluate.load("text_duplicates") >>> results = duplicates.compute(data=data, list_duplicates=True) >>> print(results) {'duplicate_fraction': 0.4, 'duplicates_dict': {'hello sun': 2, 'foo bar': 2}} ``` -------------------------------- ### F1 Score Binary Example with Sample Weights Source: https://github.com/huggingface/evaluate/blob/main/metrics/f1/README.md Shows how to compute the F1 score with sample weights applied to the predictions and references. ```python >>> f1_metric = evaluate.load("f1") >>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3]) >>> print(round(results['f1'], 2)) 0.35 ``` -------------------------------- ### SQuAD v2 Metric: Partial Match Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/squad_v2/README.md Calculates the exact match and F1 scores when the prediction partially matches the reference answers. This example demonstrates a scenario with 2 out of 3 answers being correct. ```python from evaluate import load squad_metric = load("squad_v2") predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22', 'no_answer_probability': 0.}, {'prediction_text': 'Beyonce', 'id': '56d2051ce7d4791d0090260b', 'no_answer_probability': 0.}, {'prediction_text': 'climate change', 'id': '5733b5344776f419006610e1', 'no_answer_probability': 0.}] references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}, {'answers': {'answer_start': [233], 'text': ['Beyoncé and Bruno Mars']}, 'id': '56d2051ce7d4791d0090260b'}, {'answers': {'answer_start': [891], 'text': ['climate change']}, 'id': '5733b5344776f419006610e1'}] results = squad_v2_metric.compute(predictions=predictions, references=references) results ``` -------------------------------- ### Load and Compute CharacTER Metric Source: https://github.com/huggingface/evaluate/blob/main/metrics/character/README.md Demonstrates how to load the CharacTER metric and compute scores for single or corpus examples. ```python import evaluate character = evaluate.load("character") # Single hyp/ref preds = ["this week the saudis denied information published in the new york times"] refs = ["saudi arabia denied this week information published in the american new york times"] results = character.compute(references=refs, predictions=preds) # Corpus example preds = ["this week the saudis denied information published in the new york times", "this is in fact an estimate"] refs = ["saudi arabia denied this week information published in the american new york times", "this is actually an estimate"] results = character.compute(references=refs, predictions=preds) ``` -------------------------------- ### Load and Compute GLUE Metric (sst2) Source: https://github.com/huggingface/evaluate/blob/main/metrics/glue/README.md Loads the GLUE metric for the 'sst2' subset and computes the results. This is a basic example demonstrating the core usage. ```python from evaluate import load glue_metric = load('glue', 'sst2') references = [0, 1] predictions = [0, 1] results = glue_metric.compute(predictions=predictions, references=references) ``` -------------------------------- ### Competition MATH Metric - Full Match Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/competition_math/README.md Demonstrates a full match scenario where predictions exactly match references after canonicalization. Accuracy is 1.0. ```python from evaluate import load math = load("competition_math") references = ["\\frac{1}{2}"] predictions = ["1/2"] results = math.compute(references=references, predictions=predictions) print(results) ``` -------------------------------- ### Text Duplicates Output Example (List Duplicates) Source: https://github.com/huggingface/evaluate/blob/main/measurements/text_duplicates/README.md Shows the output format when `list_duplicates=True` is used, including the duplicate strings and their counts. ```python { 'duplicate_fraction': 0.33333333333333337, 'duplicates_dict': {'hello sun': 2} } ``` -------------------------------- ### XNLI Maximal Accuracy Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/xnli/README.md Demonstrates computing the XNLI metric with predictions that perfectly match references, resulting in an accuracy of 1.0. ```python >>> from evaluate import load >>> xnli_metric = load("xnli") >>> predictions = [0, 1] >>> references = [0, 1] >>> results = xnli_metric.compute(predictions=predictions, references=references) >>> print(results) {'accuracy': 1.0} ``` -------------------------------- ### Compute Recall (Binary Classification) Source: https://github.com/huggingface/evaluate/blob/main/metrics/recall/README.md A simple example demonstrating how to compute the recall metric for binary classification. Ensure the 'recall' metric is loaded before computation. ```python >>> recall_metric = evaluate.load('recall') >>> results = recall_metric.compute(references=[0, 0, 1, 1, 1], predictions=[0, 1, 0, 1, 1]) >>> print(results) {'recall': 0.6666666666666666} ``` -------------------------------- ### Precision Metric Computation Source: https://github.com/huggingface/evaluate/blob/main/metrics/precision/README.md Demonstrates how to load and compute the precision metric using the evaluate library. It shows a basic example with predictions and references. ```APIDOC ## Precision Metric Computation ### Description This section provides an example of how to use the `evaluate` library to compute the precision metric. It illustrates the basic usage with sample `predictions` and `references`. ### Method ```python precision_metric = evaluate.load("precision") results = precision_metric.compute(references=[0, 1], predictions=[0, 1]) print(results) ``` ### Inputs - **predictions** (`list` of `int`): Predicted class labels. - **references** (`list` of `int`): Actual class labels. - **labels** (`list` of `int`): The set of labels to include when `average` is not set to `'binary'`. If `average` is `None`, it should be the label order. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class. Labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in `predictions` and `references` are used in sorted order. Defaults to None. - **pos_label** (`int`): The class to be considered the positive class, in the case where `average` is set to `binary`. Defaults to 1. - **average** (`string`): This parameter is required for multiclass/multilabel targets. If set to `None`, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data. Defaults to `'binary'`. - 'binary': Only report results for the class specified by `pos_label`. This is applicable only if the classes found in `predictions` and `references` are binary. - 'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives. - 'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account. - 'weighted': Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters `'macro'` to account for label imbalance. This option can result in an F-score that is not between precision and recall. - 'samples': Calculate metrics for each instance, and find their average (only meaningful for multilabel classification). - **sample_weight** (`list` of `float`): Sample weights Defaults to None. - **zero_division** (): Sets the value to return when there is a zero division. Defaults to . - 0: Returns 0 when there is a zero division. - 1: Returns 1 when there is a zero division. - 'warn': Raises warnings and then returns 0 when there is a zero division. ### Output Values - **precision**(`float` or `array` of `float`): Precision score or list of precision scores, depending on the value passed to `average`. Minimum possible value is 0. Maximum possible value is 1. Higher values indicate that fewer negative examples were incorrectly labeled as positive, which means that, generally, higher scores are better. ### Output Example(s) ```python {'precision': 0.2222222222222222} ``` ```python {'precision': array([0.66666667, 0.0, 0.0])} ``` ``` -------------------------------- ### Text Duplicates Example with No Duplicates Source: https://github.com/huggingface/evaluate/blob/main/measurements/text_duplicates/README.md Demonstrates the text_duplicates measurement with input data containing no duplicate strings. The duplicate fraction should be 0.0. ```python >>> data = ["foo", "bar", "foobar"] >>> duplicates = evaluate.load("text_duplicates") >>> results = duplicates.compute(data=data) >>> print(results) {'duplicate_fraction': 0.0} ``` -------------------------------- ### Create and Navigate to Project Directory Source: https://github.com/huggingface/evaluate/blob/main/docs/source/installation.mdx Use these commands to create a new project directory and navigate into it. ```bash mkdir ~/my-project cd ~/my-project ``` -------------------------------- ### Example: Compute Perplexity on Dataset Predictions Source: https://github.com/huggingface/evaluate/blob/main/metrics/perplexity/README.md Shows how to compute perplexity on a subset of text data loaded from the 'wikitext' dataset using the 'gpt2' model. ```python perplexity = evaluate.load("perplexity", module_type="metric") input_texts = datasets.load_dataset("wikitext", "wikitext-2-raw-v1", split="test")["text"][:50] input_texts = [s for s in input_texts if s!=''] results = perplexity.compute(model_id='gpt2', predictions=input_texts) print(list(results.keys())) >>>['perplexities', 'mean_perplexity'] print(round(results["mean_perplexity"], 2)) >>>576.76 print(round(results["perplexities"], 2)) >>>889.28 ``` -------------------------------- ### Load and Compute Google BLEU Source: https://github.com/huggingface/evaluate/blob/main/metrics/google_bleu/README.md Demonstrates how to load the google_bleu metric and compute the score for a single prediction and reference pair. Ensure 'evaluate' library is installed. ```python import evaluate sentence1 = "the cat sat on the mat" sentence2 = "the cat ate the mat" google_bleu = evaluate.load("google_bleu") result = google_bleu.compute(predictions=[sentence1], references=[[sentence2]]) print(result) >>> {'google_bleu': 0.3333333333333333} ``` -------------------------------- ### Competition MATH Metric - Minimal Match Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/competition_math/README.md Demonstrates a minimal match scenario where predictions do not match references after canonicalization. Accuracy is 0.0. ```python from evaluate import load math = load("competition_math") references = ["\\frac{1}{2}"] predictions = ["3/4"] results = math.compute(references=references, predictions=predictions) print(results) ``` -------------------------------- ### Load and Compute BLEU Score Source: https://github.com/huggingface/evaluate/blob/main/metrics/bleu/README.md Demonstrates how to load the BLEU metric and compute scores for given predictions and references. Ensure you have the `evaluate` library installed. ```python >>> predictions = ["hello there general kenobi", "foo bar foobar"] >>> references = [ ... ["hello there general kenobi", "hello there !"], ... ["foo bar foobar"] ... ] >>> bleu = evaluate.load("bleu") >>> results = bleu.compute(predictions=predictions, references=references) >>> print(results) {'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.1666666666666667, 'translation_length': 7, 'reference_length': 6} ``` -------------------------------- ### Clone Repository and Set Up Remotes Source: https://github.com/huggingface/evaluate/blob/main/CONTRIBUTING.md Clone your forked repository and add the base repository as a remote. This is the initial step after forking the project. ```bash git clone git@github.com:/evaluate.git cd evaluate git remote add upstream https://github.com/huggingface/evaluate.git ``` -------------------------------- ### Load and Run an EvaluationSuite Source: https://github.com/huggingface/evaluate/blob/main/docs/source/evaluation_suite.mdx Demonstrates how to load a pre-defined EvaluationSuite from the Hugging Face Hub and run it against a model or pipeline. The results, including accuracy and timing information, are returned and can be displayed in a pandas DataFrame. ```python from evaluate import EvaluationSuite suite = EvaluationSuite.load('mathemakitten/glue-evaluation-suite') results = suite.run("gpt2") ``` -------------------------------- ### Seqeval Full Match Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/seqeval/README.md Demonstrates computing seqeval metrics when predictions and references have a perfect match. Load the 'seqeval' metric and provide identical predictions and references. ```python >>> seqeval = evaluate.load('seqeval') >>> predictions = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']] >>> references = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']] >>> results = seqeval.compute(predictions=predictions, references=references) >>> print(results) {'MISC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1}, 'PER': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1}, 'overall_precision': 1.0, 'overall_recall': 1.0, 'overall_f1': 1.0, 'overall_accuracy': 1.0} ``` -------------------------------- ### Build Documentation Locally Source: https://github.com/huggingface/evaluate/blob/main/CONTRIBUTING.md Build the project's documentation locally using the doc-builder tool. The output will be placed in the specified build directory for inspection. ```bash doc-builder build evaluate docs/source/ --build_dir ~/tmp/test-build ``` -------------------------------- ### WER Output Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/wer/README.md Example of printing the computed WER score, which is a float representing the word error rate. ```python print(wer_score) 0.5 ``` -------------------------------- ### Load and Run an Evaluation Suite Source: https://github.com/huggingface/evaluate/blob/main/docs/source/a_quick_tour.mdx Load a pre-defined evaluation suite from the Hugging Face Hub and run it with a specified model. This demonstrates how to initiate the evaluation process. ```python from evaluate import EvaluationSuite suite = EvaluationSuite.load('mathemakitten/sentiment-evaluation-suite') results = suite.run("huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli") ``` -------------------------------- ### Perplexity Output Example Source: https://github.com/huggingface/evaluate/blob/main/measurements/perplexity/README.md Example of the dictionary output format for perplexity scores, including individual text perplexities and the average perplexity. ```json {"perplexities": [8.182524681091309, 33.42122268676758, 27.012239456176758], "mean_perplexity": 22.871995608011883} ``` -------------------------------- ### List available comparison modules with details Source: https://github.com/huggingface/evaluate/blob/main/docs/source/a_quick_tour.mdx Use `evaluate.list_evaluation_modules` to find available comparison modules. Set `include_community=False` to exclude community metrics and `with_details=True` to retrieve additional information like 'likes'. ```python >>> evaluate.list_evaluation_modules( ... module_type="comparison", ... include_community=False, ... with_details=True) [{'name': 'mcnemar', 'type': 'comparison', 'community': False, 'likes': 1}, {'name': 'exact_match', 'type': 'comparison', 'community': False, 'likes': 0}] ``` -------------------------------- ### Build Documentation Source: https://github.com/huggingface/evaluate/blob/main/docs/README.md Generates the documentation files using the doc-builder tool. The --build_dir flag specifies the output directory. ```bash doc-builder build transformers docs/source/ --build_dir ~/tmp/test-build ``` -------------------------------- ### Compute Maximal Values for MRPC/QQP Source: https://github.com/huggingface/evaluate/blob/main/metrics/glue/README.md Demonstrates computing the GLUE metric for the 'mrpc' or 'qqp' subsets, which output 'accuracy' and 'f1' scores. This example shows the maximal possible values. ```python from evaluate import load glue_metric = load('glue', 'mrpc') # 'mrpc' or 'qqp' references = [0, 1] predictions = [0, 1] results = glue_metric.compute(predictions=predictions, references=references) print(results) ``` -------------------------------- ### Download and Extract Sample Data Source: https://github.com/huggingface/evaluate/blob/main/metrics/rl_reliability/README.md Download and extract the sample dataset for testing RL reliability metrics. ```bash wget https://storage.googleapis.com/rl-reliability-metrics/data/tf_agents_example_csv_dataset.tgz tar -xvzf tf_agents_example_csv_dataset.tgz ``` -------------------------------- ### Recall Metric Output Example (Multiclass) Source: https://github.com/huggingface/evaluate/blob/main/metrics/recall/README.md An example of the output array for the recall metric when used in a multiclass classification context, showing scores for each class. ```python {'recall': array([1., 0., 0.])} ``` -------------------------------- ### Training a Seq2Seq Model with Evaluation Source: https://github.com/huggingface/evaluate/blob/main/docs/source/transformers_integrations.mdx Demonstrates how to set up and train a Seq2Seq model using the Seq2SeqTrainer, with evaluation performed at the end of each epoch. Requires a pre-trained model, tokenizer, data collator, and a metrics computation function. ```python from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq model = AutoModelForSeq2SeqLM.from_pretrained("t5-small") data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model) training_args = Seq2SeqTrainingArguments( output_dir="./results", eval_strategy="epoch", learning_rate=2e-5, per_device_train_batch_size=16, per_device_eval_batch_size=4, weight_decay=0.01, save_total_limit=3, num_train_epochs=2, fp16=True, predict_with_generate=True ) trainer = Seq2SeqTrainer( model=model, args=training_args, train_dataset=tokenized_billsum["train"], eval_dataset=tokenized_billsum["test"], tokenizer=tokenizer, data_collator=data_collator, compute_metrics=compute_metrics ) trainer.train() ``` -------------------------------- ### BLEU Output Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/bleu/README.md An example of the output format for the BLEU score computation. The score ranges from 0 to 1, indicating similarity to reference texts. ```python {'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.1666666666666667, 'translation_length': 7, 'reference_length': 6} ``` -------------------------------- ### SacreBLEU Output Format Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/sacrebleu/README.md Illustrates the detailed output format of the SacreBLEU computation, including score, counts, totals, precisions, brevity penalty, and lengths. ```python {'score': 39.76353643835252, 'counts': [6, 4, 2, 1], 'totals': [10, 8, 6, 4], 'precisions': [60.0, 50.0, 33.333333333333336, 25.0], 'bp': 1.0, 'sys_len': 10, 'ref_len': 7} ``` -------------------------------- ### Seqeval Minimal Match Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/seqeval/README.md Illustrates computing seqeval metrics when there are no matches between predictions and references. Load the 'seqeval' metric and provide completely different predictions and references. ```python >>> seqeval = evaluate.load('seqeval') >>> predictions = [['O', 'B-MISC', 'I-MISC'], ['B-PER', 'I-PER', 'O']] >>> references = [['B-MISC', 'O', 'O'], ['I-PER', '0', 'I-PER']] >>> results = seqeval.compute(predictions=predictions, references=references) >>> print(results) {'MISC': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1}, 'PER': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 2}, '_': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1}, 'overall_precision': 0.0, 'overall_recall': 0.0, 'overall_f1': 0.0, 'overall_accuracy': 0.0} ``` -------------------------------- ### Autodoc for Model Configuration Source: https://github.com/huggingface/evaluate/blob/main/docs/README.md Use this syntax to include all public methods of a configuration class in the documentation. ```markdown ## XXXConfig [[autodoc]] XXXConfig ``` -------------------------------- ### TER Metric Usage Source: https://github.com/huggingface/evaluate/blob/main/metrics/ter/README.md Example of how to load and compute the TER metric using predictions and references. ```APIDOC ## TER Metric Usage ### Description This section demonstrates how to load and compute the TER metric. It requires predictions and corresponding references. ### Method `evaluate.load("ter").compute(...) ### Parameters #### Path Parameters None #### Query Parameters None #### Request Body - **`predictions`** (list of str) - The system stream (a sequence of segments). - **`references`** (list of list of str) - A list of one or more reference streams (each a sequence of segments). - **`normalized`** (boolean) - If `True`, applies basic tokenization and normalization to sentences. Defaults to `False`. - **`ignore_punct`** (boolean) - If `True`, applies basic tokenization and normalization to sentences. Defaults to `False`. - **`support_zh_ja_chars`** (boolean) - If `True`, tokenization/normalization supports processing of Chinese characters, as well as Japanese Kanji, Hiragana, Katakana, and Phonetic Extensions of Katakana. Only applies if `normalized = True`. Defaults to `False`. - **`case_sensitive`** (boolean) - If `False`, makes all predictions and references lowercase to ignore differences in case. Defaults to `False`. ### Request Example ```python predictions = ["does this sentence match??", "what about this sentence?", "What did the TER metric user say to the developer?"] references = [["does this sentence match", "does this sentence match!?im"], ["wHaT aBoUt ThIs SeNtEnCe?", "wHaT aBoUt ThIs SeNtEnCe?"], ["Your jokes are...", "...TERrible"]] ter = evaluate.load("ter") results = ter.compute(predictions=predictions, references=references, case_sensitive=True) print(results) ``` ### Response #### Success Response (200) - **`score`** (float) - TER score (num_edits / sum_ref_lengths * 100) - **`num_edits`** (int) - The cumulative number of edits - **`ref_length`** (float) - The cumulative average reference length #### Response Example ```json { "score": 150.0, "num_edits": 15, "ref_length": 10.0 } ``` ### Error Handling Scores can be 0 or above. 0 is a perfect score. Scores above 100 indicate that the cumulative number of edits is higher than the cumulative length of the references. ``` -------------------------------- ### FrugalScore Output Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/frugalscore/README.md The output of FrugalScore is a dictionary containing a list of scores for each prediction-reference pair. ```python >>> results = frugalscore.compute(predictions=['hello there', 'huggingface'], references=['hello world', 'hugging face'], batch_size=16, max_length=64, device="gpu") {'scores': [0.6307541, 0.6449357]} ``` -------------------------------- ### F1 Score Binary Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/f1/README.md A simple binary F1 score calculation with predictions and references. ```python >>> f1_metric = evaluate.load("f1") >>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0]) >>> print(results) {'f1': 0.5} ``` -------------------------------- ### Load XTREME-S Metric Source: https://github.com/huggingface/evaluate/blob/main/metrics/xtreme_s/README.md Load the XTREME-S metric for a specific subset of the benchmark. Ensure you have the 'evaluate' library installed. ```python import evaluate xtreme_s_metric = evaluate.load('xtreme_s', 'mls') ``` -------------------------------- ### Load Competition MATH Metric Source: https://github.com/huggingface/evaluate/blob/main/metrics/competition_math/README.md Loads the competition_math metric from the evaluate library. Ensure the 'math_equivalence' dependency is installed. ```python from evaluate import load math = load("competition_math") references = ["\\frac{1}{2}"] predictions = ["1/2"] ``` -------------------------------- ### Seqeval Partial Match Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/seqeval/README.md Shows how to compute seqeval metrics with partial matches between predictions and references. Load the 'seqeval' metric and provide predictions and references with some overlapping and some differing elements. ```python >>> seqeval = evaluate.load('seqeval') >>> predictions = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']] >>> references = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']] >>> results = seqeval.compute(predictions=predictions, references=references) >>> print(results) {'MISC': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1}, 'PER': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1}, 'overall_precision': 0.5, 'overall_recall': 0.5, 'overall_f1': 0.5, 'overall_accuracy': 0.8} ``` -------------------------------- ### Example: No Match CER Source: https://github.com/huggingface/evaluate/blob/main/metrics/cer/README.md Demonstrates calculating CER when there is no match between predictions and references. This typically results in a score of 1.0. ```python from evaluate import load cer = load("cer") predictions = ["hello"] references = ["gracias"] cer_score = cer.compute(predictions=predictions, references=references) print(cer_score) ``` -------------------------------- ### Create a New Evaluation Module Source: https://github.com/huggingface/evaluate/blob/main/docs/source/creating_and_sharing.mdx Use the evaluate-cli to scaffold a new evaluation module, creating a Space on the Hub and cloning it locally. ```bash evaluate-cli create "My Metric" --module_type "metric" ``` -------------------------------- ### ROUGE Output with use_aggregator=True Source: https://github.com/huggingface/evaluate/blob/main/metrics/rouge/README.md Example of ROUGE metric output when `use_aggregator` is set to `True`. The output is a dictionary of aggregated scores. ```python >>> {'rouge1': 1.0, 'rouge2': 1.0} ``` -------------------------------- ### Competition MATH Metric - Partial Match Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/competition_math/README.md Demonstrates a partial match scenario with multiple references and predictions. Accuracy is calculated based on the proportion of correct matches. ```python from evaluate import load math = load("competition_math") references = ["\\frac{1}{2}","\\frac{3}{4}"] predictions = ["1/5", "3/4"] results = math.compute(references=references, predictions=predictions) print(results) ``` -------------------------------- ### XNLI Minimal Accuracy Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/xnli/README.md Demonstrates computing the XNLI metric with predictions that are completely incorrect, resulting in an accuracy of 0.0. ```python >>> from evaluate import load >>> xnli_metric = load("xnli") >>> predictions = [1, 0] >>> references = [0, 1] >>> results = xnli_metric.compute(predictions=predictions, references=references) >>> print(results) {'accuracy': 0.0} ``` -------------------------------- ### Example: Perfect Match CER Source: https://github.com/huggingface/evaluate/blob/main/metrics/cer/README.md Demonstrates calculating CER when predictions perfectly match references, resulting in a score of 0.0. ```python from evaluate import load cer = load("cer") predictions = ["hello world", "good night moon"] references = ["hello world", "good night moon"] cer_score = cer.compute(predictions=predictions, references=references) print(cer_score) ``` -------------------------------- ### Single Value Return Block Source: https://github.com/huggingface/evaluate/blob/main/docs/README.md Example of how to format a docstring for a single return value, specifying its type and a brief explanation. ```python Returns: `List[int]`: A list of integers in the range [0, 1] --- 1 for a special token, 0 for a sequence token. ``` -------------------------------- ### Get Model ID to Label Mapping Source: https://github.com/huggingface/evaluate/blob/main/measurements/toxicity/README.md Retrieves the mapping from numerical labels to their string representations for a given classification model. ```python >>> model = AutoModelForSequenceClassification.from_pretrained("DaNLP/da-electra-hatespeech-detection") >>> model.config.id2label {0: 'not offensive', 1: 'offensive'} ``` -------------------------------- ### ROC AUC Score Output Example Source: https://github.com/huggingface/evaluate/blob/main/metrics/roc_auc/README.md The standard output format for ROC AUC scores, typically a dictionary with a float value. ```python { 'roc_auc': 0.778 } ``` -------------------------------- ### Evaluating an Existing Model with Trainer Source: https://github.com/huggingface/evaluate/blob/main/docs/source/transformers_integrations.mdx Shows how to evaluate a pre-trained model using the Trainer's evaluate method. This is an alternative to training when only model assessment is needed. ```python trainer.evaluate() ``` -------------------------------- ### ROUGE Calculation With Aggregation Source: https://github.com/huggingface/evaluate/blob/main/metrics/rouge/README.md Example of computing ROUGE scores with aggregation enabled. This provides a single, averaged score across all predictions. ```python >>> rouge = evaluate.load('rouge') >>> predictions = ["hello goodbye", "ankh morpork"] >>> references = ["goodbye", "general kenobi"] >>> results = rouge.compute(predictions=predictions, ... references=references, ... use_aggregator=True) >>> print(list(results.keys())) ['rouge1', 'rouge2', 'rougeL', 'rougeLsum'] >>> print(results["rouge1"]) 0.25 ```