### Install Libraries

Source: https://github.com/huggingface/evaluate/blob/main/docs/source/transformers_integrations.mdx

Install the necessary libraries for running Transformers examples.

```bash
pip install datasets transformers torch evaluate nltk rouge_score
```

--------------------------------

### Verify 🤗 Evaluate Installation (pip)

Source: https://github.com/huggingface/evaluate/blob/main/docs/source/installation.mdx

Run this command to confirm that 🤗 Evaluate has been installed correctly and is functional. It loads the 'exact_match' metric and computes a simple example.

```python
python -c "import evaluate; print(evaluate.load('exact_match').compute(references=['hello'], predictions=['hello']))"
```

--------------------------------

### Install Spacytextblob and Download Corpora

Source: https://github.com/huggingface/evaluate/blob/main/docs/source/custom_evaluator.mdx

Installs the `spacytextblob` library and downloads necessary NLTK corpora and Spacy language models. These are required for the Spacy sentiment analysis example.

```bash
pip install spacytextblob
python -m textblob.download_corpora
python -m spacy download en_core_web_sm
```

--------------------------------

### Install Scikit-Learn

Source: https://github.com/huggingface/evaluate/blob/main/docs/source/sklearn_integrations.mdx

Install the scikit-learn library to use its estimators and pipelines.

```bash
pip install -U scikit-learn
```

--------------------------------

### Install Documentation Dependencies

Source: https://github.com/huggingface/evaluate/blob/main/CONTRIBUTING.md

Install all dependencies required to build the documentation. This ensures that the doc-builder can function correctly.

```bash
pip install ".[docs]"
```

--------------------------------

### Install Documentation Dependencies

Source: https://github.com/huggingface/evaluate/blob/main/docs/README.md

Installs the necessary packages for building the documentation, including the evaluate package with its documentation extras.

```bash
pip install -e ".[docs]"
```

--------------------------------

### Install Doc Builder Tool

Source: https://github.com/huggingface/evaluate/blob/main/docs/README.md

Installs the specialized tool required for building the documentation from a Git repository.

```bash
pip install git+https://github.com/huggingface/doc-builder
```

--------------------------------

### Install Hugging Face Evaluate with pip

Source: https://github.com/huggingface/evaluate/blob/main/README.md

Install the Hugging Face Evaluate library using pip. It is recommended to install within a virtual environment.

```bash
pip install evaluate
```

--------------------------------

### CharCut Output Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/charcut_mt/README.md

Example of the output format for the CharCut metric computation.

```json
{"charcut_mt": 0.1971153846153846}
```

--------------------------------

### Install Development Environment

Source: https://github.com/huggingface/evaluate/blob/main/CONTRIBUTING.md

Install the project in editable mode with development dependencies. This command should be run within a virtual environment.

```bash
pip install -e ".[dev]"
```

--------------------------------

### Install Evaluate with Template for New Metrics

Source: https://github.com/huggingface/evaluate/blob/main/README.md

Install the necessary dependencies to create a new metric using the evaluate library. This includes the 'template' extra.

```bash
pip install evaluate[template]
```

--------------------------------

### Example: Compute Perplexity on Custom Predictions

Source: https://github.com/huggingface/evaluate/blob/main/metrics/perplexity/README.md

Demonstrates calculating perplexity on a custom list of input texts using the 'gpt2' model, with the start token disabled.

```python
perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
results = perplexity.compute(model_id='gpt2',
                             add_start_token=False,
                             predictions=input_texts)
print(list(results.keys()))
>>>['perplexities', 'mean_perplexity']
print(round(results["mean_perplexity"], 2))
>>>646.75
print(round(results["perplexities"], 2))
>>>32.25
```

--------------------------------

### Clone and Install 🤗 Evaluate from Source

Source: https://github.com/huggingface/evaluate/blob/main/docs/source/installation.mdx

Clone the 🤗 Evaluate repository from GitHub and install it in editable mode. This method is useful for development or when contributing to the library.

```bash
git clone https://github.com/huggingface/evaluate.git
cd evaluate
pip install -e .
```

--------------------------------

### CUAD Metric: Partial Match Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/cuad/README.md

Illustrates the CUAD metric computation for a partial match scenario. Ensure the 'evaluate' library is installed and the 'cuad' metric is loaded.

```python
from evaluate import load
cuad_metric = load("cuad")
predictions = [{'prediction_text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
predictions = [{'prediction_text': ['The Company appoints the Distributor as an exclusive distributor of Products in the Market, subject to the terms and conditions of this Agreement.', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
results = cuad_metric.compute(predictions=predictions, references=references)
print(results)
```

--------------------------------

### METEOR Output Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/meteor/README.md

Example of the output dictionary from the METEOR metric computation, showing the 'meteor' score.

```python
{'meteor': 0.9999142661179699}
```

--------------------------------

### CUAD Metric: Maximal Values Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/cuad/README.md

Demonstrates the computation of the CUAD metric with maximal prediction and reference values. Ensure the 'evaluate' library is installed and the 'cuad' metric is loaded.

```python
from evaluate import load
cuad_metric = load("cuad")
predictions = [{'prediction_text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.'], 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
references = [{'answers': {'answer_start': [143, 49], 'text': ['The seller:', 'The buyer/End-User: Shenzhen LOHAS Supply Chain Management Co., Ltd.']}, 'id': 'LohaCompanyltd_20191209_F-1_EX-10.16_11917878_EX-10.16_Supply Agreement__Parties'}]
results = cuad_metric.compute(predictions=predictions, references=references)
print(results)
```

--------------------------------

### Precision Output Example (Binary)

Source: https://github.com/huggingface/evaluate/blob/main/metrics/precision/README.md

An example of the output format for the precision metric when computed for a binary classification task.

```python
{'precision': 0.2222222222222222}
```

--------------------------------

### Example Output of Label Distribution

Source: https://github.com/huggingface/evaluate/blob/main/measurements/label_distribution/README.md

This is an example of the output dictionary returned by the label_distribution measurement, showing the distribution of labels and the calculated label skew.

```python
>>> {'label_distribution': {'labels': [1, 0, 2], 'fractions': [0.1, 0.6, 0.3]}, 'label_skew': 0.7417688338666573}
```

--------------------------------

### CharacTER Metric Output Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/character/README.md

An example of the output structure when computing the CharacTER metric.

```python
{
    'count': 2,
    'mean': 0.3127282211789254,
    'median': 0.3127282211789254,
    'std': 0.07561653111280243,
    'min': 0.25925925925925924,
    'max': 0.36619718309859156,
    'cer_scores': [0.36619718309859156, 0.25925925925925924]
}
```

--------------------------------

### Multi-line Code Block Example

Source: https://github.com/huggingface/evaluate/blob/main/docs/README.md

Demonstrates how to format multi-line code blocks using triple backticks, suitable for doctest examples.

```markdown
```
# first line of code
# second line
# etc
```
```

--------------------------------

### Download NLTK Resources

Source: https://github.com/huggingface/evaluate/blob/main/docs/source/creating_and_sharing.mdx

Example of downloading specific NLTK resources, such as 'punkt_tab', within the _download_and_prepare method.

```python
def _download_and_prepare(self, dl_manager):
    import nltk
    nltk.download("punkt_tab")
```

--------------------------------

### SARI Metric Output Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/sari/README.md

Shows the expected dictionary output format for the SARI metric computation.

```python
print(sari_score)
{'sari': 26.953601953601954}
```

--------------------------------

### Recall Metric Output Example (Binary)

Source: https://github.com/huggingface/evaluate/blob/main/metrics/recall/README.md

An example of the output dictionary for the recall metric when used in a binary classification context.

```python
{'recall': 1.0}
```

--------------------------------

### Minimal TREC Eval Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/trec_eval/README.md

A minimal example demonstrating how to use the TREC Eval metric with sample qrel and run data. It shows the expected structure for predictions and references.

```Python
qrel = {
    "query": [0],
    "q0": ["q0"],
    "docid": ["doc_1"],
    "rel": [2]
}
run = {
    "query": [0, 0],
    "q0": ["q0", "q0"],
    "docid": ["doc_2", "doc_1"],
    "rank": [0, 1],
    "score": [1.5, 1.2],
    "system": ["test", "test"]
}

trec_eval = evaluate.load("trec_eval")
results = trec_eval.compute(references=[qrel], predictions=[run])
results["P@5"]
```

--------------------------------

### MASE Metric Output Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/mase/README.md

Example of the MASE metric's output dictionary when using the default 'uniform_average' multioutput configuration.

```python
>>> print(results)
{'mase': 0.1833...}
```

--------------------------------

### Precision Output Example (Multiclass)

Source: https://github.com/huggingface/evaluate/blob/main/metrics/precision/README.md

An example of the output format for the precision metric when computed for a multiclass classification task, returning scores for each class.

```python
{'precision': array([0.66666667, 0.0, 0.0])}
```

--------------------------------

### XNLI Partial Match Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/xnli/README.md

Demonstrates computing the XNLI metric with a mix of correct and incorrect predictions, resulting in a partial accuracy.

```python
>>> from evaluate import load
>>> xnli_metric = load("xnli")
>>> predictions = [1, 0, 1]
>>> references = [1, 0, 0]
>>> results = xnli_metric.compute(predictions=predictions, references=references)
>>> print(results)
{'accuracy': 0.6666666666666666}
```

--------------------------------

### CUAD Metric: Minimal Values Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/cuad/README.md

Shows how to compute the CUAD metric with minimal prediction and reference values. This example requires the 'evaluate' library and the 'cuad' metric to be loaded.

```python
from evaluate import load
cuad_metric = load("cuad")
predictions = [{'prediction_text': ['The Company appoints the Distributor as an exclusive distributor of Products in the Market, subject to the terms and conditions of this Agreement.'], 'id': 'LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMENT__Exclusivity_0'}]
references = [{'answers': {'answer_start': [143], 'text': 'The seller'}, 'id': 'LIMEENERGYCO_09_09_1999-EX-10-DISTRIBUTOR AGREEMENT__Exclusivity_0'}]
results = cuad_metric.compute(predictions=predictions, references=references)
print(results)
```

--------------------------------

### Download and Extract Resources

Source: https://github.com/huggingface/evaluate/blob/main/docs/source/creating_and_sharing.mdx

Implement resource downloading and extraction within the _download_and_prepare method using the dl_manager. This example shows downloading a checkpoint for BLEURT.

```python
def _download_and_prepare(self, dl_manager):
    model_path = dl_manager.download_and_extract(CHECKPOINT_URLS[self.config_name])
    self.scorer = score.BleurtScorer(os.path.join(model_path, self.config_name))
```

--------------------------------

### Text Duplicates Output Example (Default)

Source: https://github.com/huggingface/evaluate/blob/main/measurements/text_duplicates/README.md

Illustrates the default output format of the text_duplicates measurement, showing only the duplicate fraction.

```python
{
'duplicate_fraction': 0.33333333333333337
}
```

--------------------------------

### Text Duplicates Example with Multiple Duplicates and List Duplicates

Source: https://github.com/huggingface/evaluate/blob/main/measurements/text_duplicates/README.md

An example demonstrating the text_duplicates measurement with multiple duplicate strings and the `list_duplicates=True` option enabled. This output includes both the fraction and a dictionary of duplicate strings with their counts.

```python
>>> data = ["hello sun", "goodbye moon", "hello sun", "foo bar", "foo bar"]
>>> duplicates = evaluate.load("text_duplicates")
>>> results = duplicates.compute(data=data, list_duplicates=True)
>>> print(results)
{'duplicate_fraction': 0.4, 'duplicates_dict': {'hello sun': 2, 'foo bar': 2}}
```

--------------------------------

### F1 Score Binary Example with Sample Weights

Source: https://github.com/huggingface/evaluate/blob/main/metrics/f1/README.md

Shows how to compute the F1 score with sample weights applied to the predictions and references.

```python
>>> f1_metric = evaluate.load("f1")
>>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0], sample_weight=[0.9, 0.5, 3.9, 1.2, 0.3])
>>> print(round(results['f1'], 2))
0.35
```

--------------------------------

### SQuAD v2 Metric: Partial Match Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/squad_v2/README.md

Calculates the exact match and F1 scores when the prediction partially matches the reference answers. This example demonstrates a scenario with 2 out of 3 answers being correct.

```python
from evaluate import load
squad_metric = load("squad_v2")
predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22', 'no_answer_probability': 0.}, {'prediction_text': 'Beyonce', 'id': '56d2051ce7d4791d0090260b', 'no_answer_probability': 0.},  {'prediction_text': 'climate change', 'id': '5733b5344776f419006610e1', 'no_answer_probability': 0.}]
references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}, {'answers': {'answer_start': [233], 'text': ['Beyoncé and Bruno Mars']}, 'id': '56d2051ce7d4791d0090260b'}, {'answers': {'answer_start': [891], 'text': ['climate change']}, 'id': '5733b5344776f419006610e1'}]
results = squad_v2_metric.compute(predictions=predictions, references=references)
results
```

--------------------------------

### Load and Compute CharacTER Metric

Source: https://github.com/huggingface/evaluate/blob/main/metrics/character/README.md

Demonstrates how to load the CharacTER metric and compute scores for single or corpus examples.

```python
import evaluate
character = evaluate.load("character")

# Single hyp/ref 
preds = ["this week the saudis denied information published in the new york times"]
refs = ["saudi arabia denied this week information published in the american new york times"]
results = character.compute(references=refs, predictions=preds)

# Corpus example
preds = ["this week the saudis denied information published in the new york times",
         "this is in fact an estimate"]
refs = ["saudi arabia denied this week information published in the american new york times",
        "this is actually an estimate"]
results = character.compute(references=refs, predictions=preds)
```

--------------------------------

### Load and Compute GLUE Metric (sst2)

Source: https://github.com/huggingface/evaluate/blob/main/metrics/glue/README.md

Loads the GLUE metric for the 'sst2' subset and computes the results. This is a basic example demonstrating the core usage.

```python
from evaluate import load
glue_metric = load('glue', 'sst2')
references = [0, 1]
predictions = [0, 1]
results = glue_metric.compute(predictions=predictions, references=references)
```

--------------------------------

### Competition MATH Metric - Full Match Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/competition_math/README.md

Demonstrates a full match scenario where predictions exactly match references after canonicalization. Accuracy is 1.0.

```python
from evaluate import load
math = load("competition_math")
references = ["\\frac{1}{2}"]
predictions = ["1/2"]
results = math.compute(references=references, predictions=predictions)
print(results)
```

--------------------------------

### Text Duplicates Output Example (List Duplicates)

Source: https://github.com/huggingface/evaluate/blob/main/measurements/text_duplicates/README.md

Shows the output format when `list_duplicates=True` is used, including the duplicate strings and their counts.

```python
{
'duplicate_fraction': 0.33333333333333337,
'duplicates_dict': {'hello sun': 2}
}
```

--------------------------------

### XNLI Maximal Accuracy Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/xnli/README.md

Demonstrates computing the XNLI metric with predictions that perfectly match references, resulting in an accuracy of 1.0.

```python
>>> from evaluate import load
>>> xnli_metric = load("xnli")
>>> predictions = [0, 1]
>>> references = [0, 1]
>>> results = xnli_metric.compute(predictions=predictions, references=references)
>>> print(results)
{'accuracy': 1.0}
```

--------------------------------

### Compute Recall (Binary Classification)

Source: https://github.com/huggingface/evaluate/blob/main/metrics/recall/README.md

A simple example demonstrating how to compute the recall metric for binary classification. Ensure the 'recall' metric is loaded before computation.

```python
>>> recall_metric = evaluate.load('recall')
>>> results = recall_metric.compute(references=[0, 0, 1, 1, 1], predictions=[0, 1, 0, 1, 1])
>>> print(results)
{'recall': 0.6666666666666666}
```

--------------------------------

### Precision Metric Computation

Source: https://github.com/huggingface/evaluate/blob/main/metrics/precision/README.md

Demonstrates how to load and compute the precision metric using the evaluate library. It shows a basic example with predictions and references.

```APIDOC
## Precision Metric Computation

### Description
This section provides an example of how to use the `evaluate` library to compute the precision metric. It illustrates the basic usage with sample `predictions` and `references`.

### Method
```python
precision_metric = evaluate.load("precision")
results = precision_metric.compute(references=[0, 1], predictions=[0, 1])
print(results)
```

### Inputs
- **predictions** (`list` of `int`): Predicted class labels.
- **references** (`list` of `int`): Actual class labels.
- **labels** (`list` of `int`): The set of labels to include when `average` is not set to `'binary'`. If `average` is `None`, it should be the label order. Labels present in the data can be excluded, for example to calculate a multiclass average ignoring a majority negative class. Labels not present in the data will result in 0 components in a macro average. For multilabel targets, labels are column indices. By default, all labels in `predictions` and `references` are used in sorted order. Defaults to None.
- **pos_label** (`int`): The class to be considered the positive class, in the case where `average` is set to `binary`. Defaults to 1.
- **average** (`string`): This parameter is required for multiclass/multilabel targets. If set to `None`, the scores for each class are returned. Otherwise, this determines the type of averaging performed on the data. Defaults to `'binary'`.
    - 'binary': Only report results for the class specified by `pos_label`. This is applicable only if the classes found in `predictions` and `references` are binary.
    - 'micro': Calculate metrics globally by counting the total true positives, false negatives and false positives.
    - 'macro': Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account.
    - 'weighted': Calculate metrics for each label, and find their average weighted by support (the number of true instances for each label). This alters `'macro'` to account for label imbalance. This option can result in an F-score that is not between precision and recall.
    - 'samples': Calculate metrics for each instance, and find their average (only meaningful for multilabel classification).
- **sample_weight** (`list` of `float`): Sample weights Defaults to None.
- **zero_division** (): Sets the value to return when there is a zero division. Defaults to .
    - 0: Returns 0 when there is a zero division.
    - 1: Returns 1 when there is a zero division.
    - 'warn': Raises warnings and then returns 0 when there is a zero division.

### Output Values
- **precision**(`float` or `array` of `float`): Precision score or list of precision scores, depending on the value passed to `average`. Minimum possible value is 0. Maximum possible value is 1. Higher values indicate that fewer negative examples were incorrectly labeled as positive, which means that, generally, higher scores are better.

### Output Example(s)
```python
{'precision': 0.2222222222222222}
```
```python
{'precision': array([0.66666667, 0.0, 0.0])}
```
```

--------------------------------

### Text Duplicates Example with No Duplicates

Source: https://github.com/huggingface/evaluate/blob/main/measurements/text_duplicates/README.md

Demonstrates the text_duplicates measurement with input data containing no duplicate strings. The duplicate fraction should be 0.0.

```python
>>> data = ["foo", "bar", "foobar"]
>>> duplicates = evaluate.load("text_duplicates")
>>> results = duplicates.compute(data=data)
>>> print(results)
{'duplicate_fraction': 0.0}
```

--------------------------------

### Create and Navigate to Project Directory

Source: https://github.com/huggingface/evaluate/blob/main/docs/source/installation.mdx

Use these commands to create a new project directory and navigate into it.

```bash
mkdir ~/my-project
cd ~/my-project
```

--------------------------------

### Example: Compute Perplexity on Dataset Predictions

Source: https://github.com/huggingface/evaluate/blob/main/metrics/perplexity/README.md

Shows how to compute perplexity on a subset of text data loaded from the 'wikitext' dataset using the 'gpt2' model.

```python
perplexity = evaluate.load("perplexity", module_type="metric")
input_texts = datasets.load_dataset("wikitext",
                                    "wikitext-2-raw-v1",
                                    split="test")["text"][:50]
input_texts = [s for s in input_texts if s!='']
results = perplexity.compute(model_id='gpt2',
                             predictions=input_texts)
print(list(results.keys()))
>>>['perplexities', 'mean_perplexity']
print(round(results["mean_perplexity"], 2))
>>>576.76
print(round(results["perplexities"], 2))
>>>889.28
```

--------------------------------

### Load and Compute Google BLEU

Source: https://github.com/huggingface/evaluate/blob/main/metrics/google_bleu/README.md

Demonstrates how to load the google_bleu metric and compute the score for a single prediction and reference pair. Ensure 'evaluate' library is installed.

```python
import evaluate

sentence1 = "the cat sat on the mat"
sentence2 = "the cat ate the mat"
google_bleu = evaluate.load("google_bleu")
result = google_bleu.compute(predictions=[sentence1], references=[[sentence2]])
print(result)
>>> {'google_bleu': 0.3333333333333333}
```

--------------------------------

### Competition MATH Metric - Minimal Match Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/competition_math/README.md

Demonstrates a minimal match scenario where predictions do not match references after canonicalization. Accuracy is 0.0.

```python
from evaluate import load
math = load("competition_math")
references = ["\\frac{1}{2}"]
predictions = ["3/4"]
results = math.compute(references=references, predictions=predictions)
print(results)
```

--------------------------------

### Load and Compute BLEU Score

Source: https://github.com/huggingface/evaluate/blob/main/metrics/bleu/README.md

Demonstrates how to load the BLEU metric and compute scores for given predictions and references. Ensure you have the `evaluate` library installed.

```python
>>> predictions = ["hello there general kenobi", "foo bar foobar"]
>>> references = [
...     ["hello there general kenobi", "hello there !"],
...     ["foo bar foobar"]
... ]
>>> bleu = evaluate.load("bleu")
>>> results = bleu.compute(predictions=predictions, references=references)
>>> print(results)
{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.1666666666666667, 'translation_length': 7, 'reference_length': 6}
```

--------------------------------

### Clone Repository and Set Up Remotes

Source: https://github.com/huggingface/evaluate/blob/main/CONTRIBUTING.md

Clone your forked repository and add the base repository as a remote. This is the initial step after forking the project.

```bash
git clone git@github.com:<your Github handle>/evaluate.git
cd evaluate 
git remote add upstream https://github.com/huggingface/evaluate.git
```

--------------------------------

### Load and Run an EvaluationSuite

Source: https://github.com/huggingface/evaluate/blob/main/docs/source/evaluation_suite.mdx

Demonstrates how to load a pre-defined EvaluationSuite from the Hugging Face Hub and run it against a model or pipeline. The results, including accuracy and timing information, are returned and can be displayed in a pandas DataFrame.

```python
from evaluate import EvaluationSuite
suite = EvaluationSuite.load('mathemakitten/glue-evaluation-suite')
results = suite.run("gpt2")
```

--------------------------------

### Seqeval Full Match Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/seqeval/README.md

Demonstrates computing seqeval metrics when predictions and references have a perfect match. Load the 'seqeval' metric and provide identical predictions and references.

```python
>>> seqeval = evaluate.load('seqeval')
>>> predictions = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> references = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> results = seqeval.compute(predictions=predictions, references=references)
>>> print(results)
{'MISC': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1}, 'PER': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1}, 'overall_precision': 1.0, 'overall_recall': 1.0, 'overall_f1': 1.0, 'overall_accuracy': 1.0}
```

--------------------------------

### Build Documentation Locally

Source: https://github.com/huggingface/evaluate/blob/main/CONTRIBUTING.md

Build the project's documentation locally using the doc-builder tool. The output will be placed in the specified build directory for inspection.

```bash
doc-builder build evaluate docs/source/ --build_dir ~/tmp/test-build
```

--------------------------------

### WER Output Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/wer/README.md

Example of printing the computed WER score, which is a float representing the word error rate.

```python
print(wer_score)
0.5
```

--------------------------------

### Load and Run an Evaluation Suite

Source: https://github.com/huggingface/evaluate/blob/main/docs/source/a_quick_tour.mdx

Load a pre-defined evaluation suite from the Hugging Face Hub and run it with a specified model. This demonstrates how to initiate the evaluation process.

```python
from evaluate import EvaluationSuite
suite = EvaluationSuite.load('mathemakitten/sentiment-evaluation-suite')
results = suite.run("huggingface/prunebert-base-uncased-6-finepruned-w-distil-mnli")
```

--------------------------------

### Perplexity Output Example

Source: https://github.com/huggingface/evaluate/blob/main/measurements/perplexity/README.md

Example of the dictionary output format for perplexity scores, including individual text perplexities and the average perplexity.

```json
{"perplexities": [8.182524681091309, 33.42122268676758, 27.012239456176758], "mean_perplexity": 22.871995608011883}
```

--------------------------------

### List available comparison modules with details

Source: https://github.com/huggingface/evaluate/blob/main/docs/source/a_quick_tour.mdx

Use `evaluate.list_evaluation_modules` to find available comparison modules. Set `include_community=False` to exclude community metrics and `with_details=True` to retrieve additional information like 'likes'.

```python
>>> evaluate.list_evaluation_modules(
...   module_type="comparison",
...   include_community=False,
...   with_details=True)

[{'name': 'mcnemar', 'type': 'comparison', 'community': False, 'likes': 1},
 {'name': 'exact_match', 'type': 'comparison', 'community': False, 'likes': 0}]
```

--------------------------------

### Build Documentation

Source: https://github.com/huggingface/evaluate/blob/main/docs/README.md

Generates the documentation files using the doc-builder tool. The --build_dir flag specifies the output directory.

```bash
doc-builder build transformers docs/source/ --build_dir ~/tmp/test-build
```

--------------------------------

### Compute Maximal Values for MRPC/QQP

Source: https://github.com/huggingface/evaluate/blob/main/metrics/glue/README.md

Demonstrates computing the GLUE metric for the 'mrpc' or 'qqp' subsets, which output 'accuracy' and 'f1' scores. This example shows the maximal possible values.

```python
from evaluate import load
glue_metric = load('glue', 'mrpc')  # 'mrpc' or 'qqp'
references = [0, 1]
predictions = [0, 1]
results = glue_metric.compute(predictions=predictions, references=references)
print(results)
```

--------------------------------

### Download and Extract Sample Data

Source: https://github.com/huggingface/evaluate/blob/main/metrics/rl_reliability/README.md

Download and extract the sample dataset for testing RL reliability metrics.

```bash
wget https://storage.googleapis.com/rl-reliability-metrics/data/tf_agents_example_csv_dataset.tgz
tar -xvzf tf_agents_example_csv_dataset.tgz
```

--------------------------------

### Recall Metric Output Example (Multiclass)

Source: https://github.com/huggingface/evaluate/blob/main/metrics/recall/README.md

An example of the output array for the recall metric when used in a multiclass classification context, showing scores for each class.

```python
{'recall': array([1., 0., 0.])}
```

--------------------------------

### Training a Seq2Seq Model with Evaluation

Source: https://github.com/huggingface/evaluate/blob/main/docs/source/transformers_integrations.mdx

Demonstrates how to set up and train a Seq2Seq model using the Seq2SeqTrainer, with evaluation performed at the end of each epoch. Requires a pre-trained model, tokenizer, data collator, and a metrics computation function.

```python
from transformers import AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, DataCollatorForSeq2Seq

model = AutoModelForSeq2SeqLM.from_pretrained("t5-small")
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=model)

training_args = Seq2SeqTrainingArguments(
    output_dir="./results",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=4,
    weight_decay=0.01,
    save_total_limit=3,
    num_train_epochs=2,
    fp16=True,
    predict_with_generate=True
)

trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_billsum["train"],
    eval_dataset=tokenized_billsum["test"],
    tokenizer=tokenizer,
    data_collator=data_collator,
    compute_metrics=compute_metrics
)

trainer.train()
```

--------------------------------

### BLEU Output Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/bleu/README.md

An example of the output format for the BLEU score computation. The score ranges from 0 to 1, indicating similarity to reference texts.

```python
{'bleu': 1.0, 'precisions': [1.0, 1.0, 1.0, 1.0], 'brevity_penalty': 1.0, 'length_ratio': 1.1666666666666667, 'translation_length': 7, 'reference_length': 6}
```

--------------------------------

### SacreBLEU Output Format Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/sacrebleu/README.md

Illustrates the detailed output format of the SacreBLEU computation, including score, counts, totals, precisions, brevity penalty, and lengths.

```python
{'score': 39.76353643835252, 'counts': [6, 4, 2, 1], 'totals': [10, 8, 6, 4], 'precisions': [60.0, 50.0, 33.333333333333336, 25.0], 'bp': 1.0, 'sys_len': 10, 'ref_len': 7}
```

--------------------------------

### Seqeval Minimal Match Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/seqeval/README.md

Illustrates computing seqeval metrics when there are no matches between predictions and references. Load the 'seqeval' metric and provide completely different predictions and references.

```python
>>> seqeval = evaluate.load('seqeval')
>>> predictions = [['O', 'B-MISC', 'I-MISC'], ['B-PER', 'I-PER', 'O']]
>>> references = [['B-MISC', 'O', 'O'], ['I-PER', '0', 'I-PER']]
>>> results = seqeval.compute(predictions=predictions, references=references)
>>> print(results)
{'MISC': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1}, 'PER': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 2}, '_': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1}, 'overall_precision': 0.0, 'overall_recall': 0.0, 'overall_f1': 0.0, 'overall_accuracy': 0.0}
```

--------------------------------

### Autodoc for Model Configuration

Source: https://github.com/huggingface/evaluate/blob/main/docs/README.md

Use this syntax to include all public methods of a configuration class in the documentation.

```markdown
## XXXConfig

[[autodoc]] XXXConfig
```

--------------------------------

### TER Metric Usage

Source: https://github.com/huggingface/evaluate/blob/main/metrics/ter/README.md

Example of how to load and compute the TER metric using predictions and references.

```APIDOC
## TER Metric Usage

### Description
This section demonstrates how to load and compute the TER metric. It requires predictions and corresponding references.

### Method
`evaluate.load("ter").compute(...)

### Parameters
#### Path Parameters
None

#### Query Parameters
None

#### Request Body
- **`predictions`** (list of str) - The system stream (a sequence of segments).
- **`references`** (list of list of str) - A list of one or more reference streams (each a sequence of segments).
- **`normalized`** (boolean) - If `True`, applies basic tokenization and normalization to sentences. Defaults to `False`.
- **`ignore_punct`** (boolean) - If `True`, applies basic tokenization and normalization to sentences. Defaults to `False`.
- **`support_zh_ja_chars`** (boolean) - If `True`, tokenization/normalization supports processing of Chinese characters, as well as Japanese Kanji, Hiragana, Katakana, and Phonetic Extensions of Katakana. Only applies if `normalized = True`. Defaults to `False`.
- **`case_sensitive`** (boolean) - If `False`, makes all predictions and references lowercase to ignore differences in case. Defaults to `False`.

### Request Example
```python
predictions = ["does this sentence match??",
                    "what about this sentence?",
                    "What did the TER metric user say to the developer?"]
references = [["does this sentence match", "does this sentence match!?im"],
            ["wHaT aBoUt ThIs SeNtEnCe?", "wHaT aBoUt ThIs SeNtEnCe?"],
            ["Your jokes are...", "...TERrible"]]
ter = evaluate.load("ter")
results = ter.compute(predictions=predictions,
                        references=references,
                        case_sensitive=True)
print(results)
```

### Response
#### Success Response (200)
- **`score`** (float) - TER score (num_edits / sum_ref_lengths * 100)
- **`num_edits`** (int) - The cumulative number of edits
- **`ref_length`** (float) - The cumulative average reference length

#### Response Example
```json
{
  "score": 150.0,
  "num_edits": 15,
  "ref_length": 10.0
}
```

### Error Handling
Scores can be 0 or above. 0 is a perfect score. Scores above 100 indicate that the cumulative number of edits is higher than the cumulative length of the references.
```

--------------------------------

### FrugalScore Output Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/frugalscore/README.md

The output of FrugalScore is a dictionary containing a list of scores for each prediction-reference pair.

```python
>>> results = frugalscore.compute(predictions=['hello there', 'huggingface'], references=['hello world', 'hugging face'], batch_size=16, max_length=64, device="gpu")
{'scores': [0.6307541, 0.6449357]}
```

--------------------------------

### F1 Score Binary Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/f1/README.md

A simple binary F1 score calculation with predictions and references.

```python
>>> f1_metric = evaluate.load("f1")
>>> results = f1_metric.compute(references=[0, 1, 0, 1, 0], predictions=[0, 0, 1, 1, 0])
>>> print(results)
{'f1': 0.5}
```

--------------------------------

### Load XTREME-S Metric

Source: https://github.com/huggingface/evaluate/blob/main/metrics/xtreme_s/README.md

Load the XTREME-S metric for a specific subset of the benchmark. Ensure you have the 'evaluate' library installed.

```python
import evaluate

xtreme_s_metric = evaluate.load('xtreme_s', 'mls')
```

--------------------------------

### Load Competition MATH Metric

Source: https://github.com/huggingface/evaluate/blob/main/metrics/competition_math/README.md

Loads the competition_math metric from the evaluate library. Ensure the 'math_equivalence' dependency is installed.

```python
from evaluate import load
math = load("competition_math")
references = ["\\frac{1}{2}"]
predictions = ["1/2"]
```

--------------------------------

### Seqeval Partial Match Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/seqeval/README.md

Shows how to compute seqeval metrics with partial matches between predictions and references. Load the 'seqeval' metric and provide predictions and references with some overlapping and some differing elements.

```python
>>> seqeval = evaluate.load('seqeval')
>>> predictions = [['O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> references = [['O', 'O', 'O', 'B-MISC', 'I-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
>>> results = seqeval.compute(predictions=predictions, references=references)
>>> print(results)
{'MISC': {'precision': 0.0, 'recall': 0.0, 'f1': 0.0, 'number': 1}, 'PER': {'precision': 1.0, 'recall': 1.0, 'f1': 1.0, 'number': 1}, 'overall_precision': 0.5, 'overall_recall': 0.5, 'overall_f1': 0.5, 'overall_accuracy': 0.8}
```

--------------------------------

### Example: No Match CER

Source: https://github.com/huggingface/evaluate/blob/main/metrics/cer/README.md

Demonstrates calculating CER when there is no match between predictions and references. This typically results in a score of 1.0.

```python
from evaluate import load
cer = load("cer")
predictions = ["hello"]
references = ["gracias"]
cer_score = cer.compute(predictions=predictions, references=references)
print(cer_score)
```

--------------------------------

### Create a New Evaluation Module

Source: https://github.com/huggingface/evaluate/blob/main/docs/source/creating_and_sharing.mdx

Use the evaluate-cli to scaffold a new evaluation module, creating a Space on the Hub and cloning it locally.

```bash
evaluate-cli create "My Metric" --module_type "metric"
```

--------------------------------

### ROUGE Output with use_aggregator=True

Source: https://github.com/huggingface/evaluate/blob/main/metrics/rouge/README.md

Example of ROUGE metric output when `use_aggregator` is set to `True`. The output is a dictionary of aggregated scores.

```python
>>> {'rouge1': 1.0, 'rouge2': 1.0}
```

--------------------------------

### Competition MATH Metric - Partial Match Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/competition_math/README.md

Demonstrates a partial match scenario with multiple references and predictions. Accuracy is calculated based on the proportion of correct matches.

```python
from evaluate import load
math = load("competition_math")
references = ["\\frac{1}{2}","\\frac{3}{4}"]
predictions = ["1/5", "3/4"]
results = math.compute(references=references, predictions=predictions)
print(results)
```

--------------------------------

### XNLI Minimal Accuracy Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/xnli/README.md

Demonstrates computing the XNLI metric with predictions that are completely incorrect, resulting in an accuracy of 0.0.

```python
>>> from evaluate import load
>>> xnli_metric = load("xnli")
>>> predictions = [1, 0]
>>> references = [0, 1]
>>> results = xnli_metric.compute(predictions=predictions, references=references)
>>> print(results)
{'accuracy': 0.0}
```

--------------------------------

### Example: Perfect Match CER

Source: https://github.com/huggingface/evaluate/blob/main/metrics/cer/README.md

Demonstrates calculating CER when predictions perfectly match references, resulting in a score of 0.0.

```python
from evaluate import load
cer = load("cer")
predictions = ["hello world", "good night moon"]
references = ["hello world", "good night moon"]
cer_score = cer.compute(predictions=predictions, references=references)
print(cer_score)
```

--------------------------------

### Single Value Return Block

Source: https://github.com/huggingface/evaluate/blob/main/docs/README.md

Example of how to format a docstring for a single return value, specifying its type and a brief explanation.

```python
    Returns:
        `List[int]`: A list of integers in the range [0, 1] --- 1 for a special token, 0 for a sequence token.
```

--------------------------------

### Get Model ID to Label Mapping

Source: https://github.com/huggingface/evaluate/blob/main/measurements/toxicity/README.md

Retrieves the mapping from numerical labels to their string representations for a given classification model.

```python
>>> model = AutoModelForSequenceClassification.from_pretrained("DaNLP/da-electra-hatespeech-detection")
>>> model.config.id2label
{0: 'not offensive', 1: 'offensive'}
```

--------------------------------

### ROC AUC Score Output Example

Source: https://github.com/huggingface/evaluate/blob/main/metrics/roc_auc/README.md

The standard output format for ROC AUC scores, typically a dictionary with a float value.

```python
{
'roc_auc': 0.778
}
```

--------------------------------

### Evaluating an Existing Model with Trainer

Source: https://github.com/huggingface/evaluate/blob/main/docs/source/transformers_integrations.mdx

Shows how to evaluate a pre-trained model using the Trainer's evaluate method. This is an alternative to training when only model assessment is needed.

```python
trainer.evaluate()
```

--------------------------------

### ROUGE Calculation With Aggregation

Source: https://github.com/huggingface/evaluate/blob/main/metrics/rouge/README.md

Example of computing ROUGE scores with aggregation enabled. This provides a single, averaged score across all predictions.

```python
>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello goodbye", "ankh morpork"]
>>> references = ["goodbye", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
...                         references=references,
...                         use_aggregator=True)
>>> print(list(results.keys()))
['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
>>> print(results["rouge1"])
0.25
```