### Few-Shot Prompting for Guided Generation

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Use the FewShotPrompt step to guide LLM generation by providing input/output examples. This is useful for tasks like translation or style transfer where demonstrating the desired format is key. Define input and output labels for clarity.

```python
from datadreamer import DataDreamer
from datadreamer.llms import OpenAI
from datadreamer.steps import DataSource, FewShotPrompt

with DataDreamer("./output"):
    llm = OpenAI(model_name="gpt-4")

    examples = DataSource(
        "Translation Examples",
        data={
            "english": ["Hello", "Thank you"],
            "tamil": ["வணக்கம்", "நன்றி"],
        },
    )

    new_sentences = DataSource(
        "New Sentences",
        data={"sentence": ["Good morning", "How are you?"]},
    )

    translated = FewShotPrompt(
        "Translate to Tamil",
        inputs={
            "input_examples": examples.output["english"],
            "output_examples": examples.output["tamil"],
            "inputs": new_sentences.output["sentence"],
        },
        args={
            "llm": llm,
            "input_label": "English:",
            "output_label": "Tamil:",
            "instruction": "Translate the sentence to Tamil.",
            "max_new_tokens": 200,
        },
        outputs={"inputs": "english", "generations": "tamil"},
    )

    print(translated.output["tamil"])
    # Expected: ['காலை வணக்கம்', 'நீங்கள் எப்படி இருக்கிறீர்கள்?']
```

--------------------------------

### Setup Project for Local Development

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/contributing.rst

Run these commands after cloning the repository to configure local hooks and prepare the project for development.

```bash
cd DataDreamer
git config --local core.hooksPath ./scripts/.githooks/
./scripts/.githooks/post-checkout
```

--------------------------------

### Install DataDreamer using pip

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/installation.rst

Use this command to install the DataDreamer library from PyPI. Ensure you have pip3 installed.

```bash
pip3 install datadreamer.dev
```

--------------------------------

### Basic Step Structure in DataDreamer

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/advanced_usage/creating_a_new_datadreamer_.../step.rst

Define a new custom step by subclassing `datadreamer.steps.Step` and implementing the `setup` and `run` methods. The `setup` method is for registration, and `run` contains the core logic.

```python
from datadreamer.steps import Step

class MyNewStep(Step):
	def setup(self):
		# Register inputs, arguments, outputs, and data card information here

	def run(self):
		# Implement your custom data processing / transformation logic here
```

--------------------------------

### Create Custom Data Processing Step

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Define a reusable custom step by subclassing `Step` and implementing `setup()` and `run()`. This example counts words in provided texts, with an optional minimum word count argument.

```python
from datadreamer import DataDreamer
from datadreamer.steps import Step, LazyRows

class WordCountStep(Step):
    def setup(self):
        self.register_input("texts")             # expected input column
        self.register_arg("min_words", required=False, default=0)
        self.register_output("texts")
        self.register_output("word_counts")

    def run(self):
        texts = self.inputs["texts"]
        min_words = self.args["min_words"]

        def generate_rows():
            for text in texts:
                count = len(text.split())
                if count >= min_words:
                    yield {"texts": text, "word_counts": count}

        return LazyRows(generate_rows, total_num_rows=len(texts))


with DataDreamer("./output"):
    from datadreamer.steps import DataSource

    source = DataSource("Source", data={"texts": ["Hello world", "A longer sentence here", "Hi"]})

    counted = WordCountStep(
        "Count Words",
        inputs={"texts": source.output["texts"]},
        args={"min_words": 2},
    )

    for text, count in zip(counted.output["texts"], counted.output["word_counts"]):
        print(f"{count} words: {text}")
    # 2 words: Hello world
    # 4 words: A longer sentence here
```

--------------------------------

### Multi-GPU Training Setup

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/advanced_usage/parallelization/training_models_on_multiple_gpus.rst

To train on multiple GPUs, pass a list of devices to the `device` parameter of the `Trainer` constructor. This enables distributed training modes like FSDP or DDP.

```python
from datadreamer.trainers import Trainer

trainer = Trainer(
    ...,
    device=["cuda:0", "cuda:1"]
)
```

--------------------------------

### Skip Dependency Installation for Faster Runs

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/contributing.rst

Set the PROJECT_SKIP_INSTALL_REQS environment variable to 1 to bypass dependency installation on subsequent runs, making them faster. This command also shows how to combine it with test filtering.

```bash
export PROJECT_SKIP_INSTALL_REQS=1
./scripts/run.sh -k "TestFewShotPrompt"
```

--------------------------------

### Initialize DataDreamer Session

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Start a DataDreamer session using a context manager. This sets up an output directory for caching and persistence. Re-running the script will resume from the cached state.

```python
from datadreamer import DataDreamer

with DataDreamer("./output"):
    # All steps, trainers, and model usage go here.
    # Outputs are automatically cached to ./output/
    # Re-running the script resumes from cached state.
    pass
```

--------------------------------

### Translate Test Set with Final Examples

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/bootstrapping_machine_translation.rst

Translates a separate test set of English sentences to Tamil using the best few-shot examples bootstrapped in the previous rounds.

```python
# Load the test set of English sentences
english_test_dataset = HFHubDataSource(
	"Get FLORES-101 English Sentences (Test Set)",
	"gsarti/flores_101",
	config_name="eng",
	split="devtest",
).select_columns(["sentence"])

# Finally translate the test set with the final bootstrapped synthetic few-shot examples
english_test_to_tamil = FewShotPrompt(
	"Few-shot Translate from English To Tamil (Test Set)",
	inputs={
		"input_examples": best_translation_pairs.output["english"],
		"output_examples": best_translation_pairs.output["tamil"],
		"inputs": english_test_dataset.output["sentence"],
	},
	args={
		"llm": gpt_4,
		"input_label": "English:",
		"output_label": "Tamil:",
		"instruction": "Translate the sentence to Tamil.",
		"max_new_tokens": 1000,
	},
	outputs={"inputs": "english", "generations": "tamil"},
).select_columns(["english", "tamil"])
```

--------------------------------

### Multi-Node Fine-Tuning with DataDreamer

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Configure and run a fine-tuning job across multiple nodes. Ensure NODE_RANK is set correctly for each node and the distributed_config matches your network setup.

```python
NODE_RANK = 0   # 0 for master node, 1 for second node, etc.
TOTAL_NODES = 2

with DataDreamer("./output"):
    dataset = HFHubDataSource("Load Data", "yahma/alpaca-cleaned", split="train").take(2000)
    splits = dataset.splits(train_size=0.90, validation_size=0.10)

    trainer = TrainHFFineTune(
        "Multi-Node Fine-Tune",
        model_name="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
        peft_config=LoraConfig(),
        device=["cuda:0", "cuda:1"],   # GPUs on this node
        dtype="bfloat16",
        distributed_config={
            "master_addr": "192.168.1.100",   # IP of master node
            "master_port": "29500",
            "nnodes": TOTAL_NODES,
            "node_rank": NODE_RANK,
        },
    )
    trainer.train(
        train_input=splits["train"].output["instruction"],
        train_output=splits["train"].output["output"],
        validation_input=splits["validation"].output["instruction"],
        validation_output=splits["validation"].output["output"],
        epochs=3,
        batch_size=2,
    )
```

--------------------------------

### Train Model to Generate Tweet Summary

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/index.rst

This example demonstrates training a model to generate a tweet summarizing a research paper abstract using synthetic data. It involves loading an LLM, generating synthetic data from prompts, and fine-tuning a model.

```python
from datadreamer import DataDreamer
from datadreamer.llms import OpenAI
from datadreamer.steps import DataFromPrompt, ProcessWithPrompt
from datadreamer.trainers import TrainHFFineTune
from peft import LoraConfig

with DataDreamer("./output"):
   # Load GPT-4
   gpt_4 = OpenAI(model_name="gpt-4")

   # Generate synthetic arXiv-style research paper abstracts with GPT-4
   arxiv_dataset = DataFromPrompt(

```

--------------------------------

### Parallel LLM Inference

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Run the same smaller model on multiple GPUs in parallel using `ParallelLLM` for increased throughput. This example demonstrates parallel prompting.

```python
from datadreamer import DataDreamer
from datadreamer.llms import HFTransformers, ParallelLLM
from datadreamer.steps import Prompt

with DataDreamer("./output"):
    parallel_llm = ParallelLLM(
        HFTransformers("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device="cuda:0"),
        HFTransformers("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device="cuda:1"),
    )

    results = Prompt(
        "Parallel Prompting",
        inputs={"prompts": ["What is AI?", "Explain quantum computing."]},
        args={"llm": parallel_llm},
    )
    print(results.output["generations"])
```

--------------------------------

### Align TinyLlama-Chat with Human Preferences using DPO

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/aligning.rst

This snippet demonstrates aligning the TinyLlama chat model with human preferences using DataDreamer's `TrainHFDPO` trainer. It fetches a DPO dataset from the Hugging Face Hub, takes a subset for demonstration, splits it into training and validation sets, and then trains the model using LoRA for efficient fine-tuning. Ensure you have the necessary libraries installed and CUDA-enabled devices available.

```python
from datadreamer import DataDreamer
from datadreamer.steps import HFHubDataSource
from datadreamer.trainers import TrainHFDPO
from peft import LoraConfig

with DataDreamer("./output"):
    # Get the DPO dataset
    dpo_dataset = HFHubDataSource(
        "Get DPO Dataset", "Intel/orca_dpo_pairs", split="train"
    )

    # Keep only 1000 examples as a quick demo
    dpo_dataset = dpo_dataset.take(1000)

    # Create training data splits
    splits = dpo_dataset.splits(train_size=0.90, validation_size=0.10)

    # Align the TinyLlama chat model with human preferences
    trainer = TrainHFDPO(
        "Align TinyLlama-Chat",
        model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        peft_config=LoraConfig(),
        device=["cuda:0", "cuda:1"],
        dtype="bfloat16",
    )
    trainer.train(
        train_prompts=splits["train"].output["question"],
        train_chosen=splits["train"].output["chosen"],
        train_rejected=splits["train"].output["rejected"],
        validation_prompts=splits["validation"].output["question"],
        validation_chosen=splits["validation"].output["chosen"],
        validation_rejected=splits["validation"].output["rejected"],
        epochs=3,
        batch_size=1,
        gradient_accumulation_steps=32,
    )
```

--------------------------------

### Initialize DataDreamer and Load LLM

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/bootstrapping_machine_translation.rst

Sets up the DataDreamer environment and loads the specified large language model (GPT-4 in this case).

```python
from datadreamer import DataDreamer
from datadreamer.llms import OpenAI
from datadreamer.steps import (
	FewShotPrompt,
	ProcessWithPrompt,
	HFHubDataSource,
	CosineSimilarity,
)
from datadreamer.embedders import SentenceTransformersEmbedder

with DataDreamer("./output"):
	# Load GPT-4
	gpt_4 = OpenAI(model_name="gpt-4")
```

--------------------------------

### Run All Project Tests

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/contributing.rst

Execute this command to run all test cases defined in the project.

```bash
./scripts/run.sh
```

--------------------------------

### Load HF Hub DataSource Step

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Load an existing dataset from the Hugging Face Hub as an input step. You can then take subsets and select specific columns.

```python
from datadreamer import DataDreamer
from datadreamer.steps import HFHubDataSource

with DataDreamer("./output"):
    dataset = HFHubDataSource(
        "Load Alpaca Dataset",
        "yahma/alpaca-cleaned",
        split="train",
    )

    # Keep only 500 rows, select specific columns
    subset = dataset.take(500).select_columns(["instruction", "output"])
    print(subset.output["instruction"][0])
```

--------------------------------

### Instruction-Tune TinyLlama with DataDreamer

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/instruction_tuning.rst

This snippet demonstrates the complete process of instruction tuning a base LLM using DataDreamer. It includes fetching a dataset, formatting it, splitting into training and validation sets, defining a prompt template, and configuring the fine-tuning trainer with LoRA. Use this for instruction-tuning base models to improve their ability to follow instructions.

```python
from datadreamer import DataDreamer
from datadreamer.steps import HFHubDataSource
from datadreamer.trainers import TrainHFFineTune
from peft import LoraConfig

with DataDreamer("./output"):
		# Get the Alpaca instruction-tuning dataset (cleaned version)
		instruction_tuning_dataset = HFHubDataSource(
			"Get Alpaca Instruction-Tuning Dataset", "yahma/alpaca-cleaned", split="train"
		)

		# Keep only 1000 examples as a quick demo
		instruction_tuning_dataset = instruction_tuning_dataset.take(1000)

		# Some examples taken in an "input", we'll format those into the instruction
		instruction_tuning_dataset.map(
			lambda row: {
				"instruction": (
					row["instruction"]
					if len(row["input"]) == 0
					else f"Input: {row['input']}\n\n{row['instruction']}"
				),
				"output": row["output"],
			},
			lazy=False,
		)

		# Create training data splits
		splits = instruction_tuning_dataset.splits(train_size=0.90, validation_size=0.10)

		# Define what the prompt template should be when instruction-tuning
		chat_prompt_template = "### Instruction:\n{{prompt}}\n\n### Response:\n"

		# Instruction-tune the base TinyLlama model to make it follow instructions
		trainer = TrainHFFineTune(
			"Instruction-Tune TinyLlama",
			model_name="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
			chat_prompt_template=chat_prompt_template,
			peft_config=LoraConfig(),
			device=["cuda:0", "cuda:1"],
			dtype="bfloat16",
		)
		trainer.train(
			train_input=splits["train"].output["instruction"],
			train_output=splits["train"].output["output"],
			validation_input=splits["validation"].output["instruction"],
			validation_output=splits["validation"].output["output"],
			epochs=3,
			batch_size=1,
			gradient_accumulation_steps=32,
		)
```

--------------------------------

### Few-Shot Translation for Refined Pairs

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/bootstrapping_machine_translation.rst

Uses previously generated best translation pairs as few-shot examples to translate more English sentences, refining the synthetic pairs in subsequent rounds.

```python
else:
	# On subsequent rounds, use the best synthetic translation pairs from the previous round
	# as few-shot examples to translate more English sentences to create even better synthetic pairs
	english_to_tamil = FewShotPrompt(
		f"Round #{r+1}: Few-shot Translate from English To Tamil",
		inputs={
			"input_examples": best_translation_pairs.output["english"],
			"output_examples": best_translation_pairs.output["tamil"],
			"inputs": sentences_for_round.output["sentence"],
		},
		args={
			"llm": gpt_4,
			"input_label": "English:",
			"output_label": "Tamil:",
			"instruction": "Translate the sentence to Tamil.",
			"max_new_tokens": 1000,
		},
		outputs={"inputs": "english", "generations": "tamil"},
	).select_columns(["english", "tamil"])
```

--------------------------------

### Run a Single Test File

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/contributing.rst

Use this command to run tests from a specific file by providing its path as an argument.

```bash
./scripts/run.sh src/tests/steps/prompt/test_prompt.py
```

--------------------------------

### Transform Data with ProcessWithPrompt

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Apply an LLM instruction to each item in an input column to transform it using the ProcessWithPrompt step. This step requires an initial data source (e.g., generated by DataFromPrompt) and an LLM configuration. Specify input and output mappings for clarity.

```python
from datadreamer import DataDreamer
from datadreamer.llms import OpenAI
from datadreamer.steps import DataFromPrompt, ProcessWithPrompt

with DataDreamer("./output"):
    gpt_4 = OpenAI(model_name="gpt-4")

    abstracts = DataFromPrompt(
        "Generate Abstracts",
        args={
            "llm": gpt_4, "n": 100, "temperature": 1.2,
            "instruction": "Generate an arXiv NLP paper abstract.",
        },
        outputs={"generations": "abstracts"},
    )

    tweets = ProcessWithPrompt(
        "Generate Tweets from Abstracts",
        inputs={"inputs": abstracts.output["abstracts"]},
        args={
            "llm": gpt_4,
            "instruction": "Given the abstract, write a tweet to summarize the work.",
            "top_p": 1.0,
        },
        outputs={"inputs": "abstracts", "generations": "tweets"},
    )

    print(tweets.output["tweets"][0])
    # Expected: A short tweet summarizing the first abstract.
```

--------------------------------

### Monitoring GPU Memory Usage

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/advanced_usage/parallelization/training_models_on_multiple_gpus.rst

Enable verbose logging for GPU memory usage during multi-GPU training by setting the `verbose` parameter to `True` in the `Trainer` constructor. This logs memory usage at the start, end, and after each epoch.

```python
from datadreamer.trainers import Trainer

trainer = Trainer(
    ...,
    device=["cuda:0", "cuda:1"],
    verbose=True
)
```

--------------------------------

### Generate Synthetic Data with DataFromPrompt

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Utilize the DataFromPrompt step to generate a specified number of independent synthetic data items using a single instruction prompt. Configure the LLM, number of items, temperature, and the instruction itself. The output column can be renamed using the `outputs` argument.

```python
from datadreamer import DataDreamer
from datadreamer.llms import OpenAI
from datadreamer.steps import DataFromPrompt

with DataDreamer("./output"):
    gpt_4 = OpenAI(model_name="gpt-4")

    abstracts = DataFromPrompt(
        "Generate Research Paper Abstracts",
        args={
            "llm": gpt_4,
            "n": 1000,
            "temperature": 1.2,
            "instruction": (
                "Generate an arXiv abstract of an NLP research paper."
                " Return just the abstract, no titles."
            ),
        },
        outputs={"generations": "abstracts"},  # rename output column
    )

    print(f"Generated {len(abstracts.output['abstracts'])} abstracts")
```

--------------------------------

### Send Prompts to LLM and Collect Responses

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Use the Prompt step to send a list of prompts to an LLM and collect the generated responses. Ensure the LLM and data sources are properly initialized.

```python
from datadreamer import DataDreamer
from datadreamer.llms import OpenAI
from datadreamer.steps import DataSource, Prompt

with DataDreamer("./output"):
    llm = OpenAI(model_name="gpt-4")

    prompts = DataSource(
        "Input Prompts",
        data={"prompts": ["Summarize the theory of relativity.", "What is CRISPR?"]},
    )

    results = Prompt(
        "Run Prompts",
        inputs={"prompts": prompts.output["prompts"]},
        args={
            "llm": llm,
            "temperature": 0.7,
            "max_new_tokens": 256,
        },
    )
    print(results.output["generations"][0])
    # Expected: A concise summary of the theory of relativity.
```

--------------------------------

### Distill GPT-4 Capabilities to GPT-3.5

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/openai_distillation.rst

This script demonstrates distilling GPT-4's 'ELI5' generation capabilities into GPT-3.5. It loads GPT-4, fetches ELI5 questions from Hugging Face Hub, uses GPT-4 to generate answers, splits the data, and fine-tunes GPT-3.5.

```python
from datadreamer import DataDreamer
from datadreamer.llms import OpenAI
from datadreamer.steps import ProcessWithPrompt, HFHubDataSource
from datadreamer.trainers import TrainOpenAIFineTune

with DataDreamer("./output"):
	# Load GPT-4
	gpt_4 = OpenAI(model_name="gpt-4")

	# Get ELI5 questions
	eli5_dataset = HFHubDataSource(
		"Get ELI5 Questions",
		"eli5_category",
		split="train",
		trust_remote_code=True,
	).select_columns(["title"])

	# Keep only 1000 examples as a quick demo
	eli5_dataset = eli5_dataset.take(1000)

	# Ask GPT-4 to ELI5
	questions_and_answers = ProcessWithPrompt(
		"Generate Explanations",
		inputs={"inputs": eli5_dataset.output["title"]},
		args={
			"llm": gpt_4,
			"instruction": (
				'Given the question, give an "Explain it like I\'m 5" answer.'
			),
			"top_p": 1.0,
		},
		outputs={"inputs": "questions", "generations": "answers"},
	)

	# Create training data splits
	splits = questions_and_answers.splits(train_size=0.90, validation_size=0.10)

	# Train a model to answer questions in ELI5 style
	trainer = TrainOpenAIFineTune(
		"Distill capabilities to GPT-3.5",
		model_name="gpt-3.5-turbo-1106",
	)
	trainer.train(
		train_input=splits["train"].output["questions"],
		train_output=splits["train"].output["answers"],
		validation_input=splits["validation"].output["questions"],
		validation_output=splits["validation"].output["answers"],
		epochs=30,
		batch_size=8,
	)
```

--------------------------------

### Format Project Code

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/contributing.rst

Execute this script to check for style violations and automatically format the project's code according to the style guidelines.

```bash
./scripts/format.sh
```

--------------------------------

### Instruction-Tune a Hugging Face Model with DataDreamer

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Fine-tunes a Hugging Face model using the `TrainHFFineTune` trainer, supporting PEFT (like LoRA) and multi-GPU training. The trained model can be exported and published to the Hugging Face Hub with an auto-generated model card.

```python
from datadreamer import DataDreamer
from datadreamer.steps import HFHubDataSource
from datadreamer.trainers import TrainHFFineTune
from peft import LoraConfig

with DataDreamer("./output"):
    dataset = HFHubDataSource("Load Alpaca", "yahma/alpaca-cleaned", split="train").take(1000)
    dataset = dataset.map(
        lambda row: {
            "instruction": row["instruction"] if not row["input"] else f"Input: {row['input']}\n\n{row['instruction']}",
            "output": row["output"],
        },
        lazy=False,
    )
    splits = dataset.splits(train_size=0.90, validation_size=0.10)

    trainer = TrainHFFineTune(
        "Instruction-Tune TinyLlama",
        model_name="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T",
        chat_prompt_template="### Instruction:\n{{prompt}}\n\n### Response:\n",
        peft_config=LoraConfig(),
        device=["cuda:0", "cuda:1"],  # multi-GPU training
        dtype="bfloat16",
    )
    trainer.train(
        train_input=splits["train"].output["instruction"],
        train_output=splits["train"].output["output"],
        validation_input=splits["validation"].output["instruction"],
        validation_output=splits["validation"].output["output"],
        epochs=3,
        batch_size=1,
        gradient_accumulation_steps=32,
    )

    # Export and publish the trained model with auto-generated model card
    trainer.publish_to_hf_hub("my-org/tinyllama-instruction-tuned")
    print(f"Model saved to: {trainer.model_path}")
```

--------------------------------

### Generate Diverse Synthetic Data with DataFromAttributedPrompt

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Combines variable attribute lists into instruction templates to generate diverse synthetic data. Requires LLM and attribute lists as input.

```python
from datadreamer import DataDreamer
from datadreamer.llms import OpenAI
from datadreamer.steps import DataSource, Prompt, DataFromAttributedPrompt

with DataDreamer("./output"):
    gpt_4 = OpenAI(model_name="gpt-4")

    # Generate attribute lists via prompting
    attribute_prompts = DataSource(
        "Attribute Prompts",
        data={
            "prompts": [
                "Generate 10 movie names, comma-separated.",
                "Generate 10 review styles, comma-separated.",
            ]
        },
    )
    attributes = Prompt(
        "Generate Attributes",
        inputs={"prompts": attribute_prompts.output["prompts"]},
        args={"llm": gpt_4},
    ).output["generations"]

    movie_reviews = DataFromAttributedPrompt(
        "Generate Movie Reviews",
        args={
            "llm": gpt_4,
            "n": 200,
            "instruction": "Write a {review_style} review of {movie_name}.",
            "attributes": {
                "movie_name": attributes[0].split(","),
                "review_style": attributes[1].split(","),
            },
        },
        outputs={"generations": "reviews"},
    ).select_columns(["reviews"]).shuffle()

    movie_reviews.publish_to_hf_hub("my-org/movie-reviews-dataset")
```

--------------------------------

### Switching to DDP Mode

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/advanced_usage/parallelization/training_models_on_multiple_gpus.rst

To use Distributed Data Parallel (DDP) instead of the default FSDP, set `fsdp=False` in the `Trainer` constructor. This is useful for scaling batch size but requires the model to fit on a single GPU.

```python
from datadreamer.trainers import Trainer

trainer = Trainer(
    ...,
    device=["cuda:0", "cuda:1"],
    fsdp=False
)
```

--------------------------------

### Load OpenAI LLM

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Load an OpenAI API-based model like GPT-4 for use in prompting steps. Ensure you have the necessary API keys configured.

```python
from datadreamer import DataDreamer
from datadreamer.llms import OpenAI

with DataDreamer("./output"):
    gpt_4 = OpenAI(model_name="gpt-4")
    # gpt_4 can now be passed to any prompting step
```

--------------------------------

### Create DataSource Step

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Create a DataDreamer step from in-memory Python data (dictionary of lists) to seed a workflow. Outputs can be accessed by column name.

```python
from datadreamer import DataDreamer
from datadreamer.steps import DataSource

with DataDreamer("./output"):
    source = DataSource(
        "My Dataset",
        data={
            "questions": ["What is the capital of France?", "Who wrote Hamlet?"],
            "labels": ["geography", "literature"],
        },
    )
    # Access output columns
    for row in source.output["questions"]:
        print(row)
    # Expected output:
    # What is the capital of France?
    # Who wrote Hamlet?
```

--------------------------------

### Run a Step Asynchronously in Background

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Launch a step in a separate background process using `background=True` and wait for its completion using `wait()`. This allows other work to be done while the step executes.

```python
from datadreamer import DataDreamer
from datadreamer.llms import OpenAI
from datadreamer.steps import DataSource, Prompt, wait

with DataDreamer("./output"):
    llm = OpenAI(model_name="gpt-4")

    prompts = DataSource("Prompts", data={"prompts": ["Explain neural networks in simple terms."]})

    # background=True launches step in a separate background process
    background_step = Prompt(
        "Background Prompt",
        inputs={"prompts": prompts.output["prompts"]},
        args={"llm": llm},
        background=True,
    )

    # Do other work here...
    print("Step is running in the background...")

    # Block until the background step is complete
    wait(background_step)
    print(background_step.output["generations"])
```

--------------------------------

### Augment HotpotQA Questions with LLM Decompositions

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/dataset_augmentation.rst

This snippet demonstrates augmenting HotpotQA questions by using an LLM (GPT-4) to generate step-by-step intermediate questions. It loads the dataset, processes questions with a prompt, and publishes the results.

```python
from datadreamer import DataDreamer
from datadreamer.llms import OpenAI
from datadreamer.steps import ProcessWithPrompt, HFHubDataSource

with DataDreamer("./output"):
    # Load GPT-4
    gpt_4 = OpenAI(model_name="gpt-4")

    # Get HotPot QA questions
    hotpot_qa_dataset = HFHubDataSource(
        "Get Hotpot QA Questions",
        "hotpot_qa",
        config_name="distractor",
        split="train",
    ).select_columns(["question"])

    # Keep only 1000 questions as a quick demo
    hotpot_qa_dataset = hotpot_qa_dataset.take(1000)

    # Ask GPT-4 to decompose the question
    questions_and_decompositions = ProcessWithPrompt(
        "Generate Decompositions",
        inputs={"inputs": hotpot_qa_dataset.output["question"]},
        args={
            "llm": gpt_4,
            "instruction": (
                "Given the question which requires multiple steps to solve, give a numbered list of intermediate questions required to solve the question."
                "Return only the list, nothing else."
            ),
        },
        outputs={"inputs": "questions", "generations": "decompositions"},
    ).select_columns(["questions", "decompositions"])

    # Publish and share the synthetic dataset
    questions_and_decompositions.publish_to_hf_hub(
        "datadreamer-dev/hotpot_qa_augmented",
    )
```

--------------------------------

### Subset and Transform Dataset with DataDreamer

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Demonstrates taking a subset of a dataset, selecting specific columns, shuffling, and applying a mapping transformation. The `lazy=False` argument executes the map transformation eagerly.

```python
from datadreamer import DataDreamer
from datadreamer.steps import HFHubDataSource

with DataDreamer("./output"):
    dataset = HFHubDataSource("Load Dataset", "yahma/alpaca-cleaned", split="train")

    # Subset and reshape
    subset = (
        dataset
        .take(500)
        .select_columns(["instruction", "output"])
        .shuffle(seed=42)
    )

    # Map transformation (lazy=False executes eagerly)
    formatted = subset.map(
        lambda row: {
            "instruction": f"### Instruction:\n{row['instruction']}",
            "output": row["output"],
        },
        lazy=False,
        name="Format Instructions",
    )

    # Split for train/validation
    splits = formatted.splits(train_size=0.90, validation_size=0.10)
    print(f"Train size: {len(splits['train'].output['instruction'])}")
    print(f"Validation size: {len(splits['validation'].output['instruction'])}")
```

--------------------------------

### Publish Synthetic Dataset and Trained Model

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/index.rst

This code snippet demonstrates how to publish a generated synthetic dataset to the Hugging Face Hub and also publish the trained model. Ensure you have the necessary authentication for Hugging Face Hub.

```python
abstracts_and_tweets.publish_to_hf_hub(
         "datadreamer-dev/abstracts_and_tweets",
         train_size=0.90,
         validation_size=0.10,
      )

      # Publish and share the trained model
      trainer.publish_to_hf_hub("datadreamer-dev/abstracts_to_tweet_model")
```

--------------------------------

### Type Check Project with Mypy

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/contributing.rst

Run this command to perform static type checking on the project using mypy and identify any type errors.

```bash
./scripts/run.sh -k "mypy"
```

--------------------------------

### Configure Google Analytics

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/_templates/metatags.html

Include this JavaScript snippet in your HTML to initialize Google Analytics. Replace 'G-1BQW0NQ6M8' with your actual Measurement ID.

```javascript
window.dataLayer = window.dataLayer || [];
function gtag(){
  dataLayer.push(arguments);
}

gtag('js', new Date());
gtag('config', 'G-1BQW0NQ6M8');
```

--------------------------------

### Align Model with DPO using DataDreamer

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Aligns a model with human preferences using Direct Preference Optimization (DPO). This trainer supports PEFT and multi-GPU configurations and allows publishing the aligned model to Hugging Face Hub.

```python
from datadreamer import DataDreamer
from datadreamer.steps import HFHubDataSource
from datadreamer.trainers import TrainHFDPO
from peft import LoraConfig

with DataDreamer("./output"):
    dpo_dataset = HFHubDataSource("Load DPO Dataset", "Intel/orca_dpo_pairs", split="train").take(1000)
    splits = dpo_dataset.splits(train_size=0.90, validation_size=0.10)

    trainer = TrainHFDPO(
        "Align TinyLlama-Chat with DPO",
        model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
        peft_config=LoraConfig(),
        device=["cuda:0", "cuda:1"],
        dtype="bfloat16",
    )
    trainer.train(
        train_prompts=splits["train"].output["question"],
        train_chosen=splits["train"].output["chosen"],
        train_rejected=splits["train"].output["rejected"],
        validation_prompts=splits["validation"].output["question"],
        validation_chosen=splits["validation"].output["chosen"],
        validation_rejected=splits["validation"].output["rejected"],
        epochs=3,
        batch_size=1,
        gradient_accumulation_steps=32,
    )
    trainer.publish_to_hf_hub("my-org/tinyllama-dpo-aligned")
```

--------------------------------

### Publish Synthetic Data to Hugging Face Hub

Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt

Generates synthetic data using an LLM and publishes the resulting dataset to the Hugging Face Hub with automatic data card generation. Ensure your Hugging Face token is set as an environment variable or passed directly.

```python
from datadreamer import DataDreamer
from datadreamer.llms import OpenAI
from datadreamer.steps import DataFromPrompt

with DataDreamer("./output"):
    gpt_4 = OpenAI(model_name="gpt-4")

    synthetic_data = DataFromPrompt(
        "Generate Summaries",
        args={
            "llm": gpt_4, "n": 500, "temperature": 1.0,
            "instruction": "Write a one-paragraph summary of a fictional scientific discovery.",
        },
        outputs={"generations": "summaries"},
    )

    # Publish with train/validation split; DataDreamer auto-generates the data card
    synthetic_data.publish_to_hf_hub(
        "my-org/synthetic-summaries",
        train_size=0.90,
        validation_size=0.10,
        # token="hf_..." if not set via environment variable
    )
```

--------------------------------

### Load English Sentences for Translation

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/bootstrapping_machine_translation.rst

Loads a dataset of English sentences from Hugging Face Hub (FLORES-101) and takes a subset for demonstration.

```python
# Get English sentences
english_dataset = HFHubDataSource(
	"Get FLORES-101 English Sentences",
	"gsarti/flores_101",
	config_name="eng",
	split="dev",
).select_columns(["sentence"])

# Keep only 400 examples as a quick demo
english_dataset = english_dataset.take(400)
```

--------------------------------

### Generate Movie Reviews with Attributed Prompts

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/attributed_prompts.rst

Use DataFromAttributedPrompt to generate varied movie reviews by combining different movie names, elements, and review styles. This method allows for the creation of a diverse dataset representative of real-world data. Ensure the LLM and attributes are correctly configured.

```python
from datadreamer import DataDreamer
from datadreamer.llms import OpenAI
from datadreamer.steps import (
	Prompt,
	DataSource,
	DataFromAttributedPrompt,
)

with DataDreamer("./output"):
	# Load GPT-4
	gpt_4 = OpenAI(model_name="gpt-4")

	# Create prompts to generate attributes for movie reviews
	attribute_generation_prompts = DataSource(
		"Attribute Generation Prompts",
		data={
			"prompts": [
				"Generate the names of 10 movies released in theatres in the past, in a comma separated list.",
				"Generate 10 elements of a movie a reviewer might consider, in a comma separated list.",
				"Generate 10 adjectives that could describe a movie reviewer's style, in a comma separated list.",
			],
		},
	)

	# Generate the attributes for movie reviews
	attributes = Prompt(
		"Generate Attributes",
		inputs={
			"prompts": attribute_generation_prompts.output["prompts"],
		},
		args={
			"llm": gpt_4,
		},
	).output["generations"]

	# Generate movie reviews with varied attributes
	movie_reviews = (
		DataFromAttributedPrompt(
			"Generate Movie Reviews",
			args={
				"llm": gpt_4,
				"n": 1000,
				"instruction": "Generate a few sentence {review_style} movie review about {movie_name} that focuses on {movie_element}.",
				"attributes": {
					"movie_name": attributes[0].split(","),
					"movie_element": attributes[1].split(","),
					"review_style": attributes[2].split(","),
				},
			},
			outputs={"generations": "reviews"},
		)
		.select_columns(["reviews"])
		.shuffle()
	)

	# Publish and share the synthetic dataset
	movie_reviews.publish_to_hf_hub(
		"datadreamer-dev/movie_reviews",
	)

```

--------------------------------

### Self-Reward Training Loop with DataDreamer

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/self_rewarding.rst

This snippet outlines the main loop for self-reward training. It iterates through multiple rounds, loading a base LLM, sampling candidate responses, having the LLM judge its own outputs, and then fine-tuning the model using the generated preferences. The trained adapter is then applied in the next round.

```python
from datadreamer import DataDreamer
from datadreamer.steps import HFHubDataSource, Prompt, JudgeGenerationPairsWithPrompt
from datadreamer.trainers import TrainHFDPO
from datadreamer.llms import HFTransformers
from peft import LoraConfig

with DataDreamer("./output"):
	# Get a dataset of prompts
	prompts_dataset = HFHubDataSource(
		"Get Prompts Dataset", "Intel/orca_dpo_pairs", split="train"
	).select_columns(["question"])

	# Keep only 3000 examples as a quick demo
	prompts_dataset = prompts_dataset.take(3000)

	# Define how many rounds of self-reward training
ounds = 3

	# For each round of self-reward training
	adapter_to_apply = None
	for r in range(rounds):
		# Use a partial set of the prompts for each round
		prompts_for_round = prompts_dataset.shard(
			num_shards=rounds, index=r, name=f"Round #{r+1}: Get Prompts"
		)

		# Load the LLM
		llm = HFTransformers(
			"TinyLlama/TinyLlama-1.1B-Chat-v1.0",
			adapter_name=adapter_to_apply,
			device_map="auto",
			dtype="bfloat16",
		)

		# Sample 2 candidate responses from the LLM
		candidate_responses = []
		for candidate_idx in range(2):
			candidate_responses.append(
				Prompt(
					f"Round #{r+1}: Sample Candidate Response #{candidate_idx}",
					inputs={"prompts": prompts_for_round.output["question"]},
					args={
						"llm": llm,
						"batch_size": 2,
						"top_p": 1.0,
						"seed": candidate_idx,
					},
				)
			)

		# Have the LLM judge its own responses
		judgements = JudgeGenerationPairsWithPrompt(
			f"Round #{r+1}: Judge Candidate Responses",
			args={
				"llm": llm,
				"batch_size": 1,
				"max_new_tokens": 5,
			},
			inputs={
				"prompts": prompts_for_round.output["question"],
				"a": candidate_responses[0].output["generations"],
				"b": candidate_responses[1].output["generations"],
			},
		)

		# Unload the LLM
		llm.unload_model()

		# Process the judgements into a preference dataset
		dpo_dataset = judgements.map(
				lambda row: {
					"question": row["prompts"],
					"chosen": row["a"] if row["judgements"] == "Response A" else row["b"],
					"rejected": row["b"] if row["judgements"] == "Response A" else row["a"],
				},
			lazy=False,
			name=f"Round #{r+1}: Create Self-Reward Preference Dataset",
		)

		# Create training data splits
	splits = dpo_dataset.splits(train_size=0.90, validation_size=0.10)

		# Align the TinyLlama chat model with its own preferences
		trainer = TrainHFDPO(
				f"Round #{r+1}: Self-Reward Align TinyLlama-Chat",
				model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
				peft_config=LoraConfig(),
				device=["cuda:0", "cuda:1"],
				dtype="bfloat16",
			)
		trainer.train(
				train_prompts=splits["train"].output["question"],
				train_chosen=splits["train"].output["chosen"],
				train_rejected=splits["train"].output["rejected"],
				validation_prompts=splits["validation"].output["question"],
				validation_chosen=splits["validation"].output["chosen"],
				validation_rejected=splits["validation"].output["rejected"],
				epochs=3,
				batch_size=1,
				gradient_accumulation_steps=32,
			)

		# Unload the trained model from memory
		trainer.unload_model()

		# Use the newly trained adapter for the next round of self-reward
		adapter_to_apply = trainer.model_path
```

--------------------------------

### Generate Research Paper Abstracts with GPT-4

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/index.rst

Uses GPT-4 to generate arXiv abstracts for NLP research papers. Ensure 'gpt_4' is initialized and accessible.

```python
ProcessWithPrompt(
         "Generate Research Paper Abstracts",
         args={
            "llm": gpt_4,
            "n": 1000,
            "temperature": 1.2,
            "instruction": (
               "Generate an arXiv abstract of an NLP research paper."
               " Return just the abstract, no titles."
            ),
         },
         outputs={"generations": "abstracts"},
      )
```

--------------------------------

### Initialize DataDreamer Session

Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/overview_guide.rst

All DataDreamer code should be placed within a DataDreamer session. This ensures automatic organization, caching, and saving of results, making the session resumable and reproducible.

```python
from datadreamer import DataDreamer

with DataDreamer('./output/'):
    # ... run steps or trainers here ...
```