### Few-Shot Prompting for Guided Generation Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Use the FewShotPrompt step to guide LLM generation by providing input/output examples. This is useful for tasks like translation or style transfer where demonstrating the desired format is key. Define input and output labels for clarity. ```python from datadreamer import DataDreamer from datadreamer.llms import OpenAI from datadreamer.steps import DataSource, FewShotPrompt with DataDreamer("./output"): llm = OpenAI(model_name="gpt-4") examples = DataSource( "Translation Examples", data={ "english": ["Hello", "Thank you"], "tamil": ["வணக்கம்", "நன்றி"], }, ) new_sentences = DataSource( "New Sentences", data={"sentence": ["Good morning", "How are you?"]}, ) translated = FewShotPrompt( "Translate to Tamil", inputs={ "input_examples": examples.output["english"], "output_examples": examples.output["tamil"], "inputs": new_sentences.output["sentence"], }, args={ "llm": llm, "input_label": "English:", "output_label": "Tamil:", "instruction": "Translate the sentence to Tamil.", "max_new_tokens": 200, }, outputs={"inputs": "english", "generations": "tamil"}, ) print(translated.output["tamil"]) # Expected: ['காலை வணக்கம்', 'நீங்கள் எப்படி இருக்கிறீர்கள்?'] ``` -------------------------------- ### Setup Project for Local Development Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/contributing.rst Run these commands after cloning the repository to configure local hooks and prepare the project for development. ```bash cd DataDreamer git config --local core.hooksPath ./scripts/.githooks/ ./scripts/.githooks/post-checkout ``` -------------------------------- ### Install DataDreamer using pip Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/installation.rst Use this command to install the DataDreamer library from PyPI. Ensure you have pip3 installed. ```bash pip3 install datadreamer.dev ``` -------------------------------- ### Basic Step Structure in DataDreamer Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/advanced_usage/creating_a_new_datadreamer_.../step.rst Define a new custom step by subclassing `datadreamer.steps.Step` and implementing the `setup` and `run` methods. The `setup` method is for registration, and `run` contains the core logic. ```python from datadreamer.steps import Step class MyNewStep(Step): def setup(self): # Register inputs, arguments, outputs, and data card information here def run(self): # Implement your custom data processing / transformation logic here ``` -------------------------------- ### Create Custom Data Processing Step Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Define a reusable custom step by subclassing `Step` and implementing `setup()` and `run()`. This example counts words in provided texts, with an optional minimum word count argument. ```python from datadreamer import DataDreamer from datadreamer.steps import Step, LazyRows class WordCountStep(Step): def setup(self): self.register_input("texts") # expected input column self.register_arg("min_words", required=False, default=0) self.register_output("texts") self.register_output("word_counts") def run(self): texts = self.inputs["texts"] min_words = self.args["min_words"] def generate_rows(): for text in texts: count = len(text.split()) if count >= min_words: yield {"texts": text, "word_counts": count} return LazyRows(generate_rows, total_num_rows=len(texts)) with DataDreamer("./output"): from datadreamer.steps import DataSource source = DataSource("Source", data={"texts": ["Hello world", "A longer sentence here", "Hi"]}) counted = WordCountStep( "Count Words", inputs={"texts": source.output["texts"]}, args={"min_words": 2}, ) for text, count in zip(counted.output["texts"], counted.output["word_counts"]): print(f"{count} words: {text}") # 2 words: Hello world # 4 words: A longer sentence here ``` -------------------------------- ### Multi-GPU Training Setup Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/advanced_usage/parallelization/training_models_on_multiple_gpus.rst To train on multiple GPUs, pass a list of devices to the `device` parameter of the `Trainer` constructor. This enables distributed training modes like FSDP or DDP. ```python from datadreamer.trainers import Trainer trainer = Trainer( ..., device=["cuda:0", "cuda:1"] ) ``` -------------------------------- ### Skip Dependency Installation for Faster Runs Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/contributing.rst Set the PROJECT_SKIP_INSTALL_REQS environment variable to 1 to bypass dependency installation on subsequent runs, making them faster. This command also shows how to combine it with test filtering. ```bash export PROJECT_SKIP_INSTALL_REQS=1 ./scripts/run.sh -k "TestFewShotPrompt" ``` -------------------------------- ### Initialize DataDreamer Session Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Start a DataDreamer session using a context manager. This sets up an output directory for caching and persistence. Re-running the script will resume from the cached state. ```python from datadreamer import DataDreamer with DataDreamer("./output"): # All steps, trainers, and model usage go here. # Outputs are automatically cached to ./output/ # Re-running the script resumes from cached state. pass ``` -------------------------------- ### Translate Test Set with Final Examples Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/bootstrapping_machine_translation.rst Translates a separate test set of English sentences to Tamil using the best few-shot examples bootstrapped in the previous rounds. ```python # Load the test set of English sentences english_test_dataset = HFHubDataSource( "Get FLORES-101 English Sentences (Test Set)", "gsarti/flores_101", config_name="eng", split="devtest", ).select_columns(["sentence"]) # Finally translate the test set with the final bootstrapped synthetic few-shot examples english_test_to_tamil = FewShotPrompt( "Few-shot Translate from English To Tamil (Test Set)", inputs={ "input_examples": best_translation_pairs.output["english"], "output_examples": best_translation_pairs.output["tamil"], "inputs": english_test_dataset.output["sentence"], }, args={ "llm": gpt_4, "input_label": "English:", "output_label": "Tamil:", "instruction": "Translate the sentence to Tamil.", "max_new_tokens": 1000, }, outputs={"inputs": "english", "generations": "tamil"}, ).select_columns(["english", "tamil"]) ``` -------------------------------- ### Multi-Node Fine-Tuning with DataDreamer Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Configure and run a fine-tuning job across multiple nodes. Ensure NODE_RANK is set correctly for each node and the distributed_config matches your network setup. ```python NODE_RANK = 0 # 0 for master node, 1 for second node, etc. TOTAL_NODES = 2 with DataDreamer("./output"): dataset = HFHubDataSource("Load Data", "yahma/alpaca-cleaned", split="train").take(2000) splits = dataset.splits(train_size=0.90, validation_size=0.10) trainer = TrainHFFineTune( "Multi-Node Fine-Tune", model_name="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T", peft_config=LoraConfig(), device=["cuda:0", "cuda:1"], # GPUs on this node dtype="bfloat16", distributed_config={ "master_addr": "192.168.1.100", # IP of master node "master_port": "29500", "nnodes": TOTAL_NODES, "node_rank": NODE_RANK, }, ) trainer.train( train_input=splits["train"].output["instruction"], train_output=splits["train"].output["output"], validation_input=splits["validation"].output["instruction"], validation_output=splits["validation"].output["output"], epochs=3, batch_size=2, ) ``` -------------------------------- ### Train Model to Generate Tweet Summary Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/index.rst This example demonstrates training a model to generate a tweet summarizing a research paper abstract using synthetic data. It involves loading an LLM, generating synthetic data from prompts, and fine-tuning a model. ```python from datadreamer import DataDreamer from datadreamer.llms import OpenAI from datadreamer.steps import DataFromPrompt, ProcessWithPrompt from datadreamer.trainers import TrainHFFineTune from peft import LoraConfig with DataDreamer("./output"): # Load GPT-4 gpt_4 = OpenAI(model_name="gpt-4") # Generate synthetic arXiv-style research paper abstracts with GPT-4 arxiv_dataset = DataFromPrompt( ``` -------------------------------- ### Parallel LLM Inference Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Run the same smaller model on multiple GPUs in parallel using `ParallelLLM` for increased throughput. This example demonstrates parallel prompting. ```python from datadreamer import DataDreamer from datadreamer.llms import HFTransformers, ParallelLLM from datadreamer.steps import Prompt with DataDreamer("./output"): parallel_llm = ParallelLLM( HFTransformers("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device="cuda:0"), HFTransformers("TinyLlama/TinyLlama-1.1B-Chat-v1.0", device="cuda:1"), ) results = Prompt( "Parallel Prompting", inputs={"prompts": ["What is AI?", "Explain quantum computing."]}, args={"llm": parallel_llm}, ) print(results.output["generations"]) ``` -------------------------------- ### Align TinyLlama-Chat with Human Preferences using DPO Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/aligning.rst This snippet demonstrates aligning the TinyLlama chat model with human preferences using DataDreamer's `TrainHFDPO` trainer. It fetches a DPO dataset from the Hugging Face Hub, takes a subset for demonstration, splits it into training and validation sets, and then trains the model using LoRA for efficient fine-tuning. Ensure you have the necessary libraries installed and CUDA-enabled devices available. ```python from datadreamer import DataDreamer from datadreamer.steps import HFHubDataSource from datadreamer.trainers import TrainHFDPO from peft import LoraConfig with DataDreamer("./output"): # Get the DPO dataset dpo_dataset = HFHubDataSource( "Get DPO Dataset", "Intel/orca_dpo_pairs", split="train" ) # Keep only 1000 examples as a quick demo dpo_dataset = dpo_dataset.take(1000) # Create training data splits splits = dpo_dataset.splits(train_size=0.90, validation_size=0.10) # Align the TinyLlama chat model with human preferences trainer = TrainHFDPO( "Align TinyLlama-Chat", model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0", peft_config=LoraConfig(), device=["cuda:0", "cuda:1"], dtype="bfloat16", ) trainer.train( train_prompts=splits["train"].output["question"], train_chosen=splits["train"].output["chosen"], train_rejected=splits["train"].output["rejected"], validation_prompts=splits["validation"].output["question"], validation_chosen=splits["validation"].output["chosen"], validation_rejected=splits["validation"].output["rejected"], epochs=3, batch_size=1, gradient_accumulation_steps=32, ) ``` -------------------------------- ### Initialize DataDreamer and Load LLM Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/bootstrapping_machine_translation.rst Sets up the DataDreamer environment and loads the specified large language model (GPT-4 in this case). ```python from datadreamer import DataDreamer from datadreamer.llms import OpenAI from datadreamer.steps import ( FewShotPrompt, ProcessWithPrompt, HFHubDataSource, CosineSimilarity, ) from datadreamer.embedders import SentenceTransformersEmbedder with DataDreamer("./output"): # Load GPT-4 gpt_4 = OpenAI(model_name="gpt-4") ``` -------------------------------- ### Run All Project Tests Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/contributing.rst Execute this command to run all test cases defined in the project. ```bash ./scripts/run.sh ``` -------------------------------- ### Load HF Hub DataSource Step Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Load an existing dataset from the Hugging Face Hub as an input step. You can then take subsets and select specific columns. ```python from datadreamer import DataDreamer from datadreamer.steps import HFHubDataSource with DataDreamer("./output"): dataset = HFHubDataSource( "Load Alpaca Dataset", "yahma/alpaca-cleaned", split="train", ) # Keep only 500 rows, select specific columns subset = dataset.take(500).select_columns(["instruction", "output"]) print(subset.output["instruction"][0]) ``` -------------------------------- ### Instruction-Tune TinyLlama with DataDreamer Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/instruction_tuning.rst This snippet demonstrates the complete process of instruction tuning a base LLM using DataDreamer. It includes fetching a dataset, formatting it, splitting into training and validation sets, defining a prompt template, and configuring the fine-tuning trainer with LoRA. Use this for instruction-tuning base models to improve their ability to follow instructions. ```python from datadreamer import DataDreamer from datadreamer.steps import HFHubDataSource from datadreamer.trainers import TrainHFFineTune from peft import LoraConfig with DataDreamer("./output"): # Get the Alpaca instruction-tuning dataset (cleaned version) instruction_tuning_dataset = HFHubDataSource( "Get Alpaca Instruction-Tuning Dataset", "yahma/alpaca-cleaned", split="train" ) # Keep only 1000 examples as a quick demo instruction_tuning_dataset = instruction_tuning_dataset.take(1000) # Some examples taken in an "input", we'll format those into the instruction instruction_tuning_dataset.map( lambda row: { "instruction": ( row["instruction"] if len(row["input"]) == 0 else f"Input: {row['input']}\n\n{row['instruction']}" ), "output": row["output"], }, lazy=False, ) # Create training data splits splits = instruction_tuning_dataset.splits(train_size=0.90, validation_size=0.10) # Define what the prompt template should be when instruction-tuning chat_prompt_template = "### Instruction:\n{{prompt}}\n\n### Response:\n" # Instruction-tune the base TinyLlama model to make it follow instructions trainer = TrainHFFineTune( "Instruction-Tune TinyLlama", model_name="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T", chat_prompt_template=chat_prompt_template, peft_config=LoraConfig(), device=["cuda:0", "cuda:1"], dtype="bfloat16", ) trainer.train( train_input=splits["train"].output["instruction"], train_output=splits["train"].output["output"], validation_input=splits["validation"].output["instruction"], validation_output=splits["validation"].output["output"], epochs=3, batch_size=1, gradient_accumulation_steps=32, ) ``` -------------------------------- ### Few-Shot Translation for Refined Pairs Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/bootstrapping_machine_translation.rst Uses previously generated best translation pairs as few-shot examples to translate more English sentences, refining the synthetic pairs in subsequent rounds. ```python else: # On subsequent rounds, use the best synthetic translation pairs from the previous round # as few-shot examples to translate more English sentences to create even better synthetic pairs english_to_tamil = FewShotPrompt( f"Round #{r+1}: Few-shot Translate from English To Tamil", inputs={ "input_examples": best_translation_pairs.output["english"], "output_examples": best_translation_pairs.output["tamil"], "inputs": sentences_for_round.output["sentence"], }, args={ "llm": gpt_4, "input_label": "English:", "output_label": "Tamil:", "instruction": "Translate the sentence to Tamil.", "max_new_tokens": 1000, }, outputs={"inputs": "english", "generations": "tamil"}, ).select_columns(["english", "tamil"]) ``` -------------------------------- ### Run a Single Test File Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/contributing.rst Use this command to run tests from a specific file by providing its path as an argument. ```bash ./scripts/run.sh src/tests/steps/prompt/test_prompt.py ``` -------------------------------- ### Transform Data with ProcessWithPrompt Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Apply an LLM instruction to each item in an input column to transform it using the ProcessWithPrompt step. This step requires an initial data source (e.g., generated by DataFromPrompt) and an LLM configuration. Specify input and output mappings for clarity. ```python from datadreamer import DataDreamer from datadreamer.llms import OpenAI from datadreamer.steps import DataFromPrompt, ProcessWithPrompt with DataDreamer("./output"): gpt_4 = OpenAI(model_name="gpt-4") abstracts = DataFromPrompt( "Generate Abstracts", args={ "llm": gpt_4, "n": 100, "temperature": 1.2, "instruction": "Generate an arXiv NLP paper abstract.", }, outputs={"generations": "abstracts"}, ) tweets = ProcessWithPrompt( "Generate Tweets from Abstracts", inputs={"inputs": abstracts.output["abstracts"]}, args={ "llm": gpt_4, "instruction": "Given the abstract, write a tweet to summarize the work.", "top_p": 1.0, }, outputs={"inputs": "abstracts", "generations": "tweets"}, ) print(tweets.output["tweets"][0]) # Expected: A short tweet summarizing the first abstract. ``` -------------------------------- ### Monitoring GPU Memory Usage Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/advanced_usage/parallelization/training_models_on_multiple_gpus.rst Enable verbose logging for GPU memory usage during multi-GPU training by setting the `verbose` parameter to `True` in the `Trainer` constructor. This logs memory usage at the start, end, and after each epoch. ```python from datadreamer.trainers import Trainer trainer = Trainer( ..., device=["cuda:0", "cuda:1"], verbose=True ) ``` -------------------------------- ### Generate Synthetic Data with DataFromPrompt Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Utilize the DataFromPrompt step to generate a specified number of independent synthetic data items using a single instruction prompt. Configure the LLM, number of items, temperature, and the instruction itself. The output column can be renamed using the `outputs` argument. ```python from datadreamer import DataDreamer from datadreamer.llms import OpenAI from datadreamer.steps import DataFromPrompt with DataDreamer("./output"): gpt_4 = OpenAI(model_name="gpt-4") abstracts = DataFromPrompt( "Generate Research Paper Abstracts", args={ "llm": gpt_4, "n": 1000, "temperature": 1.2, "instruction": ( "Generate an arXiv abstract of an NLP research paper." " Return just the abstract, no titles." ), }, outputs={"generations": "abstracts"}, # rename output column ) print(f"Generated {len(abstracts.output['abstracts'])} abstracts") ``` -------------------------------- ### Send Prompts to LLM and Collect Responses Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Use the Prompt step to send a list of prompts to an LLM and collect the generated responses. Ensure the LLM and data sources are properly initialized. ```python from datadreamer import DataDreamer from datadreamer.llms import OpenAI from datadreamer.steps import DataSource, Prompt with DataDreamer("./output"): llm = OpenAI(model_name="gpt-4") prompts = DataSource( "Input Prompts", data={"prompts": ["Summarize the theory of relativity.", "What is CRISPR?"]}, ) results = Prompt( "Run Prompts", inputs={"prompts": prompts.output["prompts"]}, args={ "llm": llm, "temperature": 0.7, "max_new_tokens": 256, }, ) print(results.output["generations"][0]) # Expected: A concise summary of the theory of relativity. ``` -------------------------------- ### Distill GPT-4 Capabilities to GPT-3.5 Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/openai_distillation.rst This script demonstrates distilling GPT-4's 'ELI5' generation capabilities into GPT-3.5. It loads GPT-4, fetches ELI5 questions from Hugging Face Hub, uses GPT-4 to generate answers, splits the data, and fine-tunes GPT-3.5. ```python from datadreamer import DataDreamer from datadreamer.llms import OpenAI from datadreamer.steps import ProcessWithPrompt, HFHubDataSource from datadreamer.trainers import TrainOpenAIFineTune with DataDreamer("./output"): # Load GPT-4 gpt_4 = OpenAI(model_name="gpt-4") # Get ELI5 questions eli5_dataset = HFHubDataSource( "Get ELI5 Questions", "eli5_category", split="train", trust_remote_code=True, ).select_columns(["title"]) # Keep only 1000 examples as a quick demo eli5_dataset = eli5_dataset.take(1000) # Ask GPT-4 to ELI5 questions_and_answers = ProcessWithPrompt( "Generate Explanations", inputs={"inputs": eli5_dataset.output["title"]}, args={ "llm": gpt_4, "instruction": ( 'Given the question, give an "Explain it like I\'m 5" answer.' ), "top_p": 1.0, }, outputs={"inputs": "questions", "generations": "answers"}, ) # Create training data splits splits = questions_and_answers.splits(train_size=0.90, validation_size=0.10) # Train a model to answer questions in ELI5 style trainer = TrainOpenAIFineTune( "Distill capabilities to GPT-3.5", model_name="gpt-3.5-turbo-1106", ) trainer.train( train_input=splits["train"].output["questions"], train_output=splits["train"].output["answers"], validation_input=splits["validation"].output["questions"], validation_output=splits["validation"].output["answers"], epochs=30, batch_size=8, ) ``` -------------------------------- ### Format Project Code Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/contributing.rst Execute this script to check for style violations and automatically format the project's code according to the style guidelines. ```bash ./scripts/format.sh ``` -------------------------------- ### Instruction-Tune a Hugging Face Model with DataDreamer Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Fine-tunes a Hugging Face model using the `TrainHFFineTune` trainer, supporting PEFT (like LoRA) and multi-GPU training. The trained model can be exported and published to the Hugging Face Hub with an auto-generated model card. ```python from datadreamer import DataDreamer from datadreamer.steps import HFHubDataSource from datadreamer.trainers import TrainHFFineTune from peft import LoraConfig with DataDreamer("./output"): dataset = HFHubDataSource("Load Alpaca", "yahma/alpaca-cleaned", split="train").take(1000) dataset = dataset.map( lambda row: { "instruction": row["instruction"] if not row["input"] else f"Input: {row['input']}\n\n{row['instruction']}", "output": row["output"], }, lazy=False, ) splits = dataset.splits(train_size=0.90, validation_size=0.10) trainer = TrainHFFineTune( "Instruction-Tune TinyLlama", model_name="TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T", chat_prompt_template="### Instruction:\n{{prompt}}\n\n### Response:\n", peft_config=LoraConfig(), device=["cuda:0", "cuda:1"], # multi-GPU training dtype="bfloat16", ) trainer.train( train_input=splits["train"].output["instruction"], train_output=splits["train"].output["output"], validation_input=splits["validation"].output["instruction"], validation_output=splits["validation"].output["output"], epochs=3, batch_size=1, gradient_accumulation_steps=32, ) # Export and publish the trained model with auto-generated model card trainer.publish_to_hf_hub("my-org/tinyllama-instruction-tuned") print(f"Model saved to: {trainer.model_path}") ``` -------------------------------- ### Generate Diverse Synthetic Data with DataFromAttributedPrompt Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Combines variable attribute lists into instruction templates to generate diverse synthetic data. Requires LLM and attribute lists as input. ```python from datadreamer import DataDreamer from datadreamer.llms import OpenAI from datadreamer.steps import DataSource, Prompt, DataFromAttributedPrompt with DataDreamer("./output"): gpt_4 = OpenAI(model_name="gpt-4") # Generate attribute lists via prompting attribute_prompts = DataSource( "Attribute Prompts", data={ "prompts": [ "Generate 10 movie names, comma-separated.", "Generate 10 review styles, comma-separated.", ] }, ) attributes = Prompt( "Generate Attributes", inputs={"prompts": attribute_prompts.output["prompts"]}, args={"llm": gpt_4}, ).output["generations"] movie_reviews = DataFromAttributedPrompt( "Generate Movie Reviews", args={ "llm": gpt_4, "n": 200, "instruction": "Write a {review_style} review of {movie_name}.", "attributes": { "movie_name": attributes[0].split(","), "review_style": attributes[1].split(","), }, }, outputs={"generations": "reviews"}, ).select_columns(["reviews"]).shuffle() movie_reviews.publish_to_hf_hub("my-org/movie-reviews-dataset") ``` -------------------------------- ### Switching to DDP Mode Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/advanced_usage/parallelization/training_models_on_multiple_gpus.rst To use Distributed Data Parallel (DDP) instead of the default FSDP, set `fsdp=False` in the `Trainer` constructor. This is useful for scaling batch size but requires the model to fit on a single GPU. ```python from datadreamer.trainers import Trainer trainer = Trainer( ..., device=["cuda:0", "cuda:1"], fsdp=False ) ``` -------------------------------- ### Load OpenAI LLM Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Load an OpenAI API-based model like GPT-4 for use in prompting steps. Ensure you have the necessary API keys configured. ```python from datadreamer import DataDreamer from datadreamer.llms import OpenAI with DataDreamer("./output"): gpt_4 = OpenAI(model_name="gpt-4") # gpt_4 can now be passed to any prompting step ``` -------------------------------- ### Create DataSource Step Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Create a DataDreamer step from in-memory Python data (dictionary of lists) to seed a workflow. Outputs can be accessed by column name. ```python from datadreamer import DataDreamer from datadreamer.steps import DataSource with DataDreamer("./output"): source = DataSource( "My Dataset", data={ "questions": ["What is the capital of France?", "Who wrote Hamlet?"], "labels": ["geography", "literature"], }, ) # Access output columns for row in source.output["questions"]: print(row) # Expected output: # What is the capital of France? # Who wrote Hamlet? ``` -------------------------------- ### Run a Step Asynchronously in Background Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Launch a step in a separate background process using `background=True` and wait for its completion using `wait()`. This allows other work to be done while the step executes. ```python from datadreamer import DataDreamer from datadreamer.llms import OpenAI from datadreamer.steps import DataSource, Prompt, wait with DataDreamer("./output"): llm = OpenAI(model_name="gpt-4") prompts = DataSource("Prompts", data={"prompts": ["Explain neural networks in simple terms."]}) # background=True launches step in a separate background process background_step = Prompt( "Background Prompt", inputs={"prompts": prompts.output["prompts"]}, args={"llm": llm}, background=True, ) # Do other work here... print("Step is running in the background...") # Block until the background step is complete wait(background_step) print(background_step.output["generations"]) ``` -------------------------------- ### Augment HotpotQA Questions with LLM Decompositions Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/dataset_augmentation.rst This snippet demonstrates augmenting HotpotQA questions by using an LLM (GPT-4) to generate step-by-step intermediate questions. It loads the dataset, processes questions with a prompt, and publishes the results. ```python from datadreamer import DataDreamer from datadreamer.llms import OpenAI from datadreamer.steps import ProcessWithPrompt, HFHubDataSource with DataDreamer("./output"): # Load GPT-4 gpt_4 = OpenAI(model_name="gpt-4") # Get HotPot QA questions hotpot_qa_dataset = HFHubDataSource( "Get Hotpot QA Questions", "hotpot_qa", config_name="distractor", split="train", ).select_columns(["question"]) # Keep only 1000 questions as a quick demo hotpot_qa_dataset = hotpot_qa_dataset.take(1000) # Ask GPT-4 to decompose the question questions_and_decompositions = ProcessWithPrompt( "Generate Decompositions", inputs={"inputs": hotpot_qa_dataset.output["question"]}, args={ "llm": gpt_4, "instruction": ( "Given the question which requires multiple steps to solve, give a numbered list of intermediate questions required to solve the question." "Return only the list, nothing else." ), }, outputs={"inputs": "questions", "generations": "decompositions"}, ).select_columns(["questions", "decompositions"]) # Publish and share the synthetic dataset questions_and_decompositions.publish_to_hf_hub( "datadreamer-dev/hotpot_qa_augmented", ) ``` -------------------------------- ### Subset and Transform Dataset with DataDreamer Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Demonstrates taking a subset of a dataset, selecting specific columns, shuffling, and applying a mapping transformation. The `lazy=False` argument executes the map transformation eagerly. ```python from datadreamer import DataDreamer from datadreamer.steps import HFHubDataSource with DataDreamer("./output"): dataset = HFHubDataSource("Load Dataset", "yahma/alpaca-cleaned", split="train") # Subset and reshape subset = ( dataset .take(500) .select_columns(["instruction", "output"]) .shuffle(seed=42) ) # Map transformation (lazy=False executes eagerly) formatted = subset.map( lambda row: { "instruction": f"### Instruction:\n{row['instruction']}", "output": row["output"], }, lazy=False, name="Format Instructions", ) # Split for train/validation splits = formatted.splits(train_size=0.90, validation_size=0.10) print(f"Train size: {len(splits['train'].output['instruction'])}") print(f"Validation size: {len(splits['validation'].output['instruction'])}") ``` -------------------------------- ### Publish Synthetic Dataset and Trained Model Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/index.rst This code snippet demonstrates how to publish a generated synthetic dataset to the Hugging Face Hub and also publish the trained model. Ensure you have the necessary authentication for Hugging Face Hub. ```python abstracts_and_tweets.publish_to_hf_hub( "datadreamer-dev/abstracts_and_tweets", train_size=0.90, validation_size=0.10, ) # Publish and share the trained model trainer.publish_to_hf_hub("datadreamer-dev/abstracts_to_tweet_model") ``` -------------------------------- ### Type Check Project with Mypy Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/contributing.rst Run this command to perform static type checking on the project using mypy and identify any type errors. ```bash ./scripts/run.sh -k "mypy" ``` -------------------------------- ### Configure Google Analytics Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/_templates/metatags.html Include this JavaScript snippet in your HTML to initialize Google Analytics. Replace 'G-1BQW0NQ6M8' with your actual Measurement ID. ```javascript window.dataLayer = window.dataLayer || []; function gtag(){ dataLayer.push(arguments); } gtag('js', new Date()); gtag('config', 'G-1BQW0NQ6M8'); ``` -------------------------------- ### Align Model with DPO using DataDreamer Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Aligns a model with human preferences using Direct Preference Optimization (DPO). This trainer supports PEFT and multi-GPU configurations and allows publishing the aligned model to Hugging Face Hub. ```python from datadreamer import DataDreamer from datadreamer.steps import HFHubDataSource from datadreamer.trainers import TrainHFDPO from peft import LoraConfig with DataDreamer("./output"): dpo_dataset = HFHubDataSource("Load DPO Dataset", "Intel/orca_dpo_pairs", split="train").take(1000) splits = dpo_dataset.splits(train_size=0.90, validation_size=0.10) trainer = TrainHFDPO( "Align TinyLlama-Chat with DPO", model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0", peft_config=LoraConfig(), device=["cuda:0", "cuda:1"], dtype="bfloat16", ) trainer.train( train_prompts=splits["train"].output["question"], train_chosen=splits["train"].output["chosen"], train_rejected=splits["train"].output["rejected"], validation_prompts=splits["validation"].output["question"], validation_chosen=splits["validation"].output["chosen"], validation_rejected=splits["validation"].output["rejected"], epochs=3, batch_size=1, gradient_accumulation_steps=32, ) trainer.publish_to_hf_hub("my-org/tinyllama-dpo-aligned") ``` -------------------------------- ### Publish Synthetic Data to Hugging Face Hub Source: https://context7.com/datadreamer-dev/datadreamer/llms.txt Generates synthetic data using an LLM and publishes the resulting dataset to the Hugging Face Hub with automatic data card generation. Ensure your Hugging Face token is set as an environment variable or passed directly. ```python from datadreamer import DataDreamer from datadreamer.llms import OpenAI from datadreamer.steps import DataFromPrompt with DataDreamer("./output"): gpt_4 = OpenAI(model_name="gpt-4") synthetic_data = DataFromPrompt( "Generate Summaries", args={ "llm": gpt_4, "n": 500, "temperature": 1.0, "instruction": "Write a one-paragraph summary of a fictional scientific discovery.", }, outputs={"generations": "summaries"}, ) # Publish with train/validation split; DataDreamer auto-generates the data card synthetic_data.publish_to_hf_hub( "my-org/synthetic-summaries", train_size=0.90, validation_size=0.10, # token="hf_..." if not set via environment variable ) ``` -------------------------------- ### Load English Sentences for Translation Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/bootstrapping_machine_translation.rst Loads a dataset of English sentences from Hugging Face Hub (FLORES-101) and takes a subset for demonstration. ```python # Get English sentences english_dataset = HFHubDataSource( "Get FLORES-101 English Sentences", "gsarti/flores_101", config_name="eng", split="dev", ).select_columns(["sentence"]) # Keep only 400 examples as a quick demo english_dataset = english_dataset.take(400) ``` -------------------------------- ### Generate Movie Reviews with Attributed Prompts Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/attributed_prompts.rst Use DataFromAttributedPrompt to generate varied movie reviews by combining different movie names, elements, and review styles. This method allows for the creation of a diverse dataset representative of real-world data. Ensure the LLM and attributes are correctly configured. ```python from datadreamer import DataDreamer from datadreamer.llms import OpenAI from datadreamer.steps import ( Prompt, DataSource, DataFromAttributedPrompt, ) with DataDreamer("./output"): # Load GPT-4 gpt_4 = OpenAI(model_name="gpt-4") # Create prompts to generate attributes for movie reviews attribute_generation_prompts = DataSource( "Attribute Generation Prompts", data={ "prompts": [ "Generate the names of 10 movies released in theatres in the past, in a comma separated list.", "Generate 10 elements of a movie a reviewer might consider, in a comma separated list.", "Generate 10 adjectives that could describe a movie reviewer's style, in a comma separated list.", ], }, ) # Generate the attributes for movie reviews attributes = Prompt( "Generate Attributes", inputs={ "prompts": attribute_generation_prompts.output["prompts"], }, args={ "llm": gpt_4, }, ).output["generations"] # Generate movie reviews with varied attributes movie_reviews = ( DataFromAttributedPrompt( "Generate Movie Reviews", args={ "llm": gpt_4, "n": 1000, "instruction": "Generate a few sentence {review_style} movie review about {movie_name} that focuses on {movie_element}.", "attributes": { "movie_name": attributes[0].split(","), "movie_element": attributes[1].split(","), "review_style": attributes[2].split(","), }, }, outputs={"generations": "reviews"}, ) .select_columns(["reviews"]) .shuffle() ) # Publish and share the synthetic dataset movie_reviews.publish_to_hf_hub( "datadreamer-dev/movie_reviews", ) ``` -------------------------------- ### Self-Reward Training Loop with DataDreamer Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/quick_tour/self_rewarding.rst This snippet outlines the main loop for self-reward training. It iterates through multiple rounds, loading a base LLM, sampling candidate responses, having the LLM judge its own outputs, and then fine-tuning the model using the generated preferences. The trained adapter is then applied in the next round. ```python from datadreamer import DataDreamer from datadreamer.steps import HFHubDataSource, Prompt, JudgeGenerationPairsWithPrompt from datadreamer.trainers import TrainHFDPO from datadreamer.llms import HFTransformers from peft import LoraConfig with DataDreamer("./output"): # Get a dataset of prompts prompts_dataset = HFHubDataSource( "Get Prompts Dataset", "Intel/orca_dpo_pairs", split="train" ).select_columns(["question"]) # Keep only 3000 examples as a quick demo prompts_dataset = prompts_dataset.take(3000) # Define how many rounds of self-reward training ounds = 3 # For each round of self-reward training adapter_to_apply = None for r in range(rounds): # Use a partial set of the prompts for each round prompts_for_round = prompts_dataset.shard( num_shards=rounds, index=r, name=f"Round #{r+1}: Get Prompts" ) # Load the LLM llm = HFTransformers( "TinyLlama/TinyLlama-1.1B-Chat-v1.0", adapter_name=adapter_to_apply, device_map="auto", dtype="bfloat16", ) # Sample 2 candidate responses from the LLM candidate_responses = [] for candidate_idx in range(2): candidate_responses.append( Prompt( f"Round #{r+1}: Sample Candidate Response #{candidate_idx}", inputs={"prompts": prompts_for_round.output["question"]}, args={ "llm": llm, "batch_size": 2, "top_p": 1.0, "seed": candidate_idx, }, ) ) # Have the LLM judge its own responses judgements = JudgeGenerationPairsWithPrompt( f"Round #{r+1}: Judge Candidate Responses", args={ "llm": llm, "batch_size": 1, "max_new_tokens": 5, }, inputs={ "prompts": prompts_for_round.output["question"], "a": candidate_responses[0].output["generations"], "b": candidate_responses[1].output["generations"], }, ) # Unload the LLM llm.unload_model() # Process the judgements into a preference dataset dpo_dataset = judgements.map( lambda row: { "question": row["prompts"], "chosen": row["a"] if row["judgements"] == "Response A" else row["b"], "rejected": row["b"] if row["judgements"] == "Response A" else row["a"], }, lazy=False, name=f"Round #{r+1}: Create Self-Reward Preference Dataset", ) # Create training data splits splits = dpo_dataset.splits(train_size=0.90, validation_size=0.10) # Align the TinyLlama chat model with its own preferences trainer = TrainHFDPO( f"Round #{r+1}: Self-Reward Align TinyLlama-Chat", model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0", peft_config=LoraConfig(), device=["cuda:0", "cuda:1"], dtype="bfloat16", ) trainer.train( train_prompts=splits["train"].output["question"], train_chosen=splits["train"].output["chosen"], train_rejected=splits["train"].output["rejected"], validation_prompts=splits["validation"].output["question"], validation_chosen=splits["validation"].output["chosen"], validation_rejected=splits["validation"].output["rejected"], epochs=3, batch_size=1, gradient_accumulation_steps=32, ) # Unload the trained model from memory trainer.unload_model() # Use the newly trained adapter for the next round of self-reward adapter_to_apply = trainer.model_path ``` -------------------------------- ### Generate Research Paper Abstracts with GPT-4 Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/index.rst Uses GPT-4 to generate arXiv abstracts for NLP research papers. Ensure 'gpt_4' is initialized and accessible. ```python ProcessWithPrompt( "Generate Research Paper Abstracts", args={ "llm": gpt_4, "n": 1000, "temperature": 1.2, "instruction": ( "Generate an arXiv abstract of an NLP research paper." " Return just the abstract, no titles." ), }, outputs={"generations": "abstracts"}, ) ``` -------------------------------- ### Initialize DataDreamer Session Source: https://github.com/datadreamer-dev/datadreamer/blob/main/docs/source/pages/get_started/overview_guide.rst All DataDreamer code should be placed within a DataDreamer session. This ensures automatic organization, caching, and saving of results, making the session resumable and reproducible. ```python from datadreamer import DataDreamer with DataDreamer('./output/'): # ... run steps or trainers here ... ```