大模型微调之在亚马逊AWS上实战LlaMA案例（七）

微调SageMaker JumpStart上的LLaMA 2模型

这是在us-west-2的测试结果。展示了如何使用SageMaker Python SDK部署预训练的Llama 2模型，并将其微调到你的数据集，用于领域适应或指令调整格式。

设置

首先安装并升级必要的包。在首次执行下面的单元格后，重新启动内核。

!pip install --upgrade sagemaker datasets

以下是您提供的Markdown内容的中文翻译：

部署预训练模型

将部署Llama-2模型作为SageMaker终端节点。要训练/部署13B和70B模型，请分别将model_id更改为"meta-textgeneration-llama-2-7b"和"meta-textgeneration-llama-2-70b"。

"B"在这里代表模型的参数数量，其中"7B"表示70亿参数的模型。"meta-textgeneration-llama-2-7b"和"meta-textgeneration-llama-2-70b"是模型的标识符，用于指定要部署的具体模型版本。

model_id, model_version = "meta-textgeneration-llama-2-7b", "2.*"

# %%
from sagemaker.jumpstart.model import JumpStartModel

pretrained_model = JumpStartModel(model_id=model_id, model_version=model_version)
pretrained_predictor = pretrained_model.deploy()

以下是您提供的Markdown内容的中文翻译：

调用终端节点

接下来，将使用一些示例查询调用终端节点。将使用自定义数据集对模型进行微调，并使用微调后的模型进行推理。还将展示通过预训练模型和微调模型获得的结果之间的比较。

def print_response(payload, response):
    print(payload["inputs"])
    print(f"> {response[0]['generation']}")
    print("\n==================================\n")

# %%
payload = {
    "inputs": "I believe the meaning of life is",
    "parameters": {
        "max_new_tokens": 64,
        "top_p": 0.9,
        "temperature": 0.6,
        "return_full_text": False,
    },
}
try:
    response = pretrained_predictor.predict(payload, custom_attributes="accept_eula=false")
    print_response(payload, response)
except Exception as e:
    print(e)

微调的数据集准备

可以使用领域适应格式或指令调整格式的数据集进行微调。详情请查看数据集指令部分。在这个演示中，将使用Dolly数据集的一个子集，以指令调整格式进行微调。Dolly数据集包含了大约15,000条用于各种类别的指令遵循记录，如问答、摘要、信息提取等。它在Apache 2.0许可下可用。将选择摘要示例进行微调。

训练数据以JSON行（.jsonl）格式格式化，其中每一行是一个字典，代表一个单独的数据样本。所有训练数据必须在同一个文件夹中，但可以保存在多个jsonl文件中。训练文件夹还可以包含一个template.json文件，描述输入和输出格式。

要在一组非结构化数据集（文本文件）上训练您的模型，请参阅附录中的使用领域适应数据集格式的示例微调部分。

from datasets import load_dataset

dolly_dataset = load_dataset("databricks/databricks-dolly-15k", split="train")

# To train for question answering/information extraction, you can replace the assertion in next line to example["category"] == "closed_qa"/"information_extraction".
summarization_dataset = dolly_dataset.filter(lambda example: example["category"] == "summarization")
summarization_dataset = summarization_dataset.remove_columns("category")

# We split the dataset into two where test data is used to evaluate at the end.
train_and_test_dataset = summarization_dataset.train_test_split(test_size=0.1)

# Dumping the training data to a local file to be used for training.
train_and_test_dataset["train"].to_json("train.jsonl")

# %%
train_and_test_dataset["train"][0]

接下来，将创建一个提示模板，用于以指令/输入格式使用数据进行训练作业（因为在这个例子中是对模型进行指令微调），以及用于推理已部署的终端节点。

import json

template = {
    "prompt": "Below is an instruction that describes a task, paired with an input that provides further context. "
    "Write a response that appropriately completes the request.\n\n"
    "### Instruction:\n{instruction}\n\n### Input:\n{context}\n\n",
    "completion": " {response}",
}
with open("template.json", "w") as f:
    json.dump(template, f)

您提供的是一个用于机器学习模型微调的提示模板（prompt template），这个模板用于格式化输入数据，以便模型能够理解并执行给定的任务。以下是该模板的中文解释：

{
  "prompt": "以下是描述任务的指令，附带提供更多上下文的输入。\n\n"
           "### 指令：\n{instruction}\n\n"
           "### 输入：\n{context}\n\n",
  "completion": " {response}"
}

在这个模板中：

"prompt" 部分是提供给模型的提示文本，它包含了一个描述任务的指令和一个提供上下文的输入。
"### Instruction:" 后面跟着的 {instruction} 是一个占位符，代表具体的任务指令，它将被实际的任务指令文本替换。
"### Input:" 后面跟着的 {context} 是另一个占位符，代表提供额外信息的输入文本，它将被实际的输入文本替换。
"completion" 部分是模型生成的文本，它应该适当地完成请求。这里的 {response} 是一个占位符，代表模型生成的响应文本。

当使用这个模板时，需要将 {instruction} 和 {context} 替换成具体的任务指令和输入文本，模型将基于这些信息生成响应，即 {response}。

这个模板适用于需要模型根据指令和上下文生成文本的任务，比如文本摘要、问题回答或者文本生成等。在微调过程中，模型会学习如何根据给定的指令和上下文产生合适的响应。

上传数据集到S3

将上传准备好的数据集到S3，这个数据集将用于微调。
这段内容描述了在微调模型之前，需要将数据集上传到Amazon S3存储服务的步骤。S3（Simple Storage Service）是Amazon Web Services（AWS）提供的一种对象存储服务，它可以存储任意数量的数据，并且通常用于机器学习项目中存储训练数据集。

上传数据集到S3后，可以在SageMaker中设置训练作业，指定S3存储桶中的数据集位置，以便SageMaker可以访问并使用这些数据进行模型的训练和微调。

from sagemaker.s3 import S3Uploader
import sagemaker
import random

output_bucket = sagemaker.Session().default_bucket()
local_data_file = "train.jsonl"
train_data_location = f"s3://{output_bucket}/dolly_dataset"
S3Uploader.upload(local_data_file, train_data_location)
S3Uploader.upload("template.json", train_data_location)
print(f"Training data: {train_data_location}")

训练模型

接下来，将在Dolly的摘要数据集上微调LLaMA v2 7B模型。微调脚本基于这个仓库提供的脚本。要了解更多关于微调脚本的信息，请查看关于微调方法的几点说明。有关支持的超参数及其默认值的列表，请参见微调支持的超参数。

from sagemaker.jumpstart.estimator import JumpStartEstimator


estimator = JumpStartEstimator(
    model_id=model_id,
    model_version=model_version,
    environment={"accept_eula": "true"},
    disable_output_compression=True,  # For Llama-2-70b, add instance_type = "ml.g5.48xlarge"
)
# By default, instruction tuning is set to false. Thus, to use instruction tuning dataset you use
estimator.set_hyperparameters(instruction_tuned="True", epoch="5", max_input_length="1024")
estimator.fit({"training": train_data_location})

接下来，将部署微调后的模型。将比较微调模型和预训练模型的性能。

finetuned_predictor = estimator.deploy()

评估预训练和微调模型

接下来，使用测试数据来评估微调模型的性能，并将其与预训练模型进行比较。

import pandas as pd
from IPython.display import display, HTML

test_dataset = train_and_test_dataset["test"]

inputs, ground_truth_responses, responses_before_finetuning, responses_after_finetuning = (
    [],
    [],
    [],
    [],
)


def predict_and_print(datapoint):
    # For instruction fine-tuning, we insert a special key between input and output
    input_output_demarkation_key = "\n\n### Response:\n"

    payload = {
        "inputs": template["prompt"].format(
            instruction=datapoint["instruction"], context=datapoint["context"]
        )
        + input_output_demarkation_key,
        "parameters": {"max_new_tokens": 100},
    }
    inputs.append(payload["inputs"])
    ground_truth_responses.append(datapoint["response"])
    # Please change the following line to "accept_eula=True"
    pretrained_response = pretrained_predictor.predict(
        payload, custom_attributes="accept_eula=false"
    )
    responses_before_finetuning.append(pretrained_response[0]["generation"])
    # Please change the following line to "accept_eula=True"
    finetuned_response = finetuned_predictor.predict(payload, custom_attributes="accept_eula=false")
    responses_after_finetuning.append(finetuned_response[0]["generation"])


try:
    for i, datapoint in enumerate(test_dataset.select(range(5))):
        predict_and_print(datapoint)

    df = pd.DataFrame(
        {
            "Inputs": inputs,
            "Ground Truth": ground_truth_responses,
            "Response from non-finetuned model": responses_before_finetuning,
            "Response from fine-tuned model": responses_after_finetuning,
        }
    )
    display(HTML(df.to_html()))
except Exception as e:
    print(e)

pretrained_predictor.delete_model()
pretrained_predictor.delete_endpoint()
finetuned_predictor.delete_model()
finetuned_predictor.delete_endpoint()

附录：

1. 支持的推理参数

此模型支持以下推理有效载荷参数：

max_new_tokens: 模型生成文本，直到输出长度（不包括输入上下文长度）达到max_new_tokens。如果指定，它必须是正整数。
temperature: 控制输出中的随机性。更高的温度会导致输出序列中有低概率的词，而较低的温度会导致输出序列中有高概率的词。如果temperature -> 0，它会导致贪婪解码。如果指定，它必须是正浮点数。
top_p: 在文本生成的每一步中，从累积概率为top_p的最小可能单词集中进行抽样。如果指定，它必须是介于0和1之间的浮点数。
return_full_text: 如果为True，输入文本将成为生成的输出文本的一部分。如果指定，它必须是布尔值。它的默认值为False。

您可以在调用终端节点时指定上述参数的任何子集。

如果未定义max_new_tokens，模型可能会生成最大允许的总令牌数，这些模型为4K。这可能会导致终端节点查询超时错误，因此建议在可能的情况下设置max_new_tokens。对于7B、13B和70B模型，我们建议将max_new_tokens分别设置为不超过1500、1000和500，同时保持总令牌数少于4K。
为了支持4k上下文长度，此模型已将查询负载限制为仅使用批量大小为1。具有较大批量大小的负载将在推理之前收到终端节点错误。

2. 训练的数据集格式化指令

微调新数据集上的模型

我们目前提供两种类型的微调：指令微调和领域适应微调。您可以通过指定参数instruction_tuned为’True’或’False’来轻松切换到其中一种训练方法。

2.1. 领域适应微调

文本生成模型也可以在任何特定领域的数据集上进行微调。在特定领域的数据集上微调后，模型预计将生成特定领域的文本，并使用少量示例提示解决该特定领域中的各种NLP任务。

以下是训练数据应该如何格式化以输入模型的说明。

输入: 一个训练目录和一个可选的验证目录。每个目录包含一个CSV/JSON/TXT文件。
- 对于CSV/JSON文件，训练或验证数据是从名为’text’的列中使用的，如果没有找到名为’text’的列，则使用第一列。
- 训练和验证（如果提供）下的文件数量应该分别等于一个。
输出: 一个可以部署用于推理的已训练模型。

下面是一个用于微调文本生成模型的TXT文件的示例。TXT文件是亚马逊从2021年到2022年的SEC文件。

This report includes estimates, projections, statements relating to our
business plans, objectives, and expected operating results that are “forward-
looking statements” within the meaning of the Private Securities Litigation
Reform Act of 1995, Section 27A of the Securities Act of 1933, and Section 21E
of the Securities Exchange Act of 1934. Forward-looking statements may appear
throughout this report, including the following sections: “Business” (Part I,
Item 1 of this Form 10-K), “Risk Factors” (Part I, Item 1A of this Form 10-K),
and “Management’s Discussion and Analysis of Financial Condition and Results
of Operations” (Part II, Item 7 of this Form 10-K). These forward-looking
statements generally are identified by the words “believe,” “project,”
“expect,” “anticipate,” “estimate,” “intend,” “strategy,” “future,”
“opportunity,” “plan,” “may,” “should,” “will,” “would,” “will be,” “will
continue,” “will likely result,” and similar expressions. Forward-looking
statements are based on current expectations and assumptions that are subject
to risks and uncertainties that may cause actual results to differ materially.
We describe risks and uncertainties that could cause actual results and events
to differ materially in “Risk Factors,” “Management’s Discussion and Analysis
of Financial Condition and Results of Operations,” and “Quantitative and
Qualitative Disclosures about Market Risk” (Part II, Item 7A of this Form
10-K). Readers are cautioned not to place undue reliance on forward-looking
statements, which speak only as of the date they are made. We undertake no
obligation to update or revise publicly any forward-looking statements,
whether because of new information, future events, or otherwise.
GENERAL
Embracing Our Future ...

2.2. 指令微调

文本生成模型可以在任何文本数据上进行指令调整，前提是数据处于预期的格式。调整后的指令模型可以进一步部署用于推理。

以下是训练数据应该如何格式化以输入模型的说明。

输入: 一个训练目录和一个可选的验证目录。训练和验证目录应包含一个或多个JSON行（.jsonl）格式的文件。特别是，训练目录还可以包含一个可选的*.json文件，描述输入和输出格式。
- 根据每个epoch结束时计算的验证损失，选择最佳模型。
  如果没有给定验证集，训练数据会自动分割一定比例（可调整）用于验证。
- 训练数据必须以JSON行（.jsonl）格式格式化，其中每一行是一个字典，代表一个单独的数据样本。所有训练数据必须在单个文件夹中，但可以保存在多个jsonl文件中。.jsonl文件扩展名是强制性的。训练文件夹还可以包含一个template.json文件，描述输入和输出格式。如果没有给出模板文件，将使用以下模板：
```
{
  "prompt": "Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.\n\n### Instruction:\n{instruction}\n\n### Input:\n{context}",
  "completion": "{response}"
}
```
  在这种情况下，JSON行条目中的数据必须包括instruction、context和response字段。如果提供了自定义模板，它也必须使用prompt和completion键来定义输入和输出模板。
  以下是自定义模板的示例：
```
{
  "prompt": "question: {question} context: {context}",
  "completion": "{answer}"
}
```
  在这里，JSON行条目中的数据必须包括question、context和answer字段。
输出: 一个可以部署用于推理的已训练模型。

2.3. 使用领域适应数据集格式的示例微调

我们提供了亚马逊SEC文件数据的一个子集，以领域适应数据集格式。它从公开可用的EDGAR下载。访问数据的说明显示在这里。

许可证：Creative Commons Attribution-ShareAlike License (CC BY-SA 4.0)。

请取消以下代码的注释，以在领域适应格式的数据集上微调模型。

import boto3
model_id = “meta-textgeneration-llama-2-7b”

estimator = JumpStartEstimator(model_id=model_id, environment={“accept_eula”: “true”}, instance_type=“ml.g5.24xlarge”)
estimator.set_hyperparameters(instruction_tuned=“False”, epoch=“5”)
estimator.fit({“training”: f"s3://jumpstart-cache-prod-{boto3.Session().region_name}/training-datasets/sec_amazon"})

3. 微调支持的超参数

epoch: 微调算法通过训练数据集的传递次数。必须是大于1的整数。默认值：5
learning_rate: 模型权重在处理每个训练示例批次后更新的速率。必须是大于0的正浮点数。默认值：1e-4。
instruction_tuned: 是否对模型进行指令训练。必须是’True’或’False’。默认值：‘False’
per_device_train_batch_size: 训练时每个GPU核心/CPU的批量大小。必须是正整数。默认值：4。
per_device_eval_batch_size: 评估时每个GPU核心/CPU的批量大小。必须是正整数。默认值：1
max_train_samples: 出于调试目的或更快的训练，将训练示例的数量截断到这个值。值-1表示使用所有训练样本。必须是正整数或-1。默认值：-1.
max_val_samples: 出于调试目的或更快的训练，将验证示例的数量截断到这个值。值-1表示使用所有验证样本。必须是正整数或-1。默认值：-1.
max_input_length: 令牌化后的最大总输入序列长度。比这更长的序列将被截断。如果为-1，max_input_length设置为1024和由分词器定义的最大模型长度的

大模型技术分享

在这里插入图片描述

《企业级生成式人工智能LLM大模型技术、算法及案例实战》线上高级研修讲座

模块一：Generative AI 原理本质、技术内核及工程实践周期详解
模块二：工业级 Prompting 技术内幕及端到端的基于LLM 的会议助理实战
模块三：三大 Llama 2 模型详解及实战构建安全可靠的智能对话系统
模块四：生产环境下 GenAI/LLMs 的五大核心问题及构建健壮的应用实战
模块五：大模型应用开发技术：Agentic-based 应用技术及案例实战
模块六：LLM 大模型微调及模型 Quantization 技术及案例实战
模块七：大模型高效微调 PEFT 算法、技术、流程及代码实战进阶
模块八：LLM 模型对齐技术、流程及进行文本Toxicity 分析实战
模块九：构建安全的 GenAI/LLMs 核心技术Red Teaming 解密实战
模块十：构建可信赖的企业私有安全大模型Responsible AI 实战

Llama3关键技术深度解析与构建Responsible AI、算法及开发落地实战

1、Llama开源模型家族大模型技术、工具和多模态详解：学员将深入了解Meta Llama 3的创新之处，比如其在语言模型技术上的突破，并学习到如何在Llama 3中构建trust and safety AI。他们将详细了解Llama 3的五大技术分支及工具，以及如何在AWS上实战Llama指令微调的案例。
2、解密Llama 3 Foundation Model模型结构特色技术及代码实现：深入了解Llama 3中的各种技术，比如Tiktokenizer、KV Cache、Grouped Multi-Query Attention等。通过项目二逐行剖析Llama 3的源码，加深对技术的理解。
3、解密Llama 3 Foundation Model模型结构核心技术及代码实现：SwiGLU Activation Function、FeedForward Block、Encoder Block等。通过项目三学习Llama 3的推理及Inferencing代码，加强对技术的实践理解。
4、基于LangGraph on Llama 3构建Responsible AI实战体验：通过项目四在Llama 3上实战基于LangGraph的Responsible AI项目。他们将了解到LangGraph的三大核心组件、运行机制和流程步骤，从而加强对Responsible AI的实践能力。
5、Llama模型家族构建技术构建安全可信赖企业级AI应用内幕详解：深入了解构建安全可靠的企业级AI应用所需的关键技术，比如Code Llama、Llama Guard等。项目五实战构建安全可靠的对话智能项目升级版，加强对安全性的实践理解。
6、Llama模型家族Fine-tuning技术与算法实战：学员将学习Fine-tuning技术与算法，比如Supervised Fine-Tuning(SFT)、Reward Model技术、PPO算法、DPO算法等。项目六动手实现PPO及DPO算法，加强对算法的理解和应用能力。
7、Llama模型家族基于AI反馈的强化学习技术解密：深入学习Llama模型家族基于AI反馈的强化学习技术，比如RLAIF和RLHF。项目七实战基于RLAIF的Constitutional AI。
8、Llama 3中的DPO原理、算法、组件及具体实现及算法进阶：学习Llama 3中结合使用PPO和DPO算法，剖析DPO的原理和工作机制，详细解析DPO中的关键算法组件，并通过综合项目八从零开始动手实现和测试DPO算法，同时课程将解密DPO进阶技术Iterative DPO及IPO算法。
9、Llama模型家族Safety设计与实现：在这个模块中，学员将学习Llama模型家族的Safety设计与实现，比如Safety in Pretraining、Safety Fine-Tuning等。构建安全可靠的GenAI/LLMs项目开发。
10、Llama 3构建可信赖的企业私有安全大模型Responsible AI系统：构建可信赖的企业私有安全大模型Responsible AI系统，掌握Llama 3的Constitutional AI、Red Teaming。

解码Sora架构、技术及应用

一、为何Sora通往AGI道路的里程碑？
1，探索从大规模语言模型(LLM)到大规模视觉模型(LVM)的关键转变，揭示其在实现通用人工智能(AGI)中的作用。
2，展示Visual Data和Text Data结合的成功案例，解析Sora在此过程中扮演的关键角色。
3，详细介绍Sora如何依据文本指令生成具有三维一致性(3D consistency)的视频内容。 4，解析Sora如何根据图像或视频生成高保真内容的技术路径。
5，探讨Sora在不同应用场景中的实践价值及其面临的挑战和局限性。

二、解码Sora架构原理
1，DiT (Diffusion Transformer)架构详解
2，DiT是如何帮助Sora实现Consistent、Realistic、Imaginative视频内容的？
3，探讨为何选用Transformer作为Diffusion的核心网络，而非技术如U-Net。
4，DiT的Patchification原理及流程，揭示其在处理视频和图像数据中的重要性。
5，Conditional Diffusion过程详解，及其在内容生成过程中的作用。
三、解码Sora关键技术解密
1，Sora如何利用Transformer和Diffusion技术理解物体间的互动，及其对模拟复杂互动场景的重要性。
2，为何说Space-time patches是Sora技术的核心，及其对视频生成能力的提升作用。
3，Spacetime latent patches详解，探讨其在视频压缩和生成中的关键角色。
4，Sora Simulator如何利用Space-time patches构建digital和physical世界，及其对模拟真实世界变化的能力。
5，Sora如何实现faithfully按照用户输入文本而生成内容，探讨背后的技术与创新。
6，Sora为何依据abstract concept而不是依据具体的pixels进行内容生成，及其对模型生成质量与多样性的影响。

在这里插入图片描述