Open main menu
首页
专栏
课程
分类
归档
Chat
Sci-Hub
谷歌学术
Libgen
GitHub镜像
登录/注册
搜索
搜索
关闭
Previous
Previous
Next
Next
类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat
sockstack
/
227
/
2023-12-17 00:02:38
<p><span style="color: red; font-size: 18px">ChatGPT 可用网址,仅供交流学习使用,如对您有所帮助,请收藏并推荐给需要的朋友。</span><br><a href="https://ckai.xyz/?sockstack§ion=detail" target="__blank">https://ckai.xyz</a><br><br></p> <article class="baidu_pl"><div id="article_content" class="article_content clearfix"> <link rel="stylesheet" href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/kdoc_html_views-1a98987dfd.css"> <link rel="stylesheet" href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/ck_htmledit_views-25cebea3f9.css"> <div id="content_views" class="htmledit_views"> <p>本文为《类ChatGPT项目的部署与微调》系列的第二篇,整个系列包含:</p> <ul> <li>(上)从LLaMA到Alpaca、BELLE</li> <li>(中)ChatLLaMA和ColossalChat</li> <li>(下)从ChatGLM-6b到ChatDoctor</li> </ul> <hr> <h1> <strong>第四部分 </strong>LLaMA的RLHF版:<strong>ChatLLaMA和</strong>ColossalChat</h1> <h2>4.1 <strong>ChatLLaMA(英文版):类似SFT、RM、RL/PPO训练三步骤</strong> </h2> <p>由于LLaMA没有使用RLHF方法,初创公司 Nebuly AI开源了RLHF版的LLaMA,即<strong><strong><strong><strong>ChatLLaMA</strong></strong></strong></strong></p> <h3>4.1.1 三套数据集:分别训练actor、<strong>reward、rlhf</strong> </h3> <p>其训练过程类似 ChatGPT,而通过本博客内的《ChatGPT技术原理解析》3.1节,可知训练三个模型(SFT、RM、RL/PPO)得先准备三套数据集</p> <ol> <li> <strong>actor_training_data,即用于微调GPT3所用的数据</strong>,比如<br><em>[<br> {<!-- --><br> "user_input": "here the input of the user",<br> "completion": "here the model completion"<br> }<br> ]</em><br><br> actor_training_data如何而来呢,有4项途径<br> ①使用 100% 合成数据,可以通过运行以下命令综合生成数据集:<br><em>python artifacts/generate_actor_dataset.py</em>,注:此命令需要订阅OpenAI,生成完整数据集的davinci-003成本约为 200 美元(当然 也有免费的途径)<br><br> ②使用具有辅助交互的开源数据集之一,目前支持:<br>Anthropic HH RLHF:这个数据集由结构化的 {<!-- --><em>question/answer pairs</em>} 组成,包括机器人选择和拒绝的答案;<br>Stanford Human Preferences Dataset (SHP):这个数据集是从选定的“提问”subreddits 中挑选出来的,并且包括基于最受支持的回答的范围广泛的 {<!-- --><em>question/answer pairs</em>} 的问题<br><br> 可以运行以下命令下载数据集: <div> <pre><code>python artifacts/download_dataset.py <dataset_name> --path <path_to_folder_for_download> --number_of_samples <N></code></pre> </div> 其中:<br> <dataset_name>对于 StanfordNLP/SHP 数据集,可以是“SHP”或“ARLHF”,对于 Anthropic/hh-rlhf 数据集,可以分别是“SHP”或“ARLHF”;<br> <path_to_folder_for_download>是要创建数据集的文件夹路径;<br> <N>是组成 reward_dataset.json 的样本数<br><br> ③使用 100% 个性化数据集<br> 用户提供自己的个性化完整数据集,数据集必须是具有以下格式的 JSON 文件: <p><em>[<br> {<!-- --><br> "user_input": "here the input of the user",<br> "completion": "here the model completion"<br> }<br> ]</em><br> 其中列表包含多个dictionaries,每个dictionary 对应一个数据样本,建议使用超过 1000 个数据样本来进行对actor的训练<br> </p> ④创建完整的数据集,增加一些自定义数据样本,数据集可以从用户提供的一些提示+响应示例中综合生成(少数 => 10)</li> <li> <strong>reward_training_data,用于训练一个奖励模型的数据</strong>,包含三部分的数据: <br> i) prompts,<br> ii) completion<br> iii) score of the completion assigned accordingly to the user feedback (the Human Feedback in RLHF,即对各个回答的评分score)<br><br> 示例如下<br><em>[{<!-- --><br> "user_input": "...",<br> "completion": "...",<br> "score": 1<br> },<br> ...<br> ]</em><br><br> 同样的,奖励数据怎么来呢?有以下三种方式<br><img referrerpolicy="no-referrer" alt="\rightarrow" class="mathcode" src="https://latex.codecogs.com/gif.latex?%5Crightarrow"> <u>1 be synthetically scored using a LLM as Human Feedback</u><br> LLM 模型用于为每个entry计算分数<br> 为此,LLM 需要一个提示模板,其中包含评估生成的文本的所有说明(比如<strong>奖励规则</strong>,什么情况下该奖 什么情况下不奖都得十分明确)。为此,您应该将key reward添加到文件中templates.json,比如:<br><em>{<!-- --></em> <p><em> "reward": "Here is the template for the reward model. The rules are:\n\n1.Rule 1\n\n2. Rule 2"<br> }</em><br> 如果未提供模板,则使用默认模板artifacts/generate_rewards.py,注:所有模板都必须保存在一个名为 .json 的 JSON 文件中templates.json</p> <p>获得unlabelled dataset后,您可以通过运行以下命令生成分数:</p> <div> <pre><code class="language-python">python artifacts/generate_rewards.py <dataset_path> --model <model_to_use> --temperature <t> --max_tokens <n> --reward_template <path_to_file.json></code></pre> </div> 其中,<dataset_path>要评分的reward dataset的路径;<br> <model_to_use>用于奖励的模型,默认建议使用text-davinci-003<br> <temperature>用于对模型进行评分的temperature,temperature =0.1;<br> <max_tokens><br> <reward_template>,这是包含用于生成奖励的模板的文件的路径,如果未提供路径,将使用默认模板<br><br><img referrerpolicy="no-referrer" alt="\rightarrow" class="mathcode" src="https://latex.codecogs.com/gif.latex?%5Crightarrow"> <u>2 用户提供他们个性化的完整数据集</u>(至少需要 100 个数据样本),但数据集必须是以下格式的 JSON 文件,取名为:reward_training_data.json <div> <pre><code class="language-python">[{"user_input": "here type the user input","completion": "here type the completion","score": 4.0},{"user_input": "here type the user input","completion": "random garbage","score": 0.0} ]</code></pre> </div> <img referrerpolicy="no-referrer" alt="\rightarrow" class="mathcode" src="https://latex.codecogs.com/gif.latex?%5Crightarrow"> <u>3 用户提供的少量示例和使用 LLM 综合扩展的数据集</u><strong>(</strong>通过self-instruct的方式提示LLM产生更多所需要的指令数据<strong>)</strong> </li> <li> <strong>rlhf_training_data,通过RL方法不断优化迭代最优策略的数据</strong><br> It can be provided in 2 different ways:<br><img referrerpolicy="no-referrer" alt="\rightarrow" class="mathcode" src="https://latex.codecogs.com/gif.latex?%5Crightarrow"> <u>Few examples provided by the user and dataset synthetically expanded using LLM</u>(依然可以继续通过self-instruct的方式提示LLM产生更多所需要的指令数据)<br> 需要将key rlhf添加到templates.json文件中,其中包含有关要执行的任务的信息以及 LLM 生成所需的额外上下文,这是模板的示例(所有模板必须保存在一个名为templates.json): <p>{<!-- --><br> "rlhf": "Here is the template for the generating RLHF prompts. The task we want to perform is ..."<br> }</p> <p><img referrerpolicy="no-referrer" alt="\rightarrow" class="mathcode" src="https://latex.codecogs.com/gif.latex?%5Crightarrow"> <u>The user provides the full dataset with possible interactions with the model</u><br> 数据集需要包含超过 1000 个提示示例(文件命名为rlhf_training_data.json):</p> <p><em>[<br> {<!-- --><br> "user_input": "here the example of user input"<br> }<br> ]</em></p> </li> </ol> <h3>4.1.2 AC架构的PPO算法的代码实现:智能体-评论家训练过程</h3> <p>chatllama/rlhf/reward.py中</p> <ul> <li>首先定义了一个名为 Reward Model 的类,作为奖励模型或批评者模型(Critic Model)。Reward Model 是一个基于语言模型的模型,附加了一个头部head,用于预测给定的 token 序列的奖励(一个标量值),最后将 RewardModel 类别名为 Critic Model,以保持命名一致性</li> <li>之后,定义类:RewardDatase用于训练奖励模型的数据集<br> RewardDataset 类是一个继承自 Dataset 的自定义数据集类,它的作用是从给定的 JSON 文件中读取数据,并将数据整理成适当的格式。JSON 文件应包含以下格式的数据: <pre><code>class RewardDataset(Dataset):"""Dataset class for the reward modelread a json file with the following format:[{"user_input": "...","completion": "...","score": ...},...]Where:user_input: the initial input of the usercompletion: the completion generated by the modelscore: the score given by the user to the completion (or by the LLM)"""</code></pre> 其中 user_input 是用户的初始输入,completion 是模型生成的补全,而 score 是用户或LLM给予补全的分数</li> <li>再定义一个RewardTrainer 类用于训练奖励模型,它初始化奖励模型、优化器、损失函数、数据集和数据加载器等。此外,它还支持使用 DeepSpeed 或 Accelerate(两种高性能深度学习训练框架)进行训练。<br><br> RewardTrainer 类的主要方法有:<br> train:训练奖励模型。它执行训练循环,包括前向传播、计算损失、反向传播和优化器更新。在每个周期结束时,它还可以对模型进行验证(如果提供了验证数据集的话) <pre><code>def __init__(self, config: ConfigReward) -> None:# save the configself.config = config# load the modelself.reward = RewardModel(config)# optimizerself.optimizer = torch.optim.AdamW(self.reward.parameters(), lr=config.lr)# loss function,用的交叉熵损失self.loss_function = torch.nn.MSELoss()# check validation datasetself.validation_flag = Falseif config.validation_dataset_path is not None:self.validation_flag = True# create dataset and dataloadersself.train_dataset = RewardDataset(config.train_dataset_path)self.train_dataloader = DataLoader(self.train_dataset, batch_size=config.batch_size)if self.validation_flag:self.eval_dataset = RewardDataset(config.validation_dataset_path)self.validation_dataloader = DataLoader(self.eval_dataset, batch_size=config.batch_size)# intilize scheduler - learning rate will drop to 10% of the initial# valueself.scheduler = torch.optim.lr_scheduler.CosineAnnealingWarmRestarts(self.optimizer,T_0=len(self.train_dataset) // config.batch_size,T_mult=1,eta_min=config.lr * 0.1,last_epoch=-1,)</code></pre> save_checkpoint:保存模型的检查点。在训练过程中,可以定期将模型的当前状态(包括模型参数、优化器状态和训练统计信息)保存为检查点,以便在需要时恢复训练<br> load_checkpoint:从检查点恢复模型。如果在训练过程中找到检查点文件,则该方法将从检查点恢复模型状态,并返回从何处恢复训练的周期和步骤<br><br> 总之,在 RewardTrainer 类的 train 方法中<br> 首先会尝试从检查点恢复模型(如果有的话);<br> 然后,它会遍历数据加载器中的所有输入,对每个输入执行前向传播、计算损失、反向传播和优化器更新。在每个周期结束时,如果提供了验证数据集,还会对模型进行验证;<br> 最后,在训练完成后,将保存模型</li> </ul> <p>有了奖励函数,便可以通过PPO算法优化强化学习任务中的策略(actor)和价值(critic)网络,具体如下图,设置内外两个循环</p> <ul> <li>外层循环迭代训练轮次(epochs)</li> <li>内层循环遍历数据加载器(dataloader)中的批次(batches),在每次迭代中,它会处理一批数据,包括状态、动作、价值等,这些数据用于训练智能体-评论家模型</li> </ul> <p class="img-center"><img referrerpolicy="no-referrer" alt="" height="531" src="https://img-blog.csdnimg.cn/31b76c233e924b0e92fa2818599c149c.png" width="550"></p> <p>在内层循环中依次做如下处理(以下代码来源于:chatllama/chatllama/rlhf/trainer.py ):</p> <ol> <li>使用智能体-评论家模型计算新的动作概率和价值 <pre><code> # get actor critic new probabilities and valuesactions_logits, values = self.actorcritic.forward(sequences_actor,sequences_mask_actor,sequences_critic,sequences_mask_critic,action_len_actor.item(),action_len_critic.item(),)</code></pre> </li> <li>计算动作的对数概率、熵和KL散度损失 <pre><code> # get action log probactions_prob = (torch.softmax(actions_logits, dim=-1).max(dim=-1).values)actions_log_prob = torch.log(actions_prob + self.eps)# compute entropyentropies = (actions_prob * actions_log_prob).sum(dim=-1)# compute KL divergencekl_div_loss = ((actions_prob * (old_actions_log_probs - actions_log_prob)).sum(dim=-1).mean())</code></pre> </li> <li>计算重要性权重比率(ratios),即新旧策略的概率比 <pre><code> # compute ratiosratios = (actions_log_prob - old_actions_log_probs).exp()</code></pre> </li> <li>计算PPO损失,包括优势函数的计算和PPO-clip算法的应用<br> 首先我们回顾下强化学习极简入门一文里对『近端策略优化裁剪PPO-clip』的阐述<img referrerpolicy="no-referrer" alt="\begin{aligned} J_{\mathrm{PPO2}}^{\theta'}(\theta) \approx \sum_{\left(s_{t}, a_{t}\right)} \min &\left(\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta'}\left(a_{t} | s_{t}\right)} A^{\theta'}\left(s_{t}, a_{t}\right),{clip}\left(\frac{p_{\theta}\left(a_{t} | s_{t}\right)}{p_{\theta'}\left(a_{t} | s_{t}\right)}, 1-\varepsilon, 1+\varepsilon\right) A^{\theta'}\left(s_{t}, a_{t}\right)\right) \end{aligned}" src="https://latex.codecogs.com/gif.latex?%5Cbegin%7Baligned%7D%20J_%7B%5Cmathrm%7BPPO2%7D%7D%5E%7B%5Ctheta%27%7D%28%5Ctheta%29%20%5Capprox%20%5Csum_%7B%5Cleft%28s_%7Bt%7D%2C%20a_%7Bt%7D%5Cright%29%7D%20%5Cmin%20%26%5Cleft%28%5Cfrac%7Bp_%7B%5Ctheta%7D%5Cleft%28a_%7Bt%7D%20%7C%20s_%7Bt%7D%5Cright%29%7D%7Bp_%7B%5Ctheta%27%7D%5Cleft%28a_%7Bt%7D%20%7C%20s_%7Bt%7D%5Cright%29%7D%20A%5E%7B%5Ctheta%27%7D%5Cleft%28s_%7Bt%7D%2C%20a_%7Bt%7D%5Cright%29%2C%7Bclip%7D%5Cleft%28%5Cfrac%7Bp_%7B%5Ctheta%7D%5Cleft%28a_%7Bt%7D%20%7C%20s_%7Bt%7D%5Cright%29%7D%7Bp_%7B%5Ctheta%27%7D%5Cleft%28a_%7Bt%7D%20%7C%20s_%7Bt%7D%5Cright%29%7D%2C%201-%5Cvarepsilon%2C%201&plus;%5Cvarepsilon%5Cright%29%20A%5E%7B%5Ctheta%27%7D%5Cleft%28s_%7Bt%7D%2C%20a_%7Bt%7D%5Cright%29%5Cright%29%20%5Cend%7Baligned%7D"><br> 简单实现的话,即是 <pre><code># ratios即为重要性权重,exp代表求期望,括号里的environment_log_probs代表用于与环境交互的策略 ratios = torch.exp(log_probs - environment_log_probs)# 分别用sur_1、sur_2来计算公式的两部分 # 第一部分是重要性权重乘以优势函数 sur_1 = ratios * advs# 第二部分是具体的裁剪过程 sur_2 = torch.clamp(ratios, 1 - clip_eps, 1 + clip_eps) * advs# 最终看谁更小则取谁 clip_loss = -torch.min(sur_1,sur_2).mean()</code></pre> 更具体的实现,则可以如下所示 <pre><code> # compute PPO lossif check_model_family(self.config.actor, self.config.critic):# compute discounted rewards as in TRLgamma = self.config.trainer.gamma_discounteddiscounted_rewards = torch.zeros_like(old_values)for i in range(discounted_rewards.shape[1]):for j in range(i, discounted_rewards.shape[1]):discounted_rewards[:, i] += (gamma ** (j - i) * rewards[:, j])advantages = (discounted_rewards - old_values) # TRL has opposite sign for old valuesadvantages = (advantages - advantages.mean(dim=-1)) / (advantages.std() + self.eps)surr1 = advantages * ratioselse:advantages = rewards - old_values[:, -1]surr1 = advantages * ratiossurr2 = (torch.clamp(ratios, 1 - actor_eps_clip, 1 + actor_eps_clip)* advantages)</code></pre> </li> <li>计算策略损失和总损失 <pre><code> policy_loss = -torch.min(surr1, surr2) - beta_s * entropiespolicy_loss = policy_loss.mean()loss = policy_loss + kl_div_loss</code></pre> </li> <li>如果损失值为NaN,抛出异常 <pre><code> # check if loss item is NaNif torch.isnan(loss):raise ValueError("Loss is nan")</code></pre> </li> <li>更新策略,包括使用DeepSpeed或Accelerate库进行优化 <pre><code> # update actor with lossif self.config.actor.deepspeed_enable:actor_model_engine.backward(loss)actor_model_engine.step()elif self.config.actor.accelerate_enable:self.actor_optimizer.zero_grad()actor_accelerator.backward(loss)self.actor_optimizer.step()self.actor_scheduler.step()else:self.actor_optimizer.zero_grad()loss.backward()self.actor_optimizer.step()self.actor_scheduler.step()</code></pre> </li> <li>计算价值损失 <pre><code> # compute value loss# the loss is the distance between the rewards and the values# I want this distance to be small so that values are# representative of the rewards, for this reason i took the# maximum between the two.# The clip is limiting the slew-rate of values_loss_clippedvalue_loss_clipped = old_values + (values - old_values).clamp(-critic_eps_clip, critic_eps_clip)value_loss1 = (value_loss_clipped - rewards) ** 2value_loss2 = (values - rewards) ** 2value_loss = torch.max(value_loss1, value_loss2).mean()</code></pre> </li> <li>如果价值损失为NaN,抛出异常 <pre><code> if torch.isnan(value_loss):raise ValueError("Value loss is nan")</code></pre> </li> <li>更新评论家,包括使用DeepSpeed或Accelerate库进行优化 <pre><code> # upate criticif self.config.critic.deepspeed_enable:critic_model_engine.backward(value_loss)critic_model_engine.step()elif self.config.critic.accelerate_enable:self.critic_optimizer.zero_grad()critic_accelerator.backward(loss)self.critic_optimizer.step()self.critic_scheduler.step()else:self.critic_optimizer.zero_grad()value_loss.backward()self.critic_optimizer.step()self.critic_scheduler.step()</code></pre> </li> <li>将损失值追加到训练统计信息中 <pre><code> # append the losses to the training statsself.training_stats.training_loss.append(loss.detach().cpu().item())self.training_stats.value_loss.append(value_loss.detach().cpu().item())</code></pre> </li> <li>输出迭代信息 <pre><code> # print iteration infoprint(f"Epoch {epoch+1}/{epochs}",f"Step {k+1}/{int(len(dataloader) / batch_size)}",f"Loss {loss.detach().cpu().item():.4f}",f"Value Loss {value_loss.detach().cpu().item():.4f}",)</code></pre> </li> <li>训练循环结束后,将智能体-评论家模型设为评估模式并输出训练结束信息 <pre><code> self.actorcritic.eval()print("End Learning")</code></pre> </li> </ol> <h2>4.2 ColossalChat:通过self-instruct技术指令微调LLaMA且加上RLHF</h2> <h3>4.2.1 技术架构:通过self-instruct生成的中英双语数据集 + 三阶段训练方式</h3> <p>据介绍(介绍页面,该页面的翻译之一,代码地址),Colossal-AI 开源了基于 LLaMA-7B 模型的包含完整 RLHF 流程的类 Chat 模型复现方案 ColossalChat</p> <ul> <li> <strong>关于数据集</strong>:收集并清洗了社交平台上人们的真实提问场景作为种子数据集,然后利用 self-instruct 技术扩充数据(通过prompt OpenAI API,花费约 900 美元进行标注),最终生成了10.4万条问答的中、英双语数据集(这是数据的开源地址)<br> 对比其他 self-instruct 方法生成的数据集,该数据集的种子数据更加真实、丰富,生成的数据集涵盖的话题更多,该数据可以同时用于微调和 RLHF 训练,通过高质量的数据,ColossalChat 能进行更好地对话交互,同时支持中文 <div> <p><img referrerpolicy="no-referrer" alt="" height="309" src="https://img-blog.csdnimg.cn/img_convert/fbf9be0afd2e6e7a5c7859c2002cc09c.png" width="700"></p> </div> </li> <li> <strong>关于训练方式</strong>:类似instructGPT/ChatGPT的训练三步骤(如果忘了,务必复习下此文的3.1节)<br> Stage1 是supervised-fintuning,即使用上文提到的数据集进行监督微调<br> Stage2 训练一个奖励模型(初始化为阶段1的SFT模型),它通过模型对于同一个 prompt 的不同输出进行人工排序,根据排序结果监督训练出一个奖励模型<br> Stage3 是通过阶段2训练出来的奖励函数微调出一个RL模型,微调过程中通过PPO算法限制RL模型的参数更新范围(以阶段1的SFT模型的策略为参考基准,PPO算法避免与基线模型SFT的策略偏离过远)<br><br> 具体而言,为两个阶段进行: <div> <p><img referrerpolicy="no-referrer" alt="" src="https://img-blog.csdnimg.cn/img_convert/c16ccb65bb19b11425eeb4b29e5ccd66.png"></p> </div> <img referrerpolicy="no-referrer" alt="\rightarrow" class="mathcode" src="https://latex.codecogs.com/gif.latex?%5Crightarrow"> 如上图底部,首先是 Make Experience 部分,利用 SFT 、Actor、RM、Critic模型计算生成 Experience 存入 buffer 中;之后是参数更新部分,利用 Experience 计算价值损失(value loss),类似 <div> <p><img referrerpolicy="no-referrer" alt="gif.latex?%5Ctext%7Bloss%7D%28%5Ctheta%29%3D-%5Cfrac%7B1%7D%7B%20%28_%7B2%7D%5E%7BK%7D%5Ctextrm%7B%7D%29%20%7DE_%7B%28x%2Cy_w%2Cy_l%29%5Csim%20D%7D%5B%5Clog%28%5Csigma%28r_%5Ctheta%28x%2Cy_w%29-r_%5Ctheta%28x%2Cy_l%29%29%29%5D" src="https://latex.codecogs.com/gif.latex?%5Ctext%7Bloss%7D%28%5Ctheta%29%3D-%5Cfrac%7B1%7D%7B%20%28_%7B2%7D%5E%7BK%7D%5Ctextrm%7B%7D%29%20%7DE_%7B%28x%2Cy_w%2Cy_l%29%5Csim%20D%7D%5B%5Clog%28%5Csigma%28r_%5Ctheta%28x%2Cy_w%29-r_%5Ctheta%28x%2Cy_l%29%29%29%5D"></p> </div> 和策略损失(policy loss),类似 <div> <p><img referrerpolicy="no-referrer" alt="" height="110" src="https://img-blog.csdnimg.cn/3bec8b5062e6439f993962b621da0d3e.png" width="700"></p> </div> <img referrerpolicy="no-referrer" alt="\rightarrow" class="mathcode" src="https://latex.codecogs.com/gif.latex?%5Crightarrow"> 如上图顶部即是PTX 部分(上面的目标函数<img referrerpolicy="no-referrer" alt="objective(\phi)" class="mathcode" src="https://latex.codecogs.com/gif.latex?objective%28%5Cphi%29">中加在最后的偏置项)<br> ColossalChat 计算 Actor 的现有输出response 和预训练语料的回答部分的交叉熵损失函数(calculates the cross-entropy loss between the Actor’s output response and the response part of the input corpus)<br> 用来在 PPO 梯度中加入预训练梯度(add pre-training gradients to the PPO gradient)<br> 以保持语言模型比如GPT2原有的核心性能(maintain the language model’s original performance and prevent forgetting),防止忘了最早从哪里出发的(GPT2 <img referrerpolicy="no-referrer" alt="\rightarrow" class="mathcode" src="https://latex.codecogs.com/gif.latex?%5Crightarrow"> SFT <img referrerpolicy="no-referrer" alt="\rightarrow" class="mathcode" src="https://latex.codecogs.com/gif.latex?%5Crightarrow"> RM <img referrerpolicy="no-referrer" alt="\rightarrow" class="mathcode" src="https://latex.codecogs.com/gif.latex?%5Crightarrow"> RLHF)<br><br> 最后将策略损失、价值损失和 PTX 损失加和(the policy loss, value loss, and PTX loss are summed up),进行反向传播和参数更新</li> </ul> <h3>4.2.2 代码实现:SFT模型 + 奖励模型 + PPO training</h3> <p>先看下整体的代码架构图</p> <p><img referrerpolicy="no-referrer" alt="" height="1200" src="https://img-blog.csdnimg.cn/26081f42b34247f982acf9971c396220.png" width="1200"><br> 接下来,我们看下一些关键实现</p> <ol> <li>首先通过ColossalAI/applications/Chat/coati/trainer/sft.py,训练一个SFT模型 <pre><code class="language-python">import math import time from abc import ABC from typing import Optionalimport loralib as lora import torch import torch.distributed as dist import wandb from coati.models.loss import GPTLMLoss from torch import nn from torch.optim import Adam, Optimizer from torch.optim.lr_scheduler import LambdaLR from torch.utils.data import DataLoader from torch.utils.data.distributed import DistributedSampler from tqdm import tqdm from transformers.tokenization_utils_base import PreTrainedTokenizerBase from transformers.trainer import get_schedulerfrom colossalai.logging import get_dist_loggerfrom .strategies import Strategy from .utils import is_rank_0class SFTTrainer(ABC):"""Trainer to use while training reward model.Args:model (torch.nn.Module): the model to trainstrategy (Strategy): the strategy to use for trainingoptim(Optimizer): the optimizer to use for trainingtrain_dataloader: the dataloader to use for trainingeval_dataloader: the dataloader to use for evaluationbatch_size (int, defaults to 1): the batch size while trainingmax_epochs (int, defaults to 2): the number of epochs to trainoptim_kwargs (dict, defaults to {'lr':1e-4}): the kwargs to use while initializing optimizer"""def __init__(self,model,strategy: Strategy,optim: Optimizer,train_dataloader: DataLoader,eval_dataloader: DataLoader = None,batch_size: int = 1,max_epochs: int = 2,accimulation_steps: int = 8,) -> None:super().__init__()self.strategy = strategyself.epochs = max_epochsself.train_dataloader = train_dataloaderself.eval_dataloader = eval_dataloaderself.model = strategy.setup_model(model)if "DDP" in str(self.strategy):self.model = self.model.moduleself.optimizer = strategy.setup_optimizer(optim, self.model)self.accimulation_steps = accimulation_stepsnum_update_steps_per_epoch = len(train_dataloader) // self.accimulation_stepsmax_steps = math.ceil(self.epochs * num_update_steps_per_epoch)self.scheduler = get_scheduler("cosine",self.optimizer,num_warmup_steps=math.ceil(max_steps * 0.03),num_training_steps=max_steps)def fit(self, logger, log_interval=10):wandb.init(project="Coati", name=time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))wandb.watch(self.model)total_loss = 0# epoch_bar = tqdm(range(self.epochs), desc='Epochs', disable=not is_rank_0())step_bar = tqdm(range(len(self.train_dataloader) // self.accimulation_steps * self.epochs),desc=f'steps',disable=not is_rank_0())for epoch in range(self.epochs):# process_bar = tqdm(range(len(self.train_dataloader)), desc=f'Train process for{epoch}', disable=not is_rank_0())# trainself.model.train()for batch_id, batch in enumerate(self.train_dataloader):prompt_ids = batch["input_ids"].to(torch.cuda.current_device())p_mask = batch["attention_mask"].to(torch.cuda.current_device())labels = batch["labels"].to(torch.cuda.current_device())# prompt_ids = prompt_ids.squeeze(1).cuda()# p_mask = p_mask.squeeze(1).cuda()# prompt_logits = self.model(prompt_ids, attention_mask=p_mask, labels=labels)outputs = self.model(prompt_ids, attention_mask=p_mask, labels=labels)loss = outputs.lossprompt_logits = outputs.logitsif loss >= 2.5:logger.warning(f"batch_id:{batch_id}, abnormal loss: {loss}")loss = loss / self.accimulation_stepsself.strategy.backward(loss, self.model, self.optimizer)total_loss += loss.item()# gradient accumulationif (batch_id + 1) % self.accimulation_steps == 0:self.strategy.optimizer_step(self.optimizer)self.optimizer.zero_grad()self.scheduler.step()wandb.log({"loss": total_loss / self.accimulation_steps,"lr": self.scheduler.get_last_lr()[0],"epoch": epoch,"batch_id": batch_id})total_loss = 0step_bar.update()# if batch_id % log_interval == 0:# logger.info(f'Train Epoch {epoch}/{self.epochs} Batch {batch_id} Rank {dist.get_rank()} loss {loss.item()}')# wandb.log({"loss": loss.item()})# process_bar.update()# evalif self.eval_dataloader is not None:self.model.eval()with torch.no_grad():loss_sum = 0num_seen = 0for batch in self.eval_dataloader:prompt_ids = batch["input_ids"].to(torch.cuda.current_device())p_mask = batch["attention_mask"].to(torch.cuda.current_device())labels = batch["labels"].to(torch.cuda.current_device())# prompt_ids = prompt_ids.squeeze(1).cuda()# p_mask = p_mask.squeeze(1).cuda()outputs = self.model(prompt_ids, attention_mask=p_mask, labels=labels)loss = outputs.loss# prompt_logits = outputs.logitsloss_sum += loss.item()num_seen += prompt_ids.size(0)loss_mean = loss_sum / num_seenif dist.get_rank() == 0:logger.info(f'Eval Epoch {epoch}/{self.epochs} loss {loss_mean}')# epoch_bar.update()def save_model(self,path: str,only_rank0: bool = False,tokenizer: Optional[PreTrainedTokenizerBase] = None) -> None:self.strategy.save_model(model=self.model, path=path, only_rank0=only_rank0, tokenizer=tokenizer)</code></pre> </li> <li>其次,通过ColossalAI/applications/Chat/coati/trainer/rm.py 训练一个奖励模型 <pre><code class="language-python">from abc import ABC from datetime import datetime from typing import Optionalimport pandas as pd import torch import torch.distributed as dist from torch.optim import Optimizer, lr_scheduler from torch.utils.data import DataLoader, Dataset, DistributedSampler from tqdm import tqdm from transformers.tokenization_utils_base import PreTrainedTokenizerBasefrom .strategies import Strategy from .utils import is_rank_0class RewardModelTrainer(ABC):"""Trainer to use while training reward model.Args:这个类继承了 ABC 抽象基类。它接受以下参数:model:待训练的模型strategy:训练策略optim:优化器loss_fn:损失函数train_dataset:训练数据集valid_dataset:验证数据集eval_dataset:评估数据集batch_size:批次大小(默认为1)max_epochs:最大训练轮数(默认为2)"""# 初始化 RewardModelTrainer 对象,配置模型、优化器、调度器等,并创建训练、验证和评估的数据加载器def __init__(self,model,strategy: Strategy,optim: Optimizer,loss_fn,train_dataset: Dataset,valid_dataset: Dataset,eval_dataset: Dataset,batch_size: int = 1,max_epochs: int = 1,) -> None:super().__init__()self.strategy = strategyself.epochs = max_epochstrain_sampler = Noneif dist.is_initialized() and dist.get_world_size() > 1:train_sampler = DistributedSampler(train_dataset, shuffle=True, seed=42, drop_last=True)self.train_dataloader = DataLoader(train_dataset,shuffle=(train_sampler is None),sampler=train_sampler,batch_size=batch_size)self.valid_dataloader = DataLoader(valid_dataset, batch_size=batch_size, shuffle=True)self.eval_dataloader = DataLoader(eval_dataset, batch_size=batch_size, shuffle=True)self.model = strategy.setup_model(model)self.loss_fn = loss_fnself.optimizer = strategy.setup_optimizer(optim, self.model)self.scheduler = lr_scheduler.CosineAnnealingLR(self.optimizer, self.train_dataloader.__len__() // 100)# 计算给定数据加载器上的准确率,计算选定奖励与拒绝奖励之间的平均距离def eval_acc(self, dataloader):dist = 0on = 0cnt = 0self.model.eval()with torch.no_grad():for chosen_ids, c_mask, reject_ids, r_mask in dataloader:chosen_ids = chosen_ids.squeeze(1).to(torch.cuda.current_device())c_mask = c_mask.squeeze(1).to(torch.cuda.current_device())reject_ids = reject_ids.squeeze(1).to(torch.cuda.current_device())r_mask = r_mask.squeeze(1).to(torch.cuda.current_device())chosen_reward = self.model(chosen_ids, attention_mask=c_mask)reject_reward = self.model(reject_ids, attention_mask=r_mask)for i in range(len(chosen_reward)):cnt += 1if chosen_reward[i] > reject_reward[i]:on += 1dist += (chosen_reward - reject_reward).mean().item()dist_mean = dist / len(dataloader)acc = on / cntself.model.train()return dist_mean, acc# 用于实际训练模型。在每个训练轮中,它会遍历 train_dataloader,计算损失并更新模型# 每 100 步,它会计算验证集上的距离和准确率,并将其记录到日志文件中# 同时,在每个训练轮结束时,计算评估集上的距离和准确率def fit(self):time = datetime.now()epoch_bar = tqdm(range(self.epochs), desc='Train epoch', disable=not is_rank_0())for epoch in range(self.epochs):step_bar = tqdm(range(self.train_dataloader.__len__()),desc='Train step of epoch %d' % epoch,disable=not is_rank_0())# trainself.model.train()cnt = 0acc = 0dist = 0for chosen_ids, c_mask, reject_ids, r_mask in self.train_dataloader:chosen_ids = chosen_ids.squeeze(1).to(torch.cuda.current_device())c_mask = c_mask.squeeze(1).to(torch.cuda.current_device())reject_ids = reject_ids.squeeze(1).to(torch.cuda.current_device())r_mask = r_mask.squeeze(1).to(torch.cuda.current_device())chosen_reward = self.model(chosen_ids, attention_mask=c_mask)reject_reward = self.model(reject_ids, attention_mask=r_mask)loss = self.loss_fn(chosen_reward, reject_reward)self.strategy.backward(loss, self.model, self.optimizer)self.strategy.optimizer_step(self.optimizer)self.optimizer.zero_grad()cnt += 1if cnt == 100:self.scheduler.step()dist, acc = self.eval_acc(self.valid_dataloader)cnt = 0if is_rank_0():log = pd.DataFrame([[step_bar.n, loss.item(), dist, acc]],columns=['step', 'loss', 'dist', 'acc'])log.to_csv('log_%s.csv' % time, mode='a', header=False, index=False)step_bar.update()step_bar.set_postfix({'dist': dist, 'acc': acc})# evaldist, acc = self.eval_acc(self.eval_dataloader)if is_rank_0():log = pd.DataFrame([[step_bar.n, loss.item(), dist, acc]], columns=['step', 'loss', 'dist', 'acc'])log.to_csv('log.csv', mode='a', header=False, index=False)epoch_bar.update()step_bar.set_postfix({'dist': dist, 'acc': acc})step_bar.close()# 用于保存训练好的模型。可以选择仅在 rank 0 的进程上保存模型,并选择保存 tokenizerdef save_model(self,path: str,only_rank0: bool = False,tokenizer: Optional[PreTrainedTokenizerBase] = None) -> None:self.strategy.save_model(model=self.model, path=path, only_rank0=only_rank0, tokenizer=tokenizer)</code></pre> </li> <li>最后,通过ColossalAI/applications/Chat/coati/trainer/ppo.py to start PPO training <pre><code class="language-python">from typing import Any, Callable, Dict, List, Optionalimport torch import torch.nn as nn from coati.experience_maker import Experience, NaiveExperienceMaker from coati.models.base import Actor, Critic from coati.models.generation_utils import update_model_kwargs_fn from coati.models.loss import PolicyLoss, ValueLoss from coati.replay_buffer import NaiveReplayBuffer from torch.optim import Optimizer from transformers.tokenization_utils_base import PreTrainedTokenizerBasefrom .base import Trainer from .callbacks import Callback from .strategies import Strategyclass PPOTrainer(Trainer):"""Trainer for PPO algorithm.Args:接受以下参数:strategy:用于训练的策略actor:PPO算法中的执行者(actor)模型critic:PPO算法中的评论者(critic)模型reward_model:用于计算句子奖励的奖励模型initial_model:用于生成参考对数值的初始模型,以限制actor的更新actor_optim:用于actor模型的优化器critic_optim:用于critic模型的优化器其他参数:控制训练过程的各种超参数,如kl系数、批次大小等"""def __init__(self,strategy: Strategy,actor: Actor,critic: Critic,reward_model: nn.Module,initial_model: Actor,actor_optim: Optimizer,critic_optim: Optimizer,kl_coef: float = 0.1,ptx_coef: float = 0.9,train_batch_size: int = 8,buffer_limit: int = 0,buffer_cpu_offload: bool = True,eps_clip: float = 0.2,value_clip: float = 0.4,experience_batch_size: int = 8,max_epochs: int = 1,tokenizer: Optional[Callable[[Any], dict]] = None,sample_replay_buffer: bool = False,dataloader_pin_memory: bool = True,callbacks: List[Callback] = [],**generate_kwargs) -> None:experience_maker = NaiveExperienceMaker(actor, critic, reward_model, initial_model, kl_coef)replay_buffer = NaiveReplayBuffer(train_batch_size, buffer_limit, buffer_cpu_offload)generate_kwargs = _set_default_generate_kwargs(strategy, generate_kwargs, actor)super().__init__(strategy, experience_maker, replay_buffer, experience_batch_size, max_epochs, tokenizer,sample_replay_buffer, dataloader_pin_memory, callbacks, **generate_kwargs)self.actor = actorself.critic = criticself.actor_loss_fn = PolicyLoss(eps_clip)self.critic_loss_fn = ValueLoss(value_clip)self.ptx_loss_fn = nn.CrossEntropyLoss(ignore_index=-100)self.ptx_coef = ptx_coefself.actor_optim = actor_optimself.critic_optim = critic_optim'''该方法根据经验(Experience)对象计算执行者(actor)和评论者(critic)的损失并使用策略进行反向传播和优化器更新'''def training_step(self, experience: Experience) -> Dict[str, float]:self.actor.train()self.critic.train()# policy lossnum_actions = experience.action_mask.size(1)action_log_probs = self.actor(experience.sequences, num_actions, attention_mask=experience.attention_mask)actor_loss = self.actor_loss_fn(action_log_probs,experience.action_log_probs,experience.advantages,action_mask=experience.action_mask)# ptx lossif self.ptx_coef != 0:ptx = next(iter(self.pretrain_dataloader))['input_ids'].to(torch.cuda.current_device())label = next(iter(self.pretrain_dataloader))['labels'].to(torch.cuda.current_device())[:, 1:]attention_mask = next(iter(self.pretrain_dataloader))['attention_mask'].to(torch.cuda.current_device())ptx_log_probs = self.actor.get_base_model()(ptx, attention_mask=attention_mask)['logits'][..., :-1, :]ptx_loss = self.ptx_loss_fn(ptx_log_probs.view(-1, ptx_log_probs.size(-1)), label.view(-1))actor_loss = ptx_loss * self.ptx_coef + actor_loss * (1 - self.ptx_coef)self.strategy.backward(actor_loss, self.actor, self.actor_optim)self.strategy.optimizer_step(self.actor_optim)self.actor_optim.zero_grad()# value lossvalues = self.critic(experience.sequences,action_mask=experience.action_mask,attention_mask=experience.attention_mask)critic_loss = self.critic_loss_fn(values,experience.values,experience.reward,action_mask=experience.action_mask)self.strategy.backward(critic_loss, self.critic, self.critic_optim)self.strategy.optimizer_step(self.critic_optim)self.critic_optim.zero_grad()return {'reward': experience.reward.mean().item()}def _set_default_generate_kwargs(strategy: Strategy, generate_kwargs: dict, actor: Actor) -> None:origin_model = strategy._unwrap_actor(actor)new_kwargs = {**generate_kwargs}# use huggingface models method directlyif 'prepare_inputs_fn' not in generate_kwargs and hasattr(origin_model, 'prepare_inputs_for_generation'):new_kwargs['prepare_inputs_fn'] = origin_model.prepare_inputs_for_generationif 'update_model_kwargs_fn' not in generate_kwargs:new_kwargs['update_model_kwargs_fn'] = update_model_kwargs_fnreturn new_kwargsdef save_model(self, path: str, only_rank0: bool = False, tokenizer: Optional[PreTrainedTokenizerBase] = None) -> None:self.strategy.save_model(model=self.actor, path=path, only_rank0=only_rank0, tokenizer=tokenizer)</code></pre> </li> </ol> <p>在获得最终模型权重后,还可通过量化降低推理硬件成本,并启动在线推理服务,仅需单张约 4GB 显存的 GPU 即可完成 70 亿参数模型推理服务部署 </p> <p>更多请见下文:类ChatGPT项目的部署与微调(下):从ChatGLM-6b到ChatDoctor</p> </div> </div> <div id="treeSkill"></div> <div id="blogExtensionBox" style="width:400px;margin:auto;margin-top:12px" class="blog-extension-box"></div> </article>
类ChatGPT项目的部署与微调(中):ChatLLaMA和ColossalChat
作者
sockstack
许可协议
CC BY 4.0
发布于
2023-12-17
修改于
2024-11-24
上一篇:软件:常用 Linux 软件汇总,值得收藏
下一篇:你需要了解的 50 个 ChatGPT 统计数据和事实
尚未登录
登录 / 注册
文章分类
博客重构之路
5
Spring Boot简单入门
4
k8s 入门教程
0
MySQL 知识
1
NSQ 消息队列
0
ThinkPHP5 源码分析
5
使用 Docker 从零开始搭建私人代码仓库
3
日常开发汇总
3
标签列表
springboot
hyperf
swoole
webman
php
多线程
数据结构
docker
k8s
thinkphp
mysql
tailwindcss
flowbite
css
前端