Open main menu
首页
专栏
课程
分类
归档
Chat
Sci-Hub
谷歌学术
Libgen
GitHub镜像
登录/注册
搜索
关闭
Previous
Previous
Next
Next
InstructGLM:基于ChatGLM-6B在指令数据集上进行微调
sockstack
/
224
/
2023-11-06 23:54:44
<p><span style="color: red; font-size: 18px">ChatGPT 可用网址,仅供交流学习使用,如对您有所帮助,请收藏并推荐给需要的朋友。</span><br><a href="https://ckai.xyz/?sockstack§ion=detail" target="__blank">https://ckai.xyz</a><br><br></p> <article class="baidu_pl"><div id="article_content" class="article_content clearfix"> <link rel="stylesheet" href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/kdoc_html_views-1a98987dfd.css"> <link rel="stylesheet" href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/ck_htmledit_views-25cebea3f9.css"> <div id="content_views" class="markdown_views prism-atom-one-dark"> <svg xmlns="http://www.w3.org/2000/svg" style="display: none;"><path stroke-linecap="round" d="M5,0 0,2.5 5,5z" id="raphael-marker-block" style="-webkit-tap-highlight-color: rgba(0, 0, 0, 0);"></path></svg><h1> <a id="InstructGLM_0"></a>InstructGLM</h1> <blockquote> <p>基于ChatGLM-6B+LoRA在指令数据集上进行微调</p> </blockquote> <p>https://github.com/yanqiangmiffy/InstructGLM</p> <p><strong>本项目主要内容:</strong></p> <ul> <li>🚀 2023/4/9 发布了基于100万条由BELLE项目生成的中文指令数据的Lora权重,具体可见<code>output/belle/chatglm-lora.pt</code> </li> <li>🚀 2023/4/8 基于deepspeed支持多卡微调,速度相比单卡提升8-9倍具体设置可见 微调3 基于DeepSpeed进行Lora微调</li> <li>🚀 2023/3/28 开源了基于alpaca和belle数据指令微调后的lora权重,详情可见output</li> <li>🚀 2023/3/25 针对ChatGLM-6B模型基于LoRA技术进行微调</li> <li>🚀 2023/3/23 基于gradio的demo完善</li> </ul> <h2> <a id="Todo_13"></a>Todo</h2> <ul> <li class="task-list-item"> <input type="checkbox" class="task-list-item-checkbox" checked disabled> deepspeed支持</li> <li class="task-list-item"> <input type="checkbox" class="task-list-item-checkbox" disabled> 模型评估,如何评估微调后的模型效果</li> </ul> <h2> <a id="_17"></a>开源指令数据集</h2> <ul><li>斯坦福52k英文指令数据</li></ul> <blockquote> <p>instruction:52K 条指令中的每一条都是唯一的,答案由text-davinci-003模型生成得到的</p> </blockquote> <ul><li>BELLE项目生成的中文指令数据:0.5m&1m</li></ul> <blockquote> <p>1百万数据:https://huggingface.co/datasets/BelleGroup/generated_train_1M_CN</p> </blockquote> <blockquote> <p>生成方式基于种子prompt,调用openai的api生成中文指令</p> </blockquote> <ul><li>GuanacoDataset 多语言指令数据集</li></ul> <blockquote> <p>Guanaco 是在 Meta 的 LLaMA 7B 模型上训练的指令跟随语言模型。在 Alpaca 模型原始 52K 数据的基础上,我们添加了额外的 98,369 个条目,涵盖英语、简体中文、繁体中文(台湾)、繁体中文(香港)、日语、德语以及各种语言和语法任务。通过使用这些丰富的数据重新训练和优化模型,Guanaco 在多语言环境中展示了出色的性能和潜力。项目链接可以查看<br> https://guanaco-model.github.io/</p> </blockquote> <ul><li>alpaca中文指令微调数据集</li></ul> <blockquote> <p>与原始alpaca数据json格式相同,数据生成的方法是机器翻译和self-instruct</p> </blockquote> <ul><li>人工精调的中文对话数据集</li></ul> <blockquote> <p>加入除了alpaca之外的其他中文聊天对话<br> 人工微调,部分并不中文化的问题,我们将重新询问chatgpt或文心一言,重新获取回答并覆盖掉alpaca的回答</p> </blockquote> <ul><li>firefly-train-1.1M , 一份高质量的包含1.1M中文多任务指令微调数据集,包含23种常见的中文NLP任务的指令数据。对于每个任务,由人工书写若干指令模板,保证数据的高质量与丰富度。</li></ul> <h2> <a id="1alpaca_45"></a>微调1:alpaca英文指令数据</h2> <p>斯坦福羊驼52k数据,原始数据格式如下:</p> <pre><code class="prism language-text">{"instruction": "Evaluate this sentence for spelling and grammar mistakes","input": "He finnished his meal and left the resturant","output": "He finished his meal and left the restaurant." } </code></pre> <blockquote> <p>数据集地址:https://github.com/tatsu-lab/stanford_alpaca</p> </blockquote> <h3> <a id="1_59"></a>1.数据预处理</h3> <p>转化alpaca数据集为jsonl,这一步可以执行设置数据转换后格式,比如:</p> <pre><code class="prism language-text">###Instruction:xxx###Input:xxxx###Response:xxx </code></pre> <pre><code class="prism language-shell">python cover_alpaca2jsonl.py <span class="token punctuation">\</span>--data_path data/alpaca_data.json <span class="token punctuation">\</span>--save_path data/alpaca_data.jsonl </code></pre> <p>对文本进行tokenize,加快训练速度,文本长度可根据运行资源自行设置</p> <pre><code class="prism language-shell">python tokenize_dataset_rows.py <span class="token punctuation">\</span>--jsonl_path data/alpaca_data.jsonl <span class="token punctuation">\</span>--save_path data/alpaca <span class="token punctuation">\</span>--max_seq_length <span class="token number">320</span> </code></pre> <h3> <a id="2__82"></a>2. 模型训练</h3> <pre><code class="prism language-shell">python train_lora.py <span class="token punctuation">\</span>--dataset_path data/alpaca <span class="token punctuation">\</span>--lora_rank <span class="token number">8</span> <span class="token punctuation">\</span>--per_device_train_batch_size <span class="token number">2</span> <span class="token punctuation">\</span>--gradient_accumulation_steps <span class="token number">1</span> <span class="token punctuation">\</span>--max_steps <span class="token number">52000</span> <span class="token punctuation">\</span>--save_steps <span class="token number">1000</span> <span class="token punctuation">\</span>--save_total_limit <span class="token number">2</span> <span class="token punctuation">\</span>--learning_rate 2e-5 <span class="token punctuation">\</span>--fp16 <span class="token punctuation">\</span>--remove_unused_columns <span class="token boolean">false</span> <span class="token punctuation">\</span>--logging_steps <span class="token number">50</span> <span class="token punctuation">\</span>--output_dir output </code></pre> <h2> <a id="2BELLE_100"></a>微调2:BELLE中文指令数据</h2> <p>包含543314条由BELLE项目生成的中文指令数据,数据格式如下:</p> <table> <thead><tr> <th>input</th> <th>target</th> </tr></thead> <tbody><tr> <td>用一句话描述地球为什么是独一无二的。\n</td> <td>地球上有适宜生命存在的条件和多样化的生命形式</td> </tr></tbody> </table> <blockquote> <p>数据集地址:https://huggingface.co/datasets/BelleGroup/generated_train_0.5M_CN</p> </blockquote> <h3> <a id="1_110"></a>1.数据预处理</h3> <p>转化bell数据集为jsonl</p> <pre><code class="prism language-shell"> python cover_alpaca2jsonl.py <span class="token punctuation">\</span>--dataset_name BelleGroup/generated_train_0.5M_CN <span class="token punctuation">\</span>--save_path data/belle_data.jsonl </code></pre> <p>文本长度统计</p> <pre><code class="prism language-text">count 543314.000000 mean 83.536944 std 95.665178 min 4.000000 25% 33.000000 50% 51.000000 75% 88.000000 90% 194.000000 max 4410.000000 Name: input_len, dtype: float64count 543314.000000 mean 121.079030 std 165.472722 min 1.000000 25% 27.000000 50% 67.000000 75% 151.000000 90% 296.000000 max 9463.000000 Name: target_len, dtype: float64 </code></pre> <p>分词处理</p> <pre><code class="prism language-shell">python tokenize_dataset_rows.py <span class="token punctuation">\</span>--jsonl_path data/belle_data.jsonl <span class="token punctuation">\</span>--save_path data/belle <span class="token punctuation">\</span>--max_seq_length <span class="token number">320</span> </code></pre> <p>转换后的数据:</p> <pre><code class="prism language-text"> input_ids seq_len 0 [20005, 92863, 20012, 20005, 83864, 87784, 871... 20 1 [20005, 92863, 20012, 20005, 91432, 86523, 885... 80 2 [20005, 92863, 20012, 104069, 85056, 86334, 89... 61 3 [20005, 92863, 20012, 91492, 89122, 83866, 852... 24 4 [20005, 92863, 20012, 20005, 83834, 99899, 927... 24 </code></pre> <h3> <a id="2__167"></a>2. 模型训练</h3> <ul><li>基于原始chatglm-6b训练</li></ul> <pre><code class="prism language-shell">python train_lora.py <span class="token punctuation">\</span>--dataset_path data/belle <span class="token punctuation">\</span>--lora_rank <span class="token number">8</span> <span class="token punctuation">\</span>--per_device_train_batch_size <span class="token number">2</span> <span class="token punctuation">\</span>--gradient_accumulation_steps <span class="token number">1</span> <span class="token punctuation">\</span>--max_steps <span class="token number">52000</span> <span class="token punctuation">\</span>--save_steps <span class="token number">1000</span> <span class="token punctuation">\</span>--save_total_limit <span class="token number">2</span> <span class="token punctuation">\</span>--learning_rate 2e-5 <span class="token punctuation">\</span>--fp16 <span class="token punctuation">\</span>--remove_unused_columns <span class="token boolean">false</span> <span class="token punctuation">\</span>--logging_steps <span class="token number">50</span> <span class="token punctuation">\</span>--output_dir output </code></pre> <ul><li>基于alpaca的lora继续微调</li></ul> <pre><code class="prism language-shell">python train_lora.py <span class="token punctuation">\</span>--dataset_path data/belle <span class="token punctuation">\</span>--lora_rank <span class="token number">8</span> <span class="token punctuation">\</span>--per_device_train_batch_size <span class="token number">8</span> <span class="token punctuation">\</span>--gradient_accumulation_steps <span class="token number">1</span> <span class="token punctuation">\</span>--max_steps <span class="token number">52000</span> <span class="token punctuation">\</span>--save_steps <span class="token number">10000</span> <span class="token punctuation">\</span>--save_total_limit <span class="token number">2</span> <span class="token punctuation">\</span>--learning_rate 2e-5 <span class="token punctuation">\</span>--fp16 <span class="token punctuation">\</span>--remove_unused_columns <span class="token boolean">false</span> <span class="token punctuation">\</span>--logging_steps <span class="token number">50</span> <span class="token punctuation">\</span>--output_dir output/belle <span class="token punctuation">\</span>--is_resume True <span class="token punctuation">\</span>--resume_path output/alpaca/chatglm-lora.pt </code></pre> <h2> <a id="3DeepSpeedLora_207"></a>微调3:基于DeepSpeed进行Lora微调</h2> <p>支持多卡+zero方案,训练速度可提高8倍左右</p> <pre><code class="prism language-shell">accelerate launch --config_file config/default_config.yaml train_new.py </code></pre> <h2> <a id="_214"></a>实验环境</h2> <ul> <li>安装所需要的包:pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple</li> <li>显卡:2xA100 80G</li> </ul> <h2> <a id="_219"></a>实验结果</h2> <ul><li>训练好的lora权重</li></ul> <pre><code class="prism language-text">└─output├─alpaca:基于52k微调的lora权重├─belle::基于52k微调的lora权重+belle微调的权重52000steps└─belle_raw:belle微调的权重104000steps</code></pre> <pre><code> 链接:https://pan.baidu.com/s/1c-zRSEUn4151YLoowPN4YA?pwd=hxbr 提取码:hxbr --来自百度网盘超级会员V3的分享 </code></pre> <ul><li>alpaca数据微调效果</li></ul> <p><img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/img_convert/b10015459be4c2074a751c2a07146929.png" alt=""></p> <ul><li>belle数据微调效果</li></ul> <p><img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/img_convert/d3d14536b7ba0037f49b94aaa66a1cd3.jpeg" alt=""></p> <h2> <a id="Reference_247"></a>Reference</h2> <blockquote> <p>非常感谢以下作者的无私开源</p> </blockquote> <ul> <li>https://github.com/mymusise/ChatGLM-Tuning</li> <li>https://huggingface.co/BelleGroup/BELLE-7B-2M</li> <li>https://github.com/LianjiaTech/BELLE</li> <li>https://huggingface.co/datasets/BelleGroup/generated_train_0.5M_CN</li> <li>https://huggingface.co/datasets/JosephusCheung/GuanacoDataset</li> <li>https://guanaco-model.github.io/</li> <li>https://github.com/carbonz0/alpaca-chinese-dataset</li> <li>https://github.com/THUDM/ChatGLM-6B</li> <li>https://huggingface.co/THUDM/chatglm-6b</li> <li>https://github.com/lich99/ChatGLM-finetune-LoRA</li> </ul> <h2> <a id="Bugs_262"></a>Bugs</h2> <ul><li>gcc版本升级</li></ul> <pre><code>yum install centos-release-scl -y yum install devtoolset-9 -y#临时覆盖系统原有的gcc引用 scl enable devtoolset-9 bash# 查看gcc版本 gcc -v </code></pre> </div> <link href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/markdown_views-0407448025.css" rel="stylesheet"> <link href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/style-c216769e99.css" rel="stylesheet"> </div> <div id="treeSkill"></div> </article>
InstructGLM:基于ChatGLM-6B在指令数据集上进行微调
作者
sockstack
许可协议
CC BY 4.0
发布于
2023-11-06
修改于
2025-01-05
上一篇:软件:常用 Linux 软件汇总,值得收藏
下一篇:介绍10款ChatGPT替代产品
尚未登录
登录 / 注册
文章分类
博客重构之路
5
Spring Boot简单入门
4
k8s 入门教程
0
MySQL 知识
1
NSQ 消息队列
0
ThinkPHP5 源码分析
5
使用 Docker 从零开始搭建私人代码仓库
3
日常开发汇总
4
标签列表
springboot
hyperf
swoole
webman
php
多线程
数据结构
docker
k8s
thinkphp
mysql
tailwindcss
flowbite
css
前端