Open main menu
首页
专栏
课程
分类
归档
Chat
Sci-Hub
谷歌学术
Libgen
GitHub镜像
登录/注册
搜索
关闭
Previous
Previous
Next
Next
NLP(五十四)tiktoken的使用
sockstack
/
527
/
2024-02-27 00:02:37
<p><span style="color: red; font-size: 18px">ChatGPT 可用网址,仅供交流学习使用,如对您有所帮助,请收藏并推荐给需要的朋友。</span><br><a href="https://ckai.xyz/?sockstack§ion=detail" target="__blank">https://ckai.xyz</a><br><br></p> <article class="baidu_pl"><div id="article_content" class="article_content clearfix"> <link rel="stylesheet" href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/kdoc_html_views-1a98987dfd.css"> <link rel="stylesheet" href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/ck_htmledit_views-25cebea3f9.css"> <div id="content_views" class="markdown_views prism-dracula"> <svg xmlns="http://www.w3.org/2000/svg" style="display: none;"><path stroke-linecap="round" d="M5,0 0,2.5 5,5z" id="raphael-marker-block" style="-webkit-tap-highlight-color: rgba(0, 0, 0, 0);"></path></svg><p> <code>tiktoken</code>是OpenAI于近期开源的Python第三方模块,该模块主要实现了tokenizer的BPE(Byte pair encoding)算法,并对运行性能做了极大的优化。本文将介绍tiktoken模块的使用。</p> <h3> <a id="tiktoken_2"></a>tiktoken简介</h3> <p> <code>BPE(Byte pair encoding)</code>算法是NLP中常见的tokenizer方式,关于其介绍和实现原理,读者可参考深入理解NLP Subword算法:BPE、WordPiece、ULM。<br> <code>tiktoken</code>已开源至Github,访问网址为:https://github.com/openai/tiktoken,tiktoken会比其它开源的tokenizer库运行快3-6倍,以下是它与hugging face的tokenizer库的性能比较:<br> <img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/ff32f043729840328ec5edc8809d3a49.png#pic_center" alt="不同线程数下tiktoken与hugging face的性能比较"><br> 以上结果是使用GPT-2 tokenizer在1G文本上进行的性能测试,使用的<code>GPT2TokenizerFast</code>来源于<code>tokenizers==0.13.2</code>, <code>transformers==4.24.0</code> , <code>tiktoken==0.2.0</code>。</p> <h3> <a id="_8"></a>简单使用</h3> <p> <code>tiktoken</code>的Encodings(编码方式)用于展示文本是如何被转化为token的。不同的模型使用不同类型的编码方式。<code>tiktoken</code>支持如下三种OpenAI模型的编码方式:</p> <table> <thead><tr> <th>编码方式</th> <th>OpenAI模型</th> </tr></thead> <tbody> <tr> <td>cl100k_base</td> <td>gpt-4, gpt-3.5-turbo, text-embedding-ada-002</td> </tr> <tr> <td>p50k_base</td> <td>Codex模型,如 text-davinci-002, text-davinci-003</td> </tr> <tr> <td>r50k_base (或gpt2)</td> <td>GPT-3模型,如davinci</td> </tr> </tbody> </table> <p>可以通过如下代码来获取模型的编码方式:</p> <pre><code class="prism language-python"><span class="token comment"># -*- coding: utf-8 -*-</span> <span class="token keyword">import</span> tiktoken<span class="token comment"># get encoding name</span> <span class="token keyword">print</span><span class="token punctuation">(</span>tiktoken<span class="token punctuation">.</span>encoding_for_model<span class="token punctuation">(</span><span class="token string">'gpt-3.5-turbo'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> </code></pre> <p>输出结果为:</p> <pre><code><Encoding 'cl100k_base'> </code></pre> <p>注意,<code>p50k_base</code>与<code>r50k_base</code>基本类似,在非代码应用中,它们通常会给出相同的token。<br> <code>cl100k_base</code>中的100k代码该编码方式中的词汇表数量大约为100k,词汇表文件为cl100k_base_vocab.json,下载网址为:https://raw.githubusercontent.com/weikang-wang/ChatGPT-Vocabulary/main/cl100k_base_vocab.json,词汇数量为100256,如此庞大的词汇数量使得OpenAI模型在多种语言上都有不俗的表现。</p> <h3> <a id="_32"></a>编码与解码</h3> <p> 编码(encode)是指将文本映射为token的数字列表,解码(decode)是指将token的数字列表转化为文本。参看以下的Python代码实现:</p> <pre><code class="prism language-python"><span class="token comment"># -*- coding: utf-8 -*-</span> <span class="token keyword">import</span> tiktoken<span class="token comment"># simple test</span> enc <span class="token operator">=</span> tiktoken<span class="token punctuation">.</span>get_encoding<span class="token punctuation">(</span><span class="token string">"cl100k_base"</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>enc<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">"hello world"</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token punctuation">[</span><span class="token number">15339</span><span class="token punctuation">,</span> <span class="token number">1917</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>enc<span class="token punctuation">.</span>decode<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">15339</span><span class="token punctuation">,</span> <span class="token number">1917</span><span class="token punctuation">]</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token string">"hello world"</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>enc<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">"hello <|endoftext|>"</span><span class="token punctuation">,</span> allowed_special<span class="token operator">=</span><span class="token string">"all"</span><span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token punctuation">[</span><span class="token number">15339</span><span class="token punctuation">,</span> <span class="token number">220</span><span class="token punctuation">,</span> <span class="token number">100257</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token comment"># encode</span> tokens <span class="token operator">=</span> enc<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">"tiktoken is great!"</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>tokens<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token builtin">len</span><span class="token punctuation">(</span>tokens<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token comment"># decode</span> <span class="token keyword">print</span><span class="token punctuation">(</span>enc<span class="token punctuation">.</span>decode<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">83</span><span class="token punctuation">,</span> <span class="token number">1609</span><span class="token punctuation">,</span> <span class="token number">5963</span><span class="token punctuation">,</span> <span class="token number">374</span><span class="token punctuation">,</span> <span class="token number">2294</span><span class="token punctuation">,</span> <span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token comment"># chinese encode</span> tokens <span class="token operator">=</span> enc<span class="token punctuation">.</span>encode<span class="token punctuation">(</span><span class="token string">"大模型是什么?"</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>tokens<span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token builtin">len</span><span class="token punctuation">(</span>tokens<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token comment"># chinese decode</span> <span class="token keyword">print</span><span class="token punctuation">(</span>enc<span class="token punctuation">.</span>decode<span class="token punctuation">(</span><span class="token punctuation">[</span><span class="token number">27384</span><span class="token punctuation">,</span> <span class="token number">54872</span><span class="token punctuation">,</span> <span class="token number">25287</span><span class="token punctuation">,</span> <span class="token number">21043</span><span class="token punctuation">,</span> <span class="token number">6271</span><span class="token punctuation">,</span> <span class="token number">222</span><span class="token punctuation">,</span> <span class="token number">82696</span><span class="token punctuation">,</span> <span class="token number">11571</span><span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token punctuation">)</span> </code></pre> <p>输出结果如下:</p> <pre><code>True True True [83, 1609, 5963, 374, 2294, 0] 6 tiktoken is great! [27384, 54872, 25287, 21043, 6271, 222, 82696, 11571] 8 大模型是什么? </code></pre> <h3> <a id="token_72"></a>计算token数量</h3> <p> OpenAI模型中token数量较为关键,毕竟,OpenAI接口调用的收费方式是按照token数量来的。关于OpenAI接口调用的收费方式,可以参考网站:https://openai.com/pricing。<br> 下面是用<code>tiktoken</code>来计算token数量的Python代码:</p> <pre><code class="prism language-python"><span class="token comment"># -*- coding: utf-8 -*-</span> <span class="token keyword">import</span> tiktoken<span class="token keyword">def</span> <span class="token function">num_tokens_from_string</span><span class="token punctuation">(</span>string<span class="token punctuation">:</span> <span class="token builtin">str</span><span class="token punctuation">,</span> encoding_name<span class="token punctuation">:</span> <span class="token builtin">str</span><span class="token punctuation">)</span> <span class="token operator">-</span><span class="token operator">></span> <span class="token builtin">int</span><span class="token punctuation">:</span><span class="token comment"># Returns the number of tokens in a text string.</span>encoding <span class="token operator">=</span> tiktoken<span class="token punctuation">.</span>get_encoding<span class="token punctuation">(</span>encoding_name<span class="token punctuation">)</span>num_tokens <span class="token operator">=</span> <span class="token builtin">len</span><span class="token punctuation">(</span>encoding<span class="token punctuation">.</span>encode<span class="token punctuation">(</span>string<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token keyword">return</span> num_tokens<span class="token keyword">print</span><span class="token punctuation">(</span>num_tokens_from_string<span class="token punctuation">(</span><span class="token string">'tiktoken is great!'</span><span class="token punctuation">,</span> <span class="token string">'cl100k_base'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span>num_tokens_from_string<span class="token punctuation">(</span><span class="token string">'大模型是什么?'</span><span class="token punctuation">,</span> <span class="token string">'cl100k_base'</span><span class="token punctuation">)</span><span class="token punctuation">)</span> </code></pre> <p>输出结果为:</p> <pre><code>6 8 </code></pre> <p> 在hugging face网站上,已经有人实现了tiktoken的token数量计算,访问网站为:https://huggingface.co/spaces/JacobLinCool/tiktoken-calculator ,页面如下:<br> <img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/5155ec284e98487380ee800df00dfd7d.png" alt="tiktoken的token数量计算"><br> 在对话补全(chat completion)场景中计算token数量,以模型<code>gpt-3.5-turbo</code>为例,实现Python代码如下:</p> <pre><code class="prism language-python"><span class="token comment"># -*- coding: utf-8 -*-</span> <span class="token keyword">import</span> tiktoken <span class="token keyword">import</span> openai<span class="token keyword">def</span> <span class="token function">num_tokens_from_messages</span><span class="token punctuation">(</span>messages<span class="token punctuation">)</span><span class="token punctuation">:</span><span class="token comment"># Returns the number of tokens used by a list of messages.</span>encoding <span class="token operator">=</span> tiktoken<span class="token punctuation">.</span>encoding_for_model<span class="token punctuation">(</span><span class="token string">"gpt-3.5-turbo"</span><span class="token punctuation">)</span>tokens_per_message <span class="token operator">=</span> <span class="token number">4</span> <span class="token comment"># every message follows <|start|>{role/name}\n{content}<|end|>\n</span>tokens_per_name <span class="token operator">=</span> <span class="token operator">-</span><span class="token number">1</span> <span class="token comment"># if there's a name, the role is omitted</span>num_tokens <span class="token operator">=</span> <span class="token number">0</span><span class="token keyword">for</span> message <span class="token keyword">in</span> messages<span class="token punctuation">:</span>num_tokens <span class="token operator">+=</span> tokens_per_message<span class="token keyword">for</span> key<span class="token punctuation">,</span> value <span class="token keyword">in</span> message<span class="token punctuation">.</span>items<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">:</span>num_tokens <span class="token operator">+=</span> <span class="token builtin">len</span><span class="token punctuation">(</span>encoding<span class="token punctuation">.</span>encode<span class="token punctuation">(</span>value<span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token keyword">if</span> key <span class="token operator">==</span> <span class="token string">"name"</span><span class="token punctuation">:</span>num_tokens <span class="token operator">+=</span> tokens_per_namenum_tokens <span class="token operator">+=</span> <span class="token number">3</span> <span class="token comment"># every reply is primed with <|start|>assistant<|message|></span><span class="token keyword">return</span> num_tokensexample_messages <span class="token operator">=</span> <span class="token punctuation">[</span><span class="token punctuation">{<!-- --></span><span class="token string">"role"</span><span class="token punctuation">:</span> <span class="token string">"system"</span><span class="token punctuation">,</span><span class="token string">"content"</span><span class="token punctuation">:</span> <span class="token string">"You are a helpful, pattern-following assistant that translates corporate jargon into plain English."</span><span class="token punctuation">,</span><span class="token punctuation">}</span><span class="token punctuation">,</span><span class="token punctuation">{<!-- --></span><span class="token string">"role"</span><span class="token punctuation">:</span> <span class="token string">"system"</span><span class="token punctuation">,</span><span class="token string">"name"</span><span class="token punctuation">:</span> <span class="token string">"example_user"</span><span class="token punctuation">,</span><span class="token string">"content"</span><span class="token punctuation">:</span> <span class="token string">"New synergies will help drive top-line growth."</span><span class="token punctuation">,</span><span class="token punctuation">}</span><span class="token punctuation">,</span><span class="token punctuation">{<!-- --></span><span class="token string">"role"</span><span class="token punctuation">:</span> <span class="token string">"system"</span><span class="token punctuation">,</span><span class="token string">"name"</span><span class="token punctuation">:</span> <span class="token string">"example_assistant"</span><span class="token punctuation">,</span><span class="token string">"content"</span><span class="token punctuation">:</span> <span class="token string">"Things working well together will increase revenue."</span><span class="token punctuation">,</span><span class="token punctuation">}</span><span class="token punctuation">,</span><span class="token punctuation">{<!-- --></span><span class="token string">"role"</span><span class="token punctuation">:</span> <span class="token string">"system"</span><span class="token punctuation">,</span><span class="token string">"name"</span><span class="token punctuation">:</span> <span class="token string">"example_user"</span><span class="token punctuation">,</span><span class="token string">"content"</span><span class="token punctuation">:</span> <span class="token string">"Let's circle back when we have more bandwidth to touch base on opportunities for increased leverage."</span><span class="token punctuation">,</span><span class="token punctuation">}</span><span class="token punctuation">,</span><span class="token punctuation">{<!-- --></span><span class="token string">"role"</span><span class="token punctuation">:</span> <span class="token string">"system"</span><span class="token punctuation">,</span><span class="token string">"name"</span><span class="token punctuation">:</span> <span class="token string">"example_assistant"</span><span class="token punctuation">,</span><span class="token string">"content"</span><span class="token punctuation">:</span> <span class="token string">"Let's talk later when we're less busy about how to do better."</span><span class="token punctuation">,</span><span class="token punctuation">}</span><span class="token punctuation">,</span><span class="token punctuation">{<!-- --></span><span class="token string">"role"</span><span class="token punctuation">:</span> <span class="token string">"user"</span><span class="token punctuation">,</span><span class="token string">"content"</span><span class="token punctuation">:</span> <span class="token string">"This late pivot means we don't have time to boil the ocean for the client deliverable."</span><span class="token punctuation">,</span><span class="token punctuation">}</span><span class="token punctuation">,</span> <span class="token punctuation">]</span><span class="token comment"># example token count from the function defined above</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>num_tokens_from_messages<span class="token punctuation">(</span>example_messages<span class="token punctuation">)</span><span class="token punctuation">}</span></span><span class="token string"> prompt tokens counted by num_tokens_from_messages()."</span></span><span class="token punctuation">)</span> <span class="token comment"># example token count from the OpenAI API</span> openai<span class="token punctuation">.</span>api_key <span class="token operator">=</span> <span class="token string">""</span> response <span class="token operator">=</span> openai<span class="token punctuation">.</span>ChatCompletion<span class="token punctuation">.</span>create<span class="token punctuation">(</span>model<span class="token operator">=</span><span class="token string">"gpt-3.5-turbo"</span><span class="token punctuation">,</span>messages<span class="token operator">=</span>example_messages<span class="token punctuation">,</span>temperature<span class="token operator">=</span><span class="token number">0</span><span class="token punctuation">,</span>max_tokens<span class="token operator">=</span><span class="token number">1</span> <span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f'</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>response<span class="token punctuation">[</span><span class="token string">"usage"</span><span class="token punctuation">]</span><span class="token punctuation">[</span><span class="token string">"prompt_tokens"</span><span class="token punctuation">]</span><span class="token punctuation">}</span></span><span class="token string"> prompt tokens counted by the OpenAI API.'</span></span><span class="token punctuation">)</span> </code></pre> <p>输出结果如下:</p> <pre><code>127 prompt tokens counted by num_tokens_from_messages(). 127 prompt tokens counted by the OpenAI API. </code></pre> <p>可见,在<code>num_tokens_from_messages</code>中,对于输入messages中的每条message,token数量先加上4,然后对字典中的value值进行token数量统计,如果此时对应的key为name,则token数量减1,因为要忽略role字段的token数量。在模型<code>gpt-3.5-turbo</code>中,<code>num_tokens_from_messages</code>函数与OpenAI对话补全中的token数量计算方式是一致的。</p> <h3> <a id="_170"></a>总结</h3> <p> 本文介绍了<code>tiktoken</code>模型和它的简单使用,以及token数量计算方式。</p> <h3> <a id="_172"></a>参考文献</h3> <ol> <li>深入理解NLP Subword算法:BPE、WordPiece、ULM: https://zhuanlan.zhihu.com/p/86965595</li> <li>tiktoken的Github网址:https://github.com/openai/tiktoken</li> <li>tiktoken-calculator: https://huggingface.co/spaces/JacobLinCool/tiktoken-calculator</li> <li>How_to_count_tokens_with_tiktoken.ipynb: https://github.com/openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb</li> </ol> </div> <link href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/markdown_views-98b95bb57c.css" rel="stylesheet"> <link href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/style-c216769e99.css" rel="stylesheet"> </div> <div id="treeSkill"></div> </article>
NLP(五十四)tiktoken的使用
作者
sockstack
许可协议
CC BY 4.0
发布于
2024-02-27
修改于
2025-02-05
上一篇:软件:常用 Linux 软件汇总,值得收藏
下一篇:手把手教你,用Auto-GPT自动写个网站(保姆级)
尚未登录
登录 / 注册
文章分类
博客重构之路
5
Spring Boot简单入门
4
k8s 入门教程
0
MySQL 知识
1
NSQ 消息队列
0
ThinkPHP5 源码分析
5
使用 Docker 从零开始搭建私人代码仓库
3
日常开发汇总
4
标签列表
springboot
hyperf
swoole
webman
php
多线程
数据结构
docker
k8s
thinkphp
mysql
tailwindcss
flowbite
css
前端