Open main menu
首页
专栏
课程
分类
归档
Chat
Sci-Hub
谷歌学术
Libgen
GitHub镜像
登录/注册
搜索
关闭
Previous
Previous
Next
Next
【AIGC】Visual ChatGPT 视觉模型深度解析
sockstack
/
165
/
2023-11-06 23:54:12
<p><span style="color: red; font-size: 18px">ChatGPT 可用网址,仅供交流学习使用,如对您有所帮助,请收藏并推荐给需要的朋友。</span><br><a href="https://ckai.xyz/?sockstack§ion=detail" target="__blank">https://ckai.xyz</a><br><br></p> <article class="baidu_pl"><div id="article_content" class="article_content clearfix"> <link rel="stylesheet" href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/kdoc_html_views-1a98987dfd.css"> <link rel="stylesheet" href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/ck_htmledit_views-25cebea3f9.css"> <div id="content_views" class="markdown_views prism-atom-one-light"> <svg xmlns="http://www.w3.org/2000/svg" style="display: none;"><path stroke-linecap="round" d="M5,0 0,2.5 5,5z" id="raphael-marker-block" style="-webkit-tap-highlight-color: rgba(0, 0, 0, 0);"></path></svg><p><strong>欢迎关注【youcans的AGI学习笔记】原创作品</strong></p> <p></p> <div class="toc"> <h3>【AIGC】Visual ChatGPT 视觉模型深度解析</h3> <ul><li><ul> <li>1. 【Visual- ChatGPT】火热来袭</li> <li>2. 【Visual-GPT】操作实例</li> <li><ul> <li>2.1 处理流程</li> <li>2.2 操作实例</li> </ul></li> <li>3. 【Visual-GPT】技术原理分析</li> <li><ul> <li>3.1 技术原理</li> <li>3.2 系统架构</li> <li>3.3 模块说明</li> <li>3.4 Prompt Manager 功能与规则</li> <li>3.5 视觉基础模型(Visual Foundatin Model)</li> </ul></li> <li>4. 【Visual-GPT】使用与运行</li> <li><ul> <li>4.1 clone the repo</li> <li>4.2 prepare the basic environments</li> <li>4.3 start local runing</li> </ul></li> <li>5. 【Visual-GPT】论文简介</li> <li><ul> <li>5.1 论文获取</li> <li>5.2 主要贡献</li> <li>5.3 本文的启发</li> <li>5.4 模型复现</li> <li>5.5 常见错误</li> <li>5.6 代码解读</li> </ul></li> <li>6. GPT4 来了</li> </ul></li></ul> </div> <p></p> <blockquote> <p><strong>说明:</strong><br> 根据有关要求,本文将【Visual ChatGPT】模型简称为【Visual-GPT】。<br> 本文为删节版,进行了大量删改,有些内容比较晦涩,读者可以略过,当然也可以仔细研读…完整版参见文末链接。<br> 更新说明:文末链接已删除。</p> </blockquote> <h2> <a id="1_Visual_ChatGPT_10"></a>1. 【Visual- ChatGPT】火热来袭</h2> <p>3月9日,微软亚洲研究院发布了图文版 ChatGPT——Visual ChatGPT,并在 Github 开源了基础代码,短短一周已经获得了 19.7k 颗星。</p> <p>2022年11月,OpenAI 推出的 ChatGPT,几个月来已经火爆全球,不仅需要候补注册,还要科学上网。ChatGPT 具有强大的会话能力的语言界面进行人机对话,能陪你聊天、编写代码、修改 bug、解答问题…,但是目前还不能处理或生成视觉图像。</p> <p>Visual ChatGPT 把一系列 Visual Foundation 视觉模型接入 ChatGPT,使用户能够与 ChatGPT 以文本和图像的形式交互,还能提供复杂的视觉指令,让多个模型协同工作。Visual ChatGPT 可以理解和响应基于文本的输入和基于视觉的输入,减少进入文本到图像模型的障碍,增加各种 AI 工具的互操作性。</p> <p>Visual Transformer 将 ChatGPT 作为逻辑处理中心,集成 Visual Foundation 视觉基础模型,从而实现:</p> <ul> <li>提供视觉聊天系统,可以接收和发送文本和图像;</li> <li>提供复杂的视觉问答和视觉编辑指令,可以解决复杂视觉任务;</li> <li>可以提供反馈,总结答案,还可以主动对模糊的指令进行询问。</li> </ul> <p><img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/d60e0853c6014e368f6eaf8f540f76fe.gif#pic_center" alt="在这里插入图片描述"></p> <p>Visual-GPT 可以用自然语言简单地从模型中键入想要的内容,如题图所示的过程中进行了几轮对话:</p> <ol> <li>用户要求生成一张猫的图像。Visual-GPT 生成了一幅正在看书的猫的图像。</li> <li>用户要求将图像中的猫换成狗,并把书删除。Visual-GPT 将该图像中的猫换成了狗,并删除了图像中的书。</li> <li>用户要求对图像进行 Canny 边缘检测。Visual-GPT 理解并执行了 Canny 边缘检测操作,生成了边缘图像。</li> <li>用户要求基于指定的网络图像,生成一幅黄狗图像,Visual-GPT 也很好地完成了这个任务。</li> </ol> <br> <h2> <a id="2_VisualGPT_37"></a>2. 【Visual-GPT】操作实例</h2> <h3> <a id="21__39"></a>2.1 处理流程</h3> <p>Visual-GPT 的基本处理流程如图所示。</p> <p><img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/8398e7da72104461907013e77b5fc029.png#pic_center" alt="在这里插入图片描述"></p> <p>如图所示,用户上传了一张黄色花朵的图像,并输入一条复杂的语言指令「请根据该图像生成的深度图在生成一朵红色花朵,然后逐步将其制作成卡通图片」。</p> <p>Visual-GPT 中的 Prompt Manager 控制与 VFM 相关的处理流程。ChatGPT 利用这些 VFMs,并以迭代的方式接收其反馈,直到满足用户的要求或达到结束条件。</p> <ul> <li>首先是深度估计模型,用来检测图像深度信息;</li> <li>然后是深度图像模型,用来生成具有深度信息的红色花朵图像;</li> <li>最后利用基于 Stable Diffusion 的风格迁移模型,将图像风格转换为卡通图像。</li> </ul> <p>在上述 pipeline 中,Prompt Manager 作为 ChatGPT 的管理调度中心,提供可视化格式的类型并记录信息转换的过程,最后输出最终结果图像并显示。</p> <br> <h3> <a id="22__57"></a>2.2 操作实例</h3> <p>第一轮对话:<br> Q1:用户文本询问,问题与图像无关。<br> A1:模型文本回答,回答与图像无关。<br> Q2:用户要求画一个苹果。<br> A2:模型图文回答,绘制了一幅苹果图片。</p> <img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/19a56815dcb74c7082aabf469866956c.png" alt="chatgpt_visual13"> <br> <p>第二轮对话:<br> Q3:用户输入图像,是一个苹果和杯子的草图。<br> A3:模型文本回答,询问用户的意图,并主动提示草图的文件名。<br> Q4:用户文本输入,要求按草图绘制苹果和杯子。<br> A5:模型图文回答,按照用户要求绘制了一幅苹果和杯子的图片。</p> <img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/d49f06b989b3432b94a50e9e94c70b21.png" alt="chatgpt_visual13"> <br> 第三轮对话: Q5:用户输入文本,要求把上图修改为水彩画风格。 A5:模型图文回答,按照用户要求把上图修改为一幅水彩画风格的图片。 Q6:用户文本输入,询问图片的背景颜色。 A6:模型文本回答,回答图片的背景颜色。 <img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/e190061c236d41b3be4f0c99f60ca946.png" alt="chatgpt_visual13"> <br> 第四轮对话: Q7:用户文本输入,要求去除图片中的苹果。 A7:模型图文回答,按照用户要求从图片中去除苹果——但是没有去除苹果在桌面上的影子。 Q8:用户输入文本,指出上图中的影子还在桌面上,并要求把换一张黑色的桌子。 A8:模型图文回答,按照用户要求把图片中的桌子换成黑色桌子。 <img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/dbf320a686cd403f98c872693a8f5282.png" alt="chatgpt_visual13"> <br> <h2> <a id="3_VisualGPT_97"></a>3. 【Visual-GPT】技术原理分析</h2> <h3> <a id="31__99"></a>3.1 技术原理</h3> <p>由于 ChatGPT 是用单一语言模态训练而成,处理视觉信息的能力非常有限。而视觉基础模型(VFM,Visual Foundation Models)在计算机视觉方面潜力巨大,因而能够理解和生成复杂的图像。例如,BLIP 模型是理解和提供图像描述的专家,Stable Diffusion 可以基于文本提示合成图像。然而由于 VFM 模型对输入输出格式的苛求和固定限制,但在人机交互上却不如对话语言模型灵活。</p> <p>Visual ChatGPT 是在大量文本和图像数据集上训练的。该模型使用不同的视觉基础模型(如 VGG、ResNet和DenseNet)从图像中提取特征,然后将这些特征与基于文本的输入相结合以生成响应。使用有监督和无监督学习技术的组合进行训练,使其能够学习并适应新的场景。</p> <p>当用户用图像输入问题或陈述时,它分析图像并提取相关特征。然后,它将这些特性与基于文本的输入相结合,以生成与用户查询相关的响应。例如,如果用户上传一辆汽车的图像并询问“这辆汽车的品牌和型号是什么?”,Visual-GPT 将分析图像并根据从图像中提取的视觉特征生成响应。</p> <p>传统的聊天机器人只依赖基于文本的输入,这限制了它们的能力。Visual-GPT 通过结合计算机视觉扩展了聊天机器人的功能,使其能够基于视觉上下文理解并生成响应。Visual-GPT 的另一个特性是它能够生成创造性的响应。由于它是在GPT-3之上构建的,它可以访问大量文本数据集,这使它能够生成富有创意和多样性的响应。这使得与 Visual-GPT 的交互更具吸引力和人性化。</p> <br> <h3> <a id="32__112"></a>3.2 系统架构</h3> <p>Visual-GPT 的系统架构如下图所示,由用户查询模块(User Query)、交互管理模块(Prompt Manger)、视觉基础模型(Visual Foundation Models,VFM)、调用 ChatGpt API 系统和迭代交互模块(Iterative Reasoning)、用户输出模块(Outputs)构成。</p> <p><img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/f92b51081d2e44c39d2a9c18478faa6b.png#pic_center" alt="在这里插入图片描述"></p> <p>上图左图是多轮对话的过程,中图是 Visual-GPT 如何迭代调用 VFMs 并提供答案的流程图,右图是模型针对第2个 Q/A 的详细运行过程。</p> <p>分析系统架构图,该系统利用 ChatGPT 和 一个Prompt Manager(M) 来做意图识别和语言理解,然后决定后续的操作和产出。</p> <p>在这个对话的例子中:</p> <ol> <li>第一轮对话:首先用户输入一张图片 User Query(Q1),模型回答收到 (A1)。</li> <li>第二轮对话:(1)用户提出”把沙发改为桌子“和”把风格改为水彩画“两个要求(Q2),模型判断需要使用VFM模型;(2)模型判断第一个要求是替换东西,因此调用 repalce object 模块,生成符合第一个要求的图片;(3)模型判断第二个要求是通过语言修改图片,因此调用 pix2pix 模块,生成符合第二个要求的图片;(4)模型判断完成用户提出的需求,输出第二幅图片(A2)。</li> <li>第三轮对话:用户提出问题(Q3),模型判断不需要 VFM,调用 VQA 模块,回答问题得到答案(A3)。</li> </ol> <p>将这个过程抽象出来, 就是一系列系统规则组成的M§和功能模块组成的M(F) :</p> <p>对于由多个“问题-答案对”所构成的集合 <span class="katex--inline"><span class="katex"><span class="katex-mathml">S=(Q1,A1),(Q2,A2),...,(Qn,An)S={(Q_1,A_1), (Q_2,A_2),...,(Q_n,A_n)}</span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.6833em;"></span><span class="mord mathnormal" style="margin-right: 0.0576em;">S</span><span class="mspace" style="margin-right: 0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right: 0.2778em;"></span></span><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mord"><span class="mopen">(</span><span class="mord"><span class="mord mathnormal">Q</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.3011em;"><span class="" style="top: -2.55em; margin-left: 0em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.15em;"><span class=""></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.3011em;"><span class="" style="top: -2.55em; margin-left: 0em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">1</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.15em;"><span class=""></span></span></span></span></span></span><span class="mclose">)</span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal">Q</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.3011em;"><span class="" style="top: -2.55em; margin-left: 0em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.15em;"><span class=""></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.3011em;"><span class="" style="top: -2.55em; margin-left: 0em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight">2</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.15em;"><span class=""></span></span></span></span></span></span><span class="mclose">)</span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord">...</span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal">Q</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.1514em;"><span class="" style="top: -2.55em; margin-left: 0em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">n</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.15em;"><span class=""></span></span></span></span></span></span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.1514em;"><span class="" style="top: -2.55em; margin-left: 0em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">n</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.15em;"><span class=""></span></span></span></span></span></span><span class="mclose">)</span></span></span></span></span></span>,要从第 <span class="katex--inline"><span class="katex"><span class="katex-mathml">ii</span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.6595em;"></span><span class="mord mathnormal">i</span></span></span></span></span> 轮对话中得到答案 <span class="katex--inline"><span class="katex"><span class="katex-mathml">AiA_i</span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.8333em; vertical-align: -0.15em;"></span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.3117em;"><span class="" style="top: -2.55em; margin-left: 0em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.15em;"><span class=""></span></span></span></span></span></span></span></span></span></span>,需要一系列的 VFM 和中间输出。</p> <p>将第<span class="katex--inline"><span class="katex"><span class="katex-mathml">ii</span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.6595em;"></span><span class="mord mathnormal">i</span></span></span></span></span>轮对话中第<span class="katex--inline"><span class="katex"><span class="katex-mathml">jj</span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.854em; vertical-align: -0.1944em;"></span><span class="mord mathnormal" style="margin-right: 0.0572em;">j</span></span></span></span></span>次的工具调用中间答案记为 <span class="katex--inline"><span class="katex"><span class="katex-mathml">AijA_i^{j}</span><span class="katex-html"><span class="base"><span class="strut" style="height: 1.2194em; vertical-align: -0.2769em;"></span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.9426em;"><span class="" style="top: -2.4231em; margin-left: 0em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span><span class="" style="top: -3.1809em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right: 0.0572em;">j</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.2769em;"><span class=""></span></span></span></span></span></span></span></span></span></span>,就可以定义 Visual ChatGPT 的模型为:<br> <span class="katex--display"><span class="katex-display"><span class="katex"><span class="katex-mathml">Aij+1=ChatGPT(M(P),M(F),M(H<i),M(Qi),M(Ri<j),M(F(Aij)))A_i^{j+1} = ChatGPT(M(P), M(F), M(H_{<i}), M(Q_i), M(R_i^{<j}), M(F(A_i^j))) </span><span class="katex-html"><span class="base"><span class="strut" style="height: 1.2194em; vertical-align: -0.2769em;"></span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.9426em;"><span class="" style="top: -2.4231em; margin-left: 0em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span><span class="" style="top: -3.1809em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mord mathnormal mtight" style="margin-right: 0.0572em;">j</span><span class="mbin mtight">+</span><span class="mord mtight">1</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.2769em;"><span class=""></span></span></span></span></span></span><span class="mspace" style="margin-right: 0.2778em;"></span><span class="mrel">=</span><span class="mspace" style="margin-right: 0.2778em;"></span></span><span class="base"><span class="strut" style="height: 1.2194em; vertical-align: -0.2769em;"></span><span class="mord mathnormal" style="margin-right: 0.0715em;">C</span><span class="mord mathnormal">ha</span><span class="mord mathnormal" style="margin-right: 0.1389em;">tGPT</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right: 0.109em;">M</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right: 0.1389em;">P</span><span class="mclose">)</span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord mathnormal" style="margin-right: 0.109em;">M</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right: 0.1389em;">F</span><span class="mclose">)</span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord mathnormal" style="margin-right: 0.109em;">M</span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal" style="margin-right: 0.0813em;">H</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.3117em;"><span class="" style="top: -2.55em; margin-left: -0.0813em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mrel mtight"><</span><span class="mord mathnormal mtight">i</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.1774em;"><span class=""></span></span></span></span></span></span><span class="mclose">)</span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord mathnormal" style="margin-right: 0.109em;">M</span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal">Q</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.3117em;"><span class="" style="top: -2.55em; margin-left: 0em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.15em;"><span class=""></span></span></span></span></span></span><span class="mclose">)</span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord mathnormal" style="margin-right: 0.109em;">M</span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal" style="margin-right: 0.0077em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.9426em;"><span class="" style="top: -2.4231em; margin-left: -0.0077em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span><span class="" style="top: -3.1809em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mrel mtight"><</span><span class="mord mathnormal mtight" style="margin-right: 0.0572em;">j</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.2769em;"><span class=""></span></span></span></span></span></span><span class="mclose">)</span><span class="mpunct">,</span><span class="mspace" style="margin-right: 0.1667em;"></span><span class="mord mathnormal" style="margin-right: 0.109em;">M</span><span class="mopen">(</span><span class="mord mathnormal" style="margin-right: 0.1389em;">F</span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.9426em;"><span class="" style="top: -2.4231em; margin-left: 0em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span><span class="" style="top: -3.1809em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right: 0.0572em;">j</span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.2769em;"><span class=""></span></span></span></span></span></span><span class="mclose">)))</span></span></span></span></span></span><br> 其中:P 是全局原则,F 是各个视觉基础模型,<span class="katex--inline"><span class="katex"><span class="katex-mathml">M(H<i)M(H_{<i})</span><span class="katex-html"><span class="base"><span class="strut" style="height: 1em; vertical-align: -0.25em;"></span><span class="mord mathnormal" style="margin-right: 0.109em;">M</span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal" style="margin-right: 0.0813em;">H</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.3117em;"><span class="" style="top: -2.55em; margin-left: -0.0813em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mrel mtight"><</span><span class="mord mathnormal mtight">i</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.1774em;"><span class=""></span></span></span></span></span></span><span class="mclose">)</span></span></span></span></span> 是历史会话记忆,$ M(Q_i)$ 是第 i 轮的用户输入, <span class="katex--inline"><span class="katex"><span class="katex-mathml">M(Ri<j)M(R_i^{<j})</span><span class="katex-html"><span class="base"><span class="strut" style="height: 1.2194em; vertical-align: -0.2769em;"></span><span class="mord mathnormal" style="margin-right: 0.109em;">M</span><span class="mopen">(</span><span class="mord"><span class="mord mathnormal" style="margin-right: 0.0077em;">R</span><span class="msupsub"><span class="vlist-t vlist-t2"><span class="vlist-r"><span class="vlist" style="height: 0.9426em;"><span class="" style="top: -2.4231em; margin-left: -0.0077em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight">i</span></span></span><span class="" style="top: -3.1809em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mtight"><span class="mrel mtight"><</span><span class="mord mathnormal mtight" style="margin-right: 0.0572em;">j</span></span></span></span></span><span class="vlist-s"></span></span><span class="vlist-r"><span class="vlist" style="height: 0.2769em;"><span class=""></span></span></span></span></span></span><span class="mclose">)</span></span></span></span></span> 是第 i 轮的推理历史,<span class="katex--inline"><span class="katex"><span class="katex-mathml">AjA^j</span><span class="katex-html"><span class="base"><span class="strut" style="height: 0.8247em;"></span><span class="mord"><span class="mord mathnormal">A</span><span class="msupsub"><span class="vlist-t"><span class="vlist-r"><span class="vlist" style="height: 0.8247em;"><span class="" style="top: -3.063em; margin-right: 0.05em;"><span class="pstrut" style="height: 2.7em;"></span><span class="sizing reset-size6 size3 mtight"><span class="mord mathnormal mtight" style="margin-right: 0.0572em;">j</span></span></span></span></span></span></span></span></span></span></span></span> 是中间答案。</p> <p>ChatGPT生成最终答案要经历一个不断迭代的过程,它会不断自我询问,自动调用更多VFM。而当用户指令不够清晰时,Visual ChatGPT会询问其能否提供更多细节,避免机器自行揣测甚至篡改人类意图。</p> <br> <h3> <a id="33__143"></a>3.3 模块说明</h3> <p><strong>M§:</strong></p> <p>Visual-GPT 为了能让不同的VFM理解视觉信息并生成相应答案,需要设计一系列系统原则,并将其转化为 ChatGPT能够理解的提示。</p> <p>通过生成这样的提示,Prompt Manager 能够帮助 Visual-GPT 完成生成文本、图像的任务,能够访问一系列VFM并自由选择使用哪个基础模型,提高对文件名的敏感度,进行链式思考和严格推理。</p> <p><strong>M(F):</strong></p> <p>Prompt Manager 需要帮助 Visual-GPT 区分不同的VFM,以便准确地完成图像任务。</p> <p>为此,Prompt Manager对各个基础模型的名称、应用场景、输入和输出提示以及实例给出了具体定义。</p> <p><strong>M(Q):</strong></p> <p>Prompt Manager会对用户新上传的图像生成唯一文件名,并生成假的对话历史,其中提到该名称的图片已经收到,这样可以在涉及引用现有图像的查询时忽略文件名的检查。</p> <p>Prompt Manager会在查询问题之后加上一个后缀提示,来确保成功触发VFM,强制 Visual-GPT 进行思考,给出言之有物的输出。</p> <p><strong>M(F(A)):</strong></p> <p>VFM给出的中间输出,Prompt Manager会为其生成链式文件名,作为下一轮内部对话的输入。</p> <br> <h3> <a id="34_Prompt_Manager__169"></a>3.4 Prompt Manager 功能与规则</h3> <p>Visual-GPT 的核心是 Prompt Manager,具体功能如下:</p> <ol> <li>首先明确告诉 ChatGPT 每个 VFM 的功能,并指定输入输出格式。</li> <li>然后转换不同的视觉信息(如 png 图像、深度图像和 mask 矩阵)转换为语言格式。</li> <li>最后处理不同 VFM 的历史、优先级和冲突。</li> </ol> <p>通过 Prompt Manager 的帮助,ChatGPT 可以利用这些 VFM,并以迭代方式接收反馈,直到满足用户的需求或达到结束条件。</p> <p>Visual-GPT 集成了不同 VFM 来理解视觉信息并生成相应答案的系统。因此,Visual-GPT 需要制定一些基本规则,并将其转化为 ChatGPT 可以理解的命令。</p> <p>这些基本规则包括:</p> <ul> <li>Visual-GPT 的任务需求:协助完成一系列与文本和视觉相关的任务,例如 VQA、图像生成和编辑。</li> <li>VFM 的可访问性:Visual ChatGPT 可以访问 VFM 列表来解决各种 VL( vision-language ) 任务,使用哪种基础模型由 ChatGPT 模型本身决定。</li> <li>文件名敏感度:在对话中可能包含多个图像及不同的更新版本,使用精确的文件名以避免歧义至关重要,滥用文件名会导致混淆图片。Visual-GPT 被设计为严格使用文件名,确保检索和操作图像文件的正确性。</li> <li>Chain-of-Thought:一些看似简单的命令可能需要多个 VFM,例如生成卡通图片的过程涉及深度估计、深度到图像和风格转换的 VFM。Visual-GPT 引入了 CoT 以帮助决定、利用和调度多个 VFM,将用户的问题分解为多个子问题来解决更具挑战性需求。</li> <li>推理格式的严谨性:Visual-GPT 必须遵循严格的推理格式。该研究使用精细的正则表达式匹配算法解析中间推理结果,为 ChatGPT 模型构建合理的输入格式,以帮助其确定下一次执行,例如触发新的 VFM 或返回最终响应。</li> <li>可靠性:Visual-GPT 作为一种语言模型,可能会伪造假图像文件名或事实,这会使系统不可靠。为了处理此类问题,需要设计 prompt 使忠于视觉基础模型的输出,而不能伪造图像内容或文件名。此外,prompt 还将引导 ChatGPT 优先利用 VFM,而不是根据对话历史生成结果。</li> </ul> <p><img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/04cbd4fec98a4cb6a114b2ec49b462f5.png#pic_center" alt="在这里插入图片描述"><br> <br></p> <h3> <a id="35_Visual_Foundatin_Model_193"></a>3.5 视觉基础模型(Visual Foundatin Model)</h3> <p>Visual-GPT 支持 22 种视觉基础模型(Visual Foundatin Model):</p> <ul> <li>从图像中删除对象(Remove Objects from Image):image path, textual what to remove -> image path</li> <li>替换图像中的对象(Replace Objects from Image):image path, textual what to replace, textual what to add -> image path</li> <li>按文本要求修改图像(Change Image by the Text):image path, textual how to modify -> image path</li> <li>图像问题解答(Image Question Answering):image path, question -> answer</li> <li>从图像生成描述文本(Image-to-Text):image path -> natural language description</li> <li>从描述文本生成图像(Text-to-Image):textual description -> image path</li> <li>对图像进行边缘检测(Image-to-Edge):image path -> edge image path</li> <li>从边缘检测图和文本描述生成新图像(Edge-to-Image):edge image path, textual description -> image path</li> <li>对图像进行直线检测(Image-to-Line):image path -> line image path</li> <li>从直接检测图和文本生成新图像(Line-to-Image):line image path, textual description -> image path</li> <li>对图像进行 HED 边缘检测(Image-to-Hed):image path -> hed image path</li> <li>从HED边缘检测和文本生成新图像(Hed-to-Image):hed image path, textual description -> image path</li> <li>生成分割图像(Image-to-Seg):image path -> segment image path</li> <li>从分割图像和文本生成新图像(Seg-to-Image):segment image path, textual description ->image path</li> <li>从图像生成深度图(Image-to-Depth):image path -> depth image path</li> <li>从深度图和文本生成新图像(Depth-to-Image):depth image path, textual description -> image path</li> <li>从图像生成法线图(Image-to-NormalMap):image path -> norm image path</li> <li>从法线图和文本生成新图像(NormalMap-to-Image):norm image path, textual description -> image path</li> <li>从图像生成草图(Image-to-Sketch):image path -> sketch image path</li> <li>从草图和文本生成新图像(Sketch-to-Image):sketch image path, textual description -> image path</li> <li>对图像进行姿态检测(Image-to-Pose):image path -> pos image path</li> <li>从姿态检测和文本生成新图像(Pose-to-Image):pos image path, textual description -> image path</li> </ul> <br> <h2> <a id="4_VisualGPT_222"></a>4. 【Visual-GPT】使用与运行</h2> <p><strong>【本文为删节版,相关内容已删除。】</strong><br> <br></p> <h3> <a id="41_clone_the_repo_225"></a>4.1 clone the repo</h3> <h3> <a id="42_prepare_the_basic_environments_226"></a>4.2 prepare the basic environments</h3> <h3> <a id="43_start_local_runing_227"></a>4.3 start local runing</h3> <br> <h2> <a id="5_VisualGPT_231"></a>5. 【Visual-GPT】论文简介</h2> <h3> <a id="51__233"></a>5.1 论文获取</h3> <p><img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/2979808ebe034df19b6cae09a04e8c8f.png#pic_center" alt="在这里插入图片描述"></p> <p>Title:Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models<br> 标题:Visual ChatGPT:使用 Visual Foundation 模型进行对话、绘图和编辑<br> 作者:Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, Nan Duan<br> 机构:Microsoft Researc Asia(微软亚洲研究院)<br> 论文链接: https://arxiv.org/abs/2303.04671<br> 开源代码: https://github.com/microsoft/visual-chatgpt</p> <p>我已经将本文上传到 CSDN,读者也可以从 arxiv 自行下载。</p> <p>第一作者:吴晨飞,高级研究员,2020 年加入微软亚洲研究院自然语言计算组,研究领域为多模型的预训练、理解和生成。</p> <p>通讯作者: 段楠,微软亚洲研究院首席研究员及自然语言计算组研究经理,中国科学技术大学兼职博导,天津大学兼职教授,研究领域为自然语言处理、代码智能、多模态智能和机器推理等。</p> <br> <h3> <a id="52__252"></a>5.2 主要贡献</h3> <p>(1)提出 Visual ChatGPT,打开了 ChatGPT 和 VFM 连接的大门,使 ChatGPT 能够处理复杂的视觉任务。</p> <p>(2)设计了一个 Prompt Manager,其中涉及 22 个不同的 VFM,并定义了它们之间的内在关联,以便更好地交互和组合。</p> <p>(3)进行了大量的零样本实验,并展示了大量的案例来验证 Visual ChatGPT 的理解和生成能力。</p> <br> <h3> <a id="53__262"></a>5.3 本文的启发</h3> <ul> <li>本文开启了 ChatGPT 处理视觉任务的大门。</li> <li>NLP —> Natural Language PhotoShop,自然语言文本描述下的图片创作编辑和问答。</li> <li>可以通过系统设计和工具包设计的 Prompt 实现无监督的工具调用,类似于 zero-shot 的 toolformer。</li> <li>ChatGPT 本身对仿真场景的能力很强,也能接受图片路径和函数关系,可以很好地使用基础视觉模型。</li> <li>Visual ChatGPT 本身是一个语言模型,所谓的两方多轮对话只是一个 Human AI 的多轮特殊形式。</li> </ul> <br> <h3> <a id="54__272"></a>5.4 模型复现</h3> <p>Visual-GPT 的运行步骤如下。</p> <p>(1)创建 Python3.8 环境并激活新的环境:</p> <pre><code># create a new environment conda create -n visgpt python=3.8# activate the new environment conda activate visgpt </code></pre> <p>(2)安装所需的依赖(详见4.2):</p> <pre><code># prepare the basic environments pip install -r requirement.txt </code></pre> <p>(3)clone the repo:</p> <p>【删除】</p> <p>clone the repo 所建立的文件夹结构如下:</p> <blockquote> <p>├── assets<br> │ ├── demo.gif<br> │ ├── demo_short.gif<br> │ └── figure.jpg<br> ├── download.sh<br> ├── LICENSE.md<br> ├── README.md<br> ├── requirement.txt<br> └── visual_chatgpt.py</p> </blockquote> <p>(4)设置工作目录:</p> <p>将工作目录设置为创建的 github repo 的 copy:</p> <pre><code># clone the repo %cd visual-chatgpt </code></pre> <p>(5)下载基本视觉模型 VFM:</p> <pre><code># download the visual foundation models bash download.sh </code></pre> <p>(6)输入 OpenAI_API_key:</p> <p>要开始使用OpenAI API,请访问 platform.OpenAI.com 并使用 Google 或 Microsoft 邮箱注册帐户,获取 API 密钥,该密钥将允许您访问API。——科学上网,势不可挡!</p> <blockquote> <p>%env OPENAI_API_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx</p> </blockquote> <pre><code># prepare your private OpenAI key (for Linux) export OPENAI_API_KEY={Your_Private_Openai_Key}# prepare your private OpenAI key (for Windows) set OPENAI_API_KEY={Your_Private_Openai_Key} </code></pre> <p>(7)创建图像保存目录</p> <pre><code>!mkdir ./image </code></pre> <p>(8)运行 Visual GPT</p> <pre><code>!python3.8 ./visual_chatgpt.py </code></pre> <p><strong>注意问题:</strong></p> <p>(1)可以通过 “–load” 指定 GPU/CPU 分配,该参数设置使用的 VFM 模型及加载位置。可用的 Visual Foundation 模型参见 3.6 节内容。</p> <p>例如,将 ImageCaptiing 加载到 cpu,将 Text2Image 加载到 cuda:0,则设置为:</p> <blockquote> <p>python visual_chatgpt.py --load ImageCaptioning_cpu, Text2Image_cuda:0</p> </blockquote> <p>(2)VFM 模型所需的内存资源很大,推荐的设置选项为:</p> <ul> <li>CPU 用户:只加载 ImageCaptioning_cpu, Text2Image_cpu</li> <li>1 Tesla T4 15GB 用户:只加载 ImageCaptioning_cuda:0, Text2Image_cuda:0,可以加载 ImageEditing_cuda:0</li> <li>4 Tesla V100 32GB 用户:加载如下</li> </ul> <pre><code>--load ImageCaptioning_cuda:0,ImageEditing_cuda:0,Text2Image_cuda:1,Image2Canny_cpu,CannyText2Image_cuda:1,Image2Depth_cpu,DepthText2Image_cuda:1,VisualQuestionAnswering_cuda:2,InstructPix2Pix_cuda:2,Image2Scribble_cpu,ScribbleText2Image_cuda:2,Image2Seg_cpu,SegText2Image_cuda:2,Image2Pose_cpu,PoseText2Image_cuda:2,Image2Hed_cpu,HedText2Image_cuda:3,Image2Normal_cpu,NormalText2Image_cuda:3,Image2Line_cpu,LineText2Image_cuda:3 </code></pre> <p>(3)不同 VFM 模型所需内存的参考值。</p> <table> <thead><tr> <th>Foundation Model</th> <th>Memory Usage (MB)</th> </tr></thead> <tbody> <tr> <td>ImageEditing</td> <td>6.5</td> </tr> <tr> <td>ImageCaption</td> <td>1.7</td> </tr> <tr> <td>T2I</td> <td>6.5</td> </tr> <tr> <td>canny2image</td> <td>5.4</td> </tr> <tr> <td>line2image</td> <td>6.5</td> </tr> <tr> <td>hed2image</td> <td>6.5</td> </tr> <tr> <td>scribble2image</td> <td>6.5</td> </tr> <tr> <td>pose2image</td> <td>6.5</td> </tr> <tr> <td>BLIPVQA</td> <td>2.6</td> </tr> <tr> <td>seg2image</td> <td>5.4</td> </tr> <tr> <td>depth2image</td> <td>6.5</td> </tr> <tr> <td>normal2image</td> <td>3.9</td> </tr> <tr> <td>InstructPix2Pix</td> <td>2.7</td> </tr> </tbody> </table> <br> <h3> <a id="55__395"></a>5.5 常见错误</h3> <ol> <li> <p>RuntimeError: CUDA error: invalid device ordinal<br> 问题原因:GPU 的数量不够。<br> 解决方案:将 visual_chatgpt.py 文件中的所有 cuda:\d 替换为 cuda:0。</p> </li> <li> <p>OutOfMemoryError: CUDA out of memory<br> 问题原因:没有足够的 GPU 内存来运行 VFM模型。<br> 解决方案:忽略 download.sh 和 visual_chatgpt.py 文件中不需要的一些模型,只加载必要的模型。</p> </li> </ol> <br> <h3> <a id="56__407"></a>5.6 代码解读</h3> <p>**说明:**本节内容来自外网,博主也在解读和测试。在此贴出相关内容,仅供参考,更多解读详见 【Visua ChatGPT: Paper and Code Review】。</p> <pre><code class="prism language-python"><span class="token keyword">with</span> gr<span class="token punctuation">.</span>Column<span class="token punctuation">(</span>scale<span class="token operator">=</span><span class="token number">0.15</span><span class="token punctuation">,</span> min_width<span class="token operator">=</span><span class="token number">0</span><span class="token punctuation">)</span><span class="token punctuation">:</span> btn <span class="token operator">=</span> gr<span class="token punctuation">.</span>UploadButton<span class="token punctuation">(</span>“Upload”<span class="token punctuation">,</span> file_types<span class="token operator">=</span><span class="token punctuation">[</span>“image”<span class="token punctuation">]</span><span class="token punctuation">)</span> btn<span class="token punctuation">.</span>upload<span class="token punctuation">(</span>bot<span class="token punctuation">.</span>run_image<span class="token punctuation">,</span> <span class="token punctuation">[</span>btn<span class="token punctuation">,</span> state<span class="token punctuation">,</span> txt<span class="token punctuation">]</span><span class="token punctuation">,</span> <span class="token punctuation">[</span>chatbot<span class="token punctuation">,</span> state<span class="token punctuation">,</span> txt<span class="token punctuation">]</span><span class="token punctuation">)</span><span class="token keyword">def</span> <span class="token function">run_image</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> image<span class="token punctuation">,</span> state<span class="token punctuation">,</span> txt<span class="token punctuation">)</span><span class="token punctuation">:</span> image_filename <span class="token operator">=</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span><span class="token string">'image'</span><span class="token punctuation">,</span> <span class="token builtin">str</span><span class="token punctuation">(</span>uuid<span class="token punctuation">.</span>uuid4<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">:</span><span class="token number">8</span><span class="token punctuation">]</span> <span class="token operator">+</span> <span class="token string">".png"</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"======>Auto Resize Image..."</span><span class="token punctuation">)</span> img <span class="token operator">=</span> Image<span class="token punctuation">.</span><span class="token builtin">open</span><span class="token punctuation">(</span>image<span class="token punctuation">.</span>name<span class="token punctuation">)</span> width<span class="token punctuation">,</span> height <span class="token operator">=</span> img<span class="token punctuation">.</span>size ratio <span class="token operator">=</span> <span class="token builtin">min</span><span class="token punctuation">(</span><span class="token number">512</span> <span class="token operator">/</span> width<span class="token punctuation">,</span> <span class="token number">512</span> <span class="token operator">/</span> height<span class="token punctuation">)</span> width_new<span class="token punctuation">,</span> height_new <span class="token operator">=</span> <span class="token punctuation">(</span><span class="token builtin">round</span><span class="token punctuation">(</span>width <span class="token operator">*</span> ratio<span class="token punctuation">)</span><span class="token punctuation">,</span> <span class="token builtin">round</span><span class="token punctuation">(</span>height <span class="token operator">*</span> ratio<span class="token punctuation">)</span><span class="token punctuation">)</span> img <span class="token operator">=</span> img<span class="token punctuation">.</span>resize<span class="token punctuation">(</span><span class="token punctuation">(</span>width_new<span class="token punctuation">,</span> height_new<span class="token punctuation">)</span><span class="token punctuation">)</span> img <span class="token operator">=</span> img<span class="token punctuation">.</span>convert<span class="token punctuation">(</span><span class="token string">'RGB'</span><span class="token punctuation">)</span> img<span class="token punctuation">.</span>save<span class="token punctuation">(</span>image_filename<span class="token punctuation">,</span> <span class="token string">"PNG"</span><span class="token punctuation">)</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"Resize image form </span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>width<span class="token punctuation">}</span></span><span class="token string">x</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>height<span class="token punctuation">}</span></span><span class="token string"> to </span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>width_new<span class="token punctuation">}</span></span><span class="token string">x</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>height_new<span class="token punctuation">}</span></span><span class="token string">"</span></span><span class="token punctuation">)</span> description <span class="token operator">=</span> self<span class="token punctuation">.</span>i2t<span class="token punctuation">.</span>inference<span class="token punctuation">(</span>image_filename<span class="token punctuation">)</span> Human_prompt <span class="token operator">=</span> <span class="token string">"nHuman: provide a figure named {}. The description is: {}. This information helps you to understand this image, but you should use tools to finish following tasks, "</span> <span class="token string">"rather than directly imagine from my description. If you understand, say "</span>Received<span class="token string">". n"</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>image_filename<span class="token punctuation">,</span> description<span class="token punctuation">)</span> AI_prompt <span class="token operator">=</span> <span class="token string">"Received. "</span> self<span class="token punctuation">.</span>agent<span class="token punctuation">.</span>memory<span class="token punctuation">.</span><span class="token builtin">buffer</span> <span class="token operator">=</span> self<span class="token punctuation">.</span>agent<span class="token punctuation">.</span>memory<span class="token punctuation">.</span><span class="token builtin">buffer</span> <span class="token operator">+</span> Human_prompt <span class="token operator">+</span> <span class="token string">'AI: '</span> <span class="token operator">+</span> AI_prompt <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"======>Current memory:n %s"</span> <span class="token operator">%</span> self<span class="token punctuation">.</span>agent<span class="token punctuation">.</span>memory<span class="token punctuation">)</span> state <span class="token operator">=</span> state <span class="token operator">+</span> <span class="token punctuation">[</span><span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"![](/file=</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>image_filename<span class="token punctuation">}</span></span><span class="token string">)*</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>image_filename<span class="token punctuation">}</span></span><span class="token string">*"</span></span><span class="token punctuation">,</span> AI_prompt<span class="token punctuation">)</span><span class="token punctuation">]</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"Outputs:"</span><span class="token punctuation">,</span> state<span class="token punctuation">)</span> <span class="token keyword">return</span> state<span class="token punctuation">,</span> state<span class="token punctuation">,</span> txt <span class="token operator">+</span> <span class="token string">' '</span> <span class="token operator">+</span> image_filename <span class="token operator">+</span> <span class="token string">' '</span> </code></pre> <p>如上所述,上传图像后,调用run_image函数。此函数通过uuid创建新的图像名称,对图像进行预处理,然后创建添加到缓存的人工旋转。</p> <p>还可以看出,图像描述与文件名一起被包括作为初始输入。该描述由Blip图像字幕模型生成。</p> <pre><code class="prism language-python"><span class="token keyword">class</span> <span class="token class-name">ImageCaptioning</span><span class="token punctuation">:</span> <span class="token keyword">def</span> <span class="token function">__init__</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> device<span class="token punctuation">)</span><span class="token punctuation">:</span> <span class="token keyword">print</span><span class="token punctuation">(</span><span class="token string">"Initializing ImageCaptioning to %s"</span> <span class="token operator">%</span> device<span class="token punctuation">)</span> self<span class="token punctuation">.</span>device <span class="token operator">=</span> device self<span class="token punctuation">.</span>processor <span class="token operator">=</span> BlipProcessor<span class="token punctuation">.</span>from_pretrained<span class="token punctuation">(</span><span class="token string">"Salesforce/blip-image-captioning-base"</span><span class="token punctuation">)</span> self<span class="token punctuation">.</span>model <span class="token operator">=</span> BlipForConditionalGeneration<span class="token punctuation">.</span>from_pretrained<span class="token punctuation">(</span><span class="token string">"Salesforce/blip-image-captioning-base"</span><span class="token punctuation">)</span><span class="token punctuation">.</span>to<span class="token punctuation">(</span>self<span class="token punctuation">.</span>device<span class="token punctuation">)</span> self<span class="token punctuation">.</span>i2t <span class="token operator">=</span> ImageCaptioning<span class="token punctuation">(</span>device<span class="token operator">=</span><span class="token string">"cuda:4"</span><span class="token punctuation">)</span> </code></pre> <p>从上面声明的Human_prompt变量中可以看出,短语“但在根据我的描述直接想象之前,您应该使用工具完成以下任务。设置ChatGPT使用VFM而不是任意提供响应的音调。</p> <pre><code class="prism language-python">Human_prompt <span class="token operator">=</span> <span class="token string">"nHuman: provide a figure named {}. The description is: {}. This information helps you to understand this image, but you should use tools to finish following tasks, "</span> <span class="token string">"rather than directly imagine from my description. If you understand, say "</span>Received<span class="token string">". n"</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>image_filename<span class="token punctuation">,</span> description<span class="token punctuation">)</span> </code></pre> <p>除了调用提交图像之外,每个调用还具有前缀和后缀,以进一步确保模型不会以特殊方式运行。前缀中列出的一些关键准则如下:</p> <ul> <li> <p>作为一种语言模型,VisualChatGPT不能直接读取图像,但它有一系列工具来完成各种视觉任务。每个图像都将创建一个文件名为“image/xxx.png”,VisualChatGPT可以调用各种工具来间接理解图像。</p> </li> <li> <p>VisualChatGPT 对图像的文件名非常严格,不支持不存在的文件。</p> </li> <li> <p>Visual ChatGPT可以按顺序使用这些工具,并且忠于工具观察结果的输出,而不是伪造图像内容和图像文件名。如果创建了新图像,它将记住上次观察工具时的文件名。</p> </li> <li> <p>Visual ChatGPT可以访问以下工具:</p> </li> </ul> <p>这些声明使Visual ChatGPT能够使用可用的可视化工具,以及如何处理文件名以及如何与用户就VFM模型之一生成的图像进行通信。</p> <p>代理有一个可以使用的所有工具的列表,在本例中是VFM。每个工具都有详细描述其功能,例如:</p> <pre><code class="prism language-python">Tool<span class="token punctuation">(</span>name<span class="token operator">=</span><span class="token string">"Generate Image From User Input Text"</span><span class="token punctuation">,</span> func<span class="token operator">=</span>self<span class="token punctuation">.</span>t2i<span class="token punctuation">.</span>inference<span class="token punctuation">,</span> description<span class="token operator">=</span>"useful when you want to generate an image <span class="token keyword">from</span> a user <span class="token builtin">input</span> text <span class="token keyword">and</span> save it to a <span class="token builtin">file</span><span class="token punctuation">.</span> like<span class="token punctuation">:</span> generate an image of an <span class="token builtin">object</span> <span class="token keyword">or</span> something<span class="token punctuation">,</span> <span class="token keyword">or</span> generate an image that includes some objects<span class="token punctuation">.</span> " "The <span class="token builtin">input</span> to this tool should be a string<span class="token punctuation">,</span> representing the text used to generate image<span class="token punctuation">.</span> "<span class="token punctuation">)</span><span class="token punctuation">,</span> </code></pre> <p>所使用的工具之一是VFM,它可以将文本转换为图像,如图所示,为代理提供了有关工具名称的信息,该信息概括了模型的功能、要调用的函数以及详细描述工具和输入。以及仪器输出。</p> <p>然后,代理使用工具的描述和过去的对话历史来决定下一步使用哪个工具。使用ReAct框架做出决策。</p> <pre><code class="prism language-python">self<span class="token punctuation">.</span>agent <span class="token operator">=</span> initialize_agent<span class="token punctuation">(</span> self<span class="token punctuation">.</span>tools<span class="token punctuation">,</span> self<span class="token punctuation">.</span>llm<span class="token punctuation">,</span> agent<span class="token operator">=</span><span class="token string">"conversational-react-description"</span><span class="token punctuation">,</span> verbose<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">,</span> memory<span class="token operator">=</span>self<span class="token punctuation">.</span>memory<span class="token punctuation">,</span> return_intermediate_steps<span class="token operator">=</span><span class="token boolean">True</span><span class="token punctuation">,</span> agent_kwargs<span class="token operator">=</span><span class="token punctuation">{<!-- --></span><span class="token string">'prefix'</span><span class="token punctuation">:</span> VISUAL_CHATGPT_PREFIX<span class="token punctuation">,</span> <span class="token string">'format_instructions'</span><span class="token punctuation">:</span> VISUAL_CHATGPT_FORMAT_INSTRUCTIONS<span class="token punctuation">,</span> <span class="token string">'suffix'</span><span class="token punctuation">:</span> VISUAL_CHATGPT_SUFFIX<span class="token punctuation">}</span><span class="token punctuation">,</span> </code></pre> <p>ReAct可以被认为是推理链(CoT)推理范式的扩展。而CoT允许LM生成一系列推理来解决任务,从而减少产生幻觉的可能性。</p> <p>为了确保ChatGPT以这种格式响应,ChatGPT提示符包含以下内容:</p> <pre><code>VISUAL_CHATGPT_FORMAT_INSTRUCTIONS = “””To use a tool, please use the following format: """ Thought: Do I need to use a tool? Yes Action: the action to take, should be one of [{tool_names}] Action Input: the input to the action Observation: the result of the action When you have a response to say to the Human, or if you do not need to use a tool, you MUST use the format: Thought: Do I need to use a tool? No """ {ai_prefix}: [your response here]””” </code></pre> <p>需要注意的是,想法、行动和观察步骤的输出不会显示给最终用户。所有这些信息都是隐藏的,以确保最终用户不会被没有直接解决用户问题的所有中间答案淹没。</p> <p>相反,当LM认为它已经得到了最终答案或想向用户提问时,只向用户 [此处为您的回答] 字段显示生成的文本的一部分。</p> <p>ReAct范式的另一个好效果是,我们现在可以结合使用多种工具,因为在看到观察结果后,ChatGPT默认会考虑是否需要使用工具。本质上我必须使用工具吗?是添加到ChatGPT服务生成的每个查询和代理响应的后缀。</p> <p>从ChatGPT响应格式上方的提示可以看出,对于ChatGPT从可用列表中选择一个工具,可以从前面看到的工具描述中获得工具的输入格式,最后可以从视图中解析VFM输出。</p> <p>可以通过下面的LangChain库查看行动分析和行动条目:</p> <pre><code class="prism language-python"><span class="token keyword">def</span> <span class="token function">_extract_tool_and_input</span><span class="token punctuation">(</span>self<span class="token punctuation">,</span> llm_output<span class="token punctuation">:</span> <span class="token builtin">str</span><span class="token punctuation">)</span> <span class="token operator">-</span><span class="token operator">></span> Optional<span class="token punctuation">[</span>Tuple<span class="token punctuation">[</span><span class="token builtin">str</span><span class="token punctuation">,</span> <span class="token builtin">str</span><span class="token punctuation">]</span><span class="token punctuation">]</span><span class="token punctuation">:</span> <span class="token keyword">if</span> <span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>self<span class="token punctuation">.</span>ai_prefix<span class="token punctuation">}</span></span><span class="token string">:"</span></span> <span class="token keyword">in</span> llm_output<span class="token punctuation">:</span> <span class="token keyword">return</span> self<span class="token punctuation">.</span>ai_prefix<span class="token punctuation">,</span> llm_output<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>self<span class="token punctuation">.</span>ai_prefix<span class="token punctuation">}</span></span><span class="token string">:"</span></span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token operator">-</span><span class="token number">1</span><span class="token punctuation">]</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span> regex <span class="token operator">=</span> <span class="token string">r"Action: (.*?)[n]*Action Input: (.*)"</span> <span class="token keyword">match</span> <span class="token operator">=</span> re<span class="token punctuation">.</span>search<span class="token punctuation">(</span>regex<span class="token punctuation">,</span> llm_output<span class="token punctuation">)</span> <span class="token keyword">if</span> <span class="token keyword">not</span> <span class="token keyword">match</span><span class="token punctuation">:</span> <span class="token keyword">raise</span> ValueError<span class="token punctuation">(</span><span class="token string-interpolation"><span class="token string">f"Could not parse LLM output: `</span><span class="token interpolation"><span class="token punctuation">{<!-- --></span>llm_output<span class="token punctuation">}</span></span><span class="token string">`"</span></span><span class="token punctuation">)</span> action <span class="token operator">=</span> <span class="token keyword">match</span><span class="token punctuation">.</span>group<span class="token punctuation">(</span><span class="token number">1</span><span class="token punctuation">)</span> action_input <span class="token operator">=</span> <span class="token keyword">match</span><span class="token punctuation">.</span>group<span class="token punctuation">(</span><span class="token number">2</span><span class="token punctuation">)</span> <span class="token keyword">return</span> action<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">,</span> action_input<span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token string">" "</span><span class="token punctuation">)</span><span class="token punctuation">.</span>strip<span class="token punctuation">(</span><span class="token string">'"'</span><span class="token punctuation">)</span> </code></pre> <p>在提取要使用的工具和要提供的输入时,进行调用以执行该工具。</p> <p>每个模型的输出以以下格式保存为文件名:</p> <blockquote> <p>{Name}_{Operation}_{Previous Name}_{Organization Name}.</p> </blockquote> <p>title 是唯一的 uuid,操作对应于工具的名称,原名对应于用于创建新图像的输入图像的 uuid,组织的名称对应于用户提供的原始输入图像。按照这种命名约定,ChatGPT可以很容易地导出有关新生成的图像的信息。</p> <pre><code class="prism language-python"><span class="token keyword">def</span> <span class="token function">get_new_image_name</span><span class="token punctuation">(</span>org_img_name<span class="token punctuation">,</span> func_name<span class="token operator">=</span><span class="token string">"update"</span><span class="token punctuation">)</span><span class="token punctuation">:</span> head_tail <span class="token operator">=</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>split<span class="token punctuation">(</span>org_img_name<span class="token punctuation">)</span> head <span class="token operator">=</span> head_tail<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> tail <span class="token operator">=</span> head_tail<span class="token punctuation">[</span><span class="token number">1</span><span class="token punctuation">]</span> name_split <span class="token operator">=</span> tail<span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'.'</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span><span class="token punctuation">.</span>split<span class="token punctuation">(</span><span class="token string">'_'</span><span class="token punctuation">)</span> this_new_uuid <span class="token operator">=</span> <span class="token builtin">str</span><span class="token punctuation">(</span>uuid<span class="token punctuation">.</span>uuid4<span class="token punctuation">(</span><span class="token punctuation">)</span><span class="token punctuation">)</span><span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">:</span><span class="token number">4</span><span class="token punctuation">]</span> <span class="token keyword">if</span> <span class="token builtin">len</span><span class="token punctuation">(</span>name_split<span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">1</span><span class="token punctuation">:</span> most_org_file_name <span class="token operator">=</span> name_split<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> recent_prev_file_name <span class="token operator">=</span> name_split<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> new_file_name <span class="token operator">=</span> <span class="token string">'{}_{}_{}_{}.png'</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>this_new_uuid<span class="token punctuation">,</span> func_name<span class="token punctuation">,</span> recent_prev_file_name<span class="token punctuation">,</span> most_org_file_name<span class="token punctuation">)</span> <span class="token keyword">else</span><span class="token punctuation">:</span> <span class="token keyword">assert</span> <span class="token builtin">len</span><span class="token punctuation">(</span>name_split<span class="token punctuation">)</span> <span class="token operator">==</span> <span class="token number">4</span> most_org_file_name <span class="token operator">=</span> name_split<span class="token punctuation">[</span><span class="token number">3</span><span class="token punctuation">]</span> recent_prev_file_name <span class="token operator">=</span> name_split<span class="token punctuation">[</span><span class="token number">0</span><span class="token punctuation">]</span> new_file_name <span class="token operator">=</span> <span class="token string">'{}_{}_{}_{}.png'</span><span class="token punctuation">.</span><span class="token builtin">format</span><span class="token punctuation">(</span>this_new_uuid<span class="token punctuation">,</span> func_name<span class="token punctuation">,</span> recent_prev_file_name<span class="token punctuation">,</span> most_org_file_name<span class="token punctuation">)</span> <span class="token keyword">return</span> os<span class="token punctuation">.</span>path<span class="token punctuation">.</span>join<span class="token punctuation">(</span>head<span class="token punctuation">,</span> new_file_name<span class="token punctuation">)</span> </code></pre> <p>最后,将所有移动部件组合起来,与Visual ChatGPT进行对话,后者可以使用视觉信息。</p> <p>这项工作是快速工程重要性的完美例证。提示允许代理使用文件名处理视觉信息,并创建思维链->动作链->观察反应链,帮助确定要使用哪些VFM并处理VFM模型的输出。</p> <p>为了抽象解决方案的复杂性质,中介响应(包括思想、行动和观察话语)对用户是隐藏的,只有当ChatGPT相信时LM生成的最终响应才会显示给用户。不再需要使用VFM。</p> <br> <h2> <a id="6_GPT4__576"></a>6. GPT4 来了</h2> <p>刚刚写完本文,就看到 GPT4 发布的资讯。而且,GPT-4 开始接受图像作为输入介质,也可以开始处理图像了。</p> <p>下面是 OpenAI 提供的一个示例,GPT-4 针对图像输入回答的问题。</p> <p><img referrerpolicy="no-referrer" src="https://img-blog.csdnimg.cn/9a6a6d270f324cf9b270e0dd97a2386c.png#pic_center" alt="在这里插入图片描述"></p> <p>由于目前图像输入的权限尚未公开,还不清楚 GPT-4 图像处理的技术原理和能力。所以关于 GPT-4,我们后文再讨论吧。</p> <p>但是,可以预期的是:世界潮流,浩浩荡荡。</p> <br> <p><strong>版权声明:</strong><br> 欢迎关注【youcans的AGI学习笔记】,转发请注明原文链接:【AIGC】Visual ChatGPT 视觉模型深度解析 (https://youcans.blog.csdn.net/article/details/129546888)<br> Copyright 2023 youcans, XUPT<br> Crated:2023-03-15</p> <br> </div> <link href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/editerView/markdown_views-0407448025.css" rel="stylesheet"> <link href="https://csdnimg.cn/release/blogv2/dist/mdeditor/css/style-c216769e99.css" rel="stylesheet"> </div> <div id="treeSkill"></div> </article>
【AIGC】Visual ChatGPT 视觉模型深度解析
作者
sockstack
许可协议
CC BY 4.0
发布于
2023-11-06
修改于
2024-12-20
上一篇:软件:常用 Linux 软件汇总,值得收藏
下一篇:国内免费可用的ChatGPT网页版
尚未登录
登录 / 注册
文章分类
博客重构之路
5
Spring Boot简单入门
4
k8s 入门教程
0
MySQL 知识
1
NSQ 消息队列
0
ThinkPHP5 源码分析
5
使用 Docker 从零开始搭建私人代码仓库
3
日常开发汇总
4
标签列表
springboot
hyperf
swoole
webman
php
多线程
数据结构
docker
k8s
thinkphp
mysql
tailwindcss
flowbite
css
前端