大语言模型吧 关注:51贴子:119
  • 2回复贴,共1

让agent在聊天过程中检测意图并生成图片

只看楼主收藏回复

虽然gpt4.1接口表现非常棒,但纯文字的聊天总感觉缺少点什么……最近捣鼓出了 聊天过程中生成图片的功能,来分享一下!

我期望的是:能根据用户当前对话,识别用户生图意愿,然后自动调用生图函数,返回结果。再把返回的结果 发给聊天agent,“假装”图片是自己为用户生成的。就继承了连贯和真实感~
1.生成图片
其实,gpt本身的生图操作,其实是很简单的。python代码如下:
import openai
client = openai.OpenAI()

此方法,会默认返回图片url数组(n=1 则只生成一张图片),可以通过“image_url = [img.url for img in response.data]” 来获取。

2.修改图片
然后,为了能在聊天过程中修改图片。或者根据对应风格重绘,我尝试了修改图片的写法:

需要传一个二进制图片流,和mask(模板),表示可以修改的图片位置。在这里……为了兼容修改/重绘,mask默认了全部图片内容皆可修改。
注意:image_bytes 并非图片url,可以先获取公网url,再进行转换得来:
import httpx
from io import BytesIO

User-agent 可以打开自己浏览器的f12,找到对应字段复制粘贴即可。
因为本人开发gpt接口对接QQ机器人,故Referer为QQ地址,方便获取qq图片url
然后,已经得到二进制图片流,可以通过如下方式,得到mask:

自此,调用gpt 创建/修改 图片的代码已经完成。

3.在什么时候创建图片?
理想情况下,应该是判断用户“生图意愿”,自动调用创建图片函数。这里……传统的固定逻辑编程,显然无法满足这个需求。此时,需要使用agent来进行智能判断!
可以创建一个CreateImage Agent。核心提示词:判断用户生图意愿,返回final_output为“无” 或“优化后的提示词” 。
在每次把用户对话 发送给聊天agent前,先发送给CreateImage agent,判断其返回值是否为“无” 且长度<5(为防止偶然的返回空字符串或False 等字符)。
由程序自行 根据返回值 调用生图函数。
个人做了一点小小的优化:当调用生图函数时,当前prompt+"【extra】: 本次消息包含图片内容,为上游生图模型创作,假装是自己画的,然后重点响应图片相关话题"
没有调用生图函数,但用户有发过来图片时,当前prompt+"【extra】: 本次消息包含图片内容,为用户发你的图片,重点响应图片相关话题"
个人的提示词部分原文如下(因为考虑到agent 更契合英语提示词,所以目前的提示词 都会让AI帮我翻译成英语,更精准表达,让agent更加服从指令):“Note: Image generation consumes significantly higher resources, so directly return '无' when no explicit creation intent is detected.
[OUTPUT RULE]
- If no image‑generation intent → output exactly: 无
- If intent is clear → output only the refined English prompt (one line, no extra characters).
Do **not** add code, explanations, JSON, or line breaks.
[STRICTNESS]
Any deviation from the two allowed outputs will break downstream logic.
Follow these steps every turn.”

4.几点经验和思考
4.1 原本想过,直接用agent调用function tool来实现生图,agent自己返回 生成后的图片url。但实际测试过……agent在返回url时,会出现很高的错误率!往往不能精准的返回一长串url,故采用 判断生图意愿,由程序判断agent返回值,再自行调用的方式。
4.2 其实,图片url本身传给gpt 并不会消耗更多token,会直接以文字长度计数。所以……实际情况下,发送图片url给gpt是非常划算的事情!
所以,我个人的方式是:生图之后,把图片url+用户prompt 一起发送给聊天agent,让聊天agent“假装”是自己生成的图片,再来生成文字响应内容。图片url发给gpt,几乎不会占用更多的token。
将url发送给gpt的代码如下:
from agents import Agent, Runner

4.3 生图成本较高!比单纯文字要贵好多。dell-E-2一张图$0.02,特别是dell-E-3 每次生图$0.08 高质量$0.12 。大概5~7毛钱一张图了。还不如网页版的gpt-image-1 效果好。而gpt-image-1 需要授权的组织才可以使用,个人还无法调用,只能网页版使用…… 所以,,目前玩玩就好,聊天中自动生成图片,更多的是图一乐,增加聊天过程中的情趣。而很难真用于绘画和生产。如果感兴趣或者有高技术的大佬,可以试试调用stable diffution接口。
所以,最推荐的反倒是冲个plus 白嫖网页版的gpt-image-1.可以无限生图,质量还非常高!!!


IP属地:广东1楼2025-05-15 21:59回复
    啊对,在研究如何优化 生图prompt的时候,开o3深度研究,整合了一堆 针对二次元图片的提示词技巧。这里发一下:
    Anime-Style Prompt Engineering Template for Image Generation
    Anime-style image generation prompts benefit from a clear structure and vivid detail. The goal is to turn casual or emotional ideas into concise, visually specific English prompts that evoke a 2D anime (二次元) aesthetic. Below is a comprehensive guide covering prompt format, useful components, examples, and cautionary tips.
    Prompt Format
    A reusable prompt template for anime-style images should include key elements in order. This structure ensures the model understands the subject and the anime-inspired style you want:
    Subject – Who or what is in the scene. Specify character (e.g. “a young anime girl”), creature, or object, including distinctive features (hair color, clothing, etc).
    Action/Pose – What the subject is doing. Use vivid verbs or poses (e.g. “sitting under a tree”, “waving hello”, “in mid-jump with a sword drawn”).
    Setting/Background – Where the scene takes place. Describe the environment or backdrop (e.g. “in a bustling Tokyo street”, “under pink cherry blossom trees”, “inside a medieval castle hall”).
    Mood/Emotion – The atmosphere or feelings conveyed. Include emotional tone if relevant (e.g. “romantic and calm”, “bright and cheerful”, “dark and haunting mood”).
    Art Style Tags – Tags for anime style and technique. Mention anime or manga style and any specifics like “cel-shaded, vibrant colors, dynamic composition”cursor-ide.com. You can add medium or style notes such as “digital painting”, “90s anime aesthetic”, “Studio Ghibli-inspired”, etc.
    Lighting/Perspective (optional) – Any notable lighting, camera angle, or perspective details. For example, “soft warm lighting at sunset”, “dramatic shadows from below”, or “close-up portrait view” to enhance the visual mood.
    Chinese Text (optional) – If you need visible Chinese text in the image, explicitly mention it. For instance, “with a sign in the background that reads ‘学校’ (Chinese for 'school')” or “holding a letter with the Chinese characters ‘我爱你’ on it”. Keep it short and specify the context (sign, poster, label, etc.) so the model tries to render the text on that object. (Note: Many models struggle with exact text; providing the exact characters and context improves the chances.)
    Using this format, an example prompt template might look like:
    [Subject], [action/pose], [setting], [mood], [anime-style tags], [lighting/perspective].


    IP属地:广东2楼2025-05-15 22:05
    回复
      2025-08-07 01:07:40
      广告
      不感兴趣
      开通SVIP免广告
      You can omit or rearrange elements as needed, but including subject, setting, style, and mood will generally yield a more precise anime-style image.
      Prompt Components: Descriptors & Keywords for Anime Aesthetics
      Choosing the right descriptors is vital for anime-style prompts. Below is a list of effective English keywords grouped by category, which you can mix and match in your prompts:
      Characters & Subjects – schoolgirl, young samurai warrior, futuristic cyborg, magical girl, chibi cat-girl, giant mecha robot. (Describe the character type or role in vivid terms.)
      Appearance & Clothing – long flowing pink hair, nekomimi (cat ears), bright blue eyes, wearing a Japanese school uniform, ornate kimono with cherry blossom patterns, silver futuristic armor. (Details that make the character visually distinct.)
      Expressions & Emotions – smiling warmly, blushing cheeks, teary-eyed, determined glare, joyful laughing expression, shyly looking away. (Facial expressions or body language to convey emotion.)
      Background & Setting – under blooming sakura trees, in a bustling night city with neon signs, atop a grassy hill at sunrise, inside a cozy anime-style classroom, floating in a cosmic galaxy with stars. (Environments to set the scene, often with a touch of anime fantasy or realism.)
      Lighting & Atmosphere – soft diffused lighting, golden hour sunlight, moody blue moonlight, glowing neon lights, dramatic high-contrast shadows, ethereal backlighting. (Lighting keywords create atmosphere: e.g. warm and gentle vs. dark and dramatic.)
      Art Style & Medium – anime style illustration, cel-shaded, vibrant color palette, manga-style line art, watercolor anime painting, 2D digital art, Japanese animation aestheticcursor-ide.com. (Terms that enforce an anime or manga look. You can include specific era or studio styles, like “90s retro anime style” or “Makoto Shinkai-inspired lighting”.)
      Perspective & Composition – dynamic low-angle view, close-up portrait, wide-angle shot, bird’s-eye view from above, over-the-shoulder perspective, cinematic framing. (These guide the camera angle or composition, making the scene more visually interesting.)
      By combining these descriptors, you can paint a clear picture for the AI. For example, instead of saying “a girl in a nice setting”, you might say “a smiling anime girl with long black hair holding a red umbrella in the rainy city at night, neon signs reflecting in the puddles”. The bold parts illustrate subject, action, and setting with vivid detail.
      Examples: From Chinese Prompts to Optimized English Prompts
      Below are several examples showing how a casual Chinese prompt or idea can be converted into a concise, vivid English prompt for anime-style image generation. Each example preserves the original intent while adding visual specifics and anime flair:
      Chinese Prompt (romantic/casual): “樱花树下牵手的情侣,浪漫温馨的感觉。”
      Optimized English: “Anime-style illustration of a young couple holding hands under cherry blossom trees at dusk, petals gently falling around them. The scene is romantic and warm, with soft golden sunset light and a serene atmosphere.”
      Chinese Prompt (emotional/sad): “一个落寞的男孩在雨中哭泣,表现出悲伤的情绪。”
      Optimized English: “An anime boy standing alone in the rain at night, head down and crying with tears on his face. The streets are wet and empty, city lights blurred in the background. The lighting is dim and blue, emphasizing a sad and lonely mood.”
      Chinese Prompt (including text): “女孩拿着一封写有‘爱’字的情书。”
      Optimized English: “Anime-style scene of a shy girl holding a love letter close to her chest. The envelope has the Chinese character ‘爱’ (love) written on it in elegant ink. She stands in a sunny school courtyard with petals drifting by, blushing with a hopeful smile. Soft, warm lighting enhances the gentle, romantic feeling.”
      (This prompt explicitly asks for a visible Chinese character on the letter by mentioning the exact character and context.)
      Chinese Prompt (action/fantasy): “身穿机甲作战的女战士,场面炫酷。”
      Optimized English: “A female mecha warrior in sleek futuristic armor engaged in battle. She is leaping forward with energy blades drawn, against a backdrop of exploding circuitry and sparks. The composition is dynamic and intense, with vibrant sci-fi anime style and dramatic lighting illuminating the scene.”
      Chinese Prompt (cute/chibi style): “一只开心的猫耳少女,卡通可爱风格。”
      Optimized English: “A cute chibi cat-girl (anime girl with cat ears) jumping with joy. She has big sparkling eyes and a wide grin, wearing a frilly pastel dress. The art style is bright and kawaii anime, with soft pastel colors and a simple heart-pattern background to match the cheerful, playful mood.”
      Each optimized prompt captures the essence of the original request but adds concrete visual details: who/what is present, what they look like, what they are doing, and the overall style/mood. Notice the use of specific nouns and adjectives (e.g. cherry blossom trees, rainy city, mecha armor) and anime-related style cues (anime-style illustration, chibi, vibrant colors) to guide the AI clearly.
      Edge Cases & Cautions
      When crafting anime-style prompts, keep these cautions in mind to avoid common pitfalls:
      Avoid ambiguous or vague descriptions. Generic words like “beautiful” or “nice” alone don’t guide the AI. Be specific about what makes a scene beautiful. For instance, instead of a vague prompt like “a beautiful landscape”, specify “a misty highland valley at dawn with rolling hills” for claritycursor-ide.com. Precise detail yields better results than abstract praise.
      Don’t rely on emotion words without visual context. AI image models can’t draw “sadness” or “happiness” as abstract concepts. Always pair emotions with concrete imagery. For example, “create happiness” is too abstract, but “a joyful scene of friends laughing together under colorful confetti” gives a visual way to show happinesscursor-ide.com. If you want a mood (romantic, melancholic, etc.), describe how that mood looks (lighting, expressions, setting).
      Avoid overloading or conflicting details. Too many unrelated elements can confuse the model. Stick to a clear theme – for instance, don’t mix “futuristic city” and “medieval castle” in one prompt unless intentionally blending genres. Also be careful not to mix incompatible style cues (e.g. asking for a “photorealistic anime” style combines realism and cartoon, which can be confusingcursor-ide.com). Pick one consistent art style for the best results.
      Be consistent with anime-related terms. If you want an anime look, include words like “anime,” “manga,” or specific anime art techniques (cel-shading, etc.). If you omit these, a general model might default to a more realistic style. In our examples, explicitly saying “anime-style illustration” or “chibi style” ensures the output stays in the 2D anime realm.
      Use Chinese text sparingly and clearly. When requesting Chinese characters in the image, less is more. Models may jumble long text, so stick to a short word or phrase and spell it out in quotes. Always mention the context (e.g. on a sign, on a letter, as a subtitle in the image). This increases the chance the text appears correctly. If the model fails to render exact text, consider simplifying or trying a different approach (like writing the pinyin or describing a symbol) as a last resort.
      Mind model limitations and content rules. Each image generator has rules (e.g. disallowing explicit content or certain trademarks). Keep prompts wholesome or stylistically descriptive rather than overly graphic. For example, “kissing on the cheek” is usually fine, but extremely violent or adult themes might be filtered out by models like DALL·E. Sticking to PG-13 anime themes ensures your prompt won’t hit a content warning.
      By following this template and tips, you can turn any casual idea or Chinese prompt into a compelling English prompt tailored for anime-style image generation. The key is to paint a picture with words – specify the who, what, where, and mood in an anime context. With practice, your prompts will consistently produce vivid 2D visuals that match the style and emotion you envision. Happy prompting!
      Sources: Best practices adapted from prompt engineering guidescursor-ide.comcursor-ide.comcursor-ide.com and community experience in anime image generation.


      IP属地:广东3楼2025-05-15 22:06
      回复