提问 HN:你如何用编程方法评估一个 LLM 的输出听起来是否“太像 AI”?
1 分•作者: shubhamoriginx•3 天前
大家好,HN,
我目前正在开发 Aaptics,这是一个帮助创始人撰写内容的工具。最大的工程挑战不是基础设施,而是让底层模型不再听起来像个公司机器人(例如,避免使用“深入研究”、“证明”或“在当今快节奏的环境中”等词语)。
目前,我的流程使用自定义的 RAG 设置,它会摄取用户过去的写作内容,并结合大量的负面提示和少样本示例。然而,模型偶尔还是会陷入那种可识别的“ChatGPT 语气”。
对于那些正在构建 AI 应用程序的人来说,你们是如何量化评估输出的“人性化”程度的?
你们使用 LLM 作为评判框架吗?
依赖于特定的温度/top_p 调整吗?
还是对某些 n-gram 进行硬编码惩罚?
我希望在四月中旬发布之前完成这个流程,并感谢那些已经在生产中解决这个问题的人提供的任何见解。aaptics.in/waitlist
查看原文
Hi HN,<p>I’m currently building Aaptics, a tool designed to help founders draft content. The biggest engineering challenge hasn't been the infrastructure, but getting the underlying models to stop sounding like a corporate robot (e.g., stopping it from using words like "delve", "testament", or "in today's fast-paced landscape").<p>Right now, my pipeline uses a custom RAG setup that ingests a user's past writing, combined with heavy negative-prompting and few-shot examples. However, the model still occasionally slips into that recognizable "ChatGPT tone."<p>For those of you building AI applications, how are you quantitatively evaluating the "humanness" of your outputs?<p>Are you using LLM-as-a-judge frameworks?<p>Relying on specific temperature/top_p tweaking?<p>Or hardcoding penalizations for certain n-grams?<p>I'm aiming to finalize this pipeline before our mid-April launch and would appreciate any insights from folks who have solved this in production. aaptics.in/waitlist