使用嵌入式 Python 和 OpenAI API 在 IRIS 中进行数据标签

文章

Jingwei Wang · 二月 15, 2024 阅读大约需 4 分钟

##嵌入式 Python #Artificial Intelligence (AI) #API #ObjectScript #Python #分析 #非结构化数据 #InterSystems IRIS

大型语言模型（例如 OpenAI 的 GPT-4）的发明和普及掀起了一波创新解决方案浪潮，这些解决方案可以利用大量非结构化数据，在此之前，人工处理这些数据是不切实际的，甚至是不可能的。此类应用程序可能包括数据检索（请参阅 Don Woodlock 的 ML301 课程，了解检索增强生成的精彩介绍）、情感分析，甚至完全自主的 AI 代理等！

在本文中，我想演示如何使用 IRIS 的嵌入式 Python 功能直接与 Python OpenAI 库交互，方法是构建一个简单的数据标记应用程序，该应用程序将自动为我们插入IRIS 表中的记录分配关键字。然后，这些关键字可用于搜索和分类数据，以及用于数据分析目的。我将使用客户对产品的评论作为示例用例。

先决条件

运行的IRIS实例
OpenAI API 密钥（您可以在此处创建）
配置好的开发环境（本文将使用VS Code ）

Review类

让我们首先创建一个 ObjectScript 类，该类将定义客户评论的数据模型。为了简单起见，我们将只定义 4 个 %String 字段：客户姓名、产品名称、评论正文以及我们将生成的关键字。该类应该扩展%Persistent，以便我们可以将其对象保存到磁盘。

 Class DataTagging.Review Extends %Persistent
{
Property Name As %String(MAXLEN = 50) [ Required ];
Property Product As %String(MAXLEN = 50) [ Required ];
Property ReviewBody As %String(MAXLEN = 300) [ Required ];
Property Keywords As %String(MAXLEN = 300) [ SqlComputed, SqlComputeOnChange = ReviewBody ];
}

由于我们希望在插入或更新 ReviewBody 属性时自动计算 Keywords属性，因此我将其标记为SqlComputed。您可以在此处了解有关计算值的更多信息。

`KeywordsComputation`方法

我们现在想要定义一种方法，用于根据ReviewBody计算Keywords。我们可以使用Embedded Python直接与官方的openai Python包进行交互。但首先，我们需要安装它。为此，请运行以下 shell 命令：

 <your-IRIS-installation-path>/bin/irispip install --target <your-IRIS-installation-path>/Mgr/python openai

我们现在可以使用 OpenAI 的聊天完成 API 来生成关键字：

 ClassMethod KeywordsComputation(cols As %Library.PropertyHelper) As %String [ Language = python ]
{
    '''
    This method is used to compute the value of the Keywords property
    by calling the OpenAI API to generate a list of keywords based on the review body.
    '''
    from openai import OpenAI

    client = OpenAI(
        # Defaults to os.environ.get("OPENAI_API_KEY")
        api_key="<your-api-key>",
    )

    # Set the prompt; use few-shot learning to give examples of the desired output
    user_prompt = "Generate a list of keywords that summarize the content of a customer review of a product. " \
                + "Output a JSON array of strings.\n\n" \
                + "Excellent watch. I got the blue version and love the color. The battery life could've been better though.\n\nKeywords:\n" \
                + "[\"Color\", \"Battery\"]\n\n" \
                + "Ordered the shoes. The delivery was quick and the quality of the material is terrific!.\n\nKeywords:\n" \
                + "[\"Delivery\", \"Quality\", \"Material\"]\n\n" \
                + cols.getfield("ReviewBody") + "\n\nKeywords:"
    # Call the OpenAI API to generate the keywords
    chat_completion = client.chat.completions.create(
        model="gpt-4",  # Change this to use a different model
        messages=[
            {
                "role": "user",
                "content": user_prompt
            }
        ],
        temperature=0.5,  # Controls how "creative" the model is
        max_tokens=1024,  # Controls the maximum number of tokens to generate
    )

    # Return the array of keywords as a JSON string
    return chat_completion.choices[0].message.content
}

请注意，在提示中，我首先指定了我希望 GPT-4 如何“生成总结产品客户评论内容的关键字列表”的一般说明，然后给出两个示例输入以及所需的输入输出。然后，我插入 cols.getfield("ReviewBody") 并以“Keywords:”一词结束提示，通过提供与我给出的示例格式相同的关键字来推动它完成句子。这是Few-Shot Prompting技术的一个简单示例。

为了演示的简单性，我选择将关键字存储为 JSON 字符串；在生产中存储它们的更好方法可能是DynamicArray ，但我将把它作为练习留给读者。

关键词生成

现在，我们可以通过管理门户使用以下 SQL 脚本向表中插入一行来测试我们的数据标记应用程序：

 INSERT INTO DataTagging.Review (Name, Product, ReviewBody)
VALUES ('Ivan', 'BMW 330i', 'Solid car overall. Had some engine problems but got everything fixed under the warranty.')

如下所示，它自动为我们生成了四个关键字。做得好！

结论

总而言之，InterSystems IRIS 嵌入 Python 的能力在处理非结构化数据时提供了多种可能性。利用 OpenAI 的强大功能进行自动数据标记只是利用这一强大功能可以实现的目标之一。这可以减少人为错误并提高整体效率。

查看原帖由 @Maxim Gorshkov 撰写

使用嵌入式 Python 和 OpenAI API 在 IRIS 中进行数据标签

先决条件

Review类

KeywordsComputation方法

关键词生成

结论

`KeywordsComputation`方法