Bo's Oracle Station

【博客文章2025】Oracle Database 23ai:AI Vector Search笔记与实验5---在数据库中存储嵌入向量时用到的UTL_TO_EMBEDDINGS与JSON

2025-1-23 15:09| 发布者: admin| 查看: 75| 评论: 0|原作者: Bo Tang

摘要: 本博客介绍了在数据库中存储嵌入式向量时使用的UTL_TO_EMBEDDINGS和JSON知识。UTL_TO_EMBEDDINGS用于从文本输入中自动生成一个或多个嵌入向量。详细介绍了JSON格式的UTL_TO_EMBEDINGS的参数,分析了UTL_TO_EMBEDINGS生成的JSON_TABLE,并对TO_VECTOR进行了研究。 由于Oracle Database 23.5不支持将本地REST Ollama作为JSON参数“provider”的值,本文最后对23.6进行了思考。
【博客文章2025】Oracle Database 23ai:AI Vector Search笔记与实验5---在数据库中存储嵌入向量时用到的UTL_TO_EMBEDDINGS与JSON


Author: Bo Tang

1. 从文本文档自动生成一个或多个嵌入向量:

    如同实验环境https://www.botangdb.com/mytec/mytec_basicadmin/202501/00900135.html的操作:

insert into t1chunks
 select t.id doc_id, c.embed_id chunk_id, c.embed_data chunk_data,
 to_vector(c.embed_vector) chunk_embedding
from t1 t,
     dbms_vector_chain.utl_to_embeddings(
                                                                  dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(t.data),
                                                                                                                       json('{"normalize":"all"}')
                                                                                                                       ),
                                                                  json('{"provider":"database", "model":"ALL_MINILM_L12_V2"}')
                                                                  ) a,
    json_table(a.column_value,
                   '$[*]' columns(embed_id number path '$.embed_id',
                                         embed_data varchar2(4000) path '$.embed_data',
                                         embed_vector clob path '$.embed_vector')
                   ) c;

commit;

6,527 rows inserted.
Commit complete.

    首先插入数据(使用insert语句从t1表的data列读取PDF文档,转换这些PDF文档为文本,然后将文本分块成数据chunks,最后为每一个数据chunk使用数据库内置语言模型生成嵌入向量。这些事情都由上面这条语句完成。dbms_vector_chain包中的utl_to_text将PDF数据转换成文本;utl_to_chunks将文本分块;utl_to_embeddings为每一个文本分块生成嵌入向量。
    DBMS_VECTOR_CHAIN.UTL_TO_EMBEDDINGS的语法结构是这样的(返回的是VECTOR_ARRAY_T类型,即转换1组文本chunks为1组嵌入向量阵列,输入的每一个文本chunk长度的最大值为4000字节):

DBMS_VECTOR_CHAIN.UTL_TO_EMBEDDINGS (
 DATA IN VECTOR_ARRAY_T,
 PARAMS IN JSON default NULL
) return VECTOR_ARRAY_T;


    如上,代入我们的数据,就是这样:

......
     dbms_vector_chain.utl_to_embeddings(
                                                                  dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(t.data),
                                                                                                                       json('{"normalize":"all"}')
                                                                                                                       ),
                                                                  json('{"provider":"database", "model":"ALL_MINILM_L12_V2"}')
                                                                  ) a,
......

    附:类似的
DBMS_VECTOR_CHAIN.UTL_TO_EMBEDDING的语法结构是这样的(返回的是VECTOR类型,即转换文本为单个嵌入向量,输入的文本长度最大值为4000字节):

DBMS_VECTOR_CHAIN.UTL_TO_EMBEDDING (
 DATA IN CLOB,
 PARAMS IN JSON default NULL
) return VECTOR;


2. JSON格式的UTL_TO_EMBEDINGS的PARAMS:

    dbms_vector_chain.utl_to_embeddings/dbms_vector_chain.utl_to_embedding接受2个形式参数:DATA和PARAMS。其中DATA这个参数比较简单,就是CLOB或VARCHAR2型的输,这些输入将被用于计算嵌入向量。
    我们重点讨论PARAMS这个JSON型的参数,PARAMS以JSON格式指定输入参数:
    Oracle数据库作为嵌入向量服务提供商(默认设置)情况下,provider”和“model”这两个JSON参数是指:用加载到数据库中的ONNX嵌入向量模型来生成嵌入向量。要该设置生效,必须先将ONNX格式的嵌入模型加载到数据库中。不是随便一个ONNX的模型都可以用于这项工作,这篇文章提供了这样一个小型化的(22.57M参数bert架构)可用的ONNX模型。然而,网上提供最多的ONNX模型下载的网站http://github.com/onnx/models的embedding模型都是不能直接用于加载Oracle Database 23ai到库内的。因此,如果您想要使用一个中文的嵌入模型,比如bge-large-zh-v1.5,那么您自己需要将其他格式的bge-large-zh-v1.5模型进行复杂的转换,变成ONNX格式,才能使用。

{
 "provider" : "database",
 "model" : "value"
}


    第三方嵌入模型作为嵌入向量服务提供商情况下,provider”和“model”这两个JSON参数是指:此API对您选择的远程服务提供商(Cohere,huggingface、googleai、ocigenai、openai或vertexai)或本地服务供应商(Ollama,注:Oracle Database 23.6才支持)进行REST API调用。
    UTL_TO_EMBEDDING的语法:

{
 "provider" : "value",
 "credential_name" :  "value",
 "url" :  "value",
 "model" :  "value",
 "transfer_timeout": "value" ,
 "其他RSET参数":  "value"
}

    UTL_TO_EMBEDDINGS的语法:

{
 "provider" : "value",
 "credential_name" : "value",
 "url" :  "value",
 "model" : "value",
  "transfer_timeout":  "value",
"batch_size" : "value",
 "其他RSET参数": "value"
}

3.分析UTL_TO_EMBEDINGS生成的JSON_TABLE:

    dbms_vector_chain.utl_to_embeddings的输出(JSON格式)

{
 "embed_id" : NUMBER,
 "embed_data" : "VARCHAR2(4000)",
 "embed_vector": "CLOB"
}

    embed_id指生成向量阵列中的每一份向量的ID。embed_data指输入的文本。embed_vector指输出的向量。
   为了获取生成的嵌入向量用来插入t1chunks表,需要使用JSON_TABLE运算符在内存中映射embed_id、embed_data和embed_vector为表结构:

......
    json_table(a.column_value,
                   '$[*]' columns(embed_id number path '$.embed_id',
                                         embed_data varchar2(4000) path '$.embed_data',
                                         embed_vector clob path '$.embed_vector')
                   ) c;
......

    可以使用JSON_TABLE查看:

SELECT t.id doc_id,
    embed_id,  
    embed_data,
    embed_vector
 FROM t1 t,
  dbms_vector_chain.utl_to_embeddings(
     dbms_vector_chain.utl_to_chunks(      dbms_vector_chain.utl_to_text( t.data),
                                                  JSON('{ "by": "words",
                                                              "max": "100",
                                                              "overlap" : "0",
                                                              "split": "recursively",
                                                              "language" : "american",
                                                              "normalize": "all"
                                                             }'
                                                          )
                                                    ),
                                                    
            json('{"provider":"database", "model":"ALL_MINILM_L12_V2"}')       
                                                   ) a,
                                                    
      json_table(a.column_value,
                   '$[*]' columns(embed_id number path '$.embed_id',
                                       embed_data  clob    path '$.embed_data',
                                         embed_vector  clob    path '$.embed_vector')
                   ) c
                   where t.id=2;    


DOC_ID     EMBED_ID     EMBED_DATA     EMBED_VECTOR
2     1     AI Vector Search User's Guide Oracle?Database AI Vector Search User's Guide 23ai F87786-15 November 2024 Oracle Database AI Vector Search User's Guide, 23ai F87786-15 Copyright ?2023, 2024, Oracle and/or its affiliates. Primary Authors: Jean-Francois Verrier, Sarah Hirschfeld, Binika Kumar Contributing Authors: Douglas Williams, Frederick Kush, Gunjan Jain, Jessica True, Jody Glover, Maitreyee Chaliha,     [7.86887761E-003,-4.39698659E-002,-7.38092512E-002,-8.19571.............

   也使用JSON_VALUE查看:

  SELECT t.id doc,
 JSON_VALUE(c.column_value, '$.embed_id' RETURNING NUMBER) AS id,
 JSON_VALUE(c.column_value, '$.embed_data'    ) AS txt,
 JSON_VALUE(c.column_value, '$.embed_vector'   RETURNING  clob  )   AS  ivector
 FROM t1 t,
 dbms_vector_chain.utl_to_embeddings(
     dbms_vector_chain.utl_to_chunks(      dbms_vector_chain.utl_to_text( t.data),
                                                  JSON('{ "by": "words",
                                                              "max": "100",
                                                              "overlap" : "0",
                                                              "split": "recursively",
                                                              "language" : "american",
                                                              "normalize": "all"
                                                             }'
                                                          )
                                                    ),
                                                    
            json('{"provider":"database", "model":"ALL_MINILM_L12_V2"}')       
                                                   ) c;
                                                   where t.id=2;    


DOC_ID     EMBED_ID     EMBED_DATA     EMBED_VECTOR
2     1     AI Vector Search User's Guide Oracle?Database AI Vector Search User's Guide 23ai F87786-15 November 2024 Oracle Database AI Vector Search User's Guide, 23ai F87786-15 Copyright ?2023, 2024, Oracle and/or its affiliates. Primary Authors: Jean-Francois Verrier, Sarah Hirschfeld, Binika Kumar Contributing Authors: Douglas Williams, Frederick Kush, Gunjan Jain, Jessica True, Jody Glover, Maitreyee Chaliha,     [7.86887761E-003,-4.39698659E-002,-7.38092512E-002,-8.19571.............

4. TO_VECTOR:

    TO_VECTOR是一个构造函数,它接受VARCHAR2、CLOB、BLOB或JSON类型的字符串作为输入,将其转换为向量,并返回向量作为输出。TO_VECTOR还可以以另一个矢量作为输入,调整其格式,并将调整后的矢量作为输出返回。所以,以下代码是后者的用途,这样保证了VECTOR的格式正确:

insert into t1chunks
 select t.id doc_id, c.embed_id chunk_id, c.embed_data chunk_data,
 to_vector(c.embed_vector) chunk_embedding
from t1 t,
     dbms_vector_chain.utl_to_embeddings(
                                                                  dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(t.data),
                                                                                                                       json('{"normalize":"all"}')
                                                                                                                       ),
                                                                  json('{"provider":"database", "model":"ALL_MINILM_L12_V2"}')
                                                                  ) a,
    json_table(a.column_value,
                   '$[*]' columns(embed_id number path '$.embed_id',
                                         embed_data varchar2(4000) path '$.embed_data',
                                         embed_vector clob path '$.embed_vector')
                   ) c;

commit;

6,527 rows inserted.
Commit complete.

5. 对于23.6的思考:

    Oracle Database 23.5还不能直接支持使用UTL_TO_EMBEDDING去调用Ollama作为JSON参数provider的值。需要将版本升级到Oracle Database 23.6才支持:

在行: 33 上开始执行命令时出错 -
insert into t1chunksOllama
 select t.id doc_id, c.embed_id chunk_id, c.embed_data chunk_data,
 to_vector(c.embed_vector) chunk_embedding
from t1 t,
     dbms_vector_chain.utl_to_embeddings(
                                                                  dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(t.data),
                                                                                                                       json('{"normalize":"all"}')
                                                                                                                       ),
                                                                  json('{
                                                                            "provider":"ollama",
                                                                            "host":"local",
                                                                            "url":"http://localhost:11434/api/embeddings",
                                                                            "model":"qwen2:7b-instruct"
                                                                           }'

                                                                         )
                                                                  ) a,
    json_table(a.column_value,
                   '$[*]' columns(embed_id number path '$.embed_id',
                                         embed_data varchar2(4000) path '$.embed_data',
                                         embed_vector clob path '$.embed_vector')
                   ) c
错误位于命令行: 37 列: 6
错误报告 -
SQL 错误: ORA-20000: Oracle Text 错误:
DRG-50857: oracle error in dbms_vector_chain.utl_to_embeddings(vector_array_t)
ORA-20003: invalid value for PROVIDER parameter
ORA-06512: 在 "CTXSYS.DRUE", line 192
ORA-06512: 在 "CTXSYS.DBMS_VECTOR_CHAIN", line 855
ORA-06512: 在 line 1
20000. 00000 -  "%s"
*Cause:    The stored procedure 'raise_application_error'
           was called which causes this error to be generated.
*Action:   Correct the problem as described in the error message or contact
           the application administrator or DBA for more information.


    另外一个例子如下:

[oracle@station1 ~]$ sqlplus /nolog

SQL*Plus: Release 23.0.0.0.0 - for Oracle Cloud and Engineered Systems on Thu Dec 19 18:44:41 2024
Version 23.5.0.24.07

Copyright (c) 1982, 2024, Oracle.  All rights reserved.

SQL> conn hr/cloud_4U@pdb1_1
Connected.
SQL> var embed_ollama_params clob;
SQL> exec :embed_ollama_params := '{"provider": "ollama","host":"local","url":"http://localhost:11434/api/embeddings","model": "all-minilm"}';
PL/SQL procedure successfully completed.

SQL> select dbms_vector.utl_to_embedding('hello', json(:embed_ollama_params))
  2  ollama_output from dual;
ERROR:
ORA-20003: invalid value for PROVIDER parameter
ORA-06512: at "SYS.DBMS_VECTOR", line 2150
ORA-06512: at "SYS.DBMS_VECTOR", line 2142
ORA-06512: at line 1

no rows selected


    23.6GOLD IMAGE请从support.oracle.com下载。使用23.6后,以上两段代码都能成功运行:




路过

雷人

握手

鲜花

鸡蛋

QQ|手机版|Bo's Oracle Station   

GMT+8, 2025-2-21 20:49 , Processed in 0.034872 second(s), 21 queries .

返回顶部