【博客文章2025】Oracle Database 23ai:AI Vector Search笔记与实验5---在数据库中存储嵌入向量时用到的UTL_TO_EMBEDDINGS与JSON
Author: Bo Tang
1. 从文本文档自动生成一个或多个嵌入向量:
如同实验环境https://www.botangdb.com/mytec/mytec_basicadmin/202501/00900135.html的操作:
insert into t1chunks select t.id doc_id, c.embed_id chunk_id, c.embed_data chunk_data, to_vector(c.embed_vector) chunk_embedding from t1 t, dbms_vector_chain.utl_to_embeddings( dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(t.data),
json('{"normalize":"all"}') ), json('{"provider":"database", "model":"ALL_MINILM_L12_V2"}') ) a, json_table(a.column_value, '$[*]' columns(embed_id number path '$.embed_id', embed_data varchar2(4000) path '$.embed_data', embed_vector clob path '$.embed_vector') ) c; commit;
6,527 rows inserted. Commit complete. |
首先插入数据(使用insert语句从t1表的data列读取PDF文档,转换这些PDF文档为文本),然后将文本分块成数据chunks,最后为每一个数据chunk使用数据库内置语言模型生成嵌入向量。这些事情都由上面这条语句完成。dbms_vector_chain包中的utl_to_text将PDF数据转换成文本;utl_to_chunks将文本分块;utl_to_embeddings为每一个文本分块生成嵌入向量。 DBMS_VECTOR_CHAIN.UTL_TO_EMBEDDINGS的语法结构是这样的(返回的是VECTOR_ARRAY_T类型,即转换1组文本chunks为1组嵌入向量阵列,输入的每一个文本chunk长度的最大值为4000字节):
DBMS_VECTOR_CHAIN.UTL_TO_EMBEDDINGS ( DATA IN VECTOR_ARRAY_T, PARAMS IN JSON default NULL ) return VECTOR_ARRAY_T;
|
如上,代入我们的数据,就是这样:
...... dbms_vector_chain.utl_to_embeddings( dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(t.data), json('{"normalize":"all"}') ), json('{"provider":"database", "model":"ALL_MINILM_L12_V2"}') ) a, ......
|
附:类似的DBMS_VECTOR_CHAIN.UTL_TO_EMBEDDING的语法结构是这样的(返回的是VECTOR类型,即转换文本为单个嵌入向量,输入的文本长度最大值为4000字节):
DBMS_VECTOR_CHAIN.UTL_TO_EMBEDDING ( DATA IN CLOB, PARAMS IN JSON default NULL ) return VECTOR;
|
2. JSON格式的UTL_TO_EMBEDINGS的PARAMS:
dbms_vector_chain.utl_to_embeddings/dbms_vector_chain.utl_to_embedding接受2个形式参数:DATA和PARAMS。其中DATA这个参数比较简单,就是CLOB或VARCHAR2型的输,这些输入将被用于计算嵌入向量。 我们重点讨论PARAMS这个JSON型的参数,PARAMS以JSON格式指定输入参数: Oracle数据库作为嵌入向量服务提供商(默认设置)情况下,“provider”和“model”这两个JSON参数是指:用加载到数据库中的ONNX嵌入向量模型来生成嵌入向量。要该设置生效,必须先将ONNX格式的嵌入模型加载到数据库中。不是随便一个ONNX的模型都可以用于这项工作,这篇文章提供了这样一个小型化的(22.57M参数bert架构)可用的ONNX模型。然而,网上提供最多的ONNX模型下载的网站http://github.com/onnx/models的embedding模型都是不能直接用于加载Oracle Database 23ai到库内的。因此,如果您想要使用一个中文的嵌入模型,比如bge-large-zh-v1.5,那么您自己需要将其他格式的bge-large-zh-v1.5模型进行复杂的转换,变成ONNX格式,才能使用。
{ "provider" : "database", "model" : "value" }
|
第三方嵌入模型作为嵌入向量服务提供商情况下,“provider”和“model”这两个JSON参数是指:此API对您选择的远程服务提供商(Cohere,huggingface、googleai、ocigenai、openai或vertexai)或本地服务供应商(Ollama,注:Oracle Database 23.6才支持)进行REST API调用。
UTL_TO_EMBEDDING的语法:
{ "provider" : "value", "credential_name" : "value", "url" : "value", "model" : "value", "transfer_timeout": "value" , "其他RSET参数": "value" }
|
UTL_TO_EMBEDDINGS的语法:
{ "provider" : "value", "credential_name" : "value", "url" : "value", "model" : "value", "transfer_timeout": "value", "batch_size" : "value", "其他RSET参数": "value" }
|
3.分析UTL_TO_EMBEDINGS生成的JSON_TABLE:
dbms_vector_chain.utl_to_embeddings的输出(JSON格式):
{ "embed_id" : NUMBER, "embed_data" : "VARCHAR2(4000)", "embed_vector": "CLOB" }
|
embed_id指生成向量阵列中的每一份向量的ID。embed_data指输入的文本。embed_vector指输出的向量。 为了获取生成的嵌入向量用来插入t1chunks表,需要使用JSON_TABLE运算符在内存中映射embed_id、embed_data和embed_vector为表结构:
...... json_table(a.column_value, '$[*]' columns(embed_id number path '$.embed_id', embed_data varchar2(4000) path '$.embed_data', embed_vector clob path '$.embed_vector') ) c; ......
|
可以使用JSON_TABLE查看:
SELECT t.id doc_id, embed_id, embed_data, embed_vector FROM t1 t, dbms_vector_chain.utl_to_embeddings( dbms_vector_chain.utl_to_chunks( dbms_vector_chain.utl_to_text( t.data), JSON('{ "by": "words", "max": "100", "overlap" : "0", "split": "recursively", "language" : "american", "normalize": "all" }' ) ), json('{"provider":"database", "model":"ALL_MINILM_L12_V2"}') ) a, json_table(a.column_value, '$[*]' columns(embed_id number path '$.embed_id', embed_data clob path '$.embed_data', embed_vector clob path '$.embed_vector') ) c where t.id=2;
DOC_ID EMBED_ID EMBED_DATA EMBED_VECTOR 2 1 AI Vector Search User's Guide Oracle?Database AI Vector Search User's Guide 23ai F87786-15 November 2024 Oracle Database AI Vector Search User's Guide, 23ai F87786-15 Copyright ?2023, 2024, Oracle and/or its affiliates. Primary Authors: Jean-Francois Verrier, Sarah Hirschfeld, Binika Kumar Contributing Authors: Douglas Williams, Frederick Kush, Gunjan Jain, Jessica True, Jody Glover, Maitreyee Chaliha, [7.86887761E-003,-4.39698659E-002,-7.38092512E-002,-8.19571............. |
也使用JSON_VALUE查看:
SELECT t.id doc, JSON_VALUE(c.column_value, '$.embed_id' RETURNING NUMBER) AS id, JSON_VALUE(c.column_value, '$.embed_data' ) AS txt, JSON_VALUE(c.column_value, '$.embed_vector' RETURNING clob ) AS ivector FROM t1 t, dbms_vector_chain.utl_to_embeddings( dbms_vector_chain.utl_to_chunks( dbms_vector_chain.utl_to_text( t.data), JSON('{ "by": "words", "max": "100", "overlap" : "0", "split": "recursively", "language" : "american", "normalize": "all" }' ) ), json('{"provider":"database", "model":"ALL_MINILM_L12_V2"}') ) c; where t.id=2;
DOC_ID EMBED_ID EMBED_DATA EMBED_VECTOR 2 1 AI Vector Search User's Guide Oracle?Database AI Vector Search User's Guide 23ai F87786-15 November 2024 Oracle Database AI Vector Search User's Guide, 23ai F87786-15 Copyright ?2023, 2024, Oracle and/or its affiliates. Primary Authors: Jean-Francois Verrier, Sarah Hirschfeld, Binika Kumar Contributing Authors: Douglas Williams, Frederick Kush, Gunjan Jain, Jessica True, Jody Glover, Maitreyee Chaliha, [7.86887761E-003,-4.39698659E-002,-7.38092512E-002,-8.19571............. |
4. TO_VECTOR:
TO_VECTOR是一个构造函数,它接受VARCHAR2、CLOB、BLOB或JSON类型的字符串作为输入,将其转换为向量,并返回向量作为输出。TO_VECTOR还可以以另一个矢量作为输入,调整其格式,并将调整后的矢量作为输出返回。所以,以下代码是后者的用途,这样保证了VECTOR的格式正确:
insert into t1chunks select t.id doc_id, c.embed_id chunk_id, c.embed_data chunk_data, to_vector(c.embed_vector) chunk_embedding from t1 t, dbms_vector_chain.utl_to_embeddings( dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(t.data),
json('{"normalize":"all"}') ), json('{"provider":"database", "model":"ALL_MINILM_L12_V2"}') ) a, json_table(a.column_value, '$[*]' columns(embed_id number path '$.embed_id', embed_data varchar2(4000) path '$.embed_data', embed_vector clob path '$.embed_vector') ) c; commit;
6,527 rows inserted. Commit complete. |
5. 对于23.6的思考:
Oracle Database 23.5还不能直接支持使用UTL_TO_EMBEDDING去调用Ollama作为JSON参数provider的值。需要将版本升级到Oracle Database 23.6才支持:
在行: 33 上开始执行命令时出错 - insert into t1chunksOllama select t.id doc_id, c.embed_id chunk_id, c.embed_data chunk_data, to_vector(c.embed_vector) chunk_embedding from t1 t, dbms_vector_chain.utl_to_embeddings(
dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(t.data),
json('{"normalize":"all"}') ), json('{ "provider":"ollama", "host":"local", "url":"http://localhost:11434/api/embeddings", "model":"qwen2:7b-instruct" }' ) ) a, json_table(a.column_value, '$[*]' columns(embed_id number path '$.embed_id', embed_data varchar2(4000) path '$.embed_data', embed_vector clob path '$.embed_vector') ) c 错误位于命令行: 37 列: 6 错误报告 - SQL 错误: ORA-20000: Oracle Text 错误: DRG-50857: oracle error in dbms_vector_chain.utl_to_embeddings(vector_array_t) ORA-20003: invalid value for PROVIDER parameter ORA-06512: 在 "CTXSYS.DRUE", line 192 ORA-06512: 在 "CTXSYS.DBMS_VECTOR_CHAIN", line 855 ORA-06512: 在 line 1 20000. 00000 - "%s" *Cause: The stored procedure 'raise_application_error' was called which causes this error to be generated. *Action: Correct the problem as described in the error message or contact the application administrator or DBA for more information.
|
[oracle@station1 ~]$ sqlplus /nolog
SQL*Plus: Release 23.0.0.0.0 - for Oracle Cloud and Engineered Systems on Thu Dec 19 18:44:41 2024 Version 23.5.0.24.07
Copyright (c) 1982, 2024, Oracle. All rights reserved.
SQL> conn hr/cloud_4U@pdb1_1 Connected. SQL> var embed_ollama_params clob; SQL> exec :embed_ollama_params := '{"provider": "ollama","host":"local","url":"http://localhost:11434/api/embeddings","model": "all-minilm"}'; PL/SQL procedure successfully completed.
SQL> select dbms_vector.utl_to_embedding('hello', json(:embed_ollama_params)) 2 ollama_output from dual; ERROR: ORA-20003: invalid value for PROVIDER parameter ORA-06512: at "SYS.DBMS_VECTOR", line 2150 ORA-06512: at "SYS.DBMS_VECTOR", line 2142 ORA-06512: at line 1 no rows selected
|
23.6GOLD IMAGE请从support.oracle.com下载。使用23.6后,以上两段代码都能成功运行: |