Bo's Oracle Station

【博客文章2025】Oracle Database 23ai:AI Vector Search笔记与实验4---在数据库中存储数据chunks时用到的UTL_TO_CHUNKS与JSON

2025-1-22 14:32| 发布者: admin| 查看: 95| 评论: 0|原作者: Bo Tang

摘要: 本博客介绍了在数据库中存储数据CHUNKS时用到的UTL_TO_CHUNKS与JSON知识。DBMS_VECTOR_CHAIN包中的UTL_TO_CHUNKS将文本分块。详细介绍了JSON格式的UTL_TO_CHUNK的PARAMS。最后设置JSON参数,查看数据CHUNKS详情。
【博客文章2025】Oracle Database 23ai:AI Vector Search笔记与实验4---在数据库中存储数据CHUNKS时用到的UTL_TO_CHUNKS与JSON



Author: Bo Tang


1. 在数据库中存储数据CHUNKS和存储嵌入向量:

    在数据库中已经创建了1个表,并已经在其DATA列中,加载了2个PDF文件

SQL> conn hr/cloud_4U@pdb1_1
Connected.
SQL> select id, dbms_lob.getlength(data) from t1;

    ID DBMS_LOB.GETLENGTH(DATA)
---------- ------------------------
     1           12685223
     2            5016485

    再创建一个新表来存储非结构化的数据chunks和相关的嵌入向量。这个表将引入一个叫做vector类型的列:

create table t1chunks (doc_id number, chunk_id number, chunk_data varchar2(4000), chunk_embedding vector);

    插入数据时,使用insert语句从t1表的data列读取PDF文档:

insert into t1chunks
 select t.id doc_id, c.embed_id chunk_id, c.embed_data chunk_data,
 to_vector(c.embed_vector) chunk_embedding
from t1 t,
     dbms_vector_chain.utl_to_embeddings(
                                                                  dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(t.data),
                                                                                                                       json('{"normalize":"all"}')
                                                                                                                       ),
                                                                  json('{"provider":"database", "model":"ALL_MINILM_L12_V2"}')
                                                                  ) a,
    json_table(a.column_value,
                   '$[*]' columns(embed_id number path '$.embed_id',
                                         embed_data varchar2(4000) path '$.embed_data',
                                         embed_vector clob path '$.embed_vector')
                   ) c;

commit;

6,527 rows inserted.
Commit complete.

    以上语句首先会转换PDF文档为文本,然后将文本分块成数据chunks,最后为每一个数据chunk使用语言模型生成嵌入向量。这些事情都由上面这条语句完成。dbms_vector_chain包中的utl_to_text将PDF数据转换成文本;utl_to_chunks将文本分块;utl_to_embeddings为每一个文本分块生成嵌入向量。

2. DBMS_VECTOR_CHAIN包:

    DBMS_VECTOR_CHAIN包用来支持Oracle AI向量搜索的高级操作:例如分割数据chunks、嵌入数据以及文本生成和摘要生成操作。 它更适合于具有相似性搜索和混合搜索的文本处理,可以快速进行端到端搜索。该包内部包含一组模块化和可灵活应用的函数。就像上面的例子这样,可以把其中的多个函数比如:utl_to_text、utl_to_chunks和utl_to_embeddings串在一起使用,来完成搜索。

3. 利用UTL_TO_CHUNKS将文本分块:

    DBMS_VECTOR_CHAIN.UTL_TO_CHUNKS用于拆分大型纯文本,将文档分割成更小的文本块。它的语法结构是这样的:

DBMS_VECTOR_CHAIN.UTL_TO_CHUNKS (
    DATA IN CLOB | VARCHAR2
    PARAMS IN JSON default NULL
) return VECTOR_ARRAY_T;

    如上,代入我们的数据,就是这样:

......
                                                                  dbms_vector_chain.utl_to_chunks(dbms_vector_chain.utl_to_text(t.data),
                                                                                                                       json('{"normalize":"all"}')
                                                                                                                       ),
......

4. JSON格式的UTL_TO_CHUNK的PARAMS:

    dbms_vector_chain.utl_to_chunks接受2个形式参数:DATA和PARAMS。其中DATA这个参数比较简单,就是CLOB或VARCHAR2型的输入。我们重点讨论PARAMS这个JSON型的参数。
    PARAMS以JSON格式指定输入参数:

{
"by" : mode,
"max" : max,
"overlap" : overlap,
"split" : split_condition,
"custom_list" : [ split_chars1, ... ],
"vocabulary" : vocabulary_name,
"language" : nls_language,
"normalize" : normalize_mode,
"norm_options" : [ normalize_option1, ... ],
"extended" : boolean
}

    如上,代入我们的数据,就是这样:

......
                                                                
                                                                                                                       json('{"normalize":"all"}')
                                                                                                                  
......

    我们只传递了
"normalize" : normalize_mode,这个参数,而其他参数采用默认值:
    "normalize"这个JSON参数是指:当把文档转换为文本时,自动进行预处理或自动进行后处理时可能出现一些问题。这些问题涉及:多个连续空格的处理和引号以及回引号的处理等等。Oracle建议您使用“all”或者以下某个options模式以提取高质量的CHUNKS。

Valid values:
none:
Applies no normalization.

all:
Normalizes common multi-byte (unicode) punctuation to standard single-byte.

options:
Specify an array of normalization options using the
norm_options parameter.
{
"normalize" : "options",
"norm_options" : [ "normalize_option1", ... ]
}

punctuation:
Converts quotes, dashes, and other punctuation characters supported in the character set of the
text to their common ASCII form. For example:
* U+2013 (En Dash) maps to U+002D (Hyphen-Minus)
* U+2018 (Left Single Quotation Mark) maps to U+0027 (Apostrophe)
* U+2019 (Right Single Quotation Mark) maps to U+0027 (Apostrophe)
* U+201B (Single High-Reversed-9 Quotation Mark) maps to U+0027 (Apostrophe)
whitespace:
Minimizes whitespace by eliminating unnecessary characters.
For example, retain blanklines, but remove any extra newlines and interspersed spaces or tabs:
" \n \n " => "\n\n"
widechar:
Normalizes wide, multi-byte digits and (a-z) letters to single-byte.
These are multi-byte equivalents for 0-9 and a-z A-Z, which can show up in Chinese,
Japanese, or Korean text.
For example:
{
"normalize" : "options",
"norm_options" : [ "whitespace" ]
}
Default value: none

    “by”参数以什么单位(字符、单词或词汇标记 )作为拆分CHUNKS的模式。其默认值是words:

Valid values:
characters (or chars):
Splits by counting the number of characters.

words:
Splits by counting the number of words.
Words are defined as sequences of alphabetic characters, sequences of digits, individual
punctuation marks, or symbols. For segmented languages without whitespace word boundaries
(such as Chinese, Japanese, or Thai), each native character is considered a word (that is, unigram).
vocabulary:
Splits by counting the number of vocabulary tokens.
Vocabulary tokens are words or word pieces, recognized by the vocabulary of the tokenizer that your
embedding model uses. You can load your vocabulary file using the chunker helper API
DBMS_VECTOR_CHAIN.CREATE_VOCABULARY.
Note: For accurate results, ensure that the chosen model matches the vocabulary file used for
chunking. If you are not using a vocabulary file, then ensure that the input length is defined within the
token limits of your model.
Default value: words                                                               

    “max”参数指:指定每个CHUNK的最大大小。此设置将按照这个大小将输入文本进行分割。max的单位对应于by模式,即拆分的单位是by指定的字母、标点符号或词汇标记。其默认值是100:

Valid values:
by characters: 50 to 4000 characters
by words: 10 to 1000 words
by vocabulary: 10 to 1000 tokens
Default value: 100 
                                                                    

    “split ”参数指:指定当输入文本达到最大大小限制时,在什么样的边界处将其拆分。这有助于保持可读性。其默认值是recursively:

Valid values:
• none:
Splits at the max limit of characters, words, or vocabulary tokens.
• newline, blankline, and space:
These are single-split character conditions that split at the last split character before the max value.
Use newline to split at the end of a line of text. Use blankline to split at the end of a blank line
(sequence of characters, such as two newlines). Use space to split at the end of a blank space.
• recursively:
This is a multiple-split character condition that breaks the input text using an ordered list of
characters (or sequences).
recursively is predefined as BLANKLINE, newline, space, none in this order:
1. If the input text is more than the max value, then split by the first split character.
2. If that fails, then split by the second split character.
3. And so on.
4. If no split characters exist, then split by max wherever it appears in the text.
• sentence:
This is an end-of-sentence split condition that breaks the input text at a sentence boundary.
This condition automatically determines sentence boundaries by using knowledge of the input
language's sentence punctuation and contextual rules. This language-specific condition relies mostly
on end-of-sentence (EOS) punctuations and common abbreviations.
Contextual rules are based on word information, so this condition is only valid when splitting the text
by words or vocabulary (not by characters).
Note: This condition obeys the by word and max settings, and thus may not determine accurate
sentence boundaries in some cases. For example, when a sentence is larger than the max value, it
splits the sentence at max. Similarly, it includes multiple sentences in the text only when they fit
within the max limit.
• custom:
Splits based on a custom split characters list. You can provide custom sequences up to a limit of 16
split character strings, with a maximum length of 10 each.
Specify an array of valid text literals using the custom_list parameter.
{
"split" : "custom",
"custom_list" : [ "split_chars1", ... ]
}
Note: You can omit sequences only for tab (\t), newline (\n), and linefeed (\r).
Default value: recursively

    “overlap”参数指:指定CHUNK应包含的前一个CHUNK的文本的数量。这有助于在逻辑上分割相关文本(如句子)。其单位与by参数指定的单位相同。其默认值是0:

Valid value: 5% to 20% of max
Default value: 0
                                                               

    “language”参数指:指定输入数据的语言。这个参数很重要,尤其是当你的文本包含某些字符(例如标点符号或缩写)时,因为它们在另一种语言中可能有不同的解释。其默认值从会话的NLS_LANGUAGE参数继承。

Valid values:
• NLS-supported language name or its abbreviation, as listed in Oracle Database Globalization
Support Guide.
• Custom language name or its abbreviation, as listed in Supported Languages and Data File
Locations. You use the DBMS_VECTOR_CHAIN.CREATE_LANG_DATA chunker helper API to load
language-specific data (abbreviation tokens) into the database, for your specified language.
Default value: NLS_LANGUAGE from session
                                                            

    “extended”参数指:在不设定max_string_size参数的情况下,将VARCHAR2字符串的输出限制增加到32767字节。其默认值:

Default value: 4000 or 32767 (when max_string_size=extended)                                                             

5. 查询数据CHUNKS详情:

    设置JSON参数,查看数据CHUNKS详情:

 SELECT t.id doc,
  chunk_id,
  chunk_offset,
  chunk_length,
  chunk_data
 FROM t1 t,
  dbms_vector_chain.utl_to_chunks(      dbms_vector_chain.utl_to_text( t.data),
                                                  JSON('{ "by": "words",
                                                              "max": "100",
                                                              "overlap" : "0",
                                                              "split": "recursively",
                                                              "language" : "american",
                                                              "normalize": "all"
                                                             }'
                                                          )
                                                    ) a,
                                                    
      json_table(a.column_value,
                   '$[*]' columns(chunk_id number path '$.chunk_id',
                                         chunk_offset number path '$.chunk_offset',
                                         chunk_length number path '$.chunk_length',
                                         chunk_data  clob    path '$.chunk_data')
                   ) c
                   where t.id=2;


DOC     CHUNK_ID     CHUNK_OFFSET     CHUNK_LENGTH     CHUNK_DATA
2     1     4     461     AI Vector Search User's Guide Oracle?Database AI Vector Search User's Guide 23ai F87786-15 November 2024 Oracle Database AI Vector Search User's Guide, 23ai F87786-15 Copyright ?2023, 2024, Oracle and/or its affiliates. Primary Authors: Jean-Francois Verrier, Sarah Hirschfeld, Binika Kumar Contributing Authors: Douglas Williams, Frederick Kush, Gunjan Jain, Jessica True, Jody Glover, Maitreyee Chaliha,
2     2     465     454     Mamata Basapur, Prakash Jashnani, Ramya P, Sarika Surampudi, Suresh Rajan, Tulika Das, Usha Krishnamurthy Contributors: Agnivo Saha, Ajay Sunnyhith Chidurala, Aleksandra Czarlinska, Angela Amor, Aurosish Mishra, Bonnie Xia, Boriana Milenova, David Jiang, Dinesh Das, Doug Hood, George Krupka, Harichandan Roy, Malavika S P, Mark Hornick, Rohan Aggarwal, Roger Ford, Sebastian DeLaHoz, Shasank Chavan, Sudhir Kumar, Tirthankar Lahiri, Teck
2     3     919     457     Hua Lee, Valentin Leonard Tabacaru Cristache, Vinita Subramanian, Weiwei Gong, Yuan Zhou This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit,
2     4     1376     498     perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software, software documentation, data (as defined in the Federal Acquisition Regulation), or related
2     5     1874     500     documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, then the following notice is applicable: U.S. GOVERNMENT END USERS: Oracle programs (including any operating system, integrated software, any programs embedded, installed, or activated on delivered hardware, and modifications of such programs) and Oracle computer documentation or other Oracle data delivered to or accessed by U.S. Government end users are "commercial
2     6     2374     477     computer software," "commercial computer software documentation," or "limited rights data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, reproduction, duplication, release, display, disclosure, modification, preparation of derivative works, and/or adaptation of i) Oracle programs (including any operating system, integrated software, any programs embedded, installed, or activated on
2     7     2851     544     delivered hardware, and modifications of such programs), ii) Oracle computer documentation and/or iii) other Oracle data, is subject to the rights and limitations specified in the license contained in the applicable contract. The terms governing the U.S. Government's use of Oracle cloud services are defined by the applicable contract for such services. No other rights are granted to the U.S. Government. This software or hardware is developed for general use in a variety of information management applications. It is not
2     8     3395     492     developed or intended for use in any inherently dangerous applications, including applications that may create a risk of personal injury. If you use this software or hardware in dangerous applications, then you shall be responsible to take all appropriate fail-safe, backup, redundancy, and other measures to ensure its safe use. Oracle Corporation and its affiliates disclaim any liability for any damages caused by use of this software or hardware in dangerous applications.
2     9     3887     521     Oracle? Java, MySQL, and NetSuite are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Intel and Intel Inside are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. AMD, Epyc, and the AMD logo are trademarks or registered trademarks of Advanced Micro Devices. UNIX is a registered trademark of The Open Group.
2     10     4408     481     This software or hardware and documentation may provide access to or information about content, products, and services from third parties. Oracle Corporation and its affiliates are not responsible for and expressly disclaim all warranties of any kind with respect to third-party content, products, and services unless otherwise set forth in an applicable agreement between you and Oracle. Oracle Corporation and its affiliates will not be responsible for any loss,
......

    或者:

SELECT t.id doc,
 JSON_VALUE(C.column_value, '$.chunk_id' RETURNING NUMBER) AS id,
 JSON_VALUE(C.column_value, '$.chunk_offset' RETURNING NUMBER) AS pos,
 JSON_VALUE(C.column_value, '$.chunk_length' RETURNING NUMBER) AS siz,
 JSON_VALUE(C.column_value, '$.chunk_data') AS txt
 FROM t1 t,
  dbms_vector_chain.utl_to_chunks(t.data,
                                                  JSON('{ "by": "words",
                                                              "max": "100",
                                                              "overlap" : "0",
                                                              "split": "recursively",
                                                              "language" : "american",
                                                              "normalize": "all"
                                                             }'
                                                          )
                                                    ) C
                                                   where t.id=2 ;



DOC     CHUNK_ID     CHUNK_OFFSET     CHUNK_LENGTH     CHUNK_DATA
2     1     4     461     AI Vector Search User's Guide Oracle?Database AI Vector Search User's Guide 23ai F87786-15 November 2024 Oracle Database AI Vector Search User's Guide, 23ai F87786-15 Copyright ?2023, 2024, Oracle and/or its affiliates. Primary Authors: Jean-Francois Verrier, Sarah Hirschfeld, Binika Kumar Contributing Authors: Douglas Williams, Frederick Kush, Gunjan Jain, Jessica True, Jody Glover, Maitreyee Chaliha,
2     2     465     454     Mamata Basapur, Prakash Jashnani, Ramya P, Sarika Surampudi, Suresh Rajan, Tulika Das, Usha Krishnamurthy Contributors: Agnivo Saha, Ajay Sunnyhith Chidurala, Aleksandra Czarlinska, Angela Amor, Aurosish Mishra, Bonnie Xia, Boriana Milenova, David Jiang, Dinesh Das, Doug Hood, George Krupka, Harichandan Roy, Malavika S P, Mark Hornick, Rohan Aggarwal, Roger Ford, Sebastian DeLaHoz, Shasank Chavan, Sudhir Kumar, Tirthankar Lahiri, Teck
2     3     919     457     Hua Lee, Valentin Leonard Tabacaru Cristache, Vinita Subramanian, Weiwei Gong, Yuan Zhou This software and related documentation are provided under a license agreement containing restrictions on use and disclosure and are protected by intellectual property laws. Except as expressly permitted in your license agreement or allowed by law, you may not use, copy, reproduce, translate, broadcast, modify, license, transmit, distribute, exhibit,
2     4     1376     498     perform, publish, or display any part, in any form, or by any means. Reverse engineering, disassembly, or decompilation of this software, unless required by law for interoperability, is prohibited. The information contained herein is subject to change without notice and is not warranted to be error-free. If you find any errors, please report them to us in writing. If this is software, software documentation, data (as defined in the Federal Acquisition Regulation), or related
2     5     1874     500     documentation that is delivered to the U.S. Government or anyone licensing it on behalf of the U.S. Government, then the following notice is applicable: U.S. GOVERNMENT END USERS: Oracle programs (including any operating system, integrated software, any programs embedded, installed, or activated on delivered hardware, and modifications of such programs) and Oracle computer documentation or other Oracle data delivered to or accessed by U.S. Government end users are "commercial
2     6     2374     477     computer software," "commercial computer software documentation," or "limited rights data" pursuant to the applicable Federal Acquisition Regulation and agency-specific supplemental regulations. As such, the use, reproduction, duplication, release, display, disclosure, modification, preparation of derivative works, and/or adaptation of i) Oracle programs (including any operating system, integrated software, any programs embedded, installed, or activated on
2     7     2851     544     delivered hardware, and modifications of such programs), ii) Oracle computer documentation and/or iii) other Oracle data, is subject to the rights and limitations specified in the license contained in the applicable contract. The terms governing the U.S. Government's use of Oracle cloud services are defined by the applicable contract for such services. No other rights are granted to the U.S. Government. This software or hardware is developed for general use in a variety of information management applications. It is not
2     8     3395     492     developed or intended for use in any inherently dangerous applications, including applications that may create a risk of personal injury. If you use this software or hardware in dangerous applications, then you shall be responsible to take all appropriate fail-safe, backup, redundancy, and other measures to ensure its safe use. Oracle Corporation and its affiliates disclaim any liability for any damages caused by use of this software or hardware in dangerous applications.
2     9     3887     521     Oracle? Java, MySQL, and NetSuite are registered trademarks of Oracle and/or its affiliates. Other names may be trademarks of their respective owners. Intel and Intel Inside are trademarks or registered trademarks of Intel Corporation. All SPARC trademarks are used under license and are trademarks or registered trademarks of SPARC International, Inc. AMD, Epyc, and the AMD logo are trademarks or registered trademarks of Advanced Micro Devices. UNIX is a registered trademark of The Open Group.
2     10     4408     481     This software or hardware and documentation may provide access to or information about content, products, and services from third parties. Oracle Corporation and its affiliates are not responsible for and expressly disclaim all warranties of any kind with respect to third-party content, products, and services unless otherwise set forth in an applicable agreement between you and Oracle. Oracle Corporation and its affiliates will not be responsible for any loss,
......                                               





路过

雷人

握手

鲜花

鸡蛋

QQ|手机版|Bo's Oracle Station   

GMT+8, 2025-2-21 18:47 , Processed in 0.034192 second(s), 21 queries .

返回顶部