mirror of https://github.com/Mai-with-u/MaiBot.git
LPMM 知识库删除能力与自检脚本增强(附关键健壮性修复)
为 LPMM 新增安全可控的删除能力: KGManager.delete_paragraphs 支持按段落/实体哈希删除图节点及关联边,可选清理孤立实体,并从图中重建元数据 统一删除脚本 scripts/delete_lpmm_items.py,支持按批次(OpenIE 文件)、哈希文件、原始文本段落、关键字搜索进行删除,内置 dry-run 和最大节点数保护 新增自检与回归脚本: scripts/inspect_lpmm_batch.py / scripts/inspect_lpmm_global.py 用于批次级和全局状态检查 scripts/test_lpmm_retrieval.py 一键初始化 LPMM 并用固定问题测试检索效果。 健壮性与性能保护: 在 KGManager.kg_search 中对 ent_appear_cnt 缺失增加兜底,避免实体权重计算时的 KeyError。 增加同义实体数量限制与 PPR 节点/关系阈值,必要时自动退回纯向量检索 文档补充: docs-src/lpmm_user_guide.md:面向零基础用户的导入 / 删除 / 自检脚本使用指南 docs-src/lpmm_parameters_guide.md:[lpmm_knowledge] 关键参数说明与简单调参建议pull/1386/head
parent
fa4555197d
commit
1383caf249
|
|
@ -0,0 +1,154 @@
|
|||
# LPMM 关键参数调节指南(进阶版)
|
||||
|
||||
> 本文是对 `config/bot_config.toml` 中 `[lpmm_knowledge]` 段的补充说明。
|
||||
> 如果你只想使用默认配置,可以不改这些参数,脚本仍然可以正常工作。
|
||||
|
||||
所有与 LPMM 相关的参数,都集中在:
|
||||
|
||||
```toml
|
||||
[lpmm_knowledge] # lpmm知识库配置
|
||||
enable = true
|
||||
lpmm_mode = "agent"
|
||||
...
|
||||
```
|
||||
|
||||
下面按功能将常用参数分为三组介绍。
|
||||
|
||||
---
|
||||
|
||||
## 一、检索相关参数(影响答案质量与风格)
|
||||
|
||||
```toml
|
||||
qa_relation_search_top_k = 10 # 关系检索TopK
|
||||
qa_relation_threshold = 0.5 # 关系阈值,相似度高于该值才认为“命中关系”
|
||||
qa_paragraph_search_top_k = 1000 # 段落检索TopK,越小可能影响召回
|
||||
qa_paragraph_node_weight = 0.05 # 段落节点权重,在图检索&PPR中的权重
|
||||
qa_ent_filter_top_k = 10 # 实体过滤TopK
|
||||
qa_ppr_damping = 0.8 # PPR阻尼系数
|
||||
qa_res_top_k = 3 # 最终提供给问答模型的段落数
|
||||
```
|
||||
|
||||
- `qa_relation_search_top_k`
|
||||
控制“最多考虑多少条关系向量候选”。
|
||||
- 数值大:召回更全面,但略慢;
|
||||
- 数值小:更快,可能遗漏部分隐含关系。
|
||||
|
||||
- `qa_relation_threshold`
|
||||
关系相似度的阈值:
|
||||
- 数值高:只信任非常相关的关系,系统更可能退化为纯段落向量检索;
|
||||
- 数值低:图结构影响更大,适合实体关系较丰富的场景。
|
||||
|
||||
- `qa_paragraph_search_top_k`
|
||||
控制“最多考虑多少段落候选”。
|
||||
- 太小:可能召回不全,导致答案缺失;
|
||||
- 太大:略微增加计算量,一般 1000 为安全默认。
|
||||
|
||||
- `qa_paragraph_node_weight`
|
||||
文段节点在图检索中的权重:
|
||||
- 数值大:更依赖段落向量相似度(传统向量检索);
|
||||
- 数值小:更依赖图结构和实体网络。
|
||||
|
||||
- `qa_ppr_damping`
|
||||
Personalized PageRank 的阻尼系数:
|
||||
- 通常保持在 0.8 左右即可;
|
||||
- 越接近 1:偏向长路径探索,结果更发散;
|
||||
- 略低:更集中在与问题直接相关的节点附近。
|
||||
|
||||
- `qa_res_top_k`
|
||||
LPMM 最终会把相关度最高的前 `qa_res_top_k` 条段落组合成“知识上下文”给问答模型。
|
||||
- 太多:增加模型负担、阅读更多文字;
|
||||
- 太少:信息不够充分,一般 3–5 比较平衡。
|
||||
|
||||
> 调参建议:
|
||||
> - 优先在 `qa_relation_threshold`、`qa_paragraph_node_weight` 上做小幅调整;
|
||||
> - 每次调整后,用 `scripts/test_lpmm_retrieval.py` 跑一遍固定问题,感受回答变化。
|
||||
|
||||
---
|
||||
|
||||
## 二、性能与硬件相关参数
|
||||
|
||||
```toml
|
||||
embedding_dimension = 1024 # 嵌入向量维度,应与模型输出维度一致
|
||||
max_embedding_workers = 12 # 嵌入/抽取并发线程数
|
||||
embedding_chunk_size = 16 # 每批嵌入的条数
|
||||
info_extraction_workers = 3 # 实体抽取同时执行线程数
|
||||
enable_ppr = true # 是否启用PPR,低配机器可关闭
|
||||
ppr_node_cap = 8000 # 图节点数超过该值时自动跳过PPR
|
||||
ppr_relation_cap = 50 # 命中关系数超过该值时自动跳过PPR
|
||||
```
|
||||
|
||||
- `embedding_dimension`
|
||||
必须与所选嵌入模型的输出维度一致(比如 768、1024 等)。**不要随意修改,除非你知道你在做什么!!!**
|
||||
|
||||
- `max_embedding_workers`
|
||||
决定导入/抽取阶段的并行线程数:
|
||||
- 机器配置好:可以适当调大,加快导入速度;
|
||||
- 机器配置弱:建议调低(如 2 或 4),避免 CPU 长时间 100%。
|
||||
|
||||
- `embedding_chunk_size`
|
||||
每批发送给嵌入 API 的段落数量:
|
||||
- 数值大:请求次数少,但单次请求更“重”;
|
||||
- 数值小:请求次数多,但对网络和 API 的单次压力小。
|
||||
|
||||
- `info_extraction_workers`
|
||||
`scripts/info_extraction.py` 中实体抽取的并行线程数:
|
||||
- 使用 Pro/贵价模型时建议不要太大,避免并行费用过高;
|
||||
- 一般 2–4 就能取得较好平衡。
|
||||
|
||||
- `enable_ppr`
|
||||
是否启用个性化 PageRank(PPR)图检索:
|
||||
- `true`:检索会结合向量+知识图,效果更好,但略慢;
|
||||
- `false`:只用向量检索,牺牲一定效果,性能更稳定。
|
||||
|
||||
- `ppr_node_cap` / `ppr_relation_cap`
|
||||
安全阈值:当图节点数或命中关系数超过阈值时自动跳过 PPR,以避免“大图”导致卡顿。
|
||||
|
||||
> 调参建议:
|
||||
> - 若导入/检索阶段机器明显“顶不住”(>=1MB的大文本,且分配配置<4C),优先调低:
|
||||
> - `max_embedding_workers`
|
||||
> - `embedding_chunk_size`
|
||||
> - `info_extraction_workers`
|
||||
> - 或暂时将 `enable_ppr = false` (除非真的出现问题,否则不建议禁用此项,大幅影响检索效果)
|
||||
> - 调整后重新执行导入或检索,观察日志与系统资源占用。
|
||||
|
||||
---
|
||||
|
||||
## 三、开启/关闭 LPMM 与模式说明
|
||||
|
||||
```toml
|
||||
enable = true # 是否开启lpmm知识库
|
||||
lpmm_mode = "agent" # 可选 classic / agent
|
||||
```
|
||||
|
||||
- `enable`
|
||||
- `true`:LPMM 知识库启用,检索和问答会使用知识库;
|
||||
- `false`:LPMM 完全关闭,脚本仍可导入/删除数据,但对聊天问答不生效。
|
||||
|
||||
- `lpmm_mode`
|
||||
- `classic`:传统模式,仅使用 LPMM 知识库本身;
|
||||
- `agent`:与新的记忆系统联动,用于更复杂的记忆+知识混合场景。
|
||||
|
||||
> 修改 `enable` 或 `lpmm_mode` 后,需要重启主程序,让配置生效。
|
||||
|
||||
---
|
||||
|
||||
## 四、推荐的调参流程
|
||||
|
||||
1. **保持默认配置,先跑一轮完整流程**
|
||||
- 导入 → `inspect_lpmm_global.py` → `test_lpmm_retrieval.py`;
|
||||
- 记录当前“答案风格”和“响应速度”。
|
||||
|
||||
2. **每次只调整一到两个参数**
|
||||
- 例如先调 `qa_relation_threshold`、`qa_paragraph_node_weight`;
|
||||
- 或在性能不佳时调整 `max_embedding_workers`、`enable_ppr`。
|
||||
|
||||
3. **调整后重复同一组测试问题**
|
||||
- 使用 `scripts/test_lpmm_retrieval.py`;
|
||||
- 对比不同配置下的答案,选择更符合需求的组合。
|
||||
|
||||
4. **出现“怎么调都不对”时**
|
||||
- 将 `[lpmm_knowledge]` 段恢复为仓库中的默认配置;
|
||||
- 重启主程序,即可回到“出厂设置”。
|
||||
|
||||
通过本指南中的参数调节,你可以在“检索质量”“响应速度”“系统资源占用”之间找到适合自己麦麦和机器的平衡点!
|
||||
|
||||
|
|
@ -0,0 +1,395 @@
|
|||
# LPMM 知识库脚本使用指南(零基础用户版)
|
||||
|
||||
本指南面向不熟悉命令行和代码的 C 端用户,帮助你完成:
|
||||
|
||||
- LPMM 知识库的初始部署(从本地 txt 到可检索知识库)
|
||||
- 安全删除知识(按批次、按原文、按哈希、按关键字)
|
||||
- 导入 / 删除后的自检与检索效果验证
|
||||
|
||||
> 说明:本文默认你已经完成 MaiBot 的基础安装,并能在项目根目录打开命令行终端。
|
||||
|
||||
---
|
||||
|
||||
## 一、需要用到的脚本一览
|
||||
|
||||
在项目根目录(`MaiBot-dev`)下,这些脚本是 LPMM 相关的“工具箱”:
|
||||
|
||||
- 导入相关:
|
||||
- `scripts/raw_data_preprocessor.py`
|
||||
从 `data/lpmm_raw_data` 目录读取 `.txt` 文件,按空行拆分为一个个段落,并做去重。
|
||||
- `scripts/info_extraction.py`
|
||||
调用大模型,从每个段落里抽取实体和三元组,生成中间的 OpenIE JSON 文件。
|
||||
- `scripts/import_openie.py`
|
||||
把 `data/openie` 目录中的 OpenIE JSON 文件导入到 LPMM 知识库(向量库 + 知识图)。
|
||||
|
||||
- 删除相关:
|
||||
- `scripts/delete_lpmm_items.py`
|
||||
LPMM 知识库删除入口,支持按批次、按原始文本段落、按哈希列表、按关键字模糊搜索删除。
|
||||
|
||||
- 自检相关:
|
||||
- `scripts/inspect_lpmm_global.py`
|
||||
查看整个知识库的当前状态:段落/实体/关系条数、知识图节点/边数量、示例内容等。
|
||||
- `scripts/inspect_lpmm_batch.py`
|
||||
针对某个 OpenIE JSON 批次,检查它在向量库和知识图中的“残留情况”(导入与删除前后对比)。
|
||||
- `scripts/test_lpmm_retrieval.py`
|
||||
使用几条预设问题测试 LPMM 检索能力,帮助你判断知识库是否正常工作。
|
||||
|
||||
> 注意:所有命令示例都假设你已经在虚拟环境中,命令行前缀类似 `(.venv)`,并且当前目录是项目根目录。
|
||||
|
||||
---
|
||||
|
||||
## 二、LPMM 知识库的初始部署
|
||||
|
||||
### 2.1 准备原始 txt 文本
|
||||
|
||||
1. 把要导入的知识文档放到:
|
||||
|
||||
```text
|
||||
data/lpmm_raw_data
|
||||
```
|
||||
|
||||
2. 文件要求:
|
||||
|
||||
- 必须是 `.txt` 文件,建议使用 UTF-8 编码;
|
||||
- 用**空行**分隔段落:一段话后空一行,即视为一条独立知识。
|
||||
|
||||
示例文件:
|
||||
|
||||
- `data/lpmm_raw_data/lpmm_large_sample.txt`:仓库内已经提供了一份大样本测试文本,可以直接用来练习。
|
||||
|
||||
### 2.2 第一步:预处理原始文本(拆段 + 去重)
|
||||
|
||||
在项目根目录执行:
|
||||
|
||||
```bash
|
||||
.\.venv\Scripts\python.exe scripts/raw_data_preprocessor.py
|
||||
```
|
||||
|
||||
成功时通常会看到日志类似:
|
||||
|
||||
- 正在处理文件: `lpmm_large_sample.txt`
|
||||
- 共读取到 XX 条数据
|
||||
|
||||
这一步不会调用大模型,仅做拆段和去重。
|
||||
|
||||
### 2.3 第二步:进行信息抽取(生成 OpenIE JSON)
|
||||
|
||||
执行:
|
||||
|
||||
```bash
|
||||
.\.venv\Scripts\python.exe scripts/info_extraction.py
|
||||
```
|
||||
|
||||
你会看到一个“重要操作确认”提示,说明:
|
||||
|
||||
- 信息抽取会调用大模型,消耗 API 费用和时间;
|
||||
- 如果确认无误,输入 `y` 回车继续。
|
||||
|
||||
提取过程中可能出现:
|
||||
|
||||
- 类似“模型 ... 网络错误(可重试)”这样的日志;
|
||||
这表示脚本在遇到网络问题时自动重试,一般无需手动干预。
|
||||
|
||||
运行结束后,会有类似提示:
|
||||
|
||||
```text
|
||||
信息提取结果已保存到: data/openie/11-27-10-06-openie.json
|
||||
```
|
||||
|
||||
- 请记住这个文件名,比如:`11-27-10-06-openie.json`
|
||||
接下来我们会用 `<OPENIE>` 来代指这类文件。
|
||||
|
||||
### 2.4 第三步:导入 OpenIE 数据到 LPMM 知识库
|
||||
|
||||
执行:
|
||||
|
||||
```bash
|
||||
.\.venv\Scripts\python.exe scripts/import_openie.py
|
||||
```
|
||||
|
||||
这个脚本会:
|
||||
|
||||
- 从 `data/openie` 目录读取所有 `*.json` 文件,并合并导入;
|
||||
- 将新段落的嵌入向量写入 `data/embedding`;
|
||||
- 将三元组构建为知识图写入 `data/rag`。
|
||||
|
||||
> 提示:如果你希望“只导入某几批数据”,可以暂时把不需要的 JSON 文件移出 `data/openie`,导入结束后再移回。
|
||||
|
||||
### 2.5 第四步:全局自检(确认导入成功)
|
||||
|
||||
执行:
|
||||
|
||||
```bash
|
||||
.\.venv\Scripts\python.exe scripts/inspect_lpmm_global.py
|
||||
```
|
||||
|
||||
你会看到类似输出:
|
||||
|
||||
- 段落向量条数: `52`
|
||||
- 实体向量条数: `260`
|
||||
- 关系向量条数: `299`
|
||||
- KG 节点总数 / 边总数 / 段落节点数 / 实体节点数
|
||||
- 若干条示例段落与实体内容预览
|
||||
|
||||
只要这些数字大于 0,就表示 LPMM 知识库已经有可用的数据了。
|
||||
|
||||
### 2.6 第五步:用脚本测试 LPMM 检索效果(可选但推荐)
|
||||
|
||||
执行:
|
||||
|
||||
```bash
|
||||
.\.venv\Scripts\python.exe scripts/test_lpmm_retrieval.py
|
||||
```
|
||||
|
||||
脚本会:
|
||||
|
||||
- 自动初始化 LPMM(加载向量库与知识图);
|
||||
- 用几条预设问题查询 LPMM;
|
||||
- 打印原始检索结果和关键词命中情况。
|
||||
|
||||
你可以通过观察“RAW RESULT”里的内容,粗略判断:
|
||||
|
||||
- 能否命中与问题高度相关的知识;
|
||||
- 删除或导入新知识后,回答内容是否发生变化。
|
||||
|
||||
---
|
||||
|
||||
## 三、安全删除知识的几种方式
|
||||
|
||||
> 强烈建议:删除前先备份以下目录,以便“回档”:
|
||||
>
|
||||
> - `data/embedding`(向量库)
|
||||
> - `data/rag`(知识图)
|
||||
|
||||
所有删除操作使用同一个脚本:
|
||||
|
||||
```bash
|
||||
.\.venv\Scripts\python.exe scripts/delete_lpmm_items.py [参数...]
|
||||
```
|
||||
|
||||
脚本特点:
|
||||
|
||||
- 删除前会打印“待删除段落数量 / 实体数量 / 关系数量 / 预计删除节点数”等摘要;
|
||||
- 需要你输入大写 `YES` 确认才会真正执行;
|
||||
- 支持多种删除策略,可灵活组合。
|
||||
|
||||
### 3.1 按批次删除(推荐:整批回滚)
|
||||
|
||||
适用场景:某次导入的整批知识有问题,希望整体回滚。
|
||||
|
||||
1. 删除前,先检查该批次状态:
|
||||
|
||||
```bash
|
||||
.\.venv\Scripts\python.exe scripts/inspect_lpmm_batch.py ^
|
||||
--openie-file data/openie/<OPENIE>.json
|
||||
```
|
||||
|
||||
你会看到该批次:
|
||||
|
||||
- 段落:总计多少条、向量库剩余多少、KG 中剩余多少;
|
||||
- 实体、关系的类似统计;
|
||||
- 少量示例段落/实体内容预览。
|
||||
|
||||
2. 确认无误后,按批次删除:
|
||||
|
||||
```bash
|
||||
.\.venv\Scripts\python.exe scripts/delete_lpmm_items.py ^
|
||||
--openie-file data/openie/<OPENIE>.json ^
|
||||
--delete-entities --delete-relations --remove-orphan-entities
|
||||
```
|
||||
|
||||
参数含义:
|
||||
|
||||
- `--delete-entities`:删除该批次涉及的实体向量;
|
||||
- `--delete-relations`:删除该批次涉及的关系向量;
|
||||
- `--remove-orphan-entities`:顺带清理删除后不再参与任何边的“孤立实体”节点。
|
||||
|
||||
3. 删除后再检查:
|
||||
|
||||
```bash
|
||||
.\.venv\Scripts\python.exe scripts/inspect_lpmm_batch.py ^
|
||||
--openie-file data/openie/<OPENIE>.json
|
||||
|
||||
.\.venv\Scripts\python.exe scripts/inspect_lpmm_global.py
|
||||
```
|
||||
|
||||
若批次检查显示“向量库剩余 0 / KG 中剩余 0”,则说明该批次已被彻底删除。
|
||||
|
||||
### 3.2 按原始文本段落删除(精确定位某一段)
|
||||
|
||||
适用场景:某个原始 txt 的特定段落写错了,只想删这段对应的知识。
|
||||
|
||||
命令示例:
|
||||
|
||||
```bash
|
||||
.\.venv\Scripts\python.exe scripts/delete_lpmm_items.py ^
|
||||
--raw-file data/lpmm_raw_data/lpmm_large_sample.txt ^
|
||||
--raw-index 2
|
||||
```
|
||||
|
||||
说明:
|
||||
|
||||
- `--raw-index` 从 1 开始计数,可用逗号多选,例如:`1,3,5`;
|
||||
- 脚本会展示该段落的内容预览和哈希值,再请求你确认。
|
||||
|
||||
### 3.3 按哈希列表删除(进阶用法)
|
||||
|
||||
适用场景:你有一份“需要删除的段落哈希列表”(比如从其他系统导出)。
|
||||
|
||||
示例哈希列表文件:
|
||||
|
||||
- `data/openie/lpmm_delete_test_hashes.txt`
|
||||
|
||||
命令:
|
||||
|
||||
```bash
|
||||
.\.venv\Scripts\python.exe scripts/delete_lpmm_items.py ^
|
||||
--hash-file data/openie/lpmm_delete_test_hashes.txt
|
||||
```
|
||||
|
||||
说明:
|
||||
|
||||
- 文件中每行一条,可以是 `paragraph-xxxx` 或纯哈希,脚本会自动识别;
|
||||
- 适合“精确控制删除哪些段落”,但准备哈希列表需要一定技术基础。
|
||||
|
||||
### 3.4 按关键字模糊搜索删除(对非技术用户最友好)
|
||||
|
||||
适用场景:只知道某段话里包含某个关键词,不知道它在哪个 txt 或批次里。
|
||||
|
||||
示例 1:删除与“近义词扩展”相关的段落
|
||||
|
||||
```bash
|
||||
.\.venv\Scripts\python.exe scripts/delete_lpmm_items.py --search-text "近义词扩展" --search-limit 5
|
||||
```
|
||||
|
||||
示例 2:删除与“LPMM”强相关的一些段落
|
||||
|
||||
```bash
|
||||
.\.venv\Scripts\python.exe scripts/delete_lpmm_items.py --search-text "LPMM" --search-limit 20
|
||||
|
||||
```
|
||||
|
||||
执行过程:
|
||||
|
||||
1. 脚本在当前段落库中查找包含该关键字的段落;
|
||||
2. 列出前 N 条候选(`--search-limit` 决定数量);
|
||||
3. 提示你输入要删除的序号列表,例如:`1,2,5`;
|
||||
4. 再次提示你输入 `YES` 确认,才会真正执行删除。
|
||||
|
||||
> 建议:
|
||||
>
|
||||
> - 第一次使用时可以先加 `--dry-run` 看看效果:
|
||||
> ```bash
|
||||
> .\.venv\Scripts\python.exe scripts/delete_lpmm_items.py ^
|
||||
> --search-text "LPMM" ^
|
||||
> --search-limit 20 ^
|
||||
> --dry-run
|
||||
> ```
|
||||
> - 确认候选列表确实是你要删的内容后,再去掉 `--dry-run` 正式执行。
|
||||
|
||||
---
|
||||
|
||||
## 四、自检:如何确认导入 / 删除是否“生效”
|
||||
|
||||
### 4.1 全局状态检查
|
||||
|
||||
每次导入或删除之后,建议跑一次:
|
||||
|
||||
```bash
|
||||
.\.venv\Scripts\python.exe scripts/inspect_lpmm_global.py
|
||||
```
|
||||
|
||||
你可以在这里看到:
|
||||
|
||||
- 段落向量条数、实体向量条数、关系向量条数;
|
||||
- 知识图的节点总数、边总数、段落节点和实体节点数量;
|
||||
- 若干条“剩余段落示例”和“剩余实体示例”。
|
||||
|
||||
观察方式:
|
||||
|
||||
- 导入后:数字应该明显上升(说明新增数据生效);
|
||||
- 删除后:数字应该明显下降(说明删除操作生效)。
|
||||
|
||||
### 4.2 某个批次的局部状态
|
||||
|
||||
如果你想确认“某一个 OpenIE 文件对应的那一批知识”是否存在,可以使用:
|
||||
|
||||
```bash
|
||||
.\.venv\Scripts\python.exe scripts/inspect_lpmm_batch.py --openie-file data/openie/<OPENIE>.json
|
||||
```
|
||||
|
||||
输出中会包含:
|
||||
|
||||
- 该批次的段落 / 实体 / 关系的总数;
|
||||
- 在向量库中还剩多少条,在 KG 中还剩多少条;
|
||||
- 若干条仍存在的段落/实体示例。
|
||||
|
||||
典型用法:
|
||||
|
||||
- 导入后立刻检查一次:确认这一批已经“写入”;
|
||||
- 删除后再检查一次:确认这一批是否已经“清空”。
|
||||
|
||||
### 4.3 检索效果回归测试
|
||||
|
||||
每次做完导入或删除,你都可以用这条命令快速验证检索效果:
|
||||
|
||||
```bash
|
||||
.\.venv\Scripts\python.exe scripts/test_lpmm_retrieval.py
|
||||
```
|
||||
|
||||
它会:
|
||||
|
||||
- 初始化 LPMM(加载当前向量库和知识图);
|
||||
- 用几条预设问题(包括与 LPMM 和配置相关的问题)进行检索;
|
||||
- 打印检索结果以及命中关键词情况。
|
||||
|
||||
通过对比不同时间点的输出,你可以判断:
|
||||
|
||||
- 某些知识是否已经被成功删除(不再出现在回答中);
|
||||
- 新增的知识是否已经能被检索到。
|
||||
|
||||
---
|
||||
|
||||
## 五、常见提示与注意事项
|
||||
|
||||
1. **看到“网络错误(可重试)”需要担心吗?**
|
||||
|
||||
- 不需要。
|
||||
- 这些日志说明脚本在自动处理网络抖动,多数情况下会在重试后成功返回结果。
|
||||
- 只要脚本最后没有报“重试耗尽并退出”,一般导入/提取结果是有效的。
|
||||
|
||||
2. **删除操作会不会“一删全没”?**
|
||||
|
||||
- 不会直接“一删全没”:
|
||||
- 每次删除会打印摘要信息;
|
||||
- 必须输入 `YES` 才会真正执行;
|
||||
- 大批次时还有 `--max-delete-nodes` 保护,超过阈值会警告。
|
||||
- 但仍然建议:
|
||||
- 在大规模删除前备份 `data/embedding` 和 `data/rag`;
|
||||
- 先通过 `--dry-run` 看看待删列表。
|
||||
|
||||
3. **可以多次导入吗?需要先清空吗?**
|
||||
|
||||
- 可以多次导入,系统会根据段落内容的哈希做去重;
|
||||
- 不需要每次都清空,只要你希望老数据仍然保留即可;
|
||||
- 如果你确实想“重来一遍”,可以:
|
||||
- 先备份,然后删除 `data/embedding` 和 `data/rag`;
|
||||
- 再重新跑导入流程。
|
||||
|
||||
4. **LPMM 开关在哪里?**
|
||||
|
||||
- 配置文件:`config/bot_config.toml`;
|
||||
- 小节:`[lpmm_knowledge]`;
|
||||
- 其中有 `enable = true/false` 开关:
|
||||
- 为 `true`:LPMM 知识库启用,问答时会使用;
|
||||
- 为 `false`:LPMM 关闭,即使知识库有数据,也不会参与回答。
|
||||
- 修改后需要重启主程序,让设置生效。
|
||||
|
||||
---
|
||||
|
||||
如果你是普通用户,只需要记住一句话:
|
||||
|
||||
> “导入三步走:预处理 → 信息抽取 → 导入 OpenIE;
|
||||
> 删除三步走:先检查 → 再删除 → 然后再检查。”
|
||||
|
||||
照着本指南中的命令一步一步执行,就可以安全地管理你的 LPMM 知识库。***
|
||||
|
|
@ -0,0 +1,360 @@
|
|||
import argparse
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import List, Tuple, Dict, Any
|
||||
import json
|
||||
import os
|
||||
|
||||
# 强制使用 utf-8,避免控制台编码报错
|
||||
try:
|
||||
if hasattr(sys.stdout, "reconfigure"):
|
||||
sys.stdout.reconfigure(encoding="utf-8")
|
||||
if hasattr(sys.stderr, "reconfigure"):
|
||||
sys.stderr.reconfigure(encoding="utf-8")
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# 确保能找到 src 包
|
||||
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
|
||||
|
||||
from src.chat.knowledge.embedding_store import EmbeddingManager
|
||||
from src.chat.knowledge.kg_manager import KGManager
|
||||
from src.common.logger import get_logger
|
||||
from src.chat.knowledge.utils.hash import get_sha256
|
||||
|
||||
logger = get_logger("delete_lpmm_items")
|
||||
|
||||
|
||||
def read_hashes(file_path: Path) -> List[str]:
|
||||
"""读取哈希列表,跳过空行"""
|
||||
hashes: List[str] = []
|
||||
for line in file_path.read_text(encoding="utf-8").splitlines():
|
||||
val = line.strip()
|
||||
if not val:
|
||||
continue
|
||||
hashes.append(val)
|
||||
return hashes
|
||||
|
||||
|
||||
def read_openie_hashes(file_path: Path) -> List[str]:
|
||||
"""从 OpenIE JSON 中提取 idx 作为段落哈希"""
|
||||
data: Dict[str, Any] = json.loads(file_path.read_text(encoding="utf-8"))
|
||||
docs = data.get("docs", []) if isinstance(data, dict) else []
|
||||
hashes: List[str] = []
|
||||
for doc in docs:
|
||||
idx = doc.get("idx") if isinstance(doc, dict) else None
|
||||
if isinstance(idx, str) and idx.strip():
|
||||
hashes.append(idx.strip())
|
||||
return hashes
|
||||
|
||||
|
||||
def normalize_paragraph_keys(raw_hashes: List[str]) -> Tuple[List[str], List[str]]:
|
||||
"""将输入规范为完整键和纯哈希两份列表"""
|
||||
keys: List[str] = []
|
||||
hashes: List[str] = []
|
||||
for h in raw_hashes:
|
||||
if h.startswith("paragraph-"):
|
||||
keys.append(h)
|
||||
hashes.append(h.replace("paragraph-", "", 1))
|
||||
else:
|
||||
keys.append(f"paragraph-{h}")
|
||||
hashes.append(h)
|
||||
return keys, hashes
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="Delete paragraphs from LPMM knowledge base (vectors + graph).")
|
||||
parser.add_argument("--hash-file", help="文本文件路径,每行一个 paragraph 哈希或带前缀键")
|
||||
parser.add_argument("--openie-file", help="OpenIE 输出文件(JSON),将其 docs.idx 作为待删段落哈希")
|
||||
parser.add_argument("--raw-file", help="原始 txt 语料文件(按空行分段),可结合 --raw-index 使用")
|
||||
parser.add_argument(
|
||||
"--raw-index",
|
||||
help="在 --raw-file 中要删除的段落索引,1 基,支持逗号分隔,例如 1,3",
|
||||
)
|
||||
parser.add_argument("--search-text", help="在当前段落库中按子串搜索匹配段落并交互选择删除")
|
||||
parser.add_argument(
|
||||
"--search-limit",
|
||||
type=int,
|
||||
default=10,
|
||||
help="--search-text 模式下最多展示的候选段落数量",
|
||||
)
|
||||
parser.add_argument("--delete-entities", action="store_true", help="同时删除 OpenIE 文件中的实体节点/嵌入")
|
||||
parser.add_argument("--delete-relations", action="store_true", help="同时删除 OpenIE 文件中的关系嵌入")
|
||||
parser.add_argument("--remove-orphan-entities", action="store_true", help="删除删除后孤立的实体节点")
|
||||
parser.add_argument("--dry-run", action="store_true", help="仅预览将删除的项,不实际修改")
|
||||
parser.add_argument("--yes", action="store_true", help="跳过交互确认,直接执行删除(谨慎使用)")
|
||||
parser.add_argument(
|
||||
"--max-delete-nodes",
|
||||
type=int,
|
||||
default=2000,
|
||||
help="单次最大允许删除的节点数量(段落+实体),超过则需要显式确认或调整该参数",
|
||||
)
|
||||
args = parser.parse_args()
|
||||
|
||||
# 至少需要一种来源
|
||||
if not (args.hash_file or args.openie_file or args.raw_file or args.search_text):
|
||||
logger.error("必须指定 --hash-file / --openie-file / --raw-file / --search-text 之一")
|
||||
sys.exit(1)
|
||||
|
||||
raw_hashes: List[str] = []
|
||||
raw_entities: List[str] = []
|
||||
raw_relations: List[str] = []
|
||||
|
||||
if args.hash_file:
|
||||
hash_file = Path(args.hash_file)
|
||||
if not hash_file.exists():
|
||||
logger.error(f"哈希文件不存在: {hash_file}")
|
||||
sys.exit(1)
|
||||
raw_hashes.extend(read_hashes(hash_file))
|
||||
|
||||
if args.openie_file:
|
||||
openie_path = Path(args.openie_file)
|
||||
if not openie_path.exists():
|
||||
logger.error(f"OpenIE 文件不存在: {openie_path}")
|
||||
sys.exit(1)
|
||||
# 段落
|
||||
raw_hashes.extend(read_openie_hashes(openie_path))
|
||||
# 实体/关系(实体同时包含 extracted_entities 与三元组主语/宾语,以匹配 KG 构图逻辑)
|
||||
try:
|
||||
data = json.loads(openie_path.read_text(encoding="utf-8"))
|
||||
docs = data.get("docs", []) if isinstance(data, dict) else []
|
||||
for doc in docs:
|
||||
if not isinstance(doc, dict):
|
||||
continue
|
||||
ents = doc.get("extracted_entities", [])
|
||||
if isinstance(ents, list):
|
||||
raw_entities.extend([e for e in ents if isinstance(e, str)])
|
||||
triples = doc.get("extracted_triples", [])
|
||||
if isinstance(triples, list):
|
||||
for t in triples:
|
||||
if isinstance(t, list) and len(t) == 3:
|
||||
subj, _, obj = t
|
||||
if isinstance(subj, str):
|
||||
raw_entities.append(subj)
|
||||
if isinstance(obj, str):
|
||||
raw_entities.append(obj)
|
||||
raw_relations.append(str(tuple(t)))
|
||||
except Exception as e:
|
||||
logger.error(f"读取 OpenIE 文件失败: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
# 从原始 txt 语料按段落索引选择删除
|
||||
if args.raw_file:
|
||||
raw_path = Path(args.raw_file)
|
||||
if not raw_path.exists():
|
||||
logger.error(f"原始语料文件不存在: {raw_path}")
|
||||
sys.exit(1)
|
||||
text = raw_path.read_text(encoding="utf-8")
|
||||
paragraphs: List[str] = []
|
||||
buf = []
|
||||
for line in text.splitlines():
|
||||
if line.strip() == "":
|
||||
if buf:
|
||||
paragraphs.append("\n".join(buf).strip())
|
||||
buf = []
|
||||
else:
|
||||
buf.append(line)
|
||||
if buf:
|
||||
paragraphs.append("\n".join(buf).strip())
|
||||
|
||||
if not paragraphs:
|
||||
logger.error(f"原始语料文件 {raw_path} 中没有解析到任何段落")
|
||||
sys.exit(1)
|
||||
|
||||
if not args.raw_index:
|
||||
logger.info(f"{raw_path} 共解析出 {len(paragraphs)} 个段落,请通过 --raw-index 指定要删除的段落,例如 --raw-index 1,3")
|
||||
sys.exit(1)
|
||||
|
||||
# 解析索引列表(1-based)
|
||||
try:
|
||||
idx_list = [int(x.strip()) for x in str(args.raw_index).split(",") if x.strip()]
|
||||
except ValueError:
|
||||
logger.error(f"--raw-index 解析失败: {args.raw_index}")
|
||||
sys.exit(1)
|
||||
|
||||
for idx in idx_list:
|
||||
if idx < 1 or idx > len(paragraphs):
|
||||
logger.error(f"--raw-index 包含无效索引 {idx}(有效范围 1~{len(paragraphs)})")
|
||||
sys.exit(1)
|
||||
|
||||
logger.info("根据原始语料选择段落:")
|
||||
for idx in idx_list:
|
||||
para = paragraphs[idx - 1]
|
||||
h = get_sha256(para)
|
||||
logger.info(f"- 第 {idx} 段,hash={h},内容预览:{para[:80]}")
|
||||
raw_hashes.append(h)
|
||||
|
||||
# 在现有库中按子串搜索候选段落并交互选择
|
||||
if args.search_text:
|
||||
search_text = args.search_text.strip()
|
||||
if not search_text:
|
||||
logger.error("--search-text 不能为空")
|
||||
sys.exit(1)
|
||||
logger.info(f"正在根据关键字在现有段落库中搜索:{search_text!r}")
|
||||
em_search = EmbeddingManager()
|
||||
try:
|
||||
em_search.load_from_file()
|
||||
except Exception as e:
|
||||
logger.error(f"加载嵌入库失败,无法使用 --search-text 功能: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
candidates = []
|
||||
for key, item in em_search.paragraphs_embedding_store.store.items():
|
||||
if search_text in item.str:
|
||||
candidates.append((key, item.str))
|
||||
if len(candidates) >= args.search_limit:
|
||||
break
|
||||
|
||||
if not candidates:
|
||||
logger.info("未在现有段落库中找到包含该关键字的段落")
|
||||
else:
|
||||
logger.info("找到以下候选段落(输入序号选择要删除的条目,可用逗号分隔,多选):")
|
||||
for i, (key, text) in enumerate(candidates, start=1):
|
||||
logger.info(f"{i}. {key} | {text[:80]}")
|
||||
choice = input("请输入要删除的序号列表(如 1,3),或直接回车取消:").strip()
|
||||
if choice:
|
||||
try:
|
||||
idxs = [int(x.strip()) for x in choice.split(",") if x.strip()]
|
||||
except ValueError:
|
||||
logger.error("输入的序号列表无法解析,已取消 --search-text 删除")
|
||||
else:
|
||||
for i in idxs:
|
||||
if 1 <= i <= len(candidates):
|
||||
key, _ = candidates[i - 1]
|
||||
# key 已是完整的 paragraph-xxx
|
||||
if key.startswith("paragraph-"):
|
||||
raw_hashes.append(key.split("paragraph-", 1)[1])
|
||||
else:
|
||||
logger.warning(f"忽略无效序号: {i}")
|
||||
|
||||
# 去重但保持顺序
|
||||
seen = set()
|
||||
raw_hashes = [h for h in raw_hashes if not (h in seen or seen.add(h))]
|
||||
|
||||
if not raw_hashes:
|
||||
logger.error("未读取到任何待删哈希,无操作")
|
||||
sys.exit(1)
|
||||
|
||||
keys, pg_hashes = normalize_paragraph_keys(raw_hashes)
|
||||
|
||||
ent_hashes: List[str] = []
|
||||
rel_hashes: List[str] = []
|
||||
if args.delete_entities and raw_entities:
|
||||
ent_hashes = [get_sha256(e) for e in raw_entities]
|
||||
if args.delete_relations and raw_relations:
|
||||
rel_hashes = [get_sha256(r) for r in raw_relations]
|
||||
|
||||
logger.info("=== 删除操作预备 ===")
|
||||
logger.info("请确保已备份 data/embedding 与 data/rag,必要时可使用 --dry-run 预览")
|
||||
logger.info(f"待删除段落数量: {len(keys)}")
|
||||
logger.info(f"示例: {keys[:5]}")
|
||||
if ent_hashes:
|
||||
logger.info(f"待删除实体数量: {len(ent_hashes)}")
|
||||
if rel_hashes:
|
||||
logger.info(f"待删除关系数量: {len(rel_hashes)}")
|
||||
|
||||
total_nodes_to_delete = len(pg_hashes) + (len(ent_hashes) if args.delete_entities else 0)
|
||||
logger.info(f"本次预计删除节点总数(段落+实体): {total_nodes_to_delete}")
|
||||
|
||||
if args.dry_run:
|
||||
logger.info("dry-run 模式,未执行删除")
|
||||
return
|
||||
|
||||
# 大批次删除保护
|
||||
if total_nodes_to_delete > args.max_delete_nodes and not args.yes:
|
||||
logger.error(
|
||||
f"本次预计删除节点 {total_nodes_to_delete} 个,超过阈值 {args.max_delete_nodes}。"
|
||||
" 为避免误删,请降低批次规模或使用 --max-delete-nodes 调整阈值,并加上 --yes 明确确认。"
|
||||
)
|
||||
sys.exit(1)
|
||||
|
||||
# 交互确认
|
||||
if not args.yes:
|
||||
confirm = input("确认删除上述数据?输入大写 YES 以继续,其他任意键取消: ").strip()
|
||||
if confirm != "YES":
|
||||
logger.info("用户取消删除操作")
|
||||
return
|
||||
|
||||
# 加载嵌入与图
|
||||
embed_manager = EmbeddingManager()
|
||||
kg_manager = KGManager()
|
||||
|
||||
try:
|
||||
embed_manager.load_from_file()
|
||||
kg_manager.load_from_file()
|
||||
except Exception as e:
|
||||
logger.error(f"加载现有知识库失败: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
# 记录删除前全局统计,便于对比
|
||||
before_para_vec = len(embed_manager.paragraphs_embedding_store.store)
|
||||
before_ent_vec = len(embed_manager.entities_embedding_store.store)
|
||||
before_rel_vec = len(embed_manager.relation_embedding_store.store)
|
||||
before_nodes = len(kg_manager.graph.get_node_list())
|
||||
before_edges = len(kg_manager.graph.get_edge_list())
|
||||
logger.info(
|
||||
f"删除前统计: 段落向量={before_para_vec}, 实体向量={before_ent_vec}, 关系向量={before_rel_vec}, "
|
||||
f"KG节点={before_nodes}, KG边={before_edges}"
|
||||
)
|
||||
|
||||
# 删除向量
|
||||
deleted, skipped = embed_manager.paragraphs_embedding_store.delete_items(keys)
|
||||
embed_manager.stored_pg_hashes = set(embed_manager.paragraphs_embedding_store.store.keys())
|
||||
logger.info(f"段落向量删除完成,删除: {deleted}, 跳过: {skipped}")
|
||||
ent_deleted = ent_skipped = rel_deleted = rel_skipped = 0
|
||||
if ent_hashes:
|
||||
ent_keys = [f"entity-{h}" for h in ent_hashes]
|
||||
ent_deleted, ent_skipped = embed_manager.entities_embedding_store.delete_items(ent_keys)
|
||||
logger.info(f"实体向量删除完成,删除: {ent_deleted}, 跳过: {ent_skipped}")
|
||||
if rel_hashes:
|
||||
rel_keys = [f"relation-{h}" for h in rel_hashes]
|
||||
rel_deleted, rel_skipped = embed_manager.relation_embedding_store.delete_items(rel_keys)
|
||||
logger.info(f"关系向量删除完成,删除: {rel_deleted}, 跳过: {rel_skipped}")
|
||||
|
||||
# 删除图节点/边
|
||||
kg_result = kg_manager.delete_paragraphs(
|
||||
pg_hashes,
|
||||
ent_hashes=ent_hashes if args.delete_entities else None,
|
||||
remove_orphan_entities=args.remove_orphan_entities,
|
||||
)
|
||||
logger.info(
|
||||
f"KG 删除完成,删除: {kg_result.get('deleted', 0)}, 跳过: {kg_result.get('skipped', 0)}, "
|
||||
f"孤立实体清理: {kg_result.get('orphan_removed', 0)}"
|
||||
)
|
||||
|
||||
# 重建索引并保存
|
||||
logger.info("重建 Faiss 索引并保存嵌入文件...")
|
||||
embed_manager.rebuild_faiss_index()
|
||||
embed_manager.save_to_file()
|
||||
|
||||
logger.info("保存 KG 数据...")
|
||||
kg_manager.save_to_file()
|
||||
|
||||
# 删除后统计
|
||||
after_para_vec = len(embed_manager.paragraphs_embedding_store.store)
|
||||
after_ent_vec = len(embed_manager.entities_embedding_store.store)
|
||||
after_rel_vec = len(embed_manager.relation_embedding_store.store)
|
||||
after_nodes = len(kg_manager.graph.get_node_list())
|
||||
after_edges = len(kg_manager.graph.get_edge_list())
|
||||
|
||||
logger.info(
|
||||
"删除后统计: 段落向量=%d(%+d), 实体向量=%d(%+d), 关系向量=%d(%+d), KG节点=%d(%+d), KG边=%d(%+d)"
|
||||
% (
|
||||
after_para_vec,
|
||||
after_para_vec - before_para_vec,
|
||||
after_ent_vec,
|
||||
after_ent_vec - before_ent_vec,
|
||||
after_rel_vec,
|
||||
after_rel_vec - before_rel_vec,
|
||||
after_nodes,
|
||||
after_nodes - before_nodes,
|
||||
after_edges,
|
||||
after_edges - before_edges,
|
||||
)
|
||||
)
|
||||
|
||||
logger.info("删除流程完成")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -131,6 +131,13 @@ def main(): # sourcery skip: comprehension-to-generator, extract-method
|
|||
logger.info("用户取消操作")
|
||||
print("操作已取消")
|
||||
sys.exit(1)
|
||||
|
||||
# 友好提示:说明“网络错误(可重试)”日志属于正常自动重试行为,避免用户误以为任务失败
|
||||
print(
|
||||
"\n提示:在提取过程中,如果看到模型出现“网络错误(可重试)”等日志,"
|
||||
"表示系统正在自动重试请求,一般不会影响整体导入结果,请耐心等待即可。\n"
|
||||
)
|
||||
|
||||
print("\n" + "=" * 40 + "\n")
|
||||
ensure_dirs() # 确保目录存在
|
||||
logger.info("--------进行信息提取--------\n")
|
||||
|
|
|
|||
|
|
@ -0,0 +1,132 @@
|
|||
import argparse
|
||||
import json
|
||||
import os
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import List, Tuple
|
||||
|
||||
# 确保能导入 src.*
|
||||
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
|
||||
|
||||
from src.chat.knowledge.utils.hash import get_sha256
|
||||
from src.chat.knowledge.embedding_store import EmbeddingManager
|
||||
from src.chat.knowledge.kg_manager import KGManager
|
||||
from src.common.logger import get_logger
|
||||
|
||||
logger = get_logger("inspect_lpmm_batch")
|
||||
|
||||
|
||||
def load_openie_hashes(path: Path) -> Tuple[List[str], List[str], List[str]]:
|
||||
"""从 OpenIE JSON 中提取段落 / 实体 / 关系的哈希
|
||||
|
||||
注意:实体既包括 extracted_entities 中的条目,也包括三元组中的主语/宾语,
|
||||
以与 KG 构图逻辑保持一致。
|
||||
"""
|
||||
with path.open("r", encoding="utf-8") as f:
|
||||
data = json.load(f)
|
||||
|
||||
pg_hashes: List[str] = []
|
||||
ent_hashes: List[str] = []
|
||||
rel_hashes: List[str] = []
|
||||
|
||||
for doc in data.get("docs", []):
|
||||
if not isinstance(doc, dict):
|
||||
continue
|
||||
idx = doc.get("idx")
|
||||
if isinstance(idx, str) and idx.strip():
|
||||
pg_hashes.append(idx.strip())
|
||||
|
||||
ents = doc.get("extracted_entities", [])
|
||||
if isinstance(ents, list):
|
||||
for e in ents:
|
||||
if isinstance(e, str):
|
||||
ent_hashes.append(get_sha256(e))
|
||||
|
||||
triples = doc.get("extracted_triples", [])
|
||||
if isinstance(triples, list):
|
||||
for t in triples:
|
||||
if isinstance(t, list) and len(t) == 3:
|
||||
# 主语/宾语作为实体参与构图
|
||||
subj, _, obj = t
|
||||
if isinstance(subj, str):
|
||||
ent_hashes.append(get_sha256(subj))
|
||||
if isinstance(obj, str):
|
||||
ent_hashes.append(get_sha256(obj))
|
||||
rel_hashes.append(get_sha256(str(tuple(t))))
|
||||
|
||||
# 去重但保留顺序
|
||||
def unique(seq: List[str]) -> List[str]:
|
||||
seen = set()
|
||||
return [x for x in seq if not (x in seen or seen.add(x))]
|
||||
|
||||
return unique(pg_hashes), unique(ent_hashes), unique(rel_hashes)
|
||||
|
||||
|
||||
def main() -> None:
|
||||
parser = argparse.ArgumentParser(
|
||||
description="检查指定 OpenIE 文件对应批次在当前向量库与 KG 中的存在情况(用于验证删除效果)。"
|
||||
)
|
||||
parser.add_argument("--openie-file", required=True, help="OpenIE 输出 JSON 文件路径")
|
||||
args = parser.parse_args()
|
||||
|
||||
openie_path = Path(args.openie_file)
|
||||
if not openie_path.exists():
|
||||
logger.error(f"OpenIE 文件不存在: {openie_path}")
|
||||
sys.exit(1)
|
||||
|
||||
pg_hashes, ent_hashes, rel_hashes = load_openie_hashes(openie_path)
|
||||
logger.info(
|
||||
f"从 {openie_path.name} 解析到 段落 {len(pg_hashes)} 条,实体 {len(ent_hashes)} 个,关系 {len(rel_hashes)} 条"
|
||||
)
|
||||
|
||||
# 加载当前嵌入与 KG
|
||||
em = EmbeddingManager()
|
||||
kg = KGManager()
|
||||
try:
|
||||
em.load_from_file()
|
||||
kg.load_from_file()
|
||||
except Exception as e:
|
||||
logger.error(f"加载当前知识库失败: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
graph_nodes = set(kg.graph.get_node_list())
|
||||
|
||||
# 检查段落
|
||||
pg_keys = [f"paragraph-{h}" for h in pg_hashes]
|
||||
pg_in_vec = sum(1 for k in pg_keys if k in em.paragraphs_embedding_store.store)
|
||||
pg_in_kg = sum(1 for k in pg_keys if k in graph_nodes)
|
||||
|
||||
# 检查实体
|
||||
ent_keys = [f"entity-{h}" for h in ent_hashes]
|
||||
ent_in_vec = sum(1 for k in ent_keys if k in em.entities_embedding_store.store)
|
||||
ent_in_kg = sum(1 for k in ent_keys if k in graph_nodes)
|
||||
|
||||
# 检查关系(只针对向量库)
|
||||
rel_keys = [f"relation-{h}" for h in rel_hashes]
|
||||
rel_in_vec = sum(1 for k in rel_keys if k in em.relation_embedding_store.store)
|
||||
|
||||
print("==== 批次存在情况(删除前/后对比用) ====")
|
||||
print(f"段落: 总计 {len(pg_keys)}, 向量库剩余 {pg_in_vec}, KG 中剩余 {pg_in_kg}")
|
||||
print(f"实体: 总计 {len(ent_keys)}, 向量库剩余 {ent_in_vec}, KG 中剩余 {ent_in_kg}")
|
||||
print(f"关系: 总计 {len(rel_keys)}, 向量库剩余 {rel_in_vec}")
|
||||
|
||||
# 打印少量仍存在的样例,便于检查内容是否正常
|
||||
sample_pg = [k for k in pg_keys if k in graph_nodes][:3]
|
||||
if sample_pg:
|
||||
print("\n仍在 KG 中的段落节点示例:")
|
||||
for k in sample_pg:
|
||||
nd = kg.graph[k]
|
||||
content = nd["content"] if "content" in nd else k
|
||||
print(f"- {k}: {content[:80]}")
|
||||
|
||||
sample_ent = [k for k in ent_keys if k in graph_nodes][:3]
|
||||
if sample_ent:
|
||||
print("\n仍在 KG 中的实体节点示例:")
|
||||
for k in sample_ent:
|
||||
nd = kg.graph[k]
|
||||
content = nd["content"] if "content" in nd else k
|
||||
print(f"- {k}: {content[:80]}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -0,0 +1,71 @@
|
|||
import os
|
||||
import sys
|
||||
from typing import Set
|
||||
|
||||
# 保证可以导入 src.*
|
||||
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
|
||||
|
||||
from src.chat.knowledge.embedding_store import EmbeddingManager
|
||||
from src.chat.knowledge.kg_manager import KGManager
|
||||
from src.common.logger import get_logger
|
||||
|
||||
logger = get_logger("inspect_lpmm_global")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
"""检查当前整库(所有批次)的向量与 KG 状态,用于观察删除对剩余数据的影响。"""
|
||||
em = EmbeddingManager()
|
||||
kg = KGManager()
|
||||
|
||||
try:
|
||||
em.load_from_file()
|
||||
kg.load_from_file()
|
||||
except Exception as e:
|
||||
logger.error(f"加载当前知识库失败: {e}")
|
||||
sys.exit(1)
|
||||
|
||||
# 向量库统计
|
||||
para_cnt = len(em.paragraphs_embedding_store.store)
|
||||
ent_cnt_vec = len(em.entities_embedding_store.store)
|
||||
rel_cnt_vec = len(em.relation_embedding_store.store)
|
||||
|
||||
# KG 统计
|
||||
nodes = kg.graph.get_node_list()
|
||||
edges = kg.graph.get_edge_list()
|
||||
node_set: Set[str] = set(nodes)
|
||||
|
||||
para_nodes = [n for n in nodes if n.startswith("paragraph-")]
|
||||
ent_nodes = [n for n in nodes if n.startswith("entity-")]
|
||||
|
||||
print("==== 向量库统计 ====")
|
||||
print(f"段落向量条数: {para_cnt}")
|
||||
print(f"实体向量条数: {ent_cnt_vec}")
|
||||
print(f"关系向量条数: {rel_cnt_vec}")
|
||||
|
||||
print("\n==== KG 图统计 ====")
|
||||
print(f"节点总数: {len(nodes)}")
|
||||
print(f"边总数: {len(edges)}")
|
||||
print(f"段落节点数: {len(para_nodes)}")
|
||||
print(f"实体节点数: {len(ent_nodes)}")
|
||||
|
||||
# ent_appear_cnt 状态
|
||||
ent_cnt_meta = len(kg.ent_appear_cnt)
|
||||
print(f"\n实体计数表条目数: {ent_cnt_meta}")
|
||||
|
||||
# 抽样查看剩余段落/实体内容
|
||||
print("\n==== 剩余段落示例(最多 3 条) ====")
|
||||
for nid in para_nodes[:3]:
|
||||
nd = kg.graph[nid]
|
||||
content = nd["content"] if "content" in nd else nid
|
||||
print(f"- {nid}: {content[:80]}")
|
||||
|
||||
print("\n==== 剩余实体示例(最多 5 条) ====")
|
||||
for nid in ent_nodes[:5]:
|
||||
nd = kg.graph[nid]
|
||||
content = nd["content"] if "content" in nd else nid
|
||||
print(f"- {nid}: {content[:80]}")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
||||
|
|
@ -1,9 +1,9 @@
|
|||
import os
|
||||
from pathlib import Path
|
||||
import sys # 新增系统模块导入
|
||||
from src.chat.knowledge.utils.hash import get_sha256
|
||||
|
||||
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
|
||||
from src.chat.knowledge.utils.hash import get_sha256
|
||||
from src.common.logger import get_logger
|
||||
|
||||
logger = get_logger("lpmm")
|
||||
|
|
@ -59,10 +59,11 @@ def load_raw_data() -> tuple[list[str], list[str]]:
|
|||
- raw_data: 原始数据列表
|
||||
- sha256_list: 原始数据的SHA256集合
|
||||
"""
|
||||
raw_data = _process_multi_files()
|
||||
raw_paragraphs = _process_multi_files()
|
||||
sha256_list = []
|
||||
sha256_set = set()
|
||||
for item in raw_data:
|
||||
raw_data: list[str] = []
|
||||
for item in raw_paragraphs:
|
||||
if not isinstance(item, str):
|
||||
logger.warning(f"数据类型错误:{item}")
|
||||
continue
|
||||
|
|
|
|||
|
|
@ -0,0 +1,93 @@
|
|||
import asyncio
|
||||
import os
|
||||
import sys
|
||||
from typing import List, Dict, Any
|
||||
|
||||
# 强制使用 utf-8,避免控制台编码报错影响 Embedding 加载
|
||||
try:
|
||||
if hasattr(sys.stdout, "reconfigure"):
|
||||
sys.stdout.reconfigure(encoding="utf-8")
|
||||
if hasattr(sys.stderr, "reconfigure"):
|
||||
sys.stderr.reconfigure(encoding="utf-8")
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# 确保能导入 src.*
|
||||
sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))
|
||||
|
||||
from src.common.logger import get_logger
|
||||
from src.config.config import global_config
|
||||
from src.chat.knowledge import lpmm_start_up
|
||||
from src.memory_system.retrieval_tools.query_lpmm_knowledge import query_lpmm_knowledge
|
||||
|
||||
logger = get_logger("test_lpmm_retrieval")
|
||||
|
||||
|
||||
TEST_CASES: List[Dict[str, Any]] = [
|
||||
{
|
||||
"name": "回滚一批知识",
|
||||
"query": "LPMM是什么?",
|
||||
"expect_keywords": ["哈希列表", "删除脚本", "OpenIE"],
|
||||
},
|
||||
{
|
||||
"name": "调整 LPMM 检索参数",
|
||||
"query": "不同用词习惯带来的检索偏差该如何解决",
|
||||
"expect_keywords": ["bot_config.toml", "lpmm_knowledge", "qa_paragraph_search_top_k"],
|
||||
},
|
||||
]
|
||||
|
||||
|
||||
async def run_tests() -> None:
|
||||
"""简单测试 LPMM 知识库检索能力"""
|
||||
if not global_config.lpmm_knowledge.enable:
|
||||
logger.warning("当前配置中 lpmm_knowledge.enable 为 False,检索测试可能直接返回“未启用”。")
|
||||
|
||||
logger.info("开始初始化 LPMM 知识库...")
|
||||
lpmm_start_up()
|
||||
logger.info("LPMM 知识库初始化完成,开始执行测试用例。")
|
||||
|
||||
for case in TEST_CASES:
|
||||
name = case["name"]
|
||||
query = case["query"]
|
||||
expect_keywords: List[str] = case.get("expect_keywords", [])
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print(f"[TEST] {name}")
|
||||
print(f"[Q] {query}")
|
||||
|
||||
result = await query_lpmm_knowledge(query, limit=3)
|
||||
|
||||
print("\n[RAW RESULT]")
|
||||
print(result)
|
||||
|
||||
status = "UNKNOWN"
|
||||
hit_keywords: List[str] = []
|
||||
|
||||
if isinstance(result, str):
|
||||
if "未启用" in result or "未初始化" in result or "查询失败" in result:
|
||||
status = "ERROR"
|
||||
elif "未找到与" in result:
|
||||
status = "NO_HIT"
|
||||
else:
|
||||
if expect_keywords:
|
||||
hit_keywords = [kw for kw in expect_keywords if kw in result]
|
||||
status = "PASS" if hit_keywords else "WARN"
|
||||
else:
|
||||
status = "PASS"
|
||||
|
||||
print("\n[CHECK]")
|
||||
print(f"Status: {status}")
|
||||
if expect_keywords:
|
||||
print(f"Expected keywords: {expect_keywords}")
|
||||
print(f"Hit keywords: {hit_keywords}")
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("LPMM 检索测试完成。请根据每条用例的 Status 和命中关键词判断检索效果是否符合预期。")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
asyncio.run(run_tests())
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
|
|
@ -1,7 +1,8 @@
|
|||
import json
|
||||
import os
|
||||
import time
|
||||
from typing import Dict, List, Tuple
|
||||
from typing import Dict, List, Tuple, Set
|
||||
import xml.etree.ElementTree as ET
|
||||
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
|
@ -98,6 +99,28 @@ class KGManager:
|
|||
# 加载KG
|
||||
self.graph = di_graph.load_from_file(self.graph_data_path)
|
||||
|
||||
def _rebuild_metadata_from_graph(self) -> None:
|
||||
"""根据当前图重建 stored_paragraph_hashes 与 ent_appear_cnt"""
|
||||
nodes = self.graph.get_node_list()
|
||||
edges = self.graph.get_edge_list()
|
||||
|
||||
# 段落 hash:paragraph-{hash}
|
||||
self.stored_paragraph_hashes = set()
|
||||
for node_id in nodes:
|
||||
if node_id.startswith("paragraph-"):
|
||||
self.stored_paragraph_hashes.add(node_id.split("paragraph-", 1)[1])
|
||||
|
||||
# 实体出现次数:基于 entity -> paragraph 的边权
|
||||
ent_appear_cnt: Dict[str, float] = {}
|
||||
for edge_tuple in edges:
|
||||
src, tgt = edge_tuple[0], edge_tuple[1]
|
||||
if src.startswith("entity") and tgt.startswith("paragraph"):
|
||||
edge_data = self.graph[src, tgt]
|
||||
weight = edge_data["weight"] if "weight" in edge_data else 1.0
|
||||
ent_appear_cnt[src] = ent_appear_cnt.get(src, 0.0) + float(weight)
|
||||
|
||||
self.ent_appear_cnt = ent_appear_cnt
|
||||
|
||||
def _build_edges_between_ent(
|
||||
self,
|
||||
node_to_node: Dict[Tuple[str, str], float],
|
||||
|
|
@ -149,6 +172,13 @@ class KGManager:
|
|||
ent_hash_list.add("entity" + "-" + get_sha256(triple[0]))
|
||||
ent_hash_list.add("entity" + "-" + get_sha256(triple[2]))
|
||||
ent_hash_list = list(ent_hash_list)
|
||||
# 性能保护:限制同义连接的实体数量
|
||||
max_synonym_entities = global_config.lpmm_knowledge.max_synonym_entities
|
||||
if max_synonym_entities and len(ent_hash_list) > max_synonym_entities:
|
||||
logger.warning(
|
||||
f"同义连接实体数 {len(ent_hash_list)} 超过阈值 {max_synonym_entities},跳过同义边构建以保护性能"
|
||||
)
|
||||
return 0
|
||||
|
||||
synonym_hash_set = set()
|
||||
synonym_result = {}
|
||||
|
|
@ -329,6 +359,14 @@ class KGManager:
|
|||
embed_manager: EmbeddingManager对象
|
||||
"""
|
||||
# 图中存在的节点总集
|
||||
# 性能保护:超限或关闭时直接返回向量检索结果
|
||||
if (
|
||||
not global_config.lpmm_knowledge.enable_ppr
|
||||
or len(self.graph.get_node_list()) > global_config.lpmm_knowledge.ppr_node_cap
|
||||
or len(relation_search_result) > global_config.lpmm_knowledge.ppr_relation_cap
|
||||
):
|
||||
logger.info("PPR 已禁用或超出阈值,使用纯向量检索结果")
|
||||
return paragraph_search_result, None
|
||||
existed_nodes = self.graph.get_node_list()
|
||||
|
||||
# 准备PPR使用的数据
|
||||
|
|
@ -357,7 +395,15 @@ class KGManager:
|
|||
ent_mean_scores = {} # 记录实体的平均相似度
|
||||
for ent_hash, scores in ent_sim_scores.items():
|
||||
# 先对相似度进行累加,然后与实体计数相除获取最终权重
|
||||
ent_weights[ent_hash] = float(np.sum(scores)) / self.ent_appear_cnt[ent_hash]
|
||||
# 保护:有些实体在当前图中可能只有实体-实体关系,不会出现在 ent_appear_cnt 中
|
||||
appear_cnt = self.ent_appear_cnt.get(ent_hash)
|
||||
if not appear_cnt or appear_cnt <= 0:
|
||||
logger.debug(
|
||||
f"实体 {ent_hash} 在 ent_appear_cnt 中不存在或计数为 0,"
|
||||
f"将使用 1.0 作为默认出现次数参与权重计算"
|
||||
)
|
||||
appear_cnt = 1.0
|
||||
ent_weights[ent_hash] = float(np.sum(scores)) / float(appear_cnt)
|
||||
# 记录实体的平均相似度,用于后续的top_k筛选
|
||||
ent_mean_scores[ent_hash] = float(np.mean(scores))
|
||||
del ent_sim_scores
|
||||
|
|
@ -434,3 +480,115 @@ class KGManager:
|
|||
passage_node_res = sorted(passage_node_res, key=lambda item: item[1], reverse=True)
|
||||
|
||||
return passage_node_res, ppr_node_weights
|
||||
|
||||
def delete_paragraphs(
|
||||
self,
|
||||
pg_hashes: List[str],
|
||||
ent_hashes: List[str] | None = None,
|
||||
remove_orphan_entities: bool = False,
|
||||
) -> Dict[str, int]:
|
||||
"""删除段落/实体节点及相关边(基于 GraphML),可选清理孤立实体,并重建元数据"""
|
||||
# 要删除的节点 ID
|
||||
nodes_to_delete: Set[str] = {f"paragraph-{h}" for h in pg_hashes}
|
||||
if ent_hashes:
|
||||
nodes_to_delete.update({f"entity-{h}" for h in ent_hashes})
|
||||
|
||||
if not os.path.exists(self.graph_data_path):
|
||||
raise FileNotFoundError(f"KG图文件{self.graph_data_path}不存在")
|
||||
|
||||
tree = ET.parse(self.graph_data_path)
|
||||
root = tree.getroot()
|
||||
|
||||
# GraphML 可能带命名空间,用尾缀判断
|
||||
def is_node(elem: ET.Element) -> bool:
|
||||
return elem.tag.endswith("node")
|
||||
|
||||
def is_edge(elem: ET.Element) -> bool:
|
||||
return elem.tag.endswith("edge")
|
||||
|
||||
graph_elem = None
|
||||
for child in root:
|
||||
if child.tag.endswith("graph"):
|
||||
graph_elem = child
|
||||
break
|
||||
if graph_elem is None:
|
||||
raise RuntimeError("GraphML 中未找到 <graph> 节点")
|
||||
|
||||
# 统计现有节点
|
||||
existing_nodes: Set[str] = set()
|
||||
for elem in graph_elem:
|
||||
if is_node(elem):
|
||||
node_id = elem.get("id")
|
||||
if node_id:
|
||||
existing_nodes.add(node_id)
|
||||
|
||||
deleted_nodes = len(nodes_to_delete & existing_nodes)
|
||||
skipped_nodes = len(nodes_to_delete - existing_nodes)
|
||||
|
||||
# 先删除指定节点及相关边
|
||||
# 删除节点
|
||||
for elem in list(graph_elem):
|
||||
if is_node(elem):
|
||||
node_id = elem.get("id")
|
||||
if node_id and node_id in nodes_to_delete:
|
||||
graph_elem.remove(elem)
|
||||
|
||||
# 删除 incident edges
|
||||
for elem in list(graph_elem):
|
||||
if is_edge(elem):
|
||||
src = elem.get("source")
|
||||
tgt = elem.get("target")
|
||||
if src in nodes_to_delete or tgt in nodes_to_delete:
|
||||
graph_elem.remove(elem)
|
||||
|
||||
orphan_removed = 0
|
||||
if remove_orphan_entities:
|
||||
# 计算仍然参与边的节点
|
||||
used_nodes: Set[str] = set()
|
||||
for elem in graph_elem:
|
||||
if is_edge(elem):
|
||||
src = elem.get("source")
|
||||
tgt = elem.get("target")
|
||||
if src:
|
||||
used_nodes.add(src)
|
||||
if tgt:
|
||||
used_nodes.add(tgt)
|
||||
|
||||
# 找出没有任何边的实体节点
|
||||
orphan_entities: Set[str] = set()
|
||||
for elem in graph_elem:
|
||||
if is_node(elem):
|
||||
node_id = elem.get("id")
|
||||
if node_id and node_id.startswith("entity") and node_id not in used_nodes:
|
||||
orphan_entities.add(node_id)
|
||||
|
||||
orphan_removed = len(orphan_entities)
|
||||
|
||||
if orphan_entities:
|
||||
# 删除孤立实体节点
|
||||
for elem in list(graph_elem):
|
||||
if is_node(elem):
|
||||
node_id = elem.get("id")
|
||||
if node_id in orphan_entities:
|
||||
graph_elem.remove(elem)
|
||||
|
||||
# 删除与孤立实体相关的边(理论上已无,但做一次防御性清理)
|
||||
for elem in list(graph_elem):
|
||||
if is_edge(elem):
|
||||
src = elem.get("source")
|
||||
tgt = elem.get("target")
|
||||
if src in orphan_entities or tgt in orphan_entities:
|
||||
graph_elem.remove(elem)
|
||||
|
||||
# 写回 GraphML
|
||||
tree.write(self.graph_data_path, encoding="utf-8", xml_declaration=True)
|
||||
|
||||
# 重新加载图并重建元数据
|
||||
self.graph = di_graph.load_from_file(self.graph_data_path)
|
||||
self._rebuild_metadata_from_graph()
|
||||
|
||||
return {
|
||||
"deleted": deleted_nodes,
|
||||
"skipped": skipped_nodes,
|
||||
"orphan_removed": orphan_removed,
|
||||
}
|
||||
|
|
|
|||
Loading…
Reference in New Issue