docs(详细设计): 新增资料包导入与目录汇总设计

2026-06-03 20:50:27 +08:00
parent 11c20593d5
commit 18428e75fd
7 changed files with 1501 additions and 0 deletions
--- a/docs/详细设计/skill/章节点识别Skill.md
+++ b/docs/详细设计/skill/章节点识别Skill.md
@@ -0,0 +1,158 @@
+# 章节点识别Skill 设计
+
+## 1. Skill 定位
+
+`章节点识别Skill` 负责对已登记文档进行章节点和资料角色的初步识别，为目录汇总和后续法规完整性核查提供结构化字段。
+
+本 Skill 使用规则优先，不依赖 LLM。后续可以在人工复核或复杂标题解析中引入模型辅助，但不作为 V1 必需能力。
+
+英文实现标识建议使用 `ChapterClassificationSkill`，用于 Python 类名和 Tool Registry 注册处理器。
+
+## 2. 输入
+
+```python
+@dataclass
+class ChapterClassificationInput:
+    document_id: int
+    original_filename: str
+    relative_path: str
+    file_type: str
+    title_text: str | None = None
+    manual_hint: dict = field(default_factory=dict)
+```
+
+## 3. 输出
+
+```python
+@dataclass
+class ChapterClassificationResult:
+    document_id: int
+    chapter_code: str | None
+    chapter_name: str | None
+    document_role: str | None
+    declared_document_name: str | None
+    confidence: str
+    status: str
+    evidence: list[dict]
+```
+
+## 4. 识别规则
+
+### 4.1 章节点编码识别
+
+从文件名和相对路径中识别：
+
+1. `CH1.2`
+2. `CH1.4`
+3. `CH1.5`
+4. `CH1.9`
+5. `CH1.11.1`
+6. `CH1.11.5`
+7. `CH1.11.6`
+
+正则示例：
+
+```python
+r"CH\s*(\d+(?:\.\d+)*)"
+```
+
+### 4.2 章节名称识别
+
+从相对路径中识别：
+
+1. `第1章 监管信息`
+2. `第2章 综述资料`
+3. `第3章 非临床资料`
+4. `第4章 临床评价资料`
+5. `第5章 产品说明书和标签样稿`
+6. `第6章 质量管理体系文件`
+
+### 4.3 文档角色识别
+
+| 关键词 | document_role |
+|---|---|
+| `监管信息目录` | `regulatory_information_catalog` |
+| `申请表` | `application_form` |
+| `产品列表` | `product_list` |
+| `符合标准的清单` | `standard_compliance_list` |
+| `真实性声明` | `authenticity_statement` |
+| `符合性声明` | `conformity_statement` |
+| `沟通的说明` | `pre_submission_communication` |
+| `说明书` | `product_instruction` |
+
+## 5. 核心方法
+
+### 5.1 `run(input) -> ChapterClassificationResult`
+
+主入口方法。
+
+执行顺序：
+
+1. 从相对路径识别章名称。
+2. 从文件名识别章节点编码。
+3. 从文件名识别文档角色。
+4. 如有标题文本，则用标题补充识别。
+5. 计算置信度。
+6. 返回识别结果。
+
+### 5.2 `extract_chapter_code(text) -> str | None`
+
+从路径或文件名提取 `CHx.x`。
+
+### 5.3 `extract_chapter_name(relative_path) -> str | None`
+
+从目录层级识别章节名称。
+
+### 5.4 `detect_document_role(text) -> str | None`
+
+基于关键词和规则表识别文档角色。
+
+### 5.5 `calculate_confidence(matches) -> str`
+
+置信度规则：
+
+1. 路径、文件名和标题一致：`high`
+2. 文件名命中但路径缺失：`medium`
+3. 只有关键词命中：`low`
+4. 无法识别：`manual_review_required`
+
+## 6. 技术实现
+
+使用技术：
+
+1. `re`
+2. YAML 规则表
+3. 可选 `python-docx` 首页标题抽取
+4. Django 管理后台人工修正
+
+建议规则文件：
+
+```text
+configs/registration/chapter_classification.yaml
+```
+
+## 7. 落库字段
+
+建议写入 `RegistrationDocument`：
+
+1. `chapter_code`
+2. `chapter_name`
+3. `document_role`
+4. `declared_document_name`
+5. `classification_confidence`
+6. `classification_status`
+7. `needs_manual_review`
+
+## 8. 异常处理
+
+1. 文件名无章节点：尝试路径识别。
+2. 路径与文件名冲突：标记待人工复核。
+3. 识别为法规资料但批次为业务资料：标记潜在混入风险。
+4. 同一文件命中多个角色：保留最高优先级角色，记录警告。
+
+## 9. 测试要点
+
+1. `CH1.4 申请表.docx` 识别为 `CH1.4` 和 `application_form`。
+2. `第1章 监管信息/CH1.2 监管信息目录.docx` 识别章节和目录角色。
+3. 无章节点文件标记待人工复核。
+4. 路径与文件名冲突时输出警告。