931 lines
26 KiB
Markdown
931 lines
26 KiB
Markdown
# 自动汇总文件夹文件目录与页数流程详细设计
|
||
|
||
## 文档信息
|
||
|
||
| 项目 | 内容 |
|
||
| --- | --- |
|
||
| 需求分析文档 | docs/需求分析/1.自动汇总.md |
|
||
| 功能设计文档 | docs/功能设计/1.自动汇总.md |
|
||
| 功能名称 | 自动汇总文件夹文件目录与页数 |
|
||
| 所属模块 | 审核智能体 review_agent |
|
||
| 设计日期 | 2026-06-05 |
|
||
| 设计版本 | V1.0 |
|
||
|
||
---
|
||
|
||
## 一、详细设计目标
|
||
|
||
本详细设计用于指导“自动汇总文件夹文件目录与页数”功能开发落地,覆盖代码目录、数据模型、接口契约、后台工作流、Skill 拆分、轻量依赖、前端三栏布局、SSE 实时状态、异常重试和测试用例。
|
||
|
||
核心约束:
|
||
|
||
| 约束 | 说明 |
|
||
| --- | --- |
|
||
| 对话绑定 | 上传文件与当前 Conversation 绑定,一个对话对应一套文件,不能串文件 |
|
||
| 上传即存储 | 用户拖拽或选择文件后立即保存,但不启动工作流 |
|
||
| 提示词触发 | 用户发送消息后,根据提示词判断是否启动自动汇总工作流 |
|
||
| 后台异步 | 工作流后台执行,右侧第三栏工作流卡片实时更新 |
|
||
| 轻量依赖 | 优先使用 Python 内部库和轻量第三方库,不强依赖 LibreOffice |
|
||
| 老格式支持 | doc、xls、ppt 进入处理流程,能读到页数则统计,读不到则记录异常 |
|
||
| 结果存档 | 批次、文件、节点、事件、明细、导出文件全部入库 |
|
||
|
||
---
|
||
|
||
## 二、代码结构设计
|
||
|
||
### 2.1 目录结构
|
||
|
||
在现有 `review_agent` 应用内按模块重新划分文件处理能力。Django 模型仍集中放在 `review_agent/models.py`,其余代码放入 `review_agent/file_summary/`。
|
||
|
||
```text
|
||
review_agent/
|
||
models.py
|
||
urls.py
|
||
views.py
|
||
services.py
|
||
file_summary/
|
||
__init__.py
|
||
constants.py
|
||
schemas.py
|
||
storage.py
|
||
workflow.py
|
||
events.py
|
||
urls.py
|
||
views.py
|
||
services/
|
||
__init__.py
|
||
archive.py
|
||
inventory.py
|
||
page_count.py
|
||
product_detect.py
|
||
report.py
|
||
export_excel.py
|
||
workflow_trigger.py
|
||
skills/
|
||
__init__.py
|
||
base.py
|
||
registry.py
|
||
upload_intake.py
|
||
archive_extract.py
|
||
file_inventory.py
|
||
document_page_count.py
|
||
product_detect.py
|
||
summary_report.py
|
||
excel_export.py
|
||
```
|
||
|
||
### 2.2 文件职责
|
||
|
||
| 文件 | 职责 |
|
||
| --- | --- |
|
||
| review_agent/models.py | 集中定义 Conversation、Message、文件汇总相关模型 |
|
||
| file_summary/constants.py | 状态、节点、文件类型、事件类型常量 |
|
||
| file_summary/schemas.py | dataclass 入参出参结构,避免业务层直接传散乱 dict |
|
||
| file_summary/storage.py | 上传文件、工作目录、导出文件路径生成与保存 |
|
||
| file_summary/workflow.py | WorkflowExecutor,串行执行节点图 |
|
||
| file_summary/events.py | 工作流事件持久化与 SSE 格式化 |
|
||
| file_summary/views.py | 上传暂存、启动工作流、状态查询、SSE、下载接口 |
|
||
| services/archive.py | 压缩包识别、zip/7z/rar 解压 |
|
||
| services/inventory.py | 文件遍历与清单生成 |
|
||
| services/page_count.py | 文件页数统计与 3 次重试 |
|
||
| services/product_detect.py | 产品名识别 |
|
||
| services/report.py | Markdown 报告和对话简表生成 |
|
||
| services/export_excel.py | Excel 文件导出 |
|
||
| services/workflow_trigger.py | 根据提示词判断是否触发自动汇总工作流 |
|
||
| skills/base.py | Skill 基类与统一返回结构 |
|
||
| skills/registry.py | Skill 注册与按需加载 |
|
||
| skills/*.py | 各工作流节点对应 Skill |
|
||
|
||
---
|
||
|
||
## 三、依赖设计
|
||
|
||
### 3.1 requirements 建议
|
||
|
||
```text
|
||
Django==5.2.14
|
||
pypdf
|
||
python-docx
|
||
python-pptx
|
||
openpyxl
|
||
xlrd
|
||
olefile
|
||
py7zr
|
||
```
|
||
|
||
### 3.2 格式处理策略
|
||
|
||
| 格式 | 处理库 | 统计口径 | 失败策略 |
|
||
| --- | --- | --- | --- |
|
||
| pdf | pypdf | PDF 页面数 | 重试 3 次,仍失败记录异常 |
|
||
| docx | python-docx | 优先读取内置页数属性 | 读不到记录“页数不可确定” |
|
||
| doc | olefile | 读取 OLE 元数据页数 | 读不到记录“页数不可确定” |
|
||
| pptx | python-pptx | 幻灯片数量 | 重试 3 次,仍失败记录异常 |
|
||
| ppt | olefile | 读取 OLE 元数据页数/幻灯片数 | 读不到记录“页数不可确定” |
|
||
| xlsx | openpyxl | 工作表数量 | 重试 3 次,仍失败记录异常 |
|
||
| xls | xlrd | 工作表数量 | 重试 3 次,仍失败记录异常 |
|
||
|
||
### 3.3 压缩包处理策略
|
||
|
||
| 格式 | 处理方式 | 说明 |
|
||
| --- | --- | --- |
|
||
| zip | Python 标准库 zipfile | 必须支持 |
|
||
| 7z | py7zr | 必须支持 |
|
||
| rar | 优先系统 7z 命令 | Docker 镜像需安装 7-Zip/p7zip |
|
||
|
||
### 3.4 Docker 部署说明
|
||
|
||
Demo 运行不强依赖 LibreOffice。若未来要求 doc/docx/ppt/pptx 页数与 Office 打开后的分页完全一致,可在 Docker 镜像中额外安装 LibreOffice headless,再通过“转换 PDF 后统计页数”的增强策略实现。
|
||
|
||
RAR 解压如需稳定支持,Docker 镜像需要安装 7-Zip/p7zip,并确保 `7z` 命令在 PATH 中可调用。
|
||
|
||
---
|
||
|
||
## 四、数据模型详细设计
|
||
|
||
模型集中放在 `review_agent/models.py`,按“会话模型”和“文件汇总模型”分段。
|
||
|
||
### 4.1 FileAttachment
|
||
|
||
用户上传即存储的文件记录。此时尚未启动工作流。
|
||
|
||
| 字段 | 类型 | 约束 | 说明 |
|
||
| --- | --- | --- | --- |
|
||
| id | BigAutoField | PK | 主键 |
|
||
| conversation | ForeignKey(Conversation) | CASCADE, db_index | 绑定对话 |
|
||
| user | ForeignKey(User) | CASCADE, db_index | 上传用户 |
|
||
| original_name | CharField(255) | required | 原始文件名 |
|
||
| storage_path | CharField(500) | required | 本地保存路径 |
|
||
| file_size | BigIntegerField | default=0 | 文件大小 |
|
||
| content_type | CharField(120) | blank | MIME 类型 |
|
||
| upload_status | CharField(20) | choices | uploaded、bound、deleted |
|
||
| created_at | DateTimeField | auto_now_add | 上传时间 |
|
||
|
||
索引:
|
||
|
||
```text
|
||
(conversation, created_at)
|
||
(user, created_at)
|
||
```
|
||
|
||
### 4.2 FileSummaryBatch
|
||
|
||
一次自动汇总工作流批次。
|
||
|
||
| 字段 | 类型 | 约束 | 说明 |
|
||
| --- | --- | --- | --- |
|
||
| id | BigAutoField | PK | 主键 |
|
||
| conversation | ForeignKey(Conversation) | CASCADE, db_index | 绑定对话 |
|
||
| user | ForeignKey(User) | CASCADE, db_index | 执行用户 |
|
||
| trigger_message | ForeignKey(Message) | SET_NULL, null | 触发工作流的用户消息 |
|
||
| batch_no | CharField(64) | unique | 批次编号 |
|
||
| product_name | CharField(200) | blank | 产品名称 |
|
||
| status | CharField(20) | choices | pending、running、success、failed |
|
||
| total_files | IntegerField | default=0 | 文件总数 |
|
||
| supported_files | IntegerField | default=0 | 支持统计数 |
|
||
| success_files | IntegerField | default=0 | 成功数 |
|
||
| failed_files | IntegerField | default=0 | 失败数 |
|
||
| unsupported_files | IntegerField | default=0 | 不支持数 |
|
||
| uncertain_files | IntegerField | default=0 | 页数不可确定数 |
|
||
| total_pages | IntegerField | default=0 | 总页数 |
|
||
| work_dir | CharField(500) | blank | 工作目录 |
|
||
| error_message | TextField | blank | 批次错误 |
|
||
| created_at | DateTimeField | auto_now_add | 创建时间 |
|
||
| started_at | DateTimeField | null | 开始时间 |
|
||
| finished_at | DateTimeField | null | 结束时间 |
|
||
|
||
### 4.3 FileSummaryBatchAttachment
|
||
|
||
批次与上传文件的绑定表,确保工作流只读取本批次文件。
|
||
|
||
| 字段 | 类型 | 约束 | 说明 |
|
||
| --- | --- | --- | --- |
|
||
| id | BigAutoField | PK | 主键 |
|
||
| batch | ForeignKey(FileSummaryBatch) | CASCADE | 批次 |
|
||
| attachment | ForeignKey(FileAttachment) | CASCADE | 上传文件 |
|
||
| created_at | DateTimeField | auto_now_add | 绑定时间 |
|
||
|
||
唯一约束:
|
||
|
||
```text
|
||
unique(batch, attachment)
|
||
```
|
||
|
||
### 4.4 FileSummaryItem
|
||
|
||
文件明细记录。
|
||
|
||
| 字段 | 类型 | 约束 | 说明 |
|
||
| --- | --- | --- | --- |
|
||
| id | BigAutoField | PK | 主键 |
|
||
| batch | ForeignKey(FileSummaryBatch) | CASCADE, db_index | 所属批次 |
|
||
| file_index | IntegerField | required | 文件序号 |
|
||
| directory_level | CharField(300) | blank | 目录层级 |
|
||
| file_name | CharField(255) | required | 文件名 |
|
||
| file_type | CharField(20) | required | 扩展名 |
|
||
| relative_path | CharField(500) | required | 相对路径 |
|
||
| storage_path | CharField(500) | required | 实际处理路径 |
|
||
| page_count | IntegerField | null | 页数 |
|
||
| statistics_status | CharField(20) | choices | success、failed、unsupported、uncertain、skipped |
|
||
| retry_count | IntegerField | default=0 | 重试次数 |
|
||
| error_message | TextField | blank | 异常说明 |
|
||
| created_at | DateTimeField | auto_now_add | 创建时间 |
|
||
| updated_at | DateTimeField | auto_now | 更新时间 |
|
||
|
||
唯一约束:
|
||
|
||
```text
|
||
unique(batch, relative_path)
|
||
```
|
||
|
||
### 4.5 WorkflowNodeRun
|
||
|
||
工作流节点状态记录。
|
||
|
||
| 字段 | 类型 | 约束 | 说明 |
|
||
| --- | --- | --- | --- |
|
||
| id | BigAutoField | PK | 主键 |
|
||
| batch | ForeignKey(FileSummaryBatch) | CASCADE, db_index | 批次 |
|
||
| node_code | CharField(40) | required | 节点编码 |
|
||
| node_name | CharField(80) | required | 节点名称 |
|
||
| status | CharField(20) | choices | pending、running、retrying、success、failed、skipped |
|
||
| progress | IntegerField | default=0 | 进度百分比 |
|
||
| message | TextField | blank | 节点说明 |
|
||
| started_at | DateTimeField | null | 开始时间 |
|
||
| finished_at | DateTimeField | null | 完成时间 |
|
||
|
||
唯一约束:
|
||
|
||
```text
|
||
unique(batch, node_code)
|
||
```
|
||
|
||
### 4.6 WorkflowEvent
|
||
|
||
SSE 事件持久化记录,用于页面刷新后恢复和调试。
|
||
|
||
| 字段 | 类型 | 约束 | 说明 |
|
||
| --- | --- | --- | --- |
|
||
| id | BigAutoField | PK | 主键 |
|
||
| batch | ForeignKey(FileSummaryBatch) | CASCADE, db_index | 批次 |
|
||
| event_type | CharField(40) | required | 事件类型 |
|
||
| payload | JSONField | default=dict | 事件载荷 |
|
||
| created_at | DateTimeField | auto_now_add | 创建时间 |
|
||
|
||
### 4.7 ExportedSummaryFile
|
||
|
||
导出文件记录。
|
||
|
||
| 字段 | 类型 | 约束 | 说明 |
|
||
| --- | --- | --- | --- |
|
||
| id | BigAutoField | PK | 主键 |
|
||
| batch | ForeignKey(FileSummaryBatch) | CASCADE, db_index | 批次 |
|
||
| export_type | CharField(20) | choices | markdown、excel |
|
||
| file_name | CharField(255) | required | 文件名 |
|
||
| storage_path | CharField(500) | required | 保存路径 |
|
||
| status | CharField(20) | choices | success、failed |
|
||
| error_message | TextField | blank | 异常 |
|
||
| created_at | DateTimeField | auto_now_add | 生成时间 |
|
||
|
||
下载链接运行时根据 `export_id` 生成,不建议长期存储静态 URL。
|
||
|
||
---
|
||
|
||
## 五、常量与状态设计
|
||
|
||
### 5.1 支持格式
|
||
|
||
```python
|
||
SUPPORTED_PAGE_TYPES = {"pdf", "doc", "docx", "xls", "xlsx", "ppt", "pptx"}
|
||
ARCHIVE_TYPES = {"zip", "7z", "rar"}
|
||
```
|
||
|
||
### 5.2 工作流节点
|
||
|
||
```python
|
||
WORKFLOW_NODES = [
|
||
("upload", "上传中"),
|
||
("extract", "解压中"),
|
||
("inventory", "扫描中"),
|
||
("page_count", "解析页数中"),
|
||
("product_detect", "识别产品名中"),
|
||
("report", "输出 Markdown 中"),
|
||
("excel_export", "输出 Excel 中"),
|
||
("completed", "已完成"),
|
||
]
|
||
```
|
||
|
||
### 5.3 触发词规则
|
||
|
||
`workflow_trigger.py` 先用规则判断,后续可升级为 LLM 意图识别。
|
||
|
||
```python
|
||
SUMMARY_TRIGGER_KEYWORDS = [
|
||
"自动汇总",
|
||
"文件目录",
|
||
"页数",
|
||
"统计文件",
|
||
"汇总目录",
|
||
"目录与页数",
|
||
]
|
||
```
|
||
|
||
规则:
|
||
|
||
| 条件 | 结果 |
|
||
| --- | --- |
|
||
| 当前对话存在未绑定或最近上传文件,且提示词命中关键词 | 启动自动汇总工作流 |
|
||
| 未命中关键词 | 走普通 LLM 对话 |
|
||
| 命中关键词但没有上传文件 | AI 回复提示“请先上传文件或压缩包” |
|
||
|
||
---
|
||
|
||
## 六、服务与方法签名
|
||
|
||
### 6.1 storage.py
|
||
|
||
```python
|
||
def save_attachment(conversation, user, uploaded_file) -> FileAttachment:
|
||
"""保存上传文件并绑定当前对话。"""
|
||
|
||
def build_batch_work_dir(batch: FileSummaryBatch) -> Path:
|
||
"""生成批次工作目录。"""
|
||
|
||
def build_export_path(batch: FileSummaryBatch, suffix: str) -> Path:
|
||
"""生成导出文件路径。"""
|
||
```
|
||
|
||
存储目录:
|
||
|
||
```text
|
||
media/review_agent/
|
||
user_{user_id}/
|
||
conversation_{conversation_id}/
|
||
attachments/
|
||
batches/
|
||
batch_{batch_id}/
|
||
input/
|
||
extracted/
|
||
exports/
|
||
```
|
||
|
||
### 6.2 archive.py
|
||
|
||
```python
|
||
def is_archive(path: Path) -> bool:
|
||
"""判断是否压缩包。"""
|
||
|
||
def extract_archive(source: Path, target_dir: Path) -> list[Path]:
|
||
"""解压 zip、7z、rar,返回解压后的文件路径列表。"""
|
||
|
||
def extract_zip(source: Path, target_dir: Path) -> list[Path]:
|
||
"""使用 zipfile 解压。"""
|
||
|
||
def extract_7z(source: Path, target_dir: Path) -> list[Path]:
|
||
"""使用 py7zr 解压。"""
|
||
|
||
def extract_rar(source: Path, target_dir: Path) -> list[Path]:
|
||
"""优先调用系统 7z 命令解压 rar。"""
|
||
```
|
||
|
||
安全规则:
|
||
|
||
| 规则 | 说明 |
|
||
| --- | --- |
|
||
| 路径穿越检查 | 解压后的最终路径必须仍在 target_dir 内 |
|
||
| 文件名清理 | 保留原名,但禁止绝对路径和上级目录跳转 |
|
||
| 解压失败 | 抛出 ArchiveExtractError,批次失败 |
|
||
|
||
### 6.3 inventory.py
|
||
|
||
```python
|
||
def scan_files(batch: FileSummaryBatch, roots: list[Path]) -> list[FileSummaryItem]:
|
||
"""扫描目录或散装文件,创建 FileSummaryItem。"""
|
||
|
||
def build_directory_level(relative_path: Path) -> str:
|
||
"""根据相对路径生成目录层级。"""
|
||
|
||
def normalize_file_type(path: Path) -> str:
|
||
"""返回小写扩展名,不含点。"""
|
||
```
|
||
|
||
### 6.4 page_count.py
|
||
|
||
```python
|
||
def count_pages(item: FileSummaryItem) -> PageCountResult:
|
||
"""根据文件类型分发页数统计。"""
|
||
|
||
def count_pages_with_retry(item: FileSummaryItem, max_retry: int = 3) -> PageCountResult:
|
||
"""失败最多重试 3 次。"""
|
||
|
||
def count_pdf(path: Path) -> int:
|
||
"""使用 pypdf 统计 PDF 页数。"""
|
||
|
||
def count_docx(path: Path) -> PageCountResult:
|
||
"""使用 python-docx 读取内置页数属性。"""
|
||
|
||
def count_doc(path: Path) -> PageCountResult:
|
||
"""使用 olefile 读取老 doc 的 OLE 元数据页数。"""
|
||
|
||
def count_xlsx(path: Path) -> int:
|
||
"""使用 openpyxl 统计工作表数量。"""
|
||
|
||
def count_xls(path: Path) -> int:
|
||
"""使用 xlrd 统计工作表数量。"""
|
||
|
||
def count_pptx(path: Path) -> int:
|
||
"""使用 python-pptx 统计幻灯片数量。"""
|
||
|
||
def count_ppt(path: Path) -> PageCountResult:
|
||
"""使用 olefile 读取老 ppt 的 OLE 元数据页数或幻灯片数。"""
|
||
```
|
||
|
||
`PageCountResult`:
|
||
|
||
```python
|
||
@dataclass
|
||
class PageCountResult:
|
||
status: str
|
||
page_count: int | None = None
|
||
error_message: str = ""
|
||
```
|
||
|
||
状态规则:
|
||
|
||
| 情况 | status | page_count |
|
||
| --- | --- | --- |
|
||
| 成功读取页数 | success | 整数 |
|
||
| 不支持类型 | unsupported | None |
|
||
| 文件可读但页数无元数据 | uncertain | None |
|
||
| 解析异常且重试失败 | failed | None |
|
||
|
||
### 6.5 product_detect.py
|
||
|
||
```python
|
||
def detect_product_name(batch: FileSummaryBatch) -> ProductDetectResult:
|
||
"""从目录名、文件名和少量元数据中识别产品名。"""
|
||
|
||
def update_conversation_title(batch: FileSummaryBatch, product_name: str) -> None:
|
||
"""按规则更新对话标题。"""
|
||
```
|
||
|
||
产品名识别优先级:
|
||
|
||
| 优先级 | 来源 |
|
||
| --- | --- |
|
||
| 1 | 顶层目录名 |
|
||
| 2 | 文件名中包含“产品”“试剂盒”“说明书”等关键词的片段 |
|
||
| 3 | docx 文档属性 title |
|
||
| 4 | PDF 元数据 title |
|
||
|
||
### 6.6 report.py
|
||
|
||
```python
|
||
def build_summary_stats(batch: FileSummaryBatch) -> dict:
|
||
"""汇总统计数据。"""
|
||
|
||
def build_chat_markdown(batch: FileSummaryBatch) -> str:
|
||
"""生成对话框展示 Markdown 简表。"""
|
||
|
||
def build_full_markdown_report(batch: FileSummaryBatch) -> str:
|
||
"""生成完整 Markdown 报告。"""
|
||
|
||
def save_markdown_report(batch: FileSummaryBatch) -> ExportedSummaryFile:
|
||
"""保存 Markdown 报告并创建导出记录。"""
|
||
```
|
||
|
||
### 6.7 export_excel.py
|
||
|
||
```python
|
||
def build_excel_workbook(batch: FileSummaryBatch) -> Workbook:
|
||
"""构建 Excel Workbook。"""
|
||
|
||
def save_excel(batch: FileSummaryBatch) -> ExportedSummaryFile:
|
||
"""保存 Excel 并创建导出记录。"""
|
||
```
|
||
|
||
工作表:
|
||
|
||
| Sheet | 字段 |
|
||
| --- | --- |
|
||
| 汇总信息 | 批次编号、产品名、文件总数、成功数、失败数、不可确定数、总页数 |
|
||
| 文件明细 | 序号、目录层级、文件名、类型、页数、相对路径、状态、重试次数、异常说明 |
|
||
|
||
---
|
||
|
||
## 七、Skill 详细设计
|
||
|
||
### 7.1 BaseSkill
|
||
|
||
```python
|
||
class BaseSkill:
|
||
name: str
|
||
node_code: str
|
||
|
||
def run(self, context: WorkflowContext) -> SkillResult:
|
||
raise NotImplementedError
|
||
```
|
||
|
||
`WorkflowContext`:
|
||
|
||
```python
|
||
@dataclass
|
||
class WorkflowContext:
|
||
batch_id: int
|
||
conversation_id: int
|
||
user_id: int
|
||
message_id: int | None = None
|
||
```
|
||
|
||
`SkillResult`:
|
||
|
||
```python
|
||
@dataclass
|
||
class SkillResult:
|
||
success: bool
|
||
message: str = ""
|
||
data: dict = field(default_factory=dict)
|
||
```
|
||
|
||
### 7.2 Skill 列表
|
||
|
||
| Skill 类名 | 节点 | 调用服务 |
|
||
| --- | --- | --- |
|
||
| UploadIntakeSkill | upload | storage.py |
|
||
| ArchiveExtractSkill | extract | archive.py |
|
||
| FileInventorySkill | inventory | inventory.py |
|
||
| DocumentPageCountSkill | page_count | page_count.py |
|
||
| ProductDetectSkill | product_detect | product_detect.py |
|
||
| SummaryReportSkill | report | report.py |
|
||
| ExcelExportSkill | excel_export | export_excel.py |
|
||
|
||
---
|
||
|
||
## 八、工作流执行器详细设计
|
||
|
||
### 8.1 执行入口
|
||
|
||
```python
|
||
def start_file_summary_workflow(batch_id: int) -> None:
|
||
thread = threading.Thread(
|
||
target=WorkflowExecutor().run,
|
||
args=(batch_id,),
|
||
daemon=True,
|
||
)
|
||
thread.start()
|
||
```
|
||
|
||
### 8.2 执行伪代码
|
||
|
||
```python
|
||
class WorkflowExecutor:
|
||
def run(self, batch_id: int) -> None:
|
||
batch = FileSummaryBatch.objects.get(pk=batch_id)
|
||
self.mark_batch_running(batch)
|
||
self.emit("workflow_started", batch, {"batch_id": batch.id})
|
||
|
||
try:
|
||
for node_code in self.resolve_nodes(batch):
|
||
self.run_node(batch, node_code)
|
||
self.mark_batch_success(batch)
|
||
self.emit("workflow_completed", batch, self.build_completed_payload(batch))
|
||
except Exception as exc:
|
||
self.mark_batch_failed(batch, str(exc))
|
||
self.emit("workflow_failed", batch, {"message": str(exc)})
|
||
```
|
||
|
||
### 8.3 节点跳过规则
|
||
|
||
| 节点 | 跳过条件 |
|
||
| --- | --- |
|
||
| extract | 当前批次没有压缩包 |
|
||
| product_detect | 没有任何可用于识别的文件名、目录名或元数据 |
|
||
|
||
---
|
||
|
||
## 九、接口详细设计
|
||
|
||
### 9.1 上传暂存接口
|
||
|
||
```text
|
||
POST /api/review-agent/conversations/{conversation_id}/attachments/
|
||
Content-Type: multipart/form-data
|
||
```
|
||
|
||
请求:
|
||
|
||
| 参数 | 类型 | 必填 | 说明 |
|
||
| --- | --- | --- | --- |
|
||
| files[] | File[] | 是 | 一个或多个文件 |
|
||
|
||
响应:
|
||
|
||
```json
|
||
{
|
||
"attachments": [
|
||
{
|
||
"id": 101,
|
||
"original_name": "注册资料.zip",
|
||
"file_size": 204800,
|
||
"upload_status": "uploaded"
|
||
}
|
||
]
|
||
}
|
||
```
|
||
|
||
权限:
|
||
|
||
```text
|
||
conversation.user 必须等于 request.user
|
||
```
|
||
|
||
### 9.2 发送消息并按需触发工作流
|
||
|
||
沿用现有 `POST /chat/stream/` SSE 能力,在 `stream_chat` 中增加判断:
|
||
|
||
```text
|
||
用户发送 prompt
|
||
-> 保存 Message
|
||
-> 判断 prompt 是否命中自动汇总工作流
|
||
-> 命中则创建 FileSummaryBatch 并启动后台工作流
|
||
-> SSE 返回 workflow_meta
|
||
-> 未命中则走原 LLM 流式回复
|
||
```
|
||
|
||
新增 SSE meta:
|
||
|
||
```json
|
||
{
|
||
"conversation_id": 1,
|
||
"title": "新对话",
|
||
"workflow": {
|
||
"type": "file_summary",
|
||
"batch_id": 12,
|
||
"status": "running"
|
||
}
|
||
}
|
||
```
|
||
|
||
### 9.3 查询批次状态
|
||
|
||
```text
|
||
GET /api/review-agent/file-summary/{batch_id}/
|
||
```
|
||
|
||
响应:
|
||
|
||
```json
|
||
{
|
||
"batch": {
|
||
"id": 12,
|
||
"batch_no": "FS202606050001",
|
||
"status": "running",
|
||
"product_name": "",
|
||
"total_files": 24,
|
||
"success_files": 10,
|
||
"failed_files": 1,
|
||
"uncertain_files": 2,
|
||
"total_pages": 180
|
||
},
|
||
"nodes": [
|
||
{
|
||
"node_code": "page_count",
|
||
"node_name": "解析页数中",
|
||
"status": "running",
|
||
"progress": 45,
|
||
"message": "正在解析 11/24"
|
||
}
|
||
],
|
||
"exports": []
|
||
}
|
||
```
|
||
|
||
### 9.4 工作流事件流
|
||
|
||
```text
|
||
GET /api/review-agent/file-summary/{batch_id}/events/?after={event_id}
|
||
```
|
||
|
||
响应类型:`text/event-stream`
|
||
|
||
事件:
|
||
|
||
```text
|
||
event: node_progress
|
||
data: {"event_id": 301, "batch_id": 12, "node_code": "page_count", "status": "running", "progress": 45, "message": "正在解析 11/24"}
|
||
```
|
||
|
||
### 9.5 下载导出文件
|
||
|
||
```text
|
||
GET /api/review-agent/file-summary/exports/{export_id}/download/
|
||
```
|
||
|
||
权限:
|
||
|
||
```text
|
||
ExportedSummaryFile -> batch -> conversation -> user 必须为当前用户
|
||
```
|
||
|
||
---
|
||
|
||
## 十、前端详细设计
|
||
|
||
### 10.1 三栏布局
|
||
|
||
页面调整为三栏:
|
||
|
||
| 区域 | 内容 |
|
||
| --- | --- |
|
||
| 左侧栏 | 对话历史 |
|
||
| 中间栏 | 聊天消息、输入框 |
|
||
| 右侧栏上半部分 | 拖拽式文件导入区 |
|
||
| 右侧栏下半部分 | 工作流卡片列表 |
|
||
|
||
HTML 结构建议:
|
||
|
||
```html
|
||
<main class="workspace three-column">
|
||
<aside class="sidebar"></aside>
|
||
<section class="chat-shell"></section>
|
||
<aside class="workflow-panel">
|
||
<section class="upload-dropzone" id="uploadDropzone"></section>
|
||
<section class="workflow-card-list" id="workflowCardList"></section>
|
||
</aside>
|
||
</main>
|
||
```
|
||
|
||
### 10.2 上传交互
|
||
|
||
JS 方法:
|
||
|
||
```javascript
|
||
function bindUploadDropzone()
|
||
function uploadConversationFiles(files)
|
||
function renderAttachmentList(attachments)
|
||
```
|
||
|
||
流程:
|
||
|
||
```text
|
||
用户拖拽或选择文件
|
||
-> POST attachments 接口
|
||
-> 保存成功后右侧上传区展示文件名
|
||
-> 不启动工作流
|
||
-> 用户发送提示词
|
||
-> 命中工作流后创建工作流卡片
|
||
```
|
||
|
||
### 10.3 工作流卡片
|
||
|
||
JS 方法:
|
||
|
||
```javascript
|
||
function createWorkflowCard(batch)
|
||
function updateWorkflowNode(batchId, nodePayload)
|
||
function markWorkflowCompleted(batchId, payload)
|
||
function markWorkflowFailed(batchId, payload)
|
||
function connectWorkflowEvents(batchId)
|
||
function restoreWorkflowCards()
|
||
```
|
||
|
||
卡片结构:
|
||
|
||
```html
|
||
<article class="workflow-card" data-batch-id="12">
|
||
<header>
|
||
<strong>文件目录与页数汇总</strong>
|
||
<span class="workflow-status">运行中</span>
|
||
</header>
|
||
<ol class="workflow-nodes">
|
||
<li data-node-code="upload">上传中</li>
|
||
<li data-node-code="extract">解压中</li>
|
||
<li data-node-code="inventory">扫描中</li>
|
||
<li data-node-code="page_count">解析页数中</li>
|
||
<li data-node-code="product_detect">识别产品名中</li>
|
||
<li data-node-code="report">输出 Markdown 中</li>
|
||
<li data-node-code="excel_export">输出 Excel 中</li>
|
||
</ol>
|
||
</article>
|
||
```
|
||
|
||
### 10.4 Markdown 渲染
|
||
|
||
现有消息使用 `nl2br`,无法正常渲染 Markdown 表格。需要改造:
|
||
|
||
| 消息类型 | 渲染策略 |
|
||
| --- | --- |
|
||
| 普通用户消息 | escapeHtml + nl2br |
|
||
| 普通助手消息 | 安全 Markdown 渲染 |
|
||
| 文件汇总结果 | 安全 Markdown 渲染,允许 table、a、strong、code |
|
||
|
||
可选方案:
|
||
|
||
| 方案 | 说明 |
|
||
| --- | --- |
|
||
| 前端 marked + DOMPurify | 渲染体验好,但增加前端依赖 |
|
||
| 后端 markdown + bleach | 后端输出安全 HTML,前端直接展示 |
|
||
|
||
Demo 建议使用前端 `marked` + `DOMPurify` CDN 或本地静态文件。
|
||
|
||
---
|
||
|
||
## 十一、对话标题更新设计
|
||
|
||
产品名识别成功后更新标题:
|
||
|
||
```python
|
||
def update_conversation_title(batch, product_name):
|
||
conversation = batch.conversation
|
||
if conversation.title.startswith("新对话"):
|
||
conversation.title = f"{product_name}-文件汇总"[:120]
|
||
conversation.save(update_fields=["title", "updated_at"])
|
||
```
|
||
|
||
规则:
|
||
|
||
| 场景 | 处理 |
|
||
| --- | --- |
|
||
| 新对话默认标题 | 更新为产品名 |
|
||
| 用户已有自定义标题 | 不覆盖 |
|
||
| 产品名为空 | 不更新 |
|
||
|
||
---
|
||
|
||
## 十二、测试设计
|
||
|
||
### 12.1 单元测试
|
||
|
||
| 用例 | 目标 |
|
||
| --- | --- |
|
||
| test_trigger_keywords | 提示词命中时触发自动汇总 |
|
||
| test_save_attachment_binds_conversation | 上传文件绑定当前对话 |
|
||
| test_zip_extract_safe_path | zip 解压禁止路径穿越 |
|
||
| test_scan_files_builds_relative_path | 扫描生成正确相对路径 |
|
||
| test_count_pdf_pages | PDF 页数统计 |
|
||
| test_count_xlsx_sheets | xlsx 工作表数量统计 |
|
||
| test_count_pptx_slides | pptx 幻灯片数量统计 |
|
||
| test_retry_three_times | 单文件失败重试 3 次 |
|
||
| test_uncertain_old_doc | 老 doc 元数据缺失时标记 uncertain |
|
||
|
||
### 12.2 接口测试
|
||
|
||
| 用例 | 目标 |
|
||
| --- | --- |
|
||
| test_upload_attachment_api | 上传接口返回 attachment_id |
|
||
| test_upload_permission_denied | 不能向他人对话上传文件 |
|
||
| test_stream_triggers_workflow | 发送命中提示词后返回 workflow meta |
|
||
| test_batch_status_permission | 不能查询他人批次 |
|
||
| test_export_download_permission | 不能下载他人导出文件 |
|
||
|
||
### 12.3 集成测试
|
||
|
||
| 用例 | 目标 |
|
||
| --- | --- |
|
||
| test_file_summary_zip_workflow | zip 上传后完整工作流成功 |
|
||
| test_file_summary_multi_file_workflow | 多文件上传后完整工作流成功 |
|
||
| test_single_file_failure_not_blocking | 单文件失败不阻断批次 |
|
||
| test_workflow_events_created | 节点事件按顺序写入数据库 |
|
||
| test_markdown_and_excel_exports | Markdown 与 Excel 文件生成成功 |
|
||
|
||
### 12.4 前端验证
|
||
|
||
| 用例 | 目标 |
|
||
| --- | --- |
|
||
| 拖拽上传 | 右侧上传区展示文件列表 |
|
||
| 提示词触发 | 发送“自动汇总文件目录与页数”后创建工作流卡片 |
|
||
| 状态实时更新 | SSE 事件驱动节点状态变化 |
|
||
| 页面刷新恢复 | 刷新后右侧卡片恢复当前批次状态 |
|
||
| Markdown 表格 | 对话消息中表格和下载链接正常显示 |
|
||
|
||
---
|
||
|
||
## 十三、开发顺序
|
||
|
||
1. 增加依赖与模型字段,生成迁移。
|
||
2. 实现文件上传暂存接口和存储目录策略。
|
||
3. 实现 workflow_trigger,根据提示词决定是否启动工作流。
|
||
4. 实现 SkillRegistry、WorkflowExecutor 和 WorkflowEvent。
|
||
5. 实现压缩包解压、文件扫描、页数统计服务。
|
||
6. 实现 Markdown 报告与 Excel 导出。
|
||
7. 改造前端三栏布局、拖拽上传区和工作流卡片。
|
||
8. 增加 Markdown 渲染能力。
|
||
9. 补齐权限测试、工作流测试和前端手工验证。
|
||
|
||
---
|
||
|
||
## 十四、参考依据
|
||
|
||
本设计采用轻量 Python 库优先方案,依据如下:
|
||
|
||
| 能力 | 依据 |
|
||
| --- | --- |
|
||
| PDF 页数 | pypdf 的 PdfReader 可读取 pages |
|
||
| docx 元数据 | python-docx 支持 core properties |
|
||
| pptx 幻灯片 | python-pptx 可读取 presentation slides |
|
||
| xlsx 工作表 | openpyxl 可读取 workbook worksheets |
|
||
| xls 工作表 | xlrd 支持读取历史 xls 工作簿 |
|
||
| 老 Office 元数据 | olefile 可读取 OLE2 复合文档结构 |
|
||
| 7z 解压 | py7zr 支持 7z 压缩格式处理 |
|
||
| rar 解压 | rarfile 通常依赖外部 unrar/unar/7z 工具,故本设计优先系统 7z |
|