上传文件至 /

2026-01-16 19:22:13 +08:00 · 2026-01-16 19:22:13 +08:00 · b7ca3a5bc9
commit b7ca3a5bc9
5 changed files with 1413 additions and 0 deletions
--- a/README.md
+++ b/README.md
@ -0,0 +1,875 @@
+# 机器学习 × LLM × Agent：课程设计（5 天）
+
+> **小组作业** | 2–3 人/组 | 构建一个「可落地的智能预测与行动建议系统」
+
+用传统机器学习完成可量化的预测任务，再用 LLM + Agent 把预测结果变成可执行的决策/建议，并保证输出结构化、可追溯、可复现。
+
+---
+
+## 📅 课程安排概览
+
+| 天数 | 主题 | 内容 |
+|------|------|------|
+| **Day 1** | 项目启动 | 技术栈介绍 + 演示 + 选题分组 |
+| **Day 2** | 自主设计 | 分组开发 |
+| **Day 3** | 答疑 + Git 指导 | 集中答疑 + Git 提交教学 |
+| **Day 4** | 自主设计 | 继续开发 + 准备展示 |
+| **Day 5** | 小组展示 | 教师机运行 + 评分 |
+
+---
+
+## 📑 目录
+
+- [Day 1：项目启动](#day-1项目启动)
+  - [快速开始](#-快速开始)
+  - [技术栈要求](#技术栈要求2026-版)
+  - [选题指南](#选题指南)
+  - [可选扩展思路](#可选扩展思路)
+- [Day 2：自主设计](#day-2自主设计)
+- [Day 3：答疑 + Git 指导](#day-3答疑--git-指导)
+  - [Git 安装](#git-安装国内环境)
+  - [Git 基础操作](#git-基础操作)
+  - [.gitignore 详解](#gitignore-详解)
+- [Day 4：自主设计](#day-4自主设计)
+- [Day 5：小组展示](#day-5小组展示)
+  - [展示流程](#展示流程)
+  - [跨机运行检查清单](#跨机运行检查清单)
+  - [评分标准](#评分标准总分-100)
+- [附录](#附录)
+  - [代码示例](#代码示例)
+  - [项目结构](#建议项目结构)
+  - [参考资料](#参考资料)
+
+---
+
+# Day 1：项目启动
+
+## 🚀 快速开始
+
+> **2026 最佳实践**：使用 `uv` 替代 pip/venv/poetry 进行全流程项目管理
+
+```bash
+# 1. 安装 uv（如尚未安装）
+# 方法 A：使用 pip 安装（推荐，国内可用）
+pip install uv -i https://mirrors.aliyun.com/pypi/simple/
+
+# 方法 B：使用 pipx 安装（隔离环境）
+pipx install uv
+
+# 方法 C：官方脚本（需要科学上网）
+# macOS / Linux: curl -LsSf https://astral.sh/uv/install.sh | sh
+# Windows: powershell -c "irm https://astral.sh/uv/install.ps1 | iex"
+
+# 配置 PyPI 镜像（加速依赖下载）
+uv config set index-url https://mirrors.aliyun.com/pypi/simple/
+
+# 2. 克隆/Fork 本模板仓库
+git clone http://hblu.top:3000/MachineLearning2025/CourseDesign
+cd CourseDesign
+
+# 3. 初始化项目并安装依赖（uv 自动创建虚拟环境）
+uv sync
+
+# 4. 配置 DeepSeek API Key（不要提交到仓库！）
+cp .env.example .env
+# 编辑 .env 文件，填入你的 API Key
+# DEEPSEEK_API_KEY="your-key-here"
+
+# 5. 运行示例
+# 方式 A：运行 Streamlit 可视化 Demo（推荐）
+uv run streamlit run src/streamlit_app.py
+
+# 方式 B：运行命令行 Agent Demo
+uv run python src/agent_app.py
+
+# 方式 C：运行训练脚本
+uv run python src/train.py
+```
+
+### uv 常用命令速查
+
+| 命令 | 说明 |
+|------|------|
+| `uv sync` | 同步依赖（根据 `pyproject.toml` 和 `uv.lock`） |
+| `uv add <package>` | 添加依赖（自动更新 `pyproject.toml` 和 `uv.lock`） |
+| `uv add --dev <package>` | 添加开发依赖（如 pytest, ruff） |
+| `uv run <command>` | 在项目环境中运行命令 |
+| `uv lock` | 手动更新锁文件 |
+| `uv python install 3.12` | 安装指定 Python 版本 |
+
+---
+
+## 技术栈要求（2026 版）
+
+| 组件 | 要求 | 2026 最佳实践 |
+|------|------|---------------|
+| **人数** | 2–3 人/组 | — |
+| **Python 版本** | ≥ 3.12 | 推荐 3.12/3.14 |
+| **项目管理** | `uv` | 替代 pip/venv/poetry，10-100x 更快 |
+| **数据处理** | `polars` + `pandas>=2.2` | polars 作为主力（Lazy API），pandas 用于兼容 |
+| **数据可视化** | `seaborn>=0.13` | 使用 Seaborn Objects API（`so.Plot`） |
+| **数据验证** | `pydantic` + `pandera` | pydantic 验证单行/配置，pandera 验证 DataFrame 清洗前后 |
+| **机器学习** | `scikit-learn` + `lightgbm` | sklearn 做基线，LightGBM 做高性能模型 |
+| **Agent 框架** | `pydantic-ai` | 结构化输出、类型安全的 Agent |
+| **LLM 提供方** | `DeepSeek` | OpenAI 兼容 API |
+
+### 必须包含的三块能力
+
+| 能力 | 说明 |
+|------|------|
+| **传统机器学习** | 可复现训练流程、离线评估指标、模型保存与加载 |
+| **LLM** | 用于解释、归因、生成建议/回复、信息整合（不能凭空杜撰） |
+| **Agent** | 用工具调用把系统串起来（至少 2 个 tool，其中 1 个必须是 ML 预测/评估相关工具） |
+
+---
+
+## 选题指南
+
+> ⚠️ **注意**：Level 1/2/3 **都可以拿满分**；高难度通常更容易体现"深度"，但不会因为选 Level 1 就被封顶。
+
+### Level 1｜入门：表格预测 + 行动建议闭环
+
+> 📌 **建议新手选择**
+
+**目标**：做一个结构化数据的分类/回归模型，并让 Agent 基于模型输出给出可执行建议。
+
+#### 推荐数据集
+
+| 数据集 | 链接 |
+|--------|------|
+| Telco Customer Churn | [Kaggle](https://www.kaggle.com/datasets/blastchar/telco-customer-churn) |
+| German Credit Risk | [Kaggle](https://www.kaggle.com/datasets/uciml/german-credit) |
+| Bank Marketing | [Kaggle](https://www.kaggle.com/datasets/janiobachmann/bank-marketing-dataset) |
+| Heart Failure Prediction | [Kaggle](https://www.kaggle.com/datasets/fedesoriano/heart-failure-prediction) |
+
+#### ✅ 必做部分
+
+| 模块 | 要求 |
+|------|------|
+| **数据处理** | 使用 Polars 完成可复现的数据清洗流水线；使用 Pandera 定义 Schema |
+| **机器学习** | 至少 2 个模型对比（1 个基线如 LogReg，1 个强模型如 LightGBM）；达到 `F1 ≥ 0.70` 或 `ROC-AUC ≥ 0.75` |
+| **Agent** | 使用 Pydantic 定义输入输出；至少 2 个 tool（含 1 个 ML 预测工具） |
+
+---
+
+### Level 2｜进阶：文本任务 + 处置建议
+
+> 📌 **NLP 向**
+
+**目标**：做文本分类/情感分析，并让 Agent 生成结构化处置方案。
+
+#### 推荐数据集
+
+| 数据集 | 链接 | 说明 |
+|--------|------|------|
+| Twitter US Airline Sentiment | [Kaggle](https://www.kaggle.com/datasets/crowdflower/twitter-airline-sentiment) | 航空公司情感分析 |
+| IMDB 50K Movie Reviews | [Kaggle](https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews) | 电影评论情感 |
+| SMS Spam Collection | [Kaggle](https://www.kaggle.com/datasets/uciml/sms-spam-collection-dataset) | 垃圾短信分类 |
+| Consumer Complaints | [Kaggle](https://www.kaggle.com/datasets/selener/consumer-complaint-database) | 投诉分流 |
+
+#### ✅ 必做部分
+
+| 模块 | 要求 |
+|------|------|
+| **数据处理** | 文本清洗要「克制」，说明预处理策略；使用 Pandera 定义 Schema |
+| **机器学习** | 基线 `TF-IDF + LogReg`；达到 `Accuracy ≥ 0.85` 或 `Macro-F1 ≥ 0.80` |
+| **Agent** | 实现「分类 → 解释 → 生成处置方案」流程；输出结构化（Pydantic） |
+
+---
+
+### Level 3｜高阶：不平衡/多表/时序 + 多步决策
+
+> 📌 **真实世界约束**
+
+**目标**：处理更复杂的数据特性（极度不平衡、多表关联、时序预测），实现多步决策 Agent。
+
+#### 推荐数据集
+
+| 数据集 | 链接 | 特点 |
+|--------|------|------|
+| Credit Card Fraud Detection | [Kaggle](https://www.kaggle.com/datasets/mlg-ulb/creditcardfraud) | 极度不平衡 |
+| IEEE-CIS Fraud Detection | [Kaggle](https://www.kaggle.com/c/ieee-fraud-detection) | 多表/特征工程复杂 |
+| M5 Forecasting - Accuracy | [Kaggle](https://www.kaggle.com/competitions/m5-forecasting-accuracy) | 时序预测 |
+| Instacart Market Basket | [Kaggle](https://www.kaggle.com/c/instacart-market-basket-analysis) | 多表 + 推荐 |
+
+#### ✅ 必做部分
+
+| 模块 | 要求 |
+|------|------|
+| **数据处理** | 明确主键/外键与 join 规则；写出「数据泄露风险点清单」 |
+| **机器学习** | 使用合理指标（如 `PR-AUC`）；必须使用时间切分评估（如时序） |
+| **Agent** | 至少 3 步决策（评估 → 解释 → 行动计划）；输出结构化 |
+
+---
+
+### 自选题目标准
+
+> 💡 **鼓励自选题目**，但必须满足以下硬标准
+
+| 要求 | 说明 |
+|------|------|
+| **数据真实可获取** | 公开、可重复下载（Kaggle/UCI/OpenML 等），提供链接 |
+| **可量化预测任务** | 有明确标签/目标变量与评价指标 |
+| **业务闭环** | 能落到「下一步做什么」的决策/行动 |
+| **Agent 工具调用** | 至少 2 个 tools，其中 1 个必须是 ML 工具 |
+| **规模与复杂度** | 样本量建议 ≥ 5,000 |
+| **合规性** | 禁止爬取受限数据；禁止提交密钥/隐私数据 |
+
+---
+
+## 可选扩展思路
+
+以下是一些可选的扩展方向，用于加深项目深度，**不作为评分硬性要求**：
+
+| 方向 | 思路 |
+|------|------|
+| **可解释性** | 添加特征重要性解释工具（如 `explain_top_features`），让 Agent 能解释决策依据 |
+| **代价敏感策略** | 给每个动作定义成本/收益假设，让 Agent 输出最划算的动作组合 |
+| **阈值策略** | 把"预测概率"转化为"干预策略"（高/中/低风险不同处理） |
+| **相似案例检索** | 用 TF-IDF/Embedding 做 `retrieve_similar(text) -> top_k`，提供可追溯证据 |
+| **合规检查** | 对 Agent 输出做规则检查（如不得泄露隐私、不得虚假承诺） |
+| **误差分析** | Top 误判样本分析，找出模型薄弱点 |
+| **消融实验** | 对比不同特征/模型配置，得出改进方向 |
+
+---
+
+# Day 2：自主设计
+
+**今日任务**：
+- 分组进行项目设计与开发
+- 完成数据探索与清洗
+- 开始训练基线模型
+
+**建议里程碑**：
+- [ ] 数据下载并完成初步探索
+- [ ] 数据清洗流水线可运行
+- [ ] 基线模型训练完成
+
+---
+
+# Day 3：答疑 + Git 指导
+
+## Git 安装（国内环境）
+
+### Windows
+
+1. 下载 Git for Windows：
+   - 官方镜像（推荐）：https://registry.npmmirror.com/binary.html?path=git-for-windows/
+   - 或官网：https://git-scm.com/download/win
+2. 双击安装，全程默认设置即可
+3. 安装完成后，右键可看到「Git Bash Here」选项
+
+### macOS
+
+```bash
+# 方法 A：Xcode 命令行工具（推荐）
+xcode-select --install
+
+# 方法 B：Homebrew
+brew install git
+```
+
+### Linux (Ubuntu/Debian)
+
+```bash
+sudo apt update
+sudo apt install git
+```
+
+### 验证安装
+
+```bash
+git --version
+# 输出类似：git version 2.43.0
+```
+
+---
+
+## Git 基础操作
+
+### 首次配置
+
+```bash
+# 设置用户名和邮箱（提交记录会显示）
+git config --global user.name "你的姓名"
+git config --global user.email "你的邮箱@example.com"
+```
+
+### 克隆仓库
+
+```bash
+# 组长创建仓库后，所有组员克隆
+git clone http://hblu.top:3000/<用户名>/<项目名>.git
+cd <项目名>
+```
+
+### 日常开发流程
+
+```bash
+# 1. 拉取最新代码（每次开始工作前）
+git pull
+
+# 2. 查看当前状态
+git status
+
+# 3. 添加修改的文件
+git add .                    # 添加所有修改
+git add src/train.py         # 或只添加特定文件
+
+# 4. 提交修改
+git commit -m "feat: 添加数据预处理模块"
+
+# 5. 推送到远程仓库
+git push
+```
+
+### 常用命令速查
+
+| 命令 | 说明 |
+|------|------|
+| `git clone <url>` | 克隆远程仓库 |
+| `git pull` | 拉取远程更新 |
+| `git status` | 查看当前状态 |
+| `git add .` | 暂存所有修改 |
+| `git commit -m "消息"` | 提交修改 |
+| `git push` | 推送到远程 |
+| `git log --oneline -5` | 查看最近 5 条提交 |
+
+### 团队协作注意事项
+
+1. **每次开始工作前先 `git pull`**，避免冲突
+2. **提交信息要有意义**，如 `feat: 添加 Agent 工具` 而非 `update`
+3. **小步提交**，不要把所有修改攒到最后一起提交
+
+---
+
+## .gitignore 详解
+
+`.gitignore` 文件告诉 Git **哪些文件不要提交**。这非常重要，因为：
+- **API Key 泄露会导致账户被盗用**
+- **大文件会导致仓库臃肿**
+- **临时文件没有提交意义**
+
+### 本项目必须忽略的文件
+
+创建 `.gitignore` 文件，内容如下：
+
+```gitignore
+# ===== 环境变量（绝对不能提交！）=====
+.env
+
+# ===== Python 虚拟环境 =====
+.venv/
+venv/
+__pycache__/
+*.pyc
+*.pyo
+.pytest_cache/
+
+# ===== IDE 配置 =====
+.vscode/
+.idea/
+*.swp
+
+# ===== macOS 系统文件 =====
+.DS_Store
+
+# ===== Jupyter =====
+.ipynb_checkpoints/
+
+# ===== 超大文件（超过 10MB 需手动添加）=====
+# 如果你的数据或模型文件超过 10MB，请在下面添加：
+# data/large_dataset.csv
+# models/large_model.pkl
+```
+
+> 💡 **关于 data/ 和 models/ 文件**：
+> - **默认应该提交**，方便教师机直接运行
+> - 如果单个文件 **超过 10MB**，请添加到 `.gitignore` 并在 `data/README.md` 中说明下载方式
+
+### 检查 .gitignore 是否生效
+
+```bash
+# 查看哪些文件会被 Git 忽略
+git status --ignored
+
+# 如果之前已经提交了不应提交的文件，需要先从 Git 中移除
+git rm --cached .env          # 从 Git 移除但保留本地文件
+git rm --cached -r __pycache__
+git commit -m "chore: 移除不应提交的文件"
+```
+
+---
+
+## 作业提交流程
+
+### 1. 账号信息
+
+账号已统一创建，请登录 [hblu.top:3000/MachineLearning2025](http://hblu.top:3000/MachineLearning2025)
+
+| 项目 | 说明 |
+|------|------|
+| **用户名** | `st` + 学号（如 `st2024001`） |
+| **初始密码** | `12345678`（请登录后修改） |
+| **组织** | MachineLearning2025 |
+
+> ⚠️ **首次登录后请立即修改密码**
+
+### 2. 组长创建仓库
+
+在 [MachineLearning2025](http://hblu.top:3000/MachineLearning2025) 组织下创建新仓库，命名格式：`组号-项目名称`（如 `G01-ChurnPredictor`）
+
+### 3. 添加组员
+
+Settings → Collaborators → 添加其他组员（使用 `st+学号` 搜索）
+
+### 4. 提交检查清单
+
+- [ ] `.gitignore` 已创建且包含必要规则
+- [ ] `.env.example` 已提交，`.env` 未提交
+- [ ] 没有提交 API Key 或敏感信息
+- [ ] 没有提交大于 10MB 的文件
+
+---
+
+# Day 4：自主设计
+
+**今日任务**：
+- 继续完善项目
+- 完成 Agent 集成
+- 准备 Streamlit Demo
+- 撰写项目报告
+
+**建议里程碑**：
+- [ ] ML 模型完成并保存
+- [ ] Agent 工具调用测试通过
+- [ ] Streamlit Demo 可运行
+- [ ] README.md 初稿完成
+
+---
+
+# Day 5：小组展示
+
+## 展示流程
+
+1. **教师机克隆你的仓库**
+   ```bash
+   git clone http://hblu.top:3000/MachineLearning2025/<项目名>.git
+   cd <项目名>
+   ```
+
+2. **安装依赖并运行**
+   ```bash
+   uv sync
+   cp .env.example .env
+   # 教师填入测试用 API Key
+   uv run streamlit run src/streamlit_app.py
+   ```
+
+3. **5-8 分钟 Demo 展示**
+
+---
+
+## 跨机运行检查清单
+
+> ⚠️ **避免「明明在我电脑上能跑」的问题**
+
+### 必须检查
+
+| 检查项 | 说明 | 常见错误 |
+|--------|------|----------|
+| **依赖完整** | 所有依赖都在 `pyproject.toml` 中 | 忘记 `uv add` 新安装的包 |
+| **相对路径** | 数据/模型使用相对路径 | `C:\Users\张三\data.csv` |
+| **环境变量** | API Key 通过 `.env` 读取 | 硬编码 Key 在代码中 |
+| **数据可获取** | 数据文件有下载说明或包含在仓库 | 数据只在本地，忘记上传 |
+| **uv.lock** | 锁文件已提交 | 依赖版本不确定 |
+
+### 提交前测试方法
+
+```bash
+# 模拟干净环境测试
+cd /tmp
+git clone <你的仓库地址>
+cd <项目名>
+uv sync
+cp .env.example .env
+# 填入 API Key
+uv run streamlit run src/streamlit_app.py
+```
+
+### 常见问题排查
+
+| 错误 | 原因 | 解决方案 |
+|------|------|----------|
+| `ModuleNotFoundError` | 缺少依赖 | `uv add <包名>` 后重新提交 |
+| `FileNotFoundError` | 路径问题 | 使用 `Path(__file__).parent` 获取相对路径 |
+| `DEEPSEEK_API_KEY not found` | 环境变量问题 | 检查 `.env` 格式和 `python-dotenv` |
+
+---
+
+## 评分标准（总分 100）
+
+> ⚠️ 所有分析、对比、决策逻辑都必须在 `README.md` 中清晰体现。
+
+### A. 问题与数据（10 分）
+
+| 维度 | 分值 | 要求 |
+|------|------|------|
+| 任务定义清晰 | 5 | 标签/目标、输入输出边界 |
+| 数据说明与切分 | 5 | 来源链接、字段含义、切分策略 |
+
+### B. 传统机器学习（30 分）
+
+| 维度 | 分值 | 要求 |
+|------|------|------|
+| 基线与可复现训练 | 10 | 固定随机种子、训练脚本可跑通 |
+| 指标与对比 | 10 | 达到指标要求，与基线对比 |
+| 误差分析 | 10 | 展示错误样本/分桶，给出改进方向 |
+
+### C. LLM + Agent（30 分）
+
+| 维度 | 分值 | 要求 |
+|------|------|------|
+| 工具调用 | 10 | 至少 2 个 tools，能稳定调用 ML 工具 |
+| 结构化输出 | 10 | Pydantic schema 清晰；字段有约束 |
+| 建议可执行且有证据 | 10 | 能落地的动作清单，引用依据 |
+
+### D. 工程与演示（30 分）
+
+| 维度 | 分值 | 要求 |
+|------|------|------|
+| **Streamlit 演示** | **15** | 交互流畅；展示「预测→分析→建议」全流程 |
+| **跨机运行** | **10** | 在教师机 `git clone && uv sync && uv run` 可直接运行 |
+| 代码质量 | 5 | 结构清晰、有类型提示与文档 |
+
+### ❌ 常见扣分项
+
+- 训练/推理无法在教师机跑通
+- 未使用 `uv` 管理项目
+- 数据泄露（尤其是时序/多表）
+- Agent 编造数据集不存在的事实
+- **把密钥提交进仓库（严重扣分）**
+
+### ✅ 常见加分项
+
+- 使用 Polars Lazy API 高效处理数据
+- 做了可解释性/阈值策略/代价敏感分析
+- 做了检索增强且引用可追溯证据
+- 做了消融/对比实验，结论清晰
+
+---
+
+# 附录
+
+## 代码示例
+
+### 数据处理：Polars 最佳实践
+
+```python
+import polars as pl
+
+# ✅ 推荐：使用 Lazy API（自动查询优化）
+lf = pl.scan_csv("data/train.csv")
+result = (
+    lf.filter(pl.col("age") > 30)
+    .group_by("category")
+    .agg(pl.col("value").mean())
+    .collect()  # 最后一步才执行
+)
+
+# ✅ 推荐：从 Pandas 无缝迁移
+df_polars = pl.from_pandas(df_pandas)
+df_pandas = df_polars.to_pandas()
+```
+
+### 数据验证：Pydantic + Pandera
+
+```python
+from pydantic import BaseModel, Field
+
+class CustomerFeatures(BaseModel):
+    """客户特征数据模型"""
+    age: int = Field(ge=0, le=120, description="客户年龄")
+    tenure: int = Field(ge=0, description="客户任期（月）")
+    monthly_charges: float = Field(ge=0, description="月费用")
+    contract_type: str = Field(pattern="^(month-to-month|one-year|two-year)$")
+```
+
+```python
+import pandera as pa
+from pandera import Column, Check, DataFrameSchema
+
+# ✅ 定义清洗后 Schema
+clean_data_schema = DataFrameSchema(
+    columns={
+        "age": Column(pa.Int, checks=[Check.ge(0), Check.le(120)], nullable=False),
+        "tenure": Column(pa.Int, checks=[Check.ge(0)], nullable=False),
+        "monthly_charges": Column(pa.Float, checks=[Check.ge(0)], nullable=False),
+    },
+    strict=True,
+    coerce=True,
+)
+```
+
+### 机器学习：sklearn + LightGBM
+
+```python
+from sklearn.model_selection import train_test_split
+from sklearn.linear_model import LogisticRegression
+from sklearn.metrics import roc_auc_score
+import lightgbm as lgb
+import joblib
+
+# 基线模型
+baseline = LogisticRegression(max_iter=1000, random_state=42)
+baseline.fit(X_train, y_train)
+print("Baseline ROC-AUC:", roc_auc_score(y_test, baseline.predict_proba(X_test)[:, 1]))
+
+# 高性能模型
+lgb_model = lgb.LGBMClassifier(n_estimators=500, learning_rate=0.05, random_state=42)
+lgb_model.fit(X_train, y_train)
+
+# 保存模型
+joblib.dump(lgb_model, "models/lgb_model.pkl")
+```
+
+### Agent：pydantic-ai 示例
+
+```python
+from pydantic import BaseModel, Field
+from pydantic_ai import Agent, RunContext
+
+class Decision(BaseModel):
+    """Agent 输出的结构化决策"""
+    risk_score: float = Field(ge=0, le=1, description="预测风险概率")
+    decision: str = Field(description="建议策略")
+    actions: list[str] = Field(description="可执行动作清单")
+    rationale: str = Field(description="决策依据")
+
+agent = Agent(
+    "deepseek:deepseek-chat",
+    output_type=Decision,
+    system_prompt="你是业务决策助手。必须先调用工具获取预测结果，再给出结构化决策。",
+)
+
+@agent.tool
+def predict_risk(ctx: RunContext, features: CustomerFeatures) -> float:
+    """调用 ML 模型返回风险分数"""
+    # TODO: 实现模型调用
+    pass
+```
+
+### API Key 配置
+
+> ⚠️ **不要把 Key 写进代码、不要提交到仓库！**
+
+创建 `.env.example`（提交到仓库）：
+```
+DEEPSEEK_API_KEY=your-key-here
+```
+
+复制为 `.env` 并填入真实 Key（`.env` 在 `.gitignore` 中排除）。
+
+---
+
+## 建议项目结构
+
+```
+ml_course_design/
+├── pyproject.toml            # 项目配置与依赖
+├── uv.lock                   # 锁定的依赖版本
+├── README.md                 # 项目说明与报告
+├── .env.example              # 环境变量模板
+├── .gitignore                # Git 忽略规则
+│
+├── data/                     # 数据目录
+│   └── README.md             # 数据来源说明
+│
+├── models/                   # 训练产物
+│   └── .gitkeep
+│
+├── src/                      # 核心代码
+│   ├── __init__.py
+│   ├── data.py               # 数据读取/清洗
+│   ├── features.py           # Pydantic 特征模型
+│   ├── train.py              # 训练与评估
+│   ├── infer.py              # 推理接口
+│   ├── agent_app.py          # Agent 入口
+│   └── streamlit_app.py      # Demo 入口
+│
+└── tests/                    # 测试
+    └── test_*.py
+```
+
+---
+
+## README.md 模板（你的项目）
+
+请将以下内容作为你项目 `README.md` 的模板
+
+````markdown
+# 项目名称
+
+> **机器学习 (Python) 课程设计**
+
+## 👥 团队成员
+
+| 姓名 | 学号 | 贡献 |
+|------|------|------|
+| 张三 | 2024001 | 数据处理、模型训练 |
+| 李四 | 2024002 | Agent 开发、Streamlit |
+| 王五 | 2024003 | 测试、文档撰写 |
+
+## 📝 项目简介
+
+（1-2 段描述项目目标、选用的数据集、解决的问题）
+
+## 🚀 快速开始
+
+```bash
+# 克隆仓库
+git clone http://hblu.top:3000/MachineLearning2025/GXX-ProjectName.git
+cd GXX-ProjectName
+
+# 安装依赖
+uv sync
+
+# 配置环境变量
+cp .env.example .env
+# 编辑 .env 填入 API Key
+
+# 运行 Demo
+uv run streamlit run src/streamlit_app.py
+```
+
+---
+
+## 1️⃣ 问题定义与数据
+
+### 1.1 任务描述
+
+（描述预测任务类型：分类/回归/时序，以及业务目标）
+
+### 1.2 数据来源
+
+| 项目 | 说明 |
+|------|------|
+| 数据集名称 | XXX |
+| 数据链接 | [Kaggle](https://...) |
+| 样本量 | X,XXX 条 |
+| 特征数 | XX 个 |
+
+### 1.3 数据切分与防泄漏
+
+（如何切分训练/验证/测试集？如何确保没有数据泄漏？）
+
+---
+
+## 2️⃣ 机器学习流水线
+
+### 2.1 基线模型
+
+| 模型 | 指标 | 结果 |
+|------|------|------|
+| Logistic Regression | ROC-AUC | 0.XX |
+
+### 2.2 进阶模型
+
+| 模型 | 指标 | 结果 |
+|------|------|------|
+| LightGBM | ROC-AUC | 0.XX |
+
+### 2.3 误差分析
+
+（模型在哪些样本上表现不佳？为什么？）
+
+---
+
+## 3️⃣ Agent 实现
+
+### 3.1 工具定义
+
+| 工具名 | 功能 | 输入 | 输出 |
+|--------|------|------|------|
+| `predict_risk` | 调用 ML 模型预测 | CustomerFeatures | float |
+| `explain_features` | 解释特征影响 | CustomerFeatures | list[str] |
+
+### 3.2 决策流程
+
+（Agent 如何使用工具？如：预测 → 解释 → 建议）
+
+### 3.3 案例展示
+
+**输入**：
+```
+请分析这位客户的流失风险：年龄 35，任期 2 个月，月费 89.99
+```
+
+**输出**：
+```json
+{
+  "risk_score": 0.72,
+  "decision": "高风险，建议主动挥留",
+  "actions": ["发送优惠短信", "客服回访"],
+  "rationale": "新客户 + 月付合同是流失高危特征"
+}
+```
+
+---
+
+## 4️⃣ 开发心得
+
+### 4.1 主要困难与解决方案
+
+（遇到的最大困难是什么？如何解决？）
+
+### 4.2 对 AI 辅助编程的感受
+
+（使用 AI 工具的体验如何？哪些场景有帮助？哪些地方需要注意？）
+
+### 4.3 局限与未来改进
+
+（如果有更多时间，还有哪些可以改进的地方？）
+````
+
+---
+
+## 参考资料
+
+### 核心工具文档
+
+| 资源 | 链接 | 说明 |
+|------|------|------|
+| uv 官方文档 | https://docs.astral.sh/uv/ | Python 项目管理器 |
+| Polars 用户指南 | https://pola.rs/ | 高性能 DataFrame |
+| Pydantic 文档 | https://docs.pydantic.dev/ | 数据验证与设置 |
+| Pandera 文档 | https://pandera.readthedocs.io/ | DataFrame Schema 验证 |
+| pydantic-ai 文档 | https://ai.pydantic.dev/ | Agent 框架 |
+| DeepSeek API | https://api.deepseek.com | OpenAI 兼容 |
+
+### 推荐学习资源
+
+| 资源 | 链接 |
+|------|------|
+| Polars vs Pandas | https://pola.rs/user-guide/migration/pandas/ |
+| Pydantic AI 快速入门 | https://ai.pydantic.dev/quick-start/ |
+| Pandera 快速入门 | https://pandera.readthedocs.io/en/stable/try_pandera.html |
+| uv 项目工作流 | https://docs.astral.sh/uv/concepts/projects/ |
+
+---
+
+## 📋 Checklist（提交前自检）
+
+- [ ] 使用 `uv sync` 安装依赖，无需手动创建虚拟环境
+- [ ] `.gitignore` 包含 `.env`、`__pycache__`、大文件
+- [ ] 在干净环境下可以复现（`git clone && uv sync && uv run`）
+- [ ] 没有提交 API Key 或敏感信息
+- [ ] 使用 Polars 进行数据处理
+- [ ] 使用 Pydantic 定义特征和输出模型
+- [ ] Agent 至少有 2 个 tool（含 1 个 ML 工具）
+- [ ] README.md 说明了数据切分策略
+- [ ] Demo 可以正常运行
+
+---
+
+> 💬 **有问题？** 请在课程群/Issue 中提问，我们会尽快回复。
--- a/agent_decision.py
+++ b/agent_decision.py
@ -0,0 +1,212 @@
+import pandas as pd
+import numpy as np
+import joblib
+import json
+import random
+
+# ==========================================
+# 1. 模拟 LLM 接口 (Mock Agent)
+# ==========================================
+def mock_llm_generate(prompt):
+    """
+    模拟 LLM 的生成过程。
+    在实际应用中，这里会调用 OpenAI/Anthropic/DeepSeek 的 API。
+    """
+    # 从 Prompt 中提取关键信息来生成“假”的智能回复
+    # 这里我们用简单的规则来模拟 LLM 的“思考”
+    
+    # 提取意向度 (从 Prompt 文本中查找)
+    import re
+    # 匹配 "预测模型判定该客户订阅定存的概率为: 78.5%" 这种格式
+    prob_match = re.search(r"预测模型判定该客户订阅定存的概率为:\s*([\d\.]+)%", prompt)
+    prob = float(prob_match.group(1)) if prob_match else 0
+    
+    # 模拟 LLM 根据概率生成的建议
+    if prob > 70:
+        strategy = "VIP 专属服务"
+        action = "资深理财经理致电"
+        script = "您好，鉴于您良好的信用记录，我们为您预留了一款高收益理财产品..."
+        reason = "客户属于高意向群体，且过往活动反馈良好。"
+    elif prob > 40:
+        strategy = "标准营销"
+        action = "普通客服致电或短信触达"
+        script = "您好，我行近期推出了几款稳健型存款产品，占用您一分钟..."
+        reason = "客户意向中等，建议通过低成本渠道试探。"
+    else:
+        strategy = "静默观察"
+        action = "发送月度邮件"
+        script = "(邮件内容) 本月财经摘要..."
+        reason = "客户意向较低，频繁打扰可能导致反感。"
+
+    # 构造 JSON 输出
+    response = {
+        "customer_id": "Unknown", # 实际中会从 Context 获取
+        "analysis": {
+            "score": prob,
+            "segment": strategy
+        },
+        "action_plan": {
+            "primary_action": action,
+            "backup_action": "记录反馈并更新标签",
+            "suggested_script": script
+        },
+        "reasoning": reason
+    }
+    
+    return json.dumps(response, ensure_ascii=False, indent=2)
+
+# ==========================================
+# 2. Agent 核心类
+# ==========================================
+class MarketingAgent:
+    def __init__(self, artifact_path='model_artifacts.pkl'):
+        print(f"Agent 正在加载模型资产: {artifact_path} ...")
+        self.artifacts = joblib.load(artifact_path)
+        self.model = self.artifacts['model']
+        self.encoders = self.artifacts['encoders']
+        self.feature_meta = self.artifacts['feature_meta']
+        
+    def preprocess(self, customer_data):
+        """将原始字典数据转换为模型可接受的 DataFrame"""
+        # 创建 DataFrame
+        df = pd.DataFrame([customer_data])
+        
+        # 移除 duration (如果存在)
+        if 'duration' in df.columns:
+            df = df.drop('duration', axis=1)
+            
+        # 编码分类特征
+        for col, le in self.encoders.items():
+            if col in df.columns:
+                # 处理未知类别: 如果遇到训练集没见过的类别，设为出现最多的那个或报错
+                # 这里简单处理：如果遇到未知，就用 transform 的第一个类别 (仅作演示)
+                try:
+                    df[col] = le.transform(df[col])
+                except ValueError:
+                    # 遇到未知标签，使用众数填充或标记为 -1 (取决于模型训练时是否处理了未知)
+                    # 这里为了演示不报错，我们假设数据是干净的，或者直接填 0
+                    df[col] = 0 
+                    
+        # 确保列顺序一致
+        # 注意：XGBoost 可能会报错如果列顺序不对，最好重新索引
+        # 这里假设 feature_meta['all_cols'] 保存了训练时的特征顺序
+        df = df[self.feature_meta['all_cols']]
+        
+        return df
+
+    def analyze_customer(self, customer_data):
+        """
+        Agent 的主工作流：
+        1. 感知 (Perception): 接收数据，进行预处理。
+        2. 思考 (Cognition - Model): 调用 ML 模型预测概率。
+        3. 规划 (Planning - LLM): 构建 Prompt，调用 LLM 生成建议。
+        4. 行动 (Action): 输出结构化建议。
+        """
+        # 1. 预处理
+        X_input = self.preprocess(customer_data)
+        
+        # 2. 模型预测
+        prob = self.model.predict_proba(X_input)[0][1] # 获取属于类别 1 (yes) 的概率
+        prob_percent = round(prob * 100, 2)
+        
+        # 获取特征重要性高的特征值，放入 Prompt (简单逻辑：列出所有特征)
+        # 实际中可以结合 SHAP 值只列出 Top 3 贡献特征
+        feature_desc = ", ".join([f"{k}={v}" for k, v in customer_data.items() if k != 'duration'])
+        
+        # 3. 构建 Prompt
+        # 这是一个 "Prompt Engineering" 的过程
+        system_prompt = """你是一个专业的银行营销决策 Agent。请根据客户数据和预测模型的结果，给出具体的执行建议。
+要求输出必须是 JSON 格式。"""
+        
+        user_prompt = f"""
+        【输入数据】
+        客户特征: {feature_desc}
+        
+        【模型分析】
+        预测模型判定该客户订阅定存的概率为: {prob_percent}%
+        
+        【业务规则库】
+        - 概率 > 70%: 高价值，高优先级，人工介入。
+        - 概率 40%-70%: 潜在价值，自动化营销 + 人工辅助。
+        - 概率 < 40%: 低价值，仅自动化触达。
+        
+        【任务】
+        请基于以上信息，生成该客户的营销建议 JSON，包含：
+        - 客户分群 (segment)
+        - 推荐行动 (primary_action)
+        - 话术建议 (suggested_script)
+        - 决策依据 (reasoning)
+        """
+        
+        print(f"\n--- Agent 正在思考 (构建 Prompt) ---\n[Prompt 摘要] 预测概率: {prob_percent}%")
+        
+        # 4. 调用 LLM (模拟)
+        # 在真实场景中：response = openai.ChatCompletion.create(...)
+        llm_response = mock_llm_generate(user_prompt)
+        
+        return json.loads(llm_response)
+
+# ==========================================
+# 3. 运行演示
+# ==========================================
+if __name__ == "__main__":
+    # 加载原始数据取样
+    df_raw = pd.read_csv('bank.csv')
+    
+    # 实例化 Agent
+    agent = MarketingAgent()
+    
+    print("\n" + "="*50)
+    print("开始模拟业务流程...")
+    print("="*50)
+    
+    # 随机抽取 3 个客户进行模拟
+    sample_indices = [1, 20, 100] 
+    
+    # 构造一个高意向客户 (VIP 模拟)
+    # 基于特征重要性: poutcome=success (最重要), contact=cellular, housing=no, balance=high
+    vip_customer = {
+        'age': 35,
+        'job': 'management',
+        'marital': 'married',
+        'education': 'tertiary',
+        'default': 'no',
+        'balance': 5000,
+        'housing': 'no',
+        'loan': 'no',
+        'contact': 'cellular',
+        'day': 15,
+        'month': 'oct',
+        'duration': 0, # 会被移除
+        'campaign': 1,
+        'pdays': 90,
+        'previous': 2,
+        'poutcome': 'success', # 强特征
+        'deposit': 'yes' # 仅用于展示
+    }
+    
+    # 将 VIP 客户加入测试列表 (使用特殊的 -1 索引标记)
+    test_cases = [(i, df_raw.iloc[i].to_dict()) for i in sample_indices]
+    test_cases.append((-1, vip_customer))
+
+    for idx, customer_dict in test_cases:
+        # 移除结果列 'deposit'，模拟这是新客户
+        if 'deposit' in customer_dict:
+            real_result = customer_dict.pop('deposit')
+        else:
+            real_result = "Unknown"
+        
+        if idx == -1:
+            print(f"\n>>> 处理客户 ID: VIP-Demo (人工构造的高意向客户)")
+        else:
+            print(f"\n>>> 处理客户 ID: {idx}")
+            
+        print(f"真实结果 (仅供参考): {real_result}")
+        
+        # Agent 工作
+        decision = agent.analyze_customer(customer_dict)
+        
+        # 打印结果
+        print("\n[Agent 最终建议]")
+        print(json.dumps(decision, ensure_ascii=False, indent=2))
+        print("-" * 30)
--- a/model_artifacts.pkl
+++ b/model_artifacts.pkl
--- a/smart_agent.py
+++ b/smart_agent.py
@ -0,0 +1,236 @@
+import pandas as pd
+import joblib
+import json
+import re
+
+# ==========================================
+# 0. 基础工具类定义
+# ==========================================
+class BaseTool:
+    def __init__(self, name, description):
+        self.name = name
+        self.description = description
+
+    def run(self, *args, **kwargs):
+        raise NotImplementedError
+
+# ==========================================
+# 1. 工具实现
+# ==========================================
+
+class MLPredictionTool(BaseTool):
+    """
+    工具 1: 机器学习预测工具
+    功能: 加载预训练模型，预测客户转化概率，并提供特征归因。
+    """
+    def __init__(self, artifact_path='model_artifacts.pkl'):
+        super().__init__("ML_Predictor", "输入客户特征，输出购买概率和关键影响因素")
+        print(f"[{self.name}] 正在加载模型资产...")
+        self.artifacts = joblib.load(artifact_path)
+        self.model = self.artifacts['model']
+        self.encoders = self.artifacts['encoders']
+        self.feature_meta = self.artifacts['feature_meta']
+        # 获取特征重要性，用于简单的归因解释
+        self.feature_importances = pd.Series(
+            self.model.feature_importances_, 
+            index=self.feature_meta['all_cols']
+        ).sort_values(ascending=False)
+
+    def preprocess(self, customer_data):
+        df = pd.DataFrame([customer_data])
+        if 'duration' in df.columns: df = df.drop('duration', axis=1)
+        
+        for col, le in self.encoders.items():
+            if col in df.columns:
+                try:
+                    df[col] = le.transform(df[col])
+                except:
+                    df[col] = 0 # 简单处理未知值
+        
+        # 补齐可能缺失的列（全0填充）并保持顺序
+        for col in self.feature_meta['all_cols']:
+            if col not in df.columns:
+                df[col] = 0
+                
+        return df[self.feature_meta['all_cols']]
+
+    def run(self, customer_data):
+        # 1. 预处理
+        X = self.preprocess(customer_data)
+        
+        # 2. 预测
+        prob = float(self.model.predict_proba(X)[0][1]) # 强制转换为 python float
+        
+        # 3. 归因 (Attribution)
+        # 简单逻辑：找出该客户数据中，属于 Top 5 重要特征的字段及其值
+        top_features = self.feature_importances.head(5).index.tolist()
+        attribution = {feat: customer_data.get(feat, 'N/A') for feat in top_features}
+        
+        return {
+            "probability": round(prob, 4),
+            "risk_level": "High" if prob < 0.3 else ("Medium" if prob < 0.7 else "Low"),
+            "key_factors": attribution
+        }
+
+class StrategyRetrievalTool(BaseTool):
+    """
+    工具 2: 策略检索工具
+    功能: 根据客户分群或意向等级，检索对应的营销话术和产品包。
+    """
+    def __init__(self):
+        super().__init__("Strategy_Retriever", "根据意向分检索营销策略")
+        # 模拟向量数据库或规则库
+        self.knowledge_base = {
+            "High_Intent": {
+                "segment": "VIP_Growth",
+                "channel": "Personal_Call",
+                "product": "大额存单/结构性存款",
+                "script_template": "尊贵的{name}，鉴于您良好的{key_factor}，我们要为您推荐专属..."
+            },
+            "Medium_Intent": {
+                "segment": "Potential_Saver",
+                "channel": "SMS_Web",
+                "product": "灵活理财/定投",
+                "script_template": "你好，发现您对{key_factor}感兴趣，这里有一份理财攻略..."
+            },
+            "Low_Intent": {
+                "segment": "General_Mass",
+                "channel": "Email",
+                "product": "货币基金/新人礼包",
+                "script_template": "本月财经快讯：如何打理您的零钱..."
+            }
+        }
+
+    def run(self, probability):
+        if probability > 0.7:
+            key = "High_Intent"
+        elif probability > 0.4:
+            key = "Medium_Intent"
+        else:
+            key = "Low_Intent"
+        
+        return self.knowledge_base[key]
+
+# ==========================================
+# 2. Agent 定义 (Orchestrator)
+# ==========================================
+
+class SalesAgent:
+    def __init__(self):
+        self.tools = {
+            "predictor": MLPredictionTool(),
+            "retriever": StrategyRetrievalTool()
+        }
+    
+    def mock_llm_inference(self, prompt):
+        """
+        模拟 LLM 的生成能力。
+        在真实场景中，这里调用 openai.ChatCompletion.create(model="gpt-4", messages=...)
+        """
+        # 从 Prompt 中解析 Context
+        # 这是一个 Mock，所以我们用正则或简单的逻辑把 Prompt 里的信息“反刍”出来
+        # 实际上 LLM 会进行语义理解和润色
+        
+        # 提取关键信息用于 Mock 输出
+        try:
+            context_str = re.search(r"【Context】(.*?)【Instruction】", prompt, re.S).group(1)
+            context = json.loads(context_str)
+            
+            pred_result = context['prediction']
+            strategy_result = context['strategy']
+            customer_info = context['customer_raw']
+            
+            # 模拟 LLM 生成话术
+            script = strategy_result['script_template'].format(
+                name="客户", 
+                key_factor=list(pred_result['key_factors'].keys())[0]
+            )
+            
+            response = {
+                "thought_process": f"模型预测概率为 {pred_result['probability']}，属于 {strategy_result['segment']} 客群。已检索到对应策略，建议通过 {strategy_result['channel']} 触达。",
+                "final_decision": {
+                    "action": strategy_result['channel'],
+                    "product_recommendation": strategy_result['product'],
+                    "personalized_script": script,
+                    "attribution_explanation": f"预测模型显示该客户成交概率为 {pred_result['probability']*100}%，主要受 {json.dumps(pred_result['key_factors'], ensure_ascii=False)} 等因素影响。"
+                }
+            }
+            return json.dumps(response, ensure_ascii=False, indent=2)
+        except Exception as e:
+            return json.dumps({"error": f"LLM Mock Failed: {str(e)}"})
+
+    def process_request(self, customer_data):
+        print(f"\n[Agent] 收到新请求: {customer_data.get('job', 'Unknown')} | {customer_data.get('age')}岁")
+        
+        # --- Step 1: 调用 ML 工具进行预测 ---
+        print(f"[Agent] 调用工具: {self.tools['predictor'].name} ...")
+        pred_result = self.tools['predictor'].run(customer_data)
+        print(f"   >>> 预测结果: 概率={pred_result['probability']}, 关键因素={list(pred_result['key_factors'].keys())}")
+        
+        # --- Step 2: 调用 检索工具获取策略 ---
+        print(f"[Agent] 调用工具: {self.tools['retriever'].name} ...")
+        strategy_result = self.tools['retriever'].run(pred_result['probability'])
+        print(f"   >>> 检索结果: 渠道={strategy_result['channel']}, 产品={strategy_result['product']}")
+        
+        # --- Step 3: LLM 整合信息 ---
+        print(f"[Agent] 请求 LLM 进行最终决策与生成...")
+        
+        # 构建 Context
+        context = {
+            "customer_raw": customer_data,
+            "prediction": pred_result,
+            "strategy": strategy_result
+        }
+        
+        prompt = f"""
+        你是一个智能营销助手。请根据以下上下文信息，生成结构化的营销建议。
+        
+        【Context】
+        {json.dumps(context, ensure_ascii=False)}
+        
+        【Instruction】
+        1. 解释模型预测结果。
+        2. 结合策略库，生成具体的话术。
+        3. 输出 JSON 格式。
+        """
+        
+        final_output = self.mock_llm_inference(prompt)
+        return final_output
+
+# ==========================================
+# 3. 主程序入口
+# ==========================================
+if __name__ == "__main__":
+    # 1. 准备数据
+    df_raw = pd.read_csv('bank.csv')
+    
+    # 2. 初始化 Agent
+    agent = SalesAgent()
+    
+    # 3. 模拟场景
+    print("\n" + "="*60)
+    print("场景演示: Agent 协调多个工具完成决策")
+    print("="*60)
+    
+    # 场景 A: 低意向客户
+    customer_a = df_raw.iloc[1].to_dict() # 假设这是低概率
+    if 'deposit' in customer_a: del customer_a['deposit']
+    
+    result_a = agent.process_request(customer_a)
+    print("\n[Agent Final Output]")
+    print(result_a)
+    
+    print("-" * 60)
+    
+    # 场景 B: 高意向客户 (人工构造)
+    customer_b = customer_a.copy()
+    customer_b.update({
+        'poutcome': 'success', 
+        'duration': 1000, # 注意工具内部会移除 duration，这里只是模拟输入
+        'contact': 'cellular',
+        'month': 'oct'
+    })
+    
+    result_b = agent.process_request(customer_b)
+    print("\n[Agent Final Output]")
+    print(result_b)
--- a/train_model.py
+++ b/train_model.py
@ -0,0 +1,90 @@
+import pandas as pd
+import numpy as np
+import xgboost as xgb
+from sklearn.model_selection import train_test_split
+from sklearn.preprocessing import LabelEncoder, StandardScaler
+from sklearn.metrics import accuracy_score, classification_report
+import joblib
+import json
+
+# 1. 加载数据
+print("正在加载数据...")
+df = pd.read_csv('bank.csv')
+
+# 2. 数据预处理
+print("正在进行数据预处理...")
+
+# 移除 duration 列 (避免数据泄露)
+if 'duration' in df.columns:
+    df = df.drop('duration', axis=1)
+
+# 分离特征和目标
+X = df.drop('deposit', axis=1)
+y = df['deposit']
+
+# 处理目标变量 (yes -> 1, no -> 0)
+le_target = LabelEncoder()
+y = le_target.fit_transform(y)
+
+# 识别分类特征和数值特征
+categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
+numeric_cols = X.select_dtypes(include=['int64', 'float64']).columns.tolist()
+
+# 保存列名信息，供 Agent 使用
+feature_meta = {
+    'numeric_cols': numeric_cols,
+    'categorical_cols': categorical_cols,
+    'all_cols': list(X.columns)
+}
+
+# 对分类特征进行 Label Encoding
+# 注意：XGBoost 可以处理类别特征，但通常需要转换为数值。
+# 为了简化 Agent 的推理流程，我们需要保存这些 Encoder。
+encoders = {}
+for col in categorical_cols:
+    le = LabelEncoder()
+    X[col] = le.fit_transform(X[col])
+    encoders[col] = le
+
+# 3. 划分数据集
+X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
+
+# 4. 训练模型
+print("正在训练 XGBoost 模型...")
+model = xgb.XGBClassifier(
+    n_estimators=100,
+    learning_rate=0.1,
+    max_depth=5,
+    use_label_encoder=False,
+    eval_metric='logloss'
+)
+model.fit(X_train, y_train)
+
+# 5. 评估模型
+y_pred = model.predict(X_test)
+y_pred_proba = model.predict_proba(X_test)[:, 1]
+
+print("\n模型评估结果:")
+print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
+print("\nClassification Report:")
+print(classification_report(y_test, y_pred))
+
+# 6. 保存资产
+print("\n正在保存模型和预处理工具...")
+artifacts = {
+    'model': model,
+    'encoders': encoders,
+    'target_encoder': le_target,
+    'feature_meta': feature_meta
+}
+joblib.dump(artifacts, 'model_artifacts.pkl')
+
+# 另外保存一份特征重要性，供参考
+importances = model.feature_importances_
+feature_names = X.columns
+feat_imp_df = pd.DataFrame({'Feature': feature_names, 'Importance': importances})
+feat_imp_df = feat_imp_df.sort_values(by='Importance', ascending=False)
+print("\n特征重要性 Top 5:")
+print(feat_imp_df.head())
+
+print("\n完成！模型已保存为 'model_artifacts.pkl'")