帮助你判断解析结构化文本时何时用正则、何时引入大模型兜底。
复制安装指令,让 AI 自动完成配置 · 推荐新手
请帮我安装 askskill 上的 "regex-vs-llm-structured-text" 技能: 1. 下载 https://raw.githubusercontent.com/affaan-m/ECC/main/skills/regex-vs-llm-structured-text/SKILL.md 2. 保存为 ~/.claude/skills/regex-vs-llm-structured-text/SKILL.md 3. 装好后重载技能,告诉我可以用了
我需要从多种格式的应用日志中提取时间、错误码和请求 ID。请给我一个决策框架,判断哪些部分适合用正则表达式,哪些低置信度异常格式应交给 LLM 处理,并说明如何设计回退机制。
一套先用正则、再按置信度转交 LLM 的解析策略与实施建议。
请帮我评估:客户邮件大多遵循固定模板,但偶尔会有自由文本补充。我想提取订单号、发货日期和地址。请说明为什么应优先使用正则,并定义哪些异常情况才值得引入 LLM。
针对半结构化邮件的规则优先方案,以及触发 LLM 的边界条件清单。
OCR 识别后的表单文本格式大致固定,但会出现错位、缺字段和噪声。请给出一个结构化文本解析方案:哪些字段直接用正则清洗提取,哪些情况下需要 LLM 做语义修复,并附上质量与成本权衡建议。
面向 OCR 表单的混合解析框架,包含字段分流、异常处理和成本控制建议。
A practical decision framework for parsing structured text (quizzes, forms, invoices, documents). The key insight: regex handles 95-98% of cases cheaply and deterministically. Reserve expensive LLM calls for the remaining edge cases.
Is the text format consistent and repeating?
├── Yes (>90% follows a pattern) → Start with Regex
│ ├── Regex handles 95%+ → Done, no LLM needed
│ └── Regex handles <95% → Add LLM for edge cases only
└── No (free-form, highly variable) → Use LLM directly
Source Text
│
▼
[Regex Parser] ─── Extracts structure (95-98% accuracy)
│
▼
[Text Cleaner] ─── Removes noise (markers, page numbers, artifacts)
│
▼
[Confidence Scorer] ─── Flags low-confidence extractions
│
├── High confidence (≥0.95) → Direct output
│
└── Low confidence (<0.95) → [LLM Validator] → Output
import re
from dataclasses import dataclass
@dataclass(frozen=True)
class ParsedItem:
id: str
text: str
choices: tuple[str, ...]
answer: str
confidence: float = 1.0
def parse_structured_text(content: str) -> list[ParsedItem]:
"""Parse structured text using regex patterns."""
pattern = re.compile(
r"(?P<id>\d+)\.\s*(?P<text>.+?)\n"
r"(?P<choices>(?:[A-D]\..+?\n)+)"
r"Answer:\s*(?P<answer>[A-D])",
re.MULTILINE | re.DOTALL,
)
items = []
for match in pattern.finditer(content):
choices = tuple(
c.strip() for c in re.findall(r"[A-D]\.\s*(.+)", match.group("choices"))
)
items.append(ParsedItem(
id=match.group("id"),
text=match.group("text").strip(),
choices=choices,
answer=match.group("answer"),
))
return items
Flag items that may need LLM review:
@dataclass(frozen=True)
class ConfidenceFlag:
item_id: str
score: float
reasons: tuple[str, ...]
def score_confidence(item: ParsedItem) -> ConfidenceFlag:
"""Score extraction confidence and flag issues."""
reasons = []
score = 1.0
if len(item.choices) < 3:
reasons.append("few_choices")
score -= 0.3
if not item.answer:
reasons.append("missing_answer")
score -= 0.5
if len(item.text) < 10:
reasons.append("short_text")
score -= 0.2
return ConfidenceFlag(
item_id=item.id,
score=max(0.0, score),
reasons=tuple(reasons),
)
def identify_low_confidence(
items: list[ParsedItem],
threshold: float = 0.95,
) -> list[ConfidenceFlag]:
"""Return items below confidence threshold."""
flags = [score_confidence(item) for item in items]
return [f for f in flags if f.score < threshold]
def validate_with_llm(
item: ParsedItem,
original_text: str,
client,
) -> ParsedItem:
"""Use LLM to fix low-confidence extractions."""
response = client.messages.create(
model="claude-haiku-4-5-20251001", # Cheapest model for validation
max_tokens=500,
messages=[{
"role": "user",
"content": (
f"Extract the question, choices, and answer from this text.\n\n"
f"Text: {original_text}\n\n"
f"Current extraction: {item}\n\n"
f"Return corrected JSON if needed, or 'CORRECT' if accurate."
),
}],
)
# Parse LLM response and return corrected item...
return corrected_item
…
帮助开发者使用 Bun 进行运行、打包、测试与依赖管理,并评估替代 Node 的时机。
通过 MCP 或剪贴板高效共享代码,并按任务规则定制上下文给大模型。