localmelo — Updates

feat refactor bench Task 1: Online Core Loop

Pluggable backend registry with local (Ollama, MLC-LLM, vLLM, SGLang) and cloud (OpenAI, Anthropic, Gemini, Nvidia) adapters
Config supports split chat_backend / embedding_backend (e.g. cloud chat + local embedding)
Removed legacy support/serving/ and support/models/ modules
Tests reorganized from flat layout to domain-grouped subdirectories
Removed legacy raw-string Agent constructor

Replaced flat step loop with attempt-based outer loop (MAX_ATTEMPTS=5, STEPS_PER_ATTEMPT=10, capped by MAX_AGENT_STEPS=30)
Extracted _run_attempt() for clean single-responsibility separation
Renamed ShortTerm → WorkingMemory with reflection entries that persist across attempts
Structured ReflectionEntry and ReflectionDecision with active-learning fields
Utility-based continuation gate: utility = info_gain * feasibility * novelty - cost - repeat_risk
Stuck detection: repeated tool calls or errors trigger early attempt termination
Reflection context injected into retrieval, planning, and reflect calls
Reflections promoted to long-term memory only at terminal task state
Hardened reflection parsing with strict type coercion and [0,1] float clamping

Ollama native /api/chat provider with think: true
Thinking/answer split for both Ollama native and MLC <think> tags
Provider-boundary usage normalization with total_tokens backfill
Data-driven smoke benchmark with per-scenario metrics, normalized tokenizer comparison, multi-backend reports
--report-only mode for regenerating reports from existing JSON

Component	Status
Backend registry (local + cloud)	Done
Split chat/embedding config	Done
Attempt-based agent loop	Done
Structured reflection + utility gate	Done
Working memory with reflection entries	Done
Stuck detection	Done
Ollama native chat provider	Done
Smoke benchmark framework	Done
637 tests passing (ruff, black, mypy, pytest)	Done
Task decomposition (`decompose` action)	Not yet
Sleep-mode pipeline	Not yet
Long-memory promotion policies	Not yet
Utility threshold empirical tuning	Not yet

可插拔后端注册表，支持本地 (Ollama, MLC-LLM, vLLM, SGLang) 和云端 (OpenAI, Anthropic, Gemini, Nvidia) 适配器
Config 支持 chat_backend / embedding_backend 分离配置（如云端 chat + 本地 embedding）
移除遗留的 support/serving/ 和 support/models/ 模块
测试从扁平结构重组为按子系统分目录
移除遗留的 Agent 原始字符串构造路径

用 attempt-based 双层循环替代原来的单层 for 循环（MAX_ATTEMPTS=5，STEPS_PER_ATTEMPT=10，总上限 MAX_AGENT_STEPS=30）
提取 _run_attempt() 方法，职责清晰
ShortTerm 更名为 WorkingMemory，反思条目跨 attempt 保留
结构化 ReflectionEntry 和 ReflectionDecision，包含主动学习字段
基于效用值的续行门控：utility = info_gain * feasibility * novelty - cost - repeat_risk
卡住检测：相同工具调用或错误重复 3 次触发提前终止
反思上下文注入到检索、规划和反思调用中
反思条目仅在任务终态时推送到长期记忆
严格类型强制转换和 [0,1] 浮点数钳位