Improvement Recommendations
Evidence-backed changes derived from real failures and corrections. Every suggestion is tested against your eval set before it can ship, no guesswork.
Add fire-rating edition check from reviewer corrections
Derived from 7 reviewer corrections · fire-rating + citation-edition slices
5 of the last 9 high-severity failures are Division 07 fire-rating clauses where the model cited a superseded ASTM/UL edition. Reviewer corrections converge on the same fix: confirm the hourly rating and require the current test-method edition.
For fire-rating clauses, confirm the assembly is fire-rated and cite the relevant standard.
For fire-rating clauses, you MUST (1) confirm the required hourly rating (F-rating and T-rating) is explicitly met, and (2) cite only the current edition of the test method retrieved from the index (e.g. ASTM E814-21, UL 1479). If the retrieved edition is superseded, return needs_review.
Boost recency weighting on standards index
Observed in 4 flagged logs this week · wrong_standard + hallucinated_citation
Hallucinated and wrong-edition citations correlate with the index returning multiple editions of the same standard. Applying a recency boost so the current edition ranks first should reduce wrong-edition citations.
Hybrid search, top-k 6, no recency weighting.
Hybrid search, top-k 6, recency boost on edition metadata, dedupe superseded editions.
Route low-groundedness clauses to gpt-5.5, simple clauses to gpt-5.4-mini
Cost model over 1,240 logs · no accuracy regression in shadow eval
62% of clauses are straightforward and pass on the cheaper model. Reserve the stronger model for clauses scoring below 0.7 groundedness on first pass. Projected 31% cost reduction at equal accuracy.
All requests → gpt-5.5.
First pass → gpt-5.4-mini; escalate to gpt-5.5 when groundedness < 0.7 or verdict = needs_review.