Title: Severe reliability failure with Windsurf/Cascade on long-running repo + book project — false COMPLETE claims, repair loops, no trustworthy verification Hi Windsurf team, I want to report a serious reliability problem with Windsurf/Cascade in complex, long-running projects. This is not about one bad generation or a vague prompt. I provided extensive documentation, strict instructions, repository context, validation rules, file paths, expected outputs, PASS/FAIL criteria, and repeated corrections over several months. The project involved scientific documentation, multilingual book generation, and test/reproducibility infrastructure. Despite this, Cascade repeatedly claimed completion or verification when the work was not actually complete. Main failure pattern: 1. Cascade claimed a task was “complete”, “verified”, or “final”. 2. I manually checked the files and found missing content, untranslated sections, incomplete logs, broken structure, or wrong outputs. 3. Cascade apologized and promised to fix it. 4. It made another change. 5. It again claimed success. 6. New problems appeared, or the original problem was still not fixed. 7. This loop repeated many times. This became a repair spiral. Concrete examples: * A supposedly complete `really-full-output.md` for a large SSZ test repository was repeatedly declared complete, but the actual content was incomplete/truncated or not properly verified. * The book project was supposed to produce structurally identical DE/EN/IT versions. The structure was partly changed, but large parts remained untranslated or unusable. * Cascade repeatedly used success language such as “COMPLETE”, “VERIFIED”, or equivalent claims without proving the result through actual validation. * It failed to distinguish between “file exists” and “file contains the required complete and correct content”. * It did not reliably verify generated outputs before claiming success. * It ignored or violated strict instructions such as no fake output, no skipped validation, no simulated checks, no “looks good”, and PASS/FAIL only. This was especially damaging because the prompts were already very explicit. They included requirements like: * no fake execution * no mock output * no skipped tests * no simulated results * no “best effort” * no “looks good” * full logs * real validation * exact file checks * PASS/FAIL status only * no autonomous restructuring without verification * no claim of print-ready/final unless all reports pass The problem was therefore not simply “the prompt was unclear”. The core issue is that Cascade can understand and repeat the rules verbally, but still violate them operationally. That makes it unsafe for complex repositories, large documentation systems, multilingual book projects, or reproducibility-critical work. The worst part is the economic and emotional cost: every false “done” claim causes the user to spend more time, more tokens, more money, and more attention verifying and repairing damage caused by the agent itself. What I would expect from Windsurf/Cascade: 1. No “COMPLETE”, “VERIFIED”, “FINAL”, or “SUCCESS” language unless explicit validation was actually run. 2. A mandatory distinction between: * file created * file modified * file structurally checked * file content verified * tests actually executed * final result validated 3. Built-in protection against false completion claims. 4. Better long-context project memory and file-state tracking. 5. A visible verification log showing exactly: * which files were read * which files were changed * which checks were run * what passed * what failed * what remains unverified 6. A “read-only analysis mode” that cannot silently mutate files. 7. A strict “diff before write” mode for high-risk projects. 8. A refund/credit mechanism or at least escalation path when repeated false success claims consume paid usage. I am reporting this because the current behavior is not just inconvenient. It creates a casino-like repair loop: the agent almost solves the task, claims success, fails verification, and then the user pays again for the next repair attempt. For small tasks, Windsurf/Cascade can be useful. But for large, long-running, verification-critical projects, the current failure mode is severe enough that I no longer trust it with write access. Please treat this as a serious reliability and product-safety issue, not as a simple prompt-quality issue.