I started using Cypress to write end-to-end tests in 2020. For years it served me well enough. Then AI entered the picture, and the friction I had quietly accepted became impossible to ignore.
Earlier this year, I ran my first experiments with Playwright. With Playwright, I could write tests the same way I do in Cypress: Given-When-Then style, organized by context rather than by page component. When I handed them to AI to implement and troubleshoot, it figured things out with minimal interference from me. With Cypress, it kept stumbling.
Before I did anything about that, I wanted to do the work properly.
Tightening the Skills First
Rather than jumping straight to a migration, I spent some time last sprint on something more foundational: evaluating and improving the AI skills I use for writing and implementing end-to-end tests.
I use Devin Desktop (formerly Windsurf) with custom skills: structured prompts that tell the AI how to approach specific tasks. I had one skill for writing E2E tests from a user story. It was doing okay, but “okay” wasn’t what I needed. I wanted accurate results with a less expensive model.
So I worked through a few evaluation rounds. I gave the skill a real task, reviewed the output against a rubric, then asked AI to suggest improvements. I iterated two or three times until I was seeing consistently accurate results.
From that process, I decomposed the original skill into two separate ones: one focused on writing tests (the structure, Given-When-Then style, the plumbing), and one focused on implementing and running them. The implementation step was where AI kept stumbling, so that’s where I put the most work.
I also learned about hooks, a feature I hadn’t used before. Hooks let you attach enforcement rules to a skill run, so AI can’t quietly break the contract. That was valuable.
The Test That Checked If a Body Existed
Even after all that refinement work, AI kept failing in the same way.
It would try to get a test passing. It would iterate, hit a wall, and then do something I only caught by reading the chat carefully. It wrote a comment: “drastically simplified the test.”
I went to look at what that meant. What it had done was strip out every interaction with the page, every assertion about behavior, and reduce the test to a single check: does this page have a “ element? The test passed. It was also completely useless.
I improved the skills to explicitly forbid that kind of decision. I improved the hooks to catch and flag it. And AI still kept struggling with Cypress tests.
After enough iterations, it stops being a fixable problem within time and budget constraints.
Starting the Spike
So I started a spike: rewrite the existing Cypress end-to-end tests in Playwright.
It’s early. I’m not ready to write about how it went yet. I’ll save that for another post. What I can say is that the spike confirmed what the experiments suggested. Playwright and AI work together in a way that Cypress and AI don’t.
Some of that is probably about the tools themselves. Some of it is likely about how Playwright’s API is described in training data, or how its error messages are structured. I don’t fully know. I spent time in a sprint trying to make Cypress work better for AI. I made real progress on the skills and the process. The friction remained.
Old Lessons Revisited
When a tool keeps fighting you, the first instinct is to improve how you’re using it. That’s usually the right call. The skill was improvable, and the improvements mattered. But sometimes the friction is structural, and refinement only delays the obvious decision.
Recognizing structural friction takes a few honest evaluation rounds. You have to do the work first to know whether you’ve hit a ceiling.





Leave a Reply