AI coding instruments are shifting to a stunning place: The terminal
For years, code-editing instruments like Cursor, Windsurf, and GitHub’s Copilot have been the usual for AI-powered software program improvement. However as agentic AI grows extra highly effective and vibe coding takes off, a delicate shift has modified how AI programs are interacting with software program.
As a substitute of engaged on code, they’re more and more interacting immediately with the shell of no matter system they’re put in in. It’s a big change in how AI-powered software program improvement occurs — and regardless of the low profile, it might have important implications for the place the sector goes from right here.
The terminal is greatest often known as the black-and-white display you keep in mind from ’90s hacker motion pictures — a really old-school manner of working applications and manipulating knowledge. It’s not as visually spectacular as modern code editors, but it surely’s a particularly highly effective interface if you understand how to make use of it. And whereas code-based brokers can write and debug code, terminal instruments are sometimes wanted to get software program from written code to one thing that may truly be used.
The clearest signal of the shift to the terminal has come from main labs. Since February, Anthropic, DeepMind, and OpenAI have all launched command-line coding instruments (Claude Code, Gemini CLI, and CLI Codex, respectively), and so they’re already among the many firms’ hottest merchandise.
That shift has been simple to overlook, since they’re largely working beneath the identical branding as earlier coding instruments. However beneath the hood, there have been actual adjustments in how brokers work together with different computer systems, each on-line and offline. Some consider these adjustments are simply getting began.
“Our massive wager is that there’s a future during which 95% of LLM-computer interplay is thru a terminal-like interface,” says Mike Merrill, co-creator of the main terminal-focused benchmark Terminal-Bench.
Terminal-based instruments are additionally coming into their very own simply as outstanding code-based instruments are beginning to look shaky. The AI code editor Windsurf has been torn aside by dueling acquisitions, with senior executives employed away by Google and the remaining firm acquired by Cognition — leaving the patron product’s long-term future unsure.
On the similar time, new analysis suggests programmers could also be overestimating productiveness beneficial properties from typical instruments. A METR research testing Cursor Professional, Windsurf’s most important competitor, discovered that whereas builders estimated they might full duties 20% to 30% sooner, the noticed course of was practically 20% slower. In brief, the code assistant was truly costing programmers time.
That has left a gap for firms like Warp, which at the moment holds the highest spot on Terminal-Bench. Warp payments itself as an “agentic improvement surroundings,” a center floor between IDE applications and command-line instruments like Claude Code.
However Warp founder Zach Lloyd continues to be bullish on the terminal, seeing it as a option to sort out issues that might be out of scope for a code editor like Cursor.
“The terminal occupies a really low stage within the developer stack, so it’s essentially the most versatile place to be working brokers,” Lloyd says.
To grasp how the brand new strategy is totally different, it may be useful to have a look at the benchmarks used to measure them. The code-based era of instruments was targeted on fixing GitHub points, the idea of the SWE-Bench check. Every drawback on SWE-Bench is an open challenge from GitHub — primarily, a chunk of code that doesn’t work.
Fashions iterate on the code till they discover one thing that works, fixing the issue. Built-in merchandise like Cursor have constructed extra refined approaches to the issue, however the GitHub/SWE-Bench mannequin continues to be the core of how these instruments strategy the issue: beginning with damaged code and turning it into code that works.
Terminal-based instruments take a wider view, wanting past the code to the entire surroundings a program is working in. That features coding but in addition extra DevOps-oriented duties like configuring a Git server or troubleshooting why a script received’t run.
In a single TerminalBench drawback, the directions give a decompression program and a goal textual content file, difficult the agent to reverse-engineer an identical compression algorithm. One other asks the agent to construct the Linux kernel from supply, failing to say that the agent should obtain the supply code itself. Fixing the problems requires the type of bull-headed problem-solving potential that programmers want.
“What makes TerminalBench arduous isn’t just the questions that we’re giving the brokers,” says Terminal-Bench co-creator Alex Shaw. “It’s the environments that we’re putting them in.”
Crucially, this new strategy means tackling an issue step-by-step — the identical talent that makes agentic AI so highly effective. However even state-of-the-art agentic fashions can’t deal with all of these environments. Warp earned its excessive rating on Terminal-Bench by fixing simply over half of the issues — a mark of how difficult the benchmark is and the way a lot work nonetheless must be completed to unlock the terminal’s full potential.
Nonetheless, Lloyd believes we’re already at some extent the place terminal-based instruments can reliably deal with a lot of a developer’s non-coding work — a worth proposition that’s arduous to disregard.
“If you happen to consider the every day work of establishing a brand new undertaking, determining the dependencies and getting it runnable, Warp can just about try this autonomously,” says Lloyd. “And if it may well’t do it, it can let you know why.”