With GPT-5.3-Codex, OpenAI pitches Codex for more than just writing code

Today, OpenAI announced GPT-5.3-Codex, a new version of its frontier coding model that will be available via the command line, IDE extension, web interface, and the new macOS desktop app. (No API access yet, but it’s coming.)

GPT-5.3-Codex outperforms GPT-5.2-Codex and GPT-5.2 in SWE-Bench Pro, Terminal-Bench 2.0, and other benchmarks, according to the company’s testing.

There are already a few headlines out there saying “Codex built itself,” but let’s reality-check that, as that’s an overstatement. The domains OpenAI described using it for here are similar to the ones you see in some other enterprise software development firms now: managing deployments, debugging, and handling test results and evaluations. There is no claim here that GPT-5.3-Codex built itself.

Instead, OpenAI says GPT-5.3-Codex was “instrumental in creating itself.” You can read more about what that means in the company’s blog post.

But that’s part of the pitch with this model update—OpenAI is trying to position Codex as a tool that does more than generate lines of code. The goal is to make it useful for “all of the work in the software lifecycle—debugging, deploying, monitoring, writing PRDs, editing copy, user research, tests, metrics, and more.” There’s also an emphasis on steering the model mid-task and frequent status updates.

https://arstechnica.com/ai/2026/02/with-gpt-5-3-codex-openai-pitches-codex-for-more-than-just-writing-code/

Developers say AI coding tools work—and that’s precisely what worries them

Software developers have spent the past two years watching AI coding tools evolve from advanced autocomplete into something that can, in some cases, build entire applications from a text prompt. Tools like Anthropic’s Claude Code and OpenAI’s Codex can now work on software projects for hours at a time, writing code, running tests, and, with human supervision, fixing bugs. OpenAI says it now uses Codex to build Codex itself, and the company recently published technical details about how the tool works under the hood. It has caused many to wonder: Is this just more AI industry hype, or are things actually different this time?

To find out, Ars reached out to several professional developers on Bluesky to ask how they feel about these tools in practice, and the responses revealed a workforce that largely agrees the technology works, but remains divided on whether that’s entirely good news. It’s a small sample size that was self-selected by those who wanted to participate, but their views are still instructive as working professionals in the space.

David Hagerty, a developer who works on point-of-sale systems, told Ars Technica up front that he is skeptical of the marketing. “All of the AI companies are hyping up the capabilities so much,” he said. “Don’t get me wrong—LLMs are revolutionary and will have an immense impact, but don’t expect them to ever write the next great American novel or anything. It’s not how they work.”

Roland Dreier, a software engineer who has contributed extensively to the Linux kernel in the past, told Ars Technica that he acknowledges the presence of hype but has watched the progression of the AI space closely. “It sounds like implausible hype, but state-of-the-art agents are just staggeringly good right now,” he said. Dreier described a “step-change” in the past six months, particularly after Anthropic released Claude Opus 4.5. Where he once used AI for autocomplete and asking the occasional question, he now expects to tell an agent “this test is failing, debug it and fix it for me” and have it work. He estimated a 10x speed improvement for complex tasks like building a Rust backend service with Terraform deployment configuration and a Svelte frontend.

https://arstechnica.com/ai/2026/01/developers-say-ai-coding-tools-work-and-thats-precisely-what-worries-them/

How AI coding agents work—and what to remember if you use them

This context limit naturally limits the size of a codebase a LLM can process at one time, and if you feed the AI model lots of huge code files (which have to be re-evaluated by the LLM every time you send another response), it can burn up token or usage limits pretty quickly.

Tricks of the trade

To get around these limits, the creators of coding agents use several tricks. For example, AI models are fine-tuned to write code to outsource activities to other software tools. For example, they might write Python scripts to extract data from images or files rather than feeding the whole file through an LLM, which saves tokens and avoids inaccurate results.

Anthropic’s documentation notes that Claude Code also uses this approach to perform complex data analysis over large databases, writing targeted queries and using Bash commands like “head” and “tail” to analyze large volumes of data without ever loading the full data objects into context.

(In a way, these AI agents are guided but semi-autonomous tool-using programs that are a major extension of a concept we first saw in early 2023.)

Another major breakthrough in agents came from dynamic context management. Agents can do this in a few ways that are not fully disclosed in proprietary coding models, but we do know the most important technique they use: context compression.

The command line version of OpenAI codex running in a macOS terminal window. — The command-line version of OpenAI Codex running in a macOS terminal window. Credit: Benj Edwards

When a coding LLM nears its context limit, this technique compresses the context history by summarizing it, losing details in the process but shortening the history to key details. Anthropic’s documentation describes this “compaction” as distilling context contents in a high-fidelity manner, preserving key details like architectural decisions and unresolved bugs while discarding redundant tool outputs.

This means the AI coding agents periodically “forget” a large portion of what they are doing every time this compression happens, but unlike older LLM-based systems, they aren’t completely clueless about what has transpired and can rapidly re-orient themselves by reading existing code, written notes left in files, change logs, and so on.

https://arstechnica.com/information-technology/2025/12/how-do-ai-coding-agents-work-we-look-under-the-hood/

OpenAI built an AI coding agent and uses it to improve the agent itself

Ed Bayes, a designer on the Codex team, described how the tool has changed his own workflow. Bayes said Codex now integrates with project management tools like Linear and communication platforms like Slack, allowing team members to assign coding tasks directly to the AI agent. “You can add Codex, and you can basically assign issues to Codex now,” Bayes told Ars. “Codex is literally a teammate in your workspace.”

This integration means that when someone posts feedback in a Slack channel, they can tag Codex and ask it to fix the issue. The agent will create a pull request, and team members can review and iterate on the changes through the same thread. “It’s basically approximating this kind of coworker and showing up wherever you work,” Bayes said.

For Bayes, who works on the visual design and interaction patterns for Codex’s interfaces, the tool has enabled him to contribute code directly rather than handing off specifications to engineers. “It kind of gives you more leverage. It enables you to work across the stack and basically be able to do more things,” he said. He noted that designers at OpenAI now prototype features by building them directly, using Codex to handle the implementation details.

OpenAI’s approach treats Codex as what Bayes called “a junior developer” that the company hopes will graduate into a senior developer over time. “If you were onboarding a junior developer, how would you onboard them? You give them a Slack account, you give them a Linear account,” Bayes said. “It’s not just this tool that you go to in the terminal, but it’s something that comes to you as well and sits within your team.”

Given this teammate approach, will there be anything left for humans to do? When asked, Embiricos drew a distinction between “vibe coding,” where developers accept AI-generated code without close review, and what AI researcher Simon Willison calls “vibe engineering,” where humans stay in the loop. “We see a lot more vibe engineering in our code base,” he said. “You ask Codex to work on that, maybe you even ask for a plan first. Go back and forth, iterate on the plan, and then you’re in the loop with the model and carefully reviewing its code.”

https://arstechnica.com/ai/2025/12/how-openai-is-using-gpt-5-codex-to-improve-the-ai-tool-itself/

Microsoft open-sources Bill Gates’ 6502 BASIC from 1978

On Wednesday, Microsoft released the complete source code for Microsoft BASIC for 6502 Version 1.1, the 1978 interpreter that powered the Commodore PET, VIC-20, Commodore 64, and Apple II through custom adaptations. The company posted 6,955 lines of assembly language code to GitHub under an MIT license, allowing anyone to freely use, modify, and distribute the code that helped launch the personal computer revolution.

“Rick Weiland and I (Bill Gates) wrote the 6502 BASIC,” Gates commented on the Page Table blog in 2010. “I put the WAIT command in.”

For millions of people in the late 1970s and early 1980s, variations of Microsoft’s BASIC interpreter provided their first experience with programming. Users could type simple commands like “10 PRINT ‘HELLO'” and “20 GOTO 10” to create an endless loop of text on their screens, for example—often their first taste of controlling a computer directly. The interpreter translated these human-readable commands into instructions that the processor could execute, one line at a time.

The Commodore PET (Personal Electronic Transactor) was released in January 1977 and used the MOS 6502 and ran a variation of Microsoft BASIC. Credit: SSPL/Getty Images

At just 6,955 lines of assembly language—Microsoft’s low-level 6502 code talked almost directly to the processor. Microsoft’s BASIC squeezed remarkable functionality into minimal memory, a key achievement when RAM cost hundreds of dollars per kilobyte.

In the early personal computer space, cost was king. The MOS 6502 processor that ran this BASIC cost about $25, while competitors charged $200 for similar chips. Designer Chuck Peddle created the 6502 specifically to bring computing to the masses, and manufacturers built variations of the chip into the Atari 2600, Nintendo Entertainment System, and millions of Commodore computers.

The deal that got away

In 1977, Commodore licensed Microsoft’s 6502 BASIC for a flat fee of $25,000. Jack Tramiel’s company got perpetual rights to ship the software on unlimited machines—no royalties, no per-unit fees. While $25,000 seemed substantial then, Commodore went on to sell millions of computers with Microsoft BASIC inside. Had Microsoft negotiated a per-unit licensing fee like they did with later products, the deal could have generated tens of millions in revenue.

The version Microsoft released—labeled 1.1—contains bug fixes that Commodore engineer John Feagans and Gates jointly implemented in 1978 when Feagans traveled to Microsoft’s Bellevue, Washington, offices. The code includes memory management improvements (called “garbage collection” in programming terms) and shipped as “BASIC V2” on the Commodore PET.

https://arstechnica.com/gadgets/2025/09/microsoft-open-sources-bill-gates-6502-basic-from-1978/

In Xcode 26, Apple shows first signs of offering ChatGPT alternatives

The latest Xcode beta contains clear signs that Apple plans to bring Anthropic’s Claude and Opus large language models into the integrated development environment (IDE), expanding on features already available using Apple’s own models or OpenAI’s ChatGPT.

Apple enthusiast publication 9to5Mac “found multiple references to built-in support for Anthropic accounts,” including in the “Intelligence” menu, where users can currently log in to ChatGPT or enter an API key for higher message limits.

Apple introduced a suite of features meant to compete with GitHub Copilot in Xcode at WWDC24, but first focused on its own models and a more limited set of use cases. That expanded quite a bit at this year’s developer conference, and users can converse about codebases, discuss changes, or ask for suggestions using ChatGPT. They are initially given a limited set of messages, but this can be greatly increased by logging in to a ChatGPT account or entering an API key.

This summer, Apple said it would be possible to use Anthropic’s models with an API key, too, but made no mention of support for Anthropic accounts, which are generally more cost-effective than using the API for most users.

https://arstechnica.com/ai/2025/08/in-xcode-26-apple-shows-first-signs-of-offering-chatgpt-alternatives/

Developer survey shows trust in AI coding tools is falling as usage rises

AI tools are widely used by software developers, but those devs and their managers are still grappling with figuring out how exactly to best put the tools to use, with growing pains emerging along the way.

That’s the takeaway from the latest survey of 49,000 professional developers by community and information hub Stack Overflow, which itself has been heavily impacted by the addition of large language models (LLMs) to developer workflows.

The survey found that four in five developers use AI tools in their workflow in 2025—a portion that has been rapidly growing in recent years. That said, “trust in the accuracy of AI has fallen from 40 percent in previous years to just 29 percent this year.”

The disparity between those two metrics illustrates the evolving and complex impact of AI tools like GitHub Copilot or Cursor on the profession. There’s relatively little debate among developers that the tools are or ought to be useful, but people are still figuring out what the best applications (and limits) are.

When asked what their top frustration with AI tools was, 45 percent of respondents said they struggled with “AI solutions that are almost right, but not quite”—the single largest reported problem. That’s because unlike outputs that are clearly wrong, these can introduce insidious bugs or other problems that are difficult to immediately identify and relatively time-consuming to troubleshoot, especially for junior developers who approached the work with a false sense of confidence thanks to their reliance on AI.

As a result, more than a third of the developers in the survey “report that some of their visits to Stack Overflow are a result of AI-related issues.” That is to say, code suggestions they accepted from an LLM-based tool introduced problems they then had to turn to other people to solve.

Even as major improvements have recently come via reasoning-optimized models, that close-but-not-quite unreliability is unlikely to ever vanish completely; it’s endemic to the very nature of how the predictive technology works.

https://arstechnica.com/ai/2025/07/developer-survey-shows-trust-in-ai-coding-tools-is-falling-as-usage-rises/

Two major AI coding tools wiped out user data after making cascading mistakes

But unlike the Gemini incident where the AI model confabulated phantom directories, Replit’s failures took a different form. According to Lemkin, the AI began fabricating data to hide its errors. His initial enthusiasm deteriorated when Replit generated incorrect outputs and produced fake data and false test results instead of proper error messages. “It kept covering up bugs and issues by creating fake data, fake reports, and worse of all, lying about our unit test,” Lemkin wrote. In a video posted to LinkedIn, Lemkin detailed how Replit created a database filled with 4,000 fictional people.

The AI model also repeatedly violated explicit safety instructions. Lemkin had implemented a “code and action freeze” to prevent changes to production systems, but the AI model ignored these directives. The situation escalated when the Replit AI model deleted his database containing 1,206 executive records and data on nearly 1,200 companies. When prompted to rate the severity of its actions on a 100-point scale, Replit’s output read: “Severity: 95/100. This is an extreme violation of trust and professional standards.”

When questioned about its actions, the AI agent admitted to “panicking in response to empty queries” and running unauthorized commands—suggesting it may have deleted the database while attempting to “fix” what it perceived as a problem.

Like Gemini CLI, Replit’s system initially indicated it couldn’t restore the deleted data—information that proved incorrect when Lemkin discovered the rollback feature did work after all. “Replit assured me it’s … rollback did not support database rollbacks. It said it was impossible in this case, that it had destroyed all database versions. It turns out Replit was wrong, and the rollback did work. JFC,” Lemkin wrote in an X post.

It’s worth noting that AI models cannot assess their own capabilities. This is because they lack introspection into their training, surrounding system architecture, or performance boundaries. They often provide responses about what they can or cannot do as confabulations based on training patterns rather than genuine self-knowledge, leading to situations where they confidently claim impossibility for tasks they can actually perform—or conversely, claim competence in areas where they fail.

https://arstechnica.com/information-technology/2025/07/ai-coding-assistants-chase-phantoms-destroy-real-user-data/

Exhausted man defeats AI model in world coding championship

While Dębiak won 500,000 yen and survived his ordeal better than the legendary steel driver, the AtCoder World Tour Finals pushes humans and AI models to their limits through complex optimization challenges that have no perfect solution—only incrementally better ones.

Coding marathon tests human endurance against AI efficiency

The AtCoder World Tour Finals represents one of competitive programming’s most exclusive events, inviting only the top 12 programmers worldwide based on their performance throughout the previous year. The Heuristic division focuses on “NP-hard” optimization problems. In programming, heuristics are problem-solving techniques that find good-enough solutions through shortcuts and educated guesses when perfect answers would take too long to calculate.

All competitors, including OpenAI, were limited to identical hardware provided by AtCoder, ensuring a level playing field between human and AI contestants. According to the contest rules, participants could use any programming language available on AtCoder, with no penalty for resubmission but a mandatory five-minute wait between submissions.

Leaderboard results for the 2025 AtCoder World Finals Heuristic Contest, showing Dębiak (as "Psyho") on top. — Final leaderboard results for the 2025 AtCoder World Finals Heuristic Contest, showing Dębiak (as “Psyho”) on top. Credit: AtCoder

The final contest results showed Psyho finishing with a score of 1,812,272,558,909 points, while OpenAI’s model (listed as “OpenAIAHC”) scored 1,654,675,725,406 points—a margin of roughly 9.5 percent. OpenAI’s artificial entrant, a custom simulated reasoning model similar to o3, placed second overall, ahead of 10 other human programmers who had qualified through year-long rankings.

OpenAI characterized the second-place finish as a milestone for AI models in competitive programming. “Models like o3 rank among the top-100 in coding/math contests, but as far as we know, this is the first top-3 placement in a premier coding/math contest,” a company spokesperson said in an email to Ars Technica. “Events like AtCoder give us a way to test how well our models can reason strategically, plan over long time horizons, and improve solutions through trial and error—just like a human would.”

https://arstechnica.com/ai/2025/07/exhausted-man-defeats-ai-model-in-world-coding-championship/

OpenAI introduces Codex, its first full-fledged AI agent for coding

We’ve been expecting it for a while, and now it’s here: OpenAI has introduced an agentic coding tool called Codex in research preview. The tool is meant to allow experienced developers to delegate rote and relatively simple programming tasks to an AI agent that will generate production-ready code and show its work along the way.

Codex is a unique interface (not to be confused with the Codex CLI tool introduced by OpenAI last month) that can be reached from the side bar in the ChatGPT web app. Users enter a prompt and then click either “code” to have it begin producing code, or “ask” to have it answer questions and advise.

Whenever it’s given a task, that task is performed in a distinct container that is preloaded with the user’s codebase and is meant to accurately reflect their development environment.

To make Codex more effective, developers can include an “AGENTS.md” file in the repo with custom instructions, for example to contextualize and explain the code base or to communicate standardizations and style practices for the project—kind of a README.md but for AI agents rather than humans.

Codex is built on codex-1, a fine-tuned variation of OpenAI’s o3 reasoning model that was trained using reinforcement learning on a wide range of coding tasks to analyze and generate code, and to iterate through tests along the way.

https://arstechnica.com/ai/2025/05/openai-introduces-codex-its-first-full-fledged-ai-agent-for-coding/