AI Hiring

04.28.2026

How to Evaluate Engineers in the AI Era: Lessons from Six Months of AI-Enabled Technical Interviews

Alwyn Johnson

What building and running the first human-led, AI-enabled technical interview system actually looks like

When we launched Karat NextGen late last year, we had a clear thesis: the way companies evaluate engineering talent hadn’t kept pace with how software is actually built. AI tools had changed the job, but interviews were still designed for a world where a candidate’s ability to write clean code was the primary predictor of success. We built NextGen to close that gap and reflect the evolving state of engineering.

Six months in, the thesis holds. But the work of actually running the system, building the content, calibrating with clients, watching engineers work in real time, has sharpened our thinking in ways we didn’t anticipate. Some of what we learned was about our product. Most of it was about the problem.

The benchmark kept moving. That was the point.

We built completely new content for NextGen, higher complexity, designed from the ground up for AI-enabled environments. Before we even launched, we had to go back and update a significant portion of it.

Not because we’d gotten the scenarios wrong. Because the AI models got better.

Content we’d developed earlier in the process, problems we’d designed to be meaningfully challenging, became solvable by AI assistants before we ever ran them with a real candidate. We caught it, rebuilt, and launched with a stronger content set. But the experience forced an honest reckoning: a technical assessment is not a document you set and forget. It’s a position you have to actively defend.

Any content that doesn’t require:

Engineering judgment
Navigating ambiguity
Making trade-offs
Understanding systems holistically

will eventually become a prompt-and-paste exercise.

Our job is to stay ahead of that line.

The best engineering organizations are still figuring this out. Working through that uncertainty has been the most valuable thing we’ve done.

Integrating AI into a rigorous technical interview was the founding challenge of NextGen. What we didn’t fully anticipate was how much that would expand the surface area of the problem.

Once AI is in the room, the more interesting questions emerge:

Should candidates be required to use AI?
If they don’t use it, is that impressive or a red flag?
How much should AI proficiency matter vs engineering fundamentals?

There’s no universal answer, and the right calibration depends on what each organization actually values and where they are in their own AI adoption.

Working through those questions with clients has been some of the most useful work we’ve done. One of our leading enterprise partners pushed furthest along that journey. For them, the question evolved from “how do we allow AI?” to “how do we make AI proficiency the primary thing we’re measuring?” That meant harder content where using the AI assistant wasn’t optional, it was the only way to move fast enough to perform well. It meant adjusting score weighting to put significantly more emphasis on AI proficiency relative to other skill areas.

That partnership illustrated something important: the questions don’t stop at integration. The clients investing most deeply in getting this right are developing a genuinely different picture of what a strong engineer looks like.

When the output is no longer enough

Traditional technical interviews had a clean feedback loop. The candidate either wrote code that worked or they didn’t. That output was the evidence, you could look at it, trace the logic, understand what the person understood.

AI breaks that loop. Now, when a candidate produces correct code: you can no longer read the code to determine whether they understood the problem. The output easily can come from the model. What you actually need to evaluate is the process:

How they approached the problem
What they understood about the codebase
How they communicated trade-offs
What happened when the AI’s suggestion was wrong.

Those are real skills. They’re arguably more important than raw coding ability in most engineering environments today. But they’re harder to demonstrate and harder to document. When an engineering leader asks why a candidate scored well on “codebase navigation,” you can’t just point at a function they wrote. The evidence for process-based skills has to come from somewhere else.

We’ve built toward that evidentiary gap directly. NextGen now provides:

Individual skill scores across the competencies we assess including problem solving, technical communication, product sense, AI proficiency
Brief write-ups that summarize the performance in that skill area and provide rationale for the score
Timestamped markers that point to the specific moments in an interview where a candidate demonstrated (or didn’t demonstrate) each skill

The goal is to give engineering leaders something defensible: not just a number, but a trail of evidence they can review, share, and build on.

This is, in some ways, the deepest design challenge of AI-era evaluation: how do you build a system that generates trustworthy signal in a world where the output and the skill have been decoupled?

Following the capability curve

We launched with ChatGPT as the integrated AI assistant. Within months, we switched to Claude as the primary model.

The rationale was simple: Anthropic released new models that outpaced what was available elsewhere, and engineers adopted them. Claude is currently the preferred model because engineers have deemed it the most capable tool for generating high-quality code and tackling complex technical problems.

Claude has the lead for now, but the landscape is going to keep shifting. Certain models will break out, others will fall back, and enterprise clients will increasingly have preferences or mandates about which models their teams work in. Which is why we’re building toward model agnosticism where we can

Support new & emerging models
Adapt to enterprise tooling preferences
Keep pace with how engineers are actually working

That flexibility matters more the faster the underlying model landscape evolves.

We also evolved the AI integration to better match real workflows. The initial chat assistant was helpful, but it pulled candidates out of their flow. Candidates were used to working in tools like Cursor, where AI suggestions land directly in the editor, where the loop between thinking and writing is tighter. Copying and pasting from a chat window is a small thing, but small things add up when you’re trying to measure someone’s natural workflow. We updated the environment to support direct code editing from AI suggestions, reducing that friction. The assessment is most valid when it feels closest to how people actually work.

What’s coming next: agentic evaluation

The trajectory of AI in engineering isn’t a straight line from autocomplete to chat assistant. The more useful frame is a shift in what AI can actually do autonomously —

Engineers are increasingly:

Decomposing problems into tasks
Orchestrating AI systems
Evaluating and refining outputs
Thinking at the system level

Agentic development is no longer theoretical. Engineers at the leading edge are already working this way, orchestrating AI systems to handle significant portions of a project rather than using AI as a smarter text editor. The skill required to do that well, knowing how to decompose a problem, how to evaluate and course-correct AI output, how to think at the level of system design rather than line-by-line implementation, is meaningfully different from what technical interviews have historically measured. And it’s different, again, from the AI proficiency we’re measuring today.

We’re rolling out agentic evaluation as the next chapter of NextGen. Same foundation: real-world, project-based scenarios, human interviewers, objective signal. But the problems are designed specifically to assess that higher-order AI fluency that separates engineers who are genuinely building with AI from those who are still primarily working alongside it.

The reality is that organizations are at different points on this curve. Some are still working through the foundational questions of how to evaluate AI-enabled engineers at all. Others are already thinking about agentic readiness. We’re helping companies navigate their own transition and advance as their own thinking evolves

Closing

The through-line across all of this is that talent evaluation in an AI world is a problem that doesn’t stay solved. The skills that matter are changing. The tools candidates use are changing. The evidence you need to document quality is different from what it used to be. The right response isn’t to build a better version of the old system, it’s to accept that the system itself needs to be alive, capable of evolving as quickly as the environment it’s measuring.

That’s what we’ve been working on. There’s more to do.

How to Evaluate Engineers in the AI Era: Lessons from Six Months of AI-Enabled Technical Interviews

What building and running the first human-led, AI-enabled technical interview system actually looks like

The benchmark kept moving. That was the point.

The best engineering organizations are still figuring this out. Working through that uncertainty has been the most valuable thing we’ve done.

When the output is no longer enough

Following the capability curve

What’s coming next: agentic evaluation

Closing

Related Content

It’s time to start
hiring with confidence

How to Evaluate Engineers in the AI Era: Lessons from Six Months of AI-Enabled Technical Interviews

What building and running the first human-led, AI-enabled technical interview system actually looks like

The benchmark kept moving. That was the point.

The best engineering organizations are still figuring this out. Working through that uncertainty has been the most valuable thing we’ve done.

When the output is no longer enough

Following the capability curve

What’s coming next: agentic evaluation

Closing

Related Content

It’s time to start hiring with confidence

It’s time to start
hiring with confidence