Back to Blog

Ground Truth Isn't

May 5, 20266 min read

Something I’ve been thinking about lately is how much AI research depends on benchmarks that everyone quietly agrees to treat as reality.

A model gets trained on a dataset, evaluated against a benchmark, and then we get a clean number at the end. 94% accuracy. 97% accuracy. Better than the last model. Worse than the next one. It is convenient, it is comparable, and it gives everyone in the field a shared scoreboard.

That scoreboard is useful. I am not trying to argue benchmarks are bad. Without them, it would be extremely difficult to compare models, reproduce results, or make any kind of organized progress. The issue is that benchmarks can start to feel more objective than they really are.

Someone built the dataset. Someone decided what to include. Someone chose the labels. Someone picked the tools used to generate it. Those decisions do not disappear once the dataset becomes standard. They get frozen into the benchmark, then passed downstream into every model that trains on it.

The weird part is that after enough people use the same benchmark, the assumptions behind it stop feeling like assumptions at all.

They become "ground truth."

Binary Similarity and the Disassembler Problem

One place this shows up clearly is binary code similarity detection.

Binary similarity is basically the problem of comparing compiled programs or functions and deciding whether they are semantically related. This matters in cybersecurity because you might want to detect reused vulnerable code, identify malware variants, or recognize that two pieces of compiled code are actually doing the same thing even if they look different on the surface.

In 2022, a benchmark called BinaryCorp was released as a large-scale standard for evaluating binary similarity models. It contained around 26 million functions compiled from a wide variety of sources and optimization levels. That scale is a big deal. More variety usually means less bias from any one source of code.

But there is a subtler assumption built into it: the binaries were disassembled using IDA Pro.

That may sound like a minor implementation detail, but in binary analysis, the disassembler is not just a parser. It is interpreting compiled code and trying to reconstruct structure from something that was not designed to be easy to read. Things like function boundaries, control flow, and even what counts as code versus data can depend on the tool’s analysis.

So the benchmark is not just "what the binary is." It is "what IDA Pro thinks the binary is."

At first, that might still seem fine. IDA is one of the best tools in the field. But different disassemblers do disagree in measurable ways.

In 2021, the researchers behind DisCo compared outputs from IDA Pro and Ghidra. They found that 7.3% of actual function starts were found only by IDA, while 10.7% were found only by Ghidra. That means two top tools can look at the same binary and disagree at a pretty fundamental level: where functions even begin.

That is not a tiny stylistic difference. If a model is trained on function-level representations, the definition of a "function" is part of the training data. If the disassembler gets that wrong, or even just interprets it differently, the model inherits that interpretation.

So the natural question is: what happens if you run the same binary similarity model on the output of different disassemblers?

As far as I can tell, that exact question has not really been tested in a clean way. That is surprising, because the answer seems important. If a model performs well on IDA-generated data but worse on Ghidra-generated data, then its benchmark score is not just measuring binary similarity. It is also measuring compatibility with a specific disassembler’s worldview.

The Benchmark Is Not the World

This problem is not unique to reverse engineering.

ImageNet is the most famous example of a benchmark becoming almost synonymous with progress. For years, computer vision models competed on ImageNet, and the field treated improvements on that benchmark as real movement toward better visual understanding.

Again, ImageNet was incredibly important. The point is not that it was useless. The point is that the labels were still created by humans, at scale, under constraints.

Later work by MIT found that many major AI datasets, including ImageNet, contained significant label errors. Some of the mistakes were not subtle either. A mushroom labeled as a spoon. A frog labeled as a cat. A sound mislabeled as something else entirely.

This sounds funny until you remember that these datasets are used to decide whether models are getting better.

If the test itself contains errors, then the score is measuring performance against the dataset’s mistakes too. Sometimes the model may even be "wrong" for disagreeing with a bad label.

That is the part I find interesting. A benchmark does not just measure reality. It defines a version of reality that models are optimized against.

Malware Already Attacks the Assumptions

This matters even more in security because attackers are not passive.

In September 2024, Google Cloud published an analysis of LummaC2, a piece of malware that used indirect control flow to evade automated static analysis. The rough idea is that static analysis tools try to map out a program’s control flow without running it. They build a control flow graph, which shows how execution can move between different blocks of code.

LummaC2 deliberately made that graph harder to reconstruct. Instead of obvious direct jumps, it used indirect branching so that the real connections between blocks only became clear at runtime. According to Google Cloud’s analysis, 1,981 out of 2,009 suspicious code segments used this kind of obfuscation.

That is not just malware being "complicated." That is malware attacking the assumptions of the tools used to analyze it.

This is where the benchmark problem becomes less academic. If binary similarity models are trained and evaluated on datasets created through static disassembly, then they are downstream of the same kinds of tools that malware authors are actively trying to break.

A model can score well on a benchmark and still fail when the benchmark’s assumptions are attacked.

That is not necessarily the model’s fault. It may be doing exactly what it was trained to do. The issue is that the benchmark made some parts of the problem invisible.

So What Do We Do With This?

I do not think the answer is to throw away benchmarks. That would be unrealistic and probably worse. Benchmarks are necessary. They give researchers a shared target, make comparisons possible, and let the field move faster.

But I think benchmarks should be treated more like experimental setups than ground truth.

When a paper says a model achieved 94% accuracy, the immediate follow-up should be:

  • 94% on what?
  • What tools generated the data?
  • Who labeled it?
  • What assumptions are baked into the benchmark?
  • How old is it?
  • What kinds of real-world examples does it exclude?
  • What happens when one part of the pipeline changes?

For binary similarity specifically, there is a very obvious experiment here: take the same model, the same binaries, and compare performance across different disassembler outputs (IDA, Ghidra, Binary Ninja). If the results shift significantly, then the field has to be more careful about treating benchmark performance as general binary understanding.

More broadly, I think this is a useful way to look at AI progress in general. The accuracy score is not meaningless, but it is never the whole story. It is always attached to a dataset, a methodology, and a bunch of decisions that someone made earlier in the pipeline.

The benchmark is not reality. It is a ruler.

And sometimes the ruler is bent.