How Bad Performance Tests Led to Needless Scaling (and How I Stopped It)

When I joined this project, the damage had already been done.

The team had scaled.

More instances. Bigger machines. Higher monthly bills.

And they were still planning to scale further because the conclusion was already accepted:

the system doesn’t scale.

What bothered me wasn’t just the conclusion — it was that nobody could clearly explain *why*.

What Felt Wrong

The performance reports didn’t behave like real system limits:

* Latency spikes were inconsistent

* Throughput dropped at low, unrealistic loads

* Results varied too much between runs

That’s not how systems usually fail under pressure.

That’s how bad tests behave.

Step 1: Check the Source of Truth

All of this was based on performance testing done by an external vendor — which turned out to be a poorly executed stress test more than anything else.

So instead of touching the system, I went straight to the tests.

Step 2: Audit the Test Suite

It didn’t take long to see the problem:

* Requests sent with random payloads

* Invalid or missing data in critical flows

* No consistent user journeys

* Shared state getting corrupted under load

* No isolation between runs

Some of the “failures” weren’t even performance-related — they were just broken requests.

At that point, the reports stopped being useful.

Step 3: Rebuild a Controlled Baseline

I ignored their suite and built a small, controlled harness:

* Fixed datasets

* Deterministic flows

* Clean state before each run

* Same conditions every time

Then I ran the same scenarios that supposedly broke the system.

It didn’t break.

I ran them again.

Same result.

Step 4: Model Real Usage (Not Just Load)

One major thing missing from the original tests was think time.

What they called "performance testing" was really just constant high-pressure stress testing — and not even a controlled one.

Every request was fired back-to-back, as fast as possible. No pauses, no user behaviour — just constant pressure.

That’s not real traffic. That’s artificial pressure — closer to a badly simulated stress test than actual user behaviour.

So of course the system looked like it couldn’t cope.

What I did differently

I pulled data from analytics (Google Analytics and internal logs) to understand:

* How long users wait between actions

* Typical session flows

* How requests are naturally distributed

Then I introduced realistic think times into the tests.

Not guesses — actual behaviour.

So instead of:

It became:

What changed

* Traffic patterns stabilised

* Load distribution made sense

* The system stopped “failing” under artificial pressure

Before this, they were just bombarding the system and calling it a limit.

After this, we were measuring reality.

Step 5: Find the Real Bottlenecks

With clean data, the actual issues were small and clear:

* A slow query under concurrency

* Weak caching on a key endpoint

* Minor contention in one service

These were optimisation problems, not scaling problems.

Step 6: Tie It Back to Cost

This was the uncomfortable part.

They were already paying for infrastructure they didn’t need.

And still planning to spend more.

Once I compared:

* what they were spending (and planning to spend)

* what it actually took to fix the real issues

…it was obvious how much waste was happening.

Step 7: Prove It Repeatedly

One run wouldn’t change minds.

So I ran everything multiple times, under controlled conditions.

Same results every time.

That consistency is what finally cut through the noise.

The Result

* Further scaling was stopped

* Existing infrastructure was reassessed

* Performance testing was rebuilt properly

* The team started trusting the data again

And most importantly, the bleeding stopped.

What I Took From This

* Bad tests don’t just fail — they drive bad decisions

* If your inputs are wrong, your conclusions will be wrong

* Scaling is expensive — it should be justified, not assumed

* Realistic behaviour matters more than raw request volume

Why This One Stuck With Me

There was no big architectural change here.

Just slowing down, questioning the data, and rebuilding it properly.

It didn’t just improve the system.

It stopped a very real, very expensive mistake that was already in motion.

But more importantly, it stopped us from solving the wrong problem.

It also meant a backend engineer who was already under pressure didn’t take the fall for something that wasn’t actually broken — and that’s something I’m proud of.

Emmanuel Eko

SDET & QA Architect