Super-Resolution Research, Part 3: The Round-Trip Problem

The entire premise of super-resolution is that you can recover information lost during downscaling. You take a high-resolution image, downsample it, then use a model to upsample it back, ideally recovering detail that the downsampling destroyed. The model is trained on pairs: the original HR image and its downscaled-then-upscaled counterpart. The gap between those two images is what the model learns to bridge.

But I’d been assuming the downscale-upscale round trip was a well-defined operation. Downsample by 4x with bicubic, upsample by 4x with bicubic, measure PSNR against the original. Simple. Reproducible. Except it isn’t, because four different libraries will give you four different answers for both the downscale and the upscale, and the errors compound.

What I Expected vs. What I Found

I expected the round-trip degradation to be consistent across libraries. I figured the exact PSNR would vary slightly (different kernels, different boundary handling, sure), but the ordering would be stable and the differences would be small. A few tenths of a dB, maybe.

The PIL vs OpenCV bicubic disagreement on the same resize operation is approximately 21 dB PSNR. That’s the single-direction disagreement — just the downscale step, comparing one library’s output against the other’s. The round-trip amplifies this because you’re chaining two operations that each disagree.

# The experiment: resize baboon.bmp (500x480) to 100x100 and back
output_size = (100, 100)
original_size = (500, 480)

# PIL round trip
pil_down = make_resizer('PIL', False, 'bicubic', output_size)(img)
pil_up = make_resizer('PIL', False, 'bicubic', original_size)(pil_down)

# OpenCV round trip
cv_down = make_resizer('OpenCV', False, 'bicubic', output_size)(img)
cv_up = make_resizer('OpenCV', False, 'bicubic', original_size)(cv_down)

The PSNR between pil_up and the original image is one number. The PSNR between cv_up and the original is a different number. And the PSNR between pil_up and cv_up — two supposedly equivalent round trips of the same source image — is yet another number, and it’s embarrassingly low for operations that are supposed to be doing “the same thing.”

The Kernel Parameter Problem

I covered the kernel coefficient issue in the library divergence post , but it bears repeating in the context of round trips because the effect is multiplicative. PIL uses a = -1 for its bicubic kernel (Catmull-Rom spline). OpenCV uses a = -0.75. These produce different frequency responses: different amounts of pre-filtering, different amounts of ringing at edges, different handling of the Nyquist frequency during downsampling.

When you chain a downscale and an upscale, the kernel is applied twice. If both operations use the same library, the errors are at least consistent — you’re applying the same kernel twice, which produces a predictable blurring pattern. But if you mix libraries (downscale with PIL, upscale with OpenCV, which happens more often than you’d think in research codebases), the errors interact in unpredictable ways.

Note

I found mixed-library round trips in my own code before I found them in anyone else’s. My SRCNN pipeline used PIL for dataset creation and OpenCV for inference-time upsampling of the input. The inconsistency was silent — no error, no warning, just subtly wrong numbers that I chased for weeks.

Anti-Aliasing: The Hidden Variable

PIL’s bicubic resize does not anti-alias during downsampling by default. OpenCV’s does. This matters enormously for the round trip because anti-aliasing affects what information is preserved through the downscale step.

Without anti-aliasing, the downscaled image can contain aliasing artifacts — high-frequency content that folds back into the low-frequency range and masquerades as real detail. When you then upscale this image, the model (or the interpolation method) tries to reconstruct high-frequency detail from what is partly aliased garbage.

With anti-aliasing, the high-frequency content is filtered out before downsampling. The downscaled image is cleaner but has genuinely lost the high-frequency information. The upscaling step has a cleaner starting point but less to work with.

Neither approach is wrong. They’re different trade-offs. The problem is that they’re invisible: both libraries call their operation “bicubic,” both return an image of the requested size, and nothing in the API tells you about the anti-aliasing behavior.

Quantize-After: When Data Types Attack

The make_resizer function in my codebase has a quantize_after parameter that controls whether the resize operates in uint8 space (the default, what you get from a naive Image.resize() call) or float32 space (per-channel float resize, which clean-fid uses for FID computation).

For the round trip, this creates four variants per library:

Down in uint8, up in uint8
Down in uint8, up in float32
Down in float32, up in uint8
Down in float32, up in float32

Each combination produces a different final image. The differences between uint8 and float32 resizing within the same library are small (a few dB) compared to the cross-library differences (21 dB), but they’re not zero, and they compound with the cross-library issue.

I measured the uint8 vs float32 disagreement within PIL at roughly 36 dB PSNR for a single resize. Not terrible. But for a round trip, those errors stack, and if you’re comparing your model’s output (which was trained on float32-resized data) against a test set that was prepared with uint8 resizing, you’ve injected a systematic bias into your evaluation.

What This Means for Super-Resolution Research

Every super-resolution paper follows this protocol:

Take Set5 / Set14 / BSD100 / Urban100
Downscale by the scale factor (2x, 3x, 4x) to create LR inputs
Run the model on LR inputs
Measure PSNR between the model’s output and the original HR

Steps 2 through 4 each involve a library choice. If two papers use different libraries for any of those steps, their numbers aren’t comparable. And since most papers don’t specify which library they used (I checked; the methodology sections of Dong et al. 2014, Kim et al. 2016, and Lim et al. 2017 don’t mention the resize implementation), you can’t even tell whether a comparison is valid.

The EDSR paper (Lim et al., 2017) reported a PSNR improvement of about 1-2 dB over SRCNN on Set14 — representing three years of architectural innovation. My round-trip experiments suggest that the preprocessing library choice alone can account for a shift of similar magnitude. I’m not claiming EDSR’s improvement is fake; I’m saying the measurement framework is too noisy to distinguish a real 1 dB improvement from a preprocessing artifact with any confidence.

The Practical Lesson

I learned this the hard way during my CS 497 independent study in summer 2022. I was implementing SRCNN from scratch, training it on pairs I’d generated with my sliding window pipeline , and comparing my results against the numbers in Dong et al. (2014). My PSNR was consistently lower. Not by a little — by 2-3 dB, which for super-resolution is the difference between “working model” and “something is very wrong.”

I spent three weeks assuming my model was broken. I checked the architecture, the learning rates, the weight initialization, the loss function. Everything matched the paper. The problem was that I was using PIL for the downscale step and the paper used MATLAB (which I didn’t have access to, and whose bicubic implementation uses yet another kernel parameterization).

Once I matched the preprocessing, my numbers lined up. The model had been correct the entire time. Three weeks of debugging for a one-line fix: changing the resize library.

That experience is why I started this whole line of research. If the preprocessing choice can eat three weeks of a grad student’s time on a three-layer CNN, imagine what it does to large-scale benchmark comparisons across dozens of papers and architectures. The answer is: nobody knows, because nobody is controlling for it.