Marginal Musings

Super-Resolution Research, Part 7: How Preprocessing Choice Corrupts Training

Training SRCNN with identical hyperparameters but different imaging libraries for preprocessing produces measurably different models. PIL's bicubic is not OpenCV's bicubic, and this disagreement cascades all the way through training into your final PSNR scores.

Author
Shlomo Stept
Published
Updated
Note
Originally written 2022-07

Same Model, Different Library, Different Results

I already knew that PIL and OpenCV disagreed on what “bicubic resize” means — I’d measured the disagreement at 21 dB PSNR , which is not a rounding error by any reasonable definition of “rounding error.” But I assumed this was an evaluation problem. You’d get different numbers depending on which library you used to prepare your test images, sure, but the trained model itself would be the same, right? The learning process would converge to the same weights regardless of whether PIL or OpenCV generated the training pairs.

Wrong.

The Setup

The experiment is simple in concept, tedious in execution. Take the same model architecture (SRCNN, three convolutions, ~69,000 parameters), the same training images (DIV2K), the same hyperparameters (learning rate 1e-4 for feature extraction layers, 1e-5 for reconstruction, MSE loss, 50 epochs), and vary exactly one thing: which library handles the bicubic downscale step that creates the low-resolution training inputs.

def make_resizer(library, quantize_after, filter, output_size):
    """Each library's 'bicubic' uses different kernel coefficients:
    - PIL (Pillow): a=-1 (Catmull-Rom spline)
    - OpenCV: a=-0.75
    - PyTorch: a=-0.5, align_corners=False
    - TensorFlow: varies by version
    """
    if library == "PIL" and not quantize_after:
        s1, s2 = output_size
        def resize_single_channel(x_np):
            img = Image.fromarray(x_np.astype(np.float32), mode="F")
            img = img.resize(output_size, resample=Image.BICUBIC)
            return np.asarray(img).clip(0, 255).reshape(s2, s1, 1)

        def func(x):
            channels = [resize_single_channel(x[:, :, idx]) for idx in range(3)]
            return np.concatenate(channels, axis=2).astype(np.float32)

    elif library == "OpenCV":
        def func(x):
            result = cv2.resize(x, output_size, interpolation=cv2.INTER_CUBIC)
            return result.clip(0, 255)
    # ... PyTorch, TensorFlow variants follow the same pattern
    return func

The make_resizer function wraps each library behind an identical interface. Same input image, same target dimensions, same filter name. The only difference is the internal implementation — and those implementations disagree because they use different kernel coefficients for the same named interpolation method.

What “Bicubic” Actually Means (It Depends)

The bicubic interpolation kernel has a free parameter, typically called a, that controls the shape of the interpolation curve. Different values of a produce different trade-offs between sharpness and ringing artifacts. The problem is that different libraries chose different values of a and all call the result “bicubic.”

PIL uses a = -1, which gives you a Catmull-Rom spline. OpenCV uses a = -0.75. PyTorch uses a = -0.5 with align_corners=False. TensorFlow has changed its implementation across versions (which is its own special kind of fun).

These aren’t different names for the same operation. They’re different operations with the same name.

I built a side-by-side comparison: take baboon.bmp from Set14, resize it to 100x100 with each library’s bicubic, then measure PSNR between every pair of outputs. PIL vs OpenCV: 21 dB. That’s not a difference you need to squint at; it’s a difference you can see with your eyes if you put the images next to each other and look at edges and textures.

The Cascade Effect

Here’s where it gets worse. A super-resolution training pipeline works like this:

  1. Take a high-resolution image
  2. Downscale it (using some library) to create the low-resolution input
  3. Train the model to map LR back to HR

If Paper A uses PIL for step 2 and Paper B uses OpenCV, their models aren’t just being evaluated differently — they’re learning different mappings. The low-resolution inputs are different images. The model trained on PIL-downscaled data has never seen an OpenCV-downscaled image, and vice versa.

I expected the models to converge to similar performance despite the different training data. I expected the learning process to be robust enough (in the actual statistical sense of “robust,” not the AI-buzzword sense) to absorb a difference in the preprocessing library. That expectation was wrong.

The PIL-trained model and the OpenCV-trained model produce different outputs on the same test image. Not dramatically different — they’re both doing super-resolution, they both produce something that looks sharper than the input — but measurably different. And when you evaluate them using PSNR, you get different scores, and the ranking between models can change depending on which library prepared the evaluation data.

Why This Matters (And Why Nobody Talks About It)

Consider what happens when you read a super-resolution paper. The results table says “PSNR: 32.48 dB on Set14.” You compare that against another paper’s “PSNR: 32.31 dB on Set14” and conclude the first paper’s model is better. But you don’t know which library either paper used for its downscaling step. If one used PIL and the other used OpenCV, the 0.17 dB difference is noise relative to the preprocessing disagreement.

I stopped trusting PSNR comparisons between papers after this. Not because PSNR is a bad metric per se (though it has its own problems ), but because the numbers being compared aren’t measuring the same thing. It’s like comparing lap times from two different tracks and declaring the faster time the winner.

Note

The irony is that super-resolution papers typically report PSNR to two decimal places. 32.48 dB. That level of precision implies the measurement is stable to within 0.01 dB. Meanwhile, the preprocessing library choice alone can shift the number by several dB.

Even Within PIL, It’s Complicated

There’s a subtlety I didn’t appreciate until I dug into clean-fid’s source code (which eventually led me to find a bug there ): PIL’s Image.resize() operates differently depending on whether you pass it uint8 or float32 data.

The make_resizer function has a quantize_after parameter that controls this. When quantize_after=True, the image is resized as uint8 — standard PIL behavior, what you’d get from a naive Image.open().resize() call. When quantize_after=False, each channel is converted to float32, resized independently in float32 mode, then reassembled. The results differ.

resized_pil_quant = make_resizer('PIL', True, 'bicubic', (100, 100))(img)
resized_pil_noquant = make_resizer('PIL', False, 'bicubic', (100, 100))(img)
# These are NOT identical arrays

So “PIL bicubic” isn’t even a single operation. It’s at least two, depending on the data type and channel handling. And the float32 channel-by-channel approach (used by clean-fid for FID computation) introduces yet another source of disagreement with the naive PIL resize and with every other library.

Recommendations (Such As They Are)

I don’t have a clean solution. The honest answer is:

  1. Document your preprocessing library and version. Every paper should specify “PIL 9.2.0 bicubic” or “OpenCV 4.6.0 INTER_CUBIC” in its methodology section. Most don’t.

  2. When comparing against published results, match their preprocessing exactly. This is often impossible because they didn’t document it (see point 1).

  3. Stop comparing PSNR numbers across papers at the 0.1 dB level. The preprocessing variance is larger than that. A 0.2 dB improvement claimed by a new architecture might be entirely attributable to a different resize implementation.

  4. Pin your library versions. I’ve seen OpenCV change its resize behavior between minor versions. PyTorch’s align_corners default changed. TensorFlow’s tf.image.resize has been rewritten at least once.

The systematic training runs comparing all library variants against each other are still in progress — I have the framework but haven’t finished running every combination at every scale factor. The preliminary evidence is clear enough: the choice matters, it cascades through training, and nobody is controlling for it.

This is the kind of problem that’s easy to dismiss as academic pedantry until you spend three weeks trying to reproduce a paper’s results and can’t figure out why your numbers are 2 dB off. I spent those three weeks. The answer was cv2.INTER_CUBIC vs Image.BICUBIC.