What 55 Trained Models Taught Me About Data Augmentation
A systematic study of five augmentation types across eleven intensity levels on osteoarthritis knee X-rays. Rotation helped until it didn't, contrast was the surprise winner, and color adjustment was basically useless.
- Author
- Shlomo Stept
- Published
- Updated
- Note
- Originally written 2023-04
What 55 Trained Models Taught Me About Data Augmentation
For my BME 571 final project in Spring 2023, I trained 55 separate models. Same architecture, same dataset, same hyperparameters, different augmentation. Five augmentation types (rotation, flip, contrast, color, saturation), each tested at eleven intensity levels (0% through 100% in 10% increments). 5 times 11 is 55. Each model trained to convergence on osteoarthritis knee X-ray classification.
The goal was to answer a question that gets handwaved in most deep learning courses: how much augmentation is too much? Everyone says “data augmentation improves generalization,” and in the aggregate that’s true, but nobody tells you that rotation augmentation at 70% intensity will tank your accuracy on medical images because rotated knee X-rays don’t look like any X-ray a radiologist would actually take.
The Dataset and the Model
The dataset is binary: knee X-rays classified as Normal or Osteoarthritis. It had class imbalance (more Normal than Osteoarthritis), which I handled by downsampling the majority class to match the minority before splitting into train/validation/test sets. I also ran size-based outlier detection to remove images with dimensions far from the mean — medical imaging datasets are messier than CIFAR-10, and a handful of images were either mislabeled, corrupted, or from a different imaging protocol.
def find_size_outliers(image_paths):
sizes = [Image.open(path).size for path in image_paths]
widths = [s[0] for s in sizes]
heights = [s[1] for s in sizes]
w_mean, w_std = np.mean(widths), np.std(widths)
h_mean, h_std = np.mean(heights), np.std(heights)
outliers = [path for path, (w, h) in zip(image_paths, sizes)
if abs(w - w_mean) > 2 * w_std or abs(h - h_mean) > 2 * h_std]
return outliers
Removed outliers were moved to a quarantine directory rather than deleted. I learned that lesson after once accidentally deleting a correctly-labeled but unusually-sized image in a different project and having to re-download the entire dataset to get it back.
The model was ResNet via ConvNeXt-Base with transfer learning from ImageNet. I also had a lightweight custom CNN (three conv layers, batch norm, dropout) for fast iteration, but the final results all used the transfer learning setup because it gave more stable convergence and let me isolate the augmentation effect from the architecture effect.
The Augmentation Framework
Each augmentation is a simple callable class wrapping PIL transforms:
class AdjustContrast:
def __init__(self, factor=0.5):
self.factor = factor
def __call__(self, image):
return Contrast(image).enhance(self.factor)
class Rotate:
def __init__(self, degrees=45):
self.degrees = degrees
def __call__(self, image):
return image.rotate(self.degrees)
The “intensity level” is the probability that each transform gets applied to a given training image. At 0%, no images are augmented (baseline). At 50%, each image has a coin-flip chance of receiving the transform. At 100%, every training image is augmented. The AugmentationPipeline class handles this:
class AugmentationPipeline:
def __init__(self, transforms, probability=0.5):
self.transforms = transforms
self.probability = probability
def __call__(self, image):
for t in self.transforms:
if random.random() < self.probability:
image = t(image)
return image
This design means augmentation is stochastic: the same image might be augmented differently in different epochs, which is what you want for regularization. The probability parameter controls the expected fraction of augmented samples per epoch.
Results by Augmentation Type
Rotation: Good Until It Isn’t
Rotation at low intensities (10-30% probability) consistently improved accuracy. The model saw slightly tilted versions of X-rays, which made it more tolerant of minor patient positioning differences — a real issue in clinical imaging where the patient’s knee isn’t always aligned the same way.
At moderate intensities (40-60%), the improvement plateaued. At high intensities (70-100%), accuracy dropped below the baseline. The reason is domain-specific: a knee X-ray rotated by 45 degrees doesn’t look like a knee X-ray anymore. It looks like something went wrong with the imaging equipment. The model was learning to classify images that would never appear in production.
Note
This is the augmentation trap that nobody warns you about. Rotation is listed in every augmentation tutorial as a basic transform. And for natural images (cats, dogs, cars), it usually helps because objects can appear at any angle. But medical images have canonical orientations. A chest X-ray is always taken with the patient facing forward. A knee X-ray has a specific expected geometry. Rotating it doesn’t simulate real variation; it creates fake data that confuses the model.
Flip: Modest and Consistent
Horizontal flip was the most boring augmentation in the study. It helped a little at all intensity levels, never hurt, and the improvement was nearly constant from 10% to 100%. The reason is that a horizontally flipped knee X-ray is just a left knee that looks like a right knee (or vice versa), and the classification task doesn’t depend on which knee it is.
The improvement was small — about 1-2 percentage points in accuracy at best. But its consistency made it a safe default. If you’re augmenting medical images and you can only pick one transform, flip is the boring-but-reliable choice.
Contrast: The Surprise Winner
Contrast augmentation had the largest positive effect of any single transform I tested. At 30-50% intensity, it improved accuracy by 5-7 percentage points over the baseline. This was the biggest gain from any augmentation type at any intensity level across the entire study.
The reason, I think, is that X-ray exposure varies considerably between machines, between hospitals, and between patients. A heavier patient absorbs more radiation, producing a darker image; an older machine might have different calibration. Contrast augmentation simulates this real variation. When you adjust the contrast by a factor of 0.5, you’re creating an image that looks like it came from a slightly different X-ray machine — which is exactly the kind of distribution shift the model will encounter in practice.
At very high intensities (80-100%), contrast augmentation started to hurt, because extreme contrast factors produce images that are washed out or nearly black, which don’t resemble real X-rays. But the sweet spot (30-50%) was clear and reproducible.
Color Adjustment: Basically Useless
Color augmentation had no meaningful effect on accuracy at any intensity level. This makes sense once you think about it for five seconds: knee X-rays are grayscale. The dataset had been converted to RGB for compatibility with pretrained ImageNet models (which expect 3-channel input), but the three channels are essentially identical. Adjusting “color” on a grayscale image is adjusting nothing.
I include this result because negative results are underreported and because someone reading this might be building a pipeline for medical images and wondering whether to include color jitter. Don’t. Not for X-rays, not for CT, not for MRI. Those modalities don’t have meaningful color information; adjusting it just adds noise to your training process without providing any regularization benefit.
Saturation: Same Story, Same Reason
Saturation adjustment produced results statistically indistinguishable from no augmentation, for the same reason as color adjustment: there’s no saturation to adjust in a grayscale image displayed across three identical channels. I’m slightly embarrassed that I ran 11 training runs on this before recognizing why it wasn’t doing anything, but at least the experimental design caught the null effect cleanly.
The Interaction Effects
The final iteration of the study tested combinations: flip + rotation, flip + contrast, all five together at various intensities. The combinations were less informative than the individual augmentation results because the interaction effects were hard to disentangle with only 55 models. A proper factorial design would require hundreds of runs.
What I can say: flip + contrast at moderate intensity (30% probability each) was the best-performing combination I tested. Adding rotation to that combination helped slightly at low intensity and hurt at high intensity, consistent with the individual rotation results. The color and saturation transforms contributed nothing to any combination, consistent with their individual null results.
What I Took Away
Three things that I wouldn’t have predicted before running the experiment:
First, the relationship between augmentation intensity and model performance is not monotonic for most transform types. “More augmentation” is not “better augmentation.” There’s a sweet spot, and past it, you’re degrading your training data.
Second, domain matters more than technique. The augmentation advice you’ll find in PyTorch tutorials and Kaggle notebooks is calibrated for natural images. Medical imaging has different constraints — canonical orientations, grayscale modalities, limited real-world variation in some axes — that make some standard augmentations useless or harmful.
Third, the augmentation that helped the most (contrast) was the one that most closely simulated real distribution shift. The augmentation that helped second-most (flip) simulated a real symmetry in the data. The augmentations that didn’t help (color, saturation) simulated variation that doesn’t exist in the domain. And the augmentation with the most complex behavior (rotation) simulated variation that exists in small amounts but becomes unrealistic at large amounts.
The pattern is clear in hindsight: augmentation works when it simulates plausible variation, and it fails when it creates implausible data. That’s not a new insight — I’m sure it’s in a textbook somewhere. But I didn’t appreciate it until I watched it happen across 55 models and 275 training runs (55 models times 5 epochs of early-stopping evaluation each). Seeing the contrast curve peak at 40% and then decline, seeing the rotation curve cross below baseline at 70%, seeing the flat line of color adjustment — that made the principle concrete in a way that reading about it never had.
Note
The full results are stored as JSON files, one per augmentation type per intensity level. If someone wants to reproduce this or extend it to additional datasets, the experimental framework is designed for exactly that — swap in a new dataset, define the augmentation sweep, and the pipeline handles the rest. The plan is to expand to chest X-rays, retinal scans, skin lesions, and some natural image datasets to see how generalizable these findings are.
This study is the augmentation half of my BME 571 project. The other half — the model architecture, training loop, and classification results — is covered in the X-ray CNN post . The distinction matters because the augmentation methodology is domain-independent (you can apply this framework to any classification task), while the model discussion is specific to the osteoarthritis dataset.