What Seven Failed Autoencoders Taught Me About Anomaly Detection
Building standard and supervised autoencoders for CS 583 Deep Learning on MNIST -- from a vanilla model that barely beat random to a supervised variant at 95.1% validation accuracy -- and why reconstruction loss alone is a weaker signal than the textbooks suggest.
- Author
- Shlomo Stept
- Published
- Updated
- Note
- Originally written 2022-12
What Seven Failed Autoencoders Taught Me About Anomaly Detection
The assignment for CS 583 (Deep Learning) at Stevens was this: build an autoencoder on MNIST, compress the 784-dimensional input to a 2D latent space, visualize the features, determine whether they are discriminative, and then build a supervised autoencoder that actually is. The graded deliverable was a single Jupyter notebook. I submitted seven versions of that notebook — Assignment4_try1.ipynb through Assignment4_try7.ipynb — each one a different configuration of layer sizes, optimizers, learning rates, and loss weighting, because the first six produced results ranging from “technically an autoencoder” to “this is just noise.”
The final version reached 95.1% validation accuracy on MNIST digit classification using only the 2D latent features. The path there was instructive in ways that the final accuracy number does not capture.
The Standard Autoencoder and Why It Is Not Enough
An autoencoder compresses input to a low-dimensional bottleneck and then reconstructs the original from that compressed representation. The theoretical argument for anomaly detection is clean: train on normal data only, and anomalous inputs will reconstruct poorly because the model has never learned their patterns. High reconstruction error flags an anomaly. In the MNIST version of this, the 784-pixel input (28x28 image, flattened) gets compressed through three dense layers to a 2D bottleneck, then expanded back to 784 pixels through four decoder layers.
def make_auto_encoder_model(size_L1, size_L2, size_L3):
input_img = Input(shape=(784,), name='input_img')
# Encoder: 3 dense layers + bottleneck
encoded = Dense(size_L1, activation='relu')(input_img)
encoded = Dense(size_L2, activation='relu')(encoded)
bottleneck = Dense(size_L3, activation='relu', name='bottleneck')(encoded)
# Decoder: 4 dense layers back to 784
decoded = Dense(size_L2, activation='relu')(bottleneck)
decoded = Dense(size_L1, activation='relu')(decoded)
decoded = Dense(784, activation='sigmoid')(decoded)
ae = Model(input_img, decoded)
return ae, input_img, bottleneck
The final tuned hyperparameters for the standard autoencoder: layer sizes of [392, 98, 16] (or sometimes [261, 65, 13] — the hyperparameter search was not perfectly deterministic), RMSprop optimizer at learning rate 0.005, batch size 48, trained for 100 epochs on 10,000 training samples with 10,000 validation samples held out from MNIST’s 60K training set.
The reconstruction loss converges. The images look like digits. And then you scatter-plot the 2D bottleneck activations for the test set, color-coded by digit class, and the clusters are a mess. The 2D features from different classes overlap everywhere. A classifier trained on these 2D features tops out around 60-70% accuracy, compared to ~97% on the original 784-dimensional data. The unsupervised autoencoder learned to compress and reconstruct, which is what it was optimized to do, but the latent space it found is not discriminative because nothing in the loss function asked it to be.
(This is one of those results that feels obvious in retrospect and genuinely surprised me at the time. The textbook says “autoencoder features are useful for downstream tasks.” The experiment says “useful compared to what, exactly?”)
The Supervised Autoencoder
The assignment’s real point was building the supervised variant: an autoencoder with a classification head branching off the bottleneck, so the latent space is simultaneously optimized for reconstruction and for predicting the correct digit class. The architecture adds a classifier branch that takes the 2D bottleneck output through two dense layers to a 10-class softmax:
def make_supervised_ae_model(size_L1, size_L2, size_L3, size_C1, size_C2, size_C3):
input_img = Input(shape=(784,), name='input_img')
# Encoder (same as before)
encoded = Dense(size_L1, activation='relu')(input_img)
encoded = Dense(size_L2, activation='relu')(encoded)
bottleneck = Dense(size_L3, activation='relu', name='bottleneck')(encoded)
# Decoder branch
decoded = Dense(size_L2, activation='relu')(bottleneck)
decoded = Dense(size_L1, activation='relu')(decoded)
decoded = Dense(784, activation='sigmoid', name='decoder_output')(decoded)
# Classifier branch off the bottleneck
classified = Dense(size_C1, activation='relu')(bottleneck)
classified = Dense(size_C2, activation='relu')(classified)
classified = Dense(size_C3, activation='softmax', name='classifier_output')(classified)
sae = Model(input_img, [decoded, classified])
return sae, input_img, bottleneck
The dual loss combines MSE reconstruction loss with categorical cross-entropy classification loss, weighted by an alpha parameter:
sae.compile(
optimizer=optimizer,
loss=['mse', 'categorical_crossentropy'],
loss_weights=[1.0, 0.5],
metrics={'classifier_output': 'accuracy'}
)
The final supervised model used layer sizes [392, 98, 16] for the encoder/decoder and [40, 40, 10] for the classifier branch. Same optimizer, same batch size, same epoch count. The difference in results was not incremental — it was a cliff. The unsupervised model’s 2D features gave 60-70% classification accuracy. The supervised model’s 2D features gave 95.1%.
The Seven Iterations
What the final notebook does not show is what the first six looked like.
Try 1 was a vanilla autoencoder with no hyperparameter tuning. I picked layer sizes that seemed reasonable (256, 64, 2), used Adam at the default learning rate, trained for 10 epochs, and got reconstruction that looked plausible and classification that was barely above random. The bottleneck dimensionality of 2 was mandatory for the visualization requirement, which is an aggressive compression ratio for 784-dimensional data, and I initially thought the poor discrimination was just a consequence of that extreme compression. It was not, or at least it was not the primary cause.
Try 2 added a hyperparameter search class I wrote — Test_Hyper_Param — that systematically varied layer sizes, optimizers, learning rates, batch sizes, and epoch counts. I tested Adam, RMSprop, SGD, and Adagrad at learning rates from 0.1 down to 0.0001, batch sizes from 32 to 256, and epoch counts from 20 to 140. The class tracked validation loss for each configuration and printed comparison graphs. This is when I discovered that RMSprop at 0.005 outperformed Adam for this specific architecture, which contradicts the general advice that Adam is the safe default — but the difference was small enough that I would not generalize from one experiment on MNIST.
Tries 3 through 7 were increasingly focused variations. Try 3 experimented with bottleneck sizes: too small (going below 2 was not an option given the visualization requirement) and too large (256 or 128, which effectively removed the information bottleneck and let the model memorize rather than compress). Try 4 was the supervised variant, and the jump in classification accuracy was so large that tries 5 through 7 were all refinements of the supervised architecture — testing different classifier branch widths, adjusting the loss weight alpha between reconstruction and classification, and experimenting with deeper versus wider classifier branches.
(I named the notebooks Assignment4_try4_Val_acc=0.951.ipynb, with the accuracy in the filename, after spending an embarrassing amount of time in other projects scrolling through notebook outputs trying to remember which version produced which result. The naming convention is ugly but functional, which describes most of my early organizational decisions.)
What I Actually Learned
Reconstruction loss is a weaker anomaly detection signal than the theory suggests. The argument sounds airtight: train only on normal data, anomalies reconstruct poorly, threshold on reconstruction error. In practice the gap between “reconstructs normal data well” and “reconstructs anomalous data poorly” is often too narrow for reliable detection, because a sufficiently expressive autoencoder generalizes beyond its training distribution whether you want it to or not. The bottleneck is supposed to prevent this, but choosing the right bottleneck size is trial-and-error — too small and you underfit, too large and the constraint disappears.
Supervised signals dominate when they are available. The jump from unsupervised to supervised was the single biggest improvement across all seven iterations. Not batch normalization, not optimizer tuning, not architecture changes — adding the classification head. If you have labels, even noisy ones, the supervised approach will almost certainly outperform the unsupervised one, and the pure unsupervised autoencoder should be reserved for situations where labels genuinely do not exist. I see papers that default to unsupervised autoencoders for anomaly detection without considering whether even a small labeled set could bootstrap a supervised approach, and it strikes me as leaving accuracy on the table for the sake of methodological purity.
Hyperparameter search on MNIST is fast enough to be exhaustive. The Test_Hyper_Param class I wrote for this assignment is not sophisticated — it is a nested loop with graph printing — but MNIST is small enough that each configuration trains in seconds, and I tested dozens of combinations. On a real dataset this brute-force approach would be impractical, but for a homework assignment it was the right call: I gained intuition about which hyperparameters matter for autoencoders (layer sizes and learning rate) and which are second-order effects (batch size, optimizer choice) that I would not have developed from reading about it.
Save your iterations. Having seven numbered notebooks meant I could always return to a working baseline when try N+1 went sideways. This sounds obvious and I mention it because I did not do this in earlier assignments, where I modified notebooks in place, got confused about what had changed between versions, and once lost a working model by overwriting the cell that defined it. The ten seconds it takes to duplicate a notebook before making changes has saved me hours of debugging.
Completed for CS 583 (Deep Learning) at Stevens Institute of Technology, Spring 2022. The assignment used TensorFlow/Keras, which I have since moved away from in favor of PyTorch, but the lessons about autoencoder architecture and the value of supervised signals apply regardless of framework.