Summary
Can we make a trained machine learning model “forget” specific data without retraining from scratch?
With the growth of privacy regulations like GDPR and CCPA, machine unlearning—removing the influence of specific data from trained models—has become a key challenge.
Existing “exact” unlearning methods, such as full retraining, are accurate but computationally expensive. Faster “approximate” methods often compromise accuracy or leave traces that can be detected by privacy attacks like membership inference.
In our latest work, we introduce Adversarial Machine UNlearning (AMUN), a novel approach that fine-tunes a model on specially crafted adversarial examples of the data to be forgotten. These examples are close to the original samples but labeled according to the model’s own mispredictions. Fine-tuning on them reduces the model’s confidence on the forget set—mimicking the effect of retraining—while minimizing changes to performance on other data. We evaluate the effectiveness of unlearning using SOTA Membership Inference Attacks (MIAs).
What we did
Our key observation: fine-tuning on adversarial examples with incorrect labels does not cause catastrophic forgetting, but it does lower prediction confidence for nearby samples.
When an unlearning request arrives, AMUN:
- Finds the closest adversarial example for each sample in the forget set using a strong attack (PGD-50).
- Fine-tunes the model on these examples (and, if available, the remaining data) to localize decision boundary changes around the forget samples.
This approach:
- Avoids the instability seen when directly maximizing loss or using random wrong labels.
- Works even when the remaining data is unavailable—important for privacy-sensitive settings.
- Preserves test accuracy while neutralizing membership inference attacks.
We benchmarked AMUN against state-of-the-art unlearning methods like FT, RL, GA, BS, l1-Sparse, and SalUn on CIFAR-10 using ResNet-18. We evaluated using both older membership inference attacks (MIS) and a new, stronger attack (RMIA). The following table shows the results of unlearning 10% of the training samples in CIFAR-10, when the remaining samples are available to the unlearning algorithm.
And the following table shows the results when the remaining samples are not available, which is a much harder setting that leads to poor results for prior methods.
We also show that AMUN is more effective when handling successive unlearning requests.
What we found
- Superior Privacy: After unlearning 10% of training samples, RMIA could not distinguish forgotten samples from unseen test data better than random guessing.
- Accuracy Preservation: AMUN matched retrained models’ accuracy on test data while significantly outperforming other methods on privacy metrics.
- No Remaining Data? No Problem: Even without access to the remaining training set, AMUN achieved low average gaps to retraining—outperforming all baselines.
- Robust Models Still Forget: AMUN remained effective on adversarially robust models with controlled Lipschitz constants.
- Continuous Unlearning: Across multiple unlearning requests, AMUN maintained privacy and accuracy better than competing approaches.
Additional results
We have performed more experiments on various forget-set sizes, various model architectures, and different datasets. We have also performed experiments on adversarially robust models. These details of these experiments and the results are available in our paper.
Theoretical insights
Thoretical results in our paper shows the following factors enhances the quality of unlearning with AMUN:
- Adversarial examples that are closer to the original samples.
- Higher quality of adversarial examples.
- Transferability of the adversarial examples for the original model to the retrained model.
- Preventing from overfitting to the adversarial examples.
- A lower Lipschitz constant of the model.
- The generalization of the retrained model to the unseen samples. this basically implies better results for smaller forget sets, which is expected.