TL;DR:
This repository implements a clean and reproducible pipeline for studying approximate computing on CNNs, including global pruning, fine-tuning, and dynamic INT8 quantization.
Experiments cover ResNet-18 and MobileNetV2 on CIFAR-10 and CIFAR-100, with analysis of accuracy, model size, latency, and sparsity.
Pruning consistently improves generalization (sometimes +15%), while INT8 quantization preserves accuracy with minimal overhead.
This repository implements an experimental framework to study approximate computing techniques for CNN compression, inspired by Deep Compression (Han et al., ICLR 2016).
We evaluate how pruning, quantization, and model architecture choices affect the trade-off between:
● Accuracy
● Model size
● Inference latency
● Layer-wise sparsity
Experiments are conducted on ResNet-18 and MobileNetV2, using CIFAR-10 and CIFAR-100 datasets.
- Environment Setup
- Project Structure
- Pipeline Overview
- Running the Experiments
- Results and Analysis
- Notes and Limitations
- Conclusion
- Citation
We recommend using Anaconda.
conda create -n approx_cnn python=3.9 -y
conda activate approx_cnn
pip install -r requirements.txt
Note: PyTorch is not included in requirements.txt because CUDA users must install it manually. Install PyTorch separately following the instructions below.
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install torch torchvision torchaudio
pip install numpy tqdm
conda install matplotlib -y
git clone https://github.com/YueranCao2001/cnn-approx-compression
cd cnn-approx-compression
cnn-approx-compression/
├── data/ # CIFAR-10/100 datasets, generate automatically if you run related code
├── models/ # Saved .pth checkpoints
├── results/ # Result plots
├── scripts/ # Training, pruning, quantization, evaluation
└── README.md
The full experimental pipeline is: (1). Train baseline CNN (ResNet-18, MobileNetV2) on CIFAR-10 / CIFAR-100 (2). Apply global unstructured pruning (3). Fine-tune pruned models (4). Optionally apply dynamic INT8 quantization (5). Evaluate accuracy, model size, and latency (6). Summarize and visualize results
Each step corresponds to one script in scripts/
python scripts/train_resnet18_c10_baseline.py
python scripts/train_mobilenetv2_c10_baseline.py
models/resnet18_c10_base.pth
models/mobilenetv2_c10_base.pth
python scripts/prune_resnet18_c10_prune50.py
python scripts/prune_mobilenetv2_c10_prune50.py
● Global L1 unstructured pruning
● Applied to Conv + Linear layers
● Followed by fine-tuning
● Pruning masks removed before saving final checkpoint
python scripts/quantize_resnet18_c10_int8.py
● Applies dynamic quantization to Linear layers
● Produces: models/resnet18_c10_pruned50_int8.pth
python scripts/eval_resnet18_c10_all.py
python scripts/eval_mobilenetv2_c10_all.py
python scripts/eval_resnet18_c100_all.py
● Test accuracy (CPU)
● On-disk model size
● Average inference time per image (CPU)
python scripts/viz_resnet18_c10_results.py
python scripts/viz_mobilenetv2_c10_results.py
python scripts/viz_resnet18_vs_mobilenetv2_c10.py
All images are saved in: results/
Below is a comprehensive summary of all experiments, structured per model and dataset.
● Pruning improves accuracy from 65.3% → 80.2%, a surprising but well-known effect when pruning removes noisy or redundant weights.
● Model size stays almost unchanged (8.769 → 8.770 MB) because PyTorch stores dense FP32 tensors even after pruning.
● The +15% absolute accuracy jump indicates MobileNetV2 is highly overparameterized for CIFAR-10.
● Pruning forces a form of regularization, helping generalization.
● Latency slightly decreases (4.922 → 4.818 ms).
● Because PyTorch does not exploit sparsity, the gain is due to reduced effective FLOPs, not sparse kernels.
● As expected, size remains unchanged due to dense storage format.
● True compression would require sparse serialization or Huffman coding.
● Baseline accuracy: 74.5%
● Pruned (50%): 81.6%
● Pruned + INT8: 81.6% (no drop)
INT8 quantization preserves pruned performance, because only fully-connected layers are quantized—convolutions dominate compute.
● Same trend as above; pruning improves generalization.
● INT8 quantization does not harm accuracy.
● INT8 version incurs slightly higher latency (2.438 → 2.605 ms).
● PyTorch dynamic quantization is CPU-oriented and may add overhead for small models.
● INT8 slightly reduces state_dict size (42.7314 → 42.7288 MB).
● Again, PyTorch stores dense tensors, so compression is limited.
● 0% → 74.5%
● 30% → 82.5%
● 50% → 81.6%
● 70% → 83.0% (best)
● Latency varies slightly (2.341–2.475 ms) with no consistent trend.
● PyTorch kernels do not accelerate sparse convolutions.
● All sizes ≈ 42.73 MB, confirming dense storage.
Across 30/50/70% pruning:
● Early layers remain dense
● Deeper 3×3 convolutions gain high sparsity
● FC layer sparsity roughly matches pruning target
This matches standard global pruning dynamics.
● Pruned accuracy dramatically increases: 42.9% → 56.4%
● Similar to CIFAR-10, pruning removes redundancy.
● Pruned model slightly faster (2.620 → 2.589 ms).
● Variation is small.
● Dense storage again limits compression.
● ResNet-18: 74.5% → 81.6%
● MobileNetV2: 65.3% → 80.2%
● MobileNetV2 is naturally slower on CPU (depthwise separable ops → lower parallelism).
● ResNet-18 remains ~2.4 ms, MobileNetV2 ~4.5 ms.
● ResNet-18: ~42.7 MB
● MobileNetV2: ~8.77 MB
● Significant architectural footprint difference.
(1). PyTorch stores pruned tensors densely, so file size does not change.
(2). PyTorch dynamic INT8 quantization only affects Linear layers.
(3). No sparse kernels → pruning does not accelerate inference.
(4). True compression (Deep Compression) requires:
● Sparse matrix formats
● Weight sharing
● Huffman coding
This project systematically evaluates approximate computing techniques (pruning + quantization) on two CNN architectures and two datasets.
● Moderate-to-high unstructured pruning consistently improves accuracy (regularization effect).
● INT8 quantization preserves accuracy but provides limited compression for CNNs dominated by conv layers.
● Model size remains unchanged without sparse serialization.
● CPU latency shows minimal variation because PyTorch kernels do not exploit sparsity.
● MobileNetV2 benefits even more from pruning compared to ResNet-18.
These findings provide a strong baseline for understanding approximate computing trade-offs in modern CNNs.
If you use this repository or build upon its code or analysis, please cite:
@misc{cao2025approxcnn,
title = {Approximate Computing for CNN Compression: A Study of Pruning and INT8 Quantization on ResNet-18 and MobileNetV2},
author = {Cao, Yueran and Wang, Zinian and Chen, Chang and Li, Tian},
year = {2025},
note = {Course Project, Georgetown University},
}This project is also inspired by:
@misc{han2016deepcompressioncompressingdeep,
title={Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding},
author={Song Han and Huizi Mao and William J. Dally},
year={2016},
eprint={1510.00149},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/1510.00149},
}
















