A survey of quantization methods for efficient neural network inference A Gholami, S Kim, Z Dong, Z Yao, MW Mahoney, K Keutzer Low-Power Computer Vision, 291-326, 2022 | 1252 | 2022 |
I-BERT: Integer-only BERT quantization S Kim, A Gholami, Z Yao, MW Mahoney, K Keutzer International conference on machine learning, 5506-5518, 2021 | 354 | 2021 |
SqueezeLLM: Dense-and-Sparse Quantization S Kim, C Hooper, A Gholami, Z Dong, X Li, S Shen, MW Mahoney, ... ICML 2024, 2023 | 140 | 2023 |
Learned Token Pruning for Transformers S Kim, S Shen, D Thorsley, A Gholami, W Kwon, J Hassoun, K Keutzer Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and …, 2022 | 139 | 2022 |
A Fast Post-Training Pruning Framework for Transformers W Kwon, S Kim, MW Mahoney, J Hassoun, K Keutzer, A Gholami Advances in Neural Information Processing Systems 35, 2022 | 122 | 2022 |
Squeezeformer: An efficient transformer for automatic speech recognition S Kim, A Gholami, A Shaw, N Lee, K Mangalam, J Malik, MW Mahoney, ... Advances in Neural Information Processing Systems 35, 2022 | 119 | 2022 |
KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization C Hooper, S Kim, H Mohammadzadeh, MW Mahoney, YS Shao, ... NeurIPS 2024, 2024 | 84* | 2024 |
Full Stack Optimization of Transformer Inference: a Survey S Kim, C Hooper, T Wattanawong, M Kang, R Yan, H Genc, G Dinh, ... arXiv preprint arXiv:2302.14017, 2023 | 82 | 2023 |
Speculative decoding with big little decoder S Kim, K Mangalam, S Moon, J Malik, MW Mahoney, A Gholami, ... Advances in Neural Information Processing Systems 36, 2024 | 77* | 2024 |
Hessian-aware pruning and optimal neural implant S Yu, Z Yao, A Gholami, Z Dong, S Kim, MW Mahoney, K Keutzer Proceedings of the IEEE/CVF Winter Conference on Applications of Computer …, 2022 | 65 | 2022 |
Applications and techniques for fast machine learning in science AMC Deiana, N Tran, J Agar, M Blott, G Di Guglielmo, J Duarte, P Harris, ... Frontiers in big Data 5, 787421, 2022 | 64 | 2022 |
AI and memory wall A Gholami, Z Yao, S Kim, C Hooper, MW Mahoney, K Keutzer IEEE Micro, 2024 | 57 | 2024 |
An LLM Compiler for Parallel Function Calling S Kim, S Moon, R Tabrizi, N Lee, MW Mahoney, K Keutzer, A Gholami ICML 2024, 2023 | 32 | 2023 |
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement N Lee, T Wattanawong, S Kim, K Mangalam, S Shen, G Anumanchipali, ... ACL 2024, 2024 | 25 | 2024 |
Integer-Only Zero-Shot Quantization for Efficient Speech Recognition S Kim, A Gholami, Z Yao, N Lee, P Wang, A Nrusimha, B Zhai, T Gao, ... ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and …, 2022 | 23 | 2022 |
SPEED: Speculative Pipelined Execution for Efficient Decoding C Hooper, S Kim, H Mohammadzadeh, H Genc, K Keutzer, A Gholami, ... arXiv preprint arXiv:2310.12072, 2023 | 18 | 2023 |
WindTunnel: towards differentiable ML pipelines beyond a single model GI Yu, S Amizadeh, S Kim, A Pagnoni, C Zhang, BG Chun, M Weimer, ... Proceedings of the VLDB Endowment 15 (1), 11-20, 2021 | 16* | 2021 |
Memory-Efficient Hardware Performance Counters with Approximate-Counting Algorithms J Xu, S Kim, B Nikolic, YS Shao 2021 IEEE International Symposium on Performance Analysis of Systems and …, 2021 | 6 | 2021 |
TinyAgent: Function Calling at the Edge LE Erdogan, N Lee, S Jha, S Kim, R Tabrizi, S Moon, C Hooper, ... EMNLP 2024 (Demo), 2024 | 5 | 2024 |
Terra: Imperative-Symbolic Co-Execution of Imperative Deep Learning Programs T Kim, E Jeong, GW Kim, Y Koo, S Kim, G Yu, BG Chun Advances in Neural Information Processing Systems 34, 1468-1480, 2021 | 5 | 2021 |