A study of BFLOAT16 for deep learning training D Kalamkar, D Mudigere, N Mellempudi, D Das, K Banerjee, S Avancha, ... arXiv preprint arXiv:1905.12322, 2019 | 365 | 2019 |
Distributed deep learning using synchronous stochastic gradient descent D Das, S Avancha, D Mudigere, K Vaidynathan, S Sridharan, D Kalamkar, ... arXiv preprint arXiv:1602.06709, 2016 | 213 | 2016 |
Mixed precision training of convolutional neural networks using integer operations D Das, N Mellempudi, D Mudigere, D Kalamkar, S Avancha, K Banerjee, ... arXiv preprint arXiv:1802.00930, 2018 | 206 | 2018 |
Anatomy of high-performance deep learning convolutions on simd architectures E Georganas, S Avancha, K Banerjee, D Kalamkar, G Henry, H Pabst, ... SC18: International Conference for High Performance Computing, Networking …, 2018 | 139 | 2018 |
Performing power management in a multicore processor VW Lee, ET Grochowski, D Kim, Y Bai, S Li, NK Mellempudi, ... US Patent 10,234,930, 2019 | 128 | 2019 |
Distgnn: Scalable distributed training for large-scale graph neural networks V Md, S Misra, G Ma, R Mohanty, E Georganas, A Heinecke, D Kalamkar, ... Proceedings of the International Conference for High Performance Computing …, 2021 | 127 | 2021 |
Optimization of geometric multigrid for emerging multi-and manycore processors S Williams, DD Kalamkar, A Singh, AM Deshpande, B Van Straalen, ... SC'12: Proceedings of the International Conference on High Performance …, 2012 | 95 | 2012 |
Lattice QCD on Intel® Xeon PhiTM Coprocessors B Joo, DD Kalamkar, K Vaidyanathan, M Smelyanskiy, K Pamnany, ... Supercomputing: 28th International Supercomputing Conference, ISC 2013 …, 2013 | 88 | 2013 |
Abstraction layers for scalable distributed machine learning DD Kalamkar, K Vaidyanathan, S Sridharan, D Das US Patent 11,094,029, 2021 | 70 | 2021 |
Efficient shared-memory implementation of high-performance conjugate gradient benchmark and its application to unstructured matrices J Park, M Smelyanskiy, K Vaidyanathan, A Heinecke, DD Kalamkar, X Liu, ... SC'14: Proceedings of the International Conference for High Performance …, 2014 | 69 | 2014 |
Enabling efficient multithreaded MPI communication through a library-based implementation of MPI endpoints S Sridharan, J Dinan, DD Kalamkar SC'14: Proceedings of the International Conference for High Performance …, 2014 | 57 | 2014 |
Optimizing deep learning recommender systems training on cpu cluster architectures D Kalamkar, E Georganas, S Srinivasan, J Chen, M Shiryaev, A Heinecke SC20: International Conference for High Performance Computing, Networking …, 2020 | 55 | 2020 |
Improving concurrency and asynchrony in multithreaded MPI applications using software offloading K Vaidyanathan, DD Kalamkar, K Pamnany, JR Hammond, P Balaji, ... Proceedings of the International Conference for High Performance Computing …, 2015 | 54 | 2015 |
Lattice qcd with domain decomposition on intel® xeon phi co-processors S Heybrock, B Joó, DD Kalamkar, M Smelyanskiy, K Vaidyanathan, ... SC'14: Proceedings of the International Conference for High Performance …, 2014 | 50 | 2014 |
Optimizing Wilson-Dirac Operator and Linear Solvers for Intel® KNL B Joó, DD Kalamkar, T Kurth, K Vaidyanathan, A Walden High Performance Computing: ISC High Performance 2016 International …, 2016 | 38 | 2016 |
On scale-out deep learning training for cloud and hpc S Sridharan, K Vaidyanathan, D Kalamkar, D Das, ME Smorkalov, ... arXiv preprint arXiv:1801.08030, 2018 | 35 | 2018 |
Harnessing deep learning via a single building block E Georganas, K Banerjee, D Kalamkar, S Avancha, A Venkat, ... 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS …, 2020 | 25 | 2020 |
Performing power management in a multicore processor VW Lee, D Kim, Y Bai, S Ji, S Li, DD Kalamkar, NK Mellempudi US Patent 9,910,481, 2018 | 23 | 2018 |
Tensor processing primitives: A programming abstraction for efficiency and portability in deep learning workloads E Georganas, D Kalamkar, S Avancha, M Adelman, C Anderson, A Breuer, ... Proceedings of the International Conference for High Performance Computing …, 2021 | 22 | 2021 |
Wilson Dslash kernel from lattice QCD optimization B Joó, M Smelyanskiy, DD Kalamkar, K Vaidyanathan Thomas Jefferson National Accelerator Facility (TJNAF), Newport News, VA …, 2015 | 20 | 2015 |