Mikhail Smelyanskiy's Home Page

About Me

I am a Principal Engineer at Intel's Parallel Computing Lab, part of Intel Research Labs in Santa Clara, CA. My main focus is on application-driven parallel architecture research. Specifically, my work involves design, implementation and analysis (including competitive analysis) of parallel algorithms and workloads for the current and future generation parallel processor systems. In my work I take a top-down approach: (i) designing fastest algorithm, (ii) mapping this algorithm to the underlying hardware architecture, (iii) modeling and analyzing the performance (using cycle-accurate performance simulators if needed) to discover and explain performance bottlenecks. This results in architectural recommendations and proposals to drive the design of Intel future parallel architectures, as well as highly optimized (down to 'bare-metal ') workload implementations on existing systems. For the list of my publications, please see 'List of Publications' below.

I made significant contribution to the definition of Intel® Many-Integrated Core (MIC) architecture and the development of the Intel® Xeon Phi™ coprocessor. My research in the areas of medical imaging, computational finance and more recently in fundamental high performance compute kernels, such as DGEMM (double precision matrix-matrix multiplication), SpMVM (sparse matrix-vector multiplication) and QCD (quantum chromodynamics), helped improve MIC architecture, as well as demonstrate its full performance potential.

I was Intel technical lead behind top(#1)-ranked position in Green500, November 2012. The record was set by MIC-based Beacon system built at National Institute for Computational Sciences. Green500 provides a ranking of the 500 most energy-efficient supercomputers in the world. I was also Intel technical lead behind the very first MIC-based submission to the TOP500, which ranked number 150 in June 2012. TOP500 provides a ranking of the 500 fastest computers in the world. My highly optimized double-precision matrix-matrix multiplication (DGEMM) implementation running on single-chip Intel® Xeon Phi™ sustained performance of over 1 TeraFLOP/s (10¹² floating-point computations per second) in November 2011 – at the time, the world's fastest DGEMM, and the first to go above one TeraFLOP/s. I received Intel Achievement Award, Intel's highest honor, in 2012.

Prior to my work at Intel, I earned Ph.D. from the Department of Electrical Engineering and Computer Science at the University of Michigan, Ann Arbor in 2003. My academic advisors were Professor Edward Davidson and Professor Scott Mahlke. The focus of my thesis work was on hardware/software co-design and compiler optimizations for efficient resource utilization on VLIW architectures.

Contact Infromation

Mikhail Smelyanskiy, Ph.D.
Intel Corporation
2200 Mission College Blvd., SC-12
Santa Clara, CA 95054
Email: mikhail.smelyanskiy@intel.com
Linkedin:
Google Scholar

List of Publications

2016

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang. Preprint: arXiv:1609.04836.
Sparso: Context-driven Optimizations of Sparse Linear Algebra. Hongbo Rong, Jongsoo Park, Lingxiang Xiang, Todd A. Anderson, and Mikhail Smelyanskiy. International Conference on Parallel Architectures and Compliation Techniques (PACT), 2016, accepted for publication, open sourced at github.
High Performance Emulation of Quantum Circuits. Thomas Häner, Damian S. Steiger, Mikhail Smelyanskiy, Matthias Troyer. The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2016,accepted for publication. Preprint: arXiv:1604.06460.
Large Scale Distributed Hessian-Free Optimization for Deep Neural Network. AAAI 2017 workshop on distributed machine learning, 2016, accepted for publication. Preprint: arXiv:1606.00511.
High Performance Parallel Stochastic Gradient Descent in Shared Memory. S. Sallinen, N. Satish, M. Smelyanskiy, S. Sury, C. Ré. IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2016, accepted for publication
Error sensitivity to environmental noise in quantum circuits for chemical state preparation. Nicolas P. D. Sawaya, Mikhail Smelyanskiy, Jarrod R. McClean, and Alán Aspuru-Guzik. Journal of Chemical Theory and Computation. In Press. Preprint: arXiv:1601.01857
qHiPSTER: The Quantum High Performance Software Testing Environment. Mikhail Smelyanskiy, Nicolas P. D. Sawaya, and Alán Aspuru-Guzik. Preprint: arXiv:1601.07195.
Performance optimizations for scalable implicit RANS calculations with SU2. Economon, T. D., Palacios, F., Alonso, J. J., Bansal, G., Mudigere, D., Deshpande, A., Heinecke, A., Smelyanskiy, M., Computers and Fluids, Vol. 129 (2016), pp. 146-158. doi: 10.1016/j.compfluid.2016.02.003

2015

Towards High-Performance Optimizations of the Unstructured Open-Source SU2 Suite. Economon, T. D., Palacios, F., Alonso, J. J., Bansal, G., Mudigere, D., Deshpande, A., Heinecke, A., Smelyanskiy, M. AIAA Paper 2015-1949, AIAA Infotech at Aerospace, AIAA SciTech, Kissimmee, FL, January, 2015.
High-Performance Algebraic Multigrid Solver Optimized for Multi-Core Based Distributed Parallel Systems, Jongsoo Park, Mikhail Smelyanskiy, Ulrike Meier Yang, Dheevatsa Mudigere, and Pradeep Dubey. The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2015.
Optimizations in High-Performance Conjugate Gradient Benchmark for IA-based Multi and Many-core Processors. Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar, Md. Mostofa Ali Patwary, Vadim Pirogov, Pradeep Dubey, Xing Liu, Carlos Rosales, Cyril Mazauric, and Christopher Daley. International Journal of High Performance Computing.
Scaling Up Hartree–Fock Calculations on Tianhe-2. E. Chow, X. Liu, S. Misra, M. Dukhan, M. Smelyanskiy, J. R. Hammond, Y. Du, X.-K. Liao, and P. Dubey. International Journal of High Performance Computing Applications, 2015, to appear.
Parallel Scalability of Hartree-Fock Calculations. E. Chow, X. Liu, M. Smelyanskiy, and J. R. Hammond. The Journal of Chemical Physics, 142, 104103 (2015).
Exploring Shared-memory Optimizations for an Unstructured Mesh CFD Application on Modern Parallel Systems. Dheevatsa Mudigere, Srinivas Sridharan, Anand Deshpande, Jongsoo Park, Alexander Heinecke, Mikhail Smelyanskiy , Bharat Kaul, Pradeep Dubey, Dinesh Kaushik, and David Keyes. IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2015.

2014

Petascale High Order Dynamic Rupture Earthquake Simulations on Heterogeneous Supercomputers. Alexander Heinecke, Alexander Breuer, Sebastian Rettenberger, Michael Bader, Alice-Agnes Gabriel, Christian Pelties, Arndt Bode, William Barth, Xiang-Ke Liao, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Pradeep Dubey. Accepted to 2014 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC14). ACM Gordon Bell Finalist.
Lattice QCD with Domain Decomposition on Intel(R) Xeon Phi(TM) Co-Processors. Simon Heybrock, Balint Joo, Dhiraj D. Kalamkar, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Tilo Wettig, Pradeep Dubey. Accepted to 2014 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC14).
Efficient Shared-Memory Implementation of High-Performance Conjugate Gradient Benchmark and Its Application to Unstructured Matrices. Jongsoo Park, Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Alexander Heinecke, Dhiraj D. Kalamkar, Xing Liu, Mostofa Ali Patwary, Yutong Lu, Pradeep Dubey. Accepted to 2014 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC14).
Sparsifying Synchronizations for High-Performance Shared-Memory Sparse Triangular Solver. Jongsoo Park, Mikhail Smelyanskiy, Mikhail Smelyanskiy, Narayanan Sundaram and Pradeep Dubey. In International Supercomputing Conference (ISC), 2014. Accepted for publication.
Improving Communication Performance and Scalability of Native Applications on Intel(R) Xeon Phi(TM) Coprocessor Clusters. Karthikeyan Vaidyanathan, Kiran Pamnany, Dhiraj D Kalamkar, Alexander Heinecke, Mikhail Smelyanskiy, Jongsoo Park, Daehyun Kim, Aniruddha Shet, Bharat Kaul, Balint Joo, Pradeep Dubey. In Proceedings of the 2014 IEEE International Parallel and Distributed Processing Symposium, May 2012.
Anatomy of High-Performance Many-Threaded Matrix Multiplication. Tyler M Smith, Robert van de Geijn, Mikhail Smelyanskiy, Jeff Hammond, Field G Van Zee. In Proceedings of the 2014 IEEE International Parallel and Distributed Processing Symposium, May 2012. (PDF)

2013

Opportunities for Parallelism in Matrix Multiplication. Tyler M. Smith, Robert van de Geijn, Mikhail Smelyanskiy , Jeff R. Hammond, and Field G. Van Zee. FLAME Working Note #71 . The University of Texas at Austin, Department of Computer Science. Technical Report TR-13-20. 2013. (PDF)
Implementing Level-3 BLAS with BLIS: Early Experience. Field G. Van Zee, Tyler Smith, Francisco D. Igual, Mikhail Smelyanskiy, Xianyi Zhang, Michael Kistler, Vernon Austel, John Gunnels, Tze Meng Low, Bryan Marker, Lee Killough, Robert A. van de Geijn. FLAME Working Note #69 . The University of Texas at Austin, Department of Computer Science. Technical Report TR-13-03. 2013. (PDF)
Efficient Sparse Matrix-Vector Multiplication on x86-based Many-core Processors. Xing Liu, Mikhail Smelyanskiy, Edmond Chow, Pradeep Dubey. In Proceedings of the 2013 International Conference on Supercomputing, June 2013. (PDF)
Lattice QCD on Intel® Xeon Phi™ coprocessors. Balint Joo, Dhiraj D. Kalamkar, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Kiran Pamnany, Victor W Lee, Pradeep Dubey, and William Watson III. In Proceedings of the 2013 International Supercomputing Conference, June 2013. (PDF)
Design and Implementation of the Linpack Benchmark for Single and Multi-Node Systems Based on Intel® Xeon Phi™ co-processor. Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry, George Chrysos, Pradeep Dubey. In Proceedings of the 2013 IEEE International Parallel and Distributed Processing Symposium, May 2013. (PDF)
Exploring SIMD for Molecular Dynamics, Using Intel® Xeon Phi™. Simon Pennycook, Christopher J. Hughes, Mikhail Smelyanskiy. In Proceedings of the 2013 IEEE International Parallel and Distributed Processing Symposium, May 2013. (PDF)

2012

Analysis and Optimization of Financial Analytics Benchmark on Modern Multi- and Many-core IA-Based Architectures. Mikhail Smelyanskiy, Nikita Astafiev, Charles H. Finan, Jason Sewall, Ilya Burylov, Dhiraj D. Kalamkar, Andrey Nikolaev, Ekaterina Gonina, Nadathur Satish, Sergey Maidanov, Pradeep Dubey, Shuo Li, Sunil Kulkarni. In Workshop on High Performance Computational Finance, Nov 2012. (PDF)
Optimization of Geometric Multigrid for Emerging Multi- and Manycore Processors. Samuel W. Williams, Dhiraj D. Kalamkar, Amik Singh, Anand M. Deshpande, Brian Van Straalen, Mikhail Smelyanskiy, Ann Almgren, Pradeep Dubey, John Shalf, Leonid Oliker. In Proceedings of the 2012 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC12), Nov. 2012. (PDF)
Synthetic Aperture Radar Computation with Many-Core Processors. Jongsoo Park, Ping Tak Peter Tang, Mikhail Smelyanskiy, Daehyun Kim, Thomas Benson. In Proceedings of the 2012 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC12), Nov. 2012.(PDF)
Can Traditional Programming Bridge the Ninja Performance Gap for Parallel Computing Applications? Nadathur Satish, Changkyu Kim, Jatin Chhugani, Hideki Saito, Rakesh Krishnaiyer, Mikhail Smelyanskiy, Milind Girkar, and Pradeep Dubey. In Proceedings of the 2012 International Symposium on Computer Architecture (ISCA), June 2012. (PDF)
High Performance Non-uniform FFT on Modern x86-based Multi-core Systems. Dhiraj D. Kalamkar, Joshua D. Trzaskoz, Srinivas Sridharan, Mikhail Smelyanskiy, Daehyun Kimy, Armando Manducaz, Yunhong Shux, Matt A. Bernstein, Bharat Kaul, and Pradeep Dubey. In Proceedings of the 2012 IEEE International Parallel and Distributed Processing Symposium, May 2012. (PDF)
Improving the Performance of Dynamical Simulations Via Multiple Right-Hand Sides. Xing Liu and Edmond Chow, Karthikeyan Vaidyanathan and Mikhail Smelyanskiy. In Proceedings of the 2012 IEEE International Parallel and Distributed Processing Symposium, May 2012. (PDF)

2011

High-performance Lattice QCD for Multi-core Based Parallel Systems Using a Cache-friendly Hybrid Threaded-MPI Approach. Mikhail Smelyanskiy, Karthikeyan Vaidyanathan, Jee Choi, Balint Joo, Jatin Chhugani, Michael A. Clark, Pradeep Dubey, In Proceedings of the 2011 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC11), Nov. 2011 (PDF)
Designing and Dynamically Load Balancing Hybrid LU for Multi/Many-core. Michael Deisher, Mikhail Smelyanskiy, Brian Nickerson, Victor W. Lee, Michael Chuvelev, Pradeep Dubey. In Proceedings of the 2011 International Supercomputing Conference, June 2011. (PDF)
High-Performance 3D Compressive Sensing MRI Reconstruction Using Many-Core Architectures. Daehyun Kim, Joshua Trzasko, Mikhail Smelyanskiy, Clifton Haider, Pradeep Dubey, Armando Manduca. In International Journal of Biomedical Imaging, Volume 2011, January 2011. (PDF)

2010

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. Victor W Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal and Pradeep Dubey. In Proceedings of the 2010 International Symposium on Computer Architecture (ISCA), June 2010. (PDF)

2009

Mapping High-FIdelity Volume Rendering for Medical Imaging to CPU, GPU and Many-Core Architectures. Mikhail Smelyanskiy, David Holmes, Jatin Chhugani, Alan Larson, Douglas M. Carmean, Dennis Hanson, Pradeep Dubey, Kurt Augustine, Daehyun Kim, Alan Kyker, Victor W. Lee, Anthony D. Nguyen, Larry Seiler, Richard Robb. In Journal IEEE Transactions on Visualization and Computer Graphics, Volume 15 Issue 6, November 2009, Pages 1563-1570. (PDF)

2008

An Algorithm for the Fast Solution of Symmetric Linear Complementarity Problems. Jose Luis Morales, Jorge Nocedal, Mikhail Smelyanskiy. In Journal Numerische Mathematik archive, Volume 111 Issue 2, November 2008, Pages 251-266. (PDF)
Convergence of Recognition, Mining, and Synthesis Workloads and Its Implications. Yen-Kuang Chen, Jatin Chhugani, Pradeep Dubey, Christopher J. Hughes, Daehyun Kim, Sanjeev Kumar, Victor W. Lee, Anthony D. Nguyen, Mikhail Smelyanskiy. In Proceedings of IEEE Journal (Invited Paper), March 2008. (PDF)
Atomic Vector Operations on Chip Multiprocessors. Sanjeev Kumar, Daehyun Kim, Mikhail Smelyanskiy, Yen-Kuang Chen, Jatin Chhugani, Christopher J. Hughes, Changkyu Kim, Victor W. Lee, Anthony D. Nguyen. In Proceedings of the 2008 International Symposium on Computer Architecture (ISCA), June 2008. (PDF)

2007 & Older

High-Performance Physical Simulations on Next-Generation Architecture with Many Cores. Yen-Kuang Chen, Jatin Chhugani, Christopher J. Hughes, Daehyun Kim, Sanjeev Kumar, Victor Lee, Albert Lin, Anthony D. Nguyen, Eftychios Sifakis, Mikhail Smelyanskiy. Intel Technology Journal, August 2007. (PDF)
Scaling Performance of Interior-point Method on Large-scale Chip Multiprocessor System. Mikhail Smelyanskiy, Victor W. Lee, Daehyun Kim, Anthony D. Nguyen, Pradeep Dubey. In Proceedings of the 2007 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis (SC07), Nov. 2007. (PDF)
Parallel Computing Large-Scale Optimization Problems: Challenges and Solutions. Mikhail Smelyanskiy, Stephen Skedzielewski, Carole Dulong. Intel Technology Journal, January 2006. (PDF)
Construction and Performance Characterization of Parallel Interior Point Solver on 4-way Intel Itanium Multiprocessor System. Pranay Koka, Taeweon Suh , Radek Grzeszczuk, Mikhail Smelyanskiy, Carole Dulong. In Proceedings of IEEE 7th Annual Workshop on Workload Characterization, October 25, 2004. (PDF)
Hardware/Software Mechanisms for Increasing Resource Utilization on VLIW/EPIC Processors. Mikhail Smelyanskiy. Ph.D. Dissertation, University of Michigan, 2004. (PDF)
Probabilistic Predicate-Aware Modulo Scheduling. Mikhail Smelyanskiy, Scott A. Mahlke, Edward S. Davidson. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), Palo Alto, California, 2004. (PDF)
Systematic Register Bypass Customization for Application-Specific Processors. K. Fan, N. Clark, M. Chu, K. V. Manjunath, R. Ravindran, Mikhail Smelyanskiy, S. Mahlke. In Proceedings of the IEEE 14th International Conference on Application-specific Systems, Architectures, and Processors, June 2003, pp. 64-74. (PDF)
Predicate-Aware Scheduling: A Technique for Reducing Resource Constraints. Mikhail Smelyanskiy, Scott A. Mahlke, Edward S. Davidson, and Hsien-Hsin S. Lee. In Proceedings of the Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), San Francisco, California, 2003. (PDF)
Optimizing Memory Subsystem Designs by Exploring Region Reference Characteristics. Hsien-Hsin S. Lee, Chris J. Newburn, Mikhail Smelyanskiy, and Gary S. Tyson. In IEEE Transactions on Computers, 2002.
Evaluating the Use Of Register Queues in Software Pipelined Loops. Gary Tyson, Mikhail Smelyanskiy and Edward Davidson. In IEEE Transactions on Computers, Volume 50, Number 8, August 2001. (PDF)
Stack Value File: Custom Microarchitecture for the Stack. Hsien-Hsin Lee, Mikhail Smelyanskiy, Chris Newburn and Gary Tyson. In Proceedings of 7th International Symposium on High Performance Computer Architecture (HPCA-7), Jan. 2001. (PDF)
Register Queues: A New Hardware/Software Approach to Efficient Software Pipelining. Mikhail Smelyanskiy, Gary Tyson and Edward Davidson. In Proceedings of the 2000 International Conference on Parallel Architectures and Compilation Techniques (PACT'00), October 2000. (PDF)
Performance Optimization of an Integral Equation Code for Jet Engine Scattering on CRAY-C90. Mikhail Smelyanskiy, J.L. Volakis and E. S. Davidson. In Applied Computational Electromagnetics Scienty Journal (ACES), Volume 13, Number 2, Pages 116-130, 1998. (PDF)
Scattering from Relatively Flat Surfaces using the Adaptive Integral Method (AIM). Hristos T. Anastassiu, Mikhail Smelyanskiy, S. Bindiganavale and John L. Volakis. In Radio Science, Volume 33, Number 1, Pages 7-16, January-February 1998. (PDF)