Jianyu Huang

Research Scientist
Meta Platforms, Inc.
1 Hacker Way, Menlo Park, CA 94025
Email: jianyu0huang [AT] gmail [DOT] com

I am a Research Scientist at Meta. My interests lie in improving the performance of machine learning applications and simplifying the programming model for parallel computing. Before joining Meta, I got my CS PhD from UT Austin (Advisor: Prof. Robert van de Geijn).

If you want to know more about the performance optimizations for matrix multiplication (one of the most important building blocks for deep learning), you might be interested in how to optimize GEMM and BLISlab. If you want to learn about the practical implementations of the fast matrix multiplication algorithms like Strassen's algorithm, you might be interested in my thesis.

Connect with me on:

Professional Experience

Open Source Projects

Publications


    Context Parallelism for Scalable Million-Token Inference [PDF] [BibTex]
    Amy (Jie) Yang, Jingyi Yang, Aya Ibrahim, Xinfeng Xie, Bangsheng Tang, Grigory Sizov, Jeremy Reizenstein, Jongsoo Park, Jianyu Huang. November 2024.

    The Llama 3 Herd of Models [PDF] [BibTex]
    Llama Team, AI @ Meta.
    in the Llama3.1 release (Llama3.1), July 2024.

    AdaEmbed: Adaptive Embedding for Large-Scale Recommendation Models [PDF] [BibTex]
    Fan Lai, Wei Zhang, Rui Liu, William Tsai, Xiaohan Wei, Yuxi Hu, Sabin Devkota, Jianyu Huang, Jongsoo Park, Xing Liu, Zeliang Chen, Ellie Wen, Paul Rivera, Jie You, Jason Chen, Mosharaf Chowdhury.
    in the 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), Boston, MA, July 2023.

    Software-Hardware Co-design for Fast and Scalable Training of Deep Learning Recommendation Models [PDF] [BibTex]
    Dheevatsa Mudigere*, Yuchen Hao*, Jianyu Huang*, Zhihao Jia, Andrew Tulloch, Srinivas Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan K Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Krishna Dhulipala, KR Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Pallab Bhattacharya, Guoqiang Jerry Chen, Manoj Krishnan, Krishnakumar Nair, Petr Lapukhov, Maxim Naumov, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, Vijay Rao. (* denotes equal contribution.)
    in ACM International Symposium on Computer Architecture (ISCA 22), New York City, June 2022.

    Efficient Soft-Error Detection for Low-precision Deep Learning Recommendation Models [PDF] [BibTex]
    Sihuan Li, Jianyu Huang, Ping Tak Peter Tang, Daya Khudia, Jongsoo Park, Harish Dattatraya Dixit, Zizhong Chen.
    in IEEE International Conference on Big Data (Big Data) (BigData), Pages: 1556-1563, March 2022.

    Low-Precision Hardware Architectures Meet Recommendation Model Inference at Scale [PDF] [BibTex]
    Zhaoxia Summer Deng, Jongsoo Park, Ping Tak Peter Tang, Haixin Liu, Jie Yang, Hector Yuen, Jianyu Huang, Daya S Khudia, Xiaohan Wei, Ellie Wen, Dhruv Choudhary, Raghuraman Krishnamoorthi, Carole-Jean Wu, Nadathur Satish, Changkyu Kim, Maxim Naumov, Sam Naghshineh, Misha Smelyanskiy.
    in IEEE Micro, Volume: 41, Issue: 5, Sept.-Oct. 1 2021.

    Mixed-Precision Embedding Using a Cache [PDF] [BibTex]
    Jie (Amy) Yang*, Jianyu Huang*, Jongsoo Park, Ping Tak Peter Tang, Andrew Tulloch. October 2020. (* denotes equal contribution.)

    Strassen's Algorithm Reloaded on GPUs [PDF] [BibTex]
    Jianyu Huang, Chenhan D. Yu, Robert A. van de Geijn
    in ACM Transactions on Mathematical Software (TOMS), Article No.: 1, March 2020.

    Deep Learning Recommendation Model for Personalization and Recommendation Systems [PDF] [BibTex]
    Maxim Naumov, Dheevatsa Mudigere, Hao-Jun Michael Shi, Jianyu Huang, Narayanan Sundaraman, Jongsoo Park, Xiaodong Wang, Udit Gupta, Carole-Jean Wu, Alisson G. Azzolini, Dmytro Dzhulgakov, Andrey Mallevich, Ilia Cherniavskii, Yinghai Lu, Raghuraman Krishnamoorthi, Ansha Yu, Volodymyr Kondratenko, Stephanie Pereira, Xianjie Chen, Wenlin Chen, Vijay Rao, Bill Jia, Liang Xiong, Misha Smelyanskiy.

    A Study of BFLOAT16 for Deep Learning Training [PDF] [BibTex]
    Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, Pradeep Dubey.

    FBGEMM: Enabling High-Performance Low-Precision Deep Learning Inference [PDF] [BibTex]
    Daya Khudia, Jianyu Huang, Protonu Basu, Summer Deng, Haixin Liu, Jongsoo Park, Mikhail Smelyanskiy
    in HPCaML 2019.

    Practical Fast Matrix Multiplication Algorithms [PDF] [BibTex]
    Jianyu Huang
    PhD thesis, The University of Texas at Austin, 2018.

    Implementing Strassen’s Algorithm with CUTLASS on NVIDIA Volta GPUs [PDF] [BibTex]
    Jianyu Huang, Chenhan D. Yu, Robert A. van de Geijn
    FLAME Working Note #88, The University of Texas at Austin, Department of Computer Science. Technical Report TR-18-08. August 23, 2018.

    Learning from Optimizing Matrix-Matrix Multiplication [PDF] [BibTex]
    Devangi N. Parikh, Jianyu Huang, Margaret E. Myers, Robert A. van de Geijn
    in 8th NSF/TCPP Workshop on Parallel and Distributed Computing Education (EduPar-18), co-located with IPDPS18, Vancouver, British Columbia, Canada, 2018.

    Strassen's Algorithm for Tensor Contraction [PDF] [BibTex]
    Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn
    in SIAM Journal on Scientific Computing (SISC), 40(3):C305-C326, 2018.

    Lowering Barriers into HPC through Open Education [PDF] [BibTex]
    Robert A. van de Geijn, Jianyu Huang, Margaret E. Myers, Devangi N. Parikh, Tyler M. Smith
    in Workshop on Education for High Performance Computing (EduHPC), co-located with SC17, Denver, CO, November 2017.

    Generating Families of Practical Fast Matrix Multiplication Algorithms [PDF] [BibTex] [Code] [Artifact] [PPTX]
    Jianyu Huang, Leslie Rice, Devin A. Matthews, Robert A. van de Geijn
    in 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS17), Orlando, FL, May 29-June 2, 2017.

    Strassen's Algorithm Reloaded [PDF] [BibTex] [Code] [PPTX]
    Jianyu Huang, Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn
    in The International Conference for High Performance Computing, Networking, Storage and Analysis (SC16), Salt Lake City, UT, November 2016.

    BLISlab: A Sandbox for Optimizing GEMM [PDF] [BibTex] [Code]
    Jianyu Huang, Robert A. van de Geijn
    FLAME Working Note #80, The University of Texas at Austin, Department of Computer Science. Technical Report TR-16-13. August 31, 2016.

    Performance Optimization for the K-Nearest Neighbors Kernel on x86 Architectures [PDF] [BibTex] [Code]
    Chenhan D. Yu, Jianyu Huang, Woody Austin, Bo Xiao, George Biros
    in The International Conference for High Performance Computing, Networking, Storage and Analysis (SC15), Austin, TX, November 2015.

Posters


    INT4 Decoding GQA CUDA Optimizations for LLM Inference (with Sarunya Pumma, Jongsoo Park, Amy Yang, Jaewon Lee, Daniel Haziza, Grigory Sizov, Jeremy Reizenstein, Jeff Johnson, Ying Zhang) [Blog]
    in PyTorch Technical Blog, 2024.

    Introduction to Quantization on PyTorch [Blog]
    in PyTorch Technical Blog, 2020.

    Dynamic Quantization on BERT [Tutorial]
    in PyTorch Tutorials, 2019.

    Strassen's Algorithm for Tensor Contraction [PDF] (with Devin A. Matthews and Robert A. van de Geijn)
    in The International Conference for High Performance Computing, Networking, Storage and Analysis (SC17), Denver, CO, November 2017.

    High-performance Primitives for Machine Learning Targeting Mobile Platforms (with Chenhan D. Yu)
    in Qualcomm Fellowship Finalist Presentation, San Diego, CA, March 2016.

Presentations


    Strassen's Algorithm for Tensor Contraction [PPTX,PDF]
    in BLIS Retreat 2017, Austin, TX, September 2017.

    Strassen's Algorithm for Tensor Contraction [PPTX,PDF]
    in Tensor Computation Workshop, New York City, NY, September 2017.

    Generating Families of Practical Fast Matrix Multiplication Algorithms [PPTX,PDF]
    in IPDPS17, Orlando, FL, May 31st, 2017.

    Strassen's Algorithm Reloaded [PPTX,PDF]
    in SC16, Salt Lake City, UT, November 16th, 2016.

    Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS [PPTX,PDF] (with Leslie Rice)
    in BLIS Retreat 2016, Austin, TX, September 2016.

    High-performance Primitives for Machine Learning Targeting Mobile Platforms (with Chenhan D. Yu)
    in Qualcomm Fellowship Finalist Presentation, San Diego, CA, March 2016.

    Adding Efficient Scheduling Policy into SuperMatrix on Heterogeneous Platforms [PPTX,PDF]
    in BLIS Retreat 2015, Austin, TX, September 2015.

Services

I have served as the program committee member and reviewer for International Conference for High Performance Computing, Networking, Storage, and Analysis (SC 20), SIAM Journal on Scientific Computing (SISC), IEEE Transactions on Parallel and Distributed Systems (TPDS), ACM Transactions on Architecture and Code Optimization (TACO), ACM Transactions on Parallel Computing, IEEE Access, The Journal of Supercomputing (Springer), Internal Conference on Parallel Architectures and Compilation Techniques (PACT 2019), The 48th International Conference on Parallel Processing (ICPP 2019), International Conference on Parallel Processing (ICPP 2020), and International Conference on Parallel Processing (ICPP 2021).

Teaching

Interests

I like bicycling, swimming, jogging, reading and traveling. There is a famous Chinese proverb: Walk ten thousand miles; Read ten thousand books.