BladeDISC is an end-to-end DynamIc Shape Compiler project for machine
learning workloads, which is one of the key components of Alibaba's
PAI-Blade. BladeDISC provides
general, transparent, and ease of use performance optimization for
TensorFlow/PyTorch workloads on GPGPU and CPU backends. The architecture
natively supports dynamic shape workloads, with many considerations in the
performance of both static and dynamic shape scenarios. It also supports
multiple and flexible deployment solutions, including both Plugin Mode inside
TensorFlow/PyTorch runtime, and Standalone Mode for AOT standalone execution.
The project is based on MLIR and highly related with
mlir-hlo project.
Refer to our website for more
information, including the setup tutorial, developer guide, demo examples and
documents for developers.
Features and Roadmap
Frontend Framework Support Matrix
TensorFlow [1]
PyTorch [2]
Inference
Yes
Yes
Training
Yes [3]
Ongoing
[1] TensorFlow 1.12, 1.15, 2.4 & 2.5 are supported and fully verified. For other
versions some slight works on adaptation might be needed.
[2] 1.6.0 <= PyTorch version < 1.9.0 has been fully verified.
[3] Although supported, there's much room for improvement on Op coverage for
training workloads.
Backend Support Matrix
Status
Nvidia GPU
Yes
AMD GPU
Ongoing
Hygon DCU
Yes
X86
Yes
AArch64
Yes
Deployment Solutions
Plugin Mode - BladeDISC works as a plugin of TensorFlow or PyTorch. Only the
supported Ops are clustered and compiled, and the unsupported ones will be
executed by the original TensorFlow or PyTorch runtime. We recommend this mode
to most of the users for its transparency and ease of use.
Standalone Mode - In Standalone mode, the input workload will be compiled into
a binary that can be executed by it self, aka, does not rely on a TensorFlow
or PyTorch runtime. In this mode all ops must be supported.
Numbers of Typical Workloads
By evaluating BladeDISC using a set of typical machine learning workloads for
production purpose, BladeDISC shows up to 8.66x speedup compared with
TensorFlow/PyTorch. Moreover, compared to static optimizing compilers (i.e.,
XLA and TensorRT), DISC shows comparable or even better performance.
Fig.1 Performance speedup over framework.
Framework means either TensorFlow or PyTorch.
FastSpeech2 is TensorFlow model and others are PyTorch models.
The static compiler for TensorFlow is XLA and that for PyTorch is TensorRT.
Note that S2T and T5 have no TensorRT performance due to wrong result.
Advantage in Dynamic Shape Workloads
Specifically, for the BERT large inference on T4 we provide in the
examples, static compiler
optimization (XLA) shows severe performance degradation due to its compilation
overhead, while DISC shows a 1.75x speedup.
TensorFlow
XLA
DISC
1.78 s
41.69s
1.02s
1X
1.75X
API QuickView
For TensorFlow Users
Only two lines of code are needed on native Tensorflow program as the following:
importnumpyasnpimporttensorflowastf## enable BladeDISC on TensorFlow programimportblade_disc_tfasdiscdisc.enable()
## construct TensorFlow Graph and run itg=tf.Graph()
withg.as_default():
...
withtf.sessionassess:
sess.run(...)
PyTorch users only need the following few lines of code to enable
BladeDISC:
importtorch_blade# construct PyTorch ModuleclassMyModule(nn.Module):
...
module=MyModule()
withtorch.no_grad():
# blade_module is the optimized module by BladeDISCblade_module=torch_blade.optimize(module, allow_tracing=True, model_inputs=(x, y))
# run the optimized moduleblade_module(x, y)
torch_blade.optimize accepts an nn.Module object and outputs the
optimized module. For more information, please refer to Quickstart
for PyTorch Users.
BladeDISC is in a close relationship with
mlir-hlo project. Part of the building
blocks, including the MHLO Op definitions, TF to MHLO conversions, and some
general purpose passes have been upstreamed to mlir-hlo repository. We'll
continue to work in a close cooperative relationship with mlir-hlo project in
the longer term.
alibaba/BladeDISC
BladeDISC Introduction
Overview
BladeDISC is an end-to-end DynamIc Shape Compiler project for machine learning workloads, which is one of the key components of Alibaba's PAI-Blade. BladeDISC provides general, transparent, and ease of use performance optimization for TensorFlow/PyTorch workloads on GPGPU and CPU backends. The architecture natively supports dynamic shape workloads, with many considerations in the performance of both static and dynamic shape scenarios. It also supports multiple and flexible deployment solutions, including both Plugin Mode inside TensorFlow/PyTorch runtime, and Standalone Mode for AOT standalone execution. The project is based on MLIR and highly related with mlir-hlo project.
Refer to our website for more information, including the setup tutorial, developer guide, demo examples and documents for developers.
Features and Roadmap
Frontend Framework Support Matrix
[1] TensorFlow 1.12, 1.15, 2.4 & 2.5 are supported and fully verified. For other versions some slight works on adaptation might be needed.
[2] 1.6.0 <= PyTorch version < 1.9.0 has been fully verified.
[3] Although supported, there's much room for improvement on Op coverage for training workloads.
Backend Support Matrix
Deployment Solutions
Plugin Mode - BladeDISC works as a plugin of TensorFlow or PyTorch. Only the supported Ops are clustered and compiled, and the unsupported ones will be executed by the original TensorFlow or PyTorch runtime. We recommend this mode to most of the users for its transparency and ease of use.
Standalone Mode - In Standalone mode, the input workload will be compiled into a binary that can be executed by it self, aka, does not rely on a TensorFlow or PyTorch runtime. In this mode all ops must be supported.
Numbers of Typical Workloads
By evaluating BladeDISC using a set of typical machine learning workloads for production purpose, BladeDISC shows up to 8.66x speedup compared with TensorFlow/PyTorch. Moreover, compared to static optimizing compilers (i.e., XLA and TensorRT), DISC shows comparable or even better performance.

Fig.1 Performance speedup over framework. Framework means either TensorFlow or PyTorch. FastSpeech2 is TensorFlow model and others are PyTorch models. The static compiler for TensorFlow is XLA and that for PyTorch is TensorRT. Note that S2T and T5 have no TensorRT performance due to wrong result.Advantage in Dynamic Shape Workloads
Specifically, for the BERT large inference on T4 we provide in the examples, static compiler optimization (XLA) shows severe performance degradation due to its compilation overhead, while DISC shows a 1.75x speedup.
API QuickView
For TensorFlow Users
Only two lines of code are needed on native Tensorflow program as the following:
For more information, please refer to QuickStart for TensorFlow Users
For PyTorch Users
PyTorch users only need the following few lines of code to enable BladeDISC:
torch_blade.optimize
accepts annn.Module
object and outputs the optimized module. For more information, please refer to Quickstart for PyTorch Users.Setup and Examples
Publications
Tutorials and Documents for Developers
Presentations and Talks
How to Contribute
FAQ
Roadmap with mlir-hlo Project
BladeDISC is in a close relationship with mlir-hlo project. Part of the building blocks, including the MHLO Op definitions, TF to MHLO conversions, and some general purpose passes have been upstreamed to mlir-hlo repository. We'll continue to work in a close cooperative relationship with mlir-hlo project in the longer term.
Contact Us
Mailgroup: [email protected]
DingTalk group for support and discussion: