cuBERT

Fast implementation of BERT inference directly on NVIDIA (CUDA, CUBLAS) and Intel MKL

Stars

512

Forks

84

Language

C++

Last Updated

Mar 22, 2024

Similar Repos

Repo	Language	Stars	Description	Updated At
libcumatrix	C++	23	GPU Matrix Library - A CUDA-based C++ wrapper and syntax sugars for NVIDIA CUBLAS	May 10, 2023
CfMatting_cuda_mkl	C++	16	A cuda & mkl implementation of closed-form matting	May 09, 2023
cudaBERT	Python	89	A Fast Muti-processing BERT-Inference System	Aug 28, 2022
ByteTransformer	C++	94	optimized BERT transformer inference on NVIDIA GPU. https://arxiv.org/abs/2210.03052	Apr 18, 2023
cuTAGI	C++	8	CUDA implementation of Tractable Approximate Gaussian Inference	Mar 08, 2023
CudaSift	Cuda	7	A CUDA implementation of SIFT for NVidia GPUs	Apr 18, 2023
mkl-corrode	Rust	5	A lightweight and pleasant Rust wrapper for Intel MKL	Mar 16, 2023
Zen	Cuda	15	optimized realtime harmonic/percussive source separation using the GPU (NVIDIA CUDA) and CPU (Intel IPP)	Mar 13, 2023
PyCUDA-Raster	TeX	26	Opensource GIS Tool leveraging NVIDIA CUDA and pyCuda for fast Raster Analysis	Mar 27, 2023
Parallel-NW-Implementation	C++	2	Parallel implementation of NW algorithms with NVIDIA GPU and CUDA C++	Mar 11, 2023
cuda-md5	C++	14	Old NVIDIA CUDA implementation of salted MD5 brute-force	Apr 26, 2022
eaminer	C++	2	Heterogeneous Ethereum Miner with support for AMD, Intel and Nvidia GPUs using SYCL, OpenCL and …	Apr 15, 2022
kmcuda	Jupyter Notebook	2	Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA	Jul 20, 2022
kmcuda	Jupyter Notebook	734	Large scale K-means and K-nn implementation on NVIDIA GPU / CUDA	Apr 27, 2023
DiamondLump	C#	4	Simple MLP / CNN / RNN / LSTM neural-networks implementation in csharp using Intel-MKL-Library.	Mar 04, 2023
LowRankSVDCodes	C	3	RSVDPACK: Implementations of fast algorithms for computing the low rank SVD, interpolative and CUR decompositions …	Mar 25, 2022
LowRankMatrixDecompositionCodes	C	78	RSVDPACK: Implementations of fast algorithms for computing the low rank SVD, interpolative and CUR decompositions …	Feb 18, 2023
MTCNN_FaceDetection_TensorRT	C++	203	MTCNN C++ implementation with NVIDIA TensorRT Inference accelerator SDK	Apr 30, 2023
QR-decomposition-Benchmark	C	3	Compare of Serial Modified Gram-Shmidt, Householder (multi-core CPU MKL) and Givens (GPU CuBlas) QR-decomposition	Apr 22, 2023
Ambiguity-function-CUDA	Cuda	13	This is super fast ambiguity function created for NVIDIA cards with CUDA technology.	Apr 24, 2023
PABEE	Python	57	Code for the paper "BERT Loses Patience: Fast and Robust Inference with Early Exit".	May 01, 2023
oclpc	C	5	OpenCL Precompiler for nVidia, Intel and AMD platforms	Jan 29, 2019
nvtop	None	3	GPUs process monitoring for AMD, Intel and NVIDIA	Jul 21, 2023
cuda	Elixir	4	NVIDIA GPU CUDA library bindings for Erlang and Elixir.	Dec 16, 2021
CUDA-Install-Guide	None	2	Installation guide for NVIDIA driver, CUDA, cuDNN and TensorRT	Nov 03, 2021
cublas-test	C++	2	I wanted to try cublas but it seems to be over 600 times slower then …	Dec 04, 2022
Optimus-Manager-Indicator	JavaScript	11	Intel/Hybrid/NVIDIA GPU Switch and show GPU status	Jun 04, 2022
isaac_ros_dnn_inference	Python	36	Hardware-accelerated DNN model inference ROS2 packages using NVIDIA Triton/TensorRT for both Jetson and x86_64 with …	Jun 20, 2022
oolon-cudnn_neuralnetwork_frame	Cuda	2	This is a neural network frame, implemented through C++ and CUDA libraries, including cuDNN, cuBLAS, …	Feb 14, 2022
cuda-docs-switcher	JavaScript	5	Access NVIDIA CUDA documentation and switch between versions quickly & easily	Mar 28, 2023
waifu2x-ncnn-vulkan	C	2183	waifu2x converter ncnn version, runs fast on intel / amd / nvidia / apple-silicon GPU …	Aug 15, 2022
PoWER-BERT	Python	47	Method to improve inference time for BERT. This is an implementation of the paper titled …	Apr 12, 2022
ShadowRePlay-Linux	Shell	124	Shadowplay's Replay Feature On Linux For Nvidia, AMD and Intel	Apr 30, 2023
TurboTransformers	C++	1204	a fast and user-friendly runtime for transformer inference (Bert, Albert, GPT2, Decoders, etc) on CPU …	Aug 11, 2022
quicktree	C	3	Fast implementation of the neighbour-joining phylogenetic inference method	Aug 31, 2020
keras_bert_classification	Python	89	Bert-classification and bert-dssm implementation with keras.	Jun 28, 2022
realcugan-ncnn-vulkan	C	372	real-cugan converter ncnn version, runs fast on intel / amd / nvidia / apple-silicon GPU …	Aug 14, 2022
spbla	C++	9	Sparse Boolean linear algebra for Nvidia Cuda, OpenCL and CPU computations	Jul 03, 2022
cuda	C++	11	Greyscale image using NVIDIA CUDA 5 Toolkit and OpenCV in C++	Sep 04, 2021
auto-nvidia-cuda-driver	Shell	2	Installation script to install Nvidia driver and CUDA automatically in Ubuntu	Jan 24, 2023
GPU_Overlap-and-save_convolution	Cuda	11	Shared memory overlap-and-save method for NVIDIA GPUs using CUDA	Nov 25, 2022
intel-mkl-docker	Dockerfile	2	Docker image that has intel-mkl installed. Provides debian:buster and ubuntu:bionic based images on docker hub.	Nov 30, 2022
linux_power_opt	None	2	Linux power optimization tutorial for Nvidia, Intel and Ubuntu based distributions.	Jan 18, 2022
budgie-nvidia-switcher	Python	4	Switch between Intel and Nvidia graphics easily with a Budgie applet	May 05, 2020
gpucfr	Cuda	12	GPUCFR is a parallel implementation of Counterfactual Regret Minimization (CFR) in C++ and CUDA C …	Apr 05, 2023
scientific-python-2.7	Shell	2	scientific python 2.7 with intel MKL built into virtual machines. Docker, AMI, Virtualbox, Vmware, KVM …	Feb 22, 2020
fast-charuco	Python	3	Fast inference of deepcharuco model using onnx and improved inference setup	Jan 24, 2024
kiss_rng	Cuda	2	Fast random number generator for C++ and CUDA	Oct 17, 2022
bert-infer	Python	2	BERT Inference on CPU with Torch, ONNX Runtime, OpenVINO, and TVM.	Mar 06, 2023
BLAS-Tester	FORTRAN	28	a tester for BLAS libraries including OpenBLAS and Intel MKL. This project is based on …	Jun 12, 2021