コンテンツにスキップ

CV-CUDA の pillow resize を試す

CV-CUDA の pillow resize を試してみた。pillow, resizer, cvcuda の 3 で画像のリサイズを行い、処理時間を比較してみた。

CV-CUDA とは

CV-CUDA は、NVIDIA が提供する機械学習向けの画像処理およびコンピュータビジョン(CV)アプリケーション向けのオープンソースプロジェクト。C++、C、Python の API が公開されている。GPU アクセラレーションを活用することで、開発者は効率性の高い前処理・後処理パイプラインを構築可能。これにより、スループットを 10 倍以上向上させながら、クラウドコンピューティングコストを削減できる。バッチ処理や PyTorch へのゼロコピーにも対応。

多くのオペレーターがサポートされており、次のような機能が提供されている。

Pre/Post-Processing Operators Definition
Adaptive Thresholding Chooses threshold based on smaller regions in the neighborhood of each pixel.
Advanced Color Format Conversions Performs color conversion from interleaved RGB/BGR <-> YUV/YVU and semi planar. Supported standards: BT.601. BT.709. BT.2020
AverageBlur Reduces image noise using an average filter
BilateralFilter Reduces image noise while preserving strong edges
Bounding Box Draws an rectangular border using the X-Y coordinates and dimensions typically to define the location and size of an object in an image
Box Blurring Overlays a blurred rectangle using the X-Y coordinates and dimensions that define the location and size of an object in an image
Brightness_Contrast Adjusts brightness and contrast of an image
CenterCrop Crops an image at its center
ChannelReorder Shuffles the order of image channels
Color_Twist Adjusts the hue saturation brightness and contrast of an image
Composite Composites two images together
Conv2D Convolves an image with a provided kernel
CopyMakeBorder Creates a border around an image
CustomCrop Crops an image with a given region-of-interest
CvtColor Converts an image from one color space to another
DataTypeConvert Converts an image’s data type, with optional scaling
Erase Erases image regions
Flip Flips a 2D image around its axis
GammaContrast Adjusts image contrast
Gaussian Applies a gaussian blur filter to the image
Gaussian Noise Generates a statistical noise with a normal (Gaussian) distribution
Histogram Provides a grayscale value distribution showing the frequency of occurrence of each gray value.
Histogram Equalizer Allows effective spreading out the intensity range of the image typically used to improve contrast
HqResize Performs advanced resizing supporting 2D and 3D data, tensors, tensor batches, and varshape image batches (2D only). Supports nearest neighbor, linear, cubic, Gaussian and Lanczos interpolation, with optional antialiasing when down-sampling
Inpainting Performs inpainting by replacing a pixel by normalized weighted sum of all the known pixels in the neighborhood
Joint Bilateral Filter Reduces image noise while preserving strong edges based on a guidance image
Label Labels connected regions in an image using 4-way connectivity for foreground and 8-way for background pixels
Laplacian Applies a Laplace transform to an image
MedianBlur Reduces an image’s salt-and-pepper noise
MinArea Rect Finds the minimum area rotated rectangle typically used to draw bounding rectangle with minimum area
MinMaxLoc Finds the maximum and minimum values in a given array
Morphology Performs morphological erode and dilate transformations
Morphology (close) Performs morphological operation that involves dilation followed by erosion on an image
Morphology (open) Performs morphological operation that involves erosion followed by dilation on an image
Non-Maximum Suppression Enables selecting a single entity out of many overlapping ones typically used for selecting from multiple bounding boxes during object detection
Normalize Normalizes an image pixel’s range
OSD (Polyline Line Text Rotated Rect Segmented Mask) Displays an overlay on the image of different forms including polyline line text rotated rectangle segmented mask
PadStack Stacks several images into a tensor with border extension
PairwiseMatcher Matches features computed separately (e.g. via the SIFT operator) in two images, e.g. using the brute force method
PillowResize Changes the size and scale of an image using python-pillow algorithm
RandomResizedCrop Crops a random portion of an image and resizes it to a specified size.
Reformat Converts a planar image into non-planar and vice versa
Remap Maps pixels in an image with one projection to another projection in a new image.
Resize Changes the size and scale of an image
ResizeCropConvertReformat Performs fused Resize-Crop-Convert-Reformat sequence with optional channel reordering
Rotate Rotates a 2D array in multiples of 90 degrees
SIFT Identifies and describes features in images that are invariant to scale rotation and affine distortion.
Thresholding Chooses a global threshold value that is the same for all pixels across the image.
WarpAffine Applies an affine transformation to an image
WarpPerspective Applies a perspective transformation to an image

setup

github release にインストーラーや whl、圧縮ファイルが公開されている。これらを使って cv-cuda をインストールできる。 ここでは、Python 向けに whl からインストールしてみる。

pip install https://github.com/CVCUDA/CV-CUDA/releases/download/v0.12.0-beta/cvcuda_cu11-0.12.0b0-cp311-cp311-linux_x86_64.whl

resizer

cykooz.resizer は、画像のリサイズを行うための Python パッケージ。Rust crate の fast_image_resize を pyo3 を使ってバインディングしており、SIMD を活用することで高速に画像のリサイズを実現。

公式のベンチマークによると、Pillow を使ったリサイズに比べて、約 10 倍高速にリサイズできるとのこと。

benchmark

pillow, resizer, cvcuda の 3 で画像のリサイズを行い、処理時間を比較してみた。ベンチマークで利用したコードはこちら にある。

環境

 uv run python -m torch.utils.collect_env
<frozen runpy>:128: RuntimeWarning: 'torch.utils.collect_env' found in sys.modules after import of package 'torch.utils', but prior to execution of 'torch.utils.collect_env'; this may result in unpredictable behaviour
Collecting environment information...
PyTorch version: 2.5.0+cu124
Is debug build: False
CUDA used to build PyTorch: 12.4
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 11 (bullseye) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: Could not collect
CMake version: version 3.30.5
Libc version: glibc-2.31

Python version: 3.11.8 (main, Feb 25 2024, 04:18:18) [Clang 17.0.6 ] (64-bit runtime)
Python platform: Linux-5.10.0-33-cloud-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 11.8.89
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA L4
Nvidia driver version: 550.90.07
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Byte Order:                           Little Endian
Address sizes:                        46 bits physical, 48 bits virtual
CPU(s):                               4
On-line CPU(s) list:                  0-3
Thread(s) per core:                   2
Core(s) per socket:                   2
Socket(s):                            1
NUMA node(s):                         1
Vendor ID:                            GenuineIntel
CPU family:                           6
Model:                                85
Model name:                           Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:                             7
CPU MHz:                              2200.200
BogoMIPS:                             4400.40
Hypervisor vendor:                    KVM
Virtualization type:                  full
L1d cache:                            64 KiB
L1i cache:                            64 KiB
L2 cache:                             2 MiB
L3 cache:                             38.5 MiB
NUMA node0 CPU(s):                    0-3
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves arat avx512_vnni md_clear arch_capabilities

Versions of relevant libraries:
[pip3] numpy==2.1.2
[pip3] torch==2.5.0
[pip3] triton==3.1.0
[conda] Could not collect

ベンチマーク結果

| Package (time in ms)       | linear | lanczos3 | nearest | bilinear |
| :------------------------- | -----: | -------: | ------: | -------: |
| cvcuda tensor              |   0.20 |     0.20 |         |          |
| cvcuda image               |  63.30 |    82.52 |         |          |
| cykooz.resizer             |        |   129.97 |    0.61 |    66.19 |
| cykooz.resizer - sse4_1    |        |    59.76 |    0.60 |    32.02 |
| cykooz.resizer - avx2      |        |    39.11 |    0.59 |    22.27 |
| Pillow U8                  |        |   110.90 |    0.37 |    36.86 |
| cykooz.resizer U8          |        |    29.10 |    0.32 |    13.76 |
| cykooz.resizer U8 - sse4_1 |        |    14.31 |    0.32 |     5.52 |
| cykooz.resizer U8 - avx2   |        |    10.95 |    0.31 |     5.18 |
| cvcuda tensor u8           |   0.18 |     0.18 |         |          |

========================================================================================================================== warnings summary ==========================================================================================================================

---------------------------------------------------------------------------------------------------------------- benchmark: 27 tests ----------------------------------------------------------------------------------------------------------------
Name (time in us) Min Max Mean StdDev Median IQR Outliers OPS Rounds Iterations

---

test_resize_pillow_cuda_from_tensor_u8[lanczos3] 173.6020 (1.0) 192.1580 (1.02) 179.7351 (1.0) 6.2501 (1.37) 178.2000 (1.0) 7.6561 (1.17) 2;0 5,563.7432 (1.0) 10 1
test_resize_pillow_cuda_from_tensor_u8[linear] 174.5070 (1.01) 187.8510 (1.0) 181.7212 (1.01) 4.5635 (1.0) 182.2715 (1.02) 7.5760 (1.15) 4;0 5,502.9353 (0.99) 10 1
test_resize_pillow_cuda_from_tensor[lanczos3] 188.5100 (1.09) 216.9610 (1.15) 202.3906 (1.13) 8.4848 (1.86) 203.6245 (1.14) 6.5600 (1.0) 3;2 4,940.9410 (0.89) 10 1
test_resize_pillow_cuda_from_tensor[linear] 190.0240 (1.09) 217.7470 (1.16) 201.4807 (1.12) 10.2350 (2.24) 198.1965 (1.11) 16.4100 (2.50) 3;0 4,963.2542 (0.89) 10 1
test_resize_pil_u8[nearest-sse4_1] 305.1700 (1.76) 331.2740 (1.76) 315.1435 (1.75) 8.9028 (1.95) 313.2385 (1.76) 15.0120 (2.29) 5;0 3,173.1577 (0.57) 10 1
test_resize_pil_u8[nearest-avx2] 305.3460 (1.76) 335.4510 (1.79) 314.2647 (1.75) 9.6423 (2.11) 311.7565 (1.75) 11.6270 (1.77) 1;0 3,182.0309 (0.57) 10 1
test_resize_pil_u8[nearest-none] 306.3540 (1.76) 329.9180 (1.76) 315.1583 (1.75) 7.4469 (1.63) 315.8170 (1.77) 11.8100 (1.80) 3;0 3,173.0085 (0.57) 10 1
test_resize_pillow_u8[nearest] 342.3130 (1.97) 462.2980 (2.46) 373.5955 (2.08) 25.9545 (5.69) 372.1065 (2.09) 30.2370 (4.61) 2;1 2,676.6917 (0.48) 20 1
test_resize_pil[nearest-avx2] 571.8220 (3.29) 623.0310 (3.32) 592.3528 (3.30) 19.1698 (4.20) 586.4250 (3.29) 33.5600 (5.12) 3;0 1,688.1831 (0.30) 10 1
test_resize_pil[nearest-none] 573.4670 (3.30) 666.8610 (3.55) 613.7530 (3.41) 32.3207 (7.08) 614.2825 (3.45) 63.5920 (9.69) 4;0 1,629.3199 (0.29) 10 1
test_resize_pil[nearest-sse4_1] 577.2750 (3.33) 627.7571 (3.34) 601.5625 (3.35) 14.1852 (3.11) 598.7860 (3.36) 16.3680 (2.50) 3;0 1,662.3377 (0.30) 10 1
test_resize_pil_u8[bilinear-avx2] 5,068.6960 (29.20) 5,235.8611 (27.87) 5,183.3686 (28.84) 55.7600 (12.22) 5,205.9475 (29.21) 79.4120 (12.11) 2;0 192.9247 (0.03) 10 1
test_resize_pil_u8[bilinear-sse4_1] 5,458.7240 (31.44) 5,662.0630 (30.14) 5,515.0039 (30.68) 64.8785 (14.22) 5,494.8675 (30.84) 66.8520 (10.19) 2;1 181.3235 (0.03) 10 1
test_resize_pil_u8[lanczos3-avx2] 10,822.3110 (62.34) 11,085.7960 (59.01) 10,951.2957 (60.93) 100.6304 (22.05) 11,008.9955 (61.78) 169.9650 (25.91) 4;0 91.3134 (0.02) 10 1
test_resize_pil_u8[bilinear-none] 13,625.6840 (78.49) 13,980.6130 (74.42) 13,760.6970 (76.56) 109.0469 (23.90) 13,743.1150 (77.12) 133.0720 (20.29) 3;0 72.6707 (0.01) 10 1
test_resize_pil_u8[lanczos3-sse4_1] 14,233.6680 (81.99) 14,402.0880 (76.67) 14,309.5769 (79.61) 62.2959 (13.65) 14,308.7715 (80.30) 117.5370 (17.92) 3;0 69.8833 (0.01) 10 1
test_resize_pil[bilinear-avx2] 22,089.9410 (127.24) 22,445.1730 (119.48) 22,265.3275 (123.88) 122.6805 (26.88) 22,258.6520 (124.91) 169.7281 (25.87) 4;0 44.9129 (0.01) 10 1
test_resize_pil_u8[lanczos3-none] 28,824.6590 (166.04) 29,534.0520 (157.22) 29,099.9586 (161.90) 226.9470 (49.73) 29,057.6185 (163.06) 396.1090 (60.38) 3;0 34.3643 (0.01) 10 1
test_resize_pil[bilinear-sse4_1] 31,886.6240 (183.68) 32,202.6170 (171.43) 32,017.2611 (178.14) 106.0368 (23.24) 32,024.5760 (179.71) 120.6109 (18.39) 4;0 31.2332 (0.01) 10 1
test_resize_pillow_u8[bilinear] 36,539.7750 (210.48) 37,342.5850 (198.79) 36,856.4690 (205.06) 164.7203 (36.10) 36,813.2995 (206.58) 163.1790 (24.87) 3;1 27.1323 (0.00) 20 1
test_resize_pil[lanczos3-avx2] 38,816.1780 (223.59) 39,684.5700 (211.26) 39,105.1625 (217.57) 303.5146 (66.51) 38,987.6770 (218.79) 545.0870 (83.09) 2;0 25.5721 (0.00) 10 1
test_resize_pil[lanczos3-sse4_1] 59,359.1760 (341.93) 60,751.9260 (323.40) 59,760.2506 (332.49) 437.0312 (95.77) 59,556.2115 (334.21) 541.5740 (82.56) 1;0 16.7335 (0.00) 10 1
test_resize_pillow_cuda_from_pil_image[lanczos3] 62,972.8130 (362.74) 102,222.9920 (544.17) 82,517.5321 (459.11) 20,432.0820 (>1000.0) 82,440.0220 (462.63) 38,770.0250 (>1000.0) 0;0 12.1186 (0.00) 10 1
test_resize_pillow_cuda_from_pil_image[linear] 63,065.8410 (363.28) 63,715.7960 (339.18) 63,295.8951 (352.16) 219.4429 (48.09) 63,260.5600 (355.00) 267.2580 (40.74) 3;0 15.7988 (0.00) 10 1
test_resize_pil[bilinear-none] 65,483.8140 (377.21) 66,719.9630 (355.17) 66,188.0271 (368.25) 401.2182 (87.92) 66,202.7770 (371.51) 382.2331 (58.27) 4;0 15.1085 (0.00) 10 1
test_resize_pillow_u8[lanczos3] 110,284.3650 (635.27) 113,495.0040 (604.18) 110,897.7991 (617.01) 682.1171 (149.47) 110,707.5655 (621.25) 265.5400 (40.48) 1;3 9.0173 (0.00) 20 1
test_resize_pil[lanczos3-none] 129,275.4009 (744.67) 132,935.1900 (707.66) 129,974.7837 (723.15) 1,096.4253 (240.26) 129,550.4260 (726.99) 797.2610 (121.53) 1;1 7.6938 (0.00) 10 1

---

結果を見るに、tensor まで事前に変換しておくと、cv-cuda の処理時間が短いことがわかる。Pillow 形式の画像を cv-cuda で処理しようとすると変換に時間がかかるため、処理時間が長くなる。resizer は、Pillow に比べて高速にリサイズでき、Pillow 形式で処理をするなら cv-cuda を使わずに resizer で行っても良さそう。今回は cv-cuda のバッチ処理を試していないが、バッチ処理を行うとさらに高速に処理できると思われる。CPU, メモリ, GPU 間のデータ転送によるオーバーヘッドがあるため、これを効率良くできれば全体的なスループットを改善できるのではないか。