summary for the graph processing bottlenecks paper
1. INTRODUCTION
graph model: Bulk Synchronous Parallel model, vertex-centric
(superstep: 1. Concurrent computation, 2. Communication, 3. Barrier synchronisation)
GPU use SIMT(Single Instruction Multiple Threads) parallel execution model
12 graph applications & non-graph applications
CUDA
platform: cycle-accurate simulator & NVIDIA GPU
tools: performance counters and software simulators
2. BACKGROUND
CPU-GPU heterogeneous structure
GPU: massive parallel processing
CPU: organize and invoke application kernel function
CPU & GPU connect with PCI-E
3. METHODOLOGY
graph & non-graph algorithms
CUDA
4. RESULT AND ANALYSIS
A. Kernel execution pattern
a. The average number of kernel invocations is much (nearly an order magnitude) higher in graph applications than non-graph applications.
b. The amount of computation done per each kernel invocations is significantly smaller in graph applications than non-graph applications.
c. Short messages require long latencies over PCI and graph applications interact with CPU more frequently.
d. The total time spent on PCI transfers is higher in graph applications
e. Graph applications only transfer smaller amount of data in each PCI transfer.
B. Performance bottlenecks
a. Long memory latency is the biggest bottleneck that causes over 70% of all pipeline stalls(bubble) in graph applications.
b. Graph applications suffer from high cache miss rates.
C. SRAM resource sensitivity
a. register file: most effectively leveraged
b. shared memory:
If there’s not enough reuse of data then moving data from global memory to shared memory actually consume more. So shared memory is used less
c. constant memory: developers are less inclined to use it
d. L1&L2 cache: L1 cache is entirely ineffective for graph processing
reason: In graph applications, memory transfer between CPU and GPU; In non-graph applications, shared memory is actively used
D. SIMT lane utilization
The number of iterations executed by each SIMT lane varies as the degree of each vertex varies. Thus the SIMT lane utilization varies significantly in graph applications.
E. Execution frequency of instruction types
The execution time differences between graph and non-graph applications are not influenced by the instruction mix.
F. Coarse and fine-grain load balancing
a. coarse-grain load balancing:
(i) number of CTAs assigned to each SM
SM level imbalance depends on input size and program characteristisc
assume m SMs, maximum n CTAs for each SM
CTAs (default round-robin)
>m*n: two reasons balancing
(1) higher likehood to assign similar number of CTAs per SM
(2) Large inputs lead to more CTAs and hence the likehood of balancing CTA assignments per SM also increse
=m*n: perfact balance
<m*n: unevenly
b. fine-grain load balancing:
(i) execution time difference across CTAs
opposing force to achieving balance
Large input size increases the execution time variation
applications that exhibit more warp divergence also have high execution time variance at the CTA level.
(ii) execution time variance across warps within a CTA
(σ/μ
σ: standard deviation
μ: average execution time)
Execution time variation for warps within CTAs is not high.
G. Scheduler Sensitivity
three scheduler strategies: GTO, 2LV, LRR
Due to poor memory performance and divergence issues, graph applications have significantly lower IPC than non-graph applications.
5. DISCUSSION
A. Performance bottleneck
PCI calls and Long latency memory operation, solved by:
a. unified system memory
b. actively leverage the underutilized SRAM structures such as cache and shared memory
(data prefetching)
B. Load imbalance
Coarse-grain load distribution:
Input large enough, well-balanced.
Fine-grain load distribution:
determined by the longest warp execution, solved by:
programmer’s effort
6. RELATED WORK
others investigated performance, similar to the paper’s work
7. CONCLUSION
how GPU interact with microarchitectural features
set non-graph applications as comparison