Registration Notes

This note is for CS5240

Last time we discussed Rigid & Nonrigid and their methods:

[su_table]

Rigid Nonrigid
similarity transformation affine transformation
ICP nonrigid ICP

[/su_table]

Methods below are approximation. Now we discuss interpolation.

Thin Plate Spline

How to get TPS?

[su_custom_gallery source=”media: 243″ limit=”19″ target=”blank” width=”800″ height=”480″]

Minimizing bending energy!TPS maps [latex]p_i[/latex] to [latex]q_i[/latex] exactly.

Consider jth component [latex]v_{ij}[/latex] of [latex]q_i[/latex], TPS maps [latex]p_i[/latex] to [latex]v_{ij}[/latex] by [latex]f(p_i)=v_{ij}[/latex] which minimize bending energy denoted as [latex]E_d(f)[/latex].

Bending energy function takes two parameters the first is d(the dimension of the point), the second is m, which denotes order-m derivatives.

Finally the function f that minimize the Bending energy takes the form

[latex]f(x’) = a^Tx’+\sum_{i=1}^{n} w_iU(||x-p_i||)[/latex]

a are affine parameters. w are weights. U(r) is increasing function of distance r.

 

 

Categories: 未分类

summary for the graph processing bottlenecks paper

No Comments

The paper address

1. INTRODUCTION

graph model: Bulk Synchronous Parallel model, vertex-centric

(superstep: 1. Concurrent computation, 2. Communication, 3. Barrier synchronisation)

GPU use SIMT(Single Instruction Multiple Threads) parallel execution model

12 graph applications & non-graph applications

CUDA

platform: cycle-accurate simulator & NVIDIA GPU

tools: performance counters and software simulators

2. BACKGROUND

CPU-GPU heterogeneous structure

GPU: massive parallel processing

CPU: organize and invoke application kernel function

CPU & GPU connect with PCI-E

3. METHODOLOGY

graph & non-graph algorithms

CUDA

4. RESULT AND ANALYSIS

A. Kernel execution pattern

a. The average number of kernel invocations is much (nearly an order magnitude) higher in graph applications than non-graph applications.

b. The amount of computation done per each kernel invocations is significantly smaller in graph applications than non-graph applications.

c. Short messages require long latencies over PCI and graph applications interact with CPU more frequently.

d. The total time spent on PCI transfers is higher in graph applications

e. Graph applications only transfer smaller amount of data in each PCI transfer.

B. Performance bottlenecks

a. Long memory latency is the biggest bottleneck that causes over 70% of all pipeline stalls(bubble) in graph applications.

b. Graph applications suffer from high cache miss rates.

C. SRAM resource sensitivity

a. register file: most effectively leveraged

b. shared memory:

If there’s not enough reuse of data then moving data from global memory to shared memory actually consume more. So shared memory is used less

c. constant memory: developers are less inclined to use it

d. L1&L2 cache: L1 cache is entirely ineffective for graph processing

reason: In graph applications, memory transfer between CPU and GPU; In non-graph applications, shared memory is actively used

D. SIMT lane utilization

The number of iterations executed by each SIMT lane varies as the degree of each vertex varies. Thus the SIMT lane utilization varies significantly in graph applications.

E. Execution frequency of instruction types

The execution time differences between graph and non-graph applications are not influenced by the instruction mix.

F. Coarse and fine-grain load balancing

a. coarse-grain load balancing:

(i) number of CTAs assigned to each SM

SM level imbalance depends on input size and program characteristisc

assume m SMs, maximum n CTAs for each SM

CTAs (default round-robin)

>m*n: two reasons balancing

(1) higher likehood to assign similar number of CTAs per SM

(2) Large inputs lead to more CTAs and hence the likehood of balancing CTA assignments per SM also increse

=m*n: perfact balance

<m*n: unevenly

b. fine-grain load balancing:

(i) execution time difference across CTAs

opposing force to achieving balance

Large input size increases the execution time variation

applications that exhibit more warp divergence also have high execution time variance at the CTA level.

(ii) execution time variance across warps within a CTA

(σ/μ

σ: standard deviation

μ: average execution time)

Execution time variation for warps within CTAs is not high.

G. Scheduler Sensitivity

three scheduler strategies: GTO, 2LV, LRR

Due to poor memory performance and divergence issues, graph applications have significantly lower IPC than non-graph applications.

5. DISCUSSION

A. Performance bottleneck

PCI calls and Long latency memory operation, solved by:

a. unified system memory

b. actively leverage the underutilized SRAM structures such as cache and shared memory

(data prefetching)

B. Load imbalance

Coarse-grain load distribution:

Input large enough, well-balanced.

Fine-grain load distribution:

determined by the longest warp execution, solved by:

programmer’s effort

6. RELATED WORK

others investigated performance, similar to the paper’s work

7. CONCLUSION

how GPU interact with microarchitectural features

set non-graph applications as comparison

Categories: 未分类

wineqq install instructions

No Comments

First look at this page: https://phpcj.org/wineqq/

download the package offered by that page

extract the package followed by instructions on that page

Then important things:

  • Run wine-QQ once, wine will auto install mono or something else
  • Download simsun.ttc (download address)
  • copy simsun.ttc to ~/.wine/drive_c/windows/Fonts
  • edit ~/.wine/system.reg

change

“MS Shell Dlg”=”Tahoma”
“MS Shell Dlg 2″=”Tahoma”

to

“MS Shell Dlg”=”SimSun”
“MS Shell Dlg 2″=”SimSun”

  • create zh.reg, insert these:

REGEDIT4
[HKEY_LOCAL_MACHINE\Software\Microsoft\Windows NT\CurrentVersion\FontSubstitutes]
“Arial”=”simsun”
“Arial CE,238″=”simsun”
“Arial CYR,204″=”simsun”
“Arial Greek,161″=”simsun”
“Arial TUR,162″=”simsun”
“Courier New”=”simsun”
“Courier New CE,238″=”simsun”
“Courier New CYR,204″=”simsun”
“Courier New Greek,161″=”simsun”
“Courier New TUR,162″=”simsun”
“FixedSys”=”simsun”
“Helv”=”simsun”
“Helvetica”=”simsun”
“MS Sans Serif”=”simsun”
“MS Shell Dlg”=”simsun”
“MS Shell Dlg 2″=”simsun”
“System”=”simsun”
“Tahoma”=”simsun”
“Times”=”simsun”
“Times New Roman CE,238″=”simsun”
“Times New Roman CYR,204″=”simsun”
“Times New Roman Greek,161″=”simsun”
“Times New Roman TUR,162″=”simsun”
“Tms Rmn”=”simsun”

  • run command: regedit zh.reg
  • run wine-QQ again

Done!

Categories: 未分类

copy problem in Python

No Comments

Look at codes below:

>>> v = [0.5, 0.75, 1.0, 1.5, 2.0]
>>> m = [v, v, v]
>>> v[0] = ‘Python’
>>> m
[[‘Python’, 0.75, 1.0, 1.5, 2.0], [‘Python’, 0.75, 1.0, 1.5, 2.0], [‘Python’, 0.75, 1.0, 1.5, 2.0]]
>>> from copy import deepcopy
>>> v = [0.5, 0.75, 1.0, 1.5, 2.0]
>>> m = 3*[deepcopy(v), ]
>>> v[0] = ‘Python’
>>> m
[[0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0]]

Categories: 未分类

三月记录(不断更新)

No Comments

记录下2016年3月的每日生活,从24日开始记录

24日:继续看scikit-learn这个python库,继续学习机器学习导论的东西和部分线代,看了coursera的部分内容,gradient descent和normal equation

25日:组成原理,写了一篇博客(未写完)

26日:汇编语言+机器学习

27日:机器学习,背单词,看sentdex图像处理视频,写汇编报告

Categories: 未分类