Records when constructing the scheduler

    1. How does Resnet code in tensorflow/models be distributed?

    High API Estimator

    distributed_strategy in utils/misc

    2. How to make codes in distributed?

    Refer Tensorflow tutorial:

    Set a distributed strategy and scope including model construction and model compile


    VGG uses data augmentation which is in conflict with distribution!

    In Keras tutorial, if we use fit_generator method, then we will meet this error:

    fit_generator` is not supported for models compiled with tf.distribute.strategy.

    Our Tensorflow version is 1.14

    ImageDataGenerator tutorial code

    If we use ‘manual’ example in the official tutorial, then the training will become wield:

    Use single GPU this is the std output:

    Above is normal(though different from using fit_generator). Below is the distributed version using mirror strategy, it’s abnormal:

    Distributed version stuck in the first epoch and the loss is high for a long time.


    This issue suggests using to deal with the generator.

    Categories: 未分类


    • Outline
    • Scheduling
    • Components:

    Decides Server order

    manage queue

    • Why do we need one?
    • What can scheduling disciplines do?
    • Requirements of a scheduling discipline
    • Ease of implementation
    • Fairness

    Fairness is global, scheduling(congestion avoidance) is local.

    • Notion of Fairness
    • Fundamental choices

    Work-conserving and non-work-conserving

    Degree of aggregation

    • Scheduling disciplines

    FIFO & other disciplines(SPT, SRPT), the performance among them

    (SRPT process the remaining time, which means if I’m processing a package which still needs 5 min, then comes a package which only need 1 min, then I go to process the new package)


    • The Conservation Law

    scheduling is independent of the packet service time

    [latex]\sum ρ_iq_i=constant[/latex]

    [latex]ρ_i[/latex] mean utilization of connection i and [latex]q_i[/latex] mean waiting time of connection i

    The average delay with FIFO is a tight lower bound for
    work conserving and service time independent scheduling

    • Fairness

    Jain’s index use equal share as the


    • Max-Min Fairness
    • General Process Sharing (GPS)

    Conceptually, GPS serves packets as if they are
    in separate logical queues, visiting each nonempty
    queues in turn.

    Generalized processor sharing assumes that traffic is fluid (infinitesimal packet sizes), and can be arbitrarily split.

    How to emulate GPS as fair as possible, also efficient

    • (Weighted) round robin

    Different weights, fixed packet size

    Different weights, variable size packets:normalize weights by mean packet size


    1. With variable size packets and different weights,
      need to know mean packet size in advance
    2. Can be unfair for long periods of time
    Categories: 未分类

    Registration Notes

    This note is for CS5240

    Last time we discussed Rigid & Nonrigid and their methods:


    Rigid Nonrigid
    similarity transformation affine transformation
    ICP nonrigid ICP


    Methods below are approximation. Now we discuss interpolation.

    Thin Plate Spline

    How to get TPS?

    [su_custom_gallery source=”media: 243″ limit=”19″ target=”blank” width=”800″ height=”480″]

    Minimizing bending energy!TPS maps [latex]p_i[/latex] to [latex]q_i[/latex] exactly.

    Consider jth component [latex]v_{ij}[/latex] of [latex]q_i[/latex], TPS maps [latex]p_i[/latex] to [latex]v_{ij}[/latex] by [latex]f(p_i)=v_{ij}[/latex] which minimize bending energy denoted as [latex]E_d(f)[/latex].

    Bending energy function takes two parameters the first is d(the dimension of the point), the second is m, which denotes order-m derivatives.

    Finally the function f that minimize the Bending energy takes the form

    [latex]f(x’) = a^Tx’+\sum_{i=1}^{n} w_iU(||x-p_i||)[/latex]

    a are affine parameters. w are weights. U(r) is increasing function of distance r.



    Categories: 未分类

    summary for the graph processing bottlenecks paper

    No Comments

    The paper address


    graph model: Bulk Synchronous Parallel model, vertex-centric

    (superstep: 1. Concurrent computation, 2. Communication, 3. Barrier synchronisation)

    GPU use SIMT(Single Instruction Multiple Threads) parallel execution model

    12 graph applications & non-graph applications


    platform: cycle-accurate simulator & NVIDIA GPU

    tools: performance counters and software simulators


    CPU-GPU heterogeneous structure

    GPU: massive parallel processing

    CPU: organize and invoke application kernel function

    CPU & GPU connect with PCI-E


    graph & non-graph algorithms



    A. Kernel execution pattern

    a. The average number of kernel invocations is much (nearly an order magnitude) higher in graph applications than non-graph applications.

    b. The amount of computation done per each kernel invocations is significantly smaller in graph applications than non-graph applications.

    c. Short messages require long latencies over PCI and graph applications interact with CPU more frequently.

    d. The total time spent on PCI transfers is higher in graph applications

    e. Graph applications only transfer smaller amount of data in each PCI transfer.

    B. Performance bottlenecks

    a. Long memory latency is the biggest bottleneck that causes over 70% of all pipeline stalls(bubble) in graph applications.

    b. Graph applications suffer from high cache miss rates.

    C. SRAM resource sensitivity

    a. register file: most effectively leveraged

    b. shared memory:

    If there’s not enough reuse of data then moving data from global memory to shared memory actually consume more. So shared memory is used less

    c. constant memory: developers are less inclined to use it

    d. L1&L2 cache: L1 cache is entirely ineffective for graph processing

    reason: In graph applications, memory transfer between CPU and GPU; In non-graph applications, shared memory is actively used

    D. SIMT lane utilization

    The number of iterations executed by each SIMT lane varies as the degree of each vertex varies. Thus the SIMT lane utilization varies significantly in graph applications.

    E. Execution frequency of instruction types

    The execution time differences between graph and non-graph applications are not influenced by the instruction mix.

    F. Coarse and fine-grain load balancing

    a. coarse-grain load balancing:

    (i) number of CTAs assigned to each SM

    SM level imbalance depends on input size and program characteristisc

    assume m SMs, maximum n CTAs for each SM

    CTAs (default round-robin)

    >m*n: two reasons balancing

    (1) higher likehood to assign similar number of CTAs per SM

    (2) Large inputs lead to more CTAs and hence the likehood of balancing CTA assignments per SM also increse

    =m*n: perfact balance

    <m*n: unevenly

    b. fine-grain load balancing:

    (i) execution time difference across CTAs

    opposing force to achieving balance

    Large input size increases the execution time variation

    applications that exhibit more warp divergence also have high execution time variance at the CTA level.

    (ii) execution time variance across warps within a CTA


    σ: standard deviation

    μ: average execution time)

    Execution time variation for warps within CTAs is not high.

    G. Scheduler Sensitivity

    three scheduler strategies: GTO, 2LV, LRR

    Due to poor memory performance and divergence issues, graph applications have significantly lower IPC than non-graph applications.


    A. Performance bottleneck

    PCI calls and Long latency memory operation, solved by:

    a. unified system memory

    b. actively leverage the underutilized SRAM structures such as cache and shared memory

    (data prefetching)

    B. Load imbalance

    Coarse-grain load distribution:

    Input large enough, well-balanced.

    Fine-grain load distribution:

    determined by the longest warp execution, solved by:

    programmer’s effort


    others investigated performance, similar to the paper’s work


    how GPU interact with microarchitectural features

    set non-graph applications as comparison

    Categories: 未分类

    wineqq install instructions

    No Comments

    First look at this page:

    download the package offered by that page

    extract the package followed by instructions on that page

    Then important things:

    • Run wine-QQ once, wine will auto install mono or something else
    • Download simsun.ttc (download address)
    • copy simsun.ttc to ~/.wine/drive_c/windows/Fonts
    • edit ~/.wine/system.reg


    “MS Shell Dlg”=”Tahoma”
    “MS Shell Dlg 2″=”Tahoma”


    “MS Shell Dlg”=”SimSun”
    “MS Shell Dlg 2″=”SimSun”

    • create zh.reg, insert these:

    [HKEY_LOCAL_MACHINE\Software\Microsoft\Windows NT\CurrentVersion\FontSubstitutes]
    “Arial CE,238″=”simsun”
    “Arial CYR,204″=”simsun”
    “Arial Greek,161″=”simsun”
    “Arial TUR,162″=”simsun”
    “Courier New”=”simsun”
    “Courier New CE,238″=”simsun”
    “Courier New CYR,204″=”simsun”
    “Courier New Greek,161″=”simsun”
    “Courier New TUR,162″=”simsun”
    “MS Sans Serif”=”simsun”
    “MS Shell Dlg”=”simsun”
    “MS Shell Dlg 2″=”simsun”
    “Times New Roman CE,238″=”simsun”
    “Times New Roman CYR,204″=”simsun”
    “Times New Roman Greek,161″=”simsun”
    “Times New Roman TUR,162″=”simsun”
    “Tms Rmn”=”simsun”

    • run command: regedit zh.reg
    • run wine-QQ again


    Categories: 未分类

    copy problem in Python

    No Comments

    Look at codes below:

    >>> v = [0.5, 0.75, 1.0, 1.5, 2.0]
    >>> m = [v, v, v]
    >>> v[0] = ‘Python’
    >>> m
    [[‘Python’, 0.75, 1.0, 1.5, 2.0], [‘Python’, 0.75, 1.0, 1.5, 2.0], [‘Python’, 0.75, 1.0, 1.5, 2.0]]
    >>> from copy import deepcopy
    >>> v = [0.5, 0.75, 1.0, 1.5, 2.0]
    >>> m = 3*[deepcopy(v), ]
    >>> v[0] = ‘Python’
    >>> m
    [[0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0], [0.5, 0.75, 1.0, 1.5, 2.0]]

    Categories: 未分类

    Random Forest

    No Comments

    In order to learn svm(support vector machine), we have to learn about what the Random Forest is.

    1. What is a decision tree

    A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm.(from wiki)

    Two Types of decision tree

    1.Categorical Variable Decision Tree

    2.Continuous Variable Decision Tree

    Example:- Let’s say we have a problem to predict whether a customer will pay his renewal premium with an insurance company (yes/ no). Here we know that income of customer is a significant variable but insurance company does not have income details for all customers. Now, as we know this is an important variable, then we can build a decision tree to predict customer income based on occupation, product and various other variables. In this case, we are predicting values for continuous variable.

    Important Terminology related to Decision Trees

    Root Node, Splitting, Decision Node, Leaf/Terminal Node:

    Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.

    Branch / Sub-TreeParent and Child Node


    1. Over fitting: Over fitting is one of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model parameters and pruning (discussed in detailed below).
    2. Not fit for continuous variables: While working with continuous numerical variables, decision tree looses information when it categorizes variables in different categories.

    2. Regression Trees vs Classification Trees


    Categories: Programming

    Zero to C-Mips Compiler

    No Comments

    According to CS143 course task, complete a compiler by my own.

    There are four steps:

    1. Lexical & Syntax Analysis
    2. Semantic Analysis & Type Checking
    3. Intermediate Code
    4. Translated MIPS Code
    5. Optimization
    Categories: Programming


    No Comments


    寄存器堆(register file):CPU中多个寄存器组成的阵列,通常由快速的静态随机读写存储器(SRAM)实现。这种RAM具有专门的读端口与写端口,可以多路并发访问不同的寄存器。寄存器堆是指令集架构的一部分,程序可以访问,这与透明的CPU高速缓存(cache)不同。


    或称多路复用器,是一种可以从多个模拟数字输入信号中选择一个信号进行输出的器件。 一个有 2n 输入端的数据选择器有 n 个可选择的输入-输出线路,可以通过控制端来选择其中一个信号被选择作为输出。[3] 数据选择器主要用于增加一定量的时间和带宽内的可以通过网络发送的数据量。


    当然还有datapath(中文:数据通路),给出英文解释:A datapath is a collection of functional units, such as arithmetic logic units or multipliers, that perform data processing operations, registers, and buses.[1] Along with the control unit it composes the central processing unit (CPU).



    4.1 引言

    重点是A Basic MIPS Implementation,分为三类指令:存储访问指令,算数逻辑指令和分支指令。





    我们怎么样实现每条指令呢?首先PC(program counter)这个寄存器,它存放着当前执行指令的地址,把PC的值送入Instruction memory(内存),取出执行的指令,这个指令读取一到两个寄存器。接下来读取寄存器后,除了跳转指令,其他的都要用上ALU(算数逻辑单元),各类指令中ALU的功能各不相同。存储指令用ALU计算地址,算术逻辑指令用ALU执行运算,分支指令用ALU进行比较。还有一点值得注意:分支指令如果不修改下一条指令地址,则下一条指令地址默认是当前指令地址+4。

    当然,上图的设计并非尽善尽美,对于一些组件单元,它的数据来源远不止这么简单。比如说PC的来源有两个加法器,实际上不能简单得将两个加法器的线连在一起(图:DeepinScreenshot20160325210706),我们应该增加一个multiplexor,或者叫data selector(多选器),它是这个样子的:


    对应的两个输入分别连上两个加法器,像这样需要data selector的地方还有不少。

    加入control unit的设计:


    control unit的定义:It tells the computer’s memory, arithmetic/logic unit and input and output devices how to respond to a program’s instructions.

    在这里,control unit接收指令,对寄存器、Data memory(数据存储器)还有data selector都进行控制,比如说寄存器的读写,数据的读写和data selector中对输入的选择等等。



    4.2 逻辑设计的一般方法


    组合单元(combinational element)

    状态单元(state element)




    Categories: Programming Tags: Tags: ,