Random Forest

    No Comments

    In order to learn svm(support vector machine), we have to learn about what the Random Forest is.

    1. What is a decision tree

    A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm.(from wiki)

    Two Types of decision tree

    1.Categorical Variable Decision Tree

    2.Continuous Variable Decision Tree

    Example:- Let’s say we have a problem to predict whether a customer will pay his renewal premium with an insurance company (yes/ no). Here we know that income of customer is a significant variable but insurance company does not have income details for all customers. Now, as we know this is an important variable, then we can build a decision tree to predict customer income based on occupation, product and various other variables. In this case, we are predicting values for continuous variable.

    Important Terminology related to Decision Trees

    Root Node, Splitting, Decision Node, Leaf/Terminal Node:

    Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.

    Branch / Sub-TreeParent and Child Node

    Disadvantages

    1. Over fitting: Over fitting is one of the most practical difficulty for decision tree models. This problem gets solved by setting constraints on model parameters and pruning (discussed in detailed below).
    2. Not fit for continuous variables: While working with continuous numerical variables, decision tree looses information when it categorizes variables in different categories.

    2. Regression Trees vs Classification Trees

     

    Categories: Programming

    Zero to C-Mips Compiler

    No Comments

    According to CS143 course task, complete a compiler by my own.

    There are four steps:

    1. Lexical & Syntax Analysis
    2. Semantic Analysis & Type Checking
    3. Intermediate Code
    4. Translated MIPS Code
    5. Optimization
    Categories: Programming

    组成原理处理器部分(1)

    No Comments

    好久没有看组成原理了,先补几个概念:

    寄存器堆(register file):CPU中多个寄存器组成的阵列,通常由快速的静态随机读写存储器(SRAM)实现。这种RAM具有专门的读端口与写端口,可以多路并发访问不同的寄存器。寄存器堆是指令集架构的一部分,程序可以访问,这与透明的CPU高速缓存(cache)不同。

    数据选择器(multiplexer):

    或称多路复用器,是一种可以从多个模拟数字输入信号中选择一个信号进行输出的器件。 一个有 2n 输入端的数据选择器有 n 个可选择的输入-输出线路,可以通过控制端来选择其中一个信号被选择作为输出。[3] 数据选择器主要用于增加一定量的时间和带宽内的可以通过网络发送的数据量。

    数据选择器使多个信号共享一个设备或资源,例如一个模拟数字转换器或一个传输线,而不必给每一个输入信号配备一个设备。

    当然还有datapath(中文:数据通路),给出英文解释:A datapath is a collection of functional units, such as arithmetic logic units or multipliers, that perform data processing operations, registers, and buses.[1] Along with the control unit it composes the central processing unit (CPU).

    注:以上摘自wikipedia,吐槽一下百度百科,实在是laji

    4.1到4.3的内容分别为引言、逻辑设计的一般方法和建立数据通路,我们一步步来:

    4.1 引言

    重点是A Basic MIPS Implementation,分为三类指令:存储访问指令,算数逻辑指令和分支指令。

    这里我们重温一下实现方式:

    4_1

    上图是一个MIPS子集实现的抽象视图

    一下我尽量遵循原著,括号内为中文版翻译

    我们怎么样实现每条指令呢?首先PC(program counter)这个寄存器,它存放着当前执行指令的地址,把PC的值送入Instruction memory(内存),取出执行的指令,这个指令读取一到两个寄存器。接下来读取寄存器后,除了跳转指令,其他的都要用上ALU(算数逻辑单元),各类指令中ALU的功能各不相同。存储指令用ALU计算地址,算术逻辑指令用ALU执行运算,分支指令用ALU进行比较。还有一点值得注意:分支指令如果不修改下一条指令地址,则下一条指令地址默认是当前指令地址+4。

    当然,上图的设计并非尽善尽美,对于一些组件单元,它的数据来源远不止这么简单。比如说PC的来源有两个加法器,实际上不能简单得将两个加法器的线连在一起(图:DeepinScreenshot20160325210706),我们应该增加一个multiplexor,或者叫data selector(多选器),它是这个样子的:

    Multiplexer2

    对应的两个输入分别连上两个加法器,像这样需要data selector的地方还有不少。

    加入control unit的设计:

    4_2

    control unit的定义:It tells the computer’s memory, arithmetic/logic unit and input and output devices how to respond to a program’s instructions.

    在这里,control unit接收指令,对寄存器、Data memory(数据存储器)还有data selector都进行控制,比如说寄存器的读写,数据的读写和data selector中对输入的选择等等。

    在这里强调的是,这样的设计效率并不高,因为这样的话时钟周期必须设为足够容纳执行时间最长的指令,后面会介绍流水线方式的计算机控制设计。

     

    4.2 逻辑设计的一般方法

    先是重点概念:

    组合单元(combinational element)

    状态单元(state element)

    两者差别是组合单元没有内部存储功能,而状态单元有。此外,一个状态单元至少有两个输入和一个输出,其中有个输入是时钟信号,输出提供了在前一个时钟信号写入单元的数据值。什么意思呢?最近才学了数电,里面有讲这部分内容:计算机时钟周期

     

     

    Categories: Programming Tags: Tags: ,

    benchmark optimize(1)

    No Comments

    source code: whetstone.c

    based compiler flags: -std=c89 -DDP  -DROLL -lm

    no warning, no error

    始めよう

    GCC First:

    1.simply run:

    Rolled Double  Precision 703148 Kflops ; 2048 Reps

    2.703148 is too slow,then we add flag: -O4, optimize the loops,then compile again,run it:

    Rolled Double  Precision 4177105 Kflops ; 2048 Reps

    better now!

    now come to these flags:

    gcc -std=c89 -DDP  -DROLL -O4 -ffast-math -funroll-all-loops -mavx whetstone.c -fopenmp -lm -o b.out

    fast-math means faster but sacrifices the accuracy

    avx means using the avx instruction
    5340310 Kflops now!

     

    ICC THEN:

    1.simply run:

    Rolled Double  Precision 4636137 Kflops ; 2048 Reps

    seems good at first,if we add flag:-O3, the program isn’t faster at all,then we think about using parallel methods

    flags -xHost can improve about 14%

    2.parallel methods:

    we have to run vtune_amplifier_xe above all,this software locate in /opt/intel/vtune_amplifier_xe_xxx/bin64, run /opt/intel/vtune_amplifier_xe_xxx/bin64/amplxe-gui and you will see the software window.(ps: xxx means the version of vtune_amplifier_xe)

    run command(as root):

    root# echo 0 > /proc/sys/kernel/yama/ptrace_scope

    then refer to the tutorial:hotspots_amplxe_lin.pdf

    it shows those hotspots:

     DeepinScreenshot20160229191414

    it also shows the Utilization situation:

    DeepinScreenshot20160229191713

    Poor!Now we have to consider to parallel it.

    Categories: Programming

    CUDA learning(2)–simple parallelism cuda program

    No Comments

    now we use the function add,with these codes:
    __global__ void add(int *a, int *b, int *c) {
    *c = *a + *b;
    }

    add() runs on the device, so a, b and c must point to device memory

    but we can allocate memory on the GPU

    we can use cudaMalloc(), cudaFree(), cudaMemcpy() to handle device memory

    now comes with a simple program:

    #include <stdio.h>
    __global__ void add(int *a, int *b, int *c)
    {
    *c = *a + *b;
    }
    int main(void)
    {
    int a, b, c;
    int *d_a, *d_b, *d_c;
    int size = sizeof(int);
    // host copies of a, b, c
    // device copies of a, b, c
    // Allocate space for device copies of a, b, c
    cudaMalloc((void **)&d_a, size);
    cudaMalloc((void **)&d_b, size);
    cudaMalloc((void **)&d_c, size);
    a = 2;
    b = 7;
    cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
    // Launch add() kernel on GPU
    add<<<1,1>>>(d_a, d_b, d_c);
    // Copy result back to host
    cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);
    // Cleanup
    cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
    return 0;
    }

    So how do we run code in parallel on the device?

    change “add<<< 1, 1 >>>();” to “add<<< N, 1 >>>();”(Instead of executing add() once, execute N times in parallel,N means N blocks)

    Vector Addition on the Device

    Terminology: each parallel invocation of add() is referred to as a block .Each invocation can refer to its block index using blockIdx.x

    then we change the add function:

    __global__ void add(int *a, int *b, int *c) {
    c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
    }

     

    so they change to three arrays,so something has to be changed in main()

    maybe we can use function random_ints()

    Categories: Programming Tags: Tags:

    CUDA learning(1)–start from hello-world

    No Comments

    learn from http://www.nvidia.com/docs/io/116711/sc11-cuda-c-basics.pdf

    first comes some concepts:

    • Heterogeneous Computing
    • Blocks
    • Threads
    • Indexing
    • Shared memory
    • __syncthreads()
    • Asynchronous operation
    • Handling errors
    • Managing devices

    Now we have to understand what “Heterogeneous Computing” is.From the 8th page,we can easily realize that it contains device code,host code,parallel code and serial code.And the next content is more intuitive.Data is around CPU and GPU,and codes are execurated in GPU.

     

    (ps: GigaThread™ :this engine provides up to 10x faster context switching compared to previous generation architectures, concurrent kernel execution, and improved thread block scheduling.)

    so we add the device code,then the “hello world” program looks like this :

    #include //do not forget this!
    __global__ void mykernel(void) {

    }

    int main(void)

    {

    mykernel<<<1,1>>>();

    printf("Hello World!\n");

    return 0;

    }

    __global__: cuda C/C++ keyword,indicates:

    • Runs on the device
    • Is called from host code

    and nvcc separates source code into host and device components(where’s comment nvcc?)

    Device functions (e.g. mykernel()) processed by NVIDIA compiler and Host functions (e.g. main()) processed by standard host compiler

    (ps:sorry to tell you that for archlinux users, you can install nvidia toolkit by “pacman -S cuda

    and maybe you have to restart to use nvcc)

    function mykernel does nothing here,and we’ll tell what “<<<1,1>>>” does in a moment.

    now I have to accomplish my parallelism learning that was left before:

    reference book: CSAPP

    (ps: found nothing worth to write now ,maybe I’ll write something in subsequent sections)

     

     

    Categories: Programming Tags: Tags: