程序代写案例-COMP 3231 2020

1 BTB COMP 3231 2020-2021 Sem 1 Take-home exam General Instructions： 1. The deadline is Dec 24, 11:59 PM. This deadline is **strict**; late submissions will be given zero marks automatically. Please avoid last minute submission. 2. Unlike assignments or projects, you **cannot** discuss with others before writing your code. The work must be done **alone**. 3. You may search online for general coding instructions (e.g., referring to a CUDA tutorial), but copying code from online resources is strictly prohibited (and will be identified by us). 4. If you use github/bitbucket to manage your code, please make your repo private. 5. School’s plagiarism policy will be strictly enforced on identified cases. Submission: 1. Submit your code in a single zip/tar file, with two folders q1 and q2q3 as the code template. 2. The q1 folder should have two files: vecmax.cu and your README file. 3. The q2 folder should have your source files, a Makefile, and a README file. a) You can add more .hpp or .cpp files, but you need to change the Makefile accordingly 4. README file is important, especially when you only finish the question partially: describe which functionalities you have implemented in your README file. 2 3 Question 1 (CUDA Programming). (30’) In this question, you are going to write a CUDA program that finds the largest single-precision float value in a large array. You are given a sample code vecmax.cu, which contains the basic code and supplementary functions to compute the max value in a float array. However, the current code only works for array of length < 256. Your code must be able to run on the CS GPU FARM. a). Fix the host code so that it can handle array of arbitrary length. (20’) 1. Change *only* the host (CPU) code so that it iteratively launches kernels to find the maximum value of arbitrary length vectors. This should involve some sort of loop wrapping the kernel switch and case statements. 2. Make sure it can run for very large (1,000,000) vectors. 3. HINT: if every block calculates a local maximum, what is the new problem size/vector length for the next kernel launch? Do not worry about the inefficiencies of multiple memory transfers. 4. All reductions (i.e., comparison) should be done in GPU. You cannot generate an intermediate array and find the max using CPU. b). Change the kernel code to leverage shared memory. (10’) 1. The latency of an access to shared memory (which resides inside an SMM/SMX) is orders of magnitude less than an access to main memory. 2. Re-write the kernel so that all threads in a thread block first load their values to a buffer in shared memory. Once all of these values are loaded, have our threadIdx.x == 0 thread (I’ll call this the “lead” thread) find the maximum from this structure. 3. Remember that relaxed consistency means that there is no guarantee as to when the lead thread will see other thread’s writes to the shared memory. Place fences/barriers accordingly. 4. You should *not* change the CPU code in this part; i.e., you should use the CPU code you wrote for part (a). 5. Part b will only be evaluated based on correctness. 4 Question 2-3 (CPU Pipeline). In these two questions, you are going to implement a simple superscalar processor, with instruction pipelining, out-of- order execution (Tomasulo algorithm), and branch prediction. Assumptions: For simplicity, you do not have to model issue width, retire width, number of result buses and PRF (Physical Register File) ports. Assume these are unlimited and do not stall the processor pipeline. Some other assumptions that have been made have been described where relevant. Code template: We have provided you with the following files: 1. Makefile: to compile your code 2. Driver.cpp: contains the main() method to run the simulator: 3. Processor.hpp: Used for defining structures and method declarations: you may edit this file to declare or define structures and methods 4. Processor.cpp: All your methods are written here 5. Traces: contains the traces to pass to the simulator (more details in the later section How to run: The generated binary (processor) should be run from the root directory as: cat tracefile |./processor –f F –k K –l L -m M -r R The command line parameters are as follows: • F – Dispatch rate (instructions per cycle) • K – Number of k0 function units # ALU Units (Used for BRANCH as well) • L – Number of k1 function units # MUL Units • M – Number of k2 function units # LOAD / STORE instruction • You do not need to care the meaning of each instruction • R – Number of reservation stations per function unit type. That means there are a total of R x (K + L + M) reservation stations • – Path name to the trace file 5 The traces: The input traces will be given in the following form. The driver file contains a function to parse this input and populates an instruction structure: Where: 1. is the address of the instruction (in hex) 2. is the function unit type this line used; You can directly transfer -1 as 1 for non-branch instruction 3. [0..31] 4. [0..31] 5. [0..31] 6. is the effective load or store address (aka the memory address for the operation) 7. is the branch target for a branch instruction 8. tells if the branch is actually taken Note that: • If any reg # is -1, then there is no register for that part of the instruction (e.g., a branch instruction has -1 for its ) • If the instruction is not a branch you can ignore the and fields • If the instruction is not a load/store you can ignore the field • You do not need to care the meaning of each instruction, just check the data hazard in an opaque way. Q2: Implement the Tomasulo pipeline. (40’) The pipeline has five stages as below Stage Name Number of Cycles per instruction Fetch 1 Dispatch Variable, depending upon resource conflicts Schedule Variable, depending upon data dependencies Execute Variable, depending on operation Status Update (Execute completion) Variable, depends on data dependencies 6 To get the full marks for Q2, your code do not need to work for branches. In other words, you code do not need to be able to run the test traces named xxx_branch.xxx.trace. You can temporarily ignore those fonts in blue for Q2. These descriptions are used for Q3 (branch predition). Fetch: 1. The fetch unit fetches up to F instructions from the trace into empty slots in the dispatch queue. Note that the dispatch queue is infinite in size, hence the only source of stalls is due to branches (However, we don’t stall the fetch stage, explained later). 2. We assume that the frontend (Fetch unit) never misses in the Instruction Cache (which we don’t model)) 3. When a branch occurs (OP_BR), the branch is predicted using the GSelect. 4. The trace also has a target and the branch behavior which is checked against the prediction. If the prediction was incorrect, you must stop fetching instructions until the mispredicted branch completes in the k0 unit. Restart fetching (step 1) the cycle after the mispredicted branch completes. Dispatch: 1. The dispatcher attempts to dispatch as many instructions as it can from the dispatch queue into empty slots in the appropriate reservation station queue, in program order each cycle. When there are no more slots in the scheduling queue/reservation station, it stalls. 2. The dispatch unit also stalls when there is a branch misprediction. This has been described in a later section. Schedule: 1. There are R*ki entries in the scheduling unit for function unit of type k 2. If there are multiple independent instructions ready to fire during the same cycle in a reservation station, service them in program order, and based on the availability of function units. 3. A fired instruction remains in the reservation station until it completes. The latency of each unit type is listed below. 4. The number of function units is a parameter of the simulation and should be adjustable along the range of 1 to 3 units of each type. 5. Instructions and the execute unit they use are listed above Function Unit Type Number of Units Default Latency 0 Parameter: k0 3 1 1 Parameter: k 1 1 3 2 Parameter: k2 2 10 Execute: 1. The function units are present in this stage. When an instruction completes, it updates the reservation stations and, if it is a mispredicted branch, the fetch unit. Those updates are visible in the following cycle. An instruction is considered “completed” after it spends the required cycles in execute. 7 2. Note: All the FU’s are pipelined. This means that all FUs are available for ready instructions to be scheduled onto in each cycle of the processor simulation. Setup your scoreboard to take the pipeline behavior into account. Status Update (last stage of Execute): 1. Completed instructions broadcast their results in this state. The register file, the branch predictor is updated here. 2. Instructions that complete are also removed from the reservation stations. The clock: Instruction movement only happens when the latches are clocked, which occurs at the rising edge of each clock cycle. You must simulate the same behavior of the pipeline latches, even if you do not model the actual latches. For example, if an instruction is ready to move from scheduling to execute, the motion only takes effect at the beginning of the next clock cycle. Note that the latching is not strict between the dispatch unit and the scheduling unit, since both units use the scheduling queues. For example, if the dispatch unit inserts an instruction into one of the scheduling queues during clock cycle J, that instruction must spend at least cycle J+1 in the scheduling unit. In addition, assume each clock cycle is divided into two half cycles (you do not need to explicitly model this, but please make sure your simulator follows this ordering of events): Cycle half Action First half The register file is written via a result bus Any independent instructions in the reservation stations are marked to fire The dispatch unit reserves slots in the scheduling queue Second half The register file is read by Dispatch Scheduling queues are updated via a result bus The state update unit deletes completed instructions from the scheduling queue (reservation stations) Operation of the Dispatch Queue: Note the dispatch queue is scanned from head to tail (in program order). When an instruction is inserted into the reservation stations (schedule queue), it is deleted from the dispatch queue. Tag Generation: Tags are generated sequentially for every instruction beginning with 0. The first instruction in a trace will use a value of 0, the second will use 1, etc. The traces are sufficiently short that you will not have to reuse tags. Bus Arbitration: 8 Since we are modeling infinite result buses, it means that all instructions in the schedule queue (aka reservation stations) that complete may update the register file and schedule queue in the same cycle. Unless we agree on some ordering of this update, multiple different (and valid) machine states can result. This will complicate validation. Therefore, assume that the order of updates is the same as Tag order. Consider a scenario where tags 100, 102, 110 and 104 are waiting to complete: The order of completion is tag 100, 102, 104 and 110, all at the beginning of the next clock cycle! Sample output Each line: The first number if the instruction counter, the five other numbers are cycle number at which the instruction is in the fetch, dispatch, schedule, execute, and state update stages. Q3: Implement the GSelect branch predictor. (30’) First, please read the updates to the pipelines in Q2 written in the blue fonts. 1. Branches are predicted in the Instruction Fetch stage. 2. Branches are predicted using a 128 entry table of 2-bit bimodal GSelect predictor using the bimodal branch predictor where the initial state is 01 for each counter. The GHR (Global history register) is 3 bits wide. The instruction address is hashed to get an index into the Branch Target Buffer using this function: ((Address >> 2) % 16) 3. Branch behavior (whether the branch is taken and where the branch goes) is known after the instructions has executed. (After it has spent one cycle in execute in a k0 unit) 4. If the branch behavior and the prediction are correct (note: check the predictor value when branch is put into Dispatch Queue), continue executing the trace and there are no stalls 5. However, if the branch prediction and actual behavior don’t match, you will stall the DISPATCH unit until the miss-predicted branch reaches the STATUS UPDATE stage of the pipeline. After the miss-predicted branch has performed state update, the DISPATCH unit can begin dispatching instructions again 6. Note: The FETCH unit is never stalled. This is because the trace only contains correct instructions and does not contain the incorrect instructions that a branch miss-prediction would have induced, and multipath fetch (as performed in real processors) would have fetched both the branch taken and not taken paths, squashing the incorrect instructions after branch resolution. This system logically performs the same function as an efficient instruction retirement system (such as checkpoint repair or ROB with future file) with branches would perform. The stall allows for the penalty of miss-prediction to be included in the IPC calculations. However, it removes the complexity of implementing a smart retirement scheme in the simulator you are writing 7. The FUs will be freed once the instruction broadcasts on the common data bus. If the instruction is a branch, update the predictor when you free the FU. 8. Note that when you start the simulation, the smith counters are set to value ‘01’. 9 9. If more than one branch exits in EX and are preparing to Status Updates, serve them in the lowest tag first order. 欢迎咨询51作业君

程序代写案例-COMP 3231 2020

Related

Previous Post程序代写案例-STAC53F

Next PostThe Evolution of Trust - The New York Times

Author admin