Performance

This is from 1.6 of the book.
The Defining Performance section (p 29) always gives me a moment of pause.
- Just exactly what do you mean by performance?
- Are you moving one person or 1000
- Are there any other ways you might measure performance for airplanes?
- What if you needed to move people 10,000 miles?
We often fall into the trap that clock speed = better performance.
But a far better measure of performance is execution time.
- For our program
- On our data.

They define performance for a computer x as

                     1
Performance_X = ---------------
               Execution Time_X

So if computer X runs a program faster than computer Y.

      Execution Time_X < Execution Time_Y

         1              1
--------------- > ---------------
Execution Time_X   Execution Time_Y

Performance_X > Performance_Y

Example, A single cycle wombat computer with a clock speed of 2.63 MHz runs a program in 3.80 seconds, A multi cycle version with a clock speed of 8.33 MHz runs the same program in 3.94 seconds. What is the performance of each and how many times faster is the multi cycle version than the single cycle versions?
```
Execution Time_single = 3.80
Execution Time_multi = 3.94

Performance_single = 1/3.8 = .263
Performance_multi = 1/3.94 = .254

The single cycle machine has a higher performance since Performance_single > Performance_multi

The single cycle machine is 1.04 times faster since .263/.254 ≈ 1.04

     
```
Measuring performance is difficult
- I/O depends on activity
- The number of users and processes each is running has an impact
- Other OS processes impact wall clock time.
So wall clock time is not a real measure.
As we did with WOMBAT, we will concern ourselves with CPU time,
- The other factors are outside of the scope of this class.
As we did with WOMBAT, we are mostly concerned with cycles per instruction.
- Known as CPI
- ```
 
                Cycles        Seconds
Instructions x ----------- x -------- = Seconds
               Instruction    Cycle
	 
```
- As we have discussed the seconds/cycle or better yet cycles/second is a measure of clock rate.
- Instruction count is a measurement of the program.
- Cycles/instruction are (somewhat) under the control of the hardware designer.
A sort of holy grail of processors is to have a CPI of 1.0
Most of these types of problems can be solved with dimensional analysis.

If processor A has a clock rate of 2GHz and a CPI of 1.3, and processor B has a clock rate of 2.5GHz and a CPI of 1.7, which processor has the better performance. Assume both processors have the same instruction set.

	 Since the processors  have the same instruction set, we can assume
	 that programs have the same number of instructions.

	 Processor A 
                                    1.3  cycles      1 second
            Time = n instructions x ----------- x ----------------
	                            instruction   2.0 x 10⁹ cycles
                     1.3n
		 = ---------------  seconds
		   2.0 x 10⁹

		 = .65n x 10^-9 seconds.

	 Processor B
                                    1.7  cycles      1 second
            Time = n instructions x ----------- x ----------------
	                            instruction   2.5 x 10⁹ cycles
                     1.7n
		 = ---------------  seconds
		   2.5 x 10⁹

		 = .68n x 10 ^-9 seconds.

         Processor A has the shorter execution time, 
	    therefore it has the better performance.

Assume a processor running at 3.0GHz has an instruction set which can be divided into three classes. Class A is used 25% of the time and has a CPI of 1, class B is used 45% of the time and has a CPI of 2, Class C is used 30% of the time and has a CPI of 4. If a program with 5.4x10⁷ instructions is executed

What is the global CPI?
How many clock cycles are required to execute the program?
How much time is required to execute the program?

If the compiler team is able to reduce the instruction count by 10% by shifting the instruction mix to be A: 30%, B: 35%, C: 35%, would this represent an improvement?

Global CPI =  CPI_A * Usage_A +  CPI_B * Usage_B + CPI_C * Usage_C
           = 1 * .25 +  2 * .45 + 4 * .3 
	   = 2.35

Clock Cycles =  Instructions x Global CPI
             (remember CPI is cycles/instruction
	       so instructions x cycles/instructions = cycles)

	     = 5.4x10⁷ x 2.35
	     = 5.4 x 2.35 x 10⁷
	     = 12.69 x 10⁷ cycles.
	     = 1.27 x 10⁸ cycles.

Total  time  = cycles x seconds/cycle
             = 1.27 x 10⁸ cycles x 1 second/(3.0 x 10⁹)
	     = 1.27/3.0 x 10^-1
	     = 0.42 x 10^-1 seconds
	     = 0.042 seconds.

New compiler
     Instructions =  5.4 x 10⁷ * .9
                  = 4.86 x 10⁷ instructions

     Global CPI  = .3 x 1 + .35 x 2 + .35 x 4
                 = 2.4

     Clock Cycles = 4.86 x 10⁷ x 2.4
                  = 4.86 x 2.4 x 10⁷
		  = 11.66 x 10⁷ cycles
		  = 1.17 x x 10⁷ cycles
              
     Since this is lower than the old count (1.27 x 10⁸ cycles) 
     It is worth the compiler investment for this program on this processor.
	  If the compiler team is able to reduce the instruction count by 10% by shifting the instruction mix to be A: 30%, B: 35%, C: 35%, would this represent an improvement?

As they point out on page 39, multiple items impact the performance of a program.
- Algorithm has an impact on instruction count and CPI (instruction mix)
- The language used impacts both of these.
- The compiler has an impact on both of these.
- The CPU has an impact on all three.