2016嵌入式系統作業w2 - raytracing效能改進與分析
作業(A)
目標
- 以
make PROFILE=1
重新編譯程式碼,並且學習 gprof- 以
gprof
指出效能瓶頸,並且著手改寫檔案math-toolkit.h
在內的函式實做,充分紀錄效能差異在共筆- 注意: 請勿更動編譯器最佳化選項
-O0
(關閉最佳化)- 檔案 math-toolkit.h 定義的若干數學操作函式很重要,可參考 SIMD optimized dot and cross product functions 和 2015q3 Homework #1 Ext
將你的觀察、分析,以及各式效能改善過程,並善用gnuplot
製圖。
gprof - Gun Profiler
Profiling能讓程式撰寫者了解程式花在各function的時間和function間的呼叫關係,並能有效找出與預期有落差的funciton,
改善整體效能。
gprof需要在compile時加入 -pg
flag,如此會告知 gcc
在compile時為每個function加入名為mcount ( or “_mcount” , or “__mcount” , depend on compiler and OS.)的函数
則每個funciton執行時都會call mcount,而mcount會存有一份calling graph,裡頭紀錄了child function和parent funciton的關係,呼叫時間以及呼叫次數等資訊。
利用make PROFILE=1
編譯時:
1 | ifeq ($(strip $(PROFILE)),1) |
$(strip $(PROFILE)), 1)
,strip去掉$PROFILE的開頭結尾空字符
比較$PROFILE==1?,成立時則將CFLAGS
和 LDFLAGS
加入 -pg
字串
因為在runtime會多執行mcount,故執行時間會比沒有加-pg
flag的慢許多。
1 | make PROFILE=1 |
執行 ./raytracing,得到gmon.out(只能用gprof來讀取),並把data
redirect到analysis.txt
1 | gprof raytracing gmon.out > analysis.txt |
利用 Graphviz 產生函數呼叫關係圖
Graphviz is open source graph visualization software. Graph visualization is a way of
representing structural information as diagrams of abstract graphs and networks. It has important
applications in networking, bioinformatics, software engineering, database and web design,
machine learning, and in visual interfaces for other technical domains.
1 | #安裝 |
- 利用gprof2dot,將
gmon.out
轉成graphviz
所用的dot language
1 | gprof raytracing | gprof2dot |
結果
1 | digraph { |
2.直接轉成圖片
1 | gprof raytracing | gprof2dot | dot -Tpng -o codemap.png |
analysis.txt
中分為幾個部份
- Flat Profile
The flat profile shows the total amount of time your program spent executing each function.
各column分別代表
time
function執行時間的百分比,加總起來應為100%。cumulative seconds
此function的執行時間(second),加上此funciton之前functions執行時間的總和。self seconds
此function獨自執行的時間,flat profile以此column為第一順位排序。calls
此function被called的次數,如果被called的次數沒辦法被統計或
This is the total number of times the function was called. If the function was never called, or the number of times it was called cannot be determined (probably because the function was not compiled with profiling enabled), the calls field is blank.self ms/call
代表 average number of milliseconds spent in this function per call, 若沒被profiled則為空。total ms/call
This represents the average number of milliseconds spent in this function and its descendants per call, if this function is profiled. Otherwise, this field is blank for this function. This is the only field in the flat profile that uses call graph analysis.name
funtion name.
由flat profile中可發現,主要的效能瓶頸在
function name | calls | second |
---|---|---|
dot_product | 69646433 | 1.19 |
subtract_vector | 56956357 | 0.58 |
multiply_vector | 31410180 | 0.34 |
修改 math-toolkit.h
可以發現 add_vector
, subtract_vector
, multiply_vectors
, multiply_vector
,dot_product
中的forloop都只有3個iteration,因為本作業強制使用-O0
flag(禁用最佳化),所以編譯器不會幫忙做loop unrolling。
loop unrolling的好處
- 減少index arthimetic operation,和end of loop test。
- 最小化branch penalty。
- 能讓instruction更能平行執行。
loop unrolling的壞處
- code size變大。
1 |
|
做完 loop unrolling後
1 |
|
整體執行時間快了1.1 second
利用inline
參考課程提示
GCC does not inline any functions when not optimizing unless you specify the ‘always_inline’
attribute for the function, like this:
/ Prototype.
inline void foo (const char) attribute((always_inline));The remainder of this section is specific to GNU C90 inlining.
因為關閉最佳化,所以function宣告為inline並不會有作用,但為了測試inline
對效能的影響,在Makefile中的CFLAGS
加上 -D__forceinline="__attribute__((always_inline))"
並把原本須告為static inline的functions後接加上 __forceinline
1 | static inline __forceinline |
GCC:
By declaring a function inline, you can direct GCC to make calls to that function faster. One way
GCC can achieve this is to integrate that function’s code into the code for its callers. This
makes execution faster by eliminating the function-call overhead; in addition, if any of the
actual argument values are constant, their known values may permit simplifications at compile time
so that not all of the inline function’s code needs to be included. The effect on code size is
less predictable; object code may be larger or smaller with function inlining, depending on the
particular case.
force inline + loop unrolling效能比較圖,可以看到比原本快了3.9 second
利用SIMD
SIMD是Single Instruction/Multiple Data的縮寫,相較於原本每個指令只處理單一data(SISD or Scalar operation),SIMD能夠在單一指令中處理多個data。
但SIMD只能處理特定的predefined processing pattern,例如以下情況是SIMD無法處理的。
根據Intel intrinstic guide列出的intrinstic instruction中屬於SIMD指令集的有
查看 /proc/cpuinfo
,發現本機cpu為i5-3320M
支援sse4.1
sse4.2
, avx
等SIMD指令集架構,因avx
是基於sse的延伸,且具有256 bit register,故用avx
來作SIMD加速。
AVX , YMM and XMM
XMM是舊有SEE指令集的暫存器,只有128 bits,AVX中則有16個256 bits AVX暫存器,和一個control/status register, MXCSR。
修改add_vector to SIMD
1 | static inline void add_vector(const double *a, const double *b, double *out) __forceinline |
可以看到時間不降反升,從0.10
-> 0.54
Reference
SIMD
evanjack2002 / 開發紀錄(A)
FastC++: Coding Cpp Efficiently
Introduction to Intel® Advanced Vector Extensions
Basic of SIMD Programming
Wiki about SSE/AVX