Hello!
I just thought of making an update to the parallellism in C++ series,
now that it has been about a week, and I have gotten lots of feedback.
After episode 1, a few people wondered
whether it is actually possible to encapsulate
those horrible-looking but powerful intrinsics,
into a nice template class interface.
And the answer is: Yes, yes it is.
In fact,
there is already considerable prior work.
For instance, there is a relatively new library,
Boost.SIMD, just for that purpose.
I highly recommend you to check it out.
It even does trigonometric functions.
I have not tested it,
but I read through its source code,
and it seems really-really neat.
It even has a remarkable test suite.
Thank you Josua Rieder for telling me.
And there are other libraries.
Libsimdpp by Povilas Kanapickas –
supports more SIMD architectures than Boost.SIMD does,
but it does not have as complete –
a mathematical functional library as Boost.SIMD does.
Then, there was some confusion –
about the words I said about compilers and optimizations.
First, how good compilers are at optimizing SIMD?
The answer is, in general, really bad.
To paraphrase floodyberry's reply at reddit,
to take advantage of compiler's SIMD optimizations,
you must explicitly design and write your code –
as if you were writing SIMD intrinsics,
with attention to data layout, working in blocks and so on,
and then just hope the compiler does its thing.
But it is not reliable,
and your mileage depends on the compiler –
and even on the compiler version.
There is a good reason why X264,
a well-known free software library –
for a particular type of video transcoding,
has a significant portion of its source code –
written in assembler.
But I also said that the notion –
that you should not try to beat your compiler –
at optimizing, is hogwash.
In the general case, you should stick to that principle.
Very often, it does not matter –
whether compiler generates most optimal code or not.
If you are invoking system library functions,
like for instance, reading a file,
chances are that any miniscule improvements,
that you might achieve –
by architecture-dependent expect hacker optimizations,
are totally irrelevant,
because the operating system calls are –
an order of magnitude slower than your code.
Or perhaps you might otherwise be able –
to change the entire algorithm in your program –
into a categorically faster one,
if you didn't taint the code with micro-optimizations –
that prevent you from seeing the big picture,
and as a consequence,
your code is not only slower than it could be,
but also unmaintainable,
because of the fragile micro-optimizations.
You see,
compilers are very good at optimizing regular code.
The more average your code is,
the better they are at optimizing.
For example, very often a extremely complicated –
and sophisticated template struct magic thing in C++ –
has been optimized down into –
just a couple of register movement instructions in assembler,
as I demonstrated in –
the C++ for C programmers lesson one video.
And that is as it should be.
So I would just like to clarify that thing.
It is only when you get into –
very sophisticated level of detail,
in performance-critical, self-contained, concise code,
that you might be able to outsmart the compiler,
and achieve tangible benefit,
if you know exactly what you are doing.
After episode 2, some people wanted to know –
how does the performance change –
if we use both threads and SIMD, but not GPU offloading.
Well, I had initially considered –
that a bit too niche request,
but I made a chart anyway.
Here it is.
Episode 3 left one question hanging:
Why was the performance –
in the OpenMP and OpenACC versions so bad?
Well, first of all,
there was a misconception on my part.
In both versions, I had a clause –
that caused a couple of simple variables –
to be copied to the GPU memory.
This was actually a bad idea,
because what it means is that –
memory will be separately allocated on the GPU –
for those variables.
It is better to default those variables as "firstprivate",
meaning they are treated like function parameters.
Secondly in the OpenMP version,
I had a "collapse" clause, which was a good idea,
but in the OpenACC version I didn't have it.
This was totally an oversight from my part.
I forgot to check whether OpenACC has something similar,
and yes it does.
After fixing these two problems,
the OpenACC performance reached –
almost the same level as the CUDA version,
although it's still not quite there for some reason.
The OpenMP version is still slow though.
Why it is so slow, is still a total mystery.
My guess is that the GCC implementation –
is still so very recent,
that it is immature and largely unoptimized,
and because of this,
it has a number of performance problems –
that have yet to be weeded out.
Quite a lot of people were a little disappointed –
that I only covered CUDA in the video, and not OpenCL.
For those who don't know,
CUDA is a thing developed by NVidia in 2007.
It has only ever been supported by NVidia cards.
Users of cards from other manufacturers,
like ATI or Intel, had nothing of the sorts.
And then OpenCL was developed in 2009.
OpenCL is now supported –
by a number of graphics card manufacturers,
and also by a few other sorts of compute devices.
It should be obviously the better choice, right?
Why did I not cover it in my video series?
There's three reasons.
Firstly, from the start –
my video series was going to be –
about ways to do parallelism in GCC –
without installing anything extra.
The focus was going to be in GCC, and nothing else.
Back then I had a graphics card –
that a friend lent to me.
The graphics card was the NVidia GTX 560,
which only has compute level 2.
The code generated by GCC requires compute level 3,
so I was unable to actually test the code generated by GCC,
except with extraordinary hacks.
So I grew genuinely curious, and tried out CUDA.
And it was surprisingly easy, and I thought to myself,
I must introduce this to people too.
A couple of months later,
I actually did buy an NVidia GTX 970,
using money contributed by Patreon supporters,
and I was finally able to try the GCC offloading code –
on the actual device,
but at that point my GCC installation was broken –
and none of the offloading versions worked.
It was only after GCC 7 was released,
that I tried reinstalling the offloading GCC builds,
and I figured out that ditching HSA support made it work,
so I began making the videos.
The second reason is,
when I wanted to check out OpenCL back then,
for some reason or another,
I had troubles getting it working on my computer.
I can no longer remember what those problems are.
Whatever the case, it kept me from trying out OpenCL.
The third reason is that…
Well, look at OpenCL.
This is supposedly a "simple" example of using OpenCL.
It's ugly!
So much boilerplate code.
So much stuff you must set up –
and initialize and prepare and configure,
and functions you must call in order to accomplish anything.
Nope.
Nope, nope, nope.
That's not how you do usability for programmers.
That's just not my style.
It does not make for a good and easily understandable –
entertaining lesson in a YouTube video.
You do that kind of thing when you have no choice,
when it's just the way things have to be done.
But it is not the way that is fun to code –
and to learn.
But there is one more way to leverage GPU power.
That is, by using OpenGL, or maybe Vulkan,
or a certain vendor-specific library.
Compute shaders can be used –
to do arbitrary calculation on the GPU,
and the best thing is,
they work on nearly all graphics cards and operating systems.
The downside is that –
compute shaders are used exclusively for GPUs.
You cannot use them to offload computation –
for coprocessors like the Xeon Phi.
So why did I not cover compute shaders in my video series?
Simply because the focus was,
from the beginning,
how to do parallelism in GCC.
And libraries like OpenGL or Vulkan –
do not come supplied with GCC,
nor are they part of the C++ standard.
That, and I was only too vaguely aware –
of the existence of compute shaders –
as a separate and usable class of shaders,
compared to, say, vertex shaders.
But now you do know, and you can go ahead –
studying and learning more!
Finally, there were one or more requests –
to see the different renderers in real time.
So here is a desktop recording –
of each program rendering the fractal –
in real time at 480p resolution.
Do note that because I'm also using the same computer –
to record the desktop and to transcode into a video file,
I am not getting as good performance –
as I would if the fractal renderer were the only program running.
And that is the extent of material –
I had prepared for this video.
So, thank you for watching, and see you next time, bye!
Không có nhận xét nào:
Đăng nhận xét