[fluid-dev] Parallelize rendering using openMP

Discussion:

Tom M.

2018-04-11 18:59:29 UTC

I would like to rework fluidsynths parallel audio rendering. The current implementation is not very efficient, the synchronization overhead seems to dominate. At least for me, most of the time only one core is fully active. I would like to get rid of that manual thread handling in rvoice_mixer and use openMP to attempt a simpler, yet more efficient implementation. I think it should be possible without any hard synchronization between threads. In case the openMP implementation performs better than the current one, this will be provided as a feature for the next major version. This means, in order to have fluidsynth being able to render audio in parallel, it needs to be compiled using a compiler that supports at least openMP 3.0, because I need the omp task directive. OMP 3.0 was released in 2008. Nowadays all major compilers (clang, gcc, Intel, IBM, Sun Studio) support this version... execpt Microsofts VC compiler ofc, which is still lacking behind with openMP 2.0 (rel. 2002) and apparently they have no plans to support any newer version [1] [2]. Just to be clear: openMP would be an optional dependency. You can continue to compile fluidsynth without it, it would just not be capable of parallel rendering.

Any thoughts on that? In case anybody is interested, you may follow the current implementation progress: [3]

Reference issue with some nice pictures illustrating the basic implementation of rvoice_mixer: [4]

Tom

[1] https://visualstudio.uservoice.com/forums/121579-visual-studio-2015/suggestions/2276847-support-openmp-3-0-or-3-1
[2] https://visualstudio.uservoice.com/forums/121579-visual-studio-ide/suggestions/13495731-add-support-for-openmp-4-5-to-vc
[3] https://github.com/FluidSynth/fluidsynth/compare/openmp
[4] https://github.com/FluidSynth/fluidsynth/issues/197

Ceresa Jean-Jacques

2018-04-12 14:16:21 UTC

Permalink

Hi, Tom

Â

Â

I am surprised, you get only one core active most of the time.

Please are you using a very fast machine ? did you ask to fluidsynth to play sufficient number of notes ?

Â

Testing this on a 4 cores CPU (from 1 to 4) and on each successive test we see one more core in useÂ (using Windows XP).

Â

Here the table of results that show that the current implementation works well:

The cpu load of one voice (voice(%)) is inversely proportional to the number of core in use.

Â

Using 1 core:

--------------

prof_set_notes 800
prof_start 1 1000

Generating 800 notes, generated voices:800
Number of measures(n_prof):1, duration of one mesure(dur):1000ms

Profiling time(mm:ss): Total=0:1Â Remainder=0:1, press to cancel
Â ------------------------------------------------------------------------------
Â Cpu loads(%) (sr: 44100 Hz, sp: 22.68 microsecond) and maximum voices
Â ------------------------------------------------------------------------------
Â nVoices| total(%)|voices(%)| reverb(%)|chorus(%)| voice(%)|estimated maxVoices
Â -------|---------|---------|----------|---------|---------|-------------------
Â Â Â Â 800|Â 100.308|Â Â 99.226|Â Â Â Â 1.083|Â Â Â 0.000|Â Â Â 0.123|Â Â Â Â Â Â Â Â Â Â Â Â Â 801

Â

Using 2 cores:

--------------

prof_set_notes 800
prof_start 1 1000

Generating 800 notes, generated voices:800
Number of measures(n_prof):1, duration of one mesure(dur):1000ms

Profiling time(mm:ss): Total=0:1Â Remainder=0:1, press to cancel
Â ------------------------------------------------------------------------------
Â Cpu loads(%) (sr: 44100 Hz, sp: 22.68 microsecond) and maximum voices
Â ------------------------------------------------------------------------------
Â nVoices| total(%)|voices(%)| reverb(%)|chorus(%)| voice(%)|estimated maxVoices
Â -------|---------|---------|----------|---------|---------|-------------------
Â Â Â Â 800|Â Â 55.095|Â Â 53.999|Â Â Â Â 1.096|Â Â Â 0.000|Â Â Â 0.067|Â Â Â Â Â Â Â Â Â Â Â Â 1481

Â

Using 3 cores:

--------------

prof_set_notes 800
prof_start 1 1000

Generating 800 notes, generated voices:800
Number of measures(n_prof):1, duration of one mesure(dur):1000ms

Profiling time(mm:ss): Total=0:1Â Remainder=0:1, press to cancel
Â ------------------------------------------------------------------------------
Â Cpu loads(%) (sr: 44100 Hz, sp: 22.68 microsecond) and maximum voices
Â ------------------------------------------------------------------------------
Â nVoices| total(%)|voices(%)| reverb(%)|chorus(%)| voice(%)|estimated maxVoices
Â -------|---------|---------|----------|---------|---------|-------------------
Â Â Â Â 800|Â Â 36.216|Â Â 35.079|Â Â Â Â 1.137|Â Â Â 0.000|Â Â Â 0.043|Â Â Â Â Â Â Â Â Â Â Â Â 2287

Â

Using 4 cores:

--------------

prof_set_notes 800
prof_start 1 1000

Generating 800 notes, generated voices:800
Profiling time(mm:ss): Total=0:1Â Remainder=0:1, press to cancel
Â ------------------------------------------------------------------------------
Â Cpu loads(%) (sr: 44100 Hz, sp: 22.68 microsecond) and maximum voices
Â ------------------------------------------------------------------------------
Â nVoices| total(%)|voices(%)| reverb(%)|chorus(%)| voice(%)|estimated maxVoices
Â -------|---------|---------|----------|---------|---------|-------------------
Â Â Â Â 800|Â Â 28.083|Â Â 26.960|Â Â Â Â 1.122|Â Â Â 0.000|Â Â Â 0.033|Â Â Â Â Â Â Â Â Â Â Â Â 2996

Â

Sorry for the ugly formatted table

jjc.

Â

Message du 11/04/18 21:26
De : "Tom M."
Objet : [fluid-dev] Parallelize rendering using openMP
I would like to rework fluidsynths parallel audio rendering. The current implementation is not very efficient, the synchronization overhead seems to dominate. At least for me, most of the time only one core is fully active. I would like to get rid of that manual thread handling in rvoice_mixer and use openMP to attempt a simpler, yet more efficient implementation. I think it should be possible without any hard synchronization between threads. In case the openMP implementation performs better than the current one, this will be provided as a feature for the next major version. This means, in order to have fluidsynth being able to render audio in parallel, it needs to be compiled using a compiler that supports at least openMP 3.0, because I need the omp task directive. OMP 3.0 was released in 2008. Nowadays all major compilers (clang, gcc, Intel, IBM, Sun Studio) support this version... execpt Microsofts VC compiler ofc, which is still lacking behind with openMP 2.0 (rel. 2002) and apparently they have no plans to support any newer version [1] [2]. Just to be clear: openMP would be an optional dependency. You can continue to compile fluidsynth without it, it would just not be capable of parallel rendering.
Any thoughts on that? In case anybody is interested, you may follow the current implementation progress: [3]
Reference issue with some nice pictures illustrating the basic implementation of rvoice_mixer: [4]
Tom
[1] https://visualstudio.uservoice.com/forums/121579-visual-studio-2015/suggestions/2276847-support-openmp-3-0-or-3-1
[2] https://visualstudio.uservoice.com/forums/121579-visual-studio-ide/suggestions/13495731-add-support-for-openmp-4-5-to-vc
[3] https://github.com/FluidSynth/fluidsynth/compare/openmp
[4] https://github.com/FluidSynth/fluidsynth/issues/197
_______________________________________________
fluid-dev mailing list
https://lists.nongnu.org/mailman/listinfo/fluid-dev

Marcus Weseloh

2018-04-12 14:40:04 UTC

Permalink

Hi Tom,

Post by Tom M.
I would like to rework fluidsynths parallel audio rendering. The current
implementation is not very efficient, the synchronization overhead seems to
dominate.

How did you come to the conclusion that the synchronization overhead
dominates? Did you actually measure the overhead somehow?

[...]

Post by Tom M.
Any thoughts on that? In case anybody is interested, you may follow the
current implementation progress: [3]

Well, for starters I like that your implementation leads to less code. That
is nearly always a good sign, I think. I'm partial to the new OpenMP
dependency, but I don't know if it would cause problems for Android or iOS
platforms.

I do wonder though why OpenMP can do a better job than the current code.
Surely OpenMP also spawns threads in the background and uses conditions /
mutexes behind the scenes. What is it in the current synchronization that
makes it so inefficient?

Cheers,

Marcus

Tom M.

2018-04-14 15:58:36 UTC

Permalink

Thanks for the feedback so far.

Post by Ceresa Jean-Jacques
Please are you using a very fast machine ? did you ask to fluidsynth to play sufficient number of notes ?
How did you come to the conclusion that the synchronization overhead dominates?

Admittedly this might be a wrong/premature conclusion based on my observations + looking at the source code. I took a look at the callgraph generated with valgrind --tool=callgrind ./fluidsynth. Synchronization functions like g_mutex_lock() or g_cond_wait() are called quite often by fluid_mixer_thread_func(). Although it also reports to be not that expensive. Still I think it's worth evaluating what job openMP and other refactorings can do here. David Henningsson once told me that the parallel renderer was more like a (failed) experiment. So please see this current work as my little experiment.

Post by Ceresa Jean-Jacques
I do wonder though why OpenMP can do a better job than the current code.

openMP provides different scheduling strategies to process for loops. Also this restriction VOICES_PER_THREAD (==8) to avoid thread overhead seems quite magic to me (it probably worked well when David tested it, still, why is 8 the right number?). Overall I'm not sure whether openMP alone can do a better job. It definitely reduces complexity of the code. Additionally I want to revise the current implementation, like using a parallel logarithmic buffer reduction to mix audio between threads or rethinking data layout and memory accesses in general, hoping this makes it more efficient.

Tom

Marcus Weseloh

2018-04-16 08:33:35 UTC

Permalink

Hi Tom,

Using FluidR3_GM.sf2 the cpu load looks better, but I'm yet quite far

from the "perfect" scalability that your profiling interface gives you JJC.

Well, assuming you are using "normal" MIDI files as input and test with
that, the advantages of parallel rending are bound to be much smaller than
with an artificial test that tests extreme cases, I think.

Still I think it's worth evaluating what job openMP and other

refactorings can do here. David Henningsson once told me that the parallel
renderer was more like a (failed) experiment. So please see this current
work as my little experiment.

Please don't get me wrong, I also think that it's worth evaluating a better
parallel rendering implementation. So I am all for your approach of just
testing it out. I was just curious if you had done more measurements on
synchronization overhead.

Additionally I want to revise the current implementation, like using a

parallel logarithmic buffer reduction to mix audio between threads or
rethinking data layout and memory accesses in general, hoping this makes it
more efficient.

That sounds good. Maybe it would also be worthwhile to look at the chrous
and reverb code. At least on my machine and with the setup I use, the
effects take up a large proportion of processing time. And that processing
is always single threaded...

Cheers,

Marcus

Ceresa Jean-Jacques

2018-04-14 21:59:55 UTC

Permalink

Thanks for yours awswers

Â

Apparently the soundfonts I used were not polyphonic enough
Using FluidR3_GM.sf2 the cpu load looks better, but I'm yet quite far from the "perfect" scalability that your profiling interface gives you JJC.

Â

Effectivelly, your machine is fast and in this case playing MIDI file to simulate a notes (voices) generator isn't not efficient. This is why the profiling interface have is own notes generator (but it is still limited to 256 x 16Â notes !).

The most important is that fluidsynth are able to play constant number of voices during measurement. This gives consecutives measurement the same cpu load result. This makes any future performances measurements easily much more predictive.

Note:During my experiment, initially i have noticed that result between consecutives measurement was not constant. Quickly, i realized that a backgroud process was running. The job of this process was to economize energy . It was doing this by stealing cpu cycle!. Of course any performance measurement aren' possible with this kind of jobs or services running silently behind the scene.

Â

Additionally I want to revise the current implementation, like using a parallel logarithmic buffer reduction to mix audio between threads or rethinking data layout and memory accesses in general, >hoping this makes it more efficient.

Interresting. Looking the code (in the past) i have noticed that a lot of things perhaps could be enhanced arround the following subject:

1) avoiding mutual access to the "active list of voices" betweenÂ "primary tasks" and the pool of "extra tasks".

Â Â - breaking the unique list in local list for each task.

Â Â - load balancing (same number of voices in each local list).

2) optimizing mixing of buffers between "primary" task and "extra task" (to avoid actual possible synchronization overhead domination).

3) optimizing fluid_cond_signal(), fluid_cond_wait() each time the associated mutex is pointless.

Of course all this is easier to say than to do :).

jjc

Message du 14/04/18 17:58
De : "Tom M."
Objet : Re: [fluid-dev] Parallelize rendering using openMP
Thanks for the feedback so far.

Post by Ceresa Jean-Jacques
I do wonder though why OpenMP can do a better job than the current code.

Ceresa Jean-Jacques

2018-04-15 00:34:05 UTC

Permalink

Hi,

Post by Ceresa Jean-Jacques
(but it is still limited to 256 x 16Â notes !)

Houps , i must correct me. Please read 128 as ofc there are only 128 notes maximum per MIDI channel.

jjc

Post by Ceresa Jean-Jacques
Message du 15/04/18 00:00
De : "Ceresa Jean-Jacques"
A : "FluidSynthmailinglist"
Objet : Re: [fluid-dev] Parallelize rendering using openMP
Thanks for yours awswers
Â

Â
Effectivelly, your machine is fast and in this case playing MIDI file to simulate a notes (voices) generator isn't not efficient. This is why the profiling interface have is own notes generator (but it is still limited to 256 x 16Â notes !).
The most important is that fluidsynth are able to play constant number of voices during measurement. This gives consecutives measurement the same cpu load result. This makes any future performances measurements easily much more predictive.
Note:During my experiment, initially i have noticed that result between consecutives measurement was not constant. Quickly, i realized that a backgroud process was running. The job of this process was to economize energy . It was doing this by stealing cpu cycle!. Of course any performance measurement aren' possible with this kind of jobs or services running silently behind the scene.
Â

1) avoiding mutual access to the "active list of voices" betweenÂ "primary tasks" and the pool of "extra tasks".
Â Â - breaking the unique list in local list for each task.
Â Â - load balancing (same number of voices in each local list).
2) optimizing mixing of buffers between "primary" task and "extra task" (to avoid actual possible synchronization overhead domination).
3) optimizing fluid_cond_signal(), fluid_cond_wait() each time the associated mutex is pointless.
Of course all this is easier to say than to do :).
jjc
Message du 14/04/18 17:58
De : "Tom M."
Objet : Re: [fluid-dev] Parallelize rendering using openMP
Thanks for the feedback so far.

Please are you using a very fast machine ? did you ask to fluidsynth to play sufficient number of notes ?
How did you come to the conclusion that the synchronization overhead dominates?

I do wonder though why OpenMP can do a better job than the current code.

Carlo Bramini

2018-04-15 09:31:50 UTC

Permalink

Hello,
do you think that it would be possible to implement this thing in addition to existing code?
For example, something like:

#if THREADING_MODE_RVOICE == MODEL_SYNC_OPENMP3
<bla bla bla>
...
#elif THREADING_MODE_RVOICE == MODEL_SYNC_THREAD
<bla bla bla>
...
#else // THREADING_MODE_RVOICE == MODEL_SYNC_NONE
<bla bla bla>
...
#endif

Sincerely.

Tom M.

2018-04-15 09:58:02 UTC

Permalink

Post by Carlo Bramini
do you think that it would be possible to implement this thing in addition to existing code?

Technically it would be possible, but having three implementations around would make this part completely unmaintainable. So no, sry. Why are you asking?

Tom

Carlo Bramini

2018-04-17 10:18:33 UTC

Permalink

Hello,

Post by Tom M.

Post by Carlo Bramini
do you think that it would be possible to implement this thing in addition to existing code?

Technically it would be possible, but having three implementations around would make this part completely unmaintainable. So no, sry. Why are you asking?

Just for portability.
Sincerely.

Ceresa Jean-Jacques

2018-04-16 10:31:41 UTC

Permalink

Hi,

Â

Maybe it would also be worthwhile to look at the chrous and reverb code. At least on my machine and with the setup I use, the effects take up a large proportion of processing time. And that processing is always single threaded...

Â

Effects processing (reverb, chorus) are executed after the processing of all voices rendering (which is single /multi-thread). Effects is only single threaded, because they are common to all voices.

For information internal code in the actual reverb (freeverb: //ccrma.stanford.edu/~jos/pasp/Freeverb.html)Â is the processing of:Â 8 combs filters + 8 first order low pass filters + 4 all-pass filters + stereo unit. When executed on a cpu with math coprocessor the whole reverb load is equivalent to: 4,1 x 1 voice load.

Chorus is:Â N time variant delays lines (modulated by LFO). It is special as the cpu load depends of the number of delay lines chosen.

Â

At least on my machine and with the setup I use, the effects take up a large proportion of processing time.

Please at the occasion, if the hardware you use is dedicated to a stand alone synthesizer, could you run the profile commands and return the results (with CPU model) ?.

Â

# for reverb / chorus performance measurement

prof_set_notes 100Â # must be set so that you get "total(%)" result never above 100%
chorus on
reverb on
prof_start 5 1000

Â

cheers.

jjc

Â

Â

Â

Message du 16/04/18 10:34
De : "Marcus Weseloh"
A : "FluidSynth mailing list"
Objet : Re: [fluid-dev] Parallelize rendering using openMP

Hi Tom,

Using FluidR3_GM.sf2 the cpu load looks better, but I'm yet quite far from the "perfect" scalability that your profiling interface gives you JJC.

Well, assuming you are using "normal" MIDI files as input and test with that, the advantages of parallel rending are bound to be much smaller than with an artificial test that tests extreme cases, I think.
Â

Still I think it's worth evaluating what job openMP and other refactorings can do here. David Henningsson once told me that the parallel renderer was more like a (failed) experiment. So please see this current work as my little experiment.

Please don't get me wrong, I also think that it's worth evaluating a better parallel rendering implementation. So I am all for your approach of just testing it out. I was just curious if you had done more measurements on synchronization overhead.
Â

That sounds good. Maybe it would also be worthwhile to look at the chrous and reverb code. At least on my machine and with the setup I use, the effects take up a large proportion of processing time. And that processing is always single threaded...

Cheers,
Â Â Marcus

Marcus Weseloh

2018-04-19 21:14:13 UTC

Permalink

Hi JJC,

2018-04-16 12:31 GMT+02:00 Ceresa Jean-Jacques <

Post by Marcus Weseloh

At least on my machine and with the setup I use, the effects take up a

large proportion of processing time.
Please at the occasion, if the hardware you use is dedicated to a stand
alone synthesizer, could you run the profile commands and return the
results (with CPU model) ?.

Here are the profiling results for my embedded system. It's an Allwinner
A20 based board (Dual-Core Cortex-A7 ARM), 960MHz CPU Frequency, 1 GB
memory. Running Linux 4.14.12 with real-time patches. The whole system has
been optimised for low latency, not for high polyphony. So normally
Fluidsynth runs with buffer size of 64, buffer count of 2 and only on one
core. Fluidsynth has been compiled from the dynamic-sample-loading branch
with -Denable-floats=1. And gcc options -O3 and -ffast-math. With this
setup, I get the following:

nVoices| total(%)|voices(%)| reverb(%)|chorus(%)| voice(%)|estimated
maxVoices
-------|---------|---------|----------|---------|---------|-------------------
100| 80.134| 73.262| 4.941| 1.931| 0.721|
129

If I were to use two cores with buffer size 64 and buffer count 2, it looks
like this:

nVoices| total(%)|voices(%)| reverb(%)|chorus(%)| voice(%)|estimated
maxVoices
-------|---------|---------|----------|---------|---------|-------------------
100| 46.175| 39.162| 5.131| 1.882| 0.381|
244

And with buffer size 1024, buffer count 2, and two cores, I get this:

------------------------------------------------------------------------------
nVoices| total(%)|voices(%)| reverb(%)|chorus(%)| voice(%)|estimated
maxVoices
-------|---------|---------|----------|---------|---------|-------------------
100| 36.994| 31.137| 4.290| 1.567| 0.305|
308

So it shows that when it comes to performance optimisation, having good
measurements are vital. Reverb and chorus take much less CPU cycles than I
thought. Thanks for giving us this great profiling interface, JJC!

Cheers,

Marcus

Tom M.

2018-04-20 19:56:15 UTC

Permalink

I'm pretty much done implementing my ideas of an openMP parallelization. In short: I was not able to beat the current implementation.

While a parallel logarithmic buffer reduction sounds nice indeed it has the huge disadvantage that there needs to be a barrier waiting for all threads to finish rendering before the reduction can start. Using the profiling interface where all voices just keep playing, this barrier is actually pretty minor. But for real midi files where voices start and stop unpredictably this barrier dominates everything. David actually did a very good job here by efficiently using thread idling times to mix in any finished buffers.

I had to get rid of those barriers using the nowait clause (which consequently breaks rendering) to be slightly faster. This convinced me that the current parallelization approach is the right way to go, even if it involves a lot manual thread handling. And the little synchronization done currently is absolutely neglectable compared to my approach.

So for now I'm withdrawing the idea of switching to openMP. Perhaps I'll investigate whether aligned memory accesses to allow auto-vectorization by the compiler turn out to be more beneficial.

Tom

Marcus Weseloh

2018-04-22 18:16:56 UTC

Permalink

Hi Tom,

Post by Tom M.
I'm pretty much done implementing my ideas of an openMP parallelization.
In short: I was not able to beat the current implementation.

Thank you very much for trying it out anyway! I think we still got quite a
few benefits from it. Better insight into the existing code and knowledge
that the current approach does actually makes sense and compares well
against OpenMP are probably the biggest.

Cheers,

Marcus