1

Topic: OpenMP parallelization

We should refactor some assembly  methods at engineering model level. Consider, for example the implementation of EngngModel :: assembleVectorFromBC (simplified):

   for ( int i = 1; i <= nbc; ++i ) {
         if ( ( bc = dynamic_cast< ActiveBoundaryCondition * >(bc) ) ) {
            va.assembleFromActiveBC(answer, *abc, tStep, mode, s, eNorms);
        } else  if ( ( bodyLoad = dynamic_cast< BodyLoad * >(lbc) ) ) { // Body load:
            // compute contrib
           answer.assemble();
       } else if ( ( sLoad = dynamic_cast< SurfaceLoad * >(load) ) ) { // Surface load:
           // compute
           answer.assemble()
       } 
   }

The problem with this code is following: if we mark the loop over bcs as parallel (using pragma omp parallel), then we have the branching inside with independent sections, where the actual assembly into the destination is performed (answer.assemble calls)  Even if we mark these  sections as critical, we can get wrong result. This is due to the fact, that individual threads processing different BC can at the same enter different (and independent) critical sections with vector assembly.
I see two different solutions:
1) make floatArray assemble method thread safe (my preffered solution)
2) split the loop into independent loops over particular BC types (surface, edge, etc) and parallelize these loops

Actually we can apply the same principle to sparse matrices. They can implement much more smarter locking policy than marking the whole assembly as critical section.


Any comment or suggestions welcome.

Re: OpenMP parallelization

I think there is a third option here, and that's to use reduction. This would be lock completely lock free and require basically no change, so it would almost certainly perform the best. Since it's just arrays, the additional memory usage (one array per thread) is completely negligeble.

Here is an example of a custom reduction (which is basically what we would end up doing)
https://gist.github.com/eruffaldi/7180b … 0ba9a2d68c

The only concern would be which version of OpenMP each compiler supports, but the version of GCC that we already require should have support (GCC 4.9+).

But we might still want to split the active BCs into their own loop, as they might want to parallize internally.


----------
As for matrix assembly, I don't think it can be improved; Last i meassured the "assmble" was so fast compared to computing the Ke matrices, that the critical section had no meassurable performance penalty.

You also wouldn't be able to do anything different for external matrices, like PETSc, but even built in i think it would be increadibly difficult to make anything smart.

And reduction can't be used either, because of the additional memory cost (nor would it be very efficient).

Re: OpenMP parallelization

// We should be able to stick this into floatarray.h
#pragma omp declare reduction(+ : FloatArray :  omp_out += omp_in)

FloatArray ans(neq);
#pragma omp parallel for reduction(+ : ans)
for ( int i = 1; i <= nbc; ++i ) {
    // assembly to "ans"
}

answer += ans; // we might as well use return value for this instead of passing the "answer" reference;

4

Re: OpenMP parallelization

I am not very satisfied  with the current scalability of assembly routines.
I have experiment a bit with so called coloring, where two elements have the same color when they do not share any node. Thus processing the elements with the same color does not produce any race condition on data. I believe that this approach should yield superior performance to what we have now.
Unfortunately, I could not obtain the satisfactory scalability at the moment. Still exploring the problem, seems caused by false sharing and looking how to get around.