Topic: Iterative solver are too slow, parallel tests take forever

Trying to run through the test suite for parallel problems takes forever.
At first, I thought these tests where simply to large (brazil_2d_nl4 taking over 10 minutes to finish on my computer), but as the input files reveal, there isn't really that many elements (perhaps a bit on the large side).
Changing from a KSP-solver to a direct solver (LU factorization), using the Spooles library (through PETSc) reduced the execution time from 25 minutes to 5 seconds. 300 times speedup.
(flags used: -ksp_type preonly -pc_type lu -pc_factor_mat_solver_package spooles).

Some choices here;
1. Just make the use a direct solver by default. Any direct solver should work, but PETSc doesn't include one by default, so it has to be specified. Tests will fail for anyone who doesn't have the library compiled with PETSc.
2. Make the code choose a suitable linear solver automatically (say, Cholmod if its symmetric and positive definite problem, Spooles if its not, and in worst case, some built-in KSP solver). So it'll be fast for anyone who compiled PETSc with some suitable libraries.
3. Make the tests smaller to the point where they are fast even with a really slow KSP-solver. However, even the very tiny "4test" takes 4 seconds so this might not be a good plan.


As a side note, I'd like to make a statement regarding the test suite:
The test suite should only contain minimum necessary to make sure its running correctly. This usually means 1-10 elements in 1-2 time steps and hopefully results in < 0.5 seconds execution time.
Any longer would discourage developers from running the test suite, rather than making it a habit.
We can introduce another for benchmark problems if necessary, where I would have.

2

Re: Iterative solver are too slow, parallel tests take forever

Hi,
regarding the parallel tests and the time needed to run them: if you have multicore (or parallel) computer, this works fine, for example see the  test run on my workstation (8 cores, see bellow), it took only 87 seconds to run the parallel tests. The parallel tests are extremely slow when executed on a computer with less available cores than required (I don't know why, but they are then much slower than in sequential version).

Regardint the tests in general, I in principle agree that the test shuld be small as possible, focused on testing particular functionality. On the other hand, I think that we should also have a set of more complex tests, that are intended to test the code in more complex way, illustate the use of code for complex cases, benchmarking, or to test parallel scalability. Some of the existing parallel tests are in this category.

Perhaps we can split existing tests into regular tests (they will remain in the same directory) and into benchmarks, where more complex and demanding tests will be moved.

-------------------
Test project /home/bp/oofem/build/poofem-debug
      Start  1: 4test_test_np4
1/11 Test  #1: 4test_test_np4 ...................   Passed    1.22 sec
      Start  2: bar_test_np4
2/11 Test  #2: bar_test_np4 .....................   Passed    0.43 sec
      Start  3: barnl_test_np4
3/11 Test  #3: barnl_test_np4 ...................   Passed    0.81 sec
      Start  4: barnl_test_np2
4/11 Test  #4: barnl_test_np2 ...................   Passed    0.37 sec
      Start  5: brazil_test_np2
5/11 Test  #5: brazil_test_np2 ..................   Passed   22.83 sec
      Start  6: brazil_test_np4
6/11 Test  #6: brazil_test_np4 ..................   Passed   23.68 sec
      Start  7: brazil_test_np7
7/11 Test  #7: brazil_test_np7 ..................   Passed   25.20 sec
      Start  8: brazil3d_test_np7
8/11 Test  #8: brazil3d_test_np7 ................   Passed   10.88 sec
      Start  9: dyn_bar01_test_np2
9/11 Test  #9: dyn_bar01_test_np2 ...............   Passed    0.15 sec
      Start 10: dyn_bar02_test_np2
10/11 Test #10: dyn_bar02_test_np2 ...............   Passed    0.11 sec
      Start 11: dyn_bar03_test_np2
11/11 Test #11: dyn_bar03_test_np2 ...............   Passed    0.44 sec

100% tests passed, 0 tests failed out of 11
Total Test time (real) =  86.15 sec

Re: Iterative solver are too slow, parallel tests take forever

Borek,
The tests certainly doesn't seem to like my 4 core, 2.66GHz CPU. I suspect some kind of performance bug in petsc for that 25 minute simulation.

Also, I had plans to adapt the CMake scripts to allow for ctest-memcheck functionality (currently hindered because I used a python-wrapper script to run the test cases). Any test taking more than a few seconds would increase a hundredfold under Valgrind, basically killing that feature which would otherwise be very valuable for finding bugs and memory leaks.

I would strongly suggest we introduce a benchmark folder, and possible add something a long the lines of "make benchmark"

4

Re: Iterative solver are too slow, parallel tests take forever

Ok, the benchmark folder seems to be a good idea. Would you like to make this change already in upcoming release ?

Re: Iterative solver are too slow, parallel tests take forever

I can introduce the folders and such, and prepare the tests and benchmarks by prepending "test_" and "benchmark_" to each test
This would allow running
ctest -R "test"
or
ctest -R "benchmark"
(of course, the word "test" is redundantly used inside many tests, so those should be renamed)
So if you run simple "ctest" it'll still run through all of them (could be some disadvantages from that, but ctest doesn't support tests that don't run by default)

Re: Iterative solver are too slow, parallel tests take forever

I need some more time for this, so I will recommend that we wait until after release 2.2

Re: Iterative solver are too slow, parallel tests take forever

I have introduced the benchmark folder and moved and renamed some tests appropriately.
There are still some some things left to do

  • Unnecessary tests; multiple tests seem to check the same element or model in a variation of loads. If we do this for every time of element we would have tons and tons of tests. Max one test per feature (unless there is a strong motivation why).

  • Borderline problems. There are still a few which are uncomfortably large (memcheck will take to long). Specifically these tests:
    BN_cylinder.in
    isolinmoisture_cylinder.in
    tmpatch34.in
    tmpatch35.in
    should be minimized

  • Better names; Many tests are just named something along the lines of "patch", which doesn't say anything helpful.

  • I moved the largest tests to benchmark, but that leaves a bit of a gap in test-coverage right no. A small-scale version of bdam7, xfem01 should be added to the tests again.

I will look into adding the memcheck functionality (which will somehow need to pass by the testhelper-script)

Re: Iterative solver are too slow, parallel tests take forever

So, today we got 7 new tests for the same material model; tm/nlisomoisture_*.in, all which have 10 elements and 613 timesteps, which is honestly exactly the thing I've been trying to move away from. So this whole benchmark thing is starting to be a bit counterproductive when this single material tests basically doubles the test time in an ever growing test suite. I'm sure that these 7 alternatives can be merged into one test, with 7 elements, and perhaps 10 timesteps (it doesn't have to reach steady state) that is still able to predict when something breaks (which I think that the tests should be for).

So, again for iteration, here was the guidelines i added to the testcases (test/Guidelines.txt), which I felt was a good standpoint (which course is up for debate).
-----------------------------------------------------------------------------------------------------------------------------------

The test suite allows for quality control by developers.
Test-suite guidelines:
- As few elements and time steps as possible (aim for elements < 5 and time steps < 5)
- Each test should test one specific component (or possible test a few similar elements inside the same test file)
- There should be no redundant tests, e.g. maximum one test per element should suffice.

Larger benchmark problems may be added to the benchmark folder, but these should not be expected to be executed regurlarly be developers.
These benchmarks don't need to be very large. They can be simple showcase example that should be included with the OOFEM distribution.
If a benchmark problems ever finds a problem that the test-suite misses, a smaller version of the benchmark problem should be added to the test suite.

Re: Iterative solver are too slow, parallel tests take forever

Sorry, I do not visit this forum very often.
The tests for nlisomoisture have been added by Petr Havlasek, my PhD student.
I have sent him a message and he will simplify the tests.
I agree that the "mandatory" tests that are run each time the code is updated should be kept relatively short in terms of execution time.

Re: Iterative solver are too slow, parallel tests take forever

I have just talked to Petr. There is a good reason to have 7 tests for the "same" material model: The model is quite general and allows various choices of analytical expressions for the sorption isotherms and functions describing the dependence of permeability on relative humidity. Each test is run for a different choice of such functions and so it represents a different material model. The number of steps can probably be reduced. Each test takes only 3 seconds but it is true that the tests that are now kept as the basic ones typically take a fraction of a second. We will modify these tests in a few days.

Re: Iterative solver are too slow, parallel tests take forever

Dear Milan

I tried to explain in words what i meant with one single test, but I think its easier to just show an example, e.g.;

nltransienttransportproblem nsteps 10 alpha 0.5 rtol 1.e-10 lumpedcapa nsmax 10 prescribedtimes 10 ...
domain mass1transfer
OutputManager tsteps_out {1 5 10} dofman_output {4 8 12 16 20 24 28} element_output {1,2,3,4,5,6,7}
ndofman 28 nelem 7 ncrosssect 1 nmat 7 nbc 1 nic 1 nltf 1
#                x y z
node 1 coords 3  0 0 0 ic 1 1 bc 1 1
node 2 coords 3  1 0 0 ic 1 1 bc 1 1
node 3 coords 3  1 1 0 ic 1 1 
node 4 coords 3  0 1 0 ic 1 1 
node 5 coords 3  0 0 1 ic 1 1 bc 1 1
node 6 coords 3  1 0 1 ic 1 1 bc 1 1
node 7 coords 3  1 1 1 ic 1 1 
node 8 coords 3  0 1 1 ic 1 1 
... and so on for all 28 nodes, just offsetting the z-coordinate 
#
quad1mt 1 nodes 4  1 2  3  4 crossSect 1 mat 1 
quad1mt 2 nodes 4  5 6  7  8 crossSect 1 mat 2
quad1mt 3 nodes 4 9 10 11 12 crossSect 1 mat 3
... and so on for all 7 elements each with their own material model
#

which would be a suitable approach with for example sm/spring0*.in (when the tests barely differ the overhead of starting and initializing oofem several times is pretty large).
However, this material model might be a bit of a special case. It might be better to keep them as separate files.

I also recommend adding a benchmark example of the model.


Unrelated news; I just got around to add the custom targets for CTest that only runs tests or only benchmarks. The new targets are named "tests" and "benchmarks" respectively, and should be visible from within IDE's such as KDevelop or VisualStudio, or you can access them by doing

make tests
make benchmarks

( on linux, a target named "test" and in Visual Studio a target named RUN_TESTS are also automatically created (which I can't do anything about). These targets just runs the entire testsuite )

Re: Iterative solver are too slow, parallel tests take forever

Petr has simplified the tests of nlisomoisture material, the time consuming tests have been deleted.

It is useful to have the simple tests separated from benchmarks.

When I run the simple tests, I get failure for test_cemhyd01.in and test_cemhyd02.in. I think this
had been the case for some time already, and I wonder whether we shall keep it like this.

For the benchmarks, I get failure for benchmark_concrete_3point_direct.in . Is it normal?

Re: Iterative solver are too slow, parallel tests take forever

Hi.
test_cemhyd01 and test_cemhyd02 should work if you compile with USE_CEMHYD, (which is default off since it requires additional libraries, namely libtinyxml2). I'll look into excluding the tests if cemhyd support isn't compiled in.

Also, benchmark_concrete_3point_direct should work, and in fact, it probably does. It's just common in many tests that the test tolerance is set to high in comparison to the Newton solvers.
This small difference in output for this large tests is likely caused by the recent change from my, where I introduced dof-grouping in NRSolver (the differing results is small enough to be that of a single additional Newton iteration).

It is possible to see details like this by running ctest with a verbose-flag; ctest -R benchmark_concrete_3point_direct -V

...
Error when checking rule 1: err = -0.006, value is 16.162 and correct value is 16.156
...

I'll fix it later.