1. I am solving 3D finite volume temperature fields with Intel FV on Windows7 and using my own iterative methods. I have tried so many variations of OpenMP directives but never got a speed up more than factor 2 even with 16 cores. What bothers me is, that in the test program attached the CPU time recordings are so odd: the CPU time is more or less independent on NTHREADS and Nthreads=1 is faster than ordinary serial loops.
I compile with ifort /c /Qopenmp test3.f90 ,link with link test3.obj and run with test3.. Test3.f90 is attached together with my test3.exe.
Test3.f90 is like as it is, because the typical sort of loop in my 'big' program looks like:
do k=1,Nz do j=1,Ny do i=1,Nx c(i,j,k)=a(i,j,k)*b(i,j,k)+ other matrix elements - other matrix elemens enddo enddo enddo
Q: Is this type of loop structure impeding the use of OpenMP and how to make it better? I also have loops like
do i=1,N ; some arrays a(3,i) ; enddo
which also do not run better with OpenMP.
Q Are there special compiler directives to make it better?
Q What to use as diagnostics? (I have to admit that I'm using the old fashioned way of .bat files to compile and link. I do not use the visual mode).
Q:Can anyone tell me the main tripping hazard for a newbee in openMP?
2. There is some good story in that other loops like this scale with number of threads as expected:
!$OMP PARALLEL PRIVATE(i,j,k) reduction(+:prod) !$omp do do k=1,Nz do j=1,Ny do i=1,Nx prod=prod+a(i,j,k)*b(i,j,k) enddo enddo enddo !$omp end do !$OMP END PARALLEL
Now I am confused and wonder what goes wrong with my matric element multiplication.
Hopiing someone could provide me with a key idea
Best regards, Johannes