HECToR

FAQ: Error Messages and Debugging

This section describes some of the typical error messages you might encounter when running jobs, and how you may deal with them.

Exec /home/z03/z03/themos/pi failed: chdir /nfs01/z03/z03/themos No such file or directory
/usr/bin/ld: cannot find -lX11
Error - segmentation fault
Error - out of memory
Error - buffer issues with MPI
What tools are there to help with debugging my code?
How can I make my code dump a core file?
What does Exit Code xxx mean?
Why do I get LIBDMAPP ERROR/XT_SYMMETRIC_HEAP_SIZE errors when using Coarray Fortran?
How can I view how much memory my job is using?

Go back to the FAQ index.

Q. Exec /home/z03/z03/themos/pi failed: chdir /nfs01/z03/z03/themos No such file or directory?

Any files, including programs and data, that reside on the /home filesystem are not visible from the compute nodes. Please make sure that everything you need in your job exists on the /work filesystem and you have supplied the correct filenames.

Q. /usr/bin/ld: cannot find -lX11?

A. The linker is looking for non-shared ("static" or "archive") libraries but cannot find them. Perhaps you have specified a directory that contains shared objects only. The compute nodes cannot run code that contains shared objects and the linker will refuse to produce such code.

Q. Error - segmentation fault?

A. These can be difficult to pinpoint, but one possible cause is that you are trying to run an executable that was compiled for the compute nodes on a login node instead.

Q. Error - out of memory>

A. If you are running fully packed (24 MPI procs/node) and using more than 1.33GB per core then think about using fewer MPI processes per node.

Q. Error - buffer issues with MPI?

A. Try increasing the value of the environment variable MPICH_UNEX_BUFFER_SIZE, which is 60MB by default. This increases the size of unexpected message buffer space, the buffer for short messages sent using the eager messaging protocol before a corresponding receive has been posted. See the Good Practice Guide for Parallel Optimisation. Alternatively, manage your own buffer space by using BSENDs and BUFFER_ATTACH.

Q. What tools are there to help with debugging my code?

A. You may use TotalView (user_guide, [opens in new window]) and/or GDB (GDB Documentation, [opens in new window]).

Q. How can I make my code dump a core file?

A. By default, the core file size limit on HECToR is zero, and therefore core dumps are disabled. In order to change this, use the ulimit command with the -c flag. For example, 'ulimit -c unlimited' allows core files of unlimited size (subject to quota restrictions) to be dumped. You should issue this command in your job script prior to calling aprun.

Q. What does Exit Code xxx mean?

A. Exit codes are propagated by aprun from the application running on the compute nodes. If the application terminates successfully, aprun will return 0. If a termination signal is sent from the application, the code returned by aprun is 128 plus the value of the termination signal. For instance, two importrant and frequently occuring exit codes are 137 and 139. 137 indicates a SIGKILL termination signal (9) in the application and usually indicates that the application ran out of memory on the compute node, in which case try running the job with more processors, or try running in single core mode (-N 1 option to aprun). 139 indicates a SIGSEGV termination signal (11), which typically indicates that the application tried to access an area of memory it should not, in which case the code needs to be debugged; the first place to start is recompile with bounds checking (see man pages for the different compilers), and rerun.

Q. Why do I get LIBDMAPP ERROR/XT_SYMMETRIC_HEAP_SIZE errors when using Coarray Fortran?

If you get errors such as the following when running Coarray Fortran jobs:

LIBDMAPP ERROR: User error: Allocation request too large; must be less than or equal to 
XT_SYMMETRIC_HEAP_SIZE.

LIBDMAPP ERROR: User error: Sheap of size 0x4100000 is out of memory.
Increase setting of XT_SYMMETRIC_HEAP_SIZE.

then the symmetric heap (which is used to store Coarrays and SHMEM data objects) has run out of space. See the section on running Coarray Fortran jobs in the userguide for more information.

Q. How can I view how much memory my job is using?

Totalview comes with an optional application for viewing memory statistics called MemoryScape. To use MemoryScape you should add the following to your link line:

-L/opt/toolworks/totalview.8.11.0-0/linux-x86-64/lib -ltvheap_cnl_static

Then you must select the radio button "Enable Memory Debugging" at the start of your job. To view memory statistics select Debug and then Open MemoryScape. A new window will appear with a number of options for viewing different aspects of memory usage.

Go back to the FAQ index.

Main web site navigation

FAQ: Error Messages and Debugging

In this section

Apply to ARCHER

Current Service Status