This adds support for using Cuda Managed Memory with omp_alloc. AMD support
will be added in a future patch.
There is one new predefined allocator, "ompx_gnu_managed_mem_alloc", plus a
corresponding memory space, which can be used to allocate memory in the
"managed" space.
The nvptx plugin is modified to make the necessary Cuda calls, via two new
(optional) plugin interfaces.
gcc/fortran/ChangeLog:
* openmp.cc (is_predefined_allocator): Use GOMP_OMP_PREDEF_ALLOC_MAX
and GOMP_OMPX_PREDEF_ALLOC_MIN/MAX instead of hardcoded values in the
comment.
include/ChangeLog:
* cuda/cuda.h (cuMemAllocManaged): Add declaration and related
CU_MEM_ATTACH_GLOBAL flag.
* gomp-constants.h (GOMP_OMPX_PREDEF_ALLOC_MAX): Update to 201.
(GOMP_OMP_PREDEF_MEMSPACE_MAX): New constant.
(GOMP_OMPX_PREDEF_MEMSPACE_MIN): New constant.
(GOMP_OMPX_PREDEF_MEMSPACE_MAX): New constant.
libgomp/ChangeLog:
* allocator.c (ompx_gnu_max_predefined_alloc): Update to
ompx_gnu_managed_mem_alloc.
(_Static_assert): Fix assertion messages for allocators and add
new assertions for memspace constants.
(omp_max_predefined_mem_space): New define.
(ompx_gnu_min_predefined_mem_space): New define.
(ompx_gnu_max_predefined_mem_space): New define.
(MEMSPACE_ALLOC): Add check for non-standard memspaces.
(MEMSPACE_CALLOC): Likewise.
(MEMSPACE_REALLOC): Likewise.
(MEMSPACE_VALIDATE): Likewise.
(predefined_ompx_gnu_alloc_mapping): Add ompx_gnu_managed_mem_space.
(omp_init_allocator): Add ompx_gnu_managed_mem_space validation.
* config/gcn/allocator.c (gcn_memspace_alloc): Add check for
non-standard memspaces.
(gcn_memspace_calloc): Likewise.
(gcn_memspace_realloc): Likewise.
(gcn_memspace_validate): Update to validate standard vs non-standard
memspaces.
* config/linux/allocator.c (linux_memspace_alloc): Add managed
memory space handling.
(linux_memspace_calloc): Likewise.
(linux_memspace_free): Likewise.
(linux_memspace_realloc): Likewise (returns NULL for fallback).
* config/nvptx/allocator.c (nvptx_memspace_alloc): Add check for
non-standard memspaces.
(nvptx_memspace_calloc): Likewise.
(nvptx_memspace_realloc): Likewise.
(nvptx_memspace_validate): Update to validate standard vs non-standard
memspaces.
* env.c (parse_allocator): Add ompx_gnu_managed_mem_alloc,
ompx_gnu_managed_mem_space, and some static asserts so I don't forget
them again.
* libgomp-plugin.h (GOMP_OFFLOAD_managed_alloc): New declaration.
(GOMP_OFFLOAD_managed_free): New declaration.
* libgomp.h (gomp_managed_alloc): New declaration.
(gomp_managed_free): New declaration.
(struct gomp_device_descr): Add managed_alloc_func and
managed_free_func fields.
* libgomp.texi: Document ompx_gnu_managed_mem_alloc and
ompx_gnu_managed_mem_space, add C++ template documentation, and
describe NVPTX and AMD support.
* omp.h.in: Add ompx_gnu_managed_mem_space and
ompx_gnu_managed_mem_alloc enumerators, and gnu_managed_mem C++
allocator template.
* omp_lib.f90.in: Add Fortran bindings for new allocator and
memory space.
* omp_lib.h.in: Likewise.
* plugin/cuda-lib.def: Add cuMemAllocManaged.
* plugin/plugin-nvptx.c (nvptx_alloc): Add managed parameter to
support cuMemAllocManaged.
(GOMP_OFFLOAD_alloc): Move contents to ...
(cleanup_and_alloc): ... this new function, and add managed support.
(GOMP_OFFLOAD_managed_alloc): New function.
(GOMP_OFFLOAD_managed_free): New function.
* target.c (gomp_managed_alloc): New function.
(gomp_managed_free): New function.
(gomp_load_plugin_for_device): Load optional managed_alloc
and managed_free plugin APIs.
* testsuite/lib/libgomp.exp: Add check_effective_target_omp_managedmem.
* testsuite/libgomp.c++/alloc-managed-1.C: New test.
* testsuite/libgomp.c/alloc-managed-1.c: New test.
* testsuite/libgomp.c/alloc-managed-2.c: New test.
* testsuite/libgomp.c/alloc-managed-3.c: New test.
* testsuite/libgomp.c/alloc-managed-4.c: New test.
* testsuite/libgomp.fortran/alloc-managed-1.f90: New test.
Co-authored-by: Kwok Cheung Yeung <kcyeung@baylibre.com>
Co-authored-by: Thomas Schwinge <tschwinge@baylibre.com>
Those TR13/OpenMP 6.0 routines permit a reproducible offloading to
a specific device by mapping an OpenMP device number to a
unique ID (UID). The GPU device UIDs should be universally unique,
the one for the host is not.
gcc/ChangeLog:
* omp-general.cc (omp_runtime_api_procname): Add
get_device_from_uid and omp_get_uid_from_device routines.
include/ChangeLog:
* cuda/cuda.h (cuDeviceGetUuid): Declare.
(cuDeviceGetUuid_v2): Add prototype.
libgomp/ChangeLog:
* config/gcn/target.c (omp_get_uid_from_device,
omp_get_device_from_uid): Add stub implementation.
* config/nvptx/target.c (omp_get_uid_from_device,
omp_get_device_from_uid): Likewise.
* fortran.c (omp_get_uid_from_device_,
omp_get_uid_from_device_8_): New functions.
* libgomp-plugin.h (GOMP_OFFLOAD_get_uid): Add prototype.
* libgomp.h (struct gomp_device_descr): Add 'uid' and 'get_uid_func'.
* libgomp.map (GOMP_6.0): New, includind the new UID routines.
* libgomp.texi (OpenMP Technical Report 13): Mark UID routines as 'Y'.
(Device Information Routines): Document new UID routines.
(Offload-Target Specifics): Document UID format.
* omp.h.in (omp_get_device_from_uid, omp_get_uid_from_device):
New prototype.
* omp_lib.f90.in (omp_get_device_from_uid, omp_get_uid_from_device):
New interface.
* omp_lib.h.in: Likewise.
* plugin/cuda-lib.def: Add cuDeviceGetUuid and cuDeviceGetUuid_v2 via
CUDA_ONE_CALL_MAYBE_NULL.
* plugin/plugin-gcn.c (GOMP_OFFLOAD_get_uid): New.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_get_uid): New.
* target.c (str_omp_initial_device): New static var.
(STR_OMP_DEV_PREFIX): Define.
(gomp_get_uid_for_device, omp_get_uid_from_device,
omp_get_device_from_uid): New.
(gomp_load_plugin_for_device): DLSYM_OPT the function 'get_uid'.
(gomp_target_init): Set the device's 'uid' field to NULL.
* testsuite/libgomp.c/device_uid.c: New test.
* testsuite/libgomp.fortran/device_uid.f90: New test.
A few high-end nvptx devices support the attribute
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS; for those, unified shared
memory is supported in hardware. This patch enables support for those -
if all installed nvptx devices have this feature (as the capabilities
are per device type).
This exposes a bug in gomp_copy_back_icvs as it did before use
omp_get_mapped_ptr to find mapped variables, but that returns
the unchanged pointer in cased of shared memory. But in this case,
we have a few actually mapped pointers - like the ICV variables.
Additionally, there was a mismatch with regards to '-1' for the
device number as gomp_copy_back_icvs and omp_get_mapped_ptr count
differently. Hence, do the lookup manually.
include/ChangeLog:
* cuda/cuda.h (CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS): Add.
libgomp/ChangeLog:
* libgomp.texi (nvptx): Update USM description.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_get_num_devices):
Claim support when requesting USM and all devices support
CU_DEVICE_ATTRIBUTE_PAGEABLE_MEMORY_ACCESS.
* target.c (gomp_copy_back_icvs): Fix device ptr lookup.
(gomp_target_init): Set GOMP_OFFLOAD_CAP_SHARED_MEM is the
devices supports USM.
Currently, we silently disable libgomp GCN and nvptx plugins/devices in
presence of certain error conditions during device probing, thus typically
silently resorting to host-fallback execution. Make such errors fatal, similar
as for any other device access later on, so that we early and reliably notice
when things go wrong. (Keep just two cases non-fatal: (a) libgomp GCN or nvptx
plugins are available but 'libhsa-runtime64.so.1' or 'libcuda.so.1' are not,
and (b) those are available, but the corresponding devices are not.)
This resolves the issue that we've got execution test cases unexpectedly
PASSing, despite:
libgomp: GCN fatal error: Run-time could not be initialized
Runtime message: HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
..., and therefore they were not offloaded to the GCN device, but ran in
host-fallback execution mode. What happend in that scenario is that in
'init_hsa_context' during the initial 'GOMP_OFFLOAD_get_num_devices' we ran
into 'HSA_STATUS_ERROR_OUT_OF_RESOURCES', but it wasn't fatal, but just
silently disabled the libgomp plugin/device.
Especially "entertaining" were cases where such unintended host-fallback
execution happened during effective-target checks like
'offload_device_available' (host-fallback execution there meaning: no offload
device available), but actual test cases then were running with an offload
device available, and therefore mis-configured.
include/
* cuda/cuda.h (CUresult): Add 'CUDA_ERROR_NO_DEVICE'.
libgomp/
* plugin/plugin-gcn.c (init_hsa_context): Add and handle
'bool probe' parameter. Adjust all users; errors during device
probing are fatal.
* plugin/plugin-nvptx.c (nvptx_get_num_devices): Aside from
'CUDA_ERROR_NO_DEVICE', errors during device probing are fatal.
Fixes for commit r14-2792-g25072a477a56a727b369bf9b20f4d18198ff5894
"OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect",
namely:
In that commit, the code was changed to handle shared-memory devices;
however, as pointed out, omp_target_memcpy_check already set the pointer
to NULL in that case. Hence, this commit reverts to the prior version.
In cuda.h, it adds cuMemcpyPeer{,Async} for symmetry for cuMemcpy3DPeer
(all currently unused) and in three structs, fixes reserved-member names
and remove a bogus 'const' in three structs.
And it changes a DLSYM to DLSYM_OPT as not all plugins support the new
functions, yet.
include/ChangeLog:
* cuda/cuda.h (CUDA_MEMCPY2D, CUDA_MEMCPY3D, CUDA_MEMCPY3D_PEER):
Remove bogus 'const' from 'const void *dst' and fix reserved-name
name in those structs.
(cuMemcpyPeer, cuMemcpyPeerAsync): Add.
libgomp/ChangeLog:
* target.c (omp_target_memcpy_rect_worker): Undo dim=1 change for
GOMP_OFFLOAD_CAP_SHARED_MEM.
(omp_target_memcpy_rect_copy): Likewise for lock condition.
(gomp_load_plugin_for_device): Use DLSYM_OPT not DLSYM for
memcpy3d/memcpy2d.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_memcpy2d,
GOMP_OFFLOAD_memcpy3d): Use memset 0 to nullify reserved and
unused src/dst fields for that mem type; remove '{src,dst}LOD = 0'.
When copying a 2D or 3D rectangular memmory block, the performance is
better when using CUDA's cuMemcpy2D/cuMemcpy3D instead of copying the
data one by one. That's what this commit does.
Additionally, it permits device-to-device copies, if neccessary using a
temporary variable on the host.
include/ChangeLog:
* cuda/cuda.h (CUlimit): Add CUDA_ERROR_NOT_INITIALIZED,
CUDA_ERROR_DEINITIALIZED, CUDA_ERROR_INVALID_HANDLE.
(CUarray, CUmemorytype, CUDA_MEMCPY2D, CUDA_MEMCPY3D,
CUDA_MEMCPY3D_PEER): New typdefs.
(cuMemcpy2D, cuMemcpy2DAsync, cuMemcpy2DUnaligned,
cuMemcpy3D, cuMemcpy3DAsync, cuMemcpy3DPeer,
cuMemcpy3DPeerAsync): New prototypes.
libgomp/ChangeLog:
* libgomp-plugin.h (GOMP_OFFLOAD_memcpy2d,
GOMP_OFFLOAD_memcpy3d): New prototypes.
* libgomp.h (struct gomp_device_descr): Add memcpy2d_func
and memcpy3d_func.
* libgomp.texi (nvtpx): Document when cuMemcpy2D/cuMemcpy3D is used.
* oacc-host.c (memcpy2d_func, .memcpy3d_func): Init with NULL.
* plugin/cuda-lib.def (cuMemcpy2D, cuMemcpy2DUnaligned,
cuMemcpy3D): Invoke via CUDA_ONE_CALL.
* plugin/plugin-nvptx.c (GOMP_OFFLOAD_memcpy2d,
GOMP_OFFLOAD_memcpy3d): New.
* target.c (omp_target_memcpy_rect_worker):
(omp_target_memcpy_rect_check, omp_target_memcpy_rect_copy):
Permit all device-to-device copyies; invoke new plugins for
2D and 3D copying when available.
(gomp_load_plugin_for_device): DLSYM the new plugin functions.
* testsuite/libgomp.c/target-12.c: Fix dimension bug.
* testsuite/libgomp.fortran/target-12.f90: Likewise.
* testsuite/libgomp.fortran/target-memcpy-rect-1.f90: New test.
This patch adds a stub 'gomp_target_rev' in the host's target.c, which will
later handle the reverse offload.
For nvptx, it adds support for forwarding the offload gomp_target_ext call
to the host by setting values in a struct on the device and querying it on
the host - invoking gomp_target_rev on the result.
include/ChangeLog:
* cuda/cuda.h (enum CUdevice_attribute): Add
CU_DEVICE_ATTRIBUTE_UNIFIED_ADDRESSING.
(CU_MEMHOSTALLOC_DEVICEMAP): Define.
(cuMemHostAlloc): Add prototype.
libgomp/ChangeLog:
* config/nvptx/icv-device.c (GOMP_DEVICE_NUM_VAR): Remove
'static' for this variable.
* config/nvptx/libgomp-nvptx.h: New file.
* config/nvptx/target.c: Include it.
(GOMP_ADDITIONAL_ICVS): Declare extern var.
(GOMP_REV_OFFLOAD_VAR): Declare var.
(GOMP_target_ext): Handle reverse offload.
* libgomp-plugin.h (GOMP_PLUGIN_target_rev): New prototype.
* libgomp-plugin.c (GOMP_PLUGIN_target_rev): New, call ...
* target.c (gomp_target_rev): ... this new stub function.
* libgomp.h (gomp_target_rev): Declare.
* libgomp.map (GOMP_PLUGIN_1.4): New; add GOMP_PLUGIN_target_rev.
* plugin/cuda-lib.def (cuMemHostAlloc): Add.
* plugin/plugin-nvptx.c: Include libgomp-nvptx.h.
(struct ptx_device): Add rev_data member.
(nvptx_open_device): Remove async_engines query, last used in
r10-304-g1f4c5b9b; add unified-address assert check.
(GOMP_OFFLOAD_get_num_devices): Claim unified address
support.
(GOMP_OFFLOAD_load_image): Free rev_fn_table if no
offload functions exist. Make offload var available
on host and device.
(rev_off_dev_to_host_cpy, rev_off_host_to_dev_cpy): New.
(GOMP_OFFLOAD_run): Handle reverse offload.
... so that it may be used by other projects that inherit GCC's 'include'
directory.
include/
* cuda/cuda.h: New file.
libgomp/
* plugin/cuda/cuda.h: Remove file.
* plugin/plugin-nvptx.c [PLUGIN_NVPTX_DYNAMIC]: Include
"cuda/cuda.h" instead of <cuda.h>.
* plugin/configfrag.ac <PLUGIN_NVPTX_DYNAMIC>: Don't set
'PLUGIN_NVPTX_CPPFLAGS'.
* configure: Regenerate.