Tong Liu01 [Wed, 15 Mar 2023 07:24:22 +0000 (15:24 +0800)]
drm/amdgpu: add mes resume when do gfx post soft reset
[why]
when gfx do soft reset, mes will also do reset, if mes is not
resumed when do recover from soft reset, mes is unable to respond
in later sequence
[how]
resume mes when do gfx post soft reset
Signed-off-by: Tong Liu01 <Tong.Liu01@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Tim Huang [Thu, 9 Mar 2023 08:27:51 +0000 (16:27 +0800)]
drm/amdgpu: skip ASIC reset for APUs when go to S4
For GC IP v11.0.4/11, PSP TMR need to be reserved
for ASIC mode2 reset. But for S4, when psp suspend,
it will destroy the TMR that fails the ASIC reset.
We can skip the reset on APUs, assuming we can resume them
properly. Verified on some GFX11, GFX10 and old GFX9 APUs.
Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Tim Huang [Wed, 15 Mar 2023 07:52:09 +0000 (15:52 +0800)]
drm/amdgpu: reposition the gpu reset checking for reuse
Move the amdgpu_acpi_should_gpu_reset out of
CONFIG_SUSPEND to share it with hibernate case.
Signed-off-by: Tim Huang <tim.huang@amd.com> Reviewed-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Alex Deucher [Wed, 15 Mar 2023 17:51:43 +0000 (13:51 -0400)]
drm/amdgpu: drop the extra sign extension
amdgpu_bo_gpu_offset_no_check() already calls
amdgpu_gmc_sign_extend() so no need to call it twice.
Reviewed-by: Yang Wang <kevinyang.wang@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Dave Airlie [Wed, 22 Mar 2023 00:35:45 +0000 (10:35 +1000)]
Merge tag 'drm-habanalabs-next-2023-03-20' of https://git.kernel.org/pub/scm/linux/kernel/git/ogabbay/linux into drm-next
This tag contains habanalabs driver and accel changes for v6.4:
- uAPI changes:
- Add opcodes to the CS ioctl to allow user to stall/resume specific engines
inside Gaudi2. This is to allow the user to perform power
testing/measurements when training different topologies.
- Expose in the INFO ioctl the amount of device memory that the driver
and f/w reserve for themselves.
- Expose in the INFO ioctl a bit-mask of the available rotator engines
in Gaudi2. This is to align with other engines that are already exposed.
- Expose in the INFO ioctl the register's address of the f/w that should
be used to trigger interrupts from within the user's code running in the
compute engines.
- Add a critical-event bit in the eventfd bitmask so the user will know the
event that was received was critical, and a reset will now occur
- Expose in the INFO ioctl two new opcodes to fetch information on h/w and
f/w events. The events recorded are the events that were reported in the
eventfd.
- New features and improvements:
- Add a dedicated interrupt ID in MSI-X in the device to the notification of
an unexpected user-related event in Gaudi2. Handle it in the driver by
reporting this event.
- Allow the user to fetch the device memory current usage even when the
device is undergoing compute-reset (a reset type that only clears the
compute engines).
- Enable graceful reset mechanism for compute-reset. This will give the
user a few seconds before the device is reset. For example, the user can,
during that time, perform certain device operations (dump data for debug)
or close the device in an orderly fashion.
- Align the decoder with the rest of the engines in regard to notification
to the user about interrupts and in regard to performing graceful reset
when needed (instead of immediate reset).
- Add support for assert interrupt from the TPC engine.
- Get the reset type that is necessary to perform per event from the
auto-generated irq_map array.
- Print the specific reason why a device is still in use when notifying to
the user about it (after the user closed the device's FD).
- Move to threaded IRQ when handling interrupts of workload completions.
- Firmware related fixes:
- Fix RAZWI event handler to match newest f/w version.
- Read error cause register in dma core events because the f/w doesn't
do that.
- Increase maximum time to wait for completion of Gaudi2 reset due to f/w
bug.
- Align to the latest firmware specs.
- Enforce the release order of the compute device and dma-buf.
i.e increment the device file refcount for any dma-buf that was exported
for that device. This will make sure the compute device release function
won't be called until the user closes all the FDs of the relevant
dma-bufs. Without this change, closing the device's FD before/without
closing the dma-buf's FD would always lead to hard-reset of the device.
- Fix a link in the drm documentation to correctly point to the accel section.
- Compilation warnings cleanups
- Misc bug fixes and code cleanups
Signed-off-by: Dave Airlie <airlied@redhat.com>
# -----BEGIN PGP SIGNATURE-----
#
# iQEzBAABCgAdFiEE7TEboABC71LctBLFZR1NuKta54AFAmQYfcAACgkQZR1NuKta
# 54DB4Af/SuiHZkVXwr+yHPv9El726rz9ZQD7mQtzNmehWGonwAvz15yqocNMUSbF
# JbqE/vrZjvbXrP1Uv5UrlRVdnFHSPV18VnHU4BMS/WOm19SsR6vZ0QOXOoa6/AUb
# w+kF3D//DbFI4/mTGfpH5/pzwu51ti8aVktosPFlHIa8iI8CB4/4IV+ivQ8UW4oK
# HyDRkIvHdRmER7vGOfhwhsr4zdqSlJBYrv3C3Z1dkSYBPW/5ICbiM1UlKycwdYKI
# cajQBSdUQwUCWnI+i8RmSy3kjNO6OE4XRUvTv89F2bQeyK/1rJLG2m2xZR/Ml/o5
# 7Cgvbn0hWZyeqe7OObYiBlSOBSehCA==
# =wclm
# -----END PGP SIGNATURE-----
# gpg: Signature made Tue 21 Mar 2023 01:37:36 AEST
# gpg: using RSA key ED311BA00042EF52DCB412C5651D4DB8AB5AE780
# gpg: Can't check signature: No public key
From: Oded Gabbay <ogabbay@kernel.org> Link: https://patchwork.freedesktop.org/patch/msgid/20230320154026.GA766126@ogabbay-vm-u20.habana-labs.com
Dave Airlie [Tue, 21 Mar 2023 20:49:01 +0000 (06:49 +1000)]
Merge tag 'drm-intel-gt-next-2023-03-16' of git://anongit.freedesktop.org/drm/drm-intel into drm-next
Driver Changes:
- Fix issue #6333: "list_add corruption" and full system lockup from
performance monitoring (Janusz)
- Give the punit time to settle before fatally failing (Aravind, Chris)
- Don't use stolen memory or BAR for ring buffers on LLC platforms (John)
- Add missing ecodes and correct timeline seqno on GuC error captures (John)
- Make sure DSM size has correct 1MiB granularity on Gen12+ (Nirmoy,
Lucas)
- Fix potential SSEU max_subslices array-index-out-of-bounds access on Gen11 (Andrea)
- Whitelist COMMON_SLICE_CHICKEN3 for UMD access on Gen12+ (Matt R.)
- Apply Wa_1408615072/Wa_1407596294 correctly on Gen11 (Matt R)
- Apply LNCF/LBCF workarounds correctly on XeHP SDV/PVC/DG2 (Matt R)
- Implement Wa_1606376872 for Xe_LP (Gustavo)
- Consider GSI offset when doing MCR lookups on Meteorlake+ (Matt R.)
- Add engine TLB invalidation for Meteorlake (Matt R.)
- Fix GSC Driver-FLR completion on Meteorlake (Alan)
- Fix GSC races on driver load/unload on Meteorlake+ (Daniele)
- Disable MC6 for MTL A step (Badal)
- Consolidate TLB invalidation flow (Tvrtko)
- Improve debug GuC/HuC debug messages (Michal Wa., John)
- Move fd_install after last use of fence (Rob)
- Initialize the obj flags for shmem objects (Aravind)
- Fix missing debug object activation (Nirmoy)
- Probe lmem before the stolen portion (Matt A)
- Improve clean up of GuC busyness stats worker (John)
- Fix missing return code checks in GuC submission init (John)
- Annotate two more workaround/tuning registers as MCR on PVC (Matt R)
- Fix GEN8_MISCCPCTL definition and remove unused INF_UNIT_LEVEL_CLKGATE (Lucas)
- Use sysfs_emit() and sysfs_emit_at() (Nirmoy)
- Make kobj_type structures constant (Thomas W.)
- make kobj attributes const on gt/ (Jani)
- Remove the unused virtualized start hack on buddy allocator (Matt A)
- Remove redundant check for DG1 (Lucas)
- Move DG2 tuning to the right function (Lucas)
- Rename dev_priv to i915 for private data naming consistency in gt/ (Andi)
- Remove unnecessary whitelisting of CS_CTX_TIMESTAMP on Xe_HP platforms (Matt R.)
-
Dave Airlie [Tue, 21 Mar 2023 01:03:16 +0000 (11:03 +1000)]
Merge tag 'drm-misc-next-2023-03-16' of git://anongit.freedesktop.org/drm/drm-misc into drm-next
drm-misc-next for v6.4-rc1:
Cross-subsystem Changes:
- Add drm_bridge.h to drm_bridge maintainers.
Core Changes:
- Assorted fixes to TTM, tests, format-helper, accel.
- Assorted Makefile fixes to drivers and accel.
- Implement fbdev emulation for GEM DMA drivers, and convert a lot of
drivers to use it.
- Use tgid instead of pid for tracking clients.
Driver Changes:
- Assorted fixes in rockchip, vmwgfx, nouveau, cirrus.
- Add imx25 driver.
- Add Elida KD50T048A, Sony TD4353, Novatek NT36523, STARRY 2081101QFH032011-53G panels.
- Add 4K mode support to rockchip.
- Convert cirrus to use regular atomic helpers, and more cirrus
improvements.
- Add damage clipping to cirrus, virtio.
Dani Liberman [Thu, 16 Mar 2023 13:03:12 +0000 (15:03 +0200)]
accel/habanalabs: change razwi handle after fw fix
FW had one data route for tpc0 and tpc1 when running in secured mode
and a different one when running without secured mode. After fw fixed
this issue, both mode have the same data path.
Ofir Bitton [Wed, 8 Mar 2023 11:34:52 +0000 (13:34 +0200)]
accel/habanalabs: add handling for unexpected user event
In order for the user to be aware of unexpected events in Gaudi2 that
aren't assigned to a specific engine, we are adding the handling of
this dedicated interrupt.
Bagas Sanjaya [Tue, 7 Mar 2023 04:35:26 +0000 (11:35 +0700)]
accel: Link to compute accelerator subsystem intro
Commit 2c204f3d53218d ("accel: add dedicated minor for accelerator
devices") adds link to accelerator nodes section of DRM internals doc
(Documentation/gpu/drm-internals.rst), but the target doesn't exist.
Instead, there is only an introduction doc for computer accelerator
subsytem.
Link to that doc until there is documentation of accelerator internals.
Fixes: 2c204f3d53218d ("accel: add dedicated minor for accelerator devices") Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com> Reviewed-by: Jeffrey Hugo <quic_jhugo@quicinc.com> Signed-off-by: Oded Gabbay <ogabbay@kernel.org>
Tomer Tayar [Sun, 12 Mar 2023 15:15:03 +0000 (17:15 +0200)]
accel/habanalabs: remove '\n' when passing strings to gaudi2_print_event()
Remove all '\n' from strings which are passed as arguments to
gaudi2_print_event(), because the newline character is added internally
in this function.
Koby Elbaz [Tue, 7 Mar 2023 08:13:44 +0000 (10:13 +0200)]
accel/habanalabs: return tlb inv error code upon failure
Now that CQ-completion based jobs do not trigger a reset upon failure,
failure of such jobs (e.g., MMU cache invalidation) should be handled
by the caller itself depending on the error code returned to it.
- remove reset_sleep_ms arg from functions that don't use it.
- move the call msleep(reset_sleep_ms) from btm poll to gaudi2_hw_fini
as it is called from there already for other flow.
Dafna Hirschfeld [Mon, 20 Feb 2023 08:09:25 +0000 (10:09 +0200)]
accel/habanalabs: in hw_fini return error code if polling timed-out
In hw_fini callback, we use either the cpucp packet method or polling a
register. Currently we return error only in the case of cpucp packet
failure. In this patch we also return error if polling timed out.
Koby Elbaz [Mon, 6 Mar 2023 14:43:41 +0000 (16:43 +0200)]
accel/habanalabs: do not verify engine modes after being changed
Engines idle state can't always be verified between changes of
engine modes (e.g., stall/halt).
For example, if a CS is inflight when altering engine's mode,
idle state will return NOT idle, always.
Dave Airlie [Mon, 20 Mar 2023 06:44:36 +0000 (16:44 +1000)]
Merge tag 'amd-drm-next-6.4-2023-03-17' of https://gitlab.freedesktop.org/agd5f/linux into drm-next
amd-drm-next-6.4-2023-03-17:
amdgpu:
- Misc code cleanups
- Documentation fixes
- Make kobj structures const
- Add thermal throttling adjustments for supported APUs
- UMC RAS fixes
- Display reset fixes
- DCN 3.2 fixes
- Freesync fixes
- DC code reorg
- Generalize dmabuf import to work with KFD
- DC DML fixes
- SRIOV fixes
- UVD code cleanups
- IH 4.4.2 updates
- HDP 4.4.2 updates
- SDMA 4.4.2 updates
- PSP 13.0.6 updates
- Add capped/uncapped workload handling for supported APUs
- DCN 3.1.4 updates
- Re-org DC Kconfig
- USB4 fixes
- Reorg DC plane and stream handling
- Register vga_switcheroo for apple-gmux
- SMU 13.0.6 updates
- Fix error checking in read_mm_registers functions for affected families
- VCN 4.0.4 fix
- Drop redundant pci_enable_pcie_error_reporting() call
- RDNA2 SMU OD suspend/resume fix
- Expose additional memory stats via fdinfo
- RAS fixes
- Misc display fixes
- DP MST fixes
- IOMMU regression fix for KFD
amdkfd:
- Make kobj structures const
- Support for exporting buffers via dmabuf
- Multi-VMA page migration fixes
- NBIO fixes
- Misc code cleanups
- Fix possible double free
- Fix possible UAF
radeon:
- iMac fix
UAPI:
- KFD dmabuf export support. Required for importing KFD buffers into GEM contexts and for RDMA P2P support.
Proposed user mode changes: https://github.com/fxkamd/ROCT-Thunk-Interface/commits/fxkamd/dmabuf
Tom Rix [Fri, 3 Mar 2023 13:27:31 +0000 (08:27 -0500)]
drm/nouveau/fifo: set gf100_fifo_nonstall_block_dump storage-class-specifier to static
gcc with W=1 reports
drivers/gpu/drm/nouveau/nvkm/engine/fifo/gf100.c:451:1: error:
no previous prototype for ‘gf100_fifo_nonstall_block’ [-Werror=missing-prototypes]
451 | gf100_fifo_nonstall_block(struct nvkm_event *event, int type, int index)
| ^~~~~~~~~~~~~~~~~~~~~~~~~
gf100_fifo_nonstall_block is only used in gf100.c, so it should be static
Felix Kuehling [Tue, 14 Mar 2023 00:03:08 +0000 (20:03 -0400)]
drm/amdgpu: Don't resume IOMMU after incomplete init
Check kfd->init_complete in kgd2kfd_iommu_resume, consistent with other
kgd2kfd calls. This should fix IOMMU errors on resume from suspend when
KFD IOMMU initialization failed.
Hawking Zhang [Sat, 4 Mar 2023 12:22:23 +0000 (20:22 +0800)]
drm/amdgpu: drop ras check at asic level for new blocks
amdgpu_ras_register_ras_block should always be invoked
by ras_sw_init, where driver needs to check ras caps
at ip level, instead of asic level.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Stanley Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Hawking Zhang [Mon, 13 Mar 2023 06:18:34 +0000 (14:18 +0800)]
drm/amdgpu: Rework pcie_bif ras sw_init
pcie_bif ras blocks needs to be initialized as early
as possible to handle fatal error detected in hw_init
phase. also align the pcie_bif ras sw_init with other
ras blocks
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Stanley Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Hawking Zhang [Sat, 4 Mar 2023 11:54:14 +0000 (19:54 +0800)]
drm/amdgpu: Rework xgmi_wafl_pcs ras sw_init
To align with other IP blocks.
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Stanley Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Hawking Zhang [Wed, 15 Mar 2023 00:59:04 +0000 (08:59 +0800)]
drm/amdgpu: Rework mca ras sw_init
To align with other IP blocks
Signed-off-by: Hawking Zhang <Hawking.Zhang@amd.com> Reviewed-by: Stanley Yang <Stanley.Yang@amd.com> Reviewed-by: Tao Zhou <tao.zhou1@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
David Belanger [Tue, 28 Feb 2023 19:11:24 +0000 (14:11 -0500)]
drm/amdkfd: Fixed kfd_process cleanup on module exit.
Handle case when module is unloaded (kfd_exit) before a process space
(mm_struct) is released.
v2: Fixed potential race conditions by removing all kfd_process from
the process table first, then working on releasing the resources.
v3: Fixed loop element access / synchronization. Fixed extra empty lines.
Signed-off-by: David Belanger <david.belanger@amd.com> Reviewed-by: Felix Kuehling <Felix.Kuehling@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Aric Cyr [Mon, 6 Mar 2023 01:48:26 +0000 (20:48 -0500)]
drm/amd/display: 3.2.227
This version brings along the following:
- FW Release 0.0.158.0
- Fixes to HDCP, DP MST and more
- Improvements on USB4 links and more
- Code re-architecture on link.h
Reviewed-by: Aric Cyr <aric.cyr@amd.com> Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com> Signed-off-by: Aric Cyr <aric.cyr@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Samson Tam [Fri, 3 Mar 2023 22:30:25 +0000 (17:30 -0500)]
drm/amd/display: fix assert condition
[Why & How]
Reversed assert condition when checking that phy_pix_clk[] is not 0
Reviewed-by: Alvin Lee <Alvin.Lee2@amd.com> Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com> Signed-off-by: Samson Tam <Samson.Tam@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Stylon Wang [Wed, 1 Mar 2023 15:56:51 +0000 (23:56 +0800)]
drm/amd/display: Clearly states if long or short HPD event in dmesg logs
[Why]
The log "DMUB HPD callback" is crucial to identify when DP tunneling
is been established and driver is notified of this event from DMUB.
Same log is shared for long and short hotplug event and we need to
check trailing DC debug log to distinguish between them two, making
debugging on DPIA related issues a bit more troublesome.
[How]
Clearly states in dmesg logs whether this is a long or short hotplug
event.
Reviewed-by: Hamza Mahfooz <Hamza.Mahfooz@amd.com> Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com> Signed-off-by: Stylon Wang <stylon.wang@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Wesley Chalmers [Mon, 27 Feb 2023 18:21:17 +0000 (13:21 -0500)]
drm/amd/display: Make DCN32 functions available to future DCNs
[Why & How]
Make DCN32 functions available for more DCNs.
Reviewed-by: Chris Park <Chris.Park@amd.com> Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com> Signed-off-by: Wesley Chalmers <Wesley.Chalmers@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Yifan Zha [Mon, 6 Mar 2023 06:54:05 +0000 (14:54 +0800)]
drm/amdgpu: Init MMVM_CONTEXTS_DISABLE in gmc11 golden setting under SRIOV
[Why]
If disable the mmhub vm contexts(set MMVM_CONTEXTS_DISABLE to 0xffff),
driver loading failed on vf due to fence fallback timer expired on all rings.
FLR cannot reset MMVM_CONTEXTS_DISABLE.
So this vf can not be recovered anymore unless trigger a whole gpu reset.
[How]
Under SRIOV, init MMVM_CONTEXTS_DISABLE in gmc11 golden register setting.
Samson Tam [Tue, 28 Feb 2023 19:33:00 +0000 (14:33 -0500)]
drm/amd/display: reallocate DET for dual displays with high pixel rate ratio
[Why]
For dual displays where pixel rate is much higher on one display,
we may get underflow when DET is evenly allocated.
[How]
Allocate less DET segments for the lower pixel rate display and
more DET segments for the higher pixel rate display
Reviewed-by: Alvin Lee <Alvin.Lee2@amd.com> Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com> Signed-off-by: Samson Tam <Samson.Tam@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Cruise Hung [Thu, 2 Mar 2023 02:33:51 +0000 (10:33 +0800)]
drm/amd/display: Fix DP MST sinks removal issue
[Why]
In USB4 DP tunneling, it's possible to have this scenario that
the path becomes unavailable and CM tears down the path a little bit late.
So, in this case, the HPD is high but fails to read any DPCD register.
That causes the link connection type to be set to sst.
And not all sinks are removed behind the MST branch.
[How]
Restore the link connection type if it fails to read DPCD register.
Cc: stable@vger.kernel.org Cc: Mario Limonciello <mario.limonciello@amd.com> Reviewed-by: Wenjing Liu <Wenjing.Liu@amd.com> Acked-by: Qingqing Zhuo <qingqing.zhuo@amd.com> Signed-off-by: Cruise Hung <Cruise.Hung@amd.com> Tested-by: Daniel Wheeler <daniel.wheeler@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
Tvrtko Ursulin [Tue, 14 Mar 2023 14:18:55 +0000 (14:18 +0000)]
drm: Track clients by tgid and not tid
Thread group id (aka pid from userspace point of view) is a more
interesting thing to show as an owner of a DRM fd, so track and show that
instead of the thread id.
In the next patch we will make the owner updated post file descriptor
handover, which will also be tgid based to avoid ping-pong when multiple
threads access the fd.
Bjorn Helgaas [Tue, 7 Mar 2023 20:27:29 +0000 (14:27 -0600)]
accel/habanalabs: Drop redundant pci_enable_pcie_error_reporting()
pci_enable_pcie_error_reporting() enables the device to send ERR_*
Messages. Since
commit f26e58bf6f54 ("PCI/AER: Enable error reporting when AER is native"),
the PCI core does this for all devices during enumeration, so the
driver doesn't need to do it itself.
Remove the redundant pci_enable_pcie_error_reporting() call from the
driver. Also remove the corresponding pci_disable_pcie_error_reporting()
from the driver .remove() path.
Note that this only controls ERR_* Messages from the device. An ERR_*
Message may cause the Root Port to generate an interrupt, depending on the
AER Root Error Command register managed by the AER service driver.
Tomer Tayar [Wed, 1 Mar 2023 15:45:58 +0000 (17:45 +0200)]
accel/habanalabs: postpone mem_mgr IDR destruction to hpriv_release()
The memory manager IDR is currently destroyed when user releases the
file descriptor.
However, at this point the user context might be still held, and memory
buffers might be still in use.
Later on, calls to release those buffers will fail due to not finding
their handles in the IDR, leading to a memory leak.
To avoid this leak, split the IDR destruction from the memory manager
fini, and postpone it to hpriv_release() when there is no user context
and no buffers are used.
Dafna Hirschfeld [Mon, 20 Feb 2023 05:54:44 +0000 (07:54 +0200)]
accel/habanalabs: move soft-reset wait to soft-reset execute
We plan to do soft-reset either by mmio or by using cpucp packet
depending on the FW version. We don't want to check FW version in two
different places for that (execute soft-reset and wait to soft-reset)
so move the waiting to gaudi2_execute_soft_reset. This also makes sense
because the cpucp also does the waiting.
Koby Elbaz [Wed, 15 Feb 2023 15:51:14 +0000 (17:51 +0200)]
accel/habanalabs: add uapi to stall/resume engine
The user might want to stall/resume engines to perform power testing
for various scenarios. Because our current
HL_CS_FLAGS_ENGINE_CORE_COMMAND command only handles the engines' cores,
we need to add another opcode for handling entire engine and not just
its core.
The user supplies an array, where each entry holds the engine's ID and
the command to send to the engine. The size of the array is limited
by the number of engines in the ASIC (only Gaudi2 is currently
supported).
Tomer Tayar [Fri, 17 Feb 2023 10:56:48 +0000 (12:56 +0200)]
accel/habanalabs: use scnprintf() in print_device_in_use_info()
compose_device_in_use_info() was added to handle the snprintf() return
value in a single place.
However, the buffer size in print_device_in_use_info() is set such that
it would be enough for the max possible print, so
compose_device_in_use_info() is not really needed.
Moreover, scnprintf() can be used instead of snprintf(), to save the
check if the return value larger than the given size.
Koby Elbaz [Thu, 23 Feb 2023 16:17:02 +0000 (18:17 +0200)]
accel/habanalabs: use a mutex rather than a spinlock
There are two reasons why mutex is better here:
1. There's a critical section relatively long, where in
certain scenarios (e.g., multiple VM allocations) taking a spinlock
might cause noticeable performance degradation.
2. It will remove the incorrect usage of mutex under
spin_lock (where preemption is disabled).
Koby Elbaz [Tue, 21 Feb 2023 12:21:39 +0000 (14:21 +0200)]
accel/habanalabs: fix register address on PDMA/EDMA idle check
The PDMA/EDMA is_idle routines didn't check the correct CORE register
in order to get the accurate idle state.
Moreover, it's better to make the is_idle routine more robust by adding
additional checks (IS_HALTED) before announcing that the core is idle.
Koby Elbaz [Sun, 26 Feb 2023 06:22:45 +0000 (08:22 +0200)]
accel/habanalabs: remove a useless is_idle TPC flag
Is appears that the flag -
DCORE0_TPC0_CFG_STATUS_VECTOR_PIPE_EMPTY_MASK, has no actual use when
it comes to querying TPC idleness, since this flag's corresponding bit
turns-off after stalling the engine, and turns back on after resuming
it.
Koby Elbaz [Thu, 23 Feb 2023 08:43:14 +0000 (10:43 +0200)]
accel/habanalabs: verify return code after scrubbing ARCs DCCMs
In case the KDMA fails scrubbing the DCCMs (following a soft-reset
upon device release), the driver will only print failure until reset
flow ends, rather than escalating it into a hard-reset.
Dafna Hirschfeld [Wed, 15 Feb 2023 10:15:57 +0000 (12:15 +0200)]
accel/habanalabs: assert return value of hw_fini
Since hw_fini return error code for failure indication, we should
check its return value. Currently it might only fail upon soft-reset
from hl_device_reset. Later patch will add hw_fini failure in case of
polling timeout in hard-reset.
Tomer Tayar [Thu, 16 Feb 2023 17:54:32 +0000 (19:54 +0200)]
accel/habanalabs: add helper function to get vm hash node
Add a helper function to search the vm hash for a node with a given
virtual address.
As opposed to the current code, this function explicitly returns NULL
when no node is found, instead of basing on the loop cursor object's
value.
'irq_handler' in gaudi2_enable_msix(), is just assigned with a function
name and then used when calling request_threaded_irq().
Remove the variable and use the function name directly as an argument.
Tom Rix [Mon, 13 Feb 2023 14:48:14 +0000 (06:48 -0800)]
accel/habanalabs: set hl_capture_*_err storage-class-specifier to static
smatch reports
drivers/accel/habanalabs/common/device.c:2619:6: warning:
symbol 'hl_capture_hw_err' was not declared. Should it be static?
drivers/accel/habanalabs/common/device.c:2641:6: warning:
symbol 'hl_capture_fw_err' was not declared. Should it be static?
both are only used in device.c, so they should be static
Tom Rix [Wed, 8 Feb 2023 15:54:50 +0000 (07:54 -0800)]
accel/habanalabs: change unused extern decl of hdev to forward decl of hl_device
Building with clang W=2 has several similar warnings
drivers/accel/habanalabs/common/decoder.c:46:51: error: declaration shadows a variable in the global scope [-Werror,-Wshadow]
static void
dec_error_intr_work(struct hl_device *hdev, u32 base_addr, u32 core_id)
^
drivers/accel/habanalabs/common/security.h:13:26: note: previous declaration is here
extern struct hl_device *hdev;
^
There is no global definition of hdev, so the extern is not needed.
Searched with
grep -r '^struct' . | grep hl_dev
Change to an forward decl to resolve these issues
drivers/accel/habanalabs/common/mmu/../security.h:133:40: error: ‘struct hl_device’ declared inside parameter list will not be visible outside of this definition or declaration [-Werror]
133 | bool (*skip_block_hook)(struct hl_device *hdev,
| ^~~~~~~~~
accel/habanalabs: don't trace cpu accessible dma alloc/free
The cpu accessible dma allocations use the gen_pool api which actually
does not allocate new memory from the system but manages memory already
allocated before. When tracing this together with real dma
allocation/free it cause confusing logs like a '0' dma address and
a cpu address appearing twice etc.
accel/habanalabs: in hl_device_reset small refactor for readabilty
in the out_err flow, combine the two cases of soft-reset since
they have mostly common code. In addition unlock reset_info.lock
after touching reset count.
Ofir Bitton [Mon, 16 Jan 2023 17:56:23 +0000 (19:56 +0200)]
accel/habanalabs: add support for TPC assert
In order to allow TPC engines to raise an assert, we must expose
the relevant MSIX interrupt to the user so he will configure the engine
correctly. In addition, we implement the corresponding interrupt
handler that will notify the user upon such an event.
Tal Cohen [Wed, 25 Jan 2023 18:29:15 +0000 (20:29 +0200)]
accel/habanalabs: change user interrupt to threaded IRQ
We prefer not to handle the user interrupt job inside the interrupt
context. Instead, use threaded IRQ to handle the user interrupts.
This will allow to avoid disabling interrupts when the user process
registers for a new event and to avoid long handling inside an
interrupt.
Tomer Tayar [Wed, 25 Jan 2023 09:38:56 +0000 (11:38 +0200)]
accel/habanalabs: enable graceful reset mechanism for compute-reset
The graceful reset mechanism is currently enabled only for reset
requests that will end up with hard-reset.
In future, reset requests due to errors in some device engines, are
going to be modified to request compute-reset, as the much longer
hard-reset is not really needed there.
To allow it, enable graceful reset also for compute-reset, and reset
after user releases the device won't be escalated to hard-reset in those
cases.
If watchdog expires and user didn't release the device, hard-reset will
be initiated in any case.
Koby Elbaz [Sun, 29 Jan 2023 11:35:58 +0000 (13:35 +0200)]
accel/habanalabs: disable PCI when escalating compute to hard-reset
In case a compute reset has failed or a request for a hard reset has
just arrived, then we escalate current reset procedure from compute
to hard-reset.
In such a case, the FW should be aware of the updated error cause,
and if LKD is the one who performs the reset (rather than the FW),
then we ask the FW to disable PCI access.
We would also like to have relevant debug info and therefore
we print the currently escalating reset type.
Ofir Bitton [Sun, 22 Jan 2023 12:06:15 +0000 (14:06 +0200)]
accel/habanalabs: expose engine core int reg address
In order for engine cores to raise interrupts towards FW, They need
to know which register the event data should be written to.
Hence, we forward the relevant scratchpad register received during
dynamic regs handshake with FW.
Moti Haimovski [Tue, 10 Jan 2023 15:35:31 +0000 (17:35 +0200)]
accel/habanalabs: add critical-event bit in notifier
Enhance the existing user notifications by adding a HW and FW critical
event bits to be used when a HW or FW event occur that requires
both SW abort and hard-resetting the chip.
Tomer Tayar [Sun, 22 Jan 2023 10:17:02 +0000 (12:17 +0200)]
accel/habanalabs: enforce release order of compute device and dma-buf
When user closes the compute device file descriptor without closing a
dma-buf file descriptor, the device will be considered as in use,
leading to hard reset and killing the user process, to ensure the
release of the dma-buf.
Same thing will happen if user first releases the compute device file
and only then the dma-buf.
The implication of this is the duration of hard reset, during which the
device cannot be reacquired.
Moreover, this behavior adds a constraint on a user process to follow
this order of release operations.
To avoid killing the user process and to remove this constraint, enforce
the correct order of release operations inside the driver, by
incrementing the device file refcount for any dma-buf until it is
released.
Tomer Tayar [Wed, 18 Jan 2023 15:35:17 +0000 (17:35 +0200)]
accel/habanalabs: add info when FD released while device still in use
When user closes the device file descriptor, it is checked whether the
device is still in use, and a message is printed if it is.
To make this message more informative, add to this print also the reason
due to which the device is considered as in use.
The possible reasons which are checked for now are active CS and
exported dma-buf.
PSOC RAZWI handling code did not took into account single router that
supports several initiators with different XY coordinates. Also, it
ignored XY_HI coordinate. This caused 2 problems:
1. RAZWI handle ignored some initiators.
2. When getting PSOC RAZWI from some routers, there was a lot of
possible engines which could have caused the RAZWI.
Fixed the above issue by handling PSOC RAZWI with both low and high
XY coordinates. This way driver supports all initiators and in
the worst case there are not more than 2 possible engines for RAZWI.