Recent GPUs are capable of providing peak performance up to 6 TFlOps/s (DP), making them an attractive computational target for Finite Element Methods. However, effectively mapping an FEM solver to a GPU remains challenging due to the scattered memory access, large amounts of on-chip state space (eg registers) required for efficient execution, and the inherently large algorithmic variety encountered in local assembly kernels for variational forms.
In this talk, we focus on accelerating FEM action kernels for matrix-free operators on simplices. We describe a parametrized family of transformation strategies targeting these kernels, a heuristic cost model, and an auto-tuning strategy that enables us to achieve near-roofline performance for a wide variety of variational forms across domains such as fluid dynamics, solid mechanics, and wave motion. We propose a new computational offloading interface for Firedrake, and we have realized our UFL-to-GPU compilation pipeline within this interface. The pipeline can process general UFL with very few limitations and reliably produce routines for high-performance matrix-free 1-form assembly.
We close with remarks on further potential performance improvement opportunities.