XDP hardware hints discussion mail archive
 help / color / mirror / Atom feed
* [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs
@ 2023-01-12  0:32 Stanislav Fomichev
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 01/17] bpf: Document XDP RX metadata Stanislav Fomichev
                   ` (16 more replies)
  0 siblings, 17 replies; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Please see the first patch in the series for the overall
design and use-cases.

See the following email from Toke for the per-packet metadata overhead:
https://lore.kernel.org/bpf/20221206024554.3826186-1-sdf@google.com/T/#m49d48ea08d525ec88360c7d14c4d34fb0e45e798

Recent changes:

- Bring back parts that were removed during patch reshuffling from "bpf:
  Introduce device-bound XDP programs" patch (Martin)

- Remove netdev NULL check from __bpf_prog_dev_bound_init (Martin)

- Remove netdev NULL check from bpf_dev_bound_resolve_kfunc (Martin)

- Move target bound device verification from bpf_tracing_prog_attach into
  bpf_check_attach_target (Martin)

- Move mlx5e_free_rx_in_progress_descs into txrx.h (Tariq)

- mlx5e_fill_xdp_buff -> mlx5e_fill_mxbuf (Tariq)

Prior art (to record pros/cons for different approaches):

- Stable UAPI approach:
  https://lore.kernel.org/bpf/20220628194812.1453059-1-alexandr.lobakin@intel.com/
- Metadata+BTF_ID appoach:
  https://lore.kernel.org/bpf/166256538687.1434226.15760041133601409770.stgit@firesoul/
- v6:
  https://lore.kernel.org/bpf/20230104215949.529093-1-sdf@google.com/
- v5:
  https://lore.kernel.org/bpf/20221220222043.3348718-1-sdf@google.com/
- v4:
  https://lore.kernel.org/bpf/20221213023605.737383-1-sdf@google.com/
- v3:
  https://lore.kernel.org/bpf/20221206024554.3826186-1-sdf@google.com/
- v2:
  https://lore.kernel.org/bpf/20221121182552.2152891-1-sdf@google.com/
- v1:
  https://lore.kernel.org/bpf/20221115030210.3159213-1-sdf@google.com/
- kfuncs v2 RFC:
  https://lore.kernel.org/bpf/20221027200019.4106375-1-sdf@google.com/
- kfuncs v1 RFC:
  https://lore.kernel.org/bpf/20221104032532.1615099-1-sdf@google.com/

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org

Stanislav Fomichev (13):
  bpf: Document XDP RX metadata
  bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
  bpf: Move offload initialization into late_initcall
  bpf: Reshuffle some parts of bpf/offload.c
  bpf: Introduce device-bound XDP programs
  selftests/bpf: Update expected test_offload.py messages
  bpf: XDP metadata RX kfuncs
  veth: Introduce veth_xdp_buff wrapper for xdp_buff
  veth: Support RX XDP metadata
  selftests/bpf: Verify xdp_metadata xdp->af_xdp path
  net/mlx4_en: Introduce wrapper for xdp_buff
  net/mlx4_en: Support RX XDP metadata
  selftests/bpf: Simple program to dump XDP RX metadata

Toke Høiland-Jørgensen (4):
  bpf: Support consuming XDP HW metadata from fext programs
  xsk: Add cb area to struct xdp_buff_xsk
  net/mlx5e: Introduce wrapper for xdp_buff
  net/mlx5e: Support RX XDP metadata

 Documentation/networking/index.rst            |   1 +
 Documentation/networking/xdp-rx-metadata.rst  | 108 +++++
 drivers/net/ethernet/mellanox/mlx4/en_clock.c |  13 +-
 .../net/ethernet/mellanox/mlx4/en_netdev.c    |   6 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    |  63 ++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |   5 +
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   6 +-
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |   5 +
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  26 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  11 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   |  35 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |   2 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   6 +
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   |  99 ++--
 drivers/net/netdevsim/bpf.c                   |   4 -
 drivers/net/veth.c                            |  87 ++--
 include/linux/bpf.h                           |  61 ++-
 include/linux/netdevice.h                     |   8 +
 include/net/xdp.h                             |  21 +
 include/net/xsk_buff_pool.h                   |   5 +
 include/uapi/linux/bpf.h                      |   5 +
 kernel/bpf/core.c                             |  12 +-
 kernel/bpf/offload.c                          | 425 ++++++++++++------
 kernel/bpf/syscall.c                          |  34 +-
 kernel/bpf/verifier.c                         |  44 +-
 net/bpf/test_run.c                            |   3 +
 net/core/dev.c                                |   9 +-
 net/core/filter.c                             |   2 +-
 net/core/xdp.c                                |  64 +++
 tools/include/uapi/linux/bpf.h                |   5 +
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   9 +-
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 410 +++++++++++++++++
 .../selftests/bpf/progs/xdp_hw_metadata.c     |  81 ++++
 .../selftests/bpf/progs/xdp_metadata.c        |  64 +++
 .../selftests/bpf/progs/xdp_metadata2.c       |  23 +
 tools/testing/selftests/bpf/test_offload.py   |  10 +-
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 405 +++++++++++++++++
 tools/testing/selftests/bpf/xdp_metadata.h    |  15 +
 39 files changed, 1900 insertions(+), 293 deletions(-)
 create mode 100644 Documentation/networking/xdp-rx-metadata.rst
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata2.c
 create mode 100644 tools/testing/selftests/bpf/xdp_hw_metadata.c
 create mode 100644 tools/testing/selftests/bpf/xdp_metadata.h

-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] [PATCH bpf-next v7 01/17] bpf: Document XDP RX metadata
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
@ 2023-01-12  0:32 ` Stanislav Fomichev
  2023-01-16 13:09   ` [xdp-hints] " Jesper Dangaard Brouer
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 02/17] bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded Stanislav Fomichev
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, David Vernet

Document all current use-cases and assumptions.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Acked-by: David Vernet <void@manifault.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 Documentation/networking/index.rst           |   1 +
 Documentation/networking/xdp-rx-metadata.rst | 108 +++++++++++++++++++
 2 files changed, 109 insertions(+)
 create mode 100644 Documentation/networking/xdp-rx-metadata.rst

diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 4f2d1f682a18..4ddcae33c336 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -120,6 +120,7 @@ Refer to :ref:`netdev-FAQ` for a guide on netdev development process specifics.
    xfrm_proc
    xfrm_sync
    xfrm_sysctl
+   xdp-rx-metadata
 
 .. only::  subproject and html
 
diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
new file mode 100644
index 000000000000..b6c8c77937c4
--- /dev/null
+++ b/Documentation/networking/xdp-rx-metadata.rst
@@ -0,0 +1,108 @@
+===============
+XDP RX Metadata
+===============
+
+This document describes how an eXpress Data Path (XDP) program can access
+hardware metadata related to a packet using a set of helper functions,
+and how it can pass that metadata on to other consumers.
+
+General Design
+==============
+
+XDP has access to a set of kfuncs to manipulate the metadata in an XDP frame.
+Every device driver that wishes to expose additional packet metadata can
+implement these kfuncs. The set of kfuncs is declared in ``include/net/xdp.h``
+via ``XDP_METADATA_KFUNC_xxx``.
+
+Currently, the following kfuncs are supported. In the future, as more
+metadata is supported, this set will grow:
+
+.. kernel-doc:: net/core/xdp.c
+   :identifiers: bpf_xdp_metadata_rx_timestamp bpf_xdp_metadata_rx_hash
+
+An XDP program can use these kfuncs to read the metadata into stack
+variables for its own consumption. Or, to pass the metadata on to other
+consumers, an XDP program can store it into the metadata area carried
+ahead of the packet.
+
+Not all kfuncs have to be implemented by the device driver; when not
+implemented, the default ones that return ``-EOPNOTSUPP`` will be used.
+
+Within an XDP frame, the metadata layout is as follows::
+
+  +----------+-----------------+------+
+  | headroom | custom metadata | data |
+  +----------+-----------------+------+
+             ^                 ^
+             |                 |
+   xdp_buff->data_meta   xdp_buff->data
+
+An XDP program can store individual metadata items into this ``data_meta``
+area in whichever format it chooses. Later consumers of the metadata
+will have to agree on the format by some out of band contract (like for
+the AF_XDP use case, see below).
+
+AF_XDP
+======
+
+:doc:`af_xdp` use-case implies that there is a contract between the BPF
+program that redirects XDP frames into the ``AF_XDP`` socket (``XSK``) and
+the final consumer. Thus the BPF program manually allocates a fixed number of
+bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset
+of kfuncs to populate it. The userspace ``XSK`` consumer computes
+``xsk_umem__get_data() - METADATA_SIZE`` to locate that metadata.
+Note, ``xsk_umem__get_data`` is defined in ``libxdp`` and
+``METADATA_SIZE`` is an application-specific constant.
+
+Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::
+
+  +----------+-----------------+------+
+  | headroom | custom metadata | data |
+  +----------+-----------------+------+
+                               ^
+                               |
+                        rx_desc->address
+
+XDP_PASS
+========
+
+This is the path where the packets processed by the XDP program are passed
+into the kernel. The kernel creates the ``skb`` out of the ``xdp_buff``
+contents. Currently, every driver has custom kernel code to parse
+the descriptors and populate ``skb`` metadata when doing this ``xdp_buff->skb``
+conversion, and the XDP metadata is not used by the kernel when building
+``skbs``. However, TC-BPF programs can access the XDP metadata area using
+the ``data_meta`` pointer.
+
+In the future, we'd like to support a case where an XDP program
+can override some of the metadata used for building ``skbs``.
+
+bpf_redirect_map
+================
+
+``bpf_redirect_map`` can redirect the frame to a different device.
+Some devices (like virtual ethernet links) support running a second XDP
+program after the redirect. However, the final consumer doesn't have
+access to the original hardware descriptor and can't access any of
+the original metadata. The same applies to XDP programs installed
+into devmaps and cpumaps.
+
+This means that for redirected packets only custom metadata is
+currently supported, which has to be prepared by the initial XDP program
+before redirect. If the frame is eventually passed to the kernel, the
+``skb`` created from such a frame won't have any hardware metadata populated
+in its ``skb``. If such a packet is later redirected into an ``XSK``,
+that will also only have access to the custom metadata.
+
+bpf_tail_call
+=============
+
+Adding programs that access metadata kfuncs to the ``BPF_MAP_TYPE_PROG_ARRAY``
+is currently not supported.
+
+Example
+=======
+
+See ``tools/testing/selftests/bpf/progs/xdp_metadata.c`` and
+``tools/testing/selftests/bpf/prog_tests/xdp_metadata.c`` for an example of
+BPF program that handles XDP metadata.
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] [PATCH bpf-next v7 02/17] bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 01/17] bpf: Document XDP RX metadata Stanislav Fomichev
@ 2023-01-12  0:32 ` Stanislav Fomichev
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 03/17] bpf: Move offload initialization into late_initcall Stanislav Fomichev
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev,
	Jakub Kicinski

BPF offloading infra will be reused to implement
bound-but-not-offloaded bpf programs. Rename existing
helpers for clarity. No functional changes.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/bpf.h   |  8 ++++----
 kernel/bpf/core.c     |  4 ++--
 kernel/bpf/offload.c  |  4 ++--
 kernel/bpf/syscall.c  | 22 +++++++++++-----------
 kernel/bpf/verifier.c | 18 +++++++++---------
 net/core/dev.c        |  4 ++--
 net/core/filter.c     |  2 +-
 7 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index ae7771c7d750..1bb525c0130e 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2481,12 +2481,12 @@ void unpriv_ebpf_notify(int new_state);
 #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
 int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
 
-static inline bool bpf_prog_is_dev_bound(const struct bpf_prog_aux *aux)
+static inline bool bpf_prog_is_offloaded(const struct bpf_prog_aux *aux)
 {
 	return aux->offload_requested;
 }
 
-static inline bool bpf_map_is_dev_bound(struct bpf_map *map)
+static inline bool bpf_map_is_offloaded(struct bpf_map *map)
 {
 	return unlikely(map->ops == &bpf_map_offload_ops);
 }
@@ -2513,12 +2513,12 @@ static inline int bpf_prog_offload_init(struct bpf_prog *prog,
 	return -EOPNOTSUPP;
 }
 
-static inline bool bpf_prog_is_dev_bound(struct bpf_prog_aux *aux)
+static inline bool bpf_prog_is_offloaded(struct bpf_prog_aux *aux)
 {
 	return false;
 }
 
-static inline bool bpf_map_is_dev_bound(struct bpf_map *map)
+static inline bool bpf_map_is_offloaded(struct bpf_map *map)
 {
 	return false;
 }
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index ba3fff17e2f9..515f4f08591c 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2182,7 +2182,7 @@ struct bpf_prog *bpf_prog_select_runtime(struct bpf_prog *fp, int *err)
 	 * valid program, which in this case would simply not
 	 * be JITed, but falls back to the interpreter.
 	 */
-	if (!bpf_prog_is_dev_bound(fp->aux)) {
+	if (!bpf_prog_is_offloaded(fp->aux)) {
 		*err = bpf_prog_alloc_jited_linfo(fp);
 		if (*err)
 			return fp;
@@ -2553,7 +2553,7 @@ static void bpf_prog_free_deferred(struct work_struct *work)
 #endif
 	bpf_free_used_maps(aux);
 	bpf_free_used_btfs(aux);
-	if (bpf_prog_is_dev_bound(aux))
+	if (bpf_prog_is_offloaded(aux))
 		bpf_prog_offload_destroy(aux->prog);
 #ifdef CONFIG_PERF_EVENTS
 	if (aux->prog->has_callchain_buf)
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 13e4efc971e6..f5769a8ecbee 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -549,7 +549,7 @@ static bool __bpf_offload_dev_match(struct bpf_prog *prog,
 	struct bpf_offload_netdev *ondev1, *ondev2;
 	struct bpf_prog_offload *offload;
 
-	if (!bpf_prog_is_dev_bound(prog->aux))
+	if (!bpf_prog_is_offloaded(prog->aux))
 		return false;
 
 	offload = prog->aux->offload;
@@ -581,7 +581,7 @@ bool bpf_offload_prog_map_match(struct bpf_prog *prog, struct bpf_map *map)
 	struct bpf_offloaded_map *offmap;
 	bool ret;
 
-	if (!bpf_map_is_dev_bound(map))
+	if (!bpf_map_is_offloaded(map))
 		return bpf_map_offload_neutral(map);
 	offmap = map_to_offmap(map);
 
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 35ffd808f281..5e90b697f908 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -181,7 +181,7 @@ static int bpf_map_update_value(struct bpf_map *map, struct file *map_file,
 	int err;
 
 	/* Need to create a kthread, thus must support schedule */
-	if (bpf_map_is_dev_bound(map)) {
+	if (bpf_map_is_offloaded(map)) {
 		return bpf_map_offload_update_elem(map, key, value, flags);
 	} else if (map->map_type == BPF_MAP_TYPE_CPUMAP ||
 		   map->map_type == BPF_MAP_TYPE_STRUCT_OPS) {
@@ -238,7 +238,7 @@ static int bpf_map_copy_value(struct bpf_map *map, void *key, void *value,
 	void *ptr;
 	int err;
 
-	if (bpf_map_is_dev_bound(map))
+	if (bpf_map_is_offloaded(map))
 		return bpf_map_offload_lookup_elem(map, key, value);
 
 	bpf_disable_instrumentation();
@@ -1483,7 +1483,7 @@ static int map_delete_elem(union bpf_attr *attr, bpfptr_t uattr)
 		goto err_put;
 	}
 
-	if (bpf_map_is_dev_bound(map)) {
+	if (bpf_map_is_offloaded(map)) {
 		err = bpf_map_offload_delete_elem(map, key);
 		goto out;
 	} else if (IS_FD_PROG_ARRAY(map) ||
@@ -1547,7 +1547,7 @@ static int map_get_next_key(union bpf_attr *attr)
 	if (!next_key)
 		goto free_key;
 
-	if (bpf_map_is_dev_bound(map)) {
+	if (bpf_map_is_offloaded(map)) {
 		err = bpf_map_offload_get_next_key(map, key, next_key);
 		goto out;
 	}
@@ -1605,7 +1605,7 @@ int generic_map_delete_batch(struct bpf_map *map,
 				   map->key_size))
 			break;
 
-		if (bpf_map_is_dev_bound(map)) {
+		if (bpf_map_is_offloaded(map)) {
 			err = bpf_map_offload_delete_elem(map, key);
 			break;
 		}
@@ -1851,7 +1851,7 @@ static int map_lookup_and_delete_elem(union bpf_attr *attr)
 		   map->map_type == BPF_MAP_TYPE_PERCPU_HASH ||
 		   map->map_type == BPF_MAP_TYPE_LRU_HASH ||
 		   map->map_type == BPF_MAP_TYPE_LRU_PERCPU_HASH) {
-		if (!bpf_map_is_dev_bound(map)) {
+		if (!bpf_map_is_offloaded(map)) {
 			bpf_disable_instrumentation();
 			rcu_read_lock();
 			err = map->ops->map_lookup_and_delete_elem(map, key, value, attr->flags);
@@ -1944,7 +1944,7 @@ static int find_prog_type(enum bpf_prog_type type, struct bpf_prog *prog)
 	if (!ops)
 		return -EINVAL;
 
-	if (!bpf_prog_is_dev_bound(prog->aux))
+	if (!bpf_prog_is_offloaded(prog->aux))
 		prog->aux->ops = ops;
 	else
 		prog->aux->ops = &bpf_offload_prog_ops;
@@ -2255,7 +2255,7 @@ bool bpf_prog_get_ok(struct bpf_prog *prog,
 
 	if (prog->type != *attach_type)
 		return false;
-	if (bpf_prog_is_dev_bound(prog->aux) && !attach_drv)
+	if (bpf_prog_is_offloaded(prog->aux) && !attach_drv)
 		return false;
 
 	return true;
@@ -2598,7 +2598,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 	atomic64_set(&prog->aux->refcnt, 1);
 	prog->gpl_compatible = is_gpl ? 1 : 0;
 
-	if (bpf_prog_is_dev_bound(prog->aux)) {
+	if (bpf_prog_is_offloaded(prog->aux)) {
 		err = bpf_prog_offload_init(prog, attr);
 		if (err)
 			goto free_prog_sec;
@@ -3997,7 +3997,7 @@ static int bpf_prog_get_info_by_fd(struct file *file,
 			return -EFAULT;
 	}
 
-	if (bpf_prog_is_dev_bound(prog->aux)) {
+	if (bpf_prog_is_offloaded(prog->aux)) {
 		err = bpf_prog_offload_info_fill(&info, prog);
 		if (err)
 			return err;
@@ -4225,7 +4225,7 @@ static int bpf_map_get_info_by_fd(struct file *file,
 	}
 	info.btf_vmlinux_value_type_id = map->btf_vmlinux_value_type_id;
 
-	if (bpf_map_is_dev_bound(map)) {
+	if (bpf_map_is_offloaded(map)) {
 		err = bpf_map_offload_info_fill(&info, map);
 		if (err)
 			return err;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index fa4c911603e9..026a6789e896 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -13826,7 +13826,7 @@ static int do_check(struct bpf_verifier_env *env)
 			env->prev_log_len = env->log.len_used;
 		}
 
-		if (bpf_prog_is_dev_bound(env->prog->aux)) {
+		if (bpf_prog_is_offloaded(env->prog->aux)) {
 			err = bpf_prog_offload_verify_insn(env, env->insn_idx,
 							   env->prev_insn_idx);
 			if (err)
@@ -14306,7 +14306,7 @@ static int check_map_prog_compatibility(struct bpf_verifier_env *env,
 		}
 	}
 
-	if ((bpf_prog_is_dev_bound(prog->aux) || bpf_map_is_dev_bound(map)) &&
+	if ((bpf_prog_is_offloaded(prog->aux) || bpf_map_is_offloaded(map)) &&
 	    !bpf_offload_prog_map_match(prog, map)) {
 		verbose(env, "offload device mismatch between prog and map\n");
 		return -EINVAL;
@@ -14787,7 +14787,7 @@ static int verifier_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt)
 	unsigned int orig_prog_len = env->prog->len;
 	int err;
 
-	if (bpf_prog_is_dev_bound(env->prog->aux))
+	if (bpf_prog_is_offloaded(env->prog->aux))
 		bpf_prog_offload_remove_insns(env, off, cnt);
 
 	err = bpf_remove_insns(env->prog, off, cnt);
@@ -14868,7 +14868,7 @@ static void opt_hard_wire_dead_code_branches(struct bpf_verifier_env *env)
 		else
 			continue;
 
-		if (bpf_prog_is_dev_bound(env->prog->aux))
+		if (bpf_prog_is_offloaded(env->prog->aux))
 			bpf_prog_offload_replace_insn(env, i, &ja);
 
 		memcpy(insn, &ja, sizeof(ja));
@@ -15055,7 +15055,7 @@ static int convert_ctx_accesses(struct bpf_verifier_env *env)
 		}
 	}
 
-	if (bpf_prog_is_dev_bound(env->prog->aux))
+	if (bpf_prog_is_offloaded(env->prog->aux))
 		return 0;
 
 	insn = env->prog->insnsi + delta;
@@ -15455,7 +15455,7 @@ static int fixup_call_args(struct bpf_verifier_env *env)
 	int err = 0;
 
 	if (env->prog->jit_requested &&
-	    !bpf_prog_is_dev_bound(env->prog->aux)) {
+	    !bpf_prog_is_offloaded(env->prog->aux)) {
 		err = jit_subprogs(env);
 		if (err == 0)
 			return 0;
@@ -16942,7 +16942,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr)
 	if (ret < 0)
 		goto skip_full_check;
 
-	if (bpf_prog_is_dev_bound(env->prog->aux)) {
+	if (bpf_prog_is_offloaded(env->prog->aux)) {
 		ret = bpf_prog_offload_verifier_prep(env->prog);
 		if (ret)
 			goto skip_full_check;
@@ -16955,7 +16955,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr)
 	ret = do_check_subprogs(env);
 	ret = ret ?: do_check_main(env);
 
-	if (ret == 0 && bpf_prog_is_dev_bound(env->prog->aux))
+	if (ret == 0 && bpf_prog_is_offloaded(env->prog->aux))
 		ret = bpf_prog_offload_finalize(env);
 
 skip_full_check:
@@ -16990,7 +16990,7 @@ int bpf_check(struct bpf_prog **prog, union bpf_attr *attr, bpfptr_t uattr)
 	/* do 32-bit optimization after insn patching has done so those patched
 	 * insns could be handled correctly.
 	 */
-	if (ret == 0 && !bpf_prog_is_dev_bound(env->prog->aux)) {
+	if (ret == 0 && !bpf_prog_is_offloaded(env->prog->aux)) {
 		ret = opt_subreg_zext_lo32_rnd_hi32(env, attr);
 		env->prog->aux->verifier_zext = bpf_jit_needs_zext() ? !ret
 								     : false;
diff --git a/net/core/dev.c b/net/core/dev.c
index cf78f35bc0b9..a37829de6529 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9224,8 +9224,8 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack
 			NL_SET_ERR_MSG(extack, "Native and generic XDP can't be active at the same time");
 			return -EEXIST;
 		}
-		if (!offload && bpf_prog_is_dev_bound(new_prog->aux)) {
-			NL_SET_ERR_MSG(extack, "Using device-bound program without HW_MODE flag is not supported");
+		if (!offload && bpf_prog_is_offloaded(new_prog->aux)) {
+			NL_SET_ERR_MSG(extack, "Using offloaded program without HW_MODE flag is not supported");
 			return -EINVAL;
 		}
 		if (new_prog->expected_attach_type == BPF_XDP_DEVMAP) {
diff --git a/net/core/filter.c b/net/core/filter.c
index d9befa6ba04e..35fb0aaccbef 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8731,7 +8731,7 @@ static bool xdp_is_valid_access(int off, int size,
 	}
 
 	if (type == BPF_WRITE) {
-		if (bpf_prog_is_dev_bound(prog->aux)) {
+		if (bpf_prog_is_offloaded(prog->aux)) {
 			switch (off) {
 			case offsetof(struct xdp_md, rx_queue_index):
 				return __is_valid_xdp_access(off, size);
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] [PATCH bpf-next v7 03/17] bpf: Move offload initialization into late_initcall
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 01/17] bpf: Document XDP RX metadata Stanislav Fomichev
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 02/17] bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded Stanislav Fomichev
@ 2023-01-12  0:32 ` Stanislav Fomichev
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 04/17] bpf: Reshuffle some parts of bpf/offload.c Stanislav Fomichev
                   ` (13 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

So we don't have to initialize it manually from several paths.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 kernel/bpf/offload.c | 22 +++++++---------------
 1 file changed, 7 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index f5769a8ecbee..621e8738f304 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -56,7 +56,6 @@ static const struct rhashtable_params offdevs_params = {
 };
 
 static struct rhashtable offdevs;
-static bool offdevs_inited;
 
 static int bpf_dev_offload_check(struct net_device *netdev)
 {
@@ -72,8 +71,6 @@ bpf_offload_find_netdev(struct net_device *netdev)
 {
 	lockdep_assert_held(&bpf_devs_lock);
 
-	if (!offdevs_inited)
-		return NULL;
 	return rhashtable_lookup_fast(&offdevs, &netdev, offdevs_params);
 }
 
@@ -673,18 +670,6 @@ struct bpf_offload_dev *
 bpf_offload_dev_create(const struct bpf_prog_offload_ops *ops, void *priv)
 {
 	struct bpf_offload_dev *offdev;
-	int err;
-
-	down_write(&bpf_devs_lock);
-	if (!offdevs_inited) {
-		err = rhashtable_init(&offdevs, &offdevs_params);
-		if (err) {
-			up_write(&bpf_devs_lock);
-			return ERR_PTR(err);
-		}
-		offdevs_inited = true;
-	}
-	up_write(&bpf_devs_lock);
 
 	offdev = kzalloc(sizeof(*offdev), GFP_KERNEL);
 	if (!offdev)
@@ -710,3 +695,10 @@ void *bpf_offload_dev_priv(struct bpf_offload_dev *offdev)
 	return offdev->priv;
 }
 EXPORT_SYMBOL_GPL(bpf_offload_dev_priv);
+
+static int __init bpf_offload_init(void)
+{
+	return rhashtable_init(&offdevs, &offdevs_params);
+}
+
+late_initcall(bpf_offload_init);
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] [PATCH bpf-next v7 04/17] bpf: Reshuffle some parts of bpf/offload.c
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
                   ` (2 preceding siblings ...)
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 03/17] bpf: Move offload initialization into late_initcall Stanislav Fomichev
@ 2023-01-12  0:32 ` Stanislav Fomichev
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 05/17] bpf: Introduce device-bound XDP programs Stanislav Fomichev
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

To avoid adding forward declarations in the main patch, shuffle
some code around. No functional changes.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 kernel/bpf/offload.c | 222 +++++++++++++++++++++++--------------------
 1 file changed, 117 insertions(+), 105 deletions(-)

diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 621e8738f304..deb06498da0b 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -74,6 +74,121 @@ bpf_offload_find_netdev(struct net_device *netdev)
 	return rhashtable_lookup_fast(&offdevs, &netdev, offdevs_params);
 }
 
+static int __bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
+					     struct net_device *netdev)
+{
+	struct bpf_offload_netdev *ondev;
+	int err;
+
+	ondev = kzalloc(sizeof(*ondev), GFP_KERNEL);
+	if (!ondev)
+		return -ENOMEM;
+
+	ondev->netdev = netdev;
+	ondev->offdev = offdev;
+	INIT_LIST_HEAD(&ondev->progs);
+	INIT_LIST_HEAD(&ondev->maps);
+
+	down_write(&bpf_devs_lock);
+	err = rhashtable_insert_fast(&offdevs, &ondev->l, offdevs_params);
+	if (err) {
+		netdev_warn(netdev, "failed to register for BPF offload\n");
+		goto err_unlock_free;
+	}
+
+	list_add(&ondev->offdev_netdevs, &offdev->netdevs);
+	up_write(&bpf_devs_lock);
+	return 0;
+
+err_unlock_free:
+	up_write(&bpf_devs_lock);
+	kfree(ondev);
+	return err;
+}
+
+static void __bpf_prog_offload_destroy(struct bpf_prog *prog)
+{
+	struct bpf_prog_offload *offload = prog->aux->offload;
+
+	if (offload->dev_state)
+		offload->offdev->ops->destroy(prog);
+
+	/* Make sure BPF_PROG_GET_NEXT_ID can't find this dead program */
+	bpf_prog_free_id(prog, true);
+
+	list_del_init(&offload->offloads);
+	kfree(offload);
+	prog->aux->offload = NULL;
+}
+
+static int bpf_map_offload_ndo(struct bpf_offloaded_map *offmap,
+			       enum bpf_netdev_command cmd)
+{
+	struct netdev_bpf data = {};
+	struct net_device *netdev;
+
+	ASSERT_RTNL();
+
+	data.command = cmd;
+	data.offmap = offmap;
+	/* Caller must make sure netdev is valid */
+	netdev = offmap->netdev;
+
+	return netdev->netdev_ops->ndo_bpf(netdev, &data);
+}
+
+static void __bpf_map_offload_destroy(struct bpf_offloaded_map *offmap)
+{
+	WARN_ON(bpf_map_offload_ndo(offmap, BPF_OFFLOAD_MAP_FREE));
+	/* Make sure BPF_MAP_GET_NEXT_ID can't find this dead map */
+	bpf_map_free_id(&offmap->map, true);
+	list_del_init(&offmap->offloads);
+	offmap->netdev = NULL;
+}
+
+static void __bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
+						struct net_device *netdev)
+{
+	struct bpf_offload_netdev *ondev, *altdev;
+	struct bpf_offloaded_map *offmap, *mtmp;
+	struct bpf_prog_offload *offload, *ptmp;
+
+	ASSERT_RTNL();
+
+	down_write(&bpf_devs_lock);
+	ondev = rhashtable_lookup_fast(&offdevs, &netdev, offdevs_params);
+	if (WARN_ON(!ondev))
+		goto unlock;
+
+	WARN_ON(rhashtable_remove_fast(&offdevs, &ondev->l, offdevs_params));
+	list_del(&ondev->offdev_netdevs);
+
+	/* Try to move the objects to another netdev of the device */
+	altdev = list_first_entry_or_null(&offdev->netdevs,
+					  struct bpf_offload_netdev,
+					  offdev_netdevs);
+	if (altdev) {
+		list_for_each_entry(offload, &ondev->progs, offloads)
+			offload->netdev = altdev->netdev;
+		list_splice_init(&ondev->progs, &altdev->progs);
+
+		list_for_each_entry(offmap, &ondev->maps, offloads)
+			offmap->netdev = altdev->netdev;
+		list_splice_init(&ondev->maps, &altdev->maps);
+	} else {
+		list_for_each_entry_safe(offload, ptmp, &ondev->progs, offloads)
+			__bpf_prog_offload_destroy(offload->prog);
+		list_for_each_entry_safe(offmap, mtmp, &ondev->maps, offloads)
+			__bpf_map_offload_destroy(offmap);
+	}
+
+	WARN_ON(!list_empty(&ondev->progs));
+	WARN_ON(!list_empty(&ondev->maps));
+	kfree(ondev);
+unlock:
+	up_write(&bpf_devs_lock);
+}
+
 int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
 {
 	struct bpf_offload_netdev *ondev;
@@ -206,21 +321,6 @@ bpf_prog_offload_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt)
 	up_read(&bpf_devs_lock);
 }
 
-static void __bpf_prog_offload_destroy(struct bpf_prog *prog)
-{
-	struct bpf_prog_offload *offload = prog->aux->offload;
-
-	if (offload->dev_state)
-		offload->offdev->ops->destroy(prog);
-
-	/* Make sure BPF_PROG_GET_NEXT_ID can't find this dead program */
-	bpf_prog_free_id(prog, true);
-
-	list_del_init(&offload->offloads);
-	kfree(offload);
-	prog->aux->offload = NULL;
-}
-
 void bpf_prog_offload_destroy(struct bpf_prog *prog)
 {
 	down_write(&bpf_devs_lock);
@@ -340,22 +440,6 @@ int bpf_prog_offload_info_fill(struct bpf_prog_info *info,
 const struct bpf_prog_ops bpf_offload_prog_ops = {
 };
 
-static int bpf_map_offload_ndo(struct bpf_offloaded_map *offmap,
-			       enum bpf_netdev_command cmd)
-{
-	struct netdev_bpf data = {};
-	struct net_device *netdev;
-
-	ASSERT_RTNL();
-
-	data.command = cmd;
-	data.offmap = offmap;
-	/* Caller must make sure netdev is valid */
-	netdev = offmap->netdev;
-
-	return netdev->netdev_ops->ndo_bpf(netdev, &data);
-}
-
 struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr)
 {
 	struct net *net = current->nsproxy->net_ns;
@@ -405,15 +489,6 @@ struct bpf_map *bpf_map_offload_map_alloc(union bpf_attr *attr)
 	return ERR_PTR(err);
 }
 
-static void __bpf_map_offload_destroy(struct bpf_offloaded_map *offmap)
-{
-	WARN_ON(bpf_map_offload_ndo(offmap, BPF_OFFLOAD_MAP_FREE));
-	/* Make sure BPF_MAP_GET_NEXT_ID can't find this dead map */
-	bpf_map_free_id(&offmap->map, true);
-	list_del_init(&offmap->offloads);
-	offmap->netdev = NULL;
-}
-
 void bpf_map_offload_map_free(struct bpf_map *map)
 {
 	struct bpf_offloaded_map *offmap = map_to_offmap(map);
@@ -592,77 +667,14 @@ bool bpf_offload_prog_map_match(struct bpf_prog *prog, struct bpf_map *map)
 int bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
 				    struct net_device *netdev)
 {
-	struct bpf_offload_netdev *ondev;
-	int err;
-
-	ondev = kzalloc(sizeof(*ondev), GFP_KERNEL);
-	if (!ondev)
-		return -ENOMEM;
-
-	ondev->netdev = netdev;
-	ondev->offdev = offdev;
-	INIT_LIST_HEAD(&ondev->progs);
-	INIT_LIST_HEAD(&ondev->maps);
-
-	down_write(&bpf_devs_lock);
-	err = rhashtable_insert_fast(&offdevs, &ondev->l, offdevs_params);
-	if (err) {
-		netdev_warn(netdev, "failed to register for BPF offload\n");
-		goto err_unlock_free;
-	}
-
-	list_add(&ondev->offdev_netdevs, &offdev->netdevs);
-	up_write(&bpf_devs_lock);
-	return 0;
-
-err_unlock_free:
-	up_write(&bpf_devs_lock);
-	kfree(ondev);
-	return err;
+	return __bpf_offload_dev_netdev_register(offdev, netdev);
 }
 EXPORT_SYMBOL_GPL(bpf_offload_dev_netdev_register);
 
 void bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
 				       struct net_device *netdev)
 {
-	struct bpf_offload_netdev *ondev, *altdev;
-	struct bpf_offloaded_map *offmap, *mtmp;
-	struct bpf_prog_offload *offload, *ptmp;
-
-	ASSERT_RTNL();
-
-	down_write(&bpf_devs_lock);
-	ondev = rhashtable_lookup_fast(&offdevs, &netdev, offdevs_params);
-	if (WARN_ON(!ondev))
-		goto unlock;
-
-	WARN_ON(rhashtable_remove_fast(&offdevs, &ondev->l, offdevs_params));
-	list_del(&ondev->offdev_netdevs);
-
-	/* Try to move the objects to another netdev of the device */
-	altdev = list_first_entry_or_null(&offdev->netdevs,
-					  struct bpf_offload_netdev,
-					  offdev_netdevs);
-	if (altdev) {
-		list_for_each_entry(offload, &ondev->progs, offloads)
-			offload->netdev = altdev->netdev;
-		list_splice_init(&ondev->progs, &altdev->progs);
-
-		list_for_each_entry(offmap, &ondev->maps, offloads)
-			offmap->netdev = altdev->netdev;
-		list_splice_init(&ondev->maps, &altdev->maps);
-	} else {
-		list_for_each_entry_safe(offload, ptmp, &ondev->progs, offloads)
-			__bpf_prog_offload_destroy(offload->prog);
-		list_for_each_entry_safe(offmap, mtmp, &ondev->maps, offloads)
-			__bpf_map_offload_destroy(offmap);
-	}
-
-	WARN_ON(!list_empty(&ondev->progs));
-	WARN_ON(!list_empty(&ondev->maps));
-	kfree(ondev);
-unlock:
-	up_write(&bpf_devs_lock);
+	__bpf_offload_dev_netdev_unregister(offdev, netdev);
 }
 EXPORT_SYMBOL_GPL(bpf_offload_dev_netdev_unregister);
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] [PATCH bpf-next v7 05/17] bpf: Introduce device-bound XDP programs
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
                   ` (3 preceding siblings ...)
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 04/17] bpf: Reshuffle some parts of bpf/offload.c Stanislav Fomichev
@ 2023-01-12  0:32 ` Stanislav Fomichev
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 06/17] selftests/bpf: Update expected test_offload.py messages Stanislav Fomichev
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

New flag BPF_F_XDP_DEV_BOUND_ONLY plus all the infra to have a way
to associate a netdev with a BPF program at load time.

netdevsim checks are dropped in favor of generic check in dev_xdp_attach.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/netdevsim/bpf.c    |  4 --
 include/linux/bpf.h            | 24 +++++++--
 include/uapi/linux/bpf.h       |  5 ++
 kernel/bpf/core.c              |  4 +-
 kernel/bpf/offload.c           | 95 +++++++++++++++++++++++++---------
 kernel/bpf/syscall.c           |  9 ++--
 net/core/dev.c                 |  5 ++
 tools/include/uapi/linux/bpf.h |  5 ++
 8 files changed, 113 insertions(+), 38 deletions(-)

diff --git a/drivers/net/netdevsim/bpf.c b/drivers/net/netdevsim/bpf.c
index 50854265864d..f60eb97e3a62 100644
--- a/drivers/net/netdevsim/bpf.c
+++ b/drivers/net/netdevsim/bpf.c
@@ -315,10 +315,6 @@ nsim_setup_prog_hw_checks(struct netdevsim *ns, struct netdev_bpf *bpf)
 		NSIM_EA(bpf->extack, "xdpoffload of non-bound program");
 		return -EINVAL;
 	}
-	if (!bpf_offload_dev_match(bpf->prog, ns->netdev)) {
-		NSIM_EA(bpf->extack, "program bound to different dev");
-		return -EINVAL;
-	}
 
 	state = bpf->prog->aux->offload->dev_priv;
 	if (WARN_ON(strcmp(state->state, "xlated"))) {
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 1bb525c0130e..b97a05bb47be 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1261,7 +1261,8 @@ struct bpf_prog_aux {
 	enum bpf_prog_type saved_dst_prog_type;
 	enum bpf_attach_type saved_dst_attach_type;
 	bool verifier_zext; /* Zero extensions has been inserted by verifier. */
-	bool offload_requested;
+	bool dev_bound; /* Program is bound to the netdev. */
+	bool offload_requested; /* Program is bound and offloaded to the netdev. */
 	bool attach_btf_trace; /* true if attaching to BTF-enabled raw tp */
 	bool func_proto_unreliable;
 	bool sleepable;
@@ -2451,7 +2452,7 @@ void __bpf_free_used_maps(struct bpf_prog_aux *aux,
 bool bpf_prog_get_ok(struct bpf_prog *, enum bpf_prog_type *, bool);
 
 int bpf_prog_offload_compile(struct bpf_prog *prog);
-void bpf_prog_offload_destroy(struct bpf_prog *prog);
+void bpf_prog_dev_bound_destroy(struct bpf_prog *prog);
 int bpf_prog_offload_info_fill(struct bpf_prog_info *info,
 			       struct bpf_prog *prog);
 
@@ -2479,7 +2480,13 @@ bool bpf_offload_dev_match(struct bpf_prog *prog, struct net_device *netdev);
 void unpriv_ebpf_notify(int new_state);
 
 #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
-int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr);
+int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr);
+void bpf_dev_bound_netdev_unregister(struct net_device *dev);
+
+static inline bool bpf_prog_is_dev_bound(const struct bpf_prog_aux *aux)
+{
+	return aux->dev_bound;
+}
 
 static inline bool bpf_prog_is_offloaded(const struct bpf_prog_aux *aux)
 {
@@ -2507,12 +2514,21 @@ void sock_map_unhash(struct sock *sk);
 void sock_map_destroy(struct sock *sk);
 void sock_map_close(struct sock *sk, long timeout);
 #else
-static inline int bpf_prog_offload_init(struct bpf_prog *prog,
+static inline int bpf_prog_dev_bound_init(struct bpf_prog *prog,
 					union bpf_attr *attr)
 {
 	return -EOPNOTSUPP;
 }
 
+static inline void bpf_dev_bound_netdev_unregister(struct net_device *dev)
+{
+}
+
+static inline bool bpf_prog_is_dev_bound(const struct bpf_prog_aux *aux)
+{
+	return false;
+}
+
 static inline bool bpf_prog_is_offloaded(struct bpf_prog_aux *aux)
 {
 	return false;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index bc1a3d232ae4..d0b6ac896699 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1156,6 +1156,11 @@ enum bpf_link_type {
  */
 #define BPF_F_XDP_HAS_FRAGS	(1U << 5)
 
+/* If BPF_F_XDP_DEV_BOUND_ONLY is used in BPF_PROG_LOAD command, the loaded
+ * program becomes device-bound but can access XDP metadata.
+ */
+#define BPF_F_XDP_DEV_BOUND_ONLY	(1U << 6)
+
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
  * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
  */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 515f4f08591c..1cf19da3c128 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2553,8 +2553,8 @@ static void bpf_prog_free_deferred(struct work_struct *work)
 #endif
 	bpf_free_used_maps(aux);
 	bpf_free_used_btfs(aux);
-	if (bpf_prog_is_offloaded(aux))
-		bpf_prog_offload_destroy(aux->prog);
+	if (bpf_prog_is_dev_bound(aux))
+		bpf_prog_dev_bound_destroy(aux->prog);
 #ifdef CONFIG_PERF_EVENTS
 	if (aux->prog->has_callchain_buf)
 		put_callchain_buffers();
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index deb06498da0b..f767455ed732 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -41,7 +41,7 @@ struct bpf_offload_dev {
 struct bpf_offload_netdev {
 	struct rhash_head l;
 	struct net_device *netdev;
-	struct bpf_offload_dev *offdev;
+	struct bpf_offload_dev *offdev; /* NULL when bound-only */
 	struct list_head progs;
 	struct list_head maps;
 	struct list_head offdev_netdevs;
@@ -89,19 +89,17 @@ static int __bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
 	INIT_LIST_HEAD(&ondev->progs);
 	INIT_LIST_HEAD(&ondev->maps);
 
-	down_write(&bpf_devs_lock);
 	err = rhashtable_insert_fast(&offdevs, &ondev->l, offdevs_params);
 	if (err) {
 		netdev_warn(netdev, "failed to register for BPF offload\n");
-		goto err_unlock_free;
+		goto err_free;
 	}
 
-	list_add(&ondev->offdev_netdevs, &offdev->netdevs);
-	up_write(&bpf_devs_lock);
+	if (offdev)
+		list_add(&ondev->offdev_netdevs, &offdev->netdevs);
 	return 0;
 
-err_unlock_free:
-	up_write(&bpf_devs_lock);
+err_free:
 	kfree(ondev);
 	return err;
 }
@@ -149,24 +147,26 @@ static void __bpf_map_offload_destroy(struct bpf_offloaded_map *offmap)
 static void __bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
 						struct net_device *netdev)
 {
-	struct bpf_offload_netdev *ondev, *altdev;
+	struct bpf_offload_netdev *ondev, *altdev = NULL;
 	struct bpf_offloaded_map *offmap, *mtmp;
 	struct bpf_prog_offload *offload, *ptmp;
 
 	ASSERT_RTNL();
 
-	down_write(&bpf_devs_lock);
 	ondev = rhashtable_lookup_fast(&offdevs, &netdev, offdevs_params);
 	if (WARN_ON(!ondev))
-		goto unlock;
+		return;
 
 	WARN_ON(rhashtable_remove_fast(&offdevs, &ondev->l, offdevs_params));
-	list_del(&ondev->offdev_netdevs);
 
 	/* Try to move the objects to another netdev of the device */
-	altdev = list_first_entry_or_null(&offdev->netdevs,
-					  struct bpf_offload_netdev,
-					  offdev_netdevs);
+	if (offdev) {
+		list_del(&ondev->offdev_netdevs);
+		altdev = list_first_entry_or_null(&offdev->netdevs,
+						  struct bpf_offload_netdev,
+						  offdev_netdevs);
+	}
+
 	if (altdev) {
 		list_for_each_entry(offload, &ondev->progs, offloads)
 			offload->netdev = altdev->netdev;
@@ -185,11 +185,9 @@ static void __bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
 	WARN_ON(!list_empty(&ondev->progs));
 	WARN_ON(!list_empty(&ondev->maps));
 	kfree(ondev);
-unlock:
-	up_write(&bpf_devs_lock);
 }
 
-int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
+int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr)
 {
 	struct bpf_offload_netdev *ondev;
 	struct bpf_prog_offload *offload;
@@ -199,7 +197,11 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
 	    attr->prog_type != BPF_PROG_TYPE_XDP)
 		return -EINVAL;
 
-	if (attr->prog_flags)
+	if (attr->prog_flags & ~BPF_F_XDP_DEV_BOUND_ONLY)
+		return -EINVAL;
+
+	if (attr->prog_type == BPF_PROG_TYPE_SCHED_CLS &&
+	    attr->prog_flags & BPF_F_XDP_DEV_BOUND_ONLY)
 		return -EINVAL;
 
 	offload = kzalloc(sizeof(*offload), GFP_USER);
@@ -214,11 +216,23 @@ int bpf_prog_offload_init(struct bpf_prog *prog, union bpf_attr *attr)
 	if (err)
 		goto err_maybe_put;
 
+	prog->aux->offload_requested = !(attr->prog_flags & BPF_F_XDP_DEV_BOUND_ONLY);
+
 	down_write(&bpf_devs_lock);
 	ondev = bpf_offload_find_netdev(offload->netdev);
 	if (!ondev) {
-		err = -EINVAL;
-		goto err_unlock;
+		if (bpf_prog_is_offloaded(prog->aux)) {
+			err = -EINVAL;
+			goto err_unlock;
+		}
+
+		/* When only binding to the device, explicitly
+		 * create an entry in the hashtable.
+		 */
+		err = __bpf_offload_dev_netdev_register(NULL, offload->netdev);
+		if (err)
+			goto err_unlock;
+		ondev = bpf_offload_find_netdev(offload->netdev);
 	}
 	offload->offdev = ondev->offdev;
 	prog->aux->offload = offload;
@@ -321,12 +335,25 @@ bpf_prog_offload_remove_insns(struct bpf_verifier_env *env, u32 off, u32 cnt)
 	up_read(&bpf_devs_lock);
 }
 
-void bpf_prog_offload_destroy(struct bpf_prog *prog)
+void bpf_prog_dev_bound_destroy(struct bpf_prog *prog)
 {
+	struct bpf_offload_netdev *ondev;
+	struct net_device *netdev;
+
+	rtnl_lock();
 	down_write(&bpf_devs_lock);
-	if (prog->aux->offload)
+	if (prog->aux->offload) {
+		list_del_init(&prog->aux->offload->offloads);
+
+		netdev = prog->aux->offload->netdev;
 		__bpf_prog_offload_destroy(prog);
+
+		ondev = bpf_offload_find_netdev(netdev);
+		if (!ondev->offdev && list_empty(&ondev->progs))
+			__bpf_offload_dev_netdev_unregister(NULL, netdev);
+	}
 	up_write(&bpf_devs_lock);
+	rtnl_unlock();
 }
 
 static int bpf_prog_offload_translate(struct bpf_prog *prog)
@@ -621,7 +648,7 @@ static bool __bpf_offload_dev_match(struct bpf_prog *prog,
 	struct bpf_offload_netdev *ondev1, *ondev2;
 	struct bpf_prog_offload *offload;
 
-	if (!bpf_prog_is_offloaded(prog->aux))
+	if (!bpf_prog_is_dev_bound(prog->aux))
 		return false;
 
 	offload = prog->aux->offload;
@@ -667,14 +694,21 @@ bool bpf_offload_prog_map_match(struct bpf_prog *prog, struct bpf_map *map)
 int bpf_offload_dev_netdev_register(struct bpf_offload_dev *offdev,
 				    struct net_device *netdev)
 {
-	return __bpf_offload_dev_netdev_register(offdev, netdev);
+	int err;
+
+	down_write(&bpf_devs_lock);
+	err = __bpf_offload_dev_netdev_register(offdev, netdev);
+	up_write(&bpf_devs_lock);
+	return err;
 }
 EXPORT_SYMBOL_GPL(bpf_offload_dev_netdev_register);
 
 void bpf_offload_dev_netdev_unregister(struct bpf_offload_dev *offdev,
 				       struct net_device *netdev)
 {
+	down_write(&bpf_devs_lock);
 	__bpf_offload_dev_netdev_unregister(offdev, netdev);
+	up_write(&bpf_devs_lock);
 }
 EXPORT_SYMBOL_GPL(bpf_offload_dev_netdev_unregister);
 
@@ -708,6 +742,19 @@ void *bpf_offload_dev_priv(struct bpf_offload_dev *offdev)
 }
 EXPORT_SYMBOL_GPL(bpf_offload_dev_priv);
 
+void bpf_dev_bound_netdev_unregister(struct net_device *dev)
+{
+	struct bpf_offload_netdev *ondev;
+
+	ASSERT_RTNL();
+
+	down_write(&bpf_devs_lock);
+	ondev = bpf_offload_find_netdev(dev);
+	if (ondev && !ondev->offdev)
+		__bpf_offload_dev_netdev_unregister(NULL, ondev->netdev);
+	up_write(&bpf_devs_lock);
+}
+
 static int __init bpf_offload_init(void)
 {
 	return rhashtable_init(&offdevs, &offdevs_params);
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5e90b697f908..fdf4ff3d5a7f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2491,7 +2491,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 				 BPF_F_TEST_STATE_FREQ |
 				 BPF_F_SLEEPABLE |
 				 BPF_F_TEST_RND_HI32 |
-				 BPF_F_XDP_HAS_FRAGS))
+				 BPF_F_XDP_HAS_FRAGS |
+				 BPF_F_XDP_DEV_BOUND_ONLY))
 		return -EINVAL;
 
 	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
@@ -2575,7 +2576,7 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 	prog->aux->attach_btf = attach_btf;
 	prog->aux->attach_btf_id = attr->attach_btf_id;
 	prog->aux->dst_prog = dst_prog;
-	prog->aux->offload_requested = !!attr->prog_ifindex;
+	prog->aux->dev_bound = !!attr->prog_ifindex;
 	prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
 	prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
 
@@ -2598,8 +2599,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 	atomic64_set(&prog->aux->refcnt, 1);
 	prog->gpl_compatible = is_gpl ? 1 : 0;
 
-	if (bpf_prog_is_offloaded(prog->aux)) {
-		err = bpf_prog_offload_init(prog, attr);
+	if (bpf_prog_is_dev_bound(prog->aux)) {
+		err = bpf_prog_dev_bound_init(prog, attr);
 		if (err)
 			goto free_prog_sec;
 	}
diff --git a/net/core/dev.c b/net/core/dev.c
index a37829de6529..e66da626df84 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9228,6 +9228,10 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack
 			NL_SET_ERR_MSG(extack, "Using offloaded program without HW_MODE flag is not supported");
 			return -EINVAL;
 		}
+		if (bpf_prog_is_dev_bound(new_prog->aux) && !bpf_offload_dev_match(new_prog, dev)) {
+			NL_SET_ERR_MSG(extack, "Program bound to different device");
+			return -EINVAL;
+		}
 		if (new_prog->expected_attach_type == BPF_XDP_DEVMAP) {
 			NL_SET_ERR_MSG(extack, "BPF_XDP_DEVMAP programs can not be attached to a device");
 			return -EINVAL;
@@ -10830,6 +10834,7 @@ void unregister_netdevice_many_notify(struct list_head *head,
 		dev_shutdown(dev);
 
 		dev_xdp_uninstall(dev);
+		bpf_dev_bound_netdev_unregister(dev);
 
 		netdev_offload_xstats_disable_all(dev);
 
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index bc1a3d232ae4..d0b6ac896699 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1156,6 +1156,11 @@ enum bpf_link_type {
  */
 #define BPF_F_XDP_HAS_FRAGS	(1U << 5)
 
+/* If BPF_F_XDP_DEV_BOUND_ONLY is used in BPF_PROG_LOAD command, the loaded
+ * program becomes device-bound but can access XDP metadata.
+ */
+#define BPF_F_XDP_DEV_BOUND_ONLY	(1U << 6)
+
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
  * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
  */
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] [PATCH bpf-next v7 06/17] selftests/bpf: Update expected test_offload.py messages
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
                   ` (4 preceding siblings ...)
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 05/17] bpf: Introduce device-bound XDP programs Stanislav Fomichev
@ 2023-01-12  0:32 ` Stanislav Fomichev
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 07/17] bpf: XDP metadata RX kfuncs Stanislav Fomichev
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Generic check has a different error message, update the selftest.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/test_offload.py | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/bpf/test_offload.py b/tools/testing/selftests/bpf/test_offload.py
index 7cb1bc05e5cf..40cba8d368d9 100755
--- a/tools/testing/selftests/bpf/test_offload.py
+++ b/tools/testing/selftests/bpf/test_offload.py
@@ -1039,7 +1039,7 @@ netns = []
     offload = bpf_pinned("/sys/fs/bpf/offload")
     ret, _, err = sim.set_xdp(offload, "drv", fail=False, include_stderr=True)
     fail(ret == 0, "attached offloaded XDP program to drv")
-    check_extack(err, "Using device-bound program without HW_MODE flag is not supported.", args)
+    check_extack(err, "Using offloaded program without HW_MODE flag is not supported.", args)
     rm("/sys/fs/bpf/offload")
     sim.wait_for_flush()
 
@@ -1088,12 +1088,12 @@ netns = []
     ret, _, err = sim.set_xdp(pinned, "offload",
                               fail=False, include_stderr=True)
     fail(ret == 0, "Pinned program loaded for a different device accepted")
-    check_extack_nsim(err, "program bound to different dev.", args)
+    check_extack(err, "Program bound to different device.", args)
     simdev2.remove()
     ret, _, err = sim.set_xdp(pinned, "offload",
                               fail=False, include_stderr=True)
     fail(ret == 0, "Pinned program loaded for a removed device accepted")
-    check_extack_nsim(err, "xdpoffload of non-bound program.", args)
+    check_extack(err, "Program bound to different device.", args)
     rm(pin_file)
     bpftool_prog_list_wait(expected=0)
 
@@ -1334,12 +1334,12 @@ netns = []
     ret, _, err = simA.set_xdp(progB, "offload", force=True, JSON=False,
                                fail=False, include_stderr=True)
     fail(ret == 0, "cross-ASIC program allowed")
-    check_extack_nsim(err, "program bound to different dev.", args)
+    check_extack(err, "Program bound to different device.", args)
     for d in simdevB.nsims:
         ret, _, err = d.set_xdp(progA, "offload", force=True, JSON=False,
                                 fail=False, include_stderr=True)
         fail(ret == 0, "cross-ASIC program allowed")
-        check_extack_nsim(err, "program bound to different dev.", args)
+        check_extack(err, "Program bound to different device.", args)
 
     start_test("Test multi-dev ASIC cross-dev map reuse...")
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] [PATCH bpf-next v7 07/17] bpf: XDP metadata RX kfuncs
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
                   ` (5 preceding siblings ...)
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 06/17] selftests/bpf: Update expected test_offload.py messages Stanislav Fomichev
@ 2023-01-12  0:32 ` Stanislav Fomichev
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 09/17] veth: Introduce veth_xdp_buff wrapper for xdp_buff Stanislav Fomichev
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Define a new kfunc set (xdp_metadata_kfunc_ids) which implements all possible
XDP metatada kfuncs. Not all devices have to implement them. If kfunc is not
supported by the target device, the default implementation is called instead.
The verifier, at load time, replaces a call to the generic kfunc with a call
to the per-device one. Per-device kfunc pointers are stored in separate
struct xdp_metadata_ops.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/bpf.h       | 17 ++++++++++-
 include/linux/netdevice.h |  8 +++++
 include/net/xdp.h         | 21 +++++++++++++
 kernel/bpf/core.c         |  8 +++++
 kernel/bpf/offload.c      | 44 +++++++++++++++++++++++++++
 kernel/bpf/verifier.c     | 25 ++++++++++++++-
 net/bpf/test_run.c        |  3 ++
 net/core/xdp.c            | 64 +++++++++++++++++++++++++++++++++++++++
 8 files changed, 188 insertions(+), 2 deletions(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index b97a05bb47be..bb26c2e18092 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -2480,6 +2480,9 @@ bool bpf_offload_dev_match(struct bpf_prog *prog, struct net_device *netdev);
 void unpriv_ebpf_notify(int new_state);
 
 #if defined(CONFIG_NET) && defined(CONFIG_BPF_SYSCALL)
+int bpf_dev_bound_kfunc_check(struct bpf_verifier_log *log,
+			      struct bpf_prog_aux *prog_aux);
+void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id);
 int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr);
 void bpf_dev_bound_netdev_unregister(struct net_device *dev);
 
@@ -2514,8 +2517,20 @@ void sock_map_unhash(struct sock *sk);
 void sock_map_destroy(struct sock *sk);
 void sock_map_close(struct sock *sk, long timeout);
 #else
+static inline int bpf_dev_bound_kfunc_check(struct bpf_verifier_log *log,
+					    struct bpf_prog_aux *prog_aux)
+{
+	return -EOPNOTSUPP;
+}
+
+static inline void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog,
+						u32 func_id)
+{
+	return NULL;
+}
+
 static inline int bpf_prog_dev_bound_init(struct bpf_prog *prog,
-					union bpf_attr *attr)
+					  union bpf_attr *attr)
 {
 	return -EOPNOTSUPP;
 }
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index aad12a179e54..90f2be194bc5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -74,6 +74,7 @@ struct udp_tunnel_nic_info;
 struct udp_tunnel_nic;
 struct bpf_prog;
 struct xdp_buff;
+struct xdp_md;
 
 void synchronize_net(void);
 void netdev_set_default_ethtool_ops(struct net_device *dev,
@@ -1618,6 +1619,11 @@ struct net_device_ops {
 						  bool cycles);
 };
 
+struct xdp_metadata_ops {
+	int	(*xmo_rx_timestamp)(const struct xdp_md *ctx, u64 *timestamp);
+	int	(*xmo_rx_hash)(const struct xdp_md *ctx, u32 *hash);
+};
+
 /**
  * enum netdev_priv_flags - &struct net_device priv_flags
  *
@@ -1801,6 +1807,7 @@ enum netdev_ml_priv_type {
  *
  *	@netdev_ops:	Includes several pointers to callbacks,
  *			if one wants to override the ndo_*() functions
+ *	@xdp_metadata_ops:	Includes pointers to XDP metadata callbacks.
  *	@ethtool_ops:	Management operations
  *	@l3mdev_ops:	Layer 3 master device operations
  *	@ndisc_ops:	Includes callbacks for different IPv6 neighbour
@@ -2050,6 +2057,7 @@ struct net_device {
 	unsigned int		flags;
 	unsigned long long	priv_flags;
 	const struct net_device_ops *netdev_ops;
+	const struct xdp_metadata_ops *xdp_metadata_ops;
 	int			ifindex;
 	unsigned short		gflags;
 	unsigned short		hard_header_len;
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 55dbc68bfffc..91292aa13bc0 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -409,4 +409,25 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
 
 #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
 
+#define XDP_METADATA_KFUNC_xxx	\
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
+			   bpf_xdp_metadata_rx_timestamp) \
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
+			   bpf_xdp_metadata_rx_hash) \
+
+enum {
+#define XDP_METADATA_KFUNC(name, _) name,
+XDP_METADATA_KFUNC_xxx
+#undef XDP_METADATA_KFUNC
+MAX_XDP_METADATA_KFUNC,
+};
+
+#ifdef CONFIG_NET
+u32 bpf_xdp_metadata_kfunc_id(int id);
+bool bpf_dev_bound_kfunc_id(u32 btf_id);
+#else
+static inline u32 bpf_xdp_metadata_kfunc_id(int id) { return 0; }
+static inline bool bpf_dev_bound_kfunc_id(u32 btf_id) { return false; }
+#endif
+
 #endif /* __LINUX_NET_XDP_H__ */
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 1cf19da3c128..16da51093aff 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -2096,6 +2096,14 @@ bool bpf_prog_map_compatible(struct bpf_map *map,
 	if (fp->kprobe_override)
 		return false;
 
+	/* XDP programs inserted into maps are not guaranteed to run on
+	 * a particular netdev (and can run outside driver context entirely
+	 * in the case of devmap and cpumap). Until device checks
+	 * are implemented, prohibit adding dev-bound programs to program maps.
+	 */
+	if (bpf_prog_is_dev_bound(fp->aux))
+		return false;
+
 	spin_lock(&map->owner.lock);
 	if (!map->owner.type) {
 		/* There's no owner yet where we could check for
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index f767455ed732..3e173c694bbb 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -755,6 +755,50 @@ void bpf_dev_bound_netdev_unregister(struct net_device *dev)
 	up_write(&bpf_devs_lock);
 }
 
+int bpf_dev_bound_kfunc_check(struct bpf_verifier_log *log,
+			      struct bpf_prog_aux *prog_aux)
+{
+	if (!bpf_prog_is_dev_bound(prog_aux)) {
+		bpf_log(log, "metadata kfuncs require device-bound program\n");
+		return -EINVAL;
+	}
+
+	if (bpf_prog_is_offloaded(prog_aux)) {
+		bpf_log(log, "metadata kfuncs can't be offloaded\n");
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
+{
+	const struct xdp_metadata_ops *ops;
+	void *p = NULL;
+
+	/* We don't hold bpf_devs_lock while resolving several
+	 * kfuncs and can race with the unregister_netdevice().
+	 * We rely on bpf_dev_bound_match() check at attach
+	 * to render this program unusable.
+	 */
+	down_read(&bpf_devs_lock);
+	if (!prog->aux->offload)
+		goto out;
+
+	ops = prog->aux->offload->netdev->xdp_metadata_ops;
+	if (!ops)
+		goto out;
+
+	if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP))
+		p = ops->xmo_rx_timestamp;
+	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
+		p = ops->xmo_rx_hash;
+out:
+	up_read(&bpf_devs_lock);
+
+	return p;
+}
+
 static int __init bpf_offload_init(void)
 {
 	return rhashtable_init(&offdevs, &offdevs_params);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 026a6789e896..4cfba6c340d7 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2189,6 +2189,12 @@ static int add_kfunc_call(struct bpf_verifier_env *env, u32 func_id, s16 offset)
 		return -EINVAL;
 	}
 
+	if (bpf_dev_bound_kfunc_id(func_id)) {
+		err = bpf_dev_bound_kfunc_check(&env->log, prog_aux);
+		if (err)
+			return err;
+	}
+
 	desc = &tab->descs[tab->nr_descs++];
 	desc->func_id = func_id;
 	desc->imm = call_imm;
@@ -15499,12 +15505,25 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 			    struct bpf_insn *insn_buf, int insn_idx, int *cnt)
 {
 	const struct bpf_kfunc_desc *desc;
+	void *xdp_kfunc;
 
 	if (!insn->imm) {
 		verbose(env, "invalid kernel function call not eliminated in verifier pass\n");
 		return -EINVAL;
 	}
 
+	*cnt = 0;
+
+	if (bpf_dev_bound_kfunc_id(insn->imm)) {
+		xdp_kfunc = bpf_dev_bound_resolve_kfunc(env->prog, insn->imm);
+		if (xdp_kfunc) {
+			insn->imm = BPF_CALL_IMM(xdp_kfunc);
+			return 0;
+		}
+
+		/* fallback to default kfunc when not supported by netdev */
+	}
+
 	/* insn->imm has the btf func_id. Replace it with
 	 * an address (relative to __bpf_call_base).
 	 */
@@ -15515,7 +15534,6 @@ static int fixup_kfunc_call(struct bpf_verifier_env *env, struct bpf_insn *insn,
 		return -EFAULT;
 	}
 
-	*cnt = 0;
 	insn->imm = desc->imm;
 	if (insn->off)
 		return 0;
@@ -16522,6 +16540,11 @@ int bpf_check_attach_target(struct bpf_verifier_log *log,
 	if (tgt_prog) {
 		struct bpf_prog_aux *aux = tgt_prog->aux;
 
+		if (bpf_prog_is_dev_bound(tgt_prog->aux)) {
+			bpf_log(log, "Replacing device-bound programs not supported\n");
+			return -EINVAL;
+		}
+
 		for (i = 0; i < aux->func_info_cnt; i++)
 			if (aux->func_info[i].type_id == btf_id) {
 				subprog = i;
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 2723623429ac..8da0d73b368e 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -1300,6 +1300,9 @@ int bpf_prog_test_run_xdp(struct bpf_prog *prog, const union bpf_attr *kattr,
 	if (kattr->test.flags & ~BPF_F_TEST_XDP_LIVE_FRAMES)
 		return -EINVAL;
 
+	if (bpf_prog_is_dev_bound(prog->aux))
+		return -EINVAL;
+
 	if (do_live) {
 		if (!batch_size)
 			batch_size = NAPI_POLL_WEIGHT;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 844c9d99dc0e..a5a7ecf6391c 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -4,6 +4,7 @@
  * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
  */
 #include <linux/bpf.h>
+#include <linux/btf_ids.h>
 #include <linux/filter.h>
 #include <linux/types.h>
 #include <linux/mm.h>
@@ -709,3 +710,66 @@ struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
 
 	return nxdpf;
 }
+
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in vmlinux BTF");
+
+/**
+ * bpf_xdp_metadata_rx_timestamp - Read XDP frame RX timestamp.
+ * @ctx: XDP context pointer.
+ * @timestamp: Return value pointer.
+ *
+ * Returns 0 on success or ``-errno`` on error.
+ */
+int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
+{
+	return -EOPNOTSUPP;
+}
+
+/**
+ * bpf_xdp_metadata_rx_hash - Read XDP frame RX hash.
+ * @ctx: XDP context pointer.
+ * @hash: Return value pointer.
+ *
+ * Returns 0 on success or ``-errno`` on error.
+ */
+int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash)
+{
+	return -EOPNOTSUPP;
+}
+
+__diag_pop();
+
+BTF_SET8_START(xdp_metadata_kfunc_ids)
+#define XDP_METADATA_KFUNC(_, name) BTF_ID_FLAGS(func, name, 0)
+XDP_METADATA_KFUNC_xxx
+#undef XDP_METADATA_KFUNC
+BTF_SET8_END(xdp_metadata_kfunc_ids)
+
+static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &xdp_metadata_kfunc_ids,
+};
+
+BTF_ID_LIST(xdp_metadata_kfunc_ids_unsorted)
+#define XDP_METADATA_KFUNC(name, str) BTF_ID(func, str)
+XDP_METADATA_KFUNC_xxx
+#undef XDP_METADATA_KFUNC
+
+u32 bpf_xdp_metadata_kfunc_id(int id)
+{
+	/* xdp_metadata_kfunc_ids is sorted and can't be used */
+	return xdp_metadata_kfunc_ids_unsorted[id];
+}
+
+bool bpf_dev_bound_kfunc_id(u32 btf_id)
+{
+	return btf_id_set8_contains(&xdp_metadata_kfunc_ids, btf_id);
+}
+
+static int __init xdp_metadata_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &xdp_metadata_kfunc_set);
+}
+late_initcall(xdp_metadata_init);
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] [PATCH bpf-next v7 09/17] veth: Introduce veth_xdp_buff wrapper for xdp_buff
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
                   ` (6 preceding siblings ...)
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 07/17] bpf: XDP metadata RX kfuncs Stanislav Fomichev
@ 2023-01-12  0:32 ` Stanislav Fomichev
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 10/17] veth: Support RX XDP metadata Stanislav Fomichev
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

No functional changes. Boilerplate to allow stuffing more data after xdp_buff.

Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c | 56 +++++++++++++++++++++++++---------------------
 1 file changed, 31 insertions(+), 25 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index dfc7d87fad59..70f50602287a 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -116,6 +116,10 @@ static struct {
 	{ "peer_ifindex" },
 };
 
+struct veth_xdp_buff {
+	struct xdp_buff xdp;
+};
+
 static int veth_get_link_ksettings(struct net_device *dev,
 				   struct ethtool_link_ksettings *cmd)
 {
@@ -592,23 +596,24 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 	rcu_read_lock();
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (likely(xdp_prog)) {
-		struct xdp_buff xdp;
+		struct veth_xdp_buff vxbuf;
+		struct xdp_buff *xdp = &vxbuf.xdp;
 		u32 act;
 
-		xdp_convert_frame_to_buff(frame, &xdp);
-		xdp.rxq = &rq->xdp_rxq;
+		xdp_convert_frame_to_buff(frame, xdp);
+		xdp->rxq = &rq->xdp_rxq;
 
-		act = bpf_prog_run_xdp(xdp_prog, &xdp);
+		act = bpf_prog_run_xdp(xdp_prog, xdp);
 
 		switch (act) {
 		case XDP_PASS:
-			if (xdp_update_frame_from_buff(&xdp, frame))
+			if (xdp_update_frame_from_buff(xdp, frame))
 				goto err_xdp;
 			break;
 		case XDP_TX:
 			orig_frame = *frame;
-			xdp.rxq->mem = frame->mem;
-			if (unlikely(veth_xdp_tx(rq, &xdp, bq) < 0)) {
+			xdp->rxq->mem = frame->mem;
+			if (unlikely(veth_xdp_tx(rq, xdp, bq) < 0)) {
 				trace_xdp_exception(rq->dev, xdp_prog, act);
 				frame = &orig_frame;
 				stats->rx_drops++;
@@ -619,8 +624,8 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 			goto xdp_xmit;
 		case XDP_REDIRECT:
 			orig_frame = *frame;
-			xdp.rxq->mem = frame->mem;
-			if (xdp_do_redirect(rq->dev, &xdp, xdp_prog)) {
+			xdp->rxq->mem = frame->mem;
+			if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
 				frame = &orig_frame;
 				stats->rx_drops++;
 				goto err_xdp;
@@ -801,7 +806,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 {
 	void *orig_data, *orig_data_end;
 	struct bpf_prog *xdp_prog;
-	struct xdp_buff xdp;
+	struct veth_xdp_buff vxbuf;
+	struct xdp_buff *xdp = &vxbuf.xdp;
 	u32 act, metalen;
 	int off;
 
@@ -815,22 +821,22 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	}
 
 	__skb_push(skb, skb->data - skb_mac_header(skb));
-	if (veth_convert_skb_to_xdp_buff(rq, &xdp, &skb))
+	if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))
 		goto drop;
 
-	orig_data = xdp.data;
-	orig_data_end = xdp.data_end;
+	orig_data = xdp->data;
+	orig_data_end = xdp->data_end;
 
-	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
 
 	switch (act) {
 	case XDP_PASS:
 		break;
 	case XDP_TX:
-		veth_xdp_get(&xdp);
+		veth_xdp_get(xdp);
 		consume_skb(skb);
-		xdp.rxq->mem = rq->xdp_mem;
-		if (unlikely(veth_xdp_tx(rq, &xdp, bq) < 0)) {
+		xdp->rxq->mem = rq->xdp_mem;
+		if (unlikely(veth_xdp_tx(rq, xdp, bq) < 0)) {
 			trace_xdp_exception(rq->dev, xdp_prog, act);
 			stats->rx_drops++;
 			goto err_xdp;
@@ -839,10 +845,10 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 		rcu_read_unlock();
 		goto xdp_xmit;
 	case XDP_REDIRECT:
-		veth_xdp_get(&xdp);
+		veth_xdp_get(xdp);
 		consume_skb(skb);
-		xdp.rxq->mem = rq->xdp_mem;
-		if (xdp_do_redirect(rq->dev, &xdp, xdp_prog)) {
+		xdp->rxq->mem = rq->xdp_mem;
+		if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
 			stats->rx_drops++;
 			goto err_xdp;
 		}
@@ -862,7 +868,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	rcu_read_unlock();
 
 	/* check if bpf_xdp_adjust_head was used */
-	off = orig_data - xdp.data;
+	off = orig_data - xdp->data;
 	if (off > 0)
 		__skb_push(skb, off);
 	else if (off < 0)
@@ -871,21 +877,21 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	skb_reset_mac_header(skb);
 
 	/* check if bpf_xdp_adjust_tail was used */
-	off = xdp.data_end - orig_data_end;
+	off = xdp->data_end - orig_data_end;
 	if (off != 0)
 		__skb_put(skb, off); /* positive on grow, negative on shrink */
 
 	/* XDP frag metadata (e.g. nr_frags) are updated in eBPF helpers
 	 * (e.g. bpf_xdp_adjust_tail), we need to update data_len here.
 	 */
-	if (xdp_buff_has_frags(&xdp))
+	if (xdp_buff_has_frags(xdp))
 		skb->data_len = skb_shinfo(skb)->xdp_frags_size;
 	else
 		skb->data_len = 0;
 
 	skb->protocol = eth_type_trans(skb, rq->dev);
 
-	metalen = xdp.data - xdp.data_meta;
+	metalen = xdp->data - xdp->data_meta;
 	if (metalen)
 		skb_metadata_set(skb, metalen);
 out:
@@ -898,7 +904,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	return NULL;
 err_xdp:
 	rcu_read_unlock();
-	xdp_return_buff(&xdp);
+	xdp_return_buff(xdp);
 xdp_xmit:
 	return NULL;
 }
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] [PATCH bpf-next v7 10/17] veth: Support RX XDP metadata
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
                   ` (7 preceding siblings ...)
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 09/17] veth: Introduce veth_xdp_buff wrapper for xdp_buff Stanislav Fomichev
@ 2023-01-12  0:32 ` Stanislav Fomichev
  2023-01-16 16:21   ` [xdp-hints] " Jesper Dangaard Brouer
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 11/17] selftests/bpf: Verify xdp_metadata xdp->af_xdp path Stanislav Fomichev
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

The goal is to enable end-to-end testing of the metadata for AF_XDP.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 70f50602287a..ba3e05832843 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -118,6 +118,7 @@ static struct {
 
 struct veth_xdp_buff {
 	struct xdp_buff xdp;
+	struct sk_buff *skb;
 };
 
 static int veth_get_link_ksettings(struct net_device *dev,
@@ -602,6 +603,7 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 
 		xdp_convert_frame_to_buff(frame, xdp);
 		xdp->rxq = &rq->xdp_rxq;
+		vxbuf.skb = NULL;
 
 		act = bpf_prog_run_xdp(xdp_prog, xdp);
 
@@ -823,6 +825,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	__skb_push(skb, skb->data - skb_mac_header(skb));
 	if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))
 		goto drop;
+	vxbuf.skb = skb;
 
 	orig_data = xdp->data;
 	orig_data_end = xdp->data_end;
@@ -1602,6 +1605,28 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
+static int veth_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
+{
+	struct veth_xdp_buff *_ctx = (void *)ctx;
+
+	if (!_ctx->skb)
+		return -EOPNOTSUPP;
+
+	*timestamp = skb_hwtstamps(_ctx->skb)->hwtstamp;
+	return 0;
+}
+
+static int veth_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash)
+{
+	struct veth_xdp_buff *_ctx = (void *)ctx;
+
+	if (!_ctx->skb)
+		return -EOPNOTSUPP;
+
+	*hash = skb_get_hash(_ctx->skb);
+	return 0;
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -1623,6 +1648,11 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_get_peer_dev	= veth_peer_dev,
 };
 
+static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
+	.xmo_rx_timestamp		= veth_xdp_rx_timestamp,
+	.xmo_rx_hash			= veth_xdp_rx_hash,
+};
+
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
 		       NETIF_F_RXCSUM | NETIF_F_SCTP_CRC | NETIF_F_HIGHDMA | \
 		       NETIF_F_GSO_SOFTWARE | NETIF_F_GSO_ENCAP_ALL | \
@@ -1639,6 +1669,7 @@ static void veth_setup(struct net_device *dev)
 	dev->priv_flags |= IFF_PHONY_HEADROOM;
 
 	dev->netdev_ops = &veth_netdev_ops;
+	dev->xdp_metadata_ops = &veth_xdp_metadata_ops;
 	dev->ethtool_ops = &veth_ethtool_ops;
 	dev->features |= NETIF_F_LLTX;
 	dev->features |= VETH_FEATURES;
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] [PATCH bpf-next v7 11/17] selftests/bpf: Verify xdp_metadata xdp->af_xdp path
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
                   ` (8 preceding siblings ...)
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 10/17] veth: Support RX XDP metadata Stanislav Fomichev
@ 2023-01-12  0:32 ` Stanislav Fomichev
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 12/17] net/mlx4_en: Introduce wrapper for xdp_buff Stanislav Fomichev
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

- create new netns
- create veth pair (veTX+veRX)
- setup AF_XDP socket for both interfaces
- attach bpf to veRX
- send packet via veTX
- verify the packet has expected metadata at veRX

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/Makefile          |   2 +-
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 410 ++++++++++++++++++
 .../selftests/bpf/progs/xdp_metadata.c        |  64 +++
 .../selftests/bpf/progs/xdp_metadata2.c       |  23 +
 tools/testing/selftests/bpf/xdp_metadata.h    |  15 +
 5 files changed, 513 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata2.c
 create mode 100644 tools/testing/selftests/bpf/xdp_metadata.h

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 205e8c3c346a..5356f317bc62 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -527,7 +527,7 @@ TRUNNER_BPF_PROGS_DIR := progs
 TRUNNER_EXTRA_SOURCES := test_progs.c cgroup_helpers.c trace_helpers.c	\
 			 network_helpers.c testing_helpers.c		\
 			 btf_helpers.c flow_dissector_load.h		\
-			 cap_helpers.c test_loader.c
+			 cap_helpers.c test_loader.c xsk.c
 TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read $(OUTPUT)/bpf_testmod.ko	\
 		       $(OUTPUT)/liburandom_read.so			\
 		       $(OUTPUT)/xdp_synproxy				\
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
new file mode 100644
index 000000000000..bb62d9d6bce0
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -0,0 +1,410 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "xdp_metadata.skel.h"
+#include "xdp_metadata2.skel.h"
+#include "xdp_metadata.h"
+#include "xsk.h"
+
+#include <bpf/btf.h>
+#include <linux/errqueue.h>
+#include <linux/if_link.h>
+#include <linux/net_tstamp.h>
+#include <linux/udp.h>
+#include <sys/mman.h>
+#include <net/if.h>
+#include <poll.h>
+
+#define TX_NAME "veTX"
+#define RX_NAME "veRX"
+
+#define UDP_PAYLOAD_BYTES 4
+
+#define AF_XDP_SOURCE_PORT 1234
+#define AF_XDP_CONSUMER_PORT 8080
+
+#define UMEM_NUM 16
+#define UMEM_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
+#define UMEM_SIZE (UMEM_FRAME_SIZE * UMEM_NUM)
+#define XDP_FLAGS XDP_FLAGS_DRV_MODE
+#define QUEUE_ID 0
+
+#define TX_ADDR "10.0.0.1"
+#define RX_ADDR "10.0.0.2"
+#define PREFIX_LEN "8"
+#define FAMILY AF_INET
+
+#define SYS(cmd) ({ \
+	if (!ASSERT_OK(system(cmd), (cmd))) \
+		goto out; \
+})
+
+struct xsk {
+	void *umem_area;
+	struct xsk_umem *umem;
+	struct xsk_ring_prod fill;
+	struct xsk_ring_cons comp;
+	struct xsk_ring_prod tx;
+	struct xsk_ring_cons rx;
+	struct xsk_socket *socket;
+};
+
+static int open_xsk(const char *ifname, struct xsk *xsk)
+{
+	int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
+	const struct xsk_socket_config socket_config = {
+		.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.libbpf_flags = XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD,
+		.xdp_flags = XDP_FLAGS,
+		.bind_flags = XDP_COPY,
+	};
+	const struct xsk_umem_config umem_config = {
+		.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
+		.frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
+		.flags = XDP_UMEM_UNALIGNED_CHUNK_FLAG,
+	};
+	__u32 idx;
+	u64 addr;
+	int ret;
+	int i;
+
+	xsk->umem_area = mmap(NULL, UMEM_SIZE, PROT_READ | PROT_WRITE, mmap_flags, -1, 0);
+	if (!ASSERT_NEQ(xsk->umem_area, MAP_FAILED, "mmap"))
+		return -1;
+
+	ret = xsk_umem__create(&xsk->umem,
+			       xsk->umem_area, UMEM_SIZE,
+			       &xsk->fill,
+			       &xsk->comp,
+			       &umem_config);
+	if (!ASSERT_OK(ret, "xsk_umem__create"))
+		return ret;
+
+	ret = xsk_socket__create(&xsk->socket, ifname, QUEUE_ID,
+				 xsk->umem,
+				 &xsk->rx,
+				 &xsk->tx,
+				 &socket_config);
+	if (!ASSERT_OK(ret, "xsk_socket__create"))
+		return ret;
+
+	/* First half of umem is for TX. This way address matches 1-to-1
+	 * to the completion queue index.
+	 */
+
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = i * UMEM_FRAME_SIZE;
+		printf("%p: tx_desc[%d] -> %lx\n", xsk, i, addr);
+	}
+
+	/* Second half of umem is for RX. */
+
+	ret = xsk_ring_prod__reserve(&xsk->fill, UMEM_NUM / 2, &idx);
+	if (!ASSERT_EQ(UMEM_NUM / 2, ret, "xsk_ring_prod__reserve"))
+		return ret;
+	if (!ASSERT_EQ(idx, 0, "fill idx != 0"))
+		return -1;
+
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = (UMEM_NUM / 2 + i) * UMEM_FRAME_SIZE;
+		printf("%p: rx_desc[%d] -> %lx\n", xsk, i, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, i) = addr;
+	}
+	xsk_ring_prod__submit(&xsk->fill, ret);
+
+	return 0;
+}
+
+static void close_xsk(struct xsk *xsk)
+{
+	if (xsk->umem)
+		xsk_umem__delete(xsk->umem);
+	if (xsk->socket)
+		xsk_socket__delete(xsk->socket);
+	munmap(xsk->umem, UMEM_SIZE);
+}
+
+static void ip_csum(struct iphdr *iph)
+{
+	__u32 sum = 0;
+	__u16 *p;
+	int i;
+
+	iph->check = 0;
+	p = (void *)iph;
+	for (i = 0; i < sizeof(*iph) / sizeof(*p); i++)
+		sum += p[i];
+
+	while (sum >> 16)
+		sum = (sum & 0xffff) + (sum >> 16);
+
+	iph->check = ~sum;
+}
+
+static int generate_packet(struct xsk *xsk, __u16 dst_port)
+{
+	struct xdp_desc *tx_desc;
+	struct udphdr *udph;
+	struct ethhdr *eth;
+	struct iphdr *iph;
+	void *data;
+	__u32 idx;
+	int ret;
+
+	ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
+	if (!ASSERT_EQ(ret, 1, "xsk_ring_prod__reserve"))
+		return -1;
+
+	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
+	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
+	printf("%p: tx_desc[%u]->addr=%llx\n", xsk, idx, tx_desc->addr);
+	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
+
+	eth = data;
+	iph = (void *)(eth + 1);
+	udph = (void *)(iph + 1);
+
+	memcpy(eth->h_dest, "\x00\x00\x00\x00\x00\x02", ETH_ALEN);
+	memcpy(eth->h_source, "\x00\x00\x00\x00\x00\x01", ETH_ALEN);
+	eth->h_proto = htons(ETH_P_IP);
+
+	iph->version = 0x4;
+	iph->ihl = 0x5;
+	iph->tos = 0x9;
+	iph->tot_len = htons(sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES);
+	iph->id = 0;
+	iph->frag_off = 0;
+	iph->ttl = 0;
+	iph->protocol = IPPROTO_UDP;
+	ASSERT_EQ(inet_pton(FAMILY, TX_ADDR, &iph->saddr), 1, "inet_pton(TX_ADDR)");
+	ASSERT_EQ(inet_pton(FAMILY, RX_ADDR, &iph->daddr), 1, "inet_pton(RX_ADDR)");
+	ip_csum(iph);
+
+	udph->source = htons(AF_XDP_SOURCE_PORT);
+	udph->dest = htons(dst_port);
+	udph->len = htons(sizeof(*udph) + UDP_PAYLOAD_BYTES);
+	udph->check = 0;
+
+	memset(udph + 1, 0xAA, UDP_PAYLOAD_BYTES);
+
+	tx_desc->len = sizeof(*eth) + sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES;
+	xsk_ring_prod__submit(&xsk->tx, 1);
+
+	ret = sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
+	if (!ASSERT_GE(ret, 0, "sendto"))
+		return ret;
+
+	return 0;
+}
+
+static void complete_tx(struct xsk *xsk)
+{
+	__u32 idx;
+	__u64 addr;
+
+	if (ASSERT_EQ(xsk_ring_cons__peek(&xsk->comp, 1, &idx), 1, "xsk_ring_cons__peek")) {
+		addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
+
+		printf("%p: refill idx=%u addr=%llx\n", xsk, idx, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, idx) = addr;
+		xsk_ring_prod__submit(&xsk->fill, 1);
+	}
+}
+
+static void refill_rx(struct xsk *xsk, __u64 addr)
+{
+	__u32 idx;
+
+	if (ASSERT_EQ(xsk_ring_prod__reserve(&xsk->fill, 1, &idx), 1, "xsk_ring_prod__reserve")) {
+		printf("%p: complete idx=%u addr=%llx\n", xsk, idx, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, idx) = addr;
+		xsk_ring_prod__submit(&xsk->fill, 1);
+	}
+}
+
+static int verify_xsk_metadata(struct xsk *xsk)
+{
+	const struct xdp_desc *rx_desc;
+	struct pollfd fds = {};
+	struct xdp_meta *meta;
+	struct ethhdr *eth;
+	struct iphdr *iph;
+	__u64 comp_addr;
+	void *data;
+	__u64 addr;
+	__u32 idx;
+	int ret;
+
+	ret = recvfrom(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, NULL);
+	if (!ASSERT_EQ(ret, 0, "recvfrom"))
+		return -1;
+
+	fds.fd = xsk_socket__fd(xsk->socket);
+	fds.events = POLLIN;
+
+	ret = poll(&fds, 1, 1000);
+	if (!ASSERT_GT(ret, 0, "poll"))
+		return -1;
+
+	ret = xsk_ring_cons__peek(&xsk->rx, 1, &idx);
+	if (!ASSERT_EQ(ret, 1, "xsk_ring_cons__peek"))
+		return -2;
+
+	rx_desc = xsk_ring_cons__rx_desc(&xsk->rx, idx);
+	comp_addr = xsk_umem__extract_addr(rx_desc->addr);
+	addr = xsk_umem__add_offset_to_addr(rx_desc->addr);
+	printf("%p: rx_desc[%u]->addr=%llx addr=%llx comp_addr=%llx\n",
+	       xsk, idx, rx_desc->addr, addr, comp_addr);
+	data = xsk_umem__get_data(xsk->umem_area, addr);
+
+	/* Make sure we got the packet offset correctly. */
+
+	eth = data;
+	ASSERT_EQ(eth->h_proto, htons(ETH_P_IP), "eth->h_proto");
+	iph = (void *)(eth + 1);
+	ASSERT_EQ((int)iph->version, 4, "iph->version");
+
+	/* custom metadata */
+
+	meta = data - sizeof(struct xdp_meta);
+
+	if (!ASSERT_NEQ(meta->rx_timestamp, 0, "rx_timestamp"))
+		return -1;
+
+	if (!ASSERT_NEQ(meta->rx_hash, 0, "rx_hash"))
+		return -1;
+
+	xsk_ring_cons__release(&xsk->rx, 1);
+	refill_rx(xsk, comp_addr);
+
+	return 0;
+}
+
+void test_xdp_metadata(void)
+{
+	struct xdp_metadata2 *bpf_obj2 = NULL;
+	struct xdp_metadata *bpf_obj = NULL;
+	struct bpf_program *new_prog, *prog;
+	struct nstoken *tok = NULL;
+	__u32 queue_id = QUEUE_ID;
+	struct bpf_map *prog_arr;
+	struct xsk tx_xsk = {};
+	struct xsk rx_xsk = {};
+	__u32 val, key = 0;
+	int retries = 10;
+	int rx_ifindex;
+	int sock_fd;
+	int ret;
+
+	/* Setup new networking namespace, with a veth pair. */
+
+	SYS("ip netns add xdp_metadata");
+	tok = open_netns("xdp_metadata");
+	SYS("ip link add numtxqueues 1 numrxqueues 1 " TX_NAME
+	    " type veth peer " RX_NAME " numtxqueues 1 numrxqueues 1");
+	SYS("ip link set dev " TX_NAME " address 00:00:00:00:00:01");
+	SYS("ip link set dev " RX_NAME " address 00:00:00:00:00:02");
+	SYS("ip link set dev " TX_NAME " up");
+	SYS("ip link set dev " RX_NAME " up");
+	SYS("ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
+	SYS("ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
+
+	rx_ifindex = if_nametoindex(RX_NAME);
+
+	/* Setup separate AF_XDP for TX and RX interfaces. */
+
+	ret = open_xsk(TX_NAME, &tx_xsk);
+	if (!ASSERT_OK(ret, "open_xsk(TX_NAME)"))
+		goto out;
+
+	ret = open_xsk(RX_NAME, &rx_xsk);
+	if (!ASSERT_OK(ret, "open_xsk(RX_NAME)"))
+		goto out;
+
+	bpf_obj = xdp_metadata__open();
+	if (!ASSERT_OK_PTR(bpf_obj, "open skeleton"))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "rx");
+	bpf_program__set_ifindex(prog, rx_ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+
+	if (!ASSERT_OK(xdp_metadata__load(bpf_obj), "load skeleton"))
+		goto out;
+
+	/* Make sure we can't add dev-bound programs to prog maps. */
+	prog_arr = bpf_object__find_map_by_name(bpf_obj->obj, "prog_arr");
+	if (!ASSERT_OK_PTR(prog_arr, "no prog_arr map"))
+		goto out;
+
+	val = bpf_program__fd(prog);
+	if (!ASSERT_ERR(bpf_map__update_elem(prog_arr, &key, sizeof(key),
+					     &val, sizeof(val), BPF_ANY),
+			"update prog_arr"))
+		goto out;
+
+	/* Attach BPF program to RX interface. */
+
+	ret = bpf_xdp_attach(rx_ifindex,
+			     bpf_program__fd(bpf_obj->progs.rx),
+			     XDP_FLAGS, NULL);
+	if (!ASSERT_GE(ret, 0, "bpf_xdp_attach"))
+		goto out;
+
+	sock_fd = xsk_socket__fd(rx_xsk.socket);
+	ret = bpf_map_update_elem(bpf_map__fd(bpf_obj->maps.xsk), &queue_id, &sock_fd, 0);
+	if (!ASSERT_GE(ret, 0, "bpf_map_update_elem"))
+		goto out;
+
+	/* Send packet destined to RX AF_XDP socket. */
+	if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
+		       "generate AF_XDP_CONSUMER_PORT"))
+		goto out;
+
+	/* Verify AF_XDP RX packet has proper metadata. */
+	if (!ASSERT_GE(verify_xsk_metadata(&rx_xsk), 0,
+		       "verify_xsk_metadata"))
+		goto out;
+
+	complete_tx(&tx_xsk);
+
+	/* Make sure freplace correctly picks up original bound device
+	 * and doesn't crash.
+	 */
+
+	bpf_obj2 = xdp_metadata2__open();
+	if (!ASSERT_OK_PTR(bpf_obj2, "open skeleton"))
+		goto out;
+
+	new_prog = bpf_object__find_program_by_name(bpf_obj2->obj, "freplace_rx");
+	bpf_program__set_attach_target(new_prog, bpf_program__fd(prog), "rx");
+
+	if (!ASSERT_OK(xdp_metadata2__load(bpf_obj2), "load freplace skeleton"))
+		goto out;
+
+	if (!ASSERT_OK(xdp_metadata2__attach(bpf_obj2), "attach freplace"))
+		goto out;
+
+	/* Send packet to trigger . */
+	if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
+		       "generate freplace packet"))
+		goto out;
+
+	while (!retries--) {
+		if (bpf_obj2->bss->called)
+			break;
+		usleep(10);
+	}
+	ASSERT_GT(bpf_obj2->bss->called, 0, "not called");
+
+out:
+	close_xsk(&rx_xsk);
+	close_xsk(&tx_xsk);
+	xdp_metadata2__destroy(bpf_obj2);
+	xdp_metadata__destroy(bpf_obj);
+	if (tok)
+		close_netns(tok);
+	system("ip netns del xdp_metadata");
+}
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
new file mode 100644
index 000000000000..77678b034389
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include "xdp_metadata.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_XSKMAP);
+	__uint(max_entries, 4);
+	__type(key, __u32);
+	__type(value, __u32);
+} xsk SEC(".maps");
+
+struct {
+	__uint(type, BPF_MAP_TYPE_PROG_ARRAY);
+	__uint(max_entries, 1);
+	__type(key, __u32);
+	__type(value, __u32);
+} prog_arr SEC(".maps");
+
+extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
+					 __u64 *timestamp) __ksym;
+extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx,
+				    __u32 *hash) __ksym;
+
+SEC("xdp")
+int rx(struct xdp_md *ctx)
+{
+	void *data, *data_meta;
+	struct xdp_meta *meta;
+	u64 timestamp = -1;
+	int ret;
+
+	/* Reserve enough for all custom metadata. */
+
+	ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(struct xdp_meta));
+	if (ret != 0)
+		return XDP_DROP;
+
+	data = (void *)(long)ctx->data;
+	data_meta = (void *)(long)ctx->data_meta;
+
+	if (data_meta + sizeof(struct xdp_meta) > data)
+		return XDP_DROP;
+
+	meta = data_meta;
+
+	/* Export metadata. */
+
+	/* We expect veth bpf_xdp_metadata_rx_timestamp to return 0 HW
+	 * timestamp, so put some non-zero value into AF_XDP frame for
+	 * the userspace.
+	 */
+	bpf_xdp_metadata_rx_timestamp(ctx, &timestamp);
+	if (timestamp == 0)
+		meta->rx_timestamp = 1;
+
+	bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash);
+
+	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata2.c b/tools/testing/selftests/bpf/progs/xdp_metadata2.c
new file mode 100644
index 000000000000..cf69d05451c3
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata2.c
@@ -0,0 +1,23 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include "xdp_metadata.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx,
+				    __u32 *hash) __ksym;
+
+int called;
+
+SEC("freplace/rx")
+int freplace_rx(struct xdp_md *ctx)
+{
+	u32 hash = 0;
+	/* Call _any_ metadata function to make sure we don't crash. */
+	bpf_xdp_metadata_rx_hash(ctx, &hash);
+	called++;
+	return XDP_PASS;
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h
new file mode 100644
index 000000000000..f6780fbb0a21
--- /dev/null
+++ b/tools/testing/selftests/bpf/xdp_metadata.h
@@ -0,0 +1,15 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#pragma once
+
+#ifndef ETH_P_IP
+#define ETH_P_IP 0x0800
+#endif
+
+#ifndef ETH_P_IPV6
+#define ETH_P_IPV6 0x86DD
+#endif
+
+struct xdp_meta {
+	__u64 rx_timestamp;
+	__u32 rx_hash;
+};
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] [PATCH bpf-next v7 12/17] net/mlx4_en: Introduce wrapper for xdp_buff
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
                   ` (9 preceding siblings ...)
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 11/17] selftests/bpf: Verify xdp_metadata xdp->af_xdp path Stanislav Fomichev
@ 2023-01-12  0:32 ` Stanislav Fomichev
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 13/17] net/mlx4_en: Support RX XDP metadata Stanislav Fomichev
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Tariq Toukan

No functional changes. Boilerplate to allow stuffing more data after xdp_buff.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 26 +++++++++++++---------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 8f762fc170b3..014a80af2813 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -661,9 +661,14 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
 #define MLX4_CQE_STATUS_IP_ANY (MLX4_CQE_STATUS_IPV4)
 #endif
 
+struct mlx4_en_xdp_buff {
+	struct xdp_buff xdp;
+};
+
 int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
+	struct mlx4_en_xdp_buff mxbuf = {};
 	int factor = priv->cqe_factor;
 	struct mlx4_en_rx_ring *ring;
 	struct bpf_prog *xdp_prog;
@@ -671,7 +676,6 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	bool doorbell_pending;
 	bool xdp_redir_flush;
 	struct mlx4_cqe *cqe;
-	struct xdp_buff xdp;
 	int polled = 0;
 	int index;
 
@@ -681,7 +685,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	ring = priv->rx_ring[cq_ring];
 
 	xdp_prog = rcu_dereference_bh(ring->xdp_prog);
-	xdp_init_buff(&xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq);
+	xdp_init_buff(&mxbuf.xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq);
 	doorbell_pending = false;
 	xdp_redir_flush = false;
 
@@ -776,24 +780,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 						priv->frag_info[0].frag_size,
 						DMA_FROM_DEVICE);
 
-			xdp_prepare_buff(&xdp, va - frags[0].page_offset,
+			xdp_prepare_buff(&mxbuf.xdp, va - frags[0].page_offset,
 					 frags[0].page_offset, length, false);
-			orig_data = xdp.data;
+			orig_data = mxbuf.xdp.data;
 
-			act = bpf_prog_run_xdp(xdp_prog, &xdp);
+			act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
 
-			length = xdp.data_end - xdp.data;
-			if (xdp.data != orig_data) {
-				frags[0].page_offset = xdp.data -
-					xdp.data_hard_start;
-				va = xdp.data;
+			length = mxbuf.xdp.data_end - mxbuf.xdp.data;
+			if (mxbuf.xdp.data != orig_data) {
+				frags[0].page_offset = mxbuf.xdp.data -
+					mxbuf.xdp.data_hard_start;
+				va = mxbuf.xdp.data;
 			}
 
 			switch (act) {
 			case XDP_PASS:
 				break;
 			case XDP_REDIRECT:
-				if (likely(!xdp_do_redirect(dev, &xdp, xdp_prog))) {
+				if (likely(!xdp_do_redirect(dev, &mxbuf.xdp, xdp_prog))) {
 					ring->xdp_redirect++;
 					xdp_redir_flush = true;
 					frags[0].page = NULL;
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] [PATCH bpf-next v7 13/17] net/mlx4_en: Support RX XDP metadata
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
                   ` (10 preceding siblings ...)
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 12/17] net/mlx4_en: Introduce wrapper for xdp_buff Stanislav Fomichev
@ 2023-01-12  0:32 ` Stanislav Fomichev
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 14/17] xsk: Add cb area to struct xdp_buff_xsk Stanislav Fomichev
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, Tariq Toukan

RX timestamp and hash for now. Tested using the prog from the next
patch.

Also enabling xdp metadata support; don't see why it's disabled,
there is enough headroom..

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_clock.c | 13 +++++---
 .../net/ethernet/mellanox/mlx4/en_netdev.c    |  6 ++++
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    | 33 ++++++++++++++++++-
 drivers/net/ethernet/mellanox/mlx4/mlx4_en.h  |  5 +++
 4 files changed, 52 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_clock.c b/drivers/net/ethernet/mellanox/mlx4/en_clock.c
index 98b5ffb4d729..9e3b76182088 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_clock.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_clock.c
@@ -58,9 +58,7 @@ u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe)
 	return hi | lo;
 }
 
-void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
-			    struct skb_shared_hwtstamps *hwts,
-			    u64 timestamp)
+u64 mlx4_en_get_hwtstamp(struct mlx4_en_dev *mdev, u64 timestamp)
 {
 	unsigned int seq;
 	u64 nsec;
@@ -70,8 +68,15 @@ void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
 		nsec = timecounter_cyc2time(&mdev->clock, timestamp);
 	} while (read_seqretry(&mdev->clock_lock, seq));
 
+	return ns_to_ktime(nsec);
+}
+
+void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
+			    struct skb_shared_hwtstamps *hwts,
+			    u64 timestamp)
+{
 	memset(hwts, 0, sizeof(struct skb_shared_hwtstamps));
-	hwts->hwtstamp = ns_to_ktime(nsec);
+	hwts->hwtstamp = mlx4_en_get_hwtstamp(mdev, timestamp);
 }
 
 /**
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index 8800d3f1f55c..af4c4858f397 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2889,6 +2889,11 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
 	.ndo_bpf		= mlx4_xdp,
 };
 
+static const struct xdp_metadata_ops mlx4_xdp_metadata_ops = {
+	.xmo_rx_timestamp		= mlx4_en_xdp_rx_timestamp,
+	.xmo_rx_hash			= mlx4_en_xdp_rx_hash,
+};
+
 struct mlx4_en_bond {
 	struct work_struct work;
 	struct mlx4_en_priv *priv;
@@ -3310,6 +3315,7 @@ int mlx4_en_init_netdev(struct mlx4_en_dev *mdev, int port,
 		dev->netdev_ops = &mlx4_netdev_ops_master;
 	else
 		dev->netdev_ops = &mlx4_netdev_ops;
+	dev->xdp_metadata_ops = &mlx4_xdp_metadata_ops;
 	dev->watchdog_timeo = MLX4_EN_WATCHDOG_TIMEOUT;
 	netif_set_real_num_tx_queues(dev, priv->tx_ring_num[TX]);
 	netif_set_real_num_rx_queues(dev, priv->rx_ring_num);
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 014a80af2813..0869d4fff17b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -663,8 +663,35 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
 
 struct mlx4_en_xdp_buff {
 	struct xdp_buff xdp;
+	struct mlx4_cqe *cqe;
+	struct mlx4_en_dev *mdev;
+	struct mlx4_en_rx_ring *ring;
+	struct net_device *dev;
 };
 
+int mlx4_en_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
+{
+	struct mlx4_en_xdp_buff *_ctx = (void *)ctx;
+
+	if (unlikely(_ctx->ring->hwtstamp_rx_filter != HWTSTAMP_FILTER_ALL))
+		return -EOPNOTSUPP;
+
+	*timestamp = mlx4_en_get_hwtstamp(_ctx->mdev,
+					  mlx4_en_get_cqe_ts(_ctx->cqe));
+	return 0;
+}
+
+int mlx4_en_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash)
+{
+	struct mlx4_en_xdp_buff *_ctx = (void *)ctx;
+
+	if (unlikely(!(_ctx->dev->features & NETIF_F_RXHASH)))
+		return -EOPNOTSUPP;
+
+	*hash = be32_to_cpu(_ctx->cqe->immed_rss_invalid);
+	return 0;
+}
+
 int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -781,8 +808,12 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 						DMA_FROM_DEVICE);
 
 			xdp_prepare_buff(&mxbuf.xdp, va - frags[0].page_offset,
-					 frags[0].page_offset, length, false);
+					 frags[0].page_offset, length, true);
 			orig_data = mxbuf.xdp.data;
+			mxbuf.cqe = cqe;
+			mxbuf.mdev = priv->mdev;
+			mxbuf.ring = ring;
+			mxbuf.dev = dev;
 
 			act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
 
diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
index 3d4226ddba5e..544e09b97483 100644
--- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
+++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
@@ -796,10 +796,15 @@ void mlx4_en_update_pfc_stats_bitmap(struct mlx4_dev *dev,
 int mlx4_en_netdev_event(struct notifier_block *this,
 			 unsigned long event, void *ptr);
 
+struct xdp_md;
+int mlx4_en_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp);
+int mlx4_en_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash);
+
 /*
  * Functions for time stamping
  */
 u64 mlx4_en_get_cqe_ts(struct mlx4_cqe *cqe);
+u64 mlx4_en_get_hwtstamp(struct mlx4_en_dev *mdev, u64 timestamp);
 void mlx4_en_fill_hwtstamps(struct mlx4_en_dev *mdev,
 			    struct skb_shared_hwtstamps *hwts,
 			    u64 timestamp);
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] [PATCH bpf-next v7 14/17] xsk: Add cb area to struct xdp_buff_xsk
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
                   ` (11 preceding siblings ...)
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 13/17] net/mlx4_en: Support RX XDP metadata Stanislav Fomichev
@ 2023-01-12  0:32 ` Stanislav Fomichev
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 15/17] net/mlx5e: Introduce wrapper for xdp_buff Stanislav Fomichev
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Toke Høiland-Jørgensen,
	David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

From: Toke Høiland-Jørgensen <toke@redhat.com>

Add an area after the xdp_buff in struct xdp_buff_xsk that drivers can use
to stash extra information to use in metadata kfuncs. The maximum size of
24 bytes means the full xdp_buff_xsk structure will take up exactly two
cache lines (with the cb field spanning both). Also add a macro drivers can
use to check their own wrapping structs against the available size.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/net/xsk_buff_pool.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index f787c3f524b0..3e952e569418 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -19,8 +19,11 @@ struct xdp_sock;
 struct device;
 struct page;
 
+#define XSK_PRIV_MAX 24
+
 struct xdp_buff_xsk {
 	struct xdp_buff xdp;
+	u8 cb[XSK_PRIV_MAX];
 	dma_addr_t dma;
 	dma_addr_t frame_dma;
 	struct xsk_buff_pool *pool;
@@ -28,6 +31,8 @@ struct xdp_buff_xsk {
 	struct list_head free_list_node;
 };
 
+#define XSK_CHECK_PRIV_TYPE(t) BUILD_BUG_ON(sizeof(t) > offsetofend(struct xdp_buff_xsk, cb))
+
 struct xsk_dma_map {
 	dma_addr_t *dma_pages;
 	struct device *dev;
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] [PATCH bpf-next v7 15/17] net/mlx5e: Introduce wrapper for xdp_buff
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
                   ` (12 preceding siblings ...)
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 14/17] xsk: Add cb area to struct xdp_buff_xsk Stanislav Fomichev
@ 2023-01-12  0:32 ` Stanislav Fomichev
  2023-01-12  8:07   ` [xdp-hints] " Tariq Toukan
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 16/17] net/mlx5e: Support RX XDP metadata Stanislav Fomichev
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Toke Høiland-Jørgensen,
	Tariq Toukan, Saeed Mahameed, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

From: Toke Høiland-Jørgensen <toke@redhat.com>

Preparation for implementing HW metadata kfuncs. No functional change.

Cc: Tariq Toukan <tariqt@nvidia.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  3 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  6 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 25 ++++----
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 58 +++++++++----------
 5 files changed, 50 insertions(+), 43 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 2d77fb8a8a01..af663978d1b4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -469,6 +469,7 @@ struct mlx5e_txqsq {
 union mlx5e_alloc_unit {
 	struct page *page;
 	struct xdp_buff *xsk;
+	struct mlx5e_xdp_buff *mxbuf;
 };
 
 /* XDP packets can be transmitted in different ways. On completion, we need to
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 20507ef2f956..31bb6806bf5d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -158,8 +158,9 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
 
 /* returns true if packet was consumed by xdp */
 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,
-		      struct bpf_prog *prog, struct xdp_buff *xdp)
+		      struct bpf_prog *prog, struct mlx5e_xdp_buff *mxbuf)
 {
+	struct xdp_buff *xdp = &mxbuf->xdp;
 	u32 act;
 	int err;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index bc2d9034af5b..389818bf6833 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -44,10 +44,14 @@
 	(MLX5E_XDP_INLINE_WQE_MAX_DS_CNT * MLX5_SEND_WQE_DS - \
 	 sizeof(struct mlx5_wqe_inline_seg))
 
+struct mlx5e_xdp_buff {
+	struct xdp_buff xdp;
+};
+
 struct mlx5e_xsk_param;
 int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk);
 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,
-		      struct bpf_prog *prog, struct xdp_buff *xdp);
+		      struct bpf_prog *prog, struct mlx5e_xdp_buff *mlctx);
 void mlx5e_xdp_mpwqe_complete(struct mlx5e_xdpsq *sq);
 bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq);
 void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index c91b54d9ff27..9cff82d764e3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -22,6 +22,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 		goto err;
 
 	BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].xsk));
+	XSK_CHECK_PRIV_TYPE(struct mlx5e_xdp_buff);
 	batch = xsk_buff_alloc_batch(rq->xsk_pool, (struct xdp_buff **)wi->alloc_units,
 				     rq->mpwqe.pages_per_wqe);
 
@@ -233,7 +234,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    u32 head_offset,
 						    u32 page_idx)
 {
-	struct xdp_buff *xdp = wi->alloc_units[page_idx].xsk;
+	struct mlx5e_xdp_buff *mxbuf = wi->alloc_units[page_idx].mxbuf;
 	struct bpf_prog *prog;
 
 	/* Check packet size. Note LRO doesn't use linear SKB */
@@ -249,9 +250,9 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 	 */
 	WARN_ON_ONCE(head_offset);
 
-	xsk_buff_set_size(xdp, cqe_bcnt);
-	xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
-	net_prefetch(xdp->data);
+	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
+	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
+	net_prefetch(mxbuf->xdp.data);
 
 	/* Possible flows:
 	 * - XDP_REDIRECT to XSKMAP:
@@ -269,7 +270,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 	 */
 
 	prog = rcu_dereference(rq->xdp_prog);
-	if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, xdp))) {
+	if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, mxbuf))) {
 		if (likely(__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)))
 			__set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
 		return NULL; /* page/packet was consumed by XDP */
@@ -278,14 +279,14 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 	/* XDP_PASS: copy the data from the UMEM to a new SKB and reuse the
 	 * frame. On SKB allocation failure, NULL is returned.
 	 */
-	return mlx5e_xsk_construct_skb(rq, xdp);
+	return mlx5e_xsk_construct_skb(rq, &mxbuf->xdp);
 }
 
 struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 					      struct mlx5e_wqe_frag_info *wi,
 					      u32 cqe_bcnt)
 {
-	struct xdp_buff *xdp = wi->au->xsk;
+	struct mlx5e_xdp_buff *mxbuf = wi->au->mxbuf;
 	struct bpf_prog *prog;
 
 	/* wi->offset is not used in this function, because xdp->data and the
@@ -295,17 +296,17 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 	 */
 	WARN_ON_ONCE(wi->offset);
 
-	xsk_buff_set_size(xdp, cqe_bcnt);
-	xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
-	net_prefetch(xdp->data);
+	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
+	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
+	net_prefetch(mxbuf->xdp.data);
 
 	prog = rcu_dereference(rq->xdp_prog);
-	if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, xdp)))
+	if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, mxbuf)))
 		return NULL; /* page/packet was consumed by XDP */
 
 	/* XDP_PASS: copy the data from the UMEM to a new SKB. The frame reuse
 	 * will be handled by mlx5e_free_rx_wqe.
 	 * On SKB allocation failure, NULL is returned.
 	 */
-	return mlx5e_xsk_construct_skb(rq, xdp);
+	return mlx5e_xsk_construct_skb(rq, &mxbuf->xdp);
 }
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index c8820ab22169..6affdddf5bcf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -1575,11 +1575,11 @@ struct sk_buff *mlx5e_build_linear_skb(struct mlx5e_rq *rq, void *va,
 	return skb;
 }
 
-static void mlx5e_fill_xdp_buff(struct mlx5e_rq *rq, void *va, u16 headroom,
-				u32 len, struct xdp_buff *xdp)
+static void mlx5e_fill_mxbuf(struct mlx5e_rq *rq, void *va, u16 headroom,
+			     u32 len, struct mlx5e_xdp_buff *mxbuf)
 {
-	xdp_init_buff(xdp, rq->buff.frame0_sz, &rq->xdp_rxq);
-	xdp_prepare_buff(xdp, va, headroom, len, true);
+	xdp_init_buff(&mxbuf->xdp, rq->buff.frame0_sz, &rq->xdp_rxq);
+	xdp_prepare_buff(&mxbuf->xdp, va, headroom, len, true);
 }
 
 static struct sk_buff *
@@ -1606,16 +1606,16 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 
 	prog = rcu_dereference(rq->xdp_prog);
 	if (prog) {
-		struct xdp_buff xdp;
+		struct mlx5e_xdp_buff mxbuf;
 
 		net_prefetchw(va); /* xdp_frame data area */
-		mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &xdp);
-		if (mlx5e_xdp_handle(rq, au->page, prog, &xdp))
+		mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, cqe_bcnt, &mxbuf);
+		if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf))
 			return NULL; /* page/packet was consumed by XDP */
 
-		rx_headroom = xdp.data - xdp.data_hard_start;
-		metasize = xdp.data - xdp.data_meta;
-		cqe_bcnt = xdp.data_end - xdp.data;
+		rx_headroom = mxbuf.xdp.data - mxbuf.xdp.data_hard_start;
+		metasize = mxbuf.xdp.data - mxbuf.xdp.data_meta;
+		cqe_bcnt = mxbuf.xdp.data_end - mxbuf.xdp.data;
 	}
 	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
 	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
@@ -1637,9 +1637,9 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	union mlx5e_alloc_unit *au = wi->au;
 	u16 rx_headroom = rq->buff.headroom;
 	struct skb_shared_info *sinfo;
+	struct mlx5e_xdp_buff mxbuf;
 	u32 frag_consumed_bytes;
 	struct bpf_prog *prog;
-	struct xdp_buff xdp;
 	struct sk_buff *skb;
 	dma_addr_t addr;
 	u32 truesize;
@@ -1654,8 +1654,8 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	net_prefetchw(va); /* xdp_frame data area */
 	net_prefetch(va + rx_headroom);
 
-	mlx5e_fill_xdp_buff(rq, va, rx_headroom, frag_consumed_bytes, &xdp);
-	sinfo = xdp_get_shared_info_from_buff(&xdp);
+	mlx5e_fill_mxbuf(rq, va, rx_headroom, frag_consumed_bytes, &mxbuf);
+	sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
 	truesize = 0;
 
 	cqe_bcnt -= frag_consumed_bytes;
@@ -1673,13 +1673,13 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 		dma_sync_single_for_cpu(rq->pdev, addr + wi->offset,
 					frag_consumed_bytes, rq->buff.map_dir);
 
-		if (!xdp_buff_has_frags(&xdp)) {
+		if (!xdp_buff_has_frags(&mxbuf.xdp)) {
 			/* Init on the first fragment to avoid cold cache access
 			 * when possible.
 			 */
 			sinfo->nr_frags = 0;
 			sinfo->xdp_frags_size = 0;
-			xdp_buff_set_frags_flag(&xdp);
+			xdp_buff_set_frags_flag(&mxbuf.xdp);
 		}
 
 		frag = &sinfo->frags[sinfo->nr_frags++];
@@ -1688,7 +1688,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 		skb_frag_size_set(frag, frag_consumed_bytes);
 
 		if (page_is_pfmemalloc(au->page))
-			xdp_buff_set_frag_pfmemalloc(&xdp);
+			xdp_buff_set_frag_pfmemalloc(&mxbuf.xdp);
 
 		sinfo->xdp_frags_size += frag_consumed_bytes;
 		truesize += frag_info->frag_stride;
@@ -1701,7 +1701,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	au = head_wi->au;
 
 	prog = rcu_dereference(rq->xdp_prog);
-	if (prog && mlx5e_xdp_handle(rq, au->page, prog, &xdp)) {
+	if (prog && mlx5e_xdp_handle(rq, au->page, prog, &mxbuf)) {
 		if (test_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
 			int i;
 
@@ -1711,22 +1711,22 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 		return NULL; /* page/packet was consumed by XDP */
 	}
 
-	skb = mlx5e_build_linear_skb(rq, xdp.data_hard_start, rq->buff.frame0_sz,
-				     xdp.data - xdp.data_hard_start,
-				     xdp.data_end - xdp.data,
-				     xdp.data - xdp.data_meta);
+	skb = mlx5e_build_linear_skb(rq, mxbuf.xdp.data_hard_start, rq->buff.frame0_sz,
+				     mxbuf.xdp.data - mxbuf.xdp.data_hard_start,
+				     mxbuf.xdp.data_end - mxbuf.xdp.data,
+				     mxbuf.xdp.data - mxbuf.xdp.data_meta);
 	if (unlikely(!skb))
 		return NULL;
 
 	page_ref_inc(au->page);
 
-	if (unlikely(xdp_buff_has_frags(&xdp))) {
+	if (unlikely(xdp_buff_has_frags(&mxbuf.xdp))) {
 		int i;
 
 		/* sinfo->nr_frags is reset by build_skb, calculate again. */
 		xdp_update_skb_shared_info(skb, wi - head_wi - 1,
 					   sinfo->xdp_frags_size, truesize,
-					   xdp_buff_is_frag_pfmemalloc(&xdp));
+					   xdp_buff_is_frag_pfmemalloc(&mxbuf.xdp));
 
 		for (i = 0; i < sinfo->nr_frags; i++) {
 			skb_frag_t *frag = &sinfo->frags[i];
@@ -2007,19 +2007,19 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 
 	prog = rcu_dereference(rq->xdp_prog);
 	if (prog) {
-		struct xdp_buff xdp;
+		struct mlx5e_xdp_buff mxbuf;
 
 		net_prefetchw(va); /* xdp_frame data area */
-		mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &xdp);
-		if (mlx5e_xdp_handle(rq, au->page, prog, &xdp)) {
+		mlx5e_fill_mxbuf(rq, va, rx_headroom, cqe_bcnt, &mxbuf);
+		if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
 				__set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
 			return NULL; /* page/packet was consumed by XDP */
 		}
 
-		rx_headroom = xdp.data - xdp.data_hard_start;
-		metasize = xdp.data - xdp.data_meta;
-		cqe_bcnt = xdp.data_end - xdp.data;
+		rx_headroom = mxbuf.xdp.data - mxbuf.xdp.data_hard_start;
+		metasize = mxbuf.xdp.data - mxbuf.xdp.data_meta;
+		cqe_bcnt = mxbuf.xdp.data_end - mxbuf.xdp.data;
 	}
 	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
 	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] [PATCH bpf-next v7 16/17] net/mlx5e: Support RX XDP metadata
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
                   ` (13 preceding siblings ...)
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 15/17] net/mlx5e: Introduce wrapper for xdp_buff Stanislav Fomichev
@ 2023-01-12  0:32 ` Stanislav Fomichev
  2023-01-12  8:13   ` [xdp-hints] " Tariq Toukan
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 17/17] selftests/bpf: Simple program to dump XDP RX metadata Stanislav Fomichev
  2023-01-12  7:29 ` [xdp-hints] Re: [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Martin KaFai Lau
  16 siblings, 1 reply; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, Toke Høiland-Jørgensen,
	Tariq Toukan, Saeed Mahameed, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

From: Toke Høiland-Jørgensen <toke@redhat.com>

Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
pointer to the mlx5e_skb_from* functions so it can be retrieved from the
XDP ctx to do this.

Cc: Tariq Toukan <tariqt@nvidia.com>
Cc: Saeed Mahameed <saeedm@nvidia.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  5 +-
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  5 ++
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 23 +++++++++
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  5 ++
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 10 ++++
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |  2 +
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  6 +++
 .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 49 ++++++++++---------
 8 files changed, 80 insertions(+), 25 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index af663978d1b4..6de02d8aeab8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -627,10 +627,11 @@ struct mlx5e_rq;
 typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq*, struct mlx5_cqe64*);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe_mpwrq)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-			       u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+			       struct mlx5_cqe64 *cqe, u16 cqe_bcnt,
+			       u32 head_offset, u32 page_idx);
 typedef struct sk_buff *
 (*mlx5e_fp_skb_from_cqe)(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
-			 u32 cqe_bcnt);
+			 struct mlx5_cqe64 *cqe, u32 cqe_bcnt);
 typedef bool (*mlx5e_fp_post_rx_wqes)(struct mlx5e_rq *rq);
 typedef void (*mlx5e_fp_dealloc_wqe)(struct mlx5e_rq*, u16);
 typedef void (*mlx5e_fp_shampo_dealloc_hd)(struct mlx5e_rq*, u16, u16, bool);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index 853f312cd757..757c012ece27 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -73,6 +73,11 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget);
 void mlx5e_free_rx_descs(struct mlx5e_rq *rq);
 void mlx5e_free_rx_in_progress_descs(struct mlx5e_rq *rq);
 
+static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
+{
+	return config->rx_filter == HWTSTAMP_FILTER_ALL;
+}
+
 /* TX */
 netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev);
 bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 31bb6806bf5d..d10d31e12ba2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -156,6 +156,29 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
 	return true;
 }
 
+int mlx5e_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
+{
+	const struct mlx5e_xdp_buff *_ctx = (void *)ctx;
+
+	if (unlikely(!mlx5e_rx_hw_stamp(_ctx->rq->tstamp)))
+		return -EOPNOTSUPP;
+
+	*timestamp =  mlx5e_cqe_ts_to_ns(_ctx->rq->ptp_cyc2time,
+					 _ctx->rq->clock, get_cqe_ts(_ctx->cqe));
+	return 0;
+}
+
+int mlx5e_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash)
+{
+	const struct mlx5e_xdp_buff *_ctx = (void *)ctx;
+
+	if (unlikely(!(_ctx->xdp.rxq->dev->features & NETIF_F_RXHASH)))
+		return -EOPNOTSUPP;
+
+	*hash = be32_to_cpu(_ctx->cqe->rss_hash_result);
+	return 0;
+}
+
 /* returns true if packet was consumed by xdp */
 bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,
 		      struct bpf_prog *prog, struct mlx5e_xdp_buff *mxbuf)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index 389818bf6833..cb568c62aba0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -46,6 +46,8 @@
 
 struct mlx5e_xdp_buff {
 	struct xdp_buff xdp;
+	struct mlx5_cqe64 *cqe;
+	struct mlx5e_rq *rq;
 };
 
 struct mlx5e_xsk_param;
@@ -60,6 +62,9 @@ void mlx5e_xdp_rx_poll_complete(struct mlx5e_rq *rq);
 int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 		   u32 flags);
 
+int mlx5e_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp);
+int mlx5e_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash);
+
 INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq,
 							  struct mlx5e_xmit_data *xdptxd,
 							  struct skb_shared_info *sinfo,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index 9cff82d764e3..8bf3029abd3c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -49,6 +49,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 			umr_wqe->inline_mtts[i] = (struct mlx5_mtt) {
 				.ptag = cpu_to_be64(addr | MLX5_EN_WR),
 			};
+			wi->alloc_units[i].mxbuf->rq = rq;
 		}
 	} else if (unlikely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_UNALIGNED)) {
 		for (i = 0; i < batch; i++) {
@@ -58,6 +59,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 				.key = rq->mkey_be,
 				.va = cpu_to_be64(addr),
 			};
+			wi->alloc_units[i].mxbuf->rq = rq;
 		}
 	} else if (likely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_TRIPLE)) {
 		u32 mapping_size = 1 << (rq->mpwqe.page_shift - 2);
@@ -81,6 +83,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 				.key = rq->mkey_be,
 				.va = cpu_to_be64(rq->wqe_overflow.addr),
 			};
+			wi->alloc_units[i].mxbuf->rq = rq;
 		}
 	} else {
 		__be32 pad_size = cpu_to_be32((1 << rq->mpwqe.page_shift) -
@@ -100,6 +103,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 				.va = cpu_to_be64(rq->wqe_overflow.addr),
 				.bcount = pad_size,
 			};
+			wi->alloc_units[i].mxbuf->rq = rq;
 		}
 	}
 
@@ -230,6 +234,7 @@ static struct sk_buff *mlx5e_xsk_construct_skb(struct mlx5e_rq *rq, struct xdp_b
 
 struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    struct mlx5e_mpw_info *wi,
+						    struct mlx5_cqe64 *cqe,
 						    u16 cqe_bcnt,
 						    u32 head_offset,
 						    u32 page_idx)
@@ -250,6 +255,8 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 	 */
 	WARN_ON_ONCE(head_offset);
 
+	/* mxbuf->rq is set on allocation, but cqe is per-packet so set it here */
+	mxbuf->cqe = cqe;
 	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
 	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
 	net_prefetch(mxbuf->xdp.data);
@@ -284,6 +291,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 
 struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 					      struct mlx5e_wqe_frag_info *wi,
+					      struct mlx5_cqe64 *cqe,
 					      u32 cqe_bcnt)
 {
 	struct mlx5e_xdp_buff *mxbuf = wi->au->mxbuf;
@@ -296,6 +304,8 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 	 */
 	WARN_ON_ONCE(wi->offset);
 
+	/* mxbuf->rq is set on allocation, but cqe is per-packet so set it here */
+	mxbuf->cqe = cqe;
 	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
 	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
 	net_prefetch(mxbuf->xdp.data);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
index 087c943bd8e9..cefc0ef6105d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
@@ -13,11 +13,13 @@ int mlx5e_xsk_alloc_rx_wqes_batched(struct mlx5e_rq *rq, u16 ix, int wqe_bulk);
 int mlx5e_xsk_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, int wqe_bulk);
 struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
 						    struct mlx5e_mpw_info *wi,
+						    struct mlx5_cqe64 *cqe,
 						    u16 cqe_bcnt,
 						    u32 head_offset,
 						    u32 page_idx);
 struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
 					      struct mlx5e_wqe_frag_info *wi,
+					      struct mlx5_cqe64 *cqe,
 					      u32 cqe_bcnt);
 
 #endif /* __MLX5_EN_XSK_RX_H__ */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index cff5f2e29e1e..be942c060774 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -4913,6 +4913,11 @@ const struct net_device_ops mlx5e_netdev_ops = {
 #endif
 };
 
+static const struct xdp_metadata_ops mlx5_xdp_metadata_ops = {
+	.xmo_rx_timestamp		= mlx5e_xdp_rx_timestamp,
+	.xmo_rx_hash			= mlx5e_xdp_rx_hash,
+};
+
 static u32 mlx5e_choose_lro_timeout(struct mlx5_core_dev *mdev, u32 wanted_timeout)
 {
 	int i;
@@ -5053,6 +5058,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 	SET_NETDEV_DEV(netdev, mdev->device);
 
 	netdev->netdev_ops = &mlx5e_netdev_ops;
+	netdev->xdp_metadata_ops = &mlx5_xdp_metadata_ops;
 
 	mlx5e_dcbnl_build_netdev(netdev);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 6affdddf5bcf..7b08653be000 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -62,10 +62,12 @@
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+				struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
+				u32 page_idx);
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				   u16 cqe_bcnt, u32 head_offset, u32 page_idx);
+				   struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
+				   u32 page_idx);
 static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
 static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
@@ -76,11 +78,6 @@ const struct mlx5e_rx_handlers mlx5e_rx_handlers_nic = {
 	.handle_rx_cqe_mpwqe_shampo = mlx5e_handle_rx_cqe_mpwrq_shampo,
 };
 
-static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
-{
-	return config->rx_filter == HWTSTAMP_FILTER_ALL;
-}
-
 static inline void mlx5e_read_cqe_slot(struct mlx5_cqwq *wq,
 				       u32 cqcc, void *data)
 {
@@ -1575,16 +1572,19 @@ struct sk_buff *mlx5e_build_linear_skb(struct mlx5e_rq *rq, void *va,
 	return skb;
 }
 
-static void mlx5e_fill_mxbuf(struct mlx5e_rq *rq, void *va, u16 headroom,
-			     u32 len, struct mlx5e_xdp_buff *mxbuf)
+static void mlx5e_fill_mxbuf(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+			     void *va, u16 headroom, u32 len,
+			     struct mlx5e_xdp_buff *mxbuf)
 {
 	xdp_init_buff(&mxbuf->xdp, rq->buff.frame0_sz, &rq->xdp_rxq);
 	xdp_prepare_buff(&mxbuf->xdp, va, headroom, len, true);
+	mxbuf->cqe = cqe;
+	mxbuf->rq = rq;
 }
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
-			  u32 cqe_bcnt)
+			  struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
 {
 	union mlx5e_alloc_unit *au = wi->au;
 	u16 rx_headroom = rq->buff.headroom;
@@ -1630,7 +1630,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
-			     u32 cqe_bcnt)
+			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
 {
 	struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
 	struct mlx5e_wqe_frag_info *head_wi = wi;
@@ -1654,7 +1654,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
 	net_prefetchw(va); /* xdp_frame data area */
 	net_prefetch(va + rx_headroom);
 
-	mlx5e_fill_mxbuf(rq, va, rx_headroom, frag_consumed_bytes, &mxbuf);
+	mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, frag_consumed_bytes, &mxbuf);
 	sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
 	truesize = 0;
 
@@ -1777,7 +1777,7 @@ static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 			      mlx5e_skb_from_cqe_linear,
 			      mlx5e_skb_from_cqe_nonlinear,
 			      mlx5e_xsk_skb_from_cqe_linear,
-			      rq, wi, cqe_bcnt);
+			      rq, wi, cqe, cqe_bcnt);
 	if (!skb) {
 		/* probably for XDP */
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
@@ -1830,7 +1830,7 @@ static void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
 			      mlx5e_skb_from_cqe_linear,
 			      mlx5e_skb_from_cqe_nonlinear,
-			      rq, wi, cqe_bcnt);
+			      rq, wi, cqe, cqe_bcnt);
 	if (!skb) {
 		/* probably for XDP */
 		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
@@ -1889,7 +1889,7 @@ static void mlx5e_handle_rx_cqe_mpwrq_rep(struct mlx5e_rq *rq, struct mlx5_cqe64
 	skb = INDIRECT_CALL_2(rq->mpwqe.skb_from_cqe_mpwrq,
 			      mlx5e_skb_from_cqe_mpwrq_linear,
 			      mlx5e_skb_from_cqe_mpwrq_nonlinear,
-			      rq, wi, cqe_bcnt, head_offset, page_idx);
+			      rq, wi, cqe, cqe_bcnt, head_offset, page_idx);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
@@ -1940,7 +1940,8 @@ mlx5e_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				   u16 cqe_bcnt, u32 head_offset, u32 page_idx)
+				   struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
+				   u32 page_idx)
 {
 	union mlx5e_alloc_unit *au = &wi->alloc_units[page_idx];
 	u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
@@ -1979,7 +1980,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
 
 static struct sk_buff *
 mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
-				u16 cqe_bcnt, u32 head_offset, u32 page_idx)
+				struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
+				u32 page_idx)
 {
 	union mlx5e_alloc_unit *au = &wi->alloc_units[page_idx];
 	u16 rx_headroom = rq->buff.headroom;
@@ -2010,7 +2012,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
 		struct mlx5e_xdp_buff mxbuf;
 
 		net_prefetchw(va); /* xdp_frame data area */
-		mlx5e_fill_mxbuf(rq, va, rx_headroom, cqe_bcnt, &mxbuf);
+		mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, cqe_bcnt, &mxbuf);
 		if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf)) {
 			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
 				__set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
@@ -2174,8 +2176,8 @@ static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cq
 		if (likely(head_size))
 			*skb = mlx5e_skb_from_cqe_shampo(rq, wi, cqe, header_index);
 		else
-			*skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe_bcnt, data_offset,
-								  page_idx);
+			*skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe, cqe_bcnt,
+								  data_offset, page_idx);
 		if (unlikely(!*skb))
 			goto free_hd_entry;
 
@@ -2249,7 +2251,8 @@ static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cq
 			      mlx5e_skb_from_cqe_mpwrq_linear,
 			      mlx5e_skb_from_cqe_mpwrq_nonlinear,
 			      mlx5e_xsk_skb_from_cqe_mpwrq_linear,
-			      rq, wi, cqe_bcnt, head_offset, page_idx);
+			      rq, wi, cqe, cqe_bcnt, head_offset,
+			      page_idx);
 	if (!skb)
 		goto mpwrq_cqe_out;
 
@@ -2494,7 +2497,7 @@ static void mlx5i_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
 	skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
 			      mlx5e_skb_from_cqe_linear,
 			      mlx5e_skb_from_cqe_nonlinear,
-			      rq, wi, cqe_bcnt);
+			      rq, wi, cqe, cqe_bcnt);
 	if (!skb)
 		goto wq_free_wqe;
 
@@ -2586,7 +2589,7 @@ static void mlx5e_trap_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe
 		goto free_wqe;
 	}
 
-	skb = mlx5e_skb_from_cqe_nonlinear(rq, wi, cqe_bcnt);
+	skb = mlx5e_skb_from_cqe_nonlinear(rq, wi, cqe, cqe_bcnt);
 	if (!skb)
 		goto free_wqe;
 
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] [PATCH bpf-next v7 17/17] selftests/bpf: Simple program to dump XDP RX metadata
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
                   ` (14 preceding siblings ...)
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 16/17] net/mlx5e: Support RX XDP metadata Stanislav Fomichev
@ 2023-01-12  0:32 ` Stanislav Fomichev
  2023-01-12  7:29 ` [xdp-hints] Re: [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Martin KaFai Lau
  16 siblings, 0 replies; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12  0:32 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

To be used for verification of driver implementations. Note that
the skb path is gone from the series, but I'm still keeping the
implementation for any possible future work.

$ xdp_hw_metadata <ifname>

On the other machine:

$ echo -n xdp | nc -u -q1 <target> 9091 # for AF_XDP
$ echo -n skb | nc -u -q1 <target> 9092 # for skb

Sample output:

  # xdp
  xsk_ring_cons__peek: 1
  0x19f9090: rx_desc[0]->addr=100000000008000 addr=8100 comp_addr=8000
  rx_timestamp_supported: 1
  rx_timestamp: 1667850075063948829
  0x19f9090: complete idx=8 addr=8000

  # skb
  found skb hwtstamp = 1668314052.854274681

Decoding:
  # xdp
  rx_timestamp=1667850075.063948829

  $ date -d @1667850075
  Mon Nov  7 11:41:15 AM PST 2022
  $ date
  Mon Nov  7 11:42:05 AM PST 2022

  # skb
  $ date -d @1668314052
  Sat Nov 12 08:34:12 PM PST 2022
  $ date
  Sat Nov 12 08:37:06 PM PST 2022

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/.gitignore        |   1 +
 tools/testing/selftests/bpf/Makefile          |   7 +-
 .../selftests/bpf/progs/xdp_hw_metadata.c     |  81 ++++
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 405 ++++++++++++++++++
 4 files changed, 493 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
 create mode 100644 tools/testing/selftests/bpf/xdp_hw_metadata.c

diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index 401a75844cc0..4aa5bba956ff 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -47,3 +47,4 @@ test_cpp
 xskxceiver
 xdp_redirect_multi
 xdp_synproxy
+xdp_hw_metadata
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 5356f317bc62..47de173b0a93 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -83,7 +83,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
 TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
 	flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
 	test_lirc_mode2_user xdping test_cpp runqslower bench bpf_testmod.ko \
-	xskxceiver xdp_redirect_multi xdp_synproxy veristat
+	xskxceiver xdp_redirect_multi xdp_synproxy veristat xdp_hw_metadata
 
 TEST_CUSTOM_PROGS = $(OUTPUT)/urandom_read $(OUTPUT)/sign-file
 TEST_GEN_FILES += liburandom_read.so
@@ -383,6 +383,7 @@ linked_maps.skel.h-deps := linked_maps1.bpf.o linked_maps2.bpf.o
 test_subskeleton.skel.h-deps := test_subskeleton_lib2.bpf.o test_subskeleton_lib.bpf.o test_subskeleton.bpf.o
 test_subskeleton_lib.skel.h-deps := test_subskeleton_lib2.bpf.o test_subskeleton_lib.bpf.o
 test_usdt.skel.h-deps := test_usdt.bpf.o test_usdt_multispec.bpf.o
+xdp_hw_metadata.skel.h-deps := xdp_hw_metadata.bpf.o
 
 LINKED_BPF_SRCS := $(patsubst %.bpf.o,%.c,$(foreach skel,$(LINKED_SKELS),$($(skel)-deps)))
 
@@ -576,6 +577,10 @@ $(OUTPUT)/test_verifier: test_verifier.c verifier/tests.h $(BPFOBJ) | $(OUTPUT)
 	$(call msg,BINARY,,$@)
 	$(Q)$(CC) $(CFLAGS) $(filter %.a %.o %.c,$^) $(LDLIBS) -o $@
 
+$(OUTPUT)/xdp_hw_metadata: xdp_hw_metadata.c $(OUTPUT)/network_helpers.o $(OUTPUT)/xsk.o $(OUTPUT)/xdp_hw_metadata.skel.h | $(OUTPUT)
+	$(call msg,BINARY,,$@)
+	$(Q)$(CC) $(CFLAGS) -static $(filter %.a %.o %.c,$^) $(LDLIBS) -o $@
+
 # Make sure we are able to include and link libbpf against c++.
 $(OUTPUT)/test_cpp: test_cpp.cpp $(OUTPUT)/test_core_extern.skel.h $(BPFOBJ)
 	$(call msg,CXX,,$@)
diff --git a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
new file mode 100644
index 000000000000..25b8178735ee
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
@@ -0,0 +1,81 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <vmlinux.h>
+#include "xdp_metadata.h"
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_XSKMAP);
+	__uint(max_entries, 256);
+	__type(key, __u32);
+	__type(value, __u32);
+} xsk SEC(".maps");
+
+extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
+					 __u64 *timestamp) __ksym;
+extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx,
+				    __u32 *hash) __ksym;
+
+SEC("xdp")
+int rx(struct xdp_md *ctx)
+{
+	void *data, *data_meta, *data_end;
+	struct ipv6hdr *ip6h = NULL;
+	struct ethhdr *eth = NULL;
+	struct udphdr *udp = NULL;
+	struct iphdr *iph = NULL;
+	struct xdp_meta *meta;
+	int ret;
+
+	data = (void *)(long)ctx->data;
+	data_end = (void *)(long)ctx->data_end;
+	eth = data;
+	if (eth + 1 < data_end) {
+		if (eth->h_proto == bpf_htons(ETH_P_IP)) {
+			iph = (void *)(eth + 1);
+			if (iph + 1 < data_end && iph->protocol == IPPROTO_UDP)
+				udp = (void *)(iph + 1);
+		}
+		if (eth->h_proto == bpf_htons(ETH_P_IPV6)) {
+			ip6h = (void *)(eth + 1);
+			if (ip6h + 1 < data_end && ip6h->nexthdr == IPPROTO_UDP)
+				udp = (void *)(ip6h + 1);
+		}
+		if (udp && udp + 1 > data_end)
+			udp = NULL;
+	}
+
+	if (!udp)
+		return XDP_PASS;
+
+	if (udp->dest != bpf_htons(9091))
+		return XDP_PASS;
+
+	bpf_printk("forwarding UDP:9091 to AF_XDP");
+
+	ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(struct xdp_meta));
+	if (ret != 0) {
+		bpf_printk("bpf_xdp_adjust_meta returned %d", ret);
+		return XDP_PASS;
+	}
+
+	data = (void *)(long)ctx->data;
+	data_meta = (void *)(long)ctx->data_meta;
+	meta = data_meta;
+
+	if (meta + 1 > data) {
+		bpf_printk("bpf_xdp_adjust_meta doesn't appear to work");
+		return XDP_PASS;
+	}
+
+	if (!bpf_xdp_metadata_rx_timestamp(ctx, &meta->rx_timestamp))
+		bpf_printk("populated rx_timestamp with %u", meta->rx_timestamp);
+
+	if (!bpf_xdp_metadata_rx_hash(ctx, &meta->rx_hash))
+		bpf_printk("populated rx_hash with %u", meta->rx_hash);
+
+	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
+}
+
+char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
new file mode 100644
index 000000000000..b8b7600dc275
--- /dev/null
+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
@@ -0,0 +1,405 @@
+// SPDX-License-Identifier: GPL-2.0
+
+/* Reference program for verifying XDP metadata on real HW. Functional test
+ * only, doesn't test the performance.
+ *
+ * RX:
+ * - UDP 9091 packets are diverted into AF_XDP
+ * - Metadata verified:
+ *   - rx_timestamp
+ *   - rx_hash
+ *
+ * TX:
+ * - TBD
+ */
+
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "xdp_hw_metadata.skel.h"
+#include "xsk.h"
+
+#include <error.h>
+#include <linux/errqueue.h>
+#include <linux/if_link.h>
+#include <linux/net_tstamp.h>
+#include <linux/udp.h>
+#include <linux/sockios.h>
+#include <sys/mman.h>
+#include <net/if.h>
+#include <poll.h>
+
+#include "xdp_metadata.h"
+
+#define UMEM_NUM 16
+#define UMEM_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
+#define UMEM_SIZE (UMEM_FRAME_SIZE * UMEM_NUM)
+#define XDP_FLAGS (XDP_FLAGS_DRV_MODE | XDP_FLAGS_REPLACE)
+
+struct xsk {
+	void *umem_area;
+	struct xsk_umem *umem;
+	struct xsk_ring_prod fill;
+	struct xsk_ring_cons comp;
+	struct xsk_ring_prod tx;
+	struct xsk_ring_cons rx;
+	struct xsk_socket *socket;
+};
+
+struct xdp_hw_metadata *bpf_obj;
+struct xsk *rx_xsk;
+const char *ifname;
+int ifindex;
+int rxq;
+
+void test__fail(void) { /* for network_helpers.c */ }
+
+static int open_xsk(const char *ifname, struct xsk *xsk, __u32 queue_id)
+{
+	int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
+	const struct xsk_socket_config socket_config = {
+		.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.libbpf_flags = XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD,
+		.xdp_flags = XDP_FLAGS,
+		.bind_flags = XDP_COPY,
+	};
+	const struct xsk_umem_config umem_config = {
+		.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
+		.frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
+		.flags = XDP_UMEM_UNALIGNED_CHUNK_FLAG,
+	};
+	__u32 idx;
+	u64 addr;
+	int ret;
+	int i;
+
+	xsk->umem_area = mmap(NULL, UMEM_SIZE, PROT_READ | PROT_WRITE, mmap_flags, -1, 0);
+	if (xsk->umem_area == MAP_FAILED)
+		return -ENOMEM;
+
+	ret = xsk_umem__create(&xsk->umem,
+			       xsk->umem_area, UMEM_SIZE,
+			       &xsk->fill,
+			       &xsk->comp,
+			       &umem_config);
+	if (ret)
+		return ret;
+
+	ret = xsk_socket__create(&xsk->socket, ifname, queue_id,
+				 xsk->umem,
+				 &xsk->rx,
+				 &xsk->tx,
+				 &socket_config);
+	if (ret)
+		return ret;
+
+	/* First half of umem is for TX. This way address matches 1-to-1
+	 * to the completion queue index.
+	 */
+
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = i * UMEM_FRAME_SIZE;
+		printf("%p: tx_desc[%d] -> %lx\n", xsk, i, addr);
+	}
+
+	/* Second half of umem is for RX. */
+
+	ret = xsk_ring_prod__reserve(&xsk->fill, UMEM_NUM / 2, &idx);
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = (UMEM_NUM / 2 + i) * UMEM_FRAME_SIZE;
+		printf("%p: rx_desc[%d] -> %lx\n", xsk, i, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, i) = addr;
+	}
+	xsk_ring_prod__submit(&xsk->fill, ret);
+
+	return 0;
+}
+
+static void close_xsk(struct xsk *xsk)
+{
+	if (xsk->umem)
+		xsk_umem__delete(xsk->umem);
+	if (xsk->socket)
+		xsk_socket__delete(xsk->socket);
+	munmap(xsk->umem, UMEM_SIZE);
+}
+
+static void refill_rx(struct xsk *xsk, __u64 addr)
+{
+	__u32 idx;
+
+	if (xsk_ring_prod__reserve(&xsk->fill, 1, &idx) == 1) {
+		printf("%p: complete idx=%u addr=%llx\n", xsk, idx, addr);
+		*xsk_ring_prod__fill_addr(&xsk->fill, idx) = addr;
+		xsk_ring_prod__submit(&xsk->fill, 1);
+	}
+}
+
+static void verify_xdp_metadata(void *data)
+{
+	struct xdp_meta *meta;
+
+	meta = data - sizeof(*meta);
+
+	printf("rx_timestamp: %llu\n", meta->rx_timestamp);
+	printf("rx_hash: %u\n", meta->rx_hash);
+}
+
+static void verify_skb_metadata(int fd)
+{
+	char cmsg_buf[1024];
+	char packet_buf[128];
+
+	struct scm_timestamping *ts;
+	struct iovec packet_iov;
+	struct cmsghdr *cmsg;
+	struct msghdr hdr;
+
+	memset(&hdr, 0, sizeof(hdr));
+	hdr.msg_iov = &packet_iov;
+	hdr.msg_iovlen = 1;
+	packet_iov.iov_base = packet_buf;
+	packet_iov.iov_len = sizeof(packet_buf);
+
+	hdr.msg_control = cmsg_buf;
+	hdr.msg_controllen = sizeof(cmsg_buf);
+
+	if (recvmsg(fd, &hdr, 0) < 0)
+		error(-1, errno, "recvmsg");
+
+	for (cmsg = CMSG_FIRSTHDR(&hdr); cmsg != NULL;
+	     cmsg = CMSG_NXTHDR(&hdr, cmsg)) {
+
+		if (cmsg->cmsg_level != SOL_SOCKET)
+			continue;
+
+		switch (cmsg->cmsg_type) {
+		case SCM_TIMESTAMPING:
+			ts = (struct scm_timestamping *)CMSG_DATA(cmsg);
+			if (ts->ts[2].tv_sec || ts->ts[2].tv_nsec) {
+				printf("found skb hwtstamp = %lu.%lu\n",
+				       ts->ts[2].tv_sec, ts->ts[2].tv_nsec);
+				return;
+			}
+			break;
+		default:
+			break;
+		}
+	}
+
+	printf("skb hwtstamp is not found!\n");
+}
+
+static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd)
+{
+	const struct xdp_desc *rx_desc;
+	struct pollfd fds[rxq + 1];
+	__u64 comp_addr;
+	__u64 addr;
+	__u32 idx;
+	int ret;
+	int i;
+
+	for (i = 0; i < rxq; i++) {
+		fds[i].fd = xsk_socket__fd(rx_xsk[i].socket);
+		fds[i].events = POLLIN;
+		fds[i].revents = 0;
+	}
+
+	fds[rxq].fd = server_fd;
+	fds[rxq].events = POLLIN;
+	fds[rxq].revents = 0;
+
+	while (true) {
+		errno = 0;
+		ret = poll(fds, rxq + 1, 1000);
+		printf("poll: %d (%d)\n", ret, errno);
+		if (ret < 0)
+			break;
+		if (ret == 0)
+			continue;
+
+		if (fds[rxq].revents)
+			verify_skb_metadata(server_fd);
+
+		for (i = 0; i < rxq; i++) {
+			if (fds[i].revents == 0)
+				continue;
+
+			struct xsk *xsk = &rx_xsk[i];
+
+			ret = xsk_ring_cons__peek(&xsk->rx, 1, &idx);
+			printf("xsk_ring_cons__peek: %d\n", ret);
+			if (ret != 1)
+				continue;
+
+			rx_desc = xsk_ring_cons__rx_desc(&xsk->rx, idx);
+			comp_addr = xsk_umem__extract_addr(rx_desc->addr);
+			addr = xsk_umem__add_offset_to_addr(rx_desc->addr);
+			printf("%p: rx_desc[%u]->addr=%llx addr=%llx comp_addr=%llx\n",
+			       xsk, idx, rx_desc->addr, addr, comp_addr);
+			verify_xdp_metadata(xsk_umem__get_data(xsk->umem_area, addr));
+			xsk_ring_cons__release(&xsk->rx, 1);
+			refill_rx(xsk, comp_addr);
+		}
+	}
+
+	return 0;
+}
+
+struct ethtool_channels {
+	__u32	cmd;
+	__u32	max_rx;
+	__u32	max_tx;
+	__u32	max_other;
+	__u32	max_combined;
+	__u32	rx_count;
+	__u32	tx_count;
+	__u32	other_count;
+	__u32	combined_count;
+};
+
+#define ETHTOOL_GCHANNELS	0x0000003c /* Get no of channels */
+
+static int rxq_num(const char *ifname)
+{
+	struct ethtool_channels ch = {
+		.cmd = ETHTOOL_GCHANNELS,
+	};
+
+	struct ifreq ifr = {
+		.ifr_data = (void *)&ch,
+	};
+	strcpy(ifr.ifr_name, ifname);
+	int fd, ret;
+
+	fd = socket(AF_UNIX, SOCK_DGRAM, 0);
+	if (fd < 0)
+		error(-1, errno, "socket");
+
+	ret = ioctl(fd, SIOCETHTOOL, &ifr);
+	if (ret < 0)
+		error(-1, errno, "socket");
+
+	close(fd);
+
+	return ch.rx_count + ch.combined_count;
+}
+
+static void cleanup(void)
+{
+	LIBBPF_OPTS(bpf_xdp_attach_opts, opts);
+	int ret;
+	int i;
+
+	if (bpf_obj) {
+		opts.old_prog_fd = bpf_program__fd(bpf_obj->progs.rx);
+		if (opts.old_prog_fd >= 0) {
+			printf("detaching bpf program....\n");
+			ret = bpf_xdp_detach(ifindex, XDP_FLAGS, &opts);
+			if (ret)
+				printf("failed to detach XDP program: %d\n", ret);
+		}
+	}
+
+	for (i = 0; i < rxq; i++)
+		close_xsk(&rx_xsk[i]);
+
+	if (bpf_obj)
+		xdp_hw_metadata__destroy(bpf_obj);
+}
+
+static void handle_signal(int sig)
+{
+	/* interrupting poll() is all we need */
+}
+
+static void timestamping_enable(int fd, int val)
+{
+	int ret;
+
+	ret = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
+	if (ret < 0)
+		error(-1, errno, "setsockopt(SO_TIMESTAMPING)");
+}
+
+int main(int argc, char *argv[])
+{
+	int server_fd = -1;
+	int ret;
+	int i;
+
+	struct bpf_program *prog;
+
+	if (argc != 2) {
+		fprintf(stderr, "pass device name\n");
+		return -1;
+	}
+
+	ifname = argv[1];
+	ifindex = if_nametoindex(ifname);
+	rxq = rxq_num(ifname);
+
+	printf("rxq: %d\n", rxq);
+
+	rx_xsk = malloc(sizeof(struct xsk) * rxq);
+	if (!rx_xsk)
+		error(-1, ENOMEM, "malloc");
+
+	for (i = 0; i < rxq; i++) {
+		printf("open_xsk(%s, %p, %d)\n", ifname, &rx_xsk[i], i);
+		ret = open_xsk(ifname, &rx_xsk[i], i);
+		if (ret)
+			error(-1, -ret, "open_xsk");
+
+		printf("xsk_socket__fd() -> %d\n", xsk_socket__fd(rx_xsk[i].socket));
+	}
+
+	printf("open bpf program...\n");
+	bpf_obj = xdp_hw_metadata__open();
+	if (libbpf_get_error(bpf_obj))
+		error(-1, libbpf_get_error(bpf_obj), "xdp_hw_metadata__open");
+
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "rx");
+	bpf_program__set_ifindex(prog, ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+
+	printf("load bpf program...\n");
+	ret = xdp_hw_metadata__load(bpf_obj);
+	if (ret)
+		error(-1, -ret, "xdp_hw_metadata__load");
+
+	printf("prepare skb endpoint...\n");
+	server_fd = start_server(AF_INET6, SOCK_DGRAM, NULL, 9092, 1000);
+	if (server_fd < 0)
+		error(-1, errno, "start_server");
+	timestamping_enable(server_fd,
+			    SOF_TIMESTAMPING_SOFTWARE |
+			    SOF_TIMESTAMPING_RAW_HARDWARE);
+
+	printf("prepare xsk map...\n");
+	for (i = 0; i < rxq; i++) {
+		int sock_fd = xsk_socket__fd(rx_xsk[i].socket);
+		__u32 queue_id = i;
+
+		printf("map[%d] = %d\n", queue_id, sock_fd);
+		ret = bpf_map_update_elem(bpf_map__fd(bpf_obj->maps.xsk), &queue_id, &sock_fd, 0);
+		if (ret)
+			error(-1, -ret, "bpf_map_update_elem");
+	}
+
+	printf("attach bpf program...\n");
+	ret = bpf_xdp_attach(ifindex,
+			     bpf_program__fd(bpf_obj->progs.rx),
+			     XDP_FLAGS, NULL);
+	if (ret)
+		error(-1, -ret, "bpf_xdp_attach");
+
+	signal(SIGINT, handle_signal);
+	ret = verify_metadata(rx_xsk, rxq, server_fd);
+	close(server_fd);
+	cleanup();
+	if (ret)
+		error(-1, -ret, "verify_metadata");
+}
-- 
2.39.0.314.g84b9a713c41-goog


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 00/17] xdp: hints via kfuncs
  2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
                   ` (15 preceding siblings ...)
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 17/17] selftests/bpf: Simple program to dump XDP RX metadata Stanislav Fomichev
@ 2023-01-12  7:29 ` Martin KaFai Lau
  2023-01-12  8:19   ` Tariq Toukan
  16 siblings, 1 reply; 42+ messages in thread
From: Martin KaFai Lau @ 2023-01-12  7:29 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, Tariq Toukan, Saeed Mahameed,
	bpf, xdp-hints, netdev

On 1/11/23 4:32 PM, Stanislav Fomichev wrote:
> Please see the first patch in the series for the overall
> design and use-cases.
> 
> See the following email from Toke for the per-packet metadata overhead:
> https://lore.kernel.org/bpf/20221206024554.3826186-1-sdf@google.com/T/#m49d48ea08d525ec88360c7d14c4d34fb0e45e798
> 
> Recent changes:
> 
> - Bring back parts that were removed during patch reshuffling from "bpf:
>    Introduce device-bound XDP programs" patch (Martin)
> 
> - Remove netdev NULL check from __bpf_prog_dev_bound_init (Martin)
> 
> - Remove netdev NULL check from bpf_dev_bound_resolve_kfunc (Martin)
> 
> - Move target bound device verification from bpf_tracing_prog_attach into
>    bpf_check_attach_target (Martin)
> 
> - Move mlx5e_free_rx_in_progress_descs into txrx.h (Tariq)
> 
> - mlx5e_fill_xdp_buff -> mlx5e_fill_mxbuf (Tariq)

Thanks for the patches. The set lgtm.

The selftest patch 11 and 17 have conflicts with the recent changes in 
selftests/bpf/xsk.{h,c} and selftests/bpf/Makefile. eg. it no longer needs 
XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD, so please respin. From a quick look, it 
should be some minor changes.

Not sure if Tariq has a chance to look at the mlx5 changes shortly. The set is 
getting pretty long and the core part is ready with veth and mlx4 support. I 
think it is better to get the ready parts landed first such that other drivers 
can also start adding support for it. One option is to post the two mlx5 patches 
as another patchset and they can be reviewed separately.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 15/17] net/mlx5e: Introduce wrapper for xdp_buff
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 15/17] net/mlx5e: Introduce wrapper for xdp_buff Stanislav Fomichev
@ 2023-01-12  8:07   ` Tariq Toukan
  2023-01-12 19:10     ` Stanislav Fomichev
  0 siblings, 1 reply; 42+ messages in thread
From: Tariq Toukan @ 2023-01-12  8:07 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Toke Høiland-Jørgensen,
	Tariq Toukan, Saeed Mahameed, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev



On 12/01/2023 2:32, Stanislav Fomichev wrote:
> From: Toke Høiland-Jørgensen <toke@redhat.com>
> 
> Preparation for implementing HW metadata kfuncs. No functional change.
> 
> Cc: Tariq Toukan <tariqt@nvidia.com>
> Cc: Saeed Mahameed <saeedm@nvidia.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
>   .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  3 +-
>   .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  6 +-
>   .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 25 ++++----
>   .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 58 +++++++++----------
>   5 files changed, 50 insertions(+), 43 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index 2d77fb8a8a01..af663978d1b4 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -469,6 +469,7 @@ struct mlx5e_txqsq {
>   union mlx5e_alloc_unit {
>   	struct page *page;
>   	struct xdp_buff *xsk;
> +	struct mlx5e_xdp_buff *mxbuf;

In XSK files below you mix usage of both alloc_units[page_idx].mxbuf and 
alloc_units[page_idx].xsk, while both fields share the memory of a union.

As struct mlx5e_xdp_buff wraps struct xdp_buff, I think that you just 
need to change the existing xsk field type from struct xdp_buff *xsk 
into struct mlx5e_xdp_buff *xsk and align the usage.

>   };
>   
>   /* XDP packets can be transmitted in different ways. On completion, we need to
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> index 20507ef2f956..31bb6806bf5d 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> @@ -158,8 +158,9 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
>   
>   /* returns true if packet was consumed by xdp */
>   bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,
> -		      struct bpf_prog *prog, struct xdp_buff *xdp)
> +		      struct bpf_prog *prog, struct mlx5e_xdp_buff *mxbuf)
>   {
> +	struct xdp_buff *xdp = &mxbuf->xdp;
>   	u32 act;
>   	int err;
>   
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> index bc2d9034af5b..389818bf6833 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> @@ -44,10 +44,14 @@
>   	(MLX5E_XDP_INLINE_WQE_MAX_DS_CNT * MLX5_SEND_WQE_DS - \
>   	 sizeof(struct mlx5_wqe_inline_seg))
>   
> +struct mlx5e_xdp_buff {
> +	struct xdp_buff xdp;
> +};
> +
>   struct mlx5e_xsk_param;
>   int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk);
>   bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,
> -		      struct bpf_prog *prog, struct xdp_buff *xdp);
> +		      struct bpf_prog *prog, struct mlx5e_xdp_buff *mlctx);
>   void mlx5e_xdp_mpwqe_complete(struct mlx5e_xdpsq *sq);
>   bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq);
>   void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> index c91b54d9ff27..9cff82d764e3 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> @@ -22,6 +22,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>   		goto err;
>   
>   	BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].xsk));
> +	XSK_CHECK_PRIV_TYPE(struct mlx5e_xdp_buff);
>   	batch = xsk_buff_alloc_batch(rq->xsk_pool, (struct xdp_buff **)wi->alloc_units,
>   				     rq->mpwqe.pages_per_wqe);
>   
> @@ -233,7 +234,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
>   						    u32 head_offset,
>   						    u32 page_idx)
>   {
> -	struct xdp_buff *xdp = wi->alloc_units[page_idx].xsk;
> +	struct mlx5e_xdp_buff *mxbuf = wi->alloc_units[page_idx].mxbuf;
>   	struct bpf_prog *prog;
>   
>   	/* Check packet size. Note LRO doesn't use linear SKB */
> @@ -249,9 +250,9 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
>   	 */
>   	WARN_ON_ONCE(head_offset);
>   
> -	xsk_buff_set_size(xdp, cqe_bcnt);
> -	xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
> -	net_prefetch(xdp->data);
> +	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
> +	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
> +	net_prefetch(mxbuf->xdp.data);
>   
>   	/* Possible flows:
>   	 * - XDP_REDIRECT to XSKMAP:
> @@ -269,7 +270,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
>   	 */
>   
>   	prog = rcu_dereference(rq->xdp_prog);
> -	if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, xdp))) {
> +	if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, mxbuf))) {
>   		if (likely(__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)))
>   			__set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
>   		return NULL; /* page/packet was consumed by XDP */
> @@ -278,14 +279,14 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
>   	/* XDP_PASS: copy the data from the UMEM to a new SKB and reuse the
>   	 * frame. On SKB allocation failure, NULL is returned.
>   	 */
> -	return mlx5e_xsk_construct_skb(rq, xdp);
> +	return mlx5e_xsk_construct_skb(rq, &mxbuf->xdp);
>   }
>   
>   struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
>   					      struct mlx5e_wqe_frag_info *wi,
>   					      u32 cqe_bcnt)
>   {
> -	struct xdp_buff *xdp = wi->au->xsk;
> +	struct mlx5e_xdp_buff *mxbuf = wi->au->mxbuf;
>   	struct bpf_prog *prog;
>   
>   	/* wi->offset is not used in this function, because xdp->data and the
> @@ -295,17 +296,17 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
>   	 */
>   	WARN_ON_ONCE(wi->offset);
>   
> -	xsk_buff_set_size(xdp, cqe_bcnt);
> -	xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
> -	net_prefetch(xdp->data);
> +	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
> +	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
> +	net_prefetch(mxbuf->xdp.data);
>   
>   	prog = rcu_dereference(rq->xdp_prog);
> -	if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, xdp)))
> +	if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, mxbuf)))
>   		return NULL; /* page/packet was consumed by XDP */
>   
>   	/* XDP_PASS: copy the data from the UMEM to a new SKB. The frame reuse
>   	 * will be handled by mlx5e_free_rx_wqe.
>   	 * On SKB allocation failure, NULL is returned.
>   	 */
> -	return mlx5e_xsk_construct_skb(rq, xdp);
> +	return mlx5e_xsk_construct_skb(rq, &mxbuf->xdp);
>   }
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index c8820ab22169..6affdddf5bcf 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -1575,11 +1575,11 @@ struct sk_buff *mlx5e_build_linear_skb(struct mlx5e_rq *rq, void *va,
>   	return skb;
>   }
>   
> -static void mlx5e_fill_xdp_buff(struct mlx5e_rq *rq, void *va, u16 headroom,
> -				u32 len, struct xdp_buff *xdp)
> +static void mlx5e_fill_mxbuf(struct mlx5e_rq *rq, void *va, u16 headroom,
> +			     u32 len, struct mlx5e_xdp_buff *mxbuf)
>   {
> -	xdp_init_buff(xdp, rq->buff.frame0_sz, &rq->xdp_rxq);
> -	xdp_prepare_buff(xdp, va, headroom, len, true);
> +	xdp_init_buff(&mxbuf->xdp, rq->buff.frame0_sz, &rq->xdp_rxq);
> +	xdp_prepare_buff(&mxbuf->xdp, va, headroom, len, true);
>   }
>   
>   static struct sk_buff *
> @@ -1606,16 +1606,16 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
>   
>   	prog = rcu_dereference(rq->xdp_prog);
>   	if (prog) {
> -		struct xdp_buff xdp;
> +		struct mlx5e_xdp_buff mxbuf;
>   
>   		net_prefetchw(va); /* xdp_frame data area */
> -		mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &xdp);
> -		if (mlx5e_xdp_handle(rq, au->page, prog, &xdp))
> +		mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, cqe_bcnt, &mxbuf);
> +		if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf))
>   			return NULL; /* page/packet was consumed by XDP */
>   
> -		rx_headroom = xdp.data - xdp.data_hard_start;
> -		metasize = xdp.data - xdp.data_meta;
> -		cqe_bcnt = xdp.data_end - xdp.data;
> +		rx_headroom = mxbuf.xdp.data - mxbuf.xdp.data_hard_start;
> +		metasize = mxbuf.xdp.data - mxbuf.xdp.data_meta;
> +		cqe_bcnt = mxbuf.xdp.data_end - mxbuf.xdp.data;
>   	}
>   	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
>   	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
> @@ -1637,9 +1637,9 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
>   	union mlx5e_alloc_unit *au = wi->au;
>   	u16 rx_headroom = rq->buff.headroom;
>   	struct skb_shared_info *sinfo;
> +	struct mlx5e_xdp_buff mxbuf;
>   	u32 frag_consumed_bytes;
>   	struct bpf_prog *prog;
> -	struct xdp_buff xdp;
>   	struct sk_buff *skb;
>   	dma_addr_t addr;
>   	u32 truesize;
> @@ -1654,8 +1654,8 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
>   	net_prefetchw(va); /* xdp_frame data area */
>   	net_prefetch(va + rx_headroom);
>   
> -	mlx5e_fill_xdp_buff(rq, va, rx_headroom, frag_consumed_bytes, &xdp);
> -	sinfo = xdp_get_shared_info_from_buff(&xdp);
> +	mlx5e_fill_mxbuf(rq, va, rx_headroom, frag_consumed_bytes, &mxbuf);
> +	sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
>   	truesize = 0;
>   
>   	cqe_bcnt -= frag_consumed_bytes;
> @@ -1673,13 +1673,13 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
>   		dma_sync_single_for_cpu(rq->pdev, addr + wi->offset,
>   					frag_consumed_bytes, rq->buff.map_dir);
>   
> -		if (!xdp_buff_has_frags(&xdp)) {
> +		if (!xdp_buff_has_frags(&mxbuf.xdp)) {
>   			/* Init on the first fragment to avoid cold cache access
>   			 * when possible.
>   			 */
>   			sinfo->nr_frags = 0;
>   			sinfo->xdp_frags_size = 0;
> -			xdp_buff_set_frags_flag(&xdp);
> +			xdp_buff_set_frags_flag(&mxbuf.xdp);
>   		}
>   
>   		frag = &sinfo->frags[sinfo->nr_frags++];
> @@ -1688,7 +1688,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
>   		skb_frag_size_set(frag, frag_consumed_bytes);
>   
>   		if (page_is_pfmemalloc(au->page))
> -			xdp_buff_set_frag_pfmemalloc(&xdp);
> +			xdp_buff_set_frag_pfmemalloc(&mxbuf.xdp);
>   
>   		sinfo->xdp_frags_size += frag_consumed_bytes;
>   		truesize += frag_info->frag_stride;
> @@ -1701,7 +1701,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
>   	au = head_wi->au;
>   
>   	prog = rcu_dereference(rq->xdp_prog);
> -	if (prog && mlx5e_xdp_handle(rq, au->page, prog, &xdp)) {
> +	if (prog && mlx5e_xdp_handle(rq, au->page, prog, &mxbuf)) {
>   		if (test_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
>   			int i;
>   
> @@ -1711,22 +1711,22 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
>   		return NULL; /* page/packet was consumed by XDP */
>   	}
>   
> -	skb = mlx5e_build_linear_skb(rq, xdp.data_hard_start, rq->buff.frame0_sz,
> -				     xdp.data - xdp.data_hard_start,
> -				     xdp.data_end - xdp.data,
> -				     xdp.data - xdp.data_meta);
> +	skb = mlx5e_build_linear_skb(rq, mxbuf.xdp.data_hard_start, rq->buff.frame0_sz,
> +				     mxbuf.xdp.data - mxbuf.xdp.data_hard_start,
> +				     mxbuf.xdp.data_end - mxbuf.xdp.data,
> +				     mxbuf.xdp.data - mxbuf.xdp.data_meta);
>   	if (unlikely(!skb))
>   		return NULL;
>   
>   	page_ref_inc(au->page);
>   
> -	if (unlikely(xdp_buff_has_frags(&xdp))) {
> +	if (unlikely(xdp_buff_has_frags(&mxbuf.xdp))) {
>   		int i;
>   
>   		/* sinfo->nr_frags is reset by build_skb, calculate again. */
>   		xdp_update_skb_shared_info(skb, wi - head_wi - 1,
>   					   sinfo->xdp_frags_size, truesize,
> -					   xdp_buff_is_frag_pfmemalloc(&xdp));
> +					   xdp_buff_is_frag_pfmemalloc(&mxbuf.xdp));
>   
>   		for (i = 0; i < sinfo->nr_frags; i++) {
>   			skb_frag_t *frag = &sinfo->frags[i];
> @@ -2007,19 +2007,19 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
>   
>   	prog = rcu_dereference(rq->xdp_prog);
>   	if (prog) {
> -		struct xdp_buff xdp;
> +		struct mlx5e_xdp_buff mxbuf;
>   
>   		net_prefetchw(va); /* xdp_frame data area */
> -		mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &xdp);
> -		if (mlx5e_xdp_handle(rq, au->page, prog, &xdp)) {
> +		mlx5e_fill_mxbuf(rq, va, rx_headroom, cqe_bcnt, &mxbuf);
> +		if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf)) {
>   			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
>   				__set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
>   			return NULL; /* page/packet was consumed by XDP */
>   		}
>   
> -		rx_headroom = xdp.data - xdp.data_hard_start;
> -		metasize = xdp.data - xdp.data_meta;
> -		cqe_bcnt = xdp.data_end - xdp.data;
> +		rx_headroom = mxbuf.xdp.data - mxbuf.xdp.data_hard_start;
> +		metasize = mxbuf.xdp.data - mxbuf.xdp.data_meta;
> +		cqe_bcnt = mxbuf.xdp.data_end - mxbuf.xdp.data;
>   	}
>   	frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
>   	skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 16/17] net/mlx5e: Support RX XDP metadata
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 16/17] net/mlx5e: Support RX XDP metadata Stanislav Fomichev
@ 2023-01-12  8:13   ` Tariq Toukan
  2023-01-12 19:09     ` Stanislav Fomichev
  0 siblings, 1 reply; 42+ messages in thread
From: Tariq Toukan @ 2023-01-12  8:13 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Toke Høiland-Jørgensen,
	Tariq Toukan, Saeed Mahameed, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev



On 12/01/2023 2:32, Stanislav Fomichev wrote:
> From: Toke Høiland-Jørgensen <toke@redhat.com>
> 
> Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
> pointer to the mlx5e_skb_from* functions so it can be retrieved from the
> XDP ctx to do this.
> 
> Cc: Tariq Toukan <tariqt@nvidia.com>
> Cc: Saeed Mahameed <saeedm@nvidia.com>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>   drivers/net/ethernet/mellanox/mlx5/core/en.h  |  5 +-
>   .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  5 ++
>   .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 23 +++++++++
>   .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  5 ++
>   .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 10 ++++
>   .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |  2 +
>   .../net/ethernet/mellanox/mlx5/core/en_main.c |  6 +++
>   .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 49 ++++++++++---------
>   8 files changed, 80 insertions(+), 25 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index af663978d1b4..6de02d8aeab8 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -627,10 +627,11 @@ struct mlx5e_rq;
>   typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq*, struct mlx5_cqe64*);
>   typedef struct sk_buff *
>   (*mlx5e_fp_skb_from_cqe_mpwrq)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> -			       u16 cqe_bcnt, u32 head_offset, u32 page_idx);
> +			       struct mlx5_cqe64 *cqe, u16 cqe_bcnt,
> +			       u32 head_offset, u32 page_idx);
>   typedef struct sk_buff *
>   (*mlx5e_fp_skb_from_cqe)(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
> -			 u32 cqe_bcnt);
> +			 struct mlx5_cqe64 *cqe, u32 cqe_bcnt);
>   typedef bool (*mlx5e_fp_post_rx_wqes)(struct mlx5e_rq *rq);
>   typedef void (*mlx5e_fp_dealloc_wqe)(struct mlx5e_rq*, u16);
>   typedef void (*mlx5e_fp_shampo_dealloc_hd)(struct mlx5e_rq*, u16, u16, bool);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> index 853f312cd757..757c012ece27 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> @@ -73,6 +73,11 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget);
>   void mlx5e_free_rx_descs(struct mlx5e_rq *rq);
>   void mlx5e_free_rx_in_progress_descs(struct mlx5e_rq *rq);
>   
> +static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
> +{
> +	return config->rx_filter == HWTSTAMP_FILTER_ALL;
> +}
> +
>   /* TX */
>   netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev);
>   bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> index 31bb6806bf5d..d10d31e12ba2 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> @@ -156,6 +156,29 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
>   	return true;
>   }
>   
> +int mlx5e_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
> +{
> +	const struct mlx5e_xdp_buff *_ctx = (void *)ctx;
> +
> +	if (unlikely(!mlx5e_rx_hw_stamp(_ctx->rq->tstamp)))
> +		return -EOPNOTSUPP;
> +
> +	*timestamp =  mlx5e_cqe_ts_to_ns(_ctx->rq->ptp_cyc2time,
> +					 _ctx->rq->clock, get_cqe_ts(_ctx->cqe));
> +	return 0;
> +}
> +
> +int mlx5e_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash)
> +{
> +	const struct mlx5e_xdp_buff *_ctx = (void *)ctx;
> +
> +	if (unlikely(!(_ctx->xdp.rxq->dev->features & NETIF_F_RXHASH)))
> +		return -EOPNOTSUPP;
> +
> +	*hash = be32_to_cpu(_ctx->cqe->rss_hash_result);
> +	return 0;
> +}
> +
>   /* returns true if packet was consumed by xdp */
>   bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,
>   		      struct bpf_prog *prog, struct mlx5e_xdp_buff *mxbuf)
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> index 389818bf6833..cb568c62aba0 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> @@ -46,6 +46,8 @@
>   
>   struct mlx5e_xdp_buff {
>   	struct xdp_buff xdp;
> +	struct mlx5_cqe64 *cqe;
> +	struct mlx5e_rq *rq;
>   };
>   
>   struct mlx5e_xsk_param;
> @@ -60,6 +62,9 @@ void mlx5e_xdp_rx_poll_complete(struct mlx5e_rq *rq);
>   int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>   		   u32 flags);
>   
> +int mlx5e_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp);
> +int mlx5e_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash);
> +
>   INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq,
>   							  struct mlx5e_xmit_data *xdptxd,
>   							  struct skb_shared_info *sinfo,
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> index 9cff82d764e3..8bf3029abd3c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> @@ -49,6 +49,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>   			umr_wqe->inline_mtts[i] = (struct mlx5_mtt) {
>   				.ptag = cpu_to_be64(addr | MLX5_EN_WR),
>   			};
> +			wi->alloc_units[i].mxbuf->rq = rq;
>   		}
>   	} else if (unlikely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_UNALIGNED)) {
>   		for (i = 0; i < batch; i++) {
> @@ -58,6 +59,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>   				.key = rq->mkey_be,
>   				.va = cpu_to_be64(addr),
>   			};
> +			wi->alloc_units[i].mxbuf->rq = rq;
>   		}
>   	} else if (likely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_TRIPLE)) {
>   		u32 mapping_size = 1 << (rq->mpwqe.page_shift - 2);
> @@ -81,6 +83,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>   				.key = rq->mkey_be,
>   				.va = cpu_to_be64(rq->wqe_overflow.addr),
>   			};
> +			wi->alloc_units[i].mxbuf->rq = rq;
>   		}
>   	} else {
>   		__be32 pad_size = cpu_to_be32((1 << rq->mpwqe.page_shift) -
> @@ -100,6 +103,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>   				.va = cpu_to_be64(rq->wqe_overflow.addr),
>   				.bcount = pad_size,
>   			};
> +			wi->alloc_units[i].mxbuf->rq = rq;
>   		}
>   	}
>   
> @@ -230,6 +234,7 @@ static struct sk_buff *mlx5e_xsk_construct_skb(struct mlx5e_rq *rq, struct xdp_b
>   
>   struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
>   						    struct mlx5e_mpw_info *wi,
> +						    struct mlx5_cqe64 *cqe,
>   						    u16 cqe_bcnt,
>   						    u32 head_offset,
>   						    u32 page_idx)
> @@ -250,6 +255,8 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
>   	 */
>   	WARN_ON_ONCE(head_offset);
>   
> +	/* mxbuf->rq is set on allocation, but cqe is per-packet so set it here */
> +	mxbuf->cqe = cqe;
>   	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
>   	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
>   	net_prefetch(mxbuf->xdp.data);
> @@ -284,6 +291,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
>   
>   struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
>   					      struct mlx5e_wqe_frag_info *wi,
> +					      struct mlx5_cqe64 *cqe,
>   					      u32 cqe_bcnt)
>   {
>   	struct mlx5e_xdp_buff *mxbuf = wi->au->mxbuf;
> @@ -296,6 +304,8 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
>   	 */
>   	WARN_ON_ONCE(wi->offset);
>   
> +	/* mxbuf->rq is set on allocation, but cqe is per-packet so set it here */
> +	mxbuf->cqe = cqe;
>   	xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
>   	xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
>   	net_prefetch(mxbuf->xdp.data);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
> index 087c943bd8e9..cefc0ef6105d 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
> @@ -13,11 +13,13 @@ int mlx5e_xsk_alloc_rx_wqes_batched(struct mlx5e_rq *rq, u16 ix, int wqe_bulk);
>   int mlx5e_xsk_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, int wqe_bulk);
>   struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
>   						    struct mlx5e_mpw_info *wi,
> +						    struct mlx5_cqe64 *cqe,
>   						    u16 cqe_bcnt,
>   						    u32 head_offset,
>   						    u32 page_idx);
>   struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
>   					      struct mlx5e_wqe_frag_info *wi,
> +					      struct mlx5_cqe64 *cqe,
>   					      u32 cqe_bcnt);
>   
>   #endif /* __MLX5_EN_XSK_RX_H__ */
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index cff5f2e29e1e..be942c060774 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -4913,6 +4913,11 @@ const struct net_device_ops mlx5e_netdev_ops = {
>   #endif
>   };
>   
> +static const struct xdp_metadata_ops mlx5_xdp_metadata_ops = {
> +	.xmo_rx_timestamp		= mlx5e_xdp_rx_timestamp,
> +	.xmo_rx_hash			= mlx5e_xdp_rx_hash,
> +};
> +

Instead of exposing every single xdp function in the xdp header, move 
this struct into the en/xdp.c, and expose it via en/xdp.h.

See example struct and its usage:
	netdev->ethtool_ops       = &mlx5e_ethtool_ops;

>   static u32 mlx5e_choose_lro_timeout(struct mlx5_core_dev *mdev, u32 wanted_timeout)
>   {
>   	int i;
> @@ -5053,6 +5058,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
>   	SET_NETDEV_DEV(netdev, mdev->device);
>   
>   	netdev->netdev_ops = &mlx5e_netdev_ops;
> +	netdev->xdp_metadata_ops = &mlx5_xdp_metadata_ops;
>   
>   	mlx5e_dcbnl_build_netdev(netdev);
>   
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 6affdddf5bcf..7b08653be000 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -62,10 +62,12 @@
>   
>   static struct sk_buff *
>   mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> -				u16 cqe_bcnt, u32 head_offset, u32 page_idx);
> +				struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
> +				u32 page_idx);
>   static struct sk_buff *
>   mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> -				   u16 cqe_bcnt, u32 head_offset, u32 page_idx);
> +				   struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
> +				   u32 page_idx);
>   static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
>   static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
>   static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
> @@ -76,11 +78,6 @@ const struct mlx5e_rx_handlers mlx5e_rx_handlers_nic = {
>   	.handle_rx_cqe_mpwqe_shampo = mlx5e_handle_rx_cqe_mpwrq_shampo,
>   };
>   
> -static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
> -{
> -	return config->rx_filter == HWTSTAMP_FILTER_ALL;
> -}
> -
>   static inline void mlx5e_read_cqe_slot(struct mlx5_cqwq *wq,
>   				       u32 cqcc, void *data)
>   {
> @@ -1575,16 +1572,19 @@ struct sk_buff *mlx5e_build_linear_skb(struct mlx5e_rq *rq, void *va,
>   	return skb;
>   }
>   
> -static void mlx5e_fill_mxbuf(struct mlx5e_rq *rq, void *va, u16 headroom,
> -			     u32 len, struct mlx5e_xdp_buff *mxbuf)
> +static void mlx5e_fill_mxbuf(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
> +			     void *va, u16 headroom, u32 len,
> +			     struct mlx5e_xdp_buff *mxbuf)
>   {
>   	xdp_init_buff(&mxbuf->xdp, rq->buff.frame0_sz, &rq->xdp_rxq);
>   	xdp_prepare_buff(&mxbuf->xdp, va, headroom, len, true);
> +	mxbuf->cqe = cqe;
> +	mxbuf->rq = rq;
>   }
>   
>   static struct sk_buff *
>   mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
> -			  u32 cqe_bcnt)
> +			  struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
>   {
>   	union mlx5e_alloc_unit *au = wi->au;
>   	u16 rx_headroom = rq->buff.headroom;
> @@ -1630,7 +1630,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
>   
>   static struct sk_buff *
>   mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
> -			     u32 cqe_bcnt)
> +			     struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
>   {
>   	struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
>   	struct mlx5e_wqe_frag_info *head_wi = wi;
> @@ -1654,7 +1654,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
>   	net_prefetchw(va); /* xdp_frame data area */
>   	net_prefetch(va + rx_headroom);
>   
> -	mlx5e_fill_mxbuf(rq, va, rx_headroom, frag_consumed_bytes, &mxbuf);
> +	mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, frag_consumed_bytes, &mxbuf);
>   	sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
>   	truesize = 0;
>   
> @@ -1777,7 +1777,7 @@ static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
>   			      mlx5e_skb_from_cqe_linear,
>   			      mlx5e_skb_from_cqe_nonlinear,
>   			      mlx5e_xsk_skb_from_cqe_linear,
> -			      rq, wi, cqe_bcnt);
> +			      rq, wi, cqe, cqe_bcnt);
>   	if (!skb) {
>   		/* probably for XDP */
>   		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
> @@ -1830,7 +1830,7 @@ static void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
>   	skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
>   			      mlx5e_skb_from_cqe_linear,
>   			      mlx5e_skb_from_cqe_nonlinear,
> -			      rq, wi, cqe_bcnt);
> +			      rq, wi, cqe, cqe_bcnt);
>   	if (!skb) {
>   		/* probably for XDP */
>   		if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
> @@ -1889,7 +1889,7 @@ static void mlx5e_handle_rx_cqe_mpwrq_rep(struct mlx5e_rq *rq, struct mlx5_cqe64
>   	skb = INDIRECT_CALL_2(rq->mpwqe.skb_from_cqe_mpwrq,
>   			      mlx5e_skb_from_cqe_mpwrq_linear,
>   			      mlx5e_skb_from_cqe_mpwrq_nonlinear,
> -			      rq, wi, cqe_bcnt, head_offset, page_idx);
> +			      rq, wi, cqe, cqe_bcnt, head_offset, page_idx);
>   	if (!skb)
>   		goto mpwrq_cqe_out;
>   
> @@ -1940,7 +1940,8 @@ mlx5e_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
>   
>   static struct sk_buff *
>   mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> -				   u16 cqe_bcnt, u32 head_offset, u32 page_idx)
> +				   struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
> +				   u32 page_idx)
>   {
>   	union mlx5e_alloc_unit *au = &wi->alloc_units[page_idx];
>   	u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
> @@ -1979,7 +1980,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>   
>   static struct sk_buff *
>   mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> -				u16 cqe_bcnt, u32 head_offset, u32 page_idx)
> +				struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
> +				u32 page_idx)
>   {
>   	union mlx5e_alloc_unit *au = &wi->alloc_units[page_idx];
>   	u16 rx_headroom = rq->buff.headroom;
> @@ -2010,7 +2012,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
>   		struct mlx5e_xdp_buff mxbuf;
>   
>   		net_prefetchw(va); /* xdp_frame data area */
> -		mlx5e_fill_mxbuf(rq, va, rx_headroom, cqe_bcnt, &mxbuf);
> +		mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, cqe_bcnt, &mxbuf);
>   		if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf)) {
>   			if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
>   				__set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
> @@ -2174,8 +2176,8 @@ static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cq
>   		if (likely(head_size))
>   			*skb = mlx5e_skb_from_cqe_shampo(rq, wi, cqe, header_index);
>   		else
> -			*skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe_bcnt, data_offset,
> -								  page_idx);
> +			*skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe, cqe_bcnt,
> +								  data_offset, page_idx);
>   		if (unlikely(!*skb))
>   			goto free_hd_entry;
>   
> @@ -2249,7 +2251,8 @@ static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cq
>   			      mlx5e_skb_from_cqe_mpwrq_linear,
>   			      mlx5e_skb_from_cqe_mpwrq_nonlinear,
>   			      mlx5e_xsk_skb_from_cqe_mpwrq_linear,
> -			      rq, wi, cqe_bcnt, head_offset, page_idx);
> +			      rq, wi, cqe, cqe_bcnt, head_offset,
> +			      page_idx);
>   	if (!skb)
>   		goto mpwrq_cqe_out;
>   
> @@ -2494,7 +2497,7 @@ static void mlx5i_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
>   	skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
>   			      mlx5e_skb_from_cqe_linear,
>   			      mlx5e_skb_from_cqe_nonlinear,
> -			      rq, wi, cqe_bcnt);
> +			      rq, wi, cqe, cqe_bcnt);
>   	if (!skb)
>   		goto wq_free_wqe;
>   
> @@ -2586,7 +2589,7 @@ static void mlx5e_trap_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe
>   		goto free_wqe;
>   	}
>   
> -	skb = mlx5e_skb_from_cqe_nonlinear(rq, wi, cqe_bcnt);
> +	skb = mlx5e_skb_from_cqe_nonlinear(rq, wi, cqe, cqe_bcnt);
>   	if (!skb)
>   		goto free_wqe;
>   

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 00/17] xdp: hints via kfuncs
  2023-01-12  7:29 ` [xdp-hints] Re: [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Martin KaFai Lau
@ 2023-01-12  8:19   ` Tariq Toukan
  2023-01-12 18:09     ` Stanislav Fomichev
  0 siblings, 1 reply; 42+ messages in thread
From: Tariq Toukan @ 2023-01-12  8:19 UTC (permalink / raw)
  To: Martin KaFai Lau, Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, Tariq Toukan, Saeed Mahameed,
	bpf, xdp-hints, netdev



On 12/01/2023 9:29, Martin KaFai Lau wrote:
> On 1/11/23 4:32 PM, Stanislav Fomichev wrote:
>> Please see the first patch in the series for the overall
>> design and use-cases.
>>
>> See the following email from Toke for the per-packet metadata overhead:
>> https://lore.kernel.org/bpf/20221206024554.3826186-1-sdf@google.com/T/#m49d48ea08d525ec88360c7d14c4d34fb0e45e798
>>
>> Recent changes:
>>
>> - Bring back parts that were removed during patch reshuffling from "bpf:
>>    Introduce device-bound XDP programs" patch (Martin)
>>
>> - Remove netdev NULL check from __bpf_prog_dev_bound_init (Martin)
>>
>> - Remove netdev NULL check from bpf_dev_bound_resolve_kfunc (Martin)
>>
>> - Move target bound device verification from bpf_tracing_prog_attach into
>>    bpf_check_attach_target (Martin)
>>
>> - Move mlx5e_free_rx_in_progress_descs into txrx.h (Tariq)
>>
>> - mlx5e_fill_xdp_buff -> mlx5e_fill_mxbuf (Tariq)
> 
> Thanks for the patches. The set lgtm.
> 
> The selftest patch 11 and 17 have conflicts with the recent changes in 
> selftests/bpf/xsk.{h,c} and selftests/bpf/Makefile. eg. it no longer 
> needs XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD, so please respin. From a 
> quick look, it should be some minor changes.
> 
> Not sure if Tariq has a chance to look at the mlx5 changes shortly. The 
> set is getting pretty long and the core part is ready with veth and mlx4 
> support. I think it is better to get the ready parts landed first such 
> that other drivers can also start adding support for it. One option is 
> to post the two mlx5 patches as another patchset and they can be 
> reviewed separately.
> 

Hi,
I posted new comments.
I think they can be handled quickly, and still be part of the next respin.

I'm fine with both options though. You can keep the mlx5e patches or 
defer them to a followup series. Whatever works best for you.

Tariq

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 00/17] xdp: hints via kfuncs
  2023-01-12  8:19   ` Tariq Toukan
@ 2023-01-12 18:09     ` Stanislav Fomichev
  2023-01-12 18:20       ` Martin KaFai Lau
  0 siblings, 1 reply; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12 18:09 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: Martin KaFai Lau, ast, daniel, andrii, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, Tariq Toukan,
	Saeed Mahameed, bpf, xdp-hints, netdev

On Thu, Jan 12, 2023 at 12:19 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>
>
>
> On 12/01/2023 9:29, Martin KaFai Lau wrote:
> > On 1/11/23 4:32 PM, Stanislav Fomichev wrote:
> >> Please see the first patch in the series for the overall
> >> design and use-cases.
> >>
> >> See the following email from Toke for the per-packet metadata overhead:
> >> https://lore.kernel.org/bpf/20221206024554.3826186-1-sdf@google.com/T/#m49d48ea08d525ec88360c7d14c4d34fb0e45e798
> >>
> >> Recent changes:
> >>
> >> - Bring back parts that were removed during patch reshuffling from "bpf:
> >>    Introduce device-bound XDP programs" patch (Martin)
> >>
> >> - Remove netdev NULL check from __bpf_prog_dev_bound_init (Martin)
> >>
> >> - Remove netdev NULL check from bpf_dev_bound_resolve_kfunc (Martin)
> >>
> >> - Move target bound device verification from bpf_tracing_prog_attach into
> >>    bpf_check_attach_target (Martin)
> >>
> >> - Move mlx5e_free_rx_in_progress_descs into txrx.h (Tariq)
> >>
> >> - mlx5e_fill_xdp_buff -> mlx5e_fill_mxbuf (Tariq)
> >
> > Thanks for the patches. The set lgtm.
> >
> > The selftest patch 11 and 17 have conflicts with the recent changes in
> > selftests/bpf/xsk.{h,c} and selftests/bpf/Makefile. eg. it no longer
> > needs XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD, so please respin. From a
> > quick look, it should be some minor changes.
> >
> > Not sure if Tariq has a chance to look at the mlx5 changes shortly. The
> > set is getting pretty long and the core part is ready with veth and mlx4
> > support. I think it is better to get the ready parts landed first such
> > that other drivers can also start adding support for it. One option is
> > to post the two mlx5 patches as another patchset and they can be
> > reviewed separately.
> >
>
> Hi,
> I posted new comments.
> I think they can be handled quickly, and still be part of the next respin.
>
> I'm fine with both options though. You can keep the mlx5e patches or
> defer them to a followup series. Whatever works best for you.

Either way is fine with me also. I can find some time today to address
Tariq's comments and respin if that works for everybody.

> Tariq

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 00/17] xdp: hints via kfuncs
  2023-01-12 18:09     ` Stanislav Fomichev
@ 2023-01-12 18:20       ` Martin KaFai Lau
  0 siblings, 0 replies; 42+ messages in thread
From: Martin KaFai Lau @ 2023-01-12 18:20 UTC (permalink / raw)
  To: Stanislav Fomichev, Tariq Toukan
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, Tariq Toukan, Saeed Mahameed,
	bpf, xdp-hints, netdev

On 1/12/23 10:09 AM, Stanislav Fomichev wrote:
> On Thu, Jan 12, 2023 at 12:19 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>>
>>
>>
>> On 12/01/2023 9:29, Martin KaFai Lau wrote:
>>> On 1/11/23 4:32 PM, Stanislav Fomichev wrote:
>>>> Please see the first patch in the series for the overall
>>>> design and use-cases.
>>>>
>>>> See the following email from Toke for the per-packet metadata overhead:
>>>> https://lore.kernel.org/bpf/20221206024554.3826186-1-sdf@google.com/T/#m49d48ea08d525ec88360c7d14c4d34fb0e45e798
>>>>
>>>> Recent changes:
>>>>
>>>> - Bring back parts that were removed during patch reshuffling from "bpf:
>>>>     Introduce device-bound XDP programs" patch (Martin)
>>>>
>>>> - Remove netdev NULL check from __bpf_prog_dev_bound_init (Martin)
>>>>
>>>> - Remove netdev NULL check from bpf_dev_bound_resolve_kfunc (Martin)
>>>>
>>>> - Move target bound device verification from bpf_tracing_prog_attach into
>>>>     bpf_check_attach_target (Martin)
>>>>
>>>> - Move mlx5e_free_rx_in_progress_descs into txrx.h (Tariq)
>>>>
>>>> - mlx5e_fill_xdp_buff -> mlx5e_fill_mxbuf (Tariq)
>>>
>>> Thanks for the patches. The set lgtm.
>>>
>>> The selftest patch 11 and 17 have conflicts with the recent changes in
>>> selftests/bpf/xsk.{h,c} and selftests/bpf/Makefile. eg. it no longer
>>> needs XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD, so please respin. From a
>>> quick look, it should be some minor changes.
>>>
>>> Not sure if Tariq has a chance to look at the mlx5 changes shortly. The
>>> set is getting pretty long and the core part is ready with veth and mlx4
>>> support. I think it is better to get the ready parts landed first such
>>> that other drivers can also start adding support for it. One option is
>>> to post the two mlx5 patches as another patchset and they can be
>>> reviewed separately.
>>>
>>
>> Hi,
>> I posted new comments >> I think they can be handled quickly, and still be part of the next respin.
>>
>> I'm fine with both options though. You can keep the mlx5e patches or
>> defer them to a followup series. Whatever works best for you.
> 
> Either way is fine with me also. I can find some time today to address
> Tariq's comments and respin if that works for everybody.

Together SGTM also. Mentioned the option to separate and thought the core pieces 
can be landed faster than the mlx5 piece but it seems there is no major concern 
on the mlx5 changes.


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 16/17] net/mlx5e: Support RX XDP metadata
  2023-01-12  8:13   ` [xdp-hints] " Tariq Toukan
@ 2023-01-12 19:09     ` Stanislav Fomichev
  2023-01-13 20:25       ` Tariq Toukan
  0 siblings, 1 reply; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12 19:09 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Toke Høiland-Jørgensen,
	Tariq Toukan, Saeed Mahameed, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Thu, Jan 12, 2023 at 12:13 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>
>
>
> On 12/01/2023 2:32, Stanislav Fomichev wrote:
> > From: Toke Høiland-Jørgensen <toke@redhat.com>
> >
> > Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
> > pointer to the mlx5e_skb_from* functions so it can be retrieved from the
> > XDP ctx to do this.
> >
> > Cc: Tariq Toukan <tariqt@nvidia.com>
> > Cc: Saeed Mahameed <saeedm@nvidia.com>
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: David Ahern <dsahern@gmail.com>
> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > Cc: xdp-hints@xdp-project.net
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >   drivers/net/ethernet/mellanox/mlx5/core/en.h  |  5 +-
> >   .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  5 ++
> >   .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 23 +++++++++
> >   .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  5 ++
> >   .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 10 ++++
> >   .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |  2 +
> >   .../net/ethernet/mellanox/mlx5/core/en_main.c |  6 +++
> >   .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 49 ++++++++++---------
> >   8 files changed, 80 insertions(+), 25 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > index af663978d1b4..6de02d8aeab8 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > @@ -627,10 +627,11 @@ struct mlx5e_rq;
> >   typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq*, struct mlx5_cqe64*);
> >   typedef struct sk_buff *
> >   (*mlx5e_fp_skb_from_cqe_mpwrq)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> > -                            u16 cqe_bcnt, u32 head_offset, u32 page_idx);
> > +                            struct mlx5_cqe64 *cqe, u16 cqe_bcnt,
> > +                            u32 head_offset, u32 page_idx);
> >   typedef struct sk_buff *
> >   (*mlx5e_fp_skb_from_cqe)(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
> > -                      u32 cqe_bcnt);
> > +                      struct mlx5_cqe64 *cqe, u32 cqe_bcnt);
> >   typedef bool (*mlx5e_fp_post_rx_wqes)(struct mlx5e_rq *rq);
> >   typedef void (*mlx5e_fp_dealloc_wqe)(struct mlx5e_rq*, u16);
> >   typedef void (*mlx5e_fp_shampo_dealloc_hd)(struct mlx5e_rq*, u16, u16, bool);
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > index 853f312cd757..757c012ece27 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
> > @@ -73,6 +73,11 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget);
> >   void mlx5e_free_rx_descs(struct mlx5e_rq *rq);
> >   void mlx5e_free_rx_in_progress_descs(struct mlx5e_rq *rq);
> >
> > +static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
> > +{
> > +     return config->rx_filter == HWTSTAMP_FILTER_ALL;
> > +}
> > +
> >   /* TX */
> >   netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev);
> >   bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> > index 31bb6806bf5d..d10d31e12ba2 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> > @@ -156,6 +156,29 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
> >       return true;
> >   }
> >
> > +int mlx5e_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
> > +{
> > +     const struct mlx5e_xdp_buff *_ctx = (void *)ctx;
> > +
> > +     if (unlikely(!mlx5e_rx_hw_stamp(_ctx->rq->tstamp)))
> > +             return -EOPNOTSUPP;
> > +
> > +     *timestamp =  mlx5e_cqe_ts_to_ns(_ctx->rq->ptp_cyc2time,
> > +                                      _ctx->rq->clock, get_cqe_ts(_ctx->cqe));
> > +     return 0;
> > +}
> > +
> > +int mlx5e_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash)
> > +{
> > +     const struct mlx5e_xdp_buff *_ctx = (void *)ctx;
> > +
> > +     if (unlikely(!(_ctx->xdp.rxq->dev->features & NETIF_F_RXHASH)))
> > +             return -EOPNOTSUPP;
> > +
> > +     *hash = be32_to_cpu(_ctx->cqe->rss_hash_result);
> > +     return 0;
> > +}
> > +
> >   /* returns true if packet was consumed by xdp */
> >   bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,
> >                     struct bpf_prog *prog, struct mlx5e_xdp_buff *mxbuf)
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> > index 389818bf6833..cb568c62aba0 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> > @@ -46,6 +46,8 @@
> >
> >   struct mlx5e_xdp_buff {
> >       struct xdp_buff xdp;
> > +     struct mlx5_cqe64 *cqe;
> > +     struct mlx5e_rq *rq;
> >   };
> >
> >   struct mlx5e_xsk_param;
> > @@ -60,6 +62,9 @@ void mlx5e_xdp_rx_poll_complete(struct mlx5e_rq *rq);
> >   int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
> >                  u32 flags);
> >
> > +int mlx5e_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp);
> > +int mlx5e_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash);
> > +
> >   INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq,
> >                                                         struct mlx5e_xmit_data *xdptxd,
> >                                                         struct skb_shared_info *sinfo,
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> > index 9cff82d764e3..8bf3029abd3c 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> > @@ -49,6 +49,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
> >                       umr_wqe->inline_mtts[i] = (struct mlx5_mtt) {
> >                               .ptag = cpu_to_be64(addr | MLX5_EN_WR),
> >                       };
> > +                     wi->alloc_units[i].mxbuf->rq = rq;
> >               }
> >       } else if (unlikely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_UNALIGNED)) {
> >               for (i = 0; i < batch; i++) {
> > @@ -58,6 +59,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
> >                               .key = rq->mkey_be,
> >                               .va = cpu_to_be64(addr),
> >                       };
> > +                     wi->alloc_units[i].mxbuf->rq = rq;
> >               }
> >       } else if (likely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_TRIPLE)) {
> >               u32 mapping_size = 1 << (rq->mpwqe.page_shift - 2);
> > @@ -81,6 +83,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
> >                               .key = rq->mkey_be,
> >                               .va = cpu_to_be64(rq->wqe_overflow.addr),
> >                       };
> > +                     wi->alloc_units[i].mxbuf->rq = rq;
> >               }
> >       } else {
> >               __be32 pad_size = cpu_to_be32((1 << rq->mpwqe.page_shift) -
> > @@ -100,6 +103,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
> >                               .va = cpu_to_be64(rq->wqe_overflow.addr),
> >                               .bcount = pad_size,
> >                       };
> > +                     wi->alloc_units[i].mxbuf->rq = rq;
> >               }
> >       }
> >
> > @@ -230,6 +234,7 @@ static struct sk_buff *mlx5e_xsk_construct_skb(struct mlx5e_rq *rq, struct xdp_b
> >
> >   struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
> >                                                   struct mlx5e_mpw_info *wi,
> > +                                                 struct mlx5_cqe64 *cqe,
> >                                                   u16 cqe_bcnt,
> >                                                   u32 head_offset,
> >                                                   u32 page_idx)
> > @@ -250,6 +255,8 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
> >        */
> >       WARN_ON_ONCE(head_offset);
> >
> > +     /* mxbuf->rq is set on allocation, but cqe is per-packet so set it here */
> > +     mxbuf->cqe = cqe;
> >       xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
> >       xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
> >       net_prefetch(mxbuf->xdp.data);
> > @@ -284,6 +291,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
> >
> >   struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
> >                                             struct mlx5e_wqe_frag_info *wi,
> > +                                           struct mlx5_cqe64 *cqe,
> >                                             u32 cqe_bcnt)
> >   {
> >       struct mlx5e_xdp_buff *mxbuf = wi->au->mxbuf;
> > @@ -296,6 +304,8 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
> >        */
> >       WARN_ON_ONCE(wi->offset);
> >
> > +     /* mxbuf->rq is set on allocation, but cqe is per-packet so set it here */
> > +     mxbuf->cqe = cqe;
> >       xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
> >       xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
> >       net_prefetch(mxbuf->xdp.data);
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
> > index 087c943bd8e9..cefc0ef6105d 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
> > @@ -13,11 +13,13 @@ int mlx5e_xsk_alloc_rx_wqes_batched(struct mlx5e_rq *rq, u16 ix, int wqe_bulk);
> >   int mlx5e_xsk_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, int wqe_bulk);
> >   struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
> >                                                   struct mlx5e_mpw_info *wi,
> > +                                                 struct mlx5_cqe64 *cqe,
> >                                                   u16 cqe_bcnt,
> >                                                   u32 head_offset,
> >                                                   u32 page_idx);
> >   struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
> >                                             struct mlx5e_wqe_frag_info *wi,
> > +                                           struct mlx5_cqe64 *cqe,
> >                                             u32 cqe_bcnt);
> >
> >   #endif /* __MLX5_EN_XSK_RX_H__ */
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > index cff5f2e29e1e..be942c060774 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> > @@ -4913,6 +4913,11 @@ const struct net_device_ops mlx5e_netdev_ops = {
> >   #endif
> >   };
> >
> > +static const struct xdp_metadata_ops mlx5_xdp_metadata_ops = {
> > +     .xmo_rx_timestamp               = mlx5e_xdp_rx_timestamp,
> > +     .xmo_rx_hash                    = mlx5e_xdp_rx_hash,
> > +};
> > +
>
> Instead of exposing every single xdp function in the xdp header, move
> this struct into the en/xdp.c, and expose it via en/xdp.h.
>
> See example struct and its usage:
>         netdev->ethtool_ops       = &mlx5e_ethtool_ops;

SG, will put "extern const struct xdp_metadata_ops
mlx5_xdp_metadata_ops;" in en/xdp.h. LMK if you prefer that to go into
en.h next to mlx5e_ethtool_ops extern.


> >   static u32 mlx5e_choose_lro_timeout(struct mlx5_core_dev *mdev, u32 wanted_timeout)
> >   {
> >       int i;
> > @@ -5053,6 +5058,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
> >       SET_NETDEV_DEV(netdev, mdev->device);
> >
> >       netdev->netdev_ops = &mlx5e_netdev_ops;
> > +     netdev->xdp_metadata_ops = &mlx5_xdp_metadata_ops;
> >
> >       mlx5e_dcbnl_build_netdev(netdev);
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > index 6affdddf5bcf..7b08653be000 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > @@ -62,10 +62,12 @@
> >
> >   static struct sk_buff *
> >   mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> > -                             u16 cqe_bcnt, u32 head_offset, u32 page_idx);
> > +                             struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
> > +                             u32 page_idx);
> >   static struct sk_buff *
> >   mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> > -                                u16 cqe_bcnt, u32 head_offset, u32 page_idx);
> > +                                struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
> > +                                u32 page_idx);
> >   static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
> >   static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
> >   static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
> > @@ -76,11 +78,6 @@ const struct mlx5e_rx_handlers mlx5e_rx_handlers_nic = {
> >       .handle_rx_cqe_mpwqe_shampo = mlx5e_handle_rx_cqe_mpwrq_shampo,
> >   };
> >
> > -static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
> > -{
> > -     return config->rx_filter == HWTSTAMP_FILTER_ALL;
> > -}
> > -
> >   static inline void mlx5e_read_cqe_slot(struct mlx5_cqwq *wq,
> >                                      u32 cqcc, void *data)
> >   {
> > @@ -1575,16 +1572,19 @@ struct sk_buff *mlx5e_build_linear_skb(struct mlx5e_rq *rq, void *va,
> >       return skb;
> >   }
> >
> > -static void mlx5e_fill_mxbuf(struct mlx5e_rq *rq, void *va, u16 headroom,
> > -                          u32 len, struct mlx5e_xdp_buff *mxbuf)
> > +static void mlx5e_fill_mxbuf(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
> > +                          void *va, u16 headroom, u32 len,
> > +                          struct mlx5e_xdp_buff *mxbuf)
> >   {
> >       xdp_init_buff(&mxbuf->xdp, rq->buff.frame0_sz, &rq->xdp_rxq);
> >       xdp_prepare_buff(&mxbuf->xdp, va, headroom, len, true);
> > +     mxbuf->cqe = cqe;
> > +     mxbuf->rq = rq;
> >   }
> >
> >   static struct sk_buff *
> >   mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
> > -                       u32 cqe_bcnt)
> > +                       struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
> >   {
> >       union mlx5e_alloc_unit *au = wi->au;
> >       u16 rx_headroom = rq->buff.headroom;
> > @@ -1630,7 +1630,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
> >
> >   static struct sk_buff *
> >   mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
> > -                          u32 cqe_bcnt)
> > +                          struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
> >   {
> >       struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
> >       struct mlx5e_wqe_frag_info *head_wi = wi;
> > @@ -1654,7 +1654,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
> >       net_prefetchw(va); /* xdp_frame data area */
> >       net_prefetch(va + rx_headroom);
> >
> > -     mlx5e_fill_mxbuf(rq, va, rx_headroom, frag_consumed_bytes, &mxbuf);
> > +     mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, frag_consumed_bytes, &mxbuf);
> >       sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
> >       truesize = 0;
> >
> > @@ -1777,7 +1777,7 @@ static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> >                             mlx5e_skb_from_cqe_linear,
> >                             mlx5e_skb_from_cqe_nonlinear,
> >                             mlx5e_xsk_skb_from_cqe_linear,
> > -                           rq, wi, cqe_bcnt);
> > +                           rq, wi, cqe, cqe_bcnt);
> >       if (!skb) {
> >               /* probably for XDP */
> >               if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
> > @@ -1830,7 +1830,7 @@ static void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> >       skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
> >                             mlx5e_skb_from_cqe_linear,
> >                             mlx5e_skb_from_cqe_nonlinear,
> > -                           rq, wi, cqe_bcnt);
> > +                           rq, wi, cqe, cqe_bcnt);
> >       if (!skb) {
> >               /* probably for XDP */
> >               if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
> > @@ -1889,7 +1889,7 @@ static void mlx5e_handle_rx_cqe_mpwrq_rep(struct mlx5e_rq *rq, struct mlx5_cqe64
> >       skb = INDIRECT_CALL_2(rq->mpwqe.skb_from_cqe_mpwrq,
> >                             mlx5e_skb_from_cqe_mpwrq_linear,
> >                             mlx5e_skb_from_cqe_mpwrq_nonlinear,
> > -                           rq, wi, cqe_bcnt, head_offset, page_idx);
> > +                           rq, wi, cqe, cqe_bcnt, head_offset, page_idx);
> >       if (!skb)
> >               goto mpwrq_cqe_out;
> >
> > @@ -1940,7 +1940,8 @@ mlx5e_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
> >
> >   static struct sk_buff *
> >   mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> > -                                u16 cqe_bcnt, u32 head_offset, u32 page_idx)
> > +                                struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
> > +                                u32 page_idx)
> >   {
> >       union mlx5e_alloc_unit *au = &wi->alloc_units[page_idx];
> >       u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
> > @@ -1979,7 +1980,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
> >
> >   static struct sk_buff *
> >   mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> > -                             u16 cqe_bcnt, u32 head_offset, u32 page_idx)
> > +                             struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
> > +                             u32 page_idx)
> >   {
> >       union mlx5e_alloc_unit *au = &wi->alloc_units[page_idx];
> >       u16 rx_headroom = rq->buff.headroom;
> > @@ -2010,7 +2012,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> >               struct mlx5e_xdp_buff mxbuf;
> >
> >               net_prefetchw(va); /* xdp_frame data area */
> > -             mlx5e_fill_mxbuf(rq, va, rx_headroom, cqe_bcnt, &mxbuf);
> > +             mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, cqe_bcnt, &mxbuf);
> >               if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf)) {
> >                       if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
> >                               __set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
> > @@ -2174,8 +2176,8 @@ static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cq
> >               if (likely(head_size))
> >                       *skb = mlx5e_skb_from_cqe_shampo(rq, wi, cqe, header_index);
> >               else
> > -                     *skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe_bcnt, data_offset,
> > -                                                               page_idx);
> > +                     *skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe, cqe_bcnt,
> > +                                                               data_offset, page_idx);
> >               if (unlikely(!*skb))
> >                       goto free_hd_entry;
> >
> > @@ -2249,7 +2251,8 @@ static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cq
> >                             mlx5e_skb_from_cqe_mpwrq_linear,
> >                             mlx5e_skb_from_cqe_mpwrq_nonlinear,
> >                             mlx5e_xsk_skb_from_cqe_mpwrq_linear,
> > -                           rq, wi, cqe_bcnt, head_offset, page_idx);
> > +                           rq, wi, cqe, cqe_bcnt, head_offset,
> > +                           page_idx);
> >       if (!skb)
> >               goto mpwrq_cqe_out;
> >
> > @@ -2494,7 +2497,7 @@ static void mlx5i_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> >       skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
> >                             mlx5e_skb_from_cqe_linear,
> >                             mlx5e_skb_from_cqe_nonlinear,
> > -                           rq, wi, cqe_bcnt);
> > +                           rq, wi, cqe, cqe_bcnt);
> >       if (!skb)
> >               goto wq_free_wqe;
> >
> > @@ -2586,7 +2589,7 @@ static void mlx5e_trap_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe
> >               goto free_wqe;
> >       }
> >
> > -     skb = mlx5e_skb_from_cqe_nonlinear(rq, wi, cqe_bcnt);
> > +     skb = mlx5e_skb_from_cqe_nonlinear(rq, wi, cqe, cqe_bcnt);
> >       if (!skb)
> >               goto free_wqe;
> >

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 15/17] net/mlx5e: Introduce wrapper for xdp_buff
  2023-01-12  8:07   ` [xdp-hints] " Tariq Toukan
@ 2023-01-12 19:10     ` Stanislav Fomichev
  2023-01-12 21:09       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12 19:10 UTC (permalink / raw)
  To: Tariq Toukan
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Toke Høiland-Jørgensen,
	Tariq Toukan, Saeed Mahameed, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Thu, Jan 12, 2023 at 12:07 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>
>
>
> On 12/01/2023 2:32, Stanislav Fomichev wrote:
> > From: Toke Høiland-Jørgensen <toke@redhat.com>
> >
> > Preparation for implementing HW metadata kfuncs. No functional change.
> >
> > Cc: Tariq Toukan <tariqt@nvidia.com>
> > Cc: Saeed Mahameed <saeedm@nvidia.com>
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: David Ahern <dsahern@gmail.com>
> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > Cc: xdp-hints@xdp-project.net
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >   drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
> >   .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  3 +-
> >   .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  6 +-
> >   .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 25 ++++----
> >   .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 58 +++++++++----------
> >   5 files changed, 50 insertions(+), 43 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > index 2d77fb8a8a01..af663978d1b4 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > @@ -469,6 +469,7 @@ struct mlx5e_txqsq {
> >   union mlx5e_alloc_unit {
> >       struct page *page;
> >       struct xdp_buff *xsk;
> > +     struct mlx5e_xdp_buff *mxbuf;
>
> In XSK files below you mix usage of both alloc_units[page_idx].mxbuf and
> alloc_units[page_idx].xsk, while both fields share the memory of a union.
>
> As struct mlx5e_xdp_buff wraps struct xdp_buff, I think that you just
> need to change the existing xsk field type from struct xdp_buff *xsk
> into struct mlx5e_xdp_buff *xsk and align the usage.

Hmmm, good point. I'm actually not sure how it works currently.
mlx5e_alloc_unit.mxbuf doesn't seem to be initialized anywhere? Toke,
am I missing something?

I'm thinking about something like this:

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h
b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index af663978d1b4..2d77fb8a8a01 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -469,7 +469,6 @@ struct mlx5e_txqsq {
 union mlx5e_alloc_unit {
        struct page *page;
        struct xdp_buff *xsk;
-       struct mlx5e_xdp_buff *mxbuf;
 };

 /* XDP packets can be transmitted in different ways. On completion, we need to
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index 9cff82d764e3..fd0805df34b7 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -234,7 +234,8 @@ struct sk_buff
*mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
                                                    u32 head_offset,
                                                    u32 page_idx)
 {
-       struct mlx5e_xdp_buff *mxbuf = wi->alloc_units[page_idx].mxbuf;
+       struct mlx5e_xdp_buff *mxbuf =
container_of(wi->alloc_units[page_idx].xsk,
+                                                   struct mlx5e_xdp_buff, xdp);
        struct bpf_prog *prog;

        /* Check packet size. Note LRO doesn't use linear SKB */
@@ -286,7 +287,8 @@ struct sk_buff
*mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
                                              struct mlx5e_wqe_frag_info *wi,
                                              u32 cqe_bcnt)
 {
-       struct mlx5e_xdp_buff *mxbuf = wi->au->mxbuf;
+       struct mlx5e_xdp_buff *mxbuf = container_of(wi->au->xsk,
+                                                   struct mlx5e_xdp_buff, xdp);
        struct bpf_prog *prog;

        /* wi->offset is not used in this function, because xdp->data and the

Does it look sensible?


> >   };
> >
> >   /* XDP packets can be transmitted in different ways. On completion, we need to
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> > index 20507ef2f956..31bb6806bf5d 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> > @@ -158,8 +158,9 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
> >
> >   /* returns true if packet was consumed by xdp */
> >   bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,
> > -                   struct bpf_prog *prog, struct xdp_buff *xdp)
> > +                   struct bpf_prog *prog, struct mlx5e_xdp_buff *mxbuf)
> >   {
> > +     struct xdp_buff *xdp = &mxbuf->xdp;
> >       u32 act;
> >       int err;
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> > index bc2d9034af5b..389818bf6833 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> > @@ -44,10 +44,14 @@
> >       (MLX5E_XDP_INLINE_WQE_MAX_DS_CNT * MLX5_SEND_WQE_DS - \
> >        sizeof(struct mlx5_wqe_inline_seg))
> >
> > +struct mlx5e_xdp_buff {
> > +     struct xdp_buff xdp;
> > +};
> > +
> >   struct mlx5e_xsk_param;
> >   int mlx5e_xdp_max_mtu(struct mlx5e_params *params, struct mlx5e_xsk_param *xsk);
> >   bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,
> > -                   struct bpf_prog *prog, struct xdp_buff *xdp);
> > +                   struct bpf_prog *prog, struct mlx5e_xdp_buff *mlctx);
> >   void mlx5e_xdp_mpwqe_complete(struct mlx5e_xdpsq *sq);
> >   bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq);
> >   void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq);
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> > index c91b54d9ff27..9cff82d764e3 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> > @@ -22,6 +22,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
> >               goto err;
> >
> >       BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].xsk));
> > +     XSK_CHECK_PRIV_TYPE(struct mlx5e_xdp_buff);
> >       batch = xsk_buff_alloc_batch(rq->xsk_pool, (struct xdp_buff **)wi->alloc_units,
> >                                    rq->mpwqe.pages_per_wqe);
> >
> > @@ -233,7 +234,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
> >                                                   u32 head_offset,
> >                                                   u32 page_idx)
> >   {
> > -     struct xdp_buff *xdp = wi->alloc_units[page_idx].xsk;
> > +     struct mlx5e_xdp_buff *mxbuf = wi->alloc_units[page_idx].mxbuf;
> >       struct bpf_prog *prog;
> >
> >       /* Check packet size. Note LRO doesn't use linear SKB */
> > @@ -249,9 +250,9 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
> >        */
> >       WARN_ON_ONCE(head_offset);
> >
> > -     xsk_buff_set_size(xdp, cqe_bcnt);
> > -     xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
> > -     net_prefetch(xdp->data);
> > +     xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
> > +     xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
> > +     net_prefetch(mxbuf->xdp.data);
> >
> >       /* Possible flows:
> >        * - XDP_REDIRECT to XSKMAP:
> > @@ -269,7 +270,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
> >        */
> >
> >       prog = rcu_dereference(rq->xdp_prog);
> > -     if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, xdp))) {
> > +     if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, mxbuf))) {
> >               if (likely(__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)))
> >                       __set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
> >               return NULL; /* page/packet was consumed by XDP */
> > @@ -278,14 +279,14 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
> >       /* XDP_PASS: copy the data from the UMEM to a new SKB and reuse the
> >        * frame. On SKB allocation failure, NULL is returned.
> >        */
> > -     return mlx5e_xsk_construct_skb(rq, xdp);
> > +     return mlx5e_xsk_construct_skb(rq, &mxbuf->xdp);
> >   }
> >
> >   struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
> >                                             struct mlx5e_wqe_frag_info *wi,
> >                                             u32 cqe_bcnt)
> >   {
> > -     struct xdp_buff *xdp = wi->au->xsk;
> > +     struct mlx5e_xdp_buff *mxbuf = wi->au->mxbuf;
> >       struct bpf_prog *prog;
> >
> >       /* wi->offset is not used in this function, because xdp->data and the
> > @@ -295,17 +296,17 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
> >        */
> >       WARN_ON_ONCE(wi->offset);
> >
> > -     xsk_buff_set_size(xdp, cqe_bcnt);
> > -     xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
> > -     net_prefetch(xdp->data);
> > +     xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
> > +     xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
> > +     net_prefetch(mxbuf->xdp.data);
> >
> >       prog = rcu_dereference(rq->xdp_prog);
> > -     if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, xdp)))
> > +     if (likely(prog && mlx5e_xdp_handle(rq, NULL, prog, mxbuf)))
> >               return NULL; /* page/packet was consumed by XDP */
> >
> >       /* XDP_PASS: copy the data from the UMEM to a new SKB. The frame reuse
> >        * will be handled by mlx5e_free_rx_wqe.
> >        * On SKB allocation failure, NULL is returned.
> >        */
> > -     return mlx5e_xsk_construct_skb(rq, xdp);
> > +     return mlx5e_xsk_construct_skb(rq, &mxbuf->xdp);
> >   }
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > index c8820ab22169..6affdddf5bcf 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> > @@ -1575,11 +1575,11 @@ struct sk_buff *mlx5e_build_linear_skb(struct mlx5e_rq *rq, void *va,
> >       return skb;
> >   }
> >
> > -static void mlx5e_fill_xdp_buff(struct mlx5e_rq *rq, void *va, u16 headroom,
> > -                             u32 len, struct xdp_buff *xdp)
> > +static void mlx5e_fill_mxbuf(struct mlx5e_rq *rq, void *va, u16 headroom,
> > +                          u32 len, struct mlx5e_xdp_buff *mxbuf)
> >   {
> > -     xdp_init_buff(xdp, rq->buff.frame0_sz, &rq->xdp_rxq);
> > -     xdp_prepare_buff(xdp, va, headroom, len, true);
> > +     xdp_init_buff(&mxbuf->xdp, rq->buff.frame0_sz, &rq->xdp_rxq);
> > +     xdp_prepare_buff(&mxbuf->xdp, va, headroom, len, true);
> >   }
> >
> >   static struct sk_buff *
> > @@ -1606,16 +1606,16 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
> >
> >       prog = rcu_dereference(rq->xdp_prog);
> >       if (prog) {
> > -             struct xdp_buff xdp;
> > +             struct mlx5e_xdp_buff mxbuf;
> >
> >               net_prefetchw(va); /* xdp_frame data area */
> > -             mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &xdp);
> > -             if (mlx5e_xdp_handle(rq, au->page, prog, &xdp))
> > +             mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, cqe_bcnt, &mxbuf);
> > +             if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf))
> >                       return NULL; /* page/packet was consumed by XDP */
> >
> > -             rx_headroom = xdp.data - xdp.data_hard_start;
> > -             metasize = xdp.data - xdp.data_meta;
> > -             cqe_bcnt = xdp.data_end - xdp.data;
> > +             rx_headroom = mxbuf.xdp.data - mxbuf.xdp.data_hard_start;
> > +             metasize = mxbuf.xdp.data - mxbuf.xdp.data_meta;
> > +             cqe_bcnt = mxbuf.xdp.data_end - mxbuf.xdp.data;
> >       }
> >       frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
> >       skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);
> > @@ -1637,9 +1637,9 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
> >       union mlx5e_alloc_unit *au = wi->au;
> >       u16 rx_headroom = rq->buff.headroom;
> >       struct skb_shared_info *sinfo;
> > +     struct mlx5e_xdp_buff mxbuf;
> >       u32 frag_consumed_bytes;
> >       struct bpf_prog *prog;
> > -     struct xdp_buff xdp;
> >       struct sk_buff *skb;
> >       dma_addr_t addr;
> >       u32 truesize;
> > @@ -1654,8 +1654,8 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
> >       net_prefetchw(va); /* xdp_frame data area */
> >       net_prefetch(va + rx_headroom);
> >
> > -     mlx5e_fill_xdp_buff(rq, va, rx_headroom, frag_consumed_bytes, &xdp);
> > -     sinfo = xdp_get_shared_info_from_buff(&xdp);
> > +     mlx5e_fill_mxbuf(rq, va, rx_headroom, frag_consumed_bytes, &mxbuf);
> > +     sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
> >       truesize = 0;
> >
> >       cqe_bcnt -= frag_consumed_bytes;
> > @@ -1673,13 +1673,13 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
> >               dma_sync_single_for_cpu(rq->pdev, addr + wi->offset,
> >                                       frag_consumed_bytes, rq->buff.map_dir);
> >
> > -             if (!xdp_buff_has_frags(&xdp)) {
> > +             if (!xdp_buff_has_frags(&mxbuf.xdp)) {
> >                       /* Init on the first fragment to avoid cold cache access
> >                        * when possible.
> >                        */
> >                       sinfo->nr_frags = 0;
> >                       sinfo->xdp_frags_size = 0;
> > -                     xdp_buff_set_frags_flag(&xdp);
> > +                     xdp_buff_set_frags_flag(&mxbuf.xdp);
> >               }
> >
> >               frag = &sinfo->frags[sinfo->nr_frags++];
> > @@ -1688,7 +1688,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
> >               skb_frag_size_set(frag, frag_consumed_bytes);
> >
> >               if (page_is_pfmemalloc(au->page))
> > -                     xdp_buff_set_frag_pfmemalloc(&xdp);
> > +                     xdp_buff_set_frag_pfmemalloc(&mxbuf.xdp);
> >
> >               sinfo->xdp_frags_size += frag_consumed_bytes;
> >               truesize += frag_info->frag_stride;
> > @@ -1701,7 +1701,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
> >       au = head_wi->au;
> >
> >       prog = rcu_dereference(rq->xdp_prog);
> > -     if (prog && mlx5e_xdp_handle(rq, au->page, prog, &xdp)) {
> > +     if (prog && mlx5e_xdp_handle(rq, au->page, prog, &mxbuf)) {
> >               if (test_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
> >                       int i;
> >
> > @@ -1711,22 +1711,22 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
> >               return NULL; /* page/packet was consumed by XDP */
> >       }
> >
> > -     skb = mlx5e_build_linear_skb(rq, xdp.data_hard_start, rq->buff.frame0_sz,
> > -                                  xdp.data - xdp.data_hard_start,
> > -                                  xdp.data_end - xdp.data,
> > -                                  xdp.data - xdp.data_meta);
> > +     skb = mlx5e_build_linear_skb(rq, mxbuf.xdp.data_hard_start, rq->buff.frame0_sz,
> > +                                  mxbuf.xdp.data - mxbuf.xdp.data_hard_start,
> > +                                  mxbuf.xdp.data_end - mxbuf.xdp.data,
> > +                                  mxbuf.xdp.data - mxbuf.xdp.data_meta);
> >       if (unlikely(!skb))
> >               return NULL;
> >
> >       page_ref_inc(au->page);
> >
> > -     if (unlikely(xdp_buff_has_frags(&xdp))) {
> > +     if (unlikely(xdp_buff_has_frags(&mxbuf.xdp))) {
> >               int i;
> >
> >               /* sinfo->nr_frags is reset by build_skb, calculate again. */
> >               xdp_update_skb_shared_info(skb, wi - head_wi - 1,
> >                                          sinfo->xdp_frags_size, truesize,
> > -                                        xdp_buff_is_frag_pfmemalloc(&xdp));
> > +                                        xdp_buff_is_frag_pfmemalloc(&mxbuf.xdp));
> >
> >               for (i = 0; i < sinfo->nr_frags; i++) {
> >                       skb_frag_t *frag = &sinfo->frags[i];
> > @@ -2007,19 +2007,19 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
> >
> >       prog = rcu_dereference(rq->xdp_prog);
> >       if (prog) {
> > -             struct xdp_buff xdp;
> > +             struct mlx5e_xdp_buff mxbuf;
> >
> >               net_prefetchw(va); /* xdp_frame data area */
> > -             mlx5e_fill_xdp_buff(rq, va, rx_headroom, cqe_bcnt, &xdp);
> > -             if (mlx5e_xdp_handle(rq, au->page, prog, &xdp)) {
> > +             mlx5e_fill_mxbuf(rq, va, rx_headroom, cqe_bcnt, &mxbuf);
> > +             if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf)) {
> >                       if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
> >                               __set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
> >                       return NULL; /* page/packet was consumed by XDP */
> >               }
> >
> > -             rx_headroom = xdp.data - xdp.data_hard_start;
> > -             metasize = xdp.data - xdp.data_meta;
> > -             cqe_bcnt = xdp.data_end - xdp.data;
> > +             rx_headroom = mxbuf.xdp.data - mxbuf.xdp.data_hard_start;
> > +             metasize = mxbuf.xdp.data - mxbuf.xdp.data_meta;
> > +             cqe_bcnt = mxbuf.xdp.data_end - mxbuf.xdp.data;
> >       }
> >       frag_size = MLX5_SKB_FRAG_SZ(rx_headroom + cqe_bcnt);
> >       skb = mlx5e_build_linear_skb(rq, va, frag_size, rx_headroom, cqe_bcnt, metasize);

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 15/17] net/mlx5e: Introduce wrapper for xdp_buff
  2023-01-12 19:10     ` Stanislav Fomichev
@ 2023-01-12 21:09       ` Toke Høiland-Jørgensen
  2023-01-12 21:55         ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 42+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-01-12 21:09 UTC (permalink / raw)
  To: Stanislav Fomichev, Tariq Toukan
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Tariq Toukan, Saeed Mahameed,
	David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

Stanislav Fomichev <sdf@google.com> writes:

> On Thu, Jan 12, 2023 at 12:07 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>>
>>
>>
>> On 12/01/2023 2:32, Stanislav Fomichev wrote:
>> > From: Toke Høiland-Jørgensen <toke@redhat.com>
>> >
>> > Preparation for implementing HW metadata kfuncs. No functional change.
>> >
>> > Cc: Tariq Toukan <tariqt@nvidia.com>
>> > Cc: Saeed Mahameed <saeedm@nvidia.com>
>> > Cc: John Fastabend <john.fastabend@gmail.com>
>> > Cc: David Ahern <dsahern@gmail.com>
>> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
>> > Cc: Jakub Kicinski <kuba@kernel.org>
>> > Cc: Willem de Bruijn <willemb@google.com>
>> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>> > Cc: Maryam Tahhan <mtahhan@redhat.com>
>> > Cc: xdp-hints@xdp-project.net
>> > Cc: netdev@vger.kernel.org
>> > Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
>> > ---
>> >   drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
>> >   .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  3 +-
>> >   .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  6 +-
>> >   .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 25 ++++----
>> >   .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 58 +++++++++----------
>> >   5 files changed, 50 insertions(+), 43 deletions(-)
>> >
>> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> > index 2d77fb8a8a01..af663978d1b4 100644
>> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> > @@ -469,6 +469,7 @@ struct mlx5e_txqsq {
>> >   union mlx5e_alloc_unit {
>> >       struct page *page;
>> >       struct xdp_buff *xsk;
>> > +     struct mlx5e_xdp_buff *mxbuf;
>>
>> In XSK files below you mix usage of both alloc_units[page_idx].mxbuf and
>> alloc_units[page_idx].xsk, while both fields share the memory of a union.
>>
>> As struct mlx5e_xdp_buff wraps struct xdp_buff, I think that you just
>> need to change the existing xsk field type from struct xdp_buff *xsk
>> into struct mlx5e_xdp_buff *xsk and align the usage.
>
> Hmmm, good point. I'm actually not sure how it works currently.
> mlx5e_alloc_unit.mxbuf doesn't seem to be initialized anywhere? Toke,
> am I missing something?

It's initialised piecemeal in different places; but yeah, we're mixing
things a bit...

> I'm thinking about something like this:
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index af663978d1b4..2d77fb8a8a01 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -469,7 +469,6 @@ struct mlx5e_txqsq {
>  union mlx5e_alloc_unit {
>         struct page *page;
>         struct xdp_buff *xsk;
> -       struct mlx5e_xdp_buff *mxbuf;
>  };

Hmm, for consistency with the non-XSK path we should rather go the other
direction and lose the xsk member, moving everything to mxbuf? Let me
give that a shot...

-Toke


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 15/17] net/mlx5e: Introduce wrapper for xdp_buff
  2023-01-12 21:09       ` Toke Høiland-Jørgensen
@ 2023-01-12 21:55         ` Toke Høiland-Jørgensen
  2023-01-12 22:18           ` Stanislav Fomichev
  2023-01-13 20:53           ` Tariq Toukan
  0 siblings, 2 replies; 42+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-01-12 21:55 UTC (permalink / raw)
  To: Stanislav Fomichev, Tariq Toukan
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Tariq Toukan, Saeed Mahameed,
	David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

Toke Høiland-Jørgensen <toke@redhat.com> writes:

> Stanislav Fomichev <sdf@google.com> writes:
>
>> On Thu, Jan 12, 2023 at 12:07 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>>>
>>>
>>>
>>> On 12/01/2023 2:32, Stanislav Fomichev wrote:
>>> > From: Toke Høiland-Jørgensen <toke@redhat.com>
>>> >
>>> > Preparation for implementing HW metadata kfuncs. No functional change.
>>> >
>>> > Cc: Tariq Toukan <tariqt@nvidia.com>
>>> > Cc: Saeed Mahameed <saeedm@nvidia.com>
>>> > Cc: John Fastabend <john.fastabend@gmail.com>
>>> > Cc: David Ahern <dsahern@gmail.com>
>>> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
>>> > Cc: Jakub Kicinski <kuba@kernel.org>
>>> > Cc: Willem de Bruijn <willemb@google.com>
>>> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>>> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>>> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>>> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>>> > Cc: Maryam Tahhan <mtahhan@redhat.com>
>>> > Cc: xdp-hints@xdp-project.net
>>> > Cc: netdev@vger.kernel.org
>>> > Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>> > ---
>>> >   drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
>>> >   .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  3 +-
>>> >   .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  6 +-
>>> >   .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 25 ++++----
>>> >   .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 58 +++++++++----------
>>> >   5 files changed, 50 insertions(+), 43 deletions(-)
>>> >
>>> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>> > index 2d77fb8a8a01..af663978d1b4 100644
>>> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>> > @@ -469,6 +469,7 @@ struct mlx5e_txqsq {
>>> >   union mlx5e_alloc_unit {
>>> >       struct page *page;
>>> >       struct xdp_buff *xsk;
>>> > +     struct mlx5e_xdp_buff *mxbuf;
>>>
>>> In XSK files below you mix usage of both alloc_units[page_idx].mxbuf and
>>> alloc_units[page_idx].xsk, while both fields share the memory of a union.
>>>
>>> As struct mlx5e_xdp_buff wraps struct xdp_buff, I think that you just
>>> need to change the existing xsk field type from struct xdp_buff *xsk
>>> into struct mlx5e_xdp_buff *xsk and align the usage.
>>
>> Hmmm, good point. I'm actually not sure how it works currently.
>> mlx5e_alloc_unit.mxbuf doesn't seem to be initialized anywhere? Toke,
>> am I missing something?
>
> It's initialised piecemeal in different places; but yeah, we're mixing
> things a bit...
>
>> I'm thinking about something like this:
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> index af663978d1b4..2d77fb8a8a01 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> @@ -469,7 +469,6 @@ struct mlx5e_txqsq {
>>  union mlx5e_alloc_unit {
>>         struct page *page;
>>         struct xdp_buff *xsk;
>> -       struct mlx5e_xdp_buff *mxbuf;
>>  };
>
> Hmm, for consistency with the non-XSK path we should rather go the other
> direction and lose the xsk member, moving everything to mxbuf? Let me
> give that a shot...

Something like the below?

-Toke

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 6de02d8aeab8..cb9cdb6421c5 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -468,7 +468,6 @@ struct mlx5e_txqsq {
 
 union mlx5e_alloc_unit {
 	struct page *page;
-	struct xdp_buff *xsk;
 	struct mlx5e_xdp_buff *mxbuf;
 };
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index cb568c62aba0..95694a25ec31 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -33,6 +33,7 @@
 #define __MLX5_EN_XDP_H__
 
 #include <linux/indirect_call_wrapper.h>
+#include <net/xdp_sock_drv.h>
 
 #include "en.h"
 #include "en/txrx.h"
@@ -112,6 +113,21 @@ static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_xdpsq *sq)
 	}
 }
 
+static inline struct mlx5e_xdp_buff *mlx5e_xsk_buff_alloc(struct xsk_buff_pool *pool)
+{
+	return (struct mlx5e_xdp_buff *)xsk_buff_alloc(pool);
+}
+
+static inline void mlx5e_xsk_buff_free(struct mlx5e_xdp_buff *mxbuf)
+{
+	xsk_buff_free(&mxbuf->xdp);
+}
+
+static inline dma_addr_t mlx5e_xsk_buff_xdp_get_frame_dma(struct mlx5e_xdp_buff *mxbuf)
+{
+	return xsk_buff_xdp_get_frame_dma(&mxbuf->xdp);
+}
+
 /* Enable inline WQEs to shift some load from a congested HCA (HW) to
  * a less congested cpu (SW).
  */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index 8bf3029abd3c..1f166dbb7f22 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -3,7 +3,6 @@
 
 #include "rx.h"
 #include "en/xdp.h"
-#include <net/xdp_sock_drv.h>
 #include <linux/filter.h>
 
 /* RX data path */
@@ -21,7 +20,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 	if (unlikely(!xsk_buff_can_alloc(rq->xsk_pool, rq->mpwqe.pages_per_wqe)))
 		goto err;
 
-	BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].xsk));
+	BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].mxbuf));
 	XSK_CHECK_PRIV_TYPE(struct mlx5e_xdp_buff);
 	batch = xsk_buff_alloc_batch(rq->xsk_pool, (struct xdp_buff **)wi->alloc_units,
 				     rq->mpwqe.pages_per_wqe);
@@ -33,8 +32,8 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 	 * the first error, which will mean there are no more valid descriptors.
 	 */
 	for (; batch < rq->mpwqe.pages_per_wqe; batch++) {
-		wi->alloc_units[batch].xsk = xsk_buff_alloc(rq->xsk_pool);
-		if (unlikely(!wi->alloc_units[batch].xsk))
+		wi->alloc_units[batch].mxbuf = mlx5e_xsk_buff_alloc(rq->xsk_pool);
+		if (unlikely(!wi->alloc_units[batch].mxbuf))
 			goto err_reuse_batch;
 	}
 
@@ -44,7 +43,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 
 	if (likely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_ALIGNED)) {
 		for (i = 0; i < batch; i++) {
-			dma_addr_t addr = xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].xsk);
+			dma_addr_t addr = mlx5e_xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].mxbuf);
 
 			umr_wqe->inline_mtts[i] = (struct mlx5_mtt) {
 				.ptag = cpu_to_be64(addr | MLX5_EN_WR),
@@ -53,7 +52,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 		}
 	} else if (unlikely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_UNALIGNED)) {
 		for (i = 0; i < batch; i++) {
-			dma_addr_t addr = xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].xsk);
+			dma_addr_t addr = mlx5e_xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].mxbuf);
 
 			umr_wqe->inline_ksms[i] = (struct mlx5_ksm) {
 				.key = rq->mkey_be,
@@ -65,7 +64,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 		u32 mapping_size = 1 << (rq->mpwqe.page_shift - 2);
 
 		for (i = 0; i < batch; i++) {
-			dma_addr_t addr = xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].xsk);
+			dma_addr_t addr = mlx5e_xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].mxbuf);
 
 			umr_wqe->inline_ksms[i << 2] = (struct mlx5_ksm) {
 				.key = rq->mkey_be,
@@ -91,7 +90,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 		__be32 frame_size = cpu_to_be32(rq->xsk_pool->chunk_size);
 
 		for (i = 0; i < batch; i++) {
-			dma_addr_t addr = xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].xsk);
+			dma_addr_t addr = mlx5e_xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].mxbuf);
 
 			umr_wqe->inline_klms[i << 1] = (struct mlx5_klm) {
 				.key = rq->mkey_be,
@@ -137,7 +136,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
 
 err_reuse_batch:
 	while (--batch >= 0)
-		xsk_buff_free(wi->alloc_units[batch].xsk);
+		mlx5e_xsk_buff_free(wi->alloc_units[batch].mxbuf);
 
 err:
 	rq->stats->buff_alloc_err++;
@@ -156,7 +155,7 @@ int mlx5e_xsk_alloc_rx_wqes_batched(struct mlx5e_rq *rq, u16 ix, int wqe_bulk)
 	 * allocate XDP buffers straight into alloc_units.
 	 */
 	BUILD_BUG_ON(sizeof(rq->wqe.alloc_units[0]) !=
-		     sizeof(rq->wqe.alloc_units[0].xsk));
+		     sizeof(rq->wqe.alloc_units[0].mxbuf));
 	buffs = (struct xdp_buff **)rq->wqe.alloc_units;
 	contig = mlx5_wq_cyc_get_size(wq) - ix;
 	if (wqe_bulk <= contig) {
@@ -177,8 +176,9 @@ int mlx5e_xsk_alloc_rx_wqes_batched(struct mlx5e_rq *rq, u16 ix, int wqe_bulk)
 		/* Assumes log_num_frags == 0. */
 		frag = &rq->wqe.frags[j];
 
-		addr = xsk_buff_xdp_get_frame_dma(frag->au->xsk);
+		addr = mlx5e_xsk_buff_xdp_get_frame_dma(frag->au->mxbuf);
 		wqe->data[0].addr = cpu_to_be64(addr + rq->buff.headroom);
+		frag->au->mxbuf->rq = rq;
 	}
 
 	return alloc;
@@ -199,12 +199,13 @@ int mlx5e_xsk_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, int wqe_bulk)
 		/* Assumes log_num_frags == 0. */
 		frag = &rq->wqe.frags[j];
 
-		frag->au->xsk = xsk_buff_alloc(rq->xsk_pool);
-		if (unlikely(!frag->au->xsk))
+		frag->au->mxbuf = mlx5e_xsk_buff_alloc(rq->xsk_pool);
+		if (unlikely(!frag->au->mxbuf))
 			return i;
 
-		addr = xsk_buff_xdp_get_frame_dma(frag->au->xsk);
+		addr = mlx5e_xsk_buff_xdp_get_frame_dma(frag->au->mxbuf);
 		wqe->data[0].addr = cpu_to_be64(addr + rq->buff.headroom);
+		frag->au->mxbuf->rq = rq;
 	}
 
 	return wqe_bulk;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 7b08653be000..4313165709cb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -41,7 +41,6 @@
 #include <net/gro.h>
 #include <net/udp.h>
 #include <net/tcp.h>
-#include <net/xdp_sock_drv.h>
 #include "en.h"
 #include "en/txrx.h"
 #include "en_tc.h"
@@ -434,7 +433,7 @@ static inline void mlx5e_free_rx_wqe(struct mlx5e_rq *rq,
 		 * put into the Reuse Ring, because there is no way to return
 		 * the page to the userspace when the interface goes down.
 		 */
-		xsk_buff_free(wi->au->xsk);
+		mlx5e_xsk_buff_free(wi->au->mxbuf);
 		return;
 	}
 
@@ -515,7 +514,7 @@ mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi, bool recycle
 		 */
 		for (i = 0; i < rq->mpwqe.pages_per_wqe; i++)
 			if (no_xdp_xmit || !test_bit(i, wi->xdp_xmit_bitmap))
-				xsk_buff_free(alloc_units[i].xsk);
+				mlx5e_xsk_buff_free(alloc_units[i].mxbuf);
 	} else {
 		for (i = 0; i < rq->mpwqe.pages_per_wqe; i++)
 			if (no_xdp_xmit || !test_bit(i, wi->xdp_xmit_bitmap))


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 15/17] net/mlx5e: Introduce wrapper for xdp_buff
  2023-01-12 21:55         ` Toke Høiland-Jørgensen
@ 2023-01-12 22:18           ` Stanislav Fomichev
  2023-01-12 22:29             ` Toke Høiland-Jørgensen
  2023-01-13 20:53           ` Tariq Toukan
  1 sibling, 1 reply; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-12 22:18 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Tariq Toukan, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Tariq Toukan,
	Saeed Mahameed, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Thu, Jan 12, 2023 at 1:55 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>
> > Stanislav Fomichev <sdf@google.com> writes:
> >
> >> On Thu, Jan 12, 2023 at 12:07 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
> >>>
> >>>
> >>>
> >>> On 12/01/2023 2:32, Stanislav Fomichev wrote:
> >>> > From: Toke Høiland-Jørgensen <toke@redhat.com>
> >>> >
> >>> > Preparation for implementing HW metadata kfuncs. No functional change.
> >>> >
> >>> > Cc: Tariq Toukan <tariqt@nvidia.com>
> >>> > Cc: Saeed Mahameed <saeedm@nvidia.com>
> >>> > Cc: John Fastabend <john.fastabend@gmail.com>
> >>> > Cc: David Ahern <dsahern@gmail.com>
> >>> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> >>> > Cc: Jakub Kicinski <kuba@kernel.org>
> >>> > Cc: Willem de Bruijn <willemb@google.com>
> >>> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> >>> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> >>> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> >>> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> >>> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> >>> > Cc: xdp-hints@xdp-project.net
> >>> > Cc: netdev@vger.kernel.org
> >>> > Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
> >>> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> >>> > ---
> >>> >   drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
> >>> >   .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  3 +-
> >>> >   .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  6 +-
> >>> >   .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 25 ++++----
> >>> >   .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 58 +++++++++----------
> >>> >   5 files changed, 50 insertions(+), 43 deletions(-)
> >>> >
> >>> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> >>> > index 2d77fb8a8a01..af663978d1b4 100644
> >>> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> >>> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> >>> > @@ -469,6 +469,7 @@ struct mlx5e_txqsq {
> >>> >   union mlx5e_alloc_unit {
> >>> >       struct page *page;
> >>> >       struct xdp_buff *xsk;
> >>> > +     struct mlx5e_xdp_buff *mxbuf;
> >>>
> >>> In XSK files below you mix usage of both alloc_units[page_idx].mxbuf and
> >>> alloc_units[page_idx].xsk, while both fields share the memory of a union.
> >>>
> >>> As struct mlx5e_xdp_buff wraps struct xdp_buff, I think that you just
> >>> need to change the existing xsk field type from struct xdp_buff *xsk
> >>> into struct mlx5e_xdp_buff *xsk and align the usage.
> >>
> >> Hmmm, good point. I'm actually not sure how it works currently.
> >> mlx5e_alloc_unit.mxbuf doesn't seem to be initialized anywhere? Toke,
> >> am I missing something?
> >
> > It's initialised piecemeal in different places; but yeah, we're mixing
> > things a bit...
> >
> >> I'm thinking about something like this:

Seems more invasive? I don't care much tbf, but what's wrong with
keeping 'xdp_buff xsk' member and use it consistently?

> >> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> >> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> >> index af663978d1b4..2d77fb8a8a01 100644
> >> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> >> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> >> @@ -469,7 +469,6 @@ struct mlx5e_txqsq {
> >>  union mlx5e_alloc_unit {
> >>         struct page *page;
> >>         struct xdp_buff *xsk;
> >> -       struct mlx5e_xdp_buff *mxbuf;
> >>  };
> >
> > Hmm, for consistency with the non-XSK path we should rather go the other
> > direction and lose the xsk member, moving everything to mxbuf? Let me
> > give that a shot...
>
> Something like the below?
>
> -Toke
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index 6de02d8aeab8..cb9cdb6421c5 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -468,7 +468,6 @@ struct mlx5e_txqsq {
>
>  union mlx5e_alloc_unit {
>         struct page *page;
> -       struct xdp_buff *xsk;
>         struct mlx5e_xdp_buff *mxbuf;
>  };
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> index cb568c62aba0..95694a25ec31 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> @@ -33,6 +33,7 @@
>  #define __MLX5_EN_XDP_H__
>
>  #include <linux/indirect_call_wrapper.h>
> +#include <net/xdp_sock_drv.h>
>
>  #include "en.h"
>  #include "en/txrx.h"
> @@ -112,6 +113,21 @@ static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_xdpsq *sq)
>         }
>  }
>
> +static inline struct mlx5e_xdp_buff *mlx5e_xsk_buff_alloc(struct xsk_buff_pool *pool)
> +{
> +       return (struct mlx5e_xdp_buff *)xsk_buff_alloc(pool);
> +}
> +
> +static inline void mlx5e_xsk_buff_free(struct mlx5e_xdp_buff *mxbuf)
> +{
> +       xsk_buff_free(&mxbuf->xdp);
> +}
> +
> +static inline dma_addr_t mlx5e_xsk_buff_xdp_get_frame_dma(struct mlx5e_xdp_buff *mxbuf)
> +{
> +       return xsk_buff_xdp_get_frame_dma(&mxbuf->xdp);
> +}
> +
>  /* Enable inline WQEs to shift some load from a congested HCA (HW) to
>   * a less congested cpu (SW).
>   */
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> index 8bf3029abd3c..1f166dbb7f22 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> @@ -3,7 +3,6 @@
>
>  #include "rx.h"
>  #include "en/xdp.h"
> -#include <net/xdp_sock_drv.h>
>  #include <linux/filter.h>
>
>  /* RX data path */
> @@ -21,7 +20,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>         if (unlikely(!xsk_buff_can_alloc(rq->xsk_pool, rq->mpwqe.pages_per_wqe)))
>                 goto err;
>
> -       BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].xsk));
> +       BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].mxbuf));
>         batch = xsk_buff_alloc_batch(rq->xsk_pool, (struct xdp_buff **)wi->alloc_units,
>                                      rq->mpwqe.pages_per_wqe);
> @@ -33,8 +32,8 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>          * the first error, which will mean there are no more valid descriptors.
>          */
>         for (; batch < rq->mpwqe.pages_per_wqe; batch++) {
> -               wi->alloc_units[batch].xsk = xsk_buff_alloc(rq->xsk_pool);
> -               if (unlikely(!wi->alloc_units[batch].xsk))
> +               wi->alloc_units[batch].mxbuf = mlx5e_xsk_buff_alloc(rq->xsk_pool);
> +               if (unlikely(!wi->alloc_units[batch].mxbuf))
>                         goto err_reuse_batch;
>         }
>
> @@ -44,7 +43,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>
>         if (likely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_ALIGNED)) {
>                 for (i = 0; i < batch; i++) {
> -                       dma_addr_t addr = xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].xsk);
> +                       dma_addr_t addr = mlx5e_xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].mxbuf);
>
>                         umr_wqe->inline_mtts[i] = (struct mlx5_mtt) {
>                                 .ptag = cpu_to_be64(addr | MLX5_EN_WR),
> @@ -53,7 +52,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>                 }
>         } else if (unlikely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_UNALIGNED)) {
>                 for (i = 0; i < batch; i++) {
> -                       dma_addr_t addr = xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].xsk);
> +                       dma_addr_t addr = mlx5e_xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].mxbuf);
>
>                         umr_wqe->inline_ksms[i] = (struct mlx5_ksm) {
>                                 .key = rq->mkey_be,
> @@ -65,7 +64,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>                 u32 mapping_size = 1 << (rq->mpwqe.page_shift - 2);
>
>                 for (i = 0; i < batch; i++) {
> -                       dma_addr_t addr = xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].xsk);
> +                       dma_addr_t addr = mlx5e_xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].mxbuf);
>
>                         umr_wqe->inline_ksms[i << 2] = (struct mlx5_ksm) {
>                                 .key = rq->mkey_be,
> @@ -91,7 +90,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>                 __be32 frame_size = cpu_to_be32(rq->xsk_pool->chunk_size);
>
>                 for (i = 0; i < batch; i++) {
> -                       dma_addr_t addr = xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].xsk);
> +                       dma_addr_t addr = mlx5e_xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].mxbuf);
>
>                         umr_wqe->inline_klms[i << 1] = (struct mlx5_klm) {
>                                 .key = rq->mkey_be,
> @@ -137,7 +136,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>
>  err_reuse_batch:
>         while (--batch >= 0)
> -               xsk_buff_free(wi->alloc_units[batch].xsk);
> +               mlx5e_xsk_buff_free(wi->alloc_units[batch].mxbuf);
>
>  err:
>         rq->stats->buff_alloc_err++;
> @@ -156,7 +155,7 @@ int mlx5e_xsk_alloc_rx_wqes_batched(struct mlx5e_rq *rq, u16 ix, int wqe_bulk)
>          * allocate XDP buffers straight into alloc_units.
>          */
>         BUILD_BUG_ON(sizeof(rq->wqe.alloc_units[0]) !=
> -                    sizeof(rq->wqe.alloc_units[0].xsk));
> +                    sizeof(rq->wqe.alloc_units[0].mxbuf));
>         buffs = (struct xdp_buff **)rq->wqe.alloc_units;
>         contig = mlx5_wq_cyc_get_size(wq) - ix;
>         if (wqe_bulk <= contig) {
> @@ -177,8 +176,9 @@ int mlx5e_xsk_alloc_rx_wqes_batched(struct mlx5e_rq *rq, u16 ix, int wqe_bulk)
>                 /* Assumes log_num_frags == 0. */
>                 frag = &rq->wqe.frags[j];
>
> -               addr = xsk_buff_xdp_get_frame_dma(frag->au->xsk);
> +               addr = mlx5e_xsk_buff_xdp_get_frame_dma(frag->au->mxbuf);
>                 wqe->data[0].addr = cpu_to_be64(addr + rq->buff.headroom);
> +               frag->au->mxbuf->rq = rq;
>         }
>
>         return alloc;
> @@ -199,12 +199,13 @@ int mlx5e_xsk_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, int wqe_bulk)
>                 /* Assumes log_num_frags == 0. */
>                 frag = &rq->wqe.frags[j];
>
> -               frag->au->xsk = xsk_buff_alloc(rq->xsk_pool);
> -               if (unlikely(!frag->au->xsk))
> +               frag->au->mxbuf = mlx5e_xsk_buff_alloc(rq->xsk_pool);
> +               if (unlikely(!frag->au->mxbuf))
>                         return i;
>
> -               addr = xsk_buff_xdp_get_frame_dma(frag->au->xsk);
> +               addr = mlx5e_xsk_buff_xdp_get_frame_dma(frag->au->mxbuf);
>                 wqe->data[0].addr = cpu_to_be64(addr + rq->buff.headroom);
> +               frag->au->mxbuf->rq = rq;
>         }
>
>         return wqe_bulk;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 7b08653be000..4313165709cb 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -41,7 +41,6 @@
>  #include <net/gro.h>
>  #include <net/udp.h>
>  #include <net/tcp.h>
> -#include <net/xdp_sock_drv.h>
>  #include "en.h"
>  #include "en/txrx.h"
>  #include "en_tc.h"
> @@ -434,7 +433,7 @@ static inline void mlx5e_free_rx_wqe(struct mlx5e_rq *rq,
>                  * put into the Reuse Ring, because there is no way to return
>                  * the page to the userspace when the interface goes down.
>                  */
> -               xsk_buff_free(wi->au->xsk);
> +               mlx5e_xsk_buff_free(wi->au->mxbuf);
>                 return;
>         }
>
> @@ -515,7 +514,7 @@ mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi, bool recycle
>                  */
>                 for (i = 0; i < rq->mpwqe.pages_per_wqe; i++)
>                         if (no_xdp_xmit || !test_bit(i, wi->xdp_xmit_bitmap))
> -                               xsk_buff_free(alloc_units[i].xsk);
> +                               mlx5e_xsk_buff_free(alloc_units[i].mxbuf);
>         } else {
>                 for (i = 0; i < rq->mpwqe.pages_per_wqe; i++)
>                         if (no_xdp_xmit || !test_bit(i, wi->xdp_xmit_bitmap))
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 15/17] net/mlx5e: Introduce wrapper for xdp_buff
  2023-01-12 22:18           ` Stanislav Fomichev
@ 2023-01-12 22:29             ` Toke Høiland-Jørgensen
  2023-01-13 20:55               ` Tariq Toukan
  0 siblings, 1 reply; 42+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-01-12 22:29 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Tariq Toukan, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Tariq Toukan,
	Saeed Mahameed, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

Stanislav Fomichev <sdf@google.com> writes:

> On Thu, Jan 12, 2023 at 1:55 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>>
>> > Stanislav Fomichev <sdf@google.com> writes:
>> >
>> >> On Thu, Jan 12, 2023 at 12:07 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>> >>>
>> >>>
>> >>>
>> >>> On 12/01/2023 2:32, Stanislav Fomichev wrote:
>> >>> > From: Toke Høiland-Jørgensen <toke@redhat.com>
>> >>> >
>> >>> > Preparation for implementing HW metadata kfuncs. No functional change.
>> >>> >
>> >>> > Cc: Tariq Toukan <tariqt@nvidia.com>
>> >>> > Cc: Saeed Mahameed <saeedm@nvidia.com>
>> >>> > Cc: John Fastabend <john.fastabend@gmail.com>
>> >>> > Cc: David Ahern <dsahern@gmail.com>
>> >>> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
>> >>> > Cc: Jakub Kicinski <kuba@kernel.org>
>> >>> > Cc: Willem de Bruijn <willemb@google.com>
>> >>> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>> >>> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>> >>> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>> >>> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>> >>> > Cc: Maryam Tahhan <mtahhan@redhat.com>
>> >>> > Cc: xdp-hints@xdp-project.net
>> >>> > Cc: netdev@vger.kernel.org
>> >>> > Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>> >>> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
>> >>> > ---
>> >>> >   drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
>> >>> >   .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  3 +-
>> >>> >   .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  6 +-
>> >>> >   .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 25 ++++----
>> >>> >   .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 58 +++++++++----------
>> >>> >   5 files changed, 50 insertions(+), 43 deletions(-)
>> >>> >
>> >>> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> >>> > index 2d77fb8a8a01..af663978d1b4 100644
>> >>> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> >>> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> >>> > @@ -469,6 +469,7 @@ struct mlx5e_txqsq {
>> >>> >   union mlx5e_alloc_unit {
>> >>> >       struct page *page;
>> >>> >       struct xdp_buff *xsk;
>> >>> > +     struct mlx5e_xdp_buff *mxbuf;
>> >>>
>> >>> In XSK files below you mix usage of both alloc_units[page_idx].mxbuf and
>> >>> alloc_units[page_idx].xsk, while both fields share the memory of a union.
>> >>>
>> >>> As struct mlx5e_xdp_buff wraps struct xdp_buff, I think that you just
>> >>> need to change the existing xsk field type from struct xdp_buff *xsk
>> >>> into struct mlx5e_xdp_buff *xsk and align the usage.
>> >>
>> >> Hmmm, good point. I'm actually not sure how it works currently.
>> >> mlx5e_alloc_unit.mxbuf doesn't seem to be initialized anywhere? Toke,
>> >> am I missing something?
>> >
>> > It's initialised piecemeal in different places; but yeah, we're mixing
>> > things a bit...
>> >
>> >> I'm thinking about something like this:
>
> Seems more invasive? I don't care much tbf, but what's wrong with
> keeping 'xdp_buff xsk' member and use it consistently?

Yeah, it's more invasive, but it's also more consistent with the non-xsk
path where every usage of struct xdp_buff is replaced with the wrapping
struct?

Both will work, I suppose (in fact I think the resulting code will be
more or less identical), so it's more a matter of which one is easier to
read and where we put the type-safety-breaking casts.

I can live with either one (just note you'll have to move the
'mxbuf->rq' initialisation next to the one for mxbuf->cqe for yours, but
that's probably fine too). Let's see which way Tariq prefers...

-Toke


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 16/17] net/mlx5e: Support RX XDP metadata
  2023-01-12 19:09     ` Stanislav Fomichev
@ 2023-01-13 20:25       ` Tariq Toukan
  0 siblings, 0 replies; 42+ messages in thread
From: Tariq Toukan @ 2023-01-13 20:25 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Toke Høiland-Jørgensen,
	Tariq Toukan, Saeed Mahameed, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev



On 12/01/2023 21:09, Stanislav Fomichev wrote:
> On Thu, Jan 12, 2023 at 12:13 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>>
>>
>>
>> On 12/01/2023 2:32, Stanislav Fomichev wrote:
>>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>>>
>>> Support RX hash and timestamp metadata kfuncs. We need to pass in the cqe
>>> pointer to the mlx5e_skb_from* functions so it can be retrieved from the
>>> XDP ctx to do this.
>>>
>>> Cc: Tariq Toukan <tariqt@nvidia.com>
>>> Cc: Saeed Mahameed <saeedm@nvidia.com>
>>> Cc: John Fastabend <john.fastabend@gmail.com>
>>> Cc: David Ahern <dsahern@gmail.com>
>>> Cc: Martin KaFai Lau <martin.lau@linux.dev>
>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>> Cc: Willem de Bruijn <willemb@google.com>
>>> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>>> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>>> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>>> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>>> Cc: Maryam Tahhan <mtahhan@redhat.com>
>>> Cc: xdp-hints@xdp-project.net
>>> Cc: netdev@vger.kernel.org
>>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>> ---
>>>    drivers/net/ethernet/mellanox/mlx5/core/en.h  |  5 +-
>>>    .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  5 ++
>>>    .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 23 +++++++++
>>>    .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  5 ++
>>>    .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 10 ++++
>>>    .../ethernet/mellanox/mlx5/core/en/xsk/rx.h   |  2 +
>>>    .../net/ethernet/mellanox/mlx5/core/en_main.c |  6 +++
>>>    .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 49 ++++++++++---------
>>>    8 files changed, 80 insertions(+), 25 deletions(-)
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>> index af663978d1b4..6de02d8aeab8 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>> @@ -627,10 +627,11 @@ struct mlx5e_rq;
>>>    typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq*, struct mlx5_cqe64*);
>>>    typedef struct sk_buff *
>>>    (*mlx5e_fp_skb_from_cqe_mpwrq)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
>>> -                            u16 cqe_bcnt, u32 head_offset, u32 page_idx);
>>> +                            struct mlx5_cqe64 *cqe, u16 cqe_bcnt,
>>> +                            u32 head_offset, u32 page_idx);
>>>    typedef struct sk_buff *
>>>    (*mlx5e_fp_skb_from_cqe)(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
>>> -                      u32 cqe_bcnt);
>>> +                      struct mlx5_cqe64 *cqe, u32 cqe_bcnt);
>>>    typedef bool (*mlx5e_fp_post_rx_wqes)(struct mlx5e_rq *rq);
>>>    typedef void (*mlx5e_fp_dealloc_wqe)(struct mlx5e_rq*, u16);
>>>    typedef void (*mlx5e_fp_shampo_dealloc_hd)(struct mlx5e_rq*, u16, u16, bool);
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
>>> index 853f312cd757..757c012ece27 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
>>> @@ -73,6 +73,11 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget);
>>>    void mlx5e_free_rx_descs(struct mlx5e_rq *rq);
>>>    void mlx5e_free_rx_in_progress_descs(struct mlx5e_rq *rq);
>>>
>>> +static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
>>> +{
>>> +     return config->rx_filter == HWTSTAMP_FILTER_ALL;
>>> +}
>>> +
>>>    /* TX */
>>>    netdev_tx_t mlx5e_xmit(struct sk_buff *skb, struct net_device *dev);
>>>    bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
>>> index 31bb6806bf5d..d10d31e12ba2 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
>>> @@ -156,6 +156,29 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
>>>        return true;
>>>    }
>>>
>>> +int mlx5e_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
>>> +{
>>> +     const struct mlx5e_xdp_buff *_ctx = (void *)ctx;
>>> +
>>> +     if (unlikely(!mlx5e_rx_hw_stamp(_ctx->rq->tstamp)))
>>> +             return -EOPNOTSUPP;
>>> +
>>> +     *timestamp =  mlx5e_cqe_ts_to_ns(_ctx->rq->ptp_cyc2time,
>>> +                                      _ctx->rq->clock, get_cqe_ts(_ctx->cqe));
>>> +     return 0;
>>> +}
>>> +
>>> +int mlx5e_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash)
>>> +{
>>> +     const struct mlx5e_xdp_buff *_ctx = (void *)ctx;
>>> +
>>> +     if (unlikely(!(_ctx->xdp.rxq->dev->features & NETIF_F_RXHASH)))
>>> +             return -EOPNOTSUPP;
>>> +
>>> +     *hash = be32_to_cpu(_ctx->cqe->rss_hash_result);
>>> +     return 0;
>>> +}
>>> +
>>>    /* returns true if packet was consumed by xdp */
>>>    bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct page *page,
>>>                      struct bpf_prog *prog, struct mlx5e_xdp_buff *mxbuf)
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
>>> index 389818bf6833..cb568c62aba0 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
>>> @@ -46,6 +46,8 @@
>>>
>>>    struct mlx5e_xdp_buff {
>>>        struct xdp_buff xdp;
>>> +     struct mlx5_cqe64 *cqe;
>>> +     struct mlx5e_rq *rq;
>>>    };
>>>
>>>    struct mlx5e_xsk_param;
>>> @@ -60,6 +62,9 @@ void mlx5e_xdp_rx_poll_complete(struct mlx5e_rq *rq);
>>>    int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>>>                   u32 flags);
>>>
>>> +int mlx5e_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp);
>>> +int mlx5e_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash);
>>> +
>>>    INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq,
>>>                                                          struct mlx5e_xmit_data *xdptxd,
>>>                                                          struct skb_shared_info *sinfo,
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
>>> index 9cff82d764e3..8bf3029abd3c 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
>>> @@ -49,6 +49,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>>>                        umr_wqe->inline_mtts[i] = (struct mlx5_mtt) {
>>>                                .ptag = cpu_to_be64(addr | MLX5_EN_WR),
>>>                        };
>>> +                     wi->alloc_units[i].mxbuf->rq = rq;
>>>                }
>>>        } else if (unlikely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_UNALIGNED)) {
>>>                for (i = 0; i < batch; i++) {
>>> @@ -58,6 +59,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>>>                                .key = rq->mkey_be,
>>>                                .va = cpu_to_be64(addr),
>>>                        };
>>> +                     wi->alloc_units[i].mxbuf->rq = rq;
>>>                }
>>>        } else if (likely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_TRIPLE)) {
>>>                u32 mapping_size = 1 << (rq->mpwqe.page_shift - 2);
>>> @@ -81,6 +83,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>>>                                .key = rq->mkey_be,
>>>                                .va = cpu_to_be64(rq->wqe_overflow.addr),
>>>                        };
>>> +                     wi->alloc_units[i].mxbuf->rq = rq;
>>>                }
>>>        } else {
>>>                __be32 pad_size = cpu_to_be32((1 << rq->mpwqe.page_shift) -
>>> @@ -100,6 +103,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>>>                                .va = cpu_to_be64(rq->wqe_overflow.addr),
>>>                                .bcount = pad_size,
>>>                        };
>>> +                     wi->alloc_units[i].mxbuf->rq = rq;
>>>                }
>>>        }
>>>
>>> @@ -230,6 +234,7 @@ static struct sk_buff *mlx5e_xsk_construct_skb(struct mlx5e_rq *rq, struct xdp_b
>>>
>>>    struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
>>>                                                    struct mlx5e_mpw_info *wi,
>>> +                                                 struct mlx5_cqe64 *cqe,
>>>                                                    u16 cqe_bcnt,
>>>                                                    u32 head_offset,
>>>                                                    u32 page_idx)
>>> @@ -250,6 +255,8 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
>>>         */
>>>        WARN_ON_ONCE(head_offset);
>>>
>>> +     /* mxbuf->rq is set on allocation, but cqe is per-packet so set it here */
>>> +     mxbuf->cqe = cqe;
>>>        xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
>>>        xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
>>>        net_prefetch(mxbuf->xdp.data);
>>> @@ -284,6 +291,7 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
>>>
>>>    struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
>>>                                              struct mlx5e_wqe_frag_info *wi,
>>> +                                           struct mlx5_cqe64 *cqe,
>>>                                              u32 cqe_bcnt)
>>>    {
>>>        struct mlx5e_xdp_buff *mxbuf = wi->au->mxbuf;
>>> @@ -296,6 +304,8 @@ struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
>>>         */
>>>        WARN_ON_ONCE(wi->offset);
>>>
>>> +     /* mxbuf->rq is set on allocation, but cqe is per-packet so set it here */
>>> +     mxbuf->cqe = cqe;
>>>        xsk_buff_set_size(&mxbuf->xdp, cqe_bcnt);
>>>        xsk_buff_dma_sync_for_cpu(&mxbuf->xdp, rq->xsk_pool);
>>>        net_prefetch(mxbuf->xdp.data);
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
>>> index 087c943bd8e9..cefc0ef6105d 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.h
>>> @@ -13,11 +13,13 @@ int mlx5e_xsk_alloc_rx_wqes_batched(struct mlx5e_rq *rq, u16 ix, int wqe_bulk);
>>>    int mlx5e_xsk_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, int wqe_bulk);
>>>    struct sk_buff *mlx5e_xsk_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq,
>>>                                                    struct mlx5e_mpw_info *wi,
>>> +                                                 struct mlx5_cqe64 *cqe,
>>>                                                    u16 cqe_bcnt,
>>>                                                    u32 head_offset,
>>>                                                    u32 page_idx);
>>>    struct sk_buff *mlx5e_xsk_skb_from_cqe_linear(struct mlx5e_rq *rq,
>>>                                              struct mlx5e_wqe_frag_info *wi,
>>> +                                           struct mlx5_cqe64 *cqe,
>>>                                              u32 cqe_bcnt);
>>>
>>>    #endif /* __MLX5_EN_XSK_RX_H__ */
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>> index cff5f2e29e1e..be942c060774 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
>>> @@ -4913,6 +4913,11 @@ const struct net_device_ops mlx5e_netdev_ops = {
>>>    #endif
>>>    };
>>>
>>> +static const struct xdp_metadata_ops mlx5_xdp_metadata_ops = {
>>> +     .xmo_rx_timestamp               = mlx5e_xdp_rx_timestamp,
>>> +     .xmo_rx_hash                    = mlx5e_xdp_rx_hash,
>>> +};
>>> +
>>
>> Instead of exposing every single xdp function in the xdp header, move
>> this struct into the en/xdp.c, and expose it via en/xdp.h.
>>
>> See example struct and its usage:
>>          netdev->ethtool_ops       = &mlx5e_ethtool_ops;
> 
> SG, will put "extern const struct xdp_metadata_ops
> mlx5_xdp_metadata_ops;" in en/xdp.h. 

Exactly, that's what I suggested above.

> LMK if you prefer that to go into
> en.h next to mlx5e_ethtool_ops extern.
> 
> 



>>>    static u32 mlx5e_choose_lro_timeout(struct mlx5_core_dev *mdev, u32 wanted_timeout)
>>>    {
>>>        int i;
>>> @@ -5053,6 +5058,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
>>>        SET_NETDEV_DEV(netdev, mdev->device);
>>>
>>>        netdev->netdev_ops = &mlx5e_netdev_ops;
>>> +     netdev->xdp_metadata_ops = &mlx5_xdp_metadata_ops;
>>>
>>>        mlx5e_dcbnl_build_netdev(netdev);
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>>> index 6affdddf5bcf..7b08653be000 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
>>> @@ -62,10 +62,12 @@
>>>
>>>    static struct sk_buff *
>>>    mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
>>> -                             u16 cqe_bcnt, u32 head_offset, u32 page_idx);
>>> +                             struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
>>> +                             u32 page_idx);
>>>    static struct sk_buff *
>>>    mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
>>> -                                u16 cqe_bcnt, u32 head_offset, u32 page_idx);
>>> +                                struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
>>> +                                u32 page_idx);
>>>    static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
>>>    static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
>>>    static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
>>> @@ -76,11 +78,6 @@ const struct mlx5e_rx_handlers mlx5e_rx_handlers_nic = {
>>>        .handle_rx_cqe_mpwqe_shampo = mlx5e_handle_rx_cqe_mpwrq_shampo,
>>>    };
>>>
>>> -static inline bool mlx5e_rx_hw_stamp(struct hwtstamp_config *config)
>>> -{
>>> -     return config->rx_filter == HWTSTAMP_FILTER_ALL;
>>> -}
>>> -
>>>    static inline void mlx5e_read_cqe_slot(struct mlx5_cqwq *wq,
>>>                                       u32 cqcc, void *data)
>>>    {
>>> @@ -1575,16 +1572,19 @@ struct sk_buff *mlx5e_build_linear_skb(struct mlx5e_rq *rq, void *va,
>>>        return skb;
>>>    }
>>>
>>> -static void mlx5e_fill_mxbuf(struct mlx5e_rq *rq, void *va, u16 headroom,
>>> -                          u32 len, struct mlx5e_xdp_buff *mxbuf)
>>> +static void mlx5e_fill_mxbuf(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
>>> +                          void *va, u16 headroom, u32 len,
>>> +                          struct mlx5e_xdp_buff *mxbuf)
>>>    {
>>>        xdp_init_buff(&mxbuf->xdp, rq->buff.frame0_sz, &rq->xdp_rxq);
>>>        xdp_prepare_buff(&mxbuf->xdp, va, headroom, len, true);
>>> +     mxbuf->cqe = cqe;
>>> +     mxbuf->rq = rq;
>>>    }
>>>
>>>    static struct sk_buff *
>>>    mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
>>> -                       u32 cqe_bcnt)
>>> +                       struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
>>>    {
>>>        union mlx5e_alloc_unit *au = wi->au;
>>>        u16 rx_headroom = rq->buff.headroom;
>>> @@ -1630,7 +1630,7 @@ mlx5e_skb_from_cqe_linear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
>>>
>>>    static struct sk_buff *
>>>    mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi,
>>> -                          u32 cqe_bcnt)
>>> +                          struct mlx5_cqe64 *cqe, u32 cqe_bcnt)
>>>    {
>>>        struct mlx5e_rq_frag_info *frag_info = &rq->wqe.info.arr[0];
>>>        struct mlx5e_wqe_frag_info *head_wi = wi;
>>> @@ -1654,7 +1654,7 @@ mlx5e_skb_from_cqe_nonlinear(struct mlx5e_rq *rq, struct mlx5e_wqe_frag_info *wi
>>>        net_prefetchw(va); /* xdp_frame data area */
>>>        net_prefetch(va + rx_headroom);
>>>
>>> -     mlx5e_fill_mxbuf(rq, va, rx_headroom, frag_consumed_bytes, &mxbuf);
>>> +     mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, frag_consumed_bytes, &mxbuf);
>>>        sinfo = xdp_get_shared_info_from_buff(&mxbuf.xdp);
>>>        truesize = 0;
>>>
>>> @@ -1777,7 +1777,7 @@ static void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
>>>                              mlx5e_skb_from_cqe_linear,
>>>                              mlx5e_skb_from_cqe_nonlinear,
>>>                              mlx5e_xsk_skb_from_cqe_linear,
>>> -                           rq, wi, cqe_bcnt);
>>> +                           rq, wi, cqe, cqe_bcnt);
>>>        if (!skb) {
>>>                /* probably for XDP */
>>>                if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
>>> @@ -1830,7 +1830,7 @@ static void mlx5e_handle_rx_cqe_rep(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
>>>        skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
>>>                              mlx5e_skb_from_cqe_linear,
>>>                              mlx5e_skb_from_cqe_nonlinear,
>>> -                           rq, wi, cqe_bcnt);
>>> +                           rq, wi, cqe, cqe_bcnt);
>>>        if (!skb) {
>>>                /* probably for XDP */
>>>                if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags)) {
>>> @@ -1889,7 +1889,7 @@ static void mlx5e_handle_rx_cqe_mpwrq_rep(struct mlx5e_rq *rq, struct mlx5_cqe64
>>>        skb = INDIRECT_CALL_2(rq->mpwqe.skb_from_cqe_mpwrq,
>>>                              mlx5e_skb_from_cqe_mpwrq_linear,
>>>                              mlx5e_skb_from_cqe_mpwrq_nonlinear,
>>> -                           rq, wi, cqe_bcnt, head_offset, page_idx);
>>> +                           rq, wi, cqe, cqe_bcnt, head_offset, page_idx);
>>>        if (!skb)
>>>                goto mpwrq_cqe_out;
>>>
>>> @@ -1940,7 +1940,8 @@ mlx5e_fill_skb_data(struct sk_buff *skb, struct mlx5e_rq *rq,
>>>
>>>    static struct sk_buff *
>>>    mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
>>> -                                u16 cqe_bcnt, u32 head_offset, u32 page_idx)
>>> +                                struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
>>> +                                u32 page_idx)
>>>    {
>>>        union mlx5e_alloc_unit *au = &wi->alloc_units[page_idx];
>>>        u16 headlen = min_t(u16, MLX5E_RX_MAX_HEAD, cqe_bcnt);
>>> @@ -1979,7 +1980,8 @@ mlx5e_skb_from_cqe_mpwrq_nonlinear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *w
>>>
>>>    static struct sk_buff *
>>>    mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
>>> -                             u16 cqe_bcnt, u32 head_offset, u32 page_idx)
>>> +                             struct mlx5_cqe64 *cqe, u16 cqe_bcnt, u32 head_offset,
>>> +                             u32 page_idx)
>>>    {
>>>        union mlx5e_alloc_unit *au = &wi->alloc_units[page_idx];
>>>        u16 rx_headroom = rq->buff.headroom;
>>> @@ -2010,7 +2012,7 @@ mlx5e_skb_from_cqe_mpwrq_linear(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi,
>>>                struct mlx5e_xdp_buff mxbuf;
>>>
>>>                net_prefetchw(va); /* xdp_frame data area */
>>> -             mlx5e_fill_mxbuf(rq, va, rx_headroom, cqe_bcnt, &mxbuf);
>>> +             mlx5e_fill_mxbuf(rq, cqe, va, rx_headroom, cqe_bcnt, &mxbuf);
>>>                if (mlx5e_xdp_handle(rq, au->page, prog, &mxbuf)) {
>>>                        if (__test_and_clear_bit(MLX5E_RQ_FLAG_XDP_XMIT, rq->flags))
>>>                                __set_bit(page_idx, wi->xdp_xmit_bitmap); /* non-atomic */
>>> @@ -2174,8 +2176,8 @@ static void mlx5e_handle_rx_cqe_mpwrq_shampo(struct mlx5e_rq *rq, struct mlx5_cq
>>>                if (likely(head_size))
>>>                        *skb = mlx5e_skb_from_cqe_shampo(rq, wi, cqe, header_index);
>>>                else
>>> -                     *skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe_bcnt, data_offset,
>>> -                                                               page_idx);
>>> +                     *skb = mlx5e_skb_from_cqe_mpwrq_nonlinear(rq, wi, cqe, cqe_bcnt,
>>> +                                                               data_offset, page_idx);
>>>                if (unlikely(!*skb))
>>>                        goto free_hd_entry;
>>>
>>> @@ -2249,7 +2251,8 @@ static void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cq
>>>                              mlx5e_skb_from_cqe_mpwrq_linear,
>>>                              mlx5e_skb_from_cqe_mpwrq_nonlinear,
>>>                              mlx5e_xsk_skb_from_cqe_mpwrq_linear,
>>> -                           rq, wi, cqe_bcnt, head_offset, page_idx);
>>> +                           rq, wi, cqe, cqe_bcnt, head_offset,
>>> +                           page_idx);
>>>        if (!skb)
>>>                goto mpwrq_cqe_out;
>>>
>>> @@ -2494,7 +2497,7 @@ static void mlx5i_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
>>>        skb = INDIRECT_CALL_2(rq->wqe.skb_from_cqe,
>>>                              mlx5e_skb_from_cqe_linear,
>>>                              mlx5e_skb_from_cqe_nonlinear,
>>> -                           rq, wi, cqe_bcnt);
>>> +                           rq, wi, cqe, cqe_bcnt);
>>>        if (!skb)
>>>                goto wq_free_wqe;
>>>
>>> @@ -2586,7 +2589,7 @@ static void mlx5e_trap_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe
>>>                goto free_wqe;
>>>        }
>>>
>>> -     skb = mlx5e_skb_from_cqe_nonlinear(rq, wi, cqe_bcnt);
>>> +     skb = mlx5e_skb_from_cqe_nonlinear(rq, wi, cqe, cqe_bcnt);
>>>        if (!skb)
>>>                goto free_wqe;
>>>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 15/17] net/mlx5e: Introduce wrapper for xdp_buff
  2023-01-12 21:55         ` Toke Høiland-Jørgensen
  2023-01-12 22:18           ` Stanislav Fomichev
@ 2023-01-13 20:53           ` Tariq Toukan
  2023-01-13 21:31             ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 42+ messages in thread
From: Tariq Toukan @ 2023-01-13 20:53 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Stanislav Fomichev, Tariq Toukan
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Saeed Mahameed, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev



On 12/01/2023 23:55, Toke Høiland-Jørgensen wrote:
> Toke Høiland-Jørgensen <toke@redhat.com> writes:
> 
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>>> On Thu, Jan 12, 2023 at 12:07 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>>>>
>>>>
>>>>
>>>> On 12/01/2023 2:32, Stanislav Fomichev wrote:
>>>>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>>
>>>>> Preparation for implementing HW metadata kfuncs. No functional change.
>>>>>
>>>>> Cc: Tariq Toukan <tariqt@nvidia.com>
>>>>> Cc: Saeed Mahameed <saeedm@nvidia.com>
>>>>> Cc: John Fastabend <john.fastabend@gmail.com>
>>>>> Cc: David Ahern <dsahern@gmail.com>
>>>>> Cc: Martin KaFai Lau <martin.lau@linux.dev>
>>>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>>>> Cc: Willem de Bruijn <willemb@google.com>
>>>>> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>>>>> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>>>>> Cc: Maryam Tahhan <mtahhan@redhat.com>
>>>>> Cc: xdp-hints@xdp-project.net
>>>>> Cc: netdev@vger.kernel.org
>>>>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>>>> ---
>>>>>    drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
>>>>>    .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  3 +-
>>>>>    .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  6 +-
>>>>>    .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 25 ++++----
>>>>>    .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 58 +++++++++----------
>>>>>    5 files changed, 50 insertions(+), 43 deletions(-)
>>>>>
>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>> index 2d77fb8a8a01..af663978d1b4 100644
>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>> @@ -469,6 +469,7 @@ struct mlx5e_txqsq {
>>>>>    union mlx5e_alloc_unit {
>>>>>        struct page *page;
>>>>>        struct xdp_buff *xsk;
>>>>> +     struct mlx5e_xdp_buff *mxbuf;
>>>>
>>>> In XSK files below you mix usage of both alloc_units[page_idx].mxbuf and
>>>> alloc_units[page_idx].xsk, while both fields share the memory of a union.
>>>>
>>>> As struct mlx5e_xdp_buff wraps struct xdp_buff, I think that you just
>>>> need to change the existing xsk field type from struct xdp_buff *xsk
>>>> into struct mlx5e_xdp_buff *xsk and align the usage.
>>>
>>> Hmmm, good point. I'm actually not sure how it works currently.
>>> mlx5e_alloc_unit.mxbuf doesn't seem to be initialized anywhere? Toke,
>>> am I missing something?
>>
>> It's initialised piecemeal in different places; but yeah, we're mixing
>> things a bit...
>>
>>> I'm thinking about something like this:
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>> index af663978d1b4..2d77fb8a8a01 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>> @@ -469,7 +469,6 @@ struct mlx5e_txqsq {
>>>   union mlx5e_alloc_unit {
>>>          struct page *page;
>>>          struct xdp_buff *xsk;
>>> -       struct mlx5e_xdp_buff *mxbuf;
>>>   };
>>
>> Hmm, for consistency with the non-XSK path we should rather go the other
>> direction and lose the xsk member, moving everything to mxbuf? Let me
>> give that a shot...
> 
> Something like the below?
> 
> -Toke
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index 6de02d8aeab8..cb9cdb6421c5 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -468,7 +468,6 @@ struct mlx5e_txqsq {
>   
>   union mlx5e_alloc_unit {
>   	struct page *page;
> -	struct xdp_buff *xsk;
>   	struct mlx5e_xdp_buff *mxbuf;
>   };
>   
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> index cb568c62aba0..95694a25ec31 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> @@ -33,6 +33,7 @@
>   #define __MLX5_EN_XDP_H__
>   
>   #include <linux/indirect_call_wrapper.h>
> +#include <net/xdp_sock_drv.h>
>   
>   #include "en.h"
>   #include "en/txrx.h"
> @@ -112,6 +113,21 @@ static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_xdpsq *sq)
>   	}
>   }
>   
> +static inline struct mlx5e_xdp_buff *mlx5e_xsk_buff_alloc(struct xsk_buff_pool *pool)
> +{
> +	return (struct mlx5e_xdp_buff *)xsk_buff_alloc(pool);

What about the space needed for the rq / cqe fields? xsk_buff_alloc 
won't allocate it.

> +}
> +
> +static inline void mlx5e_xsk_buff_free(struct mlx5e_xdp_buff *mxbuf)
> +{
> +	xsk_buff_free(&mxbuf->xdp);
> +}
> +
> +static inline dma_addr_t mlx5e_xsk_buff_xdp_get_frame_dma(struct mlx5e_xdp_buff *mxbuf)
> +{
> +	return xsk_buff_xdp_get_frame_dma(&mxbuf->xdp);
> +}
> +
>   /* Enable inline WQEs to shift some load from a congested HCA (HW) to
>    * a less congested cpu (SW).
>    */
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> index 8bf3029abd3c..1f166dbb7f22 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
> @@ -3,7 +3,6 @@
>   
>   #include "rx.h"
>   #include "en/xdp.h"
> -#include <net/xdp_sock_drv.h>
>   #include <linux/filter.h>
>   
>   /* RX data path */
> @@ -21,7 +20,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>   	if (unlikely(!xsk_buff_can_alloc(rq->xsk_pool, rq->mpwqe.pages_per_wqe)))
>   		goto err;
>   
> -	BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].xsk));
> +	BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].mxbuf));
>   	XSK_CHECK_PRIV_TYPE(struct mlx5e_xdp_buff);
>   	batch = xsk_buff_alloc_batch(rq->xsk_pool, (struct xdp_buff **)wi->alloc_units,
>   				     rq->mpwqe.pages_per_wqe);

This batching API gets broken as well...
xsk_buff_alloc_batch fills an array of struct xdp_buff pointers, it 
cannot correctly act on the array of struct mlx5e_xdp_buff, as it 
contains additional fields.

Maybe letting mlx5e_xdp_buff point to its struct xdp_buff (instead of 
wrapping it) will solve the problems here, then we'll loop over the 
xdp_buff * array and copy the pointers into the struct mlx5e_xdp_buff * 
array.
Need to give it deeper thoughts...

struct mlx5e_xdp_buff {
	struct xdp_buff *xdp;
	struct mlx5_cqe64 *cqe;
	struct mlx5e_rq *rq;
};


> @@ -33,8 +32,8 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>   	 * the first error, which will mean there are no more valid descriptors.
>   	 */
>   	for (; batch < rq->mpwqe.pages_per_wqe; batch++) {
> -		wi->alloc_units[batch].xsk = xsk_buff_alloc(rq->xsk_pool);
> -		if (unlikely(!wi->alloc_units[batch].xsk))
> +		wi->alloc_units[batch].mxbuf = mlx5e_xsk_buff_alloc(rq->xsk_pool);
> +		if (unlikely(!wi->alloc_units[batch].mxbuf))
>   			goto err_reuse_batch;
>   	}
>   
> @@ -44,7 +43,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>   
>   	if (likely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_ALIGNED)) {
>   		for (i = 0; i < batch; i++) {
> -			dma_addr_t addr = xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].xsk);
> +			dma_addr_t addr = mlx5e_xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].mxbuf);
>   
>   			umr_wqe->inline_mtts[i] = (struct mlx5_mtt) {
>   				.ptag = cpu_to_be64(addr | MLX5_EN_WR),
> @@ -53,7 +52,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>   		}
>   	} else if (unlikely(rq->mpwqe.umr_mode == MLX5E_MPWRQ_UMR_MODE_UNALIGNED)) {
>   		for (i = 0; i < batch; i++) {
> -			dma_addr_t addr = xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].xsk);
> +			dma_addr_t addr = mlx5e_xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].mxbuf);
>   
>   			umr_wqe->inline_ksms[i] = (struct mlx5_ksm) {
>   				.key = rq->mkey_be,
> @@ -65,7 +64,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>   		u32 mapping_size = 1 << (rq->mpwqe.page_shift - 2);
>   
>   		for (i = 0; i < batch; i++) {
> -			dma_addr_t addr = xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].xsk);
> +			dma_addr_t addr = mlx5e_xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].mxbuf);
>   
>   			umr_wqe->inline_ksms[i << 2] = (struct mlx5_ksm) {
>   				.key = rq->mkey_be,
> @@ -91,7 +90,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>   		__be32 frame_size = cpu_to_be32(rq->xsk_pool->chunk_size);
>   
>   		for (i = 0; i < batch; i++) {
> -			dma_addr_t addr = xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].xsk);
> +			dma_addr_t addr = mlx5e_xsk_buff_xdp_get_frame_dma(wi->alloc_units[i].mxbuf);
>   
>   			umr_wqe->inline_klms[i << 1] = (struct mlx5_klm) {
>   				.key = rq->mkey_be,
> @@ -137,7 +136,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>   
>   err_reuse_batch:
>   	while (--batch >= 0)
> -		xsk_buff_free(wi->alloc_units[batch].xsk);
> +		mlx5e_xsk_buff_free(wi->alloc_units[batch].mxbuf);
>   
>   err:
>   	rq->stats->buff_alloc_err++;
> @@ -156,7 +155,7 @@ int mlx5e_xsk_alloc_rx_wqes_batched(struct mlx5e_rq *rq, u16 ix, int wqe_bulk)
>   	 * allocate XDP buffers straight into alloc_units.
>   	 */
>   	BUILD_BUG_ON(sizeof(rq->wqe.alloc_units[0]) !=
> -		     sizeof(rq->wqe.alloc_units[0].xsk));
> +		     sizeof(rq->wqe.alloc_units[0].mxbuf));
>   	buffs = (struct xdp_buff **)rq->wqe.alloc_units;
>   	contig = mlx5_wq_cyc_get_size(wq) - ix;
>   	if (wqe_bulk <= contig) {
> @@ -177,8 +176,9 @@ int mlx5e_xsk_alloc_rx_wqes_batched(struct mlx5e_rq *rq, u16 ix, int wqe_bulk)
>   		/* Assumes log_num_frags == 0. */
>   		frag = &rq->wqe.frags[j];
>   
> -		addr = xsk_buff_xdp_get_frame_dma(frag->au->xsk);
> +		addr = mlx5e_xsk_buff_xdp_get_frame_dma(frag->au->mxbuf);
>   		wqe->data[0].addr = cpu_to_be64(addr + rq->buff.headroom);
> +		frag->au->mxbuf->rq = rq;
>   	}
>   
>   	return alloc;
> @@ -199,12 +199,13 @@ int mlx5e_xsk_alloc_rx_wqes(struct mlx5e_rq *rq, u16 ix, int wqe_bulk)
>   		/* Assumes log_num_frags == 0. */
>   		frag = &rq->wqe.frags[j];
>   
> -		frag->au->xsk = xsk_buff_alloc(rq->xsk_pool);
> -		if (unlikely(!frag->au->xsk))
> +		frag->au->mxbuf = mlx5e_xsk_buff_alloc(rq->xsk_pool);
> +		if (unlikely(!frag->au->mxbuf))
>   			return i;
>   
> -		addr = xsk_buff_xdp_get_frame_dma(frag->au->xsk);
> +		addr = mlx5e_xsk_buff_xdp_get_frame_dma(frag->au->mxbuf);
>   		wqe->data[0].addr = cpu_to_be64(addr + rq->buff.headroom);
> +		frag->au->mxbuf->rq = rq;
>   	}
>   
>   	return wqe_bulk;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 7b08653be000..4313165709cb 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -41,7 +41,6 @@
>   #include <net/gro.h>
>   #include <net/udp.h>
>   #include <net/tcp.h>
> -#include <net/xdp_sock_drv.h>
>   #include "en.h"
>   #include "en/txrx.h"
>   #include "en_tc.h"
> @@ -434,7 +433,7 @@ static inline void mlx5e_free_rx_wqe(struct mlx5e_rq *rq,
>   		 * put into the Reuse Ring, because there is no way to return
>   		 * the page to the userspace when the interface goes down.
>   		 */
> -		xsk_buff_free(wi->au->xsk);
> +		mlx5e_xsk_buff_free(wi->au->mxbuf);
>   		return;
>   	}
>   
> @@ -515,7 +514,7 @@ mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi, bool recycle
>   		 */
>   		for (i = 0; i < rq->mpwqe.pages_per_wqe; i++)
>   			if (no_xdp_xmit || !test_bit(i, wi->xdp_xmit_bitmap))
> -				xsk_buff_free(alloc_units[i].xsk);
> +				mlx5e_xsk_buff_free(alloc_units[i].mxbuf);
>   	} else {
>   		for (i = 0; i < rq->mpwqe.pages_per_wqe; i++)
>   			if (no_xdp_xmit || !test_bit(i, wi->xdp_xmit_bitmap))
> 

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 15/17] net/mlx5e: Introduce wrapper for xdp_buff
  2023-01-12 22:29             ` Toke Høiland-Jørgensen
@ 2023-01-13 20:55               ` Tariq Toukan
  0 siblings, 0 replies; 42+ messages in thread
From: Tariq Toukan @ 2023-01-13 20:55 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Stanislav Fomichev
  Cc: Tariq Toukan, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Saeed Mahameed,
	David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev



On 13/01/2023 0:29, Toke Høiland-Jørgensen wrote:
> Stanislav Fomichev <sdf@google.com> writes:
> 
>> On Thu, Jan 12, 2023 at 1:55 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>
>>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>>>
>>>> Stanislav Fomichev <sdf@google.com> writes:
>>>>
>>>>> On Thu, Jan 12, 2023 at 12:07 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 12/01/2023 2:32, Stanislav Fomichev wrote:
>>>>>>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>>>>
>>>>>>> Preparation for implementing HW metadata kfuncs. No functional change.
>>>>>>>
>>>>>>> Cc: Tariq Toukan <tariqt@nvidia.com>
>>>>>>> Cc: Saeed Mahameed <saeedm@nvidia.com>
>>>>>>> Cc: John Fastabend <john.fastabend@gmail.com>
>>>>>>> Cc: David Ahern <dsahern@gmail.com>
>>>>>>> Cc: Martin KaFai Lau <martin.lau@linux.dev>
>>>>>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>>>>>> Cc: Willem de Bruijn <willemb@google.com>
>>>>>>> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>>>>>>> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>>> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>>> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>>>>>>> Cc: Maryam Tahhan <mtahhan@redhat.com>
>>>>>>> Cc: xdp-hints@xdp-project.net
>>>>>>> Cc: netdev@vger.kernel.org
>>>>>>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>>>>>> ---
>>>>>>>    drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
>>>>>>>    .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  3 +-
>>>>>>>    .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  6 +-
>>>>>>>    .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 25 ++++----
>>>>>>>    .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 58 +++++++++----------
>>>>>>>    5 files changed, 50 insertions(+), 43 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>>>> index 2d77fb8a8a01..af663978d1b4 100644
>>>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>>>> @@ -469,6 +469,7 @@ struct mlx5e_txqsq {
>>>>>>>    union mlx5e_alloc_unit {
>>>>>>>        struct page *page;
>>>>>>>        struct xdp_buff *xsk;
>>>>>>> +     struct mlx5e_xdp_buff *mxbuf;
>>>>>>
>>>>>> In XSK files below you mix usage of both alloc_units[page_idx].mxbuf and
>>>>>> alloc_units[page_idx].xsk, while both fields share the memory of a union.
>>>>>>
>>>>>> As struct mlx5e_xdp_buff wraps struct xdp_buff, I think that you just
>>>>>> need to change the existing xsk field type from struct xdp_buff *xsk
>>>>>> into struct mlx5e_xdp_buff *xsk and align the usage.
>>>>>
>>>>> Hmmm, good point. I'm actually not sure how it works currently.
>>>>> mlx5e_alloc_unit.mxbuf doesn't seem to be initialized anywhere? Toke,
>>>>> am I missing something?
>>>>
>>>> It's initialised piecemeal in different places; but yeah, we're mixing
>>>> things a bit...
>>>>
>>>>> I'm thinking about something like this:
>>
>> Seems more invasive? I don't care much tbf, but what's wrong with
>> keeping 'xdp_buff xsk' member and use it consistently?
> 
> Yeah, it's more invasive, but it's also more consistent with the non-xsk
> path where every usage of struct xdp_buff is replaced with the wrapping
> struct?
> 
> Both will work, I suppose (in fact I think the resulting code will be
> more or less identical), so it's more a matter of which one is easier to
> read and where we put the type-safety-breaking casts.
> 
> I can live with either one (just note you'll have to move the
> 'mxbuf->rq' initialisation next to the one for mxbuf->cqe for yours, but
> that's probably fine too). Let's see which way Tariq prefers...
> 
> -Toke
> 

I think both of the above will still have issues. See my comments in the 
email that I've just sent.

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 15/17] net/mlx5e: Introduce wrapper for xdp_buff
  2023-01-13 20:53           ` Tariq Toukan
@ 2023-01-13 21:31             ` Toke Høiland-Jørgensen
  2023-01-15  6:59               ` Tariq Toukan
  0 siblings, 1 reply; 42+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-01-13 21:31 UTC (permalink / raw)
  To: Tariq Toukan, Stanislav Fomichev, Tariq Toukan
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Saeed Mahameed, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Tariq Toukan <ttoukan.linux@gmail.com> writes:

> On 12/01/2023 23:55, Toke Høiland-Jørgensen wrote:
>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>> 
>>> Stanislav Fomichev <sdf@google.com> writes:
>>>
>>>> On Thu, Jan 12, 2023 at 12:07 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 12/01/2023 2:32, Stanislav Fomichev wrote:
>>>>>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>>>
>>>>>> Preparation for implementing HW metadata kfuncs. No functional change.
>>>>>>
>>>>>> Cc: Tariq Toukan <tariqt@nvidia.com>
>>>>>> Cc: Saeed Mahameed <saeedm@nvidia.com>
>>>>>> Cc: John Fastabend <john.fastabend@gmail.com>
>>>>>> Cc: David Ahern <dsahern@gmail.com>
>>>>>> Cc: Martin KaFai Lau <martin.lau@linux.dev>
>>>>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>>>>> Cc: Willem de Bruijn <willemb@google.com>
>>>>>> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>>>>>> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>>>>>> Cc: Maryam Tahhan <mtahhan@redhat.com>
>>>>>> Cc: xdp-hints@xdp-project.net
>>>>>> Cc: netdev@vger.kernel.org
>>>>>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>>>>> ---
>>>>>>    drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
>>>>>>    .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  3 +-
>>>>>>    .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  6 +-
>>>>>>    .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 25 ++++----
>>>>>>    .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 58 +++++++++----------
>>>>>>    5 files changed, 50 insertions(+), 43 deletions(-)
>>>>>>
>>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>>> index 2d77fb8a8a01..af663978d1b4 100644
>>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>>> @@ -469,6 +469,7 @@ struct mlx5e_txqsq {
>>>>>>    union mlx5e_alloc_unit {
>>>>>>        struct page *page;
>>>>>>        struct xdp_buff *xsk;
>>>>>> +     struct mlx5e_xdp_buff *mxbuf;
>>>>>
>>>>> In XSK files below you mix usage of both alloc_units[page_idx].mxbuf and
>>>>> alloc_units[page_idx].xsk, while both fields share the memory of a union.
>>>>>
>>>>> As struct mlx5e_xdp_buff wraps struct xdp_buff, I think that you just
>>>>> need to change the existing xsk field type from struct xdp_buff *xsk
>>>>> into struct mlx5e_xdp_buff *xsk and align the usage.
>>>>
>>>> Hmmm, good point. I'm actually not sure how it works currently.
>>>> mlx5e_alloc_unit.mxbuf doesn't seem to be initialized anywhere? Toke,
>>>> am I missing something?
>>>
>>> It's initialised piecemeal in different places; but yeah, we're mixing
>>> things a bit...
>>>
>>>> I'm thinking about something like this:
>>>>
>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>> index af663978d1b4..2d77fb8a8a01 100644
>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>> @@ -469,7 +469,6 @@ struct mlx5e_txqsq {
>>>>   union mlx5e_alloc_unit {
>>>>          struct page *page;
>>>>          struct xdp_buff *xsk;
>>>> -       struct mlx5e_xdp_buff *mxbuf;
>>>>   };
>>>
>>> Hmm, for consistency with the non-XSK path we should rather go the other
>>> direction and lose the xsk member, moving everything to mxbuf? Let me
>>> give that a shot...
>> 
>> Something like the below?
>> 
>> -Toke
>> 
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> index 6de02d8aeab8..cb9cdb6421c5 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>> @@ -468,7 +468,6 @@ struct mlx5e_txqsq {
>>   
>>   union mlx5e_alloc_unit {
>>   	struct page *page;
>> -	struct xdp_buff *xsk;
>>   	struct mlx5e_xdp_buff *mxbuf;
>>   };
>>   
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
>> index cb568c62aba0..95694a25ec31 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
>> @@ -33,6 +33,7 @@
>>   #define __MLX5_EN_XDP_H__
>>   
>>   #include <linux/indirect_call_wrapper.h>
>> +#include <net/xdp_sock_drv.h>
>>   
>>   #include "en.h"
>>   #include "en/txrx.h"
>> @@ -112,6 +113,21 @@ static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_xdpsq *sq)
>>   	}
>>   }
>>   
>> +static inline struct mlx5e_xdp_buff *mlx5e_xsk_buff_alloc(struct xsk_buff_pool *pool)
>> +{
>> +	return (struct mlx5e_xdp_buff *)xsk_buff_alloc(pool);
>
> What about the space needed for the rq / cqe fields? xsk_buff_alloc 
> won't allocate it.

It will! See patch 14 in the series that adds a 'cb' field to
xdp_buff_xsk, meaning there's actually space after the xdp_buff struct
being allocated by the xsk_buff_alloc API. The XSK_CHECK_PRIV_TYPE macro
call is there to ensure the cb field is big enough for the struct we're
casting to in the driver.

>> +}
>> +
>> +static inline void mlx5e_xsk_buff_free(struct mlx5e_xdp_buff *mxbuf)
>> +{
>> +	xsk_buff_free(&mxbuf->xdp);
>> +}
>> +
>> +static inline dma_addr_t mlx5e_xsk_buff_xdp_get_frame_dma(struct mlx5e_xdp_buff *mxbuf)
>> +{
>> +	return xsk_buff_xdp_get_frame_dma(&mxbuf->xdp);
>> +}
>> +
>>   /* Enable inline WQEs to shift some load from a congested HCA (HW) to
>>    * a less congested cpu (SW).
>>    */
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
>> index 8bf3029abd3c..1f166dbb7f22 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
>> @@ -3,7 +3,6 @@
>>   
>>   #include "rx.h"
>>   #include "en/xdp.h"
>> -#include <net/xdp_sock_drv.h>
>>   #include <linux/filter.h>
>>   
>>   /* RX data path */
>> @@ -21,7 +20,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>>   	if (unlikely(!xsk_buff_can_alloc(rq->xsk_pool, rq->mpwqe.pages_per_wqe)))
>>   		goto err;
>>   
>> -	BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].xsk));
>> +	BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].mxbuf));
>>   	XSK_CHECK_PRIV_TYPE(struct mlx5e_xdp_buff);
>>   	batch = xsk_buff_alloc_batch(rq->xsk_pool, (struct xdp_buff **)wi->alloc_units,
>>   				     rq->mpwqe.pages_per_wqe);
>
> This batching API gets broken as well...
> xsk_buff_alloc_batch fills an array of struct xdp_buff pointers, it 
> cannot correctly act on the array of struct mlx5e_xdp_buff, as it 
> contains additional fields.

See above for why this does, in fact, work. I agree it's not totally
obvious, and in any case there's going to be a point where the cast
happens where type safety will break, which is what I was alluding to in
my reply to Stanislav.

I guess we could try to rework the API in xdp_sock_drv.h to make this
more obvious instead of using the casting driver-specific wrappers I
suggested here. Or we could go with Stanislav's suggestion of keeping
allocation etc using xdp_buff and only casting to mlx5e_xdp_buff in the
function where it's used; then we can keep the casting localised to that
function, and put a comment there explaining why it works?

> Maybe letting mlx5e_xdp_buff point to its struct xdp_buff (instead of 
> wrapping it) will solve the problems here, then we'll loop over the 
> xdp_buff * array and copy the pointers into the struct mlx5e_xdp_buff * 
> array.
> Need to give it deeper thoughts...
>
> struct mlx5e_xdp_buff {
> 	struct xdp_buff *xdp;
> 	struct mlx5_cqe64 *cqe;
> 	struct mlx5e_rq *rq;
> };

This was actually my original proposal; we discussed this back on v2 of
this patch series. People generally felt that the 'cb' field approach
(originally suggested by Jakub) was better. See the discussion starting
from here:

https://lore.kernel.org/r/20221123111431.7b54668e@kernel.org

-Toke


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 15/17] net/mlx5e: Introduce wrapper for xdp_buff
  2023-01-13 21:31             ` Toke Høiland-Jørgensen
@ 2023-01-15  6:59               ` Tariq Toukan
  2023-01-15 11:13                 ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 42+ messages in thread
From: Tariq Toukan @ 2023-01-15  6:59 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Saeed Mahameed, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev



On 13/01/2023 23:31, Toke Høiland-Jørgensen wrote:
> Tariq Toukan <ttoukan.linux@gmail.com> writes:
> 
>> On 12/01/2023 23:55, Toke Høiland-Jørgensen wrote:
>>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>>>
>>>> Stanislav Fomichev <sdf@google.com> writes:
>>>>
>>>>> On Thu, Jan 12, 2023 at 12:07 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 12/01/2023 2:32, Stanislav Fomichev wrote:
>>>>>>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>>>>
>>>>>>> Preparation for implementing HW metadata kfuncs. No functional change.
>>>>>>>
>>>>>>> Cc: Tariq Toukan <tariqt@nvidia.com>
>>>>>>> Cc: Saeed Mahameed <saeedm@nvidia.com>
>>>>>>> Cc: John Fastabend <john.fastabend@gmail.com>
>>>>>>> Cc: David Ahern <dsahern@gmail.com>
>>>>>>> Cc: Martin KaFai Lau <martin.lau@linux.dev>
>>>>>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>>>>>> Cc: Willem de Bruijn <willemb@google.com>
>>>>>>> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>>>>>>> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>>> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>>> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>>>>>>> Cc: Maryam Tahhan <mtahhan@redhat.com>
>>>>>>> Cc: xdp-hints@xdp-project.net
>>>>>>> Cc: netdev@vger.kernel.org
>>>>>>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>>>>>> ---
>>>>>>>     drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
>>>>>>>     .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  3 +-
>>>>>>>     .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  6 +-
>>>>>>>     .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 25 ++++----
>>>>>>>     .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 58 +++++++++----------
>>>>>>>     5 files changed, 50 insertions(+), 43 deletions(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>>>> index 2d77fb8a8a01..af663978d1b4 100644
>>>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>>>> @@ -469,6 +469,7 @@ struct mlx5e_txqsq {
>>>>>>>     union mlx5e_alloc_unit {
>>>>>>>         struct page *page;
>>>>>>>         struct xdp_buff *xsk;
>>>>>>> +     struct mlx5e_xdp_buff *mxbuf;
>>>>>>
>>>>>> In XSK files below you mix usage of both alloc_units[page_idx].mxbuf and
>>>>>> alloc_units[page_idx].xsk, while both fields share the memory of a union.
>>>>>>
>>>>>> As struct mlx5e_xdp_buff wraps struct xdp_buff, I think that you just
>>>>>> need to change the existing xsk field type from struct xdp_buff *xsk
>>>>>> into struct mlx5e_xdp_buff *xsk and align the usage.
>>>>>
>>>>> Hmmm, good point. I'm actually not sure how it works currently.
>>>>> mlx5e_alloc_unit.mxbuf doesn't seem to be initialized anywhere? Toke,
>>>>> am I missing something?
>>>>
>>>> It's initialised piecemeal in different places; but yeah, we're mixing
>>>> things a bit...
>>>>
>>>>> I'm thinking about something like this:
>>>>>
>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>> index af663978d1b4..2d77fb8a8a01 100644
>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>> @@ -469,7 +469,6 @@ struct mlx5e_txqsq {
>>>>>    union mlx5e_alloc_unit {
>>>>>           struct page *page;
>>>>>           struct xdp_buff *xsk;
>>>>> -       struct mlx5e_xdp_buff *mxbuf;
>>>>>    };
>>>>
>>>> Hmm, for consistency with the non-XSK path we should rather go the other
>>>> direction and lose the xsk member, moving everything to mxbuf? Let me
>>>> give that a shot...
>>>
>>> Something like the below?
>>>
>>> -Toke
>>>
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>> index 6de02d8aeab8..cb9cdb6421c5 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>> @@ -468,7 +468,6 @@ struct mlx5e_txqsq {
>>>    
>>>    union mlx5e_alloc_unit {
>>>    	struct page *page;
>>> -	struct xdp_buff *xsk;
>>>    	struct mlx5e_xdp_buff *mxbuf;
>>>    };
>>>    
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
>>> index cb568c62aba0..95694a25ec31 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
>>> @@ -33,6 +33,7 @@
>>>    #define __MLX5_EN_XDP_H__
>>>    
>>>    #include <linux/indirect_call_wrapper.h>
>>> +#include <net/xdp_sock_drv.h>
>>>    
>>>    #include "en.h"
>>>    #include "en/txrx.h"
>>> @@ -112,6 +113,21 @@ static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_xdpsq *sq)
>>>    	}
>>>    }
>>>    
>>> +static inline struct mlx5e_xdp_buff *mlx5e_xsk_buff_alloc(struct xsk_buff_pool *pool)
>>> +{
>>> +	return (struct mlx5e_xdp_buff *)xsk_buff_alloc(pool);
>>
>> What about the space needed for the rq / cqe fields? xsk_buff_alloc
>> won't allocate it.
> 
> It will! See patch 14 in the series that adds a 'cb' field to
> xdp_buff_xsk, meaning there's actually space after the xdp_buff struct
> being allocated by the xsk_buff_alloc API. The XSK_CHECK_PRIV_TYPE macro
> call is there to ensure the cb field is big enough for the struct we're
> casting to in the driver.
> 

Oh okay, got it.

>>> +}
>>> +
>>> +static inline void mlx5e_xsk_buff_free(struct mlx5e_xdp_buff *mxbuf)
>>> +{
>>> +	xsk_buff_free(&mxbuf->xdp);
>>> +}
>>> +
>>> +static inline dma_addr_t mlx5e_xsk_buff_xdp_get_frame_dma(struct mlx5e_xdp_buff *mxbuf)
>>> +{
>>> +	return xsk_buff_xdp_get_frame_dma(&mxbuf->xdp);
>>> +}
>>> +
>>>    /* Enable inline WQEs to shift some load from a congested HCA (HW) to
>>>     * a less congested cpu (SW).
>>>     */
>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
>>> index 8bf3029abd3c..1f166dbb7f22 100644
>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
>>> @@ -3,7 +3,6 @@
>>>    
>>>    #include "rx.h"
>>>    #include "en/xdp.h"
>>> -#include <net/xdp_sock_drv.h>
>>>    #include <linux/filter.h>
>>>    
>>>    /* RX data path */
>>> @@ -21,7 +20,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>>>    	if (unlikely(!xsk_buff_can_alloc(rq->xsk_pool, rq->mpwqe.pages_per_wqe)))
>>>    		goto err;
>>>    
>>> -	BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].xsk));
>>> +	BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].mxbuf));
>>>    	XSK_CHECK_PRIV_TYPE(struct mlx5e_xdp_buff);
>>>    	batch = xsk_buff_alloc_batch(rq->xsk_pool, (struct xdp_buff **)wi->alloc_units,
>>>    				     rq->mpwqe.pages_per_wqe);
>>
>> This batching API gets broken as well...
>> xsk_buff_alloc_batch fills an array of struct xdp_buff pointers, it
>> cannot correctly act on the array of struct mlx5e_xdp_buff, as it
>> contains additional fields.
> 
> See above for why this does, in fact, work. I agree it's not totally
> obvious, and in any case there's going to be a point where the cast
> happens where type safety will break, which is what I was alluding to in
> my reply to Stanislav.
> 
> I guess we could try to rework the API in xdp_sock_drv.h to make this
> more obvious instead of using the casting driver-specific wrappers I
> suggested here. Or we could go with Stanislav's suggestion of keeping
> allocation etc using xdp_buff and only casting to mlx5e_xdp_buff in the
> function where it's used; then we can keep the casting localised to that
> function, and put a comment there explaining why it works?
> 

Stanislav's proposal LGTM.
Let's keep the casting localised, and make sure there's a comment there.

>> Maybe letting mlx5e_xdp_buff point to its struct xdp_buff (instead of
>> wrapping it) will solve the problems here, then we'll loop over the
>> xdp_buff * array and copy the pointers into the struct mlx5e_xdp_buff *
>> array.
>> Need to give it deeper thoughts...
>>
>> struct mlx5e_xdp_buff {
>> 	struct xdp_buff *xdp;
>> 	struct mlx5_cqe64 *cqe;
>> 	struct mlx5e_rq *rq;
>> };
> 
> This was actually my original proposal; we discussed this back on v2 of
> this patch series. People generally felt that the 'cb' field approach
> (originally suggested by Jakub) was better.

I agree.

> See the discussion starting
> from here:
> 
> https://lore.kernel.org/r/20221123111431.7b54668e@kernel.org
> 
> -Toke
> 


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 15/17] net/mlx5e: Introduce wrapper for xdp_buff
  2023-01-15  6:59               ` Tariq Toukan
@ 2023-01-15 11:13                 ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 42+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-01-15 11:13 UTC (permalink / raw)
  To: Tariq Toukan, Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, Saeed Mahameed, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

Tariq Toukan <ttoukan.linux@gmail.com> writes:

> On 13/01/2023 23:31, Toke Høiland-Jørgensen wrote:
>> Tariq Toukan <ttoukan.linux@gmail.com> writes:
>> 
>>> On 12/01/2023 23:55, Toke Høiland-Jørgensen wrote:
>>>> Toke Høiland-Jørgensen <toke@redhat.com> writes:
>>>>
>>>>> Stanislav Fomichev <sdf@google.com> writes:
>>>>>
>>>>>> On Thu, Jan 12, 2023 at 12:07 AM Tariq Toukan <ttoukan.linux@gmail.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 12/01/2023 2:32, Stanislav Fomichev wrote:
>>>>>>>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>>>>>
>>>>>>>> Preparation for implementing HW metadata kfuncs. No functional change.
>>>>>>>>
>>>>>>>> Cc: Tariq Toukan <tariqt@nvidia.com>
>>>>>>>> Cc: Saeed Mahameed <saeedm@nvidia.com>
>>>>>>>> Cc: John Fastabend <john.fastabend@gmail.com>
>>>>>>>> Cc: David Ahern <dsahern@gmail.com>
>>>>>>>> Cc: Martin KaFai Lau <martin.lau@linux.dev>
>>>>>>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>>>>>>> Cc: Willem de Bruijn <willemb@google.com>
>>>>>>>> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>>>>>>>> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>>>>>>>> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>>>> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>>>>>>>> Cc: Maryam Tahhan <mtahhan@redhat.com>
>>>>>>>> Cc: xdp-hints@xdp-project.net
>>>>>>>> Cc: netdev@vger.kernel.org
>>>>>>>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>>>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>>>>>>> ---
>>>>>>>>     drivers/net/ethernet/mellanox/mlx5/core/en.h  |  1 +
>>>>>>>>     .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  3 +-
>>>>>>>>     .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  6 +-
>>>>>>>>     .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   | 25 ++++----
>>>>>>>>     .../net/ethernet/mellanox/mlx5/core/en_rx.c   | 58 +++++++++----------
>>>>>>>>     5 files changed, 50 insertions(+), 43 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>>>>> index 2d77fb8a8a01..af663978d1b4 100644
>>>>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>>>>> @@ -469,6 +469,7 @@ struct mlx5e_txqsq {
>>>>>>>>     union mlx5e_alloc_unit {
>>>>>>>>         struct page *page;
>>>>>>>>         struct xdp_buff *xsk;
>>>>>>>> +     struct mlx5e_xdp_buff *mxbuf;
>>>>>>>
>>>>>>> In XSK files below you mix usage of both alloc_units[page_idx].mxbuf and
>>>>>>> alloc_units[page_idx].xsk, while both fields share the memory of a union.
>>>>>>>
>>>>>>> As struct mlx5e_xdp_buff wraps struct xdp_buff, I think that you just
>>>>>>> need to change the existing xsk field type from struct xdp_buff *xsk
>>>>>>> into struct mlx5e_xdp_buff *xsk and align the usage.
>>>>>>
>>>>>> Hmmm, good point. I'm actually not sure how it works currently.
>>>>>> mlx5e_alloc_unit.mxbuf doesn't seem to be initialized anywhere? Toke,
>>>>>> am I missing something?
>>>>>
>>>>> It's initialised piecemeal in different places; but yeah, we're mixing
>>>>> things a bit...
>>>>>
>>>>>> I'm thinking about something like this:
>>>>>>
>>>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>>> b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>>> index af663978d1b4..2d77fb8a8a01 100644
>>>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>>>> @@ -469,7 +469,6 @@ struct mlx5e_txqsq {
>>>>>>    union mlx5e_alloc_unit {
>>>>>>           struct page *page;
>>>>>>           struct xdp_buff *xsk;
>>>>>> -       struct mlx5e_xdp_buff *mxbuf;
>>>>>>    };
>>>>>
>>>>> Hmm, for consistency with the non-XSK path we should rather go the other
>>>>> direction and lose the xsk member, moving everything to mxbuf? Let me
>>>>> give that a shot...
>>>>
>>>> Something like the below?
>>>>
>>>> -Toke
>>>>
>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>> index 6de02d8aeab8..cb9cdb6421c5 100644
>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
>>>> @@ -468,7 +468,6 @@ struct mlx5e_txqsq {
>>>>    
>>>>    union mlx5e_alloc_unit {
>>>>    	struct page *page;
>>>> -	struct xdp_buff *xsk;
>>>>    	struct mlx5e_xdp_buff *mxbuf;
>>>>    };
>>>>    
>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
>>>> index cb568c62aba0..95694a25ec31 100644
>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
>>>> @@ -33,6 +33,7 @@
>>>>    #define __MLX5_EN_XDP_H__
>>>>    
>>>>    #include <linux/indirect_call_wrapper.h>
>>>> +#include <net/xdp_sock_drv.h>
>>>>    
>>>>    #include "en.h"
>>>>    #include "en/txrx.h"
>>>> @@ -112,6 +113,21 @@ static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_xdpsq *sq)
>>>>    	}
>>>>    }
>>>>    
>>>> +static inline struct mlx5e_xdp_buff *mlx5e_xsk_buff_alloc(struct xsk_buff_pool *pool)
>>>> +{
>>>> +	return (struct mlx5e_xdp_buff *)xsk_buff_alloc(pool);
>>>
>>> What about the space needed for the rq / cqe fields? xsk_buff_alloc
>>> won't allocate it.
>> 
>> It will! See patch 14 in the series that adds a 'cb' field to
>> xdp_buff_xsk, meaning there's actually space after the xdp_buff struct
>> being allocated by the xsk_buff_alloc API. The XSK_CHECK_PRIV_TYPE macro
>> call is there to ensure the cb field is big enough for the struct we're
>> casting to in the driver.
>> 
>
> Oh okay, got it.
>
>>>> +}
>>>> +
>>>> +static inline void mlx5e_xsk_buff_free(struct mlx5e_xdp_buff *mxbuf)
>>>> +{
>>>> +	xsk_buff_free(&mxbuf->xdp);
>>>> +}
>>>> +
>>>> +static inline dma_addr_t mlx5e_xsk_buff_xdp_get_frame_dma(struct mlx5e_xdp_buff *mxbuf)
>>>> +{
>>>> +	return xsk_buff_xdp_get_frame_dma(&mxbuf->xdp);
>>>> +}
>>>> +
>>>>    /* Enable inline WQEs to shift some load from a congested HCA (HW) to
>>>>     * a less congested cpu (SW).
>>>>     */
>>>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
>>>> index 8bf3029abd3c..1f166dbb7f22 100644
>>>> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
>>>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
>>>> @@ -3,7 +3,6 @@
>>>>    
>>>>    #include "rx.h"
>>>>    #include "en/xdp.h"
>>>> -#include <net/xdp_sock_drv.h>
>>>>    #include <linux/filter.h>
>>>>    
>>>>    /* RX data path */
>>>> @@ -21,7 +20,7 @@ int mlx5e_xsk_alloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
>>>>    	if (unlikely(!xsk_buff_can_alloc(rq->xsk_pool, rq->mpwqe.pages_per_wqe)))
>>>>    		goto err;
>>>>    
>>>> -	BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].xsk));
>>>> +	BUILD_BUG_ON(sizeof(wi->alloc_units[0]) != sizeof(wi->alloc_units[0].mxbuf));
>>>>    	XSK_CHECK_PRIV_TYPE(struct mlx5e_xdp_buff);
>>>>    	batch = xsk_buff_alloc_batch(rq->xsk_pool, (struct xdp_buff **)wi->alloc_units,
>>>>    				     rq->mpwqe.pages_per_wqe);
>>>
>>> This batching API gets broken as well...
>>> xsk_buff_alloc_batch fills an array of struct xdp_buff pointers, it
>>> cannot correctly act on the array of struct mlx5e_xdp_buff, as it
>>> contains additional fields.
>> 
>> See above for why this does, in fact, work. I agree it's not totally
>> obvious, and in any case there's going to be a point where the cast
>> happens where type safety will break, which is what I was alluding to in
>> my reply to Stanislav.
>> 
>> I guess we could try to rework the API in xdp_sock_drv.h to make this
>> more obvious instead of using the casting driver-specific wrappers I
>> suggested here. Or we could go with Stanislav's suggestion of keeping
>> allocation etc using xdp_buff and only casting to mlx5e_xdp_buff in the
>> function where it's used; then we can keep the casting localised to that
>> function, and put a comment there explaining why it works?
>> 
>
> Stanislav's proposal LGTM.
> Let's keep the casting localised, and make sure there's a comment there.

Alright, cool!

-Toke


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 01/17] bpf: Document XDP RX metadata
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 01/17] bpf: Document XDP RX metadata Stanislav Fomichev
@ 2023-01-16 13:09   ` Jesper Dangaard Brouer
  2023-01-17 20:33     ` Stanislav Fomichev
  0 siblings, 1 reply; 42+ messages in thread
From: Jesper Dangaard Brouer @ 2023-01-16 13:09 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, David Vernet



On 12/01/2023 01.32, Stanislav Fomichev wrote:
> Document all current use-cases and assumptions.
> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Acked-by: David Vernet <void@manifault.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>   Documentation/networking/index.rst           |   1 +
>   Documentation/networking/xdp-rx-metadata.rst | 108 +++++++++++++++++++
>   2 files changed, 109 insertions(+)
>   create mode 100644 Documentation/networking/xdp-rx-metadata.rst
> 
> diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
> index 4f2d1f682a18..4ddcae33c336 100644
> --- a/Documentation/networking/index.rst
> +++ b/Documentation/networking/index.rst
> @@ -120,6 +120,7 @@ Refer to :ref:`netdev-FAQ` for a guide on netdev development process specifics.
>      xfrm_proc
>      xfrm_sync
>      xfrm_sysctl
> +   xdp-rx-metadata
>   
>   .. only::  subproject and html
>   
> diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
> new file mode 100644
> index 000000000000..b6c8c77937c4
> --- /dev/null
> +++ b/Documentation/networking/xdp-rx-metadata.rst
> @@ -0,0 +1,108 @@
> +===============
> +XDP RX Metadata
> +===============
> +
> +This document describes how an eXpress Data Path (XDP) program can access
> +hardware metadata related to a packet using a set of helper functions,
> +and how it can pass that metadata on to other consumers.
> +
> +General Design
> +==============
> +
> +XDP has access to a set of kfuncs to manipulate the metadata in an XDP frame.
> +Every device driver that wishes to expose additional packet metadata can
> +implement these kfuncs. The set of kfuncs is declared in ``include/net/xdp.h``
> +via ``XDP_METADATA_KFUNC_xxx``.
> +
> +Currently, the following kfuncs are supported. In the future, as more
> +metadata is supported, this set will grow:
> +
> +.. kernel-doc:: net/core/xdp.c
> +   :identifiers: bpf_xdp_metadata_rx_timestamp bpf_xdp_metadata_rx_hash
> +
> +An XDP program can use these kfuncs to read the metadata into stack
> +variables for its own consumption. Or, to pass the metadata on to other
> +consumers, an XDP program can store it into the metadata area carried
> +ahead of the packet.
> +
> +Not all kfuncs have to be implemented by the device driver; when not
> +implemented, the default ones that return ``-EOPNOTSUPP`` will be used.
> +
> +Within an XDP frame, the metadata layout is as follows::

Below diagram describes XDP buff (xdp_buff), but text says 'XDP frame'.
So XDP frame isn't referring literally to xdp_frame, which I find 
slightly confusing.
It is likely because I think too much about the code and the different 
objects, xdp_frame, xdp_buff, xdp_md (xdp ctx seen be bpf-prog).

I tried to grep in the (recent added) bpf/xdp docs to see if there is a
definition of a XDP "packet" or "frame".  Nothing popped up, except that
Documentation/bpf/map_cpumap.rst talks about raw ``xdp_frame`` objects.

Perhaps we can improve this doc by calling out xdp_buff here, like:

  Within an XDP frame, the metadata layout (accessed via ``xdp_buff``) 
is as follows::

> +
> +  +----------+-----------------+------+
> +  | headroom | custom metadata | data |
> +  +----------+-----------------+------+
> +             ^                 ^
> +             |                 |
> +   xdp_buff->data_meta   xdp_buff->data
> +
> +An XDP program can store individual metadata items into this ``data_meta``
> +area in whichever format it chooses. Later consumers of the metadata
> +will have to agree on the format by some out of band contract (like for
> +the AF_XDP use case, see below).
> +
> +AF_XDP
> +======
> +
> +:doc:`af_xdp` use-case implies that there is a contract between the BPF
> +program that redirects XDP frames into the ``AF_XDP`` socket (``XSK``) and
> +the final consumer. Thus the BPF program manually allocates a fixed number of
> +bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset
> +of kfuncs to populate it. The userspace ``XSK`` consumer computes
> +``xsk_umem__get_data() - METADATA_SIZE`` to locate that metadata.
> +Note, ``xsk_umem__get_data`` is defined in ``libxdp`` and
> +``METADATA_SIZE`` is an application-specific constant.

The main problem with AF_XDP and metadata is that, the AF_XDP descriptor
doesn't contain any info about the length METADATA_SIZE.

The text does says this, but in a very convoluted way.
I think this challenge should be more clearly spelled out.

(p.s. This was something that XDP-hints via BTF have a proposed solution 
for)

> +
> +Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::

The "note" also hint to this issue.

> +
> +  +----------+-----------------+------+
> +  | headroom | custom metadata | data |
> +  +----------+-----------------+------+
> +                               ^
> +                               |
> +                        rx_desc->address
> +
> +XDP_PASS
> +========
> +
> +This is the path where the packets processed by the XDP program are passed
> +into the kernel. The kernel creates the ``skb`` out of the ``xdp_buff``
> +contents. Currently, every driver has custom kernel code to parse
> +the descriptors and populate ``skb`` metadata when doing this ``xdp_buff->skb``
> +conversion, and the XDP metadata is not used by the kernel when building
> +``skbs``. However, TC-BPF programs can access the XDP metadata area using
> +the ``data_meta`` pointer.
> +
> +In the future, we'd like to support a case where an XDP program
> +can override some of the metadata used for building ``skbs``.

Happy this is mentioned as future work.

> +
> +bpf_redirect_map
> +================
> +
> +``bpf_redirect_map`` can redirect the frame to a different device.
> +Some devices (like virtual ethernet links) support running a second XDP
> +program after the redirect. However, the final consumer doesn't have
> +access to the original hardware descriptor and can't access any of
> +the original metadata. The same applies to XDP programs installed
> +into devmaps and cpumaps.
> +
> +This means that for redirected packets only custom metadata is
> +currently supported, which has to be prepared by the initial XDP program
> +before redirect. If the frame is eventually passed to the kernel, the
> +``skb`` created from such a frame won't have any hardware metadata populated
> +in its ``skb``. If such a packet is later redirected into an ``XSK``,
> +that will also only have access to the custom metadata.
> +

Good that this is documented, but I hope we can fix/improve this as
future work.

> +bpf_tail_call
> +=============
> +
> +Adding programs that access metadata kfuncs to the ``BPF_MAP_TYPE_PROG_ARRAY``
> +is currently not supported.
> +
> +Example
> +=======
> +
> +See ``tools/testing/selftests/bpf/progs/xdp_metadata.c`` and
> +``tools/testing/selftests/bpf/prog_tests/xdp_metadata.c`` for an example of
> +BPF program that handles XDP metadata.


--Jesper


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 10/17] veth: Support RX XDP metadata
  2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 10/17] veth: Support RX XDP metadata Stanislav Fomichev
@ 2023-01-16 16:21   ` Jesper Dangaard Brouer
  2023-01-17 20:33     ` Stanislav Fomichev
  0 siblings, 1 reply; 42+ messages in thread
From: Jesper Dangaard Brouer @ 2023-01-16 16:21 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev



On 12/01/2023 01.32, Stanislav Fomichev wrote:
> The goal is to enable end-to-end testing of the metadata for AF_XDP.

For me the goal with veth goes beyond *testing*.

This patch ignores the xdp_frame case.  I'm not blocking this patch, but
I'm saying we need to make sure there is a way forward for accessing
XDP-hints when handling redirected xdp_frame's.

I have two use-cases we should cover (as future work).

(#1) We have customers that want to redirect from physical NIC hardware
into containers, and then have the veth XDP-prog (selectively) redirect
into an AF_XDP socket (when matching fastpath packets).  Here they
(minimum) want access to the XDP hint info on HW checksum.

(#2) Both veth and cpumap can create SKBs based on xdp_frame's.  Here it
is essential to get HW checksum and HW hash when creating these SKBs
(else netstack have to do expensive csum calc and parsing in
flow-dissector).

> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>   drivers/net/veth.c | 31 +++++++++++++++++++++++++++++++
>   1 file changed, 31 insertions(+)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 70f50602287a..ba3e05832843 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -118,6 +118,7 @@ static struct {
>   
>   struct veth_xdp_buff {
>   	struct xdp_buff xdp;
> +	struct sk_buff *skb;
>   };
>   
>   static int veth_get_link_ksettings(struct net_device *dev,
> @@ -602,6 +603,7 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
>   
>   		xdp_convert_frame_to_buff(frame, xdp);
>   		xdp->rxq = &rq->xdp_rxq;
> +		vxbuf.skb = NULL;
>   
>   		act = bpf_prog_run_xdp(xdp_prog, xdp);
>   
> @@ -823,6 +825,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
>   	__skb_push(skb, skb->data - skb_mac_header(skb));
>   	if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))
>   		goto drop;
> +	vxbuf.skb = skb;
>   
>   	orig_data = xdp->data;
>   	orig_data_end = xdp->data_end;
> @@ -1602,6 +1605,28 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>   	}
>   }
>   
> +static int veth_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
> +{
> +	struct veth_xdp_buff *_ctx = (void *)ctx;
> +
> +	if (!_ctx->skb)
> +		return -EOPNOTSUPP;
> +
> +	*timestamp = skb_hwtstamps(_ctx->skb)->hwtstamp;

The SKB stores this skb_hwtstamps() in skb_shared_info memory area.
This memory area is actually also available to xdp_frames.  Thus, we
could store the HW rx_timestamp in same location for redirected
xdp_frames.  This could make code path sharing possible between SKB vs
xdp_frame in veth.

This would also make it fast to "transfer" HW rx_timestamp when creating
an SKB from an xdp_frame, as data is already written in the correct place.

Performance wise the down-side is that skb_shared_info memory area is in
a separate cacheline.  Thus, when no HW rx_timestamp is available, then
it is very expensive for a veth XDP bpf-prog to access this, just to get
a zero back.  Having an xdp_frame->flags bit that knows if HW
rx_timestamp have been stored, can mitigate this.


> +	return 0;
> +}
> +
> +static int veth_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash)
> +{
> +	struct veth_xdp_buff *_ctx = (void *)ctx;
> +
> +	if (!_ctx->skb)
> +		return -EOPNOTSUPP;

For xdp_frame case, I'm considering simply storing the u32 RX-hash in
struct xdp_frame.  This makes it easy to extract for xdp_frame to SKB
create use-case.

As have been mentioned before, the SKB also requires knowing the RSS
hash-type.  This HW hash-type actually contains a lot of information,
that today is lost when reduced to the SKB hash-type.  Due to
standardization from Microsoft, most HW provide info on (L3) IPv4 or
IPv6, and on (L4) TCP or UDP (and often SCTP).  Often hardware
descriptor also provide info on the header length.  Future work in this
area is exciting as we can speedup parsing of packets in XDP, if we can
get are more detailed HW info on hash "packet-type".

> +
> +	*hash = skb_get_hash(_ctx->skb); > +	return 0;
> +}
> +

--Jesper


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 10/17] veth: Support RX XDP metadata
  2023-01-16 16:21   ` [xdp-hints] " Jesper Dangaard Brouer
@ 2023-01-17 20:33     ` Stanislav Fomichev
  2023-01-18 15:57       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-17 20:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Mon, Jan 16, 2023 at 8:21 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 12/01/2023 01.32, Stanislav Fomichev wrote:
> > The goal is to enable end-to-end testing of the metadata for AF_XDP.
>
> For me the goal with veth goes beyond *testing*.
>
> This patch ignores the xdp_frame case.  I'm not blocking this patch, but
> I'm saying we need to make sure there is a way forward for accessing
> XDP-hints when handling redirected xdp_frame's.

Sure, let's work towards getting that other part addressed!

> I have two use-cases we should cover (as future work).
>
> (#1) We have customers that want to redirect from physical NIC hardware
> into containers, and then have the veth XDP-prog (selectively) redirect
> into an AF_XDP socket (when matching fastpath packets).  Here they
> (minimum) want access to the XDP hint info on HW checksum.
>
> (#2) Both veth and cpumap can create SKBs based on xdp_frame's.  Here it
> is essential to get HW checksum and HW hash when creating these SKBs
> (else netstack have to do expensive csum calc and parsing in
> flow-dissector).

From my PoW, I'd probably have to look into the TX side first (tx
timestamp) before looking into xdp->skb path. So if somebody on your
side has cycles, feel free to drive this effort. I'm happy to provide
reviews/comments/etc. I think we've discussed in the past that this
will most likely look like another set of "export" kfuncs?

We can start with extending new
Documentation/networking/xdp-rx-metadata.rst with a high-level design.

> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: David Ahern <dsahern@gmail.com>
> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > Cc: xdp-hints@xdp-project.net
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >   drivers/net/veth.c | 31 +++++++++++++++++++++++++++++++
> >   1 file changed, 31 insertions(+)
> >
> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > index 70f50602287a..ba3e05832843 100644
> > --- a/drivers/net/veth.c
> > +++ b/drivers/net/veth.c
> > @@ -118,6 +118,7 @@ static struct {
> >
> >   struct veth_xdp_buff {
> >       struct xdp_buff xdp;
> > +     struct sk_buff *skb;
> >   };
> >
> >   static int veth_get_link_ksettings(struct net_device *dev,
> > @@ -602,6 +603,7 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
> >
> >               xdp_convert_frame_to_buff(frame, xdp);
> >               xdp->rxq = &rq->xdp_rxq;
> > +             vxbuf.skb = NULL;
> >
> >               act = bpf_prog_run_xdp(xdp_prog, xdp);
> >
> > @@ -823,6 +825,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
> >       __skb_push(skb, skb->data - skb_mac_header(skb));
> >       if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))
> >               goto drop;
> > +     vxbuf.skb = skb;
> >
> >       orig_data = xdp->data;
> >       orig_data_end = xdp->data_end;
> > @@ -1602,6 +1605,28 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >       }
> >   }
> >
> > +static int veth_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
> > +{
> > +     struct veth_xdp_buff *_ctx = (void *)ctx;
> > +
> > +     if (!_ctx->skb)
> > +             return -EOPNOTSUPP;
> > +
> > +     *timestamp = skb_hwtstamps(_ctx->skb)->hwtstamp;
>
> The SKB stores this skb_hwtstamps() in skb_shared_info memory area.
> This memory area is actually also available to xdp_frames.  Thus, we
> could store the HW rx_timestamp in same location for redirected
> xdp_frames.  This could make code path sharing possible between SKB vs
> xdp_frame in veth.
>
> This would also make it fast to "transfer" HW rx_timestamp when creating
> an SKB from an xdp_frame, as data is already written in the correct place.
>
> Performance wise the down-side is that skb_shared_info memory area is in
> a separate cacheline.  Thus, when no HW rx_timestamp is available, then
> it is very expensive for a veth XDP bpf-prog to access this, just to get
> a zero back.  Having an xdp_frame->flags bit that knows if HW
> rx_timestamp have been stored, can mitigate this.

That's one way to do it; although I'm not sure about the cases which
don't use xdp_frame and use stack-allocated xdp_buff.

> > +     return 0;
> > +}
> > +
> > +static int veth_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash)
> > +{
> > +     struct veth_xdp_buff *_ctx = (void *)ctx;
> > +
> > +     if (!_ctx->skb)
> > +             return -EOPNOTSUPP;
>
> For xdp_frame case, I'm considering simply storing the u32 RX-hash in
> struct xdp_frame.  This makes it easy to extract for xdp_frame to SKB
> create use-case.
>
> As have been mentioned before, the SKB also requires knowing the RSS
> hash-type.  This HW hash-type actually contains a lot of information,
> that today is lost when reduced to the SKB hash-type.  Due to
> standardization from Microsoft, most HW provide info on (L3) IPv4 or
> IPv6, and on (L4) TCP or UDP (and often SCTP).  Often hardware
> descriptor also provide info on the header length.  Future work in this
> area is exciting as we can speedup parsing of packets in XDP, if we can
> get are more detailed HW info on hash "packet-type".

Something like the version we've discussed a while back [0]?
Seems workable overall if we remove it from the UAPI? (not everyone
was happy about UAPI parts IIRC)

0: https://lore.kernel.org/bpf/20221115030210.3159213-7-sdf@google.com/


> > +
> > +     *hash = skb_get_hash(_ctx->skb); > +    return 0;
> > +}
> > +
>
> --Jesper
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 01/17] bpf: Document XDP RX metadata
  2023-01-16 13:09   ` [xdp-hints] " Jesper Dangaard Brouer
@ 2023-01-17 20:33     ` Stanislav Fomichev
  2023-01-18 14:28       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-17 20:33 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, David Vernet

On Mon, Jan 16, 2023 at 5:09 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 12/01/2023 01.32, Stanislav Fomichev wrote:
> > Document all current use-cases and assumptions.
> >
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: David Ahern <dsahern@gmail.com>
> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > Cc: xdp-hints@xdp-project.net
> > Cc: netdev@vger.kernel.org
> > Acked-by: David Vernet <void@manifault.com>
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >   Documentation/networking/index.rst           |   1 +
> >   Documentation/networking/xdp-rx-metadata.rst | 108 +++++++++++++++++++
> >   2 files changed, 109 insertions(+)
> >   create mode 100644 Documentation/networking/xdp-rx-metadata.rst
> >
> > diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
> > index 4f2d1f682a18..4ddcae33c336 100644
> > --- a/Documentation/networking/index.rst
> > +++ b/Documentation/networking/index.rst
> > @@ -120,6 +120,7 @@ Refer to :ref:`netdev-FAQ` for a guide on netdev development process specifics.
> >      xfrm_proc
> >      xfrm_sync
> >      xfrm_sysctl
> > +   xdp-rx-metadata
> >
> >   .. only::  subproject and html
> >
> > diff --git a/Documentation/networking/xdp-rx-metadata.rst b/Documentation/networking/xdp-rx-metadata.rst
> > new file mode 100644
> > index 000000000000..b6c8c77937c4
> > --- /dev/null
> > +++ b/Documentation/networking/xdp-rx-metadata.rst
> > @@ -0,0 +1,108 @@
> > +===============
> > +XDP RX Metadata
> > +===============
> > +
> > +This document describes how an eXpress Data Path (XDP) program can access
> > +hardware metadata related to a packet using a set of helper functions,
> > +and how it can pass that metadata on to other consumers.
> > +
> > +General Design
> > +==============
> > +
> > +XDP has access to a set of kfuncs to manipulate the metadata in an XDP frame.
> > +Every device driver that wishes to expose additional packet metadata can
> > +implement these kfuncs. The set of kfuncs is declared in ``include/net/xdp.h``
> > +via ``XDP_METADATA_KFUNC_xxx``.
> > +
> > +Currently, the following kfuncs are supported. In the future, as more
> > +metadata is supported, this set will grow:
> > +
> > +.. kernel-doc:: net/core/xdp.c
> > +   :identifiers: bpf_xdp_metadata_rx_timestamp bpf_xdp_metadata_rx_hash
> > +
> > +An XDP program can use these kfuncs to read the metadata into stack
> > +variables for its own consumption. Or, to pass the metadata on to other
> > +consumers, an XDP program can store it into the metadata area carried
> > +ahead of the packet.
> > +
> > +Not all kfuncs have to be implemented by the device driver; when not
> > +implemented, the default ones that return ``-EOPNOTSUPP`` will be used.
> > +
> > +Within an XDP frame, the metadata layout is as follows::
>
> Below diagram describes XDP buff (xdp_buff), but text says 'XDP frame'.
> So XDP frame isn't referring literally to xdp_frame, which I find
> slightly confusing.
> It is likely because I think too much about the code and the different
> objects, xdp_frame, xdp_buff, xdp_md (xdp ctx seen be bpf-prog).
>
> I tried to grep in the (recent added) bpf/xdp docs to see if there is a
> definition of a XDP "packet" or "frame".  Nothing popped up, except that
> Documentation/bpf/map_cpumap.rst talks about raw ``xdp_frame`` objects.
>
> Perhaps we can improve this doc by calling out xdp_buff here, like:
>
>   Within an XDP frame, the metadata layout (accessed via ``xdp_buff``)
> is as follows::

Sure, will amend!

> > +
> > +  +----------+-----------------+------+
> > +  | headroom | custom metadata | data |
> > +  +----------+-----------------+------+
> > +             ^                 ^
> > +             |                 |
> > +   xdp_buff->data_meta   xdp_buff->data
> > +
> > +An XDP program can store individual metadata items into this ``data_meta``
> > +area in whichever format it chooses. Later consumers of the metadata
> > +will have to agree on the format by some out of band contract (like for
> > +the AF_XDP use case, see below).
> > +
> > +AF_XDP
> > +======
> > +
> > +:doc:`af_xdp` use-case implies that there is a contract between the BPF
> > +program that redirects XDP frames into the ``AF_XDP`` socket (``XSK``) and
> > +the final consumer. Thus the BPF program manually allocates a fixed number of
> > +bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset
> > +of kfuncs to populate it. The userspace ``XSK`` consumer computes
> > +``xsk_umem__get_data() - METADATA_SIZE`` to locate that metadata.
> > +Note, ``xsk_umem__get_data`` is defined in ``libxdp`` and
> > +``METADATA_SIZE`` is an application-specific constant.
>
> The main problem with AF_XDP and metadata is that, the AF_XDP descriptor
> doesn't contain any info about the length METADATA_SIZE.
>
> The text does says this, but in a very convoluted way.
> I think this challenge should be more clearly spelled out.
>
> (p.s. This was something that XDP-hints via BTF have a proposed solution
> for)

Any suggestions on how to clarify it better? I have two hints:
1. ``METADATA_SIZE`` is an application-specific constant
2. note missing ``data_meta`` pointer

Do you prefer I also add a sentence where I spell it out more
explicitly? Something like:

Note, ``xsk_umem__get_data`` is defined in ``libxdp`` and
``METADATA_SIZE`` is an application-specific constant (``AF_XDP``
receive descriptor does _not_ explicitly carry the size of the
metadata).

> > +
> > +Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::
>
> The "note" also hint to this issue.

This seems like an explicit design choice of the AF_XDP? In theory, I
don't see why we can't have a v2 receive descriptor format where we
return the size of the metadata?

> > +
> > +  +----------+-----------------+------+
> > +  | headroom | custom metadata | data |
> > +  +----------+-----------------+------+
> > +                               ^
> > +                               |
> > +                        rx_desc->address
> > +
> > +XDP_PASS
> > +========
> > +
> > +This is the path where the packets processed by the XDP program are passed
> > +into the kernel. The kernel creates the ``skb`` out of the ``xdp_buff``
> > +contents. Currently, every driver has custom kernel code to parse
> > +the descriptors and populate ``skb`` metadata when doing this ``xdp_buff->skb``
> > +conversion, and the XDP metadata is not used by the kernel when building
> > +``skbs``. However, TC-BPF programs can access the XDP metadata area using
> > +the ``data_meta`` pointer.
> > +
> > +In the future, we'd like to support a case where an XDP program
> > +can override some of the metadata used for building ``skbs``.
>
> Happy this is mentioned as future work.

As mentioned in a separate email, if you prefer to focus on that, feel
free to drive it since I'm gonna look into the TX side first.

> > +
> > +bpf_redirect_map
> > +================
> > +
> > +``bpf_redirect_map`` can redirect the frame to a different device.
> > +Some devices (like virtual ethernet links) support running a second XDP
> > +program after the redirect. However, the final consumer doesn't have
> > +access to the original hardware descriptor and can't access any of
> > +the original metadata. The same applies to XDP programs installed
> > +into devmaps and cpumaps.
> > +
> > +This means that for redirected packets only custom metadata is
> > +currently supported, which has to be prepared by the initial XDP program
> > +before redirect. If the frame is eventually passed to the kernel, the
> > +``skb`` created from such a frame won't have any hardware metadata populated
> > +in its ``skb``. If such a packet is later redirected into an ``XSK``,
> > +that will also only have access to the custom metadata.
> > +
>
> Good that this is documented, but I hope we can fix/improve this as
> future work.

Definitely! Hopefully documenting it here acts as a sort-of TODO which
we can eventually address. Maybe even starting with a section here on
how it is supposed to work :-)


> > +bpf_tail_call
> > +=============
> > +
> > +Adding programs that access metadata kfuncs to the ``BPF_MAP_TYPE_PROG_ARRAY``
> > +is currently not supported.
> > +
> > +Example
> > +=======
> > +
> > +See ``tools/testing/selftests/bpf/progs/xdp_metadata.c`` and
> > +``tools/testing/selftests/bpf/prog_tests/xdp_metadata.c`` for an example of
> > +BPF program that handles XDP metadata.
>
>
> --Jesper
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 01/17] bpf: Document XDP RX metadata
  2023-01-17 20:33     ` Stanislav Fomichev
@ 2023-01-18 14:28       ` Jesper Dangaard Brouer
  2023-01-18 17:55         ` Stanislav Fomichev
  0 siblings, 1 reply; 42+ messages in thread
From: Jesper Dangaard Brouer @ 2023-01-18 14:28 UTC (permalink / raw)
  To: Stanislav Fomichev, Jesper Dangaard Brouer
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, David Vernet, Björn Töpel


On 17/01/2023 21.33, Stanislav Fomichev wrote:
> On Mon, Jan 16, 2023 at 5:09 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>> On 12/01/2023 01.32, Stanislav Fomichev wrote:
>>> Document all current use-cases and assumptions.
>>>
[...]
>>> Acked-by: David Vernet <void@manifault.com>
>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>> ---
>>>    Documentation/networking/index.rst           |   1 +
>>>    Documentation/networking/xdp-rx-metadata.rst | 108 +++++++++++++++++++
>>>    2 files changed, 109 insertions(+)
>>>    create mode 100644 Documentation/networking/xdp-rx-metadata.rst
>>>
>>> diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
>>> index 4f2d1f682a18..4ddcae33c336 100644
>>> --- a/Documentation/networking/index.rst
>>> +++ b/Documentation/networking/index.rst
[...cut...]
>>> +AF_XDP
>>> +======
>>> +
>>> +:doc:`af_xdp` use-case implies that there is a contract between the BPF
>>> +program that redirects XDP frames into the ``AF_XDP`` socket (``XSK``) and
>>> +the final consumer. Thus the BPF program manually allocates a fixed number of
>>> +bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset
>>> +of kfuncs to populate it. The userspace ``XSK`` consumer computes
>>> +``xsk_umem__get_data() - METADATA_SIZE`` to locate that metadata.
>>> +Note, ``xsk_umem__get_data`` is defined in ``libxdp`` and
>>> +``METADATA_SIZE`` is an application-specific constant.
>>
>> The main problem with AF_XDP and metadata is that, the AF_XDP descriptor
>> doesn't contain any info about the length METADATA_SIZE.
>>
>> The text does says this, but in a very convoluted way.
>> I think this challenge should be more clearly spelled out.
>>
>> (p.s. This was something that XDP-hints via BTF have a proposed solution
>> for)
> 
> Any suggestions on how to clarify it better? I have two hints:
> 1. ``METADATA_SIZE`` is an application-specific constant
> 2. note missing ``data_meta`` pointer
> 
> Do you prefer I also add a sentence where I spell it out more
> explicitly? Something like:
> 
> Note, ``xsk_umem__get_data`` is defined in ``libxdp`` and
> ``METADATA_SIZE`` is an application-specific constant (``AF_XDP``
> receive descriptor does _not_ explicitly carry the size of the
> metadata).

That addition works for me.
(Later we can hopefully come up with a solution for this)

>>> +
>>> +Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::
>>
>> The "note" also hint to this issue.
> 
> This seems like an explicit design choice of the AF_XDP? In theory, I
> don't see why we can't have a v2 receive descriptor format where we
> return the size of the metadata?

(Cc. Magnus+Bjørn)
Yes, it was a design choice from AF_XDP not to include the metadata length.

The AF_XDP descriptor, see struct  xdp_desc (below) from 
/include/uapi/linux/if_xdp.h.

  /* Rx/Tx descriptor */
  struct xdp_desc {
	__u64 addr;
	__u32 len;
	__u32 options;
  };

Does contain a 'u32 options' field, that we could use.

In previous discussions, the proposed solution (from Bjørn+Magnus) was
to use some bits in the 'options' field to say metadata is present, and
xsk_umem__get_data minus 4 (or 8) bytes contain a BTF_ID.  The AF_XDP
programmer can then get the metadata length by looking up the BTF_ID.


>>> +
>>> +  +----------+-----------------+------+
>>> +  | headroom | custom metadata | data |
>>> +  +----------+-----------------+------+
>>> +                               ^
>>> +                               |
>>> +                        rx_desc->address
>>> +
>>> +XDP_PASS
>>> +========
>>> +
>>> +This is the path where the packets processed by the XDP program are passed
>>> +into the kernel. The kernel creates the ``skb`` out of the ``xdp_buff``
>>> +contents. Currently, every driver has custom kernel code to parse
>>> +the descriptors and populate ``skb`` metadata when doing this ``xdp_buff->skb``
>>> +conversion, and the XDP metadata is not used by the kernel when building
>>> +``skbs``. However, TC-BPF programs can access the XDP metadata area using
>>> +the ``data_meta`` pointer.
>>> +
>>> +In the future, we'd like to support a case where an XDP program
>>> +can override some of the metadata used for building ``skbs``.
>>
>> Happy this is mentioned as future work.
> 
> As mentioned in a separate email, if you prefer to focus on that, feel

Yes, I'm going to work on PoC code that explore this area.

> free to drive it since I'm gonna look into the TX side first.

Happy to hear you are going to look into TX-side.
Are your use case related to TX timestamping?

--Jesper


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 10/17] veth: Support RX XDP metadata
  2023-01-17 20:33     ` Stanislav Fomichev
@ 2023-01-18 15:57       ` Jesper Dangaard Brouer
  0 siblings, 0 replies; 42+ messages in thread
From: Jesper Dangaard Brouer @ 2023-01-18 15:57 UTC (permalink / raw)
  To: Stanislav Fomichev, Jesper Dangaard Brouer
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev


On 17/01/2023 21.33, Stanislav Fomichev wrote:
> On Mon, Jan 16, 2023 at 8:21 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>> On 12/01/2023 01.32, Stanislav Fomichev wrote:
>>> The goal is to enable end-to-end testing of the metadata for AF_XDP.
>>
>> For me the goal with veth goes beyond *testing*.
>>
>> This patch ignores the xdp_frame case.  I'm not blocking this patch, but
>> I'm saying we need to make sure there is a way forward for accessing
>> XDP-hints when handling redirected xdp_frame's.
> 
> Sure, let's work towards getting that other part addressed!
> 
>> I have two use-cases we should cover (as future work).
>>
>> (#1) We have customers that want to redirect from physical NIC hardware
>> into containers, and then have the veth XDP-prog (selectively) redirect
>> into an AF_XDP socket (when matching fastpath packets).  Here they
>> (minimum) want access to the XDP hint info on HW checksum.
>>
>> (#2) Both veth and cpumap can create SKBs based on xdp_frame's.  Here it
>> is essential to get HW checksum and HW hash when creating these SKBs
>> (else netstack have to do expensive csum calc and parsing in
>> flow-dissector).
> 
>  From my PoW, I'd probably have to look into the TX side first (tx
> timestamp) before looking into xdp->skb path. So if somebody on your
> side has cycles, feel free to drive this effort. I'm happy to provide
> reviews/comments/etc. I think we've discussed in the past that this
> will most likely look like another set of "export" kfuncs?
> 

I can use some cycles to look at this and come-up with some PoC code.

Yes, I'm thinking of creating 'another set of "export" kfuncs' as you 
say. I'm no-longer suggesting to add a "store" flag to this patchsets 
kfuncs.

The advantages with another set of store/export kfuncs are that these 
can live in the core-kernel code (e.g. not in drivers), and hopefully we 
can "unroll" these as BPF-prog code (as you did earlier).


> We can start with extending new
> Documentation/networking/xdp-rx-metadata.rst with a high-level design.

Sure, but I'll try to get some code working first.

>>> Cc: John Fastabend <john.fastabend@gmail.com>
>>> Cc: David Ahern <dsahern@gmail.com>
>>> Cc: Martin KaFai Lau <martin.lau@linux.dev>
>>> Cc: Jakub Kicinski <kuba@kernel.org>
>>> Cc: Willem de Bruijn <willemb@google.com>
>>> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
>>> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
>>> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
>>> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
>>> Cc: Maryam Tahhan <mtahhan@redhat.com>
>>> Cc: xdp-hints@xdp-project.net
>>> Cc: netdev@vger.kernel.org
>>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>>> ---
>>>    drivers/net/veth.c | 31 +++++++++++++++++++++++++++++++
>>>    1 file changed, 31 insertions(+)
>>>
>>> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
>>> index 70f50602287a..ba3e05832843 100644
>>> --- a/drivers/net/veth.c
>>> +++ b/drivers/net/veth.c
>>> @@ -118,6 +118,7 @@ static struct {
>>>
>>>    struct veth_xdp_buff {
>>>        struct xdp_buff xdp;
>>> +     struct sk_buff *skb;
>>>    };
>>>
>>>    static int veth_get_link_ksettings(struct net_device *dev,
>>> @@ -602,6 +603,7 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
>>>
>>>                xdp_convert_frame_to_buff(frame, xdp);
>>>                xdp->rxq = &rq->xdp_rxq;
>>> +             vxbuf.skb = NULL;
>>>
>>>                act = bpf_prog_run_xdp(xdp_prog, xdp);
>>>
>>> @@ -823,6 +825,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
>>>        __skb_push(skb, skb->data - skb_mac_header(skb));
>>>        if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))
>>>                goto drop;
>>> +     vxbuf.skb = skb;
>>>
>>>        orig_data = xdp->data;
>>>        orig_data_end = xdp->data_end;
>>> @@ -1602,6 +1605,28 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>>>        }
>>>    }
>>>
>>> +static int veth_xdp_rx_timestamp(const struct xdp_md *ctx, u64 *timestamp)
>>> +{
>>> +     struct veth_xdp_buff *_ctx = (void *)ctx;
>>> +
>>> +     if (!_ctx->skb)
>>> +             return -EOPNOTSUPP;
>>> +
>>> +     *timestamp = skb_hwtstamps(_ctx->skb)->hwtstamp;
>>
>> The SKB stores this skb_hwtstamps() in skb_shared_info memory area.
>> This memory area is actually also available to xdp_frames.  Thus, we
>> could store the HW rx_timestamp in same location for redirected
>> xdp_frames.  This could make code path sharing possible between SKB vs
>> xdp_frame in veth.
>>
>> This would also make it fast to "transfer" HW rx_timestamp when creating
>> an SKB from an xdp_frame, as data is already written in the correct place.
>>
>> Performance wise the down-side is that skb_shared_info memory area is in
>> a separate cacheline.  Thus, when no HW rx_timestamp is available, then
>> it is very expensive for a veth XDP bpf-prog to access this, just to get
>> a zero back.  Having an xdp_frame->flags bit that knows if HW
>> rx_timestamp have been stored, can mitigate this.
> 
> That's one way to do it; although I'm not sure about the cases which
> don't use xdp_frame and use stack-allocated xdp_buff.

Above I should have said xdp_buff->flags, to make it more clear that 
this doesn't depend on xdp_frame.  The xdp_buff->flags gets copied to 
xdp_frame->flags, so I see them as equivalent.

The skb_shared_info memory area is also available to xdp_buff's.
(See code #define xdp_data_hard_end in include/net/xdp.h)


>>> +     return 0;
>>> +}
>>> +
>>> +static int veth_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash)
>>> +{
>>> +     struct veth_xdp_buff *_ctx = (void *)ctx;
>>> +
>>> +     if (!_ctx->skb)
>>> +             return -EOPNOTSUPP;
>>
>> For xdp_frame case, I'm considering simply storing the u32 RX-hash in
>> struct xdp_frame.  This makes it easy to extract for xdp_frame to SKB
>> create use-case.
>>
>> As have been mentioned before, the SKB also requires knowing the RSS
>> hash-type.  This HW hash-type actually contains a lot of information,
>> that today is lost when reduced to the SKB hash-type.  Due to
>> standardization from Microsoft, most HW provide info on (L3) IPv4 or
>> IPv6, and on (L4) TCP or UDP (and often SCTP).  Often hardware
>> descriptor also provide info on the header length.  Future work in this
>> area is exciting as we can speedup parsing of packets in XDP, if we can
>> get are more detailed HW info on hash "packet-type".
> 
> Something like the version we've discussed a while back [0]?
> Seems workable overall if we remove it from the UAPI? (not everyone
> was happy about UAPI parts IIRC)
> 
> 0: https://lore.kernel.org/bpf/20221115030210.3159213-7-sdf@google.com/

Yes, somewhat similar to [0].
Except that:

(1) Have a more granular design with more kfuncs for individually 
exporting hints (like this patchset) giving BPF-programmer more 
flexibility. (but unroll BPF-byte code to avoid func-call overhead).

(2) No UAPI, except the kfunc calls, and kernel-code can hide where the 
hints are stored, e.g. as member in struct xdp_frame, in skb_shared_info 
memory area, or somehow in metadata memory area. (Hopefully avoiding too 
much bikesheeting about memory area as we are free to change this later).

--Jesper


^ permalink raw reply	[flat|nested] 42+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next v7 01/17] bpf: Document XDP RX metadata
  2023-01-18 14:28       ` Jesper Dangaard Brouer
@ 2023-01-18 17:55         ` Stanislav Fomichev
  0 siblings, 0 replies; 42+ messages in thread
From: Stanislav Fomichev @ 2023-01-18 17:55 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, David Vernet, Björn Töpel

On Wed, Jan 18, 2023 at 6:28 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 17/01/2023 21.33, Stanislav Fomichev wrote:
> > On Mon, Jan 16, 2023 at 5:09 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> >>
> >> On 12/01/2023 01.32, Stanislav Fomichev wrote:
> >>> Document all current use-cases and assumptions.
> >>>
> [...]
> >>> Acked-by: David Vernet <void@manifault.com>
> >>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> >>> ---
> >>>    Documentation/networking/index.rst           |   1 +
> >>>    Documentation/networking/xdp-rx-metadata.rst | 108 +++++++++++++++++++
> >>>    2 files changed, 109 insertions(+)
> >>>    create mode 100644 Documentation/networking/xdp-rx-metadata.rst
> >>>
> >>> diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
> >>> index 4f2d1f682a18..4ddcae33c336 100644
> >>> --- a/Documentation/networking/index.rst
> >>> +++ b/Documentation/networking/index.rst
> [...cut...]
> >>> +AF_XDP
> >>> +======
> >>> +
> >>> +:doc:`af_xdp` use-case implies that there is a contract between the BPF
> >>> +program that redirects XDP frames into the ``AF_XDP`` socket (``XSK``) and
> >>> +the final consumer. Thus the BPF program manually allocates a fixed number of
> >>> +bytes out of metadata via ``bpf_xdp_adjust_meta`` and calls a subset
> >>> +of kfuncs to populate it. The userspace ``XSK`` consumer computes
> >>> +``xsk_umem__get_data() - METADATA_SIZE`` to locate that metadata.
> >>> +Note, ``xsk_umem__get_data`` is defined in ``libxdp`` and
> >>> +``METADATA_SIZE`` is an application-specific constant.
> >>
> >> The main problem with AF_XDP and metadata is that, the AF_XDP descriptor
> >> doesn't contain any info about the length METADATA_SIZE.
> >>
> >> The text does says this, but in a very convoluted way.
> >> I think this challenge should be more clearly spelled out.
> >>
> >> (p.s. This was something that XDP-hints via BTF have a proposed solution
> >> for)
> >
> > Any suggestions on how to clarify it better? I have two hints:
> > 1. ``METADATA_SIZE`` is an application-specific constant
> > 2. note missing ``data_meta`` pointer
> >
> > Do you prefer I also add a sentence where I spell it out more
> > explicitly? Something like:
> >
> > Note, ``xsk_umem__get_data`` is defined in ``libxdp`` and
> > ``METADATA_SIZE`` is an application-specific constant (``AF_XDP``
> > receive descriptor does _not_ explicitly carry the size of the
> > metadata).
>
> That addition works for me.
> (Later we can hopefully come up with a solution for this)
>
> >>> +
> >>> +Here is the ``AF_XDP`` consumer layout (note missing ``data_meta`` pointer)::
> >>
> >> The "note" also hint to this issue.
> >
> > This seems like an explicit design choice of the AF_XDP? In theory, I
> > don't see why we can't have a v2 receive descriptor format where we
> > return the size of the metadata?
>
> (Cc. Magnus+Bjørn)
> Yes, it was a design choice from AF_XDP not to include the metadata length.
>
> The AF_XDP descriptor, see struct  xdp_desc (below) from
> /include/uapi/linux/if_xdp.h.
>
>   /* Rx/Tx descriptor */
>   struct xdp_desc {
>         __u64 addr;
>         __u32 len;
>         __u32 options;
>   };
>
> Does contain a 'u32 options' field, that we could use.
>
> In previous discussions, the proposed solution (from Bjørn+Magnus) was
> to use some bits in the 'options' field to say metadata is present, and
> xsk_umem__get_data minus 4 (or 8) bytes contain a BTF_ID.  The AF_XDP
> programmer can then get the metadata length by looking up the BTF_ID.

Yeah, that's one way to do it. Instead of BTF_ID, we can just put the
size of the metadata.
But it wasn't needed for the use-cases described in this patchset.
For redirect use-case, I agree, we might carry some extra information
about the layout, up to you.

> >>> +
> >>> +  +----------+-----------------+------+
> >>> +  | headroom | custom metadata | data |
> >>> +  +----------+-----------------+------+
> >>> +                               ^
> >>> +                               |
> >>> +                        rx_desc->address
> >>> +
> >>> +XDP_PASS
> >>> +========
> >>> +
> >>> +This is the path where the packets processed by the XDP program are passed
> >>> +into the kernel. The kernel creates the ``skb`` out of the ``xdp_buff``
> >>> +contents. Currently, every driver has custom kernel code to parse
> >>> +the descriptors and populate ``skb`` metadata when doing this ``xdp_buff->skb``
> >>> +conversion, and the XDP metadata is not used by the kernel when building
> >>> +``skbs``. However, TC-BPF programs can access the XDP metadata area using
> >>> +the ``data_meta`` pointer.
> >>> +
> >>> +In the future, we'd like to support a case where an XDP program
> >>> +can override some of the metadata used for building ``skbs``.
> >>
> >> Happy this is mentioned as future work.
> >
> > As mentioned in a separate email, if you prefer to focus on that, feel
>
> Yes, I'm going to work on PoC code that explore this area.
>
> > free to drive it since I'm gonna look into the TX side first.
>
> Happy to hear you are going to look into TX-side.
> Are your use case related to TX timestamping?

Yes, I'll try to start with a tx timestamp. But looking (briefly) at
the code, that seems like a more invasive addition.
If we want to return that tx timstamp via the original umem, some
completion mechanism has to be added (besides the existing one it).
LMK if you have some pointers to the previous discussions. Or maybe
Bjørn/Magnus had some plans/ideas about that?

> --Jesper
>

^ permalink raw reply	[flat|nested] 42+ messages in thread

end of thread, other threads:[~2023-01-18 17:55 UTC | newest]

Thread overview: 42+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-12  0:32 [xdp-hints] [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Stanislav Fomichev
2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 01/17] bpf: Document XDP RX metadata Stanislav Fomichev
2023-01-16 13:09   ` [xdp-hints] " Jesper Dangaard Brouer
2023-01-17 20:33     ` Stanislav Fomichev
2023-01-18 14:28       ` Jesper Dangaard Brouer
2023-01-18 17:55         ` Stanislav Fomichev
2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 02/17] bpf: Rename bpf_{prog,map}_is_dev_bound to is_offloaded Stanislav Fomichev
2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 03/17] bpf: Move offload initialization into late_initcall Stanislav Fomichev
2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 04/17] bpf: Reshuffle some parts of bpf/offload.c Stanislav Fomichev
2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 05/17] bpf: Introduce device-bound XDP programs Stanislav Fomichev
2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 06/17] selftests/bpf: Update expected test_offload.py messages Stanislav Fomichev
2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 07/17] bpf: XDP metadata RX kfuncs Stanislav Fomichev
2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 09/17] veth: Introduce veth_xdp_buff wrapper for xdp_buff Stanislav Fomichev
2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 10/17] veth: Support RX XDP metadata Stanislav Fomichev
2023-01-16 16:21   ` [xdp-hints] " Jesper Dangaard Brouer
2023-01-17 20:33     ` Stanislav Fomichev
2023-01-18 15:57       ` Jesper Dangaard Brouer
2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 11/17] selftests/bpf: Verify xdp_metadata xdp->af_xdp path Stanislav Fomichev
2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 12/17] net/mlx4_en: Introduce wrapper for xdp_buff Stanislav Fomichev
2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 13/17] net/mlx4_en: Support RX XDP metadata Stanislav Fomichev
2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 14/17] xsk: Add cb area to struct xdp_buff_xsk Stanislav Fomichev
2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 15/17] net/mlx5e: Introduce wrapper for xdp_buff Stanislav Fomichev
2023-01-12  8:07   ` [xdp-hints] " Tariq Toukan
2023-01-12 19:10     ` Stanislav Fomichev
2023-01-12 21:09       ` Toke Høiland-Jørgensen
2023-01-12 21:55         ` Toke Høiland-Jørgensen
2023-01-12 22:18           ` Stanislav Fomichev
2023-01-12 22:29             ` Toke Høiland-Jørgensen
2023-01-13 20:55               ` Tariq Toukan
2023-01-13 20:53           ` Tariq Toukan
2023-01-13 21:31             ` Toke Høiland-Jørgensen
2023-01-15  6:59               ` Tariq Toukan
2023-01-15 11:13                 ` Toke Høiland-Jørgensen
2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 16/17] net/mlx5e: Support RX XDP metadata Stanislav Fomichev
2023-01-12  8:13   ` [xdp-hints] " Tariq Toukan
2023-01-12 19:09     ` Stanislav Fomichev
2023-01-13 20:25       ` Tariq Toukan
2023-01-12  0:32 ` [xdp-hints] [PATCH bpf-next v7 17/17] selftests/bpf: Simple program to dump XDP RX metadata Stanislav Fomichev
2023-01-12  7:29 ` [xdp-hints] Re: [PATCH bpf-next v7 00/17] xdp: hints via kfuncs Martin KaFai Lau
2023-01-12  8:19   ` Tariq Toukan
2023-01-12 18:09     ` Stanislav Fomichev
2023-01-12 18:20       ` Martin KaFai Lau

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox