[xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata

XDP hardware hints discussion mail archive
 help / color / mirror / Atom feed

* [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata
@ 2023-07-07 19:29 Stanislav Fomichev
  2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 01/14] bpf: Rename some xdp-metadata functions into dev-bound Stanislav Fomichev
                   ` (13 more replies)
  0 siblings, 14 replies; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-07 19:29 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

--- Changes since RFC v2 ---

- Separate skb & xdp path
  - XSK is still weird in mlx5 driver; I have to create xdp_frame on
    the stack for af_xdp
- Add TX checksum kfunc
  - In a separate patch for now to show extensibility
  - Same API as we have in skb world
- mlx5 patches with tx timestamp/checksum support
  - xdp_hw_metadata is extended with appropriate bits
    - report sw completion timestamp as well

v2: https://lore.kernel.org/bpf/CAKH8qBvnNCY=eFh4pMRZqBs88JBd66sVD+Yt8mGyQJOAtq7jrA@mail.gmail.com/T/#m6e45f78b11ac10f724a9472359c44f1a38a679cb

--- Use cases ---

The goal of this series is to add two new standard-ish places
in the transmit path:

1. Right before the packet is transmitted (with access to TX
   descriptors)
2. Right after the packet is actually transmitted and we've received the
   completion (again, with access to TX completion descriptors)

Accessing TX descriptors unlocks the following use-cases:

- Setting device hints at TX: XDP/AF_XDP might use these new hooks to
use device offloads. The existing case implements TX timestamp.
- Observability: global per-netdev hooks can be used for tracing
the packets and exploring completion descriptors for all sorts of
device errors.

Accessing TX descriptors also means that the hooks have to be called
from the drivers.

The hooks are a light-weight alternative to XDP at egress and currently
don't provide any packet modification abilities. However, eventually,
can expose new kfuncs to operate on the packet (or, rather, the actual
descriptors; for performance sake).

--- UAPI ---

The hooks are implemented in a HID-BPF style. Meaning they don't
expose any UAPI and are implemented as tracing programs that call
a bunch of kfuncs. The attach/detach operation happen via regular
global fentry points. Network namespace and ifindex are exposed
to allow filtering out particular netdev.

--- skb vs xdp ---

The hooks operate on a new light-weight devtx_ctx which contains:
- sinfo (frags)
- netdev

skb and xdp_frame hook points are separate; skb and xdp_frame are
passed directly into the hook where appropriate.

--- TODO ---

Things that I'm planning to do for the non-RFC series:
- have some real device support to verify xdp_hw_metadata works
  - performance numbers with/without feature enabled (Toke)
- add has_timestamp flag to af_xdp tx_desc (Toke & Jesper)
- freplace
- explore dynptr (Toke)
- Documentation/networking/xdp-rx-metadata.rst - like documentation

Stanislav Fomichev (14):
  bpf: Rename some xdp-metadata functions into dev-bound
  bpf: Make it easier to add new metadata kfunc
  xsk: Support XDP_TX_METADATA_LEN
  bpf: Implement devtx hook points
  bpf: Implement devtx timestamp kfunc
  net: veth: Implement devtx timestamp kfuncs
  bpf: Introduce tx checksum devtx kfuncs
  net: veth: Implement devtx tx checksum
  net/mlx5e: Implement devtx kfuncs
  selftests/xsk: Support XDP_TX_METADATA_LEN
  selftests/bpf: Add helper to query current netns cookie
  selftests/bpf: Add csum helpers
  selftests/bpf: Extend xdp_metadata with devtx kfuncs
  selftests/bpf: Extend xdp_hw_metadata with devtx kfuncs

 MAINTAINERS                                   |   2 +
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  15 +
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 155 +++++++++-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |   4 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |  10 +
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   |  24 ++
 .../net/ethernet/mellanox/mlx5/core/main.c    |  15 +-
 drivers/net/veth.c                            | 126 +++++++-
 include/linux/netdevice.h                     |   6 +
 include/net/devtx.h                           |  66 +++++
 include/net/offload.h                         |  45 +++
 include/net/xdp.h                             |  18 +-
 include/net/xdp_sock.h                        |   1 +
 include/net/xsk_buff_pool.h                   |   1 +
 include/uapi/linux/if_xdp.h                   |   1 +
 kernel/bpf/offload.c                          |  52 +++-
 kernel/bpf/verifier.c                         |   4 +-
 net/core/Makefile                             |   1 +
 net/core/dev.c                                |   1 +
 net/core/devtx.c                              | 168 +++++++++++
 net/core/xdp.c                                |  20 +-
 net/xdp/xsk.c                                 |  35 ++-
 net/xdp/xsk_buff_pool.c                       |   1 +
 net/xdp/xsk_queue.h                           |   7 +-
 tools/testing/selftests/bpf/network_helpers.c |  21 ++
 tools/testing/selftests/bpf/network_helpers.h |  44 +++
 .../selftests/bpf/prog_tests/xdp_metadata.c   |  64 ++++-
 .../selftests/bpf/progs/xdp_hw_metadata.c     | 173 +++++++++++
 .../selftests/bpf/progs/xdp_metadata.c        | 141 +++++++++
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 269 +++++++++++++++++-
 tools/testing/selftests/bpf/xdp_metadata.h    |  16 ++
 tools/testing/selftests/bpf/xsk.c             |  17 ++
 tools/testing/selftests/bpf/xsk.h             |   1 +
 33 files changed, 1451 insertions(+), 73 deletions(-)
 create mode 100644 include/net/devtx.h
 create mode 100644 include/net/offload.h
 create mode 100644 net/core/devtx.c

-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] [RFC bpf-next v3 01/14] bpf: Rename some xdp-metadata functions into dev-bound
  2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
@ 2023-07-07 19:29 ` Stanislav Fomichev
  2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 02/14] bpf: Make it easier to add new metadata kfunc Stanislav Fomichev
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-07 19:29 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

No functional changes.

To make existing dev-bound infrastructure more generic and be
less tightly bound to the xdp layer, rename some functions and
move kfunc-related things into kernel/bpf/offload.c

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/net/offload.h | 28 ++++++++++++++++++++++++++++
 include/net/xdp.h     | 18 +-----------------
 kernel/bpf/offload.c  | 26 ++++++++++++++++++++++++--
 kernel/bpf/verifier.c |  4 ++--
 net/core/xdp.c        | 20 ++------------------
 5 files changed, 57 insertions(+), 39 deletions(-)
 create mode 100644 include/net/offload.h

diff --git a/include/net/offload.h b/include/net/offload.h
new file mode 100644
index 000000000000..264a35881473
--- /dev/null
+++ b/include/net/offload.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __LINUX_NET_OFFLOAD_H__
+#define __LINUX_NET_OFFLOAD_H__
+
+#include <linux/types.h>
+
+#define XDP_METADATA_KFUNC_xxx	\
+	NETDEV_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
+			      bpf_xdp_metadata_rx_timestamp) \
+	NETDEV_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
+			      bpf_xdp_metadata_rx_hash)
+
+enum {
+#define NETDEV_METADATA_KFUNC(name, _) name,
+XDP_METADATA_KFUNC_xxx
+#undef NETDEV_METADATA_KFUNC
+MAX_NETDEV_METADATA_KFUNC,
+};
+
+#ifdef CONFIG_NET
+u32 bpf_dev_bound_kfunc_id(int id);
+bool bpf_is_dev_bound_kfunc(u32 btf_id);
+#else
+static inline u32 bpf_dev_bound_kfunc_id(int id) { return 0; }
+static inline bool bpf_is_dev_bound_kfunc(u32 btf_id) { return false; }
+#endif
+
+#endif /* __LINUX_NET_OFFLOAD_H__ */
diff --git a/include/net/xdp.h b/include/net/xdp.h
index d1c5381fc95f..de4c3b70abde 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -9,6 +9,7 @@
 #include <linux/skbuff.h> /* skb_shared_info */
 #include <uapi/linux/netdev.h>
 #include <linux/bitfield.h>
+#include <net/offload.h>
 
 /**
  * DOC: XDP RX-queue information
@@ -384,19 +385,6 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
 
 #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
 
-#define XDP_METADATA_KFUNC_xxx	\
-	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
-			   bpf_xdp_metadata_rx_timestamp) \
-	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
-			   bpf_xdp_metadata_rx_hash) \
-
-enum {
-#define XDP_METADATA_KFUNC(name, _) name,
-XDP_METADATA_KFUNC_xxx
-#undef XDP_METADATA_KFUNC
-MAX_XDP_METADATA_KFUNC,
-};
-
 enum xdp_rss_hash_type {
 	/* First part: Individual bits for L3/L4 types */
 	XDP_RSS_L3_IPV4		= BIT(0),
@@ -444,14 +432,10 @@ enum xdp_rss_hash_type {
 };
 
 #ifdef CONFIG_NET
-u32 bpf_xdp_metadata_kfunc_id(int id);
-bool bpf_dev_bound_kfunc_id(u32 btf_id);
 void xdp_set_features_flag(struct net_device *dev, xdp_features_t val);
 void xdp_features_set_redirect_target(struct net_device *dev, bool support_sg);
 void xdp_features_clear_redirect_target(struct net_device *dev);
 #else
-static inline u32 bpf_xdp_metadata_kfunc_id(int id) { return 0; }
-static inline bool bpf_dev_bound_kfunc_id(u32 btf_id) { return false; }
 
 static inline void
 xdp_set_features_flag(struct net_device *dev, xdp_features_t val)
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 8a26cd8814c1..235d81f7e0ed 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -844,9 +844,9 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 	if (!ops)
 		goto out;
 
-	if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP))
+	if (func_id == bpf_dev_bound_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP))
 		p = ops->xmo_rx_timestamp;
-	else if (func_id == bpf_xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
+	else if (func_id == bpf_dev_bound_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
 		p = ops->xmo_rx_hash;
 out:
 	up_read(&bpf_devs_lock);
@@ -854,6 +854,28 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 	return p;
 }
 
+BTF_SET_START(dev_bound_kfunc_ids)
+#define NETDEV_METADATA_KFUNC(name, str) BTF_ID(func, str)
+XDP_METADATA_KFUNC_xxx
+#undef NETDEV_METADATA_KFUNC
+BTF_SET_END(dev_bound_kfunc_ids)
+
+BTF_ID_LIST(dev_bound_kfunc_ids_unsorted)
+#define NETDEV_METADATA_KFUNC(name, str) BTF_ID(func, str)
+XDP_METADATA_KFUNC_xxx
+#undef NETDEV_METADATA_KFUNC
+
+u32 bpf_dev_bound_kfunc_id(int id)
+{
+	/* dev_bound_kfunc_ids is sorted and can't be used */
+	return dev_bound_kfunc_ids_unsorted[id];
+}
+
+bool bpf_is_dev_bound_kfunc(u32 btf_id)
+{
+	return btf_id_set_contains(&dev_bound_kfunc_ids, btf_id);
+}
+
 static int __init bpf_offload_init(void)
 {
 	return rhashtable_init(&offdevs, &offdevs_params);
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 11e54dd8b6dd..e7b2d0a6d7c6 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -2721,7 +2721,7 @@ static int add_kfunc_call(struct bpf_verifier_env *env, u32 func_id, s16 offset)
 		}
 	}
 
-	if (bpf_dev_bound_kfunc_id(func_id)) {
+	if (bpf_is_dev_bound_kfunc(func_id)) {
 		err = bpf_dev_bound_kfunc_check(&env->log, prog_aux);
 		if (err)
 			return err;
@@ -17923,7 +17923,7 @@ static void specialize_kfunc(struct bpf_verifier_env *env,
 	void *xdp_kfunc;
 	bool is_rdonly;
 
-	if (bpf_dev_bound_kfunc_id(func_id)) {
+	if (bpf_is_dev_bound_kfunc(func_id)) {
 		xdp_kfunc = bpf_dev_bound_resolve_kfunc(prog, func_id);
 		if (xdp_kfunc) {
 			*addr = (unsigned long)xdp_kfunc;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 41e5ca8643ec..819767697370 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -741,9 +741,9 @@ __bpf_kfunc int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash,
 __diag_pop();
 
 BTF_SET8_START(xdp_metadata_kfunc_ids)
-#define XDP_METADATA_KFUNC(_, name) BTF_ID_FLAGS(func, name, 0)
+#define NETDEV_METADATA_KFUNC(_, name) BTF_ID_FLAGS(func, name, 0)
 XDP_METADATA_KFUNC_xxx
-#undef XDP_METADATA_KFUNC
+#undef NETDEV_METADATA_KFUNC
 BTF_SET8_END(xdp_metadata_kfunc_ids)
 
 static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
@@ -751,22 +751,6 @@ static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
 	.set   = &xdp_metadata_kfunc_ids,
 };
 
-BTF_ID_LIST(xdp_metadata_kfunc_ids_unsorted)
-#define XDP_METADATA_KFUNC(name, str) BTF_ID(func, str)
-XDP_METADATA_KFUNC_xxx
-#undef XDP_METADATA_KFUNC
-
-u32 bpf_xdp_metadata_kfunc_id(int id)
-{
-	/* xdp_metadata_kfunc_ids is sorted and can't be used */
-	return xdp_metadata_kfunc_ids_unsorted[id];
-}
-
-bool bpf_dev_bound_kfunc_id(u32 btf_id)
-{
-	return btf_id_set8_contains(&xdp_metadata_kfunc_ids, btf_id);
-}
-
 static int __init xdp_metadata_init(void)
 {
 	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &xdp_metadata_kfunc_set);
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] [RFC bpf-next v3 02/14] bpf: Make it easier to add new metadata kfunc
  2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
  2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 01/14] bpf: Rename some xdp-metadata functions into dev-bound Stanislav Fomichev
@ 2023-07-07 19:29 ` Stanislav Fomichev
  2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 03/14] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-07 19:29 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

No functional changes.

Instead of having hand-crafted code in bpf_dev_bound_resolve_kfunc,
move kfunc <> xmo handler relationship into XDP_METADATA_KFUNC_xxx.
This way, any time new kfunc is added, we don't have to touch
bpf_dev_bound_resolve_kfunc.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/net/offload.h |  8 +++++---
 kernel/bpf/offload.c  | 13 +++++++------
 net/core/xdp.c        |  2 +-
 3 files changed, 13 insertions(+), 10 deletions(-)

diff --git a/include/net/offload.h b/include/net/offload.h
index 264a35881473..de0fac38a95b 100644
--- a/include/net/offload.h
+++ b/include/net/offload.h
@@ -6,12 +6,14 @@
 
 #define XDP_METADATA_KFUNC_xxx	\
 	NETDEV_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
-			      bpf_xdp_metadata_rx_timestamp) \
+			      bpf_xdp_metadata_rx_timestamp, \
+			      xmo_rx_timestamp) \
 	NETDEV_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_HASH, \
-			      bpf_xdp_metadata_rx_hash)
+			      bpf_xdp_metadata_rx_hash, \
+			      xmo_rx_hash)
 
 enum {
-#define NETDEV_METADATA_KFUNC(name, _) name,
+#define NETDEV_METADATA_KFUNC(name, _, __) name,
 XDP_METADATA_KFUNC_xxx
 #undef NETDEV_METADATA_KFUNC
 MAX_NETDEV_METADATA_KFUNC,
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index 235d81f7e0ed..cec63c76dce5 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -844,10 +844,11 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 	if (!ops)
 		goto out;
 
-	if (func_id == bpf_dev_bound_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP))
-		p = ops->xmo_rx_timestamp;
-	else if (func_id == bpf_dev_bound_kfunc_id(XDP_METADATA_KFUNC_RX_HASH))
-		p = ops->xmo_rx_hash;
+#define NETDEV_METADATA_KFUNC(name, _, xmo) \
+	if (func_id == bpf_dev_bound_kfunc_id(name)) p = ops->xmo;
+	XDP_METADATA_KFUNC_xxx
+#undef NETDEV_METADATA_KFUNC
+
 out:
 	up_read(&bpf_devs_lock);
 
@@ -855,13 +856,13 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 }
 
 BTF_SET_START(dev_bound_kfunc_ids)
-#define NETDEV_METADATA_KFUNC(name, str) BTF_ID(func, str)
+#define NETDEV_METADATA_KFUNC(name, str, _) BTF_ID(func, str)
 XDP_METADATA_KFUNC_xxx
 #undef NETDEV_METADATA_KFUNC
 BTF_SET_END(dev_bound_kfunc_ids)
 
 BTF_ID_LIST(dev_bound_kfunc_ids_unsorted)
-#define NETDEV_METADATA_KFUNC(name, str) BTF_ID(func, str)
+#define NETDEV_METADATA_KFUNC(name, str, _) BTF_ID(func, str)
 XDP_METADATA_KFUNC_xxx
 #undef NETDEV_METADATA_KFUNC
 
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 819767697370..c4be4367f2dd 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -741,7 +741,7 @@ __bpf_kfunc int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, u32 *hash,
 __diag_pop();
 
 BTF_SET8_START(xdp_metadata_kfunc_ids)
-#define NETDEV_METADATA_KFUNC(_, name) BTF_ID_FLAGS(func, name, 0)
+#define NETDEV_METADATA_KFUNC(_, name, __) BTF_ID_FLAGS(func, name, 0)
 XDP_METADATA_KFUNC_xxx
 #undef NETDEV_METADATA_KFUNC
 BTF_SET8_END(xdp_metadata_kfunc_ids)
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] [RFC bpf-next v3 03/14] xsk: Support XDP_TX_METADATA_LEN
  2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
  2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 01/14] bpf: Rename some xdp-metadata functions into dev-bound Stanislav Fomichev
  2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 02/14] bpf: Make it easier to add new metadata kfunc Stanislav Fomichev
@ 2023-07-07 19:29 ` Stanislav Fomichev
  2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 04/14] bpf: Implement devtx hook points Stanislav Fomichev
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-07 19:29 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

For zerocopy mode, tx_desc->addr can point to the arbitrary offset
and carry some TX metadata in the headroom. For copy mode, there
is no way currently to populate skb metadata.

Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
to treat as metadata. Metadata bytes come prior to tx_desc address
(same as in RX case).

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/net/xdp_sock.h      |  1 +
 include/net/xsk_buff_pool.h |  1 +
 include/uapi/linux/if_xdp.h |  1 +
 net/xdp/xsk.c               | 35 +++++++++++++++++++++++++++++++++--
 net/xdp/xsk_buff_pool.c     |  1 +
 net/xdp/xsk_queue.h         |  7 ++++---
 6 files changed, 41 insertions(+), 5 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index e96a1151ec75..30018b3b862d 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -51,6 +51,7 @@ struct xdp_sock {
 	struct list_head flush_node;
 	struct xsk_buff_pool *pool;
 	u16 queue_id;
+	u8 tx_metadata_len;
 	bool zc;
 	enum {
 		XSK_READY = 0,
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index a8d7b8a3688a..751fea51a6af 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -75,6 +75,7 @@ struct xsk_buff_pool {
 	u32 chunk_size;
 	u32 chunk_shift;
 	u32 frame_len;
+	u8 tx_metadata_len; /* inherited from xsk_sock */
 	u8 cached_need_wakeup;
 	bool uses_need_wakeup;
 	bool dma_need_sync;
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index a78a8096f4ce..2374eafff7db 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -63,6 +63,7 @@ struct xdp_mmap_offsets {
 #define XDP_UMEM_COMPLETION_RING	6
 #define XDP_STATISTICS			7
 #define XDP_OPTIONS			8
+#define XDP_TX_METADATA_LEN		9
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 5a8c0dd250af..6ef7bc4f514c 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -485,6 +485,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 		int err;
 
 		hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
+		hr = max(hr, L1_CACHE_ALIGN((u32)xs->tx_metadata_len));
 		tr = dev->needed_tailroom;
 		len = desc->len;
 
@@ -493,14 +494,21 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			return ERR_PTR(err);
 
 		skb_reserve(skb, hr);
-		skb_put(skb, len);
+		skb_put(skb, len + xs->tx_metadata_len);
 
 		buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
-		err = skb_store_bits(skb, 0, buffer, len);
+		buffer -= xs->tx_metadata_len;
+
+		err = skb_store_bits(skb, 0, buffer, len + xs->tx_metadata_len);
 		if (unlikely(err)) {
 			kfree_skb(skb);
 			return ERR_PTR(err);
 		}
+
+		if (xs->tx_metadata_len) {
+			skb_metadata_set(skb, xs->tx_metadata_len);
+			__skb_pull(skb, xs->tx_metadata_len);
+		}
 	}
 
 	skb->dev = dev;
@@ -1137,6 +1145,29 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		mutex_unlock(&xs->mutex);
 		return err;
 	}
+	case XDP_TX_METADATA_LEN:
+	{
+		int val;
+
+		if (optlen < sizeof(val))
+			return -EINVAL;
+		if (copy_from_sockptr(&val, optval, sizeof(val)))
+			return -EFAULT;
+
+		if (val >= 256)
+			return -EINVAL;
+		if (val % 4)
+			return -EINVAL;
+
+		mutex_lock(&xs->mutex);
+		if (xs->state != XSK_READY) {
+			mutex_unlock(&xs->mutex);
+			return -EBUSY;
+		}
+		xs->tx_metadata_len = val;
+		mutex_unlock(&xs->mutex);
+		return 0;
+	}
 	default:
 		break;
 	}
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 26f6d304451e..66ff9c345a67 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -85,6 +85,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
 		XDP_PACKET_HEADROOM;
 	pool->umem = umem;
 	pool->addrs = umem->addrs;
+	pool->tx_metadata_len = xs->tx_metadata_len;
 	INIT_LIST_HEAD(&pool->free_list);
 	INIT_LIST_HEAD(&pool->xsk_tx_list);
 	spin_lock_init(&pool->xsk_tx_list_lock);
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 6d40a77fccbe..c8d287c18d64 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -133,12 +133,13 @@ static inline bool xskq_cons_read_addr_unchecked(struct xsk_queue *q, u64 *addr)
 static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
 					    struct xdp_desc *desc)
 {
-	u64 offset = desc->addr & (pool->chunk_size - 1);
+	u64 addr = desc->addr - pool->tx_metadata_len;
+	u64 offset = addr & (pool->chunk_size - 1);
 
 	if (offset + desc->len > pool->chunk_size)
 		return false;
 
-	if (desc->addr >= pool->addrs_cnt)
+	if (addr >= pool->addrs_cnt)
 		return false;
 
 	if (desc->options)
@@ -149,7 +150,7 @@ static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
 static inline bool xp_unaligned_validate_desc(struct xsk_buff_pool *pool,
 					      struct xdp_desc *desc)
 {
-	u64 addr = xp_unaligned_add_offset_to_addr(desc->addr);
+	u64 addr = xp_unaligned_add_offset_to_addr(desc->addr) - pool->tx_metadata_len;
 
 	if (desc->len > pool->chunk_size)
 		return false;
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] [RFC bpf-next v3 04/14] bpf: Implement devtx hook points
  2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (2 preceding siblings ...)
  2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 03/14] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
@ 2023-07-07 19:29 ` Stanislav Fomichev
  2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 05/14] bpf: Implement devtx timestamp kfunc Stanislav Fomichev
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-07 19:29 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

devtx is a lightweight set of hooks before and after packet transmission.
The hooks are implemented as a tracing program which has access to the
XDP-metadata-like kfuncs. The initial set of kfuncs is implemented
in the next patch, but the idea is similar to XDP metadata:
the kfuncs have netdev-specific implementation, but common
interface. Upon loading, the kfuncs are resolved to direct
calls against per-netdev implementation. This can be achieved
by marking devtx-tracing programs as dev-bound (largely
reusing xdp-dev-bound program infrastructure).

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 MAINTAINERS          |  2 ++
 include/net/devtx.h  | 66 +++++++++++++++++++++++++++++++++++++
 kernel/bpf/offload.c | 15 +++++++++
 net/core/Makefile    |  1 +
 net/core/dev.c       |  1 +
 net/core/devtx.c     | 78 ++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 163 insertions(+)
 create mode 100644 include/net/devtx.h
 create mode 100644 net/core/devtx.c

diff --git a/MAINTAINERS b/MAINTAINERS
index acbe54087d1c..de6a2430d49a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -23063,11 +23063,13 @@ L:	bpf@vger.kernel.org
 S:	Supported
 F:	drivers/net/ethernet/*/*/*/*/*xdp*
 F:	drivers/net/ethernet/*/*/*xdp*
+F:	include/net/devtx.h
 F:	include/net/xdp.h
 F:	include/net/xdp_priv.h
 F:	include/trace/events/xdp.h
 F:	kernel/bpf/cpumap.c
 F:	kernel/bpf/devmap.c
+F:	net/core/devtx.c
 F:	net/core/xdp.c
 F:	samples/bpf/xdp*
 F:	tools/testing/selftests/bpf/*/*xdp*
diff --git a/include/net/devtx.h b/include/net/devtx.h
new file mode 100644
index 000000000000..88127ca87b9a
--- /dev/null
+++ b/include/net/devtx.h
@@ -0,0 +1,66 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __LINUX_NET_DEVTX_H__
+#define __LINUX_NET_DEVTX_H__
+
+#include <linux/jump_label.h>
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/btf_ids.h>
+#include <net/xdp.h>
+
+struct devtx_ctx {
+	struct net_device *netdev;
+	struct skb_shared_info *sinfo; /* for frags */
+};
+
+#define DECLARE_DEVTX_HOOKS(PREFIX) \
+void PREFIX ## _devtx_submit_skb(struct devtx_ctx *ctx, struct sk_buff *skb); \
+void PREFIX ## _devtx_complete_skb(struct devtx_ctx *ctx, struct sk_buff *skb); \
+void PREFIX ## _devtx_submit_xdp(struct devtx_ctx *ctx, struct xdp_frame *xdpf); \
+void PREFIX ## _devtx_complete_xdp(struct devtx_ctx *ctx, struct xdp_frame *xdpf)
+
+#define DEFINE_DEVTX_HOOKS(PREFIX) \
+__weak noinline void PREFIX ## _devtx_submit_skb(struct devtx_ctx *ctx, \
+						 struct sk_buff *skb) {} \
+__weak noinline void PREFIX ## _devtx_complete_skb(struct devtx_ctx *ctx, \
+						   struct sk_buff *skb) {} \
+__weak noinline void PREFIX ## _devtx_submit_xdp(struct devtx_ctx *ctx, \
+						 struct xdp_frame *xdpf) {} \
+__weak noinline void PREFIX ## _devtx_complete_xdp(struct devtx_ctx *ctx, \
+						   struct xdp_frame *xdpf) {} \
+\
+BTF_SET8_START(PREFIX ## _devtx_hook_ids) \
+BTF_ID_FLAGS(func, PREFIX ## _devtx_submit_skb) \
+BTF_ID_FLAGS(func, PREFIX ## _devtx_complete_skb) \
+BTF_ID_FLAGS(func, PREFIX ## _devtx_submit_xdp) \
+BTF_ID_FLAGS(func, PREFIX ## _devtx_complete_xdp) \
+BTF_SET8_END(PREFIX ## _devtx_hook_ids)
+
+#ifdef CONFIG_NET
+void devtx_hooks_enable(void);
+void devtx_hooks_disable(void);
+bool devtx_hooks_match(u32 attach_btf_id, const struct xdp_metadata_ops *xmo);
+int devtx_hooks_register(struct btf_id_set8 *set, const struct xdp_metadata_ops *xmo);
+void devtx_hooks_unregister(struct btf_id_set8 *set);
+
+DECLARE_STATIC_KEY_FALSE(devtx_enabled_key);
+
+static inline bool devtx_enabled(void)
+{
+	return static_branch_unlikely(&devtx_enabled_key);
+}
+#else
+static inline void devtx_hooks_enable(void) {}
+static inline void devtx_hooks_disable(void) {}
+static inline bool devtx_hooks_match(u32 attach_btf_id, const struct xdp_metadata_ops *xmo) {}
+static inline int devtx_hooks_register(struct btf_id_set8 *set,
+				       const struct xdp_metadata_ops *xmo) {}
+static inline void devtx_hooks_unregister(struct btf_id_set8 *set) {}
+
+static inline bool devtx_enabled(void)
+{
+	return false;
+}
+#endif
+
+#endif /* __LINUX_NET_DEVTX_H__ */
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index cec63c76dce5..a4423803b3dd 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -25,6 +25,7 @@
 #include <linux/rhashtable.h>
 #include <linux/rtnetlink.h>
 #include <linux/rwsem.h>
+#include <net/devtx.h>
 
 /* Protects offdevs, members of bpf_offload_netdev and offload members
  * of all progs.
@@ -228,6 +229,7 @@ int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr)
 	int err;
 
 	if (attr->prog_type != BPF_PROG_TYPE_SCHED_CLS &&
+	    attr->prog_type != BPF_PROG_TYPE_TRACING &&
 	    attr->prog_type != BPF_PROG_TYPE_XDP)
 		return -EINVAL;
 
@@ -242,6 +244,15 @@ int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr)
 	if (!netdev)
 		return -EINVAL;
 
+	/* Make sure device-bound tracing programs are being attached
+	 * to the appropriate netdev.
+	 */
+	if (attr->prog_type == BPF_PROG_TYPE_TRACING &&
+	    !devtx_hooks_match(prog->aux->attach_btf_id, netdev->xdp_metadata_ops)) {
+		err = -EINVAL;
+		goto out;
+	}
+
 	err = bpf_dev_offload_check(netdev);
 	if (err)
 		goto out;
@@ -252,6 +263,9 @@ int bpf_prog_dev_bound_init(struct bpf_prog *prog, union bpf_attr *attr)
 	err = __bpf_prog_dev_bound_init(prog, netdev);
 	up_write(&bpf_devs_lock);
 
+	if (!err)
+		devtx_hooks_enable();
+
 out:
 	dev_put(netdev);
 	return err;
@@ -384,6 +398,7 @@ void bpf_prog_dev_bound_destroy(struct bpf_prog *prog)
 		ondev = bpf_offload_find_netdev(netdev);
 		if (!ondev->offdev && list_empty(&ondev->progs))
 			__bpf_offload_dev_netdev_unregister(NULL, netdev);
+		devtx_hooks_disable();
 	}
 	up_write(&bpf_devs_lock);
 	rtnl_unlock();
diff --git a/net/core/Makefile b/net/core/Makefile
index 731db2eaa610..97b4d6703a77 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -39,4 +39,5 @@ obj-$(CONFIG_FAILOVER) += failover.o
 obj-$(CONFIG_NET_SOCK_MSG) += skmsg.o
 obj-$(CONFIG_BPF_SYSCALL) += sock_map.o
 obj-$(CONFIG_BPF_SYSCALL) += bpf_sk_storage.o
+obj-$(CONFIG_BPF_SYSCALL) += devtx.o
 obj-$(CONFIG_OF)	+= of_net.o
diff --git a/net/core/dev.c b/net/core/dev.c
index 69a3e544676c..b9500a722591 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -150,6 +150,7 @@
 #include <linux/pm_runtime.h>
 #include <linux/prandom.h>
 #include <linux/once_lite.h>
+#include <net/devtx.h>
 
 #include "dev.h"
 #include "net-sysfs.h"
diff --git a/net/core/devtx.c b/net/core/devtx.c
new file mode 100644
index 000000000000..6ae1aecce2c5
--- /dev/null
+++ b/net/core/devtx.c
@@ -0,0 +1,78 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <net/devtx.h>
+#include <linux/filter.h>
+
+DEFINE_STATIC_KEY_FALSE(devtx_enabled_key);
+EXPORT_SYMBOL_GPL(devtx_enabled_key);
+
+struct devtx_hook_entry {
+	struct list_head devtx_hooks;
+	struct btf_id_set8 *set;
+	const struct xdp_metadata_ops *xmo;
+};
+
+static LIST_HEAD(devtx_hooks);
+static DEFINE_MUTEX(devtx_hooks_lock);
+
+void devtx_hooks_enable(void)
+{
+	static_branch_inc(&devtx_enabled_key);
+}
+
+void devtx_hooks_disable(void)
+{
+	static_branch_dec(&devtx_enabled_key);
+}
+
+bool devtx_hooks_match(u32 attach_btf_id, const struct xdp_metadata_ops *xmo)
+{
+	struct devtx_hook_entry *entry, *tmp;
+	bool match = false;
+
+	mutex_lock(&devtx_hooks_lock);
+	list_for_each_entry_safe(entry, tmp, &devtx_hooks, devtx_hooks) {
+		if (btf_id_set8_contains(entry->set, attach_btf_id)) {
+			match = entry->xmo == xmo;
+			break;
+		}
+	}
+	mutex_unlock(&devtx_hooks_lock);
+
+	return match;
+}
+
+int devtx_hooks_register(struct btf_id_set8 *set, const struct xdp_metadata_ops *xmo)
+{
+	struct devtx_hook_entry *entry;
+
+	entry = kzalloc(sizeof(*entry), GFP_KERNEL);
+	if (!entry)
+		return -ENOMEM;
+
+	entry->set = set;
+	entry->xmo = xmo;
+
+	mutex_lock(&devtx_hooks_lock);
+	list_add(&entry->devtx_hooks, &devtx_hooks);
+	mutex_unlock(&devtx_hooks_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(devtx_hooks_register);
+
+void devtx_hooks_unregister(struct btf_id_set8 *set)
+{
+	struct devtx_hook_entry *entry, *tmp;
+
+	mutex_lock(&devtx_hooks_lock);
+	list_for_each_entry_safe(entry, tmp, &devtx_hooks, devtx_hooks) {
+		if (entry->set == set) {
+			list_del(&entry->devtx_hooks);
+			kfree(entry);
+			break;
+		}
+	}
+	mutex_unlock(&devtx_hooks_lock);
+}
+EXPORT_SYMBOL_GPL(devtx_hooks_unregister);
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] [RFC bpf-next v3 05/14] bpf: Implement devtx timestamp kfunc
  2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (3 preceding siblings ...)
  2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 04/14] bpf: Implement devtx hook points Stanislav Fomichev
@ 2023-07-07 19:29 ` Stanislav Fomichev
  2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 06/14] net: veth: Implement devtx timestamp kfuncs Stanislav Fomichev
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-07 19:29 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

Two kfuncs, one per hook point:

1. at submit time - bpf_devtx_sb_request_timestamp - to request HW
   to put TX timestamp into TX completion descriptors

2. at completion time - bpf_devtx_cp_timestamp - to read out
   TX timestamp

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/netdevice.h |  4 +++
 include/net/offload.h     | 12 +++++++
 kernel/bpf/offload.c      |  6 ++++
 net/core/devtx.c          | 73 +++++++++++++++++++++++++++++++++++++++
 4 files changed, 95 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index b828c7a75be2..5be6649ea3fa 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1654,10 +1654,14 @@ struct net_device_ops {
 						  bool cycles);
 };
 
+struct devtx_ctx;
+
 struct xdp_metadata_ops {
 	int	(*xmo_rx_timestamp)(const struct xdp_md *ctx, u64 *timestamp);
 	int	(*xmo_rx_hash)(const struct xdp_md *ctx, u32 *hash,
 			       enum xdp_rss_hash_type *rss_type);
+	int	(*xmo_request_tx_timestamp)(const struct devtx_ctx *ctx);
+	int	(*xmo_tx_timestamp)(const struct devtx_ctx *ctx, u64 *timestamp);
 };
 
 /**
diff --git a/include/net/offload.h b/include/net/offload.h
index de0fac38a95b..7e2c19c5aaef 100644
--- a/include/net/offload.h
+++ b/include/net/offload.h
@@ -12,9 +12,21 @@
 			      bpf_xdp_metadata_rx_hash, \
 			      xmo_rx_hash)
 
+#define DEVTX_SUBMIT_KFUNC_xxx	\
+	NETDEV_METADATA_KFUNC(DEVTX_KFUNC_REQUEST_TX_TIMESTAMP, \
+			      bpf_devtx_request_tx_timestamp, \
+			      xmo_request_tx_timestamp)
+
+#define DEVTX_COMPLETE_KFUNC_xxx	\
+	NETDEV_METADATA_KFUNC(DEVTX_KFUNC_TX_TIMESTAMP, \
+			      bpf_devtx_tx_timestamp, \
+			      xmo_tx_timestamp)
+
 enum {
 #define NETDEV_METADATA_KFUNC(name, _, __) name,
 XDP_METADATA_KFUNC_xxx
+DEVTX_SUBMIT_KFUNC_xxx
+DEVTX_COMPLETE_KFUNC_xxx
 #undef NETDEV_METADATA_KFUNC
 MAX_NETDEV_METADATA_KFUNC,
 };
diff --git a/kernel/bpf/offload.c b/kernel/bpf/offload.c
index a4423803b3dd..fe793387c972 100644
--- a/kernel/bpf/offload.c
+++ b/kernel/bpf/offload.c
@@ -862,6 +862,8 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 #define NETDEV_METADATA_KFUNC(name, _, xmo) \
 	if (func_id == bpf_dev_bound_kfunc_id(name)) p = ops->xmo;
 	XDP_METADATA_KFUNC_xxx
+	DEVTX_SUBMIT_KFUNC_xxx
+	DEVTX_COMPLETE_KFUNC_xxx
 #undef NETDEV_METADATA_KFUNC
 
 out:
@@ -873,12 +875,16 @@ void *bpf_dev_bound_resolve_kfunc(struct bpf_prog *prog, u32 func_id)
 BTF_SET_START(dev_bound_kfunc_ids)
 #define NETDEV_METADATA_KFUNC(name, str, _) BTF_ID(func, str)
 XDP_METADATA_KFUNC_xxx
+DEVTX_SUBMIT_KFUNC_xxx
+DEVTX_COMPLETE_KFUNC_xxx
 #undef NETDEV_METADATA_KFUNC
 BTF_SET_END(dev_bound_kfunc_ids)
 
 BTF_ID_LIST(dev_bound_kfunc_ids_unsorted)
 #define NETDEV_METADATA_KFUNC(name, str, _) BTF_ID(func, str)
 XDP_METADATA_KFUNC_xxx
+DEVTX_SUBMIT_KFUNC_xxx
+DEVTX_COMPLETE_KFUNC_xxx
 #undef NETDEV_METADATA_KFUNC
 
 u32 bpf_dev_bound_kfunc_id(int id)
diff --git a/net/core/devtx.c b/net/core/devtx.c
index 6ae1aecce2c5..991a52fe81a3 100644
--- a/net/core/devtx.c
+++ b/net/core/devtx.c
@@ -76,3 +76,76 @@ void devtx_hooks_unregister(struct btf_id_set8 *set)
 	mutex_unlock(&devtx_hooks_lock);
 }
 EXPORT_SYMBOL_GPL(devtx_hooks_unregister);
+
+__diag_push();
+__diag_ignore_all("-Wmissing-prototypes",
+		  "Global functions as their definitions will be in vmlinux BTF");
+
+/**
+ * bpf_devtx_request_tx_timestamp - Request TX timestamp on the packet.
+ * Callable only from the devtx-submit hook.
+ * @ctx: devtx context pointer.
+ *
+ * Returns 0 on success or ``-errno`` on error.
+ */
+__bpf_kfunc int bpf_devtx_request_tx_timestamp(const struct devtx_ctx *ctx)
+{
+	return -EOPNOTSUPP;
+}
+
+/**
+ * bpf_devtx_tx_timestamp - Read TX timestamp of the packet. Callable
+ * only from the devtx-complete hook.
+ * @ctx: devtx context pointer.
+ * @timestamp: Return value pointer.
+ *
+ * Returns 0 on success or ``-errno`` on error.
+ */
+__bpf_kfunc int bpf_devtx_tx_timestamp(const struct devtx_ctx *ctx, __u64 *timestamp)
+{
+	return -EOPNOTSUPP;
+}
+
+__diag_pop();
+
+BTF_SET8_START(devtx_sb_kfunc_ids)
+#define NETDEV_METADATA_KFUNC(_, name, __) BTF_ID_FLAGS(func, name, 0)
+DEVTX_SUBMIT_KFUNC_xxx
+#undef NETDEV_METADATA_KFUNC
+BTF_SET8_END(devtx_sb_kfunc_ids)
+
+static const struct btf_kfunc_id_set devtx_sb_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &devtx_sb_kfunc_ids,
+};
+
+BTF_SET8_START(devtx_cp_kfunc_ids)
+#define NETDEV_METADATA_KFUNC(_, name, __) BTF_ID_FLAGS(func, name, 0)
+DEVTX_COMPLETE_KFUNC_xxx
+#undef NETDEV_METADATA_KFUNC
+BTF_SET8_END(devtx_cp_kfunc_ids)
+
+static const struct btf_kfunc_id_set devtx_cp_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &devtx_cp_kfunc_ids,
+};
+
+static int __init devtx_init(void)
+{
+	int ret;
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &devtx_sb_kfunc_set);
+	if (ret) {
+		pr_warn("failed to register devtx_sb kfuncs: %d", ret);
+		return ret;
+	}
+
+	ret = register_btf_kfunc_id_set(BPF_PROG_TYPE_TRACING, &devtx_cp_kfunc_set);
+	if (ret) {
+		pr_warn("failed to register devtx_cp completion kfuncs: %d", ret);
+		return ret;
+	}
+
+	return 0;
+}
+late_initcall(devtx_init);
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] [RFC bpf-next v3 06/14] net: veth: Implement devtx timestamp kfuncs
  2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (4 preceding siblings ...)
  2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 05/14] bpf: Implement devtx timestamp kfunc Stanislav Fomichev
@ 2023-07-07 19:29 ` Stanislav Fomichev
  2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 07/14] bpf: Introduce tx checksum devtx kfuncs Stanislav Fomichev
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-07 19:29 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

Have a software-based example for kfuncs to showcase how it
can be used in the real devices and to have something to
test against in the selftests.

Both path (skb & xdp) are covered. Only the skb path is really
tested though.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c | 97 +++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 95 insertions(+), 2 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 614f3e3efab0..5af4b15e107c 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -27,6 +27,7 @@
 #include <linux/bpf_trace.h>
 #include <linux/net_tstamp.h>
 #include <net/page_pool.h>
+#include <net/devtx.h>
 
 #define DRV_NAME	"veth"
 #define DRV_VERSION	"1.0"
@@ -123,6 +124,13 @@ struct veth_xdp_buff {
 	struct sk_buff *skb;
 };
 
+struct veth_devtx_ctx {
+	struct devtx_ctx devtx;
+	struct xdp_frame *xdpf;
+	struct sk_buff *skb;
+	ktime_t xdp_tx_timestamp;
+};
+
 static int veth_get_link_ksettings(struct net_device *dev,
 				   struct ethtool_link_ksettings *cmd)
 {
@@ -313,10 +321,33 @@ static int veth_xdp_rx(struct veth_rq *rq, struct sk_buff *skb)
 	return NET_RX_SUCCESS;
 }
 
+DEFINE_DEVTX_HOOKS(veth);
+
 static int veth_forward_skb(struct net_device *dev, struct sk_buff *skb,
 			    struct veth_rq *rq, bool xdp)
 {
-	return __dev_forward_skb(dev, skb) ?: xdp ?
+	struct net_device *orig_dev = skb->dev;
+	int ret;
+
+	ret = __dev_forward_skb(dev, skb);
+	if (ret)
+		return ret;
+
+	if (devtx_enabled()) {
+		struct veth_devtx_ctx ctx = {
+			.devtx = {
+				.netdev = orig_dev,
+				.sinfo = skb_shinfo(skb),
+			},
+			.skb = skb,
+		};
+
+		__skb_push(skb, ETH_HLEN);
+		veth_devtx_complete_skb(&ctx.devtx, skb);
+		__skb_pull(skb, ETH_HLEN);
+	}
+
+	return xdp ?
 		veth_xdp_rx(rq, skb) :
 		__netif_rx(skb);
 }
@@ -356,6 +387,18 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 	}
 
+	if (devtx_enabled()) {
+		struct veth_devtx_ctx ctx = {
+			.devtx = {
+				.netdev = skb->dev,
+				.sinfo = skb_shinfo(skb),
+			},
+			.skb = skb,
+		};
+
+		veth_devtx_submit_skb(&ctx.devtx, skb);
+	}
+
 	rcv_priv = netdev_priv(rcv);
 	rxq = skb_get_queue_mapping(skb);
 	if (rxq < rcv->real_num_rx_queues) {
@@ -509,11 +552,28 @@ static int veth_xdp_xmit(struct net_device *dev, int n,
 	for (i = 0; i < n; i++) {
 		struct xdp_frame *frame = frames[i];
 		void *ptr = veth_xdp_to_ptr(frame);
+		struct veth_devtx_ctx ctx;
 
 		if (unlikely(xdp_get_frame_len(frame) > max_len ||
-			     __ptr_ring_produce(&rq->xdp_ring, ptr)))
+			     __ptr_ring_full(&rq->xdp_ring)))
+			break;
+
+		if (devtx_enabled()) {
+			memset(&ctx, 0, sizeof(ctx));
+			ctx.devtx.netdev = dev;
+			ctx.devtx.sinfo = xdp_frame_has_frags(frame) ?
+				xdp_get_shared_info_from_frame(frame) : NULL;
+			ctx.xdpf = frame;
+
+			veth_devtx_submit_xdp(&ctx.devtx, frame);
+		}
+
+		if (unlikely(__ptr_ring_produce(&rq->xdp_ring, ptr)))
 			break;
 		nxmit++;
+
+		if (devtx_enabled())
+			veth_devtx_complete_xdp(&ctx.devtx, frame);
 	}
 	spin_unlock(&rq->xdp_ring.producer_lock);
 
@@ -1732,6 +1792,28 @@ static int veth_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
 	return 0;
 }
 
+static int veth_devtx_request_tx_timestamp(const struct devtx_ctx *_ctx)
+{
+	struct veth_devtx_ctx *ctx = (struct veth_devtx_ctx *)_ctx;
+
+	if (ctx->skb)
+		__net_timestamp(ctx->skb);
+	else
+		ctx->xdp_tx_timestamp = ktime_get_real();
+
+	return 0;
+}
+
+static int veth_devtx_tx_timestamp(const struct devtx_ctx *_ctx, u64 *timestamp)
+{
+	struct veth_devtx_ctx *ctx = (struct veth_devtx_ctx *)_ctx;
+
+	if (ctx->skb)
+		*timestamp = ctx->skb->tstamp;
+
+	return 0;
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -1756,6 +1838,8 @@ static const struct net_device_ops veth_netdev_ops = {
 static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
 	.xmo_rx_timestamp		= veth_xdp_rx_timestamp,
 	.xmo_rx_hash			= veth_xdp_rx_hash,
+	.xmo_request_tx_timestamp	= veth_devtx_request_tx_timestamp,
+	.xmo_tx_timestamp		= veth_devtx_tx_timestamp,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
@@ -2041,11 +2125,20 @@ static struct rtnl_link_ops veth_link_ops = {
 
 static __init int veth_init(void)
 {
+	int ret;
+
+	ret = devtx_hooks_register(&veth_devtx_hook_ids, &veth_xdp_metadata_ops);
+	if (ret) {
+		pr_warn("failed to register devtx hooks: %d", ret);
+		return ret;
+	}
+
 	return rtnl_link_register(&veth_link_ops);
 }
 
 static __exit void veth_exit(void)
 {
+	devtx_hooks_unregister(&veth_devtx_hook_ids);
 	rtnl_link_unregister(&veth_link_ops);
 }
 
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] [RFC bpf-next v3 07/14] bpf: Introduce tx checksum devtx kfuncs
  2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (5 preceding siblings ...)
  2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 06/14] net: veth: Implement devtx timestamp kfuncs Stanislav Fomichev
@ 2023-07-07 19:29 ` Stanislav Fomichev
  2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 08/14] net: veth: Implement devtx tx checksum Stanislav Fomichev
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-07 19:29 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

Add new kfunc that will be used for tx checksum offloading.
The API mirrors existing one from sk_buff:
- csum_start - checksum the packet starting from this position
- csum_offset - put checksum at this offset

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/netdevice.h |  2 ++
 include/net/offload.h     |  5 ++++-
 net/core/devtx.c          | 17 +++++++++++++++++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5be6649ea3fa..aeb1fa024d65 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1662,6 +1662,8 @@ struct xdp_metadata_ops {
 			       enum xdp_rss_hash_type *rss_type);
 	int	(*xmo_request_tx_timestamp)(const struct devtx_ctx *ctx);
 	int	(*xmo_tx_timestamp)(const struct devtx_ctx *ctx, u64 *timestamp);
+	int	(*xmo_request_l4_checksum)(const struct devtx_ctx *ctx,
+					   u16 csum_start, u16 csum_offset);
 };
 
 /**
diff --git a/include/net/offload.h b/include/net/offload.h
index 7e2c19c5aaef..d8f908af9e59 100644
--- a/include/net/offload.h
+++ b/include/net/offload.h
@@ -15,7 +15,10 @@
 #define DEVTX_SUBMIT_KFUNC_xxx	\
 	NETDEV_METADATA_KFUNC(DEVTX_KFUNC_REQUEST_TX_TIMESTAMP, \
 			      bpf_devtx_request_tx_timestamp, \
-			      xmo_request_tx_timestamp)
+			      xmo_request_tx_timestamp) \
+	NETDEV_METADATA_KFUNC(DEVTX_KFUNC_REQUEST_L4_CHECKSUM, \
+			      bpf_devtx_request_l4_csum, \
+			      xmo_request_l4_checksum)
 
 #define DEVTX_COMPLETE_KFUNC_xxx	\
 	NETDEV_METADATA_KFUNC(DEVTX_KFUNC_TX_TIMESTAMP, \
diff --git a/net/core/devtx.c b/net/core/devtx.c
index 991a52fe81a3..fd8a9ea125db 100644
--- a/net/core/devtx.c
+++ b/net/core/devtx.c
@@ -106,6 +106,23 @@ __bpf_kfunc int bpf_devtx_tx_timestamp(const struct devtx_ctx *ctx, __u64 *times
 	return -EOPNOTSUPP;
 }
 
+/**
+ * bpf_devtx_request_l4_csum - Request TX checksum offload on the packet.
+ * Callable only from the devtx-submit hook.
+ * @ctx: devtx context pointer.
+ * @csum_start: start checksumming from given position
+ * @csum_offset: add resulting checksum at given offset
+ *
+ * Note, this checksum offload doesn't calculate pseudo-header part.
+ *
+ * Returns 0 on success or ``-errno`` on error.
+ */
+__bpf_kfunc int bpf_devtx_request_l4_csum(const struct devtx_ctx *ctx,
+					  u16 csum_start, u16 csum_offset)
+{
+	return -EOPNOTSUPP;
+}
+
 __diag_pop();
 
 BTF_SET8_START(devtx_sb_kfunc_ids)
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] [RFC bpf-next v3 08/14] net: veth: Implement devtx tx checksum
  2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (6 preceding siblings ...)
  2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 07/14] bpf: Introduce tx checksum devtx kfuncs Stanislav Fomichev
@ 2023-07-07 19:30 ` Stanislav Fomichev
  2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs Stanislav Fomichev
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-07 19:30 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

Implement tx checksum kfunc for veth by checksumming the packet
in software (since there is nothing to offload). The change mostly
exists to make it possible to have software-based selftest.

Probably should instead set csum_start/csum_offset on the skb
itself?

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 5af4b15e107c..6f97511a545b 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1814,6 +1814,34 @@ static int veth_devtx_tx_timestamp(const struct devtx_ctx *_ctx, u64 *timestamp)
 	return 0;
 }
 
+static int veth_devtx_request_l4_csum(const struct devtx_ctx *_ctx,
+				      u16 csum_start, u16 csum_offset)
+{
+	struct veth_devtx_ctx *ctx = (struct veth_devtx_ctx *)_ctx;
+	struct sk_buff *skb = ctx->skb;
+	__wsum csum;
+	int ret;
+
+	if (!skb)
+		return -EINVAL;
+
+	if (skb_transport_header_was_set(skb))
+		return -EINVAL;
+
+	if (csum_start >= skb->len)
+		return -EINVAL;
+
+	ret = skb_ensure_writable(skb, csum_offset + sizeof(__sum16));
+	if (ret)
+		return ret;
+
+	csum = csum_partial(skb->data + csum_start, skb->len - csum_start, 0);
+	*(__sum16 *)(skb->data + csum_offset) = csum_fold(csum) ?: CSUM_MANGLED_0;
+	skb->ip_summed = CHECKSUM_UNNECESSARY;
+
+	return 0;
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -1840,6 +1868,7 @@ static const struct xdp_metadata_ops veth_xdp_metadata_ops = {
 	.xmo_rx_hash			= veth_xdp_rx_hash,
 	.xmo_request_tx_timestamp	= veth_devtx_request_tx_timestamp,
 	.xmo_tx_timestamp		= veth_devtx_tx_timestamp,
+	.xmo_request_l4_checksum	= veth_devtx_request_l4_csum,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (7 preceding siblings ...)
  2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 08/14] net: veth: Implement devtx tx checksum Stanislav Fomichev
@ 2023-07-07 19:30 ` Stanislav Fomichev
  2023-07-11 22:56   ` [xdp-hints] " Alexei Starovoitov
  2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 10/14] selftests/xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
                   ` (4 subsequent siblings)
  13 siblings, 1 reply; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-07 19:30 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

TX timestamp:
- requires passing clock, not sure I'm passing the correct one (from
  cq->mdev), but the timestamp value looks convincing

TX checksum:
- looks like device does packet parsing (and doesn't accept custom
  start/offset), so I'm using the same approach we do for skb (support
  specific header offesets)

Some quirks that might need some more thought:
- AF_XDP requires creating on-stack xdp_frame
- AF_XDP passes only head to the completion hook

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../net/ethernet/mellanox/mlx5/core/en/txrx.h |  15 ++
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 155 +++++++++++++++++-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |   4 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |  10 ++
 .../net/ethernet/mellanox/mlx5/core/en_tx.c   |  24 +++
 .../net/ethernet/mellanox/mlx5/core/main.c    |  15 +-
 6 files changed, 217 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
index 879d698b6119..02b617a534c4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h
@@ -6,6 +6,7 @@
 
 #include "en.h"
 #include <linux/indirect_call_wrapper.h>
+#include <net/devtx.h>
 
 #define MLX5E_TX_WQE_EMPTY_DS_COUNT (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS)
 
@@ -506,4 +507,18 @@ static inline struct mlx5e_mpw_info *mlx5e_get_mpw_info(struct mlx5e_rq *rq, int
 
 	return (struct mlx5e_mpw_info *)((char *)rq->mpwqe.info + array_size(i, isz));
 }
+
+struct mlx5e_devtx_ctx {
+	struct devtx_ctx devtx;
+	union {
+		struct {
+			struct mlx5_cqe64 *cqe; /* tx completion */
+			struct mlx5e_cq *cq; /* tx completion */
+		};
+		struct mlx5e_tx_wqe *wqe; /* tx */
+	};
+};
+
+DECLARE_DEVTX_HOOKS(mlx5e);
+
 #endif
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index f0e6095809fa..59342d5e7e6d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -255,9 +255,60 @@ static int mlx5e_xdp_rx_hash(const struct xdp_md *ctx, u32 *hash,
 	return 0;
 }
 
+static int mlx5e_devtx_request_tx_timestamp(const struct devtx_ctx *ctx)
+{
+	/* Nothing to do here, CQE always has a timestamp. */
+	return 0;
+}
+
+static int mlx5e_devtx_tx_timestamp(const struct devtx_ctx *_ctx, u64 *timestamp)
+{
+	const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
+	u64 ts;
+
+	if (unlikely(!ctx->cqe || !ctx->cq))
+		return -ENODATA;
+
+	ts = get_cqe_ts(ctx->cqe);
+
+	if (mlx5_is_real_time_rq(ctx->cq->mdev) || mlx5_is_real_time_sq(ctx->cq->mdev))
+		*timestamp = mlx5_real_time_cyc2time(&ctx->cq->mdev->clock, ts);
+	else
+		*timestamp = mlx5_timecounter_cyc2time(&ctx->cq->mdev->clock, ts);
+
+	return 0;
+}
+
+static int mlx5e_devtx_request_l4_checksum(const struct devtx_ctx *_ctx,
+					   u16 csum_start, u16 csum_offset)
+{
+	const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
+	struct mlx5_wqe_eth_seg *eseg;
+
+	if (unlikely(!ctx->wqe))
+		return -ENODATA;
+
+	eseg = &ctx->wqe->eth;
+
+	switch (csum_offset) {
+	case sizeof(struct ethhdr) + sizeof(struct iphdr) + offsetof(struct udphdr, check):
+	case sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + offsetof(struct udphdr, check):
+		/* Looks like HW/FW is doing parsing, so offsets are largely ignored. */
+		eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 const struct xdp_metadata_ops mlx5e_xdp_metadata_ops = {
 	.xmo_rx_timestamp		= mlx5e_xdp_rx_timestamp,
 	.xmo_rx_hash			= mlx5e_xdp_rx_hash,
+	.xmo_request_tx_timestamp	= mlx5e_devtx_request_tx_timestamp,
+	.xmo_tx_timestamp		= mlx5e_devtx_tx_timestamp,
+	.xmo_request_l4_checksum	= mlx5e_devtx_request_l4_checksum,
 };
 
 /* returns true if packet was consumed by xdp */
@@ -453,6 +504,27 @@ mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptx
 
 	mlx5e_xdp_mpwqe_add_dseg(sq, p, stats);
 
+	if (devtx_enabled()) {
+		struct mlx5e_xmit_data_frags *xdptxdf =
+			container_of(xdptxd, struct mlx5e_xmit_data_frags, xd);
+
+		struct xdp_frame frame = {
+			.data = p->data,
+			.len = p->len,
+			.metasize = sq->xsk_pool->tx_metadata_len,
+		};
+
+		struct mlx5e_devtx_ctx ctx = {
+			.devtx = {
+				.netdev = sq->cq.netdev,
+				.sinfo = xdptxd->has_frags ? xdptxdf->sinfo : NULL,
+			},
+			.wqe = sq->mpwqe.wqe,
+		};
+
+		mlx5e_devtx_submit_xdp(&ctx.devtx, &frame);
+	}
+
 	if (unlikely(mlx5e_xdp_mpwqe_is_full(session, sq->max_sq_mpw_wqebbs)))
 		mlx5e_xdp_mpwqe_complete(sq);
 
@@ -560,6 +632,24 @@ mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
 		dseg++;
 	}
 
+	if (devtx_enabled()) {
+		struct xdp_frame frame = {
+			.data = xdptxd->data,
+			.len = xdptxd->len,
+			.metasize = sq->xsk_pool->tx_metadata_len,
+		};
+
+		struct mlx5e_devtx_ctx ctx = {
+			.devtx = {
+				.netdev = sq->cq.netdev,
+				.sinfo = xdptxd->has_frags ? xdptxdf->sinfo : NULL,
+			},
+			.wqe = wqe,
+		};
+
+		mlx5e_devtx_submit_xdp(&ctx.devtx, &frame);
+	}
+
 	cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | MLX5_OPCODE_SEND);
 
 	if (test_bit(MLX5E_SQ_STATE_XDP_MULTIBUF, &sq->state)) {
@@ -607,7 +697,9 @@ mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
 static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 				  struct mlx5e_xdp_wqe_info *wi,
 				  u32 *xsk_frames,
-				  struct xdp_frame_bulk *bq)
+				  struct xdp_frame_bulk *bq,
+				  struct mlx5e_cq *cq,
+				  struct mlx5_cqe64 *cqe)
 {
 	struct mlx5e_xdp_info_fifo *xdpi_fifo = &sq->db.xdpi_fifo;
 	u16 i;
@@ -626,6 +718,21 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 			xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
 			dma_addr = xdpi.frame.dma_addr;
 
+			if (devtx_enabled()) {
+				struct mlx5e_devtx_ctx ctx = {
+					.devtx = {
+						.netdev = sq->cq.netdev,
+						.sinfo = xdp_frame_has_frags(xdpf) ?
+							 xdp_get_shared_info_from_frame(xdpf) :
+							 NULL,
+					},
+					.cqe = cqe,
+					.cq = cq,
+				};
+
+				mlx5e_devtx_complete_xdp(&ctx.devtx, xdpi.frame.xdpf);
+			}
+
 			dma_unmap_single(sq->pdev, dma_addr,
 					 xdpf->len, DMA_TO_DEVICE);
 			if (xdp_frame_has_frags(xdpf)) {
@@ -659,6 +766,24 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 				xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
 				page = xdpi.page.page;
 
+				if (devtx_enabled()) {
+					struct xdp_frame frame = {
+						.data = page,
+						.len = PAGE_SIZE,
+						.metasize = sq->xsk_pool->tx_metadata_len,
+					};
+
+					struct mlx5e_devtx_ctx ctx = {
+						.devtx = {
+							.netdev = sq->cq.netdev,
+						},
+						.cqe = cqe,
+						.cq = cq,
+					};
+
+					mlx5e_devtx_complete_xdp(&ctx.devtx, &frame);
+				}
+
 				/* No need to check ((page->pp_magic & ~0x3UL) == PP_SIGNATURE)
 				 * as we know this is a page_pool page.
 				 */
@@ -668,10 +793,32 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 
 			break;
 		}
-		case MLX5E_XDP_XMIT_MODE_XSK:
+		case MLX5E_XDP_XMIT_MODE_XSK: {
 			/* AF_XDP send */
+			struct xdp_frame frame = {};
+
+			xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
+			frame.data = xdpi.frame.xsk_head;
+			xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
+			frame.len = xdpi.frame.xsk_head_len;
+
+			if (devtx_enabled()) {
+				struct mlx5e_devtx_ctx ctx = {
+					.devtx = {
+						.netdev = sq->cq.netdev,
+					},
+					.cqe = cqe,
+					.cq = cq,
+				};
+
+				frame.metasize = sq->xsk_pool->tx_metadata_len;
+
+				mlx5e_devtx_complete_xdp(&ctx.devtx, &frame);
+			}
+
 			(*xsk_frames)++;
 			break;
+		}
 		default:
 			WARN_ON_ONCE(true);
 		}
@@ -720,7 +867,7 @@ bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq)
 
 			sqcc += wi->num_wqebbs;
 
-			mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq);
+			mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq, cq, cqe);
 		} while (!last_wqe);
 
 		if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_REQ)) {
@@ -767,7 +914,7 @@ void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq)
 
 		sq->cc += wi->num_wqebbs;
 
-		mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq);
+		mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq, NULL, NULL);
 	}
 
 	xdp_flush_frame_bulk(&bq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index 9e8e6184f9e4..f3ea9f58273a 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -82,13 +82,15 @@ enum mlx5e_xdp_xmit_mode {
  *    num, page_1, page_2, ... , page_num.
  *
  * MLX5E_XDP_XMIT_MODE_XSK:
- *    none.
+ *    frame.xsk_head + frame.xsk_head_len for header portion only.
  */
 union mlx5e_xdp_info {
 	enum mlx5e_xdp_xmit_mode mode;
 	union {
 		struct xdp_frame *xdpf;
 		dma_addr_t dma_addr;
+		void *xsk_head;
+		u32 xsk_head_len;
 	} frame;
 	union {
 		struct mlx5e_rq *rq;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
index 597f319d4770..4ea1fc1aa500 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
@@ -55,6 +55,10 @@ static void mlx5e_xsk_tx_post_err(struct mlx5e_xdpsq *sq,
 
 	nopwqe = mlx5e_post_nop(&sq->wq, sq->sqn, &sq->pc);
 	mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, *xdpi);
+	mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
+			     (union mlx5e_xdp_info) { .frame.xsk_head = NULL });
+	mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
+			     (union mlx5e_xdp_info) { .frame.xsk_head_len = 0 });
 	sq->doorbell_cseg = &nopwqe->ctrl;
 }
 
@@ -106,6 +110,12 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
 			mlx5e_xsk_tx_post_err(sq, &xdpi);
 		} else {
 			mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, xdpi);
+			mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
+					     (union mlx5e_xdp_info)
+					     { .frame.xsk_head = xdptxd.data });
+			mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
+					     (union mlx5e_xdp_info)
+					     { .frame.xsk_head_len = xdptxd.len });
 		}
 
 		flush = true;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index c7eb6b238c2b..fc76e8efc9ea 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -758,6 +758,18 @@ static void mlx5e_tx_wi_consume_fifo_skbs(struct mlx5e_txqsq *sq, struct mlx5e_t
 	for (i = 0; i < wi->num_fifo_pkts; i++) {
 		struct sk_buff *skb = mlx5e_skb_fifo_pop(&sq->db.skb_fifo);
 
+		if (devtx_enabled()) {
+			struct mlx5e_devtx_ctx ctx = {
+				.devtx = {
+					.netdev = skb->dev,
+					.sinfo = skb_shinfo(skb),
+				},
+				.cqe = cqe,
+			};
+
+			mlx5e_devtx_complete_skb(&ctx.devtx, skb);
+		}
+
 		mlx5e_consume_skb(sq, skb, cqe, napi_budget);
 	}
 }
@@ -826,6 +838,18 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
 			sqcc += wi->num_wqebbs;
 
 			if (likely(wi->skb)) {
+				if (devtx_enabled()) {
+					struct mlx5e_devtx_ctx ctx = {
+						.devtx = {
+							.netdev = wi->skb->dev,
+							.sinfo = skb_shinfo(wi->skb),
+						},
+						.cqe = cqe,
+					};
+
+					mlx5e_devtx_complete_skb(&ctx.devtx, wi->skb);
+				}
+
 				mlx5e_tx_wi_dma_unmap(sq, wi, &dma_fifo_cc);
 				mlx5e_consume_skb(sq, wi->skb, cqe, napi_budget);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/main.c b/drivers/net/ethernet/mellanox/mlx5/core/main.c
index 88dbea6631d5..ca0c71688b88 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/main.c
@@ -48,6 +48,7 @@
 #include <linux/mlx5/vport.h>
 #include <linux/version.h>
 #include <net/devlink.h>
+#include <net/devtx.h>
 #include "mlx5_core.h"
 #include "thermal.h"
 #include "lib/eq.h"
@@ -73,6 +74,7 @@
 #include "sf/dev/dev.h"
 #include "sf/sf.h"
 #include "mlx5_irq.h"
+#include "en/xdp.h"
 
 MODULE_AUTHOR("Eli Cohen <eli@mellanox.com>");
 MODULE_DESCRIPTION("Mellanox 5th generation network adapters (ConnectX series) core driver");
@@ -2277,6 +2279,8 @@ static void mlx5_core_verify_params(void)
 	}
 }
 
+DEFINE_DEVTX_HOOKS(mlx5e);
+
 static int __init mlx5_init(void)
 {
 	int err;
@@ -2289,9 +2293,15 @@ static int __init mlx5_init(void)
 	mlx5_core_verify_params();
 	mlx5_register_debugfs();
 
+	err = devtx_hooks_register(&mlx5e_devtx_hook_ids, &mlx5e_xdp_metadata_ops);
+	if (err) {
+		pr_warn("failed to register devtx hooks: %d", err);
+		goto err_debug;
+	}
+
 	err = mlx5e_init();
 	if (err)
-		goto err_debug;
+		goto err_devtx;
 
 	err = mlx5_sf_driver_register();
 	if (err)
@@ -2307,6 +2317,8 @@ static int __init mlx5_init(void)
 	mlx5_sf_driver_unregister();
 err_sf:
 	mlx5e_cleanup();
+err_devtx:
+	devtx_hooks_unregister(&mlx5e_devtx_hook_ids);
 err_debug:
 	mlx5_unregister_debugfs();
 	return err;
@@ -2314,6 +2326,7 @@ static int __init mlx5_init(void)
 
 static void __exit mlx5_cleanup(void)
 {
+	devtx_hooks_unregister(&mlx5e_devtx_hook_ids);
 	pci_unregister_driver(&mlx5_core_driver);
 	mlx5_sf_driver_unregister();
 	mlx5e_cleanup();
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] [RFC bpf-next v3 10/14] selftests/xsk: Support XDP_TX_METADATA_LEN
  2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (8 preceding siblings ...)
  2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs Stanislav Fomichev
@ 2023-07-07 19:30 ` Stanislav Fomichev
  2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 11/14] selftests/bpf: Add helper to query current netns cookie Stanislav Fomichev
                   ` (3 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-07 19:30 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

Add new config field and call setsockopt.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/xsk.c | 17 +++++++++++++++++
 tools/testing/selftests/bpf/xsk.h |  1 +
 2 files changed, 18 insertions(+)

diff --git a/tools/testing/selftests/bpf/xsk.c b/tools/testing/selftests/bpf/xsk.c
index 687d83e707f8..c659713e2d43 100644
--- a/tools/testing/selftests/bpf/xsk.c
+++ b/tools/testing/selftests/bpf/xsk.c
@@ -47,6 +47,10 @@
  #define PF_XDP AF_XDP
 #endif
 
+#ifndef XDP_TX_METADATA_LEN
+#define XDP_TX_METADATA_LEN 9
+#endif
+
 #define pr_warn(fmt, ...) fprintf(stderr, fmt, ##__VA_ARGS__)
 
 #define XSKMAP_SIZE 1
@@ -124,12 +128,14 @@ static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
 		cfg->rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
 		cfg->tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
 		cfg->bind_flags = 0;
+		cfg->tx_metadata_len = 0;
 		return 0;
 	}
 
 	cfg->rx_size = usr_cfg->rx_size;
 	cfg->tx_size = usr_cfg->tx_size;
 	cfg->bind_flags = usr_cfg->bind_flags;
+	cfg->tx_metadata_len = usr_cfg->tx_metadata_len;
 
 	return 0;
 }
@@ -479,6 +485,17 @@ int xsk_socket__create_shared(struct xsk_socket **xsk_ptr,
 			umem->tx_ring_setup_done = true;
 	}
 
+	if (xsk->config.tx_metadata_len) {
+		int optval = xsk->config.tx_metadata_len;
+
+		err = setsockopt(xsk->fd, SOL_XDP, XDP_TX_METADATA_LEN,
+				 &optval, sizeof(optval));
+		if (err) {
+			err = -errno;
+			goto out_put_ctx;
+		}
+	}
+
 	err = xsk_get_mmap_offsets(xsk->fd, &off);
 	if (err) {
 		err = -errno;
diff --git a/tools/testing/selftests/bpf/xsk.h b/tools/testing/selftests/bpf/xsk.h
index 8da8d557768b..57e0af403aa8 100644
--- a/tools/testing/selftests/bpf/xsk.h
+++ b/tools/testing/selftests/bpf/xsk.h
@@ -212,6 +212,7 @@ struct xsk_socket_config {
 	__u32 rx_size;
 	__u32 tx_size;
 	__u16 bind_flags;
+	__u8 tx_metadata_len;
 };
 
 /* Set config to NULL to get the default configuration. */
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] [RFC bpf-next v3 11/14] selftests/bpf: Add helper to query current netns cookie
  2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (9 preceding siblings ...)
  2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 10/14] selftests/xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
@ 2023-07-07 19:30 ` Stanislav Fomichev
  2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 12/14] selftests/bpf: Add csum helpers Stanislav Fomichev
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-07 19:30 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

Will be used to filter out netdev in AF_XDP metadata selftests.
The helper returns netns cookie of the current process.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/network_helpers.c | 21 +++++++++++++++++++
 tools/testing/selftests/bpf/network_helpers.h |  1 +
 2 files changed, 22 insertions(+)

diff --git a/tools/testing/selftests/bpf/network_helpers.c b/tools/testing/selftests/bpf/network_helpers.c
index a105c0cd008a..34102fce5a88 100644
--- a/tools/testing/selftests/bpf/network_helpers.c
+++ b/tools/testing/selftests/bpf/network_helpers.c
@@ -450,3 +450,24 @@ int get_socket_local_port(int sock_fd)
 
 	return -1;
 }
+
+#ifndef SO_NETNS_COOKIE
+#define SO_NETNS_COOKIE 71
+#endif
+
+__u64 get_net_cookie(void)
+{
+	socklen_t optlen;
+	__u64 optval = 0;
+	int fd;
+
+	fd = socket(AF_LOCAL, SOCK_DGRAM, 0);
+	if (fd >= 0) {
+		optlen = sizeof(optval);
+		getsockopt(fd, SOL_SOCKET, SO_NETNS_COOKIE, &optval, &optlen);
+		close(fd);
+	}
+
+	return optval;
+}
+
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index 694185644da6..380047161aac 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -57,6 +57,7 @@ int make_sockaddr(int family, const char *addr_str, __u16 port,
 		  struct sockaddr_storage *addr, socklen_t *len);
 char *ping_command(int family);
 int get_socket_local_port(int sock_fd);
+__u64 get_net_cookie(void);
 
 struct nstoken;
 /**
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] [RFC bpf-next v3 12/14] selftests/bpf: Add csum helpers
  2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (10 preceding siblings ...)
  2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 11/14] selftests/bpf: Add helper to query current netns cookie Stanislav Fomichev
@ 2023-07-07 19:30 ` Stanislav Fomichev
  2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 13/14] selftests/bpf: Extend xdp_metadata with devtx kfuncs Stanislav Fomichev
  2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 14/14] selftests/bpf: Extend xdp_hw_metadata " Stanislav Fomichev
  13 siblings, 0 replies; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-07 19:30 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

Checksum helpers will be used to calculate pseudo-header checksum in
AF_XDP metadata selftests.

The helpers are mirroring existing kernel ones:
- csum_tcpudp_magic : IPv4 pseudo header csum
- csum_ipv6_magic : IPv6 pseudo header csum
- csum_fold : fold csum and do one's complement

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/network_helpers.h | 43 +++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index 380047161aac..47a837c0d796 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -68,4 +68,47 @@ struct nstoken;
  */
 struct nstoken *open_netns(const char *name);
 void close_netns(struct nstoken *token);
+
+static __u16 csum_fold(__u32 csum)
+{
+	csum = (csum & 0xffff) + (csum >> 16);
+	csum = (csum & 0xffff) + (csum >> 16);
+
+	return (__u16)~csum;
+}
+
+static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
+					__u32 len, __u8 proto,
+					__wsum csum)
+{
+	__u64 s = csum;
+
+	s += (__u32)saddr;
+	s += (__u32)daddr;
+	s += htons(proto + len);
+	s = (s & 0xffffffff) + (s >> 32);
+	s = (s & 0xffffffff) + (s >> 32);
+
+	return csum_fold((__u32)s);
+}
+
+static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+				      const struct in6_addr *daddr,
+					__u32 len, __u8 proto,
+					__wsum csum)
+{
+	__u64 s = csum;
+	int i;
+
+	for (i = 0; i < 4; i++)
+		s += (__u32)saddr->s6_addr32[i];
+	for (i = 0; i < 4; i++)
+		s += (__u32)daddr->s6_addr32[i];
+	s += htons(proto + len);
+	s = (s & 0xffffffff) + (s >> 32);
+	s = (s & 0xffffffff) + (s >> 32);
+
+	return csum_fold((__u32)s);
+}
+
 #endif
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] [RFC bpf-next v3 13/14] selftests/bpf: Extend xdp_metadata with devtx kfuncs
  2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (11 preceding siblings ...)
  2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 12/14] selftests/bpf: Add csum helpers Stanislav Fomichev
@ 2023-07-07 19:30 ` Stanislav Fomichev
  2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 14/14] selftests/bpf: Extend xdp_hw_metadata " Stanislav Fomichev
  13 siblings, 0 replies; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-07 19:30 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

Attach kfuncs that request and report TX timestamp via ringbuf.
Confirm on the userspace side that the program has triggered
and the timestamp is non-zero.

Also make sure skb has a sensible pointers and data.

In addition, calculate pseudo-header checksum and offload
payload checksum calculation to the kfunc.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../selftests/bpf/prog_tests/xdp_metadata.c   |  64 +++++++-
 .../selftests/bpf/progs/xdp_metadata.c        | 141 ++++++++++++++++++
 tools/testing/selftests/bpf/xdp_metadata.h    |  16 ++
 3 files changed, 217 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
index 626c461fa34d..ca1442d2c347 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -42,6 +42,8 @@ struct xsk {
 	struct xsk_ring_prod tx;
 	struct xsk_ring_cons rx;
 	struct xsk_socket *socket;
+	int tx_completions;
+	struct devtx_sample last_sample;
 };
 
 static int open_xsk(int ifindex, struct xsk *xsk)
@@ -51,6 +53,7 @@ static int open_xsk(int ifindex, struct xsk *xsk)
 		.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
 		.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
 		.bind_flags = XDP_COPY,
+		.tx_metadata_len = TX_META_LEN,
 	};
 	const struct xsk_umem_config umem_config = {
 		.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
@@ -138,6 +141,7 @@ static void ip_csum(struct iphdr *iph)
 
 static int generate_packet(struct xsk *xsk, __u16 dst_port)
 {
+	struct xdp_tx_meta *meta;
 	struct xdp_desc *tx_desc;
 	struct udphdr *udph;
 	struct ethhdr *eth;
@@ -151,10 +155,13 @@ static int generate_packet(struct xsk *xsk, __u16 dst_port)
 		return -1;
 
 	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
-	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
+	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE + TX_META_LEN;
 	printf("%p: tx_desc[%u]->addr=%llx\n", xsk, idx, tx_desc->addr);
 	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
 
+	meta = data - TX_META_LEN;
+	meta->request_timestamp = 1;
+
 	eth = data;
 	iph = (void *)(eth + 1);
 	udph = (void *)(iph + 1);
@@ -178,7 +185,8 @@ static int generate_packet(struct xsk *xsk, __u16 dst_port)
 	udph->source = htons(AF_XDP_SOURCE_PORT);
 	udph->dest = htons(dst_port);
 	udph->len = htons(sizeof(*udph) + UDP_PAYLOAD_BYTES);
-	udph->check = 0;
+	udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
+					 ntohs(udph->len), IPPROTO_UDP, 0);
 
 	memset(udph + 1, 0xAA, UDP_PAYLOAD_BYTES);
 
@@ -192,7 +200,8 @@ static int generate_packet(struct xsk *xsk, __u16 dst_port)
 	return 0;
 }
 
-static void complete_tx(struct xsk *xsk)
+static void complete_tx(struct xsk *xsk, struct xdp_metadata *bpf_obj,
+			struct ring_buffer *ringbuf)
 {
 	__u32 idx;
 	__u64 addr;
@@ -202,6 +211,14 @@ static void complete_tx(struct xsk *xsk)
 
 		printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
 		xsk_ring_cons__release(&xsk->comp, 1);
+
+		ring_buffer__poll(ringbuf, 1000);
+
+		ASSERT_EQ(bpf_obj->bss->pkts_fail_tx, 0, "pkts_fail_tx");
+		ASSERT_GE(xsk->tx_completions, 1, "tx_completions");
+		ASSERT_EQ(xsk->last_sample.timestamp_retval, 0, "timestamp_retval");
+		ASSERT_GE(xsk->last_sample.hw_timestamp, 0, "hw_timestamp");
+		ASSERT_EQ(xsk->last_sample.tx_csum, 0x1c72, "tx_csum");
 	}
 }
 
@@ -276,8 +293,23 @@ static int verify_xsk_metadata(struct xsk *xsk)
 	return 0;
 }
 
+static int process_sample(void *ctx, void *data, size_t len)
+{
+	struct devtx_sample *sample = data;
+	struct xsk *xsk = ctx;
+
+	printf("%p: got tx timestamp sample %u %llu\n",
+	       xsk, sample->timestamp_retval, sample->hw_timestamp);
+
+	xsk->tx_completions++;
+	xsk->last_sample = *sample;
+
+	return 0;
+}
+
 void test_xdp_metadata(void)
 {
+	struct ring_buffer *tx_compl_ringbuf = NULL;
 	struct xdp_metadata2 *bpf_obj2 = NULL;
 	struct xdp_metadata *bpf_obj = NULL;
 	struct bpf_program *new_prog, *prog;
@@ -290,6 +322,7 @@ void test_xdp_metadata(void)
 	int retries = 10;
 	int rx_ifindex;
 	int tx_ifindex;
+	int syscall_fd;
 	int sock_fd;
 	int ret;
 
@@ -323,6 +356,14 @@ void test_xdp_metadata(void)
 	if (!ASSERT_OK_PTR(bpf_obj, "open skeleton"))
 		goto out;
 
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "tx_submit");
+	bpf_program__set_ifindex(prog, tx_ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "tx_complete");
+	bpf_program__set_ifindex(prog, tx_ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+
 	prog = bpf_object__find_program_by_name(bpf_obj->obj, "rx");
 	bpf_program__set_ifindex(prog, rx_ifindex);
 	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
@@ -330,6 +371,18 @@ void test_xdp_metadata(void)
 	if (!ASSERT_OK(xdp_metadata__load(bpf_obj), "load skeleton"))
 		goto out;
 
+	bpf_obj->data->ifindex = tx_ifindex;
+	bpf_obj->data->net_cookie = get_net_cookie();
+
+	ret = xdp_metadata__attach(bpf_obj);
+	if (!ASSERT_OK(ret, "xdp_metadata__attach"))
+		goto out;
+
+	tx_compl_ringbuf = ring_buffer__new(bpf_map__fd(bpf_obj->maps.tx_compl_buf),
+					    process_sample, &tx_xsk, NULL);
+	if (!ASSERT_OK_PTR(tx_compl_ringbuf, "ring_buffer__new"))
+		goto out;
+
 	/* Make sure we can't add dev-bound programs to prog maps. */
 	prog_arr = bpf_object__find_map_by_name(bpf_obj->obj, "prog_arr");
 	if (!ASSERT_OK_PTR(prog_arr, "no prog_arr map"))
@@ -364,7 +417,8 @@ void test_xdp_metadata(void)
 		       "verify_xsk_metadata"))
 		goto out;
 
-	complete_tx(&tx_xsk);
+	/* Verify AF_XDP TX packet has completion event with a timestamp. */
+	complete_tx(&tx_xsk, bpf_obj, tx_compl_ringbuf);
 
 	/* Make sure freplace correctly picks up original bound device
 	 * and doesn't crash.
@@ -402,5 +456,7 @@ void test_xdp_metadata(void)
 	xdp_metadata__destroy(bpf_obj);
 	if (tok)
 		close_netns(tok);
+	if (tx_compl_ringbuf)
+		ring_buffer__free(tx_compl_ringbuf);
 	SYS_NOFAIL("ip netns del xdp_metadata");
 }
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
index d151d406a123..a5d9229acdf3 100644
--- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
@@ -4,6 +4,11 @@
 #include "xdp_metadata.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_endian.h>
+#include <bpf/bpf_tracing.h>
+
+#ifndef ETH_P_IP
+#define ETH_P_IP 0x0800
+#endif
 
 struct {
 	__uint(type, BPF_MAP_TYPE_XSKMAP);
@@ -19,10 +24,24 @@ struct {
 	__type(value, __u32);
 } prog_arr SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 4096);
+} tx_compl_buf SEC(".maps");
+
+__u64 pkts_fail_tx = 0;
+
+int ifindex = -1;
+__u64 net_cookie = -1;
+
 extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
 					 __u64 *timestamp) __ksym;
 extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
 				    enum xdp_rss_hash_type *rss_type) __ksym;
+extern int bpf_devtx_request_tx_timestamp(const struct devtx_ctx *ctx) __ksym;
+extern int bpf_devtx_tx_timestamp(const struct devtx_ctx *ctx, __u64 *timestamp) __ksym;
+extern int bpf_devtx_request_l4_csum(const struct devtx_ctx *ctx,
+				     u16 csum_start, u16 csum_offset) __ksym;
 
 SEC("xdp")
 int rx(struct xdp_md *ctx)
@@ -61,4 +80,126 @@ int rx(struct xdp_md *ctx)
 	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
 }
 
+static inline int verify_frame(const struct sk_buff *skb, const struct skb_shared_info *sinfo)
+{
+	struct ethhdr eth = {};
+
+	/* all the pointers are set up correctly */
+	if (!skb->data)
+		return -1;
+	if (!sinfo)
+		return -1;
+
+	/* can get to the frags */
+	if (sinfo->nr_frags != 0)
+		return -1;
+	if (sinfo->frags[0].bv_page != 0)
+		return -1;
+	if (sinfo->frags[0].bv_len != 0)
+		return -1;
+	if (sinfo->frags[0].bv_offset != 0)
+		return -1;
+
+	/* the data has something that looks like ethernet */
+	if (skb->len != 46)
+		return -1;
+	bpf_probe_read_kernel(&eth, sizeof(eth), skb->data);
+
+	if (eth.h_proto != bpf_htons(ETH_P_IP))
+		return -1;
+
+	return 0;
+}
+
+static inline bool my_netdev(const struct devtx_ctx *ctx)
+{
+	static struct net_device *netdev;
+
+	if (netdev)
+		return netdev == ctx->netdev;
+
+	if (ctx->netdev->ifindex != ifindex)
+		return false;
+	if (ctx->netdev->nd_net.net->net_cookie != net_cookie)
+		return false;
+
+	netdev = ctx->netdev;
+	return true;
+}
+
+SEC("fentry/veth_devtx_submit_skb")
+int BPF_PROG(tx_submit, const struct devtx_ctx *devtx, struct sk_buff *skb)
+{
+	int udpoff = sizeof(struct ethhdr) + sizeof(struct iphdr);
+	struct xdp_tx_meta meta = {};
+	int ret;
+
+	if (!my_netdev(devtx))
+		return 0;
+	if (devtx->sinfo->meta_len != TX_META_LEN)
+		return 0;
+
+	bpf_probe_read_kernel(&meta, sizeof(meta), skb->data - TX_META_LEN);
+	if (!meta.request_timestamp)
+		return 0;
+
+	ret = verify_frame(skb, devtx->sinfo);
+	if (ret < 0) {
+		__sync_add_and_fetch(&pkts_fail_tx, 1);
+		return 0;
+	}
+
+	ret = bpf_devtx_request_tx_timestamp(devtx);
+	if (ret < 0) {
+		__sync_add_and_fetch(&pkts_fail_tx, 1);
+		return 0;
+	}
+
+	ret = bpf_devtx_request_l4_csum(devtx, udpoff, udpoff + offsetof(struct udphdr, check));
+	if (ret < 0) {
+		__sync_add_and_fetch(&pkts_fail_tx, 1);
+		return 0;
+	}
+
+	return 0;
+}
+
+SEC("fentry/veth_devtx_complete_skb")
+int BPF_PROG(tx_complete, const struct devtx_ctx *devtx, struct sk_buff *skb)
+{
+	struct xdp_tx_meta meta = {};
+	struct devtx_sample *sample;
+	struct udphdr udph;
+	int ret;
+
+	if (!my_netdev(devtx))
+		return 0;
+	if (devtx->sinfo->meta_len != TX_META_LEN)
+		return 0;
+
+	bpf_probe_read_kernel(&meta, sizeof(meta), skb->data - TX_META_LEN);
+	if (!meta.request_timestamp)
+		return 0;
+
+	ret = verify_frame(skb, devtx->sinfo);
+	if (ret < 0) {
+		__sync_add_and_fetch(&pkts_fail_tx, 1);
+		return 0;
+	}
+
+	sample = bpf_ringbuf_reserve(&tx_compl_buf, sizeof(*sample), 0);
+	if (!sample)
+		return 0;
+
+	bpf_probe_read_kernel(&udph, sizeof(udph),
+			      skb->data + sizeof(struct ethhdr) + sizeof(struct iphdr));
+
+	sample->timestamp_retval = bpf_devtx_tx_timestamp(devtx, &sample->hw_timestamp);
+	sample->tx_csum = udph.check;
+
+	bpf_ringbuf_submit(sample, 0);
+
+	return 0;
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h
index 938a729bd307..ecab5404f1d6 100644
--- a/tools/testing/selftests/bpf/xdp_metadata.h
+++ b/tools/testing/selftests/bpf/xdp_metadata.h
@@ -18,3 +18,19 @@ struct xdp_meta {
 		__s32 rx_hash_err;
 	};
 };
+
+struct devtx_sample {
+	int timestamp_retval;
+	__u64 hw_timestamp;
+	__u64 sw_complete_timestamp;
+	__u16 tx_csum;
+};
+
+#define TX_META_LEN	8
+
+struct xdp_tx_meta {
+	__u8 request_timestamp;
+	__u8 padding0;
+	__u16 padding1;
+	__u32 padding2;
+};
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] [RFC bpf-next v3 14/14] selftests/bpf: Extend xdp_hw_metadata with devtx kfuncs
  2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
                   ` (12 preceding siblings ...)
  2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 13/14] selftests/bpf: Extend xdp_metadata with devtx kfuncs Stanislav Fomichev
@ 2023-07-07 19:30 ` Stanislav Fomichev
  13 siblings, 0 replies; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-07 19:30 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

When we get packets on port 9091, we swap src/dst and send it out.
At this point, we also request the timestamp and plumb it back
to the userspace. The userspace simply prints the timestamp.

Also print current UDP checksum, rewrite it with the pseudo-header
checksum and offload TX checksum calculation to devtx. Upon
completion, report TX checksum back (mlx5 doesn't put it back, so
I've used tcpdump to confirm that the checksum is correct).

Some other related changes:
- switched to zerocopy mode by default; new flag can be used to force
  old behavior
- request fixed TX_METADAT_LEN headroom
- some other small fixes (umem size, fill idx+i, etc)

mvbz3:~# ./xdp_hw_metadata eth3 -c mlx5e_devtx_complete_xdp -s mlx5e_devtx_submit_xd
attach rx bpf program...
poll: 0 (0) skip=1/0 redir=0 ring_full=0 rx_fail=0 tx_fail=0 l4_csum_fail=0
...
xsk_ring_cons__peek: 1
0xeb4cb8: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
rx_hash: 0x6A1A897A with RSS type:0x2A
rx_timestamp:  1688749963628930772 (sec:1688749963.6289)
XDP RX-time:   1688749963901850574 (sec:1688749963.9019) delta sec:0.2729 (272919.802 usec)
AF_XDP time:   1688749963901967812 (sec:1688749963.9020) delta sec:0.0001 (117.238 usec)
0xeb4cb8: ping_pong with csum=8e3b (want 4b0b)
0xeb4cb8: complete tx idx=0 addr=8
got tx sample: tx_err 0 hw_timestamp 1688749963859814790 sw_timestamp 1688749963902297286 csum 8e3b
0xeb4cb8: complete rx idx=128 addr=80100
poll: 0 (0) skip=7/0 redir=1 ring_full=0 rx_fail=0 tx_fail=0 l4_csum_fail=0

mvbz4:~# nc  -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091

mvbz4:~# tcpdump -vvx -i eth3 udp
tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:12:43.901436 IP6 (flowlabel 0x7a5d2, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1087.44339 > fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0x0b4b!] UDP, length 3
        0x0000:  6007 a5d2 000b 117f fe80 0000 0000 0000
        0x0010:  1270 fdff fe48 1087 fe80 0000 0000 0000
        0x0020:  1270 fdff fe48 1077 ad33 2383 000b 3b8e
        0x0030:  7864 70
10:12:43.902125 IP6 (flowlabel 0x7a5d2, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1077.9091 > fe80::1270:fdff:fe48:1087.44339: [udp sum ok] UDP, length 3
        0x0000:  6007 a5d2 000b 117f fe80 0000 0000 0000
        0x0010:  1270 fdff fe48 1077 fe80 0000 0000 0000
        0x0020:  1270 fdff fe48 1087 2383 ad33 000b 0b4b
        0x0030:  7864 70

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../selftests/bpf/progs/xdp_hw_metadata.c     | 173 +++++++++++
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 269 +++++++++++++++++-
 2 files changed, 427 insertions(+), 15 deletions(-)

diff --git a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
index b2dfd7066c6e..2049bfa70ea9 100644
--- a/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/progs/xdp_hw_metadata.c
@@ -4,6 +4,7 @@
 #include "xdp_metadata.h"
 #include <bpf/bpf_helpers.h>
 #include <bpf/bpf_endian.h>
+#include <bpf/bpf_tracing.h>
 
 struct {
 	__uint(type, BPF_MAP_TYPE_XSKMAP);
@@ -12,14 +13,30 @@ struct {
 	__type(value, __u32);
 } xsk SEC(".maps");
 
+struct {
+	__uint(type, BPF_MAP_TYPE_RINGBUF);
+	__uint(max_entries, 4096);
+} tx_compl_buf SEC(".maps");
+
 __u64 pkts_skip = 0;
+__u64 pkts_tx_skip = 0;
 __u64 pkts_fail = 0;
 __u64 pkts_redir = 0;
+__u64 pkts_fail_tx = 0;
+__u64 pkts_fail_l4_csum = 0;
+__u64 pkts_ringbuf_full = 0;
+
+int ifindex = -1;
+__u64 net_cookie = -1;
 
 extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
 					 __u64 *timestamp) __ksym;
 extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
 				    enum xdp_rss_hash_type *rss_type) __ksym;
+extern int bpf_devtx_request_tx_timestamp(const struct devtx_ctx *ctx) __ksym;
+extern int bpf_devtx_tx_timestamp(const struct devtx_ctx *ctx, __u64 *timestamp) __ksym;
+extern int bpf_devtx_request_l4_csum(const struct devtx_ctx *ctx,
+				     u16 csum_start, u16 csum_offset) __ksym;
 
 SEC("xdp")
 int rx(struct xdp_md *ctx)
@@ -90,4 +107,160 @@ int rx(struct xdp_md *ctx)
 	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
 }
 
+/* This is not strictly required; only to showcase how to access the payload. */
+static __always_inline bool tx_filter(const struct devtx_ctx *devtx,
+				      const void *data, __be16 *proto)
+{
+	int port_offset = sizeof(struct ethhdr) + offsetof(struct udphdr, source);
+	struct ethhdr eth = {};
+	struct udphdr udp = {};
+
+	bpf_probe_read_kernel(&eth.h_proto, sizeof(eth.h_proto),
+			      data + offsetof(struct ethhdr, h_proto));
+
+	*proto = eth.h_proto;
+	if (eth.h_proto == bpf_htons(ETH_P_IP)) {
+		port_offset += sizeof(struct iphdr);
+	} else if (eth.h_proto == bpf_htons(ETH_P_IPV6)) {
+		port_offset += sizeof(struct ipv6hdr);
+	} else {
+		__sync_add_and_fetch(&pkts_tx_skip, 1);
+		return false;
+	}
+
+	bpf_probe_read_kernel(&udp.source, sizeof(udp.source), data + port_offset);
+
+	/* Replies to UDP:9091 */
+	if (udp.source != bpf_htons(9091)) {
+		__sync_add_and_fetch(&pkts_tx_skip, 1);
+		return false;
+	}
+
+	return true;
+}
+
+static inline bool my_netdev(const struct devtx_ctx *devtx)
+{
+	static struct net_device *netdev;
+
+	if (netdev)
+		return netdev == devtx->netdev;
+
+	if (devtx->netdev->ifindex != ifindex)
+		return false;
+	if (devtx->netdev->nd_net.net->net_cookie != net_cookie)
+		return false;
+
+	netdev = devtx->netdev;
+	return true;
+}
+
+static inline int udpoff(__be16 proto)
+{
+	if (proto == bpf_htons(ETH_P_IP))
+		return sizeof(struct ethhdr) + sizeof(struct iphdr);
+	else if (proto == bpf_htons(ETH_P_IPV6))
+		return sizeof(struct ethhdr) + sizeof(struct ipv6hdr);
+	else
+		return 0;
+}
+
+static inline int tx_submit(const struct devtx_ctx *devtx, const void *data, u8 meta_len)
+{
+	struct xdp_tx_meta meta = {};
+	__be16 proto = 0;
+	int off, ret;
+
+	if (!my_netdev(devtx))
+		return 0;
+	if (meta_len != TX_META_LEN)
+		return 0;
+
+	bpf_probe_read_kernel(&meta, sizeof(meta), data - TX_META_LEN);
+	if (!meta.request_timestamp)
+		return 0;
+
+	if (!tx_filter(devtx, data, &proto))
+		return 0;
+
+	ret = bpf_devtx_request_tx_timestamp(devtx);
+	if (ret < 0)
+		__sync_add_and_fetch(&pkts_fail_tx, 1);
+
+	off = udpoff(proto);
+	if (!off)
+		return 0;
+
+	ret = bpf_devtx_request_l4_csum(devtx, off, off + offsetof(struct udphdr, check));
+	if (ret < 0)
+		__sync_add_and_fetch(&pkts_fail_l4_csum, 1);
+
+	return 0;
+}
+
+SEC("?fentry")
+int BPF_PROG(tx_submit_xdp, const struct devtx_ctx *devtx, const struct xdp_frame *xdpf)
+{
+	return tx_submit(devtx, xdpf->data, xdpf->metasize);
+}
+
+SEC("?fentry")
+int BPF_PROG(tx_submit_skb, const struct devtx_ctx *devtx, const struct sk_buff *skb)
+{
+	return tx_submit(devtx, skb->data, devtx->sinfo->meta_len);
+}
+
+static inline int tx_complete(const struct devtx_ctx *devtx, const void *data, u8 meta_len)
+{
+	struct xdp_tx_meta meta = {};
+	struct devtx_sample *sample;
+	struct udphdr udph;
+	__be16 proto = 0;
+	int off;
+
+	if (!my_netdev(devtx))
+		return 0;
+	if (meta_len != TX_META_LEN)
+		return 0;
+
+	bpf_probe_read_kernel(&meta, sizeof(meta), data - TX_META_LEN);
+	if (!meta.request_timestamp)
+		return 0;
+
+	if (!tx_filter(devtx, data, &proto))
+		return 0;
+
+	off = udpoff(proto);
+	if (!off)
+		return 0;
+
+	bpf_probe_read_kernel(&udph, sizeof(udph), data + off);
+
+	sample = bpf_ringbuf_reserve(&tx_compl_buf, sizeof(*sample), 0);
+	if (!sample) {
+		__sync_add_and_fetch(&pkts_ringbuf_full, 1);
+		return 0;
+	}
+
+	sample->timestamp_retval = bpf_devtx_tx_timestamp(devtx, &sample->hw_timestamp);
+	sample->sw_complete_timestamp = bpf_ktime_get_tai_ns();
+	sample->tx_csum = udph.check;
+
+	bpf_ringbuf_submit(sample, 0);
+
+	return 0;
+}
+
+SEC("?fentry")
+int BPF_PROG(tx_complete_xdp, const struct devtx_ctx *devtx, const struct xdp_frame *xdpf)
+{
+	return tx_complete(devtx, xdpf->data, xdpf->metasize);
+}
+
+SEC("?fentry")
+int BPF_PROG(tx_complete_skb, const struct devtx_ctx *devtx, const struct sk_buff *skb)
+{
+	return tx_complete(devtx, skb->data, devtx->sinfo->meta_len);
+}
+
 char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
index 613321eb84c1..3f9c47ad5cfa 100644
--- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
@@ -10,7 +10,9 @@
  *   - rx_hash
  *
  * TX:
- * - TBD
+ * - UDP 9091 packets trigger TX reply
+ * - TX HW timestamp is requested and reported back upon completion
+ * - TX checksum is requested
  */
 
 #include <test_progs.h>
@@ -28,10 +30,12 @@
 #include <net/if.h>
 #include <poll.h>
 #include <time.h>
+#include <unistd.h>
+#include <libgen.h>
 
 #include "xdp_metadata.h"
 
-#define UMEM_NUM 16
+#define UMEM_NUM 256
 #define UMEM_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
 #define UMEM_SIZE (UMEM_FRAME_SIZE * UMEM_NUM)
 #define XDP_FLAGS (XDP_FLAGS_DRV_MODE | XDP_FLAGS_REPLACE)
@@ -49,24 +53,27 @@ struct xsk {
 struct xdp_hw_metadata *bpf_obj;
 struct xsk *rx_xsk;
 const char *ifname;
+char *tx_complete;
+char *tx_submit;
 int ifindex;
 int rxq;
 
 void test__fail(void) { /* for network_helpers.c */ }
 
-static int open_xsk(int ifindex, struct xsk *xsk, __u32 queue_id)
+static int open_xsk(int ifindex, struct xsk *xsk, __u32 queue_id, int flags)
 {
 	int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
 	const struct xsk_socket_config socket_config = {
 		.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
 		.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
-		.bind_flags = XDP_COPY,
+		.bind_flags = flags,
+		.tx_metadata_len = TX_META_LEN,
 	};
 	const struct xsk_umem_config umem_config = {
 		.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
 		.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
 		.frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
-		.flags = XDP_UMEM_UNALIGNED_CHUNK_FLAG,
+		.flags = XSK_UMEM__DEFAULT_FLAGS,
 	};
 	__u32 idx;
 	u64 addr;
@@ -108,7 +115,7 @@ static int open_xsk(int ifindex, struct xsk *xsk, __u32 queue_id)
 	for (i = 0; i < UMEM_NUM / 2; i++) {
 		addr = (UMEM_NUM / 2 + i) * UMEM_FRAME_SIZE;
 		printf("%p: rx_desc[%d] -> %lx\n", xsk, i, addr);
-		*xsk_ring_prod__fill_addr(&xsk->fill, i) = addr;
+		*xsk_ring_prod__fill_addr(&xsk->fill, idx + i) = addr;
 	}
 	xsk_ring_prod__submit(&xsk->fill, ret);
 
@@ -129,12 +136,22 @@ static void refill_rx(struct xsk *xsk, __u64 addr)
 	__u32 idx;
 
 	if (xsk_ring_prod__reserve(&xsk->fill, 1, &idx) == 1) {
-		printf("%p: complete idx=%u addr=%llx\n", xsk, idx, addr);
+		printf("%p: complete rx idx=%u addr=%llx\n", xsk, idx, addr);
 		*xsk_ring_prod__fill_addr(&xsk->fill, idx) = addr;
 		xsk_ring_prod__submit(&xsk->fill, 1);
 	}
 }
 
+static int kick_tx(struct xsk *xsk)
+{
+	return sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
+}
+
+static int kick_rx(struct xsk *xsk)
+{
+	return recvfrom(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, NULL);
+}
+
 #define NANOSEC_PER_SEC 1000000000 /* 10^9 */
 static __u64 gettime(clockid_t clock_id)
 {
@@ -228,7 +245,102 @@ static void verify_skb_metadata(int fd)
 	printf("skb hwtstamp is not found!\n");
 }
 
-static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t clock_id)
+static bool complete_tx(struct xsk *xsk, struct ring_buffer *ringbuf)
+{
+	__u32 idx;
+	__u64 addr;
+
+	if (!xsk_ring_cons__peek(&xsk->comp, 1, &idx))
+		return false;
+
+	addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
+
+	printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
+	xsk_ring_cons__release(&xsk->comp, 1);
+
+	return true;
+}
+
+#define swap(a, b, len) do { \
+	for (int i = 0; i < len; i++) { \
+		__u8 tmp = ((__u8 *)a)[i]; \
+		((__u8 *)a)[i] = ((__u8 *)b)[i]; \
+		((__u8 *)b)[i] = tmp; \
+	} \
+} while (0)
+
+static void ping_pong(struct xsk *xsk, void *rx_packet)
+{
+	struct ipv6hdr *ip6h = NULL;
+	struct iphdr *iph = NULL;
+	struct xdp_tx_meta *meta;
+	struct xdp_desc *tx_desc;
+	struct udphdr *udph;
+	struct ethhdr *eth;
+	__sum16 want_csum;
+	void *data;
+	__u32 idx;
+	int ret;
+	int len;
+
+	ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
+	if (ret != 1) {
+		printf("%p: failed to reserve tx slot\n", xsk);
+		return;
+	}
+
+	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
+	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE + TX_META_LEN;
+	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
+
+	meta = data - TX_META_LEN;
+	meta->request_timestamp = 1;
+
+	eth = rx_packet;
+
+	if (eth->h_proto == htons(ETH_P_IP)) {
+		iph = (void *)(eth + 1);
+		udph = (void *)(iph + 1);
+	} else if (eth->h_proto == htons(ETH_P_IPV6)) {
+		ip6h = (void *)(eth + 1);
+		udph = (void *)(ip6h + 1);
+	} else {
+		printf("%p: failed to detect IP version for ping pong %04x\n", xsk, eth->h_proto);
+		xsk_ring_prod__cancel(&xsk->tx, 1);
+		return;
+	}
+
+	len = ETH_HLEN;
+	if (ip6h)
+		len += sizeof(*ip6h) + ntohs(ip6h->payload_len);
+	if (iph)
+		len += ntohs(iph->tot_len);
+
+	swap(eth->h_dest, eth->h_source, ETH_ALEN);
+	if (iph)
+		swap(&iph->saddr, &iph->daddr, 4);
+	else
+		swap(&ip6h->saddr, &ip6h->daddr, 16);
+	swap(&udph->source, &udph->dest, 2);
+
+	want_csum = udph->check;
+	if (iph)
+		udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
+						 ntohs(udph->len), IPPROTO_UDP, 0);
+	else
+		udph->check = ~csum_ipv6_magic(&ip6h->saddr, &ip6h->daddr,
+					       ntohs(udph->len), IPPROTO_UDP, 0);
+
+	printf("%p: ping-pong with csum=%04x (want %04x)\n", xsk, udph->check, want_csum);
+
+	memcpy(data, rx_packet, len); /* don't share umem chunk for simplicity */
+	tx_desc->len = len;
+
+	xsk_ring_prod__submit(&xsk->tx, 1);
+}
+
+static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t clock_id,
+			   struct ring_buffer *ringbuf)
 {
 	const struct xdp_desc *rx_desc;
 	struct pollfd fds[rxq + 1];
@@ -250,10 +362,22 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
 
 	while (true) {
 		errno = 0;
+
+		for (i = 0; i < rxq; i++) {
+			ret = kick_rx(&rx_xsk[i]);
+			if (ret)
+				printf("kick_rx ret=%d\n", ret);
+		}
+
 		ret = poll(fds, rxq + 1, 1000);
-		printf("poll: %d (%d) skip=%llu fail=%llu redir=%llu\n",
+		printf("poll: %d (%d) skip=%llu/%llu redir=%llu ring_full=%llu rx_fail=%llu tx_fail=%llu l4_csum_fail=%llu\n",
 		       ret, errno, bpf_obj->bss->pkts_skip,
-		       bpf_obj->bss->pkts_fail, bpf_obj->bss->pkts_redir);
+		       bpf_obj->bss->pkts_tx_skip,
+		       bpf_obj->bss->pkts_redir,
+		       bpf_obj->bss->pkts_ringbuf_full,
+		       bpf_obj->bss->pkts_fail,
+		       bpf_obj->bss->pkts_fail_tx,
+		       bpf_obj->bss->pkts_fail_l4_csum);
 		if (ret < 0)
 			break;
 		if (ret == 0)
@@ -280,6 +404,24 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
 			       xsk, idx, rx_desc->addr, addr, comp_addr);
 			verify_xdp_metadata(xsk_umem__get_data(xsk->umem_area, addr),
 					    clock_id);
+
+			if (tx_submit && tx_complete) {
+				/* mirror the packet back */
+				ping_pong(xsk, xsk_umem__get_data(xsk->umem_area, addr));
+
+				ret = kick_tx(xsk);
+				if (ret)
+					printf("kick_tx ret=%d\n", ret);
+
+				for (int j = 0; j < 500; j++) {
+					if (complete_tx(xsk, ringbuf))
+						break;
+					usleep(10*1000);
+				}
+
+				ring_buffer__poll(ringbuf, 1000);
+			}
+
 			xsk_ring_cons__release(&xsk->rx, 1);
 			refill_rx(xsk, comp_addr);
 		}
@@ -404,21 +546,69 @@ static void timestamping_enable(int fd, int val)
 		error(1, errno, "setsockopt(SO_TIMESTAMPING)");
 }
 
+static int process_sample(void *ctx, void *data, size_t len)
+{
+	struct devtx_sample *sample = data;
+
+	printf("got tx sample: tx_err %u hw_timestamp %llu sw_timestamp %llu csum %04x\n",
+	       sample->timestamp_retval, sample->hw_timestamp,
+	       sample->sw_complete_timestamp,
+	       sample->tx_csum);
+
+	return 0;
+}
+
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] <ifname>\n"
+		"OPTS:\n"
+		"    -s    symbol name for tx_submit\n"
+		"    -c    symbol name for tx_complete\n"
+		"    -Z    run in copy mode\n",
+		prog);
+}
+
 int main(int argc, char *argv[])
 {
+	int bind_flags =  XDP_USE_NEED_WAKEUP | XDP_ZEROCOPY;
+	struct ring_buffer *tx_compl_ringbuf = NULL;
 	clockid_t clock_id = CLOCK_TAI;
 	int server_fd = -1;
+	int opt;
 	int ret;
 	int i;
 
 	struct bpf_program *prog;
 
-	if (argc != 2) {
+	while ((opt = getopt(argc, argv, "s:c:Z")) != -1) {
+		switch (opt) {
+		case 's':
+			tx_submit = optarg;
+			break;
+		case 'c':
+			tx_complete = optarg;
+			break;
+		case 'Z':
+			bind_flags = XDP_USE_NEED_WAKEUP | XDP_COPY;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (argc < 2) {
 		fprintf(stderr, "pass device name\n");
 		return -1;
 	}
 
-	ifname = argv[1];
+	if (optind >= argc) {
+		usage(basename(argv[0]));
+		return 1;
+	}
+
+	ifname = argv[optind];
 	ifindex = if_nametoindex(ifname);
 	rxq = rxq_num(ifname);
 
@@ -432,7 +622,7 @@ int main(int argc, char *argv[])
 
 	for (i = 0; i < rxq; i++) {
 		printf("open_xsk(%s, %p, %d)\n", ifname, &rx_xsk[i], i);
-		ret = open_xsk(ifindex, &rx_xsk[i], i);
+		ret = open_xsk(ifindex, &rx_xsk[i], i, bind_flags);
 		if (ret)
 			error(1, -ret, "open_xsk");
 
@@ -444,15 +634,64 @@ int main(int argc, char *argv[])
 	if (libbpf_get_error(bpf_obj))
 		error(1, libbpf_get_error(bpf_obj), "xdp_hw_metadata__open");
 
+	bpf_obj->data->ifindex = ifindex;
+	bpf_obj->data->net_cookie = get_net_cookie();
+
 	prog = bpf_object__find_program_by_name(bpf_obj->obj, "rx");
 	bpf_program__set_ifindex(prog, ifindex);
 	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
 
+	prog = bpf_object__find_program_by_name(bpf_obj->obj,
+						bind_flags & XDP_COPY ?
+						"tx_submit_skb" :
+						"tx_submit_xdp");
+	bpf_program__set_ifindex(prog, ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+	if (tx_submit) {
+		printf("attaching devtx submit program to %s\n", tx_submit);
+		ret = bpf_program__set_attach_target(prog, 0, tx_submit);
+		if (ret)
+			printf("failed to attach submit program to %s, ret=%d\n",
+			       tx_submit, ret);
+		bpf_program__set_autoattach(prog, true);
+		bpf_program__set_autoload(prog, true);
+	} else {
+		printf("skipping devtx submit program\n");
+	}
+
+	prog = bpf_object__find_program_by_name(bpf_obj->obj,
+						bind_flags & XDP_COPY ?
+						"tx_complete_skb" :
+						"tx_complete_xdp");
+	bpf_program__set_ifindex(prog, ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_DEV_BOUND_ONLY);
+	if (tx_complete) {
+		printf("attaching devtx complete program to %s\n", tx_complete);
+		ret = bpf_program__set_attach_target(prog, 0, tx_complete);
+		if (ret)
+			printf("failed to attach complete program to %s, ret=%d\n",
+			       tx_complete, ret);
+		bpf_program__set_autoattach(prog, true);
+		bpf_program__set_autoload(prog, true);
+	} else {
+		printf("skipping devtx complete program\n");
+	}
+
 	printf("load bpf program...\n");
 	ret = xdp_hw_metadata__load(bpf_obj);
 	if (ret)
 		error(1, -ret, "xdp_hw_metadata__load");
 
+	printf("attach devts bpf programs...\n");
+	ret = xdp_hw_metadata__attach(bpf_obj);
+	if (ret)
+		error(1, -ret, "xdp_hw_metadata__attach");
+
+	tx_compl_ringbuf = ring_buffer__new(bpf_map__fd(bpf_obj->maps.tx_compl_buf),
+					    process_sample, NULL, NULL);
+	if (libbpf_get_error(tx_compl_ringbuf))
+		error(1, -libbpf_get_error(tx_compl_ringbuf), "ring_buffer__new");
+
 	printf("prepare skb endpoint...\n");
 	server_fd = start_server(AF_INET6, SOCK_DGRAM, NULL, 9092, 1000);
 	if (server_fd < 0)
@@ -472,7 +711,7 @@ int main(int argc, char *argv[])
 			error(1, -ret, "bpf_map_update_elem");
 	}
 
-	printf("attach bpf program...\n");
+	printf("attach rx bpf program...\n");
 	ret = bpf_xdp_attach(ifindex,
 			     bpf_program__fd(bpf_obj->progs.rx),
 			     XDP_FLAGS, NULL);
@@ -480,7 +719,7 @@ int main(int argc, char *argv[])
 		error(1, -ret, "bpf_xdp_attach");
 
 	signal(SIGINT, handle_signal);
-	ret = verify_metadata(rx_xsk, rxq, server_fd, clock_id);
+	ret = verify_metadata(rx_xsk, rxq, server_fd, clock_id, tx_compl_ringbuf);
 	close(server_fd);
 	cleanup();
 	if (ret)
-- 
2.41.0.255.g8b1d071c50-goog


^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs Stanislav Fomichev
@ 2023-07-11 22:56   ` Alexei Starovoitov
  2023-07-11 23:24     ` Stanislav Fomichev
  2023-07-12  0:32     ` Jakub Kicinski
  0 siblings, 2 replies; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-11 22:56 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

On Fri, Jul 07, 2023 at 12:30:01PM -0700, Stanislav Fomichev wrote:
> +
> +static int mlx5e_devtx_request_l4_checksum(const struct devtx_ctx *_ctx,
> +					   u16 csum_start, u16 csum_offset)
> +{
> +	const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
> +	struct mlx5_wqe_eth_seg *eseg;
> +
> +	if (unlikely(!ctx->wqe))
> +		return -ENODATA;
> +
> +	eseg = &ctx->wqe->eth;
> +
> +	switch (csum_offset) {
> +	case sizeof(struct ethhdr) + sizeof(struct iphdr) + offsetof(struct udphdr, check):
> +	case sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + offsetof(struct udphdr, check):
> +		/* Looks like HW/FW is doing parsing, so offsets are largely ignored. */
> +		eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
> +		break;
> +	default:
> +		return -EINVAL;
> +	}

I think this proves my point: csum is not generalizable even across veth and mlx5.
Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
into HW that has different ideas about csum-ing.

Here is what mlx5 does:
mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
                            struct mlx5e_accel_tx_state *accel,
                            struct mlx5_wqe_eth_seg *eseg)
{
        if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
                return;

        if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
                eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
                if (skb->encapsulation) {
                        eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
                                          MLX5_ETH_WQE_L4_INNER_CSUM;
                        sq->stats->csum_partial_inner++;
                } else {
                        eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
                        sq->stats->csum_partial++;
                }

How would you generalize that into csum api that will work across NICs ?

My answer stands: you cannot.

My proposal again:
add driver specifc kfuncs and hooks for things like csum.

Kuba,
since you nacked driver specific stuff please suggest a way to unblock this stalemate.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-11 22:56   ` [xdp-hints] " Alexei Starovoitov
@ 2023-07-11 23:24     ` Stanislav Fomichev
  2023-07-11 23:45       ` Alexei Starovoitov
  2023-07-12  0:32     ` Jakub Kicinski
  1 sibling, 1 reply; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-11 23:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

On Tue, Jul 11, 2023 at 3:57 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Fri, Jul 07, 2023 at 12:30:01PM -0700, Stanislav Fomichev wrote:
> > +
> > +static int mlx5e_devtx_request_l4_checksum(const struct devtx_ctx *_ctx,
> > +                                        u16 csum_start, u16 csum_offset)
> > +{
> > +     const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
> > +     struct mlx5_wqe_eth_seg *eseg;
> > +
> > +     if (unlikely(!ctx->wqe))
> > +             return -ENODATA;
> > +
> > +     eseg = &ctx->wqe->eth;
> > +
> > +     switch (csum_offset) {
> > +     case sizeof(struct ethhdr) + sizeof(struct iphdr) + offsetof(struct udphdr, check):
> > +     case sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + offsetof(struct udphdr, check):
> > +             /* Looks like HW/FW is doing parsing, so offsets are largely ignored. */
> > +             eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
> > +             break;
> > +     default:
> > +             return -EINVAL;
> > +     }
>
> I think this proves my point: csum is not generalizable even across veth and mlx5.
> Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
> into HW that has different ideas about csum-ing.
>
> Here is what mlx5 does:
> mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>                             struct mlx5e_accel_tx_state *accel,
>                             struct mlx5_wqe_eth_seg *eseg)
> {
>         if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
>                 return;
>
>         if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
>                 eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
>                 if (skb->encapsulation) {
>                         eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
>                                           MLX5_ETH_WQE_L4_INNER_CSUM;
>                         sq->stats->csum_partial_inner++;
>                 } else {
>                         eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
>                         sq->stats->csum_partial++;
>                 }
>
> How would you generalize that into csum api that will work across NICs ?
>
> My answer stands: you cannot.
>
> My proposal again:
> add driver specifc kfuncs and hooks for things like csum.

I do see your point, but to also give you my perspective: I have no
clue what those _CSUM tx bits do (as a non-mlx employee). And what
kind of packets they support (initial patch doesn't give any info).
We can definitely expose mlx5 specific request_l4_checksum(bool encap)
which does things similar to mlx5e_txwqe_build_eseg_csum, but then,
what does it _actually_ do? It obviously can't checksum arbitrary
packet formats (because it has this inner/outer selection bit), so
there is really no way for me to provide a per-driver kfunc api. Maybe
the vendors can?

So having csum_start/csum_offset abstraction which works with fixed
offsets seems like at least it correctly sets the expectation for BPF
program writers.
The vendors are already supposed to conform to this start/offset API for skb.

But back to your point: should we maybe try to meet somewhere in the middle?
1. We try to provide "generic" offload kfuncs; for mlx5, we'll have
this mlx5e_devtx_request_l4_checksum which works for fixed offsets
2. We also let vendors do device-specific "extensions" where devices
deviate too much: bpf_request_RAW_mlx5e_l4_checksum(bool encap)
This can give BPF authors opportunity to write somewhat portable
programs and also use vendor specific apis when/if needed.

I think we had a similar idea for rx side: have generic kfuncs, but
also let vendors experiment with custom kfuncs if they want to
differentiate.
WDYT? Can it give us the best things from both sides?

> Kuba,
> since you nacked driver specific stuff please suggest a way to unblock this stalemate.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-11 23:24     ` Stanislav Fomichev
@ 2023-07-11 23:45       ` Alexei Starovoitov
  2023-07-12  0:14         ` Stanislav Fomichev
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-11 23:45 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Jakub Kicinski,
	Toke Høiland-Jørgensen, Willem de Bruijn, David Ahern,
	Karlsson, Magnus, Björn Töpel, Fijalkowski, Maciej,
	Jesper Dangaard Brouer, Network Development, xdp-hints

On Tue, Jul 11, 2023 at 4:25 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Tue, Jul 11, 2023 at 3:57 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Fri, Jul 07, 2023 at 12:30:01PM -0700, Stanislav Fomichev wrote:
> > > +
> > > +static int mlx5e_devtx_request_l4_checksum(const struct devtx_ctx *_ctx,
> > > +                                        u16 csum_start, u16 csum_offset)
> > > +{
> > > +     const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
> > > +     struct mlx5_wqe_eth_seg *eseg;
> > > +
> > > +     if (unlikely(!ctx->wqe))
> > > +             return -ENODATA;
> > > +
> > > +     eseg = &ctx->wqe->eth;
> > > +
> > > +     switch (csum_offset) {
> > > +     case sizeof(struct ethhdr) + sizeof(struct iphdr) + offsetof(struct udphdr, check):
> > > +     case sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + offsetof(struct udphdr, check):
> > > +             /* Looks like HW/FW is doing parsing, so offsets are largely ignored. */
> > > +             eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
> > > +             break;
> > > +     default:
> > > +             return -EINVAL;
> > > +     }
> >
> > I think this proves my point: csum is not generalizable even across veth and mlx5.
> > Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
> > into HW that has different ideas about csum-ing.
> >
> > Here is what mlx5 does:
> > mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> >                             struct mlx5e_accel_tx_state *accel,
> >                             struct mlx5_wqe_eth_seg *eseg)
> > {
> >         if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
> >                 return;
> >
> >         if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
> >                 eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
> >                 if (skb->encapsulation) {
> >                         eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
> >                                           MLX5_ETH_WQE_L4_INNER_CSUM;
> >                         sq->stats->csum_partial_inner++;
> >                 } else {
> >                         eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
> >                         sq->stats->csum_partial++;
> >                 }
> >
> > How would you generalize that into csum api that will work across NICs ?
> >
> > My answer stands: you cannot.
> >
> > My proposal again:
> > add driver specifc kfuncs and hooks for things like csum.
>
> I do see your point, but to also give you my perspective: I have no
> clue what those _CSUM tx bits do (as a non-mlx employee). And what
> kind of packets they support (initial patch doesn't give any info).
> We can definitely expose mlx5 specific request_l4_checksum(bool encap)
> which does things similar to mlx5e_txwqe_build_eseg_csum, but then,
> what does it _actually_ do? It obviously can't checksum arbitrary
> packet formats (because it has this inner/outer selection bit), so
> there is really no way for me to provide a per-driver kfunc api. Maybe
> the vendors can?
>
> So having csum_start/csum_offset abstraction which works with fixed
> offsets seems like at least it correctly sets the expectation for BPF
> program writers.
> The vendors are already supposed to conform to this start/offset API for skb.
>
> But back to your point: should we maybe try to meet somewhere in the middle?
> 1. We try to provide "generic" offload kfuncs; for mlx5, we'll have
> this mlx5e_devtx_request_l4_checksum which works for fixed offsets

But it doesn't.
Even if you add a check for csum_start (that's missing in the patch)
there need to be a way to somehow figure out
whether skb->encapsulation is true and set appropriate flags.
Otherwise this request csum will do "something" that only the HW vendor knows.
That would be even harder to debug for bpf prog writers.

So instead of helping bpf prog devs it will actively hurt them.

Another example. If bpf prog was developed and tested on veth
it will work for some values of csum_offset on real HW and will -EINVAL
for the other values.
Just horrible user experience comparing to the case where
the user knows that each netdev is potentially different and
_has_ to develop and test their prog on the given HW NIC and
not assume that the kernel can "do the right thing".
This csum exercise is clear example that kernel is not in a position
to do so.
For timestamp it's arguable, but for csum there is no generic api that
kernel can apply universally to NICs.

> 2. We also let vendors do device-specific "extensions" where devices
> deviate too much: bpf_request_RAW_mlx5e_l4_checksum(bool encap)
> This can give BPF authors opportunity to write somewhat portable
> programs and also use vendor specific apis when/if needed.
>
> I think we had a similar idea for rx side: have generic kfuncs, but
> also let vendors experiment with custom kfuncs if they want to
> differentiate.
> WDYT? Can it give us the best things from both sides?
>
> > Kuba,
> > since you nacked driver specific stuff please suggest a way to unblock this stalemate.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-11 23:45       ` Alexei Starovoitov
@ 2023-07-12  0:14         ` Stanislav Fomichev
  2023-07-12  2:50           ` Alexei Starovoitov
  0 siblings, 1 reply; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-12  0:14 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Jakub Kicinski,
	Toke Høiland-Jørgensen, Willem de Bruijn, David Ahern,
	Karlsson, Magnus, Björn Töpel, Fijalkowski, Maciej,
	Jesper Dangaard Brouer, Network Development, xdp-hints

On Tue, Jul 11, 2023 at 4:45 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jul 11, 2023 at 4:25 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Tue, Jul 11, 2023 at 3:57 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Jul 07, 2023 at 12:30:01PM -0700, Stanislav Fomichev wrote:
> > > > +
> > > > +static int mlx5e_devtx_request_l4_checksum(const struct devtx_ctx *_ctx,
> > > > +                                        u16 csum_start, u16 csum_offset)
> > > > +{
> > > > +     const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
> > > > +     struct mlx5_wqe_eth_seg *eseg;
> > > > +
> > > > +     if (unlikely(!ctx->wqe))
> > > > +             return -ENODATA;
> > > > +
> > > > +     eseg = &ctx->wqe->eth;
> > > > +
> > > > +     switch (csum_offset) {
> > > > +     case sizeof(struct ethhdr) + sizeof(struct iphdr) + offsetof(struct udphdr, check):
> > > > +     case sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + offsetof(struct udphdr, check):
> > > > +             /* Looks like HW/FW is doing parsing, so offsets are largely ignored. */
> > > > +             eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
> > > > +             break;
> > > > +     default:
> > > > +             return -EINVAL;
> > > > +     }
> > >
> > > I think this proves my point: csum is not generalizable even across veth and mlx5.
> > > Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
> > > into HW that has different ideas about csum-ing.
> > >
> > > Here is what mlx5 does:
> > > mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> > >                             struct mlx5e_accel_tx_state *accel,
> > >                             struct mlx5_wqe_eth_seg *eseg)
> > > {
> > >         if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
> > >                 return;
> > >
> > >         if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
> > >                 eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
> > >                 if (skb->encapsulation) {
> > >                         eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
> > >                                           MLX5_ETH_WQE_L4_INNER_CSUM;
> > >                         sq->stats->csum_partial_inner++;
> > >                 } else {
> > >                         eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
> > >                         sq->stats->csum_partial++;
> > >                 }
> > >
> > > How would you generalize that into csum api that will work across NICs ?
> > >
> > > My answer stands: you cannot.
> > >
> > > My proposal again:
> > > add driver specifc kfuncs and hooks for things like csum.
> >
> > I do see your point, but to also give you my perspective: I have no
> > clue what those _CSUM tx bits do (as a non-mlx employee). And what
> > kind of packets they support (initial patch doesn't give any info).
> > We can definitely expose mlx5 specific request_l4_checksum(bool encap)
> > which does things similar to mlx5e_txwqe_build_eseg_csum, but then,
> > what does it _actually_ do? It obviously can't checksum arbitrary
> > packet formats (because it has this inner/outer selection bit), so
> > there is really no way for me to provide a per-driver kfunc api. Maybe
> > the vendors can?
> >
> > So having csum_start/csum_offset abstraction which works with fixed
> > offsets seems like at least it correctly sets the expectation for BPF
> > program writers.
> > The vendors are already supposed to conform to this start/offset API for skb.
> >
> > But back to your point: should we maybe try to meet somewhere in the middle?
> > 1. We try to provide "generic" offload kfuncs; for mlx5, we'll have
> > this mlx5e_devtx_request_l4_checksum which works for fixed offsets
>
> But it doesn't.
> Even if you add a check for csum_start (that's missing in the patch)
> there need to be a way to somehow figure out
> whether skb->encapsulation is true and set appropriate flags.
> Otherwise this request csum will do "something" that only the HW vendor knows.
> That would be even harder to debug for bpf prog writers.
>
> So instead of helping bpf prog devs it will actively hurt them.

Can we make it more robust? The device can look at the payload (via
descriptors or extra payload pointer via devtx_ctx) and verify
h_proto/nexthdr.
It won't be perfect, I agree, but we can get it working for the common
cases (and have device-specific kfuncs for the rest).

> Another example. If bpf prog was developed and tested on veth
> it will work for some values of csum_offset on real HW and will -EINVAL
> for the other values.
> Just horrible user experience comparing to the case where
> the user knows that each netdev is potentially different and
> _has_ to develop and test their prog on the given HW NIC and
> not assume that the kernel can "do the right thing".

For this, I was actually thinking that we need to provide some
SW-based fallback mechanism.
Because if I have a program and a nic that doesn't have an offload
implemented at all, having a fallback might be useful:

if (bpf_devtx_request_l4_csum(...)) {
  /* oops, hw bailed on us, fallback to sw and expose a counter */
  bpf_devtx_l4_csum_slowpath(csum_start, csum_offset, data, len);
  pkt_sw_csum++;
}

This is probably needed regardless of which way we do it?

Regarding veth vs non-veth: we already have similar issues with
generic xdp vs non-generic.
I'm not sure we can completely avoid having surprises when switching
from sw to hw paths.
It's whether the users will have to debug 10-20% of their program or
they'd have to start completely from scratch for every nic.

> This csum exercise is clear example that kernel is not in a position
> to do so.
> For timestamp it's arguable, but for csum there is no generic api that
> kernel can apply universally to NICs.

Sure, I agree, it's a mix of both. For some offloads, we can have
something common, for some we can't.
But I'm not sure why we have to pick one or another. We can try to
have common apis (maybe not ideal, yes) and we can expose vendor
specific ones if there is need.
If the generic ones get unused - we kill them in the future. If none
of the vendors comes up with non-generic ones - the generic ones are
good enough.

I'm assuming you favor non-generic ones because it's easier to implement?
But most of the netdev-bound infra is already there for rx, so there
is really not a lot of extra code to reuse it at tx. (it's two lines
to allow tracing progs to be dev-bound and to check whether devices
match at attach).
Or is there some other reason I'm missing?

> > 2. We also let vendors do device-specific "extensions" where devices
> > deviate too much: bpf_request_RAW_mlx5e_l4_checksum(bool encap)
> > This can give BPF authors opportunity to write somewhat portable
> > programs and also use vendor specific apis when/if needed.
> >
> > I think we had a similar idea for rx side: have generic kfuncs, but
> > also let vendors experiment with custom kfuncs if they want to
> > differentiate.
> > WDYT? Can it give us the best things from both sides?
> >
> > > Kuba,
> > > since you nacked driver specific stuff please suggest a way to unblock this stalemate.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-11 22:56   ` [xdp-hints] " Alexei Starovoitov
  2023-07-11 23:24     ` Stanislav Fomichev
@ 2023-07-12  0:32     ` Jakub Kicinski
  2023-07-12  2:37       ` Alexei Starovoitov
  1 sibling, 1 reply; 34+ messages in thread
From: Jakub Kicinski @ 2023-07-12  0:32 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, bpf, ast, daniel, andrii, martin.lau, song,
	yhs, john.fastabend, kpsingh, haoluo, jolsa, toke, willemb,
	dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
	netdev, xdp-hints

On Tue, 11 Jul 2023 15:56:57 -0700 Alexei Starovoitov wrote:
> I think this proves my point: csum is not generalizable even across veth and mlx5.
> Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
> into HW that has different ideas about csum-ing.
> 
> Here is what mlx5 does:
> mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
>                             struct mlx5e_accel_tx_state *accel,
>                             struct mlx5_wqe_eth_seg *eseg)
> {
>         if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
>                 return;
> 
>         if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
>                 eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
>                 if (skb->encapsulation) {

This should be irrelevant today, as LCO exists?

>                         eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
>                                           MLX5_ETH_WQE_L4_INNER_CSUM;
>                         sq->stats->csum_partial_inner++;
>                 } else {
>                         eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
>                         sq->stats->csum_partial++;
>                 }
> 
> How would you generalize that into csum api that will work across NICs ?
> 
> My answer stands: you cannot.
> 
> My proposal again:
> add driver specifc kfuncs and hooks for things like csum.
> 
> Kuba,
> since you nacked driver specific stuff please suggest a way to unblock this stalemate.

I hope I'm not misremembering but I think I suggested at the beginning
to create a structure describing packet geometry and requested offloads,
and for the prog fill that in.

All operating systems I know end up doing that, we'll end up doing
that as well. The question is whether we're willing to learn from
experience or prefer to go on a wild ride first...

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-12  0:32     ` Jakub Kicinski
@ 2023-07-12  2:37       ` Alexei Starovoitov
  2023-07-12  3:07         ` Jakub Kicinski
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-12  2:37 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa,
	Toke Høiland-Jørgensen, Willem de Bruijn, David Ahern,
	Karlsson, Magnus, Björn Töpel, Fijalkowski, Maciej,
	Jesper Dangaard Brouer, Network Development, xdp-hints

On Tue, Jul 11, 2023 at 5:32 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 11 Jul 2023 15:56:57 -0700 Alexei Starovoitov wrote:
> > I think this proves my point: csum is not generalizable even across veth and mlx5.
> > Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
> > into HW that has different ideas about csum-ing.
> >
> > Here is what mlx5 does:
> > mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> >                             struct mlx5e_accel_tx_state *accel,
> >                             struct mlx5_wqe_eth_seg *eseg)
> > {
> >         if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
> >                 return;
> >
> >         if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
> >                 eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
> >                 if (skb->encapsulation) {
>
> This should be irrelevant today, as LCO exists?

Hmm. Maybe. But LCO is an example that prog devs have to be aware of
and use it properly.
Meaning for certain protocols compute outer csum LCO way and
let inner go through HW csuming.
In this case I have no idea what these mlx5 flags do.
I hope this part of the code was tested with udp tunnels.

> >                         eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
> >                                           MLX5_ETH_WQE_L4_INNER_CSUM;
> >                         sq->stats->csum_partial_inner++;
> >                 } else {
> >                         eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
> >                         sq->stats->csum_partial++;
> >                 }
> >
> > How would you generalize that into csum api that will work across NICs ?
> >
> > My answer stands: you cannot.
> >
> > My proposal again:
> > add driver specifc kfuncs and hooks for things like csum.
> >
> > Kuba,
> > since you nacked driver specific stuff please suggest a way to unblock this stalemate.
>
> I hope I'm not misremembering but I think I suggested at the beginning
> to create a structure describing packet geometry and requested offloads,
> and for the prog fill that in.

hmm. but that's what skb is for. skb == packet geometry ==
layout of headers, payload, inner vs outer, csum partial, gso params.

bpf at tc layer supposed to interact with that correctly.
If the packet is modified skb geometry should be adjusted accordingly.
Like BPF_F_RECOMPUTE_CSUM flag in bpf_skb_store_bytes().

>
> All operating systems I know end up doing that, we'll end up doing
> that as well. The question is whether we're willing to learn from
> experience or prefer to go on a wild ride first...

I don't follow. This thread was aimed to add xdp layer knobs.
To me XDP is a driver level. 'struct xdp_md' along with
BPF_F_XDP_HAS_FRAGS is the best abstraction we could do generalizing
dma-buffers (page and multi-page) that drivers operate on.
Everything else at driver level is too unique to generalize.
skb layer is already doing its job.

In that sense "generic XDP" is a feature for testing only.
Trying to make "generic XDP" fast is missing the point of XDP.

AF_XDP is a different concept. Exposing timestamp,
csum, TSO to AF_XDP users is a different design challenge.
I'm all for doing that, but trying to combine "timestamp in xdp tx"
and "timestamp in AF_XDP" will lead to bad trade-off-s for both.
Which I think this patchset demonstrates.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-12  0:14         ` Stanislav Fomichev
@ 2023-07-12  2:50           ` Alexei Starovoitov
  2023-07-12  3:29             ` Stanislav Fomichev
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-12  2:50 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Jakub Kicinski,
	Toke Høiland-Jørgensen, Willem de Bruijn, David Ahern,
	Karlsson, Magnus, Björn Töpel, Fijalkowski, Maciej,
	Jesper Dangaard Brouer, Network Development, xdp-hints

On Tue, Jul 11, 2023 at 5:15 PM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Tue, Jul 11, 2023 at 4:45 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Jul 11, 2023 at 4:25 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Tue, Jul 11, 2023 at 3:57 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Fri, Jul 07, 2023 at 12:30:01PM -0700, Stanislav Fomichev wrote:
> > > > > +
> > > > > +static int mlx5e_devtx_request_l4_checksum(const struct devtx_ctx *_ctx,
> > > > > +                                        u16 csum_start, u16 csum_offset)
> > > > > +{
> > > > > +     const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
> > > > > +     struct mlx5_wqe_eth_seg *eseg;
> > > > > +
> > > > > +     if (unlikely(!ctx->wqe))
> > > > > +             return -ENODATA;
> > > > > +
> > > > > +     eseg = &ctx->wqe->eth;
> > > > > +
> > > > > +     switch (csum_offset) {
> > > > > +     case sizeof(struct ethhdr) + sizeof(struct iphdr) + offsetof(struct udphdr, check):
> > > > > +     case sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + offsetof(struct udphdr, check):
> > > > > +             /* Looks like HW/FW is doing parsing, so offsets are largely ignored. */
> > > > > +             eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
> > > > > +             break;
> > > > > +     default:
> > > > > +             return -EINVAL;
> > > > > +     }
> > > >
> > > > I think this proves my point: csum is not generalizable even across veth and mlx5.
> > > > Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
> > > > into HW that has different ideas about csum-ing.
> > > >
> > > > Here is what mlx5 does:
> > > > mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> > > >                             struct mlx5e_accel_tx_state *accel,
> > > >                             struct mlx5_wqe_eth_seg *eseg)
> > > > {
> > > >         if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
> > > >                 return;
> > > >
> > > >         if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
> > > >                 eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
> > > >                 if (skb->encapsulation) {
> > > >                         eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
> > > >                                           MLX5_ETH_WQE_L4_INNER_CSUM;
> > > >                         sq->stats->csum_partial_inner++;
> > > >                 } else {
> > > >                         eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
> > > >                         sq->stats->csum_partial++;
> > > >                 }
> > > >
> > > > How would you generalize that into csum api that will work across NICs ?
> > > >
> > > > My answer stands: you cannot.
> > > >
> > > > My proposal again:
> > > > add driver specifc kfuncs and hooks for things like csum.
> > >
> > > I do see your point, but to also give you my perspective: I have no
> > > clue what those _CSUM tx bits do (as a non-mlx employee). And what
> > > kind of packets they support (initial patch doesn't give any info).
> > > We can definitely expose mlx5 specific request_l4_checksum(bool encap)
> > > which does things similar to mlx5e_txwqe_build_eseg_csum, but then,
> > > what does it _actually_ do? It obviously can't checksum arbitrary
> > > packet formats (because it has this inner/outer selection bit), so
> > > there is really no way for me to provide a per-driver kfunc api. Maybe
> > > the vendors can?
> > >
> > > So having csum_start/csum_offset abstraction which works with fixed
> > > offsets seems like at least it correctly sets the expectation for BPF
> > > program writers.
> > > The vendors are already supposed to conform to this start/offset API for skb.
> > >
> > > But back to your point: should we maybe try to meet somewhere in the middle?
> > > 1. We try to provide "generic" offload kfuncs; for mlx5, we'll have
> > > this mlx5e_devtx_request_l4_checksum which works for fixed offsets
> >
> > But it doesn't.
> > Even if you add a check for csum_start (that's missing in the patch)
> > there need to be a way to somehow figure out
> > whether skb->encapsulation is true and set appropriate flags.
> > Otherwise this request csum will do "something" that only the HW vendor knows.
> > That would be even harder to debug for bpf prog writers.
> >
> > So instead of helping bpf prog devs it will actively hurt them.
>
> Can we make it more robust? The device can look at the payload (via
> descriptors or extra payload pointer via devtx_ctx) and verify
> h_proto/nexthdr.
> It won't be perfect, I agree, but we can get it working for the common
> cases (and have device-specific kfuncs for the rest).

More robust with more checks ?
That will slow things down and the main advantage of XDP vs skb
layer will be lost.
It's best to stay at skb then when csum and timestamp is available.

> > Another example. If bpf prog was developed and tested on veth
> > it will work for some values of csum_offset on real HW and will -EINVAL
> > for the other values.
> > Just horrible user experience comparing to the case where
> > the user knows that each netdev is potentially different and
> > _has_ to develop and test their prog on the given HW NIC and
> > not assume that the kernel can "do the right thing".
>
> For this, I was actually thinking that we need to provide some
> SW-based fallback mechanism.
> Because if I have a program and a nic that doesn't have an offload
> implemented at all, having a fallback might be useful:
>
> if (bpf_devtx_request_l4_csum(...)) {
>   /* oops, hw bailed on us, fallback to sw and expose a counter */
>   bpf_devtx_l4_csum_slowpath(csum_start, csum_offset, data, len);
>   pkt_sw_csum++;
> }
>
> This is probably needed regardless of which way we do it?

sw fallback? We already have 'generic XDP' and people misuse it
thinking it's a layer they should be using.
It's a nice feeling to say that my XDP prog was developed
and tested on mlx5, but I can move it to a different server
with brand new NIC that doesn't support XDP yet and my prog will
still work because of "generic XDP".
I think such devs are playing with fire and will be burned
when "generic XDP" NIC will be DDoSed.
Same thing here. If we do HW offload of csum it's better be in HW.
Devs have to be 100% certain that HW is offloading it.

>
> Regarding veth vs non-veth: we already have similar issues with
> generic xdp vs non-generic.
> I'm not sure we can completely avoid having surprises when switching
> from sw to hw paths.
> It's whether the users will have to debug 10-20% of their program or
> they'd have to start completely from scratch for every nic.

If rewrite for a nic is not acceptable then they should be using skb layer.

>
> > This csum exercise is clear example that kernel is not in a position
> > to do so.
> > For timestamp it's arguable, but for csum there is no generic api that
> > kernel can apply universally to NICs.
>
> Sure, I agree, it's a mix of both. For some offloads, we can have
> something common, for some we can't.
> But I'm not sure why we have to pick one or another. We can try to
> have common apis (maybe not ideal, yes) and we can expose vendor
> specific ones if there is need.
> If the generic ones get unused - we kill them in the future. If none
> of the vendors comes up with non-generic ones - the generic ones are
> good enough.
>
> I'm assuming you favor non-generic ones because it's easier to implement?

Not only.
yes, it's easier to implement, but the expectations are also clear.
The kernel won't be trying to fall back to the slow path.
XDP prog will tell HW 'do csum' and HW will do it.
For generality we have an skb layer.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-12  2:37       ` Alexei Starovoitov
@ 2023-07-12  3:07         ` Jakub Kicinski
  2023-07-12  3:23           ` Alexei Starovoitov
  0 siblings, 1 reply; 34+ messages in thread
From: Jakub Kicinski @ 2023-07-12  3:07 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa,
	Toke Høiland-Jørgensen, Willem de Bruijn, David Ahern,
	Karlsson, Magnus, Björn Töpel, Fijalkowski, Maciej,
	Jesper Dangaard Brouer, Network Development, xdp-hints

On Tue, 11 Jul 2023 19:37:23 -0700 Alexei Starovoitov wrote:
> > I hope I'm not misremembering but I think I suggested at the beginning
> > to create a structure describing packet geometry and requested offloads,
> > and for the prog fill that in.  
> 
> hmm. but that's what skb is for. skb == packet geometry ==
> layout of headers, payload, inner vs outer, csum partial, gso params.
> 
> bpf at tc layer supposed to interact with that correctly.
> If the packet is modified skb geometry should be adjusted accordingly.
> Like BPF_F_RECOMPUTE_CSUM flag in bpf_skb_store_bytes().
> 
> > All operating systems I know end up doing that, we'll end up doing
> > that as well. The question is whether we're willing to learn from
> > experience or prefer to go on a wild ride first...  
> 
> I don't follow. This thread was aimed to add xdp layer knobs.
> To me XDP is a driver level.

Driver is not a layer of the networking stack, I don't think it's 
a useful or meaningful anchor point for the mental model. 

We're talking about a set of functionality, we can look at how that
functionality was exposed in existing code.

> 'struct xdp_md' along with
> BPF_F_XDP_HAS_FRAGS is the best abstraction we could do generalizing
> dma-buffers (page and multi-page) that drivers operate on.

I think you're working against your own claim.
xdp frags explicitly reuse struct skb_shared_info.

> Everything else at driver level is too unique to generalize.
> skb layer is already doing its job.

How can you say that drivers are impossible to generalize and than
that the skb layer "is doing its job" ie. generalizes them?

> In that sense "generic XDP" is a feature for testing only.
> Trying to make "generic XDP" fast is missing the point of XDP.

That's a topic on its own.

> AF_XDP is a different concept. Exposing timestamp,
> csum, TSO to AF_XDP users is a different design challenge.
> I'm all for doing that, but trying to combine "timestamp in xdp tx"
> and "timestamp in AF_XDP" will lead to bad trade-off-s for both.
> Which I think this patchset demonstrates.

Too vague to agree or disagree with.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-12  3:07         ` Jakub Kicinski
@ 2023-07-12  3:23           ` Alexei Starovoitov
  0 siblings, 0 replies; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-12  3:23 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa,
	Toke Høiland-Jørgensen, Willem de Bruijn, David Ahern,
	Karlsson, Magnus, Björn Töpel, Fijalkowski, Maciej,
	Jesper Dangaard Brouer, Network Development, xdp-hints

On Tue, Jul 11, 2023 at 8:07 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Tue, 11 Jul 2023 19:37:23 -0700 Alexei Starovoitov wrote:
> > > I hope I'm not misremembering but I think I suggested at the beginning
> > > to create a structure describing packet geometry and requested offloads,
> > > and for the prog fill that in.
> >
> > hmm. but that's what skb is for. skb == packet geometry ==
> > layout of headers, payload, inner vs outer, csum partial, gso params.
> >
> > bpf at tc layer supposed to interact with that correctly.
> > If the packet is modified skb geometry should be adjusted accordingly.
> > Like BPF_F_RECOMPUTE_CSUM flag in bpf_skb_store_bytes().
> >
> > > All operating systems I know end up doing that, we'll end up doing
> > > that as well. The question is whether we're willing to learn from
> > > experience or prefer to go on a wild ride first...
> >
> > I don't follow. This thread was aimed to add xdp layer knobs.
> > To me XDP is a driver level.
>
> Driver is not a layer of the networking stack, I don't think it's
> a useful or meaningful anchor point for the mental model.
>
> We're talking about a set of functionality, we can look at how that
> functionality was exposed in existing code.
>
> > 'struct xdp_md' along with
> > BPF_F_XDP_HAS_FRAGS is the best abstraction we could do generalizing
> > dma-buffers (page and multi-page) that drivers operate on.
>
> I think you're working against your own claim.
> xdp frags explicitly reuse struct skb_shared_info.

Yes they do and it's an implementation detail.
skb_shared_info is convenient for drivers to fill in.
xdp prog isn't aware of the exact mechanism.
skb_shared_info is not exposed to a program.
Not sure what point you're trying to make.

>
> > Everything else at driver level is too unique to generalize.
> > skb layer is already doing its job.
>
> How can you say that drivers are impossible to generalize and than
> that the skb layer "is doing its job" ie. generalizes them?

Yeah. 'generalize' is a wrong word to use to describe a feature
that should have the same look and feel across drivers.
What I meant by 'csum impossible to generalize across drivers'
is that there is no HW-only path across them.
skb layer is doing it through sw fallback and combination
of csum_complete/unnecessary.

I'm saying that your proposal of
"create a structure describing packet geometry and requested offloads"
already exists and it's called skb.
I see no point doing this exercise again at XDP layer.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-12  2:50           ` Alexei Starovoitov
@ 2023-07-12  3:29             ` Stanislav Fomichev
  2023-07-12  4:59               ` Alexei Starovoitov
  0 siblings, 1 reply; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-12  3:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Jakub Kicinski,
	Toke Høiland-Jørgensen, Willem de Bruijn, David Ahern,
	magnus.karlsson, Björn Töpel, maciej.fijalkowski,
	Jesper Dangaard Brouer, Network Development, xdp-hints

On 07/11, Alexei Starovoitov wrote:
> On Tue, Jul 11, 2023 at 5:15 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Tue, Jul 11, 2023 at 4:45 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Jul 11, 2023 at 4:25 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > > On Tue, Jul 11, 2023 at 3:57 PM Alexei Starovoitov
> > > > <alexei.starovoitov@gmail.com> wrote:
> > > > >
> > > > > On Fri, Jul 07, 2023 at 12:30:01PM -0700, Stanislav Fomichev wrote:
> > > > > > +
> > > > > > +static int mlx5e_devtx_request_l4_checksum(const struct devtx_ctx *_ctx,
> > > > > > +                                        u16 csum_start, u16 csum_offset)
> > > > > > +{
> > > > > > +     const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
> > > > > > +     struct mlx5_wqe_eth_seg *eseg;
> > > > > > +
> > > > > > +     if (unlikely(!ctx->wqe))
> > > > > > +             return -ENODATA;
> > > > > > +
> > > > > > +     eseg = &ctx->wqe->eth;
> > > > > > +
> > > > > > +     switch (csum_offset) {
> > > > > > +     case sizeof(struct ethhdr) + sizeof(struct iphdr) + offsetof(struct udphdr, check):
> > > > > > +     case sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + offsetof(struct udphdr, check):
> > > > > > +             /* Looks like HW/FW is doing parsing, so offsets are largely ignored. */
> > > > > > +             eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
> > > > > > +             break;
> > > > > > +     default:
> > > > > > +             return -EINVAL;
> > > > > > +     }
> > > > >
> > > > > I think this proves my point: csum is not generalizable even across veth and mlx5.
> > > > > Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
> > > > > into HW that has different ideas about csum-ing.
> > > > >
> > > > > Here is what mlx5 does:
> > > > > mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> > > > >                             struct mlx5e_accel_tx_state *accel,
> > > > >                             struct mlx5_wqe_eth_seg *eseg)
> > > > > {
> > > > >         if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
> > > > >                 return;
> > > > >
> > > > >         if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
> > > > >                 eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
> > > > >                 if (skb->encapsulation) {
> > > > >                         eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
> > > > >                                           MLX5_ETH_WQE_L4_INNER_CSUM;
> > > > >                         sq->stats->csum_partial_inner++;
> > > > >                 } else {
> > > > >                         eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
> > > > >                         sq->stats->csum_partial++;
> > > > >                 }
> > > > >
> > > > > How would you generalize that into csum api that will work across NICs ?
> > > > >
> > > > > My answer stands: you cannot.
> > > > >
> > > > > My proposal again:
> > > > > add driver specifc kfuncs and hooks for things like csum.
> > > >
> > > > I do see your point, but to also give you my perspective: I have no
> > > > clue what those _CSUM tx bits do (as a non-mlx employee). And what
> > > > kind of packets they support (initial patch doesn't give any info).
> > > > We can definitely expose mlx5 specific request_l4_checksum(bool encap)
> > > > which does things similar to mlx5e_txwqe_build_eseg_csum, but then,
> > > > what does it _actually_ do? It obviously can't checksum arbitrary
> > > > packet formats (because it has this inner/outer selection bit), so
> > > > there is really no way for me to provide a per-driver kfunc api. Maybe
> > > > the vendors can?
> > > >
> > > > So having csum_start/csum_offset abstraction which works with fixed
> > > > offsets seems like at least it correctly sets the expectation for BPF
> > > > program writers.
> > > > The vendors are already supposed to conform to this start/offset API for skb.
> > > >
> > > > But back to your point: should we maybe try to meet somewhere in the middle?
> > > > 1. We try to provide "generic" offload kfuncs; for mlx5, we'll have
> > > > this mlx5e_devtx_request_l4_checksum which works for fixed offsets
> > >
> > > But it doesn't.
> > > Even if you add a check for csum_start (that's missing in the patch)
> > > there need to be a way to somehow figure out
> > > whether skb->encapsulation is true and set appropriate flags.
> > > Otherwise this request csum will do "something" that only the HW vendor knows.
> > > That would be even harder to debug for bpf prog writers.
> > >
> > > So instead of helping bpf prog devs it will actively hurt them.
> >
> > Can we make it more robust? The device can look at the payload (via
> > descriptors or extra payload pointer via devtx_ctx) and verify
> > h_proto/nexthdr.
> > It won't be perfect, I agree, but we can get it working for the common
> > cases (and have device-specific kfuncs for the rest).
> 
> More robust with more checks ?
> That will slow things down and the main advantage of XDP vs skb
> layer will be lost.
> It's best to stay at skb then when csum and timestamp is available.

This will slow things down, but not to the point where it's on par
with doing sw checksum. At least in theory.
We can't stay at skb when using AF_XDP. AF_XDP would benefit from having
the offloads.

> > > Another example. If bpf prog was developed and tested on veth
> > > it will work for some values of csum_offset on real HW and will -EINVAL
> > > for the other values.
> > > Just horrible user experience comparing to the case where
> > > the user knows that each netdev is potentially different and
> > > _has_ to develop and test their prog on the given HW NIC and
> > > not assume that the kernel can "do the right thing".
> >
> > For this, I was actually thinking that we need to provide some
> > SW-based fallback mechanism.
> > Because if I have a program and a nic that doesn't have an offload
> > implemented at all, having a fallback might be useful:
> >
> > if (bpf_devtx_request_l4_csum(...)) {
> >   /* oops, hw bailed on us, fallback to sw and expose a counter */
> >   bpf_devtx_l4_csum_slowpath(csum_start, csum_offset, data, len);
> >   pkt_sw_csum++;
> > }
> >
> > This is probably needed regardless of which way we do it?
> 
> sw fallback? We already have 'generic XDP' and people misuse it
> thinking it's a layer they should be using.
> It's a nice feeling to say that my XDP prog was developed
> and tested on mlx5, but I can move it to a different server
> with brand new NIC that doesn't support XDP yet and my prog will
> still work because of "generic XDP".
> I think such devs are playing with fire and will be burned
> when "generic XDP" NIC will be DDoSed.

This is user controlled. The users might as well chose to
not fallback to sw by not calling another helper and fail hard.

I agree with generic xdp concerns and veth path being deceptive,
but I also believe they serve their purpose of getting user's feet
wet before they switch to experimenting with real devices:
you get most of you program working and polish it up on the real HW.

Moving to a different NIC will obviously require testing. But the
difference with generic APIs is that the programs will mostly work
and won't have to be completely rewritten.

> Same thing here. If we do HW offload of csum it's better be in HW.
> Devs have to be 100% certain that HW is offloading it.

I hope we can both agree that with an api like
mlx5_l4_csum_offload(bool encap) we can't be 100% certain that the
hw is gonna handle any packet layout? So how is that different
from a generic api that also can't work in all cases?

> > Regarding veth vs non-veth: we already have similar issues with
> > generic xdp vs non-generic.
> > I'm not sure we can completely avoid having surprises when switching
> > from sw to hw paths.
> > It's whether the users will have to debug 10-20% of their program or
> > they'd have to start completely from scratch for every nic.
> 
> If rewrite for a nic is not acceptable then they should be using skb layer.

AF_XDP is a generic layer for low-level access and it provides generic
descriptor format, so why suddenly we have this requirement where we have
to do prog rewrite for every new nic?

Current AF_XDP programs are pretty portable (obviously depend on
a bunch of nic features), it seems like a good idea to try to preserve
this property? (again, with an asterisk, where we should allow some
differentiation, etc, etc)

> > > This csum exercise is clear example that kernel is not in a position
> > > to do so.
> > > For timestamp it's arguable, but for csum there is no generic api that
> > > kernel can apply universally to NICs.
> >
> > Sure, I agree, it's a mix of both. For some offloads, we can have
> > something common, for some we can't.
> > But I'm not sure why we have to pick one or another. We can try to
> > have common apis (maybe not ideal, yes) and we can expose vendor
> > specific ones if there is need.
> > If the generic ones get unused - we kill them in the future. If none
> > of the vendors comes up with non-generic ones - the generic ones are
> > good enough.
> >
> > I'm assuming you favor non-generic ones because it's easier to implement?
> 
> Not only.
> yes, it's easier to implement, but the expectations are also clear.
> The kernel won't be trying to fall back to the slow path.
> XDP prog will tell HW 'do csum' and HW will do it.
> For generality we have an skb layer.

(seems like I've addressed these points above, I don't see anything
extra here)

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-12  3:29             ` Stanislav Fomichev
@ 2023-07-12  4:59               ` Alexei Starovoitov
  2023-07-12  5:36                 ` Stanislav Fomichev
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-12  4:59 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Jakub Kicinski,
	Toke Høiland-Jørgensen, Willem de Bruijn, David Ahern,
	Karlsson, Magnus, Björn Töpel, Fijalkowski, Maciej,
	Jesper Dangaard Brouer, Network Development, xdp-hints

On Tue, Jul 11, 2023 at 8:29 PM Stanislav Fomichev <sdf@google.com> wrote:
>
>
> This will slow things down, but not to the point where it's on par
> with doing sw checksum. At least in theory.
> We can't stay at skb when using AF_XDP. AF_XDP would benefit from having
> the offloads.

To clarify: yes, AF_XDP needs generalized HW offloads.
I just don't see how xdp tx offloads are moving a needle in that direction.

> I hope we can both agree that with an api like
> mlx5_l4_csum_offload(bool encap) we can't be 100% certain that the
> hw is gonna handle any packet layout? So how is that different
> from a generic api that also can't work in all cases?

If it's hw specific then yes.
Will [mlx5|...]_l4_csum_offload apply to other nics? I doubt.

> AF_XDP is a generic layer for low-level access and it provides generic
> descriptor format, so why suddenly we have this requirement where we have
> to do prog rewrite for every new nic?
>
> Current AF_XDP programs are pretty portable (obviously depend on
> a bunch of nic features), it seems like a good idea to try to preserve
> this property? (again, with an asterisk, where we should allow some
> differentiation, etc, etc)

Agree. AF_XDP needs a generic api that will allow user space
request HW to do TSO, csum offload, etc.
xdp tx and af_xdp are too different to pull under the same framework.
xdp progs will interact with the kernel via kfuncs.
af_xdp needs a different api to express packet geometry and offload requests.
The user space cannot do it with bpf programs.
In case of AF_XDP the bpf prog in the kernel is a filter only.
For the majority of the cases bpf prog is not necessary and shouldn't be
required to express HW offloads.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-12  4:59               ` Alexei Starovoitov
@ 2023-07-12  5:36                 ` Stanislav Fomichev
  2023-07-12 15:16                   ` Willem de Bruijn
  0 siblings, 1 reply; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-12  5:36 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Hao Luo, Jiri Olsa, Jakub Kicinski,
	Toke Høiland-Jørgensen, Willem de Bruijn, David Ahern,
	Karlsson, Magnus, Björn Töpel, Fijalkowski, Maciej,
	Jesper Dangaard Brouer, Network Development, xdp-hints

On Tue, Jul 11, 2023 at 9:59 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jul 11, 2023 at 8:29 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> >
> > This will slow things down, but not to the point where it's on par
> > with doing sw checksum. At least in theory.
> > We can't stay at skb when using AF_XDP. AF_XDP would benefit from having
> > the offloads.
>
> To clarify: yes, AF_XDP needs generalized HW offloads.

Great! To reiterate, I'm mostly interested in af_xdp wrt tx
timestamps. So if the consensus is not to mix xdp-tx and af_xdp-tx,
I'm fine with switching to adding some fixed af_xdp descriptor format
to enable offloads on tx.

> I just don't see how xdp tx offloads are moving a needle in that direction.

Let me try to explain how both might be similar, maybe I wasn't clear
enough on that.
For af_xdp tx packet, the userspace puts something in the af_xdp frame
metadata area (headrom) which then gets executed/interpreted by the
bpf program at devtx (which calls kfuncs to enable particular
offloads).
IOW, instead of defining some fixed layout for the tx offloads, the
userspace and bpf program have some agreement on the layout (and bpf
program "applies" the offloads by calling the kfuncs).
Also (in theory) the same hooks can be used for xdp-tx.
Does it make sense? But, again, happy to scratch that whole idea if
we're fine with a fixed layout for af_xdp.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-12  5:36                 ` Stanislav Fomichev
@ 2023-07-12 15:16                   ` Willem de Bruijn
  2023-07-12 16:28                     ` Willem de Bruijn
  2023-07-12 19:03                     ` Alexei Starovoitov
  0 siblings, 2 replies; 34+ messages in thread
From: Willem de Bruijn @ 2023-07-12 15:16 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, Jakub Kicinski,
	Toke Høiland-Jørgensen, Willem de Bruijn, David Ahern,
	Karlsson, Magnus, Björn Töpel, Fijalkowski, Maciej,
	Jesper Dangaard Brouer, Network Development, xdp-hints

On Wed, Jul 12, 2023 at 1:36 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Tue, Jul 11, 2023 at 9:59 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Jul 11, 2023 at 8:29 PM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > >
> > > This will slow things down, but not to the point where it's on par
> > > with doing sw checksum. At least in theory.
> > > We can't stay at skb when using AF_XDP. AF_XDP would benefit from having
> > > the offloads.
> >
> > To clarify: yes, AF_XDP needs generalized HW offloads.
>
> Great! To reiterate, I'm mostly interested in af_xdp wrt tx
> timestamps. So if the consensus is not to mix xdp-tx and af_xdp-tx,
> I'm fine with switching to adding some fixed af_xdp descriptor format
> to enable offloads on tx.
>
> > I just don't see how xdp tx offloads are moving a needle in that direction.
>
> Let me try to explain how both might be similar, maybe I wasn't clear
> enough on that.
> For af_xdp tx packet, the userspace puts something in the af_xdp frame
> metadata area (headrom) which then gets executed/interpreted by the
> bpf program at devtx (which calls kfuncs to enable particular
> offloads).
> IOW, instead of defining some fixed layout for the tx offloads, the
> userspace and bpf program have some agreement on the layout (and bpf
> program "applies" the offloads by calling the kfuncs).
> Also (in theory) the same hooks can be used for xdp-tx.
> Does it make sense? But, again, happy to scratch that whole idea if
> we're fine with a fixed layout for af_xdp.

Checksum offload is an important demonstrator too.

It is admittedly a non-trivial one. Checksum offload has often been
discussed as a pain point ("protocol ossification").

In general, drivers can accept every CHECKSUM_COMPLETE skb that
matches their advertised feature NETIF_F_[HW|IP|IPV6]_CSUM. I don't
see why this would be different for kfuncs for packets coming from
userspace.

The problematic drivers are the ones that do not implement
CHECKSUM_COMPLETE as intended, but ignore this simple
protocol-independent hint in favor of parsing from scratch, possibly
zeroing the field, computing multiple layers, etc.

All of which is unnecessary with LCO. An AF_XDP user can be expected
to apply LCO and only request checksum insertion for the innermost
checksum.

The biggest problem is with these devices that parse in hardware (and
possibly also in the driver to identify and fix up hardware
limitations) is that they will fail if encountering an unknown
protocol. Which brings us to advertising limited typed support:
NETIF_F_HW_CSUM vs NETIF_F_IP_CSUM.

The fact that some devices that deviate from industry best practices
cannot support more advanced packet formats is unfortunate, but not a
reason to hold others back. No different from current kernel path. The
BPF program can fallback onto software checksumming on these devices,
like the kernel path. Perhaps we do need to pass along with csum_start
and csum_off a csum_type that matches the existing
NETIF_F_[HW|IP|IPV6]_CSUM, to let drivers return with -EOPNOTSUPP
quickly if for the generic case.

For implementation in essence it is just reordering driver code that
already exists for the skb case. I think the ice patch series to
support rx timestamping is a good indication of what it takes to
support XDP kfuncs: not so much new code, but reordering the driver
logic.

Which also indicates to me that the driver *is* the right place to
implement this logic, rather than reimplement it in a BPF library. It
avoids both code duplication and dependency hell, if the library ships
independent from the driver.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-12 15:16                   ` Willem de Bruijn
@ 2023-07-12 16:28                     ` Willem de Bruijn
  2023-07-12 19:03                     ` Alexei Starovoitov
  1 sibling, 0 replies; 34+ messages in thread
From: Willem de Bruijn @ 2023-07-12 16:28 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Alexei Starovoitov, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, Jakub Kicinski,
	Toke Høiland-Jørgensen, Willem de Bruijn, David Ahern,
	Karlsson, Magnus, Björn Töpel, Fijalkowski, Maciej,
	Jesper Dangaard Brouer, Network Development, xdp-hints

On Wed, Jul 12, 2023 at 11:16 AM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Wed, Jul 12, 2023 at 1:36 AM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Tue, Jul 11, 2023 at 9:59 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Jul 11, 2023 at 8:29 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > >
> > > > This will slow things down, but not to the point where it's on par
> > > > with doing sw checksum. At least in theory.
> > > > We can't stay at skb when using AF_XDP. AF_XDP would benefit from having
> > > > the offloads.
> > >
> > > To clarify: yes, AF_XDP needs generalized HW offloads.
> >
> > Great! To reiterate, I'm mostly interested in af_xdp wrt tx
> > timestamps. So if the consensus is not to mix xdp-tx and af_xdp-tx,
> > I'm fine with switching to adding some fixed af_xdp descriptor format
> > to enable offloads on tx.
> >
> > > I just don't see how xdp tx offloads are moving a needle in that direction.
> >
> > Let me try to explain how both might be similar, maybe I wasn't clear
> > enough on that.
> > For af_xdp tx packet, the userspace puts something in the af_xdp frame
> > metadata area (headrom) which then gets executed/interpreted by the
> > bpf program at devtx (which calls kfuncs to enable particular
> > offloads).
> > IOW, instead of defining some fixed layout for the tx offloads, the
> > userspace and bpf program have some agreement on the layout (and bpf
> > program "applies" the offloads by calling the kfuncs).
> > Also (in theory) the same hooks can be used for xdp-tx.
> > Does it make sense? But, again, happy to scratch that whole idea if
> > we're fine with a fixed layout for af_xdp.
>
> Checksum offload is an important demonstrator too.
>
> It is admittedly a non-trivial one. Checksum offload has often been
> discussed as a pain point ("protocol ossification").
>
> In general, drivers can accept every CHECKSUM_COMPLETE skb that

Erm.. CHECKSUM_PARTIAL

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-12 15:16                   ` Willem de Bruijn
  2023-07-12 16:28                     ` Willem de Bruijn
@ 2023-07-12 19:03                     ` Alexei Starovoitov
  2023-07-12 19:11                       ` Willem de Bruijn
  1 sibling, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-12 19:03 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, Jakub Kicinski,
	Toke Høiland-Jørgensen, Willem de Bruijn, David Ahern,
	Karlsson, Magnus, Björn Töpel, Fijalkowski, Maciej,
	Jesper Dangaard Brouer, Network Development, xdp-hints

On Wed, Jul 12, 2023 at 11:16:04AM -0400, Willem de Bruijn wrote:
> On Wed, Jul 12, 2023 at 1:36 AM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Tue, Jul 11, 2023 at 9:59 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Jul 11, 2023 at 8:29 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > >
> > > > This will slow things down, but not to the point where it's on par
> > > > with doing sw checksum. At least in theory.
> > > > We can't stay at skb when using AF_XDP. AF_XDP would benefit from having
> > > > the offloads.
> > >
> > > To clarify: yes, AF_XDP needs generalized HW offloads.
> >
> > Great! To reiterate, I'm mostly interested in af_xdp wrt tx
> > timestamps. So if the consensus is not to mix xdp-tx and af_xdp-tx,
> > I'm fine with switching to adding some fixed af_xdp descriptor format
> > to enable offloads on tx.

since af_xdp is a primary user let's figure out what is the best api for that.
If any code can be salvaged for xdp tx, great, but let's not start with xdp tx
as prerequisite.

> >
> > > I just don't see how xdp tx offloads are moving a needle in that direction.
> >
> > Let me try to explain how both might be similar, maybe I wasn't clear
> > enough on that.
> > For af_xdp tx packet, the userspace puts something in the af_xdp frame
> > metadata area (headrom) which then gets executed/interpreted by the
> > bpf program at devtx (which calls kfuncs to enable particular
> > offloads).
> > IOW, instead of defining some fixed layout for the tx offloads, the
> > userspace and bpf program have some agreement on the layout (and bpf
> > program "applies" the offloads by calling the kfuncs).
> > Also (in theory) the same hooks can be used for xdp-tx.
> > Does it make sense? But, again, happy to scratch that whole idea if
> > we're fine with a fixed layout for af_xdp.

So instead of defining csum offload format in xsk metadata we'll
defining it as a set of arguments to a kfunc and tx-side xsk prog
will just copy the args from metadata into kfunc args ?
Seems like an unnecesary step. Such xsk prog won't be doing
anything useful. Just copying from one place to another.
It seems the only purpose of such bpf prog is to side step uapi exposure.
bpf is not used to program anything. There won't be any control flow.
Just odd intermediate copy step.
Instead we can define a metadata struct for csum nic offload
outside of uapi/linux/if_xdp.h with big 'this is not an uapi' warning.
User space can request it via setsockopt.
And probably feature query the nic via getsockopt.

Error handling is critical here. With xsk tx prog the errors
are messy. What to do when kfunc returns error? Store it back into
packet metadata ? and then user space needs to check every single
packet for errors? Not practical imo.

Feature query via getsockopt would be done once instead and
user space will fill in "csum offload struct" in packet metadata
and won't check per-packet error. If driver said the csum feature
is available it's better work for every packet.
Notice mlx5e_txwqe_build_eseg_csum() returns void.

> 
> Checksum offload is an important demonstrator too.
> 
> It is admittedly a non-trivial one. Checksum offload has often been
> discussed as a pain point ("protocol ossification").
> 
> In general, drivers can accept every CHECKSUM_COMPLETE skb that
> matches their advertised feature NETIF_F_[HW|IP|IPV6]_CSUM. I don't
> see why this would be different for kfuncs for packets coming from
> userspace.
> 
> The problematic drivers are the ones that do not implement
> CHECKSUM_COMPLETE as intended, but ignore this simple
> protocol-independent hint in favor of parsing from scratch, possibly
> zeroing the field, computing multiple layers, etc.
> 
> All of which is unnecessary with LCO. An AF_XDP user can be expected
> to apply LCO and only request checksum insertion for the innermost
> checksum.
> 
> The biggest problem is with these devices that parse in hardware (and
> possibly also in the driver to identify and fix up hardware
> limitations) is that they will fail if encountering an unknown
> protocol. Which brings us to advertising limited typed support:
> NETIF_F_HW_CSUM vs NETIF_F_IP_CSUM.
> 
> The fact that some devices that deviate from industry best practices
> cannot support more advanced packet formats is unfortunate, but not a
> reason to hold others back. No different from current kernel path. The
> BPF program can fallback onto software checksumming on these devices,
> like the kernel path. Perhaps we do need to pass along with csum_start
> and csum_off a csum_type that matches the existing
> NETIF_F_[HW|IP|IPV6]_CSUM, to let drivers return with -EOPNOTSUPP
> quickly if for the generic case.
> 
> For implementation in essence it is just reordering driver code that
> already exists for the skb case. I think the ice patch series to
> support rx timestamping is a good indication of what it takes to
> support XDP kfuncs: not so much new code, but reordering the driver
> logic.
> 
> Which also indicates to me that the driver *is* the right place to
> implement this logic, rather than reimplement it in a BPF library. It
> avoids both code duplication and dependency hell, if the library ships
> independent from the driver.

Agree with all of the above.
I think defining CHECKSUM_PARTIAL struct request for af_xdp is doable and
won't require much changes in the drivers.
If we do it for more than one driver from the start there is a chance it
will work for other drivers too. imo ice+gve+mlx5 would be enough.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-12 19:03                     ` Alexei Starovoitov
@ 2023-07-12 19:11                       ` Willem de Bruijn
  2023-07-12 19:42                         ` Alexei Starovoitov
  0 siblings, 1 reply; 34+ messages in thread
From: Willem de Bruijn @ 2023-07-12 19:11 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, Jakub Kicinski,
	Toke Høiland-Jørgensen, Willem de Bruijn, David Ahern,
	Karlsson, Magnus, Björn Töpel, Fijalkowski, Maciej,
	Jesper Dangaard Brouer, Network Development, xdp-hints

On Wed, Jul 12, 2023 at 3:03 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jul 12, 2023 at 11:16:04AM -0400, Willem de Bruijn wrote:
> > On Wed, Jul 12, 2023 at 1:36 AM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Tue, Jul 11, 2023 at 9:59 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Tue, Jul 11, 2023 at 8:29 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > > >
> > > > >
> > > > > This will slow things down, but not to the point where it's on par
> > > > > with doing sw checksum. At least in theory.
> > > > > We can't stay at skb when using AF_XDP. AF_XDP would benefit from having
> > > > > the offloads.
> > > >
> > > > To clarify: yes, AF_XDP needs generalized HW offloads.
> > >
> > > Great! To reiterate, I'm mostly interested in af_xdp wrt tx
> > > timestamps. So if the consensus is not to mix xdp-tx and af_xdp-tx,
> > > I'm fine with switching to adding some fixed af_xdp descriptor format
> > > to enable offloads on tx.
>
> since af_xdp is a primary user let's figure out what is the best api for that.
> If any code can be salvaged for xdp tx, great, but let's not start with xdp tx
> as prerequisite.
>
> > >
> > > > I just don't see how xdp tx offloads are moving a needle in that direction.
> > >
> > > Let me try to explain how both might be similar, maybe I wasn't clear
> > > enough on that.
> > > For af_xdp tx packet, the userspace puts something in the af_xdp frame
> > > metadata area (headrom) which then gets executed/interpreted by the
> > > bpf program at devtx (which calls kfuncs to enable particular
> > > offloads).
> > > IOW, instead of defining some fixed layout for the tx offloads, the
> > > userspace and bpf program have some agreement on the layout (and bpf
> > > program "applies" the offloads by calling the kfuncs).
> > > Also (in theory) the same hooks can be used for xdp-tx.
> > > Does it make sense? But, again, happy to scratch that whole idea if
> > > we're fine with a fixed layout for af_xdp.
>
> So instead of defining csum offload format in xsk metadata we'll
> defining it as a set of arguments to a kfunc and tx-side xsk prog
> will just copy the args from metadata into kfunc args ?
> Seems like an unnecesary step. Such xsk prog won't be doing
> anything useful. Just copying from one place to another.
> It seems the only purpose of such bpf prog is to side step uapi exposure.
> bpf is not used to program anything. There won't be any control flow.
> Just odd intermediate copy step.
> Instead we can define a metadata struct for csum nic offload
> outside of uapi/linux/if_xdp.h with big 'this is not an uapi' warning.
> User space can request it via setsockopt.
> And probably feature query the nic via getsockopt.
>
> Error handling is critical here. With xsk tx prog the errors
> are messy. What to do when kfunc returns error? Store it back into
> packet metadata ? and then user space needs to check every single
> packet for errors? Not practical imo.
>
> Feature query via getsockopt would be done once instead and
> user space will fill in "csum offload struct" in packet metadata
> and won't check per-packet error. If driver said the csum feature
> is available it's better work for every packet.
> Notice mlx5e_txwqe_build_eseg_csum() returns void.
>
> >
> > Checksum offload is an important demonstrator too.
> >
> > It is admittedly a non-trivial one. Checksum offload has often been
> > discussed as a pain point ("protocol ossification").
> >
> > In general, drivers can accept every CHECKSUM_COMPLETE skb that
> > matches their advertised feature NETIF_F_[HW|IP|IPV6]_CSUM. I don't
> > see why this would be different for kfuncs for packets coming from
> > userspace.
> >
> > The problematic drivers are the ones that do not implement
> > CHECKSUM_COMPLETE as intended, but ignore this simple
> > protocol-independent hint in favor of parsing from scratch, possibly
> > zeroing the field, computing multiple layers, etc.
> >
> > All of which is unnecessary with LCO. An AF_XDP user can be expected
> > to apply LCO and only request checksum insertion for the innermost
> > checksum.
> >
> > The biggest problem is with these devices that parse in hardware (and
> > possibly also in the driver to identify and fix up hardware
> > limitations) is that they will fail if encountering an unknown
> > protocol. Which brings us to advertising limited typed support:
> > NETIF_F_HW_CSUM vs NETIF_F_IP_CSUM.
> >
> > The fact that some devices that deviate from industry best practices
> > cannot support more advanced packet formats is unfortunate, but not a
> > reason to hold others back. No different from current kernel path. The
> > BPF program can fallback onto software checksumming on these devices,
> > like the kernel path. Perhaps we do need to pass along with csum_start
> > and csum_off a csum_type that matches the existing
> > NETIF_F_[HW|IP|IPV6]_CSUM, to let drivers return with -EOPNOTSUPP
> > quickly if for the generic case.
> >
> > For implementation in essence it is just reordering driver code that
> > already exists for the skb case. I think the ice patch series to
> > support rx timestamping is a good indication of what it takes to
> > support XDP kfuncs: not so much new code, but reordering the driver
> > logic.
> >
> > Which also indicates to me that the driver *is* the right place to
> > implement this logic, rather than reimplement it in a BPF library. It
> > avoids both code duplication and dependency hell, if the library ships
> > independent from the driver.
>
> Agree with all of the above.
> I think defining CHECKSUM_PARTIAL struct request for af_xdp is doable and
> won't require much changes in the drivers.
> If we do it for more than one driver from the start there is a chance it
> will work for other drivers too. imo ice+gve+mlx5 would be enough.

Basically, add to AF_XDP what we already have for its predecessor
AF_PACKET: setsockopt PACKET_VNET_HDR?

Possibly with a separate new struct, rather than virtio_net_hdr. As
that has dependencies on other drivers, notably virtio and its
specification process.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-12 19:11                       ` Willem de Bruijn
@ 2023-07-12 19:42                         ` Alexei Starovoitov
  2023-07-12 20:09                           ` Jakub Kicinski
  0 siblings, 1 reply; 34+ messages in thread
From: Alexei Starovoitov @ 2023-07-12 19:42 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Stanislav Fomichev, bpf, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Hao Luo, Jiri Olsa, Jakub Kicinski,
	Toke Høiland-Jørgensen, Willem de Bruijn, David Ahern,
	Karlsson, Magnus, Björn Töpel, Fijalkowski, Maciej,
	Jesper Dangaard Brouer, Network Development, xdp-hints

On Wed, Jul 12, 2023 at 12:12 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> Basically, add to AF_XDP what we already have for its predecessor
> AF_PACKET: setsockopt PACKET_VNET_HDR?
>
> Possibly with a separate new struct, rather than virtio_net_hdr. As
> that has dependencies on other drivers, notably virtio and its
> specification process.

yeah. Forgot about this one.
That's a perfect fit. I would reuse virtio_net_hdr as-is.
Why reinvent the wheel?
It would force uapi, but some might argue it's a good thing.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-12 19:42                         ` Alexei Starovoitov
@ 2023-07-12 20:09                           ` Jakub Kicinski
  2023-07-12 20:53                             ` Stanislav Fomichev
  0 siblings, 1 reply; 34+ messages in thread
From: Jakub Kicinski @ 2023-07-12 20:09 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Willem de Bruijn, Stanislav Fomichev, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Hao Luo, Jiri Olsa,
	Toke Høiland-Jørgensen, Willem de Bruijn, David Ahern,
	Karlsson, Magnus, Björn Töpel, Fijalkowski, Maciej,
	Jesper Dangaard Brouer, Network Development, xdp-hints

On Wed, 12 Jul 2023 12:42:35 -0700 Alexei Starovoitov wrote:
> > Basically, add to AF_XDP what we already have for its predecessor
> > AF_PACKET: setsockopt PACKET_VNET_HDR?
> >
> > Possibly with a separate new struct, rather than virtio_net_hdr. As
> > that has dependencies on other drivers, notably virtio and its
> > specification process.  
> 
> yeah. Forgot about this one.
> That's a perfect fit. I would reuse virtio_net_hdr as-is.
> Why reinvent the wheel?
> It would force uapi, but some might argue it's a good thing.

I was gonna reply on the other leg of the thread but it sounds like
we're in agreement now? 👏️  virtio_net_hdr is the kind of generic
descriptor I had in mind.

I'd suggest breaking hdr_len into L2len, L3len and L4len, tho. How does
virtio do IP length updates during TSO if it doesn't know where L3
starts? HW will want to know, and it's easy to add them together in
cases where it doesn't. Which is why I kept saying "packet geometry"
rather than pointing at virtio_net_hdr.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
  2023-07-12 20:09                           ` Jakub Kicinski
@ 2023-07-12 20:53                             ` Stanislav Fomichev
  0 siblings, 0 replies; 34+ messages in thread
From: Stanislav Fomichev @ 2023-07-12 20:53 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Alexei Starovoitov, Willem de Bruijn, bpf, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau, Song Liu,
	Yonghong Song, John Fastabend, KP Singh, Hao Luo, Jiri Olsa,
	Toke Høiland-Jørgensen, Willem de Bruijn, David Ahern,
	magnus.karlsson, Björn Töpel, maciej.fijalkowski,
	Jesper Dangaard Brouer, Network Development, xdp-hints

On 07/12, Jakub Kicinski wrote:
> On Wed, 12 Jul 2023 12:42:35 -0700 Alexei Starovoitov wrote:
> > > Basically, add to AF_XDP what we already have for its predecessor
> > > AF_PACKET: setsockopt PACKET_VNET_HDR?
> > >
> > > Possibly with a separate new struct, rather than virtio_net_hdr. As
> > > that has dependencies on other drivers, notably virtio and its
> > > specification process.  
> > 
> > yeah. Forgot about this one.
> > That's a perfect fit. I would reuse virtio_net_hdr as-is.
> > Why reinvent the wheel?
> > It would force uapi, but some might argue it's a good thing.
> 
> I was gonna reply on the other leg of the thread but it sounds like
> we're in agreement now? 👏️  virtio_net_hdr is the kind of generic
> descriptor I had in mind.
> 
> I'd suggest breaking hdr_len into L2len, L3len and L4len, tho. How does
> virtio do IP length updates during TSO if it doesn't know where L3
> starts? HW will want to know, and it's easy to add them together in
> cases where it doesn't. Which is why I kept saying "packet geometry"
> rather than pointing at virtio_net_hdr.

Perfect, I'll drop the kfuncs and will switch to a more static way
of passing this info. We can discuss the particular layout once I have
something concrete to show :-)

Thanks, again, everyone for the comments!

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2023-07-12 20:53 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 01/14] bpf: Rename some xdp-metadata functions into dev-bound Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 02/14] bpf: Make it easier to add new metadata kfunc Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 03/14] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 04/14] bpf: Implement devtx hook points Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 05/14] bpf: Implement devtx timestamp kfunc Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 06/14] net: veth: Implement devtx timestamp kfuncs Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 07/14] bpf: Introduce tx checksum devtx kfuncs Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 08/14] net: veth: Implement devtx tx checksum Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs Stanislav Fomichev
2023-07-11 22:56   ` [xdp-hints] " Alexei Starovoitov
2023-07-11 23:24     ` Stanislav Fomichev
2023-07-11 23:45       ` Alexei Starovoitov
2023-07-12  0:14         ` Stanislav Fomichev
2023-07-12  2:50           ` Alexei Starovoitov
2023-07-12  3:29             ` Stanislav Fomichev
2023-07-12  4:59               ` Alexei Starovoitov
2023-07-12  5:36                 ` Stanislav Fomichev
2023-07-12 15:16                   ` Willem de Bruijn
2023-07-12 16:28                     ` Willem de Bruijn
2023-07-12 19:03                     ` Alexei Starovoitov
2023-07-12 19:11                       ` Willem de Bruijn
2023-07-12 19:42                         ` Alexei Starovoitov
2023-07-12 20:09                           ` Jakub Kicinski
2023-07-12 20:53                             ` Stanislav Fomichev
2023-07-12  0:32     ` Jakub Kicinski
2023-07-12  2:37       ` Alexei Starovoitov
2023-07-12  3:07         ` Jakub Kicinski
2023-07-12  3:23           ` Alexei Starovoitov
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 10/14] selftests/xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 11/14] selftests/bpf: Add helper to query current netns cookie Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 12/14] selftests/bpf: Add csum helpers Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 13/14] selftests/bpf: Extend xdp_metadata with devtx kfuncs Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 14/14] selftests/bpf: Extend xdp_hw_metadata " Stanislav Fomichev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox