XDP hardware hints discussion mail archive
 help / color / mirror / Atom feed
* [xdp-hints] [PATCH bpf-next 0/9] xsk: TX metadata
@ 2023-08-09 16:54 Stanislav Fomichev
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 1/9] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
                   ` (10 more replies)
  0 siblings, 11 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-09 16:54 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints, Saeed Mahameed

This series implements initial TX metadata (offloads) for AF_XDP.
See patch #2 for the main implementation and mlx5-one for the
example on how to consume the metadata on the device side.

Starting with two types of offloads:
- request TX timestamp (and write it back into the metadata area)
- request TX checksum offload

Changes since last RFC:
- add /* private: */ comments to headers (Simon)
- handle metadata only in the first frag (Simon)
- rename xdp_hw_metadata flags (Willem)
- s/tmo_request_checksum/tmo_request_timestamp/ in xdp_metadata_ops
  comment (Willem)
- Documentation/networking/xsk-tx-metadata.rst

RFC v4: https://lore.kernel.org/bpf/20230724235957.1953861-1-sdf@google.com/

Performance:

I've implemented a small xskgen tool to try to saturate single tx queue:
https://github.com/fomichev/xskgen/tree/master

Here are the performance numbers with some analysis.

1. Baseline. Running with commit eb62e6aef940 ("Merge branch 'bpf:
Support bpf_get_func_ip helper in uprobes'"), nothing from this series:

- with 1400 bytes of payload: 98 gbps, 8 mpps
./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189130 sec, 98.357623 gbps 8.409509 mpps

- with 200 bytes of payload: 49 gbps, 23 mpps
./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000064 packets 20960134144 bits, took 0.422235 sec, 49.640921 gbps 23.683645 mpps

2. Adding single commit that supports reserving XDP_TX_METADATA_LEN
   changes nothing numbers-wise.

- baseline for 1400
./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189247 sec, 98.347946 gbps 8.408682 mpps

- baseline for 200
./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 20960000000 bits, took 0.421248 sec, 49.756913 gbps 23.738985 mpps

3. Adding -M flag causes xskgen to reserve the metadata and fill it, but
   doesn't set XDP_TX_METADATA descriptor option.

- new baseline for 1400 (with only filling the metadata)
./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.188767 sec, 98.387657 gbps 8.412077 mpps

- new baseline for 200 (with only filling the metadata)
./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 20960000000 bits, took 0.410213 sec, 51.095407 gbps 24.377579 mpps
(the numbers go sligtly up here, not really sure why, maybe some cache-related
side-effects?

4. Next, I'm running the same test but with the commit that adds actual
   general infra to parse XDP_TX_METADATA (but no driver support).
   Essentially applying "xsk: add TX timestamp and TX checksum offload support"
   from this series. Numbers are the same.

- fill metadata for 1400
./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.188430 sec, 98.415557 gbps 8.414463 mpps

- fill metadata for 200
./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 20960000000 bits, took 0.411559 sec, 50.928299 gbps 24.297853 mpps

- request metadata for 1400
./xskgen -m -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.188723 sec, 98.391299 gbps 8.412389 mpps

- request metadata for 200
./xskgen -m -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000064 packets 20960134144 bits, took 0.411240 sec, 50.968131 gbps 24.316856 mpps

5. Now, for the most interesting part, I'm adding mlx5 driver support.
   The mpps for 200 bytes case goes down from 23 mpps to 19 mpps, but
   _only_ when I enable the metadata. This looks like a side effect
   of me pushing extra metadata pointer via mlx5e_xdpi_fifo_push.
   Hence, this part is wrapped into 'if (xp_tx_metadata_enabled)'
   to not affect the existing non-metadata use-cases. Since this is not
   regressing existing workloads, I'm not spending any time trying to
   optimize it more (and leaving it up to mlx owners to purse if
   they see any good way to do it).

- same baseline
./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189434 sec, 98.332484 gbps 8.407360 mpps

./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000128 packets 20960268288 bits, took 0.425254 sec, 49.288821 gbps 23.515659 mpps

- fill metadata for 1400
./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189528 sec, 98.324714 gbps 8.406696 mpps

- fill metadata for 200
./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000128 packets 20960268288 bits, took 0.519085 sec, 40.379260 gbps 19.264914 mpps

- request metadata for 1400
./xskgen -m -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189329 sec, 98.341165 gbps 8.408102 mpps

- request metadata for 200
./xskgen -m -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000128 packets 20960268288 bits, took 0.519929 sec, 40.313713 gbps 19.233642 mpps

Cc: Saeed Mahameed <saeedm@nvidia.com>

Stanislav Fomichev (9):
  xsk: Support XDP_TX_METADATA_LEN
  xsk: add TX timestamp and TX checksum offload support
  tools: ynl: print xsk-features from the sample
  net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
  selftests/xsk: Support XDP_TX_METADATA_LEN
  selftests/bpf: Add csum helpers
  selftests/bpf: Add TX side to xdp_metadata
  selftests/bpf: Add TX side to xdp_hw_metadata
  xsk: document XDP_TX_METADATA_LEN layout

 Documentation/netlink/specs/netdev.yaml       |  20 ++
 Documentation/networking/index.rst            |   1 +
 Documentation/networking/xsk-tx-metadata.rst  |  75 +++++++
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |   4 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  72 ++++++-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  10 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |  11 +-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |   1 +
 include/linux/netdevice.h                     |  27 +++
 include/linux/skbuff.h                        |   5 +-
 include/net/xdp_sock.h                        |  61 ++++++
 include/net/xdp_sock_drv.h                    |  13 ++
 include/net/xsk_buff_pool.h                   |   6 +
 include/uapi/linux/if_xdp.h                   |  36 ++++
 include/uapi/linux/netdev.h                   |  16 ++
 net/core/netdev-genl.c                        |  12 +-
 net/xdp/xsk.c                                 |  61 ++++++
 net/xdp/xsk_buff_pool.c                       |   1 +
 net/xdp/xsk_queue.h                           |  19 +-
 tools/include/uapi/linux/if_xdp.h             |  50 ++++-
 tools/include/uapi/linux/netdev.h             |  15 ++
 tools/net/ynl/generated/netdev-user.c         |  19 ++
 tools/net/ynl/generated/netdev-user.h         |   3 +
 tools/net/ynl/samples/netdev.c                |   6 +
 tools/testing/selftests/bpf/network_helpers.h |  43 ++++
 .../selftests/bpf/prog_tests/xdp_metadata.c   |  31 ++-
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 201 +++++++++++++++++-
 tools/testing/selftests/bpf/xsk.c             |  17 ++
 tools/testing/selftests/bpf/xsk.h             |   1 +
 29 files changed, 793 insertions(+), 44 deletions(-)
 create mode 100644 Documentation/networking/xsk-tx-metadata.rst

-- 
2.41.0.640.ga95def55d0-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] [PATCH bpf-next 1/9] xsk: Support XDP_TX_METADATA_LEN
  2023-08-09 16:54 [xdp-hints] [PATCH bpf-next 0/9] xsk: TX metadata Stanislav Fomichev
@ 2023-08-09 16:54 ` Stanislav Fomichev
  2023-08-14 10:56   ` [xdp-hints] " Maciej Fijalkowski
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 2/9] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
                   ` (9 subsequent siblings)
  10 siblings, 1 reply; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-09 16:54 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

For zerocopy mode, tx_desc->addr can point to the arbitrary offset
and carry some TX metadata in the headroom. For copy mode, there
is no way currently to populate skb metadata.

Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
to treat as metadata. Metadata bytes come prior to tx_desc address
(same as in RX case).

The size of the metadata has the same constraints as XDP:
- less than 256 bytes
- 4-byte aligned
- non-zero

This data is not interpreted in any way right now.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/net/xdp_sock.h      |  1 +
 include/net/xsk_buff_pool.h |  1 +
 include/uapi/linux/if_xdp.h |  1 +
 net/xdp/xsk.c               | 20 ++++++++++++++++++++
 net/xdp/xsk_buff_pool.c     |  1 +
 net/xdp/xsk_queue.h         | 17 ++++++++++-------
 6 files changed, 34 insertions(+), 7 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 1617af380162..467b9fb56827 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -51,6 +51,7 @@ struct xdp_sock {
 	struct list_head flush_node;
 	struct xsk_buff_pool *pool;
 	u16 queue_id;
+	u8 tx_metadata_len;
 	bool zc;
 	bool sg;
 	enum {
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index b0bdff26fc88..9c31e8d1e198 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -77,6 +77,7 @@ struct xsk_buff_pool {
 	u32 chunk_size;
 	u32 chunk_shift;
 	u32 frame_len;
+	u8 tx_metadata_len; /* inherited from xsk_sock */
 	u8 cached_need_wakeup;
 	bool uses_need_wakeup;
 	bool dma_need_sync;
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 8d48863472b9..b37b50102e1c 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -69,6 +69,7 @@ struct xdp_mmap_offsets {
 #define XDP_UMEM_COMPLETION_RING	6
 #define XDP_STATISTICS			7
 #define XDP_OPTIONS			8
+#define XDP_TX_METADATA_LEN		9
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 47796a5a79b3..28df3280501d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1338,6 +1338,26 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
 		mutex_unlock(&xs->mutex);
 		return err;
 	}
+	case XDP_TX_METADATA_LEN:
+	{
+		int val;
+
+		if (optlen < sizeof(val))
+			return -EINVAL;
+		if (copy_from_sockptr(&val, optval, sizeof(val)))
+			return -EFAULT;
+		if (!val || val > 256 || val % 4)
+			return -EINVAL;
+
+		mutex_lock(&xs->mutex);
+		if (xs->state != XSK_READY) {
+			mutex_unlock(&xs->mutex);
+			return -EBUSY;
+		}
+		xs->tx_metadata_len = val;
+		mutex_unlock(&xs->mutex);
+		return 0;
+	}
 	default:
 		break;
 	}
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index b3f7b310811e..b351732f1032 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -85,6 +85,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
 		XDP_PACKET_HEADROOM;
 	pool->umem = umem;
 	pool->addrs = umem->addrs;
+	pool->tx_metadata_len = xs->tx_metadata_len;
 	INIT_LIST_HEAD(&pool->free_list);
 	INIT_LIST_HEAD(&pool->xskb_list);
 	INIT_LIST_HEAD(&pool->xsk_tx_list);
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 13354a1e4280..c74a1372bcb9 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -143,15 +143,17 @@ static inline bool xp_unused_options_set(u32 options)
 static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
 					    struct xdp_desc *desc)
 {
-	u64 offset = desc->addr & (pool->chunk_size - 1);
+	u64 addr = desc->addr - pool->tx_metadata_len;
+	u64 len = desc->len + pool->tx_metadata_len;
+	u64 offset = addr & (pool->chunk_size - 1);
 
 	if (!desc->len)
 		return false;
 
-	if (offset + desc->len > pool->chunk_size)
+	if (offset + len > pool->chunk_size)
 		return false;
 
-	if (desc->addr >= pool->addrs_cnt)
+	if (addr >= pool->addrs_cnt)
 		return false;
 
 	if (xp_unused_options_set(desc->options))
@@ -162,16 +164,17 @@ static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
 static inline bool xp_unaligned_validate_desc(struct xsk_buff_pool *pool,
 					      struct xdp_desc *desc)
 {
-	u64 addr = xp_unaligned_add_offset_to_addr(desc->addr);
+	u64 addr = xp_unaligned_add_offset_to_addr(desc->addr) - pool->tx_metadata_len;
+	u64 len = desc->len + pool->tx_metadata_len;
 
 	if (!desc->len)
 		return false;
 
-	if (desc->len > pool->chunk_size)
+	if (len > pool->chunk_size)
 		return false;
 
-	if (addr >= pool->addrs_cnt || addr + desc->len > pool->addrs_cnt ||
-	    xp_desc_crosses_non_contig_pg(pool, addr, desc->len))
+	if (addr >= pool->addrs_cnt || addr + len > pool->addrs_cnt ||
+	    xp_desc_crosses_non_contig_pg(pool, addr, len))
 		return false;
 
 	if (xp_unused_options_set(desc->options))
-- 
2.41.0.640.ga95def55d0-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] [PATCH bpf-next 2/9] xsk: add TX timestamp and TX checksum offload support
  2023-08-09 16:54 [xdp-hints] [PATCH bpf-next 0/9] xsk: TX metadata Stanislav Fomichev
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 1/9] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
@ 2023-08-09 16:54 ` Stanislav Fomichev
  2023-08-09 20:18   ` [xdp-hints] " Jesper Dangaard Brouer
                     ` (3 more replies)
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 3/9] tools: ynl: print xsk-features from the sample Stanislav Fomichev
                   ` (8 subsequent siblings)
  10 siblings, 4 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-09 16:54 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

This change actually defines the (initial) metadata layout
that should be used by AF_XDP userspace (xsk_tx_metadata).
The first field is flags which requests appropriate offloads,
followed by the offload-specific fields. The supported per-device
offloads are exported via netlink (new xsk-flags).

The offloads themselves are still implemented in a bit of a
framework-y fashion that's left from my initial kfunc attempt.
I'm introducing new xsk_tx_metadata_ops which drivers are
supposed to implement. The drivers are also supposed
to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
the right places. Since xsk_tx_metadata_{request,_complete}
are static inline, we don't incur any extra overhead doing
indirect calls.

The benefit of this scheme is as follows:
- keeps all metadata layout parsing away from driver code
- makes it easy to grep and see which drivers implement what
- don't need any extra flags to maintain to keep track of what
  offloads are implemented; if the callback is implemented - the offload
  is supported (used by netlink reporting code)

Two offloads are defined right now:
1. XDP_TX_METADATA_CHECKSUM: skb-style csum_start+csum_offset
2. XDP_TX_METADATA_TIMESTAMP: writes TX timestamp back into metadata
   area upon completion (tx_timestamp field)

The offloads are also implemented for copy mode:
1. Extra XDP_TX_METADATA_CHECKSUM_SW to trigger skb_checksum_help; this
   might be useful as a reference implementation and for testing
2. XDP_TX_METADATA_TIMESTAMP writes SW timestamp from the skb
   destructor (note I'm reusing hwtstamps to pass metadata pointer)

The struct is forward-compatible and can be extended in the future
by appending more fields.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 Documentation/netlink/specs/netdev.yaml | 20 +++++++++
 include/linux/netdevice.h               | 27 +++++++++++
 include/linux/skbuff.h                  |  5 ++-
 include/net/xdp_sock.h                  | 60 +++++++++++++++++++++++++
 include/net/xdp_sock_drv.h              | 13 ++++++
 include/net/xsk_buff_pool.h             |  5 +++
 include/uapi/linux/if_xdp.h             | 35 +++++++++++++++
 include/uapi/linux/netdev.h             | 16 +++++++
 net/core/netdev-genl.c                  | 12 ++++-
 net/xdp/xsk.c                           | 41 +++++++++++++++++
 net/xdp/xsk_queue.h                     |  2 +-
 tools/include/uapi/linux/if_xdp.h       | 50 ++++++++++++++++++---
 tools/include/uapi/linux/netdev.h       | 15 +++++++
 13 files changed, 293 insertions(+), 8 deletions(-)

diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 1c7284fd535b..9002b37b7676 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -42,6 +42,19 @@ name: netdev
         doc:
           This feature informs if netdev implements non-linear XDP buffer
           support in ndo_xdp_xmit callback.
+  -
+    type: flags
+    name: xsk-flags
+    render-max: true
+    entries:
+      -
+        name: tx-timestamp
+        doc:
+          HW timestamping egress packets is supported by the driver.
+      -
+        name: tx-checksum
+        doc:
+          L3 checksum HW offload is supported by the driver.
 
 attribute-sets:
   -
@@ -68,6 +81,12 @@ name: netdev
         type: u32
         checks:
           min: 1
+      -
+        name: xsk-features
+        doc: Bitmask of enabled AF_XDP features.
+        type: u64
+        enum: xsk-flags
+        enum-as-flags: true
 
 operations:
   list:
@@ -84,6 +103,7 @@ name: netdev
             - ifindex
             - xdp-features
             - xdp-zc-max-segs
+            - xsk-features
       dump:
         reply: *dev-all
     -
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0896aaa91dd7..3f02aaa30590 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1647,6 +1647,31 @@ struct net_device_ops {
 						    struct netlink_ext_ack *extack);
 };
 
+/*
+ * This structure defines the AF_XDP TX metadata hooks for network devices.
+ * The following hooks can be defined; unless noted otherwise, they are
+ * optional and can be filled with a null pointer.
+ *
+ * int (*tmo_request_timestamp)(void *priv)
+ *     This function is called when AF_XDP frame requested egress timestamp.
+ *
+ * int (*tmo_fill_timestamp)(void *priv)
+ *     This function is called when AF_XDP frame, that had requested
+ *     egress timestamp, received a completion. The hook needs to return
+ *     the actual HW timestamp.
+ *
+ * int (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv)
+ *     This function is called when AF_XDP frame requested HW checksum
+ *     offload. csum_start indicates position where checksumming should start.
+ *     csum_offset indicates position where checksum should be stored.
+ *
+ */
+struct xsk_tx_metadata_ops {
+	void	(*tmo_request_timestamp)(void *priv);
+	u64	(*tmo_fill_timestamp)(void *priv);
+	void	(*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
+};
+
 /**
  * enum netdev_priv_flags - &struct net_device priv_flags
  *
@@ -1835,6 +1860,7 @@ enum netdev_ml_priv_type {
  *	@netdev_ops:	Includes several pointers to callbacks,
  *			if one wants to override the ndo_*() functions
  *	@xdp_metadata_ops:	Includes pointers to XDP metadata callbacks.
+ *	@xsk_tx_metadata_ops:	Includes pointers to AF_XDP TX metadata callbacks.
  *	@ethtool_ops:	Management operations
  *	@l3mdev_ops:	Layer 3 master device operations
  *	@ndisc_ops:	Includes callbacks for different IPv6 neighbour
@@ -2091,6 +2117,7 @@ struct net_device {
 	unsigned long long	priv_flags;
 	const struct net_device_ops *netdev_ops;
 	const struct xdp_metadata_ops *xdp_metadata_ops;
+	const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
 	int			ifindex;
 	unsigned short		gflags;
 	unsigned short		hard_header_len;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 16a49ba534e4..5d73d5df67fb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -579,7 +579,10 @@ struct skb_shared_info {
 	/* Warning: this field is not always filled in (UFO)! */
 	unsigned short	gso_segs;
 	struct sk_buff	*frag_list;
-	struct skb_shared_hwtstamps hwtstamps;
+	union {
+		struct skb_shared_hwtstamps hwtstamps;
+		struct xsk_tx_metadata *xsk_meta;
+	};
 	unsigned int	gso_type;
 	u32		tskey;
 
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 467b9fb56827..288fa58c4665 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -90,6 +90,54 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
 int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
 void __xsk_map_flush(void);
 
+/**
+ *  xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
+ *  and call appropriate xsk_tx_metadata_ops operation.
+ *  @meta: pointer to AF_XDP metadata area
+ *  @ops: pointer to struct xsk_tx_metadata_ops
+ *  @priv: pointer to driver-private aread
+ *
+ *  This function should be called by the networking device when
+ *  it prepares AF_XDP egress packet.
+ */
+static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
+					   const struct xsk_tx_metadata_ops *ops,
+					   void *priv)
+{
+	if (!meta)
+		return;
+
+	if (ops->tmo_request_timestamp)
+		if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
+			ops->tmo_request_timestamp(priv);
+
+	if (ops->tmo_request_checksum)
+		if (meta->flags & XDP_TX_METADATA_CHECKSUM)
+			ops->tmo_request_checksum(meta->csum_start, meta->csum_offset, priv);
+}
+
+/**
+ *  xsk_tx_metadata_complete - Evaluate AF_XDP TX metadata at completion
+ *  and call appropriate xsk_tx_metadata_ops operation.
+ *  @meta: pointer to AF_XDP metadata area
+ *  @ops: pointer to struct xsk_tx_metadata_ops
+ *  @priv: pointer to driver-private aread
+ *
+ *  This function should be called by the networking device upon
+ *  AF_XDP egress completion.
+ */
+static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata *meta,
+					    const struct xsk_tx_metadata_ops *ops,
+					    void *priv)
+{
+	if (!meta)
+		return;
+
+	if (ops->tmo_fill_timestamp)
+		if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
+			meta->tx_timestamp = ops->tmo_fill_timestamp(priv);
+}
+
 #else
 
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
@@ -106,6 +154,18 @@ static inline void __xsk_map_flush(void)
 {
 }
 
+static inline void xsk_tx_metadata_request(struct xsk_tx_metadata *meta,
+					   const struct xsk_tx_metadata_ops *ops,
+					   void *priv)
+{
+}
+
+static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata *meta,
+					    const struct xsk_tx_metadata_ops *ops,
+					    void *priv)
+{
+}
+
 #endif /* CONFIG_XDP_SOCKETS */
 
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 1f6fc8c7a84c..e2558ac3e195 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -165,6 +165,14 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
 	return xp_raw_get_data(pool, addr);
 }
 
+static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
+{
+	if (!pool->tx_metadata_len)
+		return NULL;
+
+	return xp_raw_get_data(pool, addr) - pool->tx_metadata_len;
+}
+
 static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
 {
 	struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
@@ -324,6 +332,11 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
 	return NULL;
 }
 
+static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
+{
+	return NULL;
+}
+
 static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
 {
 }
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index 9c31e8d1e198..3a559753e793 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -234,4 +234,9 @@ static inline u64 xp_get_handle(struct xdp_buff_xsk *xskb)
 	return xskb->orig_addr + (offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT);
 }
 
+static inline bool xp_tx_metadata_enabled(const xdp_buff_xsk *xskb)
+{
+	return sq->xsk_pool->tx_metadata_len > 0;
+}
+
 #endif /* XSK_BUFF_POOL_H_ */
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index b37b50102e1c..b9b1b2c4108a 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -106,6 +106,38 @@ struct xdp_options {
 #define XSK_UNALIGNED_BUF_ADDR_MASK \
 	((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
 
+/* Request transmit timestamp. Upon completion, put it into tx_timestamp
+ * field of struct xsk_tx_metadata.
+ */
+#define XDP_TX_METADATA_TIMESTAMP		(1 << 0)
+
+/* Request transmit checksum offload. Checksum start position and offset
+ * are communicated via csum_start and csum_offset fields of struct
+ * xsk_tx_metadata.
+ */
+#define XDP_TX_METADATA_CHECKSUM		(1 << 1)
+
+/* Force checksum calculation in software. Can be used for testing or
+ * working around potential HW issues. This option causes performance
+ * degradation and only works in XDP_COPY mode.
+ */
+#define XDP_TX_METADATA_CHECKSUM_SW		(1 << 2)
+
+struct xsk_tx_metadata {
+	__u32 flags;
+
+	/* XDP_TX_METADATA_CHECKSUM */
+
+	/* Offset from desc->addr where checksumming should start. */
+	__u16 csum_start;
+	/* Offset from csum_start where checksum should be stored. */
+	__u16 csum_offset;
+
+	/* XDP_TX_METADATA_TIMESTAMP */
+
+	__u64 tx_timestamp;
+};
+
 /* Rx/Tx descriptor */
 struct xdp_desc {
 	__u64 addr;
@@ -122,4 +154,7 @@ struct xdp_desc {
  */
 #define XDP_PKT_CONTD (1 << 0)
 
+/* TX packet carries valid metadata. */
+#define XDP_TX_METADATA (1 << 1)
+
 #endif /* _LINUX_IF_XDP_H */
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index c1634b95c223..138e467a09d6 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -38,11 +38,27 @@ enum netdev_xdp_act {
 	NETDEV_XDP_ACT_MASK = 127,
 };
 
+/**
+ * enum netdev_xsk_flags
+ * @NETDEV_XSK_FLAGS_TX_TIMESTAMP: HW timestamping egress packets is supported
+ *   by the driver.
+ * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
+ *   driver.
+ */
+enum netdev_xsk_flags {
+	NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
+	NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
+
+	/* private: */
+	NETDEV_XSK_FLAGS_MASK = 3,
+};
+
 enum {
 	NETDEV_A_DEV_IFINDEX = 1,
 	NETDEV_A_DEV_PAD,
 	NETDEV_A_DEV_XDP_FEATURES,
 	NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
+	NETDEV_A_DEV_XSK_FEATURES,
 
 	__NETDEV_A_DEV_MAX,
 	NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 797c813c7c77..18f9fb3ff43d 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -12,15 +12,25 @@ static int
 netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff *rsp,
 		   u32 portid, u32 seq, int flags, u32 cmd)
 {
+	u64 xsk_flags = 0;
 	void *hdr;
 
 	hdr = genlmsg_put(rsp, portid, seq, &netdev_nl_family, flags, cmd);
 	if (!hdr)
 		return -EMSGSIZE;
 
+	if (netdev->xsk_tx_metadata_ops) {
+		if (netdev->xsk_tx_metadata_ops->tmo_fill_timestamp)
+			xsk_flags |= NETDEV_XSK_FLAGS_TX_TIMESTAMP;
+		if (netdev->xsk_tx_metadata_ops->tmo_request_checksum)
+			xsk_flags |= NETDEV_XSK_FLAGS_TX_CHECKSUM;
+	}
+
 	if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||
 	    nla_put_u64_64bit(rsp, NETDEV_A_DEV_XDP_FEATURES,
-			      netdev->xdp_features, NETDEV_A_DEV_PAD)) {
+			      netdev->xdp_features, NETDEV_A_DEV_PAD) ||
+	    nla_put_u64_64bit(rsp, NETDEV_A_DEV_XSK_FEATURES,
+			      xsk_flags, NETDEV_A_DEV_PAD)) {
 		genlmsg_cancel(rsp, hdr);
 		return -EINVAL;
 	}
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 28df3280501d..bd9b0e82eb25 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -543,6 +543,15 @@ static u32 xsk_get_num_desc(struct sk_buff *skb)
 
 static void xsk_destruct_skb(struct sk_buff *skb)
 {
+	struct xsk_tx_metadata *meta = skb_shinfo(skb)->xsk_meta;
+
+	if (meta) {
+		if (meta->flags & XDP_TX_METADATA_TIMESTAMP) {
+			/* sw completion timestamp, not a real one */
+			meta->tx_timestamp = ktime_get_tai_fast_ns();
+		}
+	}
+
 	xsk_cq_submit_locked(xdp_sk(skb->sk), xsk_get_num_desc(skb));
 	sock_wfree(skb);
 }
@@ -627,8 +636,10 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
 static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 				     struct xdp_desc *desc)
 {
+	struct xsk_tx_metadata *meta = NULL;
 	struct net_device *dev = xs->dev;
 	struct sk_buff *skb = xs->skb;
+	bool first_frag = false;
 	int err;
 
 	if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
@@ -657,6 +668,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 			err = skb_store_bits(skb, 0, buffer, len);
 			if (unlikely(err))
 				goto free_err;
+
+			first_frag = true;
 		} else {
 			int nr_frags = skb_shinfo(skb)->nr_frags;
 			struct page *page;
@@ -679,12 +692,40 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
 
 			skb_add_rx_frag(skb, nr_frags, page, 0, len, 0);
 		}
+
+		if (first_frag && desc->options & XDP_TX_METADATA) {
+			if (unlikely(xs->tx_metadata_len == 0)) {
+				err = -EINVAL;
+				goto free_err;
+			}
+
+			meta = buffer - xs->tx_metadata_len;
+
+			if (meta->flags & XDP_TX_METADATA_CHECKSUM) {
+				if (unlikely(meta->csum_start + meta->csum_offset +
+					     sizeof(__sum16) > len)) {
+					err = -EINVAL;
+					goto free_err;
+				}
+
+				skb->csum_start = hr + meta->csum_start;
+				skb->csum_offset = meta->csum_offset;
+				skb->ip_summed = CHECKSUM_PARTIAL;
+
+				if (unlikely(meta->flags & XDP_TX_METADATA_CHECKSUM_SW)) {
+					err = skb_checksum_help(skb);
+					if (err)
+						goto free_err;
+				}
+			}
+		}
 	}
 
 	skb->dev = dev;
 	skb->priority = xs->sk.sk_priority;
 	skb->mark = READ_ONCE(xs->sk.sk_mark);
 	skb->destructor = xsk_destruct_skb;
+	skb_shinfo(skb)->xsk_meta = meta;
 	xsk_set_destructor_arg(skb);
 
 	return skb;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index c74a1372bcb9..6f2d1621c992 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -137,7 +137,7 @@ static inline bool xskq_cons_read_addr_unchecked(struct xsk_queue *q, u64 *addr)
 
 static inline bool xp_unused_options_set(u32 options)
 {
-	return options & ~XDP_PKT_CONTD;
+	return options & ~(XDP_PKT_CONTD | XDP_TX_METADATA);
 }
 
 static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 73a47da885dc..b9b1b2c4108a 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -26,11 +26,11 @@
  */
 #define XDP_USE_NEED_WAKEUP (1 << 3)
 /* By setting this option, userspace application indicates that it can
- * handle multiple descriptors per packet thus enabling xsk core to split
+ * handle multiple descriptors per packet thus enabling AF_XDP to split
  * multi-buffer XDP frames into multiple Rx descriptors. Without this set
- * such frames will be dropped by xsk.
+ * such frames will be dropped.
  */
-#define XDP_USE_SG     (1 << 4)
+#define XDP_USE_SG	(1 << 4)
 
 /* Flags for xsk_umem_config flags */
 #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
@@ -69,6 +69,7 @@ struct xdp_mmap_offsets {
 #define XDP_UMEM_COMPLETION_RING	6
 #define XDP_STATISTICS			7
 #define XDP_OPTIONS			8
+#define XDP_TX_METADATA_LEN		9
 
 struct xdp_umem_reg {
 	__u64 addr; /* Start of packet data area */
@@ -105,6 +106,38 @@ struct xdp_options {
 #define XSK_UNALIGNED_BUF_ADDR_MASK \
 	((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
 
+/* Request transmit timestamp. Upon completion, put it into tx_timestamp
+ * field of struct xsk_tx_metadata.
+ */
+#define XDP_TX_METADATA_TIMESTAMP		(1 << 0)
+
+/* Request transmit checksum offload. Checksum start position and offset
+ * are communicated via csum_start and csum_offset fields of struct
+ * xsk_tx_metadata.
+ */
+#define XDP_TX_METADATA_CHECKSUM		(1 << 1)
+
+/* Force checksum calculation in software. Can be used for testing or
+ * working around potential HW issues. This option causes performance
+ * degradation and only works in XDP_COPY mode.
+ */
+#define XDP_TX_METADATA_CHECKSUM_SW		(1 << 2)
+
+struct xsk_tx_metadata {
+	__u32 flags;
+
+	/* XDP_TX_METADATA_CHECKSUM */
+
+	/* Offset from desc->addr where checksumming should start. */
+	__u16 csum_start;
+	/* Offset from csum_start where checksum should be stored. */
+	__u16 csum_offset;
+
+	/* XDP_TX_METADATA_TIMESTAMP */
+
+	__u64 tx_timestamp;
+};
+
 /* Rx/Tx descriptor */
 struct xdp_desc {
 	__u64 addr;
@@ -112,9 +145,16 @@ struct xdp_desc {
 	__u32 options;
 };
 
-/* Flag indicating packet constitutes of multiple buffers*/
+/* UMEM descriptor is __u64 */
+
+/* Flag indicating that the packet continues with the buffer pointed out by the
+ * next frame in the ring. The end of the packet is signalled by setting this
+ * bit to zero. For single buffer packets, every descriptor has 'options' set
+ * to 0 and this maintains backward compatibility.
+ */
 #define XDP_PKT_CONTD (1 << 0)
 
-/* UMEM descriptor is __u64 */
+/* TX packet carries valid metadata. */
+#define XDP_TX_METADATA (1 << 1)
 
 #endif /* _LINUX_IF_XDP_H */
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index c1634b95c223..e8fdc530dcc9 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -38,11 +38,26 @@ enum netdev_xdp_act {
 	NETDEV_XDP_ACT_MASK = 127,
 };
 
+/**
+ * enum netdev_xsk_flags
+ * @NETDEV_XSK_FLAGS_TX_TIMESTAMP: HW timestamping egress packets is supported
+ *   by the driver.
+ * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
+ *   driver.
+ */
+enum netdev_xsk_flags {
+	NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
+	NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
+
+	NETDEV_XSK_FLAGS_MASK = 3,
+};
+
 enum {
 	NETDEV_A_DEV_IFINDEX = 1,
 	NETDEV_A_DEV_PAD,
 	NETDEV_A_DEV_XDP_FEATURES,
 	NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
+	NETDEV_A_DEV_XSK_FEATURES,
 
 	__NETDEV_A_DEV_MAX,
 	NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
-- 
2.41.0.640.ga95def55d0-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] [PATCH bpf-next 3/9] tools: ynl: print xsk-features from the sample
  2023-08-09 16:54 [xdp-hints] [PATCH bpf-next 0/9] xsk: TX metadata Stanislav Fomichev
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 1/9] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 2/9] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
@ 2023-08-09 16:54 ` Stanislav Fomichev
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 4/9] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload Stanislav Fomichev
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-09 16:54 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

Regenerate the userspace specs and print xsk-features bitmask.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/net/ynl/generated/netdev-user.c | 19 +++++++++++++++++++
 tools/net/ynl/generated/netdev-user.h |  3 +++
 tools/net/ynl/samples/netdev.c        |  6 ++++++
 3 files changed, 28 insertions(+)

diff --git a/tools/net/ynl/generated/netdev-user.c b/tools/net/ynl/generated/netdev-user.c
index 68b408ca0f7f..f8dd6aa0ad97 100644
--- a/tools/net/ynl/generated/netdev-user.c
+++ b/tools/net/ynl/generated/netdev-user.c
@@ -45,12 +45,26 @@ const char *netdev_xdp_act_str(enum netdev_xdp_act value)
 	return netdev_xdp_act_strmap[value];
 }
 
+static const char * const netdev_xsk_flags_strmap[] = {
+	[0] = "tx-timestamp",
+	[1] = "tx-checksum",
+};
+
+const char *netdev_xsk_flags_str(enum netdev_xsk_flags value)
+{
+	value = ffs(value) - 1;
+	if (value < 0 || value >= (int)MNL_ARRAY_SIZE(netdev_xsk_flags_strmap))
+		return NULL;
+	return netdev_xsk_flags_strmap[value];
+}
+
 /* Policies */
 struct ynl_policy_attr netdev_dev_policy[NETDEV_A_DEV_MAX + 1] = {
 	[NETDEV_A_DEV_IFINDEX] = { .name = "ifindex", .type = YNL_PT_U32, },
 	[NETDEV_A_DEV_PAD] = { .name = "pad", .type = YNL_PT_IGNORE, },
 	[NETDEV_A_DEV_XDP_FEATURES] = { .name = "xdp-features", .type = YNL_PT_U64, },
 	[NETDEV_A_DEV_XDP_ZC_MAX_SEGS] = { .name = "xdp-zc-max-segs", .type = YNL_PT_U32, },
+	[NETDEV_A_DEV_XSK_FEATURES] = { .name = "xsk-features", .type = YNL_PT_U64, },
 };
 
 struct ynl_policy_nest netdev_dev_nest = {
@@ -97,6 +111,11 @@ int netdev_dev_get_rsp_parse(const struct nlmsghdr *nlh, void *data)
 				return MNL_CB_ERROR;
 			dst->_present.xdp_zc_max_segs = 1;
 			dst->xdp_zc_max_segs = mnl_attr_get_u32(attr);
+		} else if (type == NETDEV_A_DEV_XSK_FEATURES) {
+			if (ynl_attr_validate(yarg, attr))
+				return MNL_CB_ERROR;
+			dst->_present.xsk_features = 1;
+			dst->xsk_features = mnl_attr_get_u64(attr);
 		}
 	}
 
diff --git a/tools/net/ynl/generated/netdev-user.h b/tools/net/ynl/generated/netdev-user.h
index 0952d3261f4d..b8c5cdb331b4 100644
--- a/tools/net/ynl/generated/netdev-user.h
+++ b/tools/net/ynl/generated/netdev-user.h
@@ -18,6 +18,7 @@ extern const struct ynl_family ynl_netdev_family;
 /* Enums */
 const char *netdev_op_str(int op);
 const char *netdev_xdp_act_str(enum netdev_xdp_act value);
+const char *netdev_xsk_flags_str(enum netdev_xsk_flags value);
 
 /* Common nested types */
 /* ============== NETDEV_CMD_DEV_GET ============== */
@@ -48,11 +49,13 @@ struct netdev_dev_get_rsp {
 		__u32 ifindex:1;
 		__u32 xdp_features:1;
 		__u32 xdp_zc_max_segs:1;
+		__u32 xsk_features:1;
 	} _present;
 
 	__u32 ifindex;
 	__u64 xdp_features;
 	__u32 xdp_zc_max_segs;
+	__u64 xsk_features;
 };
 
 void netdev_dev_get_rsp_free(struct netdev_dev_get_rsp *rsp);
diff --git a/tools/net/ynl/samples/netdev.c b/tools/net/ynl/samples/netdev.c
index 06433400dddd..06377e3f1df5 100644
--- a/tools/net/ynl/samples/netdev.c
+++ b/tools/net/ynl/samples/netdev.c
@@ -38,6 +38,12 @@ static void netdev_print_device(struct netdev_dev_get_rsp *d, unsigned int op)
 			printf(" %s", netdev_xdp_act_str(1 << i));
 	}
 
+	printf(" %llx:", d->xsk_features);
+	for (int i = 0; d->xsk_features > 1U << i; i++) {
+		if (d->xsk_features & (1U << i))
+			printf(" %s", netdev_xsk_flags_str(1 << i));
+	}
+
 	printf(" xdp-zc-max-segs=%u", d->xdp_zc_max_segs);
 
 	name = netdev_op_str(op);
-- 
2.41.0.640.ga95def55d0-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] [PATCH bpf-next 4/9] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
  2023-08-09 16:54 [xdp-hints] [PATCH bpf-next 0/9] xsk: TX metadata Stanislav Fomichev
                   ` (2 preceding siblings ...)
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 3/9] tools: ynl: print xsk-features from the sample Stanislav Fomichev
@ 2023-08-09 16:54 ` Stanislav Fomichev
  2023-08-14 11:02   ` [xdp-hints] " Maciej Fijalkowski
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 5/9] selftests/xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
                   ` (6 subsequent siblings)
  10 siblings, 1 reply; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-09 16:54 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints, Saeed Mahameed

TX timestamp:
- requires passing clock, not sure I'm passing the correct one (from
  cq->mdev), but the timestamp value looks convincing

TX checksum:
- looks like device does packet parsing (and doesn't accept custom
  start/offset), so I'm ignoring user offsets

Cc: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h  |  4 +-
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 72 ++++++++++++++++---
 .../net/ethernet/mellanox/mlx5/core/en/xdp.h  | 10 ++-
 .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   | 11 ++-
 .../net/ethernet/mellanox/mlx5/core/en_main.c |  1 +
 5 files changed, 82 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 0f8f70b91485..6f38627ae7f8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -476,10 +476,12 @@ struct mlx5e_xdp_info_fifo {
 
 struct mlx5e_xdpsq;
 struct mlx5e_xmit_data;
+struct xsk_tx_metadata;
 typedef int (*mlx5e_fp_xmit_xdp_frame_check)(struct mlx5e_xdpsq *);
 typedef bool (*mlx5e_fp_xmit_xdp_frame)(struct mlx5e_xdpsq *,
 					struct mlx5e_xmit_data *,
-					int);
+					int,
+					struct xsk_tx_metadata *);
 
 struct mlx5e_xdpsq {
 	/* data path */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 40589cebb773..197d372048ec 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -102,7 +102,7 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
 		xdptxd->dma_addr = dma_addr;
 
 		if (unlikely(!INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
-					      mlx5e_xmit_xdp_frame, sq, xdptxd, 0)))
+					      mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL)))
 			return false;
 
 		/* xmit_mode == MLX5E_XDP_XMIT_MODE_FRAME */
@@ -144,7 +144,7 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
 	xdptxd->dma_addr = dma_addr;
 
 	if (unlikely(!INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
-				      mlx5e_xmit_xdp_frame, sq, xdptxd, 0)))
+				      mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL)))
 		return false;
 
 	/* xmit_mode == MLX5E_XDP_XMIT_MODE_PAGE */
@@ -260,6 +260,37 @@ const struct xdp_metadata_ops mlx5e_xdp_metadata_ops = {
 	.xmo_rx_hash			= mlx5e_xdp_rx_hash,
 };
 
+struct mlx5e_xsk_tx_complete {
+	struct mlx5_cqe64 *cqe;
+	struct mlx5e_cq *cq;
+};
+
+static u64 mlx5e_xsk_fill_timestamp(void *_priv)
+{
+	struct mlx5e_xsk_tx_complete *priv = _priv;
+	u64 ts;
+
+	ts = get_cqe_ts(priv->cqe);
+
+	if (mlx5_is_real_time_rq(priv->cq->mdev) || mlx5_is_real_time_sq(priv->cq->mdev))
+		return mlx5_real_time_cyc2time(&priv->cq->mdev->clock, ts);
+
+	return  mlx5_timecounter_cyc2time(&priv->cq->mdev->clock, ts);
+}
+
+static void mlx5e_xsk_request_checksum(u16 csum_start, u16 csum_offset, void *priv)
+{
+	struct mlx5_wqe_eth_seg *eseg = priv;
+
+	/* HW/FW is doing parsing, so offsets are largely ignored. */
+	eseg->cs_flags |= MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
+}
+
+const struct xsk_tx_metadata_ops mlx5e_xsk_tx_metadata_ops = {
+	.tmo_fill_timestamp		= mlx5e_xsk_fill_timestamp,
+	.tmo_request_checksum		= mlx5e_xsk_request_checksum,
+};
+
 /* returns true if packet was consumed by xdp */
 bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
 		      struct bpf_prog *prog, struct mlx5e_xdp_buff *mxbuf)
@@ -397,11 +428,11 @@ INDIRECT_CALLABLE_SCOPE int mlx5e_xmit_xdp_frame_check_mpwqe(struct mlx5e_xdpsq
 
 INDIRECT_CALLABLE_SCOPE bool
 mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
-		     int check_result);
+		     int check_result, struct xsk_tx_metadata *meta);
 
 INDIRECT_CALLABLE_SCOPE bool
 mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
-			   int check_result)
+			   int check_result, struct xsk_tx_metadata *meta)
 {
 	struct mlx5e_tx_mpwqe *session = &sq->mpwqe;
 	struct mlx5e_xdpsq_stats *stats = sq->stats;
@@ -419,7 +450,7 @@ mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptx
 			 */
 			if (unlikely(sq->mpwqe.wqe))
 				mlx5e_xdp_mpwqe_complete(sq);
-			return mlx5e_xmit_xdp_frame(sq, xdptxd, 0);
+			return mlx5e_xmit_xdp_frame(sq, xdptxd, 0, meta);
 		}
 		if (!xdptxd->len) {
 			skb_frag_t *frag = &xdptxdf->sinfo->frags[0];
@@ -449,6 +480,7 @@ mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptx
 		 * and it's safe to complete it at any time.
 		 */
 		mlx5e_xdp_mpwqe_session_start(sq);
+		xsk_tx_metadata_request(meta, &mlx5e_xsk_tx_metadata_ops, &session->wqe->eth);
 	}
 
 	mlx5e_xdp_mpwqe_add_dseg(sq, p, stats);
@@ -479,7 +511,7 @@ INDIRECT_CALLABLE_SCOPE int mlx5e_xmit_xdp_frame_check(struct mlx5e_xdpsq *sq)
 
 INDIRECT_CALLABLE_SCOPE bool
 mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
-		     int check_result)
+		     int check_result, struct xsk_tx_metadata *meta)
 {
 	struct mlx5e_xmit_data_frags *xdptxdf =
 		container_of(xdptxd, struct mlx5e_xmit_data_frags, xd);
@@ -598,6 +630,8 @@ mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
 		sq->pc++;
 	}
 
+	xsk_tx_metadata_request(meta, &mlx5e_xsk_tx_metadata_ops, eseg);
+
 	sq->doorbell_cseg = cseg;
 
 	stats->xmit++;
@@ -607,7 +641,9 @@ mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
 static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 				  struct mlx5e_xdp_wqe_info *wi,
 				  u32 *xsk_frames,
-				  struct xdp_frame_bulk *bq)
+				  struct xdp_frame_bulk *bq,
+				  struct mlx5e_cq *cq,
+				  struct mlx5_cqe64 *cqe)
 {
 	struct mlx5e_xdp_info_fifo *xdpi_fifo = &sq->db.xdpi_fifo;
 	u16 i;
@@ -667,10 +703,24 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
 
 			break;
 		}
-		case MLX5E_XDP_XMIT_MODE_XSK:
+		case MLX5E_XDP_XMIT_MODE_XSK: {
 			/* AF_XDP send */
+			struct mlx5e_xsk_tx_complete priv = {
+				.cqe = cqe,
+				.cq = cq,
+			};
+			struct xsk_tx_metadata *meta = NULL;
+
+			if (xp_tx_metadata_enabled(sq->xsk_pool)) {
+				xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
+				meta = (void *)xdpi.frame.xsk_meta;
+
+				xsk_tx_metadata_complete(meta, &mlx5e_xsk_tx_metadata_ops, &priv);
+			}
+
 			(*xsk_frames)++;
 			break;
+		}
 		default:
 			WARN_ON_ONCE(true);
 		}
@@ -719,7 +769,7 @@ bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq)
 
 			sqcc += wi->num_wqebbs;
 
-			mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq);
+			mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq, cq, cqe);
 		} while (!last_wqe);
 
 		if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_REQ)) {
@@ -766,7 +816,7 @@ void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq)
 
 		sq->cc += wi->num_wqebbs;
 
-		mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq);
+		mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq, NULL, NULL);
 	}
 
 	xdp_flush_frame_bulk(&bq);
@@ -839,7 +889,7 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 		}
 
 		ret = INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
-				      mlx5e_xmit_xdp_frame, sq, xdptxd, 0);
+				      mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL);
 		if (unlikely(!ret)) {
 			int j;
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index 9e8e6184f9e4..2fcd19c16103 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -82,13 +82,14 @@ enum mlx5e_xdp_xmit_mode {
  *    num, page_1, page_2, ... , page_num.
  *
  * MLX5E_XDP_XMIT_MODE_XSK:
- *    none.
+ *    frame.xsk_meta.
  */
 union mlx5e_xdp_info {
 	enum mlx5e_xdp_xmit_mode mode;
 	union {
 		struct xdp_frame *xdpf;
 		dma_addr_t dma_addr;
+		void *xsk_meta;
 	} frame;
 	union {
 		struct mlx5e_rq *rq;
@@ -110,13 +111,16 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
 		   u32 flags);
 
 extern const struct xdp_metadata_ops mlx5e_xdp_metadata_ops;
+extern const struct xsk_tx_metadata_ops mlx5e_xsk_tx_metadata_ops;
 
 INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq,
 							  struct mlx5e_xmit_data *xdptxd,
-							  int check_result));
+							  int check_result,
+							  struct xsk_tx_metadata *meta));
 INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq,
 						    struct mlx5e_xmit_data *xdptxd,
-						    int check_result));
+						    int check_result,
+						    struct xsk_tx_metadata *meta));
 INDIRECT_CALLABLE_DECLARE(int mlx5e_xmit_xdp_frame_check_mpwqe(struct mlx5e_xdpsq *sq));
 INDIRECT_CALLABLE_DECLARE(int mlx5e_xmit_xdp_frame_check(struct mlx5e_xdpsq *sq));
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
index 597f319d4770..2f69c7912490 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
@@ -55,12 +55,16 @@ static void mlx5e_xsk_tx_post_err(struct mlx5e_xdpsq *sq,
 
 	nopwqe = mlx5e_post_nop(&sq->wq, sq->sqn, &sq->pc);
 	mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, *xdpi);
+	if (xp_tx_metadata_enabled(sq->xsk_pool))
+		mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
+				     (union mlx5e_xdp_info) { .frame.xsk_meta = NULL });
 	sq->doorbell_cseg = &nopwqe->ctrl;
 }
 
 bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
 {
 	struct xsk_buff_pool *pool = sq->xsk_pool;
+	struct xsk_tx_metadata *meta = NULL;
 	union mlx5e_xdp_info xdpi;
 	bool work_done = true;
 	bool flush = false;
@@ -93,12 +97,13 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
 		xdptxd.dma_addr = xsk_buff_raw_get_dma(pool, desc.addr);
 		xdptxd.data = xsk_buff_raw_get_data(pool, desc.addr);
 		xdptxd.len = desc.len;
+		meta = xsk_buff_get_metadata(pool, desc.addr);
 
 		xsk_buff_raw_dma_sync_for_device(pool, xdptxd.dma_addr, xdptxd.len);
 
 		ret = INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
 				      mlx5e_xmit_xdp_frame, sq, &xdptxd,
-				      check_result);
+				      check_result, meta);
 		if (unlikely(!ret)) {
 			if (sq->mpwqe.wqe)
 				mlx5e_xdp_mpwqe_complete(sq);
@@ -106,6 +111,10 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
 			mlx5e_xsk_tx_post_err(sq, &xdpi);
 		} else {
 			mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, xdpi);
+			if (xp_tx_metadata_enabled(sq->xsk_pool))
+				mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
+						     (union mlx5e_xdp_info)
+						     { .frame.xsk_meta = (void *)meta });
 		}
 
 		flush = true;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 1c820119e438..99c2a6babaea 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -5097,6 +5097,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
 
 	netdev->netdev_ops = &mlx5e_netdev_ops;
 	netdev->xdp_metadata_ops = &mlx5e_xdp_metadata_ops;
+	netdev->xsk_tx_metadata_ops = &mlx5e_xsk_tx_metadata_ops;
 
 	mlx5e_dcbnl_build_netdev(netdev);
 
-- 
2.41.0.640.ga95def55d0-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] [PATCH bpf-next 5/9] selftests/xsk: Support XDP_TX_METADATA_LEN
  2023-08-09 16:54 [xdp-hints] [PATCH bpf-next 0/9] xsk: TX metadata Stanislav Fomichev
                   ` (3 preceding siblings ...)
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 4/9] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload Stanislav Fomichev
@ 2023-08-09 16:54 ` Stanislav Fomichev
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 6/9] selftests/bpf: Add csum helpers Stanislav Fomichev
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-09 16:54 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

Add new config field and call setsockopt.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/xsk.c | 17 +++++++++++++++++
 tools/testing/selftests/bpf/xsk.h |  1 +
 2 files changed, 18 insertions(+)

diff --git a/tools/testing/selftests/bpf/xsk.c b/tools/testing/selftests/bpf/xsk.c
index d9fb2b730a2c..cb7e48f24289 100644
--- a/tools/testing/selftests/bpf/xsk.c
+++ b/tools/testing/selftests/bpf/xsk.c
@@ -49,6 +49,10 @@
  #define PF_XDP AF_XDP
 #endif
 
+#ifndef XDP_TX_METADATA_LEN
+#define XDP_TX_METADATA_LEN 9
+#endif
+
 #define pr_warn(fmt, ...) fprintf(stderr, fmt, ##__VA_ARGS__)
 
 #define XSKMAP_SIZE 1
@@ -132,12 +136,14 @@ static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
 		cfg->rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
 		cfg->tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
 		cfg->bind_flags = 0;
+		cfg->tx_metadata_len = 0;
 		return 0;
 	}
 
 	cfg->rx_size = usr_cfg->rx_size;
 	cfg->tx_size = usr_cfg->tx_size;
 	cfg->bind_flags = usr_cfg->bind_flags;
+	cfg->tx_metadata_len = usr_cfg->tx_metadata_len;
 
 	return 0;
 }
@@ -613,6 +619,17 @@ int xsk_socket__create_shared(struct xsk_socket **xsk_ptr,
 			umem->tx_ring_setup_done = true;
 	}
 
+	if (xsk->config.tx_metadata_len) {
+		int optval = xsk->config.tx_metadata_len;
+
+		err = setsockopt(xsk->fd, SOL_XDP, XDP_TX_METADATA_LEN,
+				 &optval, sizeof(optval));
+		if (err) {
+			err = -errno;
+			goto out_put_ctx;
+		}
+	}
+
 	err = xsk_get_mmap_offsets(xsk->fd, &off);
 	if (err) {
 		err = -errno;
diff --git a/tools/testing/selftests/bpf/xsk.h b/tools/testing/selftests/bpf/xsk.h
index d93200fdaa8d..325fe0c83e5d 100644
--- a/tools/testing/selftests/bpf/xsk.h
+++ b/tools/testing/selftests/bpf/xsk.h
@@ -212,6 +212,7 @@ struct xsk_socket_config {
 	__u32 rx_size;
 	__u32 tx_size;
 	__u16 bind_flags;
+	__u8 tx_metadata_len;
 };
 
 /* Set config to NULL to get the default configuration. */
-- 
2.41.0.640.ga95def55d0-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] [PATCH bpf-next 6/9] selftests/bpf: Add csum helpers
  2023-08-09 16:54 [xdp-hints] [PATCH bpf-next 0/9] xsk: TX metadata Stanislav Fomichev
                   ` (4 preceding siblings ...)
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 5/9] selftests/xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
@ 2023-08-09 16:54 ` Stanislav Fomichev
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 7/9] selftests/bpf: Add TX side to xdp_metadata Stanislav Fomichev
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-09 16:54 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

Checksum helpers will be used to calculate pseudo-header checksum in
AF_XDP metadata selftests.

The helpers are mirroring existing kernel ones:
- csum_tcpudp_magic : IPv4 pseudo header csum
- csum_ipv6_magic : IPv6 pseudo header csum
- csum_fold : fold csum and do one's complement

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/network_helpers.h | 43 +++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index 5eccc67d1a99..654a854c9fb2 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -70,4 +70,47 @@ struct nstoken;
  */
 struct nstoken *open_netns(const char *name);
 void close_netns(struct nstoken *token);
+
+static __u16 csum_fold(__u32 csum)
+{
+	csum = (csum & 0xffff) + (csum >> 16);
+	csum = (csum & 0xffff) + (csum >> 16);
+
+	return (__u16)~csum;
+}
+
+static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
+					__u32 len, __u8 proto,
+					__wsum csum)
+{
+	__u64 s = csum;
+
+	s += (__u32)saddr;
+	s += (__u32)daddr;
+	s += htons(proto + len);
+	s = (s & 0xffffffff) + (s >> 32);
+	s = (s & 0xffffffff) + (s >> 32);
+
+	return csum_fold((__u32)s);
+}
+
+static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+				      const struct in6_addr *daddr,
+					__u32 len, __u8 proto,
+					__wsum csum)
+{
+	__u64 s = csum;
+	int i;
+
+	for (i = 0; i < 4; i++)
+		s += (__u32)saddr->s6_addr32[i];
+	for (i = 0; i < 4; i++)
+		s += (__u32)daddr->s6_addr32[i];
+	s += htons(proto + len);
+	s = (s & 0xffffffff) + (s >> 32);
+	s = (s & 0xffffffff) + (s >> 32);
+
+	return csum_fold((__u32)s);
+}
+
 #endif
-- 
2.41.0.640.ga95def55d0-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] [PATCH bpf-next 7/9] selftests/bpf: Add TX side to xdp_metadata
  2023-08-09 16:54 [xdp-hints] [PATCH bpf-next 0/9] xsk: TX metadata Stanislav Fomichev
                   ` (5 preceding siblings ...)
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 6/9] selftests/bpf: Add csum helpers Stanislav Fomichev
@ 2023-08-09 16:54 ` Stanislav Fomichev
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 8/9] selftests/bpf: Add TX side to xdp_hw_metadata Stanislav Fomichev
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-09 16:54 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

Request TX timestamp and make sure it's not empty.
Request TX checksum offload (SW-only) and make sure it's resolved
to the correct one.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 31 +++++++++++++++++--
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
index 626c461fa34d..04477a557b8c 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -48,6 +48,7 @@ static int open_xsk(int ifindex, struct xsk *xsk)
 {
 	int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
 	const struct xsk_socket_config socket_config = {
+		.tx_metadata_len = sizeof(struct xsk_tx_metadata),
 		.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
 		.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
 		.bind_flags = XDP_COPY,
@@ -138,6 +139,7 @@ static void ip_csum(struct iphdr *iph)
 
 static int generate_packet(struct xsk *xsk, __u16 dst_port)
 {
+	struct xsk_tx_metadata *meta;
 	struct xdp_desc *tx_desc;
 	struct udphdr *udph;
 	struct ethhdr *eth;
@@ -151,10 +153,14 @@ static int generate_packet(struct xsk *xsk, __u16 dst_port)
 		return -1;
 
 	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
-	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
+	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE + sizeof(struct xsk_tx_metadata);
 	printf("%p: tx_desc[%u]->addr=%llx\n", xsk, idx, tx_desc->addr);
 	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
 
+	meta = data - sizeof(struct xsk_tx_metadata);
+	memset(meta, 0, sizeof(*meta));
+	meta->flags = XDP_TX_METADATA_TIMESTAMP;
+
 	eth = data;
 	iph = (void *)(eth + 1);
 	udph = (void *)(iph + 1);
@@ -178,11 +184,17 @@ static int generate_packet(struct xsk *xsk, __u16 dst_port)
 	udph->source = htons(AF_XDP_SOURCE_PORT);
 	udph->dest = htons(dst_port);
 	udph->len = htons(sizeof(*udph) + UDP_PAYLOAD_BYTES);
-	udph->check = 0;
+	udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
+					 ntohs(udph->len), IPPROTO_UDP, 0);
 
 	memset(udph + 1, 0xAA, UDP_PAYLOAD_BYTES);
 
+	meta->flags |= XDP_TX_METADATA_CHECKSUM | XDP_TX_METADATA_CHECKSUM_SW;
+	meta->csum_start = sizeof(*eth) + sizeof(*iph);
+	meta->csum_offset = offsetof(struct udphdr, check);
+
 	tx_desc->len = sizeof(*eth) + sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES;
+	tx_desc->options |= XDP_TX_METADATA;
 	xsk_ring_prod__submit(&xsk->tx, 1);
 
 	ret = sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
@@ -194,13 +206,21 @@ static int generate_packet(struct xsk *xsk, __u16 dst_port)
 
 static void complete_tx(struct xsk *xsk)
 {
-	__u32 idx;
+	struct xsk_tx_metadata *meta;
 	__u64 addr;
+	void *data;
+	__u32 idx;
 
 	if (ASSERT_EQ(xsk_ring_cons__peek(&xsk->comp, 1, &idx), 1, "xsk_ring_cons__peek")) {
 		addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
 
 		printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
+
+		data = xsk_umem__get_data(xsk->umem_area, addr);
+		meta = data - sizeof(struct xsk_tx_metadata);
+
+		ASSERT_NEQ(meta->tx_timestamp, 0, "tx_timestamp");
+
 		xsk_ring_cons__release(&xsk->comp, 1);
 	}
 }
@@ -221,6 +241,7 @@ static int verify_xsk_metadata(struct xsk *xsk)
 	const struct xdp_desc *rx_desc;
 	struct pollfd fds = {};
 	struct xdp_meta *meta;
+	struct udphdr *udph;
 	struct ethhdr *eth;
 	struct iphdr *iph;
 	__u64 comp_addr;
@@ -257,6 +278,7 @@ static int verify_xsk_metadata(struct xsk *xsk)
 	ASSERT_EQ(eth->h_proto, htons(ETH_P_IP), "eth->h_proto");
 	iph = (void *)(eth + 1);
 	ASSERT_EQ((int)iph->version, 4, "iph->version");
+	udph = (void *)(iph + 1);
 
 	/* custom metadata */
 
@@ -270,6 +292,9 @@ static int verify_xsk_metadata(struct xsk *xsk)
 
 	ASSERT_EQ(meta->rx_hash_type, 0, "rx_hash_type");
 
+	/* checksum offload */
+	ASSERT_EQ(udph->check, 0x1c72, "csum");
+
 	xsk_ring_cons__release(&xsk->rx, 1);
 	refill_rx(xsk, comp_addr);
 
-- 
2.41.0.640.ga95def55d0-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] [PATCH bpf-next 8/9] selftests/bpf: Add TX side to xdp_hw_metadata
  2023-08-09 16:54 [xdp-hints] [PATCH bpf-next 0/9] xsk: TX metadata Stanislav Fomichev
                   ` (6 preceding siblings ...)
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 7/9] selftests/bpf: Add TX side to xdp_metadata Stanislav Fomichev
@ 2023-08-09 16:54 ` Stanislav Fomichev
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 9/9] xsk: document XDP_TX_METADATA_LEN layout Stanislav Fomichev
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-09 16:54 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

When we get a packet on port 9091, we swap src/dst and send it out.
At this point we also request the timestamp and checksum offloads.

Checksum offload is verified by looking at the tcpdump on the other side.
The tool prints pseudo-header csum and the final one it expects.
The final checksum actually matches the incoming packets checksum
because we only flip the src/dst and don't change the payload.

Some other related changes:
- switched to zerocopy mode by default; new flag can be used to force
  old behavior
- request fixed TX_METADATA_LEN headroom
- some other small fixes (umem size, fill idx+i, etc)

mvbz3:~# ./xdp_hw_metadata eth3
...
0x1062cb8: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
rx_hash: 0x2E1B50B9 with RSS type:0x2A
rx_timestamp:  1691436369532047139 (sec:1691436369.5320)
XDP RX-time:   1691436369261756803 (sec:1691436369.2618) delta sec:-0.2703 (-270290.336 usec)
AF_XDP time:   1691436369261878839 (sec:1691436369.2619) delta sec:0.0001 (122.036 usec)
0x1062cb8: ping-pong with csum=3b8e (want de7e) csum_start=54 csum_offset=6
0x1062cb8: complete tx idx=0 addr=10
0x1062cb8: tx_timestamp:  1691436369598419505 (sec:1691436369.5984)
0x1062cb8: complete rx idx=128 addr=80100

mvbz4:~# nc  -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091

mvbz4:~# tcpdump -vvx -i eth3 udp
	tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:26:09.301074 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1087.55807 > fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0xde7e!] UDP, length 3
        0x0000:  6003 5fa5 000b 117f fe80 0000 0000 0000
        0x0010:  1270 fdff fe48 1087 fe80 0000 0000 0000
        0x0020:  1270 fdff fe48 1077 d9ff 2383 000b 3b8e
        0x0030:  7864 70
12:26:09.301976 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1077.9091 > fe80::1270:fdff:fe48:1087.55807: [udp sum ok] UDP, length 3
        0x0000:  6003 5fa5 000b 117f fe80 0000 0000 0000
        0x0010:  1270 fdff fe48 1077 fe80 0000 0000 0000
        0x0020:  1270 fdff fe48 1087 2383 d9ff 000b de7e
        0x0030:  7864 70

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/xdp_hw_metadata.c | 201 +++++++++++++++++-
 1 file changed, 191 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
index 613321eb84c1..05707e640e6c 100644
--- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
@@ -10,7 +10,9 @@
  *   - rx_hash
  *
  * TX:
- * - TBD
+ * - UDP 9091 packets trigger TX reply
+ * - TX HW timestamp is requested and reported back upon completion
+ * - TX checksum is requested
  */
 
 #include <test_progs.h>
@@ -24,14 +26,17 @@
 #include <linux/net_tstamp.h>
 #include <linux/udp.h>
 #include <linux/sockios.h>
+#include <linux/if_xdp.h>
 #include <sys/mman.h>
 #include <net/if.h>
 #include <poll.h>
 #include <time.h>
+#include <unistd.h>
+#include <libgen.h>
 
 #include "xdp_metadata.h"
 
-#define UMEM_NUM 16
+#define UMEM_NUM 256
 #define UMEM_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
 #define UMEM_SIZE (UMEM_FRAME_SIZE * UMEM_NUM)
 #define XDP_FLAGS (XDP_FLAGS_DRV_MODE | XDP_FLAGS_REPLACE)
@@ -51,22 +56,24 @@ struct xsk *rx_xsk;
 const char *ifname;
 int ifindex;
 int rxq;
+bool skip_tx;
 
 void test__fail(void) { /* for network_helpers.c */ }
 
-static int open_xsk(int ifindex, struct xsk *xsk, __u32 queue_id)
+static int open_xsk(int ifindex, struct xsk *xsk, __u32 queue_id, int flags)
 {
 	int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
 	const struct xsk_socket_config socket_config = {
+		.tx_metadata_len = sizeof(struct xsk_tx_metadata),
 		.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
 		.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
-		.bind_flags = XDP_COPY,
+		.bind_flags = flags,
 	};
 	const struct xsk_umem_config umem_config = {
 		.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
 		.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
 		.frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
-		.flags = XDP_UMEM_UNALIGNED_CHUNK_FLAG,
+		.flags = XSK_UMEM__DEFAULT_FLAGS,
 	};
 	__u32 idx;
 	u64 addr;
@@ -108,7 +115,7 @@ static int open_xsk(int ifindex, struct xsk *xsk, __u32 queue_id)
 	for (i = 0; i < UMEM_NUM / 2; i++) {
 		addr = (UMEM_NUM / 2 + i) * UMEM_FRAME_SIZE;
 		printf("%p: rx_desc[%d] -> %lx\n", xsk, i, addr);
-		*xsk_ring_prod__fill_addr(&xsk->fill, i) = addr;
+		*xsk_ring_prod__fill_addr(&xsk->fill, idx + i) = addr;
 	}
 	xsk_ring_prod__submit(&xsk->fill, ret);
 
@@ -129,12 +136,22 @@ static void refill_rx(struct xsk *xsk, __u64 addr)
 	__u32 idx;
 
 	if (xsk_ring_prod__reserve(&xsk->fill, 1, &idx) == 1) {
-		printf("%p: complete idx=%u addr=%llx\n", xsk, idx, addr);
+		printf("%p: complete rx idx=%u addr=%llx\n", xsk, idx, addr);
 		*xsk_ring_prod__fill_addr(&xsk->fill, idx) = addr;
 		xsk_ring_prod__submit(&xsk->fill, 1);
 	}
 }
 
+static int kick_tx(struct xsk *xsk)
+{
+	return sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
+}
+
+static int kick_rx(struct xsk *xsk)
+{
+	return recvfrom(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, NULL);
+}
+
 #define NANOSEC_PER_SEC 1000000000 /* 10^9 */
 static __u64 gettime(clockid_t clock_id)
 {
@@ -228,6 +245,116 @@ static void verify_skb_metadata(int fd)
 	printf("skb hwtstamp is not found!\n");
 }
 
+static bool complete_tx(struct xsk *xsk)
+{
+	struct xsk_tx_metadata *meta;
+	__u64 addr;
+	void *data;
+	__u32 idx;
+
+	if (!xsk_ring_cons__peek(&xsk->comp, 1, &idx))
+		return false;
+
+	addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
+	data = xsk_umem__get_data(xsk->umem_area, addr);
+	meta = data - sizeof(struct xsk_tx_metadata);
+
+	printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
+	printf("%p: tx_timestamp:  %llu (sec:%0.4f)\n", xsk, meta->tx_timestamp,
+	       (double)meta->tx_timestamp / NANOSEC_PER_SEC);
+	xsk_ring_cons__release(&xsk->comp, 1);
+
+	return true;
+}
+
+#define swap(a, b, len) do { \
+	for (int i = 0; i < len; i++) { \
+		__u8 tmp = ((__u8 *)a)[i]; \
+		((__u8 *)a)[i] = ((__u8 *)b)[i]; \
+		((__u8 *)b)[i] = tmp; \
+	} \
+} while (0)
+
+static void ping_pong(struct xsk *xsk, void *rx_packet)
+{
+	struct xsk_tx_metadata *meta;
+	struct ipv6hdr *ip6h = NULL;
+	struct iphdr *iph = NULL;
+	struct xdp_desc *tx_desc;
+	struct udphdr *udph;
+	struct ethhdr *eth;
+	__sum16 want_csum;
+	void *data;
+	__u32 idx;
+	int ret;
+	int len;
+
+	ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
+	if (ret != 1) {
+		printf("%p: failed to reserve tx slot\n", xsk);
+		return;
+	}
+
+	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
+	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE + sizeof(struct xsk_tx_metadata);
+	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
+
+	meta = data - sizeof(struct xsk_tx_metadata);
+	memset(meta, 0, sizeof(*meta));
+	meta->flags = XDP_TX_METADATA_TIMESTAMP;
+
+	eth = rx_packet;
+
+	if (eth->h_proto == htons(ETH_P_IP)) {
+		iph = (void *)(eth + 1);
+		udph = (void *)(iph + 1);
+	} else if (eth->h_proto == htons(ETH_P_IPV6)) {
+		ip6h = (void *)(eth + 1);
+		udph = (void *)(ip6h + 1);
+	} else {
+		printf("%p: failed to detect IP version for ping pong %04x\n", xsk, eth->h_proto);
+		xsk_ring_prod__cancel(&xsk->tx, 1);
+		return;
+	}
+
+	len = ETH_HLEN;
+	if (ip6h)
+		len += sizeof(*ip6h) + ntohs(ip6h->payload_len);
+	if (iph)
+		len += ntohs(iph->tot_len);
+
+	swap(eth->h_dest, eth->h_source, ETH_ALEN);
+	if (iph)
+		swap(&iph->saddr, &iph->daddr, 4);
+	else
+		swap(&ip6h->saddr, &ip6h->daddr, 16);
+	swap(&udph->source, &udph->dest, 2);
+
+	want_csum = udph->check;
+	if (ip6h)
+		udph->check = ~csum_ipv6_magic(&ip6h->saddr, &ip6h->daddr,
+					       ntohs(udph->len), IPPROTO_UDP, 0);
+	else
+		udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
+						 ntohs(udph->len), IPPROTO_UDP, 0);
+
+	meta->flags |= XDP_TX_METADATA_CHECKSUM;
+	if (iph)
+		meta->csum_start = sizeof(*eth) + sizeof(*iph);
+	else
+		meta->csum_start = sizeof(*eth) + sizeof(*ip6h);
+	meta->csum_offset = offsetof(struct udphdr, check);
+
+	printf("%p: ping-pong with csum=%04x (want %04x) csum_start=%d csum_offset=%d\n",
+	       xsk, ntohs(udph->check), ntohs(want_csum), meta->csum_start, meta->csum_offset);
+
+	memcpy(data, rx_packet, len); /* don't share umem chunk for simplicity */
+	tx_desc->options |= XDP_TX_METADATA;
+	tx_desc->len = len;
+
+	xsk_ring_prod__submit(&xsk->tx, 1);
+}
+
 static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t clock_id)
 {
 	const struct xdp_desc *rx_desc;
@@ -250,6 +377,13 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
 
 	while (true) {
 		errno = 0;
+
+		for (i = 0; i < rxq; i++) {
+			ret = kick_rx(&rx_xsk[i]);
+			if (ret)
+				printf("kick_rx ret=%d\n", ret);
+		}
+
 		ret = poll(fds, rxq + 1, 1000);
 		printf("poll: %d (%d) skip=%llu fail=%llu redir=%llu\n",
 		       ret, errno, bpf_obj->bss->pkts_skip,
@@ -280,6 +414,22 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
 			       xsk, idx, rx_desc->addr, addr, comp_addr);
 			verify_xdp_metadata(xsk_umem__get_data(xsk->umem_area, addr),
 					    clock_id);
+
+			if (!skip_tx) {
+				/* mirror the packet back */
+				ping_pong(xsk, xsk_umem__get_data(xsk->umem_area, addr));
+
+				ret = kick_tx(xsk);
+				if (ret)
+					printf("kick_tx ret=%d\n", ret);
+
+				for (int j = 0; j < 500; j++) {
+					if (complete_tx(xsk))
+						break;
+					usleep(10*1000);
+				}
+			}
+
 			xsk_ring_cons__release(&xsk->rx, 1);
 			refill_rx(xsk, comp_addr);
 		}
@@ -404,21 +554,52 @@ static void timestamping_enable(int fd, int val)
 		error(1, errno, "setsockopt(SO_TIMESTAMPING)");
 }
 
+static void usage(const char *prog)
+{
+	fprintf(stderr,
+		"usage: %s [OPTS] <ifname>\n"
+		"OPTS:\n"
+		"    -r    don't generate AF_XDP reply (rx metadata only)\n"
+		"    -c    run in copy mode\n",
+		prog);
+}
+
 int main(int argc, char *argv[])
 {
+	int bind_flags =  XDP_USE_NEED_WAKEUP | XDP_ZEROCOPY;
 	clockid_t clock_id = CLOCK_TAI;
 	int server_fd = -1;
+	int opt;
 	int ret;
 	int i;
 
 	struct bpf_program *prog;
 
-	if (argc != 2) {
+	while ((opt = getopt(argc, argv, "rc")) != -1) {
+		switch (opt) {
+		case 'r':
+			skip_tx = true;
+			break;
+		case 'c':
+			bind_flags = XDP_USE_NEED_WAKEUP | XDP_COPY;
+			break;
+		default:
+			usage(basename(argv[0]));
+			return 1;
+		}
+	}
+
+	if (argc < 2) {
 		fprintf(stderr, "pass device name\n");
 		return -1;
 	}
 
-	ifname = argv[1];
+	if (optind >= argc) {
+		usage(basename(argv[0]));
+		return 1;
+	}
+
+	ifname = argv[optind];
 	ifindex = if_nametoindex(ifname);
 	rxq = rxq_num(ifname);
 
@@ -432,7 +613,7 @@ int main(int argc, char *argv[])
 
 	for (i = 0; i < rxq; i++) {
 		printf("open_xsk(%s, %p, %d)\n", ifname, &rx_xsk[i], i);
-		ret = open_xsk(ifindex, &rx_xsk[i], i);
+		ret = open_xsk(ifindex, &rx_xsk[i], i, bind_flags);
 		if (ret)
 			error(1, -ret, "open_xsk");
 
-- 
2.41.0.640.ga95def55d0-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] [PATCH bpf-next 9/9] xsk: document XDP_TX_METADATA_LEN layout
  2023-08-09 16:54 [xdp-hints] [PATCH bpf-next 0/9] xsk: TX metadata Stanislav Fomichev
                   ` (7 preceding siblings ...)
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 8/9] selftests/bpf: Add TX side to xdp_hw_metadata Stanislav Fomichev
@ 2023-08-09 16:54 ` Stanislav Fomichev
  2023-08-09 20:39   ` [xdp-hints] " Jesper Dangaard Brouer
  2023-08-09 20:09 ` [xdp-hints] Re: [PATCH bpf-next 0/9] xsk: TX metadata Jesper Dangaard Brouer
  2023-08-14 11:13 ` Maciej Fijalkowski
  10 siblings, 1 reply; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-09 16:54 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints

- how to use
- how to query features
- pointers to the examples

Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 Documentation/networking/index.rst           |  1 +
 Documentation/networking/xsk-tx-metadata.rst | 75 ++++++++++++++++++++
 2 files changed, 76 insertions(+)
 create mode 100644 Documentation/networking/xsk-tx-metadata.rst

diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 5b75c3f7a137..9b2accb48df7 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -123,6 +123,7 @@ Refer to :ref:`netdev-FAQ` for a guide on netdev development process specifics.
    xfrm_sync
    xfrm_sysctl
    xdp-rx-metadata
+   xsk-tx-metadata
 
 .. only::  subproject and html
 
diff --git a/Documentation/networking/xsk-tx-metadata.rst b/Documentation/networking/xsk-tx-metadata.rst
new file mode 100644
index 000000000000..41600f08f794
--- /dev/null
+++ b/Documentation/networking/xsk-tx-metadata.rst
@@ -0,0 +1,75 @@
+==================
+AF_XDP TX Metadata
+==================
+
+This document describes how to enable offloads when transmitting packets
+via :doc:`af_xdp`. Refer to :doc:`xdp-rx-metadata` on how to access similar
+metadata on the receive side.
+
+General Design
+==============
+
+The headroom for the metadata is reserved via ``setsockopt(fd, SOL_XDP,
+XDP_TX_METADATA_LEN, &len, 4)``. The metadata layout is a fixed UAPI,
+refer to ``struct xsk_tx_metadata`` in ``include/uapi/linux/if_xdp.h``.
+IOW, the ``len`` variable above should contain
+``sizeof(struct xsk_tx_metadata)``.
+
+The headroom and the metadata itself should be located right before
+``xdp_desc->addr`` in the umem frame. Within a frame, the metadata
+layout is as follows::
+
+         XDP_TX_METADATA_LEN
+     /                         \
+    +-----------------+---------+----------------------------+
+    | xsk_tx_metadata | padding |          payload           |
+    +-----------------+---------+----------------------------+
+                                ^
+                                |
+                          xdp_desc->addr
+
+An AF_XDP applications can request headrooms larger than ``sizeof(struct
+xsk_tx_metadata)``. The kernel will ignore the padding (and will still
+use ``xdp_desc->addr - XDP_TX_METADATA_LEN`` to locate
+the ``xsk_tx_metadata``). The application is expected to zero-out
+the metadata flags for the frames that shouldn't use any offloads.
+
+The flags field enables the particular offload:
+
+- ``XDP_TX_METADATA_TIMESTAMP``: requests the device to put transmission
+  timestamp into ``tx_timestamp`` field of ``struct xsk_tx_metadata``.
+- ``XDP_TX_METADATA_CHECKSUM``: requests the device to calculate L4
+  checksum. ``csum_start`` specifies byte offset of there the checksumming
+  should start and ``csum_offset`` specifies byte offset where the
+  device should store the computed checksum.
+- ``XDP_TX_METADATA_CHECKSUM_SW``: requests checksum calculation to
+  be done in software; this mode works only in ``XSK_COPY`` mode and
+  is mostly intended for testing. Do not enable this option, it
+  will negatively affect performance.
+
+Besides the flags above, in order to trigger the offloads, the first
+packet's ``struct xdp_desc`` descriptor should set ``XDP_TX_METADATA``
+bit in the ``options`` field. Also not that in a multi-buffer packet
+only the first chunk should carry the metadata.
+
+Querying Device Capabilities
+============================
+
+Every devices exports its offloads capabilities via netlink netdev family.
+Refer to ``xsk-flags`` features bitmask in
+``Documentation/netlink/specs/netdev.yaml``.
+
+- ``tx-timestamp``: device supports ``XDP_TX_METADATA_TIMESTAMP``
+- ``tx-checksum``: device supports ``XDP_TX_METADATA_CHECKSUM``
+
+Note that every devices supports ``XDP_TX_METADATA_CHECKSUM_SW`` when
+running in ``XSK_COPY`` mode.
+
+See ``tools/net/ynl/samples/netdev.c`` on how to query this information.
+
+Example
+=======
+
+See ``tools/testing/selftests/bpf/xdp_hw_metadata.c`` for an example
+program that handles TX metadata. Also see https://github.com/fomichev/xskgen
+for a more bare-bones example.
-- 
2.41.0.640.ga95def55d0-goog


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 0/9] xsk: TX metadata
  2023-08-09 16:54 [xdp-hints] [PATCH bpf-next 0/9] xsk: TX metadata Stanislav Fomichev
                   ` (8 preceding siblings ...)
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 9/9] xsk: document XDP_TX_METADATA_LEN layout Stanislav Fomichev
@ 2023-08-09 20:09 ` Jesper Dangaard Brouer
  2023-08-10 18:23   ` Stanislav Fomichev
  2023-08-14 11:13 ` Maciej Fijalkowski
  10 siblings, 1 reply; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2023-08-09 20:09 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, kuba, toke, willemb,
	dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
	netdev, xdp-hints, Saeed Mahameed



On 09/08/2023 18.54, Stanislav Fomichev wrote:
> This series implements initial TX metadata (offloads) for AF_XDP.
> See patch #2 for the main implementation and mlx5-one for the
> example on how to consume the metadata on the device side.
> 
> Starting with two types of offloads:
> - request TX timestamp (and write it back into the metadata area)
> - request TX checksum offload
> 
> Changes since last RFC:
> - add /* private: */ comments to headers (Simon)
> - handle metadata only in the first frag (Simon)
> - rename xdp_hw_metadata flags (Willem)
> - s/tmo_request_checksum/tmo_request_timestamp/ in xdp_metadata_ops
>    comment (Willem)
> - Documentation/networking/xsk-tx-metadata.rst
> 
> RFC v4: https://lore.kernel.org/bpf/20230724235957.1953861-1-sdf@google.com/
> 
> Performance:
> 
> I've implemented a small xskgen tool to try to saturate single tx queue:
> https://github.com/fomichev/xskgen/tree/master
> 
> Here are the performance numbers with some analysis.
> 

What is the clock freq GHz for the CPU used for this benchmark?

> 1. Baseline. Running with commit eb62e6aef940 ("Merge branch 'bpf:
> Support bpf_get_func_ip helper in uprobes'"), nothing from this series:
> 
> - with 1400 bytes of payload: 98 gbps, 8 mpps
> ./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 116960000000 bits, took 1.189130 sec, 98.357623 gbps 8.409509 mpps
> 
> - with 200 bytes of payload: 49 gbps, 23 mpps
> ./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000064 packets 20960134144 bits, took 0.422235 sec, 49.640921 gbps 23.683645 mpps
> 
> 2. Adding single commit that supports reserving XDP_TX_METADATA_LEN
>     changes nothing numbers-wise.
> 
> - baseline for 1400
> ./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 116960000000 bits, took 1.189247 sec, 98.347946 gbps 8.408682 mpps
> 
> - baseline for 200
> ./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 20960000000 bits, took 0.421248 sec, 49.756913 gbps 23.738985 mpps
> 
> 3. Adding -M flag causes xskgen to reserve the metadata and fill it, but
>     doesn't set XDP_TX_METADATA descriptor option.
> 
> - new baseline for 1400 (with only filling the metadata)
> ./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 116960000000 bits, took 1.188767 sec, 98.387657 gbps 8.412077 mpps
> 
> - new baseline for 200 (with only filling the metadata)
> ./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 20960000000 bits, took 0.410213 sec, 51.095407 gbps 24.377579 mpps
> (the numbers go sligtly up here, not really sure why, maybe some cache-related
> side-effects?
> 

Notice this change/improvement (23.7 to 24.4 Mpps) is only 1 nanosec.
  - (1/24.377579-1/23.738985)*10^3 = -1.103499 nanosec

This 1 nanosec could be noise in your testlab.


> 4. Next, I'm running the same test but with the commit that adds actual
>     general infra to parse XDP_TX_METADATA (but no driver support).
>     Essentially applying "xsk: add TX timestamp and TX checksum offload support"
>     from this series. Numbers are the same.
> 
> - fill metadata for 1400
> ./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 116960000000 bits, took 1.188430 sec, 98.415557 gbps 8.414463 mpps
> 
> - fill metadata for 200
> ./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 20960000000 bits, took 0.411559 sec, 50.928299 gbps 24.297853 mpps
> 
> - request metadata for 1400
> ./xskgen -m -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 116960000000 bits, took 1.188723 sec, 98.391299 gbps 8.412389 mpps
> 
> - request metadata for 200
> ./xskgen -m -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000064 packets 20960134144 bits, took 0.411240 sec, 50.968131 gbps 24.316856 mpps
> 
> 5. Now, for the most interesting part, I'm adding mlx5 driver support.
>     The mpps for 200 bytes case goes down from 23 mpps to 19 mpps, but
>     _only_ when I enable the metadata. This looks like a side effect
>     of me pushing extra metadata pointer via mlx5e_xdpi_fifo_push.
>     Hence, this part is wrapped into 'if (xp_tx_metadata_enabled)'
>     to not affect the existing non-metadata use-cases. Since this is not
>     regressing existing workloads, I'm not spending any time trying to
>     optimize it more (and leaving it up to mlx owners to purse if
>     they see any good way to do it).
> 
> - same baseline
> ./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 116960000000 bits, took 1.189434 sec, 98.332484 gbps 8.407360 mpps
> 
> ./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000128 packets 20960268288 bits, took 0.425254 sec, 49.288821 gbps 23.515659 mpps
> 
> - fill metadata for 1400
> ./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 116960000000 bits, took 1.189528 sec, 98.324714 gbps 8.406696 mpps
> 
> - fill metadata for 200
> ./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000128 packets 20960268288 bits, took 0.519085 sec, 40.379260 gbps 19.264914 mpps
> 

This change from 23 mpps to 19 mpps is approx 10 nanosec
  - (1/23.738985-1/19.264914)*10^3 = -9.78 nanosec
  - (1/23.515659-1/19.264914)*10^3 = -9.38 nanosec

A 10 nanosec difference is more than noise.


> - request metadata for 1400
> ./xskgen -m -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 116960000000 bits, took 1.189329 sec, 98.341165 gbps 8.408102 mpps
> 
> - request metadata for 200
> ./xskgen -m -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000128 packets 20960268288 bits, took 0.519929 sec, 40.313713 gbps 19.233642 mpps
> 
> Cc: Saeed Mahameed <saeedm@nvidia.com>
> 
> Stanislav Fomichev (9):
>    xsk: Support XDP_TX_METADATA_LEN
>    xsk: add TX timestamp and TX checksum offload support
>    tools: ynl: print xsk-features from the sample
>    net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
>    selftests/xsk: Support XDP_TX_METADATA_LEN
>    selftests/bpf: Add csum helpers
>    selftests/bpf: Add TX side to xdp_metadata
>    selftests/bpf: Add TX side to xdp_hw_metadata
>    xsk: document XDP_TX_METADATA_LEN layout
> 
>   Documentation/netlink/specs/netdev.yaml       |  20 ++
>   Documentation/networking/index.rst            |   1 +
>   Documentation/networking/xsk-tx-metadata.rst  |  75 +++++++
>   drivers/net/ethernet/mellanox/mlx5/core/en.h  |   4 +-
>   .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  72 ++++++-
>   .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  10 +-
>   .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |  11 +-
>   .../net/ethernet/mellanox/mlx5/core/en_main.c |   1 +
>   include/linux/netdevice.h                     |  27 +++
>   include/linux/skbuff.h                        |   5 +-
>   include/net/xdp_sock.h                        |  61 ++++++
>   include/net/xdp_sock_drv.h                    |  13 ++
>   include/net/xsk_buff_pool.h                   |   6 +
>   include/uapi/linux/if_xdp.h                   |  36 ++++
>   include/uapi/linux/netdev.h                   |  16 ++
>   net/core/netdev-genl.c                        |  12 +-
>   net/xdp/xsk.c                                 |  61 ++++++
>   net/xdp/xsk_buff_pool.c                       |   1 +
>   net/xdp/xsk_queue.h                           |  19 +-
>   tools/include/uapi/linux/if_xdp.h             |  50 ++++-
>   tools/include/uapi/linux/netdev.h             |  15 ++
>   tools/net/ynl/generated/netdev-user.c         |  19 ++
>   tools/net/ynl/generated/netdev-user.h         |   3 +
>   tools/net/ynl/samples/netdev.c                |   6 +
>   tools/testing/selftests/bpf/network_helpers.h |  43 ++++
>   .../selftests/bpf/prog_tests/xdp_metadata.c   |  31 ++-
>   tools/testing/selftests/bpf/xdp_hw_metadata.c | 201 +++++++++++++++++-
>   tools/testing/selftests/bpf/xsk.c             |  17 ++
>   tools/testing/selftests/bpf/xsk.h             |   1 +
>   29 files changed, 793 insertions(+), 44 deletions(-)
>   create mode 100644 Documentation/networking/xsk-tx-metadata.rst
> 

Thanks for working on this :-)
--Jesper


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 2/9] xsk: add TX timestamp and TX checksum offload support
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 2/9] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
@ 2023-08-09 20:18   ` Jesper Dangaard Brouer
  2023-08-10 18:25     ` Stanislav Fomichev
  2023-08-10  5:26   ` kernel test robot
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2023-08-09 20:18 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, kuba, toke, willemb,
	dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
	netdev, xdp-hints



On 09/08/2023 18.54, Stanislav Fomichev wrote:
> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> index 1f6fc8c7a84c..e2558ac3e195 100644
> --- a/include/net/xdp_sock_drv.h
> +++ b/include/net/xdp_sock_drv.h
> @@ -165,6 +165,14 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
>   	return xp_raw_get_data(pool, addr);
>   }
>   
> +static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
> +{
> +	if (!pool->tx_metadata_len)
> +		return NULL;
> +
> +	return xp_raw_get_data(pool, addr) - pool->tx_metadata_len;
> +}
> +
>   static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
>   {
>   	struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
> @@ -324,6 +332,11 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
>   	return NULL;
>   }
>   
> +static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
> +{
> +	return NULL;
> +}
> +
>   static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
>   {
>   }
> diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> index 9c31e8d1e198..3a559753e793 100644
> --- a/include/net/xsk_buff_pool.h
> +++ b/include/net/xsk_buff_pool.h
> @@ -234,4 +234,9 @@ static inline u64 xp_get_handle(struct xdp_buff_xsk *xskb)
>   	return xskb->orig_addr + (offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT);
>   }
>   
> +static inline bool xp_tx_metadata_enabled(const xdp_buff_xsk *xskb)

Hmm, shouldn't this argument be "struct xsk_buff_pool *pool" ?!?

> +{
> +	return sq->xsk_pool->tx_metadata_len > 0;
> +}

Will this even compile?

--Jesper


^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 9/9] xsk: document XDP_TX_METADATA_LEN layout
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 9/9] xsk: document XDP_TX_METADATA_LEN layout Stanislav Fomichev
@ 2023-08-09 20:39   ` Jesper Dangaard Brouer
  2023-08-10 18:17     ` Stanislav Fomichev
  0 siblings, 1 reply; 29+ messages in thread
From: Jesper Dangaard Brouer @ 2023-08-09 20:39 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
	xdp-hints


On 09/08/2023 18.54, Stanislav Fomichev wrote:
> +Example
> +=======
> +
> +See ``tools/testing/selftests/bpf/xdp_hw_metadata.c`` for an example
> +program that handles TX metadata. Also see https://github.com/fomichev/xskgen
> +for a more bare-bones example.

Will you consider maintaining this AF_XDP example here:
  https://github.com/xdp-project/bpf-examples

E..g. the kernels old samples/bpf/xdpsock program moved here:
  https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-example

--Jesper
p.s. Compile fail for fomichev/xskgen on my system

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 2/9] xsk: add TX timestamp and TX checksum offload support
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 2/9] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
  2023-08-09 20:18   ` [xdp-hints] " Jesper Dangaard Brouer
@ 2023-08-10  5:26   ` kernel test robot
  2023-08-10  6:12   ` kernel test robot
  2023-08-14 11:01   ` Maciej Fijalkowski
  3 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2023-08-10  5:26 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: oe-kbuild-all, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb,
	dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
	netdev, xdp-hints

Hi Stanislav,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Stanislav-Fomichev/xsk-Support-XDP_TX_METADATA_LEN/20230810-010031
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20230809165418.2831456-3-sdf%40google.com
patch subject: [PATCH bpf-next 2/9] xsk: add TX timestamp and TX checksum offload support
config: arc-randconfig-r043-20230809 (https://download.01.org/0day-ci/archive/20230810/202308101302.nwabzxuw-lkp@intel.com/config)
compiler: arc-elf-gcc (GCC) 12.3.0
reproduce: (https://download.01.org/0day-ci/archive/20230810/202308101302.nwabzxuw-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308101302.nwabzxuw-lkp@intel.com/

All errors (new ones prefixed by >>):

   In file included from include/net/xdp_sock_drv.h:10,
                    from net/ethtool/ioctl.c:31:
>> include/net/xsk_buff_pool.h:237:49: error: unknown type name 'xdp_buff_xsk'
     237 | static inline bool xp_tx_metadata_enabled(const xdp_buff_xsk *xskb)
         |                                                 ^~~~~~~~~~~~
   include/net/xsk_buff_pool.h: In function 'xp_tx_metadata_enabled':
>> include/net/xsk_buff_pool.h:239:16: error: 'sq' undeclared (first use in this function); did you mean 'rq'?
     239 |         return sq->xsk_pool->tx_metadata_len > 0;
         |                ^~
         |                rq
   include/net/xsk_buff_pool.h:239:16: note: each undeclared identifier is reported only once for each function it appears in


vim +/xdp_buff_xsk +237 include/net/xsk_buff_pool.h

   236	
 > 237	static inline bool xp_tx_metadata_enabled(const xdp_buff_xsk *xskb)
   238	{
 > 239		return sq->xsk_pool->tx_metadata_len > 0;
   240	}
   241	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 2/9] xsk: add TX timestamp and TX checksum offload support
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 2/9] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
  2023-08-09 20:18   ` [xdp-hints] " Jesper Dangaard Brouer
  2023-08-10  5:26   ` kernel test robot
@ 2023-08-10  6:12   ` kernel test robot
  2023-08-14 11:01   ` Maciej Fijalkowski
  3 siblings, 0 replies; 29+ messages in thread
From: kernel test robot @ 2023-08-10  6:12 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: llvm, oe-kbuild-all, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb,
	dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
	netdev, xdp-hints

Hi Stanislav,

kernel test robot noticed the following build errors:

[auto build test ERROR on bpf-next/master]

url:    https://github.com/intel-lab-lkp/linux/commits/Stanislav-Fomichev/xsk-Support-XDP_TX_METADATA_LEN/20230810-010031
base:   https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link:    https://lore.kernel.org/r/20230809165418.2831456-3-sdf%40google.com
patch subject: [PATCH bpf-next 2/9] xsk: add TX timestamp and TX checksum offload support
config: arm-randconfig-r003-20230809 (https://download.01.org/0day-ci/archive/20230810/202308101330.Z7cRgzuH-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce: (https://download.01.org/0day-ci/archive/20230810/202308101330.Z7cRgzuH-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202308101330.Z7cRgzuH-lkp@intel.com/

All error/warnings (new ones prefixed by >>):

   In file included from drivers/net/ethernet/intel/igc/igc_main.c:13:
   In file included from include/net/xdp_sock_drv.h:10:
>> include/net/xsk_buff_pool.h:237:49: error: must use 'struct' tag to refer to type 'xdp_buff_xsk'
     237 | static inline bool xp_tx_metadata_enabled(const xdp_buff_xsk *xskb)
         |                                                 ^
         |                                                 struct 
>> include/net/xsk_buff_pool.h:239:9: error: use of undeclared identifier 'sq'
     239 |         return sq->xsk_pool->tx_metadata_len > 0;
         |                ^
>> drivers/net/ethernet/intel/igc/igc_main.c:1267:14: warning: division by zero is undefined [-Wdivision-by-zero]
    1267 |         cmd_type |= IGC_SET_FLAG(tx_flags, IGC_TX_FLAGS_VLAN,
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1268 |                                  IGC_ADVTXD_DCMD_VLE);
         |                                  ~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/igc/igc_main.c:1257:30: note: expanded from macro 'IGC_SET_FLAG'
    1257 |          ((u32)((_input) & (_flag)) / ((_flag) / (_result))))
         |                                     ^ ~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/igc/igc_main.c:1271:14: warning: division by zero is undefined [-Wdivision-by-zero]
    1271 |         cmd_type |= IGC_SET_FLAG(tx_flags, IGC_TX_FLAGS_TSO,
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1272 |                                  (IGC_ADVTXD_DCMD_TSE));
         |                                  ~~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/igc/igc_main.c:1257:30: note: expanded from macro 'IGC_SET_FLAG'
    1257 |          ((u32)((_input) & (_flag)) / ((_flag) / (_result))))
         |                                     ^ ~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/igc/igc_main.c:1275:14: warning: division by zero is undefined [-Wdivision-by-zero]
    1275 |         cmd_type |= IGC_SET_FLAG(tx_flags, IGC_TX_FLAGS_TSTAMP,
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1276 |                                  (IGC_ADVTXD_MAC_TSTAMP));
         |                                  ~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/igc/igc_main.c:1257:30: note: expanded from macro 'IGC_SET_FLAG'
    1257 |          ((u32)((_input) & (_flag)) / ((_flag) / (_result))))
         |                                     ^ ~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/igc/igc_main.c:1279:14: warning: division by zero is undefined [-Wdivision-by-zero]
    1279 |         cmd_type ^= IGC_SET_FLAG(skb->no_fcs, 1, IGC_ADVTXD_DCMD_IFCS);
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/igc/igc_main.c:1257:30: note: expanded from macro 'IGC_SET_FLAG'
    1257 |          ((u32)((_input) & (_flag)) / ((_flag) / (_result))))
         |                                     ^ ~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/igc/igc_main.c:6692:46: warning: shift count >= width of type [-Wshift-count-overflow]
    6692 |         err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
         |                                                     ^~~~~~~~~~~~~~~~
   include/linux/dma-mapping.h:77:54: note: expanded from macro 'DMA_BIT_MASK'
      77 | #define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
         |                                                      ^ ~~~
   5 warnings and 2 errors generated.
--
   In file included from drivers/net/ethernet/intel/igc/igc_xdp.c:5:
   In file included from include/net/xdp_sock_drv.h:10:
>> include/net/xsk_buff_pool.h:237:49: error: must use 'struct' tag to refer to type 'xdp_buff_xsk'
     237 | static inline bool xp_tx_metadata_enabled(const xdp_buff_xsk *xskb)
         |                                                 ^
         |                                                 struct 
>> include/net/xsk_buff_pool.h:239:9: error: use of undeclared identifier 'sq'
     239 |         return sq->xsk_pool->tx_metadata_len > 0;
         |                ^
   2 errors generated.
--
   In file included from drivers/net/ethernet/intel/i40e/i40e_main.c:16:
   In file included from include/net/xdp_sock_drv.h:10:
>> include/net/xsk_buff_pool.h:237:49: error: must use 'struct' tag to refer to type 'xdp_buff_xsk'
     237 | static inline bool xp_tx_metadata_enabled(const xdp_buff_xsk *xskb)
         |                                                 ^
         |                                                 struct 
>> include/net/xsk_buff_pool.h:239:9: error: use of undeclared identifier 'sq'
     239 |         return sq->xsk_pool->tx_metadata_len > 0;
         |                ^
   drivers/net/ethernet/intel/i40e/i40e_main.c:15666:46: warning: shift count >= width of type [-Wshift-count-overflow]
    15666 |         err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
          |                                                     ^~~~~~~~~~~~~~~~
   include/linux/dma-mapping.h:77:54: note: expanded from macro 'DMA_BIT_MASK'
      77 | #define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
         |                                                      ^ ~~~
   1 warning and 2 errors generated.
--
   In file included from drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:40:
   In file included from include/net/xdp_sock_drv.h:10:
>> include/net/xsk_buff_pool.h:237:49: error: must use 'struct' tag to refer to type 'xdp_buff_xsk'
     237 | static inline bool xp_tx_metadata_enabled(const xdp_buff_xsk *xskb)
         |                                                 ^
         |                                                 struct 
>> include/net/xsk_buff_pool.h:239:9: error: use of undeclared identifier 'sq'
     239 |         return sq->xsk_pool->tx_metadata_len > 0;
         |                ^
>> drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8219:14: warning: division by zero is undefined [-Wdivision-by-zero]
    8219 |         cmd_type |= IXGBE_SET_FLAG(tx_flags, IXGBE_TX_FLAGS_HW_VLAN,
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    8220 |                                    IXGBE_ADVTXD_DCMD_VLE);
         |                                    ~~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8209:26: note: expanded from macro 'IXGBE_SET_FLAG'
    8209 |          ((u32)(_input & _flag) / (_flag / _result)))
         |                                 ^ ~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8223:14: warning: division by zero is undefined [-Wdivision-by-zero]
    8223 |         cmd_type |= IXGBE_SET_FLAG(tx_flags, IXGBE_TX_FLAGS_TSO,
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    8224 |                                    IXGBE_ADVTXD_DCMD_TSE);
         |                                    ~~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8209:26: note: expanded from macro 'IXGBE_SET_FLAG'
    8209 |          ((u32)(_input & _flag) / (_flag / _result)))
         |                                 ^ ~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8227:14: warning: division by zero is undefined [-Wdivision-by-zero]
    8227 |         cmd_type |= IXGBE_SET_FLAG(tx_flags, IXGBE_TX_FLAGS_TSTAMP,
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    8228 |                                    IXGBE_ADVTXD_MAC_TSTAMP);
         |                                    ~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8209:26: note: expanded from macro 'IXGBE_SET_FLAG'
    8209 |          ((u32)(_input & _flag) / (_flag / _result)))
         |                                 ^ ~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8231:14: warning: division by zero is undefined [-Wdivision-by-zero]
    8231 |         cmd_type ^= IXGBE_SET_FLAG(skb->no_fcs, 1, IXGBE_ADVTXD_DCMD_IFCS);
         |                     ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8209:26: note: expanded from macro 'IXGBE_SET_FLAG'
    8209 |          ((u32)(_input & _flag) / (_flag / _result)))
         |                                 ^ ~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8242:19: warning: division by zero is undefined [-Wdivision-by-zero]
    8242 |         olinfo_status |= IXGBE_SET_FLAG(tx_flags,
         |                          ^~~~~~~~~~~~~~~~~~~~~~~~
    8243 |                                         IXGBE_TX_FLAGS_CSUM,
         |                                         ~~~~~~~~~~~~~~~~~~~~
    8244 |                                         IXGBE_ADVTXD_POPTS_TXSM);
         |                                         ~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8209:26: note: expanded from macro 'IXGBE_SET_FLAG'
    8209 |          ((u32)(_input & _flag) / (_flag / _result)))
         |                                 ^ ~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8247:19: warning: division by zero is undefined [-Wdivision-by-zero]
    8247 |         olinfo_status |= IXGBE_SET_FLAG(tx_flags,
         |                          ^~~~~~~~~~~~~~~~~~~~~~~~
    8248 |                                         IXGBE_TX_FLAGS_IPV4,
         |                                         ~~~~~~~~~~~~~~~~~~~~
    8249 |                                         IXGBE_ADVTXD_POPTS_IXSM);
         |                                         ~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8209:26: note: expanded from macro 'IXGBE_SET_FLAG'
    8209 |          ((u32)(_input & _flag) / (_flag / _result)))
         |                                 ^ ~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8252:19: warning: division by zero is undefined [-Wdivision-by-zero]
    8252 |         olinfo_status |= IXGBE_SET_FLAG(tx_flags,
         |                          ^~~~~~~~~~~~~~~~~~~~~~~~
    8253 |                                         IXGBE_TX_FLAGS_IPSEC,
         |                                         ~~~~~~~~~~~~~~~~~~~~~
    8254 |                                         IXGBE_ADVTXD_POPTS_IPSEC);
         |                                         ~~~~~~~~~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8209:26: note: expanded from macro 'IXGBE_SET_FLAG'
    8209 |          ((u32)(_input & _flag) / (_flag / _result)))
         |                                 ^ ~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8260:19: warning: division by zero is undefined [-Wdivision-by-zero]
    8260 |         olinfo_status |= IXGBE_SET_FLAG(tx_flags,
         |                          ^~~~~~~~~~~~~~~~~~~~~~~~
    8261 |                                         IXGBE_TX_FLAGS_CC,
         |                                         ~~~~~~~~~~~~~~~~~~
    8262 |                                         IXGBE_ADVTXD_CC);
         |                                         ~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:8209:26: note: expanded from macro 'IXGBE_SET_FLAG'
    8209 |          ((u32)(_input & _flag) / (_flag / _result)))
         |                                 ^ ~~~~~~~~~~~~~~~~~
   drivers/net/ethernet/intel/ixgbe/ixgbe_main.c:10789:46: warning: shift count >= width of type [-Wshift-count-overflow]
    10789 |         err = dma_set_mask_and_coherent(&pdev->dev, DMA_BIT_MASK(64));
          |                                                     ^~~~~~~~~~~~~~~~
   include/linux/dma-mapping.h:77:54: note: expanded from macro 'DMA_BIT_MASK'
      77 | #define DMA_BIT_MASK(n) (((n) == 64) ? ~0ULL : ((1ULL<<(n))-1))
         |                                                      ^ ~~~
   9 warnings and 2 errors generated.


vim +237 include/net/xsk_buff_pool.h

   236	
 > 237	static inline bool xp_tx_metadata_enabled(const xdp_buff_xsk *xskb)
   238	{
 > 239		return sq->xsk_pool->tx_metadata_len > 0;
   240	}
   241	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 9/9] xsk: document XDP_TX_METADATA_LEN layout
  2023-08-09 20:39   ` [xdp-hints] " Jesper Dangaard Brouer
@ 2023-08-10 18:17     ` Stanislav Fomichev
  0 siblings, 0 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-10 18:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, netdev, xdp-hints

On 08/09, Jesper Dangaard Brouer wrote:
> 
> On 09/08/2023 18.54, Stanislav Fomichev wrote:
> > +Example
> > +=======
> > +
> > +See ``tools/testing/selftests/bpf/xdp_hw_metadata.c`` for an example
> > +program that handles TX metadata. Also see https://github.com/fomichev/xskgen
> > +for a more bare-bones example.
> 
> Will you consider maintaining this AF_XDP example here:
>  https://github.com/xdp-project/bpf-examples
> 
> E..g. the kernels old samples/bpf/xdpsock program moved here:
>  https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-example

Sure, I can put it wherever. Didn't want to put into the kernel tree..
 
> --Jesper
> p.s. Compile fail for fomichev/xskgen on my system

What is it complaining about? It's very ad-hoc, doesn't user proper
libxsk and has super simple hard-coded make rules :-)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 0/9] xsk: TX metadata
  2023-08-09 20:09 ` [xdp-hints] Re: [PATCH bpf-next 0/9] xsk: TX metadata Jesper Dangaard Brouer
@ 2023-08-10 18:23   ` Stanislav Fomichev
  0 siblings, 0 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-10 18:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, kuba, toke, willemb,
	dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
	netdev, xdp-hints, Saeed Mahameed

On 08/09, Jesper Dangaard Brouer wrote:
> 
> 
> On 09/08/2023 18.54, Stanislav Fomichev wrote:
> > This series implements initial TX metadata (offloads) for AF_XDP.
> > See patch #2 for the main implementation and mlx5-one for the
> > example on how to consume the metadata on the device side.
> > 
> > Starting with two types of offloads:
> > - request TX timestamp (and write it back into the metadata area)
> > - request TX checksum offload
> > 
> > Changes since last RFC:
> > - add /* private: */ comments to headers (Simon)
> > - handle metadata only in the first frag (Simon)
> > - rename xdp_hw_metadata flags (Willem)
> > - s/tmo_request_checksum/tmo_request_timestamp/ in xdp_metadata_ops
> >    comment (Willem)
> > - Documentation/networking/xsk-tx-metadata.rst
> > 
> > RFC v4: https://lore.kernel.org/bpf/20230724235957.1953861-1-sdf@google.com/
> > 
> > Performance:
> > 
> > I've implemented a small xskgen tool to try to saturate single tx queue:
> > https://github.com/fomichev/xskgen/tree/master
> > 
> > Here are the performance numbers with some analysis.
> > 
> 
> What is the clock freq GHz for the CPU used for this benchmark?

That's an AMD:

model name      : AMD EPYC 7B12 64-Core Processor
cpu MHz         : 2250.071
cache size      : 512 KB


> > 1. Baseline. Running with commit eb62e6aef940 ("Merge branch 'bpf:
> > Support bpf_get_func_ip helper in uprobes'"), nothing from this series:
> > 
> > - with 1400 bytes of payload: 98 gbps, 8 mpps
> > ./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> > sent 10000000 packets 116960000000 bits, took 1.189130 sec, 98.357623 gbps 8.409509 mpps
> > 
> > - with 200 bytes of payload: 49 gbps, 23 mpps
> > ./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> > sent 10000064 packets 20960134144 bits, took 0.422235 sec, 49.640921 gbps 23.683645 mpps
> > 
> > 2. Adding single commit that supports reserving XDP_TX_METADATA_LEN
> >     changes nothing numbers-wise.
> > 
> > - baseline for 1400
> > ./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> > sent 10000000 packets 116960000000 bits, took 1.189247 sec, 98.347946 gbps 8.408682 mpps
> > 
> > - baseline for 200
> > ./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> > sent 10000000 packets 20960000000 bits, took 0.421248 sec, 49.756913 gbps 23.738985 mpps
> > 
> > 3. Adding -M flag causes xskgen to reserve the metadata and fill it, but
> >     doesn't set XDP_TX_METADATA descriptor option.
> > 
> > - new baseline for 1400 (with only filling the metadata)
> > ./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> > sent 10000000 packets 116960000000 bits, took 1.188767 sec, 98.387657 gbps 8.412077 mpps
> > 
> > - new baseline for 200 (with only filling the metadata)
> > ./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> > sent 10000000 packets 20960000000 bits, took 0.410213 sec, 51.095407 gbps 24.377579 mpps
> > (the numbers go sligtly up here, not really sure why, maybe some cache-related
> > side-effects?
> > 
> 
> Notice this change/improvement (23.7 to 24.4 Mpps) is only 1 nanosec.
>  - (1/24.377579-1/23.738985)*10^3 = -1.103499 nanosec
> 
> This 1 nanosec could be noise in your testlab.
 
Idk, it still doesn't make sense to me. I can consistently see this
difference with adding/removing -M flag :-/

> > 4. Next, I'm running the same test but with the commit that adds actual
> >     general infra to parse XDP_TX_METADATA (but no driver support).
> >     Essentially applying "xsk: add TX timestamp and TX checksum offload support"
> >     from this series. Numbers are the same.
> > 
> > - fill metadata for 1400
> > ./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> > sent 10000000 packets 116960000000 bits, took 1.188430 sec, 98.415557 gbps 8.414463 mpps
> > 
> > - fill metadata for 200
> > ./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> > sent 10000000 packets 20960000000 bits, took 0.411559 sec, 50.928299 gbps 24.297853 mpps
> > 
> > - request metadata for 1400
> > ./xskgen -m -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> > sent 10000000 packets 116960000000 bits, took 1.188723 sec, 98.391299 gbps 8.412389 mpps
> > 
> > - request metadata for 200
> > ./xskgen -m -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> > sent 10000064 packets 20960134144 bits, took 0.411240 sec, 50.968131 gbps 24.316856 mpps
> > 
> > 5. Now, for the most interesting part, I'm adding mlx5 driver support.
> >     The mpps for 200 bytes case goes down from 23 mpps to 19 mpps, but
> >     _only_ when I enable the metadata. This looks like a side effect
> >     of me pushing extra metadata pointer via mlx5e_xdpi_fifo_push.
> >     Hence, this part is wrapped into 'if (xp_tx_metadata_enabled)'
> >     to not affect the existing non-metadata use-cases. Since this is not
> >     regressing existing workloads, I'm not spending any time trying to
> >     optimize it more (and leaving it up to mlx owners to purse if
> >     they see any good way to do it).
> > 
> > - same baseline
> > ./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> > sent 10000000 packets 116960000000 bits, took 1.189434 sec, 98.332484 gbps 8.407360 mpps
> > 
> > ./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> > sent 10000128 packets 20960268288 bits, took 0.425254 sec, 49.288821 gbps 23.515659 mpps
> > 
> > - fill metadata for 1400
> > ./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> > sent 10000000 packets 116960000000 bits, took 1.189528 sec, 98.324714 gbps 8.406696 mpps
> > 
> > - fill metadata for 200
> > ./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> > sent 10000128 packets 20960268288 bits, took 0.519085 sec, 40.379260 gbps 19.264914 mpps
> > 
> 
> This change from 23 mpps to 19 mpps is approx 10 nanosec
>  - (1/23.738985-1/19.264914)*10^3 = -9.78 nanosec
>  - (1/23.515659-1/19.264914)*10^3 = -9.38 nanosec
> 
> A 10 nanosec difference is more than noise.

Yeah, it's because the of way mlx5 cleans the tx queue via that extra
mlx5e_xdpi_fifo_push/mlx5e_xdpi_fifo_pop. As long as I comment this part
out (since it's technically only needed for the timestamp), the numbers
look better.

> > - request metadata for 1400
> > ./xskgen -m -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> > sent 10000000 packets 116960000000 bits, took 1.189329 sec, 98.341165 gbps 8.408102 mpps
> > 
> > - request metadata for 200
> > ./xskgen -m -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> > sent 10000128 packets 20960268288 bits, took 0.519929 sec, 40.313713 gbps 19.233642 mpps
> > 
> > Cc: Saeed Mahameed <saeedm@nvidia.com>
> > 
> > Stanislav Fomichev (9):
> >    xsk: Support XDP_TX_METADATA_LEN
> >    xsk: add TX timestamp and TX checksum offload support
> >    tools: ynl: print xsk-features from the sample
> >    net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
> >    selftests/xsk: Support XDP_TX_METADATA_LEN
> >    selftests/bpf: Add csum helpers
> >    selftests/bpf: Add TX side to xdp_metadata
> >    selftests/bpf: Add TX side to xdp_hw_metadata
> >    xsk: document XDP_TX_METADATA_LEN layout
> > 
> >   Documentation/netlink/specs/netdev.yaml       |  20 ++
> >   Documentation/networking/index.rst            |   1 +
> >   Documentation/networking/xsk-tx-metadata.rst  |  75 +++++++
> >   drivers/net/ethernet/mellanox/mlx5/core/en.h  |   4 +-
> >   .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  72 ++++++-
> >   .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  10 +-
> >   .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |  11 +-
> >   .../net/ethernet/mellanox/mlx5/core/en_main.c |   1 +
> >   include/linux/netdevice.h                     |  27 +++
> >   include/linux/skbuff.h                        |   5 +-
> >   include/net/xdp_sock.h                        |  61 ++++++
> >   include/net/xdp_sock_drv.h                    |  13 ++
> >   include/net/xsk_buff_pool.h                   |   6 +
> >   include/uapi/linux/if_xdp.h                   |  36 ++++
> >   include/uapi/linux/netdev.h                   |  16 ++
> >   net/core/netdev-genl.c                        |  12 +-
> >   net/xdp/xsk.c                                 |  61 ++++++
> >   net/xdp/xsk_buff_pool.c                       |   1 +
> >   net/xdp/xsk_queue.h                           |  19 +-
> >   tools/include/uapi/linux/if_xdp.h             |  50 ++++-
> >   tools/include/uapi/linux/netdev.h             |  15 ++
> >   tools/net/ynl/generated/netdev-user.c         |  19 ++
> >   tools/net/ynl/generated/netdev-user.h         |   3 +
> >   tools/net/ynl/samples/netdev.c                |   6 +
> >   tools/testing/selftests/bpf/network_helpers.h |  43 ++++
> >   .../selftests/bpf/prog_tests/xdp_metadata.c   |  31 ++-
> >   tools/testing/selftests/bpf/xdp_hw_metadata.c | 201 +++++++++++++++++-
> >   tools/testing/selftests/bpf/xsk.c             |  17 ++
> >   tools/testing/selftests/bpf/xsk.h             |   1 +
> >   29 files changed, 793 insertions(+), 44 deletions(-)
> >   create mode 100644 Documentation/networking/xsk-tx-metadata.rst
> > 
> 
> Thanks for working on this :-)
> --Jesper
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 2/9] xsk: add TX timestamp and TX checksum offload support
  2023-08-09 20:18   ` [xdp-hints] " Jesper Dangaard Brouer
@ 2023-08-10 18:25     ` Stanislav Fomichev
  0 siblings, 0 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-10 18:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, kuba, toke, willemb,
	dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
	netdev, xdp-hints

On 08/09, Jesper Dangaard Brouer wrote:
> 
> 
> On 09/08/2023 18.54, Stanislav Fomichev wrote:
> > diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> > index 1f6fc8c7a84c..e2558ac3e195 100644
> > --- a/include/net/xdp_sock_drv.h
> > +++ b/include/net/xdp_sock_drv.h
> > @@ -165,6 +165,14 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
> >   	return xp_raw_get_data(pool, addr);
> >   }
> > +static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
> > +{
> > +	if (!pool->tx_metadata_len)
> > +		return NULL;
> > +
> > +	return xp_raw_get_data(pool, addr) - pool->tx_metadata_len;
> > +}
> > +
> >   static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
> >   {
> >   	struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
> > @@ -324,6 +332,11 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
> >   	return NULL;
> >   }
> > +static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
> > +{
> > +	return NULL;
> > +}
> > +
> >   static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
> >   {
> >   }
> > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> > index 9c31e8d1e198..3a559753e793 100644
> > --- a/include/net/xsk_buff_pool.h
> > +++ b/include/net/xsk_buff_pool.h
> > @@ -234,4 +234,9 @@ static inline u64 xp_get_handle(struct xdp_buff_xsk *xskb)
> >   	return xskb->orig_addr + (offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT);
> >   }
> > +static inline bool xp_tx_metadata_enabled(const xdp_buff_xsk *xskb)
> 
> Hmm, shouldn't this argument be "struct xsk_buff_pool *pool" ?!?
> 
> > +{
> > +	return sq->xsk_pool->tx_metadata_len > 0;
> > +}
> 
> Will this even compile?

Yeah, you're right, this is completely bogus. This me doing the
'fixes' before sending out :-/ Will fix in a v2, thanks!

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 1/9] xsk: Support XDP_TX_METADATA_LEN
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 1/9] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
@ 2023-08-14 10:56   ` Maciej Fijalkowski
  2023-08-14 18:05     ` Stanislav Fomichev
  0 siblings, 1 reply; 29+ messages in thread
From: Maciej Fijalkowski @ 2023-08-14 10:56 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, hawk, netdev, xdp-hints

On Wed, Aug 09, 2023 at 09:54:10AM -0700, Stanislav Fomichev wrote:
> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> and carry some TX metadata in the headroom. For copy mode, there
> is no way currently to populate skb metadata.
> 
> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> to treat as metadata. Metadata bytes come prior to tx_desc address
> (same as in RX case).
> 
> The size of the metadata has the same constraints as XDP:
> - less than 256 bytes
> - 4-byte aligned
> - non-zero
> 
> This data is not interpreted in any way right now.
> 
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  include/net/xdp_sock.h      |  1 +
>  include/net/xsk_buff_pool.h |  1 +
>  include/uapi/linux/if_xdp.h |  1 +
>  net/xdp/xsk.c               | 20 ++++++++++++++++++++
>  net/xdp/xsk_buff_pool.c     |  1 +
>  net/xdp/xsk_queue.h         | 17 ++++++++++-------
>  6 files changed, 34 insertions(+), 7 deletions(-)
> 
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index 1617af380162..467b9fb56827 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -51,6 +51,7 @@ struct xdp_sock {
>  	struct list_head flush_node;
>  	struct xsk_buff_pool *pool;
>  	u16 queue_id;
> +	u8 tx_metadata_len;
>  	bool zc;
>  	bool sg;
>  	enum {
> diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> index b0bdff26fc88..9c31e8d1e198 100644
> --- a/include/net/xsk_buff_pool.h
> +++ b/include/net/xsk_buff_pool.h
> @@ -77,6 +77,7 @@ struct xsk_buff_pool {
>  	u32 chunk_size;
>  	u32 chunk_shift;
>  	u32 frame_len;
> +	u8 tx_metadata_len; /* inherited from xsk_sock */
>  	u8 cached_need_wakeup;
>  	bool uses_need_wakeup;
>  	bool dma_need_sync;
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index 8d48863472b9..b37b50102e1c 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -69,6 +69,7 @@ struct xdp_mmap_offsets {
>  #define XDP_UMEM_COMPLETION_RING	6
>  #define XDP_STATISTICS			7
>  #define XDP_OPTIONS			8
> +#define XDP_TX_METADATA_LEN		9
>  
>  struct xdp_umem_reg {
>  	__u64 addr; /* Start of packet data area */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 47796a5a79b3..28df3280501d 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -1338,6 +1338,26 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>  		mutex_unlock(&xs->mutex);
>  		return err;
>  	}
> +	case XDP_TX_METADATA_LEN:
> +	{
> +		int val;
> +
> +		if (optlen < sizeof(val))
> +			return -EINVAL;
> +		if (copy_from_sockptr(&val, optval, sizeof(val)))
> +			return -EFAULT;
> +		if (!val || val > 256 || val % 4)
> +			return -EINVAL;
> +
> +		mutex_lock(&xs->mutex);
> +		if (xs->state != XSK_READY) {
> +			mutex_unlock(&xs->mutex);
> +			return -EBUSY;
> +		}
> +		xs->tx_metadata_len = val;
> +		mutex_unlock(&xs->mutex);
> +		return 0;
> +	}
>  	default:
>  		break;
>  	}
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index b3f7b310811e..b351732f1032 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -85,6 +85,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
>  		XDP_PACKET_HEADROOM;
>  	pool->umem = umem;
>  	pool->addrs = umem->addrs;
> +	pool->tx_metadata_len = xs->tx_metadata_len;

Hey Stan,

what would happen in case when one socket sets pool->tx_metadata_len say
to 16 and the other one that is sharing the pool to 24? If sockets should
and can have different padding, then this will not work unless the
metadata_len is on a per socket level.

>  	INIT_LIST_HEAD(&pool->free_list);
>  	INIT_LIST_HEAD(&pool->xskb_list);
>  	INIT_LIST_HEAD(&pool->xsk_tx_list);
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index 13354a1e4280..c74a1372bcb9 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -143,15 +143,17 @@ static inline bool xp_unused_options_set(u32 options)
>  static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
>  					    struct xdp_desc *desc)
>  {
> -	u64 offset = desc->addr & (pool->chunk_size - 1);
> +	u64 addr = desc->addr - pool->tx_metadata_len;
> +	u64 len = desc->len + pool->tx_metadata_len;
> +	u64 offset = addr & (pool->chunk_size - 1);
>  
>  	if (!desc->len)
>  		return false;
>  
> -	if (offset + desc->len > pool->chunk_size)
> +	if (offset + len > pool->chunk_size)
>  		return false;
>  
> -	if (desc->addr >= pool->addrs_cnt)
> +	if (addr >= pool->addrs_cnt)
>  		return false;
>  
>  	if (xp_unused_options_set(desc->options))
> @@ -162,16 +164,17 @@ static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
>  static inline bool xp_unaligned_validate_desc(struct xsk_buff_pool *pool,
>  					      struct xdp_desc *desc)
>  {
> -	u64 addr = xp_unaligned_add_offset_to_addr(desc->addr);
> +	u64 addr = xp_unaligned_add_offset_to_addr(desc->addr) - pool->tx_metadata_len;
> +	u64 len = desc->len + pool->tx_metadata_len;
>  
>  	if (!desc->len)
>  		return false;
>  
> -	if (desc->len > pool->chunk_size)
> +	if (len > pool->chunk_size)
>  		return false;
>  
> -	if (addr >= pool->addrs_cnt || addr + desc->len > pool->addrs_cnt ||
> -	    xp_desc_crosses_non_contig_pg(pool, addr, desc->len))
> +	if (addr >= pool->addrs_cnt || addr + len > pool->addrs_cnt ||
> +	    xp_desc_crosses_non_contig_pg(pool, addr, len))
>  		return false;
>  
>  	if (xp_unused_options_set(desc->options))
> -- 
> 2.41.0.640.ga95def55d0-goog
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 2/9] xsk: add TX timestamp and TX checksum offload support
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 2/9] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
                     ` (2 preceding siblings ...)
  2023-08-10  6:12   ` kernel test robot
@ 2023-08-14 11:01   ` Maciej Fijalkowski
  2023-08-14 18:05     ` Stanislav Fomichev
  3 siblings, 1 reply; 29+ messages in thread
From: Maciej Fijalkowski @ 2023-08-14 11:01 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, hawk, netdev, xdp-hints

On Wed, Aug 09, 2023 at 09:54:11AM -0700, Stanislav Fomichev wrote:
> This change actually defines the (initial) metadata layout
> that should be used by AF_XDP userspace (xsk_tx_metadata).
> The first field is flags which requests appropriate offloads,
> followed by the offload-specific fields. The supported per-device
> offloads are exported via netlink (new xsk-flags).
> 
> The offloads themselves are still implemented in a bit of a
> framework-y fashion that's left from my initial kfunc attempt.
> I'm introducing new xsk_tx_metadata_ops which drivers are
> supposed to implement. The drivers are also supposed
> to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
> the right places. Since xsk_tx_metadata_{request,_complete}
> are static inline, we don't incur any extra overhead doing
> indirect calls.
> 
> The benefit of this scheme is as follows:
> - keeps all metadata layout parsing away from driver code
> - makes it easy to grep and see which drivers implement what
> - don't need any extra flags to maintain to keep track of what
>   offloads are implemented; if the callback is implemented - the offload
>   is supported (used by netlink reporting code)
> 
> Two offloads are defined right now:
> 1. XDP_TX_METADATA_CHECKSUM: skb-style csum_start+csum_offset
> 2. XDP_TX_METADATA_TIMESTAMP: writes TX timestamp back into metadata
>    area upon completion (tx_timestamp field)
> 
> The offloads are also implemented for copy mode:
> 1. Extra XDP_TX_METADATA_CHECKSUM_SW to trigger skb_checksum_help; this
>    might be useful as a reference implementation and for testing
> 2. XDP_TX_METADATA_TIMESTAMP writes SW timestamp from the skb
>    destructor (note I'm reusing hwtstamps to pass metadata pointer)
> 
> The struct is forward-compatible and can be extended in the future
> by appending more fields.
> 
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  Documentation/netlink/specs/netdev.yaml | 20 +++++++++
>  include/linux/netdevice.h               | 27 +++++++++++
>  include/linux/skbuff.h                  |  5 ++-
>  include/net/xdp_sock.h                  | 60 +++++++++++++++++++++++++
>  include/net/xdp_sock_drv.h              | 13 ++++++
>  include/net/xsk_buff_pool.h             |  5 +++
>  include/uapi/linux/if_xdp.h             | 35 +++++++++++++++
>  include/uapi/linux/netdev.h             | 16 +++++++
>  net/core/netdev-genl.c                  | 12 ++++-
>  net/xdp/xsk.c                           | 41 +++++++++++++++++
>  net/xdp/xsk_queue.h                     |  2 +-
>  tools/include/uapi/linux/if_xdp.h       | 50 ++++++++++++++++++---
>  tools/include/uapi/linux/netdev.h       | 15 +++++++
>  13 files changed, 293 insertions(+), 8 deletions(-)
> 

[...]

> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 0896aaa91dd7..3f02aaa30590 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1647,6 +1647,31 @@ struct net_device_ops {
>  						    struct netlink_ext_ack *extack);
>  };
>  
> +/*
> + * This structure defines the AF_XDP TX metadata hooks for network devices.
> + * The following hooks can be defined; unless noted otherwise, they are
> + * optional and can be filled with a null pointer.
> + *
> + * int (*tmo_request_timestamp)(void *priv)
> + *     This function is called when AF_XDP frame requested egress timestamp.
> + *
> + * int (*tmo_fill_timestamp)(void *priv)
> + *     This function is called when AF_XDP frame, that had requested
> + *     egress timestamp, received a completion. The hook needs to return
> + *     the actual HW timestamp.
> + *
> + * int (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv)
> + *     This function is called when AF_XDP frame requested HW checksum
> + *     offload. csum_start indicates position where checksumming should start.
> + *     csum_offset indicates position where checksum should be stored.
> + *
> + */
> +struct xsk_tx_metadata_ops {
> +	void	(*tmo_request_timestamp)(void *priv);
> +	u64	(*tmo_fill_timestamp)(void *priv);
> +	void	(*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
> +};
> +
>  /**
>   * enum netdev_priv_flags - &struct net_device priv_flags
>   *
> @@ -1835,6 +1860,7 @@ enum netdev_ml_priv_type {
>   *	@netdev_ops:	Includes several pointers to callbacks,
>   *			if one wants to override the ndo_*() functions
>   *	@xdp_metadata_ops:	Includes pointers to XDP metadata callbacks.
> + *	@xsk_tx_metadata_ops:	Includes pointers to AF_XDP TX metadata callbacks.
>   *	@ethtool_ops:	Management operations
>   *	@l3mdev_ops:	Layer 3 master device operations
>   *	@ndisc_ops:	Includes callbacks for different IPv6 neighbour
> @@ -2091,6 +2117,7 @@ struct net_device {
>  	unsigned long long	priv_flags;
>  	const struct net_device_ops *netdev_ops;
>  	const struct xdp_metadata_ops *xdp_metadata_ops;
> +	const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
>  	int			ifindex;
>  	unsigned short		gflags;
>  	unsigned short		hard_header_len;
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 16a49ba534e4..5d73d5df67fb 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -579,7 +579,10 @@ struct skb_shared_info {
>  	/* Warning: this field is not always filled in (UFO)! */
>  	unsigned short	gso_segs;
>  	struct sk_buff	*frag_list;
> -	struct skb_shared_hwtstamps hwtstamps;
> +	union {
> +		struct skb_shared_hwtstamps hwtstamps;
> +		struct xsk_tx_metadata *xsk_meta;
> +	};
>  	unsigned int	gso_type;
>  	u32		tskey;
>  
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index 467b9fb56827..288fa58c4665 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -90,6 +90,54 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
>  int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
>  void __xsk_map_flush(void);
>  
> +/**
> + *  xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
> + *  and call appropriate xsk_tx_metadata_ops operation.
> + *  @meta: pointer to AF_XDP metadata area
> + *  @ops: pointer to struct xsk_tx_metadata_ops
> + *  @priv: pointer to driver-private aread
> + *
> + *  This function should be called by the networking device when
> + *  it prepares AF_XDP egress packet.
> + */
> +static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
> +					   const struct xsk_tx_metadata_ops *ops,
> +					   void *priv)
> +{
> +	if (!meta)
> +		return;
> +
> +	if (ops->tmo_request_timestamp)
> +		if (meta->flags & XDP_TX_METADATA_TIMESTAMP)

We should have a copy of flags or any other things that we read multiple
times from metadata in order to avoid potential attacks from user space.
An example of that is the fact that timestamp metadata handling is two
step process, meaning to fill the timestamp you have to request it in the
first place. If user space would set XDP_TX_METADATA_TIMESTAMP after
sending but before completing we would crash the kernel potentially.

We could also move the responsibility of handling that issue to driver
programmers but IMHO that would be harder to implement, hence we think
handling it in core would be better.


> +			ops->tmo_request_timestamp(priv);
> +
> +	if (ops->tmo_request_checksum)
> +		if (meta->flags & XDP_TX_METADATA_CHECKSUM)
> +			ops->tmo_request_checksum(meta->csum_start, meta->csum_offset, priv);
> +}
> +
> +/**
> + *  xsk_tx_metadata_complete - Evaluate AF_XDP TX metadata at completion
> + *  and call appropriate xsk_tx_metadata_ops operation.
> + *  @meta: pointer to AF_XDP metadata area
> + *  @ops: pointer to struct xsk_tx_metadata_ops
> + *  @priv: pointer to driver-private aread
> + *
> + *  This function should be called by the networking device upon
> + *  AF_XDP egress completion.
> + */
> +static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata *meta,
> +					    const struct xsk_tx_metadata_ops *ops,
> +					    void *priv)
> +{
> +	if (!meta)
> +		return;
> +
> +	if (ops->tmo_fill_timestamp)
> +		if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
> +			meta->tx_timestamp = ops->tmo_fill_timestamp(priv);
> +}
> +
>  #else
>  
>  static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
> @@ -106,6 +154,18 @@ static inline void __xsk_map_flush(void)
>  {
>  }
>  
> +static inline void xsk_tx_metadata_request(struct xsk_tx_metadata *meta,
> +					   const struct xsk_tx_metadata_ops *ops,
> +					   void *priv)
> +{
> +}
> +
> +static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata *meta,
> +					    const struct xsk_tx_metadata_ops *ops,
> +					    void *priv)
> +{
> +}
> +
>  #endif /* CONFIG_XDP_SOCKETS */
>  
>  #endif /* _LINUX_XDP_SOCK_H */

[...]

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 4/9] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
  2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 4/9] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload Stanislav Fomichev
@ 2023-08-14 11:02   ` Maciej Fijalkowski
  2023-08-14 18:05     ` Stanislav Fomichev
  0 siblings, 1 reply; 29+ messages in thread
From: Maciej Fijalkowski @ 2023-08-14 11:02 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, hawk, netdev, xdp-hints, Saeed Mahameed

On Wed, Aug 09, 2023 at 09:54:13AM -0700, Stanislav Fomichev wrote:
> TX timestamp:
> - requires passing clock, not sure I'm passing the correct one (from
>   cq->mdev), but the timestamp value looks convincing
> 
> TX checksum:
> - looks like device does packet parsing (and doesn't accept custom
>   start/offset), so I'm ignoring user offsets
> 
> Cc: Saeed Mahameed <saeedm@nvidia.com>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en.h  |  4 +-
>  .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 72 ++++++++++++++++---
>  .../net/ethernet/mellanox/mlx5/core/en/xdp.h  | 10 ++-
>  .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   | 11 ++-
>  .../net/ethernet/mellanox/mlx5/core/en_main.c |  1 +
>  5 files changed, 82 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index 0f8f70b91485..6f38627ae7f8 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -476,10 +476,12 @@ struct mlx5e_xdp_info_fifo {
>  
>  struct mlx5e_xdpsq;
>  struct mlx5e_xmit_data;
> +struct xsk_tx_metadata;
>  typedef int (*mlx5e_fp_xmit_xdp_frame_check)(struct mlx5e_xdpsq *);
>  typedef bool (*mlx5e_fp_xmit_xdp_frame)(struct mlx5e_xdpsq *,
>  					struct mlx5e_xmit_data *,
> -					int);
> +					int,
> +					struct xsk_tx_metadata *);
>  
>  struct mlx5e_xdpsq {
>  	/* data path */
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> index 40589cebb773..197d372048ec 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> @@ -102,7 +102,7 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
>  		xdptxd->dma_addr = dma_addr;
>  
>  		if (unlikely(!INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
> -					      mlx5e_xmit_xdp_frame, sq, xdptxd, 0)))
> +					      mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL)))
>  			return false;
>  
>  		/* xmit_mode == MLX5E_XDP_XMIT_MODE_FRAME */
> @@ -144,7 +144,7 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
>  	xdptxd->dma_addr = dma_addr;
>  
>  	if (unlikely(!INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
> -				      mlx5e_xmit_xdp_frame, sq, xdptxd, 0)))
> +				      mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL)))
>  		return false;
>  
>  	/* xmit_mode == MLX5E_XDP_XMIT_MODE_PAGE */
> @@ -260,6 +260,37 @@ const struct xdp_metadata_ops mlx5e_xdp_metadata_ops = {
>  	.xmo_rx_hash			= mlx5e_xdp_rx_hash,
>  };
>  
> +struct mlx5e_xsk_tx_complete {
> +	struct mlx5_cqe64 *cqe;
> +	struct mlx5e_cq *cq;
> +};
> +
> +static u64 mlx5e_xsk_fill_timestamp(void *_priv)
> +{
> +	struct mlx5e_xsk_tx_complete *priv = _priv;
> +	u64 ts;
> +
> +	ts = get_cqe_ts(priv->cqe);
> +
> +	if (mlx5_is_real_time_rq(priv->cq->mdev) || mlx5_is_real_time_sq(priv->cq->mdev))
> +		return mlx5_real_time_cyc2time(&priv->cq->mdev->clock, ts);
> +
> +	return  mlx5_timecounter_cyc2time(&priv->cq->mdev->clock, ts);
> +}
> +
> +static void mlx5e_xsk_request_checksum(u16 csum_start, u16 csum_offset, void *priv)
> +{
> +	struct mlx5_wqe_eth_seg *eseg = priv;
> +
> +	/* HW/FW is doing parsing, so offsets are largely ignored. */
> +	eseg->cs_flags |= MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
> +}
> +
> +const struct xsk_tx_metadata_ops mlx5e_xsk_tx_metadata_ops = {
> +	.tmo_fill_timestamp		= mlx5e_xsk_fill_timestamp,
> +	.tmo_request_checksum		= mlx5e_xsk_request_checksum,

Can you explain to us why mlx5 doesn't need to implement the request
timestamp op?

> +};
> +
>  /* returns true if packet was consumed by xdp */
>  bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
>  		      struct bpf_prog *prog, struct mlx5e_xdp_buff *mxbuf)
> @@ -397,11 +428,11 @@ INDIRECT_CALLABLE_SCOPE int mlx5e_xmit_xdp_frame_check_mpwqe(struct mlx5e_xdpsq
>  
>  INDIRECT_CALLABLE_SCOPE bool
>  mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
> -		     int check_result);
> +		     int check_result, struct xsk_tx_metadata *meta);
>  
>  INDIRECT_CALLABLE_SCOPE bool
>  mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
> -			   int check_result)
> +			   int check_result, struct xsk_tx_metadata *meta)
>  {
>  	struct mlx5e_tx_mpwqe *session = &sq->mpwqe;
>  	struct mlx5e_xdpsq_stats *stats = sq->stats;
> @@ -419,7 +450,7 @@ mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptx
>  			 */
>  			if (unlikely(sq->mpwqe.wqe))
>  				mlx5e_xdp_mpwqe_complete(sq);
> -			return mlx5e_xmit_xdp_frame(sq, xdptxd, 0);
> +			return mlx5e_xmit_xdp_frame(sq, xdptxd, 0, meta);
>  		}
>  		if (!xdptxd->len) {
>  			skb_frag_t *frag = &xdptxdf->sinfo->frags[0];
> @@ -449,6 +480,7 @@ mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptx
>  		 * and it's safe to complete it at any time.
>  		 */
>  		mlx5e_xdp_mpwqe_session_start(sq);
> +		xsk_tx_metadata_request(meta, &mlx5e_xsk_tx_metadata_ops, &session->wqe->eth);
>  	}
>  
>  	mlx5e_xdp_mpwqe_add_dseg(sq, p, stats);
> @@ -479,7 +511,7 @@ INDIRECT_CALLABLE_SCOPE int mlx5e_xmit_xdp_frame_check(struct mlx5e_xdpsq *sq)
>  
>  INDIRECT_CALLABLE_SCOPE bool
>  mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
> -		     int check_result)
> +		     int check_result, struct xsk_tx_metadata *meta)
>  {
>  	struct mlx5e_xmit_data_frags *xdptxdf =
>  		container_of(xdptxd, struct mlx5e_xmit_data_frags, xd);
> @@ -598,6 +630,8 @@ mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
>  		sq->pc++;
>  	}
>  
> +	xsk_tx_metadata_request(meta, &mlx5e_xsk_tx_metadata_ops, eseg);
> +
>  	sq->doorbell_cseg = cseg;
>  
>  	stats->xmit++;
> @@ -607,7 +641,9 @@ mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
>  static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
>  				  struct mlx5e_xdp_wqe_info *wi,
>  				  u32 *xsk_frames,
> -				  struct xdp_frame_bulk *bq)
> +				  struct xdp_frame_bulk *bq,
> +				  struct mlx5e_cq *cq,
> +				  struct mlx5_cqe64 *cqe)
>  {
>  	struct mlx5e_xdp_info_fifo *xdpi_fifo = &sq->db.xdpi_fifo;
>  	u16 i;
> @@ -667,10 +703,24 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
>  
>  			break;
>  		}
> -		case MLX5E_XDP_XMIT_MODE_XSK:
> +		case MLX5E_XDP_XMIT_MODE_XSK: {
>  			/* AF_XDP send */
> +			struct mlx5e_xsk_tx_complete priv = {
> +				.cqe = cqe,
> +				.cq = cq,
> +			};
> +			struct xsk_tx_metadata *meta = NULL;
> +
> +			if (xp_tx_metadata_enabled(sq->xsk_pool)) {
> +				xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
> +				meta = (void *)xdpi.frame.xsk_meta;
> +
> +				xsk_tx_metadata_complete(meta, &mlx5e_xsk_tx_metadata_ops, &priv);
> +			}
> +
>  			(*xsk_frames)++;
>  			break;
> +		}
>  		default:
>  			WARN_ON_ONCE(true);
>  		}
> @@ -719,7 +769,7 @@ bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq)
>  
>  			sqcc += wi->num_wqebbs;
>  
> -			mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq);
> +			mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq, cq, cqe);
>  		} while (!last_wqe);
>  
>  		if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_REQ)) {
> @@ -766,7 +816,7 @@ void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq)
>  
>  		sq->cc += wi->num_wqebbs;
>  
> -		mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq);
> +		mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq, NULL, NULL);
>  	}
>  
>  	xdp_flush_frame_bulk(&bq);
> @@ -839,7 +889,7 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>  		}
>  
>  		ret = INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
> -				      mlx5e_xmit_xdp_frame, sq, xdptxd, 0);
> +				      mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL);
>  		if (unlikely(!ret)) {
>  			int j;
>  
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> index 9e8e6184f9e4..2fcd19c16103 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
> @@ -82,13 +82,14 @@ enum mlx5e_xdp_xmit_mode {
>   *    num, page_1, page_2, ... , page_num.
>   *
>   * MLX5E_XDP_XMIT_MODE_XSK:
> - *    none.
> + *    frame.xsk_meta.
>   */
>  union mlx5e_xdp_info {
>  	enum mlx5e_xdp_xmit_mode mode;
>  	union {
>  		struct xdp_frame *xdpf;
>  		dma_addr_t dma_addr;
> +		void *xsk_meta;
>  	} frame;
>  	union {
>  		struct mlx5e_rq *rq;
> @@ -110,13 +111,16 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
>  		   u32 flags);
>  
>  extern const struct xdp_metadata_ops mlx5e_xdp_metadata_ops;
> +extern const struct xsk_tx_metadata_ops mlx5e_xsk_tx_metadata_ops;
>  
>  INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq,
>  							  struct mlx5e_xmit_data *xdptxd,
> -							  int check_result));
> +							  int check_result,
> +							  struct xsk_tx_metadata *meta));
>  INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq,
>  						    struct mlx5e_xmit_data *xdptxd,
> -						    int check_result));
> +						    int check_result,
> +						    struct xsk_tx_metadata *meta));
>  INDIRECT_CALLABLE_DECLARE(int mlx5e_xmit_xdp_frame_check_mpwqe(struct mlx5e_xdpsq *sq));
>  INDIRECT_CALLABLE_DECLARE(int mlx5e_xmit_xdp_frame_check(struct mlx5e_xdpsq *sq));
>  
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
> index 597f319d4770..2f69c7912490 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
> @@ -55,12 +55,16 @@ static void mlx5e_xsk_tx_post_err(struct mlx5e_xdpsq *sq,
>  
>  	nopwqe = mlx5e_post_nop(&sq->wq, sq->sqn, &sq->pc);
>  	mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, *xdpi);
> +	if (xp_tx_metadata_enabled(sq->xsk_pool))
> +		mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
> +				     (union mlx5e_xdp_info) { .frame.xsk_meta = NULL });
>  	sq->doorbell_cseg = &nopwqe->ctrl;
>  }
>  
>  bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
>  {
>  	struct xsk_buff_pool *pool = sq->xsk_pool;
> +	struct xsk_tx_metadata *meta = NULL;
>  	union mlx5e_xdp_info xdpi;
>  	bool work_done = true;
>  	bool flush = false;
> @@ -93,12 +97,13 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
>  		xdptxd.dma_addr = xsk_buff_raw_get_dma(pool, desc.addr);
>  		xdptxd.data = xsk_buff_raw_get_data(pool, desc.addr);
>  		xdptxd.len = desc.len;
> +		meta = xsk_buff_get_metadata(pool, desc.addr);
>  
>  		xsk_buff_raw_dma_sync_for_device(pool, xdptxd.dma_addr, xdptxd.len);
>  
>  		ret = INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
>  				      mlx5e_xmit_xdp_frame, sq, &xdptxd,
> -				      check_result);
> +				      check_result, meta);
>  		if (unlikely(!ret)) {
>  			if (sq->mpwqe.wqe)
>  				mlx5e_xdp_mpwqe_complete(sq);
> @@ -106,6 +111,10 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
>  			mlx5e_xsk_tx_post_err(sq, &xdpi);
>  		} else {
>  			mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, xdpi);
> +			if (xp_tx_metadata_enabled(sq->xsk_pool))
> +				mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
> +						     (union mlx5e_xdp_info)
> +						     { .frame.xsk_meta = (void *)meta });
>  		}
>  
>  		flush = true;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 1c820119e438..99c2a6babaea 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -5097,6 +5097,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
>  
>  	netdev->netdev_ops = &mlx5e_netdev_ops;
>  	netdev->xdp_metadata_ops = &mlx5e_xdp_metadata_ops;
> +	netdev->xsk_tx_metadata_ops = &mlx5e_xsk_tx_metadata_ops;
>  
>  	mlx5e_dcbnl_build_netdev(netdev);
>  
> -- 
> 2.41.0.640.ga95def55d0-goog
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 0/9] xsk: TX metadata
  2023-08-09 16:54 [xdp-hints] [PATCH bpf-next 0/9] xsk: TX metadata Stanislav Fomichev
                   ` (9 preceding siblings ...)
  2023-08-09 20:09 ` [xdp-hints] Re: [PATCH bpf-next 0/9] xsk: TX metadata Jesper Dangaard Brouer
@ 2023-08-14 11:13 ` Maciej Fijalkowski
  2023-08-14 18:04   ` Stanislav Fomichev
  10 siblings, 1 reply; 29+ messages in thread
From: Maciej Fijalkowski @ 2023-08-14 11:13 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, hawk, netdev, xdp-hints, Saeed Mahameed

On Wed, Aug 09, 2023 at 09:54:09AM -0700, Stanislav Fomichev wrote:
> This series implements initial TX metadata (offloads) for AF_XDP.
> See patch #2 for the main implementation and mlx5-one for the
> example on how to consume the metadata on the device side.
> 
> Starting with two types of offloads:
> - request TX timestamp (and write it back into the metadata area)
> - request TX checksum offload
> 
> Changes since last RFC:
> - add /* private: */ comments to headers (Simon)
> - handle metadata only in the first frag (Simon)
> - rename xdp_hw_metadata flags (Willem)
> - s/tmo_request_checksum/tmo_request_timestamp/ in xdp_metadata_ops
>   comment (Willem)
> - Documentation/networking/xsk-tx-metadata.rst

Stan,

thanks for working on it - we reviewed the patchset with Magnus and we
have some questions (responded inline to patches). Overall we think it
would be worth implementing this work against another ZC driver
(preferably ice :P) to check that proposed API is generic enough.
> 
> RFC v4: https://lore.kernel.org/bpf/20230724235957.1953861-1-sdf@google.com/
> 
> Performance:
> 
> I've implemented a small xskgen tool to try to saturate single tx queue:
> https://github.com/fomichev/xskgen/tree/master
> 
> Here are the performance numbers with some analysis.
> 
> 1. Baseline. Running with commit eb62e6aef940 ("Merge branch 'bpf:
> Support bpf_get_func_ip helper in uprobes'"), nothing from this series:
> 
> - with 1400 bytes of payload: 98 gbps, 8 mpps
> ./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 116960000000 bits, took 1.189130 sec, 98.357623 gbps 8.409509 mpps
> 
> - with 200 bytes of payload: 49 gbps, 23 mpps
> ./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000064 packets 20960134144 bits, took 0.422235 sec, 49.640921 gbps 23.683645 mpps
> 
> 2. Adding single commit that supports reserving XDP_TX_METADATA_LEN
>    changes nothing numbers-wise.
> 
> - baseline for 1400
> ./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 116960000000 bits, took 1.189247 sec, 98.347946 gbps 8.408682 mpps
> 
> - baseline for 200
> ./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 20960000000 bits, took 0.421248 sec, 49.756913 gbps 23.738985 mpps
> 
> 3. Adding -M flag causes xskgen to reserve the metadata and fill it, but
>    doesn't set XDP_TX_METADATA descriptor option.
> 
> - new baseline for 1400 (with only filling the metadata)
> ./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 116960000000 bits, took 1.188767 sec, 98.387657 gbps 8.412077 mpps
> 
> - new baseline for 200 (with only filling the metadata)
> ./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 20960000000 bits, took 0.410213 sec, 51.095407 gbps 24.377579 mpps
> (the numbers go sligtly up here, not really sure why, maybe some cache-related
> side-effects?
> 
> 4. Next, I'm running the same test but with the commit that adds actual
>    general infra to parse XDP_TX_METADATA (but no driver support).
>    Essentially applying "xsk: add TX timestamp and TX checksum offload support"
>    from this series. Numbers are the same.
> 
> - fill metadata for 1400
> ./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 116960000000 bits, took 1.188430 sec, 98.415557 gbps 8.414463 mpps
> 
> - fill metadata for 200
> ./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 20960000000 bits, took 0.411559 sec, 50.928299 gbps 24.297853 mpps
> 
> - request metadata for 1400
> ./xskgen -m -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 116960000000 bits, took 1.188723 sec, 98.391299 gbps 8.412389 mpps
> 
> - request metadata for 200
> ./xskgen -m -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000064 packets 20960134144 bits, took 0.411240 sec, 50.968131 gbps 24.316856 mpps
> 
> 5. Now, for the most interesting part, I'm adding mlx5 driver support.
>    The mpps for 200 bytes case goes down from 23 mpps to 19 mpps, but
>    _only_ when I enable the metadata. This looks like a side effect
>    of me pushing extra metadata pointer via mlx5e_xdpi_fifo_push.
>    Hence, this part is wrapped into 'if (xp_tx_metadata_enabled)'
>    to not affect the existing non-metadata use-cases. Since this is not
>    regressing existing workloads, I'm not spending any time trying to
>    optimize it more (and leaving it up to mlx owners to purse if
>    they see any good way to do it).
> 
> - same baseline
> ./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 116960000000 bits, took 1.189434 sec, 98.332484 gbps 8.407360 mpps
> 
> ./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000128 packets 20960268288 bits, took 0.425254 sec, 49.288821 gbps 23.515659 mpps
> 
> - fill metadata for 1400
> ./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 116960000000 bits, took 1.189528 sec, 98.324714 gbps 8.406696 mpps
> 
> - fill metadata for 200
> ./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000128 packets 20960268288 bits, took 0.519085 sec, 40.379260 gbps 19.264914 mpps
> 
> - request metadata for 1400
> ./xskgen -m -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000000 packets 116960000000 bits, took 1.189329 sec, 98.341165 gbps 8.408102 mpps
> 
> - request metadata for 200
> ./xskgen -m -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
> sent 10000128 packets 20960268288 bits, took 0.519929 sec, 40.313713 gbps 19.233642 mpps
> 
> Cc: Saeed Mahameed <saeedm@nvidia.com>
> 
> Stanislav Fomichev (9):
>   xsk: Support XDP_TX_METADATA_LEN
>   xsk: add TX timestamp and TX checksum offload support
>   tools: ynl: print xsk-features from the sample
>   net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
>   selftests/xsk: Support XDP_TX_METADATA_LEN
>   selftests/bpf: Add csum helpers
>   selftests/bpf: Add TX side to xdp_metadata
>   selftests/bpf: Add TX side to xdp_hw_metadata
>   xsk: document XDP_TX_METADATA_LEN layout
> 
>  Documentation/netlink/specs/netdev.yaml       |  20 ++
>  Documentation/networking/index.rst            |   1 +
>  Documentation/networking/xsk-tx-metadata.rst  |  75 +++++++
>  drivers/net/ethernet/mellanox/mlx5/core/en.h  |   4 +-
>  .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  72 ++++++-
>  .../net/ethernet/mellanox/mlx5/core/en/xdp.h  |  10 +-
>  .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   |  11 +-
>  .../net/ethernet/mellanox/mlx5/core/en_main.c |   1 +
>  include/linux/netdevice.h                     |  27 +++
>  include/linux/skbuff.h                        |   5 +-
>  include/net/xdp_sock.h                        |  61 ++++++
>  include/net/xdp_sock_drv.h                    |  13 ++
>  include/net/xsk_buff_pool.h                   |   6 +
>  include/uapi/linux/if_xdp.h                   |  36 ++++
>  include/uapi/linux/netdev.h                   |  16 ++
>  net/core/netdev-genl.c                        |  12 +-
>  net/xdp/xsk.c                                 |  61 ++++++
>  net/xdp/xsk_buff_pool.c                       |   1 +
>  net/xdp/xsk_queue.h                           |  19 +-
>  tools/include/uapi/linux/if_xdp.h             |  50 ++++-
>  tools/include/uapi/linux/netdev.h             |  15 ++
>  tools/net/ynl/generated/netdev-user.c         |  19 ++
>  tools/net/ynl/generated/netdev-user.h         |   3 +
>  tools/net/ynl/samples/netdev.c                |   6 +
>  tools/testing/selftests/bpf/network_helpers.h |  43 ++++
>  .../selftests/bpf/prog_tests/xdp_metadata.c   |  31 ++-
>  tools/testing/selftests/bpf/xdp_hw_metadata.c | 201 +++++++++++++++++-
>  tools/testing/selftests/bpf/xsk.c             |  17 ++
>  tools/testing/selftests/bpf/xsk.h             |   1 +
>  29 files changed, 793 insertions(+), 44 deletions(-)
>  create mode 100644 Documentation/networking/xsk-tx-metadata.rst
> 
> -- 
> 2.41.0.640.ga95def55d0-goog
> 
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 0/9] xsk: TX metadata
  2023-08-14 11:13 ` Maciej Fijalkowski
@ 2023-08-14 18:04   ` Stanislav Fomichev
  0 siblings, 0 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-14 18:04 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, hawk, netdev, xdp-hints, Saeed Mahameed

On Mon, Aug 14, 2023 at 4:16 AM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Wed, Aug 09, 2023 at 09:54:09AM -0700, Stanislav Fomichev wrote:
> > This series implements initial TX metadata (offloads) for AF_XDP.
> > See patch #2 for the main implementation and mlx5-one for the
> > example on how to consume the metadata on the device side.
> >
> > Starting with two types of offloads:
> > - request TX timestamp (and write it back into the metadata area)
> > - request TX checksum offload
> >
> > Changes since last RFC:
> > - add /* private: */ comments to headers (Simon)
> > - handle metadata only in the first frag (Simon)
> > - rename xdp_hw_metadata flags (Willem)
> > - s/tmo_request_checksum/tmo_request_timestamp/ in xdp_metadata_ops
> >   comment (Willem)
> > - Documentation/networking/xsk-tx-metadata.rst
>
> Stan,
>
> thanks for working on it - we reviewed the patchset with Magnus and we
> have some questions (responded inline to patches). Overall we think it
> would be worth implementing this work against another ZC driver
> (preferably ice :P) to check that proposed API is generic enough.

Awesome, thanks for the review! I can try to see if I have some
ice-capable hw around, but if you want to help with the ice
implementation I'd gladly accept the offer :-p

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 4/9] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
  2023-08-14 11:02   ` [xdp-hints] " Maciej Fijalkowski
@ 2023-08-14 18:05     ` Stanislav Fomichev
  0 siblings, 0 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-14 18:05 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, hawk, netdev, xdp-hints, Saeed Mahameed

On Mon, Aug 14, 2023 at 4:02 AM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Wed, Aug 09, 2023 at 09:54:13AM -0700, Stanislav Fomichev wrote:
> > TX timestamp:
> > - requires passing clock, not sure I'm passing the correct one (from
> >   cq->mdev), but the timestamp value looks convincing
> >
> > TX checksum:
> > - looks like device does packet parsing (and doesn't accept custom
> >   start/offset), so I'm ignoring user offsets
> >
> > Cc: Saeed Mahameed <saeedm@nvidia.com>
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  drivers/net/ethernet/mellanox/mlx5/core/en.h  |  4 +-
> >  .../net/ethernet/mellanox/mlx5/core/en/xdp.c  | 72 ++++++++++++++++---
> >  .../net/ethernet/mellanox/mlx5/core/en/xdp.h  | 10 ++-
> >  .../ethernet/mellanox/mlx5/core/en/xsk/tx.c   | 11 ++-
> >  .../net/ethernet/mellanox/mlx5/core/en_main.c |  1 +
> >  5 files changed, 82 insertions(+), 16 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > index 0f8f70b91485..6f38627ae7f8 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> > @@ -476,10 +476,12 @@ struct mlx5e_xdp_info_fifo {
> >
> >  struct mlx5e_xdpsq;
> >  struct mlx5e_xmit_data;
> > +struct xsk_tx_metadata;
> >  typedef int (*mlx5e_fp_xmit_xdp_frame_check)(struct mlx5e_xdpsq *);
> >  typedef bool (*mlx5e_fp_xmit_xdp_frame)(struct mlx5e_xdpsq *,
> >                                       struct mlx5e_xmit_data *,
> > -                                     int);
> > +                                     int,
> > +                                     struct xsk_tx_metadata *);
> >
> >  struct mlx5e_xdpsq {
> >       /* data path */
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> > index 40589cebb773..197d372048ec 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
> > @@ -102,7 +102,7 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
> >               xdptxd->dma_addr = dma_addr;
> >
> >               if (unlikely(!INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
> > -                                           mlx5e_xmit_xdp_frame, sq, xdptxd, 0)))
> > +                                           mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL)))
> >                       return false;
> >
> >               /* xmit_mode == MLX5E_XDP_XMIT_MODE_FRAME */
> > @@ -144,7 +144,7 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
> >       xdptxd->dma_addr = dma_addr;
> >
> >       if (unlikely(!INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
> > -                                   mlx5e_xmit_xdp_frame, sq, xdptxd, 0)))
> > +                                   mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL)))
> >               return false;
> >
> >       /* xmit_mode == MLX5E_XDP_XMIT_MODE_PAGE */
> > @@ -260,6 +260,37 @@ const struct xdp_metadata_ops mlx5e_xdp_metadata_ops = {
> >       .xmo_rx_hash                    = mlx5e_xdp_rx_hash,
> >  };
> >
> > +struct mlx5e_xsk_tx_complete {
> > +     struct mlx5_cqe64 *cqe;
> > +     struct mlx5e_cq *cq;
> > +};
> > +
> > +static u64 mlx5e_xsk_fill_timestamp(void *_priv)
> > +{
> > +     struct mlx5e_xsk_tx_complete *priv = _priv;
> > +     u64 ts;
> > +
> > +     ts = get_cqe_ts(priv->cqe);
> > +
> > +     if (mlx5_is_real_time_rq(priv->cq->mdev) || mlx5_is_real_time_sq(priv->cq->mdev))
> > +             return mlx5_real_time_cyc2time(&priv->cq->mdev->clock, ts);
> > +
> > +     return  mlx5_timecounter_cyc2time(&priv->cq->mdev->clock, ts);
> > +}
> > +
> > +static void mlx5e_xsk_request_checksum(u16 csum_start, u16 csum_offset, void *priv)
> > +{
> > +     struct mlx5_wqe_eth_seg *eseg = priv;
> > +
> > +     /* HW/FW is doing parsing, so offsets are largely ignored. */
> > +     eseg->cs_flags |= MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
> > +}
> > +
> > +const struct xsk_tx_metadata_ops mlx5e_xsk_tx_metadata_ops = {
> > +     .tmo_fill_timestamp             = mlx5e_xsk_fill_timestamp,
> > +     .tmo_request_checksum           = mlx5e_xsk_request_checksum,
>
> Can you explain to us why mlx5 doesn't need to implement the request
> timestamp op?

There is always a timestamp in the tx completion descriptor, so no
need to explicitly request it.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 2/9] xsk: add TX timestamp and TX checksum offload support
  2023-08-14 11:01   ` Maciej Fijalkowski
@ 2023-08-14 18:05     ` Stanislav Fomichev
  0 siblings, 0 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-14 18:05 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, hawk, netdev, xdp-hints

On Mon, Aug 14, 2023 at 4:01 AM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Wed, Aug 09, 2023 at 09:54:11AM -0700, Stanislav Fomichev wrote:
> > This change actually defines the (initial) metadata layout
> > that should be used by AF_XDP userspace (xsk_tx_metadata).
> > The first field is flags which requests appropriate offloads,
> > followed by the offload-specific fields. The supported per-device
> > offloads are exported via netlink (new xsk-flags).
> >
> > The offloads themselves are still implemented in a bit of a
> > framework-y fashion that's left from my initial kfunc attempt.
> > I'm introducing new xsk_tx_metadata_ops which drivers are
> > supposed to implement. The drivers are also supposed
> > to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
> > the right places. Since xsk_tx_metadata_{request,_complete}
> > are static inline, we don't incur any extra overhead doing
> > indirect calls.
> >
> > The benefit of this scheme is as follows:
> > - keeps all metadata layout parsing away from driver code
> > - makes it easy to grep and see which drivers implement what
> > - don't need any extra flags to maintain to keep track of what
> >   offloads are implemented; if the callback is implemented - the offload
> >   is supported (used by netlink reporting code)
> >
> > Two offloads are defined right now:
> > 1. XDP_TX_METADATA_CHECKSUM: skb-style csum_start+csum_offset
> > 2. XDP_TX_METADATA_TIMESTAMP: writes TX timestamp back into metadata
> >    area upon completion (tx_timestamp field)
> >
> > The offloads are also implemented for copy mode:
> > 1. Extra XDP_TX_METADATA_CHECKSUM_SW to trigger skb_checksum_help; this
> >    might be useful as a reference implementation and for testing
> > 2. XDP_TX_METADATA_TIMESTAMP writes SW timestamp from the skb
> >    destructor (note I'm reusing hwtstamps to pass metadata pointer)
> >
> > The struct is forward-compatible and can be extended in the future
> > by appending more fields.
> >
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  Documentation/netlink/specs/netdev.yaml | 20 +++++++++
> >  include/linux/netdevice.h               | 27 +++++++++++
> >  include/linux/skbuff.h                  |  5 ++-
> >  include/net/xdp_sock.h                  | 60 +++++++++++++++++++++++++
> >  include/net/xdp_sock_drv.h              | 13 ++++++
> >  include/net/xsk_buff_pool.h             |  5 +++
> >  include/uapi/linux/if_xdp.h             | 35 +++++++++++++++
> >  include/uapi/linux/netdev.h             | 16 +++++++
> >  net/core/netdev-genl.c                  | 12 ++++-
> >  net/xdp/xsk.c                           | 41 +++++++++++++++++
> >  net/xdp/xsk_queue.h                     |  2 +-
> >  tools/include/uapi/linux/if_xdp.h       | 50 ++++++++++++++++++---
> >  tools/include/uapi/linux/netdev.h       | 15 +++++++
> >  13 files changed, 293 insertions(+), 8 deletions(-)
> >
>
> [...]
>
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 0896aaa91dd7..3f02aaa30590 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -1647,6 +1647,31 @@ struct net_device_ops {
> >                                                   struct netlink_ext_ack *extack);
> >  };
> >
> > +/*
> > + * This structure defines the AF_XDP TX metadata hooks for network devices.
> > + * The following hooks can be defined; unless noted otherwise, they are
> > + * optional and can be filled with a null pointer.
> > + *
> > + * int (*tmo_request_timestamp)(void *priv)
> > + *     This function is called when AF_XDP frame requested egress timestamp.
> > + *
> > + * int (*tmo_fill_timestamp)(void *priv)
> > + *     This function is called when AF_XDP frame, that had requested
> > + *     egress timestamp, received a completion. The hook needs to return
> > + *     the actual HW timestamp.
> > + *
> > + * int (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv)
> > + *     This function is called when AF_XDP frame requested HW checksum
> > + *     offload. csum_start indicates position where checksumming should start.
> > + *     csum_offset indicates position where checksum should be stored.
> > + *
> > + */
> > +struct xsk_tx_metadata_ops {
> > +     void    (*tmo_request_timestamp)(void *priv);
> > +     u64     (*tmo_fill_timestamp)(void *priv);
> > +     void    (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
> > +};
> > +
> >  /**
> >   * enum netdev_priv_flags - &struct net_device priv_flags
> >   *
> > @@ -1835,6 +1860,7 @@ enum netdev_ml_priv_type {
> >   *   @netdev_ops:    Includes several pointers to callbacks,
> >   *                   if one wants to override the ndo_*() functions
> >   *   @xdp_metadata_ops:      Includes pointers to XDP metadata callbacks.
> > + *   @xsk_tx_metadata_ops:   Includes pointers to AF_XDP TX metadata callbacks.
> >   *   @ethtool_ops:   Management operations
> >   *   @l3mdev_ops:    Layer 3 master device operations
> >   *   @ndisc_ops:     Includes callbacks for different IPv6 neighbour
> > @@ -2091,6 +2117,7 @@ struct net_device {
> >       unsigned long long      priv_flags;
> >       const struct net_device_ops *netdev_ops;
> >       const struct xdp_metadata_ops *xdp_metadata_ops;
> > +     const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
> >       int                     ifindex;
> >       unsigned short          gflags;
> >       unsigned short          hard_header_len;
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 16a49ba534e4..5d73d5df67fb 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -579,7 +579,10 @@ struct skb_shared_info {
> >       /* Warning: this field is not always filled in (UFO)! */
> >       unsigned short  gso_segs;
> >       struct sk_buff  *frag_list;
> > -     struct skb_shared_hwtstamps hwtstamps;
> > +     union {
> > +             struct skb_shared_hwtstamps hwtstamps;
> > +             struct xsk_tx_metadata *xsk_meta;
> > +     };
> >       unsigned int    gso_type;
> >       u32             tskey;
> >
> > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > index 467b9fb56827..288fa58c4665 100644
> > --- a/include/net/xdp_sock.h
> > +++ b/include/net/xdp_sock.h
> > @@ -90,6 +90,54 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
> >  int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
> >  void __xsk_map_flush(void);
> >
> > +/**
> > + *  xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
> > + *  and call appropriate xsk_tx_metadata_ops operation.
> > + *  @meta: pointer to AF_XDP metadata area
> > + *  @ops: pointer to struct xsk_tx_metadata_ops
> > + *  @priv: pointer to driver-private aread
> > + *
> > + *  This function should be called by the networking device when
> > + *  it prepares AF_XDP egress packet.
> > + */
> > +static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
> > +                                        const struct xsk_tx_metadata_ops *ops,
> > +                                        void *priv)
> > +{
> > +     if (!meta)
> > +             return;
> > +
> > +     if (ops->tmo_request_timestamp)
> > +             if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
>
> We should have a copy of flags or any other things that we read multiple
> times from metadata in order to avoid potential attacks from user space.
> An example of that is the fact that timestamp metadata handling is two
> step process, meaning to fill the timestamp you have to request it in the
> first place. If user space would set XDP_TX_METADATA_TIMESTAMP after
> sending but before completing we would crash the kernel potentially.
>
> We could also move the responsibility of handling that issue to driver
> programmers but IMHO that would be harder to implement, hence we think
> handling it in core would be better.

Hmm, very good point. I believe we only care about the timestamp
address for the completion, right? Not the rest of the metadata field.
So saving/passing that single pointer might be good enough..
For copy mode I think I can abuse skb_shared_info the same way I'm
adding new xsk_meta (IOW, store tx_timestamp ptr instead of overall
xsk_meta pointer).
For the native mode, not sure how we could implement that in the
generic fashion? Let me play with it and see if I can provide some
helpers for the drivers..

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 1/9] xsk: Support XDP_TX_METADATA_LEN
  2023-08-14 10:56   ` [xdp-hints] " Maciej Fijalkowski
@ 2023-08-14 18:05     ` Stanislav Fomichev
  2023-08-14 22:24       ` Stanislav Fomichev
  0 siblings, 1 reply; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-14 18:05 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, hawk, netdev, xdp-hints

On Mon, Aug 14, 2023 at 3:57 AM Maciej Fijalkowski
<maciej.fijalkowski@intel.com> wrote:
>
> On Wed, Aug 09, 2023 at 09:54:10AM -0700, Stanislav Fomichev wrote:
> > For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> > and carry some TX metadata in the headroom. For copy mode, there
> > is no way currently to populate skb metadata.
> >
> > Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> > to treat as metadata. Metadata bytes come prior to tx_desc address
> > (same as in RX case).
> >
> > The size of the metadata has the same constraints as XDP:
> > - less than 256 bytes
> > - 4-byte aligned
> > - non-zero
> >
> > This data is not interpreted in any way right now.
> >
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  include/net/xdp_sock.h      |  1 +
> >  include/net/xsk_buff_pool.h |  1 +
> >  include/uapi/linux/if_xdp.h |  1 +
> >  net/xdp/xsk.c               | 20 ++++++++++++++++++++
> >  net/xdp/xsk_buff_pool.c     |  1 +
> >  net/xdp/xsk_queue.h         | 17 ++++++++++-------
> >  6 files changed, 34 insertions(+), 7 deletions(-)
> >
> > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > index 1617af380162..467b9fb56827 100644
> > --- a/include/net/xdp_sock.h
> > +++ b/include/net/xdp_sock.h
> > @@ -51,6 +51,7 @@ struct xdp_sock {
> >       struct list_head flush_node;
> >       struct xsk_buff_pool *pool;
> >       u16 queue_id;
> > +     u8 tx_metadata_len;
> >       bool zc;
> >       bool sg;
> >       enum {
> > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> > index b0bdff26fc88..9c31e8d1e198 100644
> > --- a/include/net/xsk_buff_pool.h
> > +++ b/include/net/xsk_buff_pool.h
> > @@ -77,6 +77,7 @@ struct xsk_buff_pool {
> >       u32 chunk_size;
> >       u32 chunk_shift;
> >       u32 frame_len;
> > +     u8 tx_metadata_len; /* inherited from xsk_sock */
> >       u8 cached_need_wakeup;
> >       bool uses_need_wakeup;
> >       bool dma_need_sync;
> > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > index 8d48863472b9..b37b50102e1c 100644
> > --- a/include/uapi/linux/if_xdp.h
> > +++ b/include/uapi/linux/if_xdp.h
> > @@ -69,6 +69,7 @@ struct xdp_mmap_offsets {
> >  #define XDP_UMEM_COMPLETION_RING     6
> >  #define XDP_STATISTICS                       7
> >  #define XDP_OPTIONS                  8
> > +#define XDP_TX_METADATA_LEN          9
> >
> >  struct xdp_umem_reg {
> >       __u64 addr; /* Start of packet data area */
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 47796a5a79b3..28df3280501d 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -1338,6 +1338,26 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
> >               mutex_unlock(&xs->mutex);
> >               return err;
> >       }
> > +     case XDP_TX_METADATA_LEN:
> > +     {
> > +             int val;
> > +
> > +             if (optlen < sizeof(val))
> > +                     return -EINVAL;
> > +             if (copy_from_sockptr(&val, optval, sizeof(val)))
> > +                     return -EFAULT;
> > +             if (!val || val > 256 || val % 4)
> > +                     return -EINVAL;
> > +
> > +             mutex_lock(&xs->mutex);
> > +             if (xs->state != XSK_READY) {
> > +                     mutex_unlock(&xs->mutex);
> > +                     return -EBUSY;
> > +             }
> > +             xs->tx_metadata_len = val;
> > +             mutex_unlock(&xs->mutex);
> > +             return 0;
> > +     }
> >       default:
> >               break;
> >       }
> > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > index b3f7b310811e..b351732f1032 100644
> > --- a/net/xdp/xsk_buff_pool.c
> > +++ b/net/xdp/xsk_buff_pool.c
> > @@ -85,6 +85,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
> >               XDP_PACKET_HEADROOM;
> >       pool->umem = umem;
> >       pool->addrs = umem->addrs;
> > +     pool->tx_metadata_len = xs->tx_metadata_len;
>
> Hey Stan,
>
> what would happen in case when one socket sets pool->tx_metadata_len say
> to 16 and the other one that is sharing the pool to 24? If sockets should
> and can have different padding, then this will not work unless the
> metadata_len is on a per socket level.

Hmm, good point. I didn't think about umem sharing :-/
Maybe they all have to agree on the size? And once the size has been
size by one socket, the same size should be set on the others? (or at
least be implied that the other sockets will use the same metadata
layout)

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 1/9] xsk: Support XDP_TX_METADATA_LEN
  2023-08-14 18:05     ` Stanislav Fomichev
@ 2023-08-14 22:24       ` Stanislav Fomichev
  2023-08-15 12:19         ` Magnus Karlsson
  0 siblings, 1 reply; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-14 22:24 UTC (permalink / raw)
  To: Maciej Fijalkowski
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
	magnus.karlsson, bjorn, hawk, netdev, xdp-hints

On Mon, Aug 14, 2023 at 11:05 AM Stanislav Fomichev <sdf@google.com> wrote:
>
> On Mon, Aug 14, 2023 at 3:57 AM Maciej Fijalkowski
> <maciej.fijalkowski@intel.com> wrote:
> >
> > On Wed, Aug 09, 2023 at 09:54:10AM -0700, Stanislav Fomichev wrote:
> > > For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> > > and carry some TX metadata in the headroom. For copy mode, there
> > > is no way currently to populate skb metadata.
> > >
> > > Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> > > to treat as metadata. Metadata bytes come prior to tx_desc address
> > > (same as in RX case).
> > >
> > > The size of the metadata has the same constraints as XDP:
> > > - less than 256 bytes
> > > - 4-byte aligned
> > > - non-zero
> > >
> > > This data is not interpreted in any way right now.
> > >
> > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > ---
> > >  include/net/xdp_sock.h      |  1 +
> > >  include/net/xsk_buff_pool.h |  1 +
> > >  include/uapi/linux/if_xdp.h |  1 +
> > >  net/xdp/xsk.c               | 20 ++++++++++++++++++++
> > >  net/xdp/xsk_buff_pool.c     |  1 +
> > >  net/xdp/xsk_queue.h         | 17 ++++++++++-------
> > >  6 files changed, 34 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > index 1617af380162..467b9fb56827 100644
> > > --- a/include/net/xdp_sock.h
> > > +++ b/include/net/xdp_sock.h
> > > @@ -51,6 +51,7 @@ struct xdp_sock {
> > >       struct list_head flush_node;
> > >       struct xsk_buff_pool *pool;
> > >       u16 queue_id;
> > > +     u8 tx_metadata_len;
> > >       bool zc;
> > >       bool sg;
> > >       enum {
> > > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> > > index b0bdff26fc88..9c31e8d1e198 100644
> > > --- a/include/net/xsk_buff_pool.h
> > > +++ b/include/net/xsk_buff_pool.h
> > > @@ -77,6 +77,7 @@ struct xsk_buff_pool {
> > >       u32 chunk_size;
> > >       u32 chunk_shift;
> > >       u32 frame_len;
> > > +     u8 tx_metadata_len; /* inherited from xsk_sock */
> > >       u8 cached_need_wakeup;
> > >       bool uses_need_wakeup;
> > >       bool dma_need_sync;
> > > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > > index 8d48863472b9..b37b50102e1c 100644
> > > --- a/include/uapi/linux/if_xdp.h
> > > +++ b/include/uapi/linux/if_xdp.h
> > > @@ -69,6 +69,7 @@ struct xdp_mmap_offsets {
> > >  #define XDP_UMEM_COMPLETION_RING     6
> > >  #define XDP_STATISTICS                       7
> > >  #define XDP_OPTIONS                  8
> > > +#define XDP_TX_METADATA_LEN          9
> > >
> > >  struct xdp_umem_reg {
> > >       __u64 addr; /* Start of packet data area */
> > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > index 47796a5a79b3..28df3280501d 100644
> > > --- a/net/xdp/xsk.c
> > > +++ b/net/xdp/xsk.c
> > > @@ -1338,6 +1338,26 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
> > >               mutex_unlock(&xs->mutex);
> > >               return err;
> > >       }
> > > +     case XDP_TX_METADATA_LEN:
> > > +     {
> > > +             int val;
> > > +
> > > +             if (optlen < sizeof(val))
> > > +                     return -EINVAL;
> > > +             if (copy_from_sockptr(&val, optval, sizeof(val)))
> > > +                     return -EFAULT;
> > > +             if (!val || val > 256 || val % 4)
> > > +                     return -EINVAL;
> > > +
> > > +             mutex_lock(&xs->mutex);
> > > +             if (xs->state != XSK_READY) {
> > > +                     mutex_unlock(&xs->mutex);
> > > +                     return -EBUSY;
> > > +             }
> > > +             xs->tx_metadata_len = val;
> > > +             mutex_unlock(&xs->mutex);
> > > +             return 0;
> > > +     }
> > >       default:
> > >               break;
> > >       }
> > > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > > index b3f7b310811e..b351732f1032 100644
> > > --- a/net/xdp/xsk_buff_pool.c
> > > +++ b/net/xdp/xsk_buff_pool.c
> > > @@ -85,6 +85,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
> > >               XDP_PACKET_HEADROOM;
> > >       pool->umem = umem;
> > >       pool->addrs = umem->addrs;
> > > +     pool->tx_metadata_len = xs->tx_metadata_len;
> >
> > Hey Stan,
> >
> > what would happen in case when one socket sets pool->tx_metadata_len say
> > to 16 and the other one that is sharing the pool to 24? If sockets should
> > and can have different padding, then this will not work unless the
> > metadata_len is on a per socket level.
>
> Hmm, good point. I didn't think about umem sharing :-/
> Maybe they all have to agree on the size? And once the size has been
> size by one socket, the same size should be set on the others? (or at
> least be implied that the other sockets will use the same metadata
> layout)

Thinking about it a bit more: should that tx_metadata_len be part of a
umem then?
Users can provide it as part of xdp_umem_reg which is probably a
better place to pass it compared to the setsockopt?

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 1/9] xsk: Support XDP_TX_METADATA_LEN
  2023-08-14 22:24       ` Stanislav Fomichev
@ 2023-08-15 12:19         ` Magnus Karlsson
  2023-08-15 18:21           ` Stanislav Fomichev
  0 siblings, 1 reply; 29+ messages in thread
From: Magnus Karlsson @ 2023-08-15 12:19 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Maciej Fijalkowski, bpf, ast, daniel, andrii, martin.lau, song,
	yhs, john.fastabend, kpsingh, haoluo, jolsa, kuba, toke, willemb,
	dsahern, magnus.karlsson, bjorn, hawk, netdev, xdp-hints

On Tue, 15 Aug 2023 at 00:25, Stanislav Fomichev <sdf@google.com> wrote:
>
> On Mon, Aug 14, 2023 at 11:05 AM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Mon, Aug 14, 2023 at 3:57 AM Maciej Fijalkowski
> > <maciej.fijalkowski@intel.com> wrote:
> > >
> > > On Wed, Aug 09, 2023 at 09:54:10AM -0700, Stanislav Fomichev wrote:
> > > > For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> > > > and carry some TX metadata in the headroom. For copy mode, there
> > > > is no way currently to populate skb metadata.
> > > >
> > > > Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> > > > to treat as metadata. Metadata bytes come prior to tx_desc address
> > > > (same as in RX case).
> > > >
> > > > The size of the metadata has the same constraints as XDP:
> > > > - less than 256 bytes
> > > > - 4-byte aligned
> > > > - non-zero
> > > >
> > > > This data is not interpreted in any way right now.
> > > >
> > > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > > ---
> > > >  include/net/xdp_sock.h      |  1 +
> > > >  include/net/xsk_buff_pool.h |  1 +
> > > >  include/uapi/linux/if_xdp.h |  1 +
> > > >  net/xdp/xsk.c               | 20 ++++++++++++++++++++
> > > >  net/xdp/xsk_buff_pool.c     |  1 +
> > > >  net/xdp/xsk_queue.h         | 17 ++++++++++-------
> > > >  6 files changed, 34 insertions(+), 7 deletions(-)
> > > >
> > > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > > index 1617af380162..467b9fb56827 100644
> > > > --- a/include/net/xdp_sock.h
> > > > +++ b/include/net/xdp_sock.h
> > > > @@ -51,6 +51,7 @@ struct xdp_sock {
> > > >       struct list_head flush_node;
> > > >       struct xsk_buff_pool *pool;
> > > >       u16 queue_id;
> > > > +     u8 tx_metadata_len;
> > > >       bool zc;
> > > >       bool sg;
> > > >       enum {
> > > > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> > > > index b0bdff26fc88..9c31e8d1e198 100644
> > > > --- a/include/net/xsk_buff_pool.h
> > > > +++ b/include/net/xsk_buff_pool.h
> > > > @@ -77,6 +77,7 @@ struct xsk_buff_pool {
> > > >       u32 chunk_size;
> > > >       u32 chunk_shift;
> > > >       u32 frame_len;
> > > > +     u8 tx_metadata_len; /* inherited from xsk_sock */
> > > >       u8 cached_need_wakeup;
> > > >       bool uses_need_wakeup;
> > > >       bool dma_need_sync;
> > > > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > > > index 8d48863472b9..b37b50102e1c 100644
> > > > --- a/include/uapi/linux/if_xdp.h
> > > > +++ b/include/uapi/linux/if_xdp.h
> > > > @@ -69,6 +69,7 @@ struct xdp_mmap_offsets {
> > > >  #define XDP_UMEM_COMPLETION_RING     6
> > > >  #define XDP_STATISTICS                       7
> > > >  #define XDP_OPTIONS                  8
> > > > +#define XDP_TX_METADATA_LEN          9
> > > >
> > > >  struct xdp_umem_reg {
> > > >       __u64 addr; /* Start of packet data area */
> > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > index 47796a5a79b3..28df3280501d 100644
> > > > --- a/net/xdp/xsk.c
> > > > +++ b/net/xdp/xsk.c
> > > > @@ -1338,6 +1338,26 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
> > > >               mutex_unlock(&xs->mutex);
> > > >               return err;
> > > >       }
> > > > +     case XDP_TX_METADATA_LEN:
> > > > +     {
> > > > +             int val;
> > > > +
> > > > +             if (optlen < sizeof(val))
> > > > +                     return -EINVAL;
> > > > +             if (copy_from_sockptr(&val, optval, sizeof(val)))
> > > > +                     return -EFAULT;
> > > > +             if (!val || val > 256 || val % 4)
> > > > +                     return -EINVAL;
> > > > +
> > > > +             mutex_lock(&xs->mutex);
> > > > +             if (xs->state != XSK_READY) {
> > > > +                     mutex_unlock(&xs->mutex);
> > > > +                     return -EBUSY;
> > > > +             }
> > > > +             xs->tx_metadata_len = val;
> > > > +             mutex_unlock(&xs->mutex);
> > > > +             return 0;
> > > > +     }
> > > >       default:
> > > >               break;
> > > >       }
> > > > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > > > index b3f7b310811e..b351732f1032 100644
> > > > --- a/net/xdp/xsk_buff_pool.c
> > > > +++ b/net/xdp/xsk_buff_pool.c
> > > > @@ -85,6 +85,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
> > > >               XDP_PACKET_HEADROOM;
> > > >       pool->umem = umem;
> > > >       pool->addrs = umem->addrs;
> > > > +     pool->tx_metadata_len = xs->tx_metadata_len;
> > >
> > > Hey Stan,
> > >
> > > what would happen in case when one socket sets pool->tx_metadata_len say
> > > to 16 and the other one that is sharing the pool to 24? If sockets should
> > > and can have different padding, then this will not work unless the
> > > metadata_len is on a per socket level.
> >
> > Hmm, good point. I didn't think about umem sharing :-/
> > Maybe they all have to agree on the size? And once the size has been
> > size by one socket, the same size should be set on the others? (or at
> > least be implied that the other sockets will use the same metadata
> > layout)
>
> Thinking about it a bit more: should that tx_metadata_len be part of a
> umem then?
> Users can provide it as part of xdp_umem_reg which is probably a
> better place to pass it compared to the setsockopt?

If all the sockets sharing the umem have to agree on the
tx_metadata_len, then this is a better place than the setsockopt.
IMHO, it sounds unlikely that multiple versions of the same program,
or different programs, would like to share the same umem.

Please note that some of the members of struct xdp_umem are copied out
to the individual struct xsk_buff_pool even though they are the same
between all of them (chunk_size and the size of the umem for example).
The reason for this is performance, as to avoid having to access the
umem struct in the fast path. Something similar might be a good idea
here too, even though tx_metadata_len is fixed for a umem being
shared. Might be worth trying.

Again, thanks for working on this Stanislav.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [xdp-hints] Re: [PATCH bpf-next 1/9] xsk: Support XDP_TX_METADATA_LEN
  2023-08-15 12:19         ` Magnus Karlsson
@ 2023-08-15 18:21           ` Stanislav Fomichev
  0 siblings, 0 replies; 29+ messages in thread
From: Stanislav Fomichev @ 2023-08-15 18:21 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Maciej Fijalkowski, bpf, ast, daniel, andrii, martin.lau, song,
	yhs, john.fastabend, kpsingh, haoluo, jolsa, kuba, toke, willemb,
	dsahern, magnus.karlsson, bjorn, hawk, netdev, xdp-hints

On Tue, Aug 15, 2023 at 5:19 AM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Tue, 15 Aug 2023 at 00:25, Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Mon, Aug 14, 2023 at 11:05 AM Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Mon, Aug 14, 2023 at 3:57 AM Maciej Fijalkowski
> > > <maciej.fijalkowski@intel.com> wrote:
> > > >
> > > > On Wed, Aug 09, 2023 at 09:54:10AM -0700, Stanislav Fomichev wrote:
> > > > > For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> > > > > and carry some TX metadata in the headroom. For copy mode, there
> > > > > is no way currently to populate skb metadata.
> > > > >
> > > > > Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> > > > > to treat as metadata. Metadata bytes come prior to tx_desc address
> > > > > (same as in RX case).
> > > > >
> > > > > The size of the metadata has the same constraints as XDP:
> > > > > - less than 256 bytes
> > > > > - 4-byte aligned
> > > > > - non-zero
> > > > >
> > > > > This data is not interpreted in any way right now.
> > > > >
> > > > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > > > ---
> > > > >  include/net/xdp_sock.h      |  1 +
> > > > >  include/net/xsk_buff_pool.h |  1 +
> > > > >  include/uapi/linux/if_xdp.h |  1 +
> > > > >  net/xdp/xsk.c               | 20 ++++++++++++++++++++
> > > > >  net/xdp/xsk_buff_pool.c     |  1 +
> > > > >  net/xdp/xsk_queue.h         | 17 ++++++++++-------
> > > > >  6 files changed, 34 insertions(+), 7 deletions(-)
> > > > >
> > > > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > > > index 1617af380162..467b9fb56827 100644
> > > > > --- a/include/net/xdp_sock.h
> > > > > +++ b/include/net/xdp_sock.h
> > > > > @@ -51,6 +51,7 @@ struct xdp_sock {
> > > > >       struct list_head flush_node;
> > > > >       struct xsk_buff_pool *pool;
> > > > >       u16 queue_id;
> > > > > +     u8 tx_metadata_len;
> > > > >       bool zc;
> > > > >       bool sg;
> > > > >       enum {
> > > > > diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> > > > > index b0bdff26fc88..9c31e8d1e198 100644
> > > > > --- a/include/net/xsk_buff_pool.h
> > > > > +++ b/include/net/xsk_buff_pool.h
> > > > > @@ -77,6 +77,7 @@ struct xsk_buff_pool {
> > > > >       u32 chunk_size;
> > > > >       u32 chunk_shift;
> > > > >       u32 frame_len;
> > > > > +     u8 tx_metadata_len; /* inherited from xsk_sock */
> > > > >       u8 cached_need_wakeup;
> > > > >       bool uses_need_wakeup;
> > > > >       bool dma_need_sync;
> > > > > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > > > > index 8d48863472b9..b37b50102e1c 100644
> > > > > --- a/include/uapi/linux/if_xdp.h
> > > > > +++ b/include/uapi/linux/if_xdp.h
> > > > > @@ -69,6 +69,7 @@ struct xdp_mmap_offsets {
> > > > >  #define XDP_UMEM_COMPLETION_RING     6
> > > > >  #define XDP_STATISTICS                       7
> > > > >  #define XDP_OPTIONS                  8
> > > > > +#define XDP_TX_METADATA_LEN          9
> > > > >
> > > > >  struct xdp_umem_reg {
> > > > >       __u64 addr; /* Start of packet data area */
> > > > > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > > > > index 47796a5a79b3..28df3280501d 100644
> > > > > --- a/net/xdp/xsk.c
> > > > > +++ b/net/xdp/xsk.c
> > > > > @@ -1338,6 +1338,26 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
> > > > >               mutex_unlock(&xs->mutex);
> > > > >               return err;
> > > > >       }
> > > > > +     case XDP_TX_METADATA_LEN:
> > > > > +     {
> > > > > +             int val;
> > > > > +
> > > > > +             if (optlen < sizeof(val))
> > > > > +                     return -EINVAL;
> > > > > +             if (copy_from_sockptr(&val, optval, sizeof(val)))
> > > > > +                     return -EFAULT;
> > > > > +             if (!val || val > 256 || val % 4)
> > > > > +                     return -EINVAL;
> > > > > +
> > > > > +             mutex_lock(&xs->mutex);
> > > > > +             if (xs->state != XSK_READY) {
> > > > > +                     mutex_unlock(&xs->mutex);
> > > > > +                     return -EBUSY;
> > > > > +             }
> > > > > +             xs->tx_metadata_len = val;
> > > > > +             mutex_unlock(&xs->mutex);
> > > > > +             return 0;
> > > > > +     }
> > > > >       default:
> > > > >               break;
> > > > >       }
> > > > > diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> > > > > index b3f7b310811e..b351732f1032 100644
> > > > > --- a/net/xdp/xsk_buff_pool.c
> > > > > +++ b/net/xdp/xsk_buff_pool.c
> > > > > @@ -85,6 +85,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
> > > > >               XDP_PACKET_HEADROOM;
> > > > >       pool->umem = umem;
> > > > >       pool->addrs = umem->addrs;
> > > > > +     pool->tx_metadata_len = xs->tx_metadata_len;
> > > >
> > > > Hey Stan,
> > > >
> > > > what would happen in case when one socket sets pool->tx_metadata_len say
> > > > to 16 and the other one that is sharing the pool to 24? If sockets should
> > > > and can have different padding, then this will not work unless the
> > > > metadata_len is on a per socket level.
> > >
> > > Hmm, good point. I didn't think about umem sharing :-/
> > > Maybe they all have to agree on the size? And once the size has been
> > > size by one socket, the same size should be set on the others? (or at
> > > least be implied that the other sockets will use the same metadata
> > > layout)
> >
> > Thinking about it a bit more: should that tx_metadata_len be part of a
> > umem then?
> > Users can provide it as part of xdp_umem_reg which is probably a
> > better place to pass it compared to the setsockopt?
>
> If all the sockets sharing the umem have to agree on the
> tx_metadata_len, then this is a better place than the setsockopt.
> IMHO, it sounds unlikely that multiple versions of the same program,
> or different programs, would like to share the same umem.
>
> Please note that some of the members of struct xdp_umem are copied out
> to the individual struct xsk_buff_pool even though they are the same
> between all of them (chunk_size and the size of the umem for example).
> The reason for this is performance, as to avoid having to access the
> umem struct in the fast path. Something similar might be a good idea
> here too, even though tx_metadata_len is fixed for a umem being
> shared. Might be worth trying.
>
> Again, thanks for working on this Stanislav.

Perfect, thank you, I was trying a similar idea with a copy into the pool!

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-08-15 18:21 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-08-09 16:54 [xdp-hints] [PATCH bpf-next 0/9] xsk: TX metadata Stanislav Fomichev
2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 1/9] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
2023-08-14 10:56   ` [xdp-hints] " Maciej Fijalkowski
2023-08-14 18:05     ` Stanislav Fomichev
2023-08-14 22:24       ` Stanislav Fomichev
2023-08-15 12:19         ` Magnus Karlsson
2023-08-15 18:21           ` Stanislav Fomichev
2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 2/9] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
2023-08-09 20:18   ` [xdp-hints] " Jesper Dangaard Brouer
2023-08-10 18:25     ` Stanislav Fomichev
2023-08-10  5:26   ` kernel test robot
2023-08-10  6:12   ` kernel test robot
2023-08-14 11:01   ` Maciej Fijalkowski
2023-08-14 18:05     ` Stanislav Fomichev
2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 3/9] tools: ynl: print xsk-features from the sample Stanislav Fomichev
2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 4/9] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload Stanislav Fomichev
2023-08-14 11:02   ` [xdp-hints] " Maciej Fijalkowski
2023-08-14 18:05     ` Stanislav Fomichev
2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 5/9] selftests/xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 6/9] selftests/bpf: Add csum helpers Stanislav Fomichev
2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 7/9] selftests/bpf: Add TX side to xdp_metadata Stanislav Fomichev
2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 8/9] selftests/bpf: Add TX side to xdp_hw_metadata Stanislav Fomichev
2023-08-09 16:54 ` [xdp-hints] [PATCH bpf-next 9/9] xsk: document XDP_TX_METADATA_LEN layout Stanislav Fomichev
2023-08-09 20:39   ` [xdp-hints] " Jesper Dangaard Brouer
2023-08-10 18:17     ` Stanislav Fomichev
2023-08-09 20:09 ` [xdp-hints] Re: [PATCH bpf-next 0/9] xsk: TX metadata Jesper Dangaard Brouer
2023-08-10 18:23   ` Stanislav Fomichev
2023-08-14 11:13 ` Maciej Fijalkowski
2023-08-14 18:04   ` Stanislav Fomichev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox