* [xdp-hints] [PATCH bpf-next v3 00/10] xsk: TX metadata
@ 2023-10-03 20:05 Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 01/10] xsk: Support tx_metadata_len Stanislav Fomichev
` (9 more replies)
0 siblings, 10 replies; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-03 20:05 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
This series implements initial TX metadata (offloads) for AF_XDP.
See patch #2 for the main implementation and mlx5/stmmac ones for the
example on how to consume the metadata on the device side.
Starting with two types of offloads:
- request TX timestamp (and write it back into the metadata area)
- request TX checksum offload
Changes since v2:
- fix compile issue with XDP_SOCKETS=n (Vinicius Costa Gomes and Intel bots)
- include stmmac support by Song Yoong Siang
v2: https://lore.kernel.org/bpf/20230914210452.2588884-1-sdf@google.com/T/#t
Performance (mlx5):
I've implemented a small xskgen tool to try to saturate single tx queue:
https://github.com/fomichev/xskgen/tree/master
Here are the performance numbers with some analysis.
1. Baseline. Running with commit eb62e6aef940 ("Merge branch 'bpf:
Support bpf_get_func_ip helper in uprobes'"), nothing from this series:
- with 1400 bytes of payload: 98 gbps, 8 mpps
./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189130 sec, 98.357623 gbps 8.409509 mpps
- with 200 bytes of payload: 49 gbps, 23 mpps
./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000064 packets 20960134144 bits, took 0.422235 sec, 49.640921 gbps 23.683645 mpps
2. Adding single commit that supports reserving tx_metadata_len
changes nothing numbers-wise.
- baseline for 1400
./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189247 sec, 98.347946 gbps 8.408682 mpps
- baseline for 200
./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 20960000000 bits, took 0.421248 sec, 49.756913 gbps 23.738985 mpps
3. Adding -M flag causes xskgen to reserve the metadata and fill it, but
doesn't set XDP_TX_METADATA descriptor option.
- new baseline for 1400 (with only filling the metadata)
./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.188767 sec, 98.387657 gbps 8.412077 mpps
- new baseline for 200 (with only filling the metadata)
./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 20960000000 bits, took 0.410213 sec, 51.095407 gbps 24.377579 mpps
(the numbers go sligtly up here, not really sure why, maybe some cache-related
side-effects?
4. Next, I'm running the same test but with the commit that adds actual
general infra to parse XDP_TX_METADATA (but no driver support).
Essentially applying "xsk: add TX timestamp and TX checksum offload support"
from this series. Numbers are the same.
- fill metadata for 1400
./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.188430 sec, 98.415557 gbps 8.414463 mpps
- fill metadata for 200
./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 20960000000 bits, took 0.411559 sec, 50.928299 gbps 24.297853 mpps
- request metadata for 1400
./xskgen -m -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.188723 sec, 98.391299 gbps 8.412389 mpps
- request metadata for 200
./xskgen -m -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000064 packets 20960134144 bits, took 0.411240 sec, 50.968131 gbps 24.316856 mpps
5. Now, for the most interesting part, I'm adding mlx5 driver support.
The mpps for 200 bytes case goes down from 23 mpps to 19 mpps, but
_only_ when I enable the metadata. This looks like a side effect
of me pushing extra metadata pointer via mlx5e_xdpi_fifo_push.
Hence, this part is wrapped into 'if (xp_tx_metadata_enabled)'
to not affect the existing non-metadata use-cases. Since this is not
regressing existing workloads, I'm not spending any time trying to
optimize it more (and leaving it up to mlx owners to purse if
they see any good way to do it).
- same baseline
./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189434 sec, 98.332484 gbps 8.407360 mpps
./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000128 packets 20960268288 bits, took 0.425254 sec, 49.288821 gbps 23.515659 mpps
- fill metadata for 1400
./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189528 sec, 98.324714 gbps 8.406696 mpps
- fill metadata for 200
./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000128 packets 20960268288 bits, took 0.519085 sec, 40.379260 gbps 19.264914 mpps
- request metadata for 1400
./xskgen -m -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189329 sec, 98.341165 gbps 8.408102 mpps
- request metadata for 200
./xskgen -m -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000128 packets 20960268288 bits, took 0.519929 sec, 40.313713 gbps 19.233642 mpps
Song Yoong Siang (1):
net: stmmac: Add Tx HWTS support to XDP ZC
Stanislav Fomichev (9):
xsk: Support tx_metadata_len
xsk: add TX timestamp and TX checksum offload support
tools: ynl: print xsk-features from the sample
net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
selftests/xsk: Support tx_metadata_len
selftests/bpf: Add csum helpers
selftests/bpf: Add TX side to xdp_metadata
selftests/bpf: Add TX side to xdp_hw_metadata
xsk: document tx_metadata_len layout
Documentation/netlink/specs/netdev.yaml | 19 ++
Documentation/networking/index.rst | 1 +
Documentation/networking/xsk-tx-metadata.rst | 77 +++++++
drivers/net/ethernet/mellanox/mlx5/core/en.h | 4 +-
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 72 ++++++-
.../net/ethernet/mellanox/mlx5/core/en/xdp.h | 11 +-
.../ethernet/mellanox/mlx5/core/en/xsk/tx.c | 17 +-
.../net/ethernet/mellanox/mlx5/core/en_main.c | 1 +
drivers/net/ethernet/stmicro/stmmac/stmmac.h | 12 ++
.../net/ethernet/stmicro/stmmac/stmmac_main.c | 63 +++++-
include/linux/netdevice.h | 27 +++
include/linux/skbuff.h | 14 +-
include/net/xdp_sock.h | 81 +++++++
include/net/xdp_sock_drv.h | 13 ++
include/net/xsk_buff_pool.h | 7 +
include/uapi/linux/if_xdp.h | 41 ++++
include/uapi/linux/netdev.h | 16 ++
net/core/netdev-genl.c | 12 +-
net/xdp/xdp_umem.c | 4 +
net/xdp/xsk.c | 51 ++++-
net/xdp/xsk_buff_pool.c | 1 +
net/xdp/xsk_queue.h | 19 +-
tools/include/uapi/linux/if_xdp.h | 55 ++++-
tools/include/uapi/linux/netdev.h | 16 ++
tools/net/ynl/generated/netdev-user.c | 19 ++
tools/net/ynl/generated/netdev-user.h | 3 +
tools/net/ynl/samples/netdev.c | 6 +
tools/testing/selftests/bpf/network_helpers.h | 43 ++++
.../selftests/bpf/prog_tests/xdp_metadata.c | 31 ++-
tools/testing/selftests/bpf/xdp_hw_metadata.c | 202 +++++++++++++++++-
tools/testing/selftests/bpf/xsk.c | 3 +
tools/testing/selftests/bpf/xsk.h | 1 +
32 files changed, 896 insertions(+), 46 deletions(-)
create mode 100644 Documentation/networking/xsk-tx-metadata.rst
--
2.42.0.582.g8ccd20d70d-goog
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] [PATCH bpf-next v3 01/10] xsk: Support tx_metadata_len
2023-10-03 20:05 [xdp-hints] [PATCH bpf-next v3 00/10] xsk: TX metadata Stanislav Fomichev
@ 2023-10-03 20:05 ` Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 02/10] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
` (8 subsequent siblings)
9 siblings, 0 replies; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-03 20:05 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
For zerocopy mode, tx_desc->addr can point to the arbitrary offset
and carry some TX metadata in the headroom. For copy mode, there
is no way currently to populate skb metadata.
Introduce new tx_metadata_len umem config option that indicates how many
bytes to treat as metadata. Metadata bytes come prior to tx_desc address
(same as in RX case).
The size of the metadata has the same constraints as XDP:
- less than 256 bytes
- 4-byte aligned
- non-zero
This data is not interpreted in any way right now.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
include/net/xdp_sock.h | 1 +
include/net/xsk_buff_pool.h | 1 +
include/uapi/linux/if_xdp.h | 1 +
net/xdp/xdp_umem.c | 4 ++++
net/xdp/xsk.c | 12 +++++++++++-
net/xdp/xsk_buff_pool.c | 1 +
net/xdp/xsk_queue.h | 17 ++++++++++-------
tools/include/uapi/linux/if_xdp.h | 1 +
8 files changed, 30 insertions(+), 8 deletions(-)
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 69b472604b86..caa1f04106be 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -30,6 +30,7 @@ struct xdp_umem {
struct user_struct *user;
refcount_t users;
u8 flags;
+ u8 tx_metadata_len;
bool zc;
struct page **pgs;
int id;
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index b0bdff26fc88..1985ffaf9b0c 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -77,6 +77,7 @@ struct xsk_buff_pool {
u32 chunk_size;
u32 chunk_shift;
u32 frame_len;
+ u8 tx_metadata_len; /* inherited from umem */
u8 cached_need_wakeup;
bool uses_need_wakeup;
bool dma_need_sync;
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 8d48863472b9..2ecf79282c26 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -76,6 +76,7 @@ struct xdp_umem_reg {
__u32 chunk_size;
__u32 headroom;
__u32 flags;
+ __u32 tx_metadata_len;
};
struct xdp_statistics {
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 06cead2b8e34..333f3d53aad4 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -199,6 +199,9 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
if (headroom >= chunk_size - XDP_PACKET_HEADROOM)
return -EINVAL;
+ if (mr->tx_metadata_len > 256 || mr->tx_metadata_len % 4)
+ return -EINVAL;
+
umem->size = size;
umem->headroom = headroom;
umem->chunk_size = chunk_size;
@@ -207,6 +210,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
umem->pgs = NULL;
umem->user = NULL;
umem->flags = mr->flags;
+ umem->tx_metadata_len = mr->tx_metadata_len;
INIT_LIST_HEAD(&umem->xsk_dma_list);
refcount_set(&umem->users, 1);
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 7482d0aca504..c1e12b602213 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1255,6 +1255,14 @@ struct xdp_umem_reg_v1 {
__u32 headroom;
};
+struct xdp_umem_reg_v2 {
+ __u64 addr; /* Start of packet data area */
+ __u64 len; /* Length of packet data area */
+ __u32 chunk_size;
+ __u32 headroom;
+ __u32 flags;
+};
+
static int xsk_setsockopt(struct socket *sock, int level, int optname,
sockptr_t optval, unsigned int optlen)
{
@@ -1298,8 +1306,10 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
if (optlen < sizeof(struct xdp_umem_reg_v1))
return -EINVAL;
- else if (optlen < sizeof(mr))
+ else if (optlen < sizeof(struct xdp_umem_reg_v2))
mr_size = sizeof(struct xdp_umem_reg_v1);
+ else if (optlen < sizeof(mr))
+ mr_size = sizeof(struct xdp_umem_reg_v2);
if (copy_from_sockptr(&mr, optval, mr_size))
return -EFAULT;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 49cb9f9a09be..386eddcdf837 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -85,6 +85,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
XDP_PACKET_HEADROOM;
pool->umem = umem;
pool->addrs = umem->addrs;
+ pool->tx_metadata_len = umem->tx_metadata_len;
INIT_LIST_HEAD(&pool->free_list);
INIT_LIST_HEAD(&pool->xskb_list);
INIT_LIST_HEAD(&pool->xsk_tx_list);
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 13354a1e4280..c74a1372bcb9 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -143,15 +143,17 @@ static inline bool xp_unused_options_set(u32 options)
static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
struct xdp_desc *desc)
{
- u64 offset = desc->addr & (pool->chunk_size - 1);
+ u64 addr = desc->addr - pool->tx_metadata_len;
+ u64 len = desc->len + pool->tx_metadata_len;
+ u64 offset = addr & (pool->chunk_size - 1);
if (!desc->len)
return false;
- if (offset + desc->len > pool->chunk_size)
+ if (offset + len > pool->chunk_size)
return false;
- if (desc->addr >= pool->addrs_cnt)
+ if (addr >= pool->addrs_cnt)
return false;
if (xp_unused_options_set(desc->options))
@@ -162,16 +164,17 @@ static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
static inline bool xp_unaligned_validate_desc(struct xsk_buff_pool *pool,
struct xdp_desc *desc)
{
- u64 addr = xp_unaligned_add_offset_to_addr(desc->addr);
+ u64 addr = xp_unaligned_add_offset_to_addr(desc->addr) - pool->tx_metadata_len;
+ u64 len = desc->len + pool->tx_metadata_len;
if (!desc->len)
return false;
- if (desc->len > pool->chunk_size)
+ if (len > pool->chunk_size)
return false;
- if (addr >= pool->addrs_cnt || addr + desc->len > pool->addrs_cnt ||
- xp_desc_crosses_non_contig_pg(pool, addr, desc->len))
+ if (addr >= pool->addrs_cnt || addr + len > pool->addrs_cnt ||
+ xp_desc_crosses_non_contig_pg(pool, addr, len))
return false;
if (xp_unused_options_set(desc->options))
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 73a47da885dc..34411a2e5b6c 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -76,6 +76,7 @@ struct xdp_umem_reg {
__u32 chunk_size;
__u32 headroom;
__u32 flags;
+ __u32 tx_metadata_len;
};
struct xdp_statistics {
--
2.42.0.582.g8ccd20d70d-goog
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] [PATCH bpf-next v3 02/10] xsk: add TX timestamp and TX checksum offload support
2023-10-03 20:05 [xdp-hints] [PATCH bpf-next v3 00/10] xsk: TX metadata Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 01/10] xsk: Support tx_metadata_len Stanislav Fomichev
@ 2023-10-03 20:05 ` Stanislav Fomichev
2023-10-04 6:18 ` [xdp-hints] " Song, Yoong Siang
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 03/10] tools: ynl: print xsk-features from the sample Stanislav Fomichev
` (7 subsequent siblings)
9 siblings, 1 reply; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-03 20:05 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
This change actually defines the (initial) metadata layout
that should be used by AF_XDP userspace (xsk_tx_metadata).
The first field is flags which requests appropriate offloads,
followed by the offload-specific fields. The supported per-device
offloads are exported via netlink (new xsk-flags).
The offloads themselves are still implemented in a bit of a
framework-y fashion that's left from my initial kfunc attempt.
I'm introducing new xsk_tx_metadata_ops which drivers are
supposed to implement. The drivers are also supposed
to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
the right places. Since xsk_tx_metadata_{request,_complete}
are static inline, we don't incur any extra overhead doing
indirect calls.
The benefit of this scheme is as follows:
- keeps all metadata layout parsing away from driver code
- makes it easy to grep and see which drivers implement what
- don't need any extra flags to maintain to keep track of what
offloads are implemented; if the callback is implemented - the offload
is supported (used by netlink reporting code)
Two offloads are defined right now:
1. XDP_TX_METADATA_CHECKSUM: skb-style csum_start+csum_offset
2. XDP_TX_METADATA_TIMESTAMP: writes TX timestamp back into metadata
area upon completion (tx_timestamp field)
The offloads are also implemented for copy mode:
1. Extra XDP_TX_METADATA_CHECKSUM_SW to trigger skb_checksum_help; this
might be useful as a reference implementation and for testing
2. XDP_TX_METADATA_TIMESTAMP writes SW timestamp from the skb
destructor (note I'm reusing hwtstamps to pass metadata pointer)
The struct is forward-compatible and can be extended in the future
by appending more fields.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
Documentation/netlink/specs/netdev.yaml | 19 ++++++
include/linux/netdevice.h | 27 +++++++++
include/linux/skbuff.h | 14 ++++-
include/net/xdp_sock.h | 80 +++++++++++++++++++++++++
include/net/xdp_sock_drv.h | 13 ++++
include/net/xsk_buff_pool.h | 6 ++
include/uapi/linux/if_xdp.h | 40 +++++++++++++
include/uapi/linux/netdev.h | 16 +++++
net/core/netdev-genl.c | 12 +++-
net/xdp/xsk.c | 39 ++++++++++++
net/xdp/xsk_queue.h | 2 +-
tools/include/uapi/linux/if_xdp.h | 54 +++++++++++++++--
tools/include/uapi/linux/netdev.h | 16 +++++
tools/net/ynl/generated/netdev-user.c | 19 ++++++
tools/net/ynl/generated/netdev-user.h | 3 +
15 files changed, 352 insertions(+), 8 deletions(-)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index c46fcc78fc04..3735c26c8646 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -55,6 +55,19 @@ name: netdev
name: hash
doc:
Device is capable of exposing receive packet hash via bpf_xdp_metadata_rx_hash().
+ -
+ type: flags
+ name: xsk-flags
+ render-max: true
+ entries:
+ -
+ name: tx-timestamp
+ doc:
+ HW timestamping egress packets is supported by the driver.
+ -
+ name: tx-checksum
+ doc:
+ L3 checksum HW offload is supported by the driver.
attribute-sets:
-
@@ -88,6 +101,11 @@ name: netdev
type: u64
enum: xdp-rx-metadata
enum-as-flags: true
+ -
+ name: xsk-features
+ doc: Bitmask of enabled AF_XDP features.
+ type: u64
+ enum: xsk-flags
operations:
list:
@@ -105,6 +123,7 @@ name: netdev
- xdp-features
- xdp-zc-max-segs
- xdp-rx-metadata-features
+ - xsk-features
dump:
reply: *dev-all
-
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 7e520c14eb8c..0e1cb026cbe5 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1650,6 +1650,31 @@ struct net_device_ops {
struct netlink_ext_ack *extack);
};
+/*
+ * This structure defines the AF_XDP TX metadata hooks for network devices.
+ * The following hooks can be defined; unless noted otherwise, they are
+ * optional and can be filled with a null pointer.
+ *
+ * int (*tmo_request_timestamp)(void *priv)
+ * This function is called when AF_XDP frame requested egress timestamp.
+ *
+ * int (*tmo_fill_timestamp)(void *priv)
+ * This function is called when AF_XDP frame, that had requested
+ * egress timestamp, received a completion. The hook needs to return
+ * the actual HW timestamp.
+ *
+ * int (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv)
+ * This function is called when AF_XDP frame requested HW checksum
+ * offload. csum_start indicates position where checksumming should start.
+ * csum_offset indicates position where checksum should be stored.
+ *
+ */
+struct xsk_tx_metadata_ops {
+ void (*tmo_request_timestamp)(void *priv);
+ u64 (*tmo_fill_timestamp)(void *priv);
+ void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
+};
+
/**
* enum netdev_priv_flags - &struct net_device priv_flags
*
@@ -1838,6 +1863,7 @@ enum netdev_ml_priv_type {
* @netdev_ops: Includes several pointers to callbacks,
* if one wants to override the ndo_*() functions
* @xdp_metadata_ops: Includes pointers to XDP metadata callbacks.
+ * @xsk_tx_metadata_ops: Includes pointers to AF_XDP TX metadata callbacks.
* @ethtool_ops: Management operations
* @l3mdev_ops: Layer 3 master device operations
* @ndisc_ops: Includes callbacks for different IPv6 neighbour
@@ -2097,6 +2123,7 @@ struct net_device {
unsigned long long priv_flags;
const struct net_device_ops *netdev_ops;
const struct xdp_metadata_ops *xdp_metadata_ops;
+ const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
int ifindex;
unsigned short gflags;
unsigned short hard_header_len;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 4174c4b82d13..444d35dcd690 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -566,6 +566,15 @@ struct ubuf_info_msgzc {
int mm_account_pinned_pages(struct mmpin *mmp, size_t size);
void mm_unaccount_pinned_pages(struct mmpin *mmp);
+/* Preserve some data across TX submission and completion.
+ *
+ * Note, this state is stored in the driver. Extending the layout
+ * might need some special care.
+ */
+struct xsk_tx_metadata_compl {
+ __u64 *tx_timestamp;
+};
+
/* This data is invariant across clones and lives at
* the end of the header data, ie. at skb->end.
*/
@@ -578,7 +587,10 @@ struct skb_shared_info {
/* Warning: this field is not always filled in (UFO)! */
unsigned short gso_segs;
struct sk_buff *frag_list;
- struct skb_shared_hwtstamps hwtstamps;
+ union {
+ struct skb_shared_hwtstamps hwtstamps;
+ struct xsk_tx_metadata_compl xsk_meta;
+ };
unsigned int gso_type;
u32 tskey;
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index caa1f04106be..29427a69784d 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -92,6 +92,74 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
void __xsk_map_flush(void);
+/**
+ * xsk_tx_metadata_to_compl - Save enough relevant metadata information
+ * to perform tx completion in the future.
+ * @meta: pointer to AF_XDP metadata area
+ * @compl: pointer to output struct xsk_tx_metadata_to_compl
+ *
+ * This function should be called by the networking device when
+ * it prepares AF_XDP egress packet. The value of @compl should be stored
+ * and passed to xsk_tx_metadata_complete upon TX completion.
+ */
+static inline void xsk_tx_metadata_to_compl(struct xsk_tx_metadata *meta,
+ struct xsk_tx_metadata_compl *compl)
+{
+ if (!meta)
+ return;
+
+ if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
+ compl->tx_timestamp = &meta->completion.tx_timestamp;
+ else
+ compl->tx_timestamp = NULL;
+}
+
+/**
+ * xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
+ * and call appropriate xsk_tx_metadata_ops operation.
+ * @meta: pointer to AF_XDP metadata area
+ * @ops: pointer to struct xsk_tx_metadata_ops
+ * @priv: pointer to driver-private aread
+ *
+ * This function should be called by the networking device when
+ * it prepares AF_XDP egress packet.
+ */
+static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
+ const struct xsk_tx_metadata_ops *ops,
+ void *priv)
+{
+ if (!meta)
+ return;
+
+ if (ops->tmo_request_timestamp)
+ if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
+ ops->tmo_request_timestamp(priv);
+
+ if (ops->tmo_request_checksum)
+ if (meta->flags & XDP_TX_METADATA_CHECKSUM)
+ ops->tmo_request_checksum(meta->csum_start, meta->csum_offset, priv);
+}
+
+/**
+ * xsk_tx_metadata_complete - Evaluate AF_XDP TX metadata at completion
+ * and call appropriate xsk_tx_metadata_ops operation.
+ * @compl: pointer to completion metadata produced from xsk_tx_metadata_to_compl
+ * @ops: pointer to struct xsk_tx_metadata_ops
+ * @priv: pointer to driver-private aread
+ *
+ * This function should be called by the networking device upon
+ * AF_XDP egress completion.
+ */
+static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata_compl *compl,
+ const struct xsk_tx_metadata_ops *ops,
+ void *priv)
+{
+ if (!compl)
+ return;
+
+ *compl->tx_timestamp = ops->tmo_fill_timestamp(priv);
+}
+
#else
static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
@@ -108,6 +176,18 @@ static inline void __xsk_map_flush(void)
{
}
+static inline void xsk_tx_metadata_request(struct xsk_tx_metadata *meta,
+ const struct xsk_tx_metadata_ops *ops,
+ void *priv)
+{
+}
+
+static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata_compl *compl,
+ const struct xsk_tx_metadata_ops *ops,
+ void *priv)
+{
+}
+
#endif /* CONFIG_XDP_SOCKETS */
#endif /* _LINUX_XDP_SOCK_H */
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 1f6fc8c7a84c..e2558ac3e195 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -165,6 +165,14 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
return xp_raw_get_data(pool, addr);
}
+static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
+{
+ if (!pool->tx_metadata_len)
+ return NULL;
+
+ return xp_raw_get_data(pool, addr) - pool->tx_metadata_len;
+}
+
static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
{
struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
@@ -324,6 +332,11 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
return NULL;
}
+static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
+{
+ return NULL;
+}
+
static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
{
}
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index 1985ffaf9b0c..97f5cc10d79e 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -33,6 +33,7 @@ struct xdp_buff_xsk {
};
#define XSK_CHECK_PRIV_TYPE(t) BUILD_BUG_ON(sizeof(t) > offsetofend(struct xdp_buff_xsk, cb))
+#define XSK_TX_COMPL_FITS(t) BUILD_BUG_ON(sizeof(struct xsk_tx_metadata_compl) > sizeof(t))
struct xsk_dma_map {
dma_addr_t *dma_pages;
@@ -234,4 +235,9 @@ static inline u64 xp_get_handle(struct xdp_buff_xsk *xskb)
return xskb->orig_addr + (offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT);
}
+static inline bool xp_tx_metadata_enabled(const struct xsk_buff_pool *pool)
+{
+ return pool->tx_metadata_len > 0;
+}
+
#endif /* XSK_BUFF_POOL_H_ */
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 2ecf79282c26..ecfd67988283 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -106,6 +106,43 @@ struct xdp_options {
#define XSK_UNALIGNED_BUF_ADDR_MASK \
((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
+/* Request transmit timestamp. Upon completion, put it into tx_timestamp
+ * field of struct xsk_tx_metadata.
+ */
+#define XDP_TX_METADATA_TIMESTAMP (1 << 0)
+
+/* Request transmit checksum offload. Checksum start position and offset
+ * are communicated via csum_start and csum_offset fields of struct
+ * xsk_tx_metadata.
+ */
+#define XDP_TX_METADATA_CHECKSUM (1 << 1)
+
+/* Force checksum calculation in software. Can be used for testing or
+ * working around potential HW issues. This option causes performance
+ * degradation and only works in XDP_COPY mode.
+ */
+#define XDP_TX_METADATA_CHECKSUM_SW (1 << 2)
+
+struct xsk_tx_metadata {
+ union {
+ struct {
+ __u32 flags;
+
+ /* XDP_TX_METADATA_CHECKSUM */
+
+ /* Offset from desc->addr where checksumming should start. */
+ __u16 csum_start;
+ /* Offset from csum_start where checksum should be stored. */
+ __u16 csum_offset;
+ };
+
+ struct {
+ /* XDP_TX_METADATA_TIMESTAMP */
+ __u64 tx_timestamp;
+ } completion;
+ };
+};
+
/* Rx/Tx descriptor */
struct xdp_desc {
__u64 addr;
@@ -122,4 +159,7 @@ struct xdp_desc {
*/
#define XDP_PKT_CONTD (1 << 0)
+/* TX packet carries valid metadata. */
+#define XDP_TX_METADATA (1 << 1)
+
#endif /* _LINUX_IF_XDP_H */
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 2943a151d4f1..48d5477a668c 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -53,12 +53,28 @@ enum netdev_xdp_rx_metadata {
NETDEV_XDP_RX_METADATA_MASK = 3,
};
+/**
+ * enum netdev_xsk_flags
+ * @NETDEV_XSK_FLAGS_TX_TIMESTAMP: HW timestamping egress packets is supported
+ * by the driver.
+ * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
+ * driver.
+ */
+enum netdev_xsk_flags {
+ NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
+ NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
+
+ /* private: */
+ NETDEV_XSK_FLAGS_MASK = 3,
+};
+
enum {
NETDEV_A_DEV_IFINDEX = 1,
NETDEV_A_DEV_PAD,
NETDEV_A_DEV_XDP_FEATURES,
NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
NETDEV_A_DEV_XDP_RX_METADATA_FEATURES,
+ NETDEV_A_DEV_XSK_FEATURES,
__NETDEV_A_DEV_MAX,
NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index fe61f85bcf33..5d889c2425fd 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -14,6 +14,7 @@ netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff *rsp,
const struct genl_info *info)
{
u64 xdp_rx_meta = 0;
+ u64 xsk_features = 0;
void *hdr;
hdr = genlmsg_iput(rsp, info);
@@ -26,11 +27,20 @@ netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff *rsp,
XDP_METADATA_KFUNC_xxx
#undef XDP_METADATA_KFUNC
+ if (netdev->xsk_tx_metadata_ops) {
+ if (netdev->xsk_tx_metadata_ops->tmo_fill_timestamp)
+ xsk_features |= NETDEV_XSK_FLAGS_TX_TIMESTAMP;
+ if (netdev->xsk_tx_metadata_ops->tmo_request_checksum)
+ xsk_features |= NETDEV_XSK_FLAGS_TX_CHECKSUM;
+ }
+
if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||
nla_put_u64_64bit(rsp, NETDEV_A_DEV_XDP_FEATURES,
netdev->xdp_features, NETDEV_A_DEV_PAD) ||
nla_put_u64_64bit(rsp, NETDEV_A_DEV_XDP_RX_METADATA_FEATURES,
- xdp_rx_meta, NETDEV_A_DEV_PAD)) {
+ xdp_rx_meta, NETDEV_A_DEV_PAD) ||
+ nla_put_u64_64bit(rsp, NETDEV_A_DEV_XSK_FEATURES,
+ xsk_features, NETDEV_A_DEV_PAD)) {
genlmsg_cancel(rsp, hdr);
return -EINVAL;
}
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index c1e12b602213..c427e02c13eb 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -543,6 +543,13 @@ static u32 xsk_get_num_desc(struct sk_buff *skb)
static void xsk_destruct_skb(struct sk_buff *skb)
{
+ struct xsk_tx_metadata_compl *compl = &skb_shinfo(skb)->xsk_meta;
+
+ if (compl->tx_timestamp) {
+ /* sw completion timestamp, not a real one */
+ *compl->tx_timestamp = ktime_get_tai_fast_ns();
+ }
+
xsk_cq_submit_locked(xdp_sk(skb->sk), xsk_get_num_desc(skb));
sock_wfree(skb);
}
@@ -627,8 +634,10 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
struct xdp_desc *desc)
{
+ struct xsk_tx_metadata *meta = NULL;
struct net_device *dev = xs->dev;
struct sk_buff *skb = xs->skb;
+ bool first_frag = false;
int err;
if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
@@ -659,6 +668,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
kfree_skb(skb);
goto free_err;
}
+
+ first_frag = true;
} else {
int nr_frags = skb_shinfo(skb)->nr_frags;
struct page *page;
@@ -681,12 +692,40 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
skb_add_rx_frag(skb, nr_frags, page, 0, len, 0);
}
+
+ if (first_frag && desc->options & XDP_TX_METADATA) {
+ if (unlikely(xs->pool->tx_metadata_len == 0)) {
+ err = -EINVAL;
+ goto free_err;
+ }
+
+ meta = buffer - xs->pool->tx_metadata_len;
+
+ if (meta->flags & XDP_TX_METADATA_CHECKSUM) {
+ if (unlikely(meta->csum_start + meta->csum_offset +
+ sizeof(__sum16) > len)) {
+ err = -EINVAL;
+ goto free_err;
+ }
+
+ skb->csum_start = hr + meta->csum_start;
+ skb->csum_offset = meta->csum_offset;
+ skb->ip_summed = CHECKSUM_PARTIAL;
+
+ if (unlikely(meta->flags & XDP_TX_METADATA_CHECKSUM_SW)) {
+ err = skb_checksum_help(skb);
+ if (err)
+ goto free_err;
+ }
+ }
+ }
}
skb->dev = dev;
skb->priority = xs->sk.sk_priority;
skb->mark = READ_ONCE(xs->sk.sk_mark);
skb->destructor = xsk_destruct_skb;
+ xsk_tx_metadata_to_compl(meta, &skb_shinfo(skb)->xsk_meta);
xsk_set_destructor_arg(skb);
return skb;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index c74a1372bcb9..6f2d1621c992 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -137,7 +137,7 @@ static inline bool xskq_cons_read_addr_unchecked(struct xsk_queue *q, u64 *addr)
static inline bool xp_unused_options_set(u32 options)
{
- return options & ~XDP_PKT_CONTD;
+ return options & ~(XDP_PKT_CONTD | XDP_TX_METADATA);
}
static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 34411a2e5b6c..53ceaae10dd1 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -26,11 +26,11 @@
*/
#define XDP_USE_NEED_WAKEUP (1 << 3)
/* By setting this option, userspace application indicates that it can
- * handle multiple descriptors per packet thus enabling xsk core to split
+ * handle multiple descriptors per packet thus enabling AF_XDP to split
* multi-buffer XDP frames into multiple Rx descriptors. Without this set
- * such frames will be dropped by xsk.
+ * such frames will be dropped.
*/
-#define XDP_USE_SG (1 << 4)
+#define XDP_USE_SG (1 << 4)
/* Flags for xsk_umem_config flags */
#define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
@@ -106,6 +106,43 @@ struct xdp_options {
#define XSK_UNALIGNED_BUF_ADDR_MASK \
((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
+/* Request transmit timestamp. Upon completion, put it into tx_timestamp
+ * field of union xsk_tx_metadata.
+ */
+#define XDP_TX_METADATA_TIMESTAMP (1 << 0)
+
+/* Request transmit checksum offload. Checksum start position and offset
+ * are communicated via csum_start and csum_offset fields of union
+ * xsk_tx_metadata.
+ */
+#define XDP_TX_METADATA_CHECKSUM (1 << 1)
+
+/* Force checksum calculation in software. Can be used for testing or
+ * working around potential HW issues. This option causes performance
+ * degradation and only works in XDP_COPY mode.
+ */
+#define XDP_TX_METADATA_CHECKSUM_SW (1 << 2)
+
+struct xsk_tx_metadata {
+ union {
+ struct {
+ __u32 flags;
+
+ /* XDP_TX_METADATA_CHECKSUM */
+
+ /* Offset from desc->addr where checksumming should start. */
+ __u16 csum_start;
+ /* Offset from csum_start where checksum should be stored. */
+ __u16 csum_offset;
+ };
+
+ struct {
+ /* XDP_TX_METADATA_TIMESTAMP */
+ __u64 tx_timestamp;
+ } completion;
+ };
+};
+
/* Rx/Tx descriptor */
struct xdp_desc {
__u64 addr;
@@ -113,9 +150,16 @@ struct xdp_desc {
__u32 options;
};
-/* Flag indicating packet constitutes of multiple buffers*/
+/* UMEM descriptor is __u64 */
+
+/* Flag indicating that the packet continues with the buffer pointed out by the
+ * next frame in the ring. The end of the packet is signalled by setting this
+ * bit to zero. For single buffer packets, every descriptor has 'options' set
+ * to 0 and this maintains backward compatibility.
+ */
#define XDP_PKT_CONTD (1 << 0)
-/* UMEM descriptor is __u64 */
+/* TX packet carries valid metadata. */
+#define XDP_TX_METADATA (1 << 1)
#endif /* _LINUX_IF_XDP_H */
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 2943a151d4f1..48d5477a668c 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -53,12 +53,28 @@ enum netdev_xdp_rx_metadata {
NETDEV_XDP_RX_METADATA_MASK = 3,
};
+/**
+ * enum netdev_xsk_flags
+ * @NETDEV_XSK_FLAGS_TX_TIMESTAMP: HW timestamping egress packets is supported
+ * by the driver.
+ * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
+ * driver.
+ */
+enum netdev_xsk_flags {
+ NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
+ NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
+
+ /* private: */
+ NETDEV_XSK_FLAGS_MASK = 3,
+};
+
enum {
NETDEV_A_DEV_IFINDEX = 1,
NETDEV_A_DEV_PAD,
NETDEV_A_DEV_XDP_FEATURES,
NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
NETDEV_A_DEV_XDP_RX_METADATA_FEATURES,
+ NETDEV_A_DEV_XSK_FEATURES,
__NETDEV_A_DEV_MAX,
NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
diff --git a/tools/net/ynl/generated/netdev-user.c b/tools/net/ynl/generated/netdev-user.c
index b5ffe8cd1144..6283d87dad37 100644
--- a/tools/net/ynl/generated/netdev-user.c
+++ b/tools/net/ynl/generated/netdev-user.c
@@ -58,6 +58,19 @@ const char *netdev_xdp_rx_metadata_str(enum netdev_xdp_rx_metadata value)
return netdev_xdp_rx_metadata_strmap[value];
}
+static const char * const netdev_xsk_flags_strmap[] = {
+ [0] = "tx-timestamp",
+ [1] = "tx-checksum",
+};
+
+const char *netdev_xsk_flags_str(enum netdev_xsk_flags value)
+{
+ value = ffs(value) - 1;
+ if (value < 0 || value >= (int)MNL_ARRAY_SIZE(netdev_xsk_flags_strmap))
+ return NULL;
+ return netdev_xsk_flags_strmap[value];
+}
+
/* Policies */
struct ynl_policy_attr netdev_dev_policy[NETDEV_A_DEV_MAX + 1] = {
[NETDEV_A_DEV_IFINDEX] = { .name = "ifindex", .type = YNL_PT_U32, },
@@ -65,6 +78,7 @@ struct ynl_policy_attr netdev_dev_policy[NETDEV_A_DEV_MAX + 1] = {
[NETDEV_A_DEV_XDP_FEATURES] = { .name = "xdp-features", .type = YNL_PT_U64, },
[NETDEV_A_DEV_XDP_ZC_MAX_SEGS] = { .name = "xdp-zc-max-segs", .type = YNL_PT_U32, },
[NETDEV_A_DEV_XDP_RX_METADATA_FEATURES] = { .name = "xdp-rx-metadata-features", .type = YNL_PT_U64, },
+ [NETDEV_A_DEV_XSK_FEATURES] = { .name = "xsk-features", .type = YNL_PT_U64, },
};
struct ynl_policy_nest netdev_dev_nest = {
@@ -116,6 +130,11 @@ int netdev_dev_get_rsp_parse(const struct nlmsghdr *nlh, void *data)
return MNL_CB_ERROR;
dst->_present.xdp_rx_metadata_features = 1;
dst->xdp_rx_metadata_features = mnl_attr_get_u64(attr);
+ } else if (type == NETDEV_A_DEV_XSK_FEATURES) {
+ if (ynl_attr_validate(yarg, attr))
+ return MNL_CB_ERROR;
+ dst->_present.xsk_features = 1;
+ dst->xsk_features = mnl_attr_get_u64(attr);
}
}
diff --git a/tools/net/ynl/generated/netdev-user.h b/tools/net/ynl/generated/netdev-user.h
index b4351ff34595..bdbd1766ce46 100644
--- a/tools/net/ynl/generated/netdev-user.h
+++ b/tools/net/ynl/generated/netdev-user.h
@@ -19,6 +19,7 @@ extern const struct ynl_family ynl_netdev_family;
const char *netdev_op_str(int op);
const char *netdev_xdp_act_str(enum netdev_xdp_act value);
const char *netdev_xdp_rx_metadata_str(enum netdev_xdp_rx_metadata value);
+const char *netdev_xsk_flags_str(enum netdev_xsk_flags value);
/* Common nested types */
/* ============== NETDEV_CMD_DEV_GET ============== */
@@ -50,12 +51,14 @@ struct netdev_dev_get_rsp {
__u32 xdp_features:1;
__u32 xdp_zc_max_segs:1;
__u32 xdp_rx_metadata_features:1;
+ __u32 xsk_features:1;
} _present;
__u32 ifindex;
__u64 xdp_features;
__u32 xdp_zc_max_segs;
__u64 xdp_rx_metadata_features;
+ __u64 xsk_features;
};
void netdev_dev_get_rsp_free(struct netdev_dev_get_rsp *rsp);
--
2.42.0.582.g8ccd20d70d-goog
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] [PATCH bpf-next v3 03/10] tools: ynl: print xsk-features from the sample
2023-10-03 20:05 [xdp-hints] [PATCH bpf-next v3 00/10] xsk: TX metadata Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 01/10] xsk: Support tx_metadata_len Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 02/10] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
@ 2023-10-03 20:05 ` Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 04/10] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload Stanislav Fomichev
` (6 subsequent siblings)
9 siblings, 0 replies; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-03 20:05 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
Regenerate the userspace specs and print xsk-features bitmask.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
tools/net/ynl/samples/netdev.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/tools/net/ynl/samples/netdev.c b/tools/net/ynl/samples/netdev.c
index b828225daad0..da7c2848f773 100644
--- a/tools/net/ynl/samples/netdev.c
+++ b/tools/net/ynl/samples/netdev.c
@@ -44,6 +44,12 @@ static void netdev_print_device(struct netdev_dev_get_rsp *d, unsigned int op)
printf(" %s", netdev_xdp_rx_metadata_str(1 << i));
}
+ printf(" xsk-features (%llx):", d->xsk_features);
+ for (int i = 0; d->xsk_features > 1U << i; i++) {
+ if (d->xsk_features & (1U << i))
+ printf(" %s", netdev_xsk_flags_str(1 << i));
+ }
+
printf(" xdp-zc-max-segs=%u", d->xdp_zc_max_segs);
name = netdev_op_str(op);
--
2.42.0.582.g8ccd20d70d-goog
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] [PATCH bpf-next v3 04/10] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
2023-10-03 20:05 [xdp-hints] [PATCH bpf-next v3 00/10] xsk: TX metadata Stanislav Fomichev
` (2 preceding siblings ...)
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 03/10] tools: ynl: print xsk-features from the sample Stanislav Fomichev
@ 2023-10-03 20:05 ` Stanislav Fomichev
2023-10-04 23:47 ` [xdp-hints] " kernel test robot
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 05/10] net: stmmac: Add Tx HWTS support to XDP ZC Stanislav Fomichev
` (5 subsequent siblings)
9 siblings, 1 reply; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-03 20:05 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints, Saeed Mahameed
TX timestamp:
- requires passing clock, not sure I'm passing the correct one (from
cq->mdev), but the timestamp value looks convincing
TX checksum:
- looks like device does packet parsing (and doesn't accept custom
start/offset), so I'm ignoring user offsets
Cc: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 4 +-
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 72 ++++++++++++++++---
.../net/ethernet/mellanox/mlx5/core/en/xdp.h | 11 ++-
.../ethernet/mellanox/mlx5/core/en/xsk/tx.c | 17 ++++-
.../net/ethernet/mellanox/mlx5/core/en_main.c | 1 +
5 files changed, 89 insertions(+), 16 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 86f2690c5e01..f64ceedcc665 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -476,10 +476,12 @@ struct mlx5e_xdp_info_fifo {
struct mlx5e_xdpsq;
struct mlx5e_xmit_data;
+struct xsk_tx_metadata;
typedef int (*mlx5e_fp_xmit_xdp_frame_check)(struct mlx5e_xdpsq *);
typedef bool (*mlx5e_fp_xmit_xdp_frame)(struct mlx5e_xdpsq *,
struct mlx5e_xmit_data *,
- int);
+ int,
+ struct xsk_tx_metadata *);
struct mlx5e_xdpsq {
/* data path */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 12f56d0db0af..b3227b73fc0d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -103,7 +103,7 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
xdptxd->dma_addr = dma_addr;
if (unlikely(!INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
- mlx5e_xmit_xdp_frame, sq, xdptxd, 0)))
+ mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL)))
return false;
/* xmit_mode == MLX5E_XDP_XMIT_MODE_FRAME */
@@ -145,7 +145,7 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
xdptxd->dma_addr = dma_addr;
if (unlikely(!INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
- mlx5e_xmit_xdp_frame, sq, xdptxd, 0)))
+ mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL)))
return false;
/* xmit_mode == MLX5E_XDP_XMIT_MODE_PAGE */
@@ -261,6 +261,37 @@ const struct xdp_metadata_ops mlx5e_xdp_metadata_ops = {
.xmo_rx_hash = mlx5e_xdp_rx_hash,
};
+struct mlx5e_xsk_tx_complete {
+ struct mlx5_cqe64 *cqe;
+ struct mlx5e_cq *cq;
+};
+
+static u64 mlx5e_xsk_fill_timestamp(void *_priv)
+{
+ struct mlx5e_xsk_tx_complete *priv = _priv;
+ u64 ts;
+
+ ts = get_cqe_ts(priv->cqe);
+
+ if (mlx5_is_real_time_rq(priv->cq->mdev) || mlx5_is_real_time_sq(priv->cq->mdev))
+ return mlx5_real_time_cyc2time(&priv->cq->mdev->clock, ts);
+
+ return mlx5_timecounter_cyc2time(&priv->cq->mdev->clock, ts);
+}
+
+static void mlx5e_xsk_request_checksum(u16 csum_start, u16 csum_offset, void *priv)
+{
+ struct mlx5_wqe_eth_seg *eseg = priv;
+
+ /* HW/FW is doing parsing, so offsets are largely ignored. */
+ eseg->cs_flags |= MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
+}
+
+const struct xsk_tx_metadata_ops mlx5e_xsk_tx_metadata_ops = {
+ .tmo_fill_timestamp = mlx5e_xsk_fill_timestamp,
+ .tmo_request_checksum = mlx5e_xsk_request_checksum,
+};
+
/* returns true if packet was consumed by xdp */
bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
struct bpf_prog *prog, struct mlx5e_xdp_buff *mxbuf)
@@ -398,11 +429,11 @@ INDIRECT_CALLABLE_SCOPE int mlx5e_xmit_xdp_frame_check_mpwqe(struct mlx5e_xdpsq
INDIRECT_CALLABLE_SCOPE bool
mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
- int check_result);
+ int check_result, struct xsk_tx_metadata *meta);
INDIRECT_CALLABLE_SCOPE bool
mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
- int check_result)
+ int check_result, struct xsk_tx_metadata *meta)
{
struct mlx5e_tx_mpwqe *session = &sq->mpwqe;
struct mlx5e_xdpsq_stats *stats = sq->stats;
@@ -420,7 +451,7 @@ mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptx
*/
if (unlikely(sq->mpwqe.wqe))
mlx5e_xdp_mpwqe_complete(sq);
- return mlx5e_xmit_xdp_frame(sq, xdptxd, 0);
+ return mlx5e_xmit_xdp_frame(sq, xdptxd, 0, meta);
}
if (!xdptxd->len) {
skb_frag_t *frag = &xdptxdf->sinfo->frags[0];
@@ -450,6 +481,7 @@ mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptx
* and it's safe to complete it at any time.
*/
mlx5e_xdp_mpwqe_session_start(sq);
+ xsk_tx_metadata_request(meta, &mlx5e_xsk_tx_metadata_ops, &session->wqe->eth);
}
mlx5e_xdp_mpwqe_add_dseg(sq, p, stats);
@@ -480,7 +512,7 @@ INDIRECT_CALLABLE_SCOPE int mlx5e_xmit_xdp_frame_check(struct mlx5e_xdpsq *sq)
INDIRECT_CALLABLE_SCOPE bool
mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
- int check_result)
+ int check_result, struct xsk_tx_metadata *meta)
{
struct mlx5e_xmit_data_frags *xdptxdf =
container_of(xdptxd, struct mlx5e_xmit_data_frags, xd);
@@ -599,6 +631,8 @@ mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
sq->pc++;
}
+ xsk_tx_metadata_request(meta, &mlx5e_xsk_tx_metadata_ops, eseg);
+
sq->doorbell_cseg = cseg;
stats->xmit++;
@@ -608,7 +642,9 @@ mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
struct mlx5e_xdp_wqe_info *wi,
u32 *xsk_frames,
- struct xdp_frame_bulk *bq)
+ struct xdp_frame_bulk *bq,
+ struct mlx5e_cq *cq,
+ struct mlx5_cqe64 *cqe)
{
struct mlx5e_xdp_info_fifo *xdpi_fifo = &sq->db.xdpi_fifo;
u16 i;
@@ -668,10 +704,24 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
break;
}
- case MLX5E_XDP_XMIT_MODE_XSK:
+ case MLX5E_XDP_XMIT_MODE_XSK: {
/* AF_XDP send */
+ struct xsk_tx_metadata_compl *compl = NULL;
+ struct mlx5e_xsk_tx_complete priv = {
+ .cqe = cqe,
+ .cq = cq,
+ };
+
+ if (xp_tx_metadata_enabled(sq->xsk_pool)) {
+ xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
+ compl = &xdpi.xsk_meta;
+
+ xsk_tx_metadata_complete(compl, &mlx5e_xsk_tx_metadata_ops, &priv);
+ }
+
(*xsk_frames)++;
break;
+ }
default:
WARN_ON_ONCE(true);
}
@@ -720,7 +770,7 @@ bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq)
sqcc += wi->num_wqebbs;
- mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq);
+ mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq, cq, cqe);
} while (!last_wqe);
if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_REQ)) {
@@ -767,7 +817,7 @@ void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq)
sq->cc += wi->num_wqebbs;
- mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq);
+ mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq, NULL, NULL);
}
xdp_flush_frame_bulk(&bq);
@@ -840,7 +890,7 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
}
ret = INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
- mlx5e_xmit_xdp_frame, sq, xdptxd, 0);
+ mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL);
if (unlikely(!ret)) {
int j;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index ecfe93a479da..e054db1e10f8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -33,6 +33,7 @@
#define __MLX5_EN_XDP_H__
#include <linux/indirect_call_wrapper.h>
+#include <net/xdp_sock.h>
#include "en.h"
#include "en/txrx.h"
@@ -82,7 +83,7 @@ enum mlx5e_xdp_xmit_mode {
* num, page_1, page_2, ... , page_num.
*
* MLX5E_XDP_XMIT_MODE_XSK:
- * none.
+ * frame.xsk_meta.
*/
#define MLX5E_XDP_FIFO_ENTRIES2DS_MAX_RATIO 4
@@ -97,6 +98,7 @@ union mlx5e_xdp_info {
u8 num;
struct page *page;
} page;
+ struct xsk_tx_metadata_compl xsk_meta;
};
struct mlx5e_xsk_param;
@@ -112,13 +114,16 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
u32 flags);
extern const struct xdp_metadata_ops mlx5e_xdp_metadata_ops;
+extern const struct xsk_tx_metadata_ops mlx5e_xsk_tx_metadata_ops;
INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq,
struct mlx5e_xmit_data *xdptxd,
- int check_result));
+ int check_result,
+ struct xsk_tx_metadata *meta));
INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq,
struct mlx5e_xmit_data *xdptxd,
- int check_result));
+ int check_result,
+ struct xsk_tx_metadata *meta));
INDIRECT_CALLABLE_DECLARE(int mlx5e_xmit_xdp_frame_check_mpwqe(struct mlx5e_xdpsq *sq));
INDIRECT_CALLABLE_DECLARE(int mlx5e_xmit_xdp_frame_check(struct mlx5e_xdpsq *sq));
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
index 597f319d4770..a59199ed590d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
@@ -55,12 +55,16 @@ static void mlx5e_xsk_tx_post_err(struct mlx5e_xdpsq *sq,
nopwqe = mlx5e_post_nop(&sq->wq, sq->sqn, &sq->pc);
mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, *xdpi);
+ if (xp_tx_metadata_enabled(sq->xsk_pool))
+ mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
+ (union mlx5e_xdp_info) { .xsk_meta = {} });
sq->doorbell_cseg = &nopwqe->ctrl;
}
bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
{
struct xsk_buff_pool *pool = sq->xsk_pool;
+ struct xsk_tx_metadata *meta = NULL;
union mlx5e_xdp_info xdpi;
bool work_done = true;
bool flush = false;
@@ -93,12 +97,13 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
xdptxd.dma_addr = xsk_buff_raw_get_dma(pool, desc.addr);
xdptxd.data = xsk_buff_raw_get_data(pool, desc.addr);
xdptxd.len = desc.len;
+ meta = xsk_buff_get_metadata(pool, desc.addr);
xsk_buff_raw_dma_sync_for_device(pool, xdptxd.dma_addr, xdptxd.len);
ret = INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
mlx5e_xmit_xdp_frame, sq, &xdptxd,
- check_result);
+ check_result, meta);
if (unlikely(!ret)) {
if (sq->mpwqe.wqe)
mlx5e_xdp_mpwqe_complete(sq);
@@ -106,6 +111,16 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
mlx5e_xsk_tx_post_err(sq, &xdpi);
} else {
mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, xdpi);
+ if (xp_tx_metadata_enabled(sq->xsk_pool)) {
+ struct xsk_tx_metadata_compl compl;
+
+ xsk_tx_metadata_to_compl(meta, &compl);
+ XSK_TX_COMPL_FITS(void *);
+
+ mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
+ (union mlx5e_xdp_info)
+ { .xsk_meta = compl });
+ }
}
flush = true;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index a2ae791538ed..61109d2f4a1c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -5096,6 +5096,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
netdev->netdev_ops = &mlx5e_netdev_ops;
netdev->xdp_metadata_ops = &mlx5e_xdp_metadata_ops;
+ netdev->xsk_tx_metadata_ops = &mlx5e_xsk_tx_metadata_ops;
mlx5e_dcbnl_build_netdev(netdev);
--
2.42.0.582.g8ccd20d70d-goog
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] [PATCH bpf-next v3 05/10] net: stmmac: Add Tx HWTS support to XDP ZC
2023-10-03 20:05 [xdp-hints] [PATCH bpf-next v3 00/10] xsk: TX metadata Stanislav Fomichev
` (3 preceding siblings ...)
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 04/10] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload Stanislav Fomichev
@ 2023-10-03 20:05 ` Stanislav Fomichev
2023-10-04 23:05 ` [xdp-hints] " kernel test robot
2023-10-06 4:38 ` kernel test robot
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 06/10] selftests/xsk: Support tx_metadata_len Stanislav Fomichev
` (4 subsequent siblings)
9 siblings, 2 replies; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-03 20:05 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
From: Song Yoong Siang <yoong.siang.song@intel.com>
This patch enables transmit hardware timestamp support to XDP zero copy
via XDP Tx metadata framework.
This patchset is tested with tools/testing/selftests/bpf/xdp_hw_metadata
on Intel Tiger Lake platform. Below are the test steps and results.
Command on DUT:
sudo ./xdp_hw_metadata <interface name>
sudo hwstamp_ctl -i <interface name> -t 1 -r 1
Command on Link Partner:
echo -n xdp | nc -u -q1 <destination IPv4 addr> 9091
Result:
xsk_ring_cons__peek: 1
0x562e3313b6d0: rx_desc[3]->addr=8e100 addr=8e100 comp_addr=8e100
No rx_hash err=-95
rx_timestamp: 1677763849292380229 (sec:1677763849.2924)
XDP RX-time: 1677763849292641940 (sec:1677763849.2926)
delta sec:0.0003 (261.711 usec)
AF_XDP time: 1677763849292666175 (sec:1677763849.2927)
delta sec:0.0000 (24.235 usec)
0x562e3313b6d0: ping-pong with csum=561c (want 08af)
csum_start=34 csum_offset=6
0x562e3313b6d0: complete tx idx=3 addr=3008
0x562e3313b6d0: tx_timestamp: 1677763849295700005 (sec:1677763849.2957)
0x562e3313b6d0: complete rx idx=131 addr=8e100
Additionally, to double confirm the rx_timestamp and tx_timestamp are taken
from PTP Hardware Clock (PHC), we set the value of PHC to a specific value
using tools/testing/selftests/ptp/testptp. Below are the test steps and
results.
Command to set PHC to a specific value:
sudo ./testptp -d /dev/ptp2 -T 123000000
Result:
xsk_ring_cons__peek: 1
0x562e3313b6d0: rx_desc[7]->addr=9e100 addr=9e100 comp_addr=9e100
No rx_hash err=-95
rx_timestamp: 123000002731730589 (sec:123000002.7317)
XDP RX-time: 1677763869396644361 (sec:1677763869.3966)
delta sec:1554763866.6649 (1554763866664913.750 usec)
AF_XDP time: 1677763869396671376 (sec:1677763869.3967)
delta sec:0.0000 (27.015 usec)
0x562e3313b6d0: ping-pong with csum=561c (want d1bf)
csum_start=34 csum_offset=6
0x562e3313b6d0: complete tx idx=7 addr=7008
0x562e3313b6d0: tx_timestamp: 123000002735048790 (sec:123000002.7350)
0x562e3313b6d0: complete rx idx=135 addr=9e100
Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
drivers/net/ethernet/stmicro/stmmac/stmmac.h | 12 ++++
.../net/ethernet/stmicro/stmmac/stmmac_main.c | 63 ++++++++++++++++++-
2 files changed, 74 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index cd7a9768de5f..686c94c2e8a7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -51,6 +51,7 @@ struct stmmac_tx_info {
bool last_segment;
bool is_jumbo;
enum stmmac_txbuf_type buf_type;
+ struct xsk_tx_metadata_compl xsk_meta;
};
#define STMMAC_TBS_AVAIL BIT(0)
@@ -100,6 +101,17 @@ struct stmmac_xdp_buff {
struct dma_desc *ndesc;
};
+struct stmmac_metadata_request {
+ struct stmmac_priv *priv;
+ struct dma_desc *tx_desc;
+ bool *set_ic;
+};
+
+struct stmmac_xsk_tx_complete {
+ struct stmmac_priv *priv;
+ struct dma_desc *desc;
+};
+
struct stmmac_rx_queue {
u32 rx_count_frames;
u32 queue_index;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 81b6f3ecdf92..697712dd4024 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2422,6 +2422,46 @@ static void stmmac_dma_operation_mode(struct stmmac_priv *priv)
}
}
+static void stmmac_xsk_request_timestamp(void *_priv)
+{
+ struct stmmac_metadata_request *meta_req = _priv;
+
+ stmmac_enable_tx_timestamp(meta_req->priv, meta_req->tx_desc);
+ *meta_req->set_ic = true;
+}
+
+static u64 stmmac_xsk_fill_timestamp(void *_priv)
+{
+ struct stmmac_xsk_tx_complete *tx_compl = _priv;
+ struct stmmac_priv *priv = tx_compl->priv;
+ struct dma_desc *desc = tx_compl->desc;
+ bool found = false;
+ u64 ns = 0;
+
+ if (!priv->hwts_tx_en)
+ return 0;
+
+ /* check tx tstamp status */
+ if (stmmac_get_tx_timestamp_status(priv, desc)) {
+ stmmac_get_timestamp(priv, desc, priv->adv_ts, &ns);
+ found = true;
+ } else if (!stmmac_get_mac_tx_timestamp(priv, priv->hw, &ns)) {
+ found = true;
+ }
+
+ if (found) {
+ ns -= priv->plat->cdc_error_adj;
+ return ns_to_ktime(ns);
+ }
+
+ return 0;
+}
+
+static const struct xsk_tx_metadata_ops stmmac_xsk_tx_metadata_ops = {
+ .tmo_request_timestamp = stmmac_xsk_request_timestamp,
+ .tmo_fill_timestamp = stmmac_xsk_fill_timestamp,
+};
+
static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
{
struct netdev_queue *nq = netdev_get_tx_queue(priv->dev, queue);
@@ -2441,6 +2481,8 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
budget = min(budget, stmmac_tx_avail(priv, queue));
while (budget-- > 0) {
+ struct stmmac_metadata_request meta_req;
+ struct xsk_tx_metadata *meta = NULL;
dma_addr_t dma_addr;
bool set_ic;
@@ -2464,6 +2506,7 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
tx_desc = tx_q->dma_tx + entry;
dma_addr = xsk_buff_raw_get_dma(pool, xdp_desc.addr);
+ meta = xsk_buff_get_metadata(pool, xdp_desc.addr);
xsk_buff_raw_dma_sync_for_device(pool, dma_addr, xdp_desc.len);
tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_XSK_TX;
@@ -2491,6 +2534,11 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
else
set_ic = false;
+ meta_req.priv = priv;
+ meta_req.tx_desc = tx_desc;
+ meta_req.set_ic = &set_ic;
+ xsk_tx_metadata_request(meta, &stmmac_xsk_tx_metadata_ops, &meta_req);
+
if (set_ic) {
tx_q->tx_count_frames = 0;
stmmac_set_tx_ic(priv, tx_desc);
@@ -2503,6 +2551,8 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
stmmac_enable_dma_transmission(priv, priv->ioaddr);
+ xsk_tx_metadata_to_compl(meta, &tx_q->tx_skbuff_dma[entry].xsk_meta);
+
tx_q->cur_tx = STMMAC_GET_ENTRY(tx_q->cur_tx, priv->dma_conf.dma_tx_size);
entry = tx_q->cur_tx;
}
@@ -2608,8 +2658,18 @@ static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue)
} else {
tx_packets++;
}
- if (skb)
+ if (skb) {
stmmac_get_tx_hwtstamp(priv, p, skb);
+ } else {
+ struct stmmac_xsk_tx_complete tx_compl = {
+ .priv = priv,
+ .desc = p,
+ };
+
+ xsk_tx_metadata_complete(&tx_q->tx_skbuff_dma[entry].xsk_meta,
+ &stmmac_xsk_tx_metadata_ops,
+ &tx_compl);
+ }
}
if (likely(tx_q->tx_skbuff_dma[entry].buf &&
@@ -7444,6 +7504,7 @@ int stmmac_dvr_probe(struct device *device,
ndev->netdev_ops = &stmmac_netdev_ops;
ndev->xdp_metadata_ops = &stmmac_xdp_metadata_ops;
+ ndev->xsk_tx_metadata_ops = &stmmac_xsk_tx_metadata_ops;
ndev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
NETIF_F_RXCSUM;
--
2.42.0.582.g8ccd20d70d-goog
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] [PATCH bpf-next v3 06/10] selftests/xsk: Support tx_metadata_len
2023-10-03 20:05 [xdp-hints] [PATCH bpf-next v3 00/10] xsk: TX metadata Stanislav Fomichev
` (4 preceding siblings ...)
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 05/10] net: stmmac: Add Tx HWTS support to XDP ZC Stanislav Fomichev
@ 2023-10-03 20:05 ` Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 07/10] selftests/bpf: Add csum helpers Stanislav Fomichev
` (3 subsequent siblings)
9 siblings, 0 replies; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-03 20:05 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
Add new config field and propagate to umem registration setsockopt.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
tools/testing/selftests/bpf/xsk.c | 3 +++
tools/testing/selftests/bpf/xsk.h | 1 +
2 files changed, 4 insertions(+)
diff --git a/tools/testing/selftests/bpf/xsk.c b/tools/testing/selftests/bpf/xsk.c
index d9fb2b730a2c..24f5313dbfde 100644
--- a/tools/testing/selftests/bpf/xsk.c
+++ b/tools/testing/selftests/bpf/xsk.c
@@ -115,6 +115,7 @@ static void xsk_set_umem_config(struct xsk_umem_config *cfg,
cfg->frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE;
cfg->frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM;
cfg->flags = XSK_UMEM__DEFAULT_FLAGS;
+ cfg->tx_metadata_len = 0;
return;
}
@@ -123,6 +124,7 @@ static void xsk_set_umem_config(struct xsk_umem_config *cfg,
cfg->frame_size = usr_cfg->frame_size;
cfg->frame_headroom = usr_cfg->frame_headroom;
cfg->flags = usr_cfg->flags;
+ cfg->tx_metadata_len = usr_cfg->tx_metadata_len;
}
static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
@@ -252,6 +254,7 @@ int xsk_umem__create(struct xsk_umem **umem_ptr, void *umem_area,
mr.chunk_size = umem->config.frame_size;
mr.headroom = umem->config.frame_headroom;
mr.flags = umem->config.flags;
+ mr.tx_metadata_len = umem->config.tx_metadata_len;
err = setsockopt(umem->fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr));
if (err) {
diff --git a/tools/testing/selftests/bpf/xsk.h b/tools/testing/selftests/bpf/xsk.h
index d93200fdaa8d..bff8e50d7532 100644
--- a/tools/testing/selftests/bpf/xsk.h
+++ b/tools/testing/selftests/bpf/xsk.h
@@ -200,6 +200,7 @@ struct xsk_umem_config {
__u32 frame_size;
__u32 frame_headroom;
__u32 flags;
+ __u32 tx_metadata_len;
};
int xsk_attach_xdp_program(struct bpf_program *prog, int ifindex, u32 xdp_flags);
--
2.42.0.582.g8ccd20d70d-goog
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] [PATCH bpf-next v3 07/10] selftests/bpf: Add csum helpers
2023-10-03 20:05 [xdp-hints] [PATCH bpf-next v3 00/10] xsk: TX metadata Stanislav Fomichev
` (5 preceding siblings ...)
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 06/10] selftests/xsk: Support tx_metadata_len Stanislav Fomichev
@ 2023-10-03 20:05 ` Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 08/10] selftests/bpf: Add TX side to xdp_metadata Stanislav Fomichev
` (2 subsequent siblings)
9 siblings, 0 replies; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-03 20:05 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
Checksum helpers will be used to calculate pseudo-header checksum in
AF_XDP metadata selftests.
The helpers are mirroring existing kernel ones:
- csum_tcpudp_magic : IPv4 pseudo header csum
- csum_ipv6_magic : IPv6 pseudo header csum
- csum_fold : fold csum and do one's complement
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
tools/testing/selftests/bpf/network_helpers.h | 43 +++++++++++++++++++
1 file changed, 43 insertions(+)
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index 5eccc67d1a99..654a854c9fb2 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -70,4 +70,47 @@ struct nstoken;
*/
struct nstoken *open_netns(const char *name);
void close_netns(struct nstoken *token);
+
+static __u16 csum_fold(__u32 csum)
+{
+ csum = (csum & 0xffff) + (csum >> 16);
+ csum = (csum & 0xffff) + (csum >> 16);
+
+ return (__u16)~csum;
+}
+
+static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
+ __u32 len, __u8 proto,
+ __wsum csum)
+{
+ __u64 s = csum;
+
+ s += (__u32)saddr;
+ s += (__u32)daddr;
+ s += htons(proto + len);
+ s = (s & 0xffffffff) + (s >> 32);
+ s = (s & 0xffffffff) + (s >> 32);
+
+ return csum_fold((__u32)s);
+}
+
+static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+ const struct in6_addr *daddr,
+ __u32 len, __u8 proto,
+ __wsum csum)
+{
+ __u64 s = csum;
+ int i;
+
+ for (i = 0; i < 4; i++)
+ s += (__u32)saddr->s6_addr32[i];
+ for (i = 0; i < 4; i++)
+ s += (__u32)daddr->s6_addr32[i];
+ s += htons(proto + len);
+ s = (s & 0xffffffff) + (s >> 32);
+ s = (s & 0xffffffff) + (s >> 32);
+
+ return csum_fold((__u32)s);
+}
+
#endif
--
2.42.0.582.g8ccd20d70d-goog
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] [PATCH bpf-next v3 08/10] selftests/bpf: Add TX side to xdp_metadata
2023-10-03 20:05 [xdp-hints] [PATCH bpf-next v3 00/10] xsk: TX metadata Stanislav Fomichev
` (6 preceding siblings ...)
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 07/10] selftests/bpf: Add csum helpers Stanislav Fomichev
@ 2023-10-03 20:05 ` Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 09/10] selftests/bpf: Add TX side to xdp_hw_metadata Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 10/10] xsk: document tx_metadata_len layout Stanislav Fomichev
9 siblings, 0 replies; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-03 20:05 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
Request TX timestamp and make sure it's not empty.
Request TX checksum offload (SW-only) and make sure it's resolved
to the correct one.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
.../selftests/bpf/prog_tests/xdp_metadata.c | 31 +++++++++++++++++--
1 file changed, 28 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
index 626c461fa34d..f0da8fe93276 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -57,6 +57,7 @@ static int open_xsk(int ifindex, struct xsk *xsk)
.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
.frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
.flags = XDP_UMEM_UNALIGNED_CHUNK_FLAG,
+ .tx_metadata_len = sizeof(struct xsk_tx_metadata),
};
__u32 idx;
u64 addr;
@@ -138,6 +139,7 @@ static void ip_csum(struct iphdr *iph)
static int generate_packet(struct xsk *xsk, __u16 dst_port)
{
+ struct xsk_tx_metadata *meta;
struct xdp_desc *tx_desc;
struct udphdr *udph;
struct ethhdr *eth;
@@ -151,10 +153,14 @@ static int generate_packet(struct xsk *xsk, __u16 dst_port)
return -1;
tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
- tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
+ tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE + sizeof(struct xsk_tx_metadata);
printf("%p: tx_desc[%u]->addr=%llx\n", xsk, idx, tx_desc->addr);
data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
+ meta = data - sizeof(struct xsk_tx_metadata);
+ memset(meta, 0, sizeof(*meta));
+ meta->flags = XDP_TX_METADATA_TIMESTAMP;
+
eth = data;
iph = (void *)(eth + 1);
udph = (void *)(iph + 1);
@@ -178,11 +184,17 @@ static int generate_packet(struct xsk *xsk, __u16 dst_port)
udph->source = htons(AF_XDP_SOURCE_PORT);
udph->dest = htons(dst_port);
udph->len = htons(sizeof(*udph) + UDP_PAYLOAD_BYTES);
- udph->check = 0;
+ udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
+ ntohs(udph->len), IPPROTO_UDP, 0);
memset(udph + 1, 0xAA, UDP_PAYLOAD_BYTES);
+ meta->flags |= XDP_TX_METADATA_CHECKSUM | XDP_TX_METADATA_CHECKSUM_SW;
+ meta->csum_start = sizeof(*eth) + sizeof(*iph);
+ meta->csum_offset = offsetof(struct udphdr, check);
+
tx_desc->len = sizeof(*eth) + sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES;
+ tx_desc->options |= XDP_TX_METADATA;
xsk_ring_prod__submit(&xsk->tx, 1);
ret = sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
@@ -194,13 +206,21 @@ static int generate_packet(struct xsk *xsk, __u16 dst_port)
static void complete_tx(struct xsk *xsk)
{
- __u32 idx;
+ struct xsk_tx_metadata *meta;
__u64 addr;
+ void *data;
+ __u32 idx;
if (ASSERT_EQ(xsk_ring_cons__peek(&xsk->comp, 1, &idx), 1, "xsk_ring_cons__peek")) {
addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
+
+ data = xsk_umem__get_data(xsk->umem_area, addr);
+ meta = data - sizeof(struct xsk_tx_metadata);
+
+ ASSERT_NEQ(meta->completion.tx_timestamp, 0, "tx_timestamp");
+
xsk_ring_cons__release(&xsk->comp, 1);
}
}
@@ -221,6 +241,7 @@ static int verify_xsk_metadata(struct xsk *xsk)
const struct xdp_desc *rx_desc;
struct pollfd fds = {};
struct xdp_meta *meta;
+ struct udphdr *udph;
struct ethhdr *eth;
struct iphdr *iph;
__u64 comp_addr;
@@ -257,6 +278,7 @@ static int verify_xsk_metadata(struct xsk *xsk)
ASSERT_EQ(eth->h_proto, htons(ETH_P_IP), "eth->h_proto");
iph = (void *)(eth + 1);
ASSERT_EQ((int)iph->version, 4, "iph->version");
+ udph = (void *)(iph + 1);
/* custom metadata */
@@ -270,6 +292,9 @@ static int verify_xsk_metadata(struct xsk *xsk)
ASSERT_EQ(meta->rx_hash_type, 0, "rx_hash_type");
+ /* checksum offload */
+ ASSERT_EQ(udph->check, 0x1c72, "csum");
+
xsk_ring_cons__release(&xsk->rx, 1);
refill_rx(xsk, comp_addr);
--
2.42.0.582.g8ccd20d70d-goog
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] [PATCH bpf-next v3 09/10] selftests/bpf: Add TX side to xdp_hw_metadata
2023-10-03 20:05 [xdp-hints] [PATCH bpf-next v3 00/10] xsk: TX metadata Stanislav Fomichev
` (7 preceding siblings ...)
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 08/10] selftests/bpf: Add TX side to xdp_metadata Stanislav Fomichev
@ 2023-10-03 20:05 ` Stanislav Fomichev
2023-10-09 8:12 ` [xdp-hints] " Jesper Dangaard Brouer
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 10/10] xsk: document tx_metadata_len layout Stanislav Fomichev
9 siblings, 1 reply; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-03 20:05 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
When we get a packet on port 9091, we swap src/dst and send it out.
At this point we also request the timestamp and checksum offloads.
Checksum offload is verified by looking at the tcpdump on the other side.
The tool prints pseudo-header csum and the final one it expects.
The final checksum actually matches the incoming packets checksum
because we only flip the src/dst and don't change the payload.
Some other related changes:
- switched to zerocopy mode by default; new flag can be used to force
old behavior
- request fixed tx_metadata_len headroom
- some other small fixes (umem size, fill idx+i, etc)
mvbz3:~# ./xdp_hw_metadata eth3
...
0x1062cb8: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
rx_hash: 0x2E1B50B9 with RSS type:0x2A
rx_timestamp: 1691436369532047139 (sec:1691436369.5320)
XDP RX-time: 1691436369261756803 (sec:1691436369.2618) delta sec:-0.2703 (-270290.336 usec)
AF_XDP time: 1691436369261878839 (sec:1691436369.2619) delta sec:0.0001 (122.036 usec)
0x1062cb8: ping-pong with csum=3b8e (want de7e) csum_start=54 csum_offset=6
0x1062cb8: complete tx idx=0 addr=10
0x1062cb8: tx_timestamp: 1691436369598419505 (sec:1691436369.5984)
0x1062cb8: complete rx idx=128 addr=80100
mvbz4:~# nc -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091
mvbz4:~# tcpdump -vvx -i eth3 udp
tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
12:26:09.301074 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1087.55807 > fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0xde7e!] UDP, length 3
0x0000: 6003 5fa5 000b 117f fe80 0000 0000 0000
0x0010: 1270 fdff fe48 1087 fe80 0000 0000 0000
0x0020: 1270 fdff fe48 1077 d9ff 2383 000b 3b8e
0x0030: 7864 70
12:26:09.301976 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1077.9091 > fe80::1270:fdff:fe48:1087.55807: [udp sum ok] UDP, length 3
0x0000: 6003 5fa5 000b 117f fe80 0000 0000 0000
0x0010: 1270 fdff fe48 1077 fe80 0000 0000 0000
0x0020: 1270 fdff fe48 1087 2383 d9ff 000b de7e
0x0030: 7864 70
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
tools/testing/selftests/bpf/xdp_hw_metadata.c | 202 +++++++++++++++++-
1 file changed, 192 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
index 613321eb84c1..ab83d0ba6763 100644
--- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
@@ -10,7 +10,9 @@
* - rx_hash
*
* TX:
- * - TBD
+ * - UDP 9091 packets trigger TX reply
+ * - TX HW timestamp is requested and reported back upon completion
+ * - TX checksum is requested
*/
#include <test_progs.h>
@@ -24,14 +26,17 @@
#include <linux/net_tstamp.h>
#include <linux/udp.h>
#include <linux/sockios.h>
+#include <linux/if_xdp.h>
#include <sys/mman.h>
#include <net/if.h>
#include <poll.h>
#include <time.h>
+#include <unistd.h>
+#include <libgen.h>
#include "xdp_metadata.h"
-#define UMEM_NUM 16
+#define UMEM_NUM 256
#define UMEM_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
#define UMEM_SIZE (UMEM_FRAME_SIZE * UMEM_NUM)
#define XDP_FLAGS (XDP_FLAGS_DRV_MODE | XDP_FLAGS_REPLACE)
@@ -51,22 +56,24 @@ struct xsk *rx_xsk;
const char *ifname;
int ifindex;
int rxq;
+bool skip_tx;
void test__fail(void) { /* for network_helpers.c */ }
-static int open_xsk(int ifindex, struct xsk *xsk, __u32 queue_id)
+static int open_xsk(int ifindex, struct xsk *xsk, __u32 queue_id, int flags)
{
int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
const struct xsk_socket_config socket_config = {
.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
- .bind_flags = XDP_COPY,
+ .bind_flags = flags,
};
const struct xsk_umem_config umem_config = {
.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
.frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
- .flags = XDP_UMEM_UNALIGNED_CHUNK_FLAG,
+ .flags = XSK_UMEM__DEFAULT_FLAGS,
+ .tx_metadata_len = sizeof(struct xsk_tx_metadata),
};
__u32 idx;
u64 addr;
@@ -108,7 +115,7 @@ static int open_xsk(int ifindex, struct xsk *xsk, __u32 queue_id)
for (i = 0; i < UMEM_NUM / 2; i++) {
addr = (UMEM_NUM / 2 + i) * UMEM_FRAME_SIZE;
printf("%p: rx_desc[%d] -> %lx\n", xsk, i, addr);
- *xsk_ring_prod__fill_addr(&xsk->fill, i) = addr;
+ *xsk_ring_prod__fill_addr(&xsk->fill, idx + i) = addr;
}
xsk_ring_prod__submit(&xsk->fill, ret);
@@ -129,12 +136,22 @@ static void refill_rx(struct xsk *xsk, __u64 addr)
__u32 idx;
if (xsk_ring_prod__reserve(&xsk->fill, 1, &idx) == 1) {
- printf("%p: complete idx=%u addr=%llx\n", xsk, idx, addr);
+ printf("%p: complete rx idx=%u addr=%llx\n", xsk, idx, addr);
*xsk_ring_prod__fill_addr(&xsk->fill, idx) = addr;
xsk_ring_prod__submit(&xsk->fill, 1);
}
}
+static int kick_tx(struct xsk *xsk)
+{
+ return sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
+}
+
+static int kick_rx(struct xsk *xsk)
+{
+ return recvfrom(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, NULL);
+}
+
#define NANOSEC_PER_SEC 1000000000 /* 10^9 */
static __u64 gettime(clockid_t clock_id)
{
@@ -228,6 +245,117 @@ static void verify_skb_metadata(int fd)
printf("skb hwtstamp is not found!\n");
}
+static bool complete_tx(struct xsk *xsk)
+{
+ struct xsk_tx_metadata *meta;
+ __u64 addr;
+ void *data;
+ __u32 idx;
+
+ if (!xsk_ring_cons__peek(&xsk->comp, 1, &idx))
+ return false;
+
+ addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
+ data = xsk_umem__get_data(xsk->umem_area, addr);
+ meta = data - sizeof(struct xsk_tx_metadata);
+
+ printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
+ printf("%p: tx_timestamp: %llu (sec:%0.4f)\n", xsk,
+ meta->completion.tx_timestamp,
+ (double)meta->completion.tx_timestamp / NANOSEC_PER_SEC);
+ xsk_ring_cons__release(&xsk->comp, 1);
+
+ return true;
+}
+
+#define swap(a, b, len) do { \
+ for (int i = 0; i < len; i++) { \
+ __u8 tmp = ((__u8 *)a)[i]; \
+ ((__u8 *)a)[i] = ((__u8 *)b)[i]; \
+ ((__u8 *)b)[i] = tmp; \
+ } \
+} while (0)
+
+static void ping_pong(struct xsk *xsk, void *rx_packet)
+{
+ struct xsk_tx_metadata *meta;
+ struct ipv6hdr *ip6h = NULL;
+ struct iphdr *iph = NULL;
+ struct xdp_desc *tx_desc;
+ struct udphdr *udph;
+ struct ethhdr *eth;
+ __sum16 want_csum;
+ void *data;
+ __u32 idx;
+ int ret;
+ int len;
+
+ ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
+ if (ret != 1) {
+ printf("%p: failed to reserve tx slot\n", xsk);
+ return;
+ }
+
+ tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
+ tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE + sizeof(struct xsk_tx_metadata);
+ data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
+
+ meta = data - sizeof(struct xsk_tx_metadata);
+ memset(meta, 0, sizeof(*meta));
+ meta->flags = XDP_TX_METADATA_TIMESTAMP;
+
+ eth = rx_packet;
+
+ if (eth->h_proto == htons(ETH_P_IP)) {
+ iph = (void *)(eth + 1);
+ udph = (void *)(iph + 1);
+ } else if (eth->h_proto == htons(ETH_P_IPV6)) {
+ ip6h = (void *)(eth + 1);
+ udph = (void *)(ip6h + 1);
+ } else {
+ printf("%p: failed to detect IP version for ping pong %04x\n", xsk, eth->h_proto);
+ xsk_ring_prod__cancel(&xsk->tx, 1);
+ return;
+ }
+
+ len = ETH_HLEN;
+ if (ip6h)
+ len += sizeof(*ip6h) + ntohs(ip6h->payload_len);
+ if (iph)
+ len += ntohs(iph->tot_len);
+
+ swap(eth->h_dest, eth->h_source, ETH_ALEN);
+ if (iph)
+ swap(&iph->saddr, &iph->daddr, 4);
+ else
+ swap(&ip6h->saddr, &ip6h->daddr, 16);
+ swap(&udph->source, &udph->dest, 2);
+
+ want_csum = udph->check;
+ if (ip6h)
+ udph->check = ~csum_ipv6_magic(&ip6h->saddr, &ip6h->daddr,
+ ntohs(udph->len), IPPROTO_UDP, 0);
+ else
+ udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
+ ntohs(udph->len), IPPROTO_UDP, 0);
+
+ meta->flags |= XDP_TX_METADATA_CHECKSUM;
+ if (iph)
+ meta->csum_start = sizeof(*eth) + sizeof(*iph);
+ else
+ meta->csum_start = sizeof(*eth) + sizeof(*ip6h);
+ meta->csum_offset = offsetof(struct udphdr, check);
+
+ printf("%p: ping-pong with csum=%04x (want %04x) csum_start=%d csum_offset=%d\n",
+ xsk, ntohs(udph->check), ntohs(want_csum), meta->csum_start, meta->csum_offset);
+
+ memcpy(data, rx_packet, len); /* don't share umem chunk for simplicity */
+ tx_desc->options |= XDP_TX_METADATA;
+ tx_desc->len = len;
+
+ xsk_ring_prod__submit(&xsk->tx, 1);
+}
+
static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t clock_id)
{
const struct xdp_desc *rx_desc;
@@ -250,6 +378,13 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
while (true) {
errno = 0;
+
+ for (i = 0; i < rxq; i++) {
+ ret = kick_rx(&rx_xsk[i]);
+ if (ret)
+ printf("kick_rx ret=%d\n", ret);
+ }
+
ret = poll(fds, rxq + 1, 1000);
printf("poll: %d (%d) skip=%llu fail=%llu redir=%llu\n",
ret, errno, bpf_obj->bss->pkts_skip,
@@ -280,6 +415,22 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
xsk, idx, rx_desc->addr, addr, comp_addr);
verify_xdp_metadata(xsk_umem__get_data(xsk->umem_area, addr),
clock_id);
+
+ if (!skip_tx) {
+ /* mirror the packet back */
+ ping_pong(xsk, xsk_umem__get_data(xsk->umem_area, addr));
+
+ ret = kick_tx(xsk);
+ if (ret)
+ printf("kick_tx ret=%d\n", ret);
+
+ for (int j = 0; j < 500; j++) {
+ if (complete_tx(xsk))
+ break;
+ usleep(10*1000);
+ }
+ }
+
xsk_ring_cons__release(&xsk->rx, 1);
refill_rx(xsk, comp_addr);
}
@@ -404,21 +555,52 @@ static void timestamping_enable(int fd, int val)
error(1, errno, "setsockopt(SO_TIMESTAMPING)");
}
+static void usage(const char *prog)
+{
+ fprintf(stderr,
+ "usage: %s [OPTS] <ifname>\n"
+ "OPTS:\n"
+ " -r don't generate AF_XDP reply (rx metadata only)\n"
+ " -c run in copy mode\n",
+ prog);
+}
+
int main(int argc, char *argv[])
{
+ int bind_flags = XDP_USE_NEED_WAKEUP | XDP_ZEROCOPY;
clockid_t clock_id = CLOCK_TAI;
int server_fd = -1;
+ int opt;
int ret;
int i;
struct bpf_program *prog;
- if (argc != 2) {
+ while ((opt = getopt(argc, argv, "rc")) != -1) {
+ switch (opt) {
+ case 'r':
+ skip_tx = true;
+ break;
+ case 'c':
+ bind_flags = XDP_USE_NEED_WAKEUP | XDP_COPY;
+ break;
+ default:
+ usage(basename(argv[0]));
+ return 1;
+ }
+ }
+
+ if (argc < 2) {
fprintf(stderr, "pass device name\n");
return -1;
}
- ifname = argv[1];
+ if (optind >= argc) {
+ usage(basename(argv[0]));
+ return 1;
+ }
+
+ ifname = argv[optind];
ifindex = if_nametoindex(ifname);
rxq = rxq_num(ifname);
@@ -432,7 +614,7 @@ int main(int argc, char *argv[])
for (i = 0; i < rxq; i++) {
printf("open_xsk(%s, %p, %d)\n", ifname, &rx_xsk[i], i);
- ret = open_xsk(ifindex, &rx_xsk[i], i);
+ ret = open_xsk(ifindex, &rx_xsk[i], i, bind_flags);
if (ret)
error(1, -ret, "open_xsk");
--
2.42.0.582.g8ccd20d70d-goog
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] [PATCH bpf-next v3 10/10] xsk: document tx_metadata_len layout
2023-10-03 20:05 [xdp-hints] [PATCH bpf-next v3 00/10] xsk: TX metadata Stanislav Fomichev
` (8 preceding siblings ...)
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 09/10] selftests/bpf: Add TX side to xdp_hw_metadata Stanislav Fomichev
@ 2023-10-03 20:05 ` Stanislav Fomichev
9 siblings, 0 replies; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-03 20:05 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
- how to use
- how to query features
- pointers to the examples
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
Documentation/networking/index.rst | 1 +
Documentation/networking/xsk-tx-metadata.rst | 77 ++++++++++++++++++++
2 files changed, 78 insertions(+)
create mode 100644 Documentation/networking/xsk-tx-metadata.rst
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 5b75c3f7a137..9b2accb48df7 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -123,6 +123,7 @@ Refer to :ref:`netdev-FAQ` for a guide on netdev development process specifics.
xfrm_sync
xfrm_sysctl
xdp-rx-metadata
+ xsk-tx-metadata
.. only:: subproject and html
diff --git a/Documentation/networking/xsk-tx-metadata.rst b/Documentation/networking/xsk-tx-metadata.rst
new file mode 100644
index 000000000000..b7289f06745c
--- /dev/null
+++ b/Documentation/networking/xsk-tx-metadata.rst
@@ -0,0 +1,77 @@
+==================
+AF_XDP TX Metadata
+==================
+
+This document describes how to enable offloads when transmitting packets
+via :doc:`af_xdp`. Refer to :doc:`xdp-rx-metadata` on how to access similar
+metadata on the receive side.
+
+General Design
+==============
+
+The headroom for the metadata is reserved via ``tx_metadata_len`` in
+``struct xdp_umem_reg``. The metadata length is therefore the same for
+every socket that shares the same umem. The metadata layout is a fixed UAPI,
+refer to ``union xsk_tx_metadata`` in ``include/uapi/linux/if_xdp.h``.
+Thus, generally, the ``tx_metadata_len`` field above should contain
+``sizeof(union xsk_tx_metadata)``.
+
+The headroom and the metadata itself should be located right before
+``xdp_desc->addr`` in the umem frame. Within a frame, the metadata
+layout is as follows::
+
+ tx_metadata_len
+ / \
+ +-----------------+---------+----------------------------+
+ | xsk_tx_metadata | padding | payload |
+ +-----------------+---------+----------------------------+
+ ^
+ |
+ xdp_desc->addr
+
+An AF_XDP application can request headrooms larger than ``sizeof(struct
+xsk_tx_metadata)``. The kernel will ignore the padding (and will still
+use ``xdp_desc->addr - tx_metadata_len`` to locate
+the ``xsk_tx_metadata``). For the frames that shouldn't carry
+any metadata (i.e., the ones that don't have ``XDP_TX_METADATA`` option),
+the metadata area is ignored by the kernel as well.
+
+The flags field enables the particular offload:
+
+- ``XDP_TX_METADATA_TIMESTAMP``: requests the device to put transmission
+ timestamp into ``tx_timestamp`` field of ``union xsk_tx_metadata``.
+- ``XDP_TX_METADATA_CHECKSUM``: requests the device to calculate L4
+ checksum. ``csum_start`` specifies byte offset of there the checksumming
+ should start and ``csum_offset`` specifies byte offset where the
+ device should store the computed checksum.
+- ``XDP_TX_METADATA_CHECKSUM_SW``: requests checksum calculation to
+ be done in software; this mode works only in ``XSK_COPY`` mode and
+ is mostly intended for testing. Do not enable this option, it
+ will negatively affect performance.
+
+Besides the flags above, in order to trigger the offloads, the first
+packet's ``struct xdp_desc`` descriptor should set ``XDP_TX_METADATA``
+bit in the ``options`` field. Also not that in a multi-buffer packet
+only the first chunk should carry the metadata.
+
+Querying Device Capabilities
+============================
+
+Every devices exports its offloads capabilities via netlink netdev family.
+Refer to ``xsk-flags`` features bitmask in
+``Documentation/netlink/specs/netdev.yaml``.
+
+- ``tx-timestamp``: device supports ``XDP_TX_METADATA_TIMESTAMP``
+- ``tx-checksum``: device supports ``XDP_TX_METADATA_CHECKSUM``
+
+Note that every devices supports ``XDP_TX_METADATA_CHECKSUM_SW`` when
+running in ``XSK_COPY`` mode.
+
+See ``tools/net/ynl/samples/netdev.c`` on how to query this information.
+
+Example
+=======
+
+See ``tools/testing/selftests/bpf/xdp_hw_metadata.c`` for an example
+program that handles TX metadata. Also see https://github.com/fomichev/xskgen
+for a more bare-bones example.
--
2.42.0.582.g8ccd20d70d-goog
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] Re: [PATCH bpf-next v3 02/10] xsk: add TX timestamp and TX checksum offload support
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 02/10] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
@ 2023-10-04 6:18 ` Song, Yoong Siang
2023-10-04 17:48 ` Stanislav Fomichev
0 siblings, 1 reply; 25+ messages in thread
From: Song, Yoong Siang @ 2023-10-04 6:18 UTC (permalink / raw)
To: Stanislav Fomichev, bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern, Karlsson,
Magnus, bjorn, Fijalkowski, Maciej, hawk, netdev, xdp-hints
On Wednesday, October 4, 2023 4:05 AM Stanislav Fomichev <sdf@google.com> wrote:
>This change actually defines the (initial) metadata layout that should be used by
>AF_XDP userspace (xsk_tx_metadata).
>The first field is flags which requests appropriate offloads, followed by the offload-
>specific fields. The supported per-device offloads are exported via netlink (new
>xsk-flags).
>
>The offloads themselves are still implemented in a bit of a framework-y fashion
>that's left from my initial kfunc attempt.
>I'm introducing new xsk_tx_metadata_ops which drivers are supposed to
>implement. The drivers are also supposed to call
>xsk_tx_metadata_request/xsk_tx_metadata_complete in the right places. Since
>xsk_tx_metadata_{request,_complete}
>are static inline, we don't incur any extra overhead doing indirect calls.
>
>The benefit of this scheme is as follows:
>- keeps all metadata layout parsing away from driver code
>- makes it easy to grep and see which drivers implement what
>- don't need any extra flags to maintain to keep track of what
> offloads are implemented; if the callback is implemented - the offload
> is supported (used by netlink reporting code)
>
>Two offloads are defined right now:
>1. XDP_TX_METADATA_CHECKSUM: skb-style csum_start+csum_offset 2.
>XDP_TX_METADATA_TIMESTAMP: writes TX timestamp back into metadata
> area upon completion (tx_timestamp field)
>
>The offloads are also implemented for copy mode:
>1. Extra XDP_TX_METADATA_CHECKSUM_SW to trigger skb_checksum_help; this
> might be useful as a reference implementation and for testing 2.
>XDP_TX_METADATA_TIMESTAMP writes SW timestamp from the skb
> destructor (note I'm reusing hwtstamps to pass metadata pointer)
>
>The struct is forward-compatible and can be extended in the future by appending
>more fields.
>
>Signed-off-by: Stanislav Fomichev <sdf@google.com>
>---
> Documentation/netlink/specs/netdev.yaml | 19 ++++++
> include/linux/netdevice.h | 27 +++++++++
> include/linux/skbuff.h | 14 ++++-
> include/net/xdp_sock.h | 80 +++++++++++++++++++++++++
> include/net/xdp_sock_drv.h | 13 ++++
> include/net/xsk_buff_pool.h | 6 ++
> include/uapi/linux/if_xdp.h | 40 +++++++++++++
> include/uapi/linux/netdev.h | 16 +++++
> net/core/netdev-genl.c | 12 +++-
> net/xdp/xsk.c | 39 ++++++++++++
> net/xdp/xsk_queue.h | 2 +-
> tools/include/uapi/linux/if_xdp.h | 54 +++++++++++++++--
> tools/include/uapi/linux/netdev.h | 16 +++++
> tools/net/ynl/generated/netdev-user.c | 19 ++++++
> tools/net/ynl/generated/netdev-user.h | 3 +
> 15 files changed, 352 insertions(+), 8 deletions(-)
>
>diff --git a/Documentation/netlink/specs/netdev.yaml
>b/Documentation/netlink/specs/netdev.yaml
>index c46fcc78fc04..3735c26c8646 100644
>--- a/Documentation/netlink/specs/netdev.yaml
>+++ b/Documentation/netlink/specs/netdev.yaml
>@@ -55,6 +55,19 @@ name: netdev
> name: hash
> doc:
> Device is capable of exposing receive packet hash via
>bpf_xdp_metadata_rx_hash().
>+ -
>+ type: flags
>+ name: xsk-flags
>+ render-max: true
>+ entries:
>+ -
>+ name: tx-timestamp
>+ doc:
>+ HW timestamping egress packets is supported by the driver.
>+ -
>+ name: tx-checksum
>+ doc:
>+ L3 checksum HW offload is supported by the driver.
>
> attribute-sets:
> -
>@@ -88,6 +101,11 @@ name: netdev
> type: u64
> enum: xdp-rx-metadata
> enum-as-flags: true
>+ -
>+ name: xsk-features
>+ doc: Bitmask of enabled AF_XDP features.
>+ type: u64
>+ enum: xsk-flags
>
> operations:
> list:
>@@ -105,6 +123,7 @@ name: netdev
> - xdp-features
> - xdp-zc-max-segs
> - xdp-rx-metadata-features
>+ - xsk-features
> dump:
> reply: *dev-all
> -
>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index
>7e520c14eb8c..0e1cb026cbe5 100644
>--- a/include/linux/netdevice.h
>+++ b/include/linux/netdevice.h
>@@ -1650,6 +1650,31 @@ struct net_device_ops {
> struct netlink_ext_ack
>*extack); };
>
>+/*
>+ * This structure defines the AF_XDP TX metadata hooks for network devices.
>+ * The following hooks can be defined; unless noted otherwise, they are
>+ * optional and can be filled with a null pointer.
>+ *
>+ * int (*tmo_request_timestamp)(void *priv)
Should be "void" instead of "int"
>+ * This function is called when AF_XDP frame requested egress timestamp.
>+ *
>+ * int (*tmo_fill_timestamp)(void *priv)
Should be "u64" instead of "int"
>+ * This function is called when AF_XDP frame, that had requested
>+ * egress timestamp, received a completion. The hook needs to return
>+ * the actual HW timestamp.
>+ *
>+ * int (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv)
Should be "void" instead of "int"
>+ * This function is called when AF_XDP frame requested HW checksum
>+ * offload. csum_start indicates position where checksumming should start.
>+ * csum_offset indicates position where checksum should be stored.
>+ *
>+ */
>+struct xsk_tx_metadata_ops {
>+ void (*tmo_request_timestamp)(void *priv);
>+ u64 (*tmo_fill_timestamp)(void *priv);
>+ void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void
>*priv);
>+};
>+
> /**
> * enum netdev_priv_flags - &struct net_device priv_flags
> *
>@@ -1838,6 +1863,7 @@ enum netdev_ml_priv_type {
> * @netdev_ops: Includes several pointers to callbacks,
> * if one wants to override the ndo_*() functions
> * @xdp_metadata_ops: Includes pointers to XDP metadata callbacks.
>+ * @xsk_tx_metadata_ops: Includes pointers to AF_XDP TX
>metadata callbacks.
> * @ethtool_ops: Management operations
> * @l3mdev_ops: Layer 3 master device operations
> * @ndisc_ops: Includes callbacks for different IPv6 neighbour
>@@ -2097,6 +2123,7 @@ struct net_device {
> unsigned long long priv_flags;
> const struct net_device_ops *netdev_ops;
> const struct xdp_metadata_ops *xdp_metadata_ops;
>+ const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
> int ifindex;
> unsigned short gflags;
> unsigned short hard_header_len;
>diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index
>4174c4b82d13..444d35dcd690 100644
>--- a/include/linux/skbuff.h
>+++ b/include/linux/skbuff.h
>@@ -566,6 +566,15 @@ struct ubuf_info_msgzc { int
>mm_account_pinned_pages(struct mmpin *mmp, size_t size); void
>mm_unaccount_pinned_pages(struct mmpin *mmp);
>
>+/* Preserve some data across TX submission and completion.
>+ *
>+ * Note, this state is stored in the driver. Extending the layout
>+ * might need some special care.
>+ */
>+struct xsk_tx_metadata_compl {
>+ __u64 *tx_timestamp;
>+};
>+
> /* This data is invariant across clones and lives at
> * the end of the header data, ie. at skb->end.
> */
>@@ -578,7 +587,10 @@ struct skb_shared_info {
> /* Warning: this field is not always filled in (UFO)! */
> unsigned short gso_segs;
> struct sk_buff *frag_list;
>- struct skb_shared_hwtstamps hwtstamps;
>+ union {
>+ struct skb_shared_hwtstamps hwtstamps;
>+ struct xsk_tx_metadata_compl xsk_meta;
>+ };
> unsigned int gso_type;
> u32 tskey;
>
>diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h index
>caa1f04106be..29427a69784d 100644
>--- a/include/net/xdp_sock.h
>+++ b/include/net/xdp_sock.h
>@@ -92,6 +92,74 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff
>*xdp); int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp); void
>__xsk_map_flush(void);
>
>+/**
>+ * xsk_tx_metadata_to_compl - Save enough relevant metadata
>+information
>+ * to perform tx completion in the future.
>+ * @meta: pointer to AF_XDP metadata area
>+ * @compl: pointer to output struct xsk_tx_metadata_to_compl
>+ *
>+ * This function should be called by the networking device when
>+ * it prepares AF_XDP egress packet. The value of @compl should be
>+stored
>+ * and passed to xsk_tx_metadata_complete upon TX completion.
>+ */
>+static inline void xsk_tx_metadata_to_compl(struct xsk_tx_metadata *meta,
>+ struct xsk_tx_metadata_compl
>*compl) {
>+ if (!meta)
>+ return;
>+
>+ if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
>+ compl->tx_timestamp = &meta->completion.tx_timestamp;
>+ else
>+ compl->tx_timestamp = NULL;
>+}
>+
>+/**
>+ * xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
>+ * and call appropriate xsk_tx_metadata_ops operation.
>+ * @meta: pointer to AF_XDP metadata area
>+ * @ops: pointer to struct xsk_tx_metadata_ops
>+ * @priv: pointer to driver-private aread
>+ *
>+ * This function should be called by the networking device when
>+ * it prepares AF_XDP egress packet.
>+ */
>+static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
>+ const struct xsk_tx_metadata_ops
>*ops,
>+ void *priv)
>+{
>+ if (!meta)
>+ return;
>+
>+ if (ops->tmo_request_timestamp)
>+ if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
>+ ops->tmo_request_timestamp(priv);
>+
>+ if (ops->tmo_request_checksum)
>+ if (meta->flags & XDP_TX_METADATA_CHECKSUM)
>+ ops->tmo_request_checksum(meta->csum_start, meta-
>>csum_offset,
>+priv); }
>+
>+/**
>+ * xsk_tx_metadata_complete - Evaluate AF_XDP TX metadata at
>+completion
>+ * and call appropriate xsk_tx_metadata_ops operation.
>+ * @compl: pointer to completion metadata produced from
>+xsk_tx_metadata_to_compl
>+ * @ops: pointer to struct xsk_tx_metadata_ops
>+ * @priv: pointer to driver-private aread
>+ *
>+ * This function should be called by the networking device upon
>+ * AF_XDP egress completion.
>+ */
>+static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata_compl
>*compl,
>+ const struct xsk_tx_metadata_ops
>*ops,
>+ void *priv)
>+{
>+ if (!compl)
>+ return;
>+
>+ *compl->tx_timestamp = ops->tmo_fill_timestamp(priv); }
>+
> #else
>
> static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) @@ -
>108,6 +176,18 @@ static inline void __xsk_map_flush(void) { }
>
>+static inline void xsk_tx_metadata_request(struct xsk_tx_metadata *meta,
>+ const struct xsk_tx_metadata_ops
>*ops,
>+ void *priv)
>+{
>+}
>+
>+static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata_compl
>*compl,
>+ const struct xsk_tx_metadata_ops
>*ops,
>+ void *priv)
>+{
>+}
>+
> #endif /* CONFIG_XDP_SOCKETS */
>
> #endif /* _LINUX_XDP_SOCK_H */
>diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h index
>1f6fc8c7a84c..e2558ac3e195 100644
>--- a/include/net/xdp_sock_drv.h
>+++ b/include/net/xdp_sock_drv.h
>@@ -165,6 +165,14 @@ static inline void *xsk_buff_raw_get_data(struct
>xsk_buff_pool *pool, u64 addr)
> return xp_raw_get_data(pool, addr);
> }
>
>+static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct
>+xsk_buff_pool *pool, u64 addr) {
>+ if (!pool->tx_metadata_len)
>+ return NULL;
>+
>+ return xp_raw_get_data(pool, addr) - pool->tx_metadata_len; }
>+
> static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct
>xsk_buff_pool *pool) {
> struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
>@@ -324,6 +332,11 @@ static inline void *xsk_buff_raw_get_data(struct
>xsk_buff_pool *pool, u64 addr)
> return NULL;
> }
>
>+static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct
>+xsk_buff_pool *pool, u64 addr) {
>+ return NULL;
>+}
>+
> static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct
>xsk_buff_pool *pool) { } diff --git a/include/net/xsk_buff_pool.h
>b/include/net/xsk_buff_pool.h index 1985ffaf9b0c..97f5cc10d79e 100644
>--- a/include/net/xsk_buff_pool.h
>+++ b/include/net/xsk_buff_pool.h
>@@ -33,6 +33,7 @@ struct xdp_buff_xsk {
> };
>
> #define XSK_CHECK_PRIV_TYPE(t) BUILD_BUG_ON(sizeof(t) > offsetofend(struct
>xdp_buff_xsk, cb))
>+#define XSK_TX_COMPL_FITS(t) BUILD_BUG_ON(sizeof(struct
>+xsk_tx_metadata_compl) > sizeof(t))
>
> struct xsk_dma_map {
> dma_addr_t *dma_pages;
>@@ -234,4 +235,9 @@ static inline u64 xp_get_handle(struct xdp_buff_xsk *xskb)
> return xskb->orig_addr + (offset <<
>XSK_UNALIGNED_BUF_OFFSET_SHIFT); }
>
>+static inline bool xp_tx_metadata_enabled(const struct xsk_buff_pool
>+*pool) {
>+ return pool->tx_metadata_len > 0;
>+}
>+
> #endif /* XSK_BUFF_POOL_H_ */
>diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h index
>2ecf79282c26..ecfd67988283 100644
>--- a/include/uapi/linux/if_xdp.h
>+++ b/include/uapi/linux/if_xdp.h
>@@ -106,6 +106,43 @@ struct xdp_options { #define
>XSK_UNALIGNED_BUF_ADDR_MASK \
> ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
>
>+/* Request transmit timestamp. Upon completion, put it into
>+tx_timestamp
>+ * field of struct xsk_tx_metadata.
>+ */
>+#define XDP_TX_METADATA_TIMESTAMP (1 << 0)
Suggestion from checkpatch.pl:
CHECK: Prefer using the BIT macro
>+
>+/* Request transmit checksum offload. Checksum start position and
>+offset
>+ * are communicated via csum_start and csum_offset fields of struct
>+ * xsk_tx_metadata.
>+ */
>+#define XDP_TX_METADATA_CHECKSUM (1 << 1)
Suggestion from checkpatch.pl:
CHECK: Prefer using the BIT macro
>+
>+/* Force checksum calculation in software. Can be used for testing or
>+ * working around potential HW issues. This option causes performance
>+ * degradation and only works in XDP_COPY mode.
>+ */
>+#define XDP_TX_METADATA_CHECKSUM_SW (1 << 2)
Suggestion from checkpatch.pl:
CHECK: Prefer using the BIT macro
>+
>+struct xsk_tx_metadata {
>+ union {
>+ struct {
>+ __u32 flags;
>+
>+ /* XDP_TX_METADATA_CHECKSUM */
>+
>+ /* Offset from desc->addr where checksumming should
>start. */
>+ __u16 csum_start;
>+ /* Offset from csum_start where checksum should be
>stored. */
>+ __u16 csum_offset;
>+ };
>+
>+ struct {
>+ /* XDP_TX_METADATA_TIMESTAMP */
>+ __u64 tx_timestamp;
>+ } completion;
>+ };
>+};
>+
> /* Rx/Tx descriptor */
> struct xdp_desc {
> __u64 addr;
>@@ -122,4 +159,7 @@ struct xdp_desc {
> */
> #define XDP_PKT_CONTD (1 << 0)
>
>+/* TX packet carries valid metadata. */ #define XDP_TX_METADATA (1 <<
>+1)
>+
> #endif /* _LINUX_IF_XDP_H */
>diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h index
>2943a151d4f1..48d5477a668c 100644
>--- a/include/uapi/linux/netdev.h
>+++ b/include/uapi/linux/netdev.h
>@@ -53,12 +53,28 @@ enum netdev_xdp_rx_metadata {
> NETDEV_XDP_RX_METADATA_MASK = 3,
> };
>
>+/**
>+ * enum netdev_xsk_flags
>+ * @NETDEV_XSK_FLAGS_TX_TIMESTAMP: HW timestamping egress packets is
>supported
>+ * by the driver.
>+ * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported
>by the
>+ * driver.
>+ */
>+enum netdev_xsk_flags {
>+ NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
>+ NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
>+
>+ /* private: */
>+ NETDEV_XSK_FLAGS_MASK = 3,
>+};
>+
> enum {
> NETDEV_A_DEV_IFINDEX = 1,
> NETDEV_A_DEV_PAD,
> NETDEV_A_DEV_XDP_FEATURES,
> NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
> NETDEV_A_DEV_XDP_RX_METADATA_FEATURES,
>+ NETDEV_A_DEV_XSK_FEATURES,
>
> __NETDEV_A_DEV_MAX,
> NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1) diff --git
>a/net/core/netdev-genl.c b/net/core/netdev-genl.c index
>fe61f85bcf33..5d889c2425fd 100644
>--- a/net/core/netdev-genl.c
>+++ b/net/core/netdev-genl.c
>@@ -14,6 +14,7 @@ netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff
>*rsp,
> const struct genl_info *info)
> {
> u64 xdp_rx_meta = 0;
>+ u64 xsk_features = 0;
> void *hdr;
>
> hdr = genlmsg_iput(rsp, info);
>@@ -26,11 +27,20 @@ netdev_nl_dev_fill(struct net_device *netdev, struct
>sk_buff *rsp, XDP_METADATA_KFUNC_xxx #undef XDP_METADATA_KFUNC
>
>+ if (netdev->xsk_tx_metadata_ops) {
>+ if (netdev->xsk_tx_metadata_ops->tmo_fill_timestamp)
>+ xsk_features |= NETDEV_XSK_FLAGS_TX_TIMESTAMP;
>+ if (netdev->xsk_tx_metadata_ops->tmo_request_checksum)
>+ xsk_features |= NETDEV_XSK_FLAGS_TX_CHECKSUM;
>+ }
>+
> if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||
> nla_put_u64_64bit(rsp, NETDEV_A_DEV_XDP_FEATURES,
> netdev->xdp_features, NETDEV_A_DEV_PAD) ||
> nla_put_u64_64bit(rsp,
>NETDEV_A_DEV_XDP_RX_METADATA_FEATURES,
>- xdp_rx_meta, NETDEV_A_DEV_PAD)) {
>+ xdp_rx_meta, NETDEV_A_DEV_PAD) ||
>+ nla_put_u64_64bit(rsp, NETDEV_A_DEV_XSK_FEATURES,
>+ xsk_features, NETDEV_A_DEV_PAD)) {
> genlmsg_cancel(rsp, hdr);
> return -EINVAL;
> }
>diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c index c1e12b602213..c427e02c13eb
>100644
>--- a/net/xdp/xsk.c
>+++ b/net/xdp/xsk.c
>@@ -543,6 +543,13 @@ static u32 xsk_get_num_desc(struct sk_buff *skb)
>
> static void xsk_destruct_skb(struct sk_buff *skb) {
>+ struct xsk_tx_metadata_compl *compl = &skb_shinfo(skb)->xsk_meta;
>+
>+ if (compl->tx_timestamp) {
>+ /* sw completion timestamp, not a real one */
>+ *compl->tx_timestamp = ktime_get_tai_fast_ns();
>+ }
>+
> xsk_cq_submit_locked(xdp_sk(skb->sk), xsk_get_num_desc(skb));
> sock_wfree(skb);
> }
>@@ -627,8 +634,10 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct
>xdp_sock *xs, static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> struct xdp_desc *desc)
> {
>+ struct xsk_tx_metadata *meta = NULL;
> struct net_device *dev = xs->dev;
> struct sk_buff *skb = xs->skb;
>+ bool first_frag = false;
> int err;
>
> if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) { @@ -659,6 +668,8 @@
>static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> kfree_skb(skb);
> goto free_err;
> }
>+
>+ first_frag = true;
> } else {
> int nr_frags = skb_shinfo(skb)->nr_frags;
> struct page *page;
>@@ -681,12 +692,40 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock
>*xs,
>
> skb_add_rx_frag(skb, nr_frags, page, 0, len, 0);
> }
>+
>+ if (first_frag && desc->options & XDP_TX_METADATA) {
>+ if (unlikely(xs->pool->tx_metadata_len == 0)) {
>+ err = -EINVAL;
>+ goto free_err;
>+ }
>+
>+ meta = buffer - xs->pool->tx_metadata_len;
>+
>+ if (meta->flags & XDP_TX_METADATA_CHECKSUM) {
>+ if (unlikely(meta->csum_start + meta-
>>csum_offset +
>+ sizeof(__sum16) > len)) {
>+ err = -EINVAL;
>+ goto free_err;
>+ }
>+
>+ skb->csum_start = hr + meta->csum_start;
>+ skb->csum_offset = meta->csum_offset;
>+ skb->ip_summed = CHECKSUM_PARTIAL;
>+
>+ if (unlikely(meta->flags &
>XDP_TX_METADATA_CHECKSUM_SW)) {
>+ err = skb_checksum_help(skb);
>+ if (err)
>+ goto free_err;
>+ }
>+ }
>+ }
> }
>
> skb->dev = dev;
> skb->priority = xs->sk.sk_priority;
> skb->mark = READ_ONCE(xs->sk.sk_mark);
> skb->destructor = xsk_destruct_skb;
>+ xsk_tx_metadata_to_compl(meta, &skb_shinfo(skb)->xsk_meta);
> xsk_set_destructor_arg(skb);
>
> return skb;
>diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h index
>c74a1372bcb9..6f2d1621c992 100644
>--- a/net/xdp/xsk_queue.h
>+++ b/net/xdp/xsk_queue.h
>@@ -137,7 +137,7 @@ static inline bool xskq_cons_read_addr_unchecked(struct
>xsk_queue *q, u64 *addr)
>
> static inline bool xp_unused_options_set(u32 options) {
>- return options & ~XDP_PKT_CONTD;
>+ return options & ~(XDP_PKT_CONTD | XDP_TX_METADATA);
> }
>
> static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool, diff --git
>a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
>index 34411a2e5b6c..53ceaae10dd1 100644
>--- a/tools/include/uapi/linux/if_xdp.h
>+++ b/tools/include/uapi/linux/if_xdp.h
>@@ -26,11 +26,11 @@
> */
> #define XDP_USE_NEED_WAKEUP (1 << 3)
> /* By setting this option, userspace application indicates that it can
>- * handle multiple descriptors per packet thus enabling xsk core to split
>+ * handle multiple descriptors per packet thus enabling AF_XDP to split
> * multi-buffer XDP frames into multiple Rx descriptors. Without this set
>- * such frames will be dropped by xsk.
>+ * such frames will be dropped.
> */
>-#define XDP_USE_SG (1 << 4)
>+#define XDP_USE_SG (1 << 4)
>
> /* Flags for xsk_umem_config flags */
> #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0) @@ -106,6 +106,43
>@@ struct xdp_options { #define XSK_UNALIGNED_BUF_ADDR_MASK \
> ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
>
>+/* Request transmit timestamp. Upon completion, put it into
>+tx_timestamp
>+ * field of union xsk_tx_metadata.
>+ */
>+#define XDP_TX_METADATA_TIMESTAMP (1 << 0)
>+
>+/* Request transmit checksum offload. Checksum start position and
>+offset
>+ * are communicated via csum_start and csum_offset fields of union
>+ * xsk_tx_metadata.
>+ */
>+#define XDP_TX_METADATA_CHECKSUM (1 << 1)
>+
>+/* Force checksum calculation in software. Can be used for testing or
>+ * working around potential HW issues. This option causes performance
>+ * degradation and only works in XDP_COPY mode.
>+ */
>+#define XDP_TX_METADATA_CHECKSUM_SW (1 << 2)
>+
>+struct xsk_tx_metadata {
>+ union {
>+ struct {
>+ __u32 flags;
>+
>+ /* XDP_TX_METADATA_CHECKSUM */
>+
>+ /* Offset from desc->addr where checksumming should
>start. */
>+ __u16 csum_start;
>+ /* Offset from csum_start where checksum should be
>stored. */
>+ __u16 csum_offset;
>+ };
>+
>+ struct {
>+ /* XDP_TX_METADATA_TIMESTAMP */
>+ __u64 tx_timestamp;
>+ } completion;
>+ };
>+};
>+
> /* Rx/Tx descriptor */
> struct xdp_desc {
> __u64 addr;
>@@ -113,9 +150,16 @@ struct xdp_desc {
> __u32 options;
> };
>
>-/* Flag indicating packet constitutes of multiple buffers*/
>+/* UMEM descriptor is __u64 */
>+
>+/* Flag indicating that the packet continues with the buffer pointed
>+out by the
>+ * next frame in the ring. The end of the packet is signalled by
>+setting this
>+ * bit to zero. For single buffer packets, every descriptor has
>+'options' set
>+ * to 0 and this maintains backward compatibility.
>+ */
> #define XDP_PKT_CONTD (1 << 0)
>
>-/* UMEM descriptor is __u64 */
>+/* TX packet carries valid metadata. */ #define XDP_TX_METADATA (1 <<
>+1)
>
> #endif /* _LINUX_IF_XDP_H */
>diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
>index 2943a151d4f1..48d5477a668c 100644
>--- a/tools/include/uapi/linux/netdev.h
>+++ b/tools/include/uapi/linux/netdev.h
>@@ -53,12 +53,28 @@ enum netdev_xdp_rx_metadata {
> NETDEV_XDP_RX_METADATA_MASK = 3,
> };
>
>+/**
>+ * enum netdev_xsk_flags
>+ * @NETDEV_XSK_FLAGS_TX_TIMESTAMP: HW timestamping egress packets is
>supported
>+ * by the driver.
>+ * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported
>by the
>+ * driver.
>+ */
>+enum netdev_xsk_flags {
>+ NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
>+ NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
>+
>+ /* private: */
>+ NETDEV_XSK_FLAGS_MASK = 3,
>+};
>+
> enum {
> NETDEV_A_DEV_IFINDEX = 1,
> NETDEV_A_DEV_PAD,
> NETDEV_A_DEV_XDP_FEATURES,
> NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
> NETDEV_A_DEV_XDP_RX_METADATA_FEATURES,
>+ NETDEV_A_DEV_XSK_FEATURES,
>
> __NETDEV_A_DEV_MAX,
> NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1) diff --git
>a/tools/net/ynl/generated/netdev-user.c b/tools/net/ynl/generated/netdev-user.c
>index b5ffe8cd1144..6283d87dad37 100644
>--- a/tools/net/ynl/generated/netdev-user.c
>+++ b/tools/net/ynl/generated/netdev-user.c
>@@ -58,6 +58,19 @@ const char *netdev_xdp_rx_metadata_str(enum
>netdev_xdp_rx_metadata value)
> return netdev_xdp_rx_metadata_strmap[value];
> }
>
>+static const char * const netdev_xsk_flags_strmap[] = {
>+ [0] = "tx-timestamp",
>+ [1] = "tx-checksum",
>+};
>+
>+const char *netdev_xsk_flags_str(enum netdev_xsk_flags value) {
>+ value = ffs(value) - 1;
>+ if (value < 0 || value >= (int)MNL_ARRAY_SIZE(netdev_xsk_flags_strmap))
>+ return NULL;
>+ return netdev_xsk_flags_strmap[value]; }
>+
> /* Policies */
> struct ynl_policy_attr netdev_dev_policy[NETDEV_A_DEV_MAX + 1] = {
> [NETDEV_A_DEV_IFINDEX] = { .name = "ifindex", .type = YNL_PT_U32, },
>@@ -65,6 +78,7 @@ struct ynl_policy_attr
>netdev_dev_policy[NETDEV_A_DEV_MAX + 1] = {
> [NETDEV_A_DEV_XDP_FEATURES] = { .name = "xdp-features", .type =
>YNL_PT_U64, },
> [NETDEV_A_DEV_XDP_ZC_MAX_SEGS] = { .name = "xdp-zc-max-
>segs", .type = YNL_PT_U32, },
> [NETDEV_A_DEV_XDP_RX_METADATA_FEATURES] = { .name = "xdp-rx-
>metadata-features", .type = YNL_PT_U64, },
>+ [NETDEV_A_DEV_XSK_FEATURES] = { .name = "xsk-features", .type =
>+YNL_PT_U64, },
> };
>
> struct ynl_policy_nest netdev_dev_nest = { @@ -116,6 +130,11 @@ int
>netdev_dev_get_rsp_parse(const struct nlmsghdr *nlh, void *data)
> return MNL_CB_ERROR;
> dst->_present.xdp_rx_metadata_features = 1;
> dst->xdp_rx_metadata_features =
>mnl_attr_get_u64(attr);
>+ } else if (type == NETDEV_A_DEV_XSK_FEATURES) {
>+ if (ynl_attr_validate(yarg, attr))
>+ return MNL_CB_ERROR;
>+ dst->_present.xsk_features = 1;
>+ dst->xsk_features = mnl_attr_get_u64(attr);
> }
> }
>
>diff --git a/tools/net/ynl/generated/netdev-user.h
>b/tools/net/ynl/generated/netdev-user.h
>index b4351ff34595..bdbd1766ce46 100644
>--- a/tools/net/ynl/generated/netdev-user.h
>+++ b/tools/net/ynl/generated/netdev-user.h
>@@ -19,6 +19,7 @@ extern const struct ynl_family ynl_netdev_family; const char
>*netdev_op_str(int op); const char *netdev_xdp_act_str(enum netdev_xdp_act
>value); const char *netdev_xdp_rx_metadata_str(enum netdev_xdp_rx_metadata
>value);
>+const char *netdev_xsk_flags_str(enum netdev_xsk_flags value);
>
> /* Common nested types */
> /* ============== NETDEV_CMD_DEV_GET ============== */ @@ -50,12
>+51,14 @@ struct netdev_dev_get_rsp {
> __u32 xdp_features:1;
> __u32 xdp_zc_max_segs:1;
> __u32 xdp_rx_metadata_features:1;
>+ __u32 xsk_features:1;
> } _present;
>
> __u32 ifindex;
> __u64 xdp_features;
> __u32 xdp_zc_max_segs;
> __u64 xdp_rx_metadata_features;
>+ __u64 xsk_features;
> };
>
> void netdev_dev_get_rsp_free(struct netdev_dev_get_rsp *rsp);
>--
>2.42.0.582.g8ccd20d70d-goog
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] Re: [PATCH bpf-next v3 02/10] xsk: add TX timestamp and TX checksum offload support
2023-10-04 6:18 ` [xdp-hints] " Song, Yoong Siang
@ 2023-10-04 17:48 ` Stanislav Fomichev
2023-10-04 17:56 ` Stanislav Fomichev
0 siblings, 1 reply; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-04 17:48 UTC (permalink / raw)
To: yoong.siang.song
Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
On 10/04, Song, Yoong Siang wrote:
> On Wednesday, October 4, 2023 4:05 AM Stanislav Fomichev <sdf@google.com> wrote:
> >This change actually defines the (initial) metadata layout that should be used by
> >AF_XDP userspace (xsk_tx_metadata).
> >The first field is flags which requests appropriate offloads, followed by the offload-
> >specific fields. The supported per-device offloads are exported via netlink (new
> >xsk-flags).
> >
> >The offloads themselves are still implemented in a bit of a framework-y fashion
> >that's left from my initial kfunc attempt.
> >I'm introducing new xsk_tx_metadata_ops which drivers are supposed to
> >implement. The drivers are also supposed to call
> >xsk_tx_metadata_request/xsk_tx_metadata_complete in the right places. Since
> >xsk_tx_metadata_{request,_complete}
> >are static inline, we don't incur any extra overhead doing indirect calls.
> >
> >The benefit of this scheme is as follows:
> >- keeps all metadata layout parsing away from driver code
> >- makes it easy to grep and see which drivers implement what
> >- don't need any extra flags to maintain to keep track of what
> > offloads are implemented; if the callback is implemented - the offload
> > is supported (used by netlink reporting code)
> >
> >Two offloads are defined right now:
> >1. XDP_TX_METADATA_CHECKSUM: skb-style csum_start+csum_offset 2.
> >XDP_TX_METADATA_TIMESTAMP: writes TX timestamp back into metadata
> > area upon completion (tx_timestamp field)
> >
> >The offloads are also implemented for copy mode:
> >1. Extra XDP_TX_METADATA_CHECKSUM_SW to trigger skb_checksum_help; this
> > might be useful as a reference implementation and for testing 2.
> >XDP_TX_METADATA_TIMESTAMP writes SW timestamp from the skb
> > destructor (note I'm reusing hwtstamps to pass metadata pointer)
> >
> >The struct is forward-compatible and can be extended in the future by appending
> >more fields.
> >
> >Signed-off-by: Stanislav Fomichev <sdf@google.com>
> >---
> > Documentation/netlink/specs/netdev.yaml | 19 ++++++
> > include/linux/netdevice.h | 27 +++++++++
> > include/linux/skbuff.h | 14 ++++-
> > include/net/xdp_sock.h | 80 +++++++++++++++++++++++++
> > include/net/xdp_sock_drv.h | 13 ++++
> > include/net/xsk_buff_pool.h | 6 ++
> > include/uapi/linux/if_xdp.h | 40 +++++++++++++
> > include/uapi/linux/netdev.h | 16 +++++
> > net/core/netdev-genl.c | 12 +++-
> > net/xdp/xsk.c | 39 ++++++++++++
> > net/xdp/xsk_queue.h | 2 +-
> > tools/include/uapi/linux/if_xdp.h | 54 +++++++++++++++--
> > tools/include/uapi/linux/netdev.h | 16 +++++
> > tools/net/ynl/generated/netdev-user.c | 19 ++++++
> > tools/net/ynl/generated/netdev-user.h | 3 +
> > 15 files changed, 352 insertions(+), 8 deletions(-)
> >
> >diff --git a/Documentation/netlink/specs/netdev.yaml
> >b/Documentation/netlink/specs/netdev.yaml
> >index c46fcc78fc04..3735c26c8646 100644
> >--- a/Documentation/netlink/specs/netdev.yaml
> >+++ b/Documentation/netlink/specs/netdev.yaml
> >@@ -55,6 +55,19 @@ name: netdev
> > name: hash
> > doc:
> > Device is capable of exposing receive packet hash via
> >bpf_xdp_metadata_rx_hash().
> >+ -
> >+ type: flags
> >+ name: xsk-flags
> >+ render-max: true
> >+ entries:
> >+ -
> >+ name: tx-timestamp
> >+ doc:
> >+ HW timestamping egress packets is supported by the driver.
> >+ -
> >+ name: tx-checksum
> >+ doc:
> >+ L3 checksum HW offload is supported by the driver.
> >
> > attribute-sets:
> > -
> >@@ -88,6 +101,11 @@ name: netdev
> > type: u64
> > enum: xdp-rx-metadata
> > enum-as-flags: true
> >+ -
> >+ name: xsk-features
> >+ doc: Bitmask of enabled AF_XDP features.
> >+ type: u64
> >+ enum: xsk-flags
> >
> > operations:
> > list:
> >@@ -105,6 +123,7 @@ name: netdev
> > - xdp-features
> > - xdp-zc-max-segs
> > - xdp-rx-metadata-features
> >+ - xsk-features
> > dump:
> > reply: *dev-all
> > -
> >diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index
> >7e520c14eb8c..0e1cb026cbe5 100644
> >--- a/include/linux/netdevice.h
> >+++ b/include/linux/netdevice.h
> >@@ -1650,6 +1650,31 @@ struct net_device_ops {
> > struct netlink_ext_ack
> >*extack); };
> >
> >+/*
> >+ * This structure defines the AF_XDP TX metadata hooks for network devices.
> >+ * The following hooks can be defined; unless noted otherwise, they are
> >+ * optional and can be filled with a null pointer.
> >+ *
> >+ * int (*tmo_request_timestamp)(void *priv)
[..]
> Should be "void" instead of "int"
>
> >+ * This function is called when AF_XDP frame requested egress timestamp.
> >+ *
> >+ * int (*tmo_fill_timestamp)(void *priv)
>
> Should be "u64" instead of "int"
>
> >+ * This function is called when AF_XDP frame, that had requested
> >+ * egress timestamp, received a completion. The hook needs to return
> >+ * the actual HW timestamp.
> >+ *
> >+ * int (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv)
>
> Should be "void" instead of "int"
Oh, good catch, will update these doc entries!
> >+ * This function is called when AF_XDP frame requested HW checksum
> >+ * offload. csum_start indicates position where checksumming should start.
> >+ * csum_offset indicates position where checksum should be stored.
> >+ *
> >+ */
> >+struct xsk_tx_metadata_ops {
> >+ void (*tmo_request_timestamp)(void *priv);
> >+ u64 (*tmo_fill_timestamp)(void *priv);
> >+ void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void
> >*priv);
> >+};
> >+
> > /**
> > * enum netdev_priv_flags - &struct net_device priv_flags
> > *
> >@@ -1838,6 +1863,7 @@ enum netdev_ml_priv_type {
> > * @netdev_ops: Includes several pointers to callbacks,
> > * if one wants to override the ndo_*() functions
> > * @xdp_metadata_ops: Includes pointers to XDP metadata callbacks.
> >+ * @xsk_tx_metadata_ops: Includes pointers to AF_XDP TX
> >metadata callbacks.
> > * @ethtool_ops: Management operations
> > * @l3mdev_ops: Layer 3 master device operations
> > * @ndisc_ops: Includes callbacks for different IPv6 neighbour
> >@@ -2097,6 +2123,7 @@ struct net_device {
> > unsigned long long priv_flags;
> > const struct net_device_ops *netdev_ops;
> > const struct xdp_metadata_ops *xdp_metadata_ops;
> >+ const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
> > int ifindex;
> > unsigned short gflags;
> > unsigned short hard_header_len;
> >diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index
> >4174c4b82d13..444d35dcd690 100644
> >--- a/include/linux/skbuff.h
> >+++ b/include/linux/skbuff.h
> >@@ -566,6 +566,15 @@ struct ubuf_info_msgzc { int
> >mm_account_pinned_pages(struct mmpin *mmp, size_t size); void
> >mm_unaccount_pinned_pages(struct mmpin *mmp);
> >
> >+/* Preserve some data across TX submission and completion.
> >+ *
> >+ * Note, this state is stored in the driver. Extending the layout
> >+ * might need some special care.
> >+ */
> >+struct xsk_tx_metadata_compl {
> >+ __u64 *tx_timestamp;
> >+};
> >+
> > /* This data is invariant across clones and lives at
> > * the end of the header data, ie. at skb->end.
> > */
> >@@ -578,7 +587,10 @@ struct skb_shared_info {
> > /* Warning: this field is not always filled in (UFO)! */
> > unsigned short gso_segs;
> > struct sk_buff *frag_list;
> >- struct skb_shared_hwtstamps hwtstamps;
> >+ union {
> >+ struct skb_shared_hwtstamps hwtstamps;
> >+ struct xsk_tx_metadata_compl xsk_meta;
> >+ };
> > unsigned int gso_type;
> > u32 tskey;
> >
> >diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h index
> >caa1f04106be..29427a69784d 100644
> >--- a/include/net/xdp_sock.h
> >+++ b/include/net/xdp_sock.h
> >@@ -92,6 +92,74 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff
> >*xdp); int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp); void
> >__xsk_map_flush(void);
> >
> >+/**
> >+ * xsk_tx_metadata_to_compl - Save enough relevant metadata
> >+information
> >+ * to perform tx completion in the future.
> >+ * @meta: pointer to AF_XDP metadata area
> >+ * @compl: pointer to output struct xsk_tx_metadata_to_compl
> >+ *
> >+ * This function should be called by the networking device when
> >+ * it prepares AF_XDP egress packet. The value of @compl should be
> >+stored
> >+ * and passed to xsk_tx_metadata_complete upon TX completion.
> >+ */
> >+static inline void xsk_tx_metadata_to_compl(struct xsk_tx_metadata *meta,
> >+ struct xsk_tx_metadata_compl
> >*compl) {
> >+ if (!meta)
> >+ return;
> >+
> >+ if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
> >+ compl->tx_timestamp = &meta->completion.tx_timestamp;
> >+ else
> >+ compl->tx_timestamp = NULL;
> >+}
> >+
> >+/**
> >+ * xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
> >+ * and call appropriate xsk_tx_metadata_ops operation.
> >+ * @meta: pointer to AF_XDP metadata area
> >+ * @ops: pointer to struct xsk_tx_metadata_ops
> >+ * @priv: pointer to driver-private aread
> >+ *
> >+ * This function should be called by the networking device when
> >+ * it prepares AF_XDP egress packet.
> >+ */
> >+static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
> >+ const struct xsk_tx_metadata_ops
> >*ops,
> >+ void *priv)
> >+{
> >+ if (!meta)
> >+ return;
> >+
> >+ if (ops->tmo_request_timestamp)
> >+ if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
> >+ ops->tmo_request_timestamp(priv);
> >+
> >+ if (ops->tmo_request_checksum)
> >+ if (meta->flags & XDP_TX_METADATA_CHECKSUM)
> >+ ops->tmo_request_checksum(meta->csum_start, meta-
> >>csum_offset,
> >+priv); }
> >+
> >+/**
> >+ * xsk_tx_metadata_complete - Evaluate AF_XDP TX metadata at
> >+completion
> >+ * and call appropriate xsk_tx_metadata_ops operation.
> >+ * @compl: pointer to completion metadata produced from
> >+xsk_tx_metadata_to_compl
> >+ * @ops: pointer to struct xsk_tx_metadata_ops
> >+ * @priv: pointer to driver-private aread
> >+ *
> >+ * This function should be called by the networking device upon
> >+ * AF_XDP egress completion.
> >+ */
> >+static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata_compl
> >*compl,
> >+ const struct xsk_tx_metadata_ops
> >*ops,
> >+ void *priv)
> >+{
> >+ if (!compl)
> >+ return;
> >+
> >+ *compl->tx_timestamp = ops->tmo_fill_timestamp(priv); }
> >+
> > #else
> >
> > static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) @@ -
> >108,6 +176,18 @@ static inline void __xsk_map_flush(void) { }
> >
> >+static inline void xsk_tx_metadata_request(struct xsk_tx_metadata *meta,
> >+ const struct xsk_tx_metadata_ops
> >*ops,
> >+ void *priv)
> >+{
> >+}
> >+
> >+static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata_compl
> >*compl,
> >+ const struct xsk_tx_metadata_ops
> >*ops,
> >+ void *priv)
> >+{
> >+}
> >+
> > #endif /* CONFIG_XDP_SOCKETS */
> >
> > #endif /* _LINUX_XDP_SOCK_H */
> >diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h index
> >1f6fc8c7a84c..e2558ac3e195 100644
> >--- a/include/net/xdp_sock_drv.h
> >+++ b/include/net/xdp_sock_drv.h
> >@@ -165,6 +165,14 @@ static inline void *xsk_buff_raw_get_data(struct
> >xsk_buff_pool *pool, u64 addr)
> > return xp_raw_get_data(pool, addr);
> > }
> >
> >+static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct
> >+xsk_buff_pool *pool, u64 addr) {
> >+ if (!pool->tx_metadata_len)
> >+ return NULL;
> >+
> >+ return xp_raw_get_data(pool, addr) - pool->tx_metadata_len; }
> >+
> > static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct
> >xsk_buff_pool *pool) {
> > struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
> >@@ -324,6 +332,11 @@ static inline void *xsk_buff_raw_get_data(struct
> >xsk_buff_pool *pool, u64 addr)
> > return NULL;
> > }
> >
> >+static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct
> >+xsk_buff_pool *pool, u64 addr) {
> >+ return NULL;
> >+}
> >+
> > static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct
> >xsk_buff_pool *pool) { } diff --git a/include/net/xsk_buff_pool.h
> >b/include/net/xsk_buff_pool.h index 1985ffaf9b0c..97f5cc10d79e 100644
> >--- a/include/net/xsk_buff_pool.h
> >+++ b/include/net/xsk_buff_pool.h
> >@@ -33,6 +33,7 @@ struct xdp_buff_xsk {
> > };
> >
> > #define XSK_CHECK_PRIV_TYPE(t) BUILD_BUG_ON(sizeof(t) > offsetofend(struct
> >xdp_buff_xsk, cb))
> >+#define XSK_TX_COMPL_FITS(t) BUILD_BUG_ON(sizeof(struct
> >+xsk_tx_metadata_compl) > sizeof(t))
> >
> > struct xsk_dma_map {
> > dma_addr_t *dma_pages;
> >@@ -234,4 +235,9 @@ static inline u64 xp_get_handle(struct xdp_buff_xsk *xskb)
> > return xskb->orig_addr + (offset <<
> >XSK_UNALIGNED_BUF_OFFSET_SHIFT); }
> >
> >+static inline bool xp_tx_metadata_enabled(const struct xsk_buff_pool
> >+*pool) {
> >+ return pool->tx_metadata_len > 0;
> >+}
> >+
> > #endif /* XSK_BUFF_POOL_H_ */
> >diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h index
> >2ecf79282c26..ecfd67988283 100644
> >--- a/include/uapi/linux/if_xdp.h
> >+++ b/include/uapi/linux/if_xdp.h
> >@@ -106,6 +106,43 @@ struct xdp_options { #define
> >XSK_UNALIGNED_BUF_ADDR_MASK \
> > ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
> >
> >+/* Request transmit timestamp. Upon completion, put it into
> >+tx_timestamp
> >+ * field of struct xsk_tx_metadata.
> >+ */
> >+#define XDP_TX_METADATA_TIMESTAMP (1 << 0)
[..]
> Suggestion from checkpatch.pl:
> CHECK: Prefer using the BIT macro
>
> >+
> >+/* Request transmit checksum offload. Checksum start position and
> >+offset
> >+ * are communicated via csum_start and csum_offset fields of struct
> >+ * xsk_tx_metadata.
> >+ */
> >+#define XDP_TX_METADATA_CHECKSUM (1 << 1)
>
> Suggestion from checkpatch.pl:
> CHECK: Prefer using the BIT macro
>
> >+
> >+/* Force checksum calculation in software. Can be used for testing or
> >+ * working around potential HW issues. This option causes performance
> >+ * degradation and only works in XDP_COPY mode.
> >+ */
> >+#define XDP_TX_METADATA_CHECKSUM_SW (1 << 2)
>
> Suggestion from checkpatch.pl:
> CHECK: Prefer using the BIT macro
Will do! Hopefully nothing breaks, let me check...
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] Re: [PATCH bpf-next v3 02/10] xsk: add TX timestamp and TX checksum offload support
2023-10-04 17:48 ` Stanislav Fomichev
@ 2023-10-04 17:56 ` Stanislav Fomichev
2023-10-05 1:16 ` Song, Yoong Siang
0 siblings, 1 reply; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-04 17:56 UTC (permalink / raw)
To: yoong.siang.song
Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
On 10/04, Stanislav Fomichev wrote:
> On 10/04, Song, Yoong Siang wrote:
> > On Wednesday, October 4, 2023 4:05 AM Stanislav Fomichev <sdf@google.com> wrote:
> > >This change actually defines the (initial) metadata layout that should be used by
> > >AF_XDP userspace (xsk_tx_metadata).
> > >The first field is flags which requests appropriate offloads, followed by the offload-
> > >specific fields. The supported per-device offloads are exported via netlink (new
> > >xsk-flags).
> > >
> > >The offloads themselves are still implemented in a bit of a framework-y fashion
> > >that's left from my initial kfunc attempt.
> > >I'm introducing new xsk_tx_metadata_ops which drivers are supposed to
> > >implement. The drivers are also supposed to call
> > >xsk_tx_metadata_request/xsk_tx_metadata_complete in the right places. Since
> > >xsk_tx_metadata_{request,_complete}
> > >are static inline, we don't incur any extra overhead doing indirect calls.
> > >
> > >The benefit of this scheme is as follows:
> > >- keeps all metadata layout parsing away from driver code
> > >- makes it easy to grep and see which drivers implement what
> > >- don't need any extra flags to maintain to keep track of what
> > > offloads are implemented; if the callback is implemented - the offload
> > > is supported (used by netlink reporting code)
> > >
> > >Two offloads are defined right now:
> > >1. XDP_TX_METADATA_CHECKSUM: skb-style csum_start+csum_offset 2.
> > >XDP_TX_METADATA_TIMESTAMP: writes TX timestamp back into metadata
> > > area upon completion (tx_timestamp field)
> > >
> > >The offloads are also implemented for copy mode:
> > >1. Extra XDP_TX_METADATA_CHECKSUM_SW to trigger skb_checksum_help; this
> > > might be useful as a reference implementation and for testing 2.
> > >XDP_TX_METADATA_TIMESTAMP writes SW timestamp from the skb
> > > destructor (note I'm reusing hwtstamps to pass metadata pointer)
> > >
> > >The struct is forward-compatible and can be extended in the future by appending
> > >more fields.
> > >
> > >Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > >---
> > > Documentation/netlink/specs/netdev.yaml | 19 ++++++
> > > include/linux/netdevice.h | 27 +++++++++
> > > include/linux/skbuff.h | 14 ++++-
> > > include/net/xdp_sock.h | 80 +++++++++++++++++++++++++
> > > include/net/xdp_sock_drv.h | 13 ++++
> > > include/net/xsk_buff_pool.h | 6 ++
> > > include/uapi/linux/if_xdp.h | 40 +++++++++++++
> > > include/uapi/linux/netdev.h | 16 +++++
> > > net/core/netdev-genl.c | 12 +++-
> > > net/xdp/xsk.c | 39 ++++++++++++
> > > net/xdp/xsk_queue.h | 2 +-
> > > tools/include/uapi/linux/if_xdp.h | 54 +++++++++++++++--
> > > tools/include/uapi/linux/netdev.h | 16 +++++
> > > tools/net/ynl/generated/netdev-user.c | 19 ++++++
> > > tools/net/ynl/generated/netdev-user.h | 3 +
> > > 15 files changed, 352 insertions(+), 8 deletions(-)
> > >
> > >diff --git a/Documentation/netlink/specs/netdev.yaml
> > >b/Documentation/netlink/specs/netdev.yaml
> > >index c46fcc78fc04..3735c26c8646 100644
> > >--- a/Documentation/netlink/specs/netdev.yaml
> > >+++ b/Documentation/netlink/specs/netdev.yaml
> > >@@ -55,6 +55,19 @@ name: netdev
> > > name: hash
> > > doc:
> > > Device is capable of exposing receive packet hash via
> > >bpf_xdp_metadata_rx_hash().
> > >+ -
> > >+ type: flags
> > >+ name: xsk-flags
> > >+ render-max: true
> > >+ entries:
> > >+ -
> > >+ name: tx-timestamp
> > >+ doc:
> > >+ HW timestamping egress packets is supported by the driver.
> > >+ -
> > >+ name: tx-checksum
> > >+ doc:
> > >+ L3 checksum HW offload is supported by the driver.
> > >
> > > attribute-sets:
> > > -
> > >@@ -88,6 +101,11 @@ name: netdev
> > > type: u64
> > > enum: xdp-rx-metadata
> > > enum-as-flags: true
> > >+ -
> > >+ name: xsk-features
> > >+ doc: Bitmask of enabled AF_XDP features.
> > >+ type: u64
> > >+ enum: xsk-flags
> > >
> > > operations:
> > > list:
> > >@@ -105,6 +123,7 @@ name: netdev
> > > - xdp-features
> > > - xdp-zc-max-segs
> > > - xdp-rx-metadata-features
> > >+ - xsk-features
> > > dump:
> > > reply: *dev-all
> > > -
> > >diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index
> > >7e520c14eb8c..0e1cb026cbe5 100644
> > >--- a/include/linux/netdevice.h
> > >+++ b/include/linux/netdevice.h
> > >@@ -1650,6 +1650,31 @@ struct net_device_ops {
> > > struct netlink_ext_ack
> > >*extack); };
> > >
> > >+/*
> > >+ * This structure defines the AF_XDP TX metadata hooks for network devices.
> > >+ * The following hooks can be defined; unless noted otherwise, they are
> > >+ * optional and can be filled with a null pointer.
> > >+ *
> > >+ * int (*tmo_request_timestamp)(void *priv)
>
> [..]
>
> > Should be "void" instead of "int"
> >
> > >+ * This function is called when AF_XDP frame requested egress timestamp.
> > >+ *
> > >+ * int (*tmo_fill_timestamp)(void *priv)
> >
> > Should be "u64" instead of "int"
> >
> > >+ * This function is called when AF_XDP frame, that had requested
> > >+ * egress timestamp, received a completion. The hook needs to return
> > >+ * the actual HW timestamp.
> > >+ *
> > >+ * int (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv)
> >
> > Should be "void" instead of "int"
>
> Oh, good catch, will update these doc entries!
>
> > >+ * This function is called when AF_XDP frame requested HW checksum
> > >+ * offload. csum_start indicates position where checksumming should start.
> > >+ * csum_offset indicates position where checksum should be stored.
> > >+ *
> > >+ */
> > >+struct xsk_tx_metadata_ops {
> > >+ void (*tmo_request_timestamp)(void *priv);
> > >+ u64 (*tmo_fill_timestamp)(void *priv);
> > >+ void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void
> > >*priv);
> > >+};
> > >+
> > > /**
> > > * enum netdev_priv_flags - &struct net_device priv_flags
> > > *
> > >@@ -1838,6 +1863,7 @@ enum netdev_ml_priv_type {
> > > * @netdev_ops: Includes several pointers to callbacks,
> > > * if one wants to override the ndo_*() functions
> > > * @xdp_metadata_ops: Includes pointers to XDP metadata callbacks.
> > >+ * @xsk_tx_metadata_ops: Includes pointers to AF_XDP TX
> > >metadata callbacks.
> > > * @ethtool_ops: Management operations
> > > * @l3mdev_ops: Layer 3 master device operations
> > > * @ndisc_ops: Includes callbacks for different IPv6 neighbour
> > >@@ -2097,6 +2123,7 @@ struct net_device {
> > > unsigned long long priv_flags;
> > > const struct net_device_ops *netdev_ops;
> > > const struct xdp_metadata_ops *xdp_metadata_ops;
> > >+ const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
> > > int ifindex;
> > > unsigned short gflags;
> > > unsigned short hard_header_len;
> > >diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index
> > >4174c4b82d13..444d35dcd690 100644
> > >--- a/include/linux/skbuff.h
> > >+++ b/include/linux/skbuff.h
> > >@@ -566,6 +566,15 @@ struct ubuf_info_msgzc { int
> > >mm_account_pinned_pages(struct mmpin *mmp, size_t size); void
> > >mm_unaccount_pinned_pages(struct mmpin *mmp);
> > >
> > >+/* Preserve some data across TX submission and completion.
> > >+ *
> > >+ * Note, this state is stored in the driver. Extending the layout
> > >+ * might need some special care.
> > >+ */
> > >+struct xsk_tx_metadata_compl {
> > >+ __u64 *tx_timestamp;
> > >+};
> > >+
> > > /* This data is invariant across clones and lives at
> > > * the end of the header data, ie. at skb->end.
> > > */
> > >@@ -578,7 +587,10 @@ struct skb_shared_info {
> > > /* Warning: this field is not always filled in (UFO)! */
> > > unsigned short gso_segs;
> > > struct sk_buff *frag_list;
> > >- struct skb_shared_hwtstamps hwtstamps;
> > >+ union {
> > >+ struct skb_shared_hwtstamps hwtstamps;
> > >+ struct xsk_tx_metadata_compl xsk_meta;
> > >+ };
> > > unsigned int gso_type;
> > > u32 tskey;
> > >
> > >diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h index
> > >caa1f04106be..29427a69784d 100644
> > >--- a/include/net/xdp_sock.h
> > >+++ b/include/net/xdp_sock.h
> > >@@ -92,6 +92,74 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff
> > >*xdp); int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp); void
> > >__xsk_map_flush(void);
> > >
> > >+/**
> > >+ * xsk_tx_metadata_to_compl - Save enough relevant metadata
> > >+information
> > >+ * to perform tx completion in the future.
> > >+ * @meta: pointer to AF_XDP metadata area
> > >+ * @compl: pointer to output struct xsk_tx_metadata_to_compl
> > >+ *
> > >+ * This function should be called by the networking device when
> > >+ * it prepares AF_XDP egress packet. The value of @compl should be
> > >+stored
> > >+ * and passed to xsk_tx_metadata_complete upon TX completion.
> > >+ */
> > >+static inline void xsk_tx_metadata_to_compl(struct xsk_tx_metadata *meta,
> > >+ struct xsk_tx_metadata_compl
> > >*compl) {
> > >+ if (!meta)
> > >+ return;
> > >+
> > >+ if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
> > >+ compl->tx_timestamp = &meta->completion.tx_timestamp;
> > >+ else
> > >+ compl->tx_timestamp = NULL;
> > >+}
> > >+
> > >+/**
> > >+ * xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
> > >+ * and call appropriate xsk_tx_metadata_ops operation.
> > >+ * @meta: pointer to AF_XDP metadata area
> > >+ * @ops: pointer to struct xsk_tx_metadata_ops
> > >+ * @priv: pointer to driver-private aread
> > >+ *
> > >+ * This function should be called by the networking device when
> > >+ * it prepares AF_XDP egress packet.
> > >+ */
> > >+static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
> > >+ const struct xsk_tx_metadata_ops
> > >*ops,
> > >+ void *priv)
> > >+{
> > >+ if (!meta)
> > >+ return;
> > >+
> > >+ if (ops->tmo_request_timestamp)
> > >+ if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
> > >+ ops->tmo_request_timestamp(priv);
> > >+
> > >+ if (ops->tmo_request_checksum)
> > >+ if (meta->flags & XDP_TX_METADATA_CHECKSUM)
> > >+ ops->tmo_request_checksum(meta->csum_start, meta-
> > >>csum_offset,
> > >+priv); }
> > >+
> > >+/**
> > >+ * xsk_tx_metadata_complete - Evaluate AF_XDP TX metadata at
> > >+completion
> > >+ * and call appropriate xsk_tx_metadata_ops operation.
> > >+ * @compl: pointer to completion metadata produced from
> > >+xsk_tx_metadata_to_compl
> > >+ * @ops: pointer to struct xsk_tx_metadata_ops
> > >+ * @priv: pointer to driver-private aread
> > >+ *
> > >+ * This function should be called by the networking device upon
> > >+ * AF_XDP egress completion.
> > >+ */
> > >+static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata_compl
> > >*compl,
> > >+ const struct xsk_tx_metadata_ops
> > >*ops,
> > >+ void *priv)
> > >+{
> > >+ if (!compl)
> > >+ return;
> > >+
> > >+ *compl->tx_timestamp = ops->tmo_fill_timestamp(priv); }
> > >+
> > > #else
> > >
> > > static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp) @@ -
> > >108,6 +176,18 @@ static inline void __xsk_map_flush(void) { }
> > >
> > >+static inline void xsk_tx_metadata_request(struct xsk_tx_metadata *meta,
> > >+ const struct xsk_tx_metadata_ops
> > >*ops,
> > >+ void *priv)
> > >+{
> > >+}
> > >+
> > >+static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata_compl
> > >*compl,
> > >+ const struct xsk_tx_metadata_ops
> > >*ops,
> > >+ void *priv)
> > >+{
> > >+}
> > >+
> > > #endif /* CONFIG_XDP_SOCKETS */
> > >
> > > #endif /* _LINUX_XDP_SOCK_H */
> > >diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h index
> > >1f6fc8c7a84c..e2558ac3e195 100644
> > >--- a/include/net/xdp_sock_drv.h
> > >+++ b/include/net/xdp_sock_drv.h
> > >@@ -165,6 +165,14 @@ static inline void *xsk_buff_raw_get_data(struct
> > >xsk_buff_pool *pool, u64 addr)
> > > return xp_raw_get_data(pool, addr);
> > > }
> > >
> > >+static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct
> > >+xsk_buff_pool *pool, u64 addr) {
> > >+ if (!pool->tx_metadata_len)
> > >+ return NULL;
> > >+
> > >+ return xp_raw_get_data(pool, addr) - pool->tx_metadata_len; }
> > >+
> > > static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct
> > >xsk_buff_pool *pool) {
> > > struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
> > >@@ -324,6 +332,11 @@ static inline void *xsk_buff_raw_get_data(struct
> > >xsk_buff_pool *pool, u64 addr)
> > > return NULL;
> > > }
> > >
> > >+static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct
> > >+xsk_buff_pool *pool, u64 addr) {
> > >+ return NULL;
> > >+}
> > >+
> > > static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct
> > >xsk_buff_pool *pool) { } diff --git a/include/net/xsk_buff_pool.h
> > >b/include/net/xsk_buff_pool.h index 1985ffaf9b0c..97f5cc10d79e 100644
> > >--- a/include/net/xsk_buff_pool.h
> > >+++ b/include/net/xsk_buff_pool.h
> > >@@ -33,6 +33,7 @@ struct xdp_buff_xsk {
> > > };
> > >
> > > #define XSK_CHECK_PRIV_TYPE(t) BUILD_BUG_ON(sizeof(t) > offsetofend(struct
> > >xdp_buff_xsk, cb))
> > >+#define XSK_TX_COMPL_FITS(t) BUILD_BUG_ON(sizeof(struct
> > >+xsk_tx_metadata_compl) > sizeof(t))
> > >
> > > struct xsk_dma_map {
> > > dma_addr_t *dma_pages;
> > >@@ -234,4 +235,9 @@ static inline u64 xp_get_handle(struct xdp_buff_xsk *xskb)
> > > return xskb->orig_addr + (offset <<
> > >XSK_UNALIGNED_BUF_OFFSET_SHIFT); }
> > >
> > >+static inline bool xp_tx_metadata_enabled(const struct xsk_buff_pool
> > >+*pool) {
> > >+ return pool->tx_metadata_len > 0;
> > >+}
> > >+
> > > #endif /* XSK_BUFF_POOL_H_ */
> > >diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h index
> > >2ecf79282c26..ecfd67988283 100644
> > >--- a/include/uapi/linux/if_xdp.h
> > >+++ b/include/uapi/linux/if_xdp.h
> > >@@ -106,6 +106,43 @@ struct xdp_options { #define
> > >XSK_UNALIGNED_BUF_ADDR_MASK \
> > > ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
> > >
> > >+/* Request transmit timestamp. Upon completion, put it into
> > >+tx_timestamp
> > >+ * field of struct xsk_tx_metadata.
> > >+ */
> > >+#define XDP_TX_METADATA_TIMESTAMP (1 << 0)
>
> [..]
>
> > Suggestion from checkpatch.pl:
> > CHECK: Prefer using the BIT macro
> >
> > >+
> > >+/* Request transmit checksum offload. Checksum start position and
> > >+offset
> > >+ * are communicated via csum_start and csum_offset fields of struct
> > >+ * xsk_tx_metadata.
> > >+ */
> > >+#define XDP_TX_METADATA_CHECKSUM (1 << 1)
> >
> > Suggestion from checkpatch.pl:
> > CHECK: Prefer using the BIT macro
> >
> > >+
> > >+/* Force checksum calculation in software. Can be used for testing or
> > >+ * working around potential HW issues. This option causes performance
> > >+ * degradation and only works in XDP_COPY mode.
> > >+ */
> > >+#define XDP_TX_METADATA_CHECKSUM_SW (1 << 2)
> >
> > Suggestion from checkpatch.pl:
> > CHECK: Prefer using the BIT macro
>
> Will do! Hopefully nothing breaks, let me check...
Yeah, looks like this part is not happy, doesn't look like BIT() is
exported to UAPI, per:
check for #defines like: 1 << <digit> that could be BIT(digit), it is not exported to uapi
So I'll revert to << like in the rest of this file.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] Re: [PATCH bpf-next v3 05/10] net: stmmac: Add Tx HWTS support to XDP ZC
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 05/10] net: stmmac: Add Tx HWTS support to XDP ZC Stanislav Fomichev
@ 2023-10-04 23:05 ` kernel test robot
2023-10-04 23:14 ` Stanislav Fomichev
2023-10-06 4:38 ` kernel test robot
1 sibling, 1 reply; 25+ messages in thread
From: kernel test robot @ 2023-10-04 23:05 UTC (permalink / raw)
To: Stanislav Fomichev, bpf
Cc: oe-kbuild-all, ast, daniel, andrii, martin.lau, song, yhs,
john.fastabend, kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb,
dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
Hi Stanislav,
kernel test robot noticed the following build errors:
[auto build test ERROR on bpf-next/master]
url: https://github.com/intel-lab-lkp/linux/commits/Stanislav-Fomichev/xsk-Support-tx_metadata_len/20231004-040718
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link: https://lore.kernel.org/r/20231003200522.1914523-6-sdf%40google.com
patch subject: [PATCH bpf-next v3 05/10] net: stmmac: Add Tx HWTS support to XDP ZC
config: riscv-defconfig (https://download.01.org/0day-ci/archive/20231005/202310050607.UQ0bU3ct-lkp@intel.com/config)
compiler: riscv64-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231005/202310050607.UQ0bU3ct-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202310050607.UQ0bU3ct-lkp@intel.com/
All errors (new ones prefixed by >>):
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c: In function 'stmmac_xdp_xmit_zc':
>> drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:2554:17: error: implicit declaration of function 'xsk_tx_metadata_to_compl'; did you mean 'xsk_tx_metadata_complete'? [-Werror=implicit-function-declaration]
2554 | xsk_tx_metadata_to_compl(meta, &tx_q->tx_skbuff_dma[entry].xsk_meta);
| ^~~~~~~~~~~~~~~~~~~~~~~~
| xsk_tx_metadata_complete
cc1: some warnings being treated as errors
vim +2554 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
2464
2465 static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
2466 {
2467 struct netdev_queue *nq = netdev_get_tx_queue(priv->dev, queue);
2468 struct stmmac_tx_queue *tx_q = &priv->dma_conf.tx_queue[queue];
2469 struct stmmac_txq_stats *txq_stats = &priv->xstats.txq_stats[queue];
2470 struct xsk_buff_pool *pool = tx_q->xsk_pool;
2471 unsigned int entry = tx_q->cur_tx;
2472 struct dma_desc *tx_desc = NULL;
2473 struct xdp_desc xdp_desc;
2474 bool work_done = true;
2475 u32 tx_set_ic_bit = 0;
2476 unsigned long flags;
2477
2478 /* Avoids TX time-out as we are sharing with slow path */
2479 txq_trans_cond_update(nq);
2480
2481 budget = min(budget, stmmac_tx_avail(priv, queue));
2482
2483 while (budget-- > 0) {
2484 struct stmmac_metadata_request meta_req;
2485 struct xsk_tx_metadata *meta = NULL;
2486 dma_addr_t dma_addr;
2487 bool set_ic;
2488
2489 /* We are sharing with slow path and stop XSK TX desc submission when
2490 * available TX ring is less than threshold.
2491 */
2492 if (unlikely(stmmac_tx_avail(priv, queue) < STMMAC_TX_XSK_AVAIL) ||
2493 !netif_carrier_ok(priv->dev)) {
2494 work_done = false;
2495 break;
2496 }
2497
2498 if (!xsk_tx_peek_desc(pool, &xdp_desc))
2499 break;
2500
2501 if (likely(priv->extend_desc))
2502 tx_desc = (struct dma_desc *)(tx_q->dma_etx + entry);
2503 else if (tx_q->tbs & STMMAC_TBS_AVAIL)
2504 tx_desc = &tx_q->dma_entx[entry].basic;
2505 else
2506 tx_desc = tx_q->dma_tx + entry;
2507
2508 dma_addr = xsk_buff_raw_get_dma(pool, xdp_desc.addr);
2509 meta = xsk_buff_get_metadata(pool, xdp_desc.addr);
2510 xsk_buff_raw_dma_sync_for_device(pool, dma_addr, xdp_desc.len);
2511
2512 tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_XSK_TX;
2513
2514 /* To return XDP buffer to XSK pool, we simple call
2515 * xsk_tx_completed(), so we don't need to fill up
2516 * 'buf' and 'xdpf'.
2517 */
2518 tx_q->tx_skbuff_dma[entry].buf = 0;
2519 tx_q->xdpf[entry] = NULL;
2520
2521 tx_q->tx_skbuff_dma[entry].map_as_page = false;
2522 tx_q->tx_skbuff_dma[entry].len = xdp_desc.len;
2523 tx_q->tx_skbuff_dma[entry].last_segment = true;
2524 tx_q->tx_skbuff_dma[entry].is_jumbo = false;
2525
2526 stmmac_set_desc_addr(priv, tx_desc, dma_addr);
2527
2528 tx_q->tx_count_frames++;
2529
2530 if (!priv->tx_coal_frames[queue])
2531 set_ic = false;
2532 else if (tx_q->tx_count_frames % priv->tx_coal_frames[queue] == 0)
2533 set_ic = true;
2534 else
2535 set_ic = false;
2536
2537 meta_req.priv = priv;
2538 meta_req.tx_desc = tx_desc;
2539 meta_req.set_ic = &set_ic;
2540 xsk_tx_metadata_request(meta, &stmmac_xsk_tx_metadata_ops, &meta_req);
2541
2542 if (set_ic) {
2543 tx_q->tx_count_frames = 0;
2544 stmmac_set_tx_ic(priv, tx_desc);
2545 tx_set_ic_bit++;
2546 }
2547
2548 stmmac_prepare_tx_desc(priv, tx_desc, 1, xdp_desc.len,
2549 true, priv->mode, true, true,
2550 xdp_desc.len);
2551
2552 stmmac_enable_dma_transmission(priv, priv->ioaddr);
2553
> 2554 xsk_tx_metadata_to_compl(meta, &tx_q->tx_skbuff_dma[entry].xsk_meta);
2555
2556 tx_q->cur_tx = STMMAC_GET_ENTRY(tx_q->cur_tx, priv->dma_conf.dma_tx_size);
2557 entry = tx_q->cur_tx;
2558 }
2559 flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
2560 txq_stats->tx_set_ic_bit += tx_set_ic_bit;
2561 u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
2562
2563 if (tx_desc) {
2564 stmmac_flush_tx_descriptors(priv, queue);
2565 xsk_tx_release(pool);
2566 }
2567
2568 /* Return true if all of the 3 conditions are met
2569 * a) TX Budget is still available
2570 * b) work_done = true when XSK TX desc peek is empty (no more
2571 * pending XSK TX for transmission)
2572 */
2573 return !!budget && work_done;
2574 }
2575
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] Re: [PATCH bpf-next v3 05/10] net: stmmac: Add Tx HWTS support to XDP ZC
2023-10-04 23:05 ` [xdp-hints] " kernel test robot
@ 2023-10-04 23:14 ` Stanislav Fomichev
0 siblings, 0 replies; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-04 23:14 UTC (permalink / raw)
To: kernel test robot
Cc: bpf, oe-kbuild-all, ast, daniel, andrii, martin.lau, song, yhs,
john.fastabend, kpsingh, haoluo, jolsa, kuba, toke, willemb,
dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
On 10/05, kernel test robot wrote:
> Hi Stanislav,
>
> kernel test robot noticed the following build errors:
>
> [auto build test ERROR on bpf-next/master]
>
> url: https://github.com/intel-lab-lkp/linux/commits/Stanislav-Fomichev/xsk-Support-tx_metadata_len/20231004-040718
> base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
> patch link: https://lore.kernel.org/r/20231003200522.1914523-6-sdf%40google.com
> patch subject: [PATCH bpf-next v3 05/10] net: stmmac: Add Tx HWTS support to XDP ZC
> config: riscv-defconfig (https://download.01.org/0day-ci/archive/20231005/202310050607.UQ0bU3ct-lkp@intel.com/config)
> compiler: riscv64-linux-gcc (GCC) 13.2.0
> reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231005/202310050607.UQ0bU3ct-lkp@intel.com/reproduce)
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <lkp@intel.com>
> | Closes: https://lore.kernel.org/oe-kbuild-all/202310050607.UQ0bU3ct-lkp@intel.com/
>
> All errors (new ones prefixed by >>):
>
> drivers/net/ethernet/stmicro/stmmac/stmmac_main.c: In function 'stmmac_xdp_xmit_zc':
> >> drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:2554:17: error: implicit declaration of function 'xsk_tx_metadata_to_compl'; did you mean 'xsk_tx_metadata_complete'? [-Werror=implicit-function-declaration]
> 2554 | xsk_tx_metadata_to_compl(meta, &tx_q->tx_skbuff_dma[entry].xsk_meta);
> | ^~~~~~~~~~~~~~~~~~~~~~~~
> | xsk_tx_metadata_complete
> cc1: some warnings being treated as errors
Missing "static inline xsk_tx_metadata_to_compl" for !CONFIG_XDP_SOCKETS.
Will fix in the patch where I add xsk_tx_metadata_to_compl...
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] Re: [PATCH bpf-next v3 04/10] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 04/10] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload Stanislav Fomichev
@ 2023-10-04 23:47 ` kernel test robot
0 siblings, 0 replies; 25+ messages in thread
From: kernel test robot @ 2023-10-04 23:47 UTC (permalink / raw)
To: Stanislav Fomichev, bpf
Cc: oe-kbuild-all, ast, daniel, andrii, martin.lau, song, yhs,
john.fastabend, kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb,
dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints, Saeed Mahameed
Hi Stanislav,
kernel test robot noticed the following build errors:
[auto build test ERROR on bpf-next/master]
url: https://github.com/intel-lab-lkp/linux/commits/Stanislav-Fomichev/xsk-Support-tx_metadata_len/20231004-040718
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link: https://lore.kernel.org/r/20231003200522.1914523-5-sdf%40google.com
patch subject: [PATCH bpf-next v3 04/10] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
config: s390-defconfig (https://download.01.org/0day-ci/archive/20231005/202310050738.ZFOKzSlA-lkp@intel.com/config)
compiler: s390-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231005/202310050738.ZFOKzSlA-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202310050738.ZFOKzSlA-lkp@intel.com/
All errors (new ones prefixed by >>):
drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c: In function 'mlx5e_xsk_tx':
>> drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c:117:33: error: implicit declaration of function 'xsk_tx_metadata_to_compl'; did you mean 'xsk_tx_metadata_complete'? [-Werror=implicit-function-declaration]
117 | xsk_tx_metadata_to_compl(meta, &compl);
| ^~~~~~~~~~~~~~~~~~~~~~~~
| xsk_tx_metadata_complete
cc1: some warnings being treated as errors
vim +117 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
63
64 bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
65 {
66 struct xsk_buff_pool *pool = sq->xsk_pool;
67 struct xsk_tx_metadata *meta = NULL;
68 union mlx5e_xdp_info xdpi;
69 bool work_done = true;
70 bool flush = false;
71
72 xdpi.mode = MLX5E_XDP_XMIT_MODE_XSK;
73
74 for (; budget; budget--) {
75 int check_result = INDIRECT_CALL_2(sq->xmit_xdp_frame_check,
76 mlx5e_xmit_xdp_frame_check_mpwqe,
77 mlx5e_xmit_xdp_frame_check,
78 sq);
79 struct mlx5e_xmit_data xdptxd = {};
80 struct xdp_desc desc;
81 bool ret;
82
83 if (unlikely(check_result < 0)) {
84 work_done = false;
85 break;
86 }
87
88 if (!xsk_tx_peek_desc(pool, &desc)) {
89 /* TX will get stuck until something wakes it up by
90 * triggering NAPI. Currently it's expected that the
91 * application calls sendto() if there are consumed, but
92 * not completed frames.
93 */
94 break;
95 }
96
97 xdptxd.dma_addr = xsk_buff_raw_get_dma(pool, desc.addr);
98 xdptxd.data = xsk_buff_raw_get_data(pool, desc.addr);
99 xdptxd.len = desc.len;
100 meta = xsk_buff_get_metadata(pool, desc.addr);
101
102 xsk_buff_raw_dma_sync_for_device(pool, xdptxd.dma_addr, xdptxd.len);
103
104 ret = INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
105 mlx5e_xmit_xdp_frame, sq, &xdptxd,
106 check_result, meta);
107 if (unlikely(!ret)) {
108 if (sq->mpwqe.wqe)
109 mlx5e_xdp_mpwqe_complete(sq);
110
111 mlx5e_xsk_tx_post_err(sq, &xdpi);
112 } else {
113 mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, xdpi);
114 if (xp_tx_metadata_enabled(sq->xsk_pool)) {
115 struct xsk_tx_metadata_compl compl;
116
> 117 xsk_tx_metadata_to_compl(meta, &compl);
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] Re: [PATCH bpf-next v3 02/10] xsk: add TX timestamp and TX checksum offload support
2023-10-04 17:56 ` Stanislav Fomichev
@ 2023-10-05 1:16 ` Song, Yoong Siang
0 siblings, 0 replies; 25+ messages in thread
From: Song, Yoong Siang @ 2023-10-05 1:16 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern, Karlsson,
Magnus, bjorn, Fijalkowski, Maciej, hawk, netdev, xdp-hints
On Thursday, October 5, 2023 1:57 AM Stanislav Fomichev <sdf@google.com> wrote:
>Yeah, looks like this part is not happy, doesn't look like BIT() is exported to UAPI,
>per:
>
>check for #defines like: 1 << <digit> that could be BIT(digit), it is not exported to
>uapi
Noted.
>
>So I'll revert to << like in the rest of this file.
Sure. I am ok to keep <<
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] Re: [PATCH bpf-next v3 05/10] net: stmmac: Add Tx HWTS support to XDP ZC
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 05/10] net: stmmac: Add Tx HWTS support to XDP ZC Stanislav Fomichev
2023-10-04 23:05 ` [xdp-hints] " kernel test robot
@ 2023-10-06 4:38 ` kernel test robot
1 sibling, 0 replies; 25+ messages in thread
From: kernel test robot @ 2023-10-06 4:38 UTC (permalink / raw)
To: Stanislav Fomichev, bpf
Cc: llvm, oe-kbuild-all, ast, daniel, andrii, martin.lau, song, yhs,
john.fastabend, kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb,
dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
Hi Stanislav,
kernel test robot noticed the following build errors:
[auto build test ERROR on bpf-next/master]
url: https://github.com/intel-lab-lkp/linux/commits/Stanislav-Fomichev/xsk-Support-tx_metadata_len/20231004-040718
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link: https://lore.kernel.org/r/20231003200522.1914523-6-sdf%40google.com
patch subject: [PATCH bpf-next v3 05/10] net: stmmac: Add Tx HWTS support to XDP ZC
config: riscv-rv32_defconfig (https://download.01.org/0day-ci/archive/20231006/202310061208.tATL34LR-lkp@intel.com/config)
compiler: clang version 17.0.0 (https://github.com/llvm/llvm-project.git 4a5ac14ee968ff0ad5d2cc1ffa0299048db4c88a)
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231006/202310061208.tATL34LR-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202310061208.tATL34LR-lkp@intel.com/
All errors (new ones prefixed by >>):
>> drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:2554:3: error: call to undeclared function 'xsk_tx_metadata_to_compl'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration]
2554 | xsk_tx_metadata_to_compl(meta, &tx_q->tx_skbuff_dma[entry].xsk_meta);
| ^
drivers/net/ethernet/stmicro/stmmac/stmmac_main.c:2554:3: note: did you mean 'xsk_tx_metadata_complete'?
include/net/xdp_sock.h:185:20: note: 'xsk_tx_metadata_complete' declared here
185 | static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata_compl *compl,
| ^
1 error generated.
vim +/xsk_tx_metadata_to_compl +2554 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
2464
2465 static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
2466 {
2467 struct netdev_queue *nq = netdev_get_tx_queue(priv->dev, queue);
2468 struct stmmac_tx_queue *tx_q = &priv->dma_conf.tx_queue[queue];
2469 struct stmmac_txq_stats *txq_stats = &priv->xstats.txq_stats[queue];
2470 struct xsk_buff_pool *pool = tx_q->xsk_pool;
2471 unsigned int entry = tx_q->cur_tx;
2472 struct dma_desc *tx_desc = NULL;
2473 struct xdp_desc xdp_desc;
2474 bool work_done = true;
2475 u32 tx_set_ic_bit = 0;
2476 unsigned long flags;
2477
2478 /* Avoids TX time-out as we are sharing with slow path */
2479 txq_trans_cond_update(nq);
2480
2481 budget = min(budget, stmmac_tx_avail(priv, queue));
2482
2483 while (budget-- > 0) {
2484 struct stmmac_metadata_request meta_req;
2485 struct xsk_tx_metadata *meta = NULL;
2486 dma_addr_t dma_addr;
2487 bool set_ic;
2488
2489 /* We are sharing with slow path and stop XSK TX desc submission when
2490 * available TX ring is less than threshold.
2491 */
2492 if (unlikely(stmmac_tx_avail(priv, queue) < STMMAC_TX_XSK_AVAIL) ||
2493 !netif_carrier_ok(priv->dev)) {
2494 work_done = false;
2495 break;
2496 }
2497
2498 if (!xsk_tx_peek_desc(pool, &xdp_desc))
2499 break;
2500
2501 if (likely(priv->extend_desc))
2502 tx_desc = (struct dma_desc *)(tx_q->dma_etx + entry);
2503 else if (tx_q->tbs & STMMAC_TBS_AVAIL)
2504 tx_desc = &tx_q->dma_entx[entry].basic;
2505 else
2506 tx_desc = tx_q->dma_tx + entry;
2507
2508 dma_addr = xsk_buff_raw_get_dma(pool, xdp_desc.addr);
2509 meta = xsk_buff_get_metadata(pool, xdp_desc.addr);
2510 xsk_buff_raw_dma_sync_for_device(pool, dma_addr, xdp_desc.len);
2511
2512 tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_XSK_TX;
2513
2514 /* To return XDP buffer to XSK pool, we simple call
2515 * xsk_tx_completed(), so we don't need to fill up
2516 * 'buf' and 'xdpf'.
2517 */
2518 tx_q->tx_skbuff_dma[entry].buf = 0;
2519 tx_q->xdpf[entry] = NULL;
2520
2521 tx_q->tx_skbuff_dma[entry].map_as_page = false;
2522 tx_q->tx_skbuff_dma[entry].len = xdp_desc.len;
2523 tx_q->tx_skbuff_dma[entry].last_segment = true;
2524 tx_q->tx_skbuff_dma[entry].is_jumbo = false;
2525
2526 stmmac_set_desc_addr(priv, tx_desc, dma_addr);
2527
2528 tx_q->tx_count_frames++;
2529
2530 if (!priv->tx_coal_frames[queue])
2531 set_ic = false;
2532 else if (tx_q->tx_count_frames % priv->tx_coal_frames[queue] == 0)
2533 set_ic = true;
2534 else
2535 set_ic = false;
2536
2537 meta_req.priv = priv;
2538 meta_req.tx_desc = tx_desc;
2539 meta_req.set_ic = &set_ic;
2540 xsk_tx_metadata_request(meta, &stmmac_xsk_tx_metadata_ops, &meta_req);
2541
2542 if (set_ic) {
2543 tx_q->tx_count_frames = 0;
2544 stmmac_set_tx_ic(priv, tx_desc);
2545 tx_set_ic_bit++;
2546 }
2547
2548 stmmac_prepare_tx_desc(priv, tx_desc, 1, xdp_desc.len,
2549 true, priv->mode, true, true,
2550 xdp_desc.len);
2551
2552 stmmac_enable_dma_transmission(priv, priv->ioaddr);
2553
> 2554 xsk_tx_metadata_to_compl(meta, &tx_q->tx_skbuff_dma[entry].xsk_meta);
2555
2556 tx_q->cur_tx = STMMAC_GET_ENTRY(tx_q->cur_tx, priv->dma_conf.dma_tx_size);
2557 entry = tx_q->cur_tx;
2558 }
2559 flags = u64_stats_update_begin_irqsave(&txq_stats->syncp);
2560 txq_stats->tx_set_ic_bit += tx_set_ic_bit;
2561 u64_stats_update_end_irqrestore(&txq_stats->syncp, flags);
2562
2563 if (tx_desc) {
2564 stmmac_flush_tx_descriptors(priv, queue);
2565 xsk_tx_release(pool);
2566 }
2567
2568 /* Return true if all of the 3 conditions are met
2569 * a) TX Budget is still available
2570 * b) work_done = true when XSK TX desc peek is empty (no more
2571 * pending XSK TX for transmission)
2572 */
2573 return !!budget && work_done;
2574 }
2575
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] Re: [PATCH bpf-next v3 09/10] selftests/bpf: Add TX side to xdp_hw_metadata
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 09/10] selftests/bpf: Add TX side to xdp_hw_metadata Stanislav Fomichev
@ 2023-10-09 8:12 ` Jesper Dangaard Brouer
2023-10-09 16:37 ` Stanislav Fomichev
0 siblings, 1 reply; 25+ messages in thread
From: Jesper Dangaard Brouer @ 2023-10-09 8:12 UTC (permalink / raw)
To: Stanislav Fomichev, bpf
Cc: hawk, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, yoong.siang.song,
netdev, xdp-hints
On 03/10/2023 22.05, Stanislav Fomichev wrote:
> When we get a packet on port 9091, we swap src/dst and send it out.
> At this point we also request the timestamp and checksum offloads.
>
> Checksum offload is verified by looking at the tcpdump on the other side.
> The tool prints pseudo-header csum and the final one it expects.
> The final checksum actually matches the incoming packets checksum
> because we only flip the src/dst and don't change the payload.
>
> Some other related changes:
> - switched to zerocopy mode by default; new flag can be used to force
> old behavior
> - request fixed tx_metadata_len headroom
> - some other small fixes (umem size, fill idx+i, etc)
>
> mvbz3:~# ./xdp_hw_metadata eth3
> ...
> 0x1062cb8: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
> rx_hash: 0x2E1B50B9 with RSS type:0x2A
> rx_timestamp: 1691436369532047139 (sec:1691436369.5320)
> XDP RX-time: 1691436369261756803 (sec:1691436369.2618) delta sec:-0.2703 (-270290.336 usec)
I guess system time isn't configured to be in sync with NIC HW time,
as delta is a negative offset.
> AF_XDP time: 1691436369261878839 (sec:1691436369.2619) delta sec:0.0001 (122.036 usec)
The AF_XDP time is also software system time and compared to XDP
RX-time, it shows a delta of 122 usec. This number indicate to me that
the CPU is likely configured with power saving sleep states.
> 0x1062cb8: ping-pong with csum=3b8e (want de7e) csum_start=54 csum_offset=6
> 0x1062cb8: complete tx idx=0 addr=10
> 0x1062cb8: tx_timestamp: 1691436369598419505 (sec:1691436369.5984)
Could we add something that we can relate tx_timestamp to?
Like we do with the other delta calculations, as it helps us to
understand/validate if the number we get back is sane. Like negative
offset aboves tells us that system time sync isn't configured, and that
system have configures C-states.
I suggest delta comparing "tx_timestamp" to "rx_timestamp", as they are
the same clock domain. It will tell us the total time spend from HW RX
to HW TX, counting all the time used by software "ping-pong".
1691436369.5984-1691436369.5320 = 0.0664 sec = 66.4 ms
When implementing this, it could be (1) practical to store the
"rx_timestamp" in the metadata area of the completion packet, or (2)
should we have a mechanism for external storage that can be keyed on the
umem "addr"?
> 0x1062cb8: complete rx idx=128 addr=80100
>
> mvbz4:~# nc -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091
>
> mvbz4:~# tcpdump -vvx -i eth3 udp
> tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
> 12:26:09.301074 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1087.55807 > fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0xde7e!] UDP, length 3
> 0x0000: 6003 5fa5 000b 117f fe80 0000 0000 0000
> 0x0010: 1270 fdff fe48 1087 fe80 0000 0000 0000
> 0x0020: 1270 fdff fe48 1077 d9ff 2383 000b 3b8e
> 0x0030: 7864 70
> 12:26:09.301976 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1077.9091 > fe80::1270:fdff:fe48:1087.55807: [udp sum ok] UDP, length 3
> 0x0000: 6003 5fa5 000b 117f fe80 0000 0000 0000
> 0x0010: 1270 fdff fe48 1077 fe80 0000 0000 0000
> 0x0020: 1270 fdff fe48 1087 2383 d9ff 000b de7e
> 0x0030: 7864 70
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
> tools/testing/selftests/bpf/xdp_hw_metadata.c | 202 +++++++++++++++++-
> 1 file changed, 192 insertions(+), 10 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> index 613321eb84c1..ab83d0ba6763 100644
> --- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
> +++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> @@ -10,7 +10,9 @@
> * - rx_hash
> *
> * TX:
> - * - TBD
> + * - UDP 9091 packets trigger TX reply
> + * - TX HW timestamp is requested and reported back upon completion
> + * - TX checksum is requested
> */
>
> #include <test_progs.h>
> @@ -24,14 +26,17 @@
[...]
> @@ -51,22 +56,24 @@ struct xsk *rx_xsk;
[...]
> @@ -129,12 +136,22 @@ static void refill_rx(struct xsk *xsk, __u64 addr)
[...]
> @@ -228,6 +245,117 @@ static void verify_skb_metadata(int fd)
> printf("skb hwtstamp is not found!\n");
> }
>
> +static bool complete_tx(struct xsk *xsk)
> +{
> + struct xsk_tx_metadata *meta;
> + __u64 addr;
> + void *data;
> + __u32 idx;
> +
> + if (!xsk_ring_cons__peek(&xsk->comp, 1, &idx))
> + return false;
> +
> + addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
> + data = xsk_umem__get_data(xsk->umem_area, addr);
> + meta = data - sizeof(struct xsk_tx_metadata);
> +
> + printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
> + printf("%p: tx_timestamp: %llu (sec:%0.4f)\n", xsk,
> + meta->completion.tx_timestamp,
> + (double)meta->completion.tx_timestamp / NANOSEC_PER_SEC);
> + xsk_ring_cons__release(&xsk->comp, 1);
> +
> + return true;
> +}
> +
> +#define swap(a, b, len) do { \
> + for (int i = 0; i < len; i++) { \
> + __u8 tmp = ((__u8 *)a)[i]; \
> + ((__u8 *)a)[i] = ((__u8 *)b)[i]; \
> + ((__u8 *)b)[i] = tmp; \
> + } \
> +} while (0)
> +
> +static void ping_pong(struct xsk *xsk, void *rx_packet)
> +{
> + struct xsk_tx_metadata *meta;
> + struct ipv6hdr *ip6h = NULL;
> + struct iphdr *iph = NULL;
> + struct xdp_desc *tx_desc;
> + struct udphdr *udph;
> + struct ethhdr *eth;
> + __sum16 want_csum;
> + void *data;
> + __u32 idx;
> + int ret;
> + int len;
> +
> + ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
> + if (ret != 1) {
> + printf("%p: failed to reserve tx slot\n", xsk);
> + return;
> + }
> +
> + tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
> + tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE + sizeof(struct xsk_tx_metadata);
> + data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
> +
> + meta = data - sizeof(struct xsk_tx_metadata);
> + memset(meta, 0, sizeof(*meta));
> + meta->flags = XDP_TX_METADATA_TIMESTAMP;
> +
> + eth = rx_packet;
> +
> + if (eth->h_proto == htons(ETH_P_IP)) {
> + iph = (void *)(eth + 1);
> + udph = (void *)(iph + 1);
> + } else if (eth->h_proto == htons(ETH_P_IPV6)) {
> + ip6h = (void *)(eth + 1);
> + udph = (void *)(ip6h + 1);
> + } else {
> + printf("%p: failed to detect IP version for ping pong %04x\n", xsk, eth->h_proto);
> + xsk_ring_prod__cancel(&xsk->tx, 1);
> + return;
> + }
> +
> + len = ETH_HLEN;
> + if (ip6h)
> + len += sizeof(*ip6h) + ntohs(ip6h->payload_len);
> + if (iph)
> + len += ntohs(iph->tot_len);
> +
> + swap(eth->h_dest, eth->h_source, ETH_ALEN);
> + if (iph)
> + swap(&iph->saddr, &iph->daddr, 4);
> + else
> + swap(&ip6h->saddr, &ip6h->daddr, 16);
> + swap(&udph->source, &udph->dest, 2);
> +
> + want_csum = udph->check;
> + if (ip6h)
> + udph->check = ~csum_ipv6_magic(&ip6h->saddr, &ip6h->daddr,
> + ntohs(udph->len), IPPROTO_UDP, 0);
> + else
> + udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
> + ntohs(udph->len), IPPROTO_UDP, 0);
> +
> + meta->flags |= XDP_TX_METADATA_CHECKSUM;
> + if (iph)
> + meta->csum_start = sizeof(*eth) + sizeof(*iph);
> + else
> + meta->csum_start = sizeof(*eth) + sizeof(*ip6h);
> + meta->csum_offset = offsetof(struct udphdr, check);
> +
> + printf("%p: ping-pong with csum=%04x (want %04x) csum_start=%d csum_offset=%d\n",
> + xsk, ntohs(udph->check), ntohs(want_csum), meta->csum_start, meta->csum_offset);
> +
> + memcpy(data, rx_packet, len); /* don't share umem chunk for simplicity */
> + tx_desc->options |= XDP_TX_METADATA;
> + tx_desc->len = len;
> +
> + xsk_ring_prod__submit(&xsk->tx, 1);
> +}
> +
> static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t clock_id)
> {
> const struct xdp_desc *rx_desc;
> @@ -250,6 +378,13 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
>
> while (true) {
> errno = 0;
> +
> + for (i = 0; i < rxq; i++) {
> + ret = kick_rx(&rx_xsk[i]);
> + if (ret)
> + printf("kick_rx ret=%d\n", ret);
> + }
> +
> ret = poll(fds, rxq + 1, 1000);
> printf("poll: %d (%d) skip=%llu fail=%llu redir=%llu\n",
> ret, errno, bpf_obj->bss->pkts_skip,
> @@ -280,6 +415,22 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
> xsk, idx, rx_desc->addr, addr, comp_addr);
> verify_xdp_metadata(xsk_umem__get_data(xsk->umem_area, addr),
> clock_id);
> +
> + if (!skip_tx) {
> + /* mirror the packet back */
> + ping_pong(xsk, xsk_umem__get_data(xsk->umem_area, addr));
> +
> + ret = kick_tx(xsk);
> + if (ret)
> + printf("kick_tx ret=%d\n", ret);
> +
> + for (int j = 0; j < 500; j++) {
> + if (complete_tx(xsk))
> + break;
> + usleep(10*1000);
I don't fully follow why we need this usleep here.
> + }
> + }
> +
> xsk_ring_cons__release(&xsk->rx, 1);
> refill_rx(xsk, comp_addr);
> }
> @@ -404,21 +555,52 @@ static void timestamping_enable(int fd, int val)
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] Re: [PATCH bpf-next v3 09/10] selftests/bpf: Add TX side to xdp_hw_metadata
2023-10-09 8:12 ` [xdp-hints] " Jesper Dangaard Brouer
@ 2023-10-09 16:37 ` Stanislav Fomichev
2023-10-10 20:40 ` Stanislav Fomichev
0 siblings, 1 reply; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-09 16:37 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, yoong.siang.song,
netdev, xdp-hints
On 10/09, Jesper Dangaard Brouer wrote:
>
>
> On 03/10/2023 22.05, Stanislav Fomichev wrote:
> > When we get a packet on port 9091, we swap src/dst and send it out.
> > At this point we also request the timestamp and checksum offloads.
> >
> > Checksum offload is verified by looking at the tcpdump on the other side.
> > The tool prints pseudo-header csum and the final one it expects.
> > The final checksum actually matches the incoming packets checksum
> > because we only flip the src/dst and don't change the payload.
> >
> > Some other related changes:
> > - switched to zerocopy mode by default; new flag can be used to force
> > old behavior
> > - request fixed tx_metadata_len headroom
> > - some other small fixes (umem size, fill idx+i, etc)
> >
> > mvbz3:~# ./xdp_hw_metadata eth3
> > ...
> > 0x1062cb8: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
> > rx_hash: 0x2E1B50B9 with RSS type:0x2A
> > rx_timestamp: 1691436369532047139 (sec:1691436369.5320)
> > XDP RX-time: 1691436369261756803 (sec:1691436369.2618) delta sec:-0.2703 (-270290.336 usec)
>
> I guess system time isn't configured to be in sync with NIC HW time,
> as delta is a negative offset.
>
> > AF_XDP time: 1691436369261878839 (sec:1691436369.2619) delta sec:0.0001 (122.036 usec)
> The AF_XDP time is also software system time and compared to XDP
> RX-time, it shows a delta of 122 usec. This number indicate to me that
> the CPU is likely configured with power saving sleep states.
Yes, I don't do any synchronization and don't disable the sleep states.
> > 0x1062cb8: ping-pong with csum=3b8e (want de7e) csum_start=54 csum_offset=6
> > 0x1062cb8: complete tx idx=0 addr=10
> > 0x1062cb8: tx_timestamp: 1691436369598419505 (sec:1691436369.5984)
>
> Could we add something that we can relate tx_timestamp to?
>
> Like we do with the other delta calculations, as it helps us to
> understand/validate if the number we get back is sane. Like negative
> offset aboves tells us that system time sync isn't configured, and that
> system have configures C-states.
>
> I suggest delta comparing "tx_timestamp" to "rx_timestamp", as they are
> the same clock domain. It will tell us the total time spend from HW RX
> to HW TX, counting all the time used by software "ping-pong".
>
> 1691436369.5984-1691436369.5320 = 0.0664 sec = 66.4 ms
>
> When implementing this, it could be (1) practical to store the
> "rx_timestamp" in the metadata area of the completion packet, or (2)
> should we have a mechanism for external storage that can be keyed on the
> umem "addr"?
Sounds good. I can probably just store last rx_timestamp somewhere in the
global var and do a delta on tx? Since the test is single threaded
and sequential, not sure we need the mechanism to pass the tstamp around.
LMK if you disagree and I'm missing something.
> > 0x1062cb8: complete rx idx=128 addr=80100
> >
> > mvbz4:~# nc -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091
> >
> > mvbz4:~# tcpdump -vvx -i eth3 udp
> > tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
> > 12:26:09.301074 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1087.55807 > fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0xde7e!] UDP, length 3
> > 0x0000: 6003 5fa5 000b 117f fe80 0000 0000 0000
> > 0x0010: 1270 fdff fe48 1087 fe80 0000 0000 0000
> > 0x0020: 1270 fdff fe48 1077 d9ff 2383 000b 3b8e
> > 0x0030: 7864 70
> > 12:26:09.301976 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1077.9091 > fe80::1270:fdff:fe48:1087.55807: [udp sum ok] UDP, length 3
> > 0x0000: 6003 5fa5 000b 117f fe80 0000 0000 0000
> > 0x0010: 1270 fdff fe48 1077 fe80 0000 0000 0000
> > 0x0020: 1270 fdff fe48 1087 2383 d9ff 000b de7e
> > 0x0030: 7864 70
> >
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> > tools/testing/selftests/bpf/xdp_hw_metadata.c | 202 +++++++++++++++++-
> > 1 file changed, 192 insertions(+), 10 deletions(-)
> >
> > diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> > index 613321eb84c1..ab83d0ba6763 100644
> > --- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
> > +++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> > @@ -10,7 +10,9 @@
> > * - rx_hash
> > *
> > * TX:
> > - * - TBD
> > + * - UDP 9091 packets trigger TX reply
> > + * - TX HW timestamp is requested and reported back upon completion
> > + * - TX checksum is requested
> > */
> > #include <test_progs.h>
> > @@ -24,14 +26,17 @@
> [...]
> > @@ -51,22 +56,24 @@ struct xsk *rx_xsk;
> [...]
> > @@ -129,12 +136,22 @@ static void refill_rx(struct xsk *xsk, __u64 addr)
> [...]
> > @@ -228,6 +245,117 @@ static void verify_skb_metadata(int fd)
> > printf("skb hwtstamp is not found!\n");
> > }
> > +static bool complete_tx(struct xsk *xsk)
> > +{
> > + struct xsk_tx_metadata *meta;
> > + __u64 addr;
> > + void *data;
> > + __u32 idx;
> > +
> > + if (!xsk_ring_cons__peek(&xsk->comp, 1, &idx))
> > + return false;
> > +
> > + addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
> > + data = xsk_umem__get_data(xsk->umem_area, addr);
> > + meta = data - sizeof(struct xsk_tx_metadata);
> > +
> > + printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
> > + printf("%p: tx_timestamp: %llu (sec:%0.4f)\n", xsk,
> > + meta->completion.tx_timestamp,
> > + (double)meta->completion.tx_timestamp / NANOSEC_PER_SEC);
> > + xsk_ring_cons__release(&xsk->comp, 1);
> > +
> > + return true;
> > +}
> > +
> > +#define swap(a, b, len) do { \
> > + for (int i = 0; i < len; i++) { \
> > + __u8 tmp = ((__u8 *)a)[i]; \
> > + ((__u8 *)a)[i] = ((__u8 *)b)[i]; \
> > + ((__u8 *)b)[i] = tmp; \
> > + } \
> > +} while (0)
> > +
> > +static void ping_pong(struct xsk *xsk, void *rx_packet)
> > +{
> > + struct xsk_tx_metadata *meta;
> > + struct ipv6hdr *ip6h = NULL;
> > + struct iphdr *iph = NULL;
> > + struct xdp_desc *tx_desc;
> > + struct udphdr *udph;
> > + struct ethhdr *eth;
> > + __sum16 want_csum;
> > + void *data;
> > + __u32 idx;
> > + int ret;
> > + int len;
> > +
> > + ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
> > + if (ret != 1) {
> > + printf("%p: failed to reserve tx slot\n", xsk);
> > + return;
> > + }
> > +
> > + tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
> > + tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE + sizeof(struct xsk_tx_metadata);
> > + data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
> > +
> > + meta = data - sizeof(struct xsk_tx_metadata);
> > + memset(meta, 0, sizeof(*meta));
> > + meta->flags = XDP_TX_METADATA_TIMESTAMP;
> > +
> > + eth = rx_packet;
> > +
> > + if (eth->h_proto == htons(ETH_P_IP)) {
> > + iph = (void *)(eth + 1);
> > + udph = (void *)(iph + 1);
> > + } else if (eth->h_proto == htons(ETH_P_IPV6)) {
> > + ip6h = (void *)(eth + 1);
> > + udph = (void *)(ip6h + 1);
> > + } else {
> > + printf("%p: failed to detect IP version for ping pong %04x\n", xsk, eth->h_proto);
> > + xsk_ring_prod__cancel(&xsk->tx, 1);
> > + return;
> > + }
> > +
> > + len = ETH_HLEN;
> > + if (ip6h)
> > + len += sizeof(*ip6h) + ntohs(ip6h->payload_len);
> > + if (iph)
> > + len += ntohs(iph->tot_len);
> > +
> > + swap(eth->h_dest, eth->h_source, ETH_ALEN);
> > + if (iph)
> > + swap(&iph->saddr, &iph->daddr, 4);
> > + else
> > + swap(&ip6h->saddr, &ip6h->daddr, 16);
> > + swap(&udph->source, &udph->dest, 2);
> > +
> > + want_csum = udph->check;
> > + if (ip6h)
> > + udph->check = ~csum_ipv6_magic(&ip6h->saddr, &ip6h->daddr,
> > + ntohs(udph->len), IPPROTO_UDP, 0);
> > + else
> > + udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
> > + ntohs(udph->len), IPPROTO_UDP, 0);
> > +
> > + meta->flags |= XDP_TX_METADATA_CHECKSUM;
> > + if (iph)
> > + meta->csum_start = sizeof(*eth) + sizeof(*iph);
> > + else
> > + meta->csum_start = sizeof(*eth) + sizeof(*ip6h);
> > + meta->csum_offset = offsetof(struct udphdr, check);
> > +
> > + printf("%p: ping-pong with csum=%04x (want %04x) csum_start=%d csum_offset=%d\n",
> > + xsk, ntohs(udph->check), ntohs(want_csum), meta->csum_start, meta->csum_offset);
> > +
> > + memcpy(data, rx_packet, len); /* don't share umem chunk for simplicity */
> > + tx_desc->options |= XDP_TX_METADATA;
> > + tx_desc->len = len;
> > +
> > + xsk_ring_prod__submit(&xsk->tx, 1);
> > +}
> > +
> > static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t clock_id)
> > {
> > const struct xdp_desc *rx_desc;
> > @@ -250,6 +378,13 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
> > while (true) {
> > errno = 0;
> > +
> > + for (i = 0; i < rxq; i++) {
> > + ret = kick_rx(&rx_xsk[i]);
> > + if (ret)
> > + printf("kick_rx ret=%d\n", ret);
> > + }
> > +
> > ret = poll(fds, rxq + 1, 1000);
> > printf("poll: %d (%d) skip=%llu fail=%llu redir=%llu\n",
> > ret, errno, bpf_obj->bss->pkts_skip,
> > @@ -280,6 +415,22 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
> > xsk, idx, rx_desc->addr, addr, comp_addr);
> > verify_xdp_metadata(xsk_umem__get_data(xsk->umem_area, addr),
> > clock_id);
> > +
> > + if (!skip_tx) {
> > + /* mirror the packet back */
> > + ping_pong(xsk, xsk_umem__get_data(xsk->umem_area, addr));
> > +
> > + ret = kick_tx(xsk);
> > + if (ret)
> > + printf("kick_tx ret=%d\n", ret);
> > +
> > + for (int j = 0; j < 500; j++) {
> > + if (complete_tx(xsk))
> > + break;
> > + usleep(10*1000);
>
> I don't fully follow why we need this usleep here.
To avoid the busypoll here (since we don't care too much about perf in
the test). But I agree, should be ok to drop, will do.
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] Re: [PATCH bpf-next v3 09/10] selftests/bpf: Add TX side to xdp_hw_metadata
2023-10-09 16:37 ` Stanislav Fomichev
@ 2023-10-10 20:40 ` Stanislav Fomichev
2023-10-13 1:13 ` Song, Yoong Siang
0 siblings, 1 reply; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-10 20:40 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, yoong.siang.song,
netdev, xdp-hints
On 10/09, Stanislav Fomichev wrote:
> On 10/09, Jesper Dangaard Brouer wrote:
> >
> >
> > On 03/10/2023 22.05, Stanislav Fomichev wrote:
> > > When we get a packet on port 9091, we swap src/dst and send it out.
> > > At this point we also request the timestamp and checksum offloads.
> > >
> > > Checksum offload is verified by looking at the tcpdump on the other side.
> > > The tool prints pseudo-header csum and the final one it expects.
> > > The final checksum actually matches the incoming packets checksum
> > > because we only flip the src/dst and don't change the payload.
> > >
> > > Some other related changes:
> > > - switched to zerocopy mode by default; new flag can be used to force
> > > old behavior
> > > - request fixed tx_metadata_len headroom
> > > - some other small fixes (umem size, fill idx+i, etc)
> > >
> > > mvbz3:~# ./xdp_hw_metadata eth3
> > > ...
> > > 0x1062cb8: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
> > > rx_hash: 0x2E1B50B9 with RSS type:0x2A
> > > rx_timestamp: 1691436369532047139 (sec:1691436369.5320)
> > > XDP RX-time: 1691436369261756803 (sec:1691436369.2618) delta sec:-0.2703 (-270290.336 usec)
> >
> > I guess system time isn't configured to be in sync with NIC HW time,
> > as delta is a negative offset.
> >
> > > AF_XDP time: 1691436369261878839 (sec:1691436369.2619) delta sec:0.0001 (122.036 usec)
> > The AF_XDP time is also software system time and compared to XDP
> > RX-time, it shows a delta of 122 usec. This number indicate to me that
> > the CPU is likely configured with power saving sleep states.
>
> Yes, I don't do any synchronization and don't disable the sleep states.
>
> > > 0x1062cb8: ping-pong with csum=3b8e (want de7e) csum_start=54 csum_offset=6
> > > 0x1062cb8: complete tx idx=0 addr=10
> > > 0x1062cb8: tx_timestamp: 1691436369598419505 (sec:1691436369.5984)
> >
> > Could we add something that we can relate tx_timestamp to?
> >
> > Like we do with the other delta calculations, as it helps us to
> > understand/validate if the number we get back is sane. Like negative
> > offset aboves tells us that system time sync isn't configured, and that
> > system have configures C-states.
> >
> > I suggest delta comparing "tx_timestamp" to "rx_timestamp", as they are
> > the same clock domain. It will tell us the total time spend from HW RX
> > to HW TX, counting all the time used by software "ping-pong".
> >
> > 1691436369.5984-1691436369.5320 = 0.0664 sec = 66.4 ms
> >
> > When implementing this, it could be (1) practical to store the
> > "rx_timestamp" in the metadata area of the completion packet, or (2)
> > should we have a mechanism for external storage that can be keyed on the
> > umem "addr"?
>
> Sounds good. I can probably just store last rx_timestamp somewhere in the
> global var and do a delta on tx? Since the test is single threaded
> and sequential, not sure we need the mechanism to pass the tstamp around.
> LMK if you disagree and I'm missing something.
I ended up reshuffling current code a bit to basically use clock tai
as a reference for every line. Feels like its a bit simpler when
everything is referenced against the same clock?
For RX part, I rename existing XDP/AF_XDP to HW/SW and dump them both
relative to tai.
0x195d1f0: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
rx_hash: 0xEE2BBD59 with RSS type:0x2A
rx_timestamp: 1696969312125212179 (sec:1696969312.1252)
HW RX-time: 1696969312125212179 (sec:1696969312.1252) to CLOCK_TAI delta sec:-0.1339 (-133862.968 usec)
SW RX-time: 1696969311991283421 (sec:1696969311.9913) to CLOCK_TAI delta sec:0.0001 (65.790 usec)
0x195d1f0: ping-pong with csum=3b8e (want de5f) csum_start=54 csum_offset=6
0x195d1f0: complete tx idx=0 addr=8
tx_timestamp: 1696969312152959759 (sec:1696969312.1530)
SW RX-time: 1696969311991283421 (sec:1696969311.9913) to CLOCK_TAI delta sec:0.0101 (10139.862 usec)
HW RX-time: 1696969312125212179 (sec:1696969312.1252) to HW TX-complete-time delta sec:0.0277 (27747.580 usec)
HW TX-complete-time: 1696969312152959759 (sec:1696969312.1530) to CLOCK_TAI delta sec:-0.1515 (-151536.476 usec)
For TX part, I add a bunch of reference points:
1) SW RX-time (meta->xdp_timestamp) vs CLOCK_TAI (aka tai-at-complete-time)
2) HW RX-time (meta->rx_timestamp) vs HW TX-complete-time (new af_xdp timestamp)
3) HW TX-complete-time vs CLOCK_TAI
What do you think? See the patch below.
Note: all 3 of the above should, in theory, be more or less constant (with irq
moderation / etc disabled). But for me on mlx5 (2) they are not and looks
like hw rx timestamp jitters a quite a bit. I don't have a clue rigt
now on why, will try to take a separate look, but it's unrelated to the tx side.
diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
index ab83d0ba6763..64a90d7479c1 100644
--- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
@@ -57,6 +57,8 @@ const char *ifname;
int ifindex;
int rxq;
bool skip_tx;
+__u64 last_hw_rx_timestamp;
+__u64 last_sw_rx_timestamp;
void test__fail(void) { /* for network_helpers.c */ }
@@ -167,6 +169,16 @@ static __u64 gettime(clockid_t clock_id)
return (__u64) t.tv_sec * NANOSEC_PER_SEC + t.tv_nsec;
}
+static void print_tstamp_delta(const char *name, const char *refname, __u64 tstamp, __u64 reference)
+{
+ __s64 delta = (__s64)reference - (__s64)tstamp;
+
+ printf("%s: %llu (sec:%0.4f) to %s delta sec:%0.4f (%0.3f usec)\n",
+ name, tstamp, (double)tstamp / NANOSEC_PER_SEC, refname,
+ (double)delta / NANOSEC_PER_SEC,
+ (double)delta / 1000);
+}
+
static void verify_xdp_metadata(void *data, clockid_t clock_id)
{
struct xdp_meta *meta;
@@ -182,22 +194,15 @@ static void verify_xdp_metadata(void *data, clockid_t clock_id)
printf("rx_timestamp: %llu (sec:%0.4f)\n", meta->rx_timestamp,
(double)meta->rx_timestamp / NANOSEC_PER_SEC);
if (meta->rx_timestamp) {
- __u64 usr_clock = gettime(clock_id);
- __u64 xdp_clock = meta->xdp_timestamp;
- __s64 delta_X = xdp_clock - meta->rx_timestamp;
- __s64 delta_X2U = usr_clock - xdp_clock;
-
- printf("XDP RX-time: %llu (sec:%0.4f) delta sec:%0.4f (%0.3f usec)\n",
- xdp_clock, (double)xdp_clock / NANOSEC_PER_SEC,
- (double)delta_X / NANOSEC_PER_SEC,
- (double)delta_X / 1000);
-
- printf("AF_XDP time: %llu (sec:%0.4f) delta sec:%0.4f (%0.3f usec)\n",
- usr_clock, (double)usr_clock / NANOSEC_PER_SEC,
- (double)delta_X2U / NANOSEC_PER_SEC,
- (double)delta_X2U / 1000);
- }
+ __u64 ref_tstamp = gettime(clock_id);
+
+ /* store received timestamps to calculate a delta at tx */
+ last_hw_rx_timestamp = meta->rx_timestamp;
+ last_sw_rx_timestamp = meta->xdp_timestamp;
+ print_tstamp_delta("HW RX-time", "CLOCK_TAI", meta->rx_timestamp, ref_tstamp);
+ print_tstamp_delta("SW RX-time", "CLOCK_TAI", meta->xdp_timestamp, ref_tstamp);
+ }
}
static void verify_skb_metadata(int fd)
@@ -245,7 +250,7 @@ static void verify_skb_metadata(int fd)
printf("skb hwtstamp is not found!\n");
}
-static bool complete_tx(struct xsk *xsk)
+static bool complete_tx(struct xsk *xsk, clockid_t clock_id)
{
struct xsk_tx_metadata *meta;
__u64 addr;
@@ -260,9 +265,17 @@ static bool complete_tx(struct xsk *xsk)
meta = data - sizeof(struct xsk_tx_metadata);
printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
- printf("%p: tx_timestamp: %llu (sec:%0.4f)\n", xsk,
- meta->completion.tx_timestamp,
+
+ printf("tx_timestamp: %llu (sec:%0.4f)\n", meta->completion.tx_timestamp,
(double)meta->completion.tx_timestamp / NANOSEC_PER_SEC);
+ if (meta->completion.tx_timestamp) {
+ __u64 ref_tstamp = gettime(clock_id);
+
+ print_tstamp_delta("HW TX-complete-time", "CLOCK_TAI", meta->completion.tx_timestamp, ref_tstamp);
+ print_tstamp_delta("SW RX-time", "CLOCK_TAI", last_sw_rx_timestamp, ref_tstamp);
+ print_tstamp_delta("HW RX-time", "HW TX-complete-time", last_hw_rx_timestamp, meta->completion.tx_timestamp);
+ }
+
xsk_ring_cons__release(&xsk->comp, 1);
return true;
@@ -276,7 +289,7 @@ static bool complete_tx(struct xsk *xsk)
} \
} while (0)
-static void ping_pong(struct xsk *xsk, void *rx_packet)
+static void ping_pong(struct xsk *xsk, void *rx_packet, clockid_t clock_id)
{
struct xsk_tx_metadata *meta;
struct ipv6hdr *ip6h = NULL;
@@ -418,14 +431,14 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
if (!skip_tx) {
/* mirror the packet back */
- ping_pong(xsk, xsk_umem__get_data(xsk->umem_area, addr));
+ ping_pong(xsk, xsk_umem__get_data(xsk->umem_area, addr), clock_id);
ret = kick_tx(xsk);
if (ret)
printf("kick_tx ret=%d\n", ret);
for (int j = 0; j < 500; j++) {
- if (complete_tx(xsk))
+ if (complete_tx(xsk, clock_id))
break;
usleep(10*1000);
}
> > > 0x1062cb8: complete rx idx=128 addr=80100
> > >
> > > mvbz4:~# nc -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091
> > >
> > > mvbz4:~# tcpdump -vvx -i eth3 udp
> > > tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
> > > 12:26:09.301074 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1087.55807 > fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0xde7e!] UDP, length 3
> > > 0x0000: 6003 5fa5 000b 117f fe80 0000 0000 0000
> > > 0x0010: 1270 fdff fe48 1087 fe80 0000 0000 0000
> > > 0x0020: 1270 fdff fe48 1077 d9ff 2383 000b 3b8e
> > > 0x0030: 7864 70
> > > 12:26:09.301976 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1077.9091 > fe80::1270:fdff:fe48:1087.55807: [udp sum ok] UDP, length 3
> > > 0x0000: 6003 5fa5 000b 117f fe80 0000 0000 0000
> > > 0x0010: 1270 fdff fe48 1077 fe80 0000 0000 0000
> > > 0x0020: 1270 fdff fe48 1087 2383 d9ff 000b de7e
> > > 0x0030: 7864 70
> > >
> > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > ---
> > > tools/testing/selftests/bpf/xdp_hw_metadata.c | 202 +++++++++++++++++-
> > > 1 file changed, 192 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> > > index 613321eb84c1..ab83d0ba6763 100644
> > > --- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
> > > +++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> > > @@ -10,7 +10,9 @@
> > > * - rx_hash
> > > *
> > > * TX:
> > > - * - TBD
> > > + * - UDP 9091 packets trigger TX reply
> > > + * - TX HW timestamp is requested and reported back upon completion
> > > + * - TX checksum is requested
> > > */
> > > #include <test_progs.h>
> > > @@ -24,14 +26,17 @@
> > [...]
> > > @@ -51,22 +56,24 @@ struct xsk *rx_xsk;
> > [...]
> > > @@ -129,12 +136,22 @@ static void refill_rx(struct xsk *xsk, __u64 addr)
> > [...]
> > > @@ -228,6 +245,117 @@ static void verify_skb_metadata(int fd)
> > > printf("skb hwtstamp is not found!\n");
> > > }
> > > +static bool complete_tx(struct xsk *xsk)
> > > +{
> > > + struct xsk_tx_metadata *meta;
> > > + __u64 addr;
> > > + void *data;
> > > + __u32 idx;
> > > +
> > > + if (!xsk_ring_cons__peek(&xsk->comp, 1, &idx))
> > > + return false;
> > > +
> > > + addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
> > > + data = xsk_umem__get_data(xsk->umem_area, addr);
> > > + meta = data - sizeof(struct xsk_tx_metadata);
> > > +
> > > + printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
> > > + printf("%p: tx_timestamp: %llu (sec:%0.4f)\n", xsk,
> > > + meta->completion.tx_timestamp,
> > > + (double)meta->completion.tx_timestamp / NANOSEC_PER_SEC);
> > > + xsk_ring_cons__release(&xsk->comp, 1);
> > > +
> > > + return true;
> > > +}
> > > +
> > > +#define swap(a, b, len) do { \
> > > + for (int i = 0; i < len; i++) { \
> > > + __u8 tmp = ((__u8 *)a)[i]; \
> > > + ((__u8 *)a)[i] = ((__u8 *)b)[i]; \
> > > + ((__u8 *)b)[i] = tmp; \
> > > + } \
> > > +} while (0)
> > > +
> > > +static void ping_pong(struct xsk *xsk, void *rx_packet)
> > > +{
> > > + struct xsk_tx_metadata *meta;
> > > + struct ipv6hdr *ip6h = NULL;
> > > + struct iphdr *iph = NULL;
> > > + struct xdp_desc *tx_desc;
> > > + struct udphdr *udph;
> > > + struct ethhdr *eth;
> > > + __sum16 want_csum;
> > > + void *data;
> > > + __u32 idx;
> > > + int ret;
> > > + int len;
> > > +
> > > + ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
> > > + if (ret != 1) {
> > > + printf("%p: failed to reserve tx slot\n", xsk);
> > > + return;
> > > + }
> > > +
> > > + tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
> > > + tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE + sizeof(struct xsk_tx_metadata);
> > > + data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
> > > +
> > > + meta = data - sizeof(struct xsk_tx_metadata);
> > > + memset(meta, 0, sizeof(*meta));
> > > + meta->flags = XDP_TX_METADATA_TIMESTAMP;
> > > +
> > > + eth = rx_packet;
> > > +
> > > + if (eth->h_proto == htons(ETH_P_IP)) {
> > > + iph = (void *)(eth + 1);
> > > + udph = (void *)(iph + 1);
> > > + } else if (eth->h_proto == htons(ETH_P_IPV6)) {
> > > + ip6h = (void *)(eth + 1);
> > > + udph = (void *)(ip6h + 1);
> > > + } else {
> > > + printf("%p: failed to detect IP version for ping pong %04x\n", xsk, eth->h_proto);
> > > + xsk_ring_prod__cancel(&xsk->tx, 1);
> > > + return;
> > > + }
> > > +
> > > + len = ETH_HLEN;
> > > + if (ip6h)
> > > + len += sizeof(*ip6h) + ntohs(ip6h->payload_len);
> > > + if (iph)
> > > + len += ntohs(iph->tot_len);
> > > +
> > > + swap(eth->h_dest, eth->h_source, ETH_ALEN);
> > > + if (iph)
> > > + swap(&iph->saddr, &iph->daddr, 4);
> > > + else
> > > + swap(&ip6h->saddr, &ip6h->daddr, 16);
> > > + swap(&udph->source, &udph->dest, 2);
> > > +
> > > + want_csum = udph->check;
> > > + if (ip6h)
> > > + udph->check = ~csum_ipv6_magic(&ip6h->saddr, &ip6h->daddr,
> > > + ntohs(udph->len), IPPROTO_UDP, 0);
> > > + else
> > > + udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
> > > + ntohs(udph->len), IPPROTO_UDP, 0);
> > > +
> > > + meta->flags |= XDP_TX_METADATA_CHECKSUM;
> > > + if (iph)
> > > + meta->csum_start = sizeof(*eth) + sizeof(*iph);
> > > + else
> > > + meta->csum_start = sizeof(*eth) + sizeof(*ip6h);
> > > + meta->csum_offset = offsetof(struct udphdr, check);
> > > +
> > > + printf("%p: ping-pong with csum=%04x (want %04x) csum_start=%d csum_offset=%d\n",
> > > + xsk, ntohs(udph->check), ntohs(want_csum), meta->csum_start, meta->csum_offset);
> > > +
> > > + memcpy(data, rx_packet, len); /* don't share umem chunk for simplicity */
> > > + tx_desc->options |= XDP_TX_METADATA;
> > > + tx_desc->len = len;
> > > +
> > > + xsk_ring_prod__submit(&xsk->tx, 1);
> > > +}
> > > +
> > > static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t clock_id)
> > > {
> > > const struct xdp_desc *rx_desc;
> > > @@ -250,6 +378,13 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
> > > while (true) {
> > > errno = 0;
> > > +
> > > + for (i = 0; i < rxq; i++) {
> > > + ret = kick_rx(&rx_xsk[i]);
> > > + if (ret)
> > > + printf("kick_rx ret=%d\n", ret);
> > > + }
> > > +
> > > ret = poll(fds, rxq + 1, 1000);
> > > printf("poll: %d (%d) skip=%llu fail=%llu redir=%llu\n",
> > > ret, errno, bpf_obj->bss->pkts_skip,
> > > @@ -280,6 +415,22 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
> > > xsk, idx, rx_desc->addr, addr, comp_addr);
> > > verify_xdp_metadata(xsk_umem__get_data(xsk->umem_area, addr),
> > > clock_id);
> > > +
> > > + if (!skip_tx) {
> > > + /* mirror the packet back */
> > > + ping_pong(xsk, xsk_umem__get_data(xsk->umem_area, addr));
> > > +
> > > + ret = kick_tx(xsk);
> > > + if (ret)
> > > + printf("kick_tx ret=%d\n", ret);
> > > +
> > > + for (int j = 0; j < 500; j++) {
> > > + if (complete_tx(xsk))
> > > + break;
> > > + usleep(10*1000);
> >
> > I don't fully follow why we need this usleep here.
>
> To avoid the busypoll here (since we don't care too much about perf in
> the test). But I agree, should be ok to drop, will do.
I take that back, I have to keep it. Otherwise I don't have a good
bound on when to stop/abort when waiting for completion. (and the
number of loops needs to go from 500 to unsure-how-many).
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] Re: [PATCH bpf-next v3 09/10] selftests/bpf: Add TX side to xdp_hw_metadata
2023-10-10 20:40 ` Stanislav Fomichev
@ 2023-10-13 1:13 ` Song, Yoong Siang
2023-10-13 18:47 ` Stanislav Fomichev
0 siblings, 1 reply; 25+ messages in thread
From: Song, Yoong Siang @ 2023-10-13 1:13 UTC (permalink / raw)
To: Stanislav Fomichev, Jesper Dangaard Brouer
Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern, Karlsson,
Magnus, bjorn, Fijalkowski, Maciej, netdev, xdp-hints
On Wednesday, October 11, 2023 4:40 AM, Stanislav Fomichev <sdf@google.com> wrote:
>On 10/09, Stanislav Fomichev wrote:
>> On 10/09, Jesper Dangaard Brouer wrote:
>> >
>> >
>> > On 03/10/2023 22.05, Stanislav Fomichev wrote:
>> > > When we get a packet on port 9091, we swap src/dst and send it out.
>> > > At this point we also request the timestamp and checksum offloads.
>> > >
>> > > Checksum offload is verified by looking at the tcpdump on the other side.
>> > > The tool prints pseudo-header csum and the final one it expects.
>> > > The final checksum actually matches the incoming packets checksum
>> > > because we only flip the src/dst and don't change the payload.
>> > >
>> > > Some other related changes:
>> > > - switched to zerocopy mode by default; new flag can be used to force
>> > > old behavior
>> > > - request fixed tx_metadata_len headroom
>> > > - some other small fixes (umem size, fill idx+i, etc)
>> > >
>> > > mvbz3:~# ./xdp_hw_metadata eth3
>> > > ...
>> > > 0x1062cb8: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
>> > > rx_hash: 0x2E1B50B9 with RSS type:0x2A
>> > > rx_timestamp: 1691436369532047139 (sec:1691436369.5320)
>> > > XDP RX-time: 1691436369261756803 (sec:1691436369.2618) delta sec:-
>0.2703 (-270290.336 usec)
>> >
>> > I guess system time isn't configured to be in sync with NIC HW time,
>> > as delta is a negative offset.
>> >
>> > > AF_XDP time: 1691436369261878839 (sec:1691436369.2619) delta
>sec:0.0001 (122.036 usec)
>> > The AF_XDP time is also software system time and compared to XDP
>> > RX-time, it shows a delta of 122 usec. This number indicate to me
>> > that the CPU is likely configured with power saving sleep states.
>>
>> Yes, I don't do any synchronization and don't disable the sleep states.
>>
>> > > 0x1062cb8: ping-pong with csum=3b8e (want de7e) csum_start=54
>> > > csum_offset=6
>> > > 0x1062cb8: complete tx idx=0 addr=10
>> > > 0x1062cb8: tx_timestamp: 1691436369598419505
>> > > (sec:1691436369.5984)
>> >
>> > Could we add something that we can relate tx_timestamp to?
>> >
>> > Like we do with the other delta calculations, as it helps us to
>> > understand/validate if the number we get back is sane. Like negative
>> > offset aboves tells us that system time sync isn't configured, and
>> > that system have configures C-states.
>> >
>> > I suggest delta comparing "tx_timestamp" to "rx_timestamp", as they
>> > are the same clock domain. It will tell us the total time spend
>> > from HW RX to HW TX, counting all the time used by software "ping-pong".
>> >
>> > 1691436369.5984-1691436369.5320 = 0.0664 sec = 66.4 ms
>> >
>> > When implementing this, it could be (1) practical to store the
>> > "rx_timestamp" in the metadata area of the completion packet, or (2)
>> > should we have a mechanism for external storage that can be keyed on
>> > the umem "addr"?
>>
>> Sounds good. I can probably just store last rx_timestamp somewhere in
>> the global var and do a delta on tx? Since the test is single threaded
>> and sequential, not sure we need the mechanism to pass the tstamp around.
>> LMK if you disagree and I'm missing something.
>
>I ended up reshuffling current code a bit to basically use clock tai as a reference for
>every line. Feels like its a bit simpler when everything is referenced against the
>same clock?
>
>For RX part, I rename existing XDP/AF_XDP to HW/SW and dump them both
>relative to tai.
>
>0x195d1f0: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
>rx_hash: 0xEE2BBD59 with RSS type:0x2A
>rx_timestamp: 1696969312125212179 (sec:1696969312.1252)
>HW RX-time: 1696969312125212179 (sec:1696969312.1252) to CLOCK_TAI delta
>sec:-0.1339 (-133862.968 usec)
>SW RX-time: 1696969311991283421 (sec:1696969311.9913) to CLOCK_TAI delta
>sec:0.0001 (65.790 usec)
>0x195d1f0: ping-pong with csum=3b8e (want de5f) csum_start=54 csum_offset=6
>0x195d1f0: complete tx idx=0 addr=8
>tx_timestamp: 1696969312152959759 (sec:1696969312.1530)
>SW RX-time: 1696969311991283421 (sec:1696969311.9913) to CLOCK_TAI delta
>sec:0.0101 (10139.862 usec)
>HW RX-time: 1696969312125212179 (sec:1696969312.1252) to HW TX-
>complete-time delta sec:0.0277 (27747.580 usec)
>HW TX-complete-time: 1696969312152959759 (sec:1696969312.1530) to
>CLOCK_TAI delta sec:-0.1515 (-151536.476 usec)
>
>For TX part, I add a bunch of reference points:
>1) SW RX-time (meta->xdp_timestamp) vs CLOCK_TAI (aka tai-at-complete-time)
>2) HW RX-time (meta->rx_timestamp) vs HW TX-complete-time (new af_xdp
>timestamp)
>3) HW TX-complete-time vs CLOCK_TAI
>
>What do you think? See the patch below.
Hi Stanislav,
For me, the "CLOCK_TAI" in the printing is a bit confusing because
1. There are two value of tai which refer to different moment but having the same name "CLOCK_TAI"
2. SW RX-time is also a clock tai.
So, I suggest to change the naming:
- HW RX-time: the moment NIC receive the packet (based on PHC)
- XDP RX-time: the moment bpf prog parse the packet (based on tai)
- SW RX-time: the moment user app receive the packet (based on tai)
- HW TX-complete-time: the moment NIC send out the packet (based on PHC)
- SW TX-complete-time: the moment user app know the packet being send out (based on tai)
Thanks & Regards
Siang
>
>Note: all 3 of the above should, in theory, be more or less constant (with irq
>moderation / etc disabled). But for me on mlx5 (2) they are not and looks like hw rx
>timestamp jitters a quite a bit. I don't have a clue rigt now on why, will try to take a
>separate look, but it's unrelated to the tx side.
>
>
>diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c
>b/tools/testing/selftests/bpf/xdp_hw_metadata.c
>index ab83d0ba6763..64a90d7479c1 100644
>--- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
>+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
>@@ -57,6 +57,8 @@ const char *ifname;
> int ifindex;
> int rxq;
> bool skip_tx;
>+__u64 last_hw_rx_timestamp;
>+__u64 last_sw_rx_timestamp;
>
> void test__fail(void) { /* for network_helpers.c */ }
>
>@@ -167,6 +169,16 @@ static __u64 gettime(clockid_t clock_id)
> return (__u64) t.tv_sec * NANOSEC_PER_SEC + t.tv_nsec; }
>
>+static void print_tstamp_delta(const char *name, const char *refname,
>+__u64 tstamp, __u64 reference) {
>+ __s64 delta = (__s64)reference - (__s64)tstamp;
>+
>+ printf("%s: %llu (sec:%0.4f) to %s delta sec:%0.4f (%0.3f usec)\n",
>+ name, tstamp, (double)tstamp / NANOSEC_PER_SEC, refname,
>+ (double)delta / NANOSEC_PER_SEC,
>+ (double)delta / 1000);
>+}
>+
> static void verify_xdp_metadata(void *data, clockid_t clock_id) {
> struct xdp_meta *meta;
>@@ -182,22 +194,15 @@ static void verify_xdp_metadata(void *data, clockid_t
>clock_id)
> printf("rx_timestamp: %llu (sec:%0.4f)\n", meta->rx_timestamp,
> (double)meta->rx_timestamp / NANOSEC_PER_SEC);
> if (meta->rx_timestamp) {
>- __u64 usr_clock = gettime(clock_id);
>- __u64 xdp_clock = meta->xdp_timestamp;
>- __s64 delta_X = xdp_clock - meta->rx_timestamp;
>- __s64 delta_X2U = usr_clock - xdp_clock;
>-
>- printf("XDP RX-time: %llu (sec:%0.4f) delta sec:%0.4f (%0.3f
>usec)\n",
>- xdp_clock, (double)xdp_clock / NANOSEC_PER_SEC,
>- (double)delta_X / NANOSEC_PER_SEC,
>- (double)delta_X / 1000);
>-
>- printf("AF_XDP time: %llu (sec:%0.4f) delta sec:%0.4f (%0.3f
>usec)\n",
>- usr_clock, (double)usr_clock / NANOSEC_PER_SEC,
>- (double)delta_X2U / NANOSEC_PER_SEC,
>- (double)delta_X2U / 1000);
>- }
>+ __u64 ref_tstamp = gettime(clock_id);
>+
>+ /* store received timestamps to calculate a delta at tx */
>+ last_hw_rx_timestamp = meta->rx_timestamp;
>+ last_sw_rx_timestamp = meta->xdp_timestamp;
>
>+ print_tstamp_delta("HW RX-time", "CLOCK_TAI", meta-
>>rx_timestamp, ref_tstamp);
>+ print_tstamp_delta("SW RX-time", "CLOCK_TAI", meta-
>>xdp_timestamp, ref_tstamp);
>+ }
> }
>
> static void verify_skb_metadata(int fd) @@ -245,7 +250,7 @@ static void
>verify_skb_metadata(int fd)
> printf("skb hwtstamp is not found!\n"); }
>
>-static bool complete_tx(struct xsk *xsk)
>+static bool complete_tx(struct xsk *xsk, clockid_t clock_id)
> {
> struct xsk_tx_metadata *meta;
> __u64 addr;
>@@ -260,9 +265,17 @@ static bool complete_tx(struct xsk *xsk)
> meta = data - sizeof(struct xsk_tx_metadata);
>
> printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
>- printf("%p: tx_timestamp: %llu (sec:%0.4f)\n", xsk,
>- meta->completion.tx_timestamp,
>+
>+ printf("tx_timestamp: %llu (sec:%0.4f)\n",
>+meta->completion.tx_timestamp,
> (double)meta->completion.tx_timestamp / NANOSEC_PER_SEC);
>+ if (meta->completion.tx_timestamp) {
>+ __u64 ref_tstamp = gettime(clock_id);
>+
>+ print_tstamp_delta("HW TX-complete-time", "CLOCK_TAI", meta-
>>completion.tx_timestamp, ref_tstamp);
>+ print_tstamp_delta("SW RX-time", "CLOCK_TAI",
>last_sw_rx_timestamp, ref_tstamp);
>+ print_tstamp_delta("HW RX-time", "HW TX-complete-time",
>last_hw_rx_timestamp, meta->completion.tx_timestamp);
>+ }
>+
> xsk_ring_cons__release(&xsk->comp, 1);
>
> return true;
>@@ -276,7 +289,7 @@ static bool complete_tx(struct xsk *xsk)
> } \
> } while (0)
>
>-static void ping_pong(struct xsk *xsk, void *rx_packet)
>+static void ping_pong(struct xsk *xsk, void *rx_packet, clockid_t
>+clock_id)
> {
> struct xsk_tx_metadata *meta;
> struct ipv6hdr *ip6h = NULL;
>@@ -418,14 +431,14 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int
>server_fd, clockid_t
>
> if (!skip_tx) {
> /* mirror the packet back */
>- ping_pong(xsk, xsk_umem__get_data(xsk-
>>umem_area, addr));
>+ ping_pong(xsk, xsk_umem__get_data(xsk-
>>umem_area, addr), clock_id);
>
> ret = kick_tx(xsk);
> if (ret)
> printf("kick_tx ret=%d\n", ret);
>
> for (int j = 0; j < 500; j++) {
>- if (complete_tx(xsk))
>+ if (complete_tx(xsk, clock_id))
> break;
> usleep(10*1000);
> }
>
>
>> > > 0x1062cb8: complete rx idx=128 addr=80100
>> > >
>> > > mvbz4:~# nc -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091
>> > >
>> > > mvbz4:~# tcpdump -vvx -i eth3 udp
>> > > tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot
>> > > length 262144 bytes
>> > > 12:26:09.301074 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17)
>payload length: 11) fe80::1270:fdff:fe48:1087.55807 >
>fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0xde7e!] UDP, length 3
>> > > 0x0000: 6003 5fa5 000b 117f fe80 0000 0000 0000
>> > > 0x0010: 1270 fdff fe48 1087 fe80 0000 0000 0000
>> > > 0x0020: 1270 fdff fe48 1077 d9ff 2383 000b 3b8e
>> > > 0x0030: 7864 70
>> > > 12:26:09.301976 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17)
>payload length: 11) fe80::1270:fdff:fe48:1077.9091 >
>fe80::1270:fdff:fe48:1087.55807: [udp sum ok] UDP, length 3
>> > > 0x0000: 6003 5fa5 000b 117f fe80 0000 0000 0000
>> > > 0x0010: 1270 fdff fe48 1077 fe80 0000 0000 0000
>> > > 0x0020: 1270 fdff fe48 1087 2383 d9ff 000b de7e
>> > > 0x0030: 7864 70
>> > >
>> > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
>> > > ---
>> > > tools/testing/selftests/bpf/xdp_hw_metadata.c | 202
>+++++++++++++++++-
>> > > 1 file changed, 192 insertions(+), 10 deletions(-)
>> > >
>> > > diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c
>> > > b/tools/testing/selftests/bpf/xdp_hw_metadata.c
>> > > index 613321eb84c1..ab83d0ba6763 100644
>> > > --- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
>> > > +++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
>> > > @@ -10,7 +10,9 @@
>> > > * - rx_hash
>> > > *
>> > > * TX:
>> > > - * - TBD
>> > > + * - UDP 9091 packets trigger TX reply
>> > > + * - TX HW timestamp is requested and reported back upon
>> > > + completion
>> > > + * - TX checksum is requested
>> > > */
>> > > #include <test_progs.h>
>> > > @@ -24,14 +26,17 @@
>> > [...]
>> > > @@ -51,22 +56,24 @@ struct xsk *rx_xsk;
>> > [...]
>> > > @@ -129,12 +136,22 @@ static void refill_rx(struct xsk *xsk, __u64
>> > > addr)
>> > [...]
>> > > @@ -228,6 +245,117 @@ static void verify_skb_metadata(int fd)
>> > > printf("skb hwtstamp is not found!\n");
>> > > }
>> > > +static bool complete_tx(struct xsk *xsk) {
>> > > + struct xsk_tx_metadata *meta;
>> > > + __u64 addr;
>> > > + void *data;
>> > > + __u32 idx;
>> > > +
>> > > + if (!xsk_ring_cons__peek(&xsk->comp, 1, &idx))
>> > > + return false;
>> > > +
>> > > + addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
>> > > + data = xsk_umem__get_data(xsk->umem_area, addr);
>> > > + meta = data - sizeof(struct xsk_tx_metadata);
>> > > +
>> > > + printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
>> > > + printf("%p: tx_timestamp: %llu (sec:%0.4f)\n", xsk,
>> > > + meta->completion.tx_timestamp,
>> > > + (double)meta->completion.tx_timestamp / NANOSEC_PER_SEC);
>> > > + xsk_ring_cons__release(&xsk->comp, 1);
>> > > +
>> > > + return true;
>> > > +}
>> > > +
>> > > +#define swap(a, b, len) do { \
>> > > + for (int i = 0; i < len; i++) { \
>> > > + __u8 tmp = ((__u8 *)a)[i]; \
>> > > + ((__u8 *)a)[i] = ((__u8 *)b)[i]; \
>> > > + ((__u8 *)b)[i] = tmp; \
>> > > + } \
>> > > +} while (0)
>> > > +
>> > > +static void ping_pong(struct xsk *xsk, void *rx_packet) {
>> > > + struct xsk_tx_metadata *meta;
>> > > + struct ipv6hdr *ip6h = NULL;
>> > > + struct iphdr *iph = NULL;
>> > > + struct xdp_desc *tx_desc;
>> > > + struct udphdr *udph;
>> > > + struct ethhdr *eth;
>> > > + __sum16 want_csum;
>> > > + void *data;
>> > > + __u32 idx;
>> > > + int ret;
>> > > + int len;
>> > > +
>> > > + ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
>> > > + if (ret != 1) {
>> > > + printf("%p: failed to reserve tx slot\n", xsk);
>> > > + return;
>> > > + }
>> > > +
>> > > + tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
>> > > + tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE +
>sizeof(struct xsk_tx_metadata);
>> > > + data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
>> > > +
>> > > + meta = data - sizeof(struct xsk_tx_metadata);
>> > > + memset(meta, 0, sizeof(*meta));
>> > > + meta->flags = XDP_TX_METADATA_TIMESTAMP;
>> > > +
>> > > + eth = rx_packet;
>> > > +
>> > > + if (eth->h_proto == htons(ETH_P_IP)) {
>> > > + iph = (void *)(eth + 1);
>> > > + udph = (void *)(iph + 1);
>> > > + } else if (eth->h_proto == htons(ETH_P_IPV6)) {
>> > > + ip6h = (void *)(eth + 1);
>> > > + udph = (void *)(ip6h + 1);
>> > > + } else {
>> > > + printf("%p: failed to detect IP version for ping pong %04x\n", xsk,
>eth->h_proto);
>> > > + xsk_ring_prod__cancel(&xsk->tx, 1);
>> > > + return;
>> > > + }
>> > > +
>> > > + len = ETH_HLEN;
>> > > + if (ip6h)
>> > > + len += sizeof(*ip6h) + ntohs(ip6h->payload_len);
>> > > + if (iph)
>> > > + len += ntohs(iph->tot_len);
>> > > +
>> > > + swap(eth->h_dest, eth->h_source, ETH_ALEN);
>> > > + if (iph)
>> > > + swap(&iph->saddr, &iph->daddr, 4);
>> > > + else
>> > > + swap(&ip6h->saddr, &ip6h->daddr, 16);
>> > > + swap(&udph->source, &udph->dest, 2);
>> > > +
>> > > + want_csum = udph->check;
>> > > + if (ip6h)
>> > > + udph->check = ~csum_ipv6_magic(&ip6h->saddr, &ip6h->daddr,
>> > > + ntohs(udph->len), IPPROTO_UDP, 0);
>> > > + else
>> > > + udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
>> > > + ntohs(udph->len),
>IPPROTO_UDP, 0);
>> > > +
>> > > + meta->flags |= XDP_TX_METADATA_CHECKSUM;
>> > > + if (iph)
>> > > + meta->csum_start = sizeof(*eth) + sizeof(*iph);
>> > > + else
>> > > + meta->csum_start = sizeof(*eth) + sizeof(*ip6h);
>> > > + meta->csum_offset = offsetof(struct udphdr, check);
>> > > +
>> > > + printf("%p: ping-pong with csum=%04x (want %04x) csum_start=%d
>csum_offset=%d\n",
>> > > + xsk, ntohs(udph->check), ntohs(want_csum),
>> > > +meta->csum_start, meta->csum_offset);
>> > > +
>> > > + memcpy(data, rx_packet, len); /* don't share umem chunk for simplicity
>*/
>> > > + tx_desc->options |= XDP_TX_METADATA;
>> > > + tx_desc->len = len;
>> > > +
>> > > + xsk_ring_prod__submit(&xsk->tx, 1); }
>> > > +
>> > > static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
>clock_id)
>> > > {
>> > > const struct xdp_desc *rx_desc; @@ -250,6 +378,13 @@ static int
>> > > verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
>> > > while (true) {
>> > > errno = 0;
>> > > +
>> > > + for (i = 0; i < rxq; i++) {
>> > > + ret = kick_rx(&rx_xsk[i]);
>> > > + if (ret)
>> > > + printf("kick_rx ret=%d\n", ret);
>> > > + }
>> > > +
>> > > ret = poll(fds, rxq + 1, 1000);
>> > > printf("poll: %d (%d) skip=%llu fail=%llu redir=%llu\n",
>> > > ret, errno, bpf_obj->bss->pkts_skip, @@ -280,6 +415,22
>> > > @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd,
>clockid_t
>> > > xsk, idx, rx_desc->addr, addr, comp_addr);
>> > > verify_xdp_metadata(xsk_umem__get_data(xsk-
>>umem_area, addr),
>> > > clock_id);
>> > > +
>> > > + if (!skip_tx) {
>> > > + /* mirror the packet back */
>> > > + ping_pong(xsk, xsk_umem__get_data(xsk-
>>umem_area, addr));
>> > > +
>> > > + ret = kick_tx(xsk);
>> > > + if (ret)
>> > > + printf("kick_tx ret=%d\n", ret);
>> > > +
>> > > + for (int j = 0; j < 500; j++) {
>> > > + if (complete_tx(xsk))
>> > > + break;
>> > > + usleep(10*1000);
>> >
>> > I don't fully follow why we need this usleep here.
>>
>> To avoid the busypoll here (since we don't care too much about perf in
>> the test). But I agree, should be ok to drop, will do.
>
>I take that back, I have to keep it. Otherwise I don't have a good bound on when to
>stop/abort when waiting for completion. (and the number of loops needs to go
>from 500 to unsure-how-many).
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] Re: [PATCH bpf-next v3 09/10] selftests/bpf: Add TX side to xdp_hw_metadata
2023-10-13 1:13 ` Song, Yoong Siang
@ 2023-10-13 18:47 ` Stanislav Fomichev
2023-10-15 13:28 ` Song, Yoong Siang
0 siblings, 1 reply; 25+ messages in thread
From: Stanislav Fomichev @ 2023-10-13 18:47 UTC (permalink / raw)
To: Song, Yoong Siang
Cc: Jesper Dangaard Brouer, bpf, ast, daniel, andrii, martin.lau,
song, yhs, john.fastabend, kpsingh, haoluo, jolsa, kuba, toke,
willemb, dsahern, Karlsson, Magnus, bjorn, Fijalkowski, Maciej,
netdev, xdp-hints
On Thu, Oct 12, 2023 at 6:14 PM Song, Yoong Siang
<yoong.siang.song@intel.com> wrote:
>
> On Wednesday, October 11, 2023 4:40 AM, Stanislav Fomichev <sdf@google.com> wrote:
> >On 10/09, Stanislav Fomichev wrote:
> >> On 10/09, Jesper Dangaard Brouer wrote:
> >> >
> >> >
> >> > On 03/10/2023 22.05, Stanislav Fomichev wrote:
> >> > > When we get a packet on port 9091, we swap src/dst and send it out.
> >> > > At this point we also request the timestamp and checksum offloads.
> >> > >
> >> > > Checksum offload is verified by looking at the tcpdump on the other side.
> >> > > The tool prints pseudo-header csum and the final one it expects.
> >> > > The final checksum actually matches the incoming packets checksum
> >> > > because we only flip the src/dst and don't change the payload.
> >> > >
> >> > > Some other related changes:
> >> > > - switched to zerocopy mode by default; new flag can be used to force
> >> > > old behavior
> >> > > - request fixed tx_metadata_len headroom
> >> > > - some other small fixes (umem size, fill idx+i, etc)
> >> > >
> >> > > mvbz3:~# ./xdp_hw_metadata eth3
> >> > > ...
> >> > > 0x1062cb8: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
> >> > > rx_hash: 0x2E1B50B9 with RSS type:0x2A
> >> > > rx_timestamp: 1691436369532047139 (sec:1691436369.5320)
> >> > > XDP RX-time: 1691436369261756803 (sec:1691436369.2618) delta sec:-
> >0.2703 (-270290.336 usec)
> >> >
> >> > I guess system time isn't configured to be in sync with NIC HW time,
> >> > as delta is a negative offset.
> >> >
> >> > > AF_XDP time: 1691436369261878839 (sec:1691436369.2619) delta
> >sec:0.0001 (122.036 usec)
> >> > The AF_XDP time is also software system time and compared to XDP
> >> > RX-time, it shows a delta of 122 usec. This number indicate to me
> >> > that the CPU is likely configured with power saving sleep states.
> >>
> >> Yes, I don't do any synchronization and don't disable the sleep states.
> >>
> >> > > 0x1062cb8: ping-pong with csum=3b8e (want de7e) csum_start=54
> >> > > csum_offset=6
> >> > > 0x1062cb8: complete tx idx=0 addr=10
> >> > > 0x1062cb8: tx_timestamp: 1691436369598419505
> >> > > (sec:1691436369.5984)
> >> >
> >> > Could we add something that we can relate tx_timestamp to?
> >> >
> >> > Like we do with the other delta calculations, as it helps us to
> >> > understand/validate if the number we get back is sane. Like negative
> >> > offset aboves tells us that system time sync isn't configured, and
> >> > that system have configures C-states.
> >> >
> >> > I suggest delta comparing "tx_timestamp" to "rx_timestamp", as they
> >> > are the same clock domain. It will tell us the total time spend
> >> > from HW RX to HW TX, counting all the time used by software "ping-pong".
> >> >
> >> > 1691436369.5984-1691436369.5320 = 0.0664 sec = 66.4 ms
> >> >
> >> > When implementing this, it could be (1) practical to store the
> >> > "rx_timestamp" in the metadata area of the completion packet, or (2)
> >> > should we have a mechanism for external storage that can be keyed on
> >> > the umem "addr"?
> >>
> >> Sounds good. I can probably just store last rx_timestamp somewhere in
> >> the global var and do a delta on tx? Since the test is single threaded
> >> and sequential, not sure we need the mechanism to pass the tstamp around.
> >> LMK if you disagree and I'm missing something.
> >
> >I ended up reshuffling current code a bit to basically use clock tai as a reference for
> >every line. Feels like its a bit simpler when everything is referenced against the
> >same clock?
> >
> >For RX part, I rename existing XDP/AF_XDP to HW/SW and dump them both
> >relative to tai.
> >
> >0x195d1f0: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
> >rx_hash: 0xEE2BBD59 with RSS type:0x2A
> >rx_timestamp: 1696969312125212179 (sec:1696969312.1252)
> >HW RX-time: 1696969312125212179 (sec:1696969312.1252) to CLOCK_TAI delta
> >sec:-0.1339 (-133862.968 usec)
> >SW RX-time: 1696969311991283421 (sec:1696969311.9913) to CLOCK_TAI delta
> >sec:0.0001 (65.790 usec)
> >0x195d1f0: ping-pong with csum=3b8e (want de5f) csum_start=54 csum_offset=6
> >0x195d1f0: complete tx idx=0 addr=8
> >tx_timestamp: 1696969312152959759 (sec:1696969312.1530)
> >SW RX-time: 1696969311991283421 (sec:1696969311.9913) to CLOCK_TAI delta
> >sec:0.0101 (10139.862 usec)
> >HW RX-time: 1696969312125212179 (sec:1696969312.1252) to HW TX-
> >complete-time delta sec:0.0277 (27747.580 usec)
> >HW TX-complete-time: 1696969312152959759 (sec:1696969312.1530) to
> >CLOCK_TAI delta sec:-0.1515 (-151536.476 usec)
> >
> >For TX part, I add a bunch of reference points:
> >1) SW RX-time (meta->xdp_timestamp) vs CLOCK_TAI (aka tai-at-complete-time)
> >2) HW RX-time (meta->rx_timestamp) vs HW TX-complete-time (new af_xdp
> >timestamp)
> >3) HW TX-complete-time vs CLOCK_TAI
> >
> >What do you think? See the patch below.
>
> Hi Stanislav,
>
> For me, the "CLOCK_TAI" in the printing is a bit confusing because
> 1. There are two value of tai which refer to different moment but having the same name "CLOCK_TAI"
> 2. SW RX-time is also a clock tai.
>
> So, I suggest to change the naming:
> - HW RX-time: the moment NIC receive the packet (based on PHC)
> - XDP RX-time: the moment bpf prog parse the packet (based on tai)
> - SW RX-time: the moment user app receive the packet (based on tai)
> - HW TX-complete-time: the moment NIC send out the packet (based on PHC)
> - SW TX-complete-time: the moment user app know the packet being send out (based on tai)
SG. Maybe also do s/SW/User/ ? To signify that it's a userspace-level
timestamps?
> Thanks & Regards
> Siang
>
> >
> >Note: all 3 of the above should, in theory, be more or less constant (with irq
> >moderation / etc disabled). But for me on mlx5 (2) they are not and looks like hw rx
> >timestamp jitters a quite a bit. I don't have a clue rigt now on why, will try to take a
> >separate look, but it's unrelated to the tx side.
> >
> >
> >diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c
> >b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> >index ab83d0ba6763..64a90d7479c1 100644
> >--- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
> >+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> >@@ -57,6 +57,8 @@ const char *ifname;
> > int ifindex;
> > int rxq;
> > bool skip_tx;
> >+__u64 last_hw_rx_timestamp;
> >+__u64 last_sw_rx_timestamp;
> >
> > void test__fail(void) { /* for network_helpers.c */ }
> >
> >@@ -167,6 +169,16 @@ static __u64 gettime(clockid_t clock_id)
> > return (__u64) t.tv_sec * NANOSEC_PER_SEC + t.tv_nsec; }
> >
> >+static void print_tstamp_delta(const char *name, const char *refname,
> >+__u64 tstamp, __u64 reference) {
> >+ __s64 delta = (__s64)reference - (__s64)tstamp;
> >+
> >+ printf("%s: %llu (sec:%0.4f) to %s delta sec:%0.4f (%0.3f usec)\n",
> >+ name, tstamp, (double)tstamp / NANOSEC_PER_SEC, refname,
> >+ (double)delta / NANOSEC_PER_SEC,
> >+ (double)delta / 1000);
> >+}
> >+
> > static void verify_xdp_metadata(void *data, clockid_t clock_id) {
> > struct xdp_meta *meta;
> >@@ -182,22 +194,15 @@ static void verify_xdp_metadata(void *data, clockid_t
> >clock_id)
> > printf("rx_timestamp: %llu (sec:%0.4f)\n", meta->rx_timestamp,
> > (double)meta->rx_timestamp / NANOSEC_PER_SEC);
> > if (meta->rx_timestamp) {
> >- __u64 usr_clock = gettime(clock_id);
> >- __u64 xdp_clock = meta->xdp_timestamp;
> >- __s64 delta_X = xdp_clock - meta->rx_timestamp;
> >- __s64 delta_X2U = usr_clock - xdp_clock;
> >-
> >- printf("XDP RX-time: %llu (sec:%0.4f) delta sec:%0.4f (%0.3f
> >usec)\n",
> >- xdp_clock, (double)xdp_clock / NANOSEC_PER_SEC,
> >- (double)delta_X / NANOSEC_PER_SEC,
> >- (double)delta_X / 1000);
> >-
> >- printf("AF_XDP time: %llu (sec:%0.4f) delta sec:%0.4f (%0.3f
> >usec)\n",
> >- usr_clock, (double)usr_clock / NANOSEC_PER_SEC,
> >- (double)delta_X2U / NANOSEC_PER_SEC,
> >- (double)delta_X2U / 1000);
> >- }
> >+ __u64 ref_tstamp = gettime(clock_id);
> >+
> >+ /* store received timestamps to calculate a delta at tx */
> >+ last_hw_rx_timestamp = meta->rx_timestamp;
> >+ last_sw_rx_timestamp = meta->xdp_timestamp;
> >
> >+ print_tstamp_delta("HW RX-time", "CLOCK_TAI", meta-
> >>rx_timestamp, ref_tstamp);
> >+ print_tstamp_delta("SW RX-time", "CLOCK_TAI", meta-
> >>xdp_timestamp, ref_tstamp);
> >+ }
> > }
> >
> > static void verify_skb_metadata(int fd) @@ -245,7 +250,7 @@ static void
> >verify_skb_metadata(int fd)
> > printf("skb hwtstamp is not found!\n"); }
> >
> >-static bool complete_tx(struct xsk *xsk)
> >+static bool complete_tx(struct xsk *xsk, clockid_t clock_id)
> > {
> > struct xsk_tx_metadata *meta;
> > __u64 addr;
> >@@ -260,9 +265,17 @@ static bool complete_tx(struct xsk *xsk)
> > meta = data - sizeof(struct xsk_tx_metadata);
> >
> > printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
> >- printf("%p: tx_timestamp: %llu (sec:%0.4f)\n", xsk,
> >- meta->completion.tx_timestamp,
> >+
> >+ printf("tx_timestamp: %llu (sec:%0.4f)\n",
> >+meta->completion.tx_timestamp,
> > (double)meta->completion.tx_timestamp / NANOSEC_PER_SEC);
> >+ if (meta->completion.tx_timestamp) {
> >+ __u64 ref_tstamp = gettime(clock_id);
> >+
> >+ print_tstamp_delta("HW TX-complete-time", "CLOCK_TAI", meta-
> >>completion.tx_timestamp, ref_tstamp);
> >+ print_tstamp_delta("SW RX-time", "CLOCK_TAI",
> >last_sw_rx_timestamp, ref_tstamp);
> >+ print_tstamp_delta("HW RX-time", "HW TX-complete-time",
> >last_hw_rx_timestamp, meta->completion.tx_timestamp);
> >+ }
> >+
> > xsk_ring_cons__release(&xsk->comp, 1);
> >
> > return true;
> >@@ -276,7 +289,7 @@ static bool complete_tx(struct xsk *xsk)
> > } \
> > } while (0)
> >
> >-static void ping_pong(struct xsk *xsk, void *rx_packet)
> >+static void ping_pong(struct xsk *xsk, void *rx_packet, clockid_t
> >+clock_id)
> > {
> > struct xsk_tx_metadata *meta;
> > struct ipv6hdr *ip6h = NULL;
> >@@ -418,14 +431,14 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int
> >server_fd, clockid_t
> >
> > if (!skip_tx) {
> > /* mirror the packet back */
> >- ping_pong(xsk, xsk_umem__get_data(xsk-
> >>umem_area, addr));
> >+ ping_pong(xsk, xsk_umem__get_data(xsk-
> >>umem_area, addr), clock_id);
> >
> > ret = kick_tx(xsk);
> > if (ret)
> > printf("kick_tx ret=%d\n", ret);
> >
> > for (int j = 0; j < 500; j++) {
> >- if (complete_tx(xsk))
> >+ if (complete_tx(xsk, clock_id))
> > break;
> > usleep(10*1000);
> > }
> >
> >
> >> > > 0x1062cb8: complete rx idx=128 addr=80100
> >> > >
> >> > > mvbz4:~# nc -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091
> >> > >
> >> > > mvbz4:~# tcpdump -vvx -i eth3 udp
> >> > > tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot
> >> > > length 262144 bytes
> >> > > 12:26:09.301074 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17)
> >payload length: 11) fe80::1270:fdff:fe48:1087.55807 >
> >fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0xde7e!] UDP, length 3
> >> > > 0x0000: 6003 5fa5 000b 117f fe80 0000 0000 0000
> >> > > 0x0010: 1270 fdff fe48 1087 fe80 0000 0000 0000
> >> > > 0x0020: 1270 fdff fe48 1077 d9ff 2383 000b 3b8e
> >> > > 0x0030: 7864 70
> >> > > 12:26:09.301976 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17)
> >payload length: 11) fe80::1270:fdff:fe48:1077.9091 >
> >fe80::1270:fdff:fe48:1087.55807: [udp sum ok] UDP, length 3
> >> > > 0x0000: 6003 5fa5 000b 117f fe80 0000 0000 0000
> >> > > 0x0010: 1270 fdff fe48 1077 fe80 0000 0000 0000
> >> > > 0x0020: 1270 fdff fe48 1087 2383 d9ff 000b de7e
> >> > > 0x0030: 7864 70
> >> > >
> >> > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> >> > > ---
> >> > > tools/testing/selftests/bpf/xdp_hw_metadata.c | 202
> >+++++++++++++++++-
> >> > > 1 file changed, 192 insertions(+), 10 deletions(-)
> >> > >
> >> > > diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c
> >> > > b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> >> > > index 613321eb84c1..ab83d0ba6763 100644
> >> > > --- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
> >> > > +++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> >> > > @@ -10,7 +10,9 @@
> >> > > * - rx_hash
> >> > > *
> >> > > * TX:
> >> > > - * - TBD
> >> > > + * - UDP 9091 packets trigger TX reply
> >> > > + * - TX HW timestamp is requested and reported back upon
> >> > > + completion
> >> > > + * - TX checksum is requested
> >> > > */
> >> > > #include <test_progs.h>
> >> > > @@ -24,14 +26,17 @@
> >> > [...]
> >> > > @@ -51,22 +56,24 @@ struct xsk *rx_xsk;
> >> > [...]
> >> > > @@ -129,12 +136,22 @@ static void refill_rx(struct xsk *xsk, __u64
> >> > > addr)
> >> > [...]
> >> > > @@ -228,6 +245,117 @@ static void verify_skb_metadata(int fd)
> >> > > printf("skb hwtstamp is not found!\n");
> >> > > }
> >> > > +static bool complete_tx(struct xsk *xsk) {
> >> > > + struct xsk_tx_metadata *meta;
> >> > > + __u64 addr;
> >> > > + void *data;
> >> > > + __u32 idx;
> >> > > +
> >> > > + if (!xsk_ring_cons__peek(&xsk->comp, 1, &idx))
> >> > > + return false;
> >> > > +
> >> > > + addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
> >> > > + data = xsk_umem__get_data(xsk->umem_area, addr);
> >> > > + meta = data - sizeof(struct xsk_tx_metadata);
> >> > > +
> >> > > + printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
> >> > > + printf("%p: tx_timestamp: %llu (sec:%0.4f)\n", xsk,
> >> > > + meta->completion.tx_timestamp,
> >> > > + (double)meta->completion.tx_timestamp / NANOSEC_PER_SEC);
> >> > > + xsk_ring_cons__release(&xsk->comp, 1);
> >> > > +
> >> > > + return true;
> >> > > +}
> >> > > +
> >> > > +#define swap(a, b, len) do { \
> >> > > + for (int i = 0; i < len; i++) { \
> >> > > + __u8 tmp = ((__u8 *)a)[i]; \
> >> > > + ((__u8 *)a)[i] = ((__u8 *)b)[i]; \
> >> > > + ((__u8 *)b)[i] = tmp; \
> >> > > + } \
> >> > > +} while (0)
> >> > > +
> >> > > +static void ping_pong(struct xsk *xsk, void *rx_packet) {
> >> > > + struct xsk_tx_metadata *meta;
> >> > > + struct ipv6hdr *ip6h = NULL;
> >> > > + struct iphdr *iph = NULL;
> >> > > + struct xdp_desc *tx_desc;
> >> > > + struct udphdr *udph;
> >> > > + struct ethhdr *eth;
> >> > > + __sum16 want_csum;
> >> > > + void *data;
> >> > > + __u32 idx;
> >> > > + int ret;
> >> > > + int len;
> >> > > +
> >> > > + ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
> >> > > + if (ret != 1) {
> >> > > + printf("%p: failed to reserve tx slot\n", xsk);
> >> > > + return;
> >> > > + }
> >> > > +
> >> > > + tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
> >> > > + tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE +
> >sizeof(struct xsk_tx_metadata);
> >> > > + data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
> >> > > +
> >> > > + meta = data - sizeof(struct xsk_tx_metadata);
> >> > > + memset(meta, 0, sizeof(*meta));
> >> > > + meta->flags = XDP_TX_METADATA_TIMESTAMP;
> >> > > +
> >> > > + eth = rx_packet;
> >> > > +
> >> > > + if (eth->h_proto == htons(ETH_P_IP)) {
> >> > > + iph = (void *)(eth + 1);
> >> > > + udph = (void *)(iph + 1);
> >> > > + } else if (eth->h_proto == htons(ETH_P_IPV6)) {
> >> > > + ip6h = (void *)(eth + 1);
> >> > > + udph = (void *)(ip6h + 1);
> >> > > + } else {
> >> > > + printf("%p: failed to detect IP version for ping pong %04x\n", xsk,
> >eth->h_proto);
> >> > > + xsk_ring_prod__cancel(&xsk->tx, 1);
> >> > > + return;
> >> > > + }
> >> > > +
> >> > > + len = ETH_HLEN;
> >> > > + if (ip6h)
> >> > > + len += sizeof(*ip6h) + ntohs(ip6h->payload_len);
> >> > > + if (iph)
> >> > > + len += ntohs(iph->tot_len);
> >> > > +
> >> > > + swap(eth->h_dest, eth->h_source, ETH_ALEN);
> >> > > + if (iph)
> >> > > + swap(&iph->saddr, &iph->daddr, 4);
> >> > > + else
> >> > > + swap(&ip6h->saddr, &ip6h->daddr, 16);
> >> > > + swap(&udph->source, &udph->dest, 2);
> >> > > +
> >> > > + want_csum = udph->check;
> >> > > + if (ip6h)
> >> > > + udph->check = ~csum_ipv6_magic(&ip6h->saddr, &ip6h->daddr,
> >> > > + ntohs(udph->len), IPPROTO_UDP, 0);
> >> > > + else
> >> > > + udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
> >> > > + ntohs(udph->len),
> >IPPROTO_UDP, 0);
> >> > > +
> >> > > + meta->flags |= XDP_TX_METADATA_CHECKSUM;
> >> > > + if (iph)
> >> > > + meta->csum_start = sizeof(*eth) + sizeof(*iph);
> >> > > + else
> >> > > + meta->csum_start = sizeof(*eth) + sizeof(*ip6h);
> >> > > + meta->csum_offset = offsetof(struct udphdr, check);
> >> > > +
> >> > > + printf("%p: ping-pong with csum=%04x (want %04x) csum_start=%d
> >csum_offset=%d\n",
> >> > > + xsk, ntohs(udph->check), ntohs(want_csum),
> >> > > +meta->csum_start, meta->csum_offset);
> >> > > +
> >> > > + memcpy(data, rx_packet, len); /* don't share umem chunk for simplicity
> >*/
> >> > > + tx_desc->options |= XDP_TX_METADATA;
> >> > > + tx_desc->len = len;
> >> > > +
> >> > > + xsk_ring_prod__submit(&xsk->tx, 1); }
> >> > > +
> >> > > static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
> >clock_id)
> >> > > {
> >> > > const struct xdp_desc *rx_desc; @@ -250,6 +378,13 @@ static int
> >> > > verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
> >> > > while (true) {
> >> > > errno = 0;
> >> > > +
> >> > > + for (i = 0; i < rxq; i++) {
> >> > > + ret = kick_rx(&rx_xsk[i]);
> >> > > + if (ret)
> >> > > + printf("kick_rx ret=%d\n", ret);
> >> > > + }
> >> > > +
> >> > > ret = poll(fds, rxq + 1, 1000);
> >> > > printf("poll: %d (%d) skip=%llu fail=%llu redir=%llu\n",
> >> > > ret, errno, bpf_obj->bss->pkts_skip, @@ -280,6 +415,22
> >> > > @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd,
> >clockid_t
> >> > > xsk, idx, rx_desc->addr, addr, comp_addr);
> >> > > verify_xdp_metadata(xsk_umem__get_data(xsk-
> >>umem_area, addr),
> >> > > clock_id);
> >> > > +
> >> > > + if (!skip_tx) {
> >> > > + /* mirror the packet back */
> >> > > + ping_pong(xsk, xsk_umem__get_data(xsk-
> >>umem_area, addr));
> >> > > +
> >> > > + ret = kick_tx(xsk);
> >> > > + if (ret)
> >> > > + printf("kick_tx ret=%d\n", ret);
> >> > > +
> >> > > + for (int j = 0; j < 500; j++) {
> >> > > + if (complete_tx(xsk))
> >> > > + break;
> >> > > + usleep(10*1000);
> >> >
> >> > I don't fully follow why we need this usleep here.
> >>
> >> To avoid the busypoll here (since we don't care too much about perf in
> >> the test). But I agree, should be ok to drop, will do.
> >
> >I take that back, I have to keep it. Otherwise I don't have a good bound on when to
> >stop/abort when waiting for completion. (and the number of loops needs to go
> >from 500 to unsure-how-many).
^ permalink raw reply [flat|nested] 25+ messages in thread
* [xdp-hints] Re: [PATCH bpf-next v3 09/10] selftests/bpf: Add TX side to xdp_hw_metadata
2023-10-13 18:47 ` Stanislav Fomichev
@ 2023-10-15 13:28 ` Song, Yoong Siang
0 siblings, 0 replies; 25+ messages in thread
From: Song, Yoong Siang @ 2023-10-15 13:28 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: Jesper Dangaard Brouer, bpf, ast, daniel, andrii, martin.lau,
song, yhs, john.fastabend, kpsingh, haoluo, jolsa, kuba, toke,
willemb, dsahern, Karlsson, Magnus, bjorn, Fijalkowski, Maciej,
netdev, xdp-hints
On Saturday, October 14, 2023 2:48 AM Stanislav Fomichev <sdf@google.com> wrote:
>On Thu, Oct 12, 2023 at 6:14 PM Song, Yoong Siang <yoong.siang.song@intel.com> wrote:
[...]
>> So, I suggest to change the naming:
>> - HW RX-time: the moment NIC receive the packet (based on PHC)
>> - XDP RX-time: the moment bpf prog parse the packet (based on tai)
>> - SW RX-time: the moment user app receive the packet (based on tai)
>> - HW TX-complete-time: the moment NIC send out the packet (based on
>> PHC)
>> - SW TX-complete-time: the moment user app know the packet being
>> send out (based on tai)
>
>SG. Maybe also do s/SW/User/ ? To signify that it's a userspace-level timestamps?
sound good.
>
>> Thanks & Regards
>> Siang
>>
^ permalink raw reply [flat|nested] 25+ messages in thread
end of thread, other threads:[~2023-10-15 13:28 UTC | newest]
Thread overview: 25+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-10-03 20:05 [xdp-hints] [PATCH bpf-next v3 00/10] xsk: TX metadata Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 01/10] xsk: Support tx_metadata_len Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 02/10] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
2023-10-04 6:18 ` [xdp-hints] " Song, Yoong Siang
2023-10-04 17:48 ` Stanislav Fomichev
2023-10-04 17:56 ` Stanislav Fomichev
2023-10-05 1:16 ` Song, Yoong Siang
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 03/10] tools: ynl: print xsk-features from the sample Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 04/10] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload Stanislav Fomichev
2023-10-04 23:47 ` [xdp-hints] " kernel test robot
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 05/10] net: stmmac: Add Tx HWTS support to XDP ZC Stanislav Fomichev
2023-10-04 23:05 ` [xdp-hints] " kernel test robot
2023-10-04 23:14 ` Stanislav Fomichev
2023-10-06 4:38 ` kernel test robot
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 06/10] selftests/xsk: Support tx_metadata_len Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 07/10] selftests/bpf: Add csum helpers Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 08/10] selftests/bpf: Add TX side to xdp_metadata Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 09/10] selftests/bpf: Add TX side to xdp_hw_metadata Stanislav Fomichev
2023-10-09 8:12 ` [xdp-hints] " Jesper Dangaard Brouer
2023-10-09 16:37 ` Stanislav Fomichev
2023-10-10 20:40 ` Stanislav Fomichev
2023-10-13 1:13 ` Song, Yoong Siang
2023-10-13 18:47 ` Stanislav Fomichev
2023-10-15 13:28 ` Song, Yoong Siang
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 10/10] xsk: document tx_metadata_len layout Stanislav Fomichev
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox