* [xdp-hints] [RFC net-next v4 0/8] xsk: TX metadata
@ 2023-07-24 23:59 Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 1/8] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
` (7 more replies)
0 siblings, 8 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-24 23:59 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
This series implements initial TX metadata (offloads) for AF_XDP.
See patch #2 for the main implementation and mlx5-one for the
example on how to consume the metadata on the device side.
With v4, I'm switching from BPF hooks and kfuncs to a fixed TX
metadata/offload format: the metadata is passed with every tx
umem chunk and it's up to the device driver to interpret it.
Starting with two types of offloads:
- request TX timestamp (and write it back into the metadata area)
- request TX checksum offload
TODO for the non-RFC series:
- have some real device support to verify xdp_hw_metadata works
- Documentation/networking/xdp-rx-metadata.rst - like documentation
v3: https://lore.kernel.org/bpf/20230707193006.1309662-1-sdf@google.com/
Stanislav Fomichev (8):
xsk: Support XDP_TX_METADATA_LEN
xsk: add TX timestamp and TX checksum offload support
net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
tools: ynl: update netdev sample to dump xsk-features
selftests/xsk: Support XDP_TX_METADATA_LEN
selftests/bpf: Add csum helpers
selftests/bpf: Add TX side to xdp_metadata
selftests/bpf: Add TX side to xdp_hw_metadata
Documentation/netlink/specs/netdev.yaml | 25 ++-
drivers/net/ethernet/mellanox/mlx5/core/en.h | 4 +-
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 71 ++++++-
.../net/ethernet/mellanox/mlx5/core/en/xdp.h | 10 +-
.../ethernet/mellanox/mlx5/core/en/xsk/tx.c | 9 +-
.../net/ethernet/mellanox/mlx5/core/en_main.c | 1 +
include/linux/netdevice.h | 27 +++
include/linux/skbuff.h | 5 +-
include/net/xdp_sock.h | 61 ++++++
include/net/xdp_sock_drv.h | 13 ++
include/net/xsk_buff_pool.h | 1 +
include/uapi/linux/if_xdp.h | 36 ++++
include/uapi/linux/netdev.h | 15 ++
net/core/netdev-genl.c | 12 +-
net/xdp/xsk.c | 58 +++++
net/xdp/xsk_buff_pool.c | 1 +
net/xdp/xsk_queue.h | 19 +-
tools/include/uapi/linux/if_xdp.h | 50 ++++-
tools/include/uapi/linux/netdev.h | 15 ++
tools/net/ynl/generated/netdev-user.c | 25 +++
tools/net/ynl/generated/netdev-user.h | 5 +
tools/net/ynl/samples/netdev.c | 8 +
tools/testing/selftests/bpf/network_helpers.h | 43 ++++
.../selftests/bpf/prog_tests/xdp_metadata.c | 31 ++-
tools/testing/selftests/bpf/xdp_hw_metadata.c | 201 +++++++++++++++++-
tools/testing/selftests/bpf/xsk.c | 17 ++
tools/testing/selftests/bpf/xsk.h | 1 +
27 files changed, 718 insertions(+), 46 deletions(-)
--
2.41.0.487.g6d72f3e995-goog
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] [RFC net-next v4 1/8] xsk: Support XDP_TX_METADATA_LEN
2023-07-24 23:59 [xdp-hints] [RFC net-next v4 0/8] xsk: TX metadata Stanislav Fomichev
@ 2023-07-24 23:59 ` Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
` (6 subsequent siblings)
7 siblings, 0 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-24 23:59 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
For zerocopy mode, tx_desc->addr can point to the arbitrary offset
and carry some TX metadata in the headroom. For copy mode, there
is no way currently to populate skb metadata.
Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
to treat as metadata. Metadata bytes come prior to tx_desc address
(same as in RX case).
The size of the metadata has the same constraints as XDP:
- less than 256 bytes
- 4-byte aligned
- non-zero
This data is not interpreted in any way right now.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
include/net/xdp_sock.h | 1 +
include/net/xsk_buff_pool.h | 1 +
include/uapi/linux/if_xdp.h | 1 +
net/xdp/xsk.c | 20 ++++++++++++++++++++
net/xdp/xsk_buff_pool.c | 1 +
net/xdp/xsk_queue.h | 17 ++++++++++-------
6 files changed, 34 insertions(+), 7 deletions(-)
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 1617af380162..467b9fb56827 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -51,6 +51,7 @@ struct xdp_sock {
struct list_head flush_node;
struct xsk_buff_pool *pool;
u16 queue_id;
+ u8 tx_metadata_len;
bool zc;
bool sg;
enum {
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index b0bdff26fc88..9c31e8d1e198 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -77,6 +77,7 @@ struct xsk_buff_pool {
u32 chunk_size;
u32 chunk_shift;
u32 frame_len;
+ u8 tx_metadata_len; /* inherited from xsk_sock */
u8 cached_need_wakeup;
bool uses_need_wakeup;
bool dma_need_sync;
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 8d48863472b9..b37b50102e1c 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -69,6 +69,7 @@ struct xdp_mmap_offsets {
#define XDP_UMEM_COMPLETION_RING 6
#define XDP_STATISTICS 7
#define XDP_OPTIONS 8
+#define XDP_TX_METADATA_LEN 9
struct xdp_umem_reg {
__u64 addr; /* Start of packet data area */
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4f1e0599146e..81106e4d6e0f 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1337,6 +1337,26 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
mutex_unlock(&xs->mutex);
return err;
}
+ case XDP_TX_METADATA_LEN:
+ {
+ int val;
+
+ if (optlen < sizeof(val))
+ return -EINVAL;
+ if (copy_from_sockptr(&val, optval, sizeof(val)))
+ return -EFAULT;
+ if (!val || val > 256 || val % 4)
+ return -EINVAL;
+
+ mutex_lock(&xs->mutex);
+ if (xs->state != XSK_READY) {
+ mutex_unlock(&xs->mutex);
+ return -EBUSY;
+ }
+ xs->tx_metadata_len = val;
+ mutex_unlock(&xs->mutex);
+ return 0;
+ }
default:
break;
}
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index b3f7b310811e..b351732f1032 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -85,6 +85,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
XDP_PACKET_HEADROOM;
pool->umem = umem;
pool->addrs = umem->addrs;
+ pool->tx_metadata_len = xs->tx_metadata_len;
INIT_LIST_HEAD(&pool->free_list);
INIT_LIST_HEAD(&pool->xskb_list);
INIT_LIST_HEAD(&pool->xsk_tx_list);
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 13354a1e4280..c74a1372bcb9 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -143,15 +143,17 @@ static inline bool xp_unused_options_set(u32 options)
static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
struct xdp_desc *desc)
{
- u64 offset = desc->addr & (pool->chunk_size - 1);
+ u64 addr = desc->addr - pool->tx_metadata_len;
+ u64 len = desc->len + pool->tx_metadata_len;
+ u64 offset = addr & (pool->chunk_size - 1);
if (!desc->len)
return false;
- if (offset + desc->len > pool->chunk_size)
+ if (offset + len > pool->chunk_size)
return false;
- if (desc->addr >= pool->addrs_cnt)
+ if (addr >= pool->addrs_cnt)
return false;
if (xp_unused_options_set(desc->options))
@@ -162,16 +164,17 @@ static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
static inline bool xp_unaligned_validate_desc(struct xsk_buff_pool *pool,
struct xdp_desc *desc)
{
- u64 addr = xp_unaligned_add_offset_to_addr(desc->addr);
+ u64 addr = xp_unaligned_add_offset_to_addr(desc->addr) - pool->tx_metadata_len;
+ u64 len = desc->len + pool->tx_metadata_len;
if (!desc->len)
return false;
- if (desc->len > pool->chunk_size)
+ if (len > pool->chunk_size)
return false;
- if (addr >= pool->addrs_cnt || addr + desc->len > pool->addrs_cnt ||
- xp_desc_crosses_non_contig_pg(pool, addr, desc->len))
+ if (addr >= pool->addrs_cnt || addr + len > pool->addrs_cnt ||
+ xp_desc_crosses_non_contig_pg(pool, addr, len))
return false;
if (xp_unused_options_set(desc->options))
--
2.41.0.487.g6d72f3e995-goog
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-24 23:59 [xdp-hints] [RFC net-next v4 0/8] xsk: TX metadata Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 1/8] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
@ 2023-07-24 23:59 ` Stanislav Fomichev
2023-07-25 19:28 ` [xdp-hints] " Simon Horman
` (3 more replies)
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 3/8] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload Stanislav Fomichev
` (5 subsequent siblings)
7 siblings, 4 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-24 23:59 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
This change actually defines the (initial) metadata layout
that should be used by AF_XDP userspace (xsk_tx_metadata).
The first field is flags which requests appropriate offloads,
followed by the offload-specific fields. The supported per-device
offloads are exported via netlink (new xsk-flags).
The offloads themselves are still implemented in a bit of a
framework-y fashion that's left from my initial kfunc attempt.
I'm introducing new xsk_tx_metadata_ops which drivers are
supposed to implement. The drivers are also supposed
to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
the right places. Since xsk_tx_metadata_{request,_complete}
are static inline, we don't incur any extra overhead doing
indirect calls.
The benefit of this scheme is as follows:
- keeps all metadata layout parsing away from driver code
- makes it easy to grep and see which drivers implement what
- don't need any extra flags to maintain to keep track of that
offloads are implemented; if the callback is implemented - the offload
is supported (used by netlink reporting code)
Two offloads are defined right now:
1. XDP_TX_METADATA_CHECKSUM: skb-style csum_start+csum_offset
2. XDP_TX_METADATA_TIMESTAMP: writes TX timestamp back into metadata
area upon completion (tx_timestamp field)
The offloads are also implemented for copy mode:
1. Extra XDP_TX_METADATA_CHECKSUM_SW to trigger skb_checksum_help; this
might be useful as a reference implementation and for testing
2. XDP_TX_METADATA_TIMESTAMP writes SW timestamp from the skb
destructor (note I'm reusing hwtstamps to pass metadata pointer)
The struct is forward-compatible and can be extended in the future
by appending more fields.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
Documentation/netlink/specs/netdev.yaml | 19 ++++++++
include/linux/netdevice.h | 27 +++++++++++
include/linux/skbuff.h | 5 ++-
include/net/xdp_sock.h | 60 +++++++++++++++++++++++++
include/net/xdp_sock_drv.h | 13 ++++++
include/uapi/linux/if_xdp.h | 35 +++++++++++++++
include/uapi/linux/netdev.h | 15 +++++++
net/core/netdev-genl.c | 12 ++++-
net/xdp/xsk.c | 38 ++++++++++++++++
net/xdp/xsk_queue.h | 2 +-
tools/include/uapi/linux/if_xdp.h | 50 ++++++++++++++++++---
11 files changed, 268 insertions(+), 8 deletions(-)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index e41015310a6e..bf9c1cc32614 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -42,6 +42,19 @@ name: netdev
doc:
This feature informs if netdev implements non-linear XDP buffer
support in ndo_xdp_xmit callback.
+ -
+ type: flags
+ name: xsk-flags
+ render-max: true
+ entries:
+ -
+ name: tx-timestamp
+ doc:
+ HW timestamping egress packets is supported by the driver.
+ -
+ name: tx-checksum
+ doc:
+ L3 checksum HW offload is supported by the driver.
attribute-sets:
-
@@ -68,6 +81,12 @@ name: netdev
type: u32
checks:
min: 1
+ -
+ name: xsk-features
+ doc: Bitmask of enabled AF_XDP features.
+ type: u64
+ enum: xsk-flags
+ enum-as-flags: true
operations:
list:
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 11652e464f5d..8b40c80557aa 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1660,6 +1660,31 @@ struct xdp_metadata_ops {
enum xdp_rss_hash_type *rss_type);
};
+/*
+ * This structure defines the AF_XDP TX metadata hooks for network devices.
+ * The following hooks can be defined; unless noted otherwise, they are
+ * optional and can be filled with a null pointer.
+ *
+ * int (*tmo_request_timestamp)(void *priv)
+ * This function is called when AF_XDP frame requested egress timestamp.
+ *
+ * int (*tmo_fill_timestamp)(void *priv)
+ * This function is called when AF_XDP frame, that had requested
+ * egress timestamp, received a completion. The hook needs to return
+ * the actual HW timestamp.
+ *
+ * int (*tmo_request_timestamp)(u16 csum_start, u16 csum_offset, void *priv)
+ * This function is called when AF_XDP frame requested HW checksum
+ * offload. csum_start indicates position where checksumming should start.
+ * csum_offset indicates position where checksum should be stored.
+ *
+ */
+struct xsk_tx_metadata_ops {
+ void (*tmo_request_timestamp)(void *priv);
+ u64 (*tmo_fill_timestamp)(void *priv);
+ void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
+};
+
/**
* enum netdev_priv_flags - &struct net_device priv_flags
*
@@ -1844,6 +1869,7 @@ enum netdev_ml_priv_type {
* @netdev_ops: Includes several pointers to callbacks,
* if one wants to override the ndo_*() functions
* @xdp_metadata_ops: Includes pointers to XDP metadata callbacks.
+ * @xsk_tx_metadata_ops: Includes pointers to AF_XDP TX metadata callbacks.
* @ethtool_ops: Management operations
* @l3mdev_ops: Layer 3 master device operations
* @ndisc_ops: Includes callbacks for different IPv6 neighbour
@@ -2100,6 +2126,7 @@ struct net_device {
unsigned long long priv_flags;
const struct net_device_ops *netdev_ops;
const struct xdp_metadata_ops *xdp_metadata_ops;
+ const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
int ifindex;
unsigned short gflags;
unsigned short hard_header_len;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index faaba050f843..5febc1a5131e 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -581,7 +581,10 @@ struct skb_shared_info {
/* Warning: this field is not always filled in (UFO)! */
unsigned short gso_segs;
struct sk_buff *frag_list;
- struct skb_shared_hwtstamps hwtstamps;
+ union {
+ struct skb_shared_hwtstamps hwtstamps;
+ struct xsk_tx_metadata *xsk_meta;
+ };
unsigned int gso_type;
u32 tskey;
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 467b9fb56827..288fa58c4665 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -90,6 +90,54 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
void __xsk_map_flush(void);
+/**
+ * xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
+ * and call appropriate xsk_tx_metadata_ops operation.
+ * @meta: pointer to AF_XDP metadata area
+ * @ops: pointer to struct xsk_tx_metadata_ops
+ * @priv: pointer to driver-private aread
+ *
+ * This function should be called by the networking device when
+ * it prepares AF_XDP egress packet.
+ */
+static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
+ const struct xsk_tx_metadata_ops *ops,
+ void *priv)
+{
+ if (!meta)
+ return;
+
+ if (ops->tmo_request_timestamp)
+ if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
+ ops->tmo_request_timestamp(priv);
+
+ if (ops->tmo_request_checksum)
+ if (meta->flags & XDP_TX_METADATA_CHECKSUM)
+ ops->tmo_request_checksum(meta->csum_start, meta->csum_offset, priv);
+}
+
+/**
+ * xsk_tx_metadata_complete - Evaluate AF_XDP TX metadata at completion
+ * and call appropriate xsk_tx_metadata_ops operation.
+ * @meta: pointer to AF_XDP metadata area
+ * @ops: pointer to struct xsk_tx_metadata_ops
+ * @priv: pointer to driver-private aread
+ *
+ * This function should be called by the networking device upon
+ * AF_XDP egress completion.
+ */
+static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata *meta,
+ const struct xsk_tx_metadata_ops *ops,
+ void *priv)
+{
+ if (!meta)
+ return;
+
+ if (ops->tmo_fill_timestamp)
+ if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
+ meta->tx_timestamp = ops->tmo_fill_timestamp(priv);
+}
+
#else
static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
@@ -106,6 +154,18 @@ static inline void __xsk_map_flush(void)
{
}
+static inline void xsk_tx_metadata_request(struct xsk_tx_metadata *meta,
+ const struct xsk_tx_metadata_ops *ops,
+ void *priv)
+{
+}
+
+static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata *meta,
+ const struct xsk_tx_metadata_ops *ops,
+ void *priv)
+{
+}
+
#endif /* CONFIG_XDP_SOCKETS */
#endif /* _LINUX_XDP_SOCK_H */
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 1f6fc8c7a84c..e2558ac3e195 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -165,6 +165,14 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
return xp_raw_get_data(pool, addr);
}
+static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
+{
+ if (!pool->tx_metadata_len)
+ return NULL;
+
+ return xp_raw_get_data(pool, addr) - pool->tx_metadata_len;
+}
+
static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
{
struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
@@ -324,6 +332,11 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
return NULL;
}
+static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
+{
+ return NULL;
+}
+
static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
{
}
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index b37b50102e1c..b9b1b2c4108a 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -106,6 +106,38 @@ struct xdp_options {
#define XSK_UNALIGNED_BUF_ADDR_MASK \
((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
+/* Request transmit timestamp. Upon completion, put it into tx_timestamp
+ * field of struct xsk_tx_metadata.
+ */
+#define XDP_TX_METADATA_TIMESTAMP (1 << 0)
+
+/* Request transmit checksum offload. Checksum start position and offset
+ * are communicated via csum_start and csum_offset fields of struct
+ * xsk_tx_metadata.
+ */
+#define XDP_TX_METADATA_CHECKSUM (1 << 1)
+
+/* Force checksum calculation in software. Can be used for testing or
+ * working around potential HW issues. This option causes performance
+ * degradation and only works in XDP_COPY mode.
+ */
+#define XDP_TX_METADATA_CHECKSUM_SW (1 << 2)
+
+struct xsk_tx_metadata {
+ __u32 flags;
+
+ /* XDP_TX_METADATA_CHECKSUM */
+
+ /* Offset from desc->addr where checksumming should start. */
+ __u16 csum_start;
+ /* Offset from csum_start where checksum should be stored. */
+ __u16 csum_offset;
+
+ /* XDP_TX_METADATA_TIMESTAMP */
+
+ __u64 tx_timestamp;
+};
+
/* Rx/Tx descriptor */
struct xdp_desc {
__u64 addr;
@@ -122,4 +154,7 @@ struct xdp_desc {
*/
#define XDP_PKT_CONTD (1 << 0)
+/* TX packet carries valid metadata. */
+#define XDP_TX_METADATA (1 << 1)
+
#endif /* _LINUX_IF_XDP_H */
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index bf71698a1e82..cf1e11c76339 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -37,11 +37,26 @@ enum netdev_xdp_act {
NETDEV_XDP_ACT_MASK = 127,
};
+/**
+ * enum netdev_xsk_flags
+ * @NETDEV_XSK_FLAGS_TX_TIMESTAMP: HW timestamping egress packets is supported
+ * by the driver.
+ * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
+ * driver.
+ */
+enum netdev_xsk_flags {
+ NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
+ NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
+
+ NETDEV_XSK_FLAGS_MASK = 3,
+};
+
enum {
NETDEV_A_DEV_IFINDEX = 1,
NETDEV_A_DEV_PAD,
NETDEV_A_DEV_XDP_FEATURES,
NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
+ NETDEV_A_DEV_XSK_FEATURES,
__NETDEV_A_DEV_MAX,
NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index 65ef4867fc49..9e8c1f3caf36 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -12,15 +12,25 @@ static int
netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff *rsp,
u32 portid, u32 seq, int flags, u32 cmd)
{
+ u64 xsk_flags = 0;
void *hdr;
hdr = genlmsg_put(rsp, portid, seq, &netdev_nl_family, flags, cmd);
if (!hdr)
return -EMSGSIZE;
+ if (netdev->xsk_tx_metadata_ops) {
+ if (netdev->xsk_tx_metadata_ops->tmo_fill_timestamp)
+ xsk_flags |= NETDEV_XSK_FLAGS_TX_TIMESTAMP;
+ if (netdev->xsk_tx_metadata_ops->tmo_request_checksum)
+ xsk_flags |= NETDEV_XSK_FLAGS_TX_CHECKSUM;
+ }
+
if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||
nla_put_u64_64bit(rsp, NETDEV_A_DEV_XDP_FEATURES,
- netdev->xdp_features, NETDEV_A_DEV_PAD)) {
+ netdev->xdp_features, NETDEV_A_DEV_PAD) ||
+ nla_put_u64_64bit(rsp, NETDEV_A_DEV_XSK_FEATURES,
+ xsk_flags, NETDEV_A_DEV_PAD)) {
genlmsg_cancel(rsp, hdr);
return -EINVAL;
}
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 81106e4d6e0f..9a5c4e63898d 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -542,6 +542,15 @@ static u32 xsk_get_num_desc(struct sk_buff *skb)
static void xsk_destruct_skb(struct sk_buff *skb)
{
+ struct xsk_tx_metadata *meta = skb_shinfo(skb)->xsk_meta;
+
+ if (meta) {
+ if (meta->flags & XDP_TX_METADATA_TIMESTAMP) {
+ /* sw completion timestamp, not a real one */
+ meta->tx_timestamp = ktime_get_tai_fast_ns();
+ }
+ }
+
xsk_cq_submit_locked(xdp_sk(skb->sk), xsk_get_num_desc(skb));
sock_wfree(skb);
}
@@ -626,6 +635,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
struct xdp_desc *desc)
{
+ struct xsk_tx_metadata *meta = NULL;
struct net_device *dev = xs->dev;
struct sk_buff *skb = xs->skb;
int err;
@@ -678,12 +688,40 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
skb_add_rx_frag(skb, nr_frags, page, 0, len, 0);
}
+
+ if (desc->options & XDP_TX_METADATA) {
+ if (unlikely(xs->tx_metadata_len == 0)) {
+ err = -EINVAL;
+ goto free_err;
+ }
+
+ meta = buffer - xs->tx_metadata_len;
+
+ if (meta->flags & XDP_TX_METADATA_CHECKSUM) {
+ if (unlikely(meta->csum_start + meta->csum_offset +
+ sizeof(__sum16) > len)) {
+ err = -EINVAL;
+ goto free_err;
+ }
+
+ skb->csum_start = hr + meta->csum_start;
+ skb->csum_offset = meta->csum_offset;
+ skb->ip_summed = CHECKSUM_PARTIAL;
+
+ if (unlikely(meta->flags & XDP_TX_METADATA_CHECKSUM_SW)) {
+ err = skb_checksum_help(skb);
+ if (err)
+ goto free_err;
+ }
+ }
+ }
}
skb->dev = dev;
skb->priority = xs->sk.sk_priority;
skb->mark = xs->sk.sk_mark;
skb->destructor = xsk_destruct_skb;
+ skb_shinfo(skb)->xsk_meta = meta;
xsk_set_destructor_arg(skb);
return skb;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index c74a1372bcb9..6f2d1621c992 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -137,7 +137,7 @@ static inline bool xskq_cons_read_addr_unchecked(struct xsk_queue *q, u64 *addr)
static inline bool xp_unused_options_set(u32 options)
{
- return options & ~XDP_PKT_CONTD;
+ return options & ~(XDP_PKT_CONTD | XDP_TX_METADATA);
}
static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 73a47da885dc..b9b1b2c4108a 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -26,11 +26,11 @@
*/
#define XDP_USE_NEED_WAKEUP (1 << 3)
/* By setting this option, userspace application indicates that it can
- * handle multiple descriptors per packet thus enabling xsk core to split
+ * handle multiple descriptors per packet thus enabling AF_XDP to split
* multi-buffer XDP frames into multiple Rx descriptors. Without this set
- * such frames will be dropped by xsk.
+ * such frames will be dropped.
*/
-#define XDP_USE_SG (1 << 4)
+#define XDP_USE_SG (1 << 4)
/* Flags for xsk_umem_config flags */
#define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
@@ -69,6 +69,7 @@ struct xdp_mmap_offsets {
#define XDP_UMEM_COMPLETION_RING 6
#define XDP_STATISTICS 7
#define XDP_OPTIONS 8
+#define XDP_TX_METADATA_LEN 9
struct xdp_umem_reg {
__u64 addr; /* Start of packet data area */
@@ -105,6 +106,38 @@ struct xdp_options {
#define XSK_UNALIGNED_BUF_ADDR_MASK \
((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
+/* Request transmit timestamp. Upon completion, put it into tx_timestamp
+ * field of struct xsk_tx_metadata.
+ */
+#define XDP_TX_METADATA_TIMESTAMP (1 << 0)
+
+/* Request transmit checksum offload. Checksum start position and offset
+ * are communicated via csum_start and csum_offset fields of struct
+ * xsk_tx_metadata.
+ */
+#define XDP_TX_METADATA_CHECKSUM (1 << 1)
+
+/* Force checksum calculation in software. Can be used for testing or
+ * working around potential HW issues. This option causes performance
+ * degradation and only works in XDP_COPY mode.
+ */
+#define XDP_TX_METADATA_CHECKSUM_SW (1 << 2)
+
+struct xsk_tx_metadata {
+ __u32 flags;
+
+ /* XDP_TX_METADATA_CHECKSUM */
+
+ /* Offset from desc->addr where checksumming should start. */
+ __u16 csum_start;
+ /* Offset from csum_start where checksum should be stored. */
+ __u16 csum_offset;
+
+ /* XDP_TX_METADATA_TIMESTAMP */
+
+ __u64 tx_timestamp;
+};
+
/* Rx/Tx descriptor */
struct xdp_desc {
__u64 addr;
@@ -112,9 +145,16 @@ struct xdp_desc {
__u32 options;
};
-/* Flag indicating packet constitutes of multiple buffers*/
+/* UMEM descriptor is __u64 */
+
+/* Flag indicating that the packet continues with the buffer pointed out by the
+ * next frame in the ring. The end of the packet is signalled by setting this
+ * bit to zero. For single buffer packets, every descriptor has 'options' set
+ * to 0 and this maintains backward compatibility.
+ */
#define XDP_PKT_CONTD (1 << 0)
-/* UMEM descriptor is __u64 */
+/* TX packet carries valid metadata. */
+#define XDP_TX_METADATA (1 << 1)
#endif /* _LINUX_IF_XDP_H */
--
2.41.0.487.g6d72f3e995-goog
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] [RFC net-next v4 3/8] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
2023-07-24 23:59 [xdp-hints] [RFC net-next v4 0/8] xsk: TX metadata Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 1/8] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
@ 2023-07-24 23:59 ` Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 4/8] tools: ynl: update netdev sample to dump xsk-features Stanislav Fomichev
` (4 subsequent siblings)
7 siblings, 0 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-24 23:59 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
TX timestamp:
- requires passing clock, not sure I'm passing the correct one (from
cq->mdev), but the timestamp value looks convincing
TX checksum:
- looks like device does packet parsing (and doesn't accept custom
start/offset), so I'm ignoring user offsets
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 4 +-
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 71 ++++++++++++++++---
.../net/ethernet/mellanox/mlx5/core/en/xdp.h | 10 ++-
.../ethernet/mellanox/mlx5/core/en/xsk/tx.c | 9 ++-
.../net/ethernet/mellanox/mlx5/core/en_main.c | 1 +
5 files changed, 79 insertions(+), 16 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index b1807bfb815f..dcbef1074148 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -476,10 +476,12 @@ struct mlx5e_xdp_info_fifo {
struct mlx5e_xdpsq;
struct mlx5e_xmit_data;
+struct xsk_tx_metadata;
typedef int (*mlx5e_fp_xmit_xdp_frame_check)(struct mlx5e_xdpsq *);
typedef bool (*mlx5e_fp_xmit_xdp_frame)(struct mlx5e_xdpsq *,
struct mlx5e_xmit_data *,
- int);
+ int,
+ struct xsk_tx_metadata *);
struct mlx5e_xdpsq {
/* data path */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 40589cebb773..16e16d047542 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -102,7 +102,7 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
xdptxd->dma_addr = dma_addr;
if (unlikely(!INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
- mlx5e_xmit_xdp_frame, sq, xdptxd, 0)))
+ mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL)))
return false;
/* xmit_mode == MLX5E_XDP_XMIT_MODE_FRAME */
@@ -144,7 +144,7 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
xdptxd->dma_addr = dma_addr;
if (unlikely(!INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
- mlx5e_xmit_xdp_frame, sq, xdptxd, 0)))
+ mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL)))
return false;
/* xmit_mode == MLX5E_XDP_XMIT_MODE_PAGE */
@@ -260,6 +260,38 @@ const struct xdp_metadata_ops mlx5e_xdp_metadata_ops = {
.xmo_rx_hash = mlx5e_xdp_rx_hash,
};
+struct mlx5e_xsk_tx_complete {
+ struct mlx5_cqe64 *cqe;
+ struct mlx5e_cq *cq;
+};
+
+static u64 mlx5e_xsk_fill_timestamp(void *_priv)
+{
+ struct mlx5e_xsk_tx_complete *priv = _priv;
+ u64 ts;
+
+ ts = get_cqe_ts(priv->cqe);
+
+ if (mlx5_is_real_time_rq(priv->cq->mdev) || mlx5_is_real_time_sq(priv->cq->mdev))
+ return mlx5_real_time_cyc2time(&priv->cq->mdev->clock, ts);
+
+ return mlx5_timecounter_cyc2time(&priv->cq->mdev->clock, ts);
+}
+
+static void mlx5e_xsk_request_checksum(u16 csum_start, u16 csum_offset, void *priv)
+{
+ struct mlx5_wqe_eth_seg *eseg;
+
+ eseg = priv;
+ /* HW/FW is doing parsing, so offsets are largely ignored. */
+ eseg->cs_flags |= MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
+}
+
+const struct xsk_tx_metadata_ops mlx5e_xsk_tx_metadata_ops = {
+ .tmo_fill_timestamp = mlx5e_xsk_fill_timestamp,
+ .tmo_request_checksum = mlx5e_xsk_request_checksum,
+};
+
/* returns true if packet was consumed by xdp */
bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
struct bpf_prog *prog, struct mlx5e_xdp_buff *mxbuf)
@@ -397,11 +429,11 @@ INDIRECT_CALLABLE_SCOPE int mlx5e_xmit_xdp_frame_check_mpwqe(struct mlx5e_xdpsq
INDIRECT_CALLABLE_SCOPE bool
mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
- int check_result);
+ int check_result, struct xsk_tx_metadata *meta);
INDIRECT_CALLABLE_SCOPE bool
mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
- int check_result)
+ int check_result, struct xsk_tx_metadata *meta)
{
struct mlx5e_tx_mpwqe *session = &sq->mpwqe;
struct mlx5e_xdpsq_stats *stats = sq->stats;
@@ -419,7 +451,7 @@ mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptx
*/
if (unlikely(sq->mpwqe.wqe))
mlx5e_xdp_mpwqe_complete(sq);
- return mlx5e_xmit_xdp_frame(sq, xdptxd, 0);
+ return mlx5e_xmit_xdp_frame(sq, xdptxd, 0, meta);
}
if (!xdptxd->len) {
skb_frag_t *frag = &xdptxdf->sinfo->frags[0];
@@ -449,6 +481,7 @@ mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptx
* and it's safe to complete it at any time.
*/
mlx5e_xdp_mpwqe_session_start(sq);
+ xsk_tx_metadata_request(meta, &mlx5e_xsk_tx_metadata_ops, &session->wqe->eth);
}
mlx5e_xdp_mpwqe_add_dseg(sq, p, stats);
@@ -479,7 +512,7 @@ INDIRECT_CALLABLE_SCOPE int mlx5e_xmit_xdp_frame_check(struct mlx5e_xdpsq *sq)
INDIRECT_CALLABLE_SCOPE bool
mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
- int check_result)
+ int check_result, struct xsk_tx_metadata *meta)
{
struct mlx5e_xmit_data_frags *xdptxdf =
container_of(xdptxd, struct mlx5e_xmit_data_frags, xd);
@@ -598,6 +631,8 @@ mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
sq->pc++;
}
+ xsk_tx_metadata_request(meta, &mlx5e_xsk_tx_metadata_ops, eseg);
+
sq->doorbell_cseg = cseg;
stats->xmit++;
@@ -607,7 +642,9 @@ mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
struct mlx5e_xdp_wqe_info *wi,
u32 *xsk_frames,
- struct xdp_frame_bulk *bq)
+ struct xdp_frame_bulk *bq,
+ struct mlx5e_cq *cq,
+ struct mlx5_cqe64 *cqe)
{
struct mlx5e_xdp_info_fifo *xdpi_fifo = &sq->db.xdpi_fifo;
u16 i;
@@ -667,10 +704,22 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
break;
}
- case MLX5E_XDP_XMIT_MODE_XSK:
+ case MLX5E_XDP_XMIT_MODE_XSK: {
/* AF_XDP send */
+ struct mlx5e_xsk_tx_complete priv = {
+ .cqe = cqe,
+ .cq = cq,
+ };
+ struct xsk_tx_metadata *meta;
+
+ xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
+ meta = (void *)xdpi.frame.xsk_meta;
+
+ xsk_tx_metadata_complete(meta, &mlx5e_xsk_tx_metadata_ops, &priv);
+
(*xsk_frames)++;
break;
+ }
default:
WARN_ON_ONCE(true);
}
@@ -719,7 +768,7 @@ bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq)
sqcc += wi->num_wqebbs;
- mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq);
+ mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq, cq, cqe);
} while (!last_wqe);
if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_REQ)) {
@@ -766,7 +815,7 @@ void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq)
sq->cc += wi->num_wqebbs;
- mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq);
+ mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq, NULL, NULL);
}
xdp_flush_frame_bulk(&bq);
@@ -839,7 +888,7 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
}
ret = INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
- mlx5e_xmit_xdp_frame, sq, xdptxd, 0);
+ mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL);
if (unlikely(!ret)) {
int j;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index 9e8e6184f9e4..2fcd19c16103 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -82,13 +82,14 @@ enum mlx5e_xdp_xmit_mode {
* num, page_1, page_2, ... , page_num.
*
* MLX5E_XDP_XMIT_MODE_XSK:
- * none.
+ * frame.xsk_meta.
*/
union mlx5e_xdp_info {
enum mlx5e_xdp_xmit_mode mode;
union {
struct xdp_frame *xdpf;
dma_addr_t dma_addr;
+ void *xsk_meta;
} frame;
union {
struct mlx5e_rq *rq;
@@ -110,13 +111,16 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
u32 flags);
extern const struct xdp_metadata_ops mlx5e_xdp_metadata_ops;
+extern const struct xsk_tx_metadata_ops mlx5e_xsk_tx_metadata_ops;
INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq,
struct mlx5e_xmit_data *xdptxd,
- int check_result));
+ int check_result,
+ struct xsk_tx_metadata *meta));
INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq,
struct mlx5e_xmit_data *xdptxd,
- int check_result));
+ int check_result,
+ struct xsk_tx_metadata *meta));
INDIRECT_CALLABLE_DECLARE(int mlx5e_xmit_xdp_frame_check_mpwqe(struct mlx5e_xdpsq *sq));
INDIRECT_CALLABLE_DECLARE(int mlx5e_xmit_xdp_frame_check(struct mlx5e_xdpsq *sq));
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
index 597f319d4770..86e66d916176 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
@@ -55,12 +55,15 @@ static void mlx5e_xsk_tx_post_err(struct mlx5e_xdpsq *sq,
nopwqe = mlx5e_post_nop(&sq->wq, sq->sqn, &sq->pc);
mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, *xdpi);
+ mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
+ (union mlx5e_xdp_info) { .frame.xsk_meta = NULL });
sq->doorbell_cseg = &nopwqe->ctrl;
}
bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
{
struct xsk_buff_pool *pool = sq->xsk_pool;
+ struct xsk_tx_metadata *meta = NULL;
union mlx5e_xdp_info xdpi;
bool work_done = true;
bool flush = false;
@@ -93,12 +96,13 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
xdptxd.dma_addr = xsk_buff_raw_get_dma(pool, desc.addr);
xdptxd.data = xsk_buff_raw_get_data(pool, desc.addr);
xdptxd.len = desc.len;
+ meta = xsk_buff_get_metadata(pool, desc.addr);
xsk_buff_raw_dma_sync_for_device(pool, xdptxd.dma_addr, xdptxd.len);
ret = INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
mlx5e_xmit_xdp_frame, sq, &xdptxd,
- check_result);
+ check_result, meta);
if (unlikely(!ret)) {
if (sq->mpwqe.wqe)
mlx5e_xdp_mpwqe_complete(sq);
@@ -106,6 +110,9 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
mlx5e_xsk_tx_post_err(sq, &xdpi);
} else {
mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, xdpi);
+ mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
+ (union mlx5e_xdp_info)
+ { .frame.xsk_meta = (void *)meta });
}
flush = true;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index defb1efccb78..e19f313f4612 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -5084,6 +5084,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
netdev->netdev_ops = &mlx5e_netdev_ops;
netdev->xdp_metadata_ops = &mlx5e_xdp_metadata_ops;
+ netdev->xsk_tx_metadata_ops = &mlx5e_xsk_tx_metadata_ops;
mlx5e_dcbnl_build_netdev(netdev);
--
2.41.0.487.g6d72f3e995-goog
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] [RFC net-next v4 4/8] tools: ynl: update netdev sample to dump xsk-features
2023-07-24 23:59 [xdp-hints] [RFC net-next v4 0/8] xsk: TX metadata Stanislav Fomichev
` (2 preceding siblings ...)
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 3/8] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload Stanislav Fomichev
@ 2023-07-24 23:59 ` Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 5/8] selftests/xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
` (3 subsequent siblings)
7 siblings, 0 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-24 23:59 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
While at it, also dump recently added ZC max segs (and rename
it to use dashes). Plus fix small spelling issue.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
Documentation/netlink/specs/netdev.yaml | 6 ++++--
tools/include/uapi/linux/netdev.h | 15 +++++++++++++++
tools/net/ynl/generated/netdev-user.c | 25 +++++++++++++++++++++++++
tools/net/ynl/generated/netdev-user.h | 5 +++++
tools/net/ynl/samples/netdev.c | 8 ++++++++
5 files changed, 57 insertions(+), 2 deletions(-)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index bf9c1cc32614..9002b37b7676 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -14,7 +14,7 @@ name: netdev
-
name: basic
doc:
- XDP feautues set supported by all drivers
+ XDP features set supported by all drivers
(XDP_ABORTED, XDP_DROP, XDP_PASS, XDP_TX)
-
name: redirect
@@ -76,7 +76,7 @@ name: netdev
enum: xdp-act
enum-as-flags: true
-
- name: xdp_zc_max_segs
+ name: xdp-zc-max-segs
doc: max fragment count supported by ZC driver
type: u32
checks:
@@ -102,6 +102,8 @@ name: netdev
attributes:
- ifindex
- xdp-features
+ - xdp-zc-max-segs
+ - xsk-features
dump:
reply: *dev-all
-
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index bf71698a1e82..cf1e11c76339 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -37,11 +37,26 @@ enum netdev_xdp_act {
NETDEV_XDP_ACT_MASK = 127,
};
+/**
+ * enum netdev_xsk_flags
+ * @NETDEV_XSK_FLAGS_TX_TIMESTAMP: HW timestamping egress packets is supported
+ * by the driver.
+ * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
+ * driver.
+ */
+enum netdev_xsk_flags {
+ NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
+ NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
+
+ NETDEV_XSK_FLAGS_MASK = 3,
+};
+
enum {
NETDEV_A_DEV_IFINDEX = 1,
NETDEV_A_DEV_PAD,
NETDEV_A_DEV_XDP_FEATURES,
NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
+ NETDEV_A_DEV_XSK_FEATURES,
__NETDEV_A_DEV_MAX,
NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
diff --git a/tools/net/ynl/generated/netdev-user.c b/tools/net/ynl/generated/netdev-user.c
index 4eb8aefef0cd..f8dd6aa0ad97 100644
--- a/tools/net/ynl/generated/netdev-user.c
+++ b/tools/net/ynl/generated/netdev-user.c
@@ -45,11 +45,26 @@ const char *netdev_xdp_act_str(enum netdev_xdp_act value)
return netdev_xdp_act_strmap[value];
}
+static const char * const netdev_xsk_flags_strmap[] = {
+ [0] = "tx-timestamp",
+ [1] = "tx-checksum",
+};
+
+const char *netdev_xsk_flags_str(enum netdev_xsk_flags value)
+{
+ value = ffs(value) - 1;
+ if (value < 0 || value >= (int)MNL_ARRAY_SIZE(netdev_xsk_flags_strmap))
+ return NULL;
+ return netdev_xsk_flags_strmap[value];
+}
+
/* Policies */
struct ynl_policy_attr netdev_dev_policy[NETDEV_A_DEV_MAX + 1] = {
[NETDEV_A_DEV_IFINDEX] = { .name = "ifindex", .type = YNL_PT_U32, },
[NETDEV_A_DEV_PAD] = { .name = "pad", .type = YNL_PT_IGNORE, },
[NETDEV_A_DEV_XDP_FEATURES] = { .name = "xdp-features", .type = YNL_PT_U64, },
+ [NETDEV_A_DEV_XDP_ZC_MAX_SEGS] = { .name = "xdp-zc-max-segs", .type = YNL_PT_U32, },
+ [NETDEV_A_DEV_XSK_FEATURES] = { .name = "xsk-features", .type = YNL_PT_U64, },
};
struct ynl_policy_nest netdev_dev_nest = {
@@ -91,6 +106,16 @@ int netdev_dev_get_rsp_parse(const struct nlmsghdr *nlh, void *data)
return MNL_CB_ERROR;
dst->_present.xdp_features = 1;
dst->xdp_features = mnl_attr_get_u64(attr);
+ } else if (type == NETDEV_A_DEV_XDP_ZC_MAX_SEGS) {
+ if (ynl_attr_validate(yarg, attr))
+ return MNL_CB_ERROR;
+ dst->_present.xdp_zc_max_segs = 1;
+ dst->xdp_zc_max_segs = mnl_attr_get_u32(attr);
+ } else if (type == NETDEV_A_DEV_XSK_FEATURES) {
+ if (ynl_attr_validate(yarg, attr))
+ return MNL_CB_ERROR;
+ dst->_present.xsk_features = 1;
+ dst->xsk_features = mnl_attr_get_u64(attr);
}
}
diff --git a/tools/net/ynl/generated/netdev-user.h b/tools/net/ynl/generated/netdev-user.h
index 5554dc69bb9c..b8c5cdb331b4 100644
--- a/tools/net/ynl/generated/netdev-user.h
+++ b/tools/net/ynl/generated/netdev-user.h
@@ -18,6 +18,7 @@ extern const struct ynl_family ynl_netdev_family;
/* Enums */
const char *netdev_op_str(int op);
const char *netdev_xdp_act_str(enum netdev_xdp_act value);
+const char *netdev_xsk_flags_str(enum netdev_xsk_flags value);
/* Common nested types */
/* ============== NETDEV_CMD_DEV_GET ============== */
@@ -47,10 +48,14 @@ struct netdev_dev_get_rsp {
struct {
__u32 ifindex:1;
__u32 xdp_features:1;
+ __u32 xdp_zc_max_segs:1;
+ __u32 xsk_features:1;
} _present;
__u32 ifindex;
__u64 xdp_features;
+ __u32 xdp_zc_max_segs;
+ __u64 xsk_features;
};
void netdev_dev_get_rsp_free(struct netdev_dev_get_rsp *rsp);
diff --git a/tools/net/ynl/samples/netdev.c b/tools/net/ynl/samples/netdev.c
index d31268aa47c5..06377e3f1df5 100644
--- a/tools/net/ynl/samples/netdev.c
+++ b/tools/net/ynl/samples/netdev.c
@@ -38,6 +38,14 @@ static void netdev_print_device(struct netdev_dev_get_rsp *d, unsigned int op)
printf(" %s", netdev_xdp_act_str(1 << i));
}
+ printf(" %llx:", d->xsk_features);
+ for (int i = 0; d->xsk_features > 1U << i; i++) {
+ if (d->xsk_features & (1U << i))
+ printf(" %s", netdev_xsk_flags_str(1 << i));
+ }
+
+ printf(" xdp-zc-max-segs=%u", d->xdp_zc_max_segs);
+
name = netdev_op_str(op);
if (name)
printf(" (ntf: %s)", name);
--
2.41.0.487.g6d72f3e995-goog
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] [RFC net-next v4 5/8] selftests/xsk: Support XDP_TX_METADATA_LEN
2023-07-24 23:59 [xdp-hints] [RFC net-next v4 0/8] xsk: TX metadata Stanislav Fomichev
` (3 preceding siblings ...)
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 4/8] tools: ynl: update netdev sample to dump xsk-features Stanislav Fomichev
@ 2023-07-24 23:59 ` Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 6/8] selftests/bpf: Add csum helpers Stanislav Fomichev
` (2 subsequent siblings)
7 siblings, 0 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-24 23:59 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
Add new config field and call setsockopt.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
tools/testing/selftests/bpf/xsk.c | 17 +++++++++++++++++
tools/testing/selftests/bpf/xsk.h | 1 +
2 files changed, 18 insertions(+)
diff --git a/tools/testing/selftests/bpf/xsk.c b/tools/testing/selftests/bpf/xsk.c
index d9fb2b730a2c..cb7e48f24289 100644
--- a/tools/testing/selftests/bpf/xsk.c
+++ b/tools/testing/selftests/bpf/xsk.c
@@ -49,6 +49,10 @@
#define PF_XDP AF_XDP
#endif
+#ifndef XDP_TX_METADATA_LEN
+#define XDP_TX_METADATA_LEN 9
+#endif
+
#define pr_warn(fmt, ...) fprintf(stderr, fmt, ##__VA_ARGS__)
#define XSKMAP_SIZE 1
@@ -132,12 +136,14 @@ static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
cfg->rx_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
cfg->tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS;
cfg->bind_flags = 0;
+ cfg->tx_metadata_len = 0;
return 0;
}
cfg->rx_size = usr_cfg->rx_size;
cfg->tx_size = usr_cfg->tx_size;
cfg->bind_flags = usr_cfg->bind_flags;
+ cfg->tx_metadata_len = usr_cfg->tx_metadata_len;
return 0;
}
@@ -613,6 +619,17 @@ int xsk_socket__create_shared(struct xsk_socket **xsk_ptr,
umem->tx_ring_setup_done = true;
}
+ if (xsk->config.tx_metadata_len) {
+ int optval = xsk->config.tx_metadata_len;
+
+ err = setsockopt(xsk->fd, SOL_XDP, XDP_TX_METADATA_LEN,
+ &optval, sizeof(optval));
+ if (err) {
+ err = -errno;
+ goto out_put_ctx;
+ }
+ }
+
err = xsk_get_mmap_offsets(xsk->fd, &off);
if (err) {
err = -errno;
diff --git a/tools/testing/selftests/bpf/xsk.h b/tools/testing/selftests/bpf/xsk.h
index d93200fdaa8d..325fe0c83e5d 100644
--- a/tools/testing/selftests/bpf/xsk.h
+++ b/tools/testing/selftests/bpf/xsk.h
@@ -212,6 +212,7 @@ struct xsk_socket_config {
__u32 rx_size;
__u32 tx_size;
__u16 bind_flags;
+ __u8 tx_metadata_len;
};
/* Set config to NULL to get the default configuration. */
--
2.41.0.487.g6d72f3e995-goog
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] [RFC net-next v4 6/8] selftests/bpf: Add csum helpers
2023-07-24 23:59 [xdp-hints] [RFC net-next v4 0/8] xsk: TX metadata Stanislav Fomichev
` (4 preceding siblings ...)
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 5/8] selftests/xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
@ 2023-07-24 23:59 ` Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 7/8] selftests/bpf: Add TX side to xdp_metadata Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 8/8] selftests/bpf: Add TX side to xdp_hw_metadata Stanislav Fomichev
7 siblings, 0 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-24 23:59 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
Checksum helpers will be used to calculate pseudo-header checksum in
AF_XDP metadata selftests.
The helpers are mirroring existing kernel ones:
- csum_tcpudp_magic : IPv4 pseudo header csum
- csum_ipv6_magic : IPv6 pseudo header csum
- csum_fold : fold csum and do one's complement
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
tools/testing/selftests/bpf/network_helpers.h | 43 +++++++++++++++++++
1 file changed, 43 insertions(+)
diff --git a/tools/testing/selftests/bpf/network_helpers.h b/tools/testing/selftests/bpf/network_helpers.h
index 694185644da6..d749757a36a3 100644
--- a/tools/testing/selftests/bpf/network_helpers.h
+++ b/tools/testing/selftests/bpf/network_helpers.h
@@ -67,4 +67,47 @@ struct nstoken;
*/
struct nstoken *open_netns(const char *name);
void close_netns(struct nstoken *token);
+
+static __u16 csum_fold(__u32 csum)
+{
+ csum = (csum & 0xffff) + (csum >> 16);
+ csum = (csum & 0xffff) + (csum >> 16);
+
+ return (__u16)~csum;
+}
+
+static inline __sum16 csum_tcpudp_magic(__be32 saddr, __be32 daddr,
+ __u32 len, __u8 proto,
+ __wsum csum)
+{
+ __u64 s = csum;
+
+ s += (__u32)saddr;
+ s += (__u32)daddr;
+ s += htons(proto + len);
+ s = (s & 0xffffffff) + (s >> 32);
+ s = (s & 0xffffffff) + (s >> 32);
+
+ return csum_fold((__u32)s);
+}
+
+static inline __sum16 csum_ipv6_magic(const struct in6_addr *saddr,
+ const struct in6_addr *daddr,
+ __u32 len, __u8 proto,
+ __wsum csum)
+{
+ __u64 s = csum;
+ int i;
+
+ for (i = 0; i < 4; i++)
+ s += (__u32)saddr->s6_addr32[i];
+ for (i = 0; i < 4; i++)
+ s += (__u32)daddr->s6_addr32[i];
+ s += htons(proto + len);
+ s = (s & 0xffffffff) + (s >> 32);
+ s = (s & 0xffffffff) + (s >> 32);
+
+ return csum_fold((__u32)s);
+}
+
#endif
--
2.41.0.487.g6d72f3e995-goog
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] [RFC net-next v4 7/8] selftests/bpf: Add TX side to xdp_metadata
2023-07-24 23:59 [xdp-hints] [RFC net-next v4 0/8] xsk: TX metadata Stanislav Fomichev
` (5 preceding siblings ...)
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 6/8] selftests/bpf: Add csum helpers Stanislav Fomichev
@ 2023-07-24 23:59 ` Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 8/8] selftests/bpf: Add TX side to xdp_hw_metadata Stanislav Fomichev
7 siblings, 0 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-24 23:59 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
Request TX timestamp and make sure it's not empty.
Request TX checksum offload (SW-only) and make sure it's resolved
to the correct one.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
.../selftests/bpf/prog_tests/xdp_metadata.c | 31 +++++++++++++++++--
1 file changed, 28 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
index 626c461fa34d..04477a557b8c 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -48,6 +48,7 @@ static int open_xsk(int ifindex, struct xsk *xsk)
{
int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
const struct xsk_socket_config socket_config = {
+ .tx_metadata_len = sizeof(struct xsk_tx_metadata),
.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
.bind_flags = XDP_COPY,
@@ -138,6 +139,7 @@ static void ip_csum(struct iphdr *iph)
static int generate_packet(struct xsk *xsk, __u16 dst_port)
{
+ struct xsk_tx_metadata *meta;
struct xdp_desc *tx_desc;
struct udphdr *udph;
struct ethhdr *eth;
@@ -151,10 +153,14 @@ static int generate_packet(struct xsk *xsk, __u16 dst_port)
return -1;
tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
- tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE;
+ tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE + sizeof(struct xsk_tx_metadata);
printf("%p: tx_desc[%u]->addr=%llx\n", xsk, idx, tx_desc->addr);
data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
+ meta = data - sizeof(struct xsk_tx_metadata);
+ memset(meta, 0, sizeof(*meta));
+ meta->flags = XDP_TX_METADATA_TIMESTAMP;
+
eth = data;
iph = (void *)(eth + 1);
udph = (void *)(iph + 1);
@@ -178,11 +184,17 @@ static int generate_packet(struct xsk *xsk, __u16 dst_port)
udph->source = htons(AF_XDP_SOURCE_PORT);
udph->dest = htons(dst_port);
udph->len = htons(sizeof(*udph) + UDP_PAYLOAD_BYTES);
- udph->check = 0;
+ udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
+ ntohs(udph->len), IPPROTO_UDP, 0);
memset(udph + 1, 0xAA, UDP_PAYLOAD_BYTES);
+ meta->flags |= XDP_TX_METADATA_CHECKSUM | XDP_TX_METADATA_CHECKSUM_SW;
+ meta->csum_start = sizeof(*eth) + sizeof(*iph);
+ meta->csum_offset = offsetof(struct udphdr, check);
+
tx_desc->len = sizeof(*eth) + sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES;
+ tx_desc->options |= XDP_TX_METADATA;
xsk_ring_prod__submit(&xsk->tx, 1);
ret = sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
@@ -194,13 +206,21 @@ static int generate_packet(struct xsk *xsk, __u16 dst_port)
static void complete_tx(struct xsk *xsk)
{
- __u32 idx;
+ struct xsk_tx_metadata *meta;
__u64 addr;
+ void *data;
+ __u32 idx;
if (ASSERT_EQ(xsk_ring_cons__peek(&xsk->comp, 1, &idx), 1, "xsk_ring_cons__peek")) {
addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
+
+ data = xsk_umem__get_data(xsk->umem_area, addr);
+ meta = data - sizeof(struct xsk_tx_metadata);
+
+ ASSERT_NEQ(meta->tx_timestamp, 0, "tx_timestamp");
+
xsk_ring_cons__release(&xsk->comp, 1);
}
}
@@ -221,6 +241,7 @@ static int verify_xsk_metadata(struct xsk *xsk)
const struct xdp_desc *rx_desc;
struct pollfd fds = {};
struct xdp_meta *meta;
+ struct udphdr *udph;
struct ethhdr *eth;
struct iphdr *iph;
__u64 comp_addr;
@@ -257,6 +278,7 @@ static int verify_xsk_metadata(struct xsk *xsk)
ASSERT_EQ(eth->h_proto, htons(ETH_P_IP), "eth->h_proto");
iph = (void *)(eth + 1);
ASSERT_EQ((int)iph->version, 4, "iph->version");
+ udph = (void *)(iph + 1);
/* custom metadata */
@@ -270,6 +292,9 @@ static int verify_xsk_metadata(struct xsk *xsk)
ASSERT_EQ(meta->rx_hash_type, 0, "rx_hash_type");
+ /* checksum offload */
+ ASSERT_EQ(udph->check, 0x1c72, "csum");
+
xsk_ring_cons__release(&xsk->rx, 1);
refill_rx(xsk, comp_addr);
--
2.41.0.487.g6d72f3e995-goog
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] [RFC net-next v4 8/8] selftests/bpf: Add TX side to xdp_hw_metadata
2023-07-24 23:59 [xdp-hints] [RFC net-next v4 0/8] xsk: TX metadata Stanislav Fomichev
` (6 preceding siblings ...)
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 7/8] selftests/bpf: Add TX side to xdp_metadata Stanislav Fomichev
@ 2023-07-24 23:59 ` Stanislav Fomichev
2023-07-25 20:59 ` [xdp-hints] " Willem de Bruijn
7 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-24 23:59 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
When we get packets on port 9091, we swap src/dst and send it out.
At this point, we also request the timestamp and plumb it back
to the userspace. The userspace simply prints the timestamp.
Also print current UDP checksum, rewrite it with the pseudo-header
checksum and offload TX checksum calculation to devtx. Upon
completion, report TX checksum back (mlx5 doesn't put it back, so
I've used tcpdump to confirm that the checksum is correct).
Some other related changes:
- switched to zerocopy mode by default; new flag can be used to force
old behavior
- request fixed TX_METADATA_LEN headroom
- some other small fixes (umem size, fill idx+i, etc)
mvbz3:~# ./xdp_hw_metadata eth3 -c mlx5e_devtx_complete_xdp -s mlx5e_devtx_submit_xd
attach rx bpf program...
...
0x206d298: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
rx_hash: 0x2BFB7FEC with RSS type:0x2A
rx_timestamp: 1690238278345877848 (sec:1690238278.3459)
XDP RX-time: 1690238278538397674 (sec:1690238278.5384) delta sec:0.1925 (192519.826 usec)
AF_XDP time: 1690238278538515250 (sec:1690238278.5385) delta sec:0.0001 (117.576 usec)
0x206d298: ping-pong with csum=8e3b (want 57c9) csum_start=54 csum_offset=6
0x206d298: complete tx idx=0 addr=10
0x206d298: tx_timestamp: 1690238278577008140 (sec:1690238278.5770)
0x206d298: complete rx idx=128 addr=80100
mvbz4:~# nc -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091
mvbz4:~# tcpdump -vvx -i eth3 udp
tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
10:12:43.901436 IP6 (flowlabel 0x7a5d2, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1087.44339 > fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0x0b4b!] UDP, length 3
0x0000: 6007 a5d2 000b 117f fe80 0000 0000 0000
0x0010: 1270 fdff fe48 1087 fe80 0000 0000 0000
0x0020: 1270 fdff fe48 1077 ad33 2383 000b 3b8e
0x0030: 7864 70
10:12:43.902125 IP6 (flowlabel 0x7a5d2, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1077.9091 > fe80::1270:fdff:fe48:1087.44339: [udp sum ok] UDP, length 3
0x0000: 6007 a5d2 000b 117f fe80 0000 0000 0000
0x0010: 1270 fdff fe48 1077 fe80 0000 0000 0000
0x0020: 1270 fdff fe48 1087 2383 ad33 000b 0b4b
0x0030: 7864 70
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
tools/testing/selftests/bpf/xdp_hw_metadata.c | 201 +++++++++++++++++-
1 file changed, 191 insertions(+), 10 deletions(-)
diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
index 613321eb84c1..0bef79ffac7a 100644
--- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
+++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
@@ -10,7 +10,9 @@
* - rx_hash
*
* TX:
- * - TBD
+ * - UDP 9091 packets trigger TX reply
+ * - TX HW timestamp is requested and reported back upon completion
+ * - TX checksum is requested
*/
#include <test_progs.h>
@@ -24,14 +26,17 @@
#include <linux/net_tstamp.h>
#include <linux/udp.h>
#include <linux/sockios.h>
+#include <linux/if_xdp.h>
#include <sys/mman.h>
#include <net/if.h>
#include <poll.h>
#include <time.h>
+#include <unistd.h>
+#include <libgen.h>
#include "xdp_metadata.h"
-#define UMEM_NUM 16
+#define UMEM_NUM 256
#define UMEM_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
#define UMEM_SIZE (UMEM_FRAME_SIZE * UMEM_NUM)
#define XDP_FLAGS (XDP_FLAGS_DRV_MODE | XDP_FLAGS_REPLACE)
@@ -51,22 +56,24 @@ struct xsk *rx_xsk;
const char *ifname;
int ifindex;
int rxq;
+bool skip_tx;
void test__fail(void) { /* for network_helpers.c */ }
-static int open_xsk(int ifindex, struct xsk *xsk, __u32 queue_id)
+static int open_xsk(int ifindex, struct xsk *xsk, __u32 queue_id, int flags)
{
int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
const struct xsk_socket_config socket_config = {
+ .tx_metadata_len = sizeof(struct xsk_tx_metadata),
.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
- .bind_flags = XDP_COPY,
+ .bind_flags = flags,
};
const struct xsk_umem_config umem_config = {
.fill_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
.frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE,
- .flags = XDP_UMEM_UNALIGNED_CHUNK_FLAG,
+ .flags = XSK_UMEM__DEFAULT_FLAGS,
};
__u32 idx;
u64 addr;
@@ -108,7 +115,7 @@ static int open_xsk(int ifindex, struct xsk *xsk, __u32 queue_id)
for (i = 0; i < UMEM_NUM / 2; i++) {
addr = (UMEM_NUM / 2 + i) * UMEM_FRAME_SIZE;
printf("%p: rx_desc[%d] -> %lx\n", xsk, i, addr);
- *xsk_ring_prod__fill_addr(&xsk->fill, i) = addr;
+ *xsk_ring_prod__fill_addr(&xsk->fill, idx + i) = addr;
}
xsk_ring_prod__submit(&xsk->fill, ret);
@@ -129,12 +136,22 @@ static void refill_rx(struct xsk *xsk, __u64 addr)
__u32 idx;
if (xsk_ring_prod__reserve(&xsk->fill, 1, &idx) == 1) {
- printf("%p: complete idx=%u addr=%llx\n", xsk, idx, addr);
+ printf("%p: complete rx idx=%u addr=%llx\n", xsk, idx, addr);
*xsk_ring_prod__fill_addr(&xsk->fill, idx) = addr;
xsk_ring_prod__submit(&xsk->fill, 1);
}
}
+static int kick_tx(struct xsk *xsk)
+{
+ return sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
+}
+
+static int kick_rx(struct xsk *xsk)
+{
+ return recvfrom(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, NULL);
+}
+
#define NANOSEC_PER_SEC 1000000000 /* 10^9 */
static __u64 gettime(clockid_t clock_id)
{
@@ -228,6 +245,116 @@ static void verify_skb_metadata(int fd)
printf("skb hwtstamp is not found!\n");
}
+static bool complete_tx(struct xsk *xsk)
+{
+ struct xsk_tx_metadata *meta;
+ __u64 addr;
+ void *data;
+ __u32 idx;
+
+ if (!xsk_ring_cons__peek(&xsk->comp, 1, &idx))
+ return false;
+
+ addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
+ data = xsk_umem__get_data(xsk->umem_area, addr);
+ meta = data - sizeof(struct xsk_tx_metadata);
+
+ printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
+ printf("%p: tx_timestamp: %llu (sec:%0.4f)\n", xsk, meta->tx_timestamp,
+ (double)meta->tx_timestamp / NANOSEC_PER_SEC);
+ xsk_ring_cons__release(&xsk->comp, 1);
+
+ return true;
+}
+
+#define swap(a, b, len) do { \
+ for (int i = 0; i < len; i++) { \
+ __u8 tmp = ((__u8 *)a)[i]; \
+ ((__u8 *)a)[i] = ((__u8 *)b)[i]; \
+ ((__u8 *)b)[i] = tmp; \
+ } \
+} while (0)
+
+static void ping_pong(struct xsk *xsk, void *rx_packet)
+{
+ struct xsk_tx_metadata *meta;
+ struct ipv6hdr *ip6h = NULL;
+ struct iphdr *iph = NULL;
+ struct xdp_desc *tx_desc;
+ struct udphdr *udph;
+ struct ethhdr *eth;
+ __sum16 want_csum;
+ void *data;
+ __u32 idx;
+ int ret;
+ int len;
+
+ ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
+ if (ret != 1) {
+ printf("%p: failed to reserve tx slot\n", xsk);
+ return;
+ }
+
+ tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
+ tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE + sizeof(struct xsk_tx_metadata);
+ data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
+
+ meta = data - sizeof(struct xsk_tx_metadata);
+ memset(meta, 0, sizeof(*meta));
+ meta->flags = XDP_TX_METADATA_TIMESTAMP;
+
+ eth = rx_packet;
+
+ if (eth->h_proto == htons(ETH_P_IP)) {
+ iph = (void *)(eth + 1);
+ udph = (void *)(iph + 1);
+ } else if (eth->h_proto == htons(ETH_P_IPV6)) {
+ ip6h = (void *)(eth + 1);
+ udph = (void *)(ip6h + 1);
+ } else {
+ printf("%p: failed to detect IP version for ping pong %04x\n", xsk, eth->h_proto);
+ xsk_ring_prod__cancel(&xsk->tx, 1);
+ return;
+ }
+
+ len = ETH_HLEN;
+ if (ip6h)
+ len += sizeof(*ip6h) + ntohs(ip6h->payload_len);
+ if (iph)
+ len += ntohs(iph->tot_len);
+
+ swap(eth->h_dest, eth->h_source, ETH_ALEN);
+ if (iph)
+ swap(&iph->saddr, &iph->daddr, 4);
+ else
+ swap(&ip6h->saddr, &ip6h->daddr, 16);
+ swap(&udph->source, &udph->dest, 2);
+
+ want_csum = udph->check;
+ if (ip6h)
+ udph->check = ~csum_ipv6_magic(&ip6h->saddr, &ip6h->daddr,
+ ntohs(udph->len), IPPROTO_UDP, 0);
+ else
+ udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
+ ntohs(udph->len), IPPROTO_UDP, 0);
+
+ meta->flags |= XDP_TX_METADATA_CHECKSUM;
+ if (iph)
+ meta->csum_start = sizeof(*eth) + sizeof(*iph);
+ else
+ meta->csum_start = sizeof(*eth) + sizeof(*ip6h);
+ meta->csum_offset = offsetof(struct udphdr, check);
+
+ printf("%p: ping-pong with csum=%04x (want %04x) csum_start=%d csum_offset=%d\n",
+ xsk, udph->check, want_csum, meta->csum_start, meta->csum_offset);
+
+ memcpy(data, rx_packet, len); /* don't share umem chunk for simplicity */
+ tx_desc->options |= XDP_TX_METADATA;
+ tx_desc->len = len;
+
+ xsk_ring_prod__submit(&xsk->tx, 1);
+}
+
static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t clock_id)
{
const struct xdp_desc *rx_desc;
@@ -250,6 +377,13 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
while (true) {
errno = 0;
+
+ for (i = 0; i < rxq; i++) {
+ ret = kick_rx(&rx_xsk[i]);
+ if (ret)
+ printf("kick_rx ret=%d\n", ret);
+ }
+
ret = poll(fds, rxq + 1, 1000);
printf("poll: %d (%d) skip=%llu fail=%llu redir=%llu\n",
ret, errno, bpf_obj->bss->pkts_skip,
@@ -280,6 +414,22 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
xsk, idx, rx_desc->addr, addr, comp_addr);
verify_xdp_metadata(xsk_umem__get_data(xsk->umem_area, addr),
clock_id);
+
+ if (!skip_tx) {
+ /* mirror the packet back */
+ ping_pong(xsk, xsk_umem__get_data(xsk->umem_area, addr));
+
+ ret = kick_tx(xsk);
+ if (ret)
+ printf("kick_tx ret=%d\n", ret);
+
+ for (int j = 0; j < 500; j++) {
+ if (complete_tx(xsk))
+ break;
+ usleep(10*1000);
+ }
+ }
+
xsk_ring_cons__release(&xsk->rx, 1);
refill_rx(xsk, comp_addr);
}
@@ -404,21 +554,52 @@ static void timestamping_enable(int fd, int val)
error(1, errno, "setsockopt(SO_TIMESTAMPING)");
}
+static void usage(const char *prog)
+{
+ fprintf(stderr,
+ "usage: %s [OPTS] <ifname>\n"
+ "OPTS:\n"
+ " -T don't generate AF_XDP reply (rx metadata only)\n"
+ " -Z run in copy mode\n",
+ prog);
+}
+
int main(int argc, char *argv[])
{
+ int bind_flags = XDP_USE_NEED_WAKEUP | XDP_ZEROCOPY;
clockid_t clock_id = CLOCK_TAI;
int server_fd = -1;
+ int opt;
int ret;
int i;
struct bpf_program *prog;
- if (argc != 2) {
+ while ((opt = getopt(argc, argv, "TZ")) != -1) {
+ switch (opt) {
+ case 'T':
+ skip_tx = true;
+ break;
+ case 'Z':
+ bind_flags = XDP_USE_NEED_WAKEUP | XDP_COPY;
+ break;
+ default:
+ usage(basename(argv[0]));
+ return 1;
+ }
+ }
+
+ if (argc < 2) {
fprintf(stderr, "pass device name\n");
return -1;
}
- ifname = argv[1];
+ if (optind >= argc) {
+ usage(basename(argv[0]));
+ return 1;
+ }
+
+ ifname = argv[optind];
ifindex = if_nametoindex(ifname);
rxq = rxq_num(ifname);
@@ -432,7 +613,7 @@ int main(int argc, char *argv[])
for (i = 0; i < rxq; i++) {
printf("open_xsk(%s, %p, %d)\n", ifname, &rx_xsk[i], i);
- ret = open_xsk(ifindex, &rx_xsk[i], i);
+ ret = open_xsk(ifindex, &rx_xsk[i], i, bind_flags);
if (ret)
error(1, -ret, "open_xsk");
--
2.41.0.487.g6d72f3e995-goog
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
@ 2023-07-25 19:28 ` Simon Horman
2023-07-25 20:30 ` Stanislav Fomichev
2023-07-25 20:54 ` Willem de Bruijn
` (2 subsequent siblings)
3 siblings, 1 reply; 30+ messages in thread
From: Simon Horman @ 2023-07-25 19:28 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
On Mon, Jul 24, 2023 at 04:59:51PM -0700, Stanislav Fomichev wrote:
...
Hi Stan,
> diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> index bf71698a1e82..cf1e11c76339 100644
> --- a/include/uapi/linux/netdev.h
> +++ b/include/uapi/linux/netdev.h
> @@ -37,11 +37,26 @@ enum netdev_xdp_act {
> NETDEV_XDP_ACT_MASK = 127,
> };
>
> +/**
> + * enum netdev_xsk_flags
> + * @NETDEV_XSK_FLAGS_TX_TIMESTAMP: HW timestamping egress packets is supported
> + * by the driver.
> + * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
> + * driver.
> + */
> +enum netdev_xsk_flags {
> + NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
> + NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
> +
I know that it isn't the practice in this file.
but adding the following makes kernel-doc happier
about NETDEV_XSK_FLAGS_MASK not being documented.
/* private: */
> + NETDEV_XSK_FLAGS_MASK = 3,
> +};
> +
> enum {
> NETDEV_A_DEV_IFINDEX = 1,
> NETDEV_A_DEV_PAD,
> NETDEV_A_DEV_XDP_FEATURES,
> NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
> + NETDEV_A_DEV_XSK_FEATURES,
>
> __NETDEV_A_DEV_MAX,
> NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
...
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
...
> @@ -626,6 +635,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> struct xdp_desc *desc)
> {
> + struct xsk_tx_metadata *meta = NULL;
> struct net_device *dev = xs->dev;
> struct sk_buff *skb = xs->skb;
> int err;
> @@ -678,12 +688,40 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>
> skb_add_rx_frag(skb, nr_frags, page, 0, len, 0);
> }
> +
> + if (desc->options & XDP_TX_METADATA) {
> + if (unlikely(xs->tx_metadata_len == 0)) {
> + err = -EINVAL;
> + goto free_err;
> + }
> +
> + meta = buffer - xs->tx_metadata_len;
> +
> + if (meta->flags & XDP_TX_METADATA_CHECKSUM) {
> + if (unlikely(meta->csum_start + meta->csum_offset +
> + sizeof(__sum16) > len)) {
> + err = -EINVAL;
> + goto free_err;
> + }
> +
> + skb->csum_start = hr + meta->csum_start;
hr seems to only be set - by earlier, existing code in this function -
if skb is NULL. Is it always safe to use it here?
Smatch flags hr as being potentially used uninitialised,
the above is my understanding of why it thinks that is so.
> + skb->csum_offset = meta->csum_offset;
> + skb->ip_summed = CHECKSUM_PARTIAL;
> +
> + if (unlikely(meta->flags & XDP_TX_METADATA_CHECKSUM_SW)) {
> + err = skb_checksum_help(skb);
> + if (err)
> + goto free_err;
> + }
> + }
> + }
> }
>
> skb->dev = dev;
> skb->priority = xs->sk.sk_priority;
> skb->mark = xs->sk.sk_mark;
> skb->destructor = xsk_destruct_skb;
> + skb_shinfo(skb)->xsk_meta = meta;
> xsk_set_destructor_arg(skb);
>
> return skb;
...
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-25 19:28 ` [xdp-hints] " Simon Horman
@ 2023-07-25 20:30 ` Stanislav Fomichev
2023-07-25 21:28 ` Jakub Kicinski
0 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-25 20:30 UTC (permalink / raw)
To: Simon Horman
Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
On 07/25, Simon Horman wrote:
> On Mon, Jul 24, 2023 at 04:59:51PM -0700, Stanislav Fomichev wrote:
>
> ...
>
> Hi Stan,
>
> > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> > index bf71698a1e82..cf1e11c76339 100644
> > --- a/include/uapi/linux/netdev.h
> > +++ b/include/uapi/linux/netdev.h
> > @@ -37,11 +37,26 @@ enum netdev_xdp_act {
> > NETDEV_XDP_ACT_MASK = 127,
> > };
> >
> > +/**
> > + * enum netdev_xsk_flags
> > + * @NETDEV_XSK_FLAGS_TX_TIMESTAMP: HW timestamping egress packets is supported
> > + * by the driver.
> > + * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
> > + * driver.
> > + */
> > +enum netdev_xsk_flags {
> > + NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
> > + NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
> > +
>
> I know that it isn't the practice in this file.
> but adding the following makes kernel-doc happier
> about NETDEV_XSK_FLAGS_MASK not being documented.
>
> /* private: */
This is autogenerated file :-( But I guess I can try to extend ynl
scripts to put this comment before the mask. Let me look into that...
> > + NETDEV_XSK_FLAGS_MASK = 3,
> > +};
> > +
> > enum {
> > NETDEV_A_DEV_IFINDEX = 1,
> > NETDEV_A_DEV_PAD,
> > NETDEV_A_DEV_XDP_FEATURES,
> > NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
> > + NETDEV_A_DEV_XSK_FEATURES,
> >
> > __NETDEV_A_DEV_MAX,
> > NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
>
> ...
>
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
>
> ...
>
> > @@ -626,6 +635,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > struct xdp_desc *desc)
> > {
> > + struct xsk_tx_metadata *meta = NULL;
> > struct net_device *dev = xs->dev;
> > struct sk_buff *skb = xs->skb;
> > int err;
> > @@ -678,12 +688,40 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >
> > skb_add_rx_frag(skb, nr_frags, page, 0, len, 0);
> > }
> > +
> > + if (desc->options & XDP_TX_METADATA) {
> > + if (unlikely(xs->tx_metadata_len == 0)) {
> > + err = -EINVAL;
> > + goto free_err;
> > + }
> > +
> > + meta = buffer - xs->tx_metadata_len;
> > +
> > + if (meta->flags & XDP_TX_METADATA_CHECKSUM) {
> > + if (unlikely(meta->csum_start + meta->csum_offset +
> > + sizeof(__sum16) > len)) {
> > + err = -EINVAL;
> > + goto free_err;
> > + }
> > +
> > + skb->csum_start = hr + meta->csum_start;
>
> hr seems to only be set - by earlier, existing code in this function -
> if skb is NULL. Is it always safe to use it here?
>
> Smatch flags hr as being potentially used uninitialised,
> the above is my understanding of why it thinks that is so.
Ugh, good point, I've missed that. This part is new and it's from
multi-frag support. I need to be more careful here and apply metadata
only from the first descriptor in the chain. Thanks!
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
2023-07-25 19:28 ` [xdp-hints] " Simon Horman
@ 2023-07-25 20:54 ` Willem de Bruijn
2023-07-25 22:39 ` Stanislav Fomichev
2023-07-25 20:58 ` Willem de Bruijn
2023-07-26 19:25 ` Jesper Dangaard Brouer
3 siblings, 1 reply; 30+ messages in thread
From: Willem de Bruijn @ 2023-07-25 20:54 UTC (permalink / raw)
To: Stanislav Fomichev, bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
Stanislav Fomichev wrote:
> This change actually defines the (initial) metadata layout
> that should be used by AF_XDP userspace (xsk_tx_metadata).
> The first field is flags which requests appropriate offloads,
> followed by the offload-specific fields. The supported per-device
> offloads are exported via netlink (new xsk-flags).
>
> The offloads themselves are still implemented in a bit of a
> framework-y fashion that's left from my initial kfunc attempt.
> I'm introducing new xsk_tx_metadata_ops which drivers are
> supposed to implement. The drivers are also supposed
> to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
> the right places. Since xsk_tx_metadata_{request,_complete}
> are static inline, we don't incur any extra overhead doing
> indirect calls.
>
> The benefit of this scheme is as follows:
> - keeps all metadata layout parsing away from driver code
> - makes it easy to grep and see which drivers implement what
> - don't need any extra flags to maintain to keep track of that
> offloads are implemented; if the callback is implemented - the offload
> is supported (used by netlink reporting code)
>
> Two offloads are defined right now:
> 1. XDP_TX_METADATA_CHECKSUM: skb-style csum_start+csum_offset
> 2. XDP_TX_METADATA_TIMESTAMP: writes TX timestamp back into metadata
> area upon completion (tx_timestamp field)
>
> The offloads are also implemented for copy mode:
> 1. Extra XDP_TX_METADATA_CHECKSUM_SW to trigger skb_checksum_help; this
> might be useful as a reference implementation and for testing
> 2. XDP_TX_METADATA_TIMESTAMP writes SW timestamp from the skb
> destructor (note I'm reusing hwtstamps to pass metadata pointer)
>
> The struct is forward-compatible and can be extended in the future
> by appending more fields.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
> Documentation/netlink/specs/netdev.yaml | 19 ++++++++
> include/linux/netdevice.h | 27 +++++++++++
> include/linux/skbuff.h | 5 ++-
> include/net/xdp_sock.h | 60 +++++++++++++++++++++++++
> include/net/xdp_sock_drv.h | 13 ++++++
> include/uapi/linux/if_xdp.h | 35 +++++++++++++++
> include/uapi/linux/netdev.h | 15 +++++++
> net/core/netdev-genl.c | 12 ++++-
> net/xdp/xsk.c | 38 ++++++++++++++++
> net/xdp/xsk_queue.h | 2 +-
> tools/include/uapi/linux/if_xdp.h | 50 ++++++++++++++++++---
> 11 files changed, 268 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index e41015310a6e..bf9c1cc32614 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -42,6 +42,19 @@ name: netdev
> doc:
> This feature informs if netdev implements non-linear XDP buffer
> support in ndo_xdp_xmit callback.
> + -
> + type: flags
> + name: xsk-flags
> + render-max: true
> + entries:
> + -
> + name: tx-timestamp
> + doc:
> + HW timestamping egress packets is supported by the driver.
> + -
> + name: tx-checksum
> + doc:
> + L3 checksum HW offload is supported by the driver.
>
> attribute-sets:
> -
> @@ -68,6 +81,12 @@ name: netdev
> type: u32
> checks:
> min: 1
> + -
> + name: xsk-features
> + doc: Bitmask of enabled AF_XDP features.
> + type: u64
> + enum: xsk-flags
> + enum-as-flags: true
>
> operations:
> list:
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 11652e464f5d..8b40c80557aa 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1660,6 +1660,31 @@ struct xdp_metadata_ops {
> enum xdp_rss_hash_type *rss_type);
> };
>
> +/*
> + * This structure defines the AF_XDP TX metadata hooks for network devices.
> + * The following hooks can be defined; unless noted otherwise, they are
> + * optional and can be filled with a null pointer.
> + *
> + * int (*tmo_request_timestamp)(void *priv)
> + * This function is called when AF_XDP frame requested egress timestamp.
> + *
> + * int (*tmo_fill_timestamp)(void *priv)
> + * This function is called when AF_XDP frame, that had requested
> + * egress timestamp, received a completion. The hook needs to return
> + * the actual HW timestamp.
> + *
> + * int (*tmo_request_timestamp)(u16 csum_start, u16 csum_offset, void *priv)
> + * This function is called when AF_XDP frame requested HW checksum
> + * offload. csum_start indicates position where checksumming should start.
> + * csum_offset indicates position where checksum should be stored.
> + *
> + */
> +struct xsk_tx_metadata_ops {
> + void (*tmo_request_timestamp)(void *priv);
> + u64 (*tmo_fill_timestamp)(void *priv);
> + void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
> +};
> +
> /**
> * enum netdev_priv_flags - &struct net_device priv_flags
> *
> @@ -1844,6 +1869,7 @@ enum netdev_ml_priv_type {
> * @netdev_ops: Includes several pointers to callbacks,
> * if one wants to override the ndo_*() functions
> * @xdp_metadata_ops: Includes pointers to XDP metadata callbacks.
> + * @xsk_tx_metadata_ops: Includes pointers to AF_XDP TX metadata callbacks.
> * @ethtool_ops: Management operations
> * @l3mdev_ops: Layer 3 master device operations
> * @ndisc_ops: Includes callbacks for different IPv6 neighbour
> @@ -2100,6 +2126,7 @@ struct net_device {
> unsigned long long priv_flags;
> const struct net_device_ops *netdev_ops;
> const struct xdp_metadata_ops *xdp_metadata_ops;
> + const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
> int ifindex;
> unsigned short gflags;
> unsigned short hard_header_len;
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index faaba050f843..5febc1a5131e 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -581,7 +581,10 @@ struct skb_shared_info {
> /* Warning: this field is not always filled in (UFO)! */
> unsigned short gso_segs;
> struct sk_buff *frag_list;
> - struct skb_shared_hwtstamps hwtstamps;
> + union {
> + struct skb_shared_hwtstamps hwtstamps;
> + struct xsk_tx_metadata *xsk_meta;
> + };
> unsigned int gso_type;
> u32 tskey;
>
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index 467b9fb56827..288fa58c4665 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -90,6 +90,54 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
> int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
> void __xsk_map_flush(void);
>
> +/**
> + * xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
> + * and call appropriate xsk_tx_metadata_ops operation.
> + * @meta: pointer to AF_XDP metadata area
> + * @ops: pointer to struct xsk_tx_metadata_ops
> + * @priv: pointer to driver-private aread
> + *
> + * This function should be called by the networking device when
> + * it prepares AF_XDP egress packet.
> + */
> +static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
> + const struct xsk_tx_metadata_ops *ops,
> + void *priv)
> +{
> + if (!meta)
> + return;
> +
> + if (ops->tmo_request_timestamp)
> + if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
> + ops->tmo_request_timestamp(priv);
> +
> + if (ops->tmo_request_checksum)
> + if (meta->flags & XDP_TX_METADATA_CHECKSUM)
> + ops->tmo_request_checksum(meta->csum_start, meta->csum_offset, priv);
> +}
> +
> +/**
> + * xsk_tx_metadata_complete - Evaluate AF_XDP TX metadata at completion
> + * and call appropriate xsk_tx_metadata_ops operation.
> + * @meta: pointer to AF_XDP metadata area
> + * @ops: pointer to struct xsk_tx_metadata_ops
> + * @priv: pointer to driver-private aread
> + *
> + * This function should be called by the networking device upon
> + * AF_XDP egress completion.
> + */
> +static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata *meta,
> + const struct xsk_tx_metadata_ops *ops,
> + void *priv)
> +{
> + if (!meta)
> + return;
> +
> + if (ops->tmo_fill_timestamp)
> + if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
> + meta->tx_timestamp = ops->tmo_fill_timestamp(priv);
> +}
> +
> #else
>
> static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
> @@ -106,6 +154,18 @@ static inline void __xsk_map_flush(void)
> {
> }
>
> +static inline void xsk_tx_metadata_request(struct xsk_tx_metadata *meta,
> + const struct xsk_tx_metadata_ops *ops,
> + void *priv)
> +{
> +}
> +
> +static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata *meta,
> + const struct xsk_tx_metadata_ops *ops,
> + void *priv)
> +{
> +}
> +
> #endif /* CONFIG_XDP_SOCKETS */
>
> #endif /* _LINUX_XDP_SOCK_H */
> diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> index 1f6fc8c7a84c..e2558ac3e195 100644
> --- a/include/net/xdp_sock_drv.h
> +++ b/include/net/xdp_sock_drv.h
> @@ -165,6 +165,14 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
> return xp_raw_get_data(pool, addr);
> }
>
> +static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
> +{
> + if (!pool->tx_metadata_len)
> + return NULL;
> +
> + return xp_raw_get_data(pool, addr) - pool->tx_metadata_len;
> +}
> +
> static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
> {
> struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
> @@ -324,6 +332,11 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
> return NULL;
> }
>
> +static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
> +{
> + return NULL;
> +}
> +
> static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
> {
> }
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index b37b50102e1c..b9b1b2c4108a 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -106,6 +106,38 @@ struct xdp_options {
> #define XSK_UNALIGNED_BUF_ADDR_MASK \
> ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
>
> +/* Request transmit timestamp. Upon completion, put it into tx_timestamp
> + * field of struct xsk_tx_metadata.
> + */
> +#define XDP_TX_METADATA_TIMESTAMP (1 << 0)
> +
> +/* Request transmit checksum offload. Checksum start position and offset
> + * are communicated via csum_start and csum_offset fields of struct
> + * xsk_tx_metadata.
> + */
> +#define XDP_TX_METADATA_CHECKSUM (1 << 1)
> +
> +/* Force checksum calculation in software. Can be used for testing or
> + * working around potential HW issues. This option causes performance
> + * degradation and only works in XDP_COPY mode.
> + */
> +#define XDP_TX_METADATA_CHECKSUM_SW (1 << 2)
> +
> +struct xsk_tx_metadata {
> + __u32 flags;
> +
> + /* XDP_TX_METADATA_CHECKSUM */
> +
> + /* Offset from desc->addr where checksumming should start. */
> + __u16 csum_start;
> + /* Offset from csum_start where checksum should be stored. */
> + __u16 csum_offset;
> +
> + /* XDP_TX_METADATA_TIMESTAMP */
> +
> + __u64 tx_timestamp;
> +};
> +
> /* Rx/Tx descriptor */
> struct xdp_desc {
> __u64 addr;
> @@ -122,4 +154,7 @@ struct xdp_desc {
> */
> #define XDP_PKT_CONTD (1 << 0)
>
> +/* TX packet carries valid metadata. */
> +#define XDP_TX_METADATA (1 << 1)
> +
> #endif /* _LINUX_IF_XDP_H */
> diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> index bf71698a1e82..cf1e11c76339 100644
> --- a/include/uapi/linux/netdev.h
> +++ b/include/uapi/linux/netdev.h
> @@ -37,11 +37,26 @@ enum netdev_xdp_act {
> NETDEV_XDP_ACT_MASK = 127,
> };
>
> +/**
> + * enum netdev_xsk_flags
> + * @NETDEV_XSK_FLAGS_TX_TIMESTAMP: HW timestamping egress packets is supported
> + * by the driver.
> + * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
> + * driver.
> + */
> +enum netdev_xsk_flags {
> + NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
> + NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
> +
> + NETDEV_XSK_FLAGS_MASK = 3,
> +};
> +
> enum {
> NETDEV_A_DEV_IFINDEX = 1,
> NETDEV_A_DEV_PAD,
> NETDEV_A_DEV_XDP_FEATURES,
> NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
> + NETDEV_A_DEV_XSK_FEATURES,
>
> __NETDEV_A_DEV_MAX,
> NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
> diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> index 65ef4867fc49..9e8c1f3caf36 100644
> --- a/net/core/netdev-genl.c
> +++ b/net/core/netdev-genl.c
> @@ -12,15 +12,25 @@ static int
> netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff *rsp,
> u32 portid, u32 seq, int flags, u32 cmd)
> {
> + u64 xsk_flags = 0;
> void *hdr;
>
> hdr = genlmsg_put(rsp, portid, seq, &netdev_nl_family, flags, cmd);
> if (!hdr)
> return -EMSGSIZE;
>
> + if (netdev->xsk_tx_metadata_ops) {
> + if (netdev->xsk_tx_metadata_ops->tmo_fill_timestamp)
> + xsk_flags |= NETDEV_XSK_FLAGS_TX_TIMESTAMP;
> + if (netdev->xsk_tx_metadata_ops->tmo_request_checksum)
> + xsk_flags |= NETDEV_XSK_FLAGS_TX_CHECKSUM;
> + }
> +
> if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||
> nla_put_u64_64bit(rsp, NETDEV_A_DEV_XDP_FEATURES,
> - netdev->xdp_features, NETDEV_A_DEV_PAD)) {
> + netdev->xdp_features, NETDEV_A_DEV_PAD) ||
> + nla_put_u64_64bit(rsp, NETDEV_A_DEV_XSK_FEATURES,
> + xsk_flags, NETDEV_A_DEV_PAD)) {
> genlmsg_cancel(rsp, hdr);
> return -EINVAL;
> }
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 81106e4d6e0f..9a5c4e63898d 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -542,6 +542,15 @@ static u32 xsk_get_num_desc(struct sk_buff *skb)
>
> static void xsk_destruct_skb(struct sk_buff *skb)
> {
> + struct xsk_tx_metadata *meta = skb_shinfo(skb)->xsk_meta;
> +
> + if (meta) {
> + if (meta->flags & XDP_TX_METADATA_TIMESTAMP) {
> + /* sw completion timestamp, not a real one */
> + meta->tx_timestamp = ktime_get_tai_fast_ns();
> + }
> + }
> +
> xsk_cq_submit_locked(xdp_sk(skb->sk), xsk_get_num_desc(skb));
> sock_wfree(skb);
> }
> @@ -626,6 +635,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> struct xdp_desc *desc)
> {
> + struct xsk_tx_metadata *meta = NULL;
> struct net_device *dev = xs->dev;
> struct sk_buff *skb = xs->skb;
> int err;
> @@ -678,12 +688,40 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>
> skb_add_rx_frag(skb, nr_frags, page, 0, len, 0);
> }
> +
> + if (desc->options & XDP_TX_METADATA) {
> + if (unlikely(xs->tx_metadata_len == 0)) {
> + err = -EINVAL;
> + goto free_err;
> + }
> +
> + meta = buffer - xs->tx_metadata_len;
> +
> + if (meta->flags & XDP_TX_METADATA_CHECKSUM) {
> + if (unlikely(meta->csum_start + meta->csum_offset +
> + sizeof(__sum16) > len)) {
> + err = -EINVAL;
> + goto free_err;
> + }
> +
> + skb->csum_start = hr + meta->csum_start;
> + skb->csum_offset = meta->csum_offset;
> + skb->ip_summed = CHECKSUM_PARTIAL;
> +
> + if (unlikely(meta->flags & XDP_TX_METADATA_CHECKSUM_SW)) {
> + err = skb_checksum_help(skb);
> + if (err)
> + goto free_err;
> + }
> + }
> + }
> }
>
> skb->dev = dev;
> skb->priority = xs->sk.sk_priority;
> skb->mark = xs->sk.sk_mark;
> skb->destructor = xsk_destruct_skb;
> + skb_shinfo(skb)->xsk_meta = meta;
> xsk_set_destructor_arg(skb);
>
> return skb;
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index c74a1372bcb9..6f2d1621c992 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -137,7 +137,7 @@ static inline bool xskq_cons_read_addr_unchecked(struct xsk_queue *q, u64 *addr)
>
> static inline bool xp_unused_options_set(u32 options)
> {
> - return options & ~XDP_PKT_CONTD;
> + return options & ~(XDP_PKT_CONTD | XDP_TX_METADATA);
> }
>
> static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
> diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
> index 73a47da885dc..b9b1b2c4108a 100644
> --- a/tools/include/uapi/linux/if_xdp.h
> +++ b/tools/include/uapi/linux/if_xdp.h
> @@ -26,11 +26,11 @@
> */
> #define XDP_USE_NEED_WAKEUP (1 << 3)
> /* By setting this option, userspace application indicates that it can
> - * handle multiple descriptors per packet thus enabling xsk core to split
> + * handle multiple descriptors per packet thus enabling AF_XDP to split
> * multi-buffer XDP frames into multiple Rx descriptors. Without this set
> - * such frames will be dropped by xsk.
> + * such frames will be dropped.
> */
> -#define XDP_USE_SG (1 << 4)
> +#define XDP_USE_SG (1 << 4)
>
> /* Flags for xsk_umem_config flags */
> #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
> @@ -69,6 +69,7 @@ struct xdp_mmap_offsets {
> #define XDP_UMEM_COMPLETION_RING 6
> #define XDP_STATISTICS 7
> #define XDP_OPTIONS 8
> +#define XDP_TX_METADATA_LEN 9
>
> struct xdp_umem_reg {
> __u64 addr; /* Start of packet data area */
> @@ -105,6 +106,38 @@ struct xdp_options {
> #define XSK_UNALIGNED_BUF_ADDR_MASK \
> ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
>
> +/* Request transmit timestamp. Upon completion, put it into tx_timestamp
> + * field of struct xsk_tx_metadata.
> + */
> +#define XDP_TX_METADATA_TIMESTAMP (1 << 0)
> +
> +/* Request transmit checksum offload. Checksum start position and offset
> + * are communicated via csum_start and csum_offset fields of struct
> + * xsk_tx_metadata.
> + */
> +#define XDP_TX_METADATA_CHECKSUM (1 << 1)
> +
> +/* Force checksum calculation in software. Can be used for testing or
> + * working around potential HW issues. This option causes performance
> + * degradation and only works in XDP_COPY mode.
> + */
> +#define XDP_TX_METADATA_CHECKSUM_SW (1 << 2)
Not sure how useful this is, especially if only for copy mode.
> +struct xsk_tx_metadata {
> + __u32 flags;
> +
> + /* XDP_TX_METADATA_CHECKSUM */
> +
> + /* Offset from desc->addr where checksumming should start. */
> + __u16 csum_start;
> + /* Offset from csum_start where checksum should be stored. */
> + __u16 csum_offset;
> +
> + /* XDP_TX_METADATA_TIMESTAMP */
> +
> + __u64 tx_timestamp;
> +};
Is this structure easily extensible for future offloads,
such as USO?
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
2023-07-25 19:28 ` [xdp-hints] " Simon Horman
2023-07-25 20:54 ` Willem de Bruijn
@ 2023-07-25 20:58 ` Willem de Bruijn
2023-07-28 11:56 ` Jesper Dangaard Brouer
2023-07-26 19:25 ` Jesper Dangaard Brouer
3 siblings, 1 reply; 30+ messages in thread
From: Willem de Bruijn @ 2023-07-25 20:58 UTC (permalink / raw)
To: Stanislav Fomichev, bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
Stanislav Fomichev wrote:
> This change actually defines the (initial) metadata layout
> that should be used by AF_XDP userspace (xsk_tx_metadata).
> The first field is flags which requests appropriate offloads,
> followed by the offload-specific fields. The supported per-device
> offloads are exported via netlink (new xsk-flags).
>
> The offloads themselves are still implemented in a bit of a
> framework-y fashion that's left from my initial kfunc attempt.
> I'm introducing new xsk_tx_metadata_ops which drivers are
> supposed to implement. The drivers are also supposed
> to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
> the right places. Since xsk_tx_metadata_{request,_complete}
> are static inline, we don't incur any extra overhead doing
> indirect calls.
>
> The benefit of this scheme is as follows:
> - keeps all metadata layout parsing away from driver code
> - makes it easy to grep and see which drivers implement what
> - don't need any extra flags to maintain to keep track of that
> offloads are implemented; if the callback is implemented - the offload
> is supported (used by netlink reporting code)
>
> Two offloads are defined right now:
> 1. XDP_TX_METADATA_CHECKSUM: skb-style csum_start+csum_offset
> 2. XDP_TX_METADATA_TIMESTAMP: writes TX timestamp back into metadata
> area upon completion (tx_timestamp field)
>
> The offloads are also implemented for copy mode:
> 1. Extra XDP_TX_METADATA_CHECKSUM_SW to trigger skb_checksum_help; this
> might be useful as a reference implementation and for testing
> 2. XDP_TX_METADATA_TIMESTAMP writes SW timestamp from the skb
> destructor (note I'm reusing hwtstamps to pass metadata pointer)
>
> The struct is forward-compatible and can be extended in the future
> by appending more fields.
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
> Documentation/netlink/specs/netdev.yaml | 19 ++++++++
> include/linux/netdevice.h | 27 +++++++++++
> include/linux/skbuff.h | 5 ++-
> include/net/xdp_sock.h | 60 +++++++++++++++++++++++++
> include/net/xdp_sock_drv.h | 13 ++++++
> include/uapi/linux/if_xdp.h | 35 +++++++++++++++
> include/uapi/linux/netdev.h | 15 +++++++
> net/core/netdev-genl.c | 12 ++++-
> net/xdp/xsk.c | 38 ++++++++++++++++
> net/xdp/xsk_queue.h | 2 +-
> tools/include/uapi/linux/if_xdp.h | 50 ++++++++++++++++++---
> 11 files changed, 268 insertions(+), 8 deletions(-)
>
> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> index e41015310a6e..bf9c1cc32614 100644
> --- a/Documentation/netlink/specs/netdev.yaml
> +++ b/Documentation/netlink/specs/netdev.yaml
> @@ -42,6 +42,19 @@ name: netdev
> doc:
> This feature informs if netdev implements non-linear XDP buffer
> support in ndo_xdp_xmit callback.
> + -
> + type: flags
> + name: xsk-flags
> + render-max: true
> + entries:
> + -
> + name: tx-timestamp
> + doc:
> + HW timestamping egress packets is supported by the driver.
> + -
> + name: tx-checksum
> + doc:
> + L3 checksum HW offload is supported by the driver.
>
> attribute-sets:
> -
> @@ -68,6 +81,12 @@ name: netdev
> type: u32
> checks:
> min: 1
> + -
> + name: xsk-features
> + doc: Bitmask of enabled AF_XDP features.
> + type: u64
> + enum: xsk-flags
> + enum-as-flags: true
>
> operations:
> list:
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 11652e464f5d..8b40c80557aa 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1660,6 +1660,31 @@ struct xdp_metadata_ops {
> enum xdp_rss_hash_type *rss_type);
> };
>
> +/*
> + * This structure defines the AF_XDP TX metadata hooks for network devices.
> + * The following hooks can be defined; unless noted otherwise, they are
> + * optional and can be filled with a null pointer.
> + *
> + * int (*tmo_request_timestamp)(void *priv)
> + * This function is called when AF_XDP frame requested egress timestamp.
> + *
> + * int (*tmo_fill_timestamp)(void *priv)
> + * This function is called when AF_XDP frame, that had requested
> + * egress timestamp, received a completion. The hook needs to return
> + * the actual HW timestamp.
> + *
> + * int (*tmo_request_timestamp)(u16 csum_start, u16 csum_offset, void *priv)
typo: tmo_request_checksum
> + * This function is called when AF_XDP frame requested HW checksum
> + * offload. csum_start indicates position where checksumming should start.
> + * csum_offset indicates position where checksum should be stored.
> + *
> + */
> +struct xsk_tx_metadata_ops {
> + void (*tmo_request_timestamp)(void *priv);
> + u64 (*tmo_fill_timestamp)(void *priv);
> + void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
> +};
> +
> /**
> * enum netdev_priv_flags - &struct net_device priv_flags
> *
> @@ -1844,6 +1869,7 @@ enum netdev_ml_priv_type {
> * @netdev_ops: Includes several pointers to callbacks,
> * if one wants to override the ndo_*() functions
> * @xdp_metadata_ops: Includes pointers to XDP metadata callbacks.
> + * @xsk_tx_metadata_ops: Includes pointers to AF_XDP TX metadata callbacks.
> * @ethtool_ops: Management operations
> * @l3mdev_ops: Layer 3 master device operations
> * @ndisc_ops: Includes callbacks for different IPv6 neighbour
> @@ -2100,6 +2126,7 @@ struct net_device {
> unsigned long long priv_flags;
> const struct net_device_ops *netdev_ops;
> const struct xdp_metadata_ops *xdp_metadata_ops;
> + const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
> int ifindex;
> unsigned short gflags;
> unsigned short hard_header_len;
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index faaba050f843..5febc1a5131e 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -581,7 +581,10 @@ struct skb_shared_info {
> /* Warning: this field is not always filled in (UFO)! */
> unsigned short gso_segs;
> struct sk_buff *frag_list;
> - struct skb_shared_hwtstamps hwtstamps;
> + union {
> + struct skb_shared_hwtstamps hwtstamps;
> + struct xsk_tx_metadata *xsk_meta;
> + };
> unsigned int gso_type;
> u32 tskey;
>
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index 467b9fb56827..288fa58c4665 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -90,6 +90,54 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
> int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
> void __xsk_map_flush(void);
>
> +/**
> + * xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
> + * and call appropriate xsk_tx_metadata_ops operation.
> + * @meta: pointer to AF_XDP metadata area
> + * @ops: pointer to struct xsk_tx_metadata_ops
> + * @priv: pointer to driver-private aread
> + *
> + * This function should be called by the networking device when
> + * it prepares AF_XDP egress packet.
> + */
> +static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
> + const struct xsk_tx_metadata_ops *ops,
> + void *priv)
> +{
> + if (!meta)
> + return;
> +
> + if (ops->tmo_request_timestamp)
> + if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
> + ops->tmo_request_timestamp(priv);
> +
> + if (ops->tmo_request_checksum)
> + if (meta->flags & XDP_TX_METADATA_CHECKSUM)
> + ops->tmo_request_checksum(meta->csum_start, meta->csum_offset, priv);
> +}
Might be cheaper to test the flag in the hot cacheline before
dereferencing ops?
Also, just add these functions to net_device_ops directly,
rather than dereferencing another pointer to xsk_tx_metadata_ops?
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 8/8] selftests/bpf: Add TX side to xdp_hw_metadata
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 8/8] selftests/bpf: Add TX side to xdp_hw_metadata Stanislav Fomichev
@ 2023-07-25 20:59 ` Willem de Bruijn
2023-07-25 22:36 ` Stanislav Fomichev
0 siblings, 1 reply; 30+ messages in thread
From: Willem de Bruijn @ 2023-07-25 20:59 UTC (permalink / raw)
To: Stanislav Fomichev, bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
Stanislav Fomichev wrote:
> When we get packets on port 9091, we swap src/dst and send it out.
> At this point, we also request the timestamp and plumb it back
> to the userspace. The userspace simply prints the timestamp.
>
> Also print current UDP checksum, rewrite it with the pseudo-header
> checksum and offload TX checksum calculation to devtx. Upon
> completion, report TX checksum back (mlx5 doesn't put it back, so
> I've used tcpdump to confirm that the checksum is correct).
>
> Some other related changes:
> - switched to zerocopy mode by default; new flag can be used to force
> old behavior
> - request fixed TX_METADATA_LEN headroom
> - some other small fixes (umem size, fill idx+i, etc)
>
> mvbz3:~# ./xdp_hw_metadata eth3 -c mlx5e_devtx_complete_xdp -s mlx5e_devtx_submit_xd
> attach rx bpf program...
> ...
> 0x206d298: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
> rx_hash: 0x2BFB7FEC with RSS type:0x2A
> rx_timestamp: 1690238278345877848 (sec:1690238278.3459)
> XDP RX-time: 1690238278538397674 (sec:1690238278.5384) delta sec:0.1925 (192519.826 usec)
> AF_XDP time: 1690238278538515250 (sec:1690238278.5385) delta sec:0.0001 (117.576 usec)
> 0x206d298: ping-pong with csum=8e3b (want 57c9) csum_start=54 csum_offset=6
> 0x206d298: complete tx idx=0 addr=10
> 0x206d298: tx_timestamp: 1690238278577008140 (sec:1690238278.5770)
> 0x206d298: complete rx idx=128 addr=80100
>
> mvbz4:~# nc -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091
>
> mvbz4:~# tcpdump -vvx -i eth3 udp
> tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
> 10:12:43.901436 IP6 (flowlabel 0x7a5d2, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1087.44339 > fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0x0b4b!] UDP, length 3
> 0x0000: 6007 a5d2 000b 117f fe80 0000 0000 0000
> 0x0010: 1270 fdff fe48 1087 fe80 0000 0000 0000
> 0x0020: 1270 fdff fe48 1077 ad33 2383 000b 3b8e
> 0x0030: 7864 70
> 10:12:43.902125 IP6 (flowlabel 0x7a5d2, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1077.9091 > fe80::1270:fdff:fe48:1087.44339: [udp sum ok] UDP, length 3
> 0x0000: 6007 a5d2 000b 117f fe80 0000 0000 0000
> 0x0010: 1270 fdff fe48 1077 fe80 0000 0000 0000
> 0x0020: 1270 fdff fe48 1087 2383 ad33 000b 0b4b
> 0x0030: 7864 70
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
> tools/testing/selftests/bpf/xdp_hw_metadata.c | 201 +++++++++++++++++-
> 1 file changed, 191 insertions(+), 10 deletions(-)
>
> +static void usage(const char *prog)
> +{
> + fprintf(stderr,
> + "usage: %s [OPTS] <ifname>\n"
> + "OPTS:\n"
> + " -T don't generate AF_XDP reply (rx metadata only)\n"
> + " -Z run in copy mode\n",
nit: makes more sense to call copy mode 'C', rather than 'Z'
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-25 20:30 ` Stanislav Fomichev
@ 2023-07-25 21:28 ` Jakub Kicinski
2023-07-25 22:40 ` Stanislav Fomichev
0 siblings, 1 reply; 30+ messages in thread
From: Jakub Kicinski @ 2023-07-25 21:28 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: Simon Horman, bpf, ast, daniel, andrii, martin.lau, song, yhs,
john.fastabend, kpsingh, haoluo, jolsa, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
On Tue, 25 Jul 2023 13:30:58 -0700 Stanislav Fomichev wrote:
> > I know that it isn't the practice in this file.
> > but adding the following makes kernel-doc happier
> > about NETDEV_XSK_FLAGS_MASK not being documented.
> >
> > /* private: */
>
> This is autogenerated file :-( But I guess I can try to extend ynl
> scripts to put this comment before the mask. Let me look into that...
Yes, please! I think I even wrote a patch for it at some point...
but then we realized that enums didn't support /* private: */.
Commit e27cb89a22ada4 has added the support, so we can get back
to getting the YNL changes in place.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 8/8] selftests/bpf: Add TX side to xdp_hw_metadata
2023-07-25 20:59 ` [xdp-hints] " Willem de Bruijn
@ 2023-07-25 22:36 ` Stanislav Fomichev
2023-07-25 22:55 ` Willem de Bruijn
0 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-25 22:36 UTC (permalink / raw)
To: Willem de Bruijn
Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
On 07/25, Willem de Bruijn wrote:
> Stanislav Fomichev wrote:
> > When we get packets on port 9091, we swap src/dst and send it out.
> > At this point, we also request the timestamp and plumb it back
> > to the userspace. The userspace simply prints the timestamp.
> >
> > Also print current UDP checksum, rewrite it with the pseudo-header
> > checksum and offload TX checksum calculation to devtx. Upon
> > completion, report TX checksum back (mlx5 doesn't put it back, so
> > I've used tcpdump to confirm that the checksum is correct).
> >
> > Some other related changes:
> > - switched to zerocopy mode by default; new flag can be used to force
> > old behavior
> > - request fixed TX_METADATA_LEN headroom
> > - some other small fixes (umem size, fill idx+i, etc)
> >
> > mvbz3:~# ./xdp_hw_metadata eth3 -c mlx5e_devtx_complete_xdp -s mlx5e_devtx_submit_xd
> > attach rx bpf program...
> > ...
> > 0x206d298: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
> > rx_hash: 0x2BFB7FEC with RSS type:0x2A
> > rx_timestamp: 1690238278345877848 (sec:1690238278.3459)
> > XDP RX-time: 1690238278538397674 (sec:1690238278.5384) delta sec:0.1925 (192519.826 usec)
> > AF_XDP time: 1690238278538515250 (sec:1690238278.5385) delta sec:0.0001 (117.576 usec)
> > 0x206d298: ping-pong with csum=8e3b (want 57c9) csum_start=54 csum_offset=6
> > 0x206d298: complete tx idx=0 addr=10
> > 0x206d298: tx_timestamp: 1690238278577008140 (sec:1690238278.5770)
> > 0x206d298: complete rx idx=128 addr=80100
> >
> > mvbz4:~# nc -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091
> >
> > mvbz4:~# tcpdump -vvx -i eth3 udp
> > tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
> > 10:12:43.901436 IP6 (flowlabel 0x7a5d2, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1087.44339 > fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0x0b4b!] UDP, length 3
> > 0x0000: 6007 a5d2 000b 117f fe80 0000 0000 0000
> > 0x0010: 1270 fdff fe48 1087 fe80 0000 0000 0000
> > 0x0020: 1270 fdff fe48 1077 ad33 2383 000b 3b8e
> > 0x0030: 7864 70
> > 10:12:43.902125 IP6 (flowlabel 0x7a5d2, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1077.9091 > fe80::1270:fdff:fe48:1087.44339: [udp sum ok] UDP, length 3
> > 0x0000: 6007 a5d2 000b 117f fe80 0000 0000 0000
> > 0x0010: 1270 fdff fe48 1077 fe80 0000 0000 0000
> > 0x0020: 1270 fdff fe48 1087 2383 ad33 000b 0b4b
> > 0x0030: 7864 70
> >
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> > tools/testing/selftests/bpf/xdp_hw_metadata.c | 201 +++++++++++++++++-
> > 1 file changed, 191 insertions(+), 10 deletions(-)
> >
>
> > +static void usage(const char *prog)
> > +{
> > + fprintf(stderr,
> > + "usage: %s [OPTS] <ifname>\n"
> > + "OPTS:\n"
> > + " -T don't generate AF_XDP reply (rx metadata only)\n"
> > + " -Z run in copy mode\n",
>
> nit: makes more sense to call copy mode 'C', rather than 'Z'
Initially I had -c and -s for completion/submission bpf hooks. Now that
these are gone, can actually use -c. Capital letter here actually means
'not'. Z - not zerocopy. T - no tx.
I'll rename to:
-r - rx only
-c - copy mode
LMK if it doesn't make sense..
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-25 20:54 ` Willem de Bruijn
@ 2023-07-25 22:39 ` Stanislav Fomichev
2023-07-25 23:10 ` Willem de Bruijn
0 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-25 22:39 UTC (permalink / raw)
To: Willem de Bruijn
Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
On 07/25, Willem de Bruijn wrote:
> Stanislav Fomichev wrote:
> > This change actually defines the (initial) metadata layout
> > that should be used by AF_XDP userspace (xsk_tx_metadata).
> > The first field is flags which requests appropriate offloads,
> > followed by the offload-specific fields. The supported per-device
> > offloads are exported via netlink (new xsk-flags).
> >
> > The offloads themselves are still implemented in a bit of a
> > framework-y fashion that's left from my initial kfunc attempt.
> > I'm introducing new xsk_tx_metadata_ops which drivers are
> > supposed to implement. The drivers are also supposed
> > to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
> > the right places. Since xsk_tx_metadata_{request,_complete}
> > are static inline, we don't incur any extra overhead doing
> > indirect calls.
> >
> > The benefit of this scheme is as follows:
> > - keeps all metadata layout parsing away from driver code
> > - makes it easy to grep and see which drivers implement what
> > - don't need any extra flags to maintain to keep track of that
> > offloads are implemented; if the callback is implemented - the offload
> > is supported (used by netlink reporting code)
> >
> > Two offloads are defined right now:
> > 1. XDP_TX_METADATA_CHECKSUM: skb-style csum_start+csum_offset
> > 2. XDP_TX_METADATA_TIMESTAMP: writes TX timestamp back into metadata
> > area upon completion (tx_timestamp field)
> >
> > The offloads are also implemented for copy mode:
> > 1. Extra XDP_TX_METADATA_CHECKSUM_SW to trigger skb_checksum_help; this
> > might be useful as a reference implementation and for testing
> > 2. XDP_TX_METADATA_TIMESTAMP writes SW timestamp from the skb
> > destructor (note I'm reusing hwtstamps to pass metadata pointer)
> >
> > The struct is forward-compatible and can be extended in the future
> > by appending more fields.
> >
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> > Documentation/netlink/specs/netdev.yaml | 19 ++++++++
> > include/linux/netdevice.h | 27 +++++++++++
> > include/linux/skbuff.h | 5 ++-
> > include/net/xdp_sock.h | 60 +++++++++++++++++++++++++
> > include/net/xdp_sock_drv.h | 13 ++++++
> > include/uapi/linux/if_xdp.h | 35 +++++++++++++++
> > include/uapi/linux/netdev.h | 15 +++++++
> > net/core/netdev-genl.c | 12 ++++-
> > net/xdp/xsk.c | 38 ++++++++++++++++
> > net/xdp/xsk_queue.h | 2 +-
> > tools/include/uapi/linux/if_xdp.h | 50 ++++++++++++++++++---
> > 11 files changed, 268 insertions(+), 8 deletions(-)
> >
> > diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> > index e41015310a6e..bf9c1cc32614 100644
> > --- a/Documentation/netlink/specs/netdev.yaml
> > +++ b/Documentation/netlink/specs/netdev.yaml
> > @@ -42,6 +42,19 @@ name: netdev
> > doc:
> > This feature informs if netdev implements non-linear XDP buffer
> > support in ndo_xdp_xmit callback.
> > + -
> > + type: flags
> > + name: xsk-flags
> > + render-max: true
> > + entries:
> > + -
> > + name: tx-timestamp
> > + doc:
> > + HW timestamping egress packets is supported by the driver.
> > + -
> > + name: tx-checksum
> > + doc:
> > + L3 checksum HW offload is supported by the driver.
> >
> > attribute-sets:
> > -
> > @@ -68,6 +81,12 @@ name: netdev
> > type: u32
> > checks:
> > min: 1
> > + -
> > + name: xsk-features
> > + doc: Bitmask of enabled AF_XDP features.
> > + type: u64
> > + enum: xsk-flags
> > + enum-as-flags: true
> >
> > operations:
> > list:
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 11652e464f5d..8b40c80557aa 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -1660,6 +1660,31 @@ struct xdp_metadata_ops {
> > enum xdp_rss_hash_type *rss_type);
> > };
> >
> > +/*
> > + * This structure defines the AF_XDP TX metadata hooks for network devices.
> > + * The following hooks can be defined; unless noted otherwise, they are
> > + * optional and can be filled with a null pointer.
> > + *
> > + * int (*tmo_request_timestamp)(void *priv)
> > + * This function is called when AF_XDP frame requested egress timestamp.
> > + *
> > + * int (*tmo_fill_timestamp)(void *priv)
> > + * This function is called when AF_XDP frame, that had requested
> > + * egress timestamp, received a completion. The hook needs to return
> > + * the actual HW timestamp.
> > + *
> > + * int (*tmo_request_timestamp)(u16 csum_start, u16 csum_offset, void *priv)
> > + * This function is called when AF_XDP frame requested HW checksum
> > + * offload. csum_start indicates position where checksumming should start.
> > + * csum_offset indicates position where checksum should be stored.
> > + *
> > + */
> > +struct xsk_tx_metadata_ops {
> > + void (*tmo_request_timestamp)(void *priv);
> > + u64 (*tmo_fill_timestamp)(void *priv);
> > + void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
> > +};
> > +
> > /**
> > * enum netdev_priv_flags - &struct net_device priv_flags
> > *
> > @@ -1844,6 +1869,7 @@ enum netdev_ml_priv_type {
> > * @netdev_ops: Includes several pointers to callbacks,
> > * if one wants to override the ndo_*() functions
> > * @xdp_metadata_ops: Includes pointers to XDP metadata callbacks.
> > + * @xsk_tx_metadata_ops: Includes pointers to AF_XDP TX metadata callbacks.
> > * @ethtool_ops: Management operations
> > * @l3mdev_ops: Layer 3 master device operations
> > * @ndisc_ops: Includes callbacks for different IPv6 neighbour
> > @@ -2100,6 +2126,7 @@ struct net_device {
> > unsigned long long priv_flags;
> > const struct net_device_ops *netdev_ops;
> > const struct xdp_metadata_ops *xdp_metadata_ops;
> > + const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
> > int ifindex;
> > unsigned short gflags;
> > unsigned short hard_header_len;
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index faaba050f843..5febc1a5131e 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -581,7 +581,10 @@ struct skb_shared_info {
> > /* Warning: this field is not always filled in (UFO)! */
> > unsigned short gso_segs;
> > struct sk_buff *frag_list;
> > - struct skb_shared_hwtstamps hwtstamps;
> > + union {
> > + struct skb_shared_hwtstamps hwtstamps;
> > + struct xsk_tx_metadata *xsk_meta;
> > + };
> > unsigned int gso_type;
> > u32 tskey;
> >
> > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > index 467b9fb56827..288fa58c4665 100644
> > --- a/include/net/xdp_sock.h
> > +++ b/include/net/xdp_sock.h
> > @@ -90,6 +90,54 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
> > int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
> > void __xsk_map_flush(void);
> >
> > +/**
> > + * xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
> > + * and call appropriate xsk_tx_metadata_ops operation.
> > + * @meta: pointer to AF_XDP metadata area
> > + * @ops: pointer to struct xsk_tx_metadata_ops
> > + * @priv: pointer to driver-private aread
> > + *
> > + * This function should be called by the networking device when
> > + * it prepares AF_XDP egress packet.
> > + */
> > +static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
> > + const struct xsk_tx_metadata_ops *ops,
> > + void *priv)
> > +{
> > + if (!meta)
> > + return;
> > +
> > + if (ops->tmo_request_timestamp)
> > + if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
> > + ops->tmo_request_timestamp(priv);
> > +
> > + if (ops->tmo_request_checksum)
> > + if (meta->flags & XDP_TX_METADATA_CHECKSUM)
> > + ops->tmo_request_checksum(meta->csum_start, meta->csum_offset, priv);
> > +}
> > +
> > +/**
> > + * xsk_tx_metadata_complete - Evaluate AF_XDP TX metadata at completion
> > + * and call appropriate xsk_tx_metadata_ops operation.
> > + * @meta: pointer to AF_XDP metadata area
> > + * @ops: pointer to struct xsk_tx_metadata_ops
> > + * @priv: pointer to driver-private aread
> > + *
> > + * This function should be called by the networking device upon
> > + * AF_XDP egress completion.
> > + */
> > +static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata *meta,
> > + const struct xsk_tx_metadata_ops *ops,
> > + void *priv)
> > +{
> > + if (!meta)
> > + return;
> > +
> > + if (ops->tmo_fill_timestamp)
> > + if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
> > + meta->tx_timestamp = ops->tmo_fill_timestamp(priv);
> > +}
> > +
> > #else
> >
> > static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
> > @@ -106,6 +154,18 @@ static inline void __xsk_map_flush(void)
> > {
> > }
> >
> > +static inline void xsk_tx_metadata_request(struct xsk_tx_metadata *meta,
> > + const struct xsk_tx_metadata_ops *ops,
> > + void *priv)
> > +{
> > +}
> > +
> > +static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata *meta,
> > + const struct xsk_tx_metadata_ops *ops,
> > + void *priv)
> > +{
> > +}
> > +
> > #endif /* CONFIG_XDP_SOCKETS */
> >
> > #endif /* _LINUX_XDP_SOCK_H */
> > diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
> > index 1f6fc8c7a84c..e2558ac3e195 100644
> > --- a/include/net/xdp_sock_drv.h
> > +++ b/include/net/xdp_sock_drv.h
> > @@ -165,6 +165,14 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
> > return xp_raw_get_data(pool, addr);
> > }
> >
> > +static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
> > +{
> > + if (!pool->tx_metadata_len)
> > + return NULL;
> > +
> > + return xp_raw_get_data(pool, addr) - pool->tx_metadata_len;
> > +}
> > +
> > static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
> > {
> > struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
> > @@ -324,6 +332,11 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
> > return NULL;
> > }
> >
> > +static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
> > +{
> > + return NULL;
> > +}
> > +
> > static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
> > {
> > }
> > diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> > index b37b50102e1c..b9b1b2c4108a 100644
> > --- a/include/uapi/linux/if_xdp.h
> > +++ b/include/uapi/linux/if_xdp.h
> > @@ -106,6 +106,38 @@ struct xdp_options {
> > #define XSK_UNALIGNED_BUF_ADDR_MASK \
> > ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
> >
> > +/* Request transmit timestamp. Upon completion, put it into tx_timestamp
> > + * field of struct xsk_tx_metadata.
> > + */
> > +#define XDP_TX_METADATA_TIMESTAMP (1 << 0)
> > +
> > +/* Request transmit checksum offload. Checksum start position and offset
> > + * are communicated via csum_start and csum_offset fields of struct
> > + * xsk_tx_metadata.
> > + */
> > +#define XDP_TX_METADATA_CHECKSUM (1 << 1)
> > +
> > +/* Force checksum calculation in software. Can be used for testing or
> > + * working around potential HW issues. This option causes performance
> > + * degradation and only works in XDP_COPY mode.
> > + */
> > +#define XDP_TX_METADATA_CHECKSUM_SW (1 << 2)
> > +
> > +struct xsk_tx_metadata {
> > + __u32 flags;
> > +
> > + /* XDP_TX_METADATA_CHECKSUM */
> > +
> > + /* Offset from desc->addr where checksumming should start. */
> > + __u16 csum_start;
> > + /* Offset from csum_start where checksum should be stored. */
> > + __u16 csum_offset;
> > +
> > + /* XDP_TX_METADATA_TIMESTAMP */
> > +
> > + __u64 tx_timestamp;
> > +};
> > +
> > /* Rx/Tx descriptor */
> > struct xdp_desc {
> > __u64 addr;
> > @@ -122,4 +154,7 @@ struct xdp_desc {
> > */
> > #define XDP_PKT_CONTD (1 << 0)
> >
> > +/* TX packet carries valid metadata. */
> > +#define XDP_TX_METADATA (1 << 1)
> > +
> > #endif /* _LINUX_IF_XDP_H */
> > diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
> > index bf71698a1e82..cf1e11c76339 100644
> > --- a/include/uapi/linux/netdev.h
> > +++ b/include/uapi/linux/netdev.h
> > @@ -37,11 +37,26 @@ enum netdev_xdp_act {
> > NETDEV_XDP_ACT_MASK = 127,
> > };
> >
> > +/**
> > + * enum netdev_xsk_flags
> > + * @NETDEV_XSK_FLAGS_TX_TIMESTAMP: HW timestamping egress packets is supported
> > + * by the driver.
> > + * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
> > + * driver.
> > + */
> > +enum netdev_xsk_flags {
> > + NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
> > + NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
> > +
> > + NETDEV_XSK_FLAGS_MASK = 3,
> > +};
> > +
> > enum {
> > NETDEV_A_DEV_IFINDEX = 1,
> > NETDEV_A_DEV_PAD,
> > NETDEV_A_DEV_XDP_FEATURES,
> > NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
> > + NETDEV_A_DEV_XSK_FEATURES,
> >
> > __NETDEV_A_DEV_MAX,
> > NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
> > diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
> > index 65ef4867fc49..9e8c1f3caf36 100644
> > --- a/net/core/netdev-genl.c
> > +++ b/net/core/netdev-genl.c
> > @@ -12,15 +12,25 @@ static int
> > netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff *rsp,
> > u32 portid, u32 seq, int flags, u32 cmd)
> > {
> > + u64 xsk_flags = 0;
> > void *hdr;
> >
> > hdr = genlmsg_put(rsp, portid, seq, &netdev_nl_family, flags, cmd);
> > if (!hdr)
> > return -EMSGSIZE;
> >
> > + if (netdev->xsk_tx_metadata_ops) {
> > + if (netdev->xsk_tx_metadata_ops->tmo_fill_timestamp)
> > + xsk_flags |= NETDEV_XSK_FLAGS_TX_TIMESTAMP;
> > + if (netdev->xsk_tx_metadata_ops->tmo_request_checksum)
> > + xsk_flags |= NETDEV_XSK_FLAGS_TX_CHECKSUM;
> > + }
> > +
> > if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||
> > nla_put_u64_64bit(rsp, NETDEV_A_DEV_XDP_FEATURES,
> > - netdev->xdp_features, NETDEV_A_DEV_PAD)) {
> > + netdev->xdp_features, NETDEV_A_DEV_PAD) ||
> > + nla_put_u64_64bit(rsp, NETDEV_A_DEV_XSK_FEATURES,
> > + xsk_flags, NETDEV_A_DEV_PAD)) {
> > genlmsg_cancel(rsp, hdr);
> > return -EINVAL;
> > }
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 81106e4d6e0f..9a5c4e63898d 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -542,6 +542,15 @@ static u32 xsk_get_num_desc(struct sk_buff *skb)
> >
> > static void xsk_destruct_skb(struct sk_buff *skb)
> > {
> > + struct xsk_tx_metadata *meta = skb_shinfo(skb)->xsk_meta;
> > +
> > + if (meta) {
> > + if (meta->flags & XDP_TX_METADATA_TIMESTAMP) {
> > + /* sw completion timestamp, not a real one */
> > + meta->tx_timestamp = ktime_get_tai_fast_ns();
> > + }
> > + }
> > +
> > xsk_cq_submit_locked(xdp_sk(skb->sk), xsk_get_num_desc(skb));
> > sock_wfree(skb);
> > }
> > @@ -626,6 +635,7 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
> > static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> > struct xdp_desc *desc)
> > {
> > + struct xsk_tx_metadata *meta = NULL;
> > struct net_device *dev = xs->dev;
> > struct sk_buff *skb = xs->skb;
> > int err;
> > @@ -678,12 +688,40 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
> >
> > skb_add_rx_frag(skb, nr_frags, page, 0, len, 0);
> > }
> > +
> > + if (desc->options & XDP_TX_METADATA) {
> > + if (unlikely(xs->tx_metadata_len == 0)) {
> > + err = -EINVAL;
> > + goto free_err;
> > + }
> > +
> > + meta = buffer - xs->tx_metadata_len;
> > +
> > + if (meta->flags & XDP_TX_METADATA_CHECKSUM) {
> > + if (unlikely(meta->csum_start + meta->csum_offset +
> > + sizeof(__sum16) > len)) {
> > + err = -EINVAL;
> > + goto free_err;
> > + }
> > +
> > + skb->csum_start = hr + meta->csum_start;
> > + skb->csum_offset = meta->csum_offset;
> > + skb->ip_summed = CHECKSUM_PARTIAL;
> > +
> > + if (unlikely(meta->flags & XDP_TX_METADATA_CHECKSUM_SW)) {
> > + err = skb_checksum_help(skb);
> > + if (err)
> > + goto free_err;
> > + }
> > + }
> > + }
> > }
> >
> > skb->dev = dev;
> > skb->priority = xs->sk.sk_priority;
> > skb->mark = xs->sk.sk_mark;
> > skb->destructor = xsk_destruct_skb;
> > + skb_shinfo(skb)->xsk_meta = meta;
> > xsk_set_destructor_arg(skb);
> >
> > return skb;
> > diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> > index c74a1372bcb9..6f2d1621c992 100644
> > --- a/net/xdp/xsk_queue.h
> > +++ b/net/xdp/xsk_queue.h
> > @@ -137,7 +137,7 @@ static inline bool xskq_cons_read_addr_unchecked(struct xsk_queue *q, u64 *addr)
> >
> > static inline bool xp_unused_options_set(u32 options)
> > {
> > - return options & ~XDP_PKT_CONTD;
> > + return options & ~(XDP_PKT_CONTD | XDP_TX_METADATA);
> > }
> >
> > static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
> > diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
> > index 73a47da885dc..b9b1b2c4108a 100644
> > --- a/tools/include/uapi/linux/if_xdp.h
> > +++ b/tools/include/uapi/linux/if_xdp.h
> > @@ -26,11 +26,11 @@
> > */
> > #define XDP_USE_NEED_WAKEUP (1 << 3)
> > /* By setting this option, userspace application indicates that it can
> > - * handle multiple descriptors per packet thus enabling xsk core to split
> > + * handle multiple descriptors per packet thus enabling AF_XDP to split
> > * multi-buffer XDP frames into multiple Rx descriptors. Without this set
> > - * such frames will be dropped by xsk.
> > + * such frames will be dropped.
> > */
> > -#define XDP_USE_SG (1 << 4)
> > +#define XDP_USE_SG (1 << 4)
> >
> > /* Flags for xsk_umem_config flags */
> > #define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
> > @@ -69,6 +69,7 @@ struct xdp_mmap_offsets {
> > #define XDP_UMEM_COMPLETION_RING 6
> > #define XDP_STATISTICS 7
> > #define XDP_OPTIONS 8
> > +#define XDP_TX_METADATA_LEN 9
> >
> > struct xdp_umem_reg {
> > __u64 addr; /* Start of packet data area */
> > @@ -105,6 +106,38 @@ struct xdp_options {
> > #define XSK_UNALIGNED_BUF_ADDR_MASK \
> > ((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
> >
> > +/* Request transmit timestamp. Upon completion, put it into tx_timestamp
> > + * field of struct xsk_tx_metadata.
> > + */
> > +#define XDP_TX_METADATA_TIMESTAMP (1 << 0)
> > +
> > +/* Request transmit checksum offload. Checksum start position and offset
> > + * are communicated via csum_start and csum_offset fields of struct
> > + * xsk_tx_metadata.
> > + */
> > +#define XDP_TX_METADATA_CHECKSUM (1 << 1)
> > +
> > +/* Force checksum calculation in software. Can be used for testing or
> > + * working around potential HW issues. This option causes performance
> > + * degradation and only works in XDP_COPY mode.
> > + */
> > +#define XDP_TX_METADATA_CHECKSUM_SW (1 << 2)
>
> Not sure how useful this is, especially if only for copy mode.
Seems useful at least as a reference implementation? But I'm happy
to drop. It's used only in the tests for now. I was using it to
verify csum_offset/start field values.
> > +struct xsk_tx_metadata {
> > + __u32 flags;
> > +
> > + /* XDP_TX_METADATA_CHECKSUM */
> > +
> > + /* Offset from desc->addr where checksumming should start. */
> > + __u16 csum_start;
> > + /* Offset from csum_start where checksum should be stored. */
> > + __u16 csum_offset;
> > +
> > + /* XDP_TX_METADATA_TIMESTAMP */
> > +
> > + __u64 tx_timestamp;
> > +};
>
> Is this structure easily extensible for future offloads,
> such as USO?
We can append more field. What do we need for USO? Something akin
to gso_size/gso_segs/gso_type ?
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-25 21:28 ` Jakub Kicinski
@ 2023-07-25 22:40 ` Stanislav Fomichev
2023-07-26 7:47 ` Simon Horman
0 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-25 22:40 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Simon Horman, bpf, ast, daniel, andrii, martin.lau, song, yhs,
john.fastabend, kpsingh, haoluo, jolsa, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
On 07/25, Jakub Kicinski wrote:
> On Tue, 25 Jul 2023 13:30:58 -0700 Stanislav Fomichev wrote:
> > > I know that it isn't the practice in this file.
> > > but adding the following makes kernel-doc happier
> > > about NETDEV_XSK_FLAGS_MASK not being documented.
> > >
> > > /* private: */
> >
> > This is autogenerated file :-( But I guess I can try to extend ynl
> > scripts to put this comment before the mask. Let me look into that...
>
> Yes, please! I think I even wrote a patch for it at some point...
> but then we realized that enums didn't support /* private: */.
> Commit e27cb89a22ada4 has added the support, so we can get back
> to getting the YNL changes in place.
Let me actually route these separately to you. I'll fix mxdp_zc_max_seg
thing as well..
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 8/8] selftests/bpf: Add TX side to xdp_hw_metadata
2023-07-25 22:36 ` Stanislav Fomichev
@ 2023-07-25 22:55 ` Willem de Bruijn
0 siblings, 0 replies; 30+ messages in thread
From: Willem de Bruijn @ 2023-07-25 22:55 UTC (permalink / raw)
To: Stanislav Fomichev, Willem de Bruijn
Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
Stanislav Fomichev wrote:
> On 07/25, Willem de Bruijn wrote:
> > Stanislav Fomichev wrote:
> > > When we get packets on port 9091, we swap src/dst and send it out.
> > > At this point, we also request the timestamp and plumb it back
> > > to the userspace. The userspace simply prints the timestamp.
> > >
> > > Also print current UDP checksum, rewrite it with the pseudo-header
> > > checksum and offload TX checksum calculation to devtx. Upon
> > > completion, report TX checksum back (mlx5 doesn't put it back, so
> > > I've used tcpdump to confirm that the checksum is correct).
> > >
> > > Some other related changes:
> > > - switched to zerocopy mode by default; new flag can be used to force
> > > old behavior
> > > - request fixed TX_METADATA_LEN headroom
> > > - some other small fixes (umem size, fill idx+i, etc)
> > >
> > > mvbz3:~# ./xdp_hw_metadata eth3 -c mlx5e_devtx_complete_xdp -s mlx5e_devtx_submit_xd
> > > attach rx bpf program...
> > > ...
> > > 0x206d298: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
> > > rx_hash: 0x2BFB7FEC with RSS type:0x2A
> > > rx_timestamp: 1690238278345877848 (sec:1690238278.3459)
> > > XDP RX-time: 1690238278538397674 (sec:1690238278.5384) delta sec:0.1925 (192519.826 usec)
> > > AF_XDP time: 1690238278538515250 (sec:1690238278.5385) delta sec:0.0001 (117.576 usec)
> > > 0x206d298: ping-pong with csum=8e3b (want 57c9) csum_start=54 csum_offset=6
> > > 0x206d298: complete tx idx=0 addr=10
> > > 0x206d298: tx_timestamp: 1690238278577008140 (sec:1690238278.5770)
> > > 0x206d298: complete rx idx=128 addr=80100
> > >
> > > mvbz4:~# nc -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091
> > >
> > > mvbz4:~# tcpdump -vvx -i eth3 udp
> > > tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
> > > 10:12:43.901436 IP6 (flowlabel 0x7a5d2, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1087.44339 > fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0x0b4b!] UDP, length 3
> > > 0x0000: 6007 a5d2 000b 117f fe80 0000 0000 0000
> > > 0x0010: 1270 fdff fe48 1087 fe80 0000 0000 0000
> > > 0x0020: 1270 fdff fe48 1077 ad33 2383 000b 3b8e
> > > 0x0030: 7864 70
> > > 10:12:43.902125 IP6 (flowlabel 0x7a5d2, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1077.9091 > fe80::1270:fdff:fe48:1087.44339: [udp sum ok] UDP, length 3
> > > 0x0000: 6007 a5d2 000b 117f fe80 0000 0000 0000
> > > 0x0010: 1270 fdff fe48 1077 fe80 0000 0000 0000
> > > 0x0020: 1270 fdff fe48 1087 2383 ad33 000b 0b4b
> > > 0x0030: 7864 70
> > >
> > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > ---
> > > tools/testing/selftests/bpf/xdp_hw_metadata.c | 201 +++++++++++++++++-
> > > 1 file changed, 191 insertions(+), 10 deletions(-)
> > >
> >
> > > +static void usage(const char *prog)
> > > +{
> > > + fprintf(stderr,
> > > + "usage: %s [OPTS] <ifname>\n"
> > > + "OPTS:\n"
> > > + " -T don't generate AF_XDP reply (rx metadata only)\n"
> > > + " -Z run in copy mode\n",
> >
> > nit: makes more sense to call copy mode 'C', rather than 'Z'
>
> Initially I had -c and -s for completion/submission bpf hooks. Now that
> these are gone, can actually use -c. Capital letter here actually means
> 'not'. Z - not zerocopy. T - no tx.
>
> I'll rename to:
> -r - rx only
> -c - copy mode
>
> LMK if it doesn't make sense..
Sounds great, thanks. I did not grasp the capitalization implies negation.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-25 22:39 ` Stanislav Fomichev
@ 2023-07-25 23:10 ` Willem de Bruijn
2023-07-25 23:48 ` Stanislav Fomichev
2023-07-27 14:11 ` Jesper Dangaard Brouer
0 siblings, 2 replies; 30+ messages in thread
From: Willem de Bruijn @ 2023-07-25 23:10 UTC (permalink / raw)
To: Stanislav Fomichev, Willem de Bruijn
Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
Stanislav Fomichev wrote:
> On 07/25, Willem de Bruijn wrote:
> > Stanislav Fomichev wrote:
> > > This change actually defines the (initial) metadata layout
> > > that should be used by AF_XDP userspace (xsk_tx_metadata).
> > > The first field is flags which requests appropriate offloads,
> > > followed by the offload-specific fields. The supported per-device
> > > offloads are exported via netlink (new xsk-flags).
> > >
> > > The offloads themselves are still implemented in a bit of a
> > > framework-y fashion that's left from my initial kfunc attempt.
> > > I'm introducing new xsk_tx_metadata_ops which drivers are
> > > supposed to implement. The drivers are also supposed
> > > to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
> > > the right places. Since xsk_tx_metadata_{request,_complete}
> > > are static inline, we don't incur any extra overhead doing
> > > indirect calls.
> > >
> > > The benefit of this scheme is as follows:
> > > - keeps all metadata layout parsing away from driver code
> > > - makes it easy to grep and see which drivers implement what
> > > - don't need any extra flags to maintain to keep track of that
> > > offloads are implemented; if the callback is implemented - the offload
> > > is supported (used by netlink reporting code)
> > >
> > > Two offloads are defined right now:
> > > 1. XDP_TX_METADATA_CHECKSUM: skb-style csum_start+csum_offset
> > > 2. XDP_TX_METADATA_TIMESTAMP: writes TX timestamp back into metadata
> > > area upon completion (tx_timestamp field)
> > >
> > > The offloads are also implemented for copy mode:
> > > 1. Extra XDP_TX_METADATA_CHECKSUM_SW to trigger skb_checksum_help; this
> > > might be useful as a reference implementation and for testing
> > > 2. XDP_TX_METADATA_TIMESTAMP writes SW timestamp from the skb
> > > destructor (note I'm reusing hwtstamps to pass metadata pointer)
> > >
> > > The struct is forward-compatible and can be extended in the future
> > > by appending more fields.
> > >
> > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > > +/* Request transmit checksum offload. Checksum start position and offset
> > > + * are communicated via csum_start and csum_offset fields of struct
> > > + * xsk_tx_metadata.
> > > + */
> > > +#define XDP_TX_METADATA_CHECKSUM (1 << 1)
> > > +
> > > +/* Force checksum calculation in software. Can be used for testing or
> > > + * working around potential HW issues. This option causes performance
> > > + * degradation and only works in XDP_COPY mode.
> > > + */
> > > +#define XDP_TX_METADATA_CHECKSUM_SW (1 << 2)
> >
> > Not sure how useful this is, especially if only for copy mode.
>
> Seems useful at least as a reference implementation? But I'm happy
> to drop. It's used only in the tests for now. I was using it to
> verify csum_offset/start field values.
If testing over veth, does anything even look at the checksum?
> > > +struct xsk_tx_metadata {
> > > + __u32 flags;
> > > +
> > > + /* XDP_TX_METADATA_CHECKSUM */
> > > +
> > > + /* Offset from desc->addr where checksumming should start. */
> > > + __u16 csum_start;
> > > + /* Offset from csum_start where checksum should be stored. */
> > > + __u16 csum_offset;
> > > +
> > > + /* XDP_TX_METADATA_TIMESTAMP */
> > > +
> > > + __u64 tx_timestamp;
> > > +};
> >
> > Is this structure easily extensible for future offloads,
> > such as USO?
>
> We can append more field. What do we need for USO? Something akin
> to gso_size/gso_segs/gso_type ?
Yes, a bit to set the feature (gso_type) and a field to store the
segment size (gso_size).
Pacing offload is the other feature that comes to mind. That could
conceivably use the tx_timestamp field.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-25 23:10 ` Willem de Bruijn
@ 2023-07-25 23:48 ` Stanislav Fomichev
2023-07-27 14:11 ` Jesper Dangaard Brouer
1 sibling, 0 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-25 23:48 UTC (permalink / raw)
To: Willem de Bruijn
Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
On 07/25, Willem de Bruijn wrote:
> Stanislav Fomichev wrote:
> > On 07/25, Willem de Bruijn wrote:
> > > Stanislav Fomichev wrote:
> > > > This change actually defines the (initial) metadata layout
> > > > that should be used by AF_XDP userspace (xsk_tx_metadata).
> > > > The first field is flags which requests appropriate offloads,
> > > > followed by the offload-specific fields. The supported per-device
> > > > offloads are exported via netlink (new xsk-flags).
> > > >
> > > > The offloads themselves are still implemented in a bit of a
> > > > framework-y fashion that's left from my initial kfunc attempt.
> > > > I'm introducing new xsk_tx_metadata_ops which drivers are
> > > > supposed to implement. The drivers are also supposed
> > > > to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
> > > > the right places. Since xsk_tx_metadata_{request,_complete}
> > > > are static inline, we don't incur any extra overhead doing
> > > > indirect calls.
> > > >
> > > > The benefit of this scheme is as follows:
> > > > - keeps all metadata layout parsing away from driver code
> > > > - makes it easy to grep and see which drivers implement what
> > > > - don't need any extra flags to maintain to keep track of that
> > > > offloads are implemented; if the callback is implemented - the offload
> > > > is supported (used by netlink reporting code)
> > > >
> > > > Two offloads are defined right now:
> > > > 1. XDP_TX_METADATA_CHECKSUM: skb-style csum_start+csum_offset
> > > > 2. XDP_TX_METADATA_TIMESTAMP: writes TX timestamp back into metadata
> > > > area upon completion (tx_timestamp field)
> > > >
> > > > The offloads are also implemented for copy mode:
> > > > 1. Extra XDP_TX_METADATA_CHECKSUM_SW to trigger skb_checksum_help; this
> > > > might be useful as a reference implementation and for testing
> > > > 2. XDP_TX_METADATA_TIMESTAMP writes SW timestamp from the skb
> > > > destructor (note I'm reusing hwtstamps to pass metadata pointer)
> > > >
> > > > The struct is forward-compatible and can be extended in the future
> > > > by appending more fields.
> > > >
> > > > Signed-off-by: Stanislav Fomichev <sdf@google.com>
>
> > > > +/* Request transmit checksum offload. Checksum start position and offset
> > > > + * are communicated via csum_start and csum_offset fields of struct
> > > > + * xsk_tx_metadata.
> > > > + */
> > > > +#define XDP_TX_METADATA_CHECKSUM (1 << 1)
> > > > +
> > > > +/* Force checksum calculation in software. Can be used for testing or
> > > > + * working around potential HW issues. This option causes performance
> > > > + * degradation and only works in XDP_COPY mode.
> > > > + */
> > > > +#define XDP_TX_METADATA_CHECKSUM_SW (1 << 2)
> > >
> > > Not sure how useful this is, especially if only for copy mode.
> >
> > Seems useful at least as a reference implementation? But I'm happy
> > to drop. It's used only in the tests for now. I was using it to
> > verify csum_offset/start field values.
>
> If testing over veth, does anything even look at the checksum?
My receiver in the xdp_metadata test looks at it and compares to the
fixed (verified) value:
ASSERT_EQ(udph->check, 0x1c72, "csum");
The packet is always the same (and macs are fixed), so we are able
to do that.
> > > > +struct xsk_tx_metadata {
> > > > + __u32 flags;
> > > > +
> > > > + /* XDP_TX_METADATA_CHECKSUM */
> > > > +
> > > > + /* Offset from desc->addr where checksumming should start. */
> > > > + __u16 csum_start;
> > > > + /* Offset from csum_start where checksum should be stored. */
> > > > + __u16 csum_offset;
> > > > +
> > > > + /* XDP_TX_METADATA_TIMESTAMP */
> > > > +
> > > > + __u64 tx_timestamp;
> > > > +};
> > >
> > > Is this structure easily extensible for future offloads,
> > > such as USO?
> >
> > We can append more field. What do we need for USO? Something akin
> > to gso_size/gso_segs/gso_type ?
>
> Yes, a bit to set the feature (gso_type) and a field to store the
> segment size (gso_size).
>
> Pacing offload is the other feature that comes to mind. That could
> conceivably use the tx_timestamp field.
Right, so we can append to this struct and add more XDP_TX_METADATA_$(FLAG)s
to signal various features. Jakub mentioned that it might be handy
to pass l2_offset/l3_offset/l4_offset, but I'm not sure whether
he was talking about xSO offloads or something else.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-25 22:40 ` Stanislav Fomichev
@ 2023-07-26 7:47 ` Simon Horman
0 siblings, 0 replies; 30+ messages in thread
From: Simon Horman @ 2023-07-26 7:47 UTC (permalink / raw)
To: Stanislav Fomichev
Cc: Jakub Kicinski, bpf, ast, daniel, andrii, martin.lau, song, yhs,
john.fastabend, kpsingh, haoluo, jolsa, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk, netdev,
xdp-hints
On Tue, Jul 25, 2023 at 03:40:22PM -0700, Stanislav Fomichev wrote:
> On 07/25, Jakub Kicinski wrote:
> > On Tue, 25 Jul 2023 13:30:58 -0700 Stanislav Fomichev wrote:
> > > > I know that it isn't the practice in this file.
> > > > but adding the following makes kernel-doc happier
> > > > about NETDEV_XSK_FLAGS_MASK not being documented.
> > > >
> > > > /* private: */
> > >
> > > This is autogenerated file :-( But I guess I can try to extend ynl
> > > scripts to put this comment before the mask. Let me look into that...
> >
> > Yes, please! I think I even wrote a patch for it at some point...
> > but then we realized that enums didn't support /* private: */.
> > Commit e27cb89a22ada4 has added the support, so we can get back
> > to getting the YNL changes in place.
>
> Let me actually route these separately to you. I'll fix mxdp_zc_max_seg
> thing as well..
Thanks.
I did miss that this is an auto generated file, sorry about that.
But I do think it would be good to resolve this, at some point.
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
` (2 preceding siblings ...)
2023-07-25 20:58 ` Willem de Bruijn
@ 2023-07-26 19:25 ` Jesper Dangaard Brouer
2023-07-26 21:25 ` Stanislav Fomichev
3 siblings, 1 reply; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-26 19:25 UTC (permalink / raw)
To: Stanislav Fomichev, bpf
Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
john.fastabend, kpsingh, haoluo, jolsa, kuba, toke, willemb,
dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
netdev, xdp-hints
On 25/07/2023 01.59, Stanislav Fomichev wrote:
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 11652e464f5d..8b40c80557aa 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1660,6 +1660,31 @@ struct xdp_metadata_ops {
> enum xdp_rss_hash_type *rss_type);
> };
>
> +/*
> + * This structure defines the AF_XDP TX metadata hooks for network devices.
> + * The following hooks can be defined; unless noted otherwise, they are
> + * optional and can be filled with a null pointer.
> + *
> + * int (*tmo_request_timestamp)(void *priv)
> + * This function is called when AF_XDP frame requested egress timestamp.
> + *
> + * int (*tmo_fill_timestamp)(void *priv)
> + * This function is called when AF_XDP frame, that had requested
> + * egress timestamp, received a completion. The hook needs to return
> + * the actual HW timestamp.
> + *
> + * int (*tmo_request_timestamp)(u16 csum_start, u16 csum_offset, void *priv)
> + * This function is called when AF_XDP frame requested HW checksum
> + * offload. csum_start indicates position where checksumming should start.
> + * csum_offset indicates position where checksum should be stored.
> + *
> + */
> +struct xsk_tx_metadata_ops {
> + void (*tmo_request_timestamp)(void *priv);
> + u64 (*tmo_fill_timestamp)(void *priv);
> + void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
> +};
> +
> /**
> * enum netdev_priv_flags - &struct net_device priv_flags
> *
> @@ -1844,6 +1869,7 @@ enum netdev_ml_priv_type {
> * @netdev_ops: Includes several pointers to callbacks,
> * if one wants to override the ndo_*() functions
> * @xdp_metadata_ops: Includes pointers to XDP metadata callbacks.
> + * @xsk_tx_metadata_ops: Includes pointers to AF_XDP TX metadata callbacks.
> * @ethtool_ops: Management operations
> * @l3mdev_ops: Layer 3 master device operations
> * @ndisc_ops: Includes callbacks for different IPv6 neighbour
> @@ -2100,6 +2126,7 @@ struct net_device {
> unsigned long long priv_flags;
> const struct net_device_ops *netdev_ops;
> const struct xdp_metadata_ops *xdp_metadata_ops;
> + const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
> int ifindex;
> unsigned short gflags;
> unsigned short hard_header_len;
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index faaba050f843..5febc1a5131e 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -581,7 +581,10 @@ struct skb_shared_info {
> /* Warning: this field is not always filled in (UFO)! */
> unsigned short gso_segs;
> struct sk_buff *frag_list;
> - struct skb_shared_hwtstamps hwtstamps;
> + union {
> + struct skb_shared_hwtstamps hwtstamps;
> + struct xsk_tx_metadata *xsk_meta;
> + };
> unsigned int gso_type;
> u32 tskey;
>
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index 467b9fb56827..288fa58c4665 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -90,6 +90,54 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
> int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
> void __xsk_map_flush(void);
>
> +/**
> + * xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
> + * and call appropriate xsk_tx_metadata_ops operation.
> + * @meta: pointer to AF_XDP metadata area
> + * @ops: pointer to struct xsk_tx_metadata_ops
> + * @priv: pointer to driver-private aread
> + *
> + * This function should be called by the networking device when
> + * it prepares AF_XDP egress packet.
> + */
> +static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
> + const struct xsk_tx_metadata_ops *ops,
> + void *priv)
(As you mentioned) this gets inlined in drivers for performance.
> +{
> + if (!meta)
> + return;
> +
> + if (ops->tmo_request_timestamp)
> + if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
> + ops->tmo_request_timestamp(priv);
We still have the overhead of function pointer call.
With RETPOLINE this is costly.
Measured on my testlab CPU E5-1650 v4 @ 3.60GHz
Type:function_call_cost: 3 cycles(tsc) 1.010 ns
Type:func_ptr_call_cost: 30 cycles(tsc) 8.341 ns
Given this get inlined in drivers, perhaps we can add some
INDIRECT_CALL_1 macros to avoid these indirect calls?
> +
> + if (ops->tmo_request_checksum)
> + if (meta->flags & XDP_TX_METADATA_CHECKSUM)
> + ops->tmo_request_checksum(meta->csum_start, meta->csum_offset, priv);
> +}
> +
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-26 19:25 ` Jesper Dangaard Brouer
@ 2023-07-26 21:25 ` Stanislav Fomichev
2023-07-27 13:50 ` Jesper Dangaard Brouer
0 siblings, 1 reply; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-26 21:25 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
john.fastabend, kpsingh, haoluo, jolsa, kuba, toke, willemb,
dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
netdev, xdp-hints
On 07/26, Jesper Dangaard Brouer wrote:
>
>
> On 25/07/2023 01.59, Stanislav Fomichev wrote:
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 11652e464f5d..8b40c80557aa 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -1660,6 +1660,31 @@ struct xdp_metadata_ops {
> > enum xdp_rss_hash_type *rss_type);
> > };
> > +/*
> > + * This structure defines the AF_XDP TX metadata hooks for network devices.
> > + * The following hooks can be defined; unless noted otherwise, they are
> > + * optional and can be filled with a null pointer.
> > + *
> > + * int (*tmo_request_timestamp)(void *priv)
> > + * This function is called when AF_XDP frame requested egress timestamp.
> > + *
> > + * int (*tmo_fill_timestamp)(void *priv)
> > + * This function is called when AF_XDP frame, that had requested
> > + * egress timestamp, received a completion. The hook needs to return
> > + * the actual HW timestamp.
> > + *
> > + * int (*tmo_request_timestamp)(u16 csum_start, u16 csum_offset, void *priv)
> > + * This function is called when AF_XDP frame requested HW checksum
> > + * offload. csum_start indicates position where checksumming should start.
> > + * csum_offset indicates position where checksum should be stored.
> > + *
> > + */
> > +struct xsk_tx_metadata_ops {
> > + void (*tmo_request_timestamp)(void *priv);
> > + u64 (*tmo_fill_timestamp)(void *priv);
> > + void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
> > +};
> > +
> > /**
> > * enum netdev_priv_flags - &struct net_device priv_flags
> > *
> > @@ -1844,6 +1869,7 @@ enum netdev_ml_priv_type {
> > * @netdev_ops: Includes several pointers to callbacks,
> > * if one wants to override the ndo_*() functions
> > * @xdp_metadata_ops: Includes pointers to XDP metadata callbacks.
> > + * @xsk_tx_metadata_ops: Includes pointers to AF_XDP TX metadata callbacks.
> > * @ethtool_ops: Management operations
> > * @l3mdev_ops: Layer 3 master device operations
> > * @ndisc_ops: Includes callbacks for different IPv6 neighbour
> > @@ -2100,6 +2126,7 @@ struct net_device {
> > unsigned long long priv_flags;
> > const struct net_device_ops *netdev_ops;
> > const struct xdp_metadata_ops *xdp_metadata_ops;
> > + const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
> > int ifindex;
> > unsigned short gflags;
> > unsigned short hard_header_len;
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index faaba050f843..5febc1a5131e 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -581,7 +581,10 @@ struct skb_shared_info {
> > /* Warning: this field is not always filled in (UFO)! */
> > unsigned short gso_segs;
> > struct sk_buff *frag_list;
> > - struct skb_shared_hwtstamps hwtstamps;
> > + union {
> > + struct skb_shared_hwtstamps hwtstamps;
> > + struct xsk_tx_metadata *xsk_meta;
> > + };
> > unsigned int gso_type;
> > u32 tskey;
> > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > index 467b9fb56827..288fa58c4665 100644
> > --- a/include/net/xdp_sock.h
> > +++ b/include/net/xdp_sock.h
> > @@ -90,6 +90,54 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
> > int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
> > void __xsk_map_flush(void);
> > +/**
> > + * xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
> > + * and call appropriate xsk_tx_metadata_ops operation.
> > + * @meta: pointer to AF_XDP metadata area
> > + * @ops: pointer to struct xsk_tx_metadata_ops
> > + * @priv: pointer to driver-private aread
> > + *
> > + * This function should be called by the networking device when
> > + * it prepares AF_XDP egress packet.
> > + */
> > +static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
> > + const struct xsk_tx_metadata_ops *ops,
> > + void *priv)
>
> (As you mentioned) this gets inlined in drivers for performance.
>
> > +{
> > + if (!meta)
> > + return;
> > +
> > + if (ops->tmo_request_timestamp)
> > + if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
> > + ops->tmo_request_timestamp(priv);
>
> We still have the overhead of function pointer call.
> With RETPOLINE this is costly.
>
> Measured on my testlab CPU E5-1650 v4 @ 3.60GHz
> Type:function_call_cost: 3 cycles(tsc) 1.010 ns
> Type:func_ptr_call_cost: 30 cycles(tsc) 8.341 ns
>
> Given this get inlined in drivers, perhaps we can add some
> INDIRECT_CALL_1 macros to avoid these indirect calls?
I'm assuming that the compiler is smart enough to resolve these const
struct ops definitions as long as they are in the same file.
At least here is what I see for mlx5e_xmit_xdp_frame_mpwqe: those
indirect jumps are resolved and the calls themselves are unrolled.
TBF, I don't have retpolines enabled in the config, but I don't think
it will bring indirect jumps back? Am I missing anything?
0000000000001930 <mlx5e_xmit_xdp_frame_mpwqe>:
; mlx5e_xmit_xdp_frame_mpwqe():
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:436
; {
1930: f3 0f 1e fa endbr64
1934: e8 00 00 00 00 callq 0x1939 <mlx5e_xmit_xdp_frame_mpwqe+0x9>
0000000000001935: R_X86_64_PLT32 __fentry__-0x4
1939: 55 pushq %rbp
193a: 48 89 e5 movq %rsp, %rbp
193d: 41 57 pushq %r15
193f: 41 56 pushq %r14
1941: 41 54 pushq %r12
1943: 53 pushq %rbx
1944: 48 83 ec 18 subq $0x18, %rsp
1948: 49 89 cf movq %rcx, %r15
194b: 49 89 f6 movq %rsi, %r14
194e: 48 89 fb movq %rdi, %rbx
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:438
; struct mlx5e_xdpsq_stats *stats = sq->stats;
1951: 4c 8b a7 30 02 00 00 movq 0x230(%rdi), %r12
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:442
; if (xdptxd->has_frags) {
1958: 8b 46 10 movl 0x10(%rsi), %eax
195b: 85 c0 testl %eax, %eax
195d: 0f 88 ec 00 00 00 js 0x1a4f <mlx5e_xmit_xdp_frame_mpwqe+0x11f>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:467
; if (unlikely(p->len > sq->hw_mtu)) {
1963: 25 ff ff ff 7f andl $0x7fffffff, %eax # imm = 0x7FFFFFFF
1968: 3b 83 98 02 00 00 cmpl 0x298(%rbx), %eax
196e: 0f 87 b5 02 00 00 ja 0x1c29 <mlx5e_xmit_xdp_frame_mpwqe+0x2f9>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:472
; if (!check_result)
1974: 85 d2 testl %edx, %edx
1976: 0f 84 40 01 00 00 je 0x1abc <mlx5e_xmit_xdp_frame_mpwqe+0x18c>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:474
; if (unlikely(check_result < 0))
197c: 0f 88 ac 02 00 00 js 0x1c2e <mlx5e_xmit_xdp_frame_mpwqe+0x2fe>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:477
; if (check_result == MLX5E_XDP_CHECK_START_MPWQE) {
1982: 83 fa 02 cmpl $0x2, %edx
1985: 0f 85 d7 01 00 00 jne 0x1b62 <mlx5e_xmit_xdp_frame_mpwqe+0x232>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:338
; pi = mlx5_wq_cyc_ctr2ix(wq, sq->pc);
198b: 0f b7 4b 44 movzwl 0x44(%rbx), %ecx
198f: 8b 93 10 02 00 00 movl 0x210(%rbx), %edx
; ./drivers/net/ethernet/mellanox/mlx5/core/wq.h:144
; return ctr & wq->fbc.sz_m1;
1995: 21 d1 andl %edx, %ecx
; ./drivers/net/ethernet/mellanox/mlx5/core/wq.h:164
; return mlx5_frag_buf_get_idx_last_contig_stride(&wq->fbc, ix) - ix + 1;
1997: 0f b7 c1 movzwl %cx, %eax
; ././include/linux/mlx5/driver.h:959
; u32 last_frag_stride_idx = (ix + fbc->strides_offset) | fbc->frag_sz_m1;
199a: 44 0f b7 8b 16 02 00 00 movzwl 0x216(%rbx), %r9d
19a2: 42 8d 3c 08 leal (%rax,%r9), %edi
19a6: 0f b7 b3 14 02 00 00 movzwl 0x214(%rbx), %esi
19ad: 41 89 f8 movl %edi, %r8d
19b0: 41 09 f0 orl %esi, %r8d
; ././include/linux/mlx5/driver.h:961
; return min_t(u32, last_frag_stride_idx - fbc->strides_offset, fbc->sz_m1);
19b3: 45 29 c8 subl %r9d, %r8d
19b6: 41 39 d0 cmpl %edx, %r8d
19b9: 44 0f 43 c2 cmovael %edx, %r8d
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:369
; pi = mlx5e_xdpsq_get_next_pi(sq, sq->max_sq_mpw_wqebbs);
19bd: 0f b6 93 8e 02 00 00 movzbl 0x28e(%rbx), %edx
; ./drivers/net/ethernet/mellanox/mlx5/core/wq.h:164
; return mlx5_frag_buf_get_idx_last_contig_stride(&wq->fbc, ix) - ix + 1;
19c4: 41 29 c8 subl %ecx, %r8d
19c7: 41 ff c0 incl %r8d
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:340
; if (unlikely(contig_wqebbs < size)) {
19ca: 66 41 39 d0 cmpw %dx, %r8w
19ce: 0f 82 5e 02 00 00 jb 0x1c32 <mlx5e_xmit_xdp_frame_mpwqe+0x302>
; ././include/linux/mlx5/driver.h:951
; frag = ix >> fbc->log_frag_strides;
19d4: 0f b6 8b 1a 02 00 00 movzbl 0x21a(%rbx), %ecx
; ././include/linux/mlx5/driver.h:953
; return fbc->frags[frag].buf + ((fbc->frag_sz_m1 & ix) << fbc->log_stride);
19db: 89 f8 movl %edi, %eax
; ././include/linux/mlx5/driver.h:951
; frag = ix >> fbc->log_frag_strides;
19dd: 48 d3 e8 shrq %cl, %rax
; ././include/linux/mlx5/driver.h:953
; return fbc->frags[frag].buf + ((fbc->frag_sz_m1 & ix) << fbc->log_stride);
19e0: 48 8b 8b 08 02 00 00 movq 0x208(%rbx), %rcx
19e7: 48 c1 e0 04 shlq $0x4, %rax
19eb: 48 8b 14 01 movq (%rcx,%rax), %rdx
19ef: 21 fe andl %edi, %esi
19f1: 0f b6 8b 19 02 00 00 movzbl 0x219(%rbx), %ecx
19f8: d3 e6 shll %cl, %esi
19fa: 48 8d 04 32 leaq (%rdx,%rsi), %rax
; ./drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h:114
; memset(wqe, 0, wqe_size);
19fe: 48 c7 44 32 18 00 00 00 00 movq $0x0, 0x18(%rdx,%rsi)
1a07: 48 c7 44 32 10 00 00 00 00 movq $0x0, 0x10(%rdx,%rsi)
1a10: 48 c7 44 32 08 00 00 00 00 movq $0x0, 0x8(%rdx,%rsi)
1a19: 48 c7 04 32 00 00 00 00 movq $0x0, (%rdx,%rsi)
; ././arch/x86/include/asm/processor.h:618
; alternative_input(BASE_PREFETCH, "prefetchw %P1",
1a21: 0f 18 4c 32 20 prefetcht0 0x20(%rdx,%rsi)
1a26: 0f 18 4c 32 60 prefetcht0 0x60(%rdx,%rsi)
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:378
; .inline_on = mlx5e_xdp_get_inline_state(sq, session->inline_on),
1a2b: 0f b6 4b 5e movzbl 0x5e(%rbx), %ecx
1a2f: 8b 53 40 movl 0x40(%rbx), %edx
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:168
; u16 outstanding = sq->xdpi_fifo_pc - sq->xdpi_fifo_cc;
1a32: 2b 13 subl (%rbx), %edx
1a34: 0f b7 d2 movzwl %dx, %edx
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:378
; .inline_on = mlx5e_xdp_get_inline_state(sq, session->inline_on),
1a37: 84 c9 testb %cl, %cl
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:173
; if (cur && outstanding <= MLX5E_XDP_INLINE_WATERMARK_LOW)
1a39: 0f 84 e8 00 00 00 je 0x1b27 <mlx5e_xmit_xdp_frame_mpwqe+0x1f7>
1a3f: 83 fa 0b cmpl $0xb, %edx
1a42: 0f 83 df 00 00 00 jae 0x1b27 <mlx5e_xmit_xdp_frame_mpwqe+0x1f7>
1a48: 31 c9 xorl %ecx, %ecx
1a4a: e9 e7 00 00 00 jmp 0x1b36 <mlx5e_xmit_xdp_frame_mpwqe+0x206>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:446
; if (!!xdptxd->len + xdptxdf->sinfo->nr_frags > 1) {
1a4f: 89 c1 movl %eax, %ecx
1a51: 81 e1 ff ff ff 7f andl $0x7fffffff, %ecx # imm = 0x7FFFFFFF
1a57: 49 8b 76 18 movq 0x18(%r14), %rsi
1a5b: 0f b6 7e 02 movzbl 0x2(%rsi), %edi
1a5f: 83 f9 01 cmpl $0x1, %ecx
1a62: 83 df ff sbbl $-0x1, %edi
1a65: 83 ff 02 cmpl $0x2, %edi
1a68: 0f 83 99 00 00 00 jae 0x1b07 <mlx5e_xmit_xdp_frame_mpwqe+0x1d7>
1a6e: a9 ff ff ff 7f testl $0x7fffffff, %eax # imm = 0x7FFFFFFF
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:455
; if (!xdptxd->len) {
1a73: 0f 85 ea fe ff ff jne 0x1963 <mlx5e_xmit_xdp_frame_mpwqe+0x33>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:458
; tmp.data = skb_frag_address(frag);
1a79: 48 8b 7e 30 movq 0x30(%rsi), %rdi
1a7d: 8b 4e 3c movl 0x3c(%rsi), %ecx
; ././include/linux/mm.h:2120
; return page_to_virt(page);
1a80: 48 89 f8 movq %rdi, %rax
1a83: 48 2b 05 00 00 00 00 subq (%rip), %rax # 0x1a8a <mlx5e_xmit_xdp_frame_mpwqe+0x15a>
0000000000001a86: R_X86_64_PC32 vmemmap_base-0x4
1a8a: 48 c1 e0 06 shlq $0x6, %rax
1a8e: 48 03 05 00 00 00 00 addq (%rip), %rax # 0x1a95 <mlx5e_xmit_xdp_frame_mpwqe+0x165>
0000000000001a91: R_X86_64_PC32 page_offset_base-0x4
; ././include/linux/skbuff.h:3478
; return page_address(skb_frag_page(frag)) + skb_frag_off(frag);
1a95: 48 01 c8 addq %rcx, %rax
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:458
; tmp.data = skb_frag_address(frag);
1a98: 48 89 45 d0 movq %rax, -0x30(%rbp)
1a9c: b8 ff ff ff 7f movl $0x7fffffff, %eax # imm = 0x7FFFFFFF
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:459
; tmp.len = skb_frag_size(frag);
1aa1: 23 46 38 andl 0x38(%rsi), %eax
1aa4: 89 45 d8 movl %eax, -0x28(%rbp)
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:460
; tmp.dma_addr = xdptxdf->dma_arr ? xdptxdf->dma_arr[0] :
1aa7: 49 8b 76 20 movq 0x20(%r14), %rsi
1aab: 48 85 f6 testq %rsi, %rsi
1aae: 0f 84 64 01 00 00 je 0x1c18 <mlx5e_xmit_xdp_frame_mpwqe+0x2e8>
1ab4: 48 8b 0e movq (%rsi), %rcx
1ab7: e9 60 01 00 00 jmp 0x1c1c <mlx5e_xmit_xdp_frame_mpwqe+0x2ec>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:414
; if (unlikely(!sq->mpwqe.wqe)) {
1abc: 48 83 7b 50 00 cmpq $0x0, 0x50(%rbx)
1ac1: 0f 85 9b 00 00 00 jne 0x1b62 <mlx5e_xmit_xdp_frame_mpwqe+0x232>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:415
; if (unlikely(!mlx5e_wqc_has_room_for(&sq->wq, sq->cc, sq->pc,
1ac7: 0f b7 43 04 movzwl 0x4(%rbx), %eax
1acb: 0f b7 4b 44 movzwl 0x44(%rbx), %ecx
1acf: 8b 93 10 02 00 00 movl 0x210(%rbx), %edx
; ./drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h:106
; return (mlx5_wq_cyc_ctr2ix(wq, cc - pc) >= n) || (cc == pc);
1ad5: 66 29 c8 subw %cx, %ax
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:415
; if (unlikely(!mlx5e_wqc_has_room_for(&sq->wq, sq->cc, sq->pc,
1ad8: 0f 84 b7 fe ff ff je 0x1995 <mlx5e_xmit_xdp_frame_mpwqe+0x65>
1ade: 21 d0 andl %edx, %eax
1ae0: 66 3b 83 8c 02 00 00 cmpw 0x28c(%rbx), %ax
1ae7: 0f 83 a8 fe ff ff jae 0x1995 <mlx5e_xmit_xdp_frame_mpwqe+0x65>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:418
; mlx5e_xmit_xdp_doorbell(sq);
1aed: 48 89 df movq %rbx, %rdi
1af0: e8 0b fc ff ff callq 0x1700 <mlx5e_xmit_xdp_doorbell>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:419
; sq->stats->full++;
1af5: 48 8b 83 30 02 00 00 movq 0x230(%rbx), %rax
1afc: 48 ff 40 20 incq 0x20(%rax)
1b00: 31 c0 xorl %eax, %eax
1b02: e9 f2 00 00 00 jmp 0x1bf9 <mlx5e_xmit_xdp_frame_mpwqe+0x2c9>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:451
; if (unlikely(sq->mpwqe.wqe))
1b07: 48 83 7b 50 00 cmpq $0x0, 0x50(%rbx)
1b0c: 0f 85 03 02 00 00 jne 0x1d15 <mlx5e_xmit_xdp_frame_mpwqe+0x3e5>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:453
; return mlx5e_xmit_xdp_frame(sq, xdptxd, 0, meta);
1b12: 48 89 df movq %rbx, %rdi
1b15: 4c 89 f6 movq %r14, %rsi
1b18: 31 d2 xorl %edx, %edx
1b1a: 4c 89 f9 movq %r15, %rcx
1b1d: e8 0e 02 00 00 callq 0x1d30 <mlx5e_xmit_xdp_frame>
1b22: e9 d2 00 00 00 jmp 0x1bf9 <mlx5e_xmit_xdp_frame_mpwqe+0x2c9>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:378
; .inline_on = mlx5e_xdp_get_inline_state(sq, session->inline_on),
1b27: 84 c9 testb %cl, %cl
1b29: 40 0f 95 c6 setne %sil
1b2d: 83 fa 7f cmpl $0x7f, %edx
1b30: 0f 97 c1 seta %cl
1b33: 40 08 f1 orb %sil, %cl
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:373
; *session = (struct mlx5e_tx_mpwqe) {
1b36: 48 89 43 50 movq %rax, 0x50(%rbx)
1b3a: c7 43 58 00 00 00 00 movl $0x0, 0x58(%rbx)
1b41: 66 c7 43 5c 02 00 movw $0x2, 0x5c(%rbx)
1b47: 88 4b 5e movb %cl, 0x5e(%rbx)
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:381
; stats->mpwqe++;
1b4a: 49 ff 44 24 08 incq 0x8(%r12)
; ././include/net/xdp_sock.h:107
; if (!meta)
1b4f: 4d 85 ff testq %r15, %r15
1b52: 74 0e je 0x1b62 <mlx5e_xmit_xdp_frame_mpwqe+0x232>
; ././include/net/xdp_sock.h:115
; if (meta->flags & XDP_TX_METADATA_CHECKSUM)
1b54: 41 f6 07 02 testb $0x2, (%r15)
1b58: 74 08 je 0x1b62 <mlx5e_xmit_xdp_frame_mpwqe+0x232>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:483
; xsk_tx_metadata_request(meta, &mlx5e_xsk_tx_metadata_ops, &session->wqe->eth);
1b5a: 48 8b 43 50 movq 0x50(%rbx), %rax
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:286
; eseg->cs_flags |= MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
1b5e: 80 48 14 c0 orb $-0x40, 0x14(%rax)
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:203
; (struct mlx5_wqe_data_seg *)session->wqe + session->ds_count;
1b62: 48 8b 43 50 movq 0x50(%rbx), %rax
1b66: 0f b6 4b 5c movzbl 0x5c(%rbx), %ecx
1b6a: 41 8b 76 10 movl 0x10(%r14), %esi
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:204
; u32 dma_len = xdptxd->len;
1b6e: 89 f2 movl %esi, %edx
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:206
; session->pkt_count++;
1b70: fe 43 5d incb 0x5d(%rbx)
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:204
; u32 dma_len = xdptxd->len;
1b73: 81 e2 ff ff ff 7f andl $0x7fffffff, %edx # imm = 0x7FFFFFFF
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:207
; session->bytes_count += dma_len;
1b79: 01 53 58 addl %edx, 0x58(%rbx)
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:203
; (struct mlx5_wqe_data_seg *)session->wqe + session->ds_count;
1b7c: 48 c1 e1 04 shlq $0x4, %rcx
1b80: 48 8d 3c 08 leaq (%rax,%rcx), %rdi
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:209
; if (session->inline_on && dma_len <= MLX5E_XDP_INLINE_WQE_SZ_THRSD) {
1b84: 80 7b 5e 00 cmpb $0x0, 0x5e(%rbx)
1b88: 74 32 je 0x1bbc <mlx5e_xmit_xdp_frame_mpwqe+0x28c>
1b8a: 81 fa fc 00 00 00 cmpl $0xfc, %edx
1b90: 77 2a ja 0x1bbc <mlx5e_xmit_xdp_frame_mpwqe+0x28c>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:213
; u16 ds_cnt = DIV_ROUND_UP(ds_len, MLX5_SEND_WQE_DS);
1b92: 44 8d 7e 13 leal 0x13(%rsi), %r15d
1b96: 41 c1 ef 04 shrl $0x4, %r15d
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:215
; inline_dseg->byte_count = cpu_to_be32(dma_len | MLX5_INLINE_SEG);
1b9a: 81 ce 00 00 00 80 orl $0x80000000, %esi # imm = 0x80000000
1ba0: 0f ce bswapl %esi
1ba2: 89 37 movl %esi, (%rdi)
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:216
; memcpy(inline_dseg->data, xdptxd->data, dma_len);
1ba4: 48 83 c7 04 addq $0x4, %rdi
1ba8: 49 8b 76 08 movq 0x8(%r14), %rsi
1bac: e8 00 00 00 00 callq 0x1bb1 <mlx5e_xmit_xdp_frame_mpwqe+0x281>
0000000000001bad: R_X86_64_PLT32 memcpy-0x4
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:218
; session->ds_count += ds_cnt;
1bb1: 44 00 7b 5c addb %r15b, 0x5c(%rbx)
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:219
; stats->inlnw++;
1bb5: 49 ff 44 24 10 incq 0x10(%r12)
1bba: eb 1c jmp 0x1bd8 <mlx5e_xmit_xdp_frame_mpwqe+0x2a8>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:223
; dseg->addr = cpu_to_be64(xdptxd->dma_addr);
1bbc: 49 8b 36 movq (%r14), %rsi
1bbf: 48 0f ce bswapq %rsi
1bc2: 48 89 74 08 08 movq %rsi, 0x8(%rax,%rcx)
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:224
; dseg->byte_count = cpu_to_be32(dma_len);
1bc7: 0f ca bswapl %edx
1bc9: 89 17 movl %edx, (%rdi)
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:225
; dseg->lkey = sq->mkey_be;
1bcb: 8b 93 88 02 00 00 movl 0x288(%rbx), %edx
1bd1: 89 54 08 04 movl %edx, 0x4(%rax,%rcx)
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:226
; session->ds_count++;
1bd5: fe 43 5c incb 0x5c(%rbx)
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:488
; if (unlikely(mlx5e_xdp_mpwqe_is_full(session, sq->max_sq_mpw_wqebbs)))
1bd8: 0f b6 83 8e 02 00 00 movzbl 0x28e(%rbx), %eax
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:184
; if (session->inline_on)
1bdf: 80 7b 5e 00 cmpb $0x0, 0x5e(%rbx)
1be3: 0f b6 4b 5c movzbl 0x5c(%rbx), %ecx
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:184
; if (session->inline_on)
1be7: 74 1e je 0x1c07 <mlx5e_xmit_xdp_frame_mpwqe+0x2d7>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:185
; return session->ds_count + MLX5E_XDP_INLINE_WQE_MAX_DS_CNT >
1be9: 83 c1 10 addl $0x10, %ecx
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:186
; max_sq_mpw_wqebbs * MLX5_SEND_WQEBB_NUM_DS;
1bec: c1 e0 02 shll $0x2, %eax
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:185
; return session->ds_count + MLX5E_XDP_INLINE_WQE_MAX_DS_CNT >
1bef: 39 c1 cmpl %eax, %ecx
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:488
; if (unlikely(mlx5e_xdp_mpwqe_is_full(session, sq->max_sq_mpw_wqebbs)))
1bf1: 77 1b ja 0x1c0e <mlx5e_xmit_xdp_frame_mpwqe+0x2de>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:491
; stats->xmit++;
1bf3: 49 ff 04 24 incq (%r12)
1bf7: b0 01 movb $0x1, %al
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:493
; }
1bf9: 48 83 c4 18 addq $0x18, %rsp
1bfd: 5b popq %rbx
1bfe: 41 5c popq %r12
1c00: 41 5e popq %r14
1c02: 41 5f popq %r15
1c04: 5d popq %rbp
1c05: c3 retq
1c06: cc int3
; ./drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h:342
; return session->ds_count == max_sq_mpw_wqebbs * MLX5_SEND_WQEBB_NUM_DS;
1c07: c1 e0 02 shll $0x2, %eax
1c0a: 39 c8 cmpl %ecx, %eax
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:488
; if (unlikely(mlx5e_xdp_mpwqe_is_full(session, sq->max_sq_mpw_wqebbs)))
1c0c: 75 e5 jne 0x1bf3 <mlx5e_xmit_xdp_frame_mpwqe+0x2c3>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:489
; mlx5e_xdp_mpwqe_complete(sq);
1c0e: 48 89 df movq %rbx, %rdi
1c11: e8 00 00 00 00 callq 0x1c16 <mlx5e_xmit_xdp_frame_mpwqe+0x2e6>
0000000000001c12: R_X86_64_PLT32 mlx5e_xdp_mpwqe_complete-0x4
1c16: eb db jmp 0x1bf3 <mlx5e_xmit_xdp_frame_mpwqe+0x2c3>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:461
; page_pool_get_dma_addr(skb_frag_page(frag)) +
1c18: 48 03 4f 20 addq 0x20(%rdi), %rcx
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:460
; tmp.dma_addr = xdptxdf->dma_arr ? xdptxdf->dma_arr[0] :
1c1c: 48 89 4d c8 movq %rcx, -0x38(%rbp)
1c20: 4c 8d 75 c8 leaq -0x38(%rbp), %r14
1c24: e9 3a fd ff ff jmp 0x1963 <mlx5e_xmit_xdp_frame_mpwqe+0x33>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:468
; stats->err++;
1c29: 49 ff 44 24 28 incq 0x28(%r12)
1c2e: 31 c0 xorl %eax, %eax
1c30: eb c7 jmp 0x1bf9 <mlx5e_xmit_xdp_frame_mpwqe+0x2c9>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:343
; wi = &sq->db.wqe_info[pi];
1c32: 48 01 c0 addq %rax, %rax
1c35: 48 03 83 48 02 00 00 addq 0x248(%rbx), %rax
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:344
; edge_wi = wi + contig_wqebbs;
1c3c: 41 0f b7 d0 movzwl %r8w, %edx
1c40: 48 8d 34 50 leaq (%rax,%rdx,2), %rsi
1c44: 4c 89 e1 movq %r12, %rcx
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:347
; for (; wi < edge_wi; wi++) {
1c47: 48 39 f0 cmpq %rsi, %rax
1c4a: 0f 83 9f 00 00 00 jae 0x1cef <mlx5e_xmit_xdp_frame_mpwqe+0x3bf>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:348
; *wi = (struct mlx5e_xdp_wqe_info) {
1c50: 66 c7 00 01 00 movw $0x1, (%rax)
1c55: 0f b7 4b 44 movzwl 0x44(%rbx), %ecx
; ./drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h:125
; u16 pi = mlx5_wq_cyc_ctr2ix(wq, *pc);
1c59: 0f b7 bb 10 02 00 00 movzwl 0x210(%rbx), %edi
; ./drivers/net/ethernet/mellanox/mlx5/core/wq.h:144
; return ctr & wq->fbc.sz_m1;
1c60: 21 cf andl %ecx, %edi
; ././include/linux/mlx5/driver.h:950
; ix += fbc->strides_offset;
1c62: 44 0f b7 83 16 02 00 00 movzwl 0x216(%rbx), %r8d
1c6a: 49 01 f8 addq %rdi, %r8
; ././include/linux/mlx5/driver.h:951
; frag = ix >> fbc->log_frag_strides;
1c6d: 0f b6 8b 1a 02 00 00 movzbl 0x21a(%rbx), %ecx
; ././include/linux/mlx5/driver.h:953
; return fbc->frags[frag].buf + ((fbc->frag_sz_m1 & ix) << fbc->log_stride);
1c74: 0f b7 bb 14 02 00 00 movzwl 0x214(%rbx), %edi
1c7b: 44 21 c7 andl %r8d, %edi
; ././include/linux/mlx5/driver.h:951
; frag = ix >> fbc->log_frag_strides;
1c7e: 49 d3 e8 shrq %cl, %r8
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:352
; mlx5e_post_nop(wq, sq->sqn, &sq->pc);
1c81: 44 8b 8b 78 02 00 00 movl 0x278(%rbx), %r9d
; ././include/linux/mlx5/driver.h:953
; return fbc->frags[frag].buf + ((fbc->frag_sz_m1 & ix) << fbc->log_stride);
1c88: 4c 8b 93 08 02 00 00 movq 0x208(%rbx), %r10
1c8f: 49 c1 e0 04 shlq $0x4, %r8
1c93: 0f b6 8b 19 02 00 00 movzbl 0x219(%rbx), %ecx
1c9a: d3 e7 shll %cl, %edi
1c9c: 4b 8b 0c 02 movq (%r10,%r8), %rcx
; ./drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h:129
; memset(cseg, 0, sizeof(*cseg));
1ca0: 48 c7 44 39 08 00 00 00 00 movq $0x0, 0x8(%rcx,%rdi)
1ca9: 48 c7 04 39 00 00 00 00 movq $0x0, (%rcx,%rdi)
; ./drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h:131
; cseg->opmod_idx_opcode = cpu_to_be32((*pc << 8) | MLX5_OPCODE_NOP);
1cb1: 44 0f b7 43 44 movzwl 0x44(%rbx), %r8d
1cb6: 66 41 c1 c0 08 rolw $0x8, %r8w
1cbb: 45 0f b7 c0 movzwl %r8w, %r8d
1cbf: 41 c1 e0 08 shll $0x8, %r8d
1cc3: 44 89 04 39 movl %r8d, (%rcx,%rdi)
; ./drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h:132
; cseg->qpn_ds = cpu_to_be32((sqn << 8) | 0x01);
1cc7: 41 c1 e1 08 shll $0x8, %r9d
1ccb: 41 83 c9 01 orl $0x1, %r9d
1ccf: 41 0f c9 bswapl %r9d
1cd2: 44 89 4c 39 04 movl %r9d, 0x4(%rcx,%rdi)
; ./drivers/net/ethernet/mellanox/mlx5/core/en/txrx.h:134
; (*pc)++;
1cd7: 66 ff 43 44 incw 0x44(%rbx)
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:347
; for (; wi < edge_wi; wi++) {
1cdb: 48 83 c0 02 addq $0x2, %rax
1cdf: 48 39 f0 cmpq %rsi, %rax
1ce2: 0f 82 68 ff ff ff jb 0x1c50 <mlx5e_xmit_xdp_frame_mpwqe+0x320>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:354
; sq->stats->nops += contig_wqebbs;
1ce8: 48 8b 8b 30 02 00 00 movq 0x230(%rbx), %rcx
1cef: 48 01 51 18 addq %rdx, 0x18(%rcx)
; ./drivers/net/ethernet/mellanox/mlx5/core/wq.h:159
; return mlx5_frag_buf_get_wqe(&wq->fbc, ix);
1cf3: 0f b7 43 44 movzwl 0x44(%rbx), %eax
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:356
; pi = mlx5_wq_cyc_ctr2ix(wq, sq->pc);
1cf7: 0f b7 8b 10 02 00 00 movzwl 0x210(%rbx), %ecx
; ./drivers/net/ethernet/mellanox/mlx5/core/wq.h:144
; return ctr & wq->fbc.sz_m1;
1cfe: 21 c1 andl %eax, %ecx
; ././include/linux/mlx5/driver.h:950
; ix += fbc->strides_offset;
1d00: 0f b7 bb 16 02 00 00 movzwl 0x216(%rbx), %edi
1d07: 01 cf addl %ecx, %edi
; ././include/linux/mlx5/driver.h:953
; return fbc->frags[frag].buf + ((fbc->frag_sz_m1 & ix) << fbc->log_stride);
1d09: 0f b7 b3 14 02 00 00 movzwl 0x214(%rbx), %esi
1d10: e9 bf fc ff ff jmp 0x19d4 <mlx5e_xmit_xdp_frame_mpwqe+0xa4>
; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:452
; mlx5e_xdp_mpwqe_complete(sq);
1d15: 48 89 df movq %rbx, %rdi
1d18: e8 00 00 00 00 callq 0x1d1d <mlx5e_xmit_xdp_frame_mpwqe+0x3ed>
0000000000001d19: R_X86_64_PLT32 mlx5e_xdp_mpwqe_complete-0x4
1d1d: e9 f0 fd ff ff jmp 0x1b12 <mlx5e_xmit_xdp_frame_mpwqe+0x1e2>
1d22: 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 nopw %cs:(%rax,%rax)
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-26 21:25 ` Stanislav Fomichev
@ 2023-07-27 13:50 ` Jesper Dangaard Brouer
2023-07-27 16:34 ` Stanislav Fomichev
0 siblings, 1 reply; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-27 13:50 UTC (permalink / raw)
To: Stanislav Fomichev, Jesper Dangaard Brouer
Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs,
john.fastabend, kpsingh, haoluo, jolsa, kuba, toke, willemb,
dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
netdev, xdp-hints
On 26/07/2023 23.25, Stanislav Fomichev wrote:
> On 07/26, Jesper Dangaard Brouer wrote:
>>
>>
>> On 25/07/2023 01.59, Stanislav Fomichev wrote:
>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>> index 11652e464f5d..8b40c80557aa 100644
>>> --- a/include/linux/netdevice.h
>>> +++ b/include/linux/netdevice.h
>>> @@ -1660,6 +1660,31 @@ struct xdp_metadata_ops {
>>> enum xdp_rss_hash_type *rss_type);
>>> };
>>> +/*
>>> + * This structure defines the AF_XDP TX metadata hooks for network devices.
>>> + * The following hooks can be defined; unless noted otherwise, they are
>>> + * optional and can be filled with a null pointer.
>>> + *
>>> + * int (*tmo_request_timestamp)(void *priv)
>>> + * This function is called when AF_XDP frame requested egress timestamp.
>>> + *
>>> + * int (*tmo_fill_timestamp)(void *priv)
>>> + * This function is called when AF_XDP frame, that had requested
>>> + * egress timestamp, received a completion. The hook needs to return
>>> + * the actual HW timestamp.
>>> + *
>>> + * int (*tmo_request_timestamp)(u16 csum_start, u16 csum_offset, void *priv)
>>> + * This function is called when AF_XDP frame requested HW checksum
>>> + * offload. csum_start indicates position where checksumming should start.
>>> + * csum_offset indicates position where checksum should be stored.
>>> + *
>>> + */
>>> +struct xsk_tx_metadata_ops {
>>> + void (*tmo_request_timestamp)(void *priv);
>>> + u64 (*tmo_fill_timestamp)(void *priv);
>>> + void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
>>> +};
>>> +
>>> /**
>>> * enum netdev_priv_flags - &struct net_device priv_flags
>>> *
>>> @@ -1844,6 +1869,7 @@ enum netdev_ml_priv_type {
>>> * @netdev_ops: Includes several pointers to callbacks,
>>> * if one wants to override the ndo_*() functions
>>> * @xdp_metadata_ops: Includes pointers to XDP metadata callbacks.
>>> + * @xsk_tx_metadata_ops: Includes pointers to AF_XDP TX metadata callbacks.
>>> * @ethtool_ops: Management operations
>>> * @l3mdev_ops: Layer 3 master device operations
>>> * @ndisc_ops: Includes callbacks for different IPv6 neighbour
>>> @@ -2100,6 +2126,7 @@ struct net_device {
>>> unsigned long long priv_flags;
>>> const struct net_device_ops *netdev_ops;
>>> const struct xdp_metadata_ops *xdp_metadata_ops;
>>> + const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
>>> int ifindex;
>>> unsigned short gflags;
>>> unsigned short hard_header_len;
>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>> index faaba050f843..5febc1a5131e 100644
>>> --- a/include/linux/skbuff.h
>>> +++ b/include/linux/skbuff.h
>>> @@ -581,7 +581,10 @@ struct skb_shared_info {
>>> /* Warning: this field is not always filled in (UFO)! */
>>> unsigned short gso_segs;
>>> struct sk_buff *frag_list;
>>> - struct skb_shared_hwtstamps hwtstamps;
>>> + union {
>>> + struct skb_shared_hwtstamps hwtstamps;
>>> + struct xsk_tx_metadata *xsk_meta;
>>> + };
>>> unsigned int gso_type;
>>> u32 tskey;
>>> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
>>> index 467b9fb56827..288fa58c4665 100644
>>> --- a/include/net/xdp_sock.h
>>> +++ b/include/net/xdp_sock.h
>>> @@ -90,6 +90,54 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
>>> int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
>>> void __xsk_map_flush(void);
>>> +/**
>>> + * xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
>>> + * and call appropriate xsk_tx_metadata_ops operation.
>>> + * @meta: pointer to AF_XDP metadata area
>>> + * @ops: pointer to struct xsk_tx_metadata_ops
>>> + * @priv: pointer to driver-private aread
>>> + *
>>> + * This function should be called by the networking device when
>>> + * it prepares AF_XDP egress packet.
>>> + */
>>> +static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
>>> + const struct xsk_tx_metadata_ops *ops,
>>> + void *priv)
>>
>> (As you mentioned) this gets inlined in drivers for performance.
>>
>>> +{
>>> + if (!meta)
>>> + return;
>>> +
>>> + if (ops->tmo_request_timestamp)
>>> + if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
>>> + ops->tmo_request_timestamp(priv);
>>
>> We still have the overhead of function pointer call.
>> With RETPOLINE this is costly.
>>
>> Measured on my testlab CPU E5-1650 v4 @ 3.60GHz
>> Type:function_call_cost: 3 cycles(tsc) 1.010 ns
>> Type:func_ptr_call_cost: 30 cycles(tsc) 8.341 ns
>>
>> Given this get inlined in drivers, perhaps we can add some
>> INDIRECT_CALL_1 macros to avoid these indirect calls?
>
> I'm assuming that the compiler is smart enough to resolve these const
> struct ops definitions as long as they are in the same file.
>
> At least here is what I see for mlx5e_xmit_xdp_frame_mpwqe: those
> indirect jumps are resolved and the calls themselves are unrolled.
> TBF, I don't have retpolines enabled in the config, but I don't think
> it will bring indirect jumps back? Am I missing anything?
>
I tried this with CONFIG_RETPOLINE and same results.
The compiler is smart and inlines the call to mlx5e_xsk_request_checksum().
This is great, no need for crazy INDIRECT_CALL_1 macros :-)
>
> 0000000000001930 <mlx5e_xmit_xdp_frame_mpwqe>:
> ; mlx5e_xmit_xdp_frame_mpwqe():
> ; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:436
[...]
> ; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:381
> ; stats->mpwqe++;
> 1b4a: 49 ff 44 24 08 incq 0x8(%r12)
> ; ././include/net/xdp_sock.h:107
How do you get objdump to add these file:line annotations?
I use:
objdump -rS --visualize-jumps=color -Mintel | less -R
> ; if (!meta)
> 1b4f: 4d 85 ff testq %r15, %r15
> 1b52: 74 0e je 0x1b62 <mlx5e_xmit_xdp_frame_mpwqe+0x232>
> ; ././include/net/xdp_sock.h:115
> ; if (meta->flags & XDP_TX_METADATA_CHECKSUM)
> 1b54: 41 f6 07 02 testb $0x2, (%r15)
> 1b58: 74 08 je 0x1b62 <mlx5e_xmit_xdp_frame_mpwqe+0x232>
> ; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:483
> ; xsk_tx_metadata_request(meta, &mlx5e_xsk_tx_metadata_ops, &session->wqe->eth);
> 1b5a: 48 8b 43 50 movq 0x50(%rbx), %rax
> ; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:286
> ; eseg->cs_flags |= MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
> 1b5e: 80 48 14 c0 orb $-0x40, 0x14(%rax)
Yes, here mlx5e_xsk_request_checksum() gets inlined, as
MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM = 192
192 = 0xC0 and signed byte value $-0x40, which we see in inst 'orb'.
I usually prefer Intel ASM code output via objdump -Mintel
and it decode ASM with unsigned value 0xc0 / 192:
or BYTE PTR ds:0x4,0xc0
> ; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h:203
> ; (struct mlx5_wqe_data_seg *)session->wqe + session->ds_count;
> 1b62: 48 8b 43 50 movq 0x50(%rbx), %rax
[...]
Thank you for checking up on this! :-)
--Jesper
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-25 23:10 ` Willem de Bruijn
2023-07-25 23:48 ` Stanislav Fomichev
@ 2023-07-27 14:11 ` Jesper Dangaard Brouer
2023-07-27 16:37 ` Stanislav Fomichev
1 sibling, 1 reply; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-27 14:11 UTC (permalink / raw)
To: Willem de Bruijn, Stanislav Fomichev
Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs,
john.fastabend, kpsingh, haoluo, jolsa, kuba, toke, willemb,
dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
netdev, xdp-hints
On 26/07/2023 01.10, Willem de Bruijn wrote:
> Stanislav Fomichev wrote:
>> On 07/25, Willem de Bruijn wrote:
>>> Stanislav Fomichev wrote:
[...]
>>>> +struct xsk_tx_metadata {
>>>> + __u32 flags;
>>>> +
>>>> + /* XDP_TX_METADATA_CHECKSUM */
>>>> +
>>>> + /* Offset from desc->addr where checksumming should start. */
>>>> + __u16 csum_start;
>>>> + /* Offset from csum_start where checksum should be stored. */
>>>> + __u16 csum_offset;
>>>> +
>>>> + /* XDP_TX_METADATA_TIMESTAMP */
>>>> +
>>>> + __u64 tx_timestamp;
>>>> +};
>>>
>>> Is this structure easily extensible for future offloads,
[...]
>
> Pacing offload is the other feature that comes to mind. That could
> conceivably use the tx_timestamp field.
I would really like to see hardware offload "pacing" or LaunchTime as
hardware chips i210 and i225 calls it. I looked at the TX descriptor
details for drivers igc (i225) and igb (i210), and documented my finding
here[1], which should help with the code details if someone is motivated
for implementing this (I know of people on xdp-hints list that wanted
this LaunchTime feature).
[1]
https://github.com/xdp-project/xdp-project/blob/master/areas/tsn/code01_follow_qdisc_TSN_offload.org#tx-time-to-hardware-driver-igc
AFAIK this patchset uses struct xsk_tx_metadata as both TX and
TX-Completion, right?. It would be "conceivable" to reuse tx_timestamp
field, but it might be confusing for uAPI end-users. Maybe a union?
--Jesper
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-27 13:50 ` Jesper Dangaard Brouer
@ 2023-07-27 16:34 ` Stanislav Fomichev
0 siblings, 0 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-27 16:34 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs,
john.fastabend, kpsingh, haoluo, jolsa, kuba, toke, willemb,
dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
netdev, xdp-hints
On 07/27, Jesper Dangaard Brouer wrote:
>
>
> On 26/07/2023 23.25, Stanislav Fomichev wrote:
> > On 07/26, Jesper Dangaard Brouer wrote:
> > >
> > >
> > > On 25/07/2023 01.59, Stanislav Fomichev wrote:
> > > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > > > index 11652e464f5d..8b40c80557aa 100644
> > > > --- a/include/linux/netdevice.h
> > > > +++ b/include/linux/netdevice.h
> > > > @@ -1660,6 +1660,31 @@ struct xdp_metadata_ops {
> > > > enum xdp_rss_hash_type *rss_type);
> > > > };
> > > > +/*
> > > > + * This structure defines the AF_XDP TX metadata hooks for network devices.
> > > > + * The following hooks can be defined; unless noted otherwise, they are
> > > > + * optional and can be filled with a null pointer.
> > > > + *
> > > > + * int (*tmo_request_timestamp)(void *priv)
> > > > + * This function is called when AF_XDP frame requested egress timestamp.
> > > > + *
> > > > + * int (*tmo_fill_timestamp)(void *priv)
> > > > + * This function is called when AF_XDP frame, that had requested
> > > > + * egress timestamp, received a completion. The hook needs to return
> > > > + * the actual HW timestamp.
> > > > + *
> > > > + * int (*tmo_request_timestamp)(u16 csum_start, u16 csum_offset, void *priv)
> > > > + * This function is called when AF_XDP frame requested HW checksum
> > > > + * offload. csum_start indicates position where checksumming should start.
> > > > + * csum_offset indicates position where checksum should be stored.
> > > > + *
> > > > + */
> > > > +struct xsk_tx_metadata_ops {
> > > > + void (*tmo_request_timestamp)(void *priv);
> > > > + u64 (*tmo_fill_timestamp)(void *priv);
> > > > + void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
> > > > +};
> > > > +
> > > > /**
> > > > * enum netdev_priv_flags - &struct net_device priv_flags
> > > > *
> > > > @@ -1844,6 +1869,7 @@ enum netdev_ml_priv_type {
> > > > * @netdev_ops: Includes several pointers to callbacks,
> > > > * if one wants to override the ndo_*() functions
> > > > * @xdp_metadata_ops: Includes pointers to XDP metadata callbacks.
> > > > + * @xsk_tx_metadata_ops: Includes pointers to AF_XDP TX metadata callbacks.
> > > > * @ethtool_ops: Management operations
> > > > * @l3mdev_ops: Layer 3 master device operations
> > > > * @ndisc_ops: Includes callbacks for different IPv6 neighbour
> > > > @@ -2100,6 +2126,7 @@ struct net_device {
> > > > unsigned long long priv_flags;
> > > > const struct net_device_ops *netdev_ops;
> > > > const struct xdp_metadata_ops *xdp_metadata_ops;
> > > > + const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
> > > > int ifindex;
> > > > unsigned short gflags;
> > > > unsigned short hard_header_len;
> > > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > > > index faaba050f843..5febc1a5131e 100644
> > > > --- a/include/linux/skbuff.h
> > > > +++ b/include/linux/skbuff.h
> > > > @@ -581,7 +581,10 @@ struct skb_shared_info {
> > > > /* Warning: this field is not always filled in (UFO)! */
> > > > unsigned short gso_segs;
> > > > struct sk_buff *frag_list;
> > > > - struct skb_shared_hwtstamps hwtstamps;
> > > > + union {
> > > > + struct skb_shared_hwtstamps hwtstamps;
> > > > + struct xsk_tx_metadata *xsk_meta;
> > > > + };
> > > > unsigned int gso_type;
> > > > u32 tskey;
> > > > diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> > > > index 467b9fb56827..288fa58c4665 100644
> > > > --- a/include/net/xdp_sock.h
> > > > +++ b/include/net/xdp_sock.h
> > > > @@ -90,6 +90,54 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
> > > > int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
> > > > void __xsk_map_flush(void);
> > > > +/**
> > > > + * xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
> > > > + * and call appropriate xsk_tx_metadata_ops operation.
> > > > + * @meta: pointer to AF_XDP metadata area
> > > > + * @ops: pointer to struct xsk_tx_metadata_ops
> > > > + * @priv: pointer to driver-private aread
> > > > + *
> > > > + * This function should be called by the networking device when
> > > > + * it prepares AF_XDP egress packet.
> > > > + */
> > > > +static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
> > > > + const struct xsk_tx_metadata_ops *ops,
> > > > + void *priv)
> > >
> > > (As you mentioned) this gets inlined in drivers for performance.
> > >
> > > > +{
> > > > + if (!meta)
> > > > + return;
> > > > +
> > > > + if (ops->tmo_request_timestamp)
> > > > + if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
> > > > + ops->tmo_request_timestamp(priv);
> > >
> > > We still have the overhead of function pointer call.
> > > With RETPOLINE this is costly.
> > >
> > > Measured on my testlab CPU E5-1650 v4 @ 3.60GHz
> > > Type:function_call_cost: 3 cycles(tsc) 1.010 ns
> > > Type:func_ptr_call_cost: 30 cycles(tsc) 8.341 ns
> > >
> > > Given this get inlined in drivers, perhaps we can add some
> > > INDIRECT_CALL_1 macros to avoid these indirect calls?
> >
> > I'm assuming that the compiler is smart enough to resolve these const
> > struct ops definitions as long as they are in the same file.
> >
> > At least here is what I see for mlx5e_xmit_xdp_frame_mpwqe: those
> > indirect jumps are resolved and the calls themselves are unrolled.
> > TBF, I don't have retpolines enabled in the config, but I don't think
> > it will bring indirect jumps back? Am I missing anything?
> >
>
> I tried this with CONFIG_RETPOLINE and same results.
> The compiler is smart and inlines the call to mlx5e_xsk_request_checksum().
> This is great, no need for crazy INDIRECT_CALL_1 macros :-)
Perfect, thanks for checking!
> >
> > 0000000000001930 <mlx5e_xmit_xdp_frame_mpwqe>:
> > ; mlx5e_xmit_xdp_frame_mpwqe():
> > ; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:436
> [...]
> > ; ./drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c:381
> > ; stats->mpwqe++;
> > 1b4a: 49 ff 44 24 08 incq 0x8(%r12)
> > ; ././include/net/xdp_sock.h:107
>
> How do you get objdump to add these file:line annotations?
>
> I use:
> objdump -rS --visualize-jumps=color -Mintel | less -R
I use llvm tools (and complier), so I did:
llvm-objdump -r -S -l --disassemble xxx.o
Most likely it's that -l option? man objdump seems to have it..
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-27 14:11 ` Jesper Dangaard Brouer
@ 2023-07-27 16:37 ` Stanislav Fomichev
0 siblings, 0 replies; 30+ messages in thread
From: Stanislav Fomichev @ 2023-07-27 16:37 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Willem de Bruijn, brouer, bpf, ast, daniel, andrii, martin.lau,
song, yhs, john.fastabend, kpsingh, haoluo, jolsa, kuba, toke,
willemb, dsahern, magnus.karlsson, bjorn, maciej.fijalkowski,
hawk, netdev, xdp-hints
On 07/27, Jesper Dangaard Brouer wrote:
>
> On 26/07/2023 01.10, Willem de Bruijn wrote:
> > Stanislav Fomichev wrote:
> > > On 07/25, Willem de Bruijn wrote:
> > > > Stanislav Fomichev wrote:
> [...]
> > > > > +struct xsk_tx_metadata {
> > > > > + __u32 flags;
> > > > > +
> > > > > + /* XDP_TX_METADATA_CHECKSUM */
> > > > > +
> > > > > + /* Offset from desc->addr where checksumming should start. */
> > > > > + __u16 csum_start;
> > > > > + /* Offset from csum_start where checksum should be stored. */
> > > > > + __u16 csum_offset;
> > > > > +
> > > > > + /* XDP_TX_METADATA_TIMESTAMP */
> > > > > +
> > > > > + __u64 tx_timestamp;
> > > > > +};
> >>>
> > > > Is this structure easily extensible for future offloads,
> [...]
> >
> > Pacing offload is the other feature that comes to mind. That could
> > conceivably use the tx_timestamp field.
>
> I would really like to see hardware offload "pacing" or LaunchTime as
> hardware chips i210 and i225 calls it. I looked at the TX descriptor
> details for drivers igc (i225) and igb (i210), and documented my finding
> here[1], which should help with the code details if someone is motivated
> for implementing this (I know of people on xdp-hints list that wanted
> this LaunchTime feature).
>
> [1] https://github.com/xdp-project/xdp-project/blob/master/areas/tsn/code01_follow_qdisc_TSN_offload.org#tx-time-to-hardware-driver-igc
Nice!
> AFAIK this patchset uses struct xsk_tx_metadata as both TX and
> TX-Completion, right?. It would be "conceivable" to reuse tx_timestamp
> field, but it might be confusing for uAPI end-users. Maybe a union?
Sure, but do we add it later when we add launch time? For now, let's
figure out whether 'tx_timestamp' is the right name for the 'tx
completion timestamp' :-D Later on, we can union it with launch_time?
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-25 20:58 ` Willem de Bruijn
@ 2023-07-28 11:56 ` Jesper Dangaard Brouer
2023-07-28 13:19 ` Willem de Bruijn
0 siblings, 1 reply; 30+ messages in thread
From: Jesper Dangaard Brouer @ 2023-07-28 11:56 UTC (permalink / raw)
To: Willem de Bruijn, Stanislav Fomichev, bpf
Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
john.fastabend, kpsingh, haoluo, jolsa, kuba, toke, willemb,
dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
netdev, xdp-hints
On 25/07/2023 22.58, Willem de Bruijn wrote:
> Stanislav Fomichev wrote:
>> This change actually defines the (initial) metadata layout
>> that should be used by AF_XDP userspace (xsk_tx_metadata).
>> The first field is flags which requests appropriate offloads,
>> followed by the offload-specific fields. The supported per-device
>> offloads are exported via netlink (new xsk-flags).
>>
>> The offloads themselves are still implemented in a bit of a
>> framework-y fashion that's left from my initial kfunc attempt.
>> I'm introducing new xsk_tx_metadata_ops which drivers are
>> supposed to implement. The drivers are also supposed
>> to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
>> the right places. Since xsk_tx_metadata_{request,_complete}
>> are static inline, we don't incur any extra overhead doing
>> indirect calls.
>>
>> The benefit of this scheme is as follows:
>> - keeps all metadata layout parsing away from driver code
>> - makes it easy to grep and see which drivers implement what
>> - don't need any extra flags to maintain to keep track of that
>> offloads are implemented; if the callback is implemented - the offload
>> is supported (used by netlink reporting code)
>>
>> Two offloads are defined right now:
>> 1. XDP_TX_METADATA_CHECKSUM: skb-style csum_start+csum_offset
>> 2. XDP_TX_METADATA_TIMESTAMP: writes TX timestamp back into metadata
>> area upon completion (tx_timestamp field)
>>
>> The offloads are also implemented for copy mode:
>> 1. Extra XDP_TX_METADATA_CHECKSUM_SW to trigger skb_checksum_help; this
>> might be useful as a reference implementation and for testing
>> 2. XDP_TX_METADATA_TIMESTAMP writes SW timestamp from the skb
>> destructor (note I'm reusing hwtstamps to pass metadata pointer)
>>
>> The struct is forward-compatible and can be extended in the future
>> by appending more fields.
>>
>> Signed-off-by: Stanislav Fomichev <sdf@google.com>
>> ---
>> Documentation/netlink/specs/netdev.yaml | 19 ++++++++
>> include/linux/netdevice.h | 27 +++++++++++
>> include/linux/skbuff.h | 5 ++-
>> include/net/xdp_sock.h | 60 +++++++++++++++++++++++++
>> include/net/xdp_sock_drv.h | 13 ++++++
>> include/uapi/linux/if_xdp.h | 35 +++++++++++++++
>> include/uapi/linux/netdev.h | 15 +++++++
>> net/core/netdev-genl.c | 12 ++++-
>> net/xdp/xsk.c | 38 ++++++++++++++++
>> net/xdp/xsk_queue.h | 2 +-
>> tools/include/uapi/linux/if_xdp.h | 50 ++++++++++++++++++---
>> 11 files changed, 268 insertions(+), 8 deletions(-)
>>
>> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
>> index e41015310a6e..bf9c1cc32614 100644
>> --- a/Documentation/netlink/specs/netdev.yaml
>> +++ b/Documentation/netlink/specs/netdev.yaml
>> @@ -42,6 +42,19 @@ name: netdev
>> doc:
>> This feature informs if netdev implements non-linear XDP buffer
>> support in ndo_xdp_xmit callback.
>> + -
>> + type: flags
>> + name: xsk-flags
>> + render-max: true
>> + entries:
>> + -
>> + name: tx-timestamp
>> + doc:
>> + HW timestamping egress packets is supported by the driver.
>> + -
>> + name: tx-checksum
>> + doc:
>> + L3 checksum HW offload is supported by the driver.
>>
>> attribute-sets:
>> -
>> @@ -68,6 +81,12 @@ name: netdev
>> type: u32
>> checks:
>> min: 1
>> + -
>> + name: xsk-features
>> + doc: Bitmask of enabled AF_XDP features.
>> + type: u64
>> + enum: xsk-flags
>> + enum-as-flags: true
>>
>> operations:
>> list:
>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>> index 11652e464f5d..8b40c80557aa 100644
>> --- a/include/linux/netdevice.h
>> +++ b/include/linux/netdevice.h
>> @@ -1660,6 +1660,31 @@ struct xdp_metadata_ops {
>> enum xdp_rss_hash_type *rss_type);
>> };
>>
>> +/*
>> + * This structure defines the AF_XDP TX metadata hooks for network devices.
>> + * The following hooks can be defined; unless noted otherwise, they are
>> + * optional and can be filled with a null pointer.
>> + *
>> + * int (*tmo_request_timestamp)(void *priv)
>> + * This function is called when AF_XDP frame requested egress timestamp.
>> + *
>> + * int (*tmo_fill_timestamp)(void *priv)
>> + * This function is called when AF_XDP frame, that had requested
>> + * egress timestamp, received a completion. The hook needs to return
>> + * the actual HW timestamp.
>> + *
>> + * int (*tmo_request_timestamp)(u16 csum_start, u16 csum_offset, void *priv)
>
> typo: tmo_request_checksum
>
>> + * This function is called when AF_XDP frame requested HW checksum
>> + * offload. csum_start indicates position where checksumming should start.
>> + * csum_offset indicates position where checksum should be stored.
>> + *
>> + */
>> +struct xsk_tx_metadata_ops {
>> + void (*tmo_request_timestamp)(void *priv);
>> + u64 (*tmo_fill_timestamp)(void *priv);
>> + void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
>> +};
>> +
>> /**
>> * enum netdev_priv_flags - &struct net_device priv_flags
>> *
>> @@ -1844,6 +1869,7 @@ enum netdev_ml_priv_type {
>> * @netdev_ops: Includes several pointers to callbacks,
>> * if one wants to override the ndo_*() functions
>> * @xdp_metadata_ops: Includes pointers to XDP metadata callbacks.
>> + * @xsk_tx_metadata_ops: Includes pointers to AF_XDP TX metadata callbacks.
>> * @ethtool_ops: Management operations
>> * @l3mdev_ops: Layer 3 master device operations
>> * @ndisc_ops: Includes callbacks for different IPv6 neighbour
>> @@ -2100,6 +2126,7 @@ struct net_device {
>> unsigned long long priv_flags;
>> const struct net_device_ops *netdev_ops;
>> const struct xdp_metadata_ops *xdp_metadata_ops;
>> + const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
>> int ifindex;
>> unsigned short gflags;
>> unsigned short hard_header_len;
>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>> index faaba050f843..5febc1a5131e 100644
>> --- a/include/linux/skbuff.h
>> +++ b/include/linux/skbuff.h
>> @@ -581,7 +581,10 @@ struct skb_shared_info {
>> /* Warning: this field is not always filled in (UFO)! */
>> unsigned short gso_segs;
>> struct sk_buff *frag_list;
>> - struct skb_shared_hwtstamps hwtstamps;
>> + union {
>> + struct skb_shared_hwtstamps hwtstamps;
>> + struct xsk_tx_metadata *xsk_meta;
>> + };
>> unsigned int gso_type;
>> u32 tskey;
>>
>> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
>> index 467b9fb56827..288fa58c4665 100644
>> --- a/include/net/xdp_sock.h
>> +++ b/include/net/xdp_sock.h
>> @@ -90,6 +90,54 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
>> int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
>> void __xsk_map_flush(void);
>>
>> +/**
>> + * xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
>> + * and call appropriate xsk_tx_metadata_ops operation.
>> + * @meta: pointer to AF_XDP metadata area
>> + * @ops: pointer to struct xsk_tx_metadata_ops
>> + * @priv: pointer to driver-private aread
>> + *
>> + * This function should be called by the networking device when
>> + * it prepares AF_XDP egress packet.
>> + */
>> +static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
>> + const struct xsk_tx_metadata_ops *ops,
>> + void *priv)
>> +{
>> + if (!meta)
>> + return;
>> +
>> + if (ops->tmo_request_timestamp)
>> + if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
>> + ops->tmo_request_timestamp(priv);
>> +
>> + if (ops->tmo_request_checksum)
>> + if (meta->flags & XDP_TX_METADATA_CHECKSUM)
>> + ops->tmo_request_checksum(meta->csum_start, meta->csum_offset, priv);
>> +}
>
> Might be cheaper to test the flag in the hot cacheline before
> dereferencing ops?
>
I was thinking the same thing, but I was wrong, see below, as these ops
deref's are optimized out by the compiler.
> Also, just add these functions to net_device_ops directly,
> rather than dereferencing another pointer to xsk_tx_metadata_ops?
>
After the ASM/objdump discussion[1] in this thread, I think Stanislav's
code approach here is actually optimal. Because when the 'const' ops are
defined locally in the same file that use/does the inlining of
xsk_tx_metadata_request() then the compilers (both llvm and gcc) are
smart enough to inline and do dead-code elimination. E.g. I noticed
that generated ASM code eliminated `if (ops->tmo_request_timestamp)`
code path in mlx5 as its not implemented.
[1]
https://lore.kernel.org/all/cce9db50-8c9d-ea97-cb88-171fa46cc064@redhat.com/
--Jesper
^ permalink raw reply [flat|nested] 30+ messages in thread
* [xdp-hints] Re: [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support
2023-07-28 11:56 ` Jesper Dangaard Brouer
@ 2023-07-28 13:19 ` Willem de Bruijn
0 siblings, 0 replies; 30+ messages in thread
From: Willem de Bruijn @ 2023-07-28 13:19 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Willem de Bruijn, Stanislav Fomichev, bpf
Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
john.fastabend, kpsingh, haoluo, jolsa, kuba, toke, willemb,
dsahern, magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
netdev, xdp-hints
Jesper Dangaard Brouer wrote:
>
>
> On 25/07/2023 22.58, Willem de Bruijn wrote:
> > Stanislav Fomichev wrote:
> >> This change actually defines the (initial) metadata layout
> >> that should be used by AF_XDP userspace (xsk_tx_metadata).
> >> The first field is flags which requests appropriate offloads,
> >> followed by the offload-specific fields. The supported per-device
> >> offloads are exported via netlink (new xsk-flags).
> >>
> >> The offloads themselves are still implemented in a bit of a
> >> framework-y fashion that's left from my initial kfunc attempt.
> >> I'm introducing new xsk_tx_metadata_ops which drivers are
> >> supposed to implement. The drivers are also supposed
> >> to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
> >> the right places. Since xsk_tx_metadata_{request,_complete}
> >> are static inline, we don't incur any extra overhead doing
> >> indirect calls.
> >>
> >> The benefit of this scheme is as follows:
> >> - keeps all metadata layout parsing away from driver code
> >> - makes it easy to grep and see which drivers implement what
> >> - don't need any extra flags to maintain to keep track of that
> >> offloads are implemented; if the callback is implemented - the offload
> >> is supported (used by netlink reporting code)
> >>
> >> Two offloads are defined right now:
> >> 1. XDP_TX_METADATA_CHECKSUM: skb-style csum_start+csum_offset
> >> 2. XDP_TX_METADATA_TIMESTAMP: writes TX timestamp back into metadata
> >> area upon completion (tx_timestamp field)
> >>
> >> The offloads are also implemented for copy mode:
> >> 1. Extra XDP_TX_METADATA_CHECKSUM_SW to trigger skb_checksum_help; this
> >> might be useful as a reference implementation and for testing
> >> 2. XDP_TX_METADATA_TIMESTAMP writes SW timestamp from the skb
> >> destructor (note I'm reusing hwtstamps to pass metadata pointer)
> >>
> >> The struct is forward-compatible and can be extended in the future
> >> by appending more fields.
> >>
> >> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> >> ---
> >> Documentation/netlink/specs/netdev.yaml | 19 ++++++++
> >> include/linux/netdevice.h | 27 +++++++++++
> >> include/linux/skbuff.h | 5 ++-
> >> include/net/xdp_sock.h | 60 +++++++++++++++++++++++++
> >> include/net/xdp_sock_drv.h | 13 ++++++
> >> include/uapi/linux/if_xdp.h | 35 +++++++++++++++
> >> include/uapi/linux/netdev.h | 15 +++++++
> >> net/core/netdev-genl.c | 12 ++++-
> >> net/xdp/xsk.c | 38 ++++++++++++++++
> >> net/xdp/xsk_queue.h | 2 +-
> >> tools/include/uapi/linux/if_xdp.h | 50 ++++++++++++++++++---
> >> 11 files changed, 268 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
> >> index e41015310a6e..bf9c1cc32614 100644
> >> --- a/Documentation/netlink/specs/netdev.yaml
> >> +++ b/Documentation/netlink/specs/netdev.yaml
> >> @@ -42,6 +42,19 @@ name: netdev
> >> doc:
> >> This feature informs if netdev implements non-linear XDP buffer
> >> support in ndo_xdp_xmit callback.
> >> + -
> >> + type: flags
> >> + name: xsk-flags
> >> + render-max: true
> >> + entries:
> >> + -
> >> + name: tx-timestamp
> >> + doc:
> >> + HW timestamping egress packets is supported by the driver.
> >> + -
> >> + name: tx-checksum
> >> + doc:
> >> + L3 checksum HW offload is supported by the driver.
> >>
> >> attribute-sets:
> >> -
> >> @@ -68,6 +81,12 @@ name: netdev
> >> type: u32
> >> checks:
> >> min: 1
> >> + -
> >> + name: xsk-features
> >> + doc: Bitmask of enabled AF_XDP features.
> >> + type: u64
> >> + enum: xsk-flags
> >> + enum-as-flags: true
> >>
> >> operations:
> >> list:
> >> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >> index 11652e464f5d..8b40c80557aa 100644
> >> --- a/include/linux/netdevice.h
> >> +++ b/include/linux/netdevice.h
> >> @@ -1660,6 +1660,31 @@ struct xdp_metadata_ops {
> >> enum xdp_rss_hash_type *rss_type);
> >> };
> >>
> >> +/*
> >> + * This structure defines the AF_XDP TX metadata hooks for network devices.
> >> + * The following hooks can be defined; unless noted otherwise, they are
> >> + * optional and can be filled with a null pointer.
> >> + *
> >> + * int (*tmo_request_timestamp)(void *priv)
> >> + * This function is called when AF_XDP frame requested egress timestamp.
> >> + *
> >> + * int (*tmo_fill_timestamp)(void *priv)
> >> + * This function is called when AF_XDP frame, that had requested
> >> + * egress timestamp, received a completion. The hook needs to return
> >> + * the actual HW timestamp.
> >> + *
> >> + * int (*tmo_request_timestamp)(u16 csum_start, u16 csum_offset, void *priv)
> >
> > typo: tmo_request_checksum
> >
> >> + * This function is called when AF_XDP frame requested HW checksum
> >> + * offload. csum_start indicates position where checksumming should start.
> >> + * csum_offset indicates position where checksum should be stored.
> >> + *
> >> + */
> >> +struct xsk_tx_metadata_ops {
> >> + void (*tmo_request_timestamp)(void *priv);
> >> + u64 (*tmo_fill_timestamp)(void *priv);
> >> + void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
> >> +};
> >> +
> >> /**
> >> * enum netdev_priv_flags - &struct net_device priv_flags
> >> *
> >> @@ -1844,6 +1869,7 @@ enum netdev_ml_priv_type {
> >> * @netdev_ops: Includes several pointers to callbacks,
> >> * if one wants to override the ndo_*() functions
> >> * @xdp_metadata_ops: Includes pointers to XDP metadata callbacks.
> >> + * @xsk_tx_metadata_ops: Includes pointers to AF_XDP TX metadata callbacks.
> >> * @ethtool_ops: Management operations
> >> * @l3mdev_ops: Layer 3 master device operations
> >> * @ndisc_ops: Includes callbacks for different IPv6 neighbour
> >> @@ -2100,6 +2126,7 @@ struct net_device {
> >> unsigned long long priv_flags;
> >> const struct net_device_ops *netdev_ops;
> >> const struct xdp_metadata_ops *xdp_metadata_ops;
> >> + const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
> >> int ifindex;
> >> unsigned short gflags;
> >> unsigned short hard_header_len;
> >> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> >> index faaba050f843..5febc1a5131e 100644
> >> --- a/include/linux/skbuff.h
> >> +++ b/include/linux/skbuff.h
> >> @@ -581,7 +581,10 @@ struct skb_shared_info {
> >> /* Warning: this field is not always filled in (UFO)! */
> >> unsigned short gso_segs;
> >> struct sk_buff *frag_list;
> >> - struct skb_shared_hwtstamps hwtstamps;
> >> + union {
> >> + struct skb_shared_hwtstamps hwtstamps;
> >> + struct xsk_tx_metadata *xsk_meta;
> >> + };
> >> unsigned int gso_type;
> >> u32 tskey;
> >>
> >> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> >> index 467b9fb56827..288fa58c4665 100644
> >> --- a/include/net/xdp_sock.h
> >> +++ b/include/net/xdp_sock.h
> >> @@ -90,6 +90,54 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
> >> int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
> >> void __xsk_map_flush(void);
> >>
> >> +/**
> >> + * xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
> >> + * and call appropriate xsk_tx_metadata_ops operation.
> >> + * @meta: pointer to AF_XDP metadata area
> >> + * @ops: pointer to struct xsk_tx_metadata_ops
> >> + * @priv: pointer to driver-private aread
> >> + *
> >> + * This function should be called by the networking device when
> >> + * it prepares AF_XDP egress packet.
> >> + */
> >> +static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
> >> + const struct xsk_tx_metadata_ops *ops,
> >> + void *priv)
> >> +{
> >> + if (!meta)
> >> + return;
> >> +
> >> + if (ops->tmo_request_timestamp)
> >> + if (meta->flags & XDP_TX_METADATA_TIMESTAMP)
> >> + ops->tmo_request_timestamp(priv);
> >> +
> >> + if (ops->tmo_request_checksum)
> >> + if (meta->flags & XDP_TX_METADATA_CHECKSUM)
> >> + ops->tmo_request_checksum(meta->csum_start, meta->csum_offset, priv);
> >> +}
> >
> > Might be cheaper to test the flag in the hot cacheline before
> > dereferencing ops?
> >
>
> I was thinking the same thing, but I was wrong, see below, as these ops
> deref's are optimized out by the compiler.
>
> > Also, just add these functions to net_device_ops directly,
> > rather than dereferencing another pointer to xsk_tx_metadata_ops?
> >
>
> After the ASM/objdump discussion[1] in this thread, I think Stanislav's
> code approach here is actually optimal. Because when the 'const' ops are
> defined locally in the same file that use/does the inlining of
> xsk_tx_metadata_request() then the compilers (both llvm and gcc) are
> smart enough to inline and do dead-code elimination. E.g. I noticed
> that generated ASM code eliminated `if (ops->tmo_request_timestamp)`
> code path in mlx5 as its not implemented.
>
> [1]
> https://lore.kernel.org/all/cce9db50-8c9d-ea97-cb88-171fa46cc064@redhat.com/
>
> --Jesper
>
Agreed. That is pretty slick :-) I had entirely missed that.
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2023-07-28 13:20 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-07-24 23:59 [xdp-hints] [RFC net-next v4 0/8] xsk: TX metadata Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 1/8] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 2/8] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
2023-07-25 19:28 ` [xdp-hints] " Simon Horman
2023-07-25 20:30 ` Stanislav Fomichev
2023-07-25 21:28 ` Jakub Kicinski
2023-07-25 22:40 ` Stanislav Fomichev
2023-07-26 7:47 ` Simon Horman
2023-07-25 20:54 ` Willem de Bruijn
2023-07-25 22:39 ` Stanislav Fomichev
2023-07-25 23:10 ` Willem de Bruijn
2023-07-25 23:48 ` Stanislav Fomichev
2023-07-27 14:11 ` Jesper Dangaard Brouer
2023-07-27 16:37 ` Stanislav Fomichev
2023-07-25 20:58 ` Willem de Bruijn
2023-07-28 11:56 ` Jesper Dangaard Brouer
2023-07-28 13:19 ` Willem de Bruijn
2023-07-26 19:25 ` Jesper Dangaard Brouer
2023-07-26 21:25 ` Stanislav Fomichev
2023-07-27 13:50 ` Jesper Dangaard Brouer
2023-07-27 16:34 ` Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 3/8] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 4/8] tools: ynl: update netdev sample to dump xsk-features Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 5/8] selftests/xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 6/8] selftests/bpf: Add csum helpers Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 7/8] selftests/bpf: Add TX side to xdp_metadata Stanislav Fomichev
2023-07-24 23:59 ` [xdp-hints] [RFC net-next v4 8/8] selftests/bpf: Add TX side to xdp_hw_metadata Stanislav Fomichev
2023-07-25 20:59 ` [xdp-hints] " Willem de Bruijn
2023-07-25 22:36 ` Stanislav Fomichev
2023-07-25 22:55 ` Willem de Bruijn
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox