From: Jesper Dangaard Brouer <hawk@kernel.org>
To: Stanislav Fomichev <sdf@google.com>, bpf@vger.kernel.org
Cc: hawk@kernel.org, ast@kernel.org, daniel@iogearbox.net,
andrii@kernel.org, martin.lau@linux.dev, song@kernel.org,
yhs@fb.com, john.fastabend@gmail.com, kpsingh@kernel.org,
haoluo@google.com, jolsa@kernel.org, kuba@kernel.org,
toke@kernel.org, willemb@google.com, dsahern@kernel.org,
magnus.karlsson@intel.com, bjorn@kernel.org,
maciej.fijalkowski@intel.com, yoong.siang.song@intel.com,
netdev@vger.kernel.org, xdp-hints@xdp-project.net
Subject: [xdp-hints] Re: [PATCH bpf-next v3 09/10] selftests/bpf: Add TX side to xdp_hw_metadata
Date: Mon, 9 Oct 2023 10:12:34 +0200 [thread overview]
Message-ID: <532af030-f2a6-5569-549b-bbd012ad2640@kernel.org> (raw)
In-Reply-To: <20231003200522.1914523-10-sdf@google.com>
On 03/10/2023 22.05, Stanislav Fomichev wrote:
> When we get a packet on port 9091, we swap src/dst and send it out.
> At this point we also request the timestamp and checksum offloads.
>
> Checksum offload is verified by looking at the tcpdump on the other side.
> The tool prints pseudo-header csum and the final one it expects.
> The final checksum actually matches the incoming packets checksum
> because we only flip the src/dst and don't change the payload.
>
> Some other related changes:
> - switched to zerocopy mode by default; new flag can be used to force
> old behavior
> - request fixed tx_metadata_len headroom
> - some other small fixes (umem size, fill idx+i, etc)
>
> mvbz3:~# ./xdp_hw_metadata eth3
> ...
> 0x1062cb8: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
> rx_hash: 0x2E1B50B9 with RSS type:0x2A
> rx_timestamp: 1691436369532047139 (sec:1691436369.5320)
> XDP RX-time: 1691436369261756803 (sec:1691436369.2618) delta sec:-0.2703 (-270290.336 usec)
I guess system time isn't configured to be in sync with NIC HW time,
as delta is a negative offset.
> AF_XDP time: 1691436369261878839 (sec:1691436369.2619) delta sec:0.0001 (122.036 usec)
The AF_XDP time is also software system time and compared to XDP
RX-time, it shows a delta of 122 usec. This number indicate to me that
the CPU is likely configured with power saving sleep states.
> 0x1062cb8: ping-pong with csum=3b8e (want de7e) csum_start=54 csum_offset=6
> 0x1062cb8: complete tx idx=0 addr=10
> 0x1062cb8: tx_timestamp: 1691436369598419505 (sec:1691436369.5984)
Could we add something that we can relate tx_timestamp to?
Like we do with the other delta calculations, as it helps us to
understand/validate if the number we get back is sane. Like negative
offset aboves tells us that system time sync isn't configured, and that
system have configures C-states.
I suggest delta comparing "tx_timestamp" to "rx_timestamp", as they are
the same clock domain. It will tell us the total time spend from HW RX
to HW TX, counting all the time used by software "ping-pong".
1691436369.5984-1691436369.5320 = 0.0664 sec = 66.4 ms
When implementing this, it could be (1) practical to store the
"rx_timestamp" in the metadata area of the completion packet, or (2)
should we have a mechanism for external storage that can be keyed on the
umem "addr"?
> 0x1062cb8: complete rx idx=128 addr=80100
>
> mvbz4:~# nc -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091
>
> mvbz4:~# tcpdump -vvx -i eth3 udp
> tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
> 12:26:09.301074 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1087.55807 > fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0xde7e!] UDP, length 3
> 0x0000: 6003 5fa5 000b 117f fe80 0000 0000 0000
> 0x0010: 1270 fdff fe48 1087 fe80 0000 0000 0000
> 0x0020: 1270 fdff fe48 1077 d9ff 2383 000b 3b8e
> 0x0030: 7864 70
> 12:26:09.301976 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1077.9091 > fe80::1270:fdff:fe48:1087.55807: [udp sum ok] UDP, length 3
> 0x0000: 6003 5fa5 000b 117f fe80 0000 0000 0000
> 0x0010: 1270 fdff fe48 1077 fe80 0000 0000 0000
> 0x0020: 1270 fdff fe48 1087 2383 d9ff 000b de7e
> 0x0030: 7864 70
>
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
> tools/testing/selftests/bpf/xdp_hw_metadata.c | 202 +++++++++++++++++-
> 1 file changed, 192 insertions(+), 10 deletions(-)
>
> diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> index 613321eb84c1..ab83d0ba6763 100644
> --- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
> +++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> @@ -10,7 +10,9 @@
> * - rx_hash
> *
> * TX:
> - * - TBD
> + * - UDP 9091 packets trigger TX reply
> + * - TX HW timestamp is requested and reported back upon completion
> + * - TX checksum is requested
> */
>
> #include <test_progs.h>
> @@ -24,14 +26,17 @@
[...]
> @@ -51,22 +56,24 @@ struct xsk *rx_xsk;
[...]
> @@ -129,12 +136,22 @@ static void refill_rx(struct xsk *xsk, __u64 addr)
[...]
> @@ -228,6 +245,117 @@ static void verify_skb_metadata(int fd)
> printf("skb hwtstamp is not found!\n");
> }
>
> +static bool complete_tx(struct xsk *xsk)
> +{
> + struct xsk_tx_metadata *meta;
> + __u64 addr;
> + void *data;
> + __u32 idx;
> +
> + if (!xsk_ring_cons__peek(&xsk->comp, 1, &idx))
> + return false;
> +
> + addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
> + data = xsk_umem__get_data(xsk->umem_area, addr);
> + meta = data - sizeof(struct xsk_tx_metadata);
> +
> + printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
> + printf("%p: tx_timestamp: %llu (sec:%0.4f)\n", xsk,
> + meta->completion.tx_timestamp,
> + (double)meta->completion.tx_timestamp / NANOSEC_PER_SEC);
> + xsk_ring_cons__release(&xsk->comp, 1);
> +
> + return true;
> +}
> +
> +#define swap(a, b, len) do { \
> + for (int i = 0; i < len; i++) { \
> + __u8 tmp = ((__u8 *)a)[i]; \
> + ((__u8 *)a)[i] = ((__u8 *)b)[i]; \
> + ((__u8 *)b)[i] = tmp; \
> + } \
> +} while (0)
> +
> +static void ping_pong(struct xsk *xsk, void *rx_packet)
> +{
> + struct xsk_tx_metadata *meta;
> + struct ipv6hdr *ip6h = NULL;
> + struct iphdr *iph = NULL;
> + struct xdp_desc *tx_desc;
> + struct udphdr *udph;
> + struct ethhdr *eth;
> + __sum16 want_csum;
> + void *data;
> + __u32 idx;
> + int ret;
> + int len;
> +
> + ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
> + if (ret != 1) {
> + printf("%p: failed to reserve tx slot\n", xsk);
> + return;
> + }
> +
> + tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
> + tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE + sizeof(struct xsk_tx_metadata);
> + data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
> +
> + meta = data - sizeof(struct xsk_tx_metadata);
> + memset(meta, 0, sizeof(*meta));
> + meta->flags = XDP_TX_METADATA_TIMESTAMP;
> +
> + eth = rx_packet;
> +
> + if (eth->h_proto == htons(ETH_P_IP)) {
> + iph = (void *)(eth + 1);
> + udph = (void *)(iph + 1);
> + } else if (eth->h_proto == htons(ETH_P_IPV6)) {
> + ip6h = (void *)(eth + 1);
> + udph = (void *)(ip6h + 1);
> + } else {
> + printf("%p: failed to detect IP version for ping pong %04x\n", xsk, eth->h_proto);
> + xsk_ring_prod__cancel(&xsk->tx, 1);
> + return;
> + }
> +
> + len = ETH_HLEN;
> + if (ip6h)
> + len += sizeof(*ip6h) + ntohs(ip6h->payload_len);
> + if (iph)
> + len += ntohs(iph->tot_len);
> +
> + swap(eth->h_dest, eth->h_source, ETH_ALEN);
> + if (iph)
> + swap(&iph->saddr, &iph->daddr, 4);
> + else
> + swap(&ip6h->saddr, &ip6h->daddr, 16);
> + swap(&udph->source, &udph->dest, 2);
> +
> + want_csum = udph->check;
> + if (ip6h)
> + udph->check = ~csum_ipv6_magic(&ip6h->saddr, &ip6h->daddr,
> + ntohs(udph->len), IPPROTO_UDP, 0);
> + else
> + udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
> + ntohs(udph->len), IPPROTO_UDP, 0);
> +
> + meta->flags |= XDP_TX_METADATA_CHECKSUM;
> + if (iph)
> + meta->csum_start = sizeof(*eth) + sizeof(*iph);
> + else
> + meta->csum_start = sizeof(*eth) + sizeof(*ip6h);
> + meta->csum_offset = offsetof(struct udphdr, check);
> +
> + printf("%p: ping-pong with csum=%04x (want %04x) csum_start=%d csum_offset=%d\n",
> + xsk, ntohs(udph->check), ntohs(want_csum), meta->csum_start, meta->csum_offset);
> +
> + memcpy(data, rx_packet, len); /* don't share umem chunk for simplicity */
> + tx_desc->options |= XDP_TX_METADATA;
> + tx_desc->len = len;
> +
> + xsk_ring_prod__submit(&xsk->tx, 1);
> +}
> +
> static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t clock_id)
> {
> const struct xdp_desc *rx_desc;
> @@ -250,6 +378,13 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
>
> while (true) {
> errno = 0;
> +
> + for (i = 0; i < rxq; i++) {
> + ret = kick_rx(&rx_xsk[i]);
> + if (ret)
> + printf("kick_rx ret=%d\n", ret);
> + }
> +
> ret = poll(fds, rxq + 1, 1000);
> printf("poll: %d (%d) skip=%llu fail=%llu redir=%llu\n",
> ret, errno, bpf_obj->bss->pkts_skip,
> @@ -280,6 +415,22 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
> xsk, idx, rx_desc->addr, addr, comp_addr);
> verify_xdp_metadata(xsk_umem__get_data(xsk->umem_area, addr),
> clock_id);
> +
> + if (!skip_tx) {
> + /* mirror the packet back */
> + ping_pong(xsk, xsk_umem__get_data(xsk->umem_area, addr));
> +
> + ret = kick_tx(xsk);
> + if (ret)
> + printf("kick_tx ret=%d\n", ret);
> +
> + for (int j = 0; j < 500; j++) {
> + if (complete_tx(xsk))
> + break;
> + usleep(10*1000);
I don't fully follow why we need this usleep here.
> + }
> + }
> +
> xsk_ring_cons__release(&xsk->rx, 1);
> refill_rx(xsk, comp_addr);
> }
> @@ -404,21 +555,52 @@ static void timestamping_enable(int fd, int val)
next prev parent reply other threads:[~2023-10-09 8:12 UTC|newest]
Thread overview: 25+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-03 20:05 [xdp-hints] [PATCH bpf-next v3 00/10] xsk: TX metadata Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 01/10] xsk: Support tx_metadata_len Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 02/10] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
2023-10-04 6:18 ` [xdp-hints] " Song, Yoong Siang
2023-10-04 17:48 ` Stanislav Fomichev
2023-10-04 17:56 ` Stanislav Fomichev
2023-10-05 1:16 ` Song, Yoong Siang
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 03/10] tools: ynl: print xsk-features from the sample Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 04/10] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload Stanislav Fomichev
2023-10-04 23:47 ` [xdp-hints] " kernel test robot
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 05/10] net: stmmac: Add Tx HWTS support to XDP ZC Stanislav Fomichev
2023-10-04 23:05 ` [xdp-hints] " kernel test robot
2023-10-04 23:14 ` Stanislav Fomichev
2023-10-06 4:38 ` kernel test robot
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 06/10] selftests/xsk: Support tx_metadata_len Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 07/10] selftests/bpf: Add csum helpers Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 08/10] selftests/bpf: Add TX side to xdp_metadata Stanislav Fomichev
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 09/10] selftests/bpf: Add TX side to xdp_hw_metadata Stanislav Fomichev
2023-10-09 8:12 ` Jesper Dangaard Brouer [this message]
2023-10-09 16:37 ` [xdp-hints] " Stanislav Fomichev
2023-10-10 20:40 ` Stanislav Fomichev
2023-10-13 1:13 ` Song, Yoong Siang
2023-10-13 18:47 ` Stanislav Fomichev
2023-10-15 13:28 ` Song, Yoong Siang
2023-10-03 20:05 ` [xdp-hints] [PATCH bpf-next v3 10/10] xsk: document tx_metadata_len layout Stanislav Fomichev
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://lists.xdp-project.net/postorius/lists/xdp-hints.xdp-project.net/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=532af030-f2a6-5569-549b-bbd012ad2640@kernel.org \
--to=hawk@kernel.org \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bjorn@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=daniel@iogearbox.net \
--cc=dsahern@kernel.org \
--cc=haoluo@google.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=kuba@kernel.org \
--cc=maciej.fijalkowski@intel.com \
--cc=magnus.karlsson@intel.com \
--cc=martin.lau@linux.dev \
--cc=netdev@vger.kernel.org \
--cc=sdf@google.com \
--cc=song@kernel.org \
--cc=toke@kernel.org \
--cc=willemb@google.com \
--cc=xdp-hints@xdp-project.net \
--cc=yhs@fb.com \
--cc=yoong.siang.song@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox