XDP hardware hints discussion mail archive
 help / color / mirror / Atom feed
From: Stanislav Fomichev <sdf@google.com>
To: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Cc: bpf <bpf@vger.kernel.org>, "Alexei Starovoitov" <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"Andrii Nakryiko" <andrii@kernel.org>,
	"Martin KaFai Lau" <martin.lau@linux.dev>,
	"Song Liu" <song@kernel.org>, "Yonghong Song" <yhs@fb.com>,
	"John Fastabend" <john.fastabend@gmail.com>,
	"KP Singh" <kpsingh@kernel.org>, "Hao Luo" <haoluo@google.com>,
	"Jiri Olsa" <jolsa@kernel.org>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Toke Høiland-Jørgensen" <toke@kernel.org>,
	"Willem de Bruijn" <willemb@google.com>,
	"David Ahern" <dsahern@kernel.org>,
	"Karlsson, Magnus" <magnus.karlsson@intel.com>,
	"Björn Töpel" <bjorn@kernel.org>,
	"Fijalkowski, Maciej" <maciej.fijalkowski@intel.com>,
	"Jesper Dangaard Brouer" <hawk@kernel.org>,
	"Network Development" <netdev@vger.kernel.org>,
	xdp-hints@xdp-project.net
Subject: [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
Date: Tue, 11 Jul 2023 17:14:54 -0700	[thread overview]
Message-ID: <CAKH8qBvnMd2JgobQf1bvc=x7uEn1RPVHcuu3F7gB6vS627g-Xg@mail.gmail.com> (raw)
In-Reply-To: <CAADnVQKnWCYjOQA-=61pDP4TQ-LKC7S-tOSX9Lm6tB3vJcf4dw@mail.gmail.com>

On Tue, Jul 11, 2023 at 4:45 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jul 11, 2023 at 4:25 PM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Tue, Jul 11, 2023 at 3:57 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Fri, Jul 07, 2023 at 12:30:01PM -0700, Stanislav Fomichev wrote:
> > > > +
> > > > +static int mlx5e_devtx_request_l4_checksum(const struct devtx_ctx *_ctx,
> > > > +                                        u16 csum_start, u16 csum_offset)
> > > > +{
> > > > +     const struct mlx5e_devtx_ctx *ctx = (void *)_ctx;
> > > > +     struct mlx5_wqe_eth_seg *eseg;
> > > > +
> > > > +     if (unlikely(!ctx->wqe))
> > > > +             return -ENODATA;
> > > > +
> > > > +     eseg = &ctx->wqe->eth;
> > > > +
> > > > +     switch (csum_offset) {
> > > > +     case sizeof(struct ethhdr) + sizeof(struct iphdr) + offsetof(struct udphdr, check):
> > > > +     case sizeof(struct ethhdr) + sizeof(struct ipv6hdr) + offsetof(struct udphdr, check):
> > > > +             /* Looks like HW/FW is doing parsing, so offsets are largely ignored. */
> > > > +             eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
> > > > +             break;
> > > > +     default:
> > > > +             return -EINVAL;
> > > > +     }
> > >
> > > I think this proves my point: csum is not generalizable even across veth and mlx5.
> > > Above is a square peg that tries to fit csum_start/offset api (that makes sense from SW pov)
> > > into HW that has different ideas about csum-ing.
> > >
> > > Here is what mlx5 does:
> > > mlx5e_txwqe_build_eseg_csum(struct mlx5e_txqsq *sq, struct sk_buff *skb,
> > >                             struct mlx5e_accel_tx_state *accel,
> > >                             struct mlx5_wqe_eth_seg *eseg)
> > > {
> > >         if (unlikely(mlx5e_ipsec_txwqe_build_eseg_csum(sq, skb, eseg)))
> > >                 return;
> > >
> > >         if (likely(skb->ip_summed == CHECKSUM_PARTIAL)) {
> > >                 eseg->cs_flags = MLX5_ETH_WQE_L3_CSUM;
> > >                 if (skb->encapsulation) {
> > >                         eseg->cs_flags |= MLX5_ETH_WQE_L3_INNER_CSUM |
> > >                                           MLX5_ETH_WQE_L4_INNER_CSUM;
> > >                         sq->stats->csum_partial_inner++;
> > >                 } else {
> > >                         eseg->cs_flags |= MLX5_ETH_WQE_L4_CSUM;
> > >                         sq->stats->csum_partial++;
> > >                 }
> > >
> > > How would you generalize that into csum api that will work across NICs ?
> > >
> > > My answer stands: you cannot.
> > >
> > > My proposal again:
> > > add driver specifc kfuncs and hooks for things like csum.
> >
> > I do see your point, but to also give you my perspective: I have no
> > clue what those _CSUM tx bits do (as a non-mlx employee). And what
> > kind of packets they support (initial patch doesn't give any info).
> > We can definitely expose mlx5 specific request_l4_checksum(bool encap)
> > which does things similar to mlx5e_txwqe_build_eseg_csum, but then,
> > what does it _actually_ do? It obviously can't checksum arbitrary
> > packet formats (because it has this inner/outer selection bit), so
> > there is really no way for me to provide a per-driver kfunc api. Maybe
> > the vendors can?
> >
> > So having csum_start/csum_offset abstraction which works with fixed
> > offsets seems like at least it correctly sets the expectation for BPF
> > program writers.
> > The vendors are already supposed to conform to this start/offset API for skb.
> >
> > But back to your point: should we maybe try to meet somewhere in the middle?
> > 1. We try to provide "generic" offload kfuncs; for mlx5, we'll have
> > this mlx5e_devtx_request_l4_checksum which works for fixed offsets
>
> But it doesn't.
> Even if you add a check for csum_start (that's missing in the patch)
> there need to be a way to somehow figure out
> whether skb->encapsulation is true and set appropriate flags.
> Otherwise this request csum will do "something" that only the HW vendor knows.
> That would be even harder to debug for bpf prog writers.
>
> So instead of helping bpf prog devs it will actively hurt them.

Can we make it more robust? The device can look at the payload (via
descriptors or extra payload pointer via devtx_ctx) and verify
h_proto/nexthdr.
It won't be perfect, I agree, but we can get it working for the common
cases (and have device-specific kfuncs for the rest).

> Another example. If bpf prog was developed and tested on veth
> it will work for some values of csum_offset on real HW and will -EINVAL
> for the other values.
> Just horrible user experience comparing to the case where
> the user knows that each netdev is potentially different and
> _has_ to develop and test their prog on the given HW NIC and
> not assume that the kernel can "do the right thing".

For this, I was actually thinking that we need to provide some
SW-based fallback mechanism.
Because if I have a program and a nic that doesn't have an offload
implemented at all, having a fallback might be useful:

if (bpf_devtx_request_l4_csum(...)) {
  /* oops, hw bailed on us, fallback to sw and expose a counter */
  bpf_devtx_l4_csum_slowpath(csum_start, csum_offset, data, len);
  pkt_sw_csum++;
}

This is probably needed regardless of which way we do it?

Regarding veth vs non-veth: we already have similar issues with
generic xdp vs non-generic.
I'm not sure we can completely avoid having surprises when switching
from sw to hw paths.
It's whether the users will have to debug 10-20% of their program or
they'd have to start completely from scratch for every nic.

> This csum exercise is clear example that kernel is not in a position
> to do so.
> For timestamp it's arguable, but for csum there is no generic api that
> kernel can apply universally to NICs.

Sure, I agree, it's a mix of both. For some offloads, we can have
something common, for some we can't.
But I'm not sure why we have to pick one or another. We can try to
have common apis (maybe not ideal, yes) and we can expose vendor
specific ones if there is need.
If the generic ones get unused - we kill them in the future. If none
of the vendors comes up with non-generic ones - the generic ones are
good enough.

I'm assuming you favor non-generic ones because it's easier to implement?
But most of the netdev-bound infra is already there for rx, so there
is really not a lot of extra code to reuse it at tx. (it's two lines
to allow tracing progs to be dev-bound and to check whether devices
match at attach).
Or is there some other reason I'm missing?

> > 2. We also let vendors do device-specific "extensions" where devices
> > deviate too much: bpf_request_RAW_mlx5e_l4_checksum(bool encap)
> > This can give BPF authors opportunity to write somewhat portable
> > programs and also use vendor specific apis when/if needed.
> >
> > I think we had a similar idea for rx side: have generic kfuncs, but
> > also let vendors experiment with custom kfuncs if they want to
> > differentiate.
> > WDYT? Can it give us the best things from both sides?
> >
> > > Kuba,
> > > since you nacked driver specific stuff please suggest a way to unblock this stalemate.

  reply	other threads:[~2023-07-12  0:15 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 01/14] bpf: Rename some xdp-metadata functions into dev-bound Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 02/14] bpf: Make it easier to add new metadata kfunc Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 03/14] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 04/14] bpf: Implement devtx hook points Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 05/14] bpf: Implement devtx timestamp kfunc Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 06/14] net: veth: Implement devtx timestamp kfuncs Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 07/14] bpf: Introduce tx checksum devtx kfuncs Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 08/14] net: veth: Implement devtx tx checksum Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs Stanislav Fomichev
2023-07-11 22:56   ` [xdp-hints] " Alexei Starovoitov
2023-07-11 23:24     ` Stanislav Fomichev
2023-07-11 23:45       ` Alexei Starovoitov
2023-07-12  0:14         ` Stanislav Fomichev [this message]
2023-07-12  2:50           ` Alexei Starovoitov
2023-07-12  3:29             ` Stanislav Fomichev
2023-07-12  4:59               ` Alexei Starovoitov
2023-07-12  5:36                 ` Stanislav Fomichev
2023-07-12 15:16                   ` Willem de Bruijn
2023-07-12 16:28                     ` Willem de Bruijn
2023-07-12 19:03                     ` Alexei Starovoitov
2023-07-12 19:11                       ` Willem de Bruijn
2023-07-12 19:42                         ` Alexei Starovoitov
2023-07-12 20:09                           ` Jakub Kicinski
2023-07-12 20:53                             ` Stanislav Fomichev
2023-07-12  0:32     ` Jakub Kicinski
2023-07-12  2:37       ` Alexei Starovoitov
2023-07-12  3:07         ` Jakub Kicinski
2023-07-12  3:23           ` Alexei Starovoitov
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 10/14] selftests/xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 11/14] selftests/bpf: Add helper to query current netns cookie Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 12/14] selftests/bpf: Add csum helpers Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 13/14] selftests/bpf: Extend xdp_metadata with devtx kfuncs Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 14/14] selftests/bpf: Extend xdp_hw_metadata " Stanislav Fomichev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://lists.xdp-project.net/postorius/lists/xdp-hints.xdp-project.net/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKH8qBvnMd2JgobQf1bvc=x7uEn1RPVHcuu3F7gB6vS627g-Xg@mail.gmail.com' \
    --to=sdf@google.com \
    --cc=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bjorn@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=dsahern@kernel.org \
    --cc=haoluo@google.com \
    --cc=hawk@kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=maciej.fijalkowski@intel.com \
    --cc=magnus.karlsson@intel.com \
    --cc=martin.lau@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=song@kernel.org \
    --cc=toke@kernel.org \
    --cc=willemb@google.com \
    --cc=xdp-hints@xdp-project.net \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox