XDP hardware hints discussion mail archive
 help / color / mirror / Atom feed
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: "Stanislav Fomichev" <sdf@google.com>, bpf <bpf@vger.kernel.org>,
	"Alexei Starovoitov" <ast@kernel.org>,
	"Daniel Borkmann" <daniel@iogearbox.net>,
	"Andrii Nakryiko" <andrii@kernel.org>,
	"Martin KaFai Lau" <martin.lau@linux.dev>,
	"Song Liu" <song@kernel.org>, "Yonghong Song" <yhs@fb.com>,
	"John Fastabend" <john.fastabend@gmail.com>,
	"KP Singh" <kpsingh@kernel.org>, "Hao Luo" <haoluo@google.com>,
	"Jiri Olsa" <jolsa@kernel.org>,
	"Jakub Kicinski" <kuba@kernel.org>,
	"Toke Høiland-Jørgensen" <toke@kernel.org>,
	"Willem de Bruijn" <willemb@google.com>,
	"David Ahern" <dsahern@kernel.org>,
	"Karlsson, Magnus" <magnus.karlsson@intel.com>,
	"Björn Töpel" <bjorn@kernel.org>,
	"Fijalkowski, Maciej" <maciej.fijalkowski@intel.com>,
	"Jesper Dangaard Brouer" <hawk@kernel.org>,
	"Network Development" <netdev@vger.kernel.org>,
	xdp-hints@xdp-project.net
Subject: [xdp-hints] Re: [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs
Date: Wed, 12 Jul 2023 12:03:42 -0700	[thread overview]
Message-ID: <20230712190342.dlgwh6uka5bcjfkl@macbook-pro-8.dhcp.thefacebook.com> (raw)
In-Reply-To: <CAF=yD-LO=LDWhKM--r9F119-J_9v-Znm4saxFrhhxhMV6nnmJQ@mail.gmail.com>

On Wed, Jul 12, 2023 at 11:16:04AM -0400, Willem de Bruijn wrote:
> On Wed, Jul 12, 2023 at 1:36 AM Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Tue, Jul 11, 2023 at 9:59 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Jul 11, 2023 at 8:29 PM Stanislav Fomichev <sdf@google.com> wrote:
> > > >
> > > >
> > > > This will slow things down, but not to the point where it's on par
> > > > with doing sw checksum. At least in theory.
> > > > We can't stay at skb when using AF_XDP. AF_XDP would benefit from having
> > > > the offloads.
> > >
> > > To clarify: yes, AF_XDP needs generalized HW offloads.
> >
> > Great! To reiterate, I'm mostly interested in af_xdp wrt tx
> > timestamps. So if the consensus is not to mix xdp-tx and af_xdp-tx,
> > I'm fine with switching to adding some fixed af_xdp descriptor format
> > to enable offloads on tx.

since af_xdp is a primary user let's figure out what is the best api for that.
If any code can be salvaged for xdp tx, great, but let's not start with xdp tx
as prerequisite.

> >
> > > I just don't see how xdp tx offloads are moving a needle in that direction.
> >
> > Let me try to explain how both might be similar, maybe I wasn't clear
> > enough on that.
> > For af_xdp tx packet, the userspace puts something in the af_xdp frame
> > metadata area (headrom) which then gets executed/interpreted by the
> > bpf program at devtx (which calls kfuncs to enable particular
> > offloads).
> > IOW, instead of defining some fixed layout for the tx offloads, the
> > userspace and bpf program have some agreement on the layout (and bpf
> > program "applies" the offloads by calling the kfuncs).
> > Also (in theory) the same hooks can be used for xdp-tx.
> > Does it make sense? But, again, happy to scratch that whole idea if
> > we're fine with a fixed layout for af_xdp.

So instead of defining csum offload format in xsk metadata we'll
defining it as a set of arguments to a kfunc and tx-side xsk prog
will just copy the args from metadata into kfunc args ?
Seems like an unnecesary step. Such xsk prog won't be doing
anything useful. Just copying from one place to another.
It seems the only purpose of such bpf prog is to side step uapi exposure.
bpf is not used to program anything. There won't be any control flow.
Just odd intermediate copy step.
Instead we can define a metadata struct for csum nic offload
outside of uapi/linux/if_xdp.h with big 'this is not an uapi' warning.
User space can request it via setsockopt.
And probably feature query the nic via getsockopt.

Error handling is critical here. With xsk tx prog the errors
are messy. What to do when kfunc returns error? Store it back into
packet metadata ? and then user space needs to check every single
packet for errors? Not practical imo.

Feature query via getsockopt would be done once instead and
user space will fill in "csum offload struct" in packet metadata
and won't check per-packet error. If driver said the csum feature
is available it's better work for every packet.
Notice mlx5e_txwqe_build_eseg_csum() returns void.

> 
> Checksum offload is an important demonstrator too.
> 
> It is admittedly a non-trivial one. Checksum offload has often been
> discussed as a pain point ("protocol ossification").
> 
> In general, drivers can accept every CHECKSUM_COMPLETE skb that
> matches their advertised feature NETIF_F_[HW|IP|IPV6]_CSUM. I don't
> see why this would be different for kfuncs for packets coming from
> userspace.
> 
> The problematic drivers are the ones that do not implement
> CHECKSUM_COMPLETE as intended, but ignore this simple
> protocol-independent hint in favor of parsing from scratch, possibly
> zeroing the field, computing multiple layers, etc.
> 
> All of which is unnecessary with LCO. An AF_XDP user can be expected
> to apply LCO and only request checksum insertion for the innermost
> checksum.
> 
> The biggest problem is with these devices that parse in hardware (and
> possibly also in the driver to identify and fix up hardware
> limitations) is that they will fail if encountering an unknown
> protocol. Which brings us to advertising limited typed support:
> NETIF_F_HW_CSUM vs NETIF_F_IP_CSUM.
> 
> The fact that some devices that deviate from industry best practices
> cannot support more advanced packet formats is unfortunate, but not a
> reason to hold others back. No different from current kernel path. The
> BPF program can fallback onto software checksumming on these devices,
> like the kernel path. Perhaps we do need to pass along with csum_start
> and csum_off a csum_type that matches the existing
> NETIF_F_[HW|IP|IPV6]_CSUM, to let drivers return with -EOPNOTSUPP
> quickly if for the generic case.
> 
> For implementation in essence it is just reordering driver code that
> already exists for the skb case. I think the ice patch series to
> support rx timestamping is a good indication of what it takes to
> support XDP kfuncs: not so much new code, but reordering the driver
> logic.
> 
> Which also indicates to me that the driver *is* the right place to
> implement this logic, rather than reimplement it in a BPF library. It
> avoids both code duplication and dependency hell, if the library ships
> independent from the driver.

Agree with all of the above.
I think defining CHECKSUM_PARTIAL struct request for af_xdp is doable and
won't require much changes in the drivers.
If we do it for more than one driver from the start there is a chance it
will work for other drivers too. imo ice+gve+mlx5 would be enough.

  parent reply	other threads:[~2023-07-12 19:03 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-07 19:29 [xdp-hints] [RFC bpf-next v3 00/14] bpf: Netdev TX metadata Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 01/14] bpf: Rename some xdp-metadata functions into dev-bound Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 02/14] bpf: Make it easier to add new metadata kfunc Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 03/14] xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 04/14] bpf: Implement devtx hook points Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 05/14] bpf: Implement devtx timestamp kfunc Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 06/14] net: veth: Implement devtx timestamp kfuncs Stanislav Fomichev
2023-07-07 19:29 ` [xdp-hints] [RFC bpf-next v3 07/14] bpf: Introduce tx checksum devtx kfuncs Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 08/14] net: veth: Implement devtx tx checksum Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 09/14] net/mlx5e: Implement devtx kfuncs Stanislav Fomichev
2023-07-11 22:56   ` [xdp-hints] " Alexei Starovoitov
2023-07-11 23:24     ` Stanislav Fomichev
2023-07-11 23:45       ` Alexei Starovoitov
2023-07-12  0:14         ` Stanislav Fomichev
2023-07-12  2:50           ` Alexei Starovoitov
2023-07-12  3:29             ` Stanislav Fomichev
2023-07-12  4:59               ` Alexei Starovoitov
2023-07-12  5:36                 ` Stanislav Fomichev
2023-07-12 15:16                   ` Willem de Bruijn
2023-07-12 16:28                     ` Willem de Bruijn
2023-07-12 19:03                     ` Alexei Starovoitov [this message]
2023-07-12 19:11                       ` Willem de Bruijn
2023-07-12 19:42                         ` Alexei Starovoitov
2023-07-12 20:09                           ` Jakub Kicinski
2023-07-12 20:53                             ` Stanislav Fomichev
2023-07-12  0:32     ` Jakub Kicinski
2023-07-12  2:37       ` Alexei Starovoitov
2023-07-12  3:07         ` Jakub Kicinski
2023-07-12  3:23           ` Alexei Starovoitov
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 10/14] selftests/xsk: Support XDP_TX_METADATA_LEN Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 11/14] selftests/bpf: Add helper to query current netns cookie Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 12/14] selftests/bpf: Add csum helpers Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 13/14] selftests/bpf: Extend xdp_metadata with devtx kfuncs Stanislav Fomichev
2023-07-07 19:30 ` [xdp-hints] [RFC bpf-next v3 14/14] selftests/bpf: Extend xdp_hw_metadata " Stanislav Fomichev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://lists.xdp-project.net/postorius/lists/xdp-hints.xdp-project.net/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230712190342.dlgwh6uka5bcjfkl@macbook-pro-8.dhcp.thefacebook.com \
    --to=alexei.starovoitov@gmail.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bjorn@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=dsahern@kernel.org \
    --cc=haoluo@google.com \
    --cc=hawk@kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=maciej.fijalkowski@intel.com \
    --cc=magnus.karlsson@intel.com \
    --cc=martin.lau@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=toke@kernel.org \
    --cc=willemb@google.com \
    --cc=willemdebruijn.kernel@gmail.com \
    --cc=xdp-hints@xdp-project.net \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox