XDP hardware hints discussion mail archive
 help / color / mirror / Atom feed
From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
To: Larysa Zaremba <larysa.zaremba@intel.com>
Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>,
	bpf <bpf@vger.kernel.org>, Alexei Starovoitov <ast@kernel.org>,
	Daniel Borkmann <daniel@iogearbox.net>,
	Andrii Nakryiko <andrii@kernel.org>,
	Martin KaFai Lau <martin.lau@linux.dev>,
	Song Liu <song@kernel.org>, Yonghong Song <yhs@fb.com>,
	John Fastabend <john.fastabend@gmail.com>,
	KP Singh <kpsingh@kernel.org>,
	Stanislav Fomichev <sdf@google.com>, Hao Luo <haoluo@google.com>,
	Jiri Olsa <jolsa@kernel.org>, David Ahern <dsahern@gmail.com>,
	Jakub Kicinski <kuba@kernel.org>,
	Willem de Bruijn <willemb@google.com>,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	Anatoly Burakov <anatoly.burakov@intel.com>,
	Alexander Lobakin <alexandr.lobakin@intel.com>,
	Magnus Karlsson <magnus.karlsson@gmail.com>,
	Maryam Tahhan <mtahhan@redhat.com>,
	xdp-hints@xdp-project.net,
	Network Development <netdev@vger.kernel.org>,
	Simon Horman <simon.horman@corigine.com>
Subject: [xdp-hints] Re: [PATCH bpf-next v4 12/21] xdp: Add checksum hint
Date: Mon, 31 Jul 2023 18:03:26 -0700	[thread overview]
Message-ID: <CAADnVQJPgpo7J0qVTQJYYocZ=Jnw=O5GfN2=PyAQ55+WWG_DVg@mail.gmail.com> (raw)
In-Reply-To: <ZMeSUrOfhq9dWz6f@lincoln>

On Mon, Jul 31, 2023 at 3:56 AM Larysa Zaremba <larysa.zaremba@intel.com> wrote:
>
> On Sun, Jul 30, 2023 at 09:13:02AM -0400, Willem de Bruijn wrote:
> > Alexei Starovoitov wrote:
> > > On Sat, Jul 29, 2023 at 9:15 AM Willem de Bruijn
> > > <willemdebruijn.kernel@gmail.com> wrote:
> > > >
> > > > Alexei Starovoitov wrote:
> > > > > On Fri, Jul 28, 2023 at 07:39:14PM +0200, Larysa Zaremba wrote:
> > > > > >
> > > > > > +union xdp_csum_info {
> > > > > > +   /* Checksum referred to by ``csum_start + csum_offset`` is considered
> > > > > > +    * valid, but was never calculated, TX device has to do this,
> > > > > > +    * starting from csum_start packet byte.
> > > > > > +    * Any preceding checksums are also considered valid.
> > > > > > +    * Available, if ``status == XDP_CHECKSUM_PARTIAL``.
> > > > > > +    */
> > > > > > +   struct {
> > > > > > +           u16 csum_start;
> > > > > > +           u16 csum_offset;
> > > > > > +   };
> > > > > > +
> > > > >
> > > > > CHECKSUM_PARTIAL makes sense on TX, but this RX. I don't see in the above.
> > > >
> > > > It can be observed on RX when packets are looped.
> > > >
> > > > This may be observed even in XDP on veth.
> > >
> > > veth and XDP is a broken combination. GSO packets coming out of containers
> > > cannot be parsed properly by XDP.
> > > It was added mainly for testing. Just like "generic XDP".
> > > bpf progs at skb layer is much better fit for veth.
> >
> > Ok. Still, seems forward looking and little cost to define the
> > constant?
> >
>
> +1
> CHECKSUM_PARTIAL is mostly for testing and removing/adding it doesn't change
> anything from the perspective of the user that does not use it, so I think it is
> worth having.

"little cost to define the constant".
Not really. A constant in UAPI is a heavy burden.

> > > > > > +   /* Checksum, calculated over the whole packet.
> > > > > > +    * Available, if ``status & XDP_CHECKSUM_COMPLETE``.
> > > > > > +    */
> > > > > > +   u32 checksum;
> > > > >
> > > > > imo XDP RX should only support XDP_CHECKSUM_COMPLETE with u32 checksum
> > > > > or XDP_CHECKSUM_UNNECESSARY.
> > > > >
> > > > > > +};
> > > > > > +
> > > > > > +enum xdp_csum_status {
> > > > > > +   /* HW had parsed several transport headers and validated their
> > > > > > +    * checksums, same as ``CHECKSUM_UNNECESSARY`` in ``sk_buff``.
> > > > > > +    * 3 least significant bytes contain number of consecutive checksums,
> > > > > > +    * starting with the outermost, reported by hardware as valid.
> > > > > > +    * ``sk_buff`` checksum level (``csum_level``) notation is provided
> > > > > > +    * for driver developers.
> > > > > > +    */
> > > > > > +   XDP_CHECKSUM_VALID_LVL0         = 1,    /* 1 outermost checksum */
> > > > > > +   XDP_CHECKSUM_VALID_LVL1         = 2,    /* 2 outermost checksums */
> > > > > > +   XDP_CHECKSUM_VALID_LVL2         = 3,    /* 3 outermost checksums */
> > > > > > +   XDP_CHECKSUM_VALID_LVL3         = 4,    /* 4 outermost checksums */
> > > > > > +   XDP_CHECKSUM_VALID_NUM_MASK     = GENMASK(2, 0),
> > > > > > +   XDP_CHECKSUM_VALID              = XDP_CHECKSUM_VALID_NUM_MASK,
> > > > >
> > > > > I don't see what bpf prog suppose to do with these levels.
> > > > > The driver should pick between 3:
> > > > > XDP_CHECKSUM_UNNECESSARY, XDP_CHECKSUM_COMPLETE, XDP_CHECKSUM_NONE.
> > > > >
> > > > > No levels and no anything partial. please.
> > > >
> > > > This levels business is an unfortunate side effect of
> > > > CHECKSUM_UNNECESSARY. For a packet with multiple checksum fields, what
> > > > does the boolean actually mean? With these levels, at least that is
> > > > well defined: the first N checksum fields.
> > >
> > > If I understand this correctly this is intel specific feature that
> > > other NICs don't have. skb layer also doesn't have such concept.
>
> Please look into csum_level field in sk_buff. It is not the most used property
> in the kernel networking code, but it is certainly 1. used by networking stack
> 2. set to non-zero value by many vendors.
>
> So you do not need to search yourself, I'll copy-paste the docs for
> CHECKSUM_UNNECESSARY here:
>
>  *   %CHECKSUM_UNNECESSARY is applicable to following protocols:
>  *
>  *     - TCP: IPv6 and IPv4.
>  *     - UDP: IPv4 and IPv6. A device may apply CHECKSUM_UNNECESSARY to a
>  *       zero UDP checksum for either IPv4 or IPv6, the networking stack
>  *       may perform further validation in this case.
>  *     - GRE: only if the checksum is present in the header.
>  *     - SCTP: indicates the CRC in SCTP header has been validated.
>  *     - FCOE: indicates the CRC in FC frame has been validated.
>  *
>
> Please, look at this:
>
>  *   &sk_buff.csum_level indicates the number of consecutive checksums found in
>  *   the packet minus one that have been verified as %CHECKSUM_UNNECESSARY.
>  *   For instance if a device receives an IPv6->UDP->GRE->IPv4->TCP packet
>  *   and a device is able to verify the checksums for UDP (possibly zero),
>  *   GRE (checksum flag is set) and TCP, &sk_buff.csum_level would be set to
>  *   two. If the device were only able to verify the UDP checksum and not
>  *   GRE, either because it doesn't support GRE checksum or because GRE
>  *   checksum is bad, skb->csum_level would be set to zero (TCP checksum is
>  *   not considered in this case).
>
> From:
> https://elixir.bootlin.com/linux/v6.5-rc4/source/include/linux/skbuff.h#L115
>
> > > The driver should say CHECKSUM_UNNECESSARY when it's sure
> > > or don't pretend that it checks the checksum and just say NONE.
> >
>
> Well, in such case, most of the NICs that use CHECKSUM_UNNECESSARY would have to
> return CHECKSUM_NONE instead, because based on my quick search, they mostly
> return checksum level of 0 (no tunneling detected) or 1 (tunneling detected),
> so they only parse headers up to a certain depth, meaning it's not possible
> to tell whether there isn't another CHECKSUM_UNNECESSARY-eligible header hiding
> in the payload, so those NIC cannot guarantee ALL the checksums present in the
> packet are correct. So, by your logic, we should make e.g. AF_XDP user re-check
> already verified checksums themselves, because HW "doesn't pretend that it
> checks the checksum and just says NONE".
>
> > I did not know how much this was used, but quick grep for non constant
> > csum_level shows devices from at least six vendors.
>
> Yes, there are several vendors that set the csum_level, including broadcom
> (bnxt) and mellanox (mlx4 and mlx5).
>
> Also, CHECKSUM_UNNECESSARY is found in 100+ drivers/net/ethernet files,
> while csum_level is in like 20, which means overwhelming majority of
> CHECKSUM_UNNECESSARY NICs actually stay with the default checksum level of '0'
> (they check only the outermost checksum - anything else needs to be verified by
> the networking stack).

No. What I'm saying is that XDP_CHECKSUM_UNNECESSARY should be
equivalent to skb's CHECKSUM_UNNECESSARY with csum_level = 0.
I'm well aware that some drivers are trying to be smart and put csum_level=1.
There is no use case for it in XDP.
"But our HW supports it so XDP prog should read it" is the reason NOT
to expose it to bpf in generic api.

Either we're doing per-driver kfuncs and no common infra or common kfunc
that covers 99% of the drivers. Which is CHECKSUM_UNNECESSARY && csum_level = 0

It's not acceptable to present a generic api to xdp prog with multi level
csum that only works on a specific HW. Next thing there will be new flags
and MAX_CSUM_LEVEL in XDP features.
Pretending to be generic while being HW specific is the worst interface.

  reply	other threads:[~2023-08-01  1:03 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-28 15:44 [xdp-hints] (no subject) Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 01/21] ice: make RX hash reading code more reusable Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 02/21] ice: make RX HW timestamp " Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 03/21] ice: make RX checksum checking " Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 04/21] ice: Make ptype internal to descriptor info processing Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 05/21] ice: Introduce ice_xdp_buff Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 06/21] ice: Support HW timestamp hint Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 07/21] ice: Support RX hash XDP hint Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 08/21] ice: Support XDP hints in AF_XDP ZC mode Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 09/21] xdp: Add VLAN tag hint Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 10/21] ice: Implement " Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 11/21] ice: use VLAN proto from ring packet context in skb path Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 12/21] xdp: Add checksum hint Larysa Zaremba
2023-07-28 21:53   ` [xdp-hints] " Alexei Starovoitov
2023-07-29 16:15     ` Willem de Bruijn
2023-07-29 18:04       ` Alexei Starovoitov
2023-07-30 13:13         ` Willem de Bruijn
2023-07-31 10:52           ` Larysa Zaremba
2023-08-01  1:03             ` Alexei Starovoitov [this message]
2023-08-02 13:27               ` Willem de Bruijn
2023-08-07 15:03                 ` Larysa Zaremba
2023-08-07 15:32               ` Larysa Zaremba
2023-08-07 17:06                 ` Stanislav Fomichev
2023-07-31 16:43           ` Jakub Kicinski
2023-08-07 15:08             ` Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 13/21] ice: Implement " Larysa Zaremba
2023-07-28 21:02   ` [xdp-hints] " kernel test robot
2023-07-28 21:02   ` kernel test robot
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 14/21] selftests/bpf: Allow VLAN packets in xdp_hw_metadata Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 15/21] net, xdp: allow metadata > 32 Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 16/21] selftests/bpf: Add flags and new hints to xdp_hw_metadata Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 17/21] veth: Implement VLAN tag and checksum XDP hint Larysa Zaremba
2023-07-29 22:13   ` [xdp-hints] " kernel test robot
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 18/21] net: make vlan_get_tag() return -ENODATA instead of -EINVAL Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 19/21] selftests/bpf: Use AF_INET for TX in xdp_metadata Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 20/21] selftests/bpf: Check VLAN tag and proto " Larysa Zaremba
2023-07-28 17:39 ` [xdp-hints] [PATCH bpf-next v4 21/21] selftests/bpf: check checksum state " Larysa Zaremba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://lists.xdp-project.net/postorius/lists/xdp-hints.xdp-project.net/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAADnVQJPgpo7J0qVTQJYYocZ=Jnw=O5GfN2=PyAQ55+WWG_DVg@mail.gmail.com' \
    --to=alexei.starovoitov@gmail.com \
    --cc=alexandr.lobakin@intel.com \
    --cc=anatoly.burakov@intel.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=daniel@iogearbox.net \
    --cc=dsahern@gmail.com \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=larysa.zaremba@intel.com \
    --cc=magnus.karlsson@gmail.com \
    --cc=martin.lau@linux.dev \
    --cc=mtahhan@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=sdf@google.com \
    --cc=simon.horman@corigine.com \
    --cc=song@kernel.org \
    --cc=willemb@google.com \
    --cc=willemdebruijn.kernel@gmail.com \
    --cc=xdp-hints@xdp-project.net \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox