XDP hardware hints discussion mail archive
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Stanislav Fomichev <sdf@google.com>,
	ast@kernel.org, daniel@iogearbox.net, andrii@kernel.org,
	song@kernel.org, yhs@fb.com, john.fastabend@gmail.com,
	kpsingh@kernel.org, haoluo@google.com, jolsa@kernel.org,
	David Ahern <dsahern@gmail.com>, Jakub Kicinski <kuba@kernel.org>,
	Willem de Bruijn <willemb@google.com>,
	Jesper Dangaard Brouer <brouer@redhat.com>,
	Anatoly Burakov <anatoly.burakov@intel.com>,
	Alexander Lobakin <alexandr.lobakin@intel.com>,
	Magnus Karlsson <magnus.karlsson@gmail.com>,
	Maryam Tahhan <mtahhan@redhat.com>,
	xdp-hints@xdp-project.net, netdev@vger.kernel.org,
	bpf@vger.kernel.org
Subject: [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
Date: Fri, 11 Nov 2022 00:29:43 +0100	[thread overview]
Message-ID: <87mt8yxuag.fsf@toke.dk> (raw)
In-Reply-To: <7eb3e22a-c416-e898-dff0-1146d3cc82c0@linux.dev>

Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 11/10/22 6:19 AM, Toke Høiland-Jørgensen wrote:
>> Martin KaFai Lau <martin.lau@linux.dev> writes:
>> 
>>> On 11/9/22 3:10 AM, Toke Høiland-Jørgensen wrote:
>>>> Snipping a bit of context to reply to this bit:
>>>>
>>>>>>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
>>>>>>>> sure it is solid enough by asking the xdp prog not to use the same random number
>>>>>>>> in its own metadata + not to change the metadata through xdp->data_meta after
>>>>>>>> calling bpf_xdp_metadata_export_to_skb().
>>>>>>>
>>>>>>> What do you think the usecase here might be? Or are you suggesting we
>>>>>>> reject further access to data_meta after
>>>>>>> bpf_xdp_metadata_export_to_skb somehow?
>>>>>>>
>>>>>>> If we want to let the programs override some of this
>>>>>>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
>>>>>>> more kfuncs instead of exposing the layout?
>>>>>>>
>>>>>>> bpf_xdp_metadata_export_to_skb(ctx);
>>>>>>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);
>>>>
>>>> There are several use cases for needing to access the metadata after
>>>> calling bpf_xdp_metdata_export_to_skb():
>>>>
>>>> - Accessing the metadata after redirect (in a cpumap or devmap program,
>>>>     or on a veth device)
>>>> - Transferring the packet+metadata to AF_XDP
>>> fwiw, the xdp prog could also be more selective and only stores one of the hints
>>> instead of the whole 'struct xdp_to_skb_metadata'.
>> 
>> Yup, absolutely! In that sense, reusing the SKB format is mostly a
>> convenience feature. However, lots of people consume AF_XDP through the
>> default program installed by libxdp in the XSK setup code, and including
>> custom metadata there is awkward. So having the metadata consumed by the
>> stack as the "default subset" would enable easy consumption by
>> non-advanced users, while advanced users can still do custom stuff by
>> writing their own XDP program that calls the kfuncs.
>> 
>>>> - Returning XDP_PASS, but accessing some of the metadata first (whether
>>>>     to read or change it)
>>>>
>>>> The last one could be solved by calling additional kfuncs, but that
>>>> would be less efficient than just directly editing the struct which
>>>> will be cache-hot after the helper returns.
>>>
>>> Yeah, it is more efficient to directly write if possible.  I think this set
>>> allows the direct reading and writing already through data_meta (as a _u8 *).
>> 
>> Yup, totally fine with just keeping that capability :)
>> 
>>>> And yeah, this will allow the XDP program to inject arbitrary metadata
>>>> into the netstack; but it can already inject arbitrary *packet* data
>>>> into the stack, so not sure if this is much of an additional risk? If it
>>>> does lead to trivial crashes, we should probably harden the stack
>>>> against that?
>>>>
>>>> As for the random number, Jesper and I discussed replacing this with the
>>>> same BTF-ID scheme that he was using in his patch series. I.e., instead
>>>> of just putting in a random number, we insert the BTF ID of the metadata
>>>> struct at the end of it. This will allow us to support multiple
>>>> different formats in the future (not just changing the layout, but
>>>> having multiple simultaneous formats in the same kernel image), in case
>>>> we run out of space.
>>>
>>> This seems a bit hypothetical.  How much headroom does it usually have for the
>>> xdp prog?  Potentially the hints can use all the remaining space left after the
>>> header encap and the current bpf_xdp_adjust_meta() usage?
>> 
>> For the metadata consumed by the stack right now it's a bit
>> hypothetical, yeah. However, there's a bunch of metadata commonly
>> supported by hardware that the stack currently doesn't consume and that
>> hopefully this feature will end up making more accessible. My hope is
>> that the stack can also learn how to use this in the future, in which
>> case we may run out of space. So I think of that bit mostly as
>> future-proofing...
>
> ic. in this case, Can the btf_id be added to 'struct xdp_to_skb_metadata' later 
> if it is indeed needed?  The 'struct xdp_to_skb_metadata' is not in UAPI and 
> doing it with CO-RE is to give us flexibility to make this kind of changes in 
> the future.

My worry is mostly that it'll be more painful to add it later than just
including it from the start, mostly because of AF_XDP users. But if we
do the randomisation thing (thus forcing AF_XDP users to deal with the
dynamic layout as well), it should be possible to add it later, and I
can live with that option as well...

>>>> We should probably also have a flag set on the xdp_frame so the stack
>>>> knows that the metadata area contains relevant-to-skb data, to guard
>>>> against an XDP program accidentally hitting the "magic number" (BTF_ID)
>>>> in unrelated stuff it puts into the metadata area.
>>>
>>> Yeah, I think having a flag is useful.  The flag will be set at xdp_buff and
>>> then transfer to the xdp_frame?
>> 
>> Yeah, exactly!
>> 
>>>>> After re-reading patch 6, have another question. The 'void
>>>>> bpf_xdp_metadata_export_to_skb();' function signature. Should it at
>>>>> least return ok/err? or even return a 'struct xdp_to_skb_metadata *'
>>>>> pointer and the xdp prog can directly read (or even write) it?
>>>>
>>>> Hmm, I'm not sure returning a failure makes sense? Failure to read one
>>>> or more fields just means that those fields will not be populated? We
>>>> should probably have a flags field inside the metadata struct itself to
>>>> indicate which fields are set or not, but I'm not sure returning an
>>>> error value adds anything? Returning a pointer to the metadata field
>>>> might be convenient for users (it would just be an alias to the
>>>> data_meta pointer, but the verifier could know its size, so the program
>>>> doesn't have to bounds check it).
>>>
>>> If some hints are not available, those hints should be initialized to
>>> 0/CHECKSUM_NONE/...etc.
>> 
>> The problem with that is that then we have to spend cycles writing
>> eight bytes of zeroes into the checksum field :)
>
> With a common 'struct xdp_to_skb_metadata', I am not sure how some of these zero 
> writes can be avoided.  If the xdp prog wants to optimize, it can call 
> individual kfunc to get individual hints.

Erm, we just... don't write those fields? Something like:

void write_skb_meta(hw, ctx) {
  struct xdp_skb_metadata meta = ctx->data_meta - sizeof(struct xdp_skb_metadata);
  meta->valid_fields = 0;

  if (hw_has_timestamp) {
    meta->timestamp = hw->timestamp;
    meta->valid_fields |= FIELD_TIMESTAMP;
  } /* otherwise meta->timestamp is just uninitialised */

  if (hw_has_rxhash) {
    meta->rxhash = hw->rxhash;
    meta->valid_fields |= FIELD_RXHASH;
  } /* otherwise meta->rxhash is just uninitialised */
  ...etc...
}

>>> The xdp prog needs a direct way to tell hard failure when it cannot
>>> write the meta area because of not enough space. Comparing
>>> xdp->data_meta with xdp->data as a side effect is not intuitive.
>> 
>> Yeah, hence a flags field so we can just see if setting each field
>> succeeded?
>
> How testing a flag is different from checking 0/invalid-value of a
> field?

The flags field is smaller, so we write fewer bytes if not all fields
are populated.

> or some fields just don't have an invalid value to check for
> like vlan_tci?

Yeah, that could also be an issue.

> You meant a flags field as a return value or in the 'struct xdp_to_skb_metadata' ?

See above.


>> 
>>> It is more than saving the bound check.  With type info of 'struct
>>> xdp_to_skb_metadata *', the verifier can do more checks like reading in the
>>> middle of an integer member.  The verifier could also limit write access only to
>>> a few struct's members if it is needed.
>>>
>>> The returning 'struct xdp_to_skb_metadata *' should not be an alias to the
>>> xdp->data_meta.  They should actually point to different locations in the
>>> headroom.  bpf_xdp_metadata_export_to_skb() sets a flag in xdp_buff.
>>> xdp->data_meta won't be changed and keeps pointing to the last
>>> bpf_xdp_adjust_meta() location.  The kernel will know if there is
>>> xdp_to_skb_metadata before the xdp->data_meta when that bit is set in the
>>> xdp_{buff,frame}.  Would it work?
>> 
>> Hmm, logically splitting the program metadata and the xdp_hints metadata
>> (but having them share the same area) *could* work, I guess, I'm just
>> not sure it's worth the extra complexity?
>
> It shouldn't stop the existing xdp prog writing its own metadata from using the 
> the new bpf_xdp_metadata_export_to_skb().

Right, I think I see what you mean, and I agree that splitting it
internally in the kernel does make sense (see my other reply to
Stanislav).

>> 
>>>>> A related question, why 'struct xdp_to_skb_metadata' needs
>>>>> __randomize_layout?
>>>>
>>>> The __randomize_layout thing is there to force BPF programs to use CO-RE
>>>> to access the field. This is to avoid the struct layout accidentally
>>>> ossifying because people in practice rely on a particular layout, even
>>>> though we tell them to use CO-RE. There are lots of examples of this
>>>> happening in other domains (IP header options, TCP options, etc), and
>>>> __randomize_layout seemed like a neat trick to enforce CO-RE usage :)
>>>
>>> I am not sure if it is necessary or helpful to only enforce __randomize_layout
>>> in 'struct xdp_to_skb_metadata'.  There are other CO-RE use cases (tracing and
>>> non tracing) that already have direct access (reading and/or writing) to other
>>> kernel structures.
>>>
>>> It is more important for the verifier to see the xdp prog accessing it as a
>>> 'struct xdp_to_skb_metadata *' instead of xdp->data_meta which is a __u8 * so
>>> that the verifier can enforce the rules of access.
>> 
>> That only works inside the kernel, though. Since the metadata field can
>> be copied wholesale to AF_XDP, having it randomized forces userspace
>> consumers to also write code to deal with the layout being dynamic...
>
> hm... I still don't see how useful it is, in particular you mentioned
> the libxdp will install a xdp prog to write this default format
> (xdp_to_skb_metadata) and likely libxdp will also provide some helpers
> to parse the xdp_to_skb_metadata and the libxdp user should not need
> to know if CO-RE is used or not. Considering it is a kernel internal
> struct, I think it is fine to keep it and can be revisited later if
> needed. Lets get on to other things first :)

Well, if it was just kernel-internal, sure. But we're also exporting it
to userspace (through AF_XDP). Well-behaved users will obviously do the
right thing and use CO-RE. I'm trying to guard against users just
blindly copy-pasting the struct definition from the kernel and doing a
static cast, then complaining about their code breaking the first time
we change the struct layout. Which I sadly expect that there absolutely
will be people who do unless we actively make sure that doesn't work
from the get-go. And since the randomisation is literally just adding
__randomize_layout to the struct definition, I'd rather start out with
having it, and removing it later if it turns out not to be needed... :)

-Toke


  reply	other threads:[~2022-11-10 23:29 UTC|newest]

Thread overview: 66+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-04  3:25 [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 01/14] bpf: Introduce bpf_patch Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 02/14] bpf: Support inlined/unrolled kfuncs for xdp metadata Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 03/14] veth: Introduce veth_xdp_buff wrapper for xdp_buff Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 04/14] veth: Support rx timestamp metadata for xdp Stanislav Fomichev
2022-11-09 11:21   ` [xdp-hints] " Toke Høiland-Jørgensen
2022-11-09 21:34     ` Stanislav Fomichev
2022-11-10  0:25   ` John Fastabend
2022-11-10  1:02     ` Stanislav Fomichev
2022-11-10  1:35       ` John Fastabend
2022-11-10  6:44         ` Stanislav Fomichev
2022-11-10 17:39           ` John Fastabend
2022-11-10 18:52             ` Stanislav Fomichev
2022-11-11 10:41             ` Jesper Dangaard Brouer
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 05/14] selftests/bpf: Verify xdp_metadata xdp->af_xdp path Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context Stanislav Fomichev
2022-11-07 22:01   ` [xdp-hints] " Martin KaFai Lau
2022-11-08 21:54     ` Stanislav Fomichev
2022-11-09  3:07       ` Martin KaFai Lau
2022-11-09  4:19         ` Martin KaFai Lau
2022-11-09 11:10           ` Toke Høiland-Jørgensen
2022-11-09 18:22             ` Martin KaFai Lau
2022-11-09 21:33               ` Stanislav Fomichev
2022-11-10  0:13                 ` Martin KaFai Lau
2022-11-10  1:02                   ` Stanislav Fomichev
2022-11-10 14:26                     ` Toke Høiland-Jørgensen
2022-11-10 18:52                       ` Stanislav Fomichev
2022-11-10 23:14                         ` Toke Høiland-Jørgensen
2022-11-10 23:52                           ` Stanislav Fomichev
2022-11-11  0:10                             ` Toke Høiland-Jørgensen
2022-11-11  0:45                               ` Martin KaFai Lau
2022-11-11  9:37                                 ` Toke Høiland-Jørgensen
2022-11-11  0:33                             ` Martin KaFai Lau
2022-11-11  0:57                               ` Stanislav Fomichev
2022-11-11  1:26                                 ` Martin KaFai Lau
2022-11-11  9:41                                   ` Toke Høiland-Jørgensen
2022-11-10 23:58                         ` Martin KaFai Lau
2022-11-11  0:20                           ` Stanislav Fomichev
2022-11-10 14:19               ` Toke Høiland-Jørgensen
2022-11-10 19:04                 ` Martin KaFai Lau
2022-11-10 23:29                   ` Toke Høiland-Jørgensen [this message]
2022-11-11  1:39                     ` Martin KaFai Lau
2022-11-11  9:44                       ` Toke Høiland-Jørgensen
2022-11-10  1:26             ` John Fastabend
2022-11-10 14:32               ` Toke Høiland-Jørgensen
2022-11-10 17:30                 ` John Fastabend
2022-11-10 22:49                   ` Toke Høiland-Jørgensen
2022-11-10  1:09   ` John Fastabend
2022-11-10  6:44     ` Stanislav Fomichev
2022-11-10 21:21       ` David Ahern
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 07/14] selftests/bpf: Verify xdp_metadata xdp->skb path Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 08/14] bpf: Helper to simplify calling kernel routines from unrolled kfuncs Stanislav Fomichev
2022-11-05  0:40   ` [xdp-hints] " Alexei Starovoitov
2022-11-05  2:18     ` Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 09/14] ice: Introduce ice_xdp_buff wrapper for xdp_buff Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 10/14] ice: Support rx timestamp metadata for xdp Stanislav Fomichev
2022-11-04 14:35   ` [xdp-hints] " Alexander Lobakin
2022-11-04 18:21     ` Stanislav Fomichev
2022-11-07 17:11       ` Alexander Lobakin
2022-11-07 19:10         ` Stanislav Fomichev
2022-12-15 11:54   ` Larysa Zaremba
2022-12-15 14:29     ` Toke Høiland-Jørgensen
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 11/14] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 12/14] mxl4: Support rx timestamp metadata for xdp Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 13/14] bnxt: Introduce bnxt_xdp_buff wrapper for xdp_buff Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 14/14] bnxt: Support rx timestamp metadata for xdp Stanislav Fomichev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://lists.xdp-project.net/postorius/lists/xdp-hints.xdp-project.net/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87mt8yxuag.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=alexandr.lobakin@intel.com \
    --cc=anatoly.burakov@intel.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=daniel@iogearbox.net \
    --cc=dsahern@gmail.com \
    --cc=haoluo@google.com \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=magnus.karlsson@gmail.com \
    --cc=martin.lau@linux.dev \
    --cc=mtahhan@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=sdf@google.com \
    --cc=song@kernel.org \
    --cc=willemb@google.com \
    --cc=xdp-hints@xdp-project.net \
    --cc=yhs@fb.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox