From: "Karlsson, Magnus" <magnus.karlsson@intel.com>
To: "Toke Høiland-Jørgensen" <toke@redhat.com>,
"Jesper Dangaard Brouer" <jbrouer@redhat.com>,
"Desouza, Ederson" <ederson.desouza@intel.com>
Cc: "brouer@redhat.com" <brouer@redhat.com>,
"xdp-hints@xdp-project.net" <xdp-hints@xdp-project.net>,
Eelco Chaudron <echaudro@redhat.com>,
Andrii Nakryiko <andrii@kernel.org>,
"Fijalkowski, Maciej" <maciej.fijalkowski@intel.com>,
"Burakov, Anatoly" <anatoly.burakov@intel.com>
Subject: [xdp-hints] Re: XDP-hints via local BTF info
Date: Thu, 18 Nov 2021 08:05:31 +0000 [thread overview]
Message-ID: <MW3PR11MB460254394F268761DA9E1AA1F79B9@MW3PR11MB4602.namprd11.prod.outlook.com> (raw)
In-Reply-To: <875ysqflg1.fsf@toke.dk>
> -----Original Message-----
> From: Toke Høiland-Jørgensen <toke@redhat.com>
> Sent: Wednesday, November 17, 2021 11:48 PM
> To: Karlsson, Magnus <magnus.karlsson@intel.com>; Jesper Dangaard
> Brouer <jbrouer@redhat.com>; Desouza, Ederson
> <ederson.desouza@intel.com>
> Cc: brouer@redhat.com; xdp-hints@xdp-project.net; Eelco Chaudron
> <echaudro@redhat.com>; Andrii Nakryiko <andrii@kernel.org>; Fijalkowski,
> Maciej <maciej.fijalkowski@intel.com>; Burakov, Anatoly
> <anatoly.burakov@intel.com>
> Subject: Re: [xdp-hints] Re: XDP-hints via local BTF info
>
> "Karlsson, Magnus" <magnus.karlsson@intel.com> writes:
>
> > Together with Maciej and Anatoly, I have been toying with how to
> > accomplish this, but it is early days so warning for some serious
> > handwaving. Will produce some code to see if it is possible at all.
> > One drawback with having it completely flexible and letting user-space
> > decide is the complexity implementing this in the driver/kernel. But
> > is this not why we have eBPF for in the first place? Maybe it can come
> > to the rescue here.
>
> I think there are a couple of distinctions we need to make, which your
> "handwaving" glosses over a little bit :)
>
> First, we have to distinguish between two cases here: how to work with
> existing hardware and drivers and their metadata implementations, and how
> to work with future devices that will (presumably) be completely flexible
> and/or programmable to provide any metadata configuration directly from
> the hardware. The XDP hints scheme obviously needs to work with both.
Thanks Toke for providing feedback on this. Just had to get this stuff "off my chest" 😉. I was stuck in the zero-copy world where you can modify drivers, but of course we need to consider "older" drivers with XDP support. With that in mind, I do agree with most of your comments. There are some questions and thoughts interspersed below and a suggestion for how to go forward at the end.
> Secondly, we should distinguish between the configuration interface and the
> metadata consumption from within the data path. I think those should be
> separate interfaces; in particular, I don't think loading an XDP program in
> itself should have the side effect of reconfiguring the hardware metadata
> format. Rather, the reconfiguration should be a separate, explicit step using
> whatever API we can end up agreeing on (ethtool or rtnetlink come to mind
> as obvious contenders). However, the
> *format* for this configuration could very well be BTF-based, so userspace
> can get whatever format it wants, assuming the hardware supports it.
>
> So, say we have this fancy programmable hardware, and we write a program
> with a struct definition like:
>
> struct my_meta_format {
> __u64 rx_timestamp;
> __u64 magic_colour_of_packet;
> __u32 btf_id;
> };
>
> and from userspace we can then do:
>
> dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format));
>
> which will pass the BTF-format struct to the kernel and configure the
> hardware to write those fields. After this, the XDP program can just do:
>
> int my_xdp_prog(struct xdp_buff *ctx)
> {
> struct my_meta_format *meta = ctx->data_meta;
>
> do_something_with(meta->magic_colour_of_packet);
> return XDP_XXX;
> }
>
> and it will just work. Same thing from userspace.
>
> Or it can define a CO-RE enabled struct like:
>
> struct my_meta_subset {
> __u64 magic_colour_of_packet; /* we only care about this attr */ }
> __attribute__((preserve_access_index));
>
> int my_xdp_prog(struct xdp_buff *ctx)
> {
> struct my_meta_subset *meta = ctx->data_meta;
>
> do_something_with(meta->magic_colour_of_packet);
> return XDP_XXX;
> }
>
> and libbpf will rewrite the field accesses using the BTF information, re-
> exported from the kernel from what userspace passed in when configuring.
>
> With the above in mind, a few comments on some of your other points:
>
> > The key feature here is that bpf_xdp_get_metadata() will actually go
> > and fetch this from the HW, by calling a small callback function in
> > the driver that reads for example rx_timestamp from the HW and returns
> > it to the XDP program. This is clearly not possible today and would
> > require new plumbing, if it is even possible to implement this. But
> > let us leave aside the implementation for now and just focus on the
> > benefit this (or something like it) could provide compared to a
> > kernel-centric approach.
>
> Took me a while to understand what you meant with this, but I think it finally
> clicked with your example below, so I'll leave the comment to there...
>
> > * User-space can completely control where to put data and what it puts
> > there. Think of producing the structure you want in user space
> > directly without having to copy things around. You could for example
> > produce a DPDK mbuf, or the VPP, Surikata equivalent directly. Would
> > save a lot of cycles.
>
> I think that userspace being in control of the metadata format is a great goal
> we should keep in mind, and as mentioned above we can do that with a BTF-
> based interface.
>
> > * No need for a metadata enablement interface. eBPF could find this
> > out by just parsing the XDP program and enable the used metadata
> > features in the HW by calling enable/disable metadata functions in the
> > driver.
>
> As mentioned above I think this is a mis-feature: configuration should be
> explicit and out of band, not tied to the data path implementation.
>
> > * No reason to expose BTFs to user-space. Makes it a lot simpler.
> > Actually, no need to use them in the driver either.
>
> BTF is just a handy format for describing the data layout. There has to be
> *some* format to communicate this, and exposing the BTF format to
> userspace is just a way for the kernel to say "this is how the hardware is
> currently configured". I don't understand why you think not having that
> makes things simpler?
I just think that a one-liner in a program is simpler than an entry in the BTF format. But this is moot anyway. You cannot use a program to communicate from the kernel, so BTF it is.
> > * We do not have to use the metadata area unless it is actually
> > needed. In the case of AF_XDP, we have to write it there, but local
> > use in XDP does not. You would save one cache miss per packet right
> > there.
>
> Regardless of how the format is configured, why would an XDP program read
> it if it doesn't need it?
The driver reads some metadata entry from HW and puts it in the metadata section before the packet. XDP is invoked and reads this entry from the metadata area and uses it in some way. This generates a cache miss for the write/read to the metadata entry. Compare this to the XDP program calls a helper that reads the metadata from HW and puts it directly into a local variable in the XDP program. Metadata area is not used, and we save one cache miss.
> > * We get away from non-scalable driver implementations such as:
> >
> > If (metadata_a_enabled)
> > meta->a = Read_metadata_a();
> > If (metadata_b_enabled)
> > meta->b = Read_metadata_b();
> > If (metadata_c_enabled)
> > meta->c = Read_metadata_c();
> > and so on for many entries.....
>
> Implementations such as this exist, though, and we have to support them; as
> mentioned above, I think it's absolutely a worthy goal to make the metadata
> format completely fungible from userspace, but the XDP hints interface
> needs to also support old drivers that do things in the traditional way, even if
> it's inefficient.
> >
> > Or even worse:
> > meta->a = Read_metadata_b();
> > meta->b = Read_metadata_a();
> > meta->c = Read_metadata_c();
> > and so on....
> >
> > The last one would kill performance any day. Even the first one is
> > shunned in DPDK for performance reasons.
> >
> > Note that in my proposal above, read_metadata_a() is a function that
> > will be called from the XDP program when needed and only then, not
> > from the driver, so driver code becomes uncluttered by these large if
> > statements.
>
> Ah, with this bit I understood your point above about "a helper reading the
> data from the hardware". Let me see if I can rephrase your proposal, and let
> me know if I got it wrong:
>
> The problem you're speaking of here, is that of transforming the bits that
> come in from the hardware (in an RX-desc, say), into the metadata format
> consumed by XDP. Say, for instance that the driver has a function
> like:
>
> u64 my_driver_read_rx_timestamp(void *pkt_dsc);
>
> which does whatever bit of device-specific magic it needed to get a well-
> formed u64 timestamp out of the packet descriptor, maybe involving
> handling HW-specific quirks etc. And you want to expose this function
> directly to BPF instead of having a big conditional in the driver C code, so we
> can use BPF code elimination to only get the bits we want, right?
You got it!
> This is actually something Jesper and I have been discussing as a possible
> future solution! However, what we discussed was not exposing this *to
> XDP*, but rather to have a separate BPF program *which is part of the
> driver* that does this. And this BPF program can then be paired with the BTF
> format coming in from userspace, so that it will only load the fields
> requested. So, going back to my imaginary configuration interface above,
> when userspace calls:
>
> dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format));
>
> the driver turns this over to the BPF subsystem with a BPF program like:
>
> int my_driver_convert_meta(void *ctx, void *meta_dst) {
> struct my_meta_format *meta_dst_fmt = meta_dst;
>
> if (bpf_core_field_exists(meta_dst->rx_timestamp))
> meta_dst->rx_timestamp = my_driver_read_rx_timestamp(ctx);
> if (bpf_core_field_exists(meta_dst->magic_colour_of_packet))
> meta_dst->magic_colour_of_packet =
> my_driver_read_magic_colour_of_packet(ctx);
> ...etc
> }
>
>
> And then the driver code where before you had, as you said:
>
> > If (metadata_a_enabled)
> > meta->a = Read_metadata_a();
> > If (metadata_b_enabled)
> > meta->b = Read_metadata_b();
>
> will just be
>
> BPF_PROG_RUN(drv->meta_writer_prog, meta);
>
> and that will run the BPF program that was loaded (via a usermode helper,
> probably) when userspace configured the metadata format and we'll get the
> conditional code goodness and better performance.
Perfect! This was actually my first thought too, to have a new hook point for metadata population. But I turned to the current approach because I thought it would be too hard to get a new hook point in. Though, there are potentially a number of advantages with having a new hook point for this. One from an AF_XDP point of view would be that we could optionally perform the attach on a per netdev/queue basis and get a much easier handling of per AF_XDP application/socket metadata usage. But great that we are thinking about roughly the same things. Encouraging.
> However, having such a facility for drivers to dynamically *write* metadata
> is, I believe, completely orthogonal to how we implement the description
> format for consuming this metadata from XDP and userspace.
> For one thing, not all drivers will probably want to use the dynamic method,
> so we should not tie up the consumption of metadata to this. For another,
> it's way too complex to implement everything at once, so defining the
> consumption format first, then working on efficiently writing it afterwards (or
> in parallel, maybe) makes it possible to have incremental progress.
Agreed.
> That being said, having both things in mind when defining things is a good
> idea. But as I have hopefully outlined above, using BTF as the description
> format allows both writing and reading of metadata to be flexible and
> extensible.
>
> > * We also get away from the problem that the BTF id can change from
> > underneath the AF_XDP application. This becomes especially problematic
> > when we would like to apply this to the Tx path and do offloads.
>
> For this point, I think maybe the existence of the btf_id field in the metadata
> struct has given you the impression that the format can go and change at any
> time? I don't think that has to be the case: if a user knows that they will
> always use the same metadata format, and they know what that format is,
> it's perfectly valid to just ignore the btf_id field and just cast the metadata
> field to a struct. However, the field has to be there so that it is possible for
> the consumer to detect when it changes, in order to support portable
> applications. There are two use cases for this:
It would be great if we could know it is fixed, but I do not understand how the user can know this, especially since the control of this is out-of-band. How would we deal with the following scenario?
App 1 comes up, opens up an AF_XDP socket and requests metadata_1
App 2 comes up, opens up another AF_XDP socket on the same netdev and requests metadata_2
We can provide the apps with two different btf_ids, but is this something that an existing driver can support and how does this scale as we add sockets and different usages of metadata? Note that we have no idea what the destination is until after we have executed our XDP program and potentially used the metadata area there. But our population of the metadata field is before the XDP program. Kind of chicken and egg.
The idea of a separate metadata population hook point on the netdev/queue_id level could potentially solve this. Well, as long as you are not attaching several sockets to the same netdev and queue_id, but that is rare.
> - Writing an application that can support different metadata formats so
> it can run without re-compilation on different hardware. For this, it
> is possible to load the program with a target device, so that the long
> sequence of checks of the btf_id can be eliminated by CO-RE at load
> time, and turned into just a single sanity check so that the metadata
> is not loaded if the configuration changes.
Would be great if we had CO-RE for user-space, but as it is not there in llvm or gcc for x86, we probably have to go through all the tests in AF_XDP.
> - Writing an application that deals with packets from multiple devices
> in the same program (say, a devmap program that gets redirects from
> different hardware into the same egress device).
Yes. The same goes for AF_XDP and a shared umem between different NICs.
> - Dealing with metadata configuration where a subset of the traffic has
> different metadata (e.g., timestamps may only appear on PTP packets,
> say). Here the check of the BTF ID can be used to demux the packet
> type.
Would this not lead to a combinatorial explosion of BTF IDs as we add fields that might or might not have a valid entry? Would it not be better with a flags field in the MD section?
Thank you so much for your feedback. It really helped! In summary, BTF it is as an interface and I agree that we should start simple and just get this to work with the if-statement method. But it seems that I can also continue to prototype the main idea of reading metadata from XDP. Do you think it would be possible to get a new hook point for XDP metadata population? If so, I think we should pursue that, especially if we could tie it to a netdev/queue_id pair. That would simplify a lot for AF_XDP.
/Magnus
> -Toke
next prev parent reply other threads:[~2021-11-18 8:05 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-11-17 17:22 [xdp-hints] " Jesper Dangaard Brouer
2021-11-17 20:07 ` [xdp-hints] " Karlsson, Magnus
2021-11-17 22:48 ` Toke Høiland-Jørgensen
2021-11-18 8:05 ` Karlsson, Magnus [this message]
2021-11-18 14:30 ` Jesper Dangaard Brouer
2021-11-18 14:57 ` Karlsson, Magnus
2021-11-18 15:18 ` John Fastabend
2021-11-19 14:53 ` Toke Høiland-Jørgensen
2021-11-22 12:45 ` [xdp-hints] Basic/Dumb question WAS(Re: " Jamal Hadi Salim
2021-11-22 13:59 ` [xdp-hints] " Toke Høiland-Jørgensen
2021-11-22 15:31 ` Tom Herbert
2021-11-22 18:25 ` Toke Høiland-Jørgensen
2021-11-22 12:57 ` [xdp-hints] " Alexander Lobakin
2021-11-24 11:54 ` Jesper Dangaard Brouer
2021-11-25 20:04 ` Alexander Lobakin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://lists.xdp-project.net/postorius/lists/xdp-hints.xdp-project.net/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=MW3PR11MB460254394F268761DA9E1AA1F79B9@MW3PR11MB4602.namprd11.prod.outlook.com \
--to=magnus.karlsson@intel.com \
--cc=anatoly.burakov@intel.com \
--cc=andrii@kernel.org \
--cc=brouer@redhat.com \
--cc=echaudro@redhat.com \
--cc=ederson.desouza@intel.com \
--cc=jbrouer@redhat.com \
--cc=maciej.fijalkowski@intel.com \
--cc=toke@redhat.com \
--cc=xdp-hints@xdp-project.net \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox