XDP hardware hints discussion mail archive
 help / color / mirror / Atom feed
From: John Fastabend <john.fastabend@gmail.com>
To: "Jesper Dangaard Brouer" <jbrouer@redhat.com>,
	"Karlsson, Magnus" <magnus.karlsson@intel.com>,
	"Toke Høiland-Jørgensen" <toke@redhat.com>,
	"Desouza, Ederson" <ederson.desouza@intel.com>
Cc: brouer@redhat.com,
	"xdp-hints@xdp-project.net" <xdp-hints@xdp-project.net>,
	Eelco Chaudron <echaudro@redhat.com>,
	Andrii Nakryiko <andrii@kernel.org>,
	"Fijalkowski, Maciej" <maciej.fijalkowski@intel.com>,
	"Burakov, Anatoly" <anatoly.burakov@intel.com>
Subject: [xdp-hints] Re: XDP-hints via local BTF info
Date: Thu, 18 Nov 2021 07:18:24 -0800	[thread overview]
Message-ID: <61966ec0722fe_2f3212080@john.notmuch> (raw)
In-Reply-To: <f1beafa9-8e1c-ae44-140a-2bccbe8dccbf@redhat.com>

Jesper Dangaard Brouer wrote:
> 
> 
> On 18/11/2021 09.05, Karlsson, Magnus wrote:
> > 
> >> -----Original Message-----
> >> From: Toke Høiland-Jørgensen <toke@redhat.com>
> >> Sent: Wednesday, November 17, 2021 11:48 PM
> >> To: Karlsson, Magnus <magnus.karlsson@intel.com>; Jesper Dangaard
> >> Brouer <jbrouer@redhat.com>; Desouza, Ederson
> >> <ederson.desouza@intel.com>
> >> Cc: brouer@redhat.com; xdp-hints@xdp-project.net; Eelco Chaudron
> >> <echaudro@redhat.com>; Andrii Nakryiko <andrii@kernel.org>; Fijalkowski,
> >> Maciej <maciej.fijalkowski@intel.com>; Burakov, Anatoly
> >> <anatoly.burakov@intel.com>
> >> Subject: Re: [xdp-hints] Re: XDP-hints via local BTF info
> >>
> >> "Karlsson, Magnus" <magnus.karlsson@intel.com> writes:
> >>
> >>> Together with Maciej and Anatoly, I have been toying with how to
> >>> accomplish this, but it is early days so warning for some serious
> >>> handwaving. Will produce some code to see if it is possible at all.
> >>> One drawback with having it completely flexible and letting user-space
> >>> decide is the complexity implementing this in the driver/kernel. But
> >>> is this not why we have eBPF for in the first place? Maybe it can come
> >>> to the rescue here.
> >>
> >> I think there are a couple of distinctions we need to make, which your
> >> "handwaving" glosses over a little bit :)
> >>
> >> First, we have to distinguish between two cases here: how to work with
> >> existing hardware and drivers and their metadata implementations, and how
> >> to work with future devices that will (presumably) be completely flexible
> >> and/or programmable to provide any metadata configuration directly from
> >> the hardware. The XDP hints scheme obviously needs to work with both.
> > 
> > Thanks Toke for providing feedback on this. Just had to get this stuff "off my chest" 😉. I was stuck in the zero-copy world where you can modify drivers, but of course we need to consider "older" drivers with XDP support. With that in mind, I do agree with most of your comments. There are some questions and thoughts interspersed below and a suggestion for how to go forward at the end.
> > 
> 
> I do appriciate that you want to discuss driver implementation, but this 
> email and example program[1] was targetted at a much smaller step.
> 
> The main purpose was to show-with-code that decoding BTF and using it 
> from userspace is not difficult.  And iterate on the API a bit.
> 
>   [1] 
> https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-interaction
> 
> As Toke mentioned, I also believe it is orthogonal.
> I think we all agree that BTF will be the way the XDP-hints will be 
> "described", and we need to propose APIs/code that does this decoding 
> e.g. in userspace. (But I also want things like cpumap kernel side 
> XDP-prog to understand this).
> 
> But I'm taking the bait, and will discuss with you below...
> 
> 
> >> Secondly, we should distinguish between the configuration interface and the
> >> metadata consumption from within the data path. I think those should be
> >> separate interfaces; in particular, I don't think loading an XDP program in
> >> itself should have the side effect of reconfiguring the hardware metadata
> >> format. Rather, the reconfiguration should be a separate, explicit step using
> >> whatever API we can end up agreeing on (ethtool or rtnetlink come to mind
> >> as obvious contenders).
> 
> I agree with Toke, that we should distinguish between the configuration 
> interface and metadata consumption.

+1 as a first iteration we can ignore the configuration mechanism and
assume its either hard coded or programmed into the NIC by the hardware
vendor.

> 
> Do notice that we ALREADY have existing config interfaces that enables 
> and disabled HW hints provided by the hardware. How to handle this?
> 
> E.g. The HW RX-timestamp features can be enabled when calling tcpdump, 
> and will be disabled again when tcpdump stop again.
> Thus, this action (enable HW-timestamp) will likely change what 
> XDP-hints structure that driver uses.
> What do you think the action should be?
> 
> My opinion is to simply use the existing kernel API to enable these 
> HW-timestamps.  Today the SKB is updated with the HW-timestamp, and in 
> the future I would like the SKB to be created later (in net-core) and 
> get this HW-timestamp via an BPF-prog that extract this from the 
> metadata area. (Hint today metadata area actually survives with the SKB, 
> and can be accessed by e.g. TC-BPF).
> Thus, I do want this to work with our existing netstack config interfaces.

I don't disagree above would be great if existing knobs worked as
expected and played nice with hints. I view it as step 2 though. Just
my $.02 its more of a question on how to get from point A to B, not so
much differences in end result.

> 
> 
> >> However, the
> >> *format* for this configuration could very well be BTF-based, so userspace
> >> can get whatever format it wants, assuming the hardware supports it.
> >>
> >> So, say we have this fancy programmable hardware, and we write a program
> >> with a struct definition like:
> >>
> >> struct my_meta_format {
> >>         __u64 rx_timestamp;
> >>         __u64 magic_colour_of_packet;
> >>         __u32 btf_id;
> >> };
> >>
> >> and from userspace we can then do:
> >>
> >> dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format));

I have some doubts/questions about complexity on firmware/driver side
to consume such sparse info and create such complex reconfig of hw.
But, maybe some simple pattern matching would sufficient on hw side
and useful to get things moving forward.

Seeing real hardware with support here would be great.

> >>
> >> which will pass the BTF-format struct to the kernel and configure the
> >> hardware to write those fields. After this, the XDP program can just do:
> >>
> >> int my_xdp_prog(struct xdp_buff *ctx)
> >> {
> >>    struct my_meta_format *meta = ctx->data_meta;
> >>
> >>    do_something_with(meta->magic_colour_of_packet);
> >>    return XDP_XXX;
> >> }
> >>
> >> and it will just work. Same thing from userspace.
> >>
> >> Or it can define a CO-RE enabled struct like:
> >>
> >> struct my_meta_subset {
> >>         __u64 magic_colour_of_packet; /* we only care about this attr */ }
> >> __attribute__((preserve_access_index));
> >>
> >> int my_xdp_prog(struct xdp_buff *ctx)
> >> {
> >>    struct my_meta_subset *meta = ctx->data_meta;
> >>
> >>    do_something_with(meta->magic_colour_of_packet);
> >>    return XDP_XXX;
> >> }
> >>
> >> and libbpf will rewrite the field accesses using the BTF information, re-
> >> exported from the kernel from what userspace passed in when configuring.
> >>
> 
> I've also been down the same rabbit hole, wanting userspace to define 
> BTF layout as the config interface that HW will get reconfigured via.
> I no-longer believe in this mode.  One reason is the existing config 
> interfaces that enable/disable NIC HW features.
> 
> One way we can allow userspace to define the contents of the XDP-hints 
> struct, not the HW config, is to add this new BPF 'hints-hook'.
> Userspace can query the BPF-prog loaded in the 'hints-hook' and see that 
> BTF structs it provide.
> This is similar to that I do for AF_XDP in [1], as the XDP BPF-prog 
> defines the layout and AF_XDP userspace queries the BTF avail.

I expected, but it didn't happen yet(?), is first users would go a
different route.  The way I see it is, hw vendor can configure the NIC to put
any hints they like in the header via firmware update. The user space
would understand the layout of the hints because it programmed
these hints. In general its not very friendly for distributions and
their end users, but for a DPDK user running on top of AF_XDP this would be
all thats needed. Or an embedded end system at a telco or POC on IDS would
work. Anyoen doing this today?

> 
> 
> >> With the above in mind, a few comments on some of your other points:
> >>
> >>> The key feature here is that bpf_xdp_get_metadata() will actually go
> >>> and fetch this from the HW, by calling a small callback function in
> >>> the driver that reads for example rx_timestamp from the HW and returns
> >>> it to the XDP program. This is clearly not possible today and would
> >>> require new plumbing, if it is even possible to implement this. But
> >>> let us leave aside the implementation for now and just focus on the
> >>> benefit this (or something like it) could provide compared to a
> >>> kernel-centric approach.
> >>
> >> Took me a while to understand what you meant with this, but I think it finally
> >> clicked with your example below, so I'll leave the comment to there...
> >>
> >>> * User-space can completely control where to put data and what it puts
> >>> there. Think of producing the structure you want in user space
> >>> directly without having to copy things around. You could for example
> >>> produce a DPDK mbuf, or the VPP, Surikata equivalent directly. Would
> >>> save a lot of cycles.
> >>
> >> I think that userspace being in control of the metadata format is a great goal
> >> we should keep in mind, and as mentioned above we can do that with a BTF-
> >> based interface.
> >>
> >>> * No need for a metadata enablement interface. eBPF could find this
> >>> out by just parsing the XDP program and enable the used metadata
> >>> features in the HW by calling enable/disable metadata functions in the
> >>> driver.
> >>
> >> As mentioned above I think this is a mis-feature: configuration should be
> >> explicit and out of band, not tied to the data path implementation.
> >>
> 
> I agree. Also consider the existing config interfaces that we cannot 
> ignore and have to collaborate with.

I also wonder about hw capability to consume a BTF/eBPF program
and reconfigure the hardware.

> 
> >>> * No reason to expose BTFs to user-space. Makes it a lot simpler.
> >>> Actually, no need to use them in the driver either.
> >>
> >> BTF is just a handy format for describing the data layout. There has to be
> >> *some* format to communicate this, and exposing the BTF format to
> >> userspace is just a way for the kernel to say "this is how the hardware is
> >> currently configured". I don't understand why you think not having that
> >> makes things simpler?
> > 

... (cut cache miss discussion)

> 
> >> This is actually something Jesper and I have been discussing as a possible
> >> future solution! However, what we discussed was not exposing this *to
> >> XDP*, but rather to have a separate BPF program *which is part of the
> >> driver* that does this. And this BPF program can then be paired with the BTF
> >> format coming in from userspace, so that it will only load the fields
> >> requested. So, going back to my imaginary configuration interface above,
> >> when userspace calls:
> >>
> 
> Yes, exactly.
> If we introduce a BPF hints-hook, we could give this new BPF hook access 
> to the RX-desc (and XDP->ctx / xdp_buff) plus some driver specific 
> status quirks-bits.  The driver developers, that know the quirks, could 
> ship a BPF-prog together with driver code that creates/writes the 
> metadata area AND this BPF-prog could DEFINE the struct name and layout 
> as BTF (which userspace can query and extract).
> 
> 

... cut some more interesting discussion.

>> > It would be great if we could know it is fixed, but I do not understand how the user can know this, especially since the control of this is out-of-band. How would we deal with the following scenario?
> > 
> > App 1 comes up, opens up an AF_XDP socket and requests metadata_1
> > App 2 comes up, opens up another AF_XDP socket on the same netdev and requests metadata_2
> > 
> > We can provide the apps with two different btf_ids, but is this something that an existing driver can support and how does this scale as we add sockets and different usages of metadata? Note that we have no idea what the destination is until after we have executed our XDP program and potentially used the metadata area there. But our population of the metadata field is before the XDP program. Kind of chicken and egg.
> > 
> > The idea of a separate metadata population hook point on the netdev/queue_id level could potentially solve this. Well, as long as you are not attaching several sockets to the same netdev and queue_id, but that is rare.

Interesting, but I would get basic single config working first. If user
really wants multiple configs then I would guess the NIC might partition
the hardware into VFs or virtual interfaces of some kind.

> > 
> >> - Writing an application that can support different metadata formats so
> >>    it can run without re-compilation on different hardware. For this, it
> >>    is possible to load the program with a target device, so that the long
> >>    sequence of checks of the btf_id can be eliminated by CO-RE at load
> >>    time, and turned into just a single sanity check so that the metadata
> >>    is not loaded if the configuration changes.
> > 
> > Would be great if we had CO-RE for user-space, but as it is not there in llvm or gcc for x86, we probably have to go through all the tests in AF_XDP.

I think CO-RE for user space should happen regardless of hardware use
cases. Looks to me like multiple use cases for this exist.


> > 
> >> - Writing an application that deals with packets from multiple devices
> >>    in the same program (say, a devmap program that gets redirects from
> >>    different hardware into the same egress device).
> > 
> > Yes. The same goes for AF_XDP and a shared umem between different NICs.
> > 
> >> - Dealing with metadata configuration where a subset of the traffic has
> >>    different metadata (e.g., timestamps may only appear on PTP packets,
> >>    say). Here the check of the BTF ID can be used to demux the packet
> >>    type.
> > 
> > Would this not lead to a combinatorial explosion of BTF IDs as we add fields that might or might not have a valid entry? Would it not be better with a flags field in the MD section?
> > 
> 
> I've debated this with John before, and I though we resolved the 
> concerns in that discussion.
> In summary:
> 
> 1. The driver is 100% free to create a flags field in the XDP-hints 
> struct, that define what fields are valid.  If you like to be old 
> fashion, and we can even define the bits as UAPI for common fields that 
> more drivers provide. This just requires more work in userspace/BPF-prog 
> to check flags-valid before accessing a member.
> 
> 2. Until we get the XDP-hints-hook, drivers will be old fashion and 
> define a rather limited number of XDP-hints structs.  Thus, I would not 
> worry about drivers creating too many structs and BTF-IDs.

I would add a 3. hw + user can agree on whats in the hints and are
free to put anything in there if they use out of band config
mechanism.

> 
> Today drivers are not very flexible, and this work is about creating a 
> super flexible interface.  It doesn't mean that drivers and hardware 
> will all of a sudden get super flexible.  Driver and hardware start 
> using this new flexible BTF in very simple ways.
> 
> 
> > Thank you so much for your feedback. It really helped! In summary, BTF it is as an interface and I agree that we should start simple and just get this to work with the if-statement method. But it seems that I can also continue to prototype the main idea of reading metadata from XDP. Do you think it would be possible to get a new hook point for XDP metadata population? If so, I think we should pursue that, especially if we could tie it to a netdev/queue_id pair. That would simplify a lot for AF_XDP.
> > 
> 
> I've all for prototypeing something so we can make progress.

In general I would suggest building the simplest possible thing
first. And then improve once users start to actually use it.
In parrellel I would like to see someone build a 'real' demo
using 3 to show-off the amazing performance improvements
to justify adding hooks and lots of other effort in driver/kernel.

Thanks.

  parent reply	other threads:[~2021-11-18 15:18 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-17 17:22 [xdp-hints] " Jesper Dangaard Brouer
2021-11-17 20:07 ` [xdp-hints] " Karlsson, Magnus
2021-11-17 22:48   ` Toke Høiland-Jørgensen
2021-11-18  8:05     ` Karlsson, Magnus
2021-11-18 14:30       ` Jesper Dangaard Brouer
2021-11-18 14:57         ` Karlsson, Magnus
2021-11-18 15:18         ` John Fastabend [this message]
2021-11-19 14:53           ` Toke Høiland-Jørgensen
2021-11-22 12:45             ` [xdp-hints] Basic/Dumb question WAS(Re: " Jamal Hadi Salim
2021-11-22 13:59               ` [xdp-hints] " Toke Høiland-Jørgensen
2021-11-22 15:31                 ` Tom Herbert
2021-11-22 18:25                   ` Toke Høiland-Jørgensen
2021-11-22 12:57             ` [xdp-hints] " Alexander Lobakin
2021-11-24 11:54               ` Jesper Dangaard Brouer
2021-11-25 20:04                 ` Alexander Lobakin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://lists.xdp-project.net/postorius/lists/xdp-hints.xdp-project.net/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=61966ec0722fe_2f3212080@john.notmuch \
    --to=john.fastabend@gmail.com \
    --cc=anatoly.burakov@intel.com \
    --cc=andrii@kernel.org \
    --cc=brouer@redhat.com \
    --cc=echaudro@redhat.com \
    --cc=ederson.desouza@intel.com \
    --cc=jbrouer@redhat.com \
    --cc=maciej.fijalkowski@intel.com \
    --cc=magnus.karlsson@intel.com \
    --cc=toke@redhat.com \
    --cc=xdp-hints@xdp-project.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox