[xdp-hints] XDP-hints via local BTF info

XDP hardware hints discussion mail archive
 help / color / mirror / Atom feed

* [xdp-hints] XDP-hints via local BTF info
@ 2021-11-17 17:22 Jesper Dangaard Brouer
  2021-11-17 20:07 ` [xdp-hints] " Karlsson, Magnus
  0 siblings, 1 reply; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2021-11-17 17:22 UTC (permalink / raw)
  To: Desouza, Ederson; +Cc: brouer, xdp-hints, Eelco Chaudron, Andrii Nakryiko

Hi Ederson and others,

I've coded up an XDP + AF_XDP application[1] that creates and shares 
XDP-hints via BTF from XDP-prog to AF_XDP.

As explained in README[2] file, API for decoding the BTF data-structures 
in userspace have been placed into separate files lib_xsk_extend.c and 
lib_xsk_extend.h.
The API is based on what Ederson proposed[3], but I have heavily 
modified the API while using it in the AF_XDP app.  I've kept the 
hashmap optimization for now, but ended-up not using it... so I will 
propose removing it(?).
My hope is we can discuss and mature the API together, before proposing 
the API to be included in either libbpf or libxdp.

Notice no kernel changes required.  Everything works today with what BTF 
info are provided in the BPF ELF-object file.
Later, we want to extend the NIC drivers with the same kind of BTF info 
that describe the metadata area (our XDP-hints).  For now, AF_XDP 
userspace will instead get the BTF-info from decoding the ELF-object 
file.  Thus, for now the XDP BPF-prog will be the creator/provider of 
the XDP-hints layout.

The requirement for being a valid XDP-hints data-struct is that the last 
member in the struct is named "btf_id" and have size 4 bytes (32-bit).
With BTF the struct "string" names and member names becomes important, 
and is something that userspace will be "searching" for, to establish if 
a driver/BPF-prog provide a specific XDP-hints data structure.
There is no restriction on what the XDP-hints struct should be named, 
but we might want some conversions when added this to the kernel drivers.

Open to feedback,
-Jesper

  [1] 
https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-interaction

  [2] 
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/README.org#xdp-hints-via-local-btf-info

  [3] 
https://lore.kernel.org/bpf/20210803010331.39453-15-ederson.desouza@intel.com/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [xdp-hints] Re: XDP-hints via local BTF info
  2021-11-17 17:22 [xdp-hints] XDP-hints via local BTF info Jesper Dangaard Brouer
@ 2021-11-17 20:07 ` Karlsson, Magnus
  2021-11-17 22:48   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 15+ messages in thread
From: Karlsson, Magnus @ 2021-11-17 20:07 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Desouza, Ederson
  Cc: brouer, xdp-hints, Eelco Chaudron, Andrii Nakryiko, Fijalkowski,
	Maciej, Burakov, Anatoly

> -----Original Message-----
> From: Jesper Dangaard Brouer <jbrouer@redhat.com>
> Sent: Wednesday, November 17, 2021 6:23 PM
> To: Desouza, Ederson <ederson.desouza@intel.com>
> Cc: brouer@redhat.com; xdp-hints@xdp-project.net; Eelco Chaudron
> <echaudro@redhat.com>; Andrii Nakryiko <andrii@kernel.org>
> Subject: [xdp-hints] XDP-hints via local BTF info
> 
> Hi Ederson and others,
> 
> I've coded up an XDP + AF_XDP application[1] that creates and shares XDP-
> hints via BTF from XDP-prog to AF_XDP.
> 
> As explained in README[2] file, API for decoding the BTF data-structures in
> userspace have been placed into separate files lib_xsk_extend.c and
> lib_xsk_extend.h.
> The API is based on what Ederson proposed[3], but I have heavily modified
> the API while using it in the AF_XDP app.  I've kept the hashmap optimization
> for now, but ended-up not using it... so I will propose removing it(?).
> My hope is we can discuss and mature the API together, before proposing
> the API to be included in either libbpf or libxdp.
> 
> Notice no kernel changes required.  Everything works today with what BTF
> info are provided in the BPF ELF-object file.
> Later, we want to extend the NIC drivers with the same kind of BTF info that
> describe the metadata area (our XDP-hints).  For now, AF_XDP userspace will
> instead get the BTF-info from decoding the ELF-object file.  Thus, for now the
> XDP BPF-prog will be the creator/provider of the XDP-hints layout.
> 
> The requirement for being a valid XDP-hints data-struct is that the last
> member in the struct is named "btf_id" and have size 4 bytes (32-bit).
> With BTF the struct "string" names and member names becomes important,
> and is something that userspace will be "searching" for, to establish if a
> driver/BPF-prog provide a specific XDP-hints data structure.
> There is no restriction on what the XDP-hints struct should be named, but we
> might want some conversions when added this to the kernel drivers.
> 
> Open to feedback,
> -Jesper

Thanks Jesper for all of this! Good to see some concrete code. Will go through it in detail in the coming days.

One thing though that worries me with all the approaches so far is that they are kernel-centric in the way that it is the kernel that decides were to put things and in some proposals what to put there. I would like to turn this around and produce a scheme that is user-space centric i.e., the application decides what it needs and where it wants it, because it only knows this, not the kernel. Your proposal has some of these properties (from my brief glancing over it), but not all. 

Together with Maciej and Anatoly, I have been toying with how to accomplish this, but it is early days so warning for some serious handwaving. Will produce some code to see if it is possible at all. One drawback with having it completely flexible and letting user-space decide is the complexity implementing this in the driver/kernel. But is this not why we have eBPF for in the first place? Maybe it can come to the rescue here.

So the high level idea is the following:

User-space load an XDP program. It can look something like this skipping a lot of stuff, but hopefully understandable anyway:

int my_xdp_prog(struct xdp_buff *ctx)
{
    /* If passing something to AF_XDP */
    meta->rx_timestamp = bpf_xdp_get_metadata(RX_TIMESTAMP);

   /* If just using in the XDP program, there is no need to store this in the metadata area thus saving one whole cache miss! */
   my_local_var  = bpf_xdp_get_metadata(VLAN_ID);

   use my_local_var in some way;

   return SOMETHING;
}

The key feature here is that bpf_xdp_get_metadata() will actually go and fetch this from the HW, by calling a small callback function in the driver that reads for example rx_timestamp from the HW and returns it to the XDP program. This is clearly not possible today and would require new plumbing, if it is even possible to implement this. But let us leave aside the implementation for now and just focus on the benefit this (or something like it) could provide compared to a kernel-centric approach.

* User-space can completely control where to put data and what it puts there. Think of producing the structure you want in user space directly without having to copy things around. You could for example produce a DPDK mbuf, or the VPP, Surikata equivalent directly. Would save a lot of cycles.

* No need for a metadata enablement interface. eBPF could find this out by just parsing the XDP program and enable the used metadata features in the HW by calling enable/disable metadata functions in the driver.

* No reason to expose BTFs to user-space. Makes it a lot simpler. Actually, no need to use them in the driver either.

* We do not have to use the metadata area unless it is actually needed. In the case of AF_XDP, we have to write it there, but local use in XDP does not. You would save one cache miss per packet right there.

* We get away from non-scalable driver implementations such as:

If (metadata_a_enabled)
    meta->a = Read_metadata_a();
If (metadata_b_enabled)
    meta->b = Read_metadata_b();
If (metadata_c_enabled)
    meta->c = Read_metadata_c();
 and so on for many entries.....

Or even worse:
    meta->a = Read_metadata_b();
    meta->b = Read_metadata_a();
    meta->c = Read_metadata_c();
    and so on....

The last one would kill performance any day. Even the first one is shunned in DPDK for performance reasons.

Note that in my proposal above, read_metadata_a() is a function that will be called from the XDP program when needed and only then, not from the driver, so driver code becomes uncluttered by these large if statements.

* We also get away from the problem that the BTF id can change from underneath the AF_XDP application. This becomes especially problematic when we would like to apply this to the Tx path and do offloads. 

For drawbacks:

* Might just be a pipe dream since there is no code 😊. But even so, maybe there is some simpler and possible way to realize this than this proposal? Please come with ideas. It just feels that eBPF is the key to configurability and a good way forward.

* Will the performance actually be better?

* And a lot of other things that you can tell me about.

I will write down the implementation idea in another mail and hopefully be able to produce some code. The earlier we can throw something in the garbage bin, the better.

Anyways, good stuff Jesper. Will go through it in detail.

> 
>   [1]
> https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-
> interaction
> 
>   [2]
> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-
> interaction/README.org#xdp-hints-via-local-btf-info
> 
>   [3]
> https://lore.kernel.org/bpf/20210803010331.39453-15-
> ederson.desouza@intel.com/

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [xdp-hints] Re: XDP-hints via local BTF info
  2021-11-17 20:07 ` [xdp-hints] " Karlsson, Magnus
@ 2021-11-17 22:48   ` Toke Høiland-Jørgensen
  2021-11-18  8:05     ` Karlsson, Magnus
  0 siblings, 1 reply; 15+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-11-17 22:48 UTC (permalink / raw)
  To: Karlsson, Magnus, Jesper Dangaard Brouer, Desouza, Ederson
  Cc: brouer, xdp-hints, Eelco Chaudron, Andrii Nakryiko, Fijalkowski,
	Maciej, Burakov, Anatoly

"Karlsson, Magnus" <magnus.karlsson@intel.com> writes:

> Together with Maciej and Anatoly, I have been toying with how to
> accomplish this, but it is early days so warning for some serious
> handwaving. Will produce some code to see if it is possible at all.
> One drawback with having it completely flexible and letting user-space
> decide is the complexity implementing this in the driver/kernel. But
> is this not why we have eBPF for in the first place? Maybe it can come
> to the rescue here.

I think there are a couple of distinctions we need to make, which your
"handwaving" glosses over a little bit :)

First, we have to distinguish between two cases here: how to work with
existing hardware and drivers and their metadata implementations, and
how to work with future devices that will (presumably) be completely
flexible and/or programmable to provide any metadata configuration
directly from the hardware. The XDP hints scheme obviously needs to work
with both.

Secondly, we should distinguish between the configuration interface and
the metadata consumption from within the data path. I think those should
be separate interfaces; in particular, I don't think loading an XDP
program in itself should have the side effect of reconfiguring the
hardware metadata format. Rather, the reconfiguration should be a
separate, explicit step using whatever API we can end up agreeing on
(ethtool or rtnetlink come to mind as obvious contenders). However, the
*format* for this configuration could very well be BTF-based, so
userspace can get whatever format it wants, assuming the hardware
supports it.

So, say we have this fancy programmable hardware, and we write a program
with a struct definition like:

struct my_meta_format {
       __u64 rx_timestamp;
       __u64 magic_colour_of_packet;
       __u32 btf_id;
};

and from userspace we can then do:

dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format));

which will pass the BTF-format struct to the kernel and configure the
hardware to write those fields. After this, the XDP program can just do:

int my_xdp_prog(struct xdp_buff *ctx)
{
  struct my_meta_format *meta = ctx->data_meta;

  do_something_with(meta->magic_colour_of_packet);
  return XDP_XXX;
}

and it will just work. Same thing from userspace.

Or it can define a CO-RE enabled struct like:

struct my_meta_subset {
       __u64 magic_colour_of_packet; /* we only care about this attr */
} __attribute__((preserve_access_index));

int my_xdp_prog(struct xdp_buff *ctx)
{
  struct my_meta_subset *meta = ctx->data_meta;

  do_something_with(meta->magic_colour_of_packet);
  return XDP_XXX;
}

and libbpf will rewrite the field accesses using the BTF information,
re-exported from the kernel from what userspace passed in when
configuring.

With the above in mind, a few comments on some of your other points:

> The key feature here is that bpf_xdp_get_metadata() will actually go
> and fetch this from the HW, by calling a small callback function in
> the driver that reads for example rx_timestamp from the HW and returns
> it to the XDP program. This is clearly not possible today and would
> require new plumbing, if it is even possible to implement this. But
> let us leave aside the implementation for now and just focus on the
> benefit this (or something like it) could provide compared to a
> kernel-centric approach.

Took me a while to understand what you meant with this, but I think it
finally clicked with your example below, so I'll leave the comment to
there...

> * User-space can completely control where to put data and what it puts
> there. Think of producing the structure you want in user space
> directly without having to copy things around. You could for example
> produce a DPDK mbuf, or the VPP, Surikata equivalent directly. Would
> save a lot of cycles.

I think that userspace being in control of the metadata format is a
great goal we should keep in mind, and as mentioned above we can do that
with a BTF-based interface.

> * No need for a metadata enablement interface. eBPF could find this
> out by just parsing the XDP program and enable the used metadata
> features in the HW by calling enable/disable metadata functions in the
> driver.

As mentioned above I think this is a mis-feature: configuration should
be explicit and out of band, not tied to the data path implementation.

> * No reason to expose BTFs to user-space. Makes it a lot simpler.
> Actually, no need to use them in the driver either.

BTF is just a handy format for describing the data layout. There has to
be *some* format to communicate this, and exposing the BTF format to
userspace is just a way for the kernel to say "this is how the hardware
is currently configured". I don't understand why you think not having
that makes things simpler?

> * We do not have to use the metadata area unless it is actually
> needed. In the case of AF_XDP, we have to write it there, but local
> use in XDP does not. You would save one cache miss per packet right
> there.

Regardless of how the format is configured, why would an XDP program
read it if it doesn't need it?

> * We get away from non-scalable driver implementations such as:
>
> If (metadata_a_enabled)
>     meta->a = Read_metadata_a();
> If (metadata_b_enabled)
>     meta->b = Read_metadata_b();
> If (metadata_c_enabled)
>     meta->c = Read_metadata_c();
>  and so on for many entries.....

Implementations such as this exist, though, and we have to support them;
as mentioned above, I think it's absolutely a worthy goal to make the
metadata format completely fungible from userspace, but the XDP hints
interface needs to also support old drivers that do things in the
traditional way, even if it's inefficient.
>
> Or even worse:
>     meta->a = Read_metadata_b();
>     meta->b = Read_metadata_a();
>     meta->c = Read_metadata_c();
>     and so on....
>
> The last one would kill performance any day. Even the first one is
> shunned in DPDK for performance reasons.
>
> Note that in my proposal above, read_metadata_a() is a function that
> will be called from the XDP program when needed and only then, not
> from the driver, so driver code becomes uncluttered by these large if
> statements.

Ah, with this bit I understood your point above about "a helper reading
the data from the hardware". Let me see if I can rephrase your proposal,
and let me know if I got it wrong:

The problem you're speaking of here, is that of transforming the bits
that come in from the hardware (in an RX-desc, say), into the metadata
format consumed by XDP. Say, for instance that the driver has a function
like:

u64 my_driver_read_rx_timestamp(void *pkt_dsc);

which does whatever bit of device-specific magic it needed to get a
well-formed u64 timestamp out of the packet descriptor, maybe involving
handling HW-specific quirks etc. And you want to expose this function
directly to BPF instead of having a big conditional in the driver C
code, so we can use BPF code elimination to only get the bits we want,
right?

This is actually something Jesper and I have been discussing as a
possible future solution! However, what we discussed was not exposing
this *to XDP*, but rather to have a separate BPF program *which is part
of the driver* that does this. And this BPF program can then be paired
with the BTF format coming in from userspace, so that it will only load
the fields requested. So, going back to my imaginary configuration
interface above, when userspace calls:

dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format));

the driver turns this over to the BPF subsystem with a BPF program like:

int my_driver_convert_meta(void *ctx, void *meta_dst)
{
        struct my_meta_format *meta_dst_fmt = meta_dst;

        if (bpf_core_field_exists(meta_dst->rx_timestamp))
           meta_dst->rx_timestamp = my_driver_read_rx_timestamp(ctx);
        if (bpf_core_field_exists(meta_dst->magic_colour_of_packet))
           meta_dst->magic_colour_of_packet = my_driver_read_magic_colour_of_packet(ctx);
        ...etc
}

And then the driver code where before you had, as you said:

> If (metadata_a_enabled)
>     meta->a = Read_metadata_a();
> If (metadata_b_enabled)
>     meta->b = Read_metadata_b();

will just be

BPF_PROG_RUN(drv->meta_writer_prog, meta);

and that will run the BPF program that was loaded (via a usermode
helper, probably) when userspace configured the metadata format and
we'll get the conditional code goodness and better performance.

However, having such a facility for drivers to dynamically *write*
metadata is, I believe, completely orthogonal to how we implement the
description format for consuming this metadata from XDP and userspace.
For one thing, not all drivers will probably want to use the dynamic
method, so we should not tie up the consumption of metadata to this. For
another, it's way too complex to implement everything at once, so
defining the consumption format first, then working on efficiently
writing it afterwards (or in parallel, maybe) makes it possible to have
incremental progress.

That being said, having both things in mind when defining things is a
good idea. But as I have hopefully outlined above, using BTF as the
description format allows both writing and reading of metadata to be
flexible and extensible.

> * We also get away from the problem that the BTF id can change from
> underneath the AF_XDP application. This becomes especially problematic
> when we would like to apply this to the Tx path and do offloads.

For this point, I think maybe the existence of the btf_id field in the
metadata struct has given you the impression that the format can go and
change at any time? I don't think that has to be the case: if a user
knows that they will always use the same metadata format, and they know
what that format is, it's perfectly valid to just ignore the btf_id
field and just cast the metadata field to a struct. However, the field
has to be there so that it is possible for the consumer to detect when
it changes, in order to support portable applications. There are two use
cases for this:

- Writing an application that can support different metadata formats so
  it can run without re-compilation on different hardware. For this, it
  is possible to load the program with a target device, so that the long
  sequence of checks of the btf_id can be eliminated by CO-RE at load
  time, and turned into just a single sanity check so that the metadata
  is not loaded if the configuration changes.

- Writing an application that deals with packets from multiple devices
  in the same program (say, a devmap program that gets redirects from
  different hardware into the same egress device).

- Dealing with metadata configuration where a subset of the traffic has
  different metadata (e.g., timestamps may only appear on PTP packets,
  say). Here the check of the BTF ID can be used to demux the packet
  type.

-Toke

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [xdp-hints] Re: XDP-hints via local BTF info
  2021-11-17 22:48   ` Toke Høiland-Jørgensen
@ 2021-11-18  8:05     ` Karlsson, Magnus
  2021-11-18 14:30       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 15+ messages in thread
From: Karlsson, Magnus @ 2021-11-18  8:05 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Jesper Dangaard Brouer,
	Desouza, Ederson
  Cc: brouer, xdp-hints, Eelco Chaudron, Andrii Nakryiko, Fijalkowski,
	Maciej, Burakov, Anatoly



> -----Original Message-----
> From: Toke Høiland-Jørgensen <toke@redhat.com>
> Sent: Wednesday, November 17, 2021 11:48 PM
> To: Karlsson, Magnus <magnus.karlsson@intel.com>; Jesper Dangaard
> Brouer <jbrouer@redhat.com>; Desouza, Ederson
> <ederson.desouza@intel.com>
> Cc: brouer@redhat.com; xdp-hints@xdp-project.net; Eelco Chaudron
> <echaudro@redhat.com>; Andrii Nakryiko <andrii@kernel.org>; Fijalkowski,
> Maciej <maciej.fijalkowski@intel.com>; Burakov, Anatoly
> <anatoly.burakov@intel.com>
> Subject: Re: [xdp-hints] Re: XDP-hints via local BTF info
> 
> "Karlsson, Magnus" <magnus.karlsson@intel.com> writes:
> 
> > Together with Maciej and Anatoly, I have been toying with how to
> > accomplish this, but it is early days so warning for some serious
> > handwaving. Will produce some code to see if it is possible at all.
> > One drawback with having it completely flexible and letting user-space
> > decide is the complexity implementing this in the driver/kernel. But
> > is this not why we have eBPF for in the first place? Maybe it can come
> > to the rescue here.
> 
> I think there are a couple of distinctions we need to make, which your
> "handwaving" glosses over a little bit :)
> 
> First, we have to distinguish between two cases here: how to work with
> existing hardware and drivers and their metadata implementations, and how
> to work with future devices that will (presumably) be completely flexible
> and/or programmable to provide any metadata configuration directly from
> the hardware. The XDP hints scheme obviously needs to work with both.

Thanks Toke for providing feedback on this. Just had to get this stuff "off my chest" 😉. I was stuck in the zero-copy world where you can modify drivers, but of course we need to consider "older" drivers with XDP support. With that in mind, I do agree with most of your comments. There are some questions and thoughts interspersed below and a suggestion for how to go forward at the end.

> Secondly, we should distinguish between the configuration interface and the
> metadata consumption from within the data path. I think those should be
> separate interfaces; in particular, I don't think loading an XDP program in
> itself should have the side effect of reconfiguring the hardware metadata
> format. Rather, the reconfiguration should be a separate, explicit step using
> whatever API we can end up agreeing on (ethtool or rtnetlink come to mind
> as obvious contenders). However, the
> *format* for this configuration could very well be BTF-based, so userspace
> can get whatever format it wants, assuming the hardware supports it.
> 
> So, say we have this fancy programmable hardware, and we write a program
> with a struct definition like:
> 
> struct my_meta_format {
>        __u64 rx_timestamp;
>        __u64 magic_colour_of_packet;
>        __u32 btf_id;
> };
> 
> and from userspace we can then do:
> 
> dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format));
> 
> which will pass the BTF-format struct to the kernel and configure the
> hardware to write those fields. After this, the XDP program can just do:
> 
> int my_xdp_prog(struct xdp_buff *ctx)
> {
>   struct my_meta_format *meta = ctx->data_meta;
> 
>   do_something_with(meta->magic_colour_of_packet);
>   return XDP_XXX;
> }
> 
> and it will just work. Same thing from userspace.
> 
> Or it can define a CO-RE enabled struct like:
> 
> struct my_meta_subset {
>        __u64 magic_colour_of_packet; /* we only care about this attr */ }
> __attribute__((preserve_access_index));
> 
> int my_xdp_prog(struct xdp_buff *ctx)
> {
>   struct my_meta_subset *meta = ctx->data_meta;
> 
>   do_something_with(meta->magic_colour_of_packet);
>   return XDP_XXX;
> }
> 
> and libbpf will rewrite the field accesses using the BTF information, re-
> exported from the kernel from what userspace passed in when configuring.
> 
> With the above in mind, a few comments on some of your other points:
> 
> > The key feature here is that bpf_xdp_get_metadata() will actually go
> > and fetch this from the HW, by calling a small callback function in
> > the driver that reads for example rx_timestamp from the HW and returns
> > it to the XDP program. This is clearly not possible today and would
> > require new plumbing, if it is even possible to implement this. But
> > let us leave aside the implementation for now and just focus on the
> > benefit this (or something like it) could provide compared to a
> > kernel-centric approach.
> 
> Took me a while to understand what you meant with this, but I think it finally
> clicked with your example below, so I'll leave the comment to there...
> 
> > * User-space can completely control where to put data and what it puts
> > there. Think of producing the structure you want in user space
> > directly without having to copy things around. You could for example
> > produce a DPDK mbuf, or the VPP, Surikata equivalent directly. Would
> > save a lot of cycles.
> 
> I think that userspace being in control of the metadata format is a great goal
> we should keep in mind, and as mentioned above we can do that with a BTF-
> based interface.
> 
> > * No need for a metadata enablement interface. eBPF could find this
> > out by just parsing the XDP program and enable the used metadata
> > features in the HW by calling enable/disable metadata functions in the
> > driver.
> 
> As mentioned above I think this is a mis-feature: configuration should be
> explicit and out of band, not tied to the data path implementation.
> 
> > * No reason to expose BTFs to user-space. Makes it a lot simpler.
> > Actually, no need to use them in the driver either.
> 
> BTF is just a handy format for describing the data layout. There has to be
> *some* format to communicate this, and exposing the BTF format to
> userspace is just a way for the kernel to say "this is how the hardware is
> currently configured". I don't understand why you think not having that
> makes things simpler?

I just think that a one-liner in a program is simpler than an entry in the BTF format. But this is moot anyway. You cannot use a program to communicate from the kernel, so BTF it is.

> > * We do not have to use the metadata area unless it is actually
> > needed. In the case of AF_XDP, we have to write it there, but local
> > use in XDP does not. You would save one cache miss per packet right
> > there.
> 
> Regardless of how the format is configured, why would an XDP program read
> it if it doesn't need it?

The driver reads some metadata entry from HW and puts it in the metadata section before the packet. XDP is invoked and reads this entry from the metadata area and uses it in some way. This generates a cache miss for the write/read to the metadata entry. Compare this to the XDP program calls a helper that reads the metadata from HW and puts it directly into a local variable in the XDP program. Metadata area is not used, and we save one cache miss.

> > * We get away from non-scalable driver implementations such as:
> >
> > If (metadata_a_enabled)
> >     meta->a = Read_metadata_a();
> > If (metadata_b_enabled)
> >     meta->b = Read_metadata_b();
> > If (metadata_c_enabled)
> >     meta->c = Read_metadata_c();
> >  and so on for many entries.....
> 
> Implementations such as this exist, though, and we have to support them; as
> mentioned above, I think it's absolutely a worthy goal to make the metadata
> format completely fungible from userspace, but the XDP hints interface
> needs to also support old drivers that do things in the traditional way, even if
> it's inefficient.
> >
> > Or even worse:
> >     meta->a = Read_metadata_b();
> >     meta->b = Read_metadata_a();
> >     meta->c = Read_metadata_c();
> >     and so on....
> >
> > The last one would kill performance any day. Even the first one is
> > shunned in DPDK for performance reasons.
> >
> > Note that in my proposal above, read_metadata_a() is a function that
> > will be called from the XDP program when needed and only then, not
> > from the driver, so driver code becomes uncluttered by these large if
> > statements.
> 
> Ah, with this bit I understood your point above about "a helper reading the
> data from the hardware". Let me see if I can rephrase your proposal, and let
> me know if I got it wrong:
> 
> The problem you're speaking of here, is that of transforming the bits that
> come in from the hardware (in an RX-desc, say), into the metadata format
> consumed by XDP. Say, for instance that the driver has a function
> like:
> 
> u64 my_driver_read_rx_timestamp(void *pkt_dsc);
> 
> which does whatever bit of device-specific magic it needed to get a well-
> formed u64 timestamp out of the packet descriptor, maybe involving
> handling HW-specific quirks etc. And you want to expose this function
> directly to BPF instead of having a big conditional in the driver C code, so we
> can use BPF code elimination to only get the bits we want, right?

You got it!

> This is actually something Jesper and I have been discussing as a possible
> future solution! However, what we discussed was not exposing this *to
> XDP*, but rather to have a separate BPF program *which is part of the
> driver* that does this. And this BPF program can then be paired with the BTF
> format coming in from userspace, so that it will only load the fields
> requested. So, going back to my imaginary configuration interface above,
> when userspace calls:
> 
> dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format));
> 
> the driver turns this over to the BPF subsystem with a BPF program like:
> 
> int my_driver_convert_meta(void *ctx, void *meta_dst) {
>         struct my_meta_format *meta_dst_fmt = meta_dst;
> 
>         if (bpf_core_field_exists(meta_dst->rx_timestamp))
>            meta_dst->rx_timestamp = my_driver_read_rx_timestamp(ctx);
>         if (bpf_core_field_exists(meta_dst->magic_colour_of_packet))
>            meta_dst->magic_colour_of_packet =
> my_driver_read_magic_colour_of_packet(ctx);
>         ...etc
> }
> 
> 
> And then the driver code where before you had, as you said:
> 
> > If (metadata_a_enabled)
> >     meta->a = Read_metadata_a();
> > If (metadata_b_enabled)
> >     meta->b = Read_metadata_b();
> 
> will just be
> 
> BPF_PROG_RUN(drv->meta_writer_prog, meta);
> 
> and that will run the BPF program that was loaded (via a usermode helper,
> probably) when userspace configured the metadata format and we'll get the
> conditional code goodness and better performance.

Perfect! This was actually my first thought too, to have a new hook point for metadata population. But I turned to the current approach because I thought it would be too hard to get a new hook point in. Though, there are potentially a number of advantages with having a new hook point for this. One from an AF_XDP point of view would be that we could optionally perform the attach on a per netdev/queue basis and get a much easier handling of per AF_XDP application/socket metadata usage. But great that we are thinking about roughly the same things. Encouraging.

> However, having such a facility for drivers to dynamically *write* metadata
> is, I believe, completely orthogonal to how we implement the description
> format for consuming this metadata from XDP and userspace.
> For one thing, not all drivers will probably want to use the dynamic method,
> so we should not tie up the consumption of metadata to this. For another,
> it's way too complex to implement everything at once, so defining the
> consumption format first, then working on efficiently writing it afterwards (or
> in parallel, maybe) makes it possible to have incremental progress.

Agreed.

> That being said, having both things in mind when defining things is a good
> idea. But as I have hopefully outlined above, using BTF as the description
> format allows both writing and reading of metadata to be flexible and
> extensible.
> 
> > * We also get away from the problem that the BTF id can change from
> > underneath the AF_XDP application. This becomes especially problematic
> > when we would like to apply this to the Tx path and do offloads.
> 
> For this point, I think maybe the existence of the btf_id field in the metadata
> struct has given you the impression that the format can go and change at any
> time? I don't think that has to be the case: if a user knows that they will
> always use the same metadata format, and they know what that format is,
> it's perfectly valid to just ignore the btf_id field and just cast the metadata
> field to a struct. However, the field has to be there so that it is possible for
> the consumer to detect when it changes, in order to support portable
> applications. There are two use cases for this:

It would be great if we could know it is fixed, but I do not understand how the user can know this, especially since the control of this is out-of-band. How would we deal with the following scenario?

App 1 comes up, opens up an AF_XDP socket and requests metadata_1
App 2 comes up, opens up another AF_XDP socket on the same netdev and requests metadata_2

We can provide the apps with two different btf_ids, but is this something that an existing driver can support and how does this scale as we add sockets and different usages of metadata? Note that we have no idea what the destination is until after we have executed our XDP program and potentially used the metadata area there. But our population of the metadata field is before the XDP program. Kind of chicken and egg.

The idea of a separate metadata population hook point on the netdev/queue_id level could potentially solve this. Well, as long as you are not attaching several sockets to the same netdev and queue_id, but that is rare.

> - Writing an application that can support different metadata formats so
>   it can run without re-compilation on different hardware. For this, it
>   is possible to load the program with a target device, so that the long
>   sequence of checks of the btf_id can be eliminated by CO-RE at load
>   time, and turned into just a single sanity check so that the metadata
>   is not loaded if the configuration changes.

Would be great if we had CO-RE for user-space, but as it is not there in llvm or gcc for x86, we probably have to go through all the tests in AF_XDP.

> - Writing an application that deals with packets from multiple devices
>   in the same program (say, a devmap program that gets redirects from
>   different hardware into the same egress device).

Yes. The same goes for AF_XDP and a shared umem between different NICs.

> - Dealing with metadata configuration where a subset of the traffic has
>   different metadata (e.g., timestamps may only appear on PTP packets,
>   say). Here the check of the BTF ID can be used to demux the packet
>   type.

Would this not lead to a combinatorial explosion of BTF IDs as we add fields that might or might not have a valid entry? Would it not be better with a flags field in the MD section?

Thank you so much for your feedback. It really helped! In summary, BTF it is as an interface and I agree that we should start simple and just get this to work with the if-statement method. But it seems that I can also continue to prototype the main idea of reading metadata from XDP. Do you think it would be possible to get a new hook point for XDP metadata population? If so, I think we should pursue that, especially if we could tie it to a netdev/queue_id pair. That would simplify a lot for AF_XDP. 

/Magnus

> -Toke


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [xdp-hints] Re: XDP-hints via local BTF info
  2021-11-18  8:05     ` Karlsson, Magnus
@ 2021-11-18 14:30       ` Jesper Dangaard Brouer
  2021-11-18 14:57         ` Karlsson, Magnus
  2021-11-18 15:18         ` John Fastabend
  0 siblings, 2 replies; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2021-11-18 14:30 UTC (permalink / raw)
  To: Karlsson, Magnus, Toke Høiland-Jørgensen, Desouza, Ederson
  Cc: brouer, xdp-hints, Eelco Chaudron, Andrii Nakryiko, Fijalkowski,
	Maciej, Burakov, Anatoly



On 18/11/2021 09.05, Karlsson, Magnus wrote:
> 
>> -----Original Message-----
>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>> Sent: Wednesday, November 17, 2021 11:48 PM
>> To: Karlsson, Magnus <magnus.karlsson@intel.com>; Jesper Dangaard
>> Brouer <jbrouer@redhat.com>; Desouza, Ederson
>> <ederson.desouza@intel.com>
>> Cc: brouer@redhat.com; xdp-hints@xdp-project.net; Eelco Chaudron
>> <echaudro@redhat.com>; Andrii Nakryiko <andrii@kernel.org>; Fijalkowski,
>> Maciej <maciej.fijalkowski@intel.com>; Burakov, Anatoly
>> <anatoly.burakov@intel.com>
>> Subject: Re: [xdp-hints] Re: XDP-hints via local BTF info
>>
>> "Karlsson, Magnus" <magnus.karlsson@intel.com> writes:
>>
>>> Together with Maciej and Anatoly, I have been toying with how to
>>> accomplish this, but it is early days so warning for some serious
>>> handwaving. Will produce some code to see if it is possible at all.
>>> One drawback with having it completely flexible and letting user-space
>>> decide is the complexity implementing this in the driver/kernel. But
>>> is this not why we have eBPF for in the first place? Maybe it can come
>>> to the rescue here.
>>
>> I think there are a couple of distinctions we need to make, which your
>> "handwaving" glosses over a little bit :)
>>
>> First, we have to distinguish between two cases here: how to work with
>> existing hardware and drivers and their metadata implementations, and how
>> to work with future devices that will (presumably) be completely flexible
>> and/or programmable to provide any metadata configuration directly from
>> the hardware. The XDP hints scheme obviously needs to work with both.
> 
> Thanks Toke for providing feedback on this. Just had to get this stuff "off my chest" 😉. I was stuck in the zero-copy world where you can modify drivers, but of course we need to consider "older" drivers with XDP support. With that in mind, I do agree with most of your comments. There are some questions and thoughts interspersed below and a suggestion for how to go forward at the end.
> 

I do appriciate that you want to discuss driver implementation, but this 
email and example program[1] was targetted at a much smaller step.

The main purpose was to show-with-code that decoding BTF and using it 
from userspace is not difficult.  And iterate on the API a bit.

  [1] 
https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-interaction

As Toke mentioned, I also believe it is orthogonal.
I think we all agree that BTF will be the way the XDP-hints will be 
"described", and we need to propose APIs/code that does this decoding 
e.g. in userspace. (But I also want things like cpumap kernel side 
XDP-prog to understand this).

But I'm taking the bait, and will discuss with you below...


>> Secondly, we should distinguish between the configuration interface and the
>> metadata consumption from within the data path. I think those should be
>> separate interfaces; in particular, I don't think loading an XDP program in
>> itself should have the side effect of reconfiguring the hardware metadata
>> format. Rather, the reconfiguration should be a separate, explicit step using
>> whatever API we can end up agreeing on (ethtool or rtnetlink come to mind
>> as obvious contenders).

I agree with Toke, that we should distinguish between the configuration 
interface and metadata consumption.

Do notice that we ALREADY have existing config interfaces that enables 
and disabled HW hints provided by the hardware. How to handle this?

E.g. The HW RX-timestamp features can be enabled when calling tcpdump, 
and will be disabled again when tcpdump stop again.
Thus, this action (enable HW-timestamp) will likely change what 
XDP-hints structure that driver uses.
What do you think the action should be?

My opinion is to simply use the existing kernel API to enable these 
HW-timestamps.  Today the SKB is updated with the HW-timestamp, and in 
the future I would like the SKB to be created later (in net-core) and 
get this HW-timestamp via an BPF-prog that extract this from the 
metadata area. (Hint today metadata area actually survives with the SKB, 
and can be accessed by e.g. TC-BPF).
Thus, I do want this to work with our existing netstack config interfaces.


>> However, the
>> *format* for this configuration could very well be BTF-based, so userspace
>> can get whatever format it wants, assuming the hardware supports it.
>>
>> So, say we have this fancy programmable hardware, and we write a program
>> with a struct definition like:
>>
>> struct my_meta_format {
>>         __u64 rx_timestamp;
>>         __u64 magic_colour_of_packet;
>>         __u32 btf_id;
>> };
>>
>> and from userspace we can then do:
>>
>> dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format));
>>
>> which will pass the BTF-format struct to the kernel and configure the
>> hardware to write those fields. After this, the XDP program can just do:
>>
>> int my_xdp_prog(struct xdp_buff *ctx)
>> {
>>    struct my_meta_format *meta = ctx->data_meta;
>>
>>    do_something_with(meta->magic_colour_of_packet);
>>    return XDP_XXX;
>> }
>>
>> and it will just work. Same thing from userspace.
>>
>> Or it can define a CO-RE enabled struct like:
>>
>> struct my_meta_subset {
>>         __u64 magic_colour_of_packet; /* we only care about this attr */ }
>> __attribute__((preserve_access_index));
>>
>> int my_xdp_prog(struct xdp_buff *ctx)
>> {
>>    struct my_meta_subset *meta = ctx->data_meta;
>>
>>    do_something_with(meta->magic_colour_of_packet);
>>    return XDP_XXX;
>> }
>>
>> and libbpf will rewrite the field accesses using the BTF information, re-
>> exported from the kernel from what userspace passed in when configuring.
>>

I've also been down the same rabbit hole, wanting userspace to define 
BTF layout as the config interface that HW will get reconfigured via.
I no-longer believe in this mode.  One reason is the existing config 
interfaces that enable/disable NIC HW features.

One way we can allow userspace to define the contents of the XDP-hints 
struct, not the HW config, is to add this new BPF 'hints-hook'.
Userspace can query the BPF-prog loaded in the 'hints-hook' and see that 
BTF structs it provide.
This is similar to that I do for AF_XDP in [1], as the XDP BPF-prog 
defines the layout and AF_XDP userspace queries the BTF avail.


>> With the above in mind, a few comments on some of your other points:
>>
>>> The key feature here is that bpf_xdp_get_metadata() will actually go
>>> and fetch this from the HW, by calling a small callback function in
>>> the driver that reads for example rx_timestamp from the HW and returns
>>> it to the XDP program. This is clearly not possible today and would
>>> require new plumbing, if it is even possible to implement this. But
>>> let us leave aside the implementation for now and just focus on the
>>> benefit this (or something like it) could provide compared to a
>>> kernel-centric approach.
>>
>> Took me a while to understand what you meant with this, but I think it finally
>> clicked with your example below, so I'll leave the comment to there...
>>
>>> * User-space can completely control where to put data and what it puts
>>> there. Think of producing the structure you want in user space
>>> directly without having to copy things around. You could for example
>>> produce a DPDK mbuf, or the VPP, Surikata equivalent directly. Would
>>> save a lot of cycles.
>>
>> I think that userspace being in control of the metadata format is a great goal
>> we should keep in mind, and as mentioned above we can do that with a BTF-
>> based interface.
>>
>>> * No need for a metadata enablement interface. eBPF could find this
>>> out by just parsing the XDP program and enable the used metadata
>>> features in the HW by calling enable/disable metadata functions in the
>>> driver.
>>
>> As mentioned above I think this is a mis-feature: configuration should be
>> explicit and out of band, not tied to the data path implementation.
>>

I agree. Also consider the existing config interfaces that we cannot 
ignore and have to collaborate with.

>>> * No reason to expose BTFs to user-space. Makes it a lot simpler.
>>> Actually, no need to use them in the driver either.
>>
>> BTF is just a handy format for describing the data layout. There has to be
>> *some* format to communicate this, and exposing the BTF format to
>> userspace is just a way for the kernel to say "this is how the hardware is
>> currently configured". I don't understand why you think not having that
>> makes things simpler?
> 
> I just think that a one-liner in a program is simpler than an entry in the BTF format. But this is moot anyway. You cannot use a program to communicate from the kernel, so BTF it is.
> 

As I demonstrated with code, it is not that difficult to decode and read 
BTF from userspace.
(Maybe in the future GCC will get better support and hide part of this 
for us)

>>> * We do not have to use the metadata area unless it is actually
>>> needed. In the case of AF_XDP, we have to write it there, but local
>>> use in XDP does not. You would save one cache miss per packet right
>>> there.
>>
>> Regardless of how the format is configured, why would an XDP program read
>> it if it doesn't need it?
> 
> The driver reads some metadata entry from HW and puts it in the metadata section before the packet. XDP is invoked and reads this entry from the metadata area and uses it in some way. This generates a cache miss for the write/read to the metadata entry. Compare this to the XDP program calls a helper that reads the metadata from HW and puts it directly into a local variable in the XDP program. Metadata area is not used, and we save one cache miss.
> 

Hmm... I believe we can hide this cache miss via prefetching.

And there are actually two situations we need to consider.

  1. Software path: metadata is written to by driver.

  2. Hardware path: metadata is written by new NIC hardware.

Case 1, packet data starts on a cache-line, thus minus offset will be on 
a separate cache-line, that is not covered by DMA-operation.  Thus this 
cache-line can be prefetched and is likely already in cache.

Case 2, metadata area will be covered by DMA area, and cache-line will 
now be more cold.  It will be covered by DDIO feature, meaning it will 
be delivered in L3-cache.  Thus, not a full cache-miss and prefetch from 
L3 could be doable, while reading RX-desc data.


>>> * We get away from non-scalable driver implementations such as:
>>>
>>> If (metadata_a_enabled)
>>>      meta->a = Read_metadata_a();
>>> If (metadata_b_enabled)
>>>      meta->b = Read_metadata_b();
>>> If (metadata_c_enabled)
>>>      meta->c = Read_metadata_c();
>>>   and so on for many entries.....
>>
>> Implementations such as this exist, though, and we have to support them; as
>> mentioned above, I think it's absolutely a worthy goal to make the metadata
>> format completely fungible from userspace, but the XDP hints interface
>> needs to also support old drivers that do things in the traditional way, even if
>> it's inefficient.
>>>
>>> Or even worse:
>>>      meta->a = Read_metadata_b();
>>>      meta->b = Read_metadata_a();
>>>      meta->c = Read_metadata_c();
>>>      and so on....
>>>
>>> The last one would kill performance any day. Even the first one is
>>> shunned in DPDK for performance reasons.
>>>
>>> Note that in my proposal above, read_metadata_a() is a function that
>>> will be called from the XDP program when needed and only then, not
>>> from the driver, so driver code becomes uncluttered by these large if
>>> statements.
>>
>> Ah, with this bit I understood your point above about "a helper reading the
>> data from the hardware". Let me see if I can rephrase your proposal, and let
>> me know if I got it wrong:
>>
>> The problem you're speaking of here, is that of transforming the bits that
>> come in from the hardware (in an RX-desc, say), into the metadata format
>> consumed by XDP. Say, for instance that the driver has a function
>> like:
>>
>> u64 my_driver_read_rx_timestamp(void *pkt_dsc);
>>
>> which does whatever bit of device-specific magic it needed to get a well-
>> formed u64 timestamp out of the packet descriptor, maybe involving
>> handling HW-specific quirks etc. And you want to expose this function
>> directly to BPF instead of having a big conditional in the driver C code, so we
>> can use BPF code elimination to only get the bits we want, right?
> 
> You got it!

Notice the handling HW-specific quirks, which is why we cannot simply 
give XDP BPF-prog access to the RX-desc.


>> This is actually something Jesper and I have been discussing as a possible
>> future solution! However, what we discussed was not exposing this *to
>> XDP*, but rather to have a separate BPF program *which is part of the
>> driver* that does this. And this BPF program can then be paired with the BTF
>> format coming in from userspace, so that it will only load the fields
>> requested. So, going back to my imaginary configuration interface above,
>> when userspace calls:
>>

Yes, exactly.

If we introduce a BPF hints-hook, we could give this new BPF hook access 
to the RX-desc (and XDP->ctx / xdp_buff) plus some driver specific 
status quirks-bits.  The driver developers, that know the quirks, could 
ship a BPF-prog together with driver code that creates/writes the 
metadata area AND this BPF-prog could DEFINE the struct name and layout 
as BTF (which userspace can query and extract).


>> dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format));
>>
>> the driver turns this over to the BPF subsystem with a BPF program like:
>>
>> int my_driver_convert_meta(void *ctx, void *meta_dst) {
>>          struct my_meta_format *meta_dst_fmt = meta_dst;
>>
>>          if (bpf_core_field_exists(meta_dst->rx_timestamp))
>>             meta_dst->rx_timestamp = my_driver_read_rx_timestamp(ctx);
>>          if (bpf_core_field_exists(meta_dst->magic_colour_of_packet))
>>             meta_dst->magic_colour_of_packet =
>> my_driver_read_magic_colour_of_packet(ctx);
>>          ...etc
>> }
>>
>>
>> And then the driver code where before you had, as you said:
>>
>>> If (metadata_a_enabled)
>>>      meta->a = Read_metadata_a();
>>> If (metadata_b_enabled)
>>>      meta->b = Read_metadata_b();
>>
>> will just be
>>
>> BPF_PROG_RUN(drv->meta_writer_prog, meta);
>>
>> and that will run the BPF program that was loaded (via a usermode helper,
>> probably) when userspace configured the metadata format and we'll get the
>> conditional code goodness and better performance.
> 
> Perfect! This was actually my first thought too, to have a new hook point for metadata population. But I turned to the current approach because I thought it would be too hard to get a new hook point in. Though, there are potentially a number of advantages with having a new hook point for this. One from an AF_XDP point of view would be that we could optionally perform the attach on a per netdev/queue basis and get a much easier handling of per AF_XDP application/socket metadata usage. But great that we are thinking about roughly the same things. Encouraging.
> 

I would support a new hook point for metadata population.

But before we get this far, I think that drivers should be extended with 
metadata population using normal C-code.  And then we can add this new 
hook, that gives flexibility back to end-users (allowing BPF-programmers 
to replace the driver provided program).


>> However, having such a facility for drivers to dynamically *write* metadata
>> is, I believe, completely orthogonal to how we implement the description
>> format for consuming this metadata from XDP and userspace.
>> For one thing, not all drivers will probably want to use the dynamic method,
>> so we should not tie up the consumption of metadata to this. For another,
>> it's way too complex to implement everything at once, so defining the
>> consumption format first, then working on efficiently writing it afterwards (or
>> in parallel, maybe) makes it possible to have incremental progress.
> 
> Agreed.

Yes, we need partition this work effort into smaller bits, so we can 
make progress.  I think people have talked about XDP-hints for more than 
3-years now... we need to make some real code progress.

> 
>> That being said, having both things in mind when defining things is a good
>> idea. But as I have hopefully outlined above, using BTF as the description
>> format allows both writing and reading of metadata to be flexible and
>> extensible.
>>
>>> * We also get away from the problem that the BTF id can change from
>>> underneath the AF_XDP application. This becomes especially problematic
>>> when we would like to apply this to the Tx path and do offloads.
>>
>> For this point, I think maybe the existence of the btf_id field in the metadata
>> struct has given you the impression that the format can go and change at any
>> time? I don't think that has to be the case: if a user knows that they will
>> always use the same metadata format, and they know what that format is,
>> it's perfectly valid to just ignore the btf_id field and just cast the metadata
>> field to a struct. However, the field has to be there so that it is possible for
>> the consumer to detect when it changes, in order to support portable
>> applications. There are two use cases for this:
> 
> It would be great if we could know it is fixed, but I do not understand how the user can know this, especially since the control of this is out-of-band. How would we deal with the following scenario?
> 
> App 1 comes up, opens up an AF_XDP socket and requests metadata_1
> App 2 comes up, opens up another AF_XDP socket on the same netdev and requests metadata_2
> 
> We can provide the apps with two different btf_ids, but is this something that an existing driver can support and how does this scale as we add sockets and different usages of metadata? Note that we have no idea what the destination is until after we have executed our XDP program and potentially used the metadata area there. But our population of the metadata field is before the XDP program. Kind of chicken and egg.
> 
> The idea of a separate metadata population hook point on the netdev/queue_id level could potentially solve this. Well, as long as you are not attaching several sockets to the same netdev and queue_id, but that is rare.
> 
>> - Writing an application that can support different metadata formats so
>>    it can run without re-compilation on different hardware. For this, it
>>    is possible to load the program with a target device, so that the long
>>    sequence of checks of the btf_id can be eliminated by CO-RE at load
>>    time, and turned into just a single sanity check so that the metadata
>>    is not loaded if the configuration changes.
> 
> Would be great if we had CO-RE for user-space, but as it is not there in llvm or gcc for x86, we probably have to go through all the tests in AF_XDP.
> 
>> - Writing an application that deals with packets from multiple devices
>>    in the same program (say, a devmap program that gets redirects from
>>    different hardware into the same egress device).
> 
> Yes. The same goes for AF_XDP and a shared umem between different NICs.
> 
>> - Dealing with metadata configuration where a subset of the traffic has
>>    different metadata (e.g., timestamps may only appear on PTP packets,
>>    say). Here the check of the BTF ID can be used to demux the packet
>>    type.
> 
> Would this not lead to a combinatorial explosion of BTF IDs as we add fields that might or might not have a valid entry? Would it not be better with a flags field in the MD section?
> 

I've debated this with John before, and I though we resolved the 
concerns in that discussion.
In summary:

1. The driver is 100% free to create a flags field in the XDP-hints 
struct, that define what fields are valid.  If you like to be old 
fashion, and we can even define the bits as UAPI for common fields that 
more drivers provide. This just requires more work in userspace/BPF-prog 
to check flags-valid before accessing a member.

2. Until we get the XDP-hints-hook, drivers will be old fashion and 
define a rather limited number of XDP-hints structs.  Thus, I would not 
worry about drivers creating too many structs and BTF-IDs.

Today drivers are not very flexible, and this work is about creating a 
super flexible interface.  It doesn't mean that drivers and hardware 
will all of a sudden get super flexible.  Driver and hardware start 
using this new flexible BTF in very simple ways.


> Thank you so much for your feedback. It really helped! In summary, BTF it is as an interface and I agree that we should start simple and just get this to work with the if-statement method. But it seems that I can also continue to prototype the main idea of reading metadata from XDP. Do you think it would be possible to get a new hook point for XDP metadata population? If so, I think we should pursue that, especially if we could tie it to a netdev/queue_id pair. That would simplify a lot for AF_XDP.
> 

I've all for prototypeing something so we can make progress.

--Jesper


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [xdp-hints] Re: XDP-hints via local BTF info
  2021-11-18 14:30       ` Jesper Dangaard Brouer
@ 2021-11-18 14:57         ` Karlsson, Magnus
  2021-11-18 15:18         ` John Fastabend
  1 sibling, 0 replies; 15+ messages in thread
From: Karlsson, Magnus @ 2021-11-18 14:57 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
	Desouza, Ederson
  Cc: brouer, xdp-hints, Eelco Chaudron, Andrii Nakryiko, Fijalkowski,
	Maciej, Burakov, Anatoly



> -----Original Message-----
> From: Jesper Dangaard Brouer <jbrouer@redhat.com>
> Sent: Thursday, November 18, 2021 3:30 PM
> To: Karlsson, Magnus <magnus.karlsson@intel.com>; Toke Høiland-
> Jørgensen <toke@redhat.com>; Desouza, Ederson
> <ederson.desouza@intel.com>
> Cc: brouer@redhat.com; xdp-hints@xdp-project.net; Eelco Chaudron
> <echaudro@redhat.com>; Andrii Nakryiko <andrii@kernel.org>; Fijalkowski,
> Maciej <maciej.fijalkowski@intel.com>; Burakov, Anatoly
> <anatoly.burakov@intel.com>
> Subject: Re: [xdp-hints] Re: XDP-hints via local BTF info
> 
> 
> 
> On 18/11/2021 09.05, Karlsson, Magnus wrote:
> >
> >> -----Original Message-----
> >> From: Toke Høiland-Jørgensen <toke@redhat.com>
> >> Sent: Wednesday, November 17, 2021 11:48 PM
> >> To: Karlsson, Magnus <magnus.karlsson@intel.com>; Jesper Dangaard
> >> Brouer <jbrouer@redhat.com>; Desouza, Ederson
> >> <ederson.desouza@intel.com>
> >> Cc: brouer@redhat.com; xdp-hints@xdp-project.net; Eelco Chaudron
> >> <echaudro@redhat.com>; Andrii Nakryiko <andrii@kernel.org>;
> >> Fijalkowski, Maciej <maciej.fijalkowski@intel.com>; Burakov, Anatoly
> >> <anatoly.burakov@intel.com>
> >> Subject: Re: [xdp-hints] Re: XDP-hints via local BTF info
> >>
> >> "Karlsson, Magnus" <magnus.karlsson@intel.com> writes:
> >>
> >>> Together with Maciej and Anatoly, I have been toying with how to
> >>> accomplish this, but it is early days so warning for some serious
> >>> handwaving. Will produce some code to see if it is possible at all.
> >>> One drawback with having it completely flexible and letting
> >>> user-space decide is the complexity implementing this in the
> >>> driver/kernel. But is this not why we have eBPF for in the first
> >>> place? Maybe it can come to the rescue here.
> >>
> >> I think there are a couple of distinctions we need to make, which
> >> your "handwaving" glosses over a little bit :)
> >>
> >> First, we have to distinguish between two cases here: how to work
> >> with existing hardware and drivers and their metadata
> >> implementations, and how to work with future devices that will
> >> (presumably) be completely flexible and/or programmable to provide
> >> any metadata configuration directly from the hardware. The XDP hints
> scheme obviously needs to work with both.
> >
> > Thanks Toke for providing feedback on this. Just had to get this stuff "off
> my chest" 😉. I was stuck in the zero-copy world where you can modify
> drivers, but of course we need to consider "older" drivers with XDP support.
> With that in mind, I do agree with most of your comments. There are some
> questions and thoughts interspersed below and a suggestion for how to go
> forward at the end.
> >
> 
> I do appriciate that you want to discuss driver implementation, but this email
> and example program[1] was targetted at a much smaller step.
> 
> The main purpose was to show-with-code that decoding BTF and using it
> from userspace is not difficult.  And iterate on the API a bit.
> 
>   [1]
> https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-
> interaction
> 
> As Toke mentioned, I also believe it is orthogonal.
> I think we all agree that BTF will be the way the XDP-hints will be "described",
> and we need to propose APIs/code that does this decoding e.g. in userspace.
> (But I also want things like cpumap kernel side XDP-prog to understand this).
> 
> But I'm taking the bait, and will discuss with you below...

Good 😊

> 
> >> Secondly, we should distinguish between the configuration interface
> >> and the metadata consumption from within the data path. I think those
> >> should be separate interfaces; in particular, I don't think loading
> >> an XDP program in itself should have the side effect of reconfiguring
> >> the hardware metadata format. Rather, the reconfiguration should be a
> >> separate, explicit step using whatever API we can end up agreeing on
> >> (ethtool or rtnetlink come to mind as obvious contenders).
> 
> I agree with Toke, that we should distinguish between the configuration
> interface and metadata consumption.
> 
> Do notice that we ALREADY have existing config interfaces that enables and
> disabled HW hints provided by the hardware. How to handle this?
> 
> E.g. The HW RX-timestamp features can be enabled when calling tcpdump,
> and will be disabled again when tcpdump stop again.
> Thus, this action (enable HW-timestamp) will likely change what XDP-hints
> structure that driver uses.
> What do you think the action should be?

Do not know. How do you think this should be handled when we have multiple programs running on this netdev when this happens?

> My opinion is to simply use the existing kernel API to enable these HW-
> timestamps.  Today the SKB is updated with the HW-timestamp, and in the
> future I would like the SKB to be created later (in net-core) and get this HW-
> timestamp via an BPF-prog that extract this from the metadata area. (Hint
> today metadata area actually survives with the SKB, and can be accessed by
> e.g. TC-BPF).
> Thus, I do want this to work with our existing netstack config interfaces.
> 
> 
> >> However, the
> >> *format* for this configuration could very well be BTF-based, so
> >> userspace can get whatever format it wants, assuming the hardware
> supports it.
> >>
> >> So, say we have this fancy programmable hardware, and we write a
> >> program with a struct definition like:
> >>
> >> struct my_meta_format {
> >>         __u64 rx_timestamp;
> >>         __u64 magic_colour_of_packet;
> >>         __u32 btf_id;
> >> };
> >>
> >> and from userspace we can then do:
> >>
> >> dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format));
> >>
> >> which will pass the BTF-format struct to the kernel and configure the
> >> hardware to write those fields. After this, the XDP program can just do:
> >>
> >> int my_xdp_prog(struct xdp_buff *ctx) {
> >>    struct my_meta_format *meta = ctx->data_meta;
> >>
> >>    do_something_with(meta->magic_colour_of_packet);
> >>    return XDP_XXX;
> >> }
> >>
> >> and it will just work. Same thing from userspace.
> >>
> >> Or it can define a CO-RE enabled struct like:
> >>
> >> struct my_meta_subset {
> >>         __u64 magic_colour_of_packet; /* we only care about this attr
> >> */ } __attribute__((preserve_access_index));
> >>
> >> int my_xdp_prog(struct xdp_buff *ctx) {
> >>    struct my_meta_subset *meta = ctx->data_meta;
> >>
> >>    do_something_with(meta->magic_colour_of_packet);
> >>    return XDP_XXX;
> >> }
> >>
> >> and libbpf will rewrite the field accesses using the BTF information,
> >> re- exported from the kernel from what userspace passed in when
> configuring.
> >>
> 
> I've also been down the same rabbit hole, wanting userspace to define BTF
> layout as the config interface that HW will get reconfigured via.
> I no-longer believe in this mode.  One reason is the existing config interfaces
> that enable/disable NIC HW features.
> 
> One way we can allow userspace to define the contents of the XDP-hints
> struct, not the HW config, is to add this new BPF 'hints-hook'.
> Userspace can query the BPF-prog loaded in the 'hints-hook' and see that
> BTF structs it provide.
> This is similar to that I do for AF_XDP in [1], as the XDP BPF-prog defines the
> layout and AF_XDP userspace queries the BTF avail.

I like the hints-hook approach. Hopefully it will give us some more freedom than actually doing this in the XDP program itself.

> 
> >> With the above in mind, a few comments on some of your other points:
> >>
> >>> The key feature here is that bpf_xdp_get_metadata() will actually go
> >>> and fetch this from the HW, by calling a small callback function in
> >>> the driver that reads for example rx_timestamp from the HW and
> >>> returns it to the XDP program. This is clearly not possible today
> >>> and would require new plumbing, if it is even possible to implement
> >>> this. But let us leave aside the implementation for now and just
> >>> focus on the benefit this (or something like it) could provide
> >>> compared to a kernel-centric approach.
> >>
> >> Took me a while to understand what you meant with this, but I think
> >> it finally clicked with your example below, so I'll leave the comment to
> there...
> >>
> >>> * User-space can completely control where to put data and what it
> >>> puts there. Think of producing the structure you want in user space
> >>> directly without having to copy things around. You could for example
> >>> produce a DPDK mbuf, or the VPP, Surikata equivalent directly. Would
> >>> save a lot of cycles.
> >>
> >> I think that userspace being in control of the metadata format is a
> >> great goal we should keep in mind, and as mentioned above we can do
> >> that with a BTF- based interface.
> >>
> >>> * No need for a metadata enablement interface. eBPF could find this
> >>> out by just parsing the XDP program and enable the used metadata
> >>> features in the HW by calling enable/disable metadata functions in
> >>> the driver.
> >>
> >> As mentioned above I think this is a mis-feature: configuration
> >> should be explicit and out of band, not tied to the data path
> implementation.
> >>
> 
> I agree. Also consider the existing config interfaces that we cannot ignore
> and have to collaborate with.
> 
> >>> * No reason to expose BTFs to user-space. Makes it a lot simpler.
> >>> Actually, no need to use them in the driver either.
> >>
> >> BTF is just a handy format for describing the data layout. There has
> >> to be
> >> *some* format to communicate this, and exposing the BTF format to
> >> userspace is just a way for the kernel to say "this is how the
> >> hardware is currently configured". I don't understand why you think
> >> not having that makes things simpler?
> >
> > I just think that a one-liner in a program is simpler than an entry in the BTF
> format. But this is moot anyway. You cannot use a program to communicate
> from the kernel, so BTF it is.
> >
> 
> As I demonstrated with code, it is not that difficult to decode and read BTF
> from userspace.
> (Maybe in the future GCC will get better support and hide part of this for us)

It is doable for sure, but I still think it is more complicated. But does not really matter as we have agreed on BTF as the interface. Though when we implement this in libxdp, I do not think we should expose BTFs at that level since this is an implementation detail that is a don’t care for most users. They are only interested in metadata/hints, not BTF itself. There will be power users though that is going to be interested in BTF manipulation, so we need interfaces like the ones you have proposed too.

> >>> * We do not have to use the metadata area unless it is actually
> >>> needed. In the case of AF_XDP, we have to write it there, but local
> >>> use in XDP does not. You would save one cache miss per packet right
> >>> there.
> >>
> >> Regardless of how the format is configured, why would an XDP program
> >> read it if it doesn't need it?
> >
> > The driver reads some metadata entry from HW and puts it in the metadata
> section before the packet. XDP is invoked and reads this entry from the
> metadata area and uses it in some way. This generates a cache miss for the
> write/read to the metadata entry. Compare this to the XDP program calls a
> helper that reads the metadata from HW and puts it directly into a local
> variable in the XDP program. Metadata area is not used, and we save one
> cache miss.
> >
> 
> Hmm... I believe we can hide this cache miss via prefetching.

Better not to have to use prefetches at all since they are highly variable between different CPUs.

> And there are actually two situations we need to consider.
> 
>   1. Software path: metadata is written to by driver.
> 
>   2. Hardware path: metadata is written by new NIC hardware.
> 
> Case 1, packet data starts on a cache-line, thus minus offset will be on a
> separate cache-line, that is not covered by DMA-operation.  Thus this cache-
> line can be prefetched and is likely already in cache.
> 
> Case 2, metadata area will be covered by DMA area, and cache-line will now
> be more cold.  It will be covered by DDIO feature, meaning it will be delivered
> in L3-cache.  Thus, not a full cache-miss and prefetch from
> L3 could be doable, while reading RX-desc data.
> 
> 
> >>> * We get away from non-scalable driver implementations such as:
> >>>
> >>> If (metadata_a_enabled)
> >>>      meta->a = Read_metadata_a();
> >>> If (metadata_b_enabled)
> >>>      meta->b = Read_metadata_b();
> >>> If (metadata_c_enabled)
> >>>      meta->c = Read_metadata_c();
> >>>   and so on for many entries.....
> >>
> >> Implementations such as this exist, though, and we have to support
> >> them; as mentioned above, I think it's absolutely a worthy goal to
> >> make the metadata format completely fungible from userspace, but the
> >> XDP hints interface needs to also support old drivers that do things
> >> in the traditional way, even if it's inefficient.
> >>>
> >>> Or even worse:
> >>>      meta->a = Read_metadata_b();
> >>>      meta->b = Read_metadata_a();
> >>>      meta->c = Read_metadata_c();
> >>>      and so on....
> >>>
> >>> The last one would kill performance any day. Even the first one is
> >>> shunned in DPDK for performance reasons.
> >>>
> >>> Note that in my proposal above, read_metadata_a() is a function that
> >>> will be called from the XDP program when needed and only then, not
> >>> from the driver, so driver code becomes uncluttered by these large
> >>> if statements.
> >>
> >> Ah, with this bit I understood your point above about "a helper
> >> reading the data from the hardware". Let me see if I can rephrase
> >> your proposal, and let me know if I got it wrong:
> >>
> >> The problem you're speaking of here, is that of transforming the bits
> >> that come in from the hardware (in an RX-desc, say), into the
> >> metadata format consumed by XDP. Say, for instance that the driver
> >> has a function
> >> like:
> >>
> >> u64 my_driver_read_rx_timestamp(void *pkt_dsc);
> >>
> >> which does whatever bit of device-specific magic it needed to get a
> >> well- formed u64 timestamp out of the packet descriptor, maybe
> >> involving handling HW-specific quirks etc. And you want to expose
> >> this function directly to BPF instead of having a big conditional in
> >> the driver C code, so we can use BPF code elimination to only get the bits
> we want, right?
> >
> > You got it!
> 
> Notice the handling HW-specific quirks, which is why we cannot simply give
> XDP BPF-prog access to the RX-desc.
> 
> 
> >> This is actually something Jesper and I have been discussing as a possible
> >> future solution! However, what we discussed was not exposing this *to
> >> XDP*, but rather to have a separate BPF program *which is part of the
> >> driver* that does this. And this BPF program can then be paired with the
> BTF
> >> format coming in from userspace, so that it will only load the fields
> >> requested. So, going back to my imaginary configuration interface above,
> >> when userspace calls:
> >>
> 
> Yes, exactly.
> 
> If we introduce a BPF hints-hook, we could give this new BPF hook access
> to the RX-desc (and XDP->ctx / xdp_buff) plus some driver specific
> status quirks-bits.  The driver developers, that know the quirks, could
> ship a BPF-prog together with driver code that creates/writes the
> metadata area AND this BPF-prog could DEFINE the struct name and layout
> as BTF (which userspace can query and extract).
> 
> 
> >> dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format));
> >>
> >> the driver turns this over to the BPF subsystem with a BPF program like:
> >>
> >> int my_driver_convert_meta(void *ctx, void *meta_dst) {
> >>          struct my_meta_format *meta_dst_fmt = meta_dst;
> >>
> >>          if (bpf_core_field_exists(meta_dst->rx_timestamp))
> >>             meta_dst->rx_timestamp = my_driver_read_rx_timestamp(ctx);
> >>          if (bpf_core_field_exists(meta_dst->magic_colour_of_packet))
> >>             meta_dst->magic_colour_of_packet =
> >> my_driver_read_magic_colour_of_packet(ctx);
> >>          ...etc
> >> }
> >>
> >>
> >> And then the driver code where before you had, as you said:
> >>
> >>> If (metadata_a_enabled)
> >>>      meta->a = Read_metadata_a();
> >>> If (metadata_b_enabled)
> >>>      meta->b = Read_metadata_b();
> >>
> >> will just be
> >>
> >> BPF_PROG_RUN(drv->meta_writer_prog, meta);
> >>
> >> and that will run the BPF program that was loaded (via a usermode helper,
> >> probably) when userspace configured the metadata format and we'll get
> the
> >> conditional code goodness and better performance.
> >
> > Perfect! This was actually my first thought too, to have a new hook point
> for metadata population. But I turned to the current approach because I
> thought it would be too hard to get a new hook point in. Though, there are
> potentially a number of advantages with having a new hook point for this.
> One from an AF_XDP point of view would be that we could optionally
> perform the attach on a per netdev/queue basis and get a much easier
> handling of per AF_XDP application/socket metadata usage. But great that
> we are thinking about roughly the same things. Encouraging.
> >
> 
> I would support a new hook point for metadata population.
> 
> But before we get this far, I think that drivers should be extended with
> metadata population using normal C-code.  And then we can add this new
> hook, that gives flexibility back to end-users (allowing BPF-programmers
> to replace the driver provided program).

Agree.

> 
> >> However, having such a facility for drivers to dynamically *write*
> metadata
> >> is, I believe, completely orthogonal to how we implement the description
> >> format for consuming this metadata from XDP and userspace.
> >> For one thing, not all drivers will probably want to use the dynamic
> method,
> >> so we should not tie up the consumption of metadata to this. For another,
> >> it's way too complex to implement everything at once, so defining the
> >> consumption format first, then working on efficiently writing it afterwards
> (or
> >> in parallel, maybe) makes it possible to have incremental progress.
> >
> > Agreed.
> 
> Yes, we need partition this work effort into smaller bits, so we can
> make progress.  I think people have talked about XDP-hints for more than
> 3-years now... we need to make some real code progress.
> 
> >
> >> That being said, having both things in mind when defining things is a good
> >> idea. But as I have hopefully outlined above, using BTF as the description
> >> format allows both writing and reading of metadata to be flexible and
> >> extensible.
> >>
> >>> * We also get away from the problem that the BTF id can change from
> >>> underneath the AF_XDP application. This becomes especially
> problematic
> >>> when we would like to apply this to the Tx path and do offloads.
> >>
> >> For this point, I think maybe the existence of the btf_id field in the
> metadata
> >> struct has given you the impression that the format can go and change at
> any
> >> time? I don't think that has to be the case: if a user knows that they will
> >> always use the same metadata format, and they know what that format
> is,
> >> it's perfectly valid to just ignore the btf_id field and just cast the metadata
> >> field to a struct. However, the field has to be there so that it is possible for
> >> the consumer to detect when it changes, in order to support portable
> >> applications. There are two use cases for this:
> >
> > It would be great if we could know it is fixed, but I do not understand how
> the user can know this, especially since the control of this is out-of-band.
> How would we deal with the following scenario?
> >
> > App 1 comes up, opens up an AF_XDP socket and requests metadata_1
> > App 2 comes up, opens up another AF_XDP socket on the same netdev
> and requests metadata_2
> >
> > We can provide the apps with two different btf_ids, but is this something
> that an existing driver can support and how does this scale as we add sockets
> and different usages of metadata? Note that we have no idea what the
> destination is until after we have executed our XDP program and potentially
> used the metadata area there. But our population of the metadata field is
> before the XDP program. Kind of chicken and egg.
> >
> > The idea of a separate metadata population hook point on the
> netdev/queue_id level could potentially solve this. Well, as long as you are
> not attaching several sockets to the same netdev and queue_id, but that is
> rare.
> >
> >> - Writing an application that can support different metadata formats so
> >>    it can run without re-compilation on different hardware. For this, it
> >>    is possible to load the program with a target device, so that the long
> >>    sequence of checks of the btf_id can be eliminated by CO-RE at load
> >>    time, and turned into just a single sanity check so that the metadata
> >>    is not loaded if the configuration changes.
> >
> > Would be great if we had CO-RE for user-space, but as it is not there in llvm
> or gcc for x86, we probably have to go through all the tests in AF_XDP.
> >
> >> - Writing an application that deals with packets from multiple devices
> >>    in the same program (say, a devmap program that gets redirects from
> >>    different hardware into the same egress device).
> >
> > Yes. The same goes for AF_XDP and a shared umem between different
> NICs.
> >
> >> - Dealing with metadata configuration where a subset of the traffic has
> >>    different metadata (e.g., timestamps may only appear on PTP packets,
> >>    say). Here the check of the BTF ID can be used to demux the packet
> >>    type.
> >
> > Would this not lead to a combinatorial explosion of BTF IDs as we add fields
> that might or might not have a valid entry? Would it not be better with a flags
> field in the MD section?
> >
> 
> I've debated this with John before, and I though we resolved the
> concerns in that discussion.
> In summary:
> 
> 1. The driver is 100% free to create a flags field in the XDP-hints
> struct, that define what fields are valid.  If you like to be old
> fashion, and we can even define the bits as UAPI for common fields that
> more drivers provide. This just requires more work in userspace/BPF-prog
> to check flags-valid before accessing a member.
> 
> 2. Until we get the XDP-hints-hook, drivers will be old fashion and
> define a rather limited number of XDP-hints structs.  Thus, I would not
> worry about drivers creating too many structs and BTF-IDs.
> 
> Today drivers are not very flexible, and this work is about creating a
> super flexible interface.  It doesn't mean that drivers and hardware
> will all of a sudden get super flexible.  Driver and hardware start
> using this new flexible BTF in very simple ways.

I buy this, that this is not a problem we are going to see for quite a while for sure.

> 
> > Thank you so much for your feedback. It really helped! In summary, BTF it is
> as an interface and I agree that we should start simple and just get this to
> work with the if-statement method. But it seems that I can also continue to
> prototype the main idea of reading metadata from XDP. Do you think it
> would be possible to get a new hook point for XDP metadata population? If
> so, I think we should pursue that, especially if we could tie it to a
> netdev/queue_id pair. That would simplify a lot for AF_XDP.
> >
> 
> I've all for prototypeing something so we can make progress.

Thanks for all the feedback Jesper. Appreciated.

/Magnus

> --Jesper


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [xdp-hints] Re: XDP-hints via local BTF info
  2021-11-18 14:30       ` Jesper Dangaard Brouer
  2021-11-18 14:57         ` Karlsson, Magnus
@ 2021-11-18 15:18         ` John Fastabend
  2021-11-19 14:53           ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 15+ messages in thread
From: John Fastabend @ 2021-11-18 15:18 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Karlsson, Magnus,
	Toke Høiland-Jørgensen, Desouza, Ederson
  Cc: brouer, xdp-hints, Eelco Chaudron, Andrii Nakryiko, Fijalkowski,
	Maciej, Burakov, Anatoly

Jesper Dangaard Brouer wrote:
> 
> 
> On 18/11/2021 09.05, Karlsson, Magnus wrote:
> > 
> >> -----Original Message-----
> >> From: Toke Høiland-Jørgensen <toke@redhat.com>
> >> Sent: Wednesday, November 17, 2021 11:48 PM
> >> To: Karlsson, Magnus <magnus.karlsson@intel.com>; Jesper Dangaard
> >> Brouer <jbrouer@redhat.com>; Desouza, Ederson
> >> <ederson.desouza@intel.com>
> >> Cc: brouer@redhat.com; xdp-hints@xdp-project.net; Eelco Chaudron
> >> <echaudro@redhat.com>; Andrii Nakryiko <andrii@kernel.org>; Fijalkowski,
> >> Maciej <maciej.fijalkowski@intel.com>; Burakov, Anatoly
> >> <anatoly.burakov@intel.com>
> >> Subject: Re: [xdp-hints] Re: XDP-hints via local BTF info
> >>
> >> "Karlsson, Magnus" <magnus.karlsson@intel.com> writes:
> >>
> >>> Together with Maciej and Anatoly, I have been toying with how to
> >>> accomplish this, but it is early days so warning for some serious
> >>> handwaving. Will produce some code to see if it is possible at all.
> >>> One drawback with having it completely flexible and letting user-space
> >>> decide is the complexity implementing this in the driver/kernel. But
> >>> is this not why we have eBPF for in the first place? Maybe it can come
> >>> to the rescue here.
> >>
> >> I think there are a couple of distinctions we need to make, which your
> >> "handwaving" glosses over a little bit :)
> >>
> >> First, we have to distinguish between two cases here: how to work with
> >> existing hardware and drivers and their metadata implementations, and how
> >> to work with future devices that will (presumably) be completely flexible
> >> and/or programmable to provide any metadata configuration directly from
> >> the hardware. The XDP hints scheme obviously needs to work with both.
> > 
> > Thanks Toke for providing feedback on this. Just had to get this stuff "off my chest" 😉. I was stuck in the zero-copy world where you can modify drivers, but of course we need to consider "older" drivers with XDP support. With that in mind, I do agree with most of your comments. There are some questions and thoughts interspersed below and a suggestion for how to go forward at the end.
> > 
> 
> I do appriciate that you want to discuss driver implementation, but this 
> email and example program[1] was targetted at a much smaller step.
> 
> The main purpose was to show-with-code that decoding BTF and using it 
> from userspace is not difficult.  And iterate on the API a bit.
> 
>   [1] 
> https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-interaction
> 
> As Toke mentioned, I also believe it is orthogonal.
> I think we all agree that BTF will be the way the XDP-hints will be 
> "described", and we need to propose APIs/code that does this decoding 
> e.g. in userspace. (But I also want things like cpumap kernel side 
> XDP-prog to understand this).
> 
> But I'm taking the bait, and will discuss with you below...
> 
> 
> >> Secondly, we should distinguish between the configuration interface and the
> >> metadata consumption from within the data path. I think those should be
> >> separate interfaces; in particular, I don't think loading an XDP program in
> >> itself should have the side effect of reconfiguring the hardware metadata
> >> format. Rather, the reconfiguration should be a separate, explicit step using
> >> whatever API we can end up agreeing on (ethtool or rtnetlink come to mind
> >> as obvious contenders).
> 
> I agree with Toke, that we should distinguish between the configuration 
> interface and metadata consumption.

+1 as a first iteration we can ignore the configuration mechanism and
assume its either hard coded or programmed into the NIC by the hardware
vendor.

> 
> Do notice that we ALREADY have existing config interfaces that enables 
> and disabled HW hints provided by the hardware. How to handle this?
> 
> E.g. The HW RX-timestamp features can be enabled when calling tcpdump, 
> and will be disabled again when tcpdump stop again.
> Thus, this action (enable HW-timestamp) will likely change what 
> XDP-hints structure that driver uses.
> What do you think the action should be?
> 
> My opinion is to simply use the existing kernel API to enable these 
> HW-timestamps.  Today the SKB is updated with the HW-timestamp, and in 
> the future I would like the SKB to be created later (in net-core) and 
> get this HW-timestamp via an BPF-prog that extract this from the 
> metadata area. (Hint today metadata area actually survives with the SKB, 
> and can be accessed by e.g. TC-BPF).
> Thus, I do want this to work with our existing netstack config interfaces.

I don't disagree above would be great if existing knobs worked as
expected and played nice with hints. I view it as step 2 though. Just
my $.02 its more of a question on how to get from point A to B, not so
much differences in end result.

> 
> 
> >> However, the
> >> *format* for this configuration could very well be BTF-based, so userspace
> >> can get whatever format it wants, assuming the hardware supports it.
> >>
> >> So, say we have this fancy programmable hardware, and we write a program
> >> with a struct definition like:
> >>
> >> struct my_meta_format {
> >>         __u64 rx_timestamp;
> >>         __u64 magic_colour_of_packet;
> >>         __u32 btf_id;
> >> };
> >>
> >> and from userspace we can then do:
> >>
> >> dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format));

I have some doubts/questions about complexity on firmware/driver side
to consume such sparse info and create such complex reconfig of hw.
But, maybe some simple pattern matching would sufficient on hw side
and useful to get things moving forward.

Seeing real hardware with support here would be great.

> >>
> >> which will pass the BTF-format struct to the kernel and configure the
> >> hardware to write those fields. After this, the XDP program can just do:
> >>
> >> int my_xdp_prog(struct xdp_buff *ctx)
> >> {
> >>    struct my_meta_format *meta = ctx->data_meta;
> >>
> >>    do_something_with(meta->magic_colour_of_packet);
> >>    return XDP_XXX;
> >> }
> >>
> >> and it will just work. Same thing from userspace.
> >>
> >> Or it can define a CO-RE enabled struct like:
> >>
> >> struct my_meta_subset {
> >>         __u64 magic_colour_of_packet; /* we only care about this attr */ }
> >> __attribute__((preserve_access_index));
> >>
> >> int my_xdp_prog(struct xdp_buff *ctx)
> >> {
> >>    struct my_meta_subset *meta = ctx->data_meta;
> >>
> >>    do_something_with(meta->magic_colour_of_packet);
> >>    return XDP_XXX;
> >> }
> >>
> >> and libbpf will rewrite the field accesses using the BTF information, re-
> >> exported from the kernel from what userspace passed in when configuring.
> >>
> 
> I've also been down the same rabbit hole, wanting userspace to define 
> BTF layout as the config interface that HW will get reconfigured via.
> I no-longer believe in this mode.  One reason is the existing config 
> interfaces that enable/disable NIC HW features.
> 
> One way we can allow userspace to define the contents of the XDP-hints 
> struct, not the HW config, is to add this new BPF 'hints-hook'.
> Userspace can query the BPF-prog loaded in the 'hints-hook' and see that 
> BTF structs it provide.
> This is similar to that I do for AF_XDP in [1], as the XDP BPF-prog 
> defines the layout and AF_XDP userspace queries the BTF avail.

I expected, but it didn't happen yet(?), is first users would go a
different route.  The way I see it is, hw vendor can configure the NIC to put
any hints they like in the header via firmware update. The user space
would understand the layout of the hints because it programmed
these hints. In general its not very friendly for distributions and
their end users, but for a DPDK user running on top of AF_XDP this would be
all thats needed. Or an embedded end system at a telco or POC on IDS would
work. Anyoen doing this today?

> 
> 
> >> With the above in mind, a few comments on some of your other points:
> >>
> >>> The key feature here is that bpf_xdp_get_metadata() will actually go
> >>> and fetch this from the HW, by calling a small callback function in
> >>> the driver that reads for example rx_timestamp from the HW and returns
> >>> it to the XDP program. This is clearly not possible today and would
> >>> require new plumbing, if it is even possible to implement this. But
> >>> let us leave aside the implementation for now and just focus on the
> >>> benefit this (or something like it) could provide compared to a
> >>> kernel-centric approach.
> >>
> >> Took me a while to understand what you meant with this, but I think it finally
> >> clicked with your example below, so I'll leave the comment to there...
> >>
> >>> * User-space can completely control where to put data and what it puts
> >>> there. Think of producing the structure you want in user space
> >>> directly without having to copy things around. You could for example
> >>> produce a DPDK mbuf, or the VPP, Surikata equivalent directly. Would
> >>> save a lot of cycles.
> >>
> >> I think that userspace being in control of the metadata format is a great goal
> >> we should keep in mind, and as mentioned above we can do that with a BTF-
> >> based interface.
> >>
> >>> * No need for a metadata enablement interface. eBPF could find this
> >>> out by just parsing the XDP program and enable the used metadata
> >>> features in the HW by calling enable/disable metadata functions in the
> >>> driver.
> >>
> >> As mentioned above I think this is a mis-feature: configuration should be
> >> explicit and out of band, not tied to the data path implementation.
> >>
> 
> I agree. Also consider the existing config interfaces that we cannot 
> ignore and have to collaborate with.

I also wonder about hw capability to consume a BTF/eBPF program
and reconfigure the hardware.

> 
> >>> * No reason to expose BTFs to user-space. Makes it a lot simpler.
> >>> Actually, no need to use them in the driver either.
> >>
> >> BTF is just a handy format for describing the data layout. There has to be
> >> *some* format to communicate this, and exposing the BTF format to
> >> userspace is just a way for the kernel to say "this is how the hardware is
> >> currently configured". I don't understand why you think not having that
> >> makes things simpler?
> > 

... (cut cache miss discussion)

> 
> >> This is actually something Jesper and I have been discussing as a possible
> >> future solution! However, what we discussed was not exposing this *to
> >> XDP*, but rather to have a separate BPF program *which is part of the
> >> driver* that does this. And this BPF program can then be paired with the BTF
> >> format coming in from userspace, so that it will only load the fields
> >> requested. So, going back to my imaginary configuration interface above,
> >> when userspace calls:
> >>
> 
> Yes, exactly.
> If we introduce a BPF hints-hook, we could give this new BPF hook access 
> to the RX-desc (and XDP->ctx / xdp_buff) plus some driver specific 
> status quirks-bits.  The driver developers, that know the quirks, could 
> ship a BPF-prog together with driver code that creates/writes the 
> metadata area AND this BPF-prog could DEFINE the struct name and layout 
> as BTF (which userspace can query and extract).
> 
> 

... cut some more interesting discussion.

>> > It would be great if we could know it is fixed, but I do not understand how the user can know this, especially since the control of this is out-of-band. How would we deal with the following scenario?
> > 
> > App 1 comes up, opens up an AF_XDP socket and requests metadata_1
> > App 2 comes up, opens up another AF_XDP socket on the same netdev and requests metadata_2
> > 
> > We can provide the apps with two different btf_ids, but is this something that an existing driver can support and how does this scale as we add sockets and different usages of metadata? Note that we have no idea what the destination is until after we have executed our XDP program and potentially used the metadata area there. But our population of the metadata field is before the XDP program. Kind of chicken and egg.
> > 
> > The idea of a separate metadata population hook point on the netdev/queue_id level could potentially solve this. Well, as long as you are not attaching several sockets to the same netdev and queue_id, but that is rare.

Interesting, but I would get basic single config working first. If user
really wants multiple configs then I would guess the NIC might partition
the hardware into VFs or virtual interfaces of some kind.

> > 
> >> - Writing an application that can support different metadata formats so
> >>    it can run without re-compilation on different hardware. For this, it
> >>    is possible to load the program with a target device, so that the long
> >>    sequence of checks of the btf_id can be eliminated by CO-RE at load
> >>    time, and turned into just a single sanity check so that the metadata
> >>    is not loaded if the configuration changes.
> > 
> > Would be great if we had CO-RE for user-space, but as it is not there in llvm or gcc for x86, we probably have to go through all the tests in AF_XDP.

I think CO-RE for user space should happen regardless of hardware use
cases. Looks to me like multiple use cases for this exist.


> > 
> >> - Writing an application that deals with packets from multiple devices
> >>    in the same program (say, a devmap program that gets redirects from
> >>    different hardware into the same egress device).
> > 
> > Yes. The same goes for AF_XDP and a shared umem between different NICs.
> > 
> >> - Dealing with metadata configuration where a subset of the traffic has
> >>    different metadata (e.g., timestamps may only appear on PTP packets,
> >>    say). Here the check of the BTF ID can be used to demux the packet
> >>    type.
> > 
> > Would this not lead to a combinatorial explosion of BTF IDs as we add fields that might or might not have a valid entry? Would it not be better with a flags field in the MD section?
> > 
> 
> I've debated this with John before, and I though we resolved the 
> concerns in that discussion.
> In summary:
> 
> 1. The driver is 100% free to create a flags field in the XDP-hints 
> struct, that define what fields are valid.  If you like to be old 
> fashion, and we can even define the bits as UAPI for common fields that 
> more drivers provide. This just requires more work in userspace/BPF-prog 
> to check flags-valid before accessing a member.
> 
> 2. Until we get the XDP-hints-hook, drivers will be old fashion and 
> define a rather limited number of XDP-hints structs.  Thus, I would not 
> worry about drivers creating too many structs and BTF-IDs.

I would add a 3. hw + user can agree on whats in the hints and are
free to put anything in there if they use out of band config
mechanism.

> 
> Today drivers are not very flexible, and this work is about creating a 
> super flexible interface.  It doesn't mean that drivers and hardware 
> will all of a sudden get super flexible.  Driver and hardware start 
> using this new flexible BTF in very simple ways.
> 
> 
> > Thank you so much for your feedback. It really helped! In summary, BTF it is as an interface and I agree that we should start simple and just get this to work with the if-statement method. But it seems that I can also continue to prototype the main idea of reading metadata from XDP. Do you think it would be possible to get a new hook point for XDP metadata population? If so, I think we should pursue that, especially if we could tie it to a netdev/queue_id pair. That would simplify a lot for AF_XDP.
> > 
> 
> I've all for prototypeing something so we can make progress.

In general I would suggest building the simplest possible thing
first. And then improve once users start to actually use it.
In parrellel I would like to see someone build a 'real' demo
using 3 to show-off the amazing performance improvements
to justify adding hooks and lots of other effort in driver/kernel.

Thanks.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [xdp-hints] Re: XDP-hints via local BTF info
  2021-11-18 15:18         ` John Fastabend
@ 2021-11-19 14:53           ` Toke Høiland-Jørgensen
  2021-11-22 12:45             ` [xdp-hints] Basic/Dumb question WAS(Re: " Jamal Hadi Salim
  2021-11-22 12:57             ` [xdp-hints] " Alexander Lobakin
  0 siblings, 2 replies; 15+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-11-19 14:53 UTC (permalink / raw)
  To: John Fastabend, Jesper Dangaard Brouer, Karlsson, Magnus,
	Desouza, Ederson
  Cc: brouer, xdp-hints, Eelco Chaudron, Andrii Nakryiko, Fijalkowski,
	Maciej, Burakov, Anatoly

Just a few additional comments, as I think y'all mostly covered everything:

>> >> However, the
>> >> *format* for this configuration could very well be BTF-based, so userspace
>> >> can get whatever format it wants, assuming the hardware supports it.
>> >>
>> >> So, say we have this fancy programmable hardware, and we write a program
>> >> with a struct definition like:
>> >>
>> >> struct my_meta_format {
>> >>         __u64 rx_timestamp;
>> >>         __u64 magic_colour_of_packet;
>> >>         __u32 btf_id;
>> >> };
>> >>
>> >> and from userspace we can then do:
>> >>
>> >> dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format));
>
> I have some doubts/questions about complexity on firmware/driver side
> to consume such sparse info and create such complex reconfig of hw.
> But, maybe some simple pattern matching would sufficient on hw side
> and useful to get things moving forward.

Just a quick note on this: if we're using BTF as a configuration format,
that's basically just another way of passing in a list of metadata item
names + data types, and their order. So the above would tell the
hardware (or driver) to enable "u64 rx_timestamp" and "u64
magic_colour_of_packet", where the only way the driver could figure out
what that's supposed to mean is by string matching on the names.

We could of course provide some common names in the core that many
drivers could support, but my main point with this is that BTF is "just"
a convenient format to pack this list into via a struct definition, it's
not magic faerie dust that makes sure there's also a *semantic* match. :)

> Seeing real hardware with support here would be great.

I don't think the BTF support has to go all the way to the hardware, a
driver could support this format just fine today (cf the above).

>> I've also been down the same rabbit hole, wanting userspace to define 
>> BTF layout as the config interface that HW will get reconfigured via.
>> I no-longer believe in this mode.  One reason is the existing config 
>> interfaces that enable/disable NIC HW features.
>> 
>> One way we can allow userspace to define the contents of the XDP-hints 
>> struct, not the HW config, is to add this new BPF 'hints-hook'.
>> Userspace can query the BPF-prog loaded in the 'hints-hook' and see that 
>> BTF structs it provide.
>> This is similar to that I do for AF_XDP in [1], as the XDP BPF-prog 
>> defines the layout and AF_XDP userspace queries the BTF avail.
>
> I expected, but it didn't happen yet(?), is first users would go a
> different route. The way I see it is, hw vendor can configure the NIC
> to put any hints they like in the header via firmware update. The user
> space would understand the layout of the hints because it programmed
> these hints. In general its not very friendly for distributions and
> their end users, but for a DPDK user running on top of AF_XDP this
> would be all thats needed. Or an embedded end system at a telco or POC
> on IDS would work.

People could still do this, of course, but I view the BTF layout stuff
mostly as a way to make something like this nicer to consume: you'll
be able to have essentially the same workflow, but you have
introspection of the result so you can verify you don't have a
misconfiguration somewhere, etc.

>>> > It would be great if we could know it is fixed, but I do not
>>> > understand how the user can know this, especially since the
>>> > control of this is out-of-band. How would we deal with the
>>> > following scenario?
>> > 
>> > App 1 comes up, opens up an AF_XDP socket and requests metadata_1
>> > App 2 comes up, opens up another AF_XDP socket on the same netdev and requests metadata_2
>> > 
>> > We can provide the apps with two different btf_ids, but is this
>> > something that an existing driver can support and how does this
>> > scale as we add sockets and different usages of metadata? Note that
>> > we have no idea what the destination is until after we have
>> > executed our XDP program and potentially used the metadata area
>> > there. But our population of the metadata field is before the XDP
>> > program. Kind of chicken and egg.
>> > 
>> > The idea of a separate metadata population hook point on the
>> > netdev/queue_id level could potentially solve this. Well, as long
>> > as you are not attaching several sockets to the same netdev and
>> > queue_id, but that is rare.
>
> Interesting, but I would get basic single config working first. If user
> really wants multiple configs then I would guess the NIC might partition
> the hardware into VFs or virtual interfaces of some kind.

Or manually configure the metadata to be the union of what the two
applications require. I don't think that's completely unreasonable,
actually: for instance, a web server still expects the network
interfaces to have IP addresses assigned before it starts up, and I view
this as similar.

So, if App 1 requires metadata X and Y, and App 2 requires Y and Z, the
administrator would enable all three, and the apps would both be able to
find what they need because the BTF exported by the driver tells them
where in the metadata struct they're each located...

-Toke

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [xdp-hints] Basic/Dumb question WAS(Re: Re: XDP-hints via local BTF info
  2021-11-19 14:53           ` Toke Høiland-Jørgensen
@ 2021-11-22 12:45             ` Jamal Hadi Salim
  2021-11-22 13:59               ` [xdp-hints] " Toke Høiland-Jørgensen
  2021-11-22 12:57             ` [xdp-hints] " Alexander Lobakin
  1 sibling, 1 reply; 15+ messages in thread
From: Jamal Hadi Salim @ 2021-11-22 12:45 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, John Fastabend,
	Jesper Dangaard Brouer, Karlsson, Magnus, Desouza, Ederson
  Cc: brouer, xdp-hints, Eelco Chaudron, Andrii Nakryiko, Fijalkowski,
	Maciej, Burakov, Anatoly, tom Herbert

And it goes something like this:

Why does the metadata have to go in the DMA descriptors?
Our experience with the XDP metadata is: you start accessing
that there is a performance penalty (extra cache miss(es)).

Why is the metadata not encapped as part of the data? We
dont have MTU issues on receive since that is entirely a local matter;
meaning the hardware can expand the packet as much as it wants within
the boundaries of alloced DMA buffer space and XDP and any other
subsystem (TC for example) can take advantage of the metadata.
Then extracting metadata becomes a parser issue and you get rid of the
extra cache misses.
Another advantage is: If you make the metadata part the packet data
(and appropriately handle MTU issues) you can pass it around as well
to VMs, containers, other machines,etc

We have done something along these lines using TC IFE, see:
https://legacy.netdevconf.info/0.1/sessions/9.html

The BTF aspect for discoverability is still of value. I am wondering
if we can extend the IFE action to suck in some BTF info so we dont
have to create new kernel code...

cheers,
jamal

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [xdp-hints] Re: Basic/Dumb question WAS(Re: Re: XDP-hints via local BTF info
  2021-11-22 12:45             ` [xdp-hints] Basic/Dumb question WAS(Re: " Jamal Hadi Salim
@ 2021-11-22 13:59               ` Toke Høiland-Jørgensen
  2021-11-22 15:31                 ` Tom Herbert
  0 siblings, 1 reply; 15+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-11-22 13:59 UTC (permalink / raw)
  To: Jamal Hadi Salim, John Fastabend, Jesper Dangaard Brouer,
	Karlsson, Magnus, Desouza, Ederson
  Cc: brouer, xdp-hints, Eelco Chaudron, Andrii Nakryiko, Fijalkowski,
	Maciej, Burakov, Anatoly, tom Herbert

Jamal Hadi Salim <jhs@mojatatu.com> writes:

> And it goes something like this:
>
> Why does the metadata have to go in the DMA descriptors?
> Our experience with the XDP metadata is: you start accessing
> that there is a performance penalty (extra cache miss(es)).
>
> Why is the metadata not encapped as part of the data? We
> dont have MTU issues on receive since that is entirely a local matter;
> meaning the hardware can expand the packet as much as it wants within
> the boundaries of alloced DMA buffer space and XDP and any other
> subsystem (TC for example) can take advantage of the metadata.

Once the hardware learns how to do that, this is absolutely the
direction we want to go in. But for existing stuff, the metadata is in
the descriptor and needs to be translated somehow...

-Toke


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [xdp-hints] Re: Basic/Dumb question WAS(Re: Re: XDP-hints via local BTF info
  2021-11-22 13:59               ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2021-11-22 15:31                 ` Tom Herbert
  2021-11-22 18:25                   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 15+ messages in thread
From: Tom Herbert @ 2021-11-22 15:31 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Jamal Hadi Salim, John Fastabend, Jesper Dangaard Brouer,
	Karlsson, Magnus, Desouza, Ederson, brouer, xdp-hints,
	Eelco Chaudron, Andrii Nakryiko, Fijalkowski, Maciej, Burakov,
	Anatoly

On Mon, Nov 22, 2021 at 5:59 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>
> > And it goes something like this:
> >
> > Why does the metadata have to go in the DMA descriptors?
> > Our experience with the XDP metadata is: you start accessing
> > that there is a performance penalty (extra cache miss(es)).
> >
> > Why is the metadata not encapped as part of the data? We
> > dont have MTU issues on receive since that is entirely a local matter;
> > meaning the hardware can expand the packet as much as it wants within
> > the boundaries of alloced DMA buffer space and XDP and any other
> > subsystem (TC for example) can take advantage of the metadata.
>
> Once the hardware learns how to do that, this is absolutely the
> direction we want to go in. But for existing stuff, the metadata is in
> the descriptor and needs to be translated somehow...
>
Toke,

By existing stuff I think you mean legacy stuff :-)

I tend to agree with Jamal. Descriptor space is exceedingly limited
and it's unclear to me that the benefits of getting a few bits of
*hints* from the device outweigh the costs to develop XDP hints and
perpetually maintain the facility. That is to say it doesn't seem to
address the two fundamental limitations of the venerable 50 year old
device driver model: 1) the narrow waist of PCIe bus in expressing
ancillary information 2) the black box nature of devices that prevent
the stack from having any visibility into what it's actually doing
(and hence we can only get hints at best and not real operational data
for primary protocol processing).

I believe CXL and programmable devices are the emerging architectural
solution. This will enable split plane acceleration. For instance, if
we do this right, a NIC would be able to perform all the stateless
processing of a TCP/IP packet such that when the host gets the packet
it could immediately jump to the TCP receive processing function ala
TXDP-- or even more than that the device could process the whole
received TCP datapath and the host just puts the received data on the
socket (necessary to solve the TLS offload OOO problem). All this
requires a very tight integration between the stack and the device to
the extent that the lines are blurred and the boundary of the software
stack is effectively pushed into the device (but definitely not TOE!).

Tom

> -Toke
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [xdp-hints] Re: Basic/Dumb question WAS(Re: Re: XDP-hints via local BTF info
  2021-11-22 15:31                 ` Tom Herbert
@ 2021-11-22 18:25                   ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 15+ messages in thread
From: Toke Høiland-Jørgensen @ 2021-11-22 18:25 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Jamal Hadi Salim, John Fastabend, Jesper Dangaard Brouer,
	Karlsson, Magnus, Desouza, Ederson, brouer, xdp-hints,
	Eelco Chaudron, Andrii Nakryiko, Fijalkowski, Maciej, Burakov,
	Anatoly

Tom Herbert <tom@sipanda.io> writes:

> On Mon, Nov 22, 2021 at 5:59 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Jamal Hadi Salim <jhs@mojatatu.com> writes:
>>
>> > And it goes something like this:
>> >
>> > Why does the metadata have to go in the DMA descriptors?
>> > Our experience with the XDP metadata is: you start accessing
>> > that there is a performance penalty (extra cache miss(es)).
>> >
>> > Why is the metadata not encapped as part of the data? We
>> > dont have MTU issues on receive since that is entirely a local matter;
>> > meaning the hardware can expand the packet as much as it wants within
>> > the boundaries of alloced DMA buffer space and XDP and any other
>> > subsystem (TC for example) can take advantage of the metadata.
>>
>> Once the hardware learns how to do that, this is absolutely the
>> direction we want to go in. But for existing stuff, the metadata is in
>> the descriptor and needs to be translated somehow...
>>
> Toke,
>
> By existing stuff I think you mean legacy stuff :-)

Well, whatever term you want to use for "the stuff that's in hardware
today" :)

> I tend to agree with Jamal. Descriptor space is exceedingly limited
> and it's unclear to me that the benefits of getting a few bits of
> *hints* from the device outweigh the costs to develop XDP hints and
> perpetually maintain the facility. That is to say it doesn't seem to
> address the two fundamental limitations of the venerable 50 year old
> device driver model: 1) the narrow waist of PCIe bus in expressing
> ancillary information 2) the black box nature of devices that prevent
> the stack from having any visibility into what it's actually doing
> (and hence we can only get hints at best and not real operational data
> for primary protocol processing).
>
> I believe CXL and programmable devices are the emerging architectural
> solution. This will enable split plane acceleration. For instance, if
> we do this right, a NIC would be able to perform all the stateless
> processing of a TCP/IP packet such that when the host gets the packet
> it could immediately jump to the TCP receive processing function ala
> TXDP-- or even more than that the device could process the whole
> received TCP datapath and the host just puts the received data on the
> socket (necessary to solve the TLS offload OOO problem). All this
> requires a very tight integration between the stack and the device to
> the extent that the lines are blurred and the boundary of the software
> stack is effectively pushed into the device (but definitely not TOE!).

This is certainly something that xdp-hints should be able to address.
And it does: the XDP metadata area is right before the packet data, and
while today that bit of memory is not covered by DMA, there is
absolutely no reason why it couldn't be for devices that support putting
arbitrary stuff in there. In which case the driver can just export a BTF
description of what the hardware writes into that area, and XDP can
access it directly using xdp-hints.

All the talk about translating from hardware descriptors applies to
existing hardware (or "legacy" if you want to call it that), but if the
hardware can just write the metadata directly, driver translation is
obviously not needed.

-Toke


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [xdp-hints] Re: XDP-hints via local BTF info
  2021-11-19 14:53           ` Toke Høiland-Jørgensen
  2021-11-22 12:45             ` [xdp-hints] Basic/Dumb question WAS(Re: " Jamal Hadi Salim
@ 2021-11-22 12:57             ` Alexander Lobakin
  2021-11-24 11:54               ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 15+ messages in thread
From: Alexander Lobakin @ 2021-11-22 12:57 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Alexander Lobakin, John Fastabend, Jesper Dangaard Brouer,
	Magnus Karlsson, Ederson Desouza, brouer, xdp-hints,
	Eelco Chaudron, Andrii Nakryiko, Maciej Fijalkowski,
	Anatoly Burakov

From: Toke Høiland-Jørgensen <toke@redhat.com>
Date: Fri, 19 Nov 2021 15:53:39 +0100

> Just a few additional comments, as I think y'all mostly covered everything:

Just wanted to say that I've almost finished my iteration with
cpumap and veth covered.
I'll commit the current snapshot during the next three days ([0]).
I also wanted to include GRO for cpumap into this first series,
but not sure if it really belongs to it. WDYGT?

[ snip ]

[0] https://github.com/alobakin/linux

> -Toke

Thanks,
Al

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [xdp-hints] Re: XDP-hints via local BTF info
  2021-11-22 12:57             ` [xdp-hints] " Alexander Lobakin
@ 2021-11-24 11:54               ` Jesper Dangaard Brouer
  2021-11-25 20:04                 ` Alexander Lobakin
  0 siblings, 1 reply; 15+ messages in thread
From: Jesper Dangaard Brouer @ 2021-11-24 11:54 UTC (permalink / raw)
  To: Alexander Lobakin, Toke Høiland-Jørgensen
  Cc: brouer, John Fastabend, Jesper Dangaard Brouer, Magnus Karlsson,
	Ederson Desouza, xdp-hints, Eelco Chaudron, Andrii Nakryiko,
	Maciej Fijalkowski, Anatoly Burakov

On 22/11/2021 13.57, Alexander Lobakin wrote:
> From: Toke Høiland-Jørgensen <toke@redhat.com> Date: Fri, 19 Nov 2021
> 15:53:39 +0100
> 
>> Just a few additional comments, as I think y'all mostly covered
>> everything:
> 
> Just wanted to say that I've almost finished my iteration with cpumap
> and veth covered.

What does this mean?

Can you consume XDP-hints in cpumap and veth, and transfer that to the
SKB fields? (I'm very interested in this).

> I'll commit the current snapshot during the next three days ([0]). I
> also wanted to include GRO for cpumap into this first series, but not
> sure if it really belongs to it. WDYGT?

Everybody seems eager to hijack the thread subject :-P

When introducing/proposing kernel changes, the discussion belong on the
mailing-list bpf@vger.kernel.org.

> [ snip ]
> 
> [0] https://github.com/alobakin/linux

Kernel changes are usually send as (RFC) patches in the mailing list.

If you want to share kernel changes via GitHub, please create a branch,
and please avoid rebasing once shared.  I usually version my branches
via a descriptive-name plus an increasing number, so reviewers know
which branch is the latest code iteration.

It is unusual, but we can collaborate on kernel changes via GitHub. I've
only done with changes that are not ready for upstream submission yet,
and as part of a (1-to-1) review over IRC (with co-workers).

I'm not sure how you expect people to help you review[0] this?

--Jesper

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [xdp-hints] Re: XDP-hints via local BTF info
  2021-11-24 11:54               ` Jesper Dangaard Brouer
@ 2021-11-25 20:04                 ` Alexander Lobakin
  0 siblings, 0 replies; 15+ messages in thread
From: Alexander Lobakin @ 2021-11-25 20:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Alexander Lobakin, Toke Høiland-Jørgensen, brouer,
	John Fastabend, Magnus Karlsson, Ederson Desouza, xdp-hints,
	Eelco Chaudron, Andrii Nakryiko, Maciej Fijalkowski,
	Anatoly Burakov

From: Jesper Dangaard Brouer <jbrouer@redhat.com>
Date: Wed, 24 Nov 2021 12:54:00 +0100

> On 22/11/2021 13.57, Alexander Lobakin wrote:
> > From: Toke Høiland-Jørgensen <toke@redhat.com> Date: Fri, 19 Nov 2021
> > 15:53:39 +0100
> >
> >> Just a few additional comments, as I think y'all mostly covered
> >> everything:
> >
> > Just wanted to say that I've almost finished my iteration with cpumap
> > and veth covered.
> 
> What does this mean?
> 
> Can you consume XDP-hints in cpumap and veth, and transfer that to the
> SKB fields? (I'm very interested in this).

Exactly!
NIC driver (I'm working on a current generation NIC, so no hardware
metadata composing) builds the metadata in a known format (generic
structure), XDP prog redirects the frame to another core, function
that converts xdp_frame to sk_buff parses the metadata and fills in
skb fields.
This gives nothing on the frame size lesser than 70-90, but starting
from ~100 bytes we have a nice boost due to that CPUs don't have to
spend cycles calculating checksums (hashes etc.) anymore. So I
expect we can now benefit from GRO on cpumap and veth, thus asked
about it.
Regarding that I didn't manage to receive any performance benefits
from Hints on frame sizes lesser than beforementioned, I decided to
give an option to specify the frame size which the NIC {,driver}
should start composing metadata from. This is debatable of course,
will see.

> > I'll commit the current snapshot during the next three days ([0]). I
> > also wanted to include GRO for cpumap into this first series, but not
> > sure if it really belongs to it. WDYGT?
> 
> Everybody seems eager to hijack the thread subject
> 
> When introducing/proposing kernel changes, the discussion belong on the
> mailing-list bpf@vger.kernel.org.

Sure thing. I was just interested in a more "local" opinion, like
if by any chance you had similar thoughts or drafts etc.

> > [ snip ]
> >
> > [0] https://github.com/alobakin/linux
> 
> Kernel changes are usually send as (RFC) patches in the mailing list.

Sure thing. This repo is just an "early access", so anyone could see
what is going on before I prepare and publish any series.
There might've been a misunderstanding, but I didn't attach this
link as a "please review", it was an "I have a repo" (:

> If you want to share kernel changes via GitHub, please create a branch,
> and please avoid rebasing once shared.  I usually version my branches
> via a descriptive-name plus an increasing number, so reviewers know
> which branch is the latest code iteration.

All my changesets have separate branches, sure thing.
Not sure about rebasing thou. Many devs and kernel maintainers use
rebasing on a daily basis, without it any upstream updates mess up
everything.

> It is unusual, but we can collaborate on kernel changes via GitHub. I've
> only done with changes that are not ready for upstream submission yet,
> and as part of a (1-to-1) review over IRC (with co-workers).
> 
> I'm not sure how you expect people to help you review[0] this?

As I said, it's not a "please review" link. I'll post a link to
a separate branch here once finish the code, just for a sneak peek
for a couple days before publishing an RFC to LKML, so anyone could
leave some early comments using XDP Hints ML or GitHub itself. Some
sort of a preliminary review if you prefer :P

--

I'm sorry that I didn't have a chance yet to look at your code: it's
been a busy time. I'll dig a bit later, but anyways, thanks for all
of the work in that area

> --Jesper 

Al

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-11-25 20:04 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-17 17:22 [xdp-hints] XDP-hints via local BTF info Jesper Dangaard Brouer
2021-11-17 20:07 ` [xdp-hints] " Karlsson, Magnus
2021-11-17 22:48   ` Toke Høiland-Jørgensen
2021-11-18  8:05     ` Karlsson, Magnus
2021-11-18 14:30       ` Jesper Dangaard Brouer
2021-11-18 14:57         ` Karlsson, Magnus
2021-11-18 15:18         ` John Fastabend
2021-11-19 14:53           ` Toke Høiland-Jørgensen
2021-11-22 12:45             ` [xdp-hints] Basic/Dumb question WAS(Re: " Jamal Hadi Salim
2021-11-22 13:59               ` [xdp-hints] " Toke Høiland-Jørgensen
2021-11-22 15:31                 ` Tom Herbert
2021-11-22 18:25                   ` Toke Høiland-Jørgensen
2021-11-22 12:57             ` [xdp-hints] " Alexander Lobakin
2021-11-24 11:54               ` Jesper Dangaard Brouer
2021-11-25 20:04                 ` Alexander Lobakin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox