XDP hardware hints discussion mail archive
 help / color / mirror / Atom feed
From: "Toke Høiland-Jørgensen" <toke@redhat.com>
To: "Karlsson, Magnus" <magnus.karlsson@intel.com>,
	Jesper Dangaard Brouer <jbrouer@redhat.com>,
	"Desouza, Ederson" <ederson.desouza@intel.com>
Cc: "brouer@redhat.com" <brouer@redhat.com>,
	"xdp-hints@xdp-project.net" <xdp-hints@xdp-project.net>,
	Eelco Chaudron <echaudro@redhat.com>,
	Andrii Nakryiko <andrii@kernel.org>,
	"Fijalkowski, Maciej" <maciej.fijalkowski@intel.com>,
	"Burakov, Anatoly" <anatoly.burakov@intel.com>
Subject: [xdp-hints] Re: XDP-hints via local BTF info
Date: Wed, 17 Nov 2021 23:48:14 +0100	[thread overview]
Message-ID: <875ysqflg1.fsf@toke.dk> (raw)
In-Reply-To: <MW3PR11MB4602E58178E7419934505308F79A9@MW3PR11MB4602.namprd11.prod.outlook.com>

"Karlsson, Magnus" <magnus.karlsson@intel.com> writes:

> Together with Maciej and Anatoly, I have been toying with how to
> accomplish this, but it is early days so warning for some serious
> handwaving. Will produce some code to see if it is possible at all.
> One drawback with having it completely flexible and letting user-space
> decide is the complexity implementing this in the driver/kernel. But
> is this not why we have eBPF for in the first place? Maybe it can come
> to the rescue here.

I think there are a couple of distinctions we need to make, which your
"handwaving" glosses over a little bit :)

First, we have to distinguish between two cases here: how to work with
existing hardware and drivers and their metadata implementations, and
how to work with future devices that will (presumably) be completely
flexible and/or programmable to provide any metadata configuration
directly from the hardware. The XDP hints scheme obviously needs to work
with both.

Secondly, we should distinguish between the configuration interface and
the metadata consumption from within the data path. I think those should
be separate interfaces; in particular, I don't think loading an XDP
program in itself should have the side effect of reconfiguring the
hardware metadata format. Rather, the reconfiguration should be a
separate, explicit step using whatever API we can end up agreeing on
(ethtool or rtnetlink come to mind as obvious contenders). However, the
*format* for this configuration could very well be BTF-based, so
userspace can get whatever format it wants, assuming the hardware
supports it.

So, say we have this fancy programmable hardware, and we write a program
with a struct definition like:

struct my_meta_format {
       __u64 rx_timestamp;
       __u64 magic_colour_of_packet;
       __u32 btf_id;
};

and from userspace we can then do:

dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format));

which will pass the BTF-format struct to the kernel and configure the
hardware to write those fields. After this, the XDP program can just do:

int my_xdp_prog(struct xdp_buff *ctx)
{
  struct my_meta_format *meta = ctx->data_meta;

  do_something_with(meta->magic_colour_of_packet);
  return XDP_XXX;
}

and it will just work. Same thing from userspace.

Or it can define a CO-RE enabled struct like:

struct my_meta_subset {
       __u64 magic_colour_of_packet; /* we only care about this attr */
} __attribute__((preserve_access_index));

int my_xdp_prog(struct xdp_buff *ctx)
{
  struct my_meta_subset *meta = ctx->data_meta;

  do_something_with(meta->magic_colour_of_packet);
  return XDP_XXX;
}

and libbpf will rewrite the field accesses using the BTF information,
re-exported from the kernel from what userspace passed in when
configuring.

With the above in mind, a few comments on some of your other points:

> The key feature here is that bpf_xdp_get_metadata() will actually go
> and fetch this from the HW, by calling a small callback function in
> the driver that reads for example rx_timestamp from the HW and returns
> it to the XDP program. This is clearly not possible today and would
> require new plumbing, if it is even possible to implement this. But
> let us leave aside the implementation for now and just focus on the
> benefit this (or something like it) could provide compared to a
> kernel-centric approach.

Took me a while to understand what you meant with this, but I think it
finally clicked with your example below, so I'll leave the comment to
there...

> * User-space can completely control where to put data and what it puts
> there. Think of producing the structure you want in user space
> directly without having to copy things around. You could for example
> produce a DPDK mbuf, or the VPP, Surikata equivalent directly. Would
> save a lot of cycles.

I think that userspace being in control of the metadata format is a
great goal we should keep in mind, and as mentioned above we can do that
with a BTF-based interface.

> * No need for a metadata enablement interface. eBPF could find this
> out by just parsing the XDP program and enable the used metadata
> features in the HW by calling enable/disable metadata functions in the
> driver.

As mentioned above I think this is a mis-feature: configuration should
be explicit and out of band, not tied to the data path implementation.

> * No reason to expose BTFs to user-space. Makes it a lot simpler.
> Actually, no need to use them in the driver either.

BTF is just a handy format for describing the data layout. There has to
be *some* format to communicate this, and exposing the BTF format to
userspace is just a way for the kernel to say "this is how the hardware
is currently configured". I don't understand why you think not having
that makes things simpler?

> * We do not have to use the metadata area unless it is actually
> needed. In the case of AF_XDP, we have to write it there, but local
> use in XDP does not. You would save one cache miss per packet right
> there.

Regardless of how the format is configured, why would an XDP program
read it if it doesn't need it?

> * We get away from non-scalable driver implementations such as:
>
> If (metadata_a_enabled)
>     meta->a = Read_metadata_a();
> If (metadata_b_enabled)
>     meta->b = Read_metadata_b();
> If (metadata_c_enabled)
>     meta->c = Read_metadata_c();
>  and so on for many entries.....

Implementations such as this exist, though, and we have to support them;
as mentioned above, I think it's absolutely a worthy goal to make the
metadata format completely fungible from userspace, but the XDP hints
interface needs to also support old drivers that do things in the
traditional way, even if it's inefficient.
>
> Or even worse:
>     meta->a = Read_metadata_b();
>     meta->b = Read_metadata_a();
>     meta->c = Read_metadata_c();
>     and so on....
>
> The last one would kill performance any day. Even the first one is
> shunned in DPDK for performance reasons.
>
> Note that in my proposal above, read_metadata_a() is a function that
> will be called from the XDP program when needed and only then, not
> from the driver, so driver code becomes uncluttered by these large if
> statements.

Ah, with this bit I understood your point above about "a helper reading
the data from the hardware". Let me see if I can rephrase your proposal,
and let me know if I got it wrong:

The problem you're speaking of here, is that of transforming the bits
that come in from the hardware (in an RX-desc, say), into the metadata
format consumed by XDP. Say, for instance that the driver has a function
like:

u64 my_driver_read_rx_timestamp(void *pkt_dsc);

which does whatever bit of device-specific magic it needed to get a
well-formed u64 timestamp out of the packet descriptor, maybe involving
handling HW-specific quirks etc. And you want to expose this function
directly to BPF instead of having a big conditional in the driver C
code, so we can use BPF code elimination to only get the bits we want,
right?

This is actually something Jesper and I have been discussing as a
possible future solution! However, what we discussed was not exposing
this *to XDP*, but rather to have a separate BPF program *which is part
of the driver* that does this. And this BPF program can then be paired
with the BTF format coming in from userspace, so that it will only load
the fields requested. So, going back to my imaginary configuration
interface above, when userspace calls:

dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format));

the driver turns this over to the BPF subsystem with a BPF program like:

int my_driver_convert_meta(void *ctx, void *meta_dst)
{
        struct my_meta_format *meta_dst_fmt = meta_dst;

        if (bpf_core_field_exists(meta_dst->rx_timestamp))
           meta_dst->rx_timestamp = my_driver_read_rx_timestamp(ctx);
        if (bpf_core_field_exists(meta_dst->magic_colour_of_packet))
           meta_dst->magic_colour_of_packet = my_driver_read_magic_colour_of_packet(ctx);
        ...etc
}


And then the driver code where before you had, as you said:

> If (metadata_a_enabled)
>     meta->a = Read_metadata_a();
> If (metadata_b_enabled)
>     meta->b = Read_metadata_b();

will just be

BPF_PROG_RUN(drv->meta_writer_prog, meta);

and that will run the BPF program that was loaded (via a usermode
helper, probably) when userspace configured the metadata format and
we'll get the conditional code goodness and better performance.

However, having such a facility for drivers to dynamically *write*
metadata is, I believe, completely orthogonal to how we implement the
description format for consuming this metadata from XDP and userspace.
For one thing, not all drivers will probably want to use the dynamic
method, so we should not tie up the consumption of metadata to this. For
another, it's way too complex to implement everything at once, so
defining the consumption format first, then working on efficiently
writing it afterwards (or in parallel, maybe) makes it possible to have
incremental progress.

That being said, having both things in mind when defining things is a
good idea. But as I have hopefully outlined above, using BTF as the
description format allows both writing and reading of metadata to be
flexible and extensible.

> * We also get away from the problem that the BTF id can change from
> underneath the AF_XDP application. This becomes especially problematic
> when we would like to apply this to the Tx path and do offloads.

For this point, I think maybe the existence of the btf_id field in the
metadata struct has given you the impression that the format can go and
change at any time? I don't think that has to be the case: if a user
knows that they will always use the same metadata format, and they know
what that format is, it's perfectly valid to just ignore the btf_id
field and just cast the metadata field to a struct. However, the field
has to be there so that it is possible for the consumer to detect when
it changes, in order to support portable applications. There are two use
cases for this:

- Writing an application that can support different metadata formats so
  it can run without re-compilation on different hardware. For this, it
  is possible to load the program with a target device, so that the long
  sequence of checks of the btf_id can be eliminated by CO-RE at load
  time, and turned into just a single sanity check so that the metadata
  is not loaded if the configuration changes.

- Writing an application that deals with packets from multiple devices
  in the same program (say, a devmap program that gets redirects from
  different hardware into the same egress device).

- Dealing with metadata configuration where a subset of the traffic has
  different metadata (e.g., timestamps may only appear on PTP packets,
  say). Here the check of the BTF ID can be used to demux the packet
  type.

-Toke


  reply	other threads:[~2021-11-17 22:48 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-11-17 17:22 [xdp-hints] " Jesper Dangaard Brouer
2021-11-17 20:07 ` [xdp-hints] " Karlsson, Magnus
2021-11-17 22:48   ` Toke Høiland-Jørgensen [this message]
2021-11-18  8:05     ` Karlsson, Magnus
2021-11-18 14:30       ` Jesper Dangaard Brouer
2021-11-18 14:57         ` Karlsson, Magnus
2021-11-18 15:18         ` John Fastabend
2021-11-19 14:53           ` Toke Høiland-Jørgensen
2021-11-22 12:45             ` [xdp-hints] Basic/Dumb question WAS(Re: " Jamal Hadi Salim
2021-11-22 13:59               ` [xdp-hints] " Toke Høiland-Jørgensen
2021-11-22 15:31                 ` Tom Herbert
2021-11-22 18:25                   ` Toke Høiland-Jørgensen
2021-11-22 12:57             ` [xdp-hints] " Alexander Lobakin
2021-11-24 11:54               ` Jesper Dangaard Brouer
2021-11-25 20:04                 ` Alexander Lobakin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://lists.xdp-project.net/postorius/lists/xdp-hints.xdp-project.net/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=875ysqflg1.fsf@toke.dk \
    --to=toke@redhat.com \
    --cc=anatoly.burakov@intel.com \
    --cc=andrii@kernel.org \
    --cc=brouer@redhat.com \
    --cc=echaudro@redhat.com \
    --cc=ederson.desouza@intel.com \
    --cc=jbrouer@redhat.com \
    --cc=maciej.fijalkowski@intel.com \
    --cc=magnus.karlsson@intel.com \
    --cc=xdp-hints@xdp-project.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox