[xdp-hints] Re: [LSF/MM/BPF TOPIC] XDP metadata for TX

XDP hardware hints discussion mail archive
 help / color / mirror / Atom feed

From: Jesper Dangaard Brouer <jbrouer@redhat.com>
To: Stanislav Fomichev <sdf@google.com>,
	Jesper Dangaard Brouer <jbrouer@redhat.com>
Cc: brouer@redhat.com, "Magnus Karlsson" <magnus.karlsson@gmail.com>,
	"Toke Høiland-Jørgensen" <toke@kernel.org>,
	lsf-pc@lists.linux-foundation.org, bpf@vger.kernel.org,
	"xdp-hints@xdp-project.net" <xdp-hints@xdp-project.net>
Subject: [xdp-hints] Re: [LSF/MM/BPF TOPIC] XDP metadata for TX
Date: Fri, 10 Mar 2023 12:09:32 +0100	[thread overview]
Message-ID: <3c70e01d-f685-e341-64f8-8eb4e8bd46fc@redhat.com> (raw)
In-Reply-To: <CAKH8qBuviabUfBTFg3gOfpkWc+oFvFP-NcV4g2ipn7D=C2u_2g@mail.gmail.com>



On 09/03/2023 19.04, Stanislav Fomichev wrote:
> On Tue, Mar 7, 2023 at 11:32 AM Jesper Dangaard Brouer
>>
>> On 03/03/2023 08.42, Magnus Karlsson wrote:
>>> On Mon, 27 Feb 2023 at 21:16, Stanislav Fomichev <sdf@google.com> wrote:
>>>> On Mon, Feb 27, 2023 at 6:17 AM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>>>> Stanislav Fomichev <sdf@google.com> writes:
>>>>>> On Thu, Feb 23, 2023 at 3:22 PM Toke Høiland-Jørgensen <toke@kernel.org> wrote:
>>>>>>>
>>>>>>> Stanislav Fomichev <sdf@google.com> writes:
>>>>>>>
>>>>>>>> I'd like to discuss a potential follow up for the previous "XDP RX
>>>>>>>> metadata" series [0].
>>>>>>>>
>>>>>>>> Now that we can access (a subset of) packet metadata at RX, I'd like to
>>>>>>>> explore the options where we can export some of that metadata on TX. And
>>>>>>>> also whether it might be possible to access some of the TX completion
>>>>>>>> metadata (things like TX timestamp).
>>>>>>>>
>>
>> IMHO it makes sense to see TX metadata as two separate operations.
>>
>>    (1) Metadata written into the TX descriptor.
>>    (2) Metadata read when processing TX completion.
>>
>> These operations happen at two different points in time. Thus likely
>> need different BPF hooks.   Having BPF-progs running at each of these
>> points in time, will allow us to e.g. implement BQL (which is relevant
>> to XDP queuing effort).
> 
> I guess for (2) the question here is: is it worth having a separate
> hook? Or will a simple traceepoint as Toke suggested be enough? For
> BQL purposes, we can still attach a prog to that tracepoint, right?
> 

Yes, see below, I'm hoping for (2) we can avoid separate hook, and
leverage xdp_return_frame() somehow.

For (1) "write into the TX descriptor" it happens so late it the TX
path, that I don't think we can avoid a driver TX hook.  Even if we come
up with a common struct that drivers can consume hints from, we will
still need to modify the drivers to do the translation into the hardware
specific TX desc format.  If we have to modify the drivers anyhow, then
adding a TX BPF-hook would likely be worthwhile to the flexibility this
creates.


>>>>>>>> I'm currently trying to understand whether the same approach I've used
>>>>>>>> on RX could work at TX. By May I plan to have a bunch of options laid
>>>>>>>> out (currently considering XSK tx/compl programs and XDP tx/compl
>>>>>>>> programs) so we have something to discuss.
>>>>>>>
>>>>>>> I've been looking at ways of getting a TX-completion hook for the XDP
>>>>>>> queueing stuff as well. For that, I think it could work to just hook
>>>>>>> into xdp_return_frame(), but if you want to access hardware metadata
>>>>>>> it'll obviously have to be in the driver. A hook in the driver could
>>>>>>> certainly be used for the queueing return as well, though, which may
>>>>>>> help making it worth the trouble :)
>>>>>>
>>>>>> Yeah, I'd like to get to completion descriptors ideally; so nothing
>>>>>> better than a driver hook comes to mind so far :-(
>>
>> As Toke mentions, I'm also hoping we could leverage or extend the
>> xdp_return_frame() call.  Or implicitly add the "hook" at the existing
>> xdp_return_frame() call. This is about operation (2) *reading* some
>> metadata at TX completion time.
> 
> Ack, noted, thx. Although, at least for mlx5e_free_xdpsq_desc, I don't
> see it being called for the af_xdp tx path. But maybe that's something
> we can amend in a couple of places (so xdp_return_frame would handle
> most xdp cases, and some new tbd func for af_xdp tx)?
> 

Looking at mlx5e_free_xdpsq_desc() it have switch handling the different 
frame types.  It wouldn't be much work to as you say to amend code and 
add e.g. a tracepoint/hook catching more frame types (both xdp and af_xdp).

>> Can this be mapped to the RX-kfuncs approach(?), by driver extending
>> (call/structs) with pointer to TX-desc + adaptor info and BPF-prog/hook
>> doing TX-kfuncs calls into driver (that knows how to extract completion
>> data).
> 
> Yeah, that seems like a natural thing to do here.

Great. I hope drivers will no have hidden or freed the TX completion
descriptor before our BPF-prog will get a chance to read this data.

>>
>> [...]
>>>>> Well, to me XDP_REDIRECT is the most interesting one (see above). I
>>>>> think we could even drop the XDP_TX case and only do this for
>>>>> XDP_REDIRECT, since XDP_TX is basically a special-case optimisation.
>>>>> I.e., it's possible to XDP_REDIRECT back to the same device, the frames
>>>>> will just take a slight detour up through the stack; but that could also
>>>>> be a good thing if it means we'll have to do less surgery to the drivers
>>>>> to implement this for two paths.
>>>>>
>>>>> It does have the same challenge as you outlined above, though: At that
>>>>> point the TX descriptor probably doesn't exist, so the driver NDO will
>>>>> have to do something else with the data; but maybe we can solve that
>>>>> without moving the hook into the driver itself somehow?
>>>>
>>>> Ah, ok, yeah, I was putting XDP_TX / XDP_REDIRECT under the same
>>>> "transmit something out of xdp_rx hook" umbrella. We can maybe come up
>>>> with a skb-like-private metadata layout (as we've discussed previously
>>>> for skb) here as well? But not sure it would solve all the problems?
>>
>> This is operation (1) writing metadata into the TX descriptor.
>> In this case we have a metadata mapping problem, from RX on one device
>> driver to TX on another device driver. As you say, we also need to map
>> this SKBs, which have a fairly static set of metadata.
>>
>> For the most common metadata offloads (like TX-checksum, TX-vlan) I
>> think it makes sense to store those in xdp_frame area (use for SKB
>> mapping) and re-use these when at TX writing into the TX descriptor.
> 
> [..]
> 
>> BUT there are also metadata TX offloads offloads, like asking for a
>> specific Launch-Time for at packet, that needs a more flexible approach.
> 
> Why can't these go into the same "common" xdp_frame area?
> 

For several reasons:

  1. The "common" area in xdp_frame will be size constrained.

  2. The Launch-Time for a packet is not something we *read* from the RX
     descriptor, thus having room to store it in common area seems
     strange.

  3. The Launch-Time for a packet is something we likely calculate on the
     fly based on "time queue" state at our TX-hook egress. See TC-BPF
     code example for FQ-pacing here[1].

  4. We likely need a XDP queueing layer (Toke's work) to handle the HW
     limitations of Launch-Time feature, as HW it limited how far in the
     future we can schedule data (for chips i210 and i225 see[2]).

[1] 
https://github.com/xdp-project/bpf-examples/blob/master/traffic-pacing-edt/edt_pacer_vlan.c

[2] 
https://github.com/xdp-project/xdp-project/blob/master/areas/tsn/code01_follow_qdisc_TSN_offload.org

>>>> I'm thinking of an af_xdp case where it wants to program something
>>>> similar to tso/encap/tunneling offload (assuming af_xdp will get 4k+
>>>> support) or a checksum offload. Exposing access to the driver tx hooks
>>>> seems like the easiest way to get there?
>>>>
>>>>>> - AF_XDP TX - this one needs something deep in the driver (due to tx
>>>>>> zc) to populate the descriptors?
>>>>>
>>>>> Yeah, this one is a bit more challenging, but having a way to process
>>>>> AF_XDP frames in the kernel before they're sent out would be good in any
>>>>> case (for things like policing what packets an AF_XDP application can
>>>>> send in a cloud deployment, for instance). Would be best if we could
>>>>> consolidate the XDP_REDIRECT and AF_XDP paths, I suppose...
>>>>>
>>
>> I agree, it would be best if we can consolidate the XDP_REDIRECT and
>> AF_XDP paths, else we have to re-implement the same for AF_XDP xmit path
>> (and maintain both paths). I also agree that being able to police what
>> packets an AF_XDP application can send is a useful feature (e.g. cloud
>> deployments).
>>
>> Looking forward to collaborate on this work!
>> --Jesper
> 
> Thank you for the comments! So it looks like two things we potentially
> need to do/agree upon:
>
> 1. User-facing API. One hook + tracepoint vs two hooks (and at what
> level: af_xdp vs xdp). I'll try to focus on that first (waiting for
> af_xdp patches that Magnus mentioned).
>

Yes, I think we need to dig into the code and to figure out what options 
are doable.  Balancing the future maintenance cost.

> 2. Potentially internal refactoring to consolidate XDP_REDIRECT+AF_XDP
> (seems like something we should be able to discuss as we go; aka
> implementation details)
> 

Sounds like we have a plan going forward :-)

--Jesper

     prev parent reply	other threads:[~2023-03-10 11:09 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <Y/fnZkXQdc8lkP7q@google.com>
     [not found] ` <874jrcklvf.fsf@toke.dk>
     [not found]   ` <CAKH8qBsoTiVja8=EXTcfJNYpF7JjgPoD=Wi4JBX5PGbggn=S4g@mail.gmail.com>
     [not found]     ` <878rgjjipq.fsf@toke.dk>
     [not found]       ` <CAKH8qBstQb0CS1Q-dcx_jeZM2sKSMH3PHFww6=6Hy+3wJ-NL+Q@mail.gmail.com>
     [not found]         ` <CAJ8uoz0jnavFxMJ8tgb4+-+OsCPqVJQez8ULOTM2a60D4RmJ7A@mail.gmail.com>
2023-03-07 19:32           ` Jesper Dangaard Brouer
2023-03-09 18:04             ` Stanislav Fomichev
2023-03-10 11:09               ` Jesper Dangaard Brouer [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

  List information: https://lists.xdp-project.net/postorius/lists/xdp-hints.xdp-project.net/

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3c70e01d-f685-e341-64f8-8eb4e8bd46fc@redhat.com \
    --to=jbrouer@redhat.com \
    --cc=bpf@vger.kernel.org \
    --cc=brouer@redhat.com \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=magnus.karlsson@gmail.com \
    --cc=sdf@google.com \
    --cc=toke@kernel.org \
    --cc=xdp-hints@xdp-project.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox