From: Martin KaFai Lau <martin.lau@linux.dev>
To: Stanislav Fomichev <sdf@google.com>
Cc: "Toke Høiland-Jørgensen" <toke@redhat.com>,
"Bezdeka, Florian" <florian.bezdeka@siemens.com>,
"kuba@kernel.org" <kuba@kernel.org>,
"john.fastabend@gmail.com" <john.fastabend@gmail.com>,
"alexandr.lobakin@intel.com" <alexandr.lobakin@intel.com>,
"anatoly.burakov@intel.com" <anatoly.burakov@intel.com>,
"song@kernel.org" <song@kernel.org>,
"Deric, Nemanja" <nemanja.deric@siemens.com>,
"andrii@kernel.org" <andrii@kernel.org>,
"Kiszka, Jan" <jan.kiszka@siemens.com>,
"magnus.karlsson@gmail.com" <magnus.karlsson@gmail.com>,
"willemb@google.com" <willemb@google.com>,
"ast@kernel.org" <ast@kernel.org>,
"brouer@redhat.com" <brouer@redhat.com>,
"yhs@fb.com" <yhs@fb.com>,
"kpsingh@kernel.org" <kpsingh@kernel.org>,
"daniel@iogearbox.net" <daniel@iogearbox.net>,
"bpf@vger.kernel.org" <bpf@vger.kernel.org>,
"mtahhan@redhat.com" <mtahhan@redhat.com>,
"xdp-hints@xdp-project.net" <xdp-hints@xdp-project.net>,
"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
"jolsa@kernel.org" <jolsa@kernel.org>,
"haoluo@google.com" <haoluo@google.com>,
"Yonghong Song" <yhs@meta.com>
Subject: [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs
Date: Tue, 1 Nov 2022 10:31:19 -0700 [thread overview]
Message-ID: <752afbbb-1a14-3dad-53d0-35bb32632c91@linux.dev> (raw)
In-Reply-To: <CAKH8qBt=As5ON+CbH304tRanudvTF27bzeSnjH2GQR2TVx+mXw@mail.gmail.com>
On 10/31/22 3:09 PM, Stanislav Fomichev wrote:
> On Mon, Oct 31, 2022 at 12:36 PM Yonghong Song <yhs@meta.com> wrote:
>>
>>
>>
>> On 10/31/22 8:28 AM, Toke Høiland-Jørgensen wrote:
>>> "Bezdeka, Florian" <florian.bezdeka@siemens.com> writes:
>>>
>>>> Hi all,
>>>>
>>>> I was closely following this discussion for some time now. Seems we
>>>> reached the point where it's getting interesting for me.
>>>>
>>>> On Fri, 2022-10-28 at 18:14 -0700, Jakub Kicinski wrote:
>>>>> On Fri, 28 Oct 2022 16:16:17 -0700 John Fastabend wrote:
>>>>>>>> And it's actually harder to abstract away inter HW generation
>>>>>>>> differences if the user space code has to handle all of it.
>>>>>>
>>>>>> I don't see how its any harder in practice though?
>>>>>
>>>>> You need to find out what HW/FW/config you're running, right?
>>>>> And all you have is a pointer to a blob of unknown type.
>>>>>
>>>>> Take timestamps for example, some NICs support adjusting the PHC
>>>>> or doing SW corrections (with different versions of hw/fw/server
>>>>> platforms being capable of both/one/neither).
>>>>>
>>>>> Sure you can extract all this info with tracing and careful
>>>>> inspection via uAPI. But I don't think that's _easier_.
>>>>> And the vendors can't run the results thru their validation
>>>>> (for whatever that's worth).
>>>>>
>>>>>>> I've had the same concern:
>>>>>>>
>>>>>>> Until we have some userspace library that abstracts all these details,
>>>>>>> it's not really convenient to use. IIUC, with a kptr, I'd get a blob
>>>>>>> of data and I need to go through the code and see what particular type
>>>>>>> it represents for my particular device and how the data I need is
>>>>>>> represented there. There are also these "if this is device v1 -> use
>>>>>>> v1 descriptor format; if it's a v2->use this another struct; etc"
>>>>>>> complexities that we'll be pushing onto the users. With kfuncs, we put
>>>>>>> this burden on the driver developers, but I agree that the drawback
>>>>>>> here is that we actually have to wait for the implementations to catch
>>>>>>> up.
>>>>>>
>>>>>> I agree with everything there, you will get a blob of data and then
>>>>>> will need to know what field you want to read using BTF. But, we
>>>>>> already do this for BPF programs all over the place so its not a big
>>>>>> lift for us. All other BPF tracing/observability requires the same
>>>>>> logic. I think users of BPF in general perhaps XDP/tc are the only
>>>>>> place left to write BPF programs without thinking about BTF and
>>>>>> kernel data structures.
>>>>>>
>>>>>> But, with proposed kptr the complexity lives in userspace and can be
>>>>>> fixed, added, updated without having to bother with kernel updates, etc.
>>>>>> From my point of view of supporting Cilium its a win and much preferred
>>>>>> to having to deal with driver owners on all cloud vendors, distributions,
>>>>>> and so on.
>>>>>>
>>>>>> If vendor updates firmware with new fields I get those immediately.
>>>>>
>>>>> Conversely it's a valid concern that those who *do* actually update
>>>>> their kernel regularly will have more things to worry about.
>>>>>
>>>>>>> Jakub mentions FW and I haven't even thought about that; so yeah, bpf
>>>>>>> programs might have to take a lot of other state into consideration
>>>>>>> when parsing the descriptors; all those details do seem like they
>>>>>>> belong to the driver code.
>>>>>>
>>>>>> I would prefer to avoid being stuck on requiring driver writers to
>>>>>> be involved. With just a kptr I can support the device and any
>>>>>> firwmare versions without requiring help.
>>>>>
>>>>> 1) where are you getting all those HW / FW specs :S
>>>>> 2) maybe *you* can but you're not exactly not an ex-driver developer :S
>>>>>
>>>>>>> Feel free to send it early with just a handful of drivers implemented;
>>>>>>> I'm more interested about bpf/af_xdp/user api story; if we have some
>>>>>>> nice sample/test case that shows how the metadata can be used, that
>>>>>>> might push us closer to the agreement on the best way to proceed.
>>>>>>
>>>>>> I'll try to do a intel and mlx implementation to get a cross section.
>>>>>> I have a good collection of nics here so should be able to show a
>>>>>> couple firmware versions. It could be fine I think to have the raw
>>>>>> kptr access and then also kfuncs for some things perhaps.
>>>>>>
>>>>>>>> I'd prefer if we left the door open for new vendors. Punting descriptor
>>>>>>>> parsing to user space will indeed result in what you just said - major
>>>>>>>> vendors are supported and that's it.
>>>>>>
>>>>>> I'm not sure about why it would make it harder for new vendors? I think
>>>>>> the opposite,
>>>>>
>>>>> TBH I'm only replying to the email because of the above part :)
>>>>> I thought this would be self evident, but I guess our perspectives
>>>>> are different.
>>>>>
>>>>> Perhaps you look at it from the perspective of SW running on someone
>>>>> else's cloud, an being able to move to another cloud, without having
>>>>> to worry if feature X is available in xdp or just skb.
>>>>>
>>>>> I look at it from the perspective of maintaining a cloud, with people
>>>>> writing random XDP applications. If I swap a NIC from an incumbent to a
>>>>> (superior) startup, and cloud users are messing with raw descriptor -
>>>>> I'd need to go find every XDP program out there and make sure it
>>>>> understands the new descriptors.
>>>>
>>>> Here is another perspective:
>>>>
>>>> As AF_XDP application developer I don't wan't to deal with the
>>>> underlying hardware in detail. I like to request a feature from the OS
>>>> (in this case rx/tx timestamping). If the feature is available I will
>>>> simply use it, if not I might have to work around it - maybe by falling
>>>> back to SW timestamping.
>>>>
>>>> All parts of my application (BPF program included) should not be
>>>> optimized/adjusted for all the different HW variants out there.
>>>
>>> Yes, absolutely agreed. Abstracting away those kinds of hardware
>>> differences is the whole *point* of having an OS/driver model. I.e.,
>>> it's what the kernel is there for! If people want to bypass that and get
>>> direct access to the hardware, they can already do that by using DPDK.
>>>
>>> So in other words, 100% agreed that we should not expect the BPF
>>> developers to deal with hardware details as would be required with a
>>> kptr-based interface.
>>>
>>> As for the kfunc-based interface, I think it shows some promise.
>>> Exposing a list of function names to retrieve individual metadata items
>>> instead of a struct layout is sorta comparable in terms of developer UI
>>> accessibility etc (IMO).
>>
>> Looks like there are quite some use cases for hw_timestamp.
>> Do you think we could add it to the uapi like struct xdp_md?
>>
>> The following is the current xdp_md:
>> struct xdp_md {
>> __u32 data;
>> __u32 data_end;
>> __u32 data_meta;
>> /* Below access go through struct xdp_rxq_info */
>> __u32 ingress_ifindex; /* rxq->dev->ifindex */
>> __u32 rx_queue_index; /* rxq->queue_index */
>>
>> __u32 egress_ifindex; /* txq->dev->ifindex */
>> };
>>
>> We could add __u64 hw_timestamp to the xdp_md so user
>> can just do xdp_md->hw_timestamp to get the value.
>> xdp_md->hw_timestamp == 0 means hw_timestamp is not
>> available.
>>
>> Inside the kernel, the ctx rewriter can generate code
>> to call driver specific function to retrieve the data.
>
> If the driver generates the code to retrieve the data, how's that
> different from the kfunc approach?
> The only difference I see is that it would be a more strong UAPI than
> the kfuncs?
Another thing may be worth considering, some hints for some HW/driver may be
harder (or may not worth) to unroll/inline. For example, I see driver is doing
spin_lock_bh while getting the hwtstamp. For this case, keep calling a kfunc
and avoid the unroll/inline may be the right thing to do.
>
>> The kfunc approach can be used to *less* common use cases?
>
> What's the advantage of having two approaches when one can cover
> common and uncommon cases?
>
>>> There are three main drawbacks, AFAICT:
>>>
>>> 1. It requires driver developers to write and maintain the code that
>>> generates the unrolled BPF bytecode to access the metadata fields, which
>>> is a non-trivial amount of complexity. Maybe this can be abstracted away
>>> with some internal helpers though (like, e.g., a
>>> bpf_xdp_metadata_copy_u64(dst, src, offset) helper which would spit out
>>> the required JMP/MOV/LDX instructions?
>>>
>>> 2. AF_XDP programs won't be able to access the metadata without using a
>>> custom XDP program that calls the kfuncs and puts the data into the
>>> metadata area. We could solve this with some code in libxdp, though; if
>>> this code can be made generic enough (so it just dumps the available
>>> metadata functions from the running kernel at load time), it may be
>>> possible to make it generic enough that it will be forward-compatible
>>> with new versions of the kernel that add new fields, which should
>>> alleviate Florian's concern about keeping things in sync.
>>>
>>> 3. It will make it harder to consume the metadata when building SKBs. I
>>> think the CPUMAP and veth use cases are also quite important, and that
>>> we want metadata to be available for building SKBs in this path. Maybe
>>> this can be resolved by having a convenient kfunc for this that can be
>>> used for programs doing such redirects. E.g., you could just call
>>> xdp_copy_metadata_for_skb() before doing the bpf_redirect, and that
>>> would recursively expand into all the kfunc calls needed to extract the
>>> metadata supported by the SKB path?
>>>
>>> -Toke
>>>
next prev parent reply other threads:[~2022-11-01 17:31 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-10-27 20:00 [xdp-hints] " Stanislav Fomichev
2022-10-27 20:00 ` [xdp-hints] [RFC bpf-next 1/5] bpf: Support inlined/unrolled kfuncs for xdp metadata Stanislav Fomichev
2022-10-27 20:00 ` [xdp-hints] [RFC bpf-next 2/5] veth: Support rx timestamp metadata for xdp Stanislav Fomichev
2022-10-28 8:40 ` [xdp-hints] " Jesper Dangaard Brouer
2022-10-28 18:46 ` Stanislav Fomichev
2022-10-27 20:00 ` [xdp-hints] [RFC bpf-next 3/5] libbpf: Pass prog_ifindex via bpf_object_open_opts Stanislav Fomichev
2022-10-27 20:05 ` [xdp-hints] " Andrii Nakryiko
2022-10-27 20:10 ` Stanislav Fomichev
2022-10-27 20:00 ` [xdp-hints] [RFC bpf-next 4/5] selftests/bpf: Convert xskxceiver to use custom program Stanislav Fomichev
2022-10-27 20:00 ` [xdp-hints] [RFC bpf-next 5/5] selftests/bpf: Test rx_timestamp metadata in xskxceiver Stanislav Fomichev
2022-10-28 6:22 ` [xdp-hints] " Martin KaFai Lau
2022-10-28 10:37 ` Jesper Dangaard Brouer
2022-10-28 18:46 ` Stanislav Fomichev
2022-10-31 14:20 ` Alexander Lobakin
2022-10-31 14:29 ` Alexander Lobakin
2022-10-31 17:00 ` Stanislav Fomichev
2022-11-01 13:18 ` Jesper Dangaard Brouer
2022-11-01 20:12 ` Stanislav Fomichev
2022-11-01 22:23 ` Toke Høiland-Jørgensen
2022-10-28 15:58 ` [xdp-hints] Re: [RFC bpf-next 0/5] xdp: hints via kfuncs John Fastabend
2022-10-28 18:04 ` Jakub Kicinski
2022-10-28 18:46 ` Stanislav Fomichev
2022-10-28 23:16 ` John Fastabend
2022-10-29 1:14 ` Jakub Kicinski
2022-10-31 14:10 ` Bezdeka, Florian
2022-10-31 15:28 ` Toke Høiland-Jørgensen
2022-10-31 17:00 ` Stanislav Fomichev
2022-10-31 22:57 ` Martin KaFai Lau
2022-11-01 1:59 ` Stanislav Fomichev
2022-11-01 12:52 ` Toke Høiland-Jørgensen
2022-11-01 13:43 ` David Ahern
2022-11-01 14:20 ` Toke Høiland-Jørgensen
2022-11-01 17:05 ` Martin KaFai Lau
2022-11-01 20:12 ` Stanislav Fomichev
2022-11-02 14:06 ` Jesper Dangaard Brouer
2022-11-02 22:01 ` Toke Høiland-Jørgensen
2022-11-02 23:10 ` Stanislav Fomichev
2022-11-03 0:09 ` Toke Høiland-Jørgensen
2022-11-03 12:01 ` Jesper Dangaard Brouer
2022-11-03 12:48 ` Toke Høiland-Jørgensen
2022-11-03 15:25 ` Jesper Dangaard Brouer
2022-10-31 19:36 ` Yonghong Song
2022-10-31 22:09 ` Stanislav Fomichev
2022-10-31 22:38 ` Yonghong Song
2022-10-31 22:55 ` Stanislav Fomichev
2022-11-01 14:23 ` Jesper Dangaard Brouer
2022-11-01 17:31 ` Martin KaFai Lau [this message]
2022-11-01 20:12 ` Stanislav Fomichev
2022-11-01 21:17 ` Martin KaFai Lau
2022-10-31 17:01 ` John Fastabend
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
List information: https://lists.xdp-project.net/postorius/lists/xdp-hints.xdp-project.net/
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=752afbbb-1a14-3dad-53d0-35bb32632c91@linux.dev \
--to=martin.lau@linux.dev \
--cc=alexandr.lobakin@intel.com \
--cc=anatoly.burakov@intel.com \
--cc=andrii@kernel.org \
--cc=ast@kernel.org \
--cc=bpf@vger.kernel.org \
--cc=brouer@redhat.com \
--cc=daniel@iogearbox.net \
--cc=florian.bezdeka@siemens.com \
--cc=haoluo@google.com \
--cc=jan.kiszka@siemens.com \
--cc=john.fastabend@gmail.com \
--cc=jolsa@kernel.org \
--cc=kpsingh@kernel.org \
--cc=kuba@kernel.org \
--cc=magnus.karlsson@gmail.com \
--cc=mtahhan@redhat.com \
--cc=nemanja.deric@siemens.com \
--cc=netdev@vger.kernel.org \
--cc=sdf@google.com \
--cc=song@kernel.org \
--cc=toke@redhat.com \
--cc=willemb@google.com \
--cc=xdp-hints@xdp-project.net \
--cc=yhs@fb.com \
--cc=yhs@meta.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox