From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-il1-x12a.google.com (mail-il1-x12a.google.com [IPv6:2607:f8b0:4864:20::12a]) by mail.toke.dk (Postfix) with ESMTPS id 548218E82D0 for ; Thu, 18 Nov 2021 16:18:34 +0100 (CET) Authentication-Results: mail.toke.dk; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=IXxk3Ea+ Received: by mail-il1-x12a.google.com with SMTP id k1so6785614ilo.7 for ; Thu, 18 Nov 2021 07:18:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=date:from:to:cc:message-id:in-reply-to:references:subject :mime-version:content-transfer-encoding; bh=u7HwXVcMiJBQgqWbeExtsYC2IN8PfzzzhZR3lxjRcAg=; b=IXxk3Ea+eEX54ST+AS/7A8zCsuZk3mWRt9ZIVfntdfGtwUdXC8Gn/lJitIIWOUOZMR DO5OmFoOkIs061bz+NvFbHcbXwXqlWI3iE/E4DegJZxpxTUS6tS40nJtktGFv3eHC27b FblVK1//7iowWvrmA+QaoM10pdlyssi2kQ4AvBwkj7NRMT4rdG/LxehlgdgF4H8Nu0gn rrfGTGy9u3dQATJCvvgy5he5x/f8H7M5VXlPilZii8HRE+LtmiPeSvSIjJK+NjlOX7sI ErwXFFW7QNp3ONtYMl6ueoJSJ0+D3jidinKnlIMxjSp++cA4DebeZVpuIgKzUFsX71e7 +fdA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:date:from:to:cc:message-id:in-reply-to :references:subject:mime-version:content-transfer-encoding; bh=u7HwXVcMiJBQgqWbeExtsYC2IN8PfzzzhZR3lxjRcAg=; b=n230Su+jxtxt1jjPEh18DJHGM4wPN0g9IYj+FL7Q9pA4CveoMJI5dIYtOz9JKkL00u e+MVrUYAXygqxIXObWqxOOl0bbkgCOMWzNSjH0CCLyEOfOrEAlREBydg0X752Lt3q9lk wFEcVSH4QEh7kcFswsjdsG7JnSHQ7W2YypNTWyZcO56Imit14MPXN9Wn/G1S2Mlsivhq IJ0nkchE7XhrcrhC83Hmo0VT6rrYX2K9DFUgLbhNjJ/tyvXPq94sCn7iaSCoP5Hljqlp PYZxbFXrsqzbyaeo/zRkK5bNiy/tisAQ19RL1pZ/5cOsTbuOKxBw52tsGKaYhNw6bs4B Zz4A== X-Gm-Message-State: AOAM533Su70rlyfhD0Xg6HzOI2DbOG+qs2/M7zTdfCXf/lWkA0cHjCav hMYB6ZS5TxcXLtSrHxTgYIA= X-Google-Smtp-Source: ABdhPJwL2dwu3M3arDf+am0OEZaar2hbkVE791CY1P10NTlV9uND11UQaHS3e1/S4/1q5LV0i6rHiQ== X-Received: by 2002:a05:6e02:18ce:: with SMTP id s14mr15566480ilu.142.1637248712505; Thu, 18 Nov 2021 07:18:32 -0800 (PST) Received: from localhost ([172.243.151.11]) by smtp.gmail.com with ESMTPSA id x5sm64928iov.50.2021.11.18.07.18.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 18 Nov 2021 07:18:32 -0800 (PST) Date: Thu, 18 Nov 2021 07:18:24 -0800 From: John Fastabend To: Jesper Dangaard Brouer , "Karlsson, Magnus" , =?UTF-8?B?VG9rZSBIw7hpbGFuZC1Kw7hyZ2Vuc2Vu?= , "Desouza, Ederson" Message-ID: <61966ec0722fe_2f3212080@john.notmuch> In-Reply-To: References: <875ysqflg1.fsf@toke.dk> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Message-ID-Hash: 2I4OK7ZTCT53YM4HC3H6636VJG6INCK7 X-Message-ID-Hash: 2I4OK7ZTCT53YM4HC3H6636VJG6INCK7 X-MailFrom: john.fastabend@gmail.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: brouer@redhat.com, "xdp-hints@xdp-project.net" , Eelco Chaudron , Andrii Nakryiko , "Fijalkowski, Maciej" , "Burakov, Anatoly" X-Mailman-Version: 3.3.4 Precedence: list Subject: [xdp-hints] Re: XDP-hints via local BTF info List-Id: XDP hardware hints design discussion Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: Jesper Dangaard Brouer wrote: > = > = > On 18/11/2021 09.05, Karlsson, Magnus wrote: > > = > >> -----Original Message----- > >> From: Toke H=C3=B8iland-J=C3=B8rgensen > >> Sent: Wednesday, November 17, 2021 11:48 PM > >> To: Karlsson, Magnus ; Jesper Dangaard > >> Brouer ; Desouza, Ederson > >> > >> Cc: brouer@redhat.com; xdp-hints@xdp-project.net; Eelco Chaudron > >> ; Andrii Nakryiko ; Fijalkow= ski, > >> Maciej ; Burakov, Anatoly > >> > >> Subject: Re: [xdp-hints] Re: XDP-hints via local BTF info > >> > >> "Karlsson, Magnus" writes: > >> > >>> Together with Maciej and Anatoly, I have been toying with how to > >>> accomplish this, but it is early days so warning for some serious > >>> handwaving. Will produce some code to see if it is possible at all.= > >>> One drawback with having it completely flexible and letting user-sp= ace > >>> decide is the complexity implementing this in the driver/kernel. Bu= t > >>> is this not why we have eBPF for in the first place? Maybe it can c= ome > >>> to the rescue here. > >> > >> I think there are a couple of distinctions we need to make, which yo= ur > >> "handwaving" glosses over a little bit :) > >> > >> First, we have to distinguish between two cases here: how to work wi= th > >> existing hardware and drivers and their metadata implementations, an= d how > >> to work with future devices that will (presumably) be completely fle= xible > >> and/or programmable to provide any metadata configuration directly f= rom > >> the hardware. The XDP hints scheme obviously needs to work with both= . > > = > > Thanks Toke for providing feedback on this. Just had to get this stuf= f "off my chest" =F0=9F=98=89. I was stuck in the zero-copy world where y= ou can modify drivers, but of course we need to consider "older" drivers = with XDP support. With that in mind, I do agree with most of your comment= s. There are some questions and thoughts interspersed below and a suggest= ion for how to go forward at the end. > > = > = > I do appriciate that you want to discuss driver implementation, but thi= s = > email and example program[1] was targetted at a much smaller step. > = > The main purpose was to show-with-code that decoding BTF and using it = > from userspace is not difficult. And iterate on the API a bit. > = > [1] = > https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-interact= ion > = > As Toke mentioned, I also believe it is orthogonal. > I think we all agree that BTF will be the way the XDP-hints will be = > "described", and we need to propose APIs/code that does this decoding = > e.g. in userspace. (But I also want things like cpumap kernel side = > XDP-prog to understand this). > = > But I'm taking the bait, and will discuss with you below... > = > = > >> Secondly, we should distinguish between the configuration interface = and the > >> metadata consumption from within the data path. I think those should= be > >> separate interfaces; in particular, I don't think loading an XDP pro= gram in > >> itself should have the side effect of reconfiguring the hardware met= adata > >> format. Rather, the reconfiguration should be a separate, explicit s= tep using > >> whatever API we can end up agreeing on (ethtool or rtnetlink come to= mind > >> as obvious contenders). > = > I agree with Toke, that we should distinguish between the configuration= = > interface and metadata consumption. +1 as a first iteration we can ignore the configuration mechanism and assume its either hard coded or programmed into the NIC by the hardware vendor. > = > Do notice that we ALREADY have existing config interfaces that enables = > and disabled HW hints provided by the hardware. How to handle this? > = > E.g. The HW RX-timestamp features can be enabled when calling tcpdump, = > and will be disabled again when tcpdump stop again. > Thus, this action (enable HW-timestamp) will likely change what = > XDP-hints structure that driver uses. > What do you think the action should be? > = > My opinion is to simply use the existing kernel API to enable these = > HW-timestamps. Today the SKB is updated with the HW-timestamp, and in = > the future I would like the SKB to be created later (in net-core) and = > get this HW-timestamp via an BPF-prog that extract this from the = > metadata area. (Hint today metadata area actually survives with the SKB= , = > and can be accessed by e.g. TC-BPF). > Thus, I do want this to work with our existing netstack config interfac= es. I don't disagree above would be great if existing knobs worked as expected and played nice with hints. I view it as step 2 though. Just my $.02 its more of a question on how to get from point A to B, not so much differences in end result. > = > = > >> However, the > >> *format* for this configuration could very well be BTF-based, so use= rspace > >> can get whatever format it wants, assuming the hardware supports it.= > >> > >> So, say we have this fancy programmable hardware, and we write a pro= gram > >> with a struct definition like: > >> > >> struct my_meta_format { > >> __u64 rx_timestamp; > >> __u64 magic_colour_of_packet; > >> __u32 btf_id; > >> }; > >> > >> and from userspace we can then do: > >> > >> dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format)); I have some doubts/questions about complexity on firmware/driver side to consume such sparse info and create such complex reconfig of hw. But, maybe some simple pattern matching would sufficient on hw side and useful to get things moving forward. Seeing real hardware with support here would be great. > >> > >> which will pass the BTF-format struct to the kernel and configure th= e > >> hardware to write those fields. After this, the XDP program can just= do: > >> > >> int my_xdp_prog(struct xdp_buff *ctx) > >> { > >> struct my_meta_format *meta =3D ctx->data_meta; > >> > >> do_something_with(meta->magic_colour_of_packet); > >> return XDP_XXX; > >> } > >> > >> and it will just work. Same thing from userspace. > >> > >> Or it can define a CO-RE enabled struct like: > >> > >> struct my_meta_subset { > >> __u64 magic_colour_of_packet; /* we only care about this att= r */ } > >> __attribute__((preserve_access_index)); > >> > >> int my_xdp_prog(struct xdp_buff *ctx) > >> { > >> struct my_meta_subset *meta =3D ctx->data_meta; > >> > >> do_something_with(meta->magic_colour_of_packet); > >> return XDP_XXX; > >> } > >> > >> and libbpf will rewrite the field accesses using the BTF information= , re- > >> exported from the kernel from what userspace passed in when configur= ing. > >> > = > I've also been down the same rabbit hole, wanting userspace to define = > BTF layout as the config interface that HW will get reconfigured via. > I no-longer believe in this mode. One reason is the existing config = > interfaces that enable/disable NIC HW features. > = > One way we can allow userspace to define the contents of the XDP-hints = > struct, not the HW config, is to add this new BPF 'hints-hook'. > Userspace can query the BPF-prog loaded in the 'hints-hook' and see tha= t = > BTF structs it provide. > This is similar to that I do for AF_XDP in [1], as the XDP BPF-prog = > defines the layout and AF_XDP userspace queries the BTF avail. I expected, but it didn't happen yet(?), is first users would go a different route. The way I see it is, hw vendor can configure the NIC to= put any hints they like in the header via firmware update. The user space would understand the layout of the hints because it programmed these hints. In general its not very friendly for distributions and their end users, but for a DPDK user running on top of AF_XDP this would = be all thats needed. Or an embedded end system at a telco or POC on IDS woul= d work. Anyoen doing this today? > = > = > >> With the above in mind, a few comments on some of your other points:= > >> > >>> The key feature here is that bpf_xdp_get_metadata() will actually g= o > >>> and fetch this from the HW, by calling a small callback function in= > >>> the driver that reads for example rx_timestamp from the HW and retu= rns > >>> it to the XDP program. This is clearly not possible today and would= > >>> require new plumbing, if it is even possible to implement this. But= > >>> let us leave aside the implementation for now and just focus on the= > >>> benefit this (or something like it) could provide compared to a > >>> kernel-centric approach. > >> > >> Took me a while to understand what you meant with this, but I think = it finally > >> clicked with your example below, so I'll leave the comment to there.= .. > >> > >>> * User-space can completely control where to put data and what it p= uts > >>> there. Think of producing the structure you want in user space > >>> directly without having to copy things around. You could for exampl= e > >>> produce a DPDK mbuf, or the VPP, Surikata equivalent directly. Woul= d > >>> save a lot of cycles. > >> > >> I think that userspace being in control of the metadata format is a = great goal > >> we should keep in mind, and as mentioned above we can do that with a= BTF- > >> based interface. > >> > >>> * No need for a metadata enablement interface. eBPF could find this= > >>> out by just parsing the XDP program and enable the used metadata > >>> features in the HW by calling enable/disable metadata functions in = the > >>> driver. > >> > >> As mentioned above I think this is a mis-feature: configuration shou= ld be > >> explicit and out of band, not tied to the data path implementation. > >> > = > I agree. Also consider the existing config interfaces that we cannot = > ignore and have to collaborate with. I also wonder about hw capability to consume a BTF/eBPF program and reconfigure the hardware. > = > >>> * No reason to expose BTFs to user-space. Makes it a lot simpler. > >>> Actually, no need to use them in the driver either. > >> > >> BTF is just a handy format for describing the data layout. There has= to be > >> *some* format to communicate this, and exposing the BTF format to > >> userspace is just a way for the kernel to say "this is how the hardw= are is > >> currently configured". I don't understand why you think not having t= hat > >> makes things simpler? > > = ... (cut cache miss discussion) > = > >> This is actually something Jesper and I have been discussing as a po= ssible > >> future solution! However, what we discussed was not exposing this *t= o > >> XDP*, but rather to have a separate BPF program *which is part of th= e > >> driver* that does this. And this BPF program can then be paired with= the BTF > >> format coming in from userspace, so that it will only load the field= s > >> requested. So, going back to my imaginary configuration interface ab= ove, > >> when userspace calls: > >> > = > Yes, exactly. > If we introduce a BPF hints-hook, we could give this new BPF hook acces= s = > to the RX-desc (and XDP->ctx / xdp_buff) plus some driver specific = > status quirks-bits. The driver developers, that know the quirks, could= = > ship a BPF-prog together with driver code that creates/writes the = > metadata area AND this BPF-prog could DEFINE the struct name and layout= = > as BTF (which userspace can query and extract). > = > = ... cut some more interesting discussion. >> > It would be great if we could know it is fixed, but I do not underst= and how the user can know this, especially since the control of this is o= ut-of-band. How would we deal with the following scenario? > > = > > App 1 comes up, opens up an AF_XDP socket and requests metadata_1 > > App 2 comes up, opens up another AF_XDP socket on the same netdev and= requests metadata_2 > > = > > We can provide the apps with two different btf_ids, but is this somet= hing that an existing driver can support and how does this scale as we ad= d sockets and different usages of metadata? Note that we have no idea wha= t the destination is until after we have executed our XDP program and pot= entially used the metadata area there. But our population of the metadata= field is before the XDP program. Kind of chicken and egg. > > = > > The idea of a separate metadata population hook point on the netdev/q= ueue_id level could potentially solve this. Well, as long as you are not = attaching several sockets to the same netdev and queue_id, but that is ra= re. Interesting, but I would get basic single config working first. If user really wants multiple configs then I would guess the NIC might partition the hardware into VFs or virtual interfaces of some kind. > > = > >> - Writing an application that can support different metadata formats= so > >> it can run without re-compilation on different hardware. For this= , it > >> is possible to load the program with a target device, so that the= long > >> sequence of checks of the btf_id can be eliminated by CO-RE at lo= ad > >> time, and turned into just a single sanity check so that the meta= data > >> is not loaded if the configuration changes. > > = > > Would be great if we had CO-RE for user-space, but as it is not there= in llvm or gcc for x86, we probably have to go through all the tests in = AF_XDP. I think CO-RE for user space should happen regardless of hardware use cases. Looks to me like multiple use cases for this exist. > > = > >> - Writing an application that deals with packets from multiple devic= es > >> in the same program (say, a devmap program that gets redirects fr= om > >> different hardware into the same egress device). > > = > > Yes. The same goes for AF_XDP and a shared umem between different NIC= s. > > = > >> - Dealing with metadata configuration where a subset of the traffic = has > >> different metadata (e.g., timestamps may only appear on PTP packe= ts, > >> say). Here the check of the BTF ID can be used to demux the packe= t > >> type. > > = > > Would this not lead to a combinatorial explosion of BTF IDs as we add= fields that might or might not have a valid entry? Would it not be bette= r with a flags field in the MD section? > > = > = > I've debated this with John before, and I though we resolved the = > concerns in that discussion. > In summary: > = > 1. The driver is 100% free to create a flags field in the XDP-hints = > struct, that define what fields are valid. If you like to be old = > fashion, and we can even define the bits as UAPI for common fields that= = > more drivers provide. This just requires more work in userspace/BPF-pro= g = > to check flags-valid before accessing a member. > = > 2. Until we get the XDP-hints-hook, drivers will be old fashion and = > define a rather limited number of XDP-hints structs. Thus, I would not= = > worry about drivers creating too many structs and BTF-IDs. I would add a 3. hw + user can agree on whats in the hints and are free to put anything in there if they use out of band config mechanism. > = > Today drivers are not very flexible, and this work is about creating a = > super flexible interface. It doesn't mean that drivers and hardware = > will all of a sudden get super flexible. Driver and hardware start = > using this new flexible BTF in very simple ways. > = > = > > Thank you so much for your feedback. It really helped! In summary, BT= F it is as an interface and I agree that we should start simple and just = get this to work with the if-statement method. But it seems that I can al= so continue to prototype the main idea of reading metadata from XDP. Do y= ou think it would be possible to get a new hook point for XDP metadata po= pulation? If so, I think we should pursue that, especially if we could ti= e it to a netdev/queue_id pair. That would simplify a lot for AF_XDP. > > = > = > I've all for prototypeing something so we can make progress. In general I would suggest building the simplest possible thing first. And then improve once users start to actually use it. In parrellel I would like to see someone build a 'real' demo using 3 to show-off the amazing performance improvements to justify adding hooks and lots of other effort in driver/kernel. Thanks.=