From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-wm1-x331.google.com (mail-wm1-x331.google.com [IPv6:2a00:1450:4864:20::331]) by mail.toke.dk (Postfix) with ESMTPS id A576D9CDAD6 for ; Fri, 9 Dec 2022 16:20:11 +0100 (CET) Authentication-Results: mail.toke.dk; dkim=pass (2048-bit key; unprotected) header.d=gmail.com header.i=@gmail.com header.a=rsa-sha256 header.s=20210112 header.b=K+MfZumA Received: by mail-wm1-x331.google.com with SMTP id n7so58168wms.3 for ; Fri, 09 Dec 2022 07:20:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=Fnk658isXxLTRluECFguKQ5bgT0PCiJCrq5n84WC07k=; b=K+MfZumAJSaZQU1TD0ojk0O4rZR4zyXk4o19xiXGIkgO0zljC848vH1YNA8nsRUruz AmWSFlqsd87TXySGrkkelC3YO7qeI062ivQulcmZ8U0hYOjRWW2qS5/Xk+m1njtCQso5 8EiPRL+x0q96nAk6XwvCUdx0t+TJjsfGTBq2p/8e3xbuTtlnPA2nsxXLxdoYpJzi/Sxx 1+SAKV7fE8rmwlT7NdT4pYpe9a4ltVu7e3XniB+qGyB2PMLyXW4W2eUawVSQkY/UQ1ww 1nkpFETZp2x7ld1assaFBudSkAR02IgTHImdoPez9XQS0ZfVOQCSn4hTbhbzqCbEZJkq Le9A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=Fnk658isXxLTRluECFguKQ5bgT0PCiJCrq5n84WC07k=; b=rupgtrfV+w3Qpc5a8o0JP+FvFUoBZmMqDGcumchHsl0EgMLuJg9NHb4krfIZqLY8eh 3iN9e6HPv9Hwm0yGasRqfirWyQnwzqL4I0XP1ptQCfcPqug1CmNbI0gimIBnRDtQuR3Y PwruuaqRFjozQh5KBSX3GZX5ei0F3iqtHfZotHQPMJ00/+nSRDsCRXgswxLG+daOFdUN OigsN4kFBzE9pf7yROG5eXMVY51lwFHCkvvnuQsBBdwZhtHxho6zP8LnvcvIrJWPz9da FnIwrsnCOhh6YpGfDYMciyOzPin8CiRZFsuVbh2rvh5xGfLBNPQZyiXmSZXIGsJ/VhHY dxPA== X-Gm-Message-State: ANoB5pn0WuPPZqT2hsbiOUfcqQluISQmyiEDPfA+QVedTopk5YyTKzIK yxeswOwn1bTzkNxwdPqs27dJoc6nTjUuoCrSA7g= X-Google-Smtp-Source: AA0mqf75/IKiFawBHnsg+wdRfijf2g8/Yt4wRqWY1Ci4HTXUXTORl7sQ8Ldy0g2LkgQjXLzLkpRoHXIgHla20z0BzL8= X-Received: by 2002:a7b:c3d3:0:b0:3d1:cec6:75a8 with SMTP id t19-20020a7bc3d3000000b003d1cec675a8mr11511149wmj.206.1670599209880; Fri, 09 Dec 2022 07:20:09 -0800 (PST) MIME-Version: 1.0 References: <20221206024554.3826186-1-sdf@google.com> <20221206024554.3826186-12-sdf@google.com> <875yellcx6.fsf@toke.dk> <87359pl9zy.fsf@toke.dk> <87tu25ju77.fsf@toke.dk> <87o7sdjt20.fsf@toke.dk> <66fa1861-30dd-6d00-ed14-0cf4a6b39f3c@redhat.com> In-Reply-To: <66fa1861-30dd-6d00-ed14-0cf4a6b39f3c@redhat.com> From: Dave Taht Date: Fri, 9 Dec 2022 07:19:57 -0800 Message-ID: To: Jesper Dangaard Brouer Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Message-ID-Hash: 4YZBOZEX2PS7Q6YMQDUPGVFEQJFUMFRB X-Message-ID-Hash: 4YZBOZEX2PS7Q6YMQDUPGVFEQJFUMFRB X-MailFrom: dave.taht@gmail.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Saeed Mahameed , Stanislav Fomichev , brouer@redhat.com, =?UTF-8?B?VG9rZSBIw7hpbGFuZC1Kw7hyZ2Vuc2Vu?= , Alexei Starovoitov , bpf , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Song Liu , Yonghong Song , John Fastabend , KP Singh , Hao Luo , Jiri Olsa , Saeed Mahameed , David Ahern , Jakub Kicinski , Willem de Bruijn , Anatoly Burakov , Alexander Lobakin , Magnus Karlsson , Maryam Tahhan , xdp-hints@xdp-project.net, Network Development X-Mailman-Version: 3.3.7 Precedence: list Subject: [xdp-hints] Re: [PATCH bpf-next v3 11/12] mlx5: Support RX XDP metadata List-Id: XDP hardware hints design discussion Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On Fri, Dec 9, 2022 at 5:29 AM Jesper Dangaard Brouer wrote: > > > On 09/12/2022 06.24, Saeed Mahameed wrote: > > On 08 Dec 18:57, Stanislav Fomichev wrote: > >> On Thu, Dec 8, 2022 at 4:54 PM Toke H=C3=B8iland-J=C3=B8rgensen > >> wrote: > >>> > >>> Alexei Starovoitov writes: > >>> > >>> > On Thu, Dec 8, 2022 at 4:29 PM Toke H=C3=B8iland-J=C3=B8rgensen wrote: > >>> >> > >>> >> Alexei Starovoitov writes: > >>> >> > >>> >> > On Thu, Dec 8, 2022 at 4:02 PM Toke H=C3=B8iland-J=C3=B8rgensen = wrote: > >>> >> >> > >>> >> >> Stanislav Fomichev writes: > >>> >> >> > >>> >> >> > On Thu, Dec 8, 2022 at 2:59 PM Toke H=C3=B8iland-J=C3=B8rgens= en wrote: > >>> >> >> >> > >>> >> >> >> Stanislav Fomichev writes: > >>> >> >> >> > >>> >> >> >> > From: Toke H=C3=B8iland-J=C3=B8rgensen > >>> >> >> >> > > >>> >> >> >> > Support RX hash and timestamp metadata kfuncs. We need to = pass in the cqe > >>> >> >> >> > pointer to the mlx5e_skb_from* functions so it can be retr= ieved from the > >>> >> >> >> > XDP ctx to do this. > >>> >> >> >> > >>> >> >> >> So I finally managed to get enough ducks in row to actually = benchmark > >>> >> >> >> this. With the caveat that I suddenly can't get the timestam= p support to > >>> >> >> >> work (it was working in an earlier version, but now > >>> >> >> >> timestamp_supported() just returns false). I'm not sure if t= his is an > >>> >> >> >> issue with the enablement patch, or if I just haven't gotten= the > >>> >> >> >> hardware configured properly. I'll investigate some more, bu= t figured > >>> >> >> >> I'd post these results now: > >>> >> >> >> > >>> >> >> >> Baseline XDP_DROP: 25,678,262 pps / 38.94 ns/pkt > >>> >> >> >> XDP_DROP + read metadata: 23,924,109 pps / 41.80 ns/pkt > >>> >> >> >> Overhead: 1,754,153 pps / 2.86 ns/pkt > >>> >> >> >> > >>> >> >> >> As per the above, this is with calling three kfuncs/pkt > >>> >> >> >> (metadata_supported(), rx_hash_supported() and rx_hash()). S= o that's > >>> >> >> >> ~0.95 ns per function call, which is a bit less, but not far= off from > >>> >> >> >> the ~1.2 ns that I'm used to. The tests where I accidentally= called the > >>> >> >> >> default kfuncs cut off ~1.3 ns for one less kfunc call, so i= t's > >>> >> >> >> definitely in that ballpark. > >>> >> >> >> > >>> >> >> >> I'm not doing anything with the data, just reading it into a= n on-stack > >>> >> >> >> buffer, so this is the smallest possible delta from just get= ting the > >>> >> >> >> data out of the driver. I did confirm that the call instruct= ions are > >>> >> >> >> still in the BPF program bytecode when it's dumped back out = from the > >>> >> >> >> kernel. > >>> >> >> >> > >>> >> >> >> -Toke > >>> >> >> >> > >>> >> >> > > >>> >> >> > Oh, that's great, thanks for running the numbers! Will defini= tely > >>> >> >> > reference them in v4! > >>> >> >> > Presumably, we should be able to at least unroll most of the > >>> >> >> > _supported callbacks if we want, they should be relatively ea= sy; but > >>> >> >> > the numbers look fine as is? > >>> >> >> > >>> >> >> Well, this is for one (and a half) piece of metadata. If we ext= rapolate > >>> >> >> it adds up quickly. Say we add csum and vlan tags, say, and may= be > >>> >> >> another callback to get the type of hash (l3/l4). Those would p= robably > >>> >> >> be relevant for most packets in a fairly common setup. Extrapol= ating > >>> >> >> from the ~1 ns/call figure, that's 8 ns/pkt, which is 20% of th= e > >>> >> >> baseline of 39 ns. > >>> >> >> > >>> >> >> So in that sense I still think unrolling makes sense. At least = for the > >>> >> >> _supported() calls, as eating a whole function call just for th= at is > >>> >> >> probably a bit much (which I think was also Jakub's point in a = sibling > >>> >> >> thread somewhere). > >>> >> > > >>> >> > imo the overhead is tiny enough that we can wait until > >>> >> > generic 'kfunc inlining' infra is ready. > >>> >> > > >>> >> > We're planning to dual-compile some_kernel_file.c > >>> >> > into native arch and into bpf arch. > >>> >> > Then the verifier will automatically inline bpf asm > >>> >> > of corresponding kfunc. > >>> >> > >>> >> Is that "planning" or "actively working on"? Just trying to get a = sense > >>> >> of the time frames here, as this sounds neat, but also something t= hat > >>> >> could potentially require quite a bit of fiddling with the build s= ystem > >>> >> to get to work? :) > >>> > > >>> > "planning", but regardless how long it takes I'd rather not > >>> > add any more tech debt in the form of manual bpf asm generation. > >>> > We have too much of it already: gen_lookup, convert_ctx_access, etc= . > >>> > >>> Right, I'm no fan of the manual ASM stuff either. However, if we're > >>> stuck with the function call overhead for the foreseeable future, may= be > >>> we should think about other ways of cutting down the number of functi= on > >>> calls needed? > >>> > >>> One thing I can think of is to get rid of the individual _supported() > >>> kfuncs and instead have a single one that lets you query multiple > >>> features at once, like: > >>> > >>> __u64 features_supported, features_wanted =3D XDP_META_RX_HASH | > >>> XDP_META_TIMESTAMP; > >>> > >>> features_supported =3D bpf_xdp_metadata_query_features(ctx, > >>> features_wanted); > >>> > >>> if (features_supported & XDP_META_RX_HASH) > >>> hash =3D bpf_xdp_metadata_rx_hash(ctx); > >>> > >>> ...etc > >> > >> I'm not too happy about having the bitmasks tbh :-( > >> If we want to get rid of the cost of those _supported calls, maybe we > >> can do some kind of libbpf-like probing? That would require loading a > >> program + waiting for some packet though :-( > >> > >> Or maybe they can just be cached for now? > >> > >> if (unlikely(!got_first_packet)) { > >> have_hash =3D bpf_xdp_metadata_rx_hash_supported(); > >> have_timestamp =3D bpf_xdp_metadata_rx_timestamp_supported(); > >> got_first_packet =3D true; > >> } > > > > hash/timestap/csum is per packet .. vlan as well depending how you look= at > > it.. > > True, we cannot cache this as it is *per packet* info. > > > Sorry I haven't been following the progress of xdp meta data, but why d= id > > we drop the idea of btf and driver copying metdata in front of the xdp > > frame ? > > > > It took me some time to understand this new approach, and why it makes > sense. This is my understanding of the design direction change: > > This approach gives more control to the XDP BPF-prog to pick and choose > which XDP hints are relevant for the specific use-case. BPF-prog can > also skip storing hints anywhere and just read+react on value (that e.g. > comes from RX-desc). > > For the use-cases redirect, AF_XDP, chained BPF-progs, XDP-to-TC, > SKB-creation, we *do* need to store hints somewhere, as RX-desc will be > out-of-scope. I this patchset hand-waves and says BPF-prog can just > manually store this in a prog custom layout in metadata area. I'm not > super happy with ignoring/hand-waving all these use-case, but I > hope/think we later can extend this some more structure to support these > use-cases better (with this patchset as a foundation). > > I actually like this kfunc design, because the BPF-prog's get an > intuitive API, and on driver side we can hide the details of howto > extract the HW hints. > > > > hopefully future HW generations will do that for free .. > > True. I think it is worth repeating, that the approach of storing HW > hints in metadata area (in-front of packet data) was to allow future HW > generations to write this. Thus, eliminating the 6 ns (that I showed it > cost), and then it would be up-to XDP BPF-prog to pick and choose which > to read, like this patchset already offers. As a hope for future generators of hw, being able to choose a cpu to interr= upt from a LPM table would be great. I keep hoping to find a card that can do this already... Also I would like to thank everyone working on this project so far for what you've accomplished. We're now pushing 20Gbit (through a vlan even) through libreqos.io for thousands of ISP subscribers using all this great stuff, on 16 cores at only 24% of cpu through CAKE and also successfully monitoring TCP RTTs at this scale via ebpf pping. ( https://www.yahoo.com/now/libreqoe-releases-version-1-3-214700756.html ) "Our hat is off to the creators of CAKE and the new Linux XDP and eBPF subsystems!" In our case, timestamp, and *3* hashes, are needed for cake, and interrupti= ng the right cpu would be great... > > This patchset isn't incompatible with future HW generations doing this, > as the kfunc would hide the details and point to this area instead of > the RX-desc. While we get the "store for free" from hardware, I do > worry that reading this memory area (which will part of DMA area) is > going to be slower than reading from RX-desc. > > > if btf is the problem then each vendor can provide a bpf func(s) that w= ould > > parse the metdata inside of the xdp/bpf prog domain to help programs > > extract the vendor specific data.. > > > > In some sense, if unroll will becomes a thing, then this patchset is > partly doing this. > > I did imagine that after/followup on XDP-hints with BTF patchset, we > would allow drivers to load an BPF-prog that changed/selected which HW > hints were relevant, to reduce those 6 ns overhead we introduced. > > --Jesper > --=20 This song goes out to all the folk that thought Stadia would work: https://www.linkedin.com/posts/dtaht_the-mushroom-song-activity-69813666656= 07352320-FXtz Dave T=C3=A4ht CEO, TekLibre, LLC