From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from sin.source.kernel.org (sin.source.kernel.org [IPv6:2604:1380:40e1:4800::1]) by mail.toke.dk (Postfix) with ESMTPS id B8758A81055 for ; Tue, 13 Aug 2024 11:51:55 +0200 (CEST) Authentication-Results: mail.toke.dk; dkim=pass (2048-bit key; unprotected) header.d=kernel.org header.i=@kernel.org header.a=rsa-sha256 header.s=k20201202 header.b=Ytm7Tvhu Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 1CD6CCE0FC7; Tue, 13 Aug 2024 09:51:53 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 09894C4AF09; Tue, 13 Aug 2024 09:51:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1723542712; bh=uVOY6MUqZfSq5bY4z+orE5FTCDv75mRnwySRQagyf+g=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=Ytm7TvhuuejqWWg1WHrU3jv75npQYgUzDDT5st8BLE2SxX1uDkDV0E2HZLzS8jbC2 le65h2j5yY1CN+R5DzTnynWTPTTS+nTJ+loHoZN5Ib+Pja9YJC9jJWboCiM8SwaRUT 2+UToawwW62NZcJorDViGjTDo1D050P2H3KTirPlxzuOpu8Yf7Gj9r+p1FqUeQ05v2 Ny2NBKHPvom94GNNGVgrANr+ata9umSor01qhB4MlwECY48TOKuc91z6v0F4igIXBJ /+aTwuiv3Ei0jozth6vmA+Xhde3NPflEoXODwfn+cxoxp0QqBKioU+/GNFILu/2aMx e5AAAOwaKTRsA== Message-ID: <25860d8b-a980-4f04-a376-b9cec03605fb@kernel.org> Date: Tue, 13 Aug 2024 11:51:45 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird To: Jakub Kicinski , Alexander Lobakin References: <20220628194812.1453059-1-alexandr.lobakin@intel.com> <20220628194812.1453059-33-alexandr.lobakin@intel.com> <54aab7ec-80e9-44fd-8249-fe0cabda0393@intel.com> <308fd4f1-83a9-4b74-a482-216c8211a028@app.fastmail.com> <99662019-7e9b-410d-99fe-a85d04af215c@intel.com> <20240812183307.0b6fbd60@kernel.org> Content-Language: en-US From: Jesper Dangaard Brouer In-Reply-To: <20240812183307.0b6fbd60@kernel.org> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Message-ID-Hash: JILBGD4LFEGIOH2LRNUJ76XNKR25OSA2 X-Message-ID-Hash: JILBGD4LFEGIOH2LRNUJ76XNKR25OSA2 X-MailFrom: hawk@kernel.org X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: Daniel Xu , Lorenzo Bianconi , Alexander Lobakin , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Larysa Zaremba , Michal Swiatkowski , =?UTF-8?B?QmrDtnJuIFTDtnBlbA==?= , Magnus Karlsson , Maciej Fijalkowski , Jonathan Lemon , "toke@redhat.com" , Lorenzo Bianconi , David Miller , Eric Dumazet , Paolo Abeni , Jesse Brandeburg , John Fastabend , Yajun Deng , Willem de Bruijn , "bpf@vger.kernel.org" , netdev@vger.kernel.org, linux-kernel@vger.kernel.org, xdp-hints@xdp-project.net X-Mailman-Version: 3.3.9 Precedence: list Subject: [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list() List-Id: XDP hardware hints design discussion Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: On 13/08/2024 03.33, Jakub Kicinski wrote: > On Fri, 9 Aug 2024 14:20:25 +0200 Alexander Lobakin wrote: >> But I think one solution could be: >> >> 1. We create some generic structure for cpumap, like >> >> struct cpumap_meta { >> u32 magic; >> u32 hash; >> } >> >> 2. We add such check in the cpumap code >> >> if (xdpf->metalen == sizeof(struct cpumap_meta) && >> ) >> skb->hash = meta->hash; >> >> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain >> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata. > > I wonder what the overhead of skb metadata allocation is in practice. > With Eric's "return skb to the CPU of origin" we can feed the lockless > skb cache one the right CPU, and also feed the lockless page pool > cache. I wonder if batched RFS wouldn't be faster than the XDP thing > that requires all the groundwork. I explicitly developed CPUMAP because I was benchmarking Receive Flow Steering (RFS) and Receive Packet Steering (RPS), which I observed was the bottleneck. The overhead was too large on the RX-CPU and bottleneck due to RFS and RPS maintaining data structures to avoid Out-of-Order packets. The Flow Dissector step was also a limiting factor. By bottleneck I mean it didn't scale, as RX-CPU packet per second processing speeds was too low compared to the remote-CPU pps. Digging in my old notes, I can see that RPS was limited to around 4.8 Mpps (and I have a weird disabling part of it showing 7.5Mpps). In [1] remote-CPU could process (starts at) 2.7 Mpps when dropping UDP packet due to UdpNoPorts configured (and baseline 3.3 Mpps if not remote), thus it only scales up-to 1.78 remote-CPUs. [1] shows how optimizations brings remote-CPU to handle 3.2Mpps (close non-remote to 3.3Mpps baseline). In [2] those optimizations bring remote-CPU to 4Mpps (for UdpNoPorts case). XDP RX-redirect in [1]+[2] was around 19Mpps (which might be lower today due to perf paper cuts). [1] https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap02-optimizations.org [2] https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap03-optimizations.org The benefits Eric's "return skb to the CPU of origin" should help improve the case for the remote-CPU, as I was seeing some bottlenecks in how we returned the memory. --Jesper