From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by mail.toke.dk (Postfix) with ESMTPS id EC44A8E794F for ; Wed, 17 Nov 2021 23:48:27 +0100 (CET) Authentication-Results: mail.toke.dk; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=YYs6bjkQ DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1637189306; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=Bj9sHtdFuXMbJVCBKZbcbQjq6hb2ltNJK7D8TxsRHxA=; b=YYs6bjkQnu5QqvMOiYDCFOkrShaeaS6XqHe0toqntZhKxNm+6Ns/4HiPhXOMwgzoNoK3gZ 9t5POceh1veaz7i9Am9c2YVtq0fNBIsXNhRyF/5AuYor9d+mXfPBAPUtJMdK8fwj3aTG5U XluxNCNjGgTxv/2IARby7JGKrgYh6uE= Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com [209.85.208.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-144-Tg1N6fkQM8yZoUNXSZ2Ycg-1; Wed, 17 Nov 2021 17:48:24 -0500 X-MC-Unique: Tg1N6fkQM8yZoUNXSZ2Ycg-1 Received: by mail-ed1-f71.google.com with SMTP id i19-20020a05640242d300b003e7d13ebeedso3500249edc.7 for ; Wed, 17 Nov 2021 14:48:24 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:from:to:cc:subject:in-reply-to:references:date :message-id:mime-version; bh=Bj9sHtdFuXMbJVCBKZbcbQjq6hb2ltNJK7D8TxsRHxA=; b=dWeulGk1paEs7NKQam5dUDndk/0AKTpnwP/BPeM4mRaLn6bxjhn569c/J9PDjZ0q0Z pcMRwER9/eTYOGVNGEU++UidStLGwy513PDUckcQ5b/Frqf79cAwiK0nqM+2fkF/q00f oN2EZBHswPAzwb+u4lGgSOS22FtFJOCeoxKIvltrlDodk+p+ACVWttn6WLJMAGqavjF+ qNEJDjUXwcOx5YvHU6gTCw1uQMasRe6smsaRLK1mYXOCvAqCBvVqdrXSPK009YlR9Srd E9F+KdeG5UznZBrJdi1Im7nx/gxPH1w0Er5IrhOPNYCGasDuvwqmDdQccdnKYq/pr6Fe imaA== X-Gm-Message-State: AOAM533r6TYZtYTQ0QA6mPpu+FOYzLCUCw0+lFLmIKyjYb3BcYutkGEs pI08WyKkC5CjVtelKWy8KgEvYb9WesCiRZn+PRn+Dc/KXCT4hEquj6nNrli6jjK2uvz6NlQJsPA lV+FF6JUqcvyokA6Ndj8o X-Received: by 2002:a05:6402:35ce:: with SMTP id z14mr3475367edc.197.1637189302441; Wed, 17 Nov 2021 14:48:22 -0800 (PST) X-Google-Smtp-Source: ABdhPJwiyTMniEtmsH67dRuiR8MpXHQzOuOv3JiVaMTY5oB7IYRbSFQA/u/V8lGZ6FasDmPRrDGkbw== X-Received: by 2002:a05:6402:35ce:: with SMTP id z14mr3475205edc.197.1637189301244; Wed, 17 Nov 2021 14:48:21 -0800 (PST) Received: from alrua-x1.borgediget.toke.dk ([2a0c:4d80:42:443::2]) by smtp.gmail.com with ESMTPSA id n16sm618966edt.67.2021.11.17.14.48.20 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 17 Nov 2021 14:48:20 -0800 (PST) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id E397D180270; Wed, 17 Nov 2021 23:48:14 +0100 (CET) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: "Karlsson, Magnus" , Jesper Dangaard Brouer , "Desouza, Ederson" In-Reply-To: References: X-Clacks-Overhead: GNU Terry Pratchett Date: Wed, 17 Nov 2021 23:48:14 +0100 Message-ID: <875ysqflg1.fsf@toke.dk> MIME-Version: 1.0 Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=toke@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain Message-ID-Hash: VB4Q4BERRDFLRSOPJFAGUIXCA5QMZD2C X-Message-ID-Hash: VB4Q4BERRDFLRSOPJFAGUIXCA5QMZD2C X-MailFrom: toke@redhat.com X-Mailman-Rule-Misses: dmarc-mitigation; no-senders; approved; emergency; loop; banned-address; member-moderation; nonmember-moderation; administrivia; implicit-dest; max-recipients; max-size; news-moderation; no-subject; digests; suspicious-header CC: "brouer@redhat.com" , "xdp-hints@xdp-project.net" , Eelco Chaudron , Andrii Nakryiko , "Fijalkowski, Maciej" , "Burakov, Anatoly" X-Mailman-Version: 3.3.4 Precedence: list Subject: [xdp-hints] Re: XDP-hints via local BTF info List-Id: XDP hardware hints design discussion Archived-At: List-Archive: List-Help: List-Owner: List-Post: List-Subscribe: List-Unsubscribe: "Karlsson, Magnus" writes: > Together with Maciej and Anatoly, I have been toying with how to > accomplish this, but it is early days so warning for some serious > handwaving. Will produce some code to see if it is possible at all. > One drawback with having it completely flexible and letting user-space > decide is the complexity implementing this in the driver/kernel. But > is this not why we have eBPF for in the first place? Maybe it can come > to the rescue here. I think there are a couple of distinctions we need to make, which your "handwaving" glosses over a little bit :) First, we have to distinguish between two cases here: how to work with existing hardware and drivers and their metadata implementations, and how to work with future devices that will (presumably) be completely flexible and/or programmable to provide any metadata configuration directly from the hardware. The XDP hints scheme obviously needs to work with both. Secondly, we should distinguish between the configuration interface and the metadata consumption from within the data path. I think those should be separate interfaces; in particular, I don't think loading an XDP program in itself should have the side effect of reconfiguring the hardware metadata format. Rather, the reconfiguration should be a separate, explicit step using whatever API we can end up agreeing on (ethtool or rtnetlink come to mind as obvious contenders). However, the *format* for this configuration could very well be BTF-based, so userspace can get whatever format it wants, assuming the hardware supports it. So, say we have this fancy programmable hardware, and we write a program with a struct definition like: struct my_meta_format { __u64 rx_timestamp; __u64 magic_colour_of_packet; __u32 btf_id; }; and from userspace we can then do: dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format)); which will pass the BTF-format struct to the kernel and configure the hardware to write those fields. After this, the XDP program can just do: int my_xdp_prog(struct xdp_buff *ctx) { struct my_meta_format *meta = ctx->data_meta; do_something_with(meta->magic_colour_of_packet); return XDP_XXX; } and it will just work. Same thing from userspace. Or it can define a CO-RE enabled struct like: struct my_meta_subset { __u64 magic_colour_of_packet; /* we only care about this attr */ } __attribute__((preserve_access_index)); int my_xdp_prog(struct xdp_buff *ctx) { struct my_meta_subset *meta = ctx->data_meta; do_something_with(meta->magic_colour_of_packet); return XDP_XXX; } and libbpf will rewrite the field accesses using the BTF information, re-exported from the kernel from what userspace passed in when configuring. With the above in mind, a few comments on some of your other points: > The key feature here is that bpf_xdp_get_metadata() will actually go > and fetch this from the HW, by calling a small callback function in > the driver that reads for example rx_timestamp from the HW and returns > it to the XDP program. This is clearly not possible today and would > require new plumbing, if it is even possible to implement this. But > let us leave aside the implementation for now and just focus on the > benefit this (or something like it) could provide compared to a > kernel-centric approach. Took me a while to understand what you meant with this, but I think it finally clicked with your example below, so I'll leave the comment to there... > * User-space can completely control where to put data and what it puts > there. Think of producing the structure you want in user space > directly without having to copy things around. You could for example > produce a DPDK mbuf, or the VPP, Surikata equivalent directly. Would > save a lot of cycles. I think that userspace being in control of the metadata format is a great goal we should keep in mind, and as mentioned above we can do that with a BTF-based interface. > * No need for a metadata enablement interface. eBPF could find this > out by just parsing the XDP program and enable the used metadata > features in the HW by calling enable/disable metadata functions in the > driver. As mentioned above I think this is a mis-feature: configuration should be explicit and out of band, not tied to the data path implementation. > * No reason to expose BTFs to user-space. Makes it a lot simpler. > Actually, no need to use them in the driver either. BTF is just a handy format for describing the data layout. There has to be *some* format to communicate this, and exposing the BTF format to userspace is just a way for the kernel to say "this is how the hardware is currently configured". I don't understand why you think not having that makes things simpler? > * We do not have to use the metadata area unless it is actually > needed. In the case of AF_XDP, we have to write it there, but local > use in XDP does not. You would save one cache miss per packet right > there. Regardless of how the format is configured, why would an XDP program read it if it doesn't need it? > * We get away from non-scalable driver implementations such as: > > If (metadata_a_enabled) > meta->a = Read_metadata_a(); > If (metadata_b_enabled) > meta->b = Read_metadata_b(); > If (metadata_c_enabled) > meta->c = Read_metadata_c(); > and so on for many entries..... Implementations such as this exist, though, and we have to support them; as mentioned above, I think it's absolutely a worthy goal to make the metadata format completely fungible from userspace, but the XDP hints interface needs to also support old drivers that do things in the traditional way, even if it's inefficient. > > Or even worse: > meta->a = Read_metadata_b(); > meta->b = Read_metadata_a(); > meta->c = Read_metadata_c(); > and so on.... > > The last one would kill performance any day. Even the first one is > shunned in DPDK for performance reasons. > > Note that in my proposal above, read_metadata_a() is a function that > will be called from the XDP program when needed and only then, not > from the driver, so driver code becomes uncluttered by these large if > statements. Ah, with this bit I understood your point above about "a helper reading the data from the hardware". Let me see if I can rephrase your proposal, and let me know if I got it wrong: The problem you're speaking of here, is that of transforming the bits that come in from the hardware (in an RX-desc, say), into the metadata format consumed by XDP. Say, for instance that the driver has a function like: u64 my_driver_read_rx_timestamp(void *pkt_dsc); which does whatever bit of device-specific magic it needed to get a well-formed u64 timestamp out of the packet descriptor, maybe involving handling HW-specific quirks etc. And you want to expose this function directly to BPF instead of having a big conditional in the driver C code, so we can use BPF code elimination to only get the bits we want, right? This is actually something Jesper and I have been discussing as a possible future solution! However, what we discussed was not exposing this *to XDP*, but rather to have a separate BPF program *which is part of the driver* that does this. And this BPF program can then be paired with the BTF format coming in from userspace, so that it will only load the fields requested. So, going back to my imaginary configuration interface above, when userspace calls: dev_metadata_configure(ifindex, BTF_OF_STRUCT(my_meta_format)); the driver turns this over to the BPF subsystem with a BPF program like: int my_driver_convert_meta(void *ctx, void *meta_dst) { struct my_meta_format *meta_dst_fmt = meta_dst; if (bpf_core_field_exists(meta_dst->rx_timestamp)) meta_dst->rx_timestamp = my_driver_read_rx_timestamp(ctx); if (bpf_core_field_exists(meta_dst->magic_colour_of_packet)) meta_dst->magic_colour_of_packet = my_driver_read_magic_colour_of_packet(ctx); ...etc } And then the driver code where before you had, as you said: > If (metadata_a_enabled) > meta->a = Read_metadata_a(); > If (metadata_b_enabled) > meta->b = Read_metadata_b(); will just be BPF_PROG_RUN(drv->meta_writer_prog, meta); and that will run the BPF program that was loaded (via a usermode helper, probably) when userspace configured the metadata format and we'll get the conditional code goodness and better performance. However, having such a facility for drivers to dynamically *write* metadata is, I believe, completely orthogonal to how we implement the description format for consuming this metadata from XDP and userspace. For one thing, not all drivers will probably want to use the dynamic method, so we should not tie up the consumption of metadata to this. For another, it's way too complex to implement everything at once, so defining the consumption format first, then working on efficiently writing it afterwards (or in parallel, maybe) makes it possible to have incremental progress. That being said, having both things in mind when defining things is a good idea. But as I have hopefully outlined above, using BTF as the description format allows both writing and reading of metadata to be flexible and extensible. > * We also get away from the problem that the BTF id can change from > underneath the AF_XDP application. This becomes especially problematic > when we would like to apply this to the Tx path and do offloads. For this point, I think maybe the existence of the btf_id field in the metadata struct has given you the impression that the format can go and change at any time? I don't think that has to be the case: if a user knows that they will always use the same metadata format, and they know what that format is, it's perfectly valid to just ignore the btf_id field and just cast the metadata field to a struct. However, the field has to be there so that it is possible for the consumer to detect when it changes, in order to support portable applications. There are two use cases for this: - Writing an application that can support different metadata formats so it can run without re-compilation on different hardware. For this, it is possible to load the program with a target device, so that the long sequence of checks of the btf_id can be eliminated by CO-RE at load time, and turned into just a single sanity check so that the metadata is not loaded if the configuration changes. - Writing an application that deals with packets from multiple devices in the same program (say, a devmap program that gets redirects from different hardware into the same egress device). - Dealing with metadata configuration where a subset of the traffic has different metadata (e.g., timestamps may only appear on PTP packets, say). Here the check of the BTF ID can be used to demux the packet type. -Toke