[xdp-hints] Re: [RFC bpf-next v2 00/11] bpf: Netdev TX metadata

XDP hardware hints discussion mail archive
 help / color / mirror / Atom feed

* [xdp-hints] Re: [RFC bpf-next v2 00/11] bpf: Netdev TX metadata
       [not found] <20230621170244.1283336-1-sdf@google.com>
@ 2023-06-22  8:41 ` Jesper Dangaard Brouer
  2023-06-22 17:55   ` Stanislav Fomichev
       [not found] ` <20230621170244.1283336-4-sdf@google.com>
       [not found] ` <20230621170244.1283336-10-sdf@google.com>
  2 siblings, 1 reply; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2023-06-22  8:41 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, netdev, xdp-hints


On 21/06/2023 19.02, Stanislav Fomichev wrote:
> CC'ing people only on the cover letter. Hopefully can find the rest via
> lore.

Could you please Cc me on all the patches, please.
(also please use hawk@kernel.org instead of my RH addr)

Also consider Cc'ing xdp-hints@xdp-project.net as we have end-users and
NIC engineers that can bring value to this conversation.

--Jesper


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 00/11] bpf: Netdev TX metadata
  2023-06-22  8:41 ` [xdp-hints] Re: [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Jesper Dangaard Brouer
@ 2023-06-22 17:55   ` Stanislav Fomichev
  0 siblings, 0 replies; 20+ messages in thread
From: Stanislav Fomichev @ 2023-06-22 17:55 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, toke, willemb, dsahern,
	magnus.karlsson, bjorn, maciej.fijalkowski, netdev, xdp-hints

On Thu, Jun 22, 2023 at 1:41 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> > CC'ing people only on the cover letter. Hopefully can find the rest via
> > lore.
>
> Could you please Cc me on all the patches, please.
> (also please use hawk@kernel.org instead of my RH addr)
>
> Also consider Cc'ing xdp-hints@xdp-project.net as we have end-users and
> NIC engineers that can bring value to this conversation.

Definitely! Didn't want to spam people too much (assuming they mostly
use the lore for reading).

^ permalink raw reply	[flat|nested] 20+ messages in thread

[parent not found: <20230621170244.1283336-4-sdf@google.com>]

* [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
       [not found] ` <20230621170244.1283336-4-sdf@google.com>
@ 2023-06-22  9:11   ` Jesper D. Brouer
  2023-06-22 17:55     ` Stanislav Fomichev
  0 siblings, 1 reply; 20+ messages in thread
From: Jesper D. Brouer @ 2023-06-22  9:11 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Björn Töpel,
	Karlsson, Magnus, xdp-hints


This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)

On 21/06/2023 19.02, Stanislav Fomichev wrote:
> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> and carry some TX metadata in the headroom. For copy mode, there
> is no way currently to populate skb metadata.
> 
> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> to treat as metadata. Metadata bytes come prior to tx_desc address
> (same as in RX case).

 From looking at the code, this introduces a socket option for this TX 
metadata length (tx_metadata_len).
This implies the same fixed TX metadata size is used for all packets.
Maybe describe this in patch desc.

What is the plan for dealing with cases that doesn't populate same/full
TX metadata size ?


> 
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>   include/net/xdp_sock.h      |  1 +
>   include/net/xsk_buff_pool.h |  1 +
>   include/uapi/linux/if_xdp.h |  1 +
>   net/xdp/xsk.c               | 31 ++++++++++++++++++++++++++++++-
>   net/xdp/xsk_buff_pool.c     |  1 +
>   net/xdp/xsk_queue.h         |  7 ++++---
>   6 files changed, 38 insertions(+), 4 deletions(-)
> 
> diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
> index e96a1151ec75..30018b3b862d 100644
> --- a/include/net/xdp_sock.h
> +++ b/include/net/xdp_sock.h
> @@ -51,6 +51,7 @@ struct xdp_sock {
>   	struct list_head flush_node;
>   	struct xsk_buff_pool *pool;
>   	u16 queue_id;
> +	u8 tx_metadata_len;
>   	bool zc;
>   	enum {
>   		XSK_READY = 0,
> diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
> index a8d7b8a3688a..751fea51a6af 100644
> --- a/include/net/xsk_buff_pool.h
> +++ b/include/net/xsk_buff_pool.h
> @@ -75,6 +75,7 @@ struct xsk_buff_pool {
>   	u32 chunk_size;
>   	u32 chunk_shift;
>   	u32 frame_len;
> +	u8 tx_metadata_len; /* inherited from xsk_sock */
>   	u8 cached_need_wakeup;
>   	bool uses_need_wakeup;
>   	bool dma_need_sync;
> diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
> index a78a8096f4ce..2374eafff7db 100644
> --- a/include/uapi/linux/if_xdp.h
> +++ b/include/uapi/linux/if_xdp.h
> @@ -63,6 +63,7 @@ struct xdp_mmap_offsets {
>   #define XDP_UMEM_COMPLETION_RING	6
>   #define XDP_STATISTICS			7
>   #define XDP_OPTIONS			8
> +#define XDP_TX_METADATA_LEN		9
>   
>   struct xdp_umem_reg {
>   	__u64 addr; /* Start of packet data area */
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index cc1e7f15fa73..c9b2daba7b6d 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -485,6 +485,7 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>   		int err;
>   
>   		hr = max(NET_SKB_PAD, L1_CACHE_ALIGN(dev->needed_headroom));
> +		hr = max(hr, L1_CACHE_ALIGN((u32)xs->tx_metadata_len));
>   		tr = dev->needed_tailroom;
>   		len = desc->len;
>   
> @@ -493,14 +494,21 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
>   			return ERR_PTR(err);
>   
>   		skb_reserve(skb, hr);
> -		skb_put(skb, len);
> +		skb_put(skb, len + xs->tx_metadata_len);
>   
>   		buffer = xsk_buff_raw_get_data(xs->pool, desc->addr);
> +		buffer -= xs->tx_metadata_len;
> +
>   		err = skb_store_bits(skb, 0, buffer, len);
>   		if (unlikely(err)) {
>   			kfree_skb(skb);
>   			return ERR_PTR(err);
>   		}
> +
> +		if (xs->tx_metadata_len) {
> +			skb_metadata_set(skb, xs->tx_metadata_len);
> +			__skb_pull(skb, xs->tx_metadata_len);
> +		}
>   	}
>   
>   	skb->dev = dev;
> @@ -1137,6 +1145,27 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
>   		mutex_unlock(&xs->mutex);
>   		return err;
>   	}
> +	case XDP_TX_METADATA_LEN:
> +	{
> +		int val;
> +
> +		if (optlen < sizeof(val))
> +			return -EINVAL;
> +		if (copy_from_sockptr(&val, optval, sizeof(val)))
> +			return -EFAULT;
> +
> +		if (val >= 256)
> +			return -EINVAL;
> +
> +		mutex_lock(&xs->mutex);
> +		if (xs->state != XSK_READY) {
> +			mutex_unlock(&xs->mutex);
> +			return -EBUSY;
> +		}
> +		xs->tx_metadata_len = val;
> +		mutex_unlock(&xs->mutex);
> +		return err;
> +	}
>   	default:
>   		break;
>   	}
> diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
> index 26f6d304451e..66ff9c345a67 100644
> --- a/net/xdp/xsk_buff_pool.c
> +++ b/net/xdp/xsk_buff_pool.c
> @@ -85,6 +85,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
>   		XDP_PACKET_HEADROOM;
>   	pool->umem = umem;
>   	pool->addrs = umem->addrs;
> +	pool->tx_metadata_len = xs->tx_metadata_len;
>   	INIT_LIST_HEAD(&pool->free_list);
>   	INIT_LIST_HEAD(&pool->xsk_tx_list);
>   	spin_lock_init(&pool->xsk_tx_list_lock);
> diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
> index 6d40a77fccbe..c8d287c18d64 100644
> --- a/net/xdp/xsk_queue.h
> +++ b/net/xdp/xsk_queue.h
> @@ -133,12 +133,13 @@ static inline bool xskq_cons_read_addr_unchecked(struct xsk_queue *q, u64 *addr)
>   static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
>   					    struct xdp_desc *desc)
>   {
> -	u64 offset = desc->addr & (pool->chunk_size - 1);
> +	u64 addr = desc->addr - pool->tx_metadata_len;
> +	u64 offset = addr & (pool->chunk_size - 1);
>   
>   	if (offset + desc->len > pool->chunk_size)
>   		return false;
>   
> -	if (desc->addr >= pool->addrs_cnt)
> +	if (addr >= pool->addrs_cnt)
>   		return false;
>   
>   	if (desc->options)
> @@ -149,7 +150,7 @@ static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
>   static inline bool xp_unaligned_validate_desc(struct xsk_buff_pool *pool,
>   					      struct xdp_desc *desc)
>   {
> -	u64 addr = xp_unaligned_add_offset_to_addr(desc->addr);
> +	u64 addr = xp_unaligned_add_offset_to_addr(desc->addr) - pool->tx_metadata_len;
>   
>   	if (desc->len > pool->chunk_size)
>   		return false;

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-22  9:11   ` [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN Jesper D. Brouer
@ 2023-06-22 17:55     ` Stanislav Fomichev
  2023-06-23 10:24       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 20+ messages in thread
From: Stanislav Fomichev @ 2023-06-22 17:55 UTC (permalink / raw)
  To: Jesper D. Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Björn Töpel,
	Karlsson, Magnus, xdp-hints

On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>
>
> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
>
> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> > For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> > and carry some TX metadata in the headroom. For copy mode, there
> > is no way currently to populate skb metadata.
> >
> > Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> > to treat as metadata. Metadata bytes come prior to tx_desc address
> > (same as in RX case).
>
>  From looking at the code, this introduces a socket option for this TX
> metadata length (tx_metadata_len).
> This implies the same fixed TX metadata size is used for all packets.
> Maybe describe this in patch desc.

I was planning to do a proper documentation page once we settle on all
the details (similar to the one we have for rx).

> What is the plan for dealing with cases that doesn't populate same/full
> TX metadata size ?

Do we need to support that? I was assuming that the TX layout would be
fixed between the userspace and BPF.
If every packet would have a different metadata length, it seems like
a nightmare to parse?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-22 17:55     ` Stanislav Fomichev
@ 2023-06-23 10:24       ` Jesper Dangaard Brouer
  2023-06-23 17:41         ` Stanislav Fomichev
  0 siblings, 1 reply; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2023-06-23 10:24 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Björn Töpel,
	Karlsson, Magnus, xdp-hints

On 22/06/2023 19.55, Stanislav Fomichev wrote:
> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>>
>>
>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
>>
>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
>>> and carry some TX metadata in the headroom. For copy mode, there
>>> is no way currently to populate skb metadata.
>>>
>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
>>> to treat as metadata. Metadata bytes come prior to tx_desc address
>>> (same as in RX case).
>>
>>   From looking at the code, this introduces a socket option for this TX
>> metadata length (tx_metadata_len).
>> This implies the same fixed TX metadata size is used for all packets.
>> Maybe describe this in patch desc.
> 
> I was planning to do a proper documentation page once we settle on all
> the details (similar to the one we have for rx).
> 
>> What is the plan for dealing with cases that doesn't populate same/full
>> TX metadata size ?
> 
> Do we need to support that? I was assuming that the TX layout would be
> fixed between the userspace and BPF.

I hope you don't mean fixed layout, as the whole point is adding
flexibility and extensibility.

> If every packet would have a different metadata length, it seems like
> a nightmare to parse?
> 

No parsing is really needed.  We can simply use BTF IDs and type cast in
BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
see [1] and [2].

It seems we are talking slightly past each-other(?).  Let me rephrase
and reframe the question, what is your *plan* for dealing with different
*types* of TX metadata.  The different struct *types* will of-cause have
different sizes, but that is okay as long as they fit into the maximum
size set by this new socket option XDP_TX_METADATA_LEN.
Thus, in principle I'm fine with XSK having configured a fixed headroom
for metadata, but we need a plan for handling more than one type and
perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
data ("leftover" since last time this mem was used).

With this kfunc approach, then things in-principle, becomes a contract
between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
components can as illustrated here [1]+[2] can coordinate based on local
BPF-prog BTF IDs.  This approach works as-is today, but patchset
selftests examples don't use this and instead have a very static
approach (that people will copy-paste).

An unsolved problem with TX-hook is that it can also get packets from
XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
BPF-prog know if metadata is valid and intended to be used for e.g.
requesting the timestamp? (imagine metadata size happen to match)

--Jesper

BPF-prog API bpf_core_type_id_local:
  - [1] 
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80

Userspace API btf__find_by_name_kind:
  - [2] 
https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/lib_xsk_extend.c#L185

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-23 10:24       ` Jesper Dangaard Brouer
@ 2023-06-23 17:41         ` Stanislav Fomichev
  2023-06-24  9:02           ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 20+ messages in thread
From: Stanislav Fomichev @ 2023-06-23 17:41 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Björn Töpel,
	Karlsson, Magnus, xdp-hints

On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 22/06/2023 19.55, Stanislav Fomichev wrote:
> > On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
> >>
> >>
> >> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
> >>
> >> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> >>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> >>> and carry some TX metadata in the headroom. For copy mode, there
> >>> is no way currently to populate skb metadata.
> >>>
> >>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> >>> to treat as metadata. Metadata bytes come prior to tx_desc address
> >>> (same as in RX case).
> >>
> >>   From looking at the code, this introduces a socket option for this TX
> >> metadata length (tx_metadata_len).
> >> This implies the same fixed TX metadata size is used for all packets.
> >> Maybe describe this in patch desc.
> >
> > I was planning to do a proper documentation page once we settle on all
> > the details (similar to the one we have for rx).
> >
> >> What is the plan for dealing with cases that doesn't populate same/full
> >> TX metadata size ?
> >
> > Do we need to support that? I was assuming that the TX layout would be
> > fixed between the userspace and BPF.
>
> I hope you don't mean fixed layout, as the whole point is adding
> flexibility and extensibility.

I do mean a fixed layout between the userspace (af_xdp) and devtx program.
At least fixed max size of the metadata. The userspace and the bpf
prog can then use this fixed space to implement some flexibility
(btf_ids, versioned structs, bitmasks, tlv, etc).
If we were to make the metalen vary per packet, we'd have to signal
its size per packet. Probably not worth it?

> > If every packet would have a different metadata length, it seems like
> > a nightmare to parse?
> >
>
> No parsing is really needed.  We can simply use BTF IDs and type cast in
> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
> see [1] and [2].
>
> It seems we are talking slightly past each-other(?).  Let me rephrase
> and reframe the question, what is your *plan* for dealing with different
> *types* of TX metadata.  The different struct *types* will of-cause have
> different sizes, but that is okay as long as they fit into the maximum
> size set by this new socket option XDP_TX_METADATA_LEN.
> Thus, in principle I'm fine with XSK having configured a fixed headroom
> for metadata, but we need a plan for handling more than one type and
> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
> data ("leftover" since last time this mem was used).

Yeah, I think the above correctly catches my expectation here. Some
headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
offloaded to the bpf program via btf_id/tlv/etc.

Regarding leftover metadata: can we assume the userspace will take
care of setting it up?

> With this kfunc approach, then things in-principle, becomes a contract
> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
> components can as illustrated here [1]+[2] can coordinate based on local
> BPF-prog BTF IDs.  This approach works as-is today, but patchset
> selftests examples don't use this and instead have a very static
> approach (that people will copy-paste).
>
> An unsolved problem with TX-hook is that it can also get packets from
> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
> BPF-prog know if metadata is valid and intended to be used for e.g.
> requesting the timestamp? (imagine metadata size happen to match)

My assumption was the bpf program can do ifindex/netns filtering. Plus
maybe check that the meta_len is the one that's expected.
Will that be enough to handle XDP_REDIRECT?


> --Jesper
>
> BPF-prog API bpf_core_type_id_local:
>   - [1]
> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80
>
> Userspace API btf__find_by_name_kind:
>   - [2]
> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/lib_xsk_extend.c#L185
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-23 17:41         ` Stanislav Fomichev
@ 2023-06-24  9:02           ` Jesper Dangaard Brouer
  2023-06-26 17:00             ` Stanislav Fomichev
  0 siblings, 1 reply; 20+ messages in thread
From: Jesper Dangaard Brouer @ 2023-06-24  9:02 UTC (permalink / raw)
  To: Stanislav Fomichev, Jesper Dangaard Brouer
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Björn Töpel,
	Karlsson, Magnus, xdp-hints



On 23/06/2023 19.41, Stanislav Fomichev wrote:
> On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
>>
>>
>>
>> On 22/06/2023 19.55, Stanislav Fomichev wrote:
>>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>>>>
>>>>
>>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
>>>>
>>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
>>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
>>>>> and carry some TX metadata in the headroom. For copy mode, there
>>>>> is no way currently to populate skb metadata.
>>>>>
>>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
>>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
>>>>> (same as in RX case).
>>>>
>>>>    From looking at the code, this introduces a socket option for this TX
>>>> metadata length (tx_metadata_len).
>>>> This implies the same fixed TX metadata size is used for all packets.
>>>> Maybe describe this in patch desc.
>>>
>>> I was planning to do a proper documentation page once we settle on all
>>> the details (similar to the one we have for rx).
>>>
>>>> What is the plan for dealing with cases that doesn't populate same/full
>>>> TX metadata size ?
>>>
>>> Do we need to support that? I was assuming that the TX layout would be
>>> fixed between the userspace and BPF.
>>
>> I hope you don't mean fixed layout, as the whole point is adding
>> flexibility and extensibility.
> 
> I do mean a fixed layout between the userspace (af_xdp) and devtx program.
> At least fixed max size of the metadata. The userspace and the bpf
> prog can then use this fixed space to implement some flexibility
> (btf_ids, versioned structs, bitmasks, tlv, etc).
> If we were to make the metalen vary per packet, we'd have to signal
> its size per packet. Probably not worth it?

Existing XDP metadata implementation also expand in a fixed/limited
sized memory area, but communicate size per packet in this area (also
for validation purposes).  BUT for AF_XDP we don't have room for another
pointer or size in the AF_XDP descriptor (see struct xdp_desc).


> 
>>> If every packet would have a different metadata length, it seems like
>>> a nightmare to parse?
>>>
>>
>> No parsing is really needed.  We can simply use BTF IDs and type cast in
>> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
>> see [1] and [2].
>>
>> It seems we are talking slightly past each-other(?).  Let me rephrase
>> and reframe the question, what is your *plan* for dealing with different
>> *types* of TX metadata.  The different struct *types* will of-cause have
>> different sizes, but that is okay as long as they fit into the maximum
>> size set by this new socket option XDP_TX_METADATA_LEN.
>> Thus, in principle I'm fine with XSK having configured a fixed headroom
>> for metadata, but we need a plan for handling more than one type and
>> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
>> data ("leftover" since last time this mem was used).
> 
> Yeah, I think the above correctly catches my expectation here. Some
> headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
> offloaded to the bpf program via btf_id/tlv/etc.
> 
> Regarding leftover metadata: can we assume the userspace will take
> care of setting it up?
> 
>> With this kfunc approach, then things in-principle, becomes a contract
>> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
>> components can as illustrated here [1]+[2] can coordinate based on local
>> BPF-prog BTF IDs.  This approach works as-is today, but patchset
>> selftests examples don't use this and instead have a very static
>> approach (that people will copy-paste).
>>
>> An unsolved problem with TX-hook is that it can also get packets from
>> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
>> BPF-prog know if metadata is valid and intended to be used for e.g.
>> requesting the timestamp? (imagine metadata size happen to match)
> 
> My assumption was the bpf program can do ifindex/netns filtering. Plus
> maybe check that the meta_len is the one that's expected.
> Will that be enough to handle XDP_REDIRECT?

I don't think so, using the meta_len (+ ifindex/netns) to communicate
activation of TX hardware hints is too weak and not enough.  This is an
implicit API for BPF-programmers to understand and can lead to implicit
activation.

Think about what will happen for your AF_XDP send use-case.  For
performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
is fixed even if not used (and can contain garbage), it can by accident
create hard-to-debug situations.  As discussed with Magnus+Maryam
before, we found it was practical (and faster than mem zero) to extend
AF_XDP descriptor (see struct xdp_desc) with some flags to
indicate/communicate this frame comes with TX metadata hints.

>>
>> BPF-prog API bpf_core_type_id_local:
>>    - [1]
>> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80
>>
>> Userspace API btf__find_by_name_kind:
>>    - [2]
>> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/lib_xsk_extend.c#L185
>>
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-24  9:02           ` Jesper Dangaard Brouer
@ 2023-06-26 17:00             ` Stanislav Fomichev
  2023-06-28  8:09               ` Magnus Karlsson
  0 siblings, 1 reply; 20+ messages in thread
From: Stanislav Fomichev @ 2023-06-26 17:00 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, Björn Töpel,
	Karlsson, Magnus, xdp-hints

On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 23/06/2023 19.41, Stanislav Fomichev wrote:
> > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> >>
> >>
> >>
> >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
> >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
> >>>>
> >>>>
> >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
> >>>>
> >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> >>>>> and carry some TX metadata in the headroom. For copy mode, there
> >>>>> is no way currently to populate skb metadata.
> >>>>>
> >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
> >>>>> (same as in RX case).
> >>>>
> >>>>    From looking at the code, this introduces a socket option for this TX
> >>>> metadata length (tx_metadata_len).
> >>>> This implies the same fixed TX metadata size is used for all packets.
> >>>> Maybe describe this in patch desc.
> >>>
> >>> I was planning to do a proper documentation page once we settle on all
> >>> the details (similar to the one we have for rx).
> >>>
> >>>> What is the plan for dealing with cases that doesn't populate same/full
> >>>> TX metadata size ?
> >>>
> >>> Do we need to support that? I was assuming that the TX layout would be
> >>> fixed between the userspace and BPF.
> >>
> >> I hope you don't mean fixed layout, as the whole point is adding
> >> flexibility and extensibility.
> >
> > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
> > At least fixed max size of the metadata. The userspace and the bpf
> > prog can then use this fixed space to implement some flexibility
> > (btf_ids, versioned structs, bitmasks, tlv, etc).
> > If we were to make the metalen vary per packet, we'd have to signal
> > its size per packet. Probably not worth it?
>
> Existing XDP metadata implementation also expand in a fixed/limited
> sized memory area, but communicate size per packet in this area (also
> for validation purposes).  BUT for AF_XDP we don't have room for another
> pointer or size in the AF_XDP descriptor (see struct xdp_desc).
>
>
> >
> >>> If every packet would have a different metadata length, it seems like
> >>> a nightmare to parse?
> >>>
> >>
> >> No parsing is really needed.  We can simply use BTF IDs and type cast in
> >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
> >> see [1] and [2].
> >>
> >> It seems we are talking slightly past each-other(?).  Let me rephrase
> >> and reframe the question, what is your *plan* for dealing with different
> >> *types* of TX metadata.  The different struct *types* will of-cause have
> >> different sizes, but that is okay as long as they fit into the maximum
> >> size set by this new socket option XDP_TX_METADATA_LEN.
> >> Thus, in principle I'm fine with XSK having configured a fixed headroom
> >> for metadata, but we need a plan for handling more than one type and
> >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
> >> data ("leftover" since last time this mem was used).
> >
> > Yeah, I think the above correctly catches my expectation here. Some
> > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
> > offloaded to the bpf program via btf_id/tlv/etc.
> >
> > Regarding leftover metadata: can we assume the userspace will take
> > care of setting it up?
> >
> >> With this kfunc approach, then things in-principle, becomes a contract
> >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
> >> components can as illustrated here [1]+[2] can coordinate based on local
> >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
> >> selftests examples don't use this and instead have a very static
> >> approach (that people will copy-paste).
> >>
> >> An unsolved problem with TX-hook is that it can also get packets from
> >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
> >> BPF-prog know if metadata is valid and intended to be used for e.g.
> >> requesting the timestamp? (imagine metadata size happen to match)
> >
> > My assumption was the bpf program can do ifindex/netns filtering. Plus
> > maybe check that the meta_len is the one that's expected.
> > Will that be enough to handle XDP_REDIRECT?
>
> I don't think so, using the meta_len (+ ifindex/netns) to communicate
> activation of TX hardware hints is too weak and not enough.  This is an
> implicit API for BPF-programmers to understand and can lead to implicit
> activation.
>
> Think about what will happen for your AF_XDP send use-case.  For
> performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
> is fixed even if not used (and can contain garbage), it can by accident
> create hard-to-debug situations.  As discussed with Magnus+Maryam
> before, we found it was practical (and faster than mem zero) to extend
> AF_XDP descriptor (see struct xdp_desc) with some flags to
> indicate/communicate this frame comes with TX metadata hints.

What is that "if not used" situation? Can the metadata itself have
is_used bit? The userspace has to initialize at least that bit.
We can definitely add that extra "has_metadata" bit to the descriptor,
but I'm trying to understand whether we can do without it.


> >>
> >> BPF-prog API bpf_core_type_id_local:
> >>    - [1]
> >> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80
> >>
> >> Userspace API btf__find_by_name_kind:
> >>    - [2]
> >> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/lib_xsk_extend.c#L185
> >>
> >
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-26 17:00             ` Stanislav Fomichev
@ 2023-06-28  8:09               ` Magnus Karlsson
  2023-06-28 18:49                 ` Stanislav Fomichev
  0 siblings, 1 reply; 20+ messages in thread
From: Magnus Karlsson @ 2023-06-28  8:09 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jesper Dangaard Brouer, brouer, bpf, ast, daniel, andrii,
	martin.lau, song, yhs, john.fastabend, kpsingh, haoluo, jolsa,
	Björn Töpel, Karlsson, Magnus, xdp-hints

On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
>
> On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
> <jbrouer@redhat.com> wrote:
> >
> >
> >
> > On 23/06/2023 19.41, Stanislav Fomichev wrote:
> > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
> > > <jbrouer@redhat.com> wrote:
> > >>
> > >>
> > >>
> > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
> > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
> > >>>>
> > >>>>
> > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
> > >>>>
> > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> > >>>>> and carry some TX metadata in the headroom. For copy mode, there
> > >>>>> is no way currently to populate skb metadata.
> > >>>>>
> > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
> > >>>>> (same as in RX case).
> > >>>>
> > >>>>    From looking at the code, this introduces a socket option for this TX
> > >>>> metadata length (tx_metadata_len).
> > >>>> This implies the same fixed TX metadata size is used for all packets.
> > >>>> Maybe describe this in patch desc.
> > >>>
> > >>> I was planning to do a proper documentation page once we settle on all
> > >>> the details (similar to the one we have for rx).
> > >>>
> > >>>> What is the plan for dealing with cases that doesn't populate same/full
> > >>>> TX metadata size ?
> > >>>
> > >>> Do we need to support that? I was assuming that the TX layout would be
> > >>> fixed between the userspace and BPF.
> > >>
> > >> I hope you don't mean fixed layout, as the whole point is adding
> > >> flexibility and extensibility.
> > >
> > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
> > > At least fixed max size of the metadata. The userspace and the bpf
> > > prog can then use this fixed space to implement some flexibility
> > > (btf_ids, versioned structs, bitmasks, tlv, etc).
> > > If we were to make the metalen vary per packet, we'd have to signal
> > > its size per packet. Probably not worth it?
> >
> > Existing XDP metadata implementation also expand in a fixed/limited
> > sized memory area, but communicate size per packet in this area (also
> > for validation purposes).  BUT for AF_XDP we don't have room for another
> > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
> >
> >
> > >
> > >>> If every packet would have a different metadata length, it seems like
> > >>> a nightmare to parse?
> > >>>
> > >>
> > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
> > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
> > >> see [1] and [2].
> > >>
> > >> It seems we are talking slightly past each-other(?).  Let me rephrase
> > >> and reframe the question, what is your *plan* for dealing with different
> > >> *types* of TX metadata.  The different struct *types* will of-cause have
> > >> different sizes, but that is okay as long as they fit into the maximum
> > >> size set by this new socket option XDP_TX_METADATA_LEN.
> > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
> > >> for metadata, but we need a plan for handling more than one type and
> > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
> > >> data ("leftover" since last time this mem was used).
> > >
> > > Yeah, I think the above correctly catches my expectation here. Some
> > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
> > > offloaded to the bpf program via btf_id/tlv/etc.
> > >
> > > Regarding leftover metadata: can we assume the userspace will take
> > > care of setting it up?
> > >
> > >> With this kfunc approach, then things in-principle, becomes a contract
> > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
> > >> components can as illustrated here [1]+[2] can coordinate based on local
> > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
> > >> selftests examples don't use this and instead have a very static
> > >> approach (that people will copy-paste).
> > >>
> > >> An unsolved problem with TX-hook is that it can also get packets from
> > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
> > >> BPF-prog know if metadata is valid and intended to be used for e.g.
> > >> requesting the timestamp? (imagine metadata size happen to match)
> > >
> > > My assumption was the bpf program can do ifindex/netns filtering. Plus
> > > maybe check that the meta_len is the one that's expected.
> > > Will that be enough to handle XDP_REDIRECT?
> >
> > I don't think so, using the meta_len (+ ifindex/netns) to communicate
> > activation of TX hardware hints is too weak and not enough.  This is an
> > implicit API for BPF-programmers to understand and can lead to implicit
> > activation.
> >
> > Think about what will happen for your AF_XDP send use-case.  For
> > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
> > is fixed even if not used (and can contain garbage), it can by accident
> > create hard-to-debug situations.  As discussed with Magnus+Maryam
> > before, we found it was practical (and faster than mem zero) to extend
> > AF_XDP descriptor (see struct xdp_desc) with some flags to
> > indicate/communicate this frame comes with TX metadata hints.
>
> What is that "if not used" situation? Can the metadata itself have
> is_used bit? The userspace has to initialize at least that bit.
> We can definitely add that extra "has_metadata" bit to the descriptor,
> but I'm trying to understand whether we can do without it.

To me, this "has_metadata" bit in the descriptor is just an
optimization. If it is 0, then there is no need to go and check the
metadata field and you save some performance. Regardless of this bit,
you need some way to say "is_used" for each metadata entry (at least
when the number of metadata entries is >1). Three options come to mind
each with their pros and cons.

#1: Let each metadata entry have an invalid state. Not possible for
every metadata and requires the user/kernel to go scan through every
entry for every packet.

#2: Have a field of bits at the start of the metadata section (closest
to packet data) that signifies if a metadata entry is valid or not. If
there are N metadata entries in the metadata area, then N bits in this
field would be used to signify if the corresponding metadata is used
or not. Only requires the user/kernel to scan the valid entries plus
one access for the "is_used" bits.

#3: Have N bits in the AF_XDP descriptor options field instead of the
N bits in the metadata area of #2. Faster but would consume many
precious bits in the fixed descriptor and cap the number of metadata
entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
multi-buffer work, and 15 for some future use. Depends on how daring
we are.

The "has_metadata" bit suggestion can be combined with 1 or 2.
Approach 3 is just a fine grained extension of the idea itself.

IMO, the best approach unfortunately depends on the metadata itself.
If it is rarely valid, you want something like the "has_metadata" bit.
If it is nearly always valid and used, approach #1 (if possible for
the metadata) should be the fastest. The decision also depends on the
number of metadata entries you have per packet. Sorry that I do not
have a good answer. My feeling is that we need something like #1 or
#2, or maybe both, then if needed we can add the "has_metadata" bit or
bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
a combo) in the eBPF program itself? Would provide us with the
flexibility, if possible.

>
> > >>
> > >> BPF-prog API bpf_core_type_id_local:
> > >>    - [1]
> > >> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/af_xdp_kern.c#L80
> > >>
> > >> Userspace API btf__find_by_name_kind:
> > >>    - [2]
> > >> https://github.com/xdp-project/bpf-examples/blob/master/AF_XDP-interaction/lib_xsk_extend.c#L185
> > >>
> > >
> >
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-28  8:09               ` Magnus Karlsson
@ 2023-06-28 18:49                 ` Stanislav Fomichev
  2023-06-29  6:15                   ` Magnus Karlsson
  2023-06-29 11:30                   ` Toke Høiland-Jørgensen
  0 siblings, 2 replies; 20+ messages in thread
From: Stanislav Fomichev @ 2023-06-28 18:49 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Jesper Dangaard Brouer, brouer, bpf, ast, daniel, andrii,
	martin.lau, song, yhs, john.fastabend, kpsingh, haoluo, jolsa,
	Björn Töpel, Karlsson, Magnus, xdp-hints

On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
<magnus.karlsson@gmail.com> wrote:
>
> On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
> >
> > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
> > <jbrouer@redhat.com> wrote:
> > >
> > >
> > >
> > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
> > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
> > > > <jbrouer@redhat.com> wrote:
> > > >>
> > > >>
> > > >>
> > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
> > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
> > > >>>>
> > > >>>>
> > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
> > > >>>>
> > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
> > > >>>>> is no way currently to populate skb metadata.
> > > >>>>>
> > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
> > > >>>>> (same as in RX case).
> > > >>>>
> > > >>>>    From looking at the code, this introduces a socket option for this TX
> > > >>>> metadata length (tx_metadata_len).
> > > >>>> This implies the same fixed TX metadata size is used for all packets.
> > > >>>> Maybe describe this in patch desc.
> > > >>>
> > > >>> I was planning to do a proper documentation page once we settle on all
> > > >>> the details (similar to the one we have for rx).
> > > >>>
> > > >>>> What is the plan for dealing with cases that doesn't populate same/full
> > > >>>> TX metadata size ?
> > > >>>
> > > >>> Do we need to support that? I was assuming that the TX layout would be
> > > >>> fixed between the userspace and BPF.
> > > >>
> > > >> I hope you don't mean fixed layout, as the whole point is adding
> > > >> flexibility and extensibility.
> > > >
> > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
> > > > At least fixed max size of the metadata. The userspace and the bpf
> > > > prog can then use this fixed space to implement some flexibility
> > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
> > > > If we were to make the metalen vary per packet, we'd have to signal
> > > > its size per packet. Probably not worth it?
> > >
> > > Existing XDP metadata implementation also expand in a fixed/limited
> > > sized memory area, but communicate size per packet in this area (also
> > > for validation purposes).  BUT for AF_XDP we don't have room for another
> > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
> > >
> > >
> > > >
> > > >>> If every packet would have a different metadata length, it seems like
> > > >>> a nightmare to parse?
> > > >>>
> > > >>
> > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
> > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
> > > >> see [1] and [2].
> > > >>
> > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
> > > >> and reframe the question, what is your *plan* for dealing with different
> > > >> *types* of TX metadata.  The different struct *types* will of-cause have
> > > >> different sizes, but that is okay as long as they fit into the maximum
> > > >> size set by this new socket option XDP_TX_METADATA_LEN.
> > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
> > > >> for metadata, but we need a plan for handling more than one type and
> > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
> > > >> data ("leftover" since last time this mem was used).
> > > >
> > > > Yeah, I think the above correctly catches my expectation here. Some
> > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
> > > > offloaded to the bpf program via btf_id/tlv/etc.
> > > >
> > > > Regarding leftover metadata: can we assume the userspace will take
> > > > care of setting it up?
> > > >
> > > >> With this kfunc approach, then things in-principle, becomes a contract
> > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
> > > >> components can as illustrated here [1]+[2] can coordinate based on local
> > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
> > > >> selftests examples don't use this and instead have a very static
> > > >> approach (that people will copy-paste).
> > > >>
> > > >> An unsolved problem with TX-hook is that it can also get packets from
> > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
> > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
> > > >> requesting the timestamp? (imagine metadata size happen to match)
> > > >
> > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
> > > > maybe check that the meta_len is the one that's expected.
> > > > Will that be enough to handle XDP_REDIRECT?
> > >
> > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
> > > activation of TX hardware hints is too weak and not enough.  This is an
> > > implicit API for BPF-programmers to understand and can lead to implicit
> > > activation.
> > >
> > > Think about what will happen for your AF_XDP send use-case.  For
> > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
> > > is fixed even if not used (and can contain garbage), it can by accident
> > > create hard-to-debug situations.  As discussed with Magnus+Maryam
> > > before, we found it was practical (and faster than mem zero) to extend
> > > AF_XDP descriptor (see struct xdp_desc) with some flags to
> > > indicate/communicate this frame comes with TX metadata hints.
> >
> > What is that "if not used" situation? Can the metadata itself have
> > is_used bit? The userspace has to initialize at least that bit.
> > We can definitely add that extra "has_metadata" bit to the descriptor,
> > but I'm trying to understand whether we can do without it.
>
> To me, this "has_metadata" bit in the descriptor is just an
> optimization. If it is 0, then there is no need to go and check the
> metadata field and you save some performance. Regardless of this bit,
> you need some way to say "is_used" for each metadata entry (at least
> when the number of metadata entries is >1). Three options come to mind
> each with their pros and cons.
>
> #1: Let each metadata entry have an invalid state. Not possible for
> every metadata and requires the user/kernel to go scan through every
> entry for every packet.
>
> #2: Have a field of bits at the start of the metadata section (closest
> to packet data) that signifies if a metadata entry is valid or not. If
> there are N metadata entries in the metadata area, then N bits in this
> field would be used to signify if the corresponding metadata is used
> or not. Only requires the user/kernel to scan the valid entries plus
> one access for the "is_used" bits.
>
> #3: Have N bits in the AF_XDP descriptor options field instead of the
> N bits in the metadata area of #2. Faster but would consume many
> precious bits in the fixed descriptor and cap the number of metadata
> entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
> multi-buffer work, and 15 for some future use. Depends on how daring
> we are.
>
> The "has_metadata" bit suggestion can be combined with 1 or 2.
> Approach 3 is just a fine grained extension of the idea itself.
>
> IMO, the best approach unfortunately depends on the metadata itself.
> If it is rarely valid, you want something like the "has_metadata" bit.
> If it is nearly always valid and used, approach #1 (if possible for
> the metadata) should be the fastest. The decision also depends on the
> number of metadata entries you have per packet. Sorry that I do not
> have a good answer. My feeling is that we need something like #1 or
> #2, or maybe both, then if needed we can add the "has_metadata" bit or
> bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
> a combo) in the eBPF program itself? Would provide us with the
> flexibility, if possible.

Here is my take on it, lmk if I'm missing something:

af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
plan to use metadata on tx.
This essentially requires allocating a fixed headroom to carry the metadata.
af_xdp machinery exports this fixed len into the bpf programs somehow
(devtx_frame.meta_len in this series).
Then it's up to the userspace and bpf program to agree on the layout.
If not every packet is expected to carry the metadata, there might be
some bitmask in the metadata area to indicate that.

Iow, the metadata isn't interpreted by the kernel. It's up to the prog
to interpret it and call appropriate kfunc to enable some offload.

Jesper raises a valid point with "what about redirected packets?". But
I'm not sure we need to care? Presumably the programs that do
xdp_redirect will have to conform to the same metadata layout?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-28 18:49                 ` Stanislav Fomichev
@ 2023-06-29  6:15                   ` Magnus Karlsson
  2023-06-29 11:30                   ` Toke Høiland-Jørgensen
  1 sibling, 0 replies; 20+ messages in thread
From: Magnus Karlsson @ 2023-06-29  6:15 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Jesper Dangaard Brouer, brouer, bpf, ast, daniel, andrii,
	martin.lau, song, yhs, john.fastabend, kpsingh, haoluo, jolsa,
	Björn Töpel, Karlsson, Magnus, xdp-hints

On Wed, 28 Jun 2023 at 20:49, Stanislav Fomichev <sdf@google.com> wrote:
>
> On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
> <magnus.karlsson@gmail.com> wrote:
> >
> > On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
> > >
> > > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
> > > <jbrouer@redhat.com> wrote:
> > > >
> > > >
> > > >
> > > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
> > > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
> > > > > <jbrouer@redhat.com> wrote:
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
> > > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
> > > > >>>>
> > > > >>>>
> > > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
> > > > >>>>
> > > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> > > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> > > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
> > > > >>>>> is no way currently to populate skb metadata.
> > > > >>>>>
> > > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> > > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
> > > > >>>>> (same as in RX case).
> > > > >>>>
> > > > >>>>    From looking at the code, this introduces a socket option for this TX
> > > > >>>> metadata length (tx_metadata_len).
> > > > >>>> This implies the same fixed TX metadata size is used for all packets.
> > > > >>>> Maybe describe this in patch desc.
> > > > >>>
> > > > >>> I was planning to do a proper documentation page once we settle on all
> > > > >>> the details (similar to the one we have for rx).
> > > > >>>
> > > > >>>> What is the plan for dealing with cases that doesn't populate same/full
> > > > >>>> TX metadata size ?
> > > > >>>
> > > > >>> Do we need to support that? I was assuming that the TX layout would be
> > > > >>> fixed between the userspace and BPF.
> > > > >>
> > > > >> I hope you don't mean fixed layout, as the whole point is adding
> > > > >> flexibility and extensibility.
> > > > >
> > > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
> > > > > At least fixed max size of the metadata. The userspace and the bpf
> > > > > prog can then use this fixed space to implement some flexibility
> > > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
> > > > > If we were to make the metalen vary per packet, we'd have to signal
> > > > > its size per packet. Probably not worth it?
> > > >
> > > > Existing XDP metadata implementation also expand in a fixed/limited
> > > > sized memory area, but communicate size per packet in this area (also
> > > > for validation purposes).  BUT for AF_XDP we don't have room for another
> > > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
> > > >
> > > >
> > > > >
> > > > >>> If every packet would have a different metadata length, it seems like
> > > > >>> a nightmare to parse?
> > > > >>>
> > > > >>
> > > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
> > > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
> > > > >> see [1] and [2].
> > > > >>
> > > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
> > > > >> and reframe the question, what is your *plan* for dealing with different
> > > > >> *types* of TX metadata.  The different struct *types* will of-cause have
> > > > >> different sizes, but that is okay as long as they fit into the maximum
> > > > >> size set by this new socket option XDP_TX_METADATA_LEN.
> > > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
> > > > >> for metadata, but we need a plan for handling more than one type and
> > > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
> > > > >> data ("leftover" since last time this mem was used).
> > > > >
> > > > > Yeah, I think the above correctly catches my expectation here. Some
> > > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
> > > > > offloaded to the bpf program via btf_id/tlv/etc.
> > > > >
> > > > > Regarding leftover metadata: can we assume the userspace will take
> > > > > care of setting it up?
> > > > >
> > > > >> With this kfunc approach, then things in-principle, becomes a contract
> > > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
> > > > >> components can as illustrated here [1]+[2] can coordinate based on local
> > > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
> > > > >> selftests examples don't use this and instead have a very static
> > > > >> approach (that people will copy-paste).
> > > > >>
> > > > >> An unsolved problem with TX-hook is that it can also get packets from
> > > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
> > > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
> > > > >> requesting the timestamp? (imagine metadata size happen to match)
> > > > >
> > > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
> > > > > maybe check that the meta_len is the one that's expected.
> > > > > Will that be enough to handle XDP_REDIRECT?
> > > >
> > > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
> > > > activation of TX hardware hints is too weak and not enough.  This is an
> > > > implicit API for BPF-programmers to understand and can lead to implicit
> > > > activation.
> > > >
> > > > Think about what will happen for your AF_XDP send use-case.  For
> > > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
> > > > is fixed even if not used (and can contain garbage), it can by accident
> > > > create hard-to-debug situations.  As discussed with Magnus+Maryam
> > > > before, we found it was practical (and faster than mem zero) to extend
> > > > AF_XDP descriptor (see struct xdp_desc) with some flags to
> > > > indicate/communicate this frame comes with TX metadata hints.
> > >
> > > What is that "if not used" situation? Can the metadata itself have
> > > is_used bit? The userspace has to initialize at least that bit.
> > > We can definitely add that extra "has_metadata" bit to the descriptor,
> > > but I'm trying to understand whether we can do without it.
> >
> > To me, this "has_metadata" bit in the descriptor is just an
> > optimization. If it is 0, then there is no need to go and check the
> > metadata field and you save some performance. Regardless of this bit,
> > you need some way to say "is_used" for each metadata entry (at least
> > when the number of metadata entries is >1). Three options come to mind
> > each with their pros and cons.
> >
> > #1: Let each metadata entry have an invalid state. Not possible for
> > every metadata and requires the user/kernel to go scan through every
> > entry for every packet.
> >
> > #2: Have a field of bits at the start of the metadata section (closest
> > to packet data) that signifies if a metadata entry is valid or not. If
> > there are N metadata entries in the metadata area, then N bits in this
> > field would be used to signify if the corresponding metadata is used
> > or not. Only requires the user/kernel to scan the valid entries plus
> > one access for the "is_used" bits.
> >
> > #3: Have N bits in the AF_XDP descriptor options field instead of the
> > N bits in the metadata area of #2. Faster but would consume many
> > precious bits in the fixed descriptor and cap the number of metadata
> > entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
> > multi-buffer work, and 15 for some future use. Depends on how daring
> > we are.
> >
> > The "has_metadata" bit suggestion can be combined with 1 or 2.
> > Approach 3 is just a fine grained extension of the idea itself.
> >
> > IMO, the best approach unfortunately depends on the metadata itself.
> > If it is rarely valid, you want something like the "has_metadata" bit.
> > If it is nearly always valid and used, approach #1 (if possible for
> > the metadata) should be the fastest. The decision also depends on the
> > number of metadata entries you have per packet. Sorry that I do not
> > have a good answer. My feeling is that we need something like #1 or
> > #2, or maybe both, then if needed we can add the "has_metadata" bit or
> > bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
> > a combo) in the eBPF program itself? Would provide us with the
> > flexibility, if possible.
>
> Here is my take on it, lmk if I'm missing something:
>
> af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
> plan to use metadata on tx.
> This essentially requires allocating a fixed headroom to carry the metadata.
> af_xdp machinery exports this fixed len into the bpf programs somehow
> (devtx_frame.meta_len in this series).
> Then it's up to the userspace and bpf program to agree on the layout.
> If not every packet is expected to carry the metadata, there might be
> some bitmask in the metadata area to indicate that.
>
> Iow, the metadata isn't interpreted by the kernel. It's up to the prog
> to interpret it and call appropriate kfunc to enable some offload.

Sounds good. This flexibility is needed.

> Jesper raises a valid point with "what about redirected packets?". But
> I'm not sure we need to care? Presumably the programs that do
> xdp_redirect will have to conform to the same metadata layout?

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-28 18:49                 ` Stanislav Fomichev
  2023-06-29  6:15                   ` Magnus Karlsson
@ 2023-06-29 11:30                   ` Toke Høiland-Jørgensen
  2023-06-29 11:48                     ` Magnus Karlsson
  1 sibling, 1 reply; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-29 11:30 UTC (permalink / raw)
  To: Stanislav Fomichev, Magnus Karlsson
  Cc: Jesper Dangaard Brouer, brouer, bpf, ast, daniel, andrii,
	martin.lau, song, yhs, john.fastabend, kpsingh, haoluo, jolsa,
	Björn Töpel, Karlsson, Magnus, xdp-hints

Stanislav Fomichev <sdf@google.com> writes:

> On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
> <magnus.karlsson@gmail.com> wrote:
>>
>> On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
>> >
>> > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
>> > <jbrouer@redhat.com> wrote:
>> > >
>> > >
>> > >
>> > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
>> > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
>> > > > <jbrouer@redhat.com> wrote:
>> > > >>
>> > > >>
>> > > >>
>> > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
>> > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>> > > >>>>
>> > > >>>>
>> > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
>> > > >>>>
>> > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
>> > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
>> > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
>> > > >>>>> is no way currently to populate skb metadata.
>> > > >>>>>
>> > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
>> > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
>> > > >>>>> (same as in RX case).
>> > > >>>>
>> > > >>>>    From looking at the code, this introduces a socket option for this TX
>> > > >>>> metadata length (tx_metadata_len).
>> > > >>>> This implies the same fixed TX metadata size is used for all packets.
>> > > >>>> Maybe describe this in patch desc.
>> > > >>>
>> > > >>> I was planning to do a proper documentation page once we settle on all
>> > > >>> the details (similar to the one we have for rx).
>> > > >>>
>> > > >>>> What is the plan for dealing with cases that doesn't populate same/full
>> > > >>>> TX metadata size ?
>> > > >>>
>> > > >>> Do we need to support that? I was assuming that the TX layout would be
>> > > >>> fixed between the userspace and BPF.
>> > > >>
>> > > >> I hope you don't mean fixed layout, as the whole point is adding
>> > > >> flexibility and extensibility.
>> > > >
>> > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
>> > > > At least fixed max size of the metadata. The userspace and the bpf
>> > > > prog can then use this fixed space to implement some flexibility
>> > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
>> > > > If we were to make the metalen vary per packet, we'd have to signal
>> > > > its size per packet. Probably not worth it?
>> > >
>> > > Existing XDP metadata implementation also expand in a fixed/limited
>> > > sized memory area, but communicate size per packet in this area (also
>> > > for validation purposes).  BUT for AF_XDP we don't have room for another
>> > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
>> > >
>> > >
>> > > >
>> > > >>> If every packet would have a different metadata length, it seems like
>> > > >>> a nightmare to parse?
>> > > >>>
>> > > >>
>> > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
>> > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
>> > > >> see [1] and [2].
>> > > >>
>> > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
>> > > >> and reframe the question, what is your *plan* for dealing with different
>> > > >> *types* of TX metadata.  The different struct *types* will of-cause have
>> > > >> different sizes, but that is okay as long as they fit into the maximum
>> > > >> size set by this new socket option XDP_TX_METADATA_LEN.
>> > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
>> > > >> for metadata, but we need a plan for handling more than one type and
>> > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
>> > > >> data ("leftover" since last time this mem was used).
>> > > >
>> > > > Yeah, I think the above correctly catches my expectation here. Some
>> > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
>> > > > offloaded to the bpf program via btf_id/tlv/etc.
>> > > >
>> > > > Regarding leftover metadata: can we assume the userspace will take
>> > > > care of setting it up?
>> > > >
>> > > >> With this kfunc approach, then things in-principle, becomes a contract
>> > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
>> > > >> components can as illustrated here [1]+[2] can coordinate based on local
>> > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
>> > > >> selftests examples don't use this and instead have a very static
>> > > >> approach (that people will copy-paste).
>> > > >>
>> > > >> An unsolved problem with TX-hook is that it can also get packets from
>> > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
>> > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
>> > > >> requesting the timestamp? (imagine metadata size happen to match)
>> > > >
>> > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
>> > > > maybe check that the meta_len is the one that's expected.
>> > > > Will that be enough to handle XDP_REDIRECT?
>> > >
>> > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
>> > > activation of TX hardware hints is too weak and not enough.  This is an
>> > > implicit API for BPF-programmers to understand and can lead to implicit
>> > > activation.
>> > >
>> > > Think about what will happen for your AF_XDP send use-case.  For
>> > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
>> > > is fixed even if not used (and can contain garbage), it can by accident
>> > > create hard-to-debug situations.  As discussed with Magnus+Maryam
>> > > before, we found it was practical (and faster than mem zero) to extend
>> > > AF_XDP descriptor (see struct xdp_desc) with some flags to
>> > > indicate/communicate this frame comes with TX metadata hints.
>> >
>> > What is that "if not used" situation? Can the metadata itself have
>> > is_used bit? The userspace has to initialize at least that bit.
>> > We can definitely add that extra "has_metadata" bit to the descriptor,
>> > but I'm trying to understand whether we can do without it.
>>
>> To me, this "has_metadata" bit in the descriptor is just an
>> optimization. If it is 0, then there is no need to go and check the
>> metadata field and you save some performance. Regardless of this bit,
>> you need some way to say "is_used" for each metadata entry (at least
>> when the number of metadata entries is >1). Three options come to mind
>> each with their pros and cons.
>>
>> #1: Let each metadata entry have an invalid state. Not possible for
>> every metadata and requires the user/kernel to go scan through every
>> entry for every packet.
>>
>> #2: Have a field of bits at the start of the metadata section (closest
>> to packet data) that signifies if a metadata entry is valid or not. If
>> there are N metadata entries in the metadata area, then N bits in this
>> field would be used to signify if the corresponding metadata is used
>> or not. Only requires the user/kernel to scan the valid entries plus
>> one access for the "is_used" bits.
>>
>> #3: Have N bits in the AF_XDP descriptor options field instead of the
>> N bits in the metadata area of #2. Faster but would consume many
>> precious bits in the fixed descriptor and cap the number of metadata
>> entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
>> multi-buffer work, and 15 for some future use. Depends on how daring
>> we are.
>>
>> The "has_metadata" bit suggestion can be combined with 1 or 2.
>> Approach 3 is just a fine grained extension of the idea itself.
>>
>> IMO, the best approach unfortunately depends on the metadata itself.
>> If it is rarely valid, you want something like the "has_metadata" bit.
>> If it is nearly always valid and used, approach #1 (if possible for
>> the metadata) should be the fastest. The decision also depends on the
>> number of metadata entries you have per packet. Sorry that I do not
>> have a good answer. My feeling is that we need something like #1 or
>> #2, or maybe both, then if needed we can add the "has_metadata" bit or
>> bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
>> a combo) in the eBPF program itself? Would provide us with the
>> flexibility, if possible.
>
> Here is my take on it, lmk if I'm missing something:
>
> af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
> plan to use metadata on tx.
> This essentially requires allocating a fixed headroom to carry the metadata.
> af_xdp machinery exports this fixed len into the bpf programs somehow
> (devtx_frame.meta_len in this series).
> Then it's up to the userspace and bpf program to agree on the layout.
> If not every packet is expected to carry the metadata, there might be
> some bitmask in the metadata area to indicate that.
>
> Iow, the metadata isn't interpreted by the kernel. It's up to the prog
> to interpret it and call appropriate kfunc to enable some offload.

The reason for the flag on RX is mostly performance: there's a
substantial performance hit from reading the metadata area because it's
not cache-hot; we want to avoid that when no metadata is in use. Putting
the flag inside the metadata area itself doesn't work for this, because
then you incur the cache miss just to read the flag.

This effect is probably most pronounced on RX (because the packet is
coming in off the NIC, so very unlikely that the memory has been touched
before), but I see no reason it could not also occur on TX (it'll mostly
depend on data alignment I guess?). So I think having a separate "TX
metadata" flag in the descriptor is probably worth it for the
performance gains?

> Jesper raises a valid point with "what about redirected packets?". But
> I'm not sure we need to care? Presumably the programs that do
> xdp_redirect will have to conform to the same metadata layout?

I don't think they have to do anything different semantically (agreeing
on the data format in the metadata area will have to be by some out of
band mechanism); but the performance angle is definitely valid here as
well...

-Toke


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-29 11:30                   ` Toke Høiland-Jørgensen
@ 2023-06-29 11:48                     ` Magnus Karlsson
  2023-06-29 12:01                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 20+ messages in thread
From: Magnus Karlsson @ 2023-06-29 11:48 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, Jesper Dangaard Brouer, brouer, bpf, ast,
	daniel, andrii, martin.lau, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, Björn Töpel, Karlsson, Magnus,
	xdp-hints

On Thu, 29 Jun 2023 at 13:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
> > <magnus.karlsson@gmail.com> wrote:
> >>
> >> On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
> >> >
> >> > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
> >> > <jbrouer@redhat.com> wrote:
> >> > >
> >> > >
> >> > >
> >> > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
> >> > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
> >> > > > <jbrouer@redhat.com> wrote:
> >> > > >>
> >> > > >>
> >> > > >>
> >> > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
> >> > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
> >> > > >>>>
> >> > > >>>>
> >> > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
> >> > > >>>>
> >> > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> >> > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> >> > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
> >> > > >>>>> is no way currently to populate skb metadata.
> >> > > >>>>>
> >> > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> >> > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
> >> > > >>>>> (same as in RX case).
> >> > > >>>>
> >> > > >>>>    From looking at the code, this introduces a socket option for this TX
> >> > > >>>> metadata length (tx_metadata_len).
> >> > > >>>> This implies the same fixed TX metadata size is used for all packets.
> >> > > >>>> Maybe describe this in patch desc.
> >> > > >>>
> >> > > >>> I was planning to do a proper documentation page once we settle on all
> >> > > >>> the details (similar to the one we have for rx).
> >> > > >>>
> >> > > >>>> What is the plan for dealing with cases that doesn't populate same/full
> >> > > >>>> TX metadata size ?
> >> > > >>>
> >> > > >>> Do we need to support that? I was assuming that the TX layout would be
> >> > > >>> fixed between the userspace and BPF.
> >> > > >>
> >> > > >> I hope you don't mean fixed layout, as the whole point is adding
> >> > > >> flexibility and extensibility.
> >> > > >
> >> > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
> >> > > > At least fixed max size of the metadata. The userspace and the bpf
> >> > > > prog can then use this fixed space to implement some flexibility
> >> > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
> >> > > > If we were to make the metalen vary per packet, we'd have to signal
> >> > > > its size per packet. Probably not worth it?
> >> > >
> >> > > Existing XDP metadata implementation also expand in a fixed/limited
> >> > > sized memory area, but communicate size per packet in this area (also
> >> > > for validation purposes).  BUT for AF_XDP we don't have room for another
> >> > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
> >> > >
> >> > >
> >> > > >
> >> > > >>> If every packet would have a different metadata length, it seems like
> >> > > >>> a nightmare to parse?
> >> > > >>>
> >> > > >>
> >> > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
> >> > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
> >> > > >> see [1] and [2].
> >> > > >>
> >> > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
> >> > > >> and reframe the question, what is your *plan* for dealing with different
> >> > > >> *types* of TX metadata.  The different struct *types* will of-cause have
> >> > > >> different sizes, but that is okay as long as they fit into the maximum
> >> > > >> size set by this new socket option XDP_TX_METADATA_LEN.
> >> > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
> >> > > >> for metadata, but we need a plan for handling more than one type and
> >> > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
> >> > > >> data ("leftover" since last time this mem was used).
> >> > > >
> >> > > > Yeah, I think the above correctly catches my expectation here. Some
> >> > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
> >> > > > offloaded to the bpf program via btf_id/tlv/etc.
> >> > > >
> >> > > > Regarding leftover metadata: can we assume the userspace will take
> >> > > > care of setting it up?
> >> > > >
> >> > > >> With this kfunc approach, then things in-principle, becomes a contract
> >> > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
> >> > > >> components can as illustrated here [1]+[2] can coordinate based on local
> >> > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
> >> > > >> selftests examples don't use this and instead have a very static
> >> > > >> approach (that people will copy-paste).
> >> > > >>
> >> > > >> An unsolved problem with TX-hook is that it can also get packets from
> >> > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
> >> > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
> >> > > >> requesting the timestamp? (imagine metadata size happen to match)
> >> > > >
> >> > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
> >> > > > maybe check that the meta_len is the one that's expected.
> >> > > > Will that be enough to handle XDP_REDIRECT?
> >> > >
> >> > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
> >> > > activation of TX hardware hints is too weak and not enough.  This is an
> >> > > implicit API for BPF-programmers to understand and can lead to implicit
> >> > > activation.
> >> > >
> >> > > Think about what will happen for your AF_XDP send use-case.  For
> >> > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
> >> > > is fixed even if not used (and can contain garbage), it can by accident
> >> > > create hard-to-debug situations.  As discussed with Magnus+Maryam
> >> > > before, we found it was practical (and faster than mem zero) to extend
> >> > > AF_XDP descriptor (see struct xdp_desc) with some flags to
> >> > > indicate/communicate this frame comes with TX metadata hints.
> >> >
> >> > What is that "if not used" situation? Can the metadata itself have
> >> > is_used bit? The userspace has to initialize at least that bit.
> >> > We can definitely add that extra "has_metadata" bit to the descriptor,
> >> > but I'm trying to understand whether we can do without it.
> >>
> >> To me, this "has_metadata" bit in the descriptor is just an
> >> optimization. If it is 0, then there is no need to go and check the
> >> metadata field and you save some performance. Regardless of this bit,
> >> you need some way to say "is_used" for each metadata entry (at least
> >> when the number of metadata entries is >1). Three options come to mind
> >> each with their pros and cons.
> >>
> >> #1: Let each metadata entry have an invalid state. Not possible for
> >> every metadata and requires the user/kernel to go scan through every
> >> entry for every packet.
> >>
> >> #2: Have a field of bits at the start of the metadata section (closest
> >> to packet data) that signifies if a metadata entry is valid or not. If
> >> there are N metadata entries in the metadata area, then N bits in this
> >> field would be used to signify if the corresponding metadata is used
> >> or not. Only requires the user/kernel to scan the valid entries plus
> >> one access for the "is_used" bits.
> >>
> >> #3: Have N bits in the AF_XDP descriptor options field instead of the
> >> N bits in the metadata area of #2. Faster but would consume many
> >> precious bits in the fixed descriptor and cap the number of metadata
> >> entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
> >> multi-buffer work, and 15 for some future use. Depends on how daring
> >> we are.
> >>
> >> The "has_metadata" bit suggestion can be combined with 1 or 2.
> >> Approach 3 is just a fine grained extension of the idea itself.
> >>
> >> IMO, the best approach unfortunately depends on the metadata itself.
> >> If it is rarely valid, you want something like the "has_metadata" bit.
> >> If it is nearly always valid and used, approach #1 (if possible for
> >> the metadata) should be the fastest. The decision also depends on the
> >> number of metadata entries you have per packet. Sorry that I do not
> >> have a good answer. My feeling is that we need something like #1 or
> >> #2, or maybe both, then if needed we can add the "has_metadata" bit or
> >> bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
> >> a combo) in the eBPF program itself? Would provide us with the
> >> flexibility, if possible.
> >
> > Here is my take on it, lmk if I'm missing something:
> >
> > af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
> > plan to use metadata on tx.
> > This essentially requires allocating a fixed headroom to carry the metadata.
> > af_xdp machinery exports this fixed len into the bpf programs somehow
> > (devtx_frame.meta_len in this series).
> > Then it's up to the userspace and bpf program to agree on the layout.
> > If not every packet is expected to carry the metadata, there might be
> > some bitmask in the metadata area to indicate that.
> >
> > Iow, the metadata isn't interpreted by the kernel. It's up to the prog
> > to interpret it and call appropriate kfunc to enable some offload.
>
> The reason for the flag on RX is mostly performance: there's a
> substantial performance hit from reading the metadata area because it's
> not cache-hot; we want to avoid that when no metadata is in use. Putting
> the flag inside the metadata area itself doesn't work for this, because
> then you incur the cache miss just to read the flag.

Not necessarily. Let us say that the flag is 4 bytes. Increase the
start address of the packet buffer with 4 and the flags field will be
on the same cache line as the first 60 bytes of the packet data
(assuming a 64 byte cache line size and the flags field is closest to
the start of the packet data). As long as you write something in those
first 60 bytes of packet data that cache line will be brought in and
will likely be in the cache when you access the bits in the metadata
field. The trick works similarly for Rx by setting the umem headroom
accordingly. But you are correct in that dedicating a bit in the
descriptor will make sure it is always hot, while the trick above is
dependent on the app wanting to read or write the first cache line
worth of packet data.

> This effect is probably most pronounced on RX (because the packet is
> coming in off the NIC, so very unlikely that the memory has been touched
> before), but I see no reason it could not also occur on TX (it'll mostly
> depend on data alignment I guess?). So I think having a separate "TX
> metadata" flag in the descriptor is probably worth it for the
> performance gains?
>
> > Jesper raises a valid point with "what about redirected packets?". But
> > I'm not sure we need to care? Presumably the programs that do
> > xdp_redirect will have to conform to the same metadata layout?
>
> I don't think they have to do anything different semantically (agreeing
> on the data format in the metadata area will have to be by some out of
> band mechanism); but the performance angle is definitely valid here as
> well...
>
> -Toke
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-29 11:48                     ` Magnus Karlsson
@ 2023-06-29 12:01                       ` Toke Høiland-Jørgensen
  2023-06-29 16:21                         ` Stanislav Fomichev
  2023-06-30  6:22                         ` Magnus Karlsson
  0 siblings, 2 replies; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-29 12:01 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Stanislav Fomichev, Jesper Dangaard Brouer, brouer, bpf, ast,
	daniel, andrii, martin.lau, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, Björn Töpel, Karlsson, Magnus,
	xdp-hints

Magnus Karlsson <magnus.karlsson@gmail.com> writes:

> On Thu, 29 Jun 2023 at 13:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Stanislav Fomichev <sdf@google.com> writes:
>>
>> > On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
>> > <magnus.karlsson@gmail.com> wrote:
>> >>
>> >> On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
>> >> >
>> >> > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
>> >> > <jbrouer@redhat.com> wrote:
>> >> > >
>> >> > >
>> >> > >
>> >> > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
>> >> > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
>> >> > > > <jbrouer@redhat.com> wrote:
>> >> > > >>
>> >> > > >>
>> >> > > >>
>> >> > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
>> >> > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>> >> > > >>>>
>> >> > > >>>>
>> >> > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
>> >> > > >>>>
>> >> > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
>> >> > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
>> >> > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
>> >> > > >>>>> is no way currently to populate skb metadata.
>> >> > > >>>>>
>> >> > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
>> >> > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
>> >> > > >>>>> (same as in RX case).
>> >> > > >>>>
>> >> > > >>>>    From looking at the code, this introduces a socket option for this TX
>> >> > > >>>> metadata length (tx_metadata_len).
>> >> > > >>>> This implies the same fixed TX metadata size is used for all packets.
>> >> > > >>>> Maybe describe this in patch desc.
>> >> > > >>>
>> >> > > >>> I was planning to do a proper documentation page once we settle on all
>> >> > > >>> the details (similar to the one we have for rx).
>> >> > > >>>
>> >> > > >>>> What is the plan for dealing with cases that doesn't populate same/full
>> >> > > >>>> TX metadata size ?
>> >> > > >>>
>> >> > > >>> Do we need to support that? I was assuming that the TX layout would be
>> >> > > >>> fixed between the userspace and BPF.
>> >> > > >>
>> >> > > >> I hope you don't mean fixed layout, as the whole point is adding
>> >> > > >> flexibility and extensibility.
>> >> > > >
>> >> > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
>> >> > > > At least fixed max size of the metadata. The userspace and the bpf
>> >> > > > prog can then use this fixed space to implement some flexibility
>> >> > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
>> >> > > > If we were to make the metalen vary per packet, we'd have to signal
>> >> > > > its size per packet. Probably not worth it?
>> >> > >
>> >> > > Existing XDP metadata implementation also expand in a fixed/limited
>> >> > > sized memory area, but communicate size per packet in this area (also
>> >> > > for validation purposes).  BUT for AF_XDP we don't have room for another
>> >> > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
>> >> > >
>> >> > >
>> >> > > >
>> >> > > >>> If every packet would have a different metadata length, it seems like
>> >> > > >>> a nightmare to parse?
>> >> > > >>>
>> >> > > >>
>> >> > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
>> >> > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
>> >> > > >> see [1] and [2].
>> >> > > >>
>> >> > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
>> >> > > >> and reframe the question, what is your *plan* for dealing with different
>> >> > > >> *types* of TX metadata.  The different struct *types* will of-cause have
>> >> > > >> different sizes, but that is okay as long as they fit into the maximum
>> >> > > >> size set by this new socket option XDP_TX_METADATA_LEN.
>> >> > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
>> >> > > >> for metadata, but we need a plan for handling more than one type and
>> >> > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
>> >> > > >> data ("leftover" since last time this mem was used).
>> >> > > >
>> >> > > > Yeah, I think the above correctly catches my expectation here. Some
>> >> > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
>> >> > > > offloaded to the bpf program via btf_id/tlv/etc.
>> >> > > >
>> >> > > > Regarding leftover metadata: can we assume the userspace will take
>> >> > > > care of setting it up?
>> >> > > >
>> >> > > >> With this kfunc approach, then things in-principle, becomes a contract
>> >> > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
>> >> > > >> components can as illustrated here [1]+[2] can coordinate based on local
>> >> > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
>> >> > > >> selftests examples don't use this and instead have a very static
>> >> > > >> approach (that people will copy-paste).
>> >> > > >>
>> >> > > >> An unsolved problem with TX-hook is that it can also get packets from
>> >> > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
>> >> > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
>> >> > > >> requesting the timestamp? (imagine metadata size happen to match)
>> >> > > >
>> >> > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
>> >> > > > maybe check that the meta_len is the one that's expected.
>> >> > > > Will that be enough to handle XDP_REDIRECT?
>> >> > >
>> >> > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
>> >> > > activation of TX hardware hints is too weak and not enough.  This is an
>> >> > > implicit API for BPF-programmers to understand and can lead to implicit
>> >> > > activation.
>> >> > >
>> >> > > Think about what will happen for your AF_XDP send use-case.  For
>> >> > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
>> >> > > is fixed even if not used (and can contain garbage), it can by accident
>> >> > > create hard-to-debug situations.  As discussed with Magnus+Maryam
>> >> > > before, we found it was practical (and faster than mem zero) to extend
>> >> > > AF_XDP descriptor (see struct xdp_desc) with some flags to
>> >> > > indicate/communicate this frame comes with TX metadata hints.
>> >> >
>> >> > What is that "if not used" situation? Can the metadata itself have
>> >> > is_used bit? The userspace has to initialize at least that bit.
>> >> > We can definitely add that extra "has_metadata" bit to the descriptor,
>> >> > but I'm trying to understand whether we can do without it.
>> >>
>> >> To me, this "has_metadata" bit in the descriptor is just an
>> >> optimization. If it is 0, then there is no need to go and check the
>> >> metadata field and you save some performance. Regardless of this bit,
>> >> you need some way to say "is_used" for each metadata entry (at least
>> >> when the number of metadata entries is >1). Three options come to mind
>> >> each with their pros and cons.
>> >>
>> >> #1: Let each metadata entry have an invalid state. Not possible for
>> >> every metadata and requires the user/kernel to go scan through every
>> >> entry for every packet.
>> >>
>> >> #2: Have a field of bits at the start of the metadata section (closest
>> >> to packet data) that signifies if a metadata entry is valid or not. If
>> >> there are N metadata entries in the metadata area, then N bits in this
>> >> field would be used to signify if the corresponding metadata is used
>> >> or not. Only requires the user/kernel to scan the valid entries plus
>> >> one access for the "is_used" bits.
>> >>
>> >> #3: Have N bits in the AF_XDP descriptor options field instead of the
>> >> N bits in the metadata area of #2. Faster but would consume many
>> >> precious bits in the fixed descriptor and cap the number of metadata
>> >> entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
>> >> multi-buffer work, and 15 for some future use. Depends on how daring
>> >> we are.
>> >>
>> >> The "has_metadata" bit suggestion can be combined with 1 or 2.
>> >> Approach 3 is just a fine grained extension of the idea itself.
>> >>
>> >> IMO, the best approach unfortunately depends on the metadata itself.
>> >> If it is rarely valid, you want something like the "has_metadata" bit.
>> >> If it is nearly always valid and used, approach #1 (if possible for
>> >> the metadata) should be the fastest. The decision also depends on the
>> >> number of metadata entries you have per packet. Sorry that I do not
>> >> have a good answer. My feeling is that we need something like #1 or
>> >> #2, or maybe both, then if needed we can add the "has_metadata" bit or
>> >> bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
>> >> a combo) in the eBPF program itself? Would provide us with the
>> >> flexibility, if possible.
>> >
>> > Here is my take on it, lmk if I'm missing something:
>> >
>> > af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
>> > plan to use metadata on tx.
>> > This essentially requires allocating a fixed headroom to carry the metadata.
>> > af_xdp machinery exports this fixed len into the bpf programs somehow
>> > (devtx_frame.meta_len in this series).
>> > Then it's up to the userspace and bpf program to agree on the layout.
>> > If not every packet is expected to carry the metadata, there might be
>> > some bitmask in the metadata area to indicate that.
>> >
>> > Iow, the metadata isn't interpreted by the kernel. It's up to the prog
>> > to interpret it and call appropriate kfunc to enable some offload.
>>
>> The reason for the flag on RX is mostly performance: there's a
>> substantial performance hit from reading the metadata area because it's
>> not cache-hot; we want to avoid that when no metadata is in use. Putting
>> the flag inside the metadata area itself doesn't work for this, because
>> then you incur the cache miss just to read the flag.
>
> Not necessarily. Let us say that the flag is 4 bytes. Increase the
> start address of the packet buffer with 4 and the flags field will be
> on the same cache line as the first 60 bytes of the packet data
> (assuming a 64 byte cache line size and the flags field is closest to
> the start of the packet data). As long as you write something in those
> first 60 bytes of packet data that cache line will be brought in and
> will likely be in the cache when you access the bits in the metadata
> field. The trick works similarly for Rx by setting the umem headroom
> accordingly.

Yeah, a trick like that was what I was alluding to with the "could" in
this bit:

>> but I see no reason it could not also occur on TX (it'll mostly
>> depend on data alignment I guess?).

right below the text you quoted ;)

> But you are correct in that dedicating a bit in the descriptor will
> make sure it is always hot, while the trick above is dependent on the
> app wanting to read or write the first cache line worth of packet
> data.

Exactly; which is why I think it's worth the flag bit :)

-Toke


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-29 12:01                       ` Toke Høiland-Jørgensen
@ 2023-06-29 16:21                         ` Stanislav Fomichev
  2023-06-29 20:58                           ` Toke Høiland-Jørgensen
  2023-06-30  6:22                         ` Magnus Karlsson
  1 sibling, 1 reply; 20+ messages in thread
From: Stanislav Fomichev @ 2023-06-29 16:21 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Magnus Karlsson, Jesper Dangaard Brouer, brouer, bpf, ast,
	daniel, andrii, martin.lau, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, Björn Töpel, Karlsson, Magnus,
	xdp-hints

On Thu, Jun 29, 2023 at 5:01 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
>
> > On Thu, 29 Jun 2023 at 13:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
> >> > <magnus.karlsson@gmail.com> wrote:
> >> >>
> >> >> On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
> >> >> >
> >> >> > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
> >> >> > <jbrouer@redhat.com> wrote:
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
> >> >> > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
> >> >> > > > <jbrouer@redhat.com> wrote:
> >> >> > > >>
> >> >> > > >>
> >> >> > > >>
> >> >> > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
> >> >> > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
> >> >> > > >>>>
> >> >> > > >>>>
> >> >> > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
> >> >> > > >>>>
> >> >> > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> >> >> > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> >> >> > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
> >> >> > > >>>>> is no way currently to populate skb metadata.
> >> >> > > >>>>>
> >> >> > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> >> >> > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
> >> >> > > >>>>> (same as in RX case).
> >> >> > > >>>>
> >> >> > > >>>>    From looking at the code, this introduces a socket option for this TX
> >> >> > > >>>> metadata length (tx_metadata_len).
> >> >> > > >>>> This implies the same fixed TX metadata size is used for all packets.
> >> >> > > >>>> Maybe describe this in patch desc.
> >> >> > > >>>
> >> >> > > >>> I was planning to do a proper documentation page once we settle on all
> >> >> > > >>> the details (similar to the one we have for rx).
> >> >> > > >>>
> >> >> > > >>>> What is the plan for dealing with cases that doesn't populate same/full
> >> >> > > >>>> TX metadata size ?
> >> >> > > >>>
> >> >> > > >>> Do we need to support that? I was assuming that the TX layout would be
> >> >> > > >>> fixed between the userspace and BPF.
> >> >> > > >>
> >> >> > > >> I hope you don't mean fixed layout, as the whole point is adding
> >> >> > > >> flexibility and extensibility.
> >> >> > > >
> >> >> > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
> >> >> > > > At least fixed max size of the metadata. The userspace and the bpf
> >> >> > > > prog can then use this fixed space to implement some flexibility
> >> >> > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
> >> >> > > > If we were to make the metalen vary per packet, we'd have to signal
> >> >> > > > its size per packet. Probably not worth it?
> >> >> > >
> >> >> > > Existing XDP metadata implementation also expand in a fixed/limited
> >> >> > > sized memory area, but communicate size per packet in this area (also
> >> >> > > for validation purposes).  BUT for AF_XDP we don't have room for another
> >> >> > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
> >> >> > >
> >> >> > >
> >> >> > > >
> >> >> > > >>> If every packet would have a different metadata length, it seems like
> >> >> > > >>> a nightmare to parse?
> >> >> > > >>>
> >> >> > > >>
> >> >> > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
> >> >> > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
> >> >> > > >> see [1] and [2].
> >> >> > > >>
> >> >> > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
> >> >> > > >> and reframe the question, what is your *plan* for dealing with different
> >> >> > > >> *types* of TX metadata.  The different struct *types* will of-cause have
> >> >> > > >> different sizes, but that is okay as long as they fit into the maximum
> >> >> > > >> size set by this new socket option XDP_TX_METADATA_LEN.
> >> >> > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
> >> >> > > >> for metadata, but we need a plan for handling more than one type and
> >> >> > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
> >> >> > > >> data ("leftover" since last time this mem was used).
> >> >> > > >
> >> >> > > > Yeah, I think the above correctly catches my expectation here. Some
> >> >> > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
> >> >> > > > offloaded to the bpf program via btf_id/tlv/etc.
> >> >> > > >
> >> >> > > > Regarding leftover metadata: can we assume the userspace will take
> >> >> > > > care of setting it up?
> >> >> > > >
> >> >> > > >> With this kfunc approach, then things in-principle, becomes a contract
> >> >> > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
> >> >> > > >> components can as illustrated here [1]+[2] can coordinate based on local
> >> >> > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
> >> >> > > >> selftests examples don't use this and instead have a very static
> >> >> > > >> approach (that people will copy-paste).
> >> >> > > >>
> >> >> > > >> An unsolved problem with TX-hook is that it can also get packets from
> >> >> > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
> >> >> > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
> >> >> > > >> requesting the timestamp? (imagine metadata size happen to match)
> >> >> > > >
> >> >> > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
> >> >> > > > maybe check that the meta_len is the one that's expected.
> >> >> > > > Will that be enough to handle XDP_REDIRECT?
> >> >> > >
> >> >> > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
> >> >> > > activation of TX hardware hints is too weak and not enough.  This is an
> >> >> > > implicit API for BPF-programmers to understand and can lead to implicit
> >> >> > > activation.
> >> >> > >
> >> >> > > Think about what will happen for your AF_XDP send use-case.  For
> >> >> > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
> >> >> > > is fixed even if not used (and can contain garbage), it can by accident
> >> >> > > create hard-to-debug situations.  As discussed with Magnus+Maryam
> >> >> > > before, we found it was practical (and faster than mem zero) to extend
> >> >> > > AF_XDP descriptor (see struct xdp_desc) with some flags to
> >> >> > > indicate/communicate this frame comes with TX metadata hints.
> >> >> >
> >> >> > What is that "if not used" situation? Can the metadata itself have
> >> >> > is_used bit? The userspace has to initialize at least that bit.
> >> >> > We can definitely add that extra "has_metadata" bit to the descriptor,
> >> >> > but I'm trying to understand whether we can do without it.
> >> >>
> >> >> To me, this "has_metadata" bit in the descriptor is just an
> >> >> optimization. If it is 0, then there is no need to go and check the
> >> >> metadata field and you save some performance. Regardless of this bit,
> >> >> you need some way to say "is_used" for each metadata entry (at least
> >> >> when the number of metadata entries is >1). Three options come to mind
> >> >> each with their pros and cons.
> >> >>
> >> >> #1: Let each metadata entry have an invalid state. Not possible for
> >> >> every metadata and requires the user/kernel to go scan through every
> >> >> entry for every packet.
> >> >>
> >> >> #2: Have a field of bits at the start of the metadata section (closest
> >> >> to packet data) that signifies if a metadata entry is valid or not. If
> >> >> there are N metadata entries in the metadata area, then N bits in this
> >> >> field would be used to signify if the corresponding metadata is used
> >> >> or not. Only requires the user/kernel to scan the valid entries plus
> >> >> one access for the "is_used" bits.
> >> >>
> >> >> #3: Have N bits in the AF_XDP descriptor options field instead of the
> >> >> N bits in the metadata area of #2. Faster but would consume many
> >> >> precious bits in the fixed descriptor and cap the number of metadata
> >> >> entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
> >> >> multi-buffer work, and 15 for some future use. Depends on how daring
> >> >> we are.
> >> >>
> >> >> The "has_metadata" bit suggestion can be combined with 1 or 2.
> >> >> Approach 3 is just a fine grained extension of the idea itself.
> >> >>
> >> >> IMO, the best approach unfortunately depends on the metadata itself.
> >> >> If it is rarely valid, you want something like the "has_metadata" bit.
> >> >> If it is nearly always valid and used, approach #1 (if possible for
> >> >> the metadata) should be the fastest. The decision also depends on the
> >> >> number of metadata entries you have per packet. Sorry that I do not
> >> >> have a good answer. My feeling is that we need something like #1 or
> >> >> #2, or maybe both, then if needed we can add the "has_metadata" bit or
> >> >> bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
> >> >> a combo) in the eBPF program itself? Would provide us with the
> >> >> flexibility, if possible.
> >> >
> >> > Here is my take on it, lmk if I'm missing something:
> >> >
> >> > af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
> >> > plan to use metadata on tx.
> >> > This essentially requires allocating a fixed headroom to carry the metadata.
> >> > af_xdp machinery exports this fixed len into the bpf programs somehow
> >> > (devtx_frame.meta_len in this series).
> >> > Then it's up to the userspace and bpf program to agree on the layout.
> >> > If not every packet is expected to carry the metadata, there might be
> >> > some bitmask in the metadata area to indicate that.
> >> >
> >> > Iow, the metadata isn't interpreted by the kernel. It's up to the prog
> >> > to interpret it and call appropriate kfunc to enable some offload.
> >>
> >> The reason for the flag on RX is mostly performance: there's a
> >> substantial performance hit from reading the metadata area because it's
> >> not cache-hot; we want to avoid that when no metadata is in use. Putting
> >> the flag inside the metadata area itself doesn't work for this, because
> >> then you incur the cache miss just to read the flag.
> >
> > Not necessarily. Let us say that the flag is 4 bytes. Increase the
> > start address of the packet buffer with 4 and the flags field will be
> > on the same cache line as the first 60 bytes of the packet data
> > (assuming a 64 byte cache line size and the flags field is closest to
> > the start of the packet data). As long as you write something in those
> > first 60 bytes of packet data that cache line will be brought in and
> > will likely be in the cache when you access the bits in the metadata
> > field. The trick works similarly for Rx by setting the umem headroom
> > accordingly.
>
> Yeah, a trick like that was what I was alluding to with the "could" in
> this bit:
>
> >> but I see no reason it could not also occur on TX (it'll mostly
> >> depend on data alignment I guess?).
>
> right below the text you quoted ;)
>
> > But you are correct in that dedicating a bit in the descriptor will
> > make sure it is always hot, while the trick above is dependent on the
> > app wanting to read or write the first cache line worth of packet
> > data.
>
> Exactly; which is why I think it's worth the flag bit :)

Ack. Let me add this to the list of things to follow up on. I'm
assuming it's fair to start without the flag and add it later as a
performance optimization?
We have a fair bit of core things we need to agree on first :-D

> -Toke
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-29 16:21                         ` Stanislav Fomichev
@ 2023-06-29 20:58                           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-29 20:58 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Magnus Karlsson, Jesper Dangaard Brouer, brouer, bpf, ast,
	daniel, andrii, martin.lau, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, Björn Töpel, Karlsson, Magnus,
	xdp-hints

Stanislav Fomichev <sdf@google.com> writes:

> On Thu, Jun 29, 2023 at 5:01 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
>>
>> > On Thu, 29 Jun 2023 at 13:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Stanislav Fomichev <sdf@google.com> writes:
>> >>
>> >> > On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
>> >> > <magnus.karlsson@gmail.com> wrote:
>> >> >>
>> >> >> On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
>> >> >> >
>> >> >> > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
>> >> >> > <jbrouer@redhat.com> wrote:
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
>> >> >> > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
>> >> >> > > > <jbrouer@redhat.com> wrote:
>> >> >> > > >>
>> >> >> > > >>
>> >> >> > > >>
>> >> >> > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
>> >> >> > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>> >> >> > > >>>>
>> >> >> > > >>>>
>> >> >> > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
>> >> >> > > >>>>
>> >> >> > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
>> >> >> > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
>> >> >> > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
>> >> >> > > >>>>> is no way currently to populate skb metadata.
>> >> >> > > >>>>>
>> >> >> > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
>> >> >> > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
>> >> >> > > >>>>> (same as in RX case).
>> >> >> > > >>>>
>> >> >> > > >>>>    From looking at the code, this introduces a socket option for this TX
>> >> >> > > >>>> metadata length (tx_metadata_len).
>> >> >> > > >>>> This implies the same fixed TX metadata size is used for all packets.
>> >> >> > > >>>> Maybe describe this in patch desc.
>> >> >> > > >>>
>> >> >> > > >>> I was planning to do a proper documentation page once we settle on all
>> >> >> > > >>> the details (similar to the one we have for rx).
>> >> >> > > >>>
>> >> >> > > >>>> What is the plan for dealing with cases that doesn't populate same/full
>> >> >> > > >>>> TX metadata size ?
>> >> >> > > >>>
>> >> >> > > >>> Do we need to support that? I was assuming that the TX layout would be
>> >> >> > > >>> fixed between the userspace and BPF.
>> >> >> > > >>
>> >> >> > > >> I hope you don't mean fixed layout, as the whole point is adding
>> >> >> > > >> flexibility and extensibility.
>> >> >> > > >
>> >> >> > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
>> >> >> > > > At least fixed max size of the metadata. The userspace and the bpf
>> >> >> > > > prog can then use this fixed space to implement some flexibility
>> >> >> > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
>> >> >> > > > If we were to make the metalen vary per packet, we'd have to signal
>> >> >> > > > its size per packet. Probably not worth it?
>> >> >> > >
>> >> >> > > Existing XDP metadata implementation also expand in a fixed/limited
>> >> >> > > sized memory area, but communicate size per packet in this area (also
>> >> >> > > for validation purposes).  BUT for AF_XDP we don't have room for another
>> >> >> > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
>> >> >> > >
>> >> >> > >
>> >> >> > > >
>> >> >> > > >>> If every packet would have a different metadata length, it seems like
>> >> >> > > >>> a nightmare to parse?
>> >> >> > > >>>
>> >> >> > > >>
>> >> >> > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
>> >> >> > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
>> >> >> > > >> see [1] and [2].
>> >> >> > > >>
>> >> >> > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
>> >> >> > > >> and reframe the question, what is your *plan* for dealing with different
>> >> >> > > >> *types* of TX metadata.  The different struct *types* will of-cause have
>> >> >> > > >> different sizes, but that is okay as long as they fit into the maximum
>> >> >> > > >> size set by this new socket option XDP_TX_METADATA_LEN.
>> >> >> > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
>> >> >> > > >> for metadata, but we need a plan for handling more than one type and
>> >> >> > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
>> >> >> > > >> data ("leftover" since last time this mem was used).
>> >> >> > > >
>> >> >> > > > Yeah, I think the above correctly catches my expectation here. Some
>> >> >> > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
>> >> >> > > > offloaded to the bpf program via btf_id/tlv/etc.
>> >> >> > > >
>> >> >> > > > Regarding leftover metadata: can we assume the userspace will take
>> >> >> > > > care of setting it up?
>> >> >> > > >
>> >> >> > > >> With this kfunc approach, then things in-principle, becomes a contract
>> >> >> > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
>> >> >> > > >> components can as illustrated here [1]+[2] can coordinate based on local
>> >> >> > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
>> >> >> > > >> selftests examples don't use this and instead have a very static
>> >> >> > > >> approach (that people will copy-paste).
>> >> >> > > >>
>> >> >> > > >> An unsolved problem with TX-hook is that it can also get packets from
>> >> >> > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
>> >> >> > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
>> >> >> > > >> requesting the timestamp? (imagine metadata size happen to match)
>> >> >> > > >
>> >> >> > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
>> >> >> > > > maybe check that the meta_len is the one that's expected.
>> >> >> > > > Will that be enough to handle XDP_REDIRECT?
>> >> >> > >
>> >> >> > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
>> >> >> > > activation of TX hardware hints is too weak and not enough.  This is an
>> >> >> > > implicit API for BPF-programmers to understand and can lead to implicit
>> >> >> > > activation.
>> >> >> > >
>> >> >> > > Think about what will happen for your AF_XDP send use-case.  For
>> >> >> > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
>> >> >> > > is fixed even if not used (and can contain garbage), it can by accident
>> >> >> > > create hard-to-debug situations.  As discussed with Magnus+Maryam
>> >> >> > > before, we found it was practical (and faster than mem zero) to extend
>> >> >> > > AF_XDP descriptor (see struct xdp_desc) with some flags to
>> >> >> > > indicate/communicate this frame comes with TX metadata hints.
>> >> >> >
>> >> >> > What is that "if not used" situation? Can the metadata itself have
>> >> >> > is_used bit? The userspace has to initialize at least that bit.
>> >> >> > We can definitely add that extra "has_metadata" bit to the descriptor,
>> >> >> > but I'm trying to understand whether we can do without it.
>> >> >>
>> >> >> To me, this "has_metadata" bit in the descriptor is just an
>> >> >> optimization. If it is 0, then there is no need to go and check the
>> >> >> metadata field and you save some performance. Regardless of this bit,
>> >> >> you need some way to say "is_used" for each metadata entry (at least
>> >> >> when the number of metadata entries is >1). Three options come to mind
>> >> >> each with their pros and cons.
>> >> >>
>> >> >> #1: Let each metadata entry have an invalid state. Not possible for
>> >> >> every metadata and requires the user/kernel to go scan through every
>> >> >> entry for every packet.
>> >> >>
>> >> >> #2: Have a field of bits at the start of the metadata section (closest
>> >> >> to packet data) that signifies if a metadata entry is valid or not. If
>> >> >> there are N metadata entries in the metadata area, then N bits in this
>> >> >> field would be used to signify if the corresponding metadata is used
>> >> >> or not. Only requires the user/kernel to scan the valid entries plus
>> >> >> one access for the "is_used" bits.
>> >> >>
>> >> >> #3: Have N bits in the AF_XDP descriptor options field instead of the
>> >> >> N bits in the metadata area of #2. Faster but would consume many
>> >> >> precious bits in the fixed descriptor and cap the number of metadata
>> >> >> entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
>> >> >> multi-buffer work, and 15 for some future use. Depends on how daring
>> >> >> we are.
>> >> >>
>> >> >> The "has_metadata" bit suggestion can be combined with 1 or 2.
>> >> >> Approach 3 is just a fine grained extension of the idea itself.
>> >> >>
>> >> >> IMO, the best approach unfortunately depends on the metadata itself.
>> >> >> If it is rarely valid, you want something like the "has_metadata" bit.
>> >> >> If it is nearly always valid and used, approach #1 (if possible for
>> >> >> the metadata) should be the fastest. The decision also depends on the
>> >> >> number of metadata entries you have per packet. Sorry that I do not
>> >> >> have a good answer. My feeling is that we need something like #1 or
>> >> >> #2, or maybe both, then if needed we can add the "has_metadata" bit or
>> >> >> bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
>> >> >> a combo) in the eBPF program itself? Would provide us with the
>> >> >> flexibility, if possible.
>> >> >
>> >> > Here is my take on it, lmk if I'm missing something:
>> >> >
>> >> > af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
>> >> > plan to use metadata on tx.
>> >> > This essentially requires allocating a fixed headroom to carry the metadata.
>> >> > af_xdp machinery exports this fixed len into the bpf programs somehow
>> >> > (devtx_frame.meta_len in this series).
>> >> > Then it's up to the userspace and bpf program to agree on the layout.
>> >> > If not every packet is expected to carry the metadata, there might be
>> >> > some bitmask in the metadata area to indicate that.
>> >> >
>> >> > Iow, the metadata isn't interpreted by the kernel. It's up to the prog
>> >> > to interpret it and call appropriate kfunc to enable some offload.
>> >>
>> >> The reason for the flag on RX is mostly performance: there's a
>> >> substantial performance hit from reading the metadata area because it's
>> >> not cache-hot; we want to avoid that when no metadata is in use. Putting
>> >> the flag inside the metadata area itself doesn't work for this, because
>> >> then you incur the cache miss just to read the flag.
>> >
>> > Not necessarily. Let us say that the flag is 4 bytes. Increase the
>> > start address of the packet buffer with 4 and the flags field will be
>> > on the same cache line as the first 60 bytes of the packet data
>> > (assuming a 64 byte cache line size and the flags field is closest to
>> > the start of the packet data). As long as you write something in those
>> > first 60 bytes of packet data that cache line will be brought in and
>> > will likely be in the cache when you access the bits in the metadata
>> > field. The trick works similarly for Rx by setting the umem headroom
>> > accordingly.
>>
>> Yeah, a trick like that was what I was alluding to with the "could" in
>> this bit:
>>
>> >> but I see no reason it could not also occur on TX (it'll mostly
>> >> depend on data alignment I guess?).
>>
>> right below the text you quoted ;)
>>
>> > But you are correct in that dedicating a bit in the descriptor will
>> > make sure it is always hot, while the trick above is dependent on the
>> > app wanting to read or write the first cache line worth of packet
>> > data.
>>
>> Exactly; which is why I think it's worth the flag bit :)
>
> Ack. Let me add this to the list of things to follow up on. I'm
> assuming it's fair to start without the flag and add it later as a
> performance optimization?
> We have a fair bit of core things we need to agree on first :-D

Certainly no objection as long as we are doing RFC patches, but I think
we should probably add this before merging something; no reason to
change API more than we have to :)

-Toke


^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-29 12:01                       ` Toke Høiland-Jørgensen
  2023-06-29 16:21                         ` Stanislav Fomichev
@ 2023-06-30  6:22                         ` Magnus Karlsson
  2023-06-30  9:19                           ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 20+ messages in thread
From: Magnus Karlsson @ 2023-06-30  6:22 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, Jesper Dangaard Brouer, brouer, bpf, ast,
	daniel, andrii, martin.lau, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, Björn Töpel, Karlsson, Magnus,
	xdp-hints

On Thu, 29 Jun 2023 at 14:01, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
>
> > On Thu, 29 Jun 2023 at 13:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Stanislav Fomichev <sdf@google.com> writes:
> >>
> >> > On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
> >> > <magnus.karlsson@gmail.com> wrote:
> >> >>
> >> >> On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
> >> >> >
> >> >> > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
> >> >> > <jbrouer@redhat.com> wrote:
> >> >> > >
> >> >> > >
> >> >> > >
> >> >> > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
> >> >> > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
> >> >> > > > <jbrouer@redhat.com> wrote:
> >> >> > > >>
> >> >> > > >>
> >> >> > > >>
> >> >> > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
> >> >> > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
> >> >> > > >>>>
> >> >> > > >>>>
> >> >> > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
> >> >> > > >>>>
> >> >> > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> >> >> > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
> >> >> > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
> >> >> > > >>>>> is no way currently to populate skb metadata.
> >> >> > > >>>>>
> >> >> > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
> >> >> > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
> >> >> > > >>>>> (same as in RX case).
> >> >> > > >>>>
> >> >> > > >>>>    From looking at the code, this introduces a socket option for this TX
> >> >> > > >>>> metadata length (tx_metadata_len).
> >> >> > > >>>> This implies the same fixed TX metadata size is used for all packets.
> >> >> > > >>>> Maybe describe this in patch desc.
> >> >> > > >>>
> >> >> > > >>> I was planning to do a proper documentation page once we settle on all
> >> >> > > >>> the details (similar to the one we have for rx).
> >> >> > > >>>
> >> >> > > >>>> What is the plan for dealing with cases that doesn't populate same/full
> >> >> > > >>>> TX metadata size ?
> >> >> > > >>>
> >> >> > > >>> Do we need to support that? I was assuming that the TX layout would be
> >> >> > > >>> fixed between the userspace and BPF.
> >> >> > > >>
> >> >> > > >> I hope you don't mean fixed layout, as the whole point is adding
> >> >> > > >> flexibility and extensibility.
> >> >> > > >
> >> >> > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
> >> >> > > > At least fixed max size of the metadata. The userspace and the bpf
> >> >> > > > prog can then use this fixed space to implement some flexibility
> >> >> > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
> >> >> > > > If we were to make the metalen vary per packet, we'd have to signal
> >> >> > > > its size per packet. Probably not worth it?
> >> >> > >
> >> >> > > Existing XDP metadata implementation also expand in a fixed/limited
> >> >> > > sized memory area, but communicate size per packet in this area (also
> >> >> > > for validation purposes).  BUT for AF_XDP we don't have room for another
> >> >> > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
> >> >> > >
> >> >> > >
> >> >> > > >
> >> >> > > >>> If every packet would have a different metadata length, it seems like
> >> >> > > >>> a nightmare to parse?
> >> >> > > >>>
> >> >> > > >>
> >> >> > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
> >> >> > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
> >> >> > > >> see [1] and [2].
> >> >> > > >>
> >> >> > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
> >> >> > > >> and reframe the question, what is your *plan* for dealing with different
> >> >> > > >> *types* of TX metadata.  The different struct *types* will of-cause have
> >> >> > > >> different sizes, but that is okay as long as they fit into the maximum
> >> >> > > >> size set by this new socket option XDP_TX_METADATA_LEN.
> >> >> > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
> >> >> > > >> for metadata, but we need a plan for handling more than one type and
> >> >> > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
> >> >> > > >> data ("leftover" since last time this mem was used).
> >> >> > > >
> >> >> > > > Yeah, I think the above correctly catches my expectation here. Some
> >> >> > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
> >> >> > > > offloaded to the bpf program via btf_id/tlv/etc.
> >> >> > > >
> >> >> > > > Regarding leftover metadata: can we assume the userspace will take
> >> >> > > > care of setting it up?
> >> >> > > >
> >> >> > > >> With this kfunc approach, then things in-principle, becomes a contract
> >> >> > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
> >> >> > > >> components can as illustrated here [1]+[2] can coordinate based on local
> >> >> > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
> >> >> > > >> selftests examples don't use this and instead have a very static
> >> >> > > >> approach (that people will copy-paste).
> >> >> > > >>
> >> >> > > >> An unsolved problem with TX-hook is that it can also get packets from
> >> >> > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
> >> >> > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
> >> >> > > >> requesting the timestamp? (imagine metadata size happen to match)
> >> >> > > >
> >> >> > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
> >> >> > > > maybe check that the meta_len is the one that's expected.
> >> >> > > > Will that be enough to handle XDP_REDIRECT?
> >> >> > >
> >> >> > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
> >> >> > > activation of TX hardware hints is too weak and not enough.  This is an
> >> >> > > implicit API for BPF-programmers to understand and can lead to implicit
> >> >> > > activation.
> >> >> > >
> >> >> > > Think about what will happen for your AF_XDP send use-case.  For
> >> >> > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
> >> >> > > is fixed even if not used (and can contain garbage), it can by accident
> >> >> > > create hard-to-debug situations.  As discussed with Magnus+Maryam
> >> >> > > before, we found it was practical (and faster than mem zero) to extend
> >> >> > > AF_XDP descriptor (see struct xdp_desc) with some flags to
> >> >> > > indicate/communicate this frame comes with TX metadata hints.
> >> >> >
> >> >> > What is that "if not used" situation? Can the metadata itself have
> >> >> > is_used bit? The userspace has to initialize at least that bit.
> >> >> > We can definitely add that extra "has_metadata" bit to the descriptor,
> >> >> > but I'm trying to understand whether we can do without it.
> >> >>
> >> >> To me, this "has_metadata" bit in the descriptor is just an
> >> >> optimization. If it is 0, then there is no need to go and check the
> >> >> metadata field and you save some performance. Regardless of this bit,
> >> >> you need some way to say "is_used" for each metadata entry (at least
> >> >> when the number of metadata entries is >1). Three options come to mind
> >> >> each with their pros and cons.
> >> >>
> >> >> #1: Let each metadata entry have an invalid state. Not possible for
> >> >> every metadata and requires the user/kernel to go scan through every
> >> >> entry for every packet.
> >> >>
> >> >> #2: Have a field of bits at the start of the metadata section (closest
> >> >> to packet data) that signifies if a metadata entry is valid or not. If
> >> >> there are N metadata entries in the metadata area, then N bits in this
> >> >> field would be used to signify if the corresponding metadata is used
> >> >> or not. Only requires the user/kernel to scan the valid entries plus
> >> >> one access for the "is_used" bits.
> >> >>
> >> >> #3: Have N bits in the AF_XDP descriptor options field instead of the
> >> >> N bits in the metadata area of #2. Faster but would consume many
> >> >> precious bits in the fixed descriptor and cap the number of metadata
> >> >> entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
> >> >> multi-buffer work, and 15 for some future use. Depends on how daring
> >> >> we are.
> >> >>
> >> >> The "has_metadata" bit suggestion can be combined with 1 or 2.
> >> >> Approach 3 is just a fine grained extension of the idea itself.
> >> >>
> >> >> IMO, the best approach unfortunately depends on the metadata itself.
> >> >> If it is rarely valid, you want something like the "has_metadata" bit.
> >> >> If it is nearly always valid and used, approach #1 (if possible for
> >> >> the metadata) should be the fastest. The decision also depends on the
> >> >> number of metadata entries you have per packet. Sorry that I do not
> >> >> have a good answer. My feeling is that we need something like #1 or
> >> >> #2, or maybe both, then if needed we can add the "has_metadata" bit or
> >> >> bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
> >> >> a combo) in the eBPF program itself? Would provide us with the
> >> >> flexibility, if possible.
> >> >
> >> > Here is my take on it, lmk if I'm missing something:
> >> >
> >> > af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
> >> > plan to use metadata on tx.
> >> > This essentially requires allocating a fixed headroom to carry the metadata.
> >> > af_xdp machinery exports this fixed len into the bpf programs somehow
> >> > (devtx_frame.meta_len in this series).
> >> > Then it's up to the userspace and bpf program to agree on the layout.
> >> > If not every packet is expected to carry the metadata, there might be
> >> > some bitmask in the metadata area to indicate that.
> >> >
> >> > Iow, the metadata isn't interpreted by the kernel. It's up to the prog
> >> > to interpret it and call appropriate kfunc to enable some offload.
> >>
> >> The reason for the flag on RX is mostly performance: there's a
> >> substantial performance hit from reading the metadata area because it's
> >> not cache-hot; we want to avoid that when no metadata is in use. Putting
> >> the flag inside the metadata area itself doesn't work for this, because
> >> then you incur the cache miss just to read the flag.
> >
> > Not necessarily. Let us say that the flag is 4 bytes. Increase the
> > start address of the packet buffer with 4 and the flags field will be
> > on the same cache line as the first 60 bytes of the packet data
> > (assuming a 64 byte cache line size and the flags field is closest to
> > the start of the packet data). As long as you write something in those
> > first 60 bytes of packet data that cache line will be brought in and
> > will likely be in the cache when you access the bits in the metadata
> > field. The trick works similarly for Rx by setting the umem headroom
> > accordingly.
>
> Yeah, a trick like that was what I was alluding to with the "could" in
> this bit:
>
> >> but I see no reason it could not also occur on TX (it'll mostly
> >> depend on data alignment I guess?).
>
> right below the text you quoted ;)

Ouch! Sorry Toke. Was a bit too trigger-happy there.

> > But you are correct in that dedicating a bit in the descriptor will
> > make sure it is always hot, while the trick above is dependent on the
> > app wanting to read or write the first cache line worth of packet
> > data.
>
> Exactly; which is why I think it's worth the flag bit :)
>
> -Toke
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN
  2023-06-30  6:22                         ` Magnus Karlsson
@ 2023-06-30  9:19                           ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 20+ messages in thread
From: Toke Høiland-Jørgensen @ 2023-06-30  9:19 UTC (permalink / raw)
  To: Magnus Karlsson
  Cc: Stanislav Fomichev, Jesper Dangaard Brouer, brouer, bpf, ast,
	daniel, andrii, martin.lau, song, yhs, john.fastabend, kpsingh,
	haoluo, jolsa, Björn Töpel, Karlsson, Magnus,
	xdp-hints

Magnus Karlsson <magnus.karlsson@gmail.com> writes:

> On Thu, 29 Jun 2023 at 14:01, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Magnus Karlsson <magnus.karlsson@gmail.com> writes:
>>
>> > On Thu, 29 Jun 2023 at 13:30, Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> >>
>> >> Stanislav Fomichev <sdf@google.com> writes:
>> >>
>> >> > On Wed, Jun 28, 2023 at 1:09 AM Magnus Karlsson
>> >> > <magnus.karlsson@gmail.com> wrote:
>> >> >>
>> >> >> On Mon, 26 Jun 2023 at 19:06, Stanislav Fomichev <sdf@google.com> wrote:
>> >> >> >
>> >> >> > On Sat, Jun 24, 2023 at 2:02 AM Jesper Dangaard Brouer
>> >> >> > <jbrouer@redhat.com> wrote:
>> >> >> > >
>> >> >> > >
>> >> >> > >
>> >> >> > > On 23/06/2023 19.41, Stanislav Fomichev wrote:
>> >> >> > > > On Fri, Jun 23, 2023 at 3:24 AM Jesper Dangaard Brouer
>> >> >> > > > <jbrouer@redhat.com> wrote:
>> >> >> > > >>
>> >> >> > > >>
>> >> >> > > >>
>> >> >> > > >> On 22/06/2023 19.55, Stanislav Fomichev wrote:
>> >> >> > > >>> On Thu, Jun 22, 2023 at 2:11 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>> >> >> > > >>>>
>> >> >> > > >>>>
>> >> >> > > >>>> This needs to be reviewed by AF_XDP maintainers Magnus and Bjørn (Cc)
>> >> >> > > >>>>
>> >> >> > > >>>> On 21/06/2023 19.02, Stanislav Fomichev wrote:
>> >> >> > > >>>>> For zerocopy mode, tx_desc->addr can point to the arbitrary offset
>> >> >> > > >>>>> and carry some TX metadata in the headroom. For copy mode, there
>> >> >> > > >>>>> is no way currently to populate skb metadata.
>> >> >> > > >>>>>
>> >> >> > > >>>>> Introduce new XDP_TX_METADATA_LEN that indicates how many bytes
>> >> >> > > >>>>> to treat as metadata. Metadata bytes come prior to tx_desc address
>> >> >> > > >>>>> (same as in RX case).
>> >> >> > > >>>>
>> >> >> > > >>>>    From looking at the code, this introduces a socket option for this TX
>> >> >> > > >>>> metadata length (tx_metadata_len).
>> >> >> > > >>>> This implies the same fixed TX metadata size is used for all packets.
>> >> >> > > >>>> Maybe describe this in patch desc.
>> >> >> > > >>>
>> >> >> > > >>> I was planning to do a proper documentation page once we settle on all
>> >> >> > > >>> the details (similar to the one we have for rx).
>> >> >> > > >>>
>> >> >> > > >>>> What is the plan for dealing with cases that doesn't populate same/full
>> >> >> > > >>>> TX metadata size ?
>> >> >> > > >>>
>> >> >> > > >>> Do we need to support that? I was assuming that the TX layout would be
>> >> >> > > >>> fixed between the userspace and BPF.
>> >> >> > > >>
>> >> >> > > >> I hope you don't mean fixed layout, as the whole point is adding
>> >> >> > > >> flexibility and extensibility.
>> >> >> > > >
>> >> >> > > > I do mean a fixed layout between the userspace (af_xdp) and devtx program.
>> >> >> > > > At least fixed max size of the metadata. The userspace and the bpf
>> >> >> > > > prog can then use this fixed space to implement some flexibility
>> >> >> > > > (btf_ids, versioned structs, bitmasks, tlv, etc).
>> >> >> > > > If we were to make the metalen vary per packet, we'd have to signal
>> >> >> > > > its size per packet. Probably not worth it?
>> >> >> > >
>> >> >> > > Existing XDP metadata implementation also expand in a fixed/limited
>> >> >> > > sized memory area, but communicate size per packet in this area (also
>> >> >> > > for validation purposes).  BUT for AF_XDP we don't have room for another
>> >> >> > > pointer or size in the AF_XDP descriptor (see struct xdp_desc).
>> >> >> > >
>> >> >> > >
>> >> >> > > >
>> >> >> > > >>> If every packet would have a different metadata length, it seems like
>> >> >> > > >>> a nightmare to parse?
>> >> >> > > >>>
>> >> >> > > >>
>> >> >> > > >> No parsing is really needed.  We can simply use BTF IDs and type cast in
>> >> >> > > >> BPF-prog. Both BPF-prog and userspace have access to the local BTF ids,
>> >> >> > > >> see [1] and [2].
>> >> >> > > >>
>> >> >> > > >> It seems we are talking slightly past each-other(?).  Let me rephrase
>> >> >> > > >> and reframe the question, what is your *plan* for dealing with different
>> >> >> > > >> *types* of TX metadata.  The different struct *types* will of-cause have
>> >> >> > > >> different sizes, but that is okay as long as they fit into the maximum
>> >> >> > > >> size set by this new socket option XDP_TX_METADATA_LEN.
>> >> >> > > >> Thus, in principle I'm fine with XSK having configured a fixed headroom
>> >> >> > > >> for metadata, but we need a plan for handling more than one type and
>> >> >> > > >> perhaps a xsk desc indicator/flag for knowing TX metadata isn't random
>> >> >> > > >> data ("leftover" since last time this mem was used).
>> >> >> > > >
>> >> >> > > > Yeah, I think the above correctly catches my expectation here. Some
>> >> >> > > > headroom is reserved via XDP_TX_METADATA_LEN and the flexibility is
>> >> >> > > > offloaded to the bpf program via btf_id/tlv/etc.
>> >> >> > > >
>> >> >> > > > Regarding leftover metadata: can we assume the userspace will take
>> >> >> > > > care of setting it up?
>> >> >> > > >
>> >> >> > > >> With this kfunc approach, then things in-principle, becomes a contract
>> >> >> > > >> between the "local" TX-hook BPF-prog and AF_XDP userspace.   These two
>> >> >> > > >> components can as illustrated here [1]+[2] can coordinate based on local
>> >> >> > > >> BPF-prog BTF IDs.  This approach works as-is today, but patchset
>> >> >> > > >> selftests examples don't use this and instead have a very static
>> >> >> > > >> approach (that people will copy-paste).
>> >> >> > > >>
>> >> >> > > >> An unsolved problem with TX-hook is that it can also get packets from
>> >> >> > > >> XDP_REDIRECT and even normal SKBs gets processed (right?).  How does the
>> >> >> > > >> BPF-prog know if metadata is valid and intended to be used for e.g.
>> >> >> > > >> requesting the timestamp? (imagine metadata size happen to match)
>> >> >> > > >
>> >> >> > > > My assumption was the bpf program can do ifindex/netns filtering. Plus
>> >> >> > > > maybe check that the meta_len is the one that's expected.
>> >> >> > > > Will that be enough to handle XDP_REDIRECT?
>> >> >> > >
>> >> >> > > I don't think so, using the meta_len (+ ifindex/netns) to communicate
>> >> >> > > activation of TX hardware hints is too weak and not enough.  This is an
>> >> >> > > implicit API for BPF-programmers to understand and can lead to implicit
>> >> >> > > activation.
>> >> >> > >
>> >> >> > > Think about what will happen for your AF_XDP send use-case.  For
>> >> >> > > performance reasons AF_XDP don't zero out frame memory.  Thus, meta_len
>> >> >> > > is fixed even if not used (and can contain garbage), it can by accident
>> >> >> > > create hard-to-debug situations.  As discussed with Magnus+Maryam
>> >> >> > > before, we found it was practical (and faster than mem zero) to extend
>> >> >> > > AF_XDP descriptor (see struct xdp_desc) with some flags to
>> >> >> > > indicate/communicate this frame comes with TX metadata hints.
>> >> >> >
>> >> >> > What is that "if not used" situation? Can the metadata itself have
>> >> >> > is_used bit? The userspace has to initialize at least that bit.
>> >> >> > We can definitely add that extra "has_metadata" bit to the descriptor,
>> >> >> > but I'm trying to understand whether we can do without it.
>> >> >>
>> >> >> To me, this "has_metadata" bit in the descriptor is just an
>> >> >> optimization. If it is 0, then there is no need to go and check the
>> >> >> metadata field and you save some performance. Regardless of this bit,
>> >> >> you need some way to say "is_used" for each metadata entry (at least
>> >> >> when the number of metadata entries is >1). Three options come to mind
>> >> >> each with their pros and cons.
>> >> >>
>> >> >> #1: Let each metadata entry have an invalid state. Not possible for
>> >> >> every metadata and requires the user/kernel to go scan through every
>> >> >> entry for every packet.
>> >> >>
>> >> >> #2: Have a field of bits at the start of the metadata section (closest
>> >> >> to packet data) that signifies if a metadata entry is valid or not. If
>> >> >> there are N metadata entries in the metadata area, then N bits in this
>> >> >> field would be used to signify if the corresponding metadata is used
>> >> >> or not. Only requires the user/kernel to scan the valid entries plus
>> >> >> one access for the "is_used" bits.
>> >> >>
>> >> >> #3: Have N bits in the AF_XDP descriptor options field instead of the
>> >> >> N bits in the metadata area of #2. Faster but would consume many
>> >> >> precious bits in the fixed descriptor and cap the number of metadata
>> >> >> entries possible at around 8. E.g., 8 for Rx, 8 for Tx, 1 for the
>> >> >> multi-buffer work, and 15 for some future use. Depends on how daring
>> >> >> we are.
>> >> >>
>> >> >> The "has_metadata" bit suggestion can be combined with 1 or 2.
>> >> >> Approach 3 is just a fine grained extension of the idea itself.
>> >> >>
>> >> >> IMO, the best approach unfortunately depends on the metadata itself.
>> >> >> If it is rarely valid, you want something like the "has_metadata" bit.
>> >> >> If it is nearly always valid and used, approach #1 (if possible for
>> >> >> the metadata) should be the fastest. The decision also depends on the
>> >> >> number of metadata entries you have per packet. Sorry that I do not
>> >> >> have a good answer. My feeling is that we need something like #1 or
>> >> >> #2, or maybe both, then if needed we can add the "has_metadata" bit or
>> >> >> bits (#3) optimization. Can we do this encoding and choice (#1, #2, or
>> >> >> a combo) in the eBPF program itself? Would provide us with the
>> >> >> flexibility, if possible.
>> >> >
>> >> > Here is my take on it, lmk if I'm missing something:
>> >> >
>> >> > af_xdp users call this new setsockopt(XDP_TX_METADATA_LEN) when they
>> >> > plan to use metadata on tx.
>> >> > This essentially requires allocating a fixed headroom to carry the metadata.
>> >> > af_xdp machinery exports this fixed len into the bpf programs somehow
>> >> > (devtx_frame.meta_len in this series).
>> >> > Then it's up to the userspace and bpf program to agree on the layout.
>> >> > If not every packet is expected to carry the metadata, there might be
>> >> > some bitmask in the metadata area to indicate that.
>> >> >
>> >> > Iow, the metadata isn't interpreted by the kernel. It's up to the prog
>> >> > to interpret it and call appropriate kfunc to enable some offload.
>> >>
>> >> The reason for the flag on RX is mostly performance: there's a
>> >> substantial performance hit from reading the metadata area because it's
>> >> not cache-hot; we want to avoid that when no metadata is in use. Putting
>> >> the flag inside the metadata area itself doesn't work for this, because
>> >> then you incur the cache miss just to read the flag.
>> >
>> > Not necessarily. Let us say that the flag is 4 bytes. Increase the
>> > start address of the packet buffer with 4 and the flags field will be
>> > on the same cache line as the first 60 bytes of the packet data
>> > (assuming a 64 byte cache line size and the flags field is closest to
>> > the start of the packet data). As long as you write something in those
>> > first 60 bytes of packet data that cache line will be brought in and
>> > will likely be in the cache when you access the bits in the metadata
>> > field. The trick works similarly for Rx by setting the umem headroom
>> > accordingly.
>>
>> Yeah, a trick like that was what I was alluding to with the "could" in
>> this bit:
>>
>> >> but I see no reason it could not also occur on TX (it'll mostly
>> >> depend on data alignment I guess?).
>>
>> right below the text you quoted ;)
>
> Ouch! Sorry Toke. Was a bit too trigger-happy there.

Haha, no worries, seems like we're basically in agreement anyway :)

-Toke


^ permalink raw reply	[flat|nested] 20+ messages in thread

[parent not found: <20230621170244.1283336-10-sdf@google.com>]

* [xdp-hints] Re: [RFC bpf-next v2 09/11] selftests/bpf: Extend xdp_metadata with devtx kfuncs
       [not found] ` <20230621170244.1283336-10-sdf@google.com>
@ 2023-06-23 11:12   ` Jesper D. Brouer
  2023-06-23 17:40     ` Stanislav Fomichev
  0 siblings, 1 reply; 20+ messages in thread
From: Jesper D. Brouer @ 2023-06-23 11:12 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, netdev, xdp-hints



On 21/06/2023 19.02, Stanislav Fomichev wrote:
> Attach kfuncs that request and report TX timestamp via ringbuf.
> Confirm on the userspace side that the program has triggered
> and the timestamp is non-zero.
> 
> Also make sure devtx_frame has a sensible pointers and data.
> 
[...]


> diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> index d151d406a123..fc025183d45a 100644
> --- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
> +++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
[...]
> @@ -19,10 +24,25 @@ struct {
>   	__type(value, __u32);
>   } prog_arr SEC(".maps");
>   
> +struct {
> +	__uint(type, BPF_MAP_TYPE_RINGBUF);
> +	__uint(max_entries, 10);
> +} tx_compl_buf SEC(".maps");
> +
> +__u64 pkts_fail_tx = 0;
> +
> +int ifindex = -1;
> +__u64 net_cookie = -1;
> +
>   extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
>   					 __u64 *timestamp) __ksym;
>   extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
>   				    enum xdp_rss_hash_type *rss_type) __ksym;
> +extern int bpf_devtx_sb_request_timestamp(const struct devtx_frame *ctx) __ksym;
> +extern int bpf_devtx_cp_timestamp(const struct devtx_frame *ctx, __u64 *timestamp) __ksym;
> +
> +extern int bpf_devtx_sb_attach(int ifindex, int prog_fd) __ksym;
> +extern int bpf_devtx_cp_attach(int ifindex, int prog_fd) __ksym;
>   
>   SEC("xdp")
>   int rx(struct xdp_md *ctx)
> @@ -61,4 +81,102 @@ int rx(struct xdp_md *ctx)
>   	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
>   }
>   
> +static inline int verify_frame(const struct devtx_frame *frame)
> +{
> +	struct ethhdr eth = {};
> +
> +	/* all the pointers are set up correctly */
> +	if (!frame->data)
> +		return -1;
> +	if (!frame->sinfo)
> +		return -1;
> +
> +	/* can get to the frags */
> +	if (frame->sinfo->nr_frags != 0)
> +		return -1;
> +	if (frame->sinfo->frags[0].bv_page != 0)
> +		return -1;
> +	if (frame->sinfo->frags[0].bv_len != 0)
> +		return -1;
> +	if (frame->sinfo->frags[0].bv_offset != 0)
> +		return -1;
> +
> +	/* the data has something that looks like ethernet */
> +	if (frame->len != 46)
> +		return -1;
> +	bpf_probe_read_kernel(&eth, sizeof(eth), frame->data);
> +
> +	if (eth.h_proto != bpf_htons(ETH_P_IP))
> +		return -1;
> +
> +	return 0;
> +}
> +
> +SEC("fentry/veth_devtx_submit")
> +int BPF_PROG(tx_submit, const struct devtx_frame *frame)
> +{
> +	struct xdp_tx_meta meta = {};
> +	int ret;
> +
> +	if (frame->netdev->ifindex != ifindex)
> +		return 0;
> +	if (frame->netdev->nd_net.net->net_cookie != net_cookie)
> +		return 0;
> +	if (frame->meta_len != TX_META_LEN)
> +		return 0;
> +
> +	bpf_probe_read_kernel(&meta, sizeof(meta), frame->data - TX_META_LEN);
> +	if (!meta.request_timestamp)
> +		return 0;
> +
> +	ret = verify_frame(frame);
> +	if (ret < 0) {
> +		__sync_add_and_fetch(&pkts_fail_tx, 1);
> +		return 0;
> +	}
> +
> +	ret = bpf_devtx_sb_request_timestamp(frame);

My original design thoughts were that BPF-progs would write into
metadata area, with the intend that at TX-complete we can access this
metadata area again.

In this case with request_timestamp it would make sense to me, to store
a sequence number (+ the TX-queue number), such that program code can
correlate on complete event.

Like xdp_hw_metadata example, I would likely also to add a software 
timestamp, what I could check at TX complete hook.

> +	if (ret < 0) {
> +		__sync_add_and_fetch(&pkts_fail_tx, 1);
> +		return 0;
> +	}
> +
> +	return 0;
> +}
> +
> +SEC("fentry/veth_devtx_complete")
> +int BPF_PROG(tx_complete, const struct devtx_frame *frame)
> +{
> +	struct xdp_tx_meta meta = {};
> +	struct devtx_sample *sample;
> +	int ret;
> +
> +	if (frame->netdev->ifindex != ifindex)
> +		return 0;
> +	if (frame->netdev->nd_net.net->net_cookie != net_cookie)
> +		return 0;
> +	if (frame->meta_len != TX_META_LEN)
> +		return 0;
> +
> +	bpf_probe_read_kernel(&meta, sizeof(meta), frame->data - TX_META_LEN);
> +	if (!meta.request_timestamp)
> +		return 0;
> +
> +	ret = verify_frame(frame);
> +	if (ret < 0) {
> +		__sync_add_and_fetch(&pkts_fail_tx, 1);
> +		return 0;
> +	}
> +
> +	sample = bpf_ringbuf_reserve(&tx_compl_buf, sizeof(*sample), 0);
> +	if (!sample)
> +		return 0;

Sending this via a ringbuffer to userspace, will make it hard to
correlate. (For AF_XDP it would help a little to add the TX-queue
number, as this hook isn't queue bound but AF_XDP is).

> +
> +	sample->timestamp_retval = bpf_devtx_cp_timestamp(frame, &sample->timestamp);
> +

I were expecting to see, information being written into the metadata 
area of the frame, such that AF_XDP completion-queue handling can 
extract this obtained timestamp.


> +	bpf_ringbuf_submit(sample, 0);
> +
> +	return 0;
> +}
> +
>   char _license[] SEC("license") = "GPL";
> diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h
> index 938a729bd307..e410f2b95e64 100644
> --- a/tools/testing/selftests/bpf/xdp_metadata.h
> +++ b/tools/testing/selftests/bpf/xdp_metadata.h
> @@ -18,3 +18,17 @@ struct xdp_meta {
>   		__s32 rx_hash_err;
>   	};
>   };
> +
> +struct devtx_sample {
> +	int timestamp_retval;
> +	__u64 timestamp;
> +};
> +
> +#define TX_META_LEN	8

Very static design.

> +
> +struct xdp_tx_meta {
> +	__u8 request_timestamp;
> +	__u8 padding0;
> +	__u16 padding1;
> +	__u32 padding2;
> +};

padding2 could be a btf_id for creating a more flexible design.

--Jesper

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 09/11] selftests/bpf: Extend xdp_metadata with devtx kfuncs
  2023-06-23 11:12   ` [xdp-hints] Re: [RFC bpf-next v2 09/11] selftests/bpf: Extend xdp_metadata with devtx kfuncs Jesper D. Brouer
@ 2023-06-23 17:40     ` Stanislav Fomichev
  0 siblings, 0 replies; 20+ messages in thread
From: Stanislav Fomichev @ 2023-06-23 17:40 UTC (permalink / raw)
  To: Jesper D. Brouer
  Cc: bpf, brouer, ast, daniel, andrii, martin.lau, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, netdev, xdp-hints

On Fri, Jun 23, 2023 at 4:12 AM Jesper D. Brouer <netdev@brouer.com> wrote:
>
>
>
> On 21/06/2023 19.02, Stanislav Fomichev wrote:
> > Attach kfuncs that request and report TX timestamp via ringbuf.
> > Confirm on the userspace side that the program has triggered
> > and the timestamp is non-zero.
> >
> > Also make sure devtx_frame has a sensible pointers and data.
> >
> [...]
>
>
> > diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> > index d151d406a123..fc025183d45a 100644
> > --- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
> > +++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
> [...]
> > @@ -19,10 +24,25 @@ struct {
> >       __type(value, __u32);
> >   } prog_arr SEC(".maps");
> >
> > +struct {
> > +     __uint(type, BPF_MAP_TYPE_RINGBUF);
> > +     __uint(max_entries, 10);
> > +} tx_compl_buf SEC(".maps");
> > +
> > +__u64 pkts_fail_tx = 0;
> > +
> > +int ifindex = -1;
> > +__u64 net_cookie = -1;
> > +
> >   extern int bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx,
> >                                        __u64 *timestamp) __ksym;
> >   extern int bpf_xdp_metadata_rx_hash(const struct xdp_md *ctx, __u32 *hash,
> >                                   enum xdp_rss_hash_type *rss_type) __ksym;
> > +extern int bpf_devtx_sb_request_timestamp(const struct devtx_frame *ctx) __ksym;
> > +extern int bpf_devtx_cp_timestamp(const struct devtx_frame *ctx, __u64 *timestamp) __ksym;
> > +
> > +extern int bpf_devtx_sb_attach(int ifindex, int prog_fd) __ksym;
> > +extern int bpf_devtx_cp_attach(int ifindex, int prog_fd) __ksym;
> >
> >   SEC("xdp")
> >   int rx(struct xdp_md *ctx)
> > @@ -61,4 +81,102 @@ int rx(struct xdp_md *ctx)
> >       return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
> >   }
> >
> > +static inline int verify_frame(const struct devtx_frame *frame)
> > +{
> > +     struct ethhdr eth = {};
> > +
> > +     /* all the pointers are set up correctly */
> > +     if (!frame->data)
> > +             return -1;
> > +     if (!frame->sinfo)
> > +             return -1;
> > +
> > +     /* can get to the frags */
> > +     if (frame->sinfo->nr_frags != 0)
> > +             return -1;
> > +     if (frame->sinfo->frags[0].bv_page != 0)
> > +             return -1;
> > +     if (frame->sinfo->frags[0].bv_len != 0)
> > +             return -1;
> > +     if (frame->sinfo->frags[0].bv_offset != 0)
> > +             return -1;
> > +
> > +     /* the data has something that looks like ethernet */
> > +     if (frame->len != 46)
> > +             return -1;
> > +     bpf_probe_read_kernel(&eth, sizeof(eth), frame->data);
> > +
> > +     if (eth.h_proto != bpf_htons(ETH_P_IP))
> > +             return -1;
> > +
> > +     return 0;
> > +}
> > +
> > +SEC("fentry/veth_devtx_submit")
> > +int BPF_PROG(tx_submit, const struct devtx_frame *frame)
> > +{
> > +     struct xdp_tx_meta meta = {};
> > +     int ret;
> > +
> > +     if (frame->netdev->ifindex != ifindex)
> > +             return 0;
> > +     if (frame->netdev->nd_net.net->net_cookie != net_cookie)
> > +             return 0;
> > +     if (frame->meta_len != TX_META_LEN)
> > +             return 0;
> > +
> > +     bpf_probe_read_kernel(&meta, sizeof(meta), frame->data - TX_META_LEN);
> > +     if (!meta.request_timestamp)
> > +             return 0;
> > +
> > +     ret = verify_frame(frame);
> > +     if (ret < 0) {
> > +             __sync_add_and_fetch(&pkts_fail_tx, 1);
> > +             return 0;
> > +     }
> > +
> > +     ret = bpf_devtx_sb_request_timestamp(frame);
>
> My original design thoughts were that BPF-progs would write into
> metadata area, with the intend that at TX-complete we can access this
> metadata area again.
>
> In this case with request_timestamp it would make sense to me, to store
> a sequence number (+ the TX-queue number), such that program code can
> correlate on complete event.

Yeah, we can probably follow up on that. I'm trying to start with a
read-only path for now.
Can we expose metadata mutating operations via some new kfunc helpers?
Something that returns a ptr/dynptr to the metadata portion?

> Like xdp_hw_metadata example, I would likely also to add a software
> timestamp, what I could check at TX complete hook.
>
> > +     if (ret < 0) {
> > +             __sync_add_and_fetch(&pkts_fail_tx, 1);
> > +             return 0;
> > +     }
> > +
> > +     return 0;
> > +}
> > +
> > +SEC("fentry/veth_devtx_complete")
> > +int BPF_PROG(tx_complete, const struct devtx_frame *frame)
> > +{
> > +     struct xdp_tx_meta meta = {};
> > +     struct devtx_sample *sample;
> > +     int ret;
> > +
> > +     if (frame->netdev->ifindex != ifindex)
> > +             return 0;
> > +     if (frame->netdev->nd_net.net->net_cookie != net_cookie)
> > +             return 0;
> > +     if (frame->meta_len != TX_META_LEN)
> > +             return 0;
> > +
> > +     bpf_probe_read_kernel(&meta, sizeof(meta), frame->data - TX_META_LEN);
> > +     if (!meta.request_timestamp)
> > +             return 0;
> > +
> > +     ret = verify_frame(frame);
> > +     if (ret < 0) {
> > +             __sync_add_and_fetch(&pkts_fail_tx, 1);
> > +             return 0;
> > +     }
> > +
> > +     sample = bpf_ringbuf_reserve(&tx_compl_buf, sizeof(*sample), 0);
> > +     if (!sample)
> > +             return 0;
>
> Sending this via a ringbuffer to userspace, will make it hard to
> correlate. (For AF_XDP it would help a little to add the TX-queue
> number, as this hook isn't queue bound but AF_XDP is).

Agreed. I was looking into putting the metadata back into the ring initially.
It's somewhat doable for zero-copy, but needs some special care for copy mode.
So I've decided not to over-complicate the series and land the
read-only hooks at least.
Does it sound fair? We can allow mutating metadata separately.

> > +
> > +     sample->timestamp_retval = bpf_devtx_cp_timestamp(frame, &sample->timestamp);
> > +
>
> I were expecting to see, information being written into the metadata
> area of the frame, such that AF_XDP completion-queue handling can
> extract this obtained timestamp.

SG, will add!

> > +     bpf_ringbuf_submit(sample, 0);
> > +
> > +     return 0;
> > +}
> > +
> >   char _license[] SEC("license") = "GPL";
> > diff --git a/tools/testing/selftests/bpf/xdp_metadata.h b/tools/testing/selftests/bpf/xdp_metadata.h
> > index 938a729bd307..e410f2b95e64 100644
> > --- a/tools/testing/selftests/bpf/xdp_metadata.h
> > +++ b/tools/testing/selftests/bpf/xdp_metadata.h
> > @@ -18,3 +18,17 @@ struct xdp_meta {
> >               __s32 rx_hash_err;
> >       };
> >   };
> > +
> > +struct devtx_sample {
> > +     int timestamp_retval;
> > +     __u64 timestamp;
> > +};
> > +
> > +#define TX_META_LEN  8
>
> Very static design.
>
> > +
> > +struct xdp_tx_meta {
> > +     __u8 request_timestamp;
> > +     __u8 padding0;
> > +     __u16 padding1;
> > +     __u32 padding2;
> > +};
>
> padding2 could be a btf_id for creating a more flexible design.

Right, up to the programs on how to make it more flexible (same as
rx), will add more on that in your other reply.

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2023-06-30  9:19 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <20230621170244.1283336-1-sdf@google.com>
2023-06-22  8:41 ` [xdp-hints] Re: [RFC bpf-next v2 00/11] bpf: Netdev TX metadata Jesper Dangaard Brouer
2023-06-22 17:55   ` Stanislav Fomichev
     [not found] ` <20230621170244.1283336-4-sdf@google.com>
2023-06-22  9:11   ` [xdp-hints] Re: [RFC bpf-next v2 03/11] xsk: Support XDP_TX_METADATA_LEN Jesper D. Brouer
2023-06-22 17:55     ` Stanislav Fomichev
2023-06-23 10:24       ` Jesper Dangaard Brouer
2023-06-23 17:41         ` Stanislav Fomichev
2023-06-24  9:02           ` Jesper Dangaard Brouer
2023-06-26 17:00             ` Stanislav Fomichev
2023-06-28  8:09               ` Magnus Karlsson
2023-06-28 18:49                 ` Stanislav Fomichev
2023-06-29  6:15                   ` Magnus Karlsson
2023-06-29 11:30                   ` Toke Høiland-Jørgensen
2023-06-29 11:48                     ` Magnus Karlsson
2023-06-29 12:01                       ` Toke Høiland-Jørgensen
2023-06-29 16:21                         ` Stanislav Fomichev
2023-06-29 20:58                           ` Toke Høiland-Jørgensen
2023-06-30  6:22                         ` Magnus Karlsson
2023-06-30  9:19                           ` Toke Høiland-Jørgensen
     [not found] ` <20230621170244.1283336-10-sdf@google.com>
2023-06-23 11:12   ` [xdp-hints] Re: [RFC bpf-next v2 09/11] selftests/bpf: Extend xdp_metadata with devtx kfuncs Jesper D. Brouer
2023-06-23 17:40     ` Stanislav Fomichev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox