[xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs

XDP hardware hints discussion mail archive
 help / color / mirror / Atom feed

* [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs
@ 2022-11-04  3:25 Stanislav Fomichev
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 01/14] bpf: Introduce bpf_patch Stanislav Fomichev
                   ` (13 more replies)
  0 siblings, 14 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-04  3:25 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Changes since the original RFC:

- 'struct bpf_patch' to more easily manage insn arrays

  While working on longer unrolled functions I've bumped into verifier's
  insn_buf[16]. Instead of increasing it, I've added a simple
  abstraction that handles resizing.

  *insn++ = BPF_XXX_XXX();
  *insn++ = BPF_YYY_YYY();

  vs

  bpf_patch_append(patch,
          BPF_XXX_XXX(),
          BPF_YYY_YYY(),
  );

  There are also some tricks where we resolve BPF_JMP_IMM(op, dst, imm,
  S16_MAX) to the real end of the program; mostly to make it easy to generate
  the following:

  if (some_condition) {
          if (some_other_condition) {
                  // insns
          }
  }

- Drop xdp_buff->priv in favor of container_of (Jakub)

  The drivers might need to maintain more data, so instead of
  constraining them to a single ->priv pointer in xdp_buff, use container_of
  and require the users to define the outer struct. xdp_buff should
  be the first member.

  Each driver now has two patches: the first one introduces new struct
  wrapper; the second one does the actual work.

- bpf_xdp_metadata_export_to_skb (Toke)

  To solve the case where we're constructing skb from a redirected
  frame. bpf_xdp_metadata_export_to_skb is an unrolled kfunc
  that prepares 'struct xdp_to_skb_metadata' in the metadata.
  The layout is randomized and it has a randomized magic number
  to make sure userspace doesn't depend on it. I'm not sure how
  strict we should be here? Toke/Jesper seem to be ok with
  userspace depending on this but as long as they read the
  layout via BTF, so maybe having a fixed magic is ok/better her?

  Later, at skb_metadata_set time, we call into
  skb_metadata_import_from_xdp that tries to parse this fixed format
  and extract relevant metadata.

  Note that because we are unrolling bpf_xdp_metadata_export_to_skb,
  we have to constrain other kfuncs to R2-R5 only; I've added the
  reasoning to the comments. It might be better idea to do a real
  kfunc call (but we have to generate this kfunc anyway)?

- helper to make it easier to call kernel code from unrolled kfuncs

  Since we're unrolling, we're running in a somewhat constrained
  environment. R6-R9 belong to the real callee, so we can't use them
  to stash our R1-R5. We also can't use stack in any way. Another
  constraint comes from bpf_xdp_metadata_export_to_skb which
  might call several metadata kfuncs and wants its R1 to be preserved.

  Thus, we add xdp_kfunc_call_preserving_r1, which generates the
  bytecode to preserve r1 somewhere in the user-supplied context.

  Again, we have to do this because we unroll bpf_xdp_metadata_export_to_skb.

- mlx4/bnxt/ice examples (Jesper)

  ice is the only one where it seems feasible to unroll. The code is
  untested, but at least it shows that it's somewhat possible to
  get to the timestamp in our R2-R5-constrained environment.

  mlx4/bnxt do spinlocks/seqlocks, so I'm just calling into the kernel
  code.

- Unroll default implementation to return false/0/NULL

  Original RFC left default kfuncs calls when the driver doesn't
  do the unrolling. Here, instead, we unroll to single 'return 0'
  instruction.

- Drop prog_ifindex libbpf patch, use bpf_program__set_ifindex instead (Andrii)

- Keep returning metadata by value instead of using a pointer

  I've tried using the pointer, it works currently, but it requires extra
  argument per commit eb1f7f71c126 ("bpf/verifier: allow kfunc to return
  an allocated mem"). The requirement can probably be lifted for our
  case, but not sure it's necessary for now.

  While adding rx_timestamp support for the drivers, it turns out
  we never really return the raw pointer to the descriptor field. We read
  the descriptor field, do shifts/multiplies, convert to kernel clock,
  etc/etc. So returning a value instead of a pointer seems more logical,
  at least for the rx timestamp case. For the other types of metadata,
  we might reconsider.

- rename bpf_xdp_metadata_have_<xxx> to bpf_xdp_metadata_<xxx>_supported

  Spotted in one of Toke's emails. Seems like it better conveys
  the intent that it actually tests that the device supports the
  metadata, not that the particular packet has the metadata.

The following is unchanged since the original RFC:

This is an RFC for the alternative approach suggested by Martin and
Jakub. I've tried to CC most of the people from the original discussion,
feel free to add more if you think I've missed somebody.

Summary:
- add new BPF_F_XDP_HAS_METADATA program flag and abuse
  attr->prog_ifindex to pass target device ifindex at load time
- at load time, find appropriate ndo_unroll_kfunc and call
  it to unroll/inline kfuncs; kfuncs have the default "safe"
  implementation if unrolling is not supported by a particular
  device
- rewrite xskxceiver test to use C bpf program and extend
  it to export rx_timestamp (plus add rx timestamp to veth driver)

I've intentionally kept it small and hacky to see whether the approach is
workable or not.

Pros:
- we avoid BTF complexity; the BPF programs themselves are now responsible
  for agreeing on the metadata layout with the AF_XDP consumer
- the metadata is free if not used
- the metadata should, in theory, be cheap if used; kfuncs should be
  unrolled to the same code as if the metadata was pre-populated and
  passed with a BTF id
- it's not all or nothing; users can use small subset of metadata which
  is more efficient than the BTF id approach where all metadata has to be
  exposed for every frame (and selectively consumed by the users)

Cons:
- forwarding has to be handled explicitly; the BPF programs have to
  agree on the metadata layout (IOW, the forwarding program
  has to be aware of the final AF_XDP consumer metadata layout)
- TX picture is not clear; but it's not clear with BTF ids as well;
  I think we've agreed that just reusing whatever we have at RX
  won't fly at TX; seems like TX XDP program might be the answer
  here? (with a set of another tx kfuncs to "expose" bpf/af_xdp metatata
  back into the kernel)

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org

Stanislav Fomichev (14):
  bpf: Introduce bpf_patch
  bpf: Support inlined/unrolled kfuncs for xdp metadata
  veth: Introduce veth_xdp_buff wrapper for xdp_buff
  veth: Support rx timestamp metadata for xdp
  selftests/bpf: Verify xdp_metadata xdp->af_xdp path
  xdp: Carry over xdp metadata into skb context
  selftests/bpf: Verify xdp_metadata xdp->skb path
  bpf: Helper to simplify calling kernel routines from unrolled kfuncs
  ice: Introduce ice_xdp_buff wrapper for xdp_buff
  ice: Support rx timestamp metadata for xdp
  mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff
  mxl4: Support rx timestamp metadata for xdp
  bnxt: Introduce bnxt_xdp_buff wrapper for xdp_buff
  bnxt: Support rx timestamp metadata for xdp

 drivers/net/ethernet/broadcom/bnxt/bnxt.c     |  73 +++-
 drivers/net/ethernet/intel/ice/ice.h          |   5 +
 drivers/net/ethernet/intel/ice/ice_main.c     |   1 +
 drivers/net/ethernet/intel/ice/ice_txrx.c     | 105 ++++-
 .../net/ethernet/mellanox/mlx4/en_netdev.c    |   1 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    |  66 ++-
 drivers/net/veth.c                            |  89 ++--
 include/linux/bpf.h                           |   1 +
 include/linux/bpf_patch.h                     |  29 ++
 include/linux/btf.h                           |   1 +
 include/linux/btf_ids.h                       |   4 +
 include/linux/mlx4/device.h                   |   7 +
 include/linux/netdevice.h                     |   5 +
 include/linux/skbuff.h                        |   4 +
 include/net/xdp.h                             |  41 ++
 include/uapi/linux/bpf.h                      |   5 +
 kernel/bpf/Makefile                           |   2 +-
 kernel/bpf/bpf_patch.c                        |  81 ++++
 kernel/bpf/syscall.c                          |  28 +-
 kernel/bpf/verifier.c                         |  85 ++++
 net/core/dev.c                                |   7 +
 net/core/skbuff.c                             |  25 ++
 net/core/xdp.c                                | 165 +++++++-
 tools/include/uapi/linux/bpf.h                |   5 +
 tools/testing/selftests/bpf/Makefile          |   2 +-
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 394 ++++++++++++++++++
 .../selftests/bpf/progs/xdp_metadata.c        |  78 ++++
 27 files changed, 1244 insertions(+), 65 deletions(-)
 create mode 100644 include/linux/bpf_patch.h
 create mode 100644 kernel/bpf/bpf_patch.c
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata.c

-- 
2.38.1.431.g37b22c650d-goog

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] [RFC bpf-next v2 01/14] bpf: Introduce bpf_patch
  2022-11-04  3:25 [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs Stanislav Fomichev
@ 2022-11-04  3:25 ` Stanislav Fomichev
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 02/14] bpf: Support inlined/unrolled kfuncs for xdp metadata Stanislav Fomichev
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-04  3:25 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

A simple abstraction around a series of instructions that transparently
handles resizing.

Currently, we have insn_buf[16] in convert_ctx_accesses which might
not be enough for xdp kfuncs.

If we find this abstraction helpful, we might convert existing
insn_buf[16] to it in the future.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/bpf_patch.h | 27 +++++++++++++++++++++
 kernel/bpf/Makefile       |  2 +-
 kernel/bpf/bpf_patch.c    | 51 +++++++++++++++++++++++++++++++++++++++
 3 files changed, 79 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/bpf_patch.h
 create mode 100644 kernel/bpf/bpf_patch.c

diff --git a/include/linux/bpf_patch.h b/include/linux/bpf_patch.h
new file mode 100644
index 000000000000..81ff738eef8d
--- /dev/null
+++ b/include/linux/bpf_patch.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef _LINUX_BPF_PATCH_H
+#define _LINUX_BPF_PATCH_H 1
+
+#include <linux/bpf.h>
+
+struct bpf_patch {
+	struct bpf_insn *insn;
+	size_t capacity;
+	size_t len;
+	int err;
+};
+
+void bpf_patch_free(struct bpf_patch *patch);
+size_t bpf_patch_len(const struct bpf_patch *patch);
+int bpf_patch_err(const struct bpf_patch *patch);
+void __bpf_patch_append(struct bpf_patch *patch, struct bpf_insn insn);
+struct bpf_insn *bpf_patch_data(const struct bpf_patch *patch);
+
+#define bpf_patch_append(patch, ...) ({ \
+	struct bpf_insn insn[] = { __VA_ARGS__ }; \
+	int i; \
+	for (i = 0; i < ARRAY_SIZE(insn); i++) \
+		__bpf_patch_append(patch, insn[i]); \
+})
+
+#endif
diff --git a/kernel/bpf/Makefile b/kernel/bpf/Makefile
index 3a12e6b400a2..5724f36292a5 100644
--- a/kernel/bpf/Makefile
+++ b/kernel/bpf/Makefile
@@ -13,7 +13,7 @@ obj-$(CONFIG_BPF_SYSCALL) += bpf_local_storage.o bpf_task_storage.o
 obj-${CONFIG_BPF_LSM}	  += bpf_inode_storage.o
 obj-$(CONFIG_BPF_SYSCALL) += disasm.o
 obj-$(CONFIG_BPF_JIT) += trampoline.o
-obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o
+obj-$(CONFIG_BPF_SYSCALL) += btf.o memalloc.o bpf_patch.o
 obj-$(CONFIG_BPF_JIT) += dispatcher.o
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_BPF_SYSCALL) += devmap.o
diff --git a/kernel/bpf/bpf_patch.c b/kernel/bpf/bpf_patch.c
new file mode 100644
index 000000000000..82a10bf5624a
--- /dev/null
+++ b/kernel/bpf/bpf_patch.c
@@ -0,0 +1,51 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/bpf_patch.h>
+
+void bpf_patch_free(struct bpf_patch *patch)
+{
+	kfree(patch->insn);
+}
+
+size_t bpf_patch_len(const struct bpf_patch *patch)
+{
+	return patch->len;
+}
+
+int bpf_patch_err(const struct bpf_patch *patch)
+{
+	return patch->err;
+}
+
+void __bpf_patch_append(struct bpf_patch *patch, struct bpf_insn insn)
+{
+	void *arr;
+
+	if (patch->err)
+		return;
+
+	if (patch->len + 1 > patch->capacity) {
+		if (!patch->capacity)
+			patch->capacity = 16;
+		else
+			patch->capacity *= 2;
+
+		arr = krealloc_array(patch->insn, patch->capacity, sizeof(insn), GFP_KERNEL);
+		if (!arr) {
+			patch->err = -ENOMEM;
+			kfree(patch->insn);
+			return;
+		}
+
+		patch->insn = arr;
+		patch->capacity *= 2;
+	}
+
+	patch->insn[patch->len++] = insn;
+}
+EXPORT_SYMBOL(__bpf_patch_append);
+
+struct bpf_insn *bpf_patch_data(const struct bpf_patch *patch)
+{
+	return patch->insn;
+}
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] [RFC bpf-next v2 02/14] bpf: Support inlined/unrolled kfuncs for xdp metadata
  2022-11-04  3:25 [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs Stanislav Fomichev
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 01/14] bpf: Introduce bpf_patch Stanislav Fomichev
@ 2022-11-04  3:25 ` Stanislav Fomichev
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 03/14] veth: Introduce veth_xdp_buff wrapper for xdp_buff Stanislav Fomichev
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-04  3:25 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Kfuncs have to be defined with KF_UNROLL for an attempted unroll.
For now, only XDP programs can have their kfuncs unrolled, but
we can extend this later on if more programs would like to use it.

For XDP, we define a new kfunc set (xdp_metadata_kfunc_ids) which
implements all possible metatada kfuncs. Not all devices have to
implement them. If unrolling is not supported by the target device,
the default implementation is called instead. The default
implementation is unconditionally unrolled to 'return false/0/NULL'
for now.

Upon loading, if BPF_F_XDP_HAS_METADATA is passed via prog_flags,
we treat prog_index as target device for kfunc unrolling.
net_device_ops gains new ndo_unroll_kfunc which does the actual
dirty work per device.

The kfunc unrolling itself largely follows the existing map_gen_lookup
unrolling example, so there is nothing new here.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/linux/bpf.h            |  1 +
 include/linux/btf.h            |  1 +
 include/linux/btf_ids.h        |  4 ++
 include/linux/netdevice.h      |  5 +++
 include/net/xdp.h              | 24 ++++++++++++
 include/uapi/linux/bpf.h       |  5 +++
 kernel/bpf/syscall.c           | 28 +++++++++++++-
 kernel/bpf/verifier.c          | 67 ++++++++++++++++++++++++++++++++++
 net/core/dev.c                 |  7 ++++
 net/core/xdp.c                 | 39 ++++++++++++++++++++
 tools/include/uapi/linux/bpf.h |  5 +++
 11 files changed, 185 insertions(+), 1 deletion(-)

diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 8d948bfcb984..54b353a88a03 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1188,6 +1188,7 @@ struct bpf_prog_aux {
 		struct work_struct work;
 		struct rcu_head	rcu;
 	};
+	const struct net_device_ops *xdp_kfunc_ndo;
 };
 
 struct bpf_prog {
diff --git a/include/linux/btf.h b/include/linux/btf.h
index f9aababc5d78..23ad5f8313e4 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -51,6 +51,7 @@
 #define KF_TRUSTED_ARGS (1 << 4) /* kfunc only takes trusted pointer arguments */
 #define KF_SLEEPABLE    (1 << 5) /* kfunc may sleep */
 #define KF_DESTRUCTIVE  (1 << 6) /* kfunc performs destructive actions */
+#define KF_UNROLL       (1 << 7) /* kfunc unrolling can be attempted */
 
 /*
  * Return the name of the passed struct, if exists, or halt the build if for
diff --git a/include/linux/btf_ids.h b/include/linux/btf_ids.h
index c9744efd202f..eb448e9c79bb 100644
--- a/include/linux/btf_ids.h
+++ b/include/linux/btf_ids.h
@@ -195,6 +195,10 @@ asm(							\
 __BTF_ID_LIST(name, local)				\
 __BTF_SET8_START(name, local)
 
+#define BTF_SET8_START_GLOBAL(name)			\
+__BTF_ID_LIST(name, global)				\
+__BTF_SET8_START(name, global)
+
 #define BTF_SET8_END(name)				\
 asm(							\
 ".pushsection " BTF_IDS_SECTION ",\"a\";      \n"	\
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 4b5052db978f..33171e5cf83a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -73,6 +73,8 @@ struct udp_tunnel_info;
 struct udp_tunnel_nic_info;
 struct udp_tunnel_nic;
 struct bpf_prog;
+struct bpf_insn;
+struct bpf_patch;
 struct xdp_buff;
 
 void synchronize_net(void);
@@ -1609,6 +1611,9 @@ struct net_device_ops {
 	ktime_t			(*ndo_get_tstamp)(struct net_device *dev,
 						  const struct skb_shared_hwtstamps *hwtstamps,
 						  bool cycles);
+	void			(*ndo_unroll_kfunc)(const struct bpf_prog *prog,
+						    u32 func_id,
+						    struct bpf_patch *patch);
 };
 
 /**
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 55dbc68bfffc..2a82a98f2f9f 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -7,6 +7,7 @@
 #define __LINUX_NET_XDP_H__
 
 #include <linux/skbuff.h> /* skb_shared_info */
+#include <linux/btf_ids.h> /* btf_id_set8 */
 
 /**
  * DOC: XDP RX-queue information
@@ -409,4 +410,27 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
 
 #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
 
+#define XDP_METADATA_KFUNC_xxx	\
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED, \
+			   bpf_xdp_metadata_rx_timestamp_supported) \
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
+			   bpf_xdp_metadata_rx_timestamp) \
+
+enum {
+#define XDP_METADATA_KFUNC(name, str) name,
+XDP_METADATA_KFUNC_xxx
+#undef XDP_METADATA_KFUNC
+MAX_XDP_METADATA_KFUNC,
+};
+
+#ifdef CONFIG_DEBUG_INFO_BTF
+extern struct btf_id_set8 xdp_metadata_kfunc_ids;
+static inline u32 xdp_metadata_kfunc_id(int id)
+{
+	return xdp_metadata_kfunc_ids.pairs[id].id;
+}
+#else
+static inline u32 xdp_metadata_kfunc_id(int id) { return 0; }
+#endif
+
 #endif /* __LINUX_NET_XDP_H__ */
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 94659f6b3395..6938fc4f1ec5 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1156,6 +1156,11 @@ enum bpf_link_type {
  */
 #define BPF_F_XDP_HAS_FRAGS	(1U << 5)
 
+/* If BPF_F_XDP_HAS_METADATA is used in BPF_PROG_LOAD command, the loaded
+ * program becomes device-bound but can access it's XDP metadata.
+ */
+#define BPF_F_XDP_HAS_METADATA	(1U << 6)
+
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
  * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
  */
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 5887592eeb93..4b76eee03a10 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2461,6 +2461,20 @@ static bool is_perfmon_prog_type(enum bpf_prog_type prog_type)
 /* last field in 'union bpf_attr' used by this command */
 #define	BPF_PROG_LOAD_LAST_FIELD core_relo_rec_size
 
+static int xdp_resolve_netdev(struct bpf_prog *prog, int ifindex)
+{
+	struct net *net = current->nsproxy->net_ns;
+	struct net_device *dev;
+
+	for_each_netdev(net, dev) {
+		if (dev->ifindex == ifindex) {
+			prog->aux->xdp_kfunc_ndo = dev->netdev_ops;
+			return 0;
+		}
+	}
+	return -EINVAL;
+}
+
 static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 {
 	enum bpf_prog_type type = attr->prog_type;
@@ -2478,7 +2492,8 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 				 BPF_F_TEST_STATE_FREQ |
 				 BPF_F_SLEEPABLE |
 				 BPF_F_TEST_RND_HI32 |
-				 BPF_F_XDP_HAS_FRAGS))
+				 BPF_F_XDP_HAS_FRAGS |
+				 BPF_F_XDP_HAS_METADATA))
 		return -EINVAL;
 
 	if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) &&
@@ -2566,6 +2581,17 @@ static int bpf_prog_load(union bpf_attr *attr, bpfptr_t uattr)
 	prog->aux->sleepable = attr->prog_flags & BPF_F_SLEEPABLE;
 	prog->aux->xdp_has_frags = attr->prog_flags & BPF_F_XDP_HAS_FRAGS;
 
+	if (attr->prog_flags & BPF_F_XDP_HAS_METADATA) {
+		/* Reuse prog_ifindex to carry request to unroll
+		 * metadata kfuncs.
+		 */
+		prog->aux->offload_requested = false;
+
+		err = xdp_resolve_netdev(prog, attr->prog_ifindex);
+		if (err < 0)
+			goto free_prog;
+	}
+
 	err = security_bpf_prog_alloc(prog->aux);
 	if (err)
 		goto free_prog;
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 82c07fe0bfb1..4e5c5ff35d5f 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -9,6 +9,7 @@
 #include <linux/types.h>
 #include <linux/slab.h>
 #include <linux/bpf.h>
+#include <linux/bpf_patch.h>
 #include <linux/btf.h>
 #include <linux/bpf_verifier.h>
 #include <linux/filter.h>
@@ -13865,6 +13866,45 @@ static int fixup_call_args(struct bpf_verifier_env *env)
 	return err;
 }
 
+static int unroll_kfunc_call(struct bpf_verifier_env *env,
+			     struct bpf_insn *insn,
+			     struct bpf_patch *patch)
+{
+	enum bpf_prog_type prog_type;
+	struct bpf_prog_aux *aux;
+	struct btf *desc_btf;
+	u32 *kfunc_flags;
+	u32 func_id;
+
+	desc_btf = find_kfunc_desc_btf(env, insn->off);
+	if (IS_ERR(desc_btf))
+		return PTR_ERR(desc_btf);
+
+	prog_type = resolve_prog_type(env->prog);
+	func_id = insn->imm;
+
+	kfunc_flags = btf_kfunc_id_set_contains(desc_btf, prog_type, func_id);
+	if (!kfunc_flags)
+		return 0;
+	if (!(*kfunc_flags & KF_UNROLL))
+		return 0;
+	if (prog_type != BPF_PROG_TYPE_XDP)
+		return 0;
+
+	aux = env->prog->aux;
+	if (!aux->xdp_kfunc_ndo)
+		return 0;
+
+	aux->xdp_kfunc_ndo->ndo_unroll_kfunc(env->prog, func_id, patch);
+	if (bpf_patch_len(patch) == 0) {
+		/* Default optimized kfunc implementation that
+		 * returns NULL/0/false.
+		 */
+		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 0));
+	}
+	return bpf_patch_err(patch);
+}
+
 static int fixup_kfunc_call(struct bpf_verifier_env *env,
 			    struct bpf_insn *insn)
 {
@@ -14028,6 +14068,33 @@ static int do_misc_fixups(struct bpf_verifier_env *env)
 		if (insn->src_reg == BPF_PSEUDO_CALL)
 			continue;
 		if (insn->src_reg == BPF_PSEUDO_KFUNC_CALL) {
+			struct bpf_patch patch = {};
+
+			if (bpf_prog_is_dev_bound(env->prog->aux)) {
+				verbose(env, "no metadata kfuncs offload\n");
+				return -EINVAL;
+			}
+
+			ret = unroll_kfunc_call(env, insn, &patch);
+			if (ret < 0) {
+				verbose(env, "failed to unroll kfunc with func_id=%d\n", insn->imm);
+				return cnt;
+			}
+			cnt = bpf_patch_len(&patch);
+			if (cnt) {
+				new_prog = bpf_patch_insn_data(env, i + delta,
+							       bpf_patch_data(&patch),
+							       bpf_patch_len(&patch));
+				bpf_patch_free(&patch);
+				if (!new_prog)
+					return -ENOMEM;
+
+				delta    += cnt - 1;
+				env->prog = prog = new_prog;
+				insn      = new_prog->insnsi + i + delta;
+				continue;
+			}
+
 			ret = fixup_kfunc_call(env, insn);
 			if (ret)
 				return ret;
diff --git a/net/core/dev.c b/net/core/dev.c
index 2e4f1c97b59e..c64d9ea9be3a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -9259,6 +9259,13 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack
 			return -EOPNOTSUPP;
 		}
 
+		if (new_prog &&
+		    new_prog->aux->xdp_kfunc_ndo &&
+		    new_prog->aux->xdp_kfunc_ndo != dev->netdev_ops) {
+			NL_SET_ERR_MSG(extack, "Target device was specified at load time; can only attach to the same device type");
+			return -EINVAL;
+		}
+
 		err = dev_xdp_install(dev, mode, bpf_op, extack, flags, new_prog);
 		if (err)
 			return err;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 844c9d99dc0e..22f1e44700eb 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -4,6 +4,8 @@
  * Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
  */
 #include <linux/bpf.h>
+#include <linux/bpf_patch.h>
+#include <linux/btf_ids.h>
 #include <linux/filter.h>
 #include <linux/types.h>
 #include <linux/mm.h>
@@ -709,3 +711,40 @@ struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
 
 	return nxdpf;
 }
+
+/* Indicates whether particular device supports rx_timestamp metadata.
+ * This is an optional helper to support marking some branches as
+ * "dead code" in the BPF programs.
+ */
+noinline int bpf_xdp_metadata_rx_timestamp_supported(const struct xdp_md *ctx)
+{
+	/* payload is ignored, see default case in unroll_kfunc_call */
+	return false;
+}
+
+/* Returns rx_timestamp metadata or 0 when the frame doesn't have it.
+ */
+noinline const __u64 bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx)
+{
+	/* payload is ignored, see default case in unroll_kfunc_call */
+	return 0;
+}
+
+#ifdef CONFIG_DEBUG_INFO_BTF
+BTF_SET8_START_GLOBAL(xdp_metadata_kfunc_ids)
+#define XDP_METADATA_KFUNC(name, str) BTF_ID_FLAGS(func, str, KF_RET_NULL | KF_UNROLL)
+XDP_METADATA_KFUNC_xxx
+#undef XDP_METADATA_KFUNC
+BTF_SET8_END(xdp_metadata_kfunc_ids)
+
+static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
+	.owner = THIS_MODULE,
+	.set   = &xdp_metadata_kfunc_ids,
+};
+
+static int __init xdp_metadata_init(void)
+{
+	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &xdp_metadata_kfunc_set);
+}
+late_initcall(xdp_metadata_init);
+#endif
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 94659f6b3395..6938fc4f1ec5 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1156,6 +1156,11 @@ enum bpf_link_type {
  */
 #define BPF_F_XDP_HAS_FRAGS	(1U << 5)
 
+/* If BPF_F_XDP_HAS_METADATA is used in BPF_PROG_LOAD command, the loaded
+ * program becomes device-bound but can access it's XDP metadata.
+ */
+#define BPF_F_XDP_HAS_METADATA	(1U << 6)
+
 /* link_create.kprobe_multi.flags used in LINK_CREATE command for
  * BPF_TRACE_KPROBE_MULTI attach type to create return probe.
  */
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] [RFC bpf-next v2 03/14] veth: Introduce veth_xdp_buff wrapper for xdp_buff
  2022-11-04  3:25 [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs Stanislav Fomichev
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 01/14] bpf: Introduce bpf_patch Stanislav Fomichev
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 02/14] bpf: Support inlined/unrolled kfuncs for xdp metadata Stanislav Fomichev
@ 2022-11-04  3:25 ` Stanislav Fomichev
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 04/14] veth: Support rx timestamp metadata for xdp Stanislav Fomichev
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-04  3:25 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

No functional changes. Boilerplate to allow stuffing more data after xdp_buff.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c | 56 +++++++++++++++++++++++++---------------------
 1 file changed, 31 insertions(+), 25 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index b1ed5a93b6c5..917ba57453c1 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -116,6 +116,10 @@ static struct {
 	{ "peer_ifindex" },
 };
 
+struct veth_xdp_buff {
+	struct xdp_buff xdp;
+};
+
 static int veth_get_link_ksettings(struct net_device *dev,
 				   struct ethtool_link_ksettings *cmd)
 {
@@ -592,23 +596,24 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 	rcu_read_lock();
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (likely(xdp_prog)) {
-		struct xdp_buff xdp;
+		struct veth_xdp_buff vxbuf;
+		struct xdp_buff *xdp = &vxbuf.xdp;
 		u32 act;
 
-		xdp_convert_frame_to_buff(frame, &xdp);
-		xdp.rxq = &rq->xdp_rxq;
+		xdp_convert_frame_to_buff(frame, xdp);
+		xdp->rxq = &rq->xdp_rxq;
 
-		act = bpf_prog_run_xdp(xdp_prog, &xdp);
+		act = bpf_prog_run_xdp(xdp_prog, xdp);
 
 		switch (act) {
 		case XDP_PASS:
-			if (xdp_update_frame_from_buff(&xdp, frame))
+			if (xdp_update_frame_from_buff(xdp, frame))
 				goto err_xdp;
 			break;
 		case XDP_TX:
 			orig_frame = *frame;
-			xdp.rxq->mem = frame->mem;
-			if (unlikely(veth_xdp_tx(rq, &xdp, bq) < 0)) {
+			xdp->rxq->mem = frame->mem;
+			if (unlikely(veth_xdp_tx(rq, xdp, bq) < 0)) {
 				trace_xdp_exception(rq->dev, xdp_prog, act);
 				frame = &orig_frame;
 				stats->rx_drops++;
@@ -619,8 +624,8 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 			goto xdp_xmit;
 		case XDP_REDIRECT:
 			orig_frame = *frame;
-			xdp.rxq->mem = frame->mem;
-			if (xdp_do_redirect(rq->dev, &xdp, xdp_prog)) {
+			xdp->rxq->mem = frame->mem;
+			if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
 				frame = &orig_frame;
 				stats->rx_drops++;
 				goto err_xdp;
@@ -801,7 +806,8 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 {
 	void *orig_data, *orig_data_end;
 	struct bpf_prog *xdp_prog;
-	struct xdp_buff xdp;
+	struct veth_xdp_buff vxbuf;
+	struct xdp_buff *xdp = &vxbuf.xdp;
 	u32 act, metalen;
 	int off;
 
@@ -815,22 +821,22 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	}
 
 	__skb_push(skb, skb->data - skb_mac_header(skb));
-	if (veth_convert_skb_to_xdp_buff(rq, &xdp, &skb))
+	if (veth_convert_skb_to_xdp_buff(rq, xdp, &skb))
 		goto drop;
 
-	orig_data = xdp.data;
-	orig_data_end = xdp.data_end;
+	orig_data = xdp->data;
+	orig_data_end = xdp->data_end;
 
-	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+	act = bpf_prog_run_xdp(xdp_prog, xdp);
 
 	switch (act) {
 	case XDP_PASS:
 		break;
 	case XDP_TX:
-		veth_xdp_get(&xdp);
+		veth_xdp_get(xdp);
 		consume_skb(skb);
-		xdp.rxq->mem = rq->xdp_mem;
-		if (unlikely(veth_xdp_tx(rq, &xdp, bq) < 0)) {
+		xdp->rxq->mem = rq->xdp_mem;
+		if (unlikely(veth_xdp_tx(rq, xdp, bq) < 0)) {
 			trace_xdp_exception(rq->dev, xdp_prog, act);
 			stats->rx_drops++;
 			goto err_xdp;
@@ -839,10 +845,10 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 		rcu_read_unlock();
 		goto xdp_xmit;
 	case XDP_REDIRECT:
-		veth_xdp_get(&xdp);
+		veth_xdp_get(xdp);
 		consume_skb(skb);
-		xdp.rxq->mem = rq->xdp_mem;
-		if (xdp_do_redirect(rq->dev, &xdp, xdp_prog)) {
+		xdp->rxq->mem = rq->xdp_mem;
+		if (xdp_do_redirect(rq->dev, xdp, xdp_prog)) {
 			stats->rx_drops++;
 			goto err_xdp;
 		}
@@ -862,7 +868,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	rcu_read_unlock();
 
 	/* check if bpf_xdp_adjust_head was used */
-	off = orig_data - xdp.data;
+	off = orig_data - xdp->data;
 	if (off > 0)
 		__skb_push(skb, off);
 	else if (off < 0)
@@ -871,21 +877,21 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	skb_reset_mac_header(skb);
 
 	/* check if bpf_xdp_adjust_tail was used */
-	off = xdp.data_end - orig_data_end;
+	off = xdp->data_end - orig_data_end;
 	if (off != 0)
 		__skb_put(skb, off); /* positive on grow, negative on shrink */
 
 	/* XDP frag metadata (e.g. nr_frags) are updated in eBPF helpers
 	 * (e.g. bpf_xdp_adjust_tail), we need to update data_len here.
 	 */
-	if (xdp_buff_has_frags(&xdp))
+	if (xdp_buff_has_frags(xdp))
 		skb->data_len = skb_shinfo(skb)->xdp_frags_size;
 	else
 		skb->data_len = 0;
 
 	skb->protocol = eth_type_trans(skb, rq->dev);
 
-	metalen = xdp.data - xdp.data_meta;
+	metalen = xdp->data - xdp->data_meta;
 	if (metalen)
 		skb_metadata_set(skb, metalen);
 out:
@@ -898,7 +904,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 	return NULL;
 err_xdp:
 	rcu_read_unlock();
-	xdp_return_buff(&xdp);
+	xdp_return_buff(xdp);
 xdp_xmit:
 	return NULL;
 }
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] [RFC bpf-next v2 04/14] veth: Support rx timestamp metadata for xdp
  2022-11-04  3:25 [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs Stanislav Fomichev
                   ` (2 preceding siblings ...)
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 03/14] veth: Introduce veth_xdp_buff wrapper for xdp_buff Stanislav Fomichev
@ 2022-11-04  3:25 ` Stanislav Fomichev
  2022-11-09 11:21   ` [xdp-hints] " Toke Høiland-Jørgensen
  2022-11-10  0:25   ` John Fastabend
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 05/14] selftests/bpf: Verify xdp_metadata xdp->af_xdp path Stanislav Fomichev
                   ` (9 subsequent siblings)
  13 siblings, 2 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-04  3:25 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

xskxceiver conveniently setups up veth pairs so it seems logical
to use veth as an example for some of the metadata handling.

We timestamp skb right when we "receive" it, store its
pointer in new veth_xdp_buff wrapper and generate BPF bytecode to
reach it from the BPF program.

This largely follows the idea of "store some queue context in
the xdp_buff/xdp_frame so the metadata can be reached out
from the BPF program".

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 917ba57453c1..0e629ceb087b 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -25,6 +25,7 @@
 #include <linux/filter.h>
 #include <linux/ptr_ring.h>
 #include <linux/bpf_trace.h>
+#include <linux/bpf_patch.h>
 #include <linux/net_tstamp.h>
 
 #define DRV_NAME	"veth"
@@ -118,6 +119,7 @@ static struct {
 
 struct veth_xdp_buff {
 	struct xdp_buff xdp;
+	struct sk_buff *skb;
 };
 
 static int veth_get_link_ksettings(struct net_device *dev,
@@ -602,6 +604,7 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
 
 		xdp_convert_frame_to_buff(frame, xdp);
 		xdp->rxq = &rq->xdp_rxq;
+		vxbuf.skb = NULL;
 
 		act = bpf_prog_run_xdp(xdp_prog, xdp);
 
@@ -826,6 +829,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
 
 	orig_data = xdp->data;
 	orig_data_end = xdp->data_end;
+	vxbuf.skb = skb;
 
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 
@@ -942,6 +946,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
 			struct sk_buff *skb = ptr;
 
 			stats->xdp_bytes += skb->len;
+			__net_timestamp(skb);
 			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
 			if (skb) {
 				if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
@@ -1665,6 +1670,31 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
+static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
+			      struct bpf_patch *patch)
+{
+	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
+		/* return true; */
+		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
+	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
+		bpf_patch_append(patch,
+			/* r5 = ((struct veth_xdp_buff *)r1)->skb; */
+			BPF_LDX_MEM(BPF_DW, BPF_REG_5, BPF_REG_1,
+				    offsetof(struct veth_xdp_buff, skb)),
+			/* if (r5 == NULL) { */
+			BPF_JMP_IMM(BPF_JNE, BPF_REG_5, 0, 2),
+			/*	return 0; */
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+			BPF_JMP_A(1),
+			/* } else { */
+			/*	return ((struct sk_buff *)r5)->tstamp; */
+			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_5,
+				    offsetof(struct sk_buff, tstamp)),
+			/* } */
+		);
+	}
+}
+
 static const struct net_device_ops veth_netdev_ops = {
 	.ndo_init            = veth_dev_init,
 	.ndo_open            = veth_open,
@@ -1684,6 +1714,7 @@ static const struct net_device_ops veth_netdev_ops = {
 	.ndo_bpf		= veth_xdp,
 	.ndo_xdp_xmit		= veth_ndo_xdp_xmit,
 	.ndo_get_peer_dev	= veth_peer_dev,
+	.ndo_unroll_kfunc       = veth_unroll_kfunc,
 };
 
 #define VETH_FEATURES (NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_HW_CSUM | \
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] [RFC bpf-next v2 05/14] selftests/bpf: Verify xdp_metadata xdp->af_xdp path
  2022-11-04  3:25 [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs Stanislav Fomichev
                   ` (3 preceding siblings ...)
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 04/14] veth: Support rx timestamp metadata for xdp Stanislav Fomichev
@ 2022-11-04  3:25 ` Stanislav Fomichev
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context Stanislav Fomichev
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-04  3:25 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

- create new netns
- create veth pair (veTX+veRX)
- setup AF_XDP socket for both interfaces
- attach bpf to veRX
- send packet via veTX
- verify the packet has expected metadata at veRX

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 tools/testing/selftests/bpf/Makefile          |   2 +-
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 302 ++++++++++++++++++
 .../selftests/bpf/progs/xdp_metadata.c        |  50 +++
 3 files changed, 353 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
 create mode 100644 tools/testing/selftests/bpf/progs/xdp_metadata.c

diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 79edef1dbda4..815bfd6b80cc 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -522,7 +522,7 @@ TRUNNER_BPF_PROGS_DIR := progs
 TRUNNER_EXTRA_SOURCES := test_progs.c cgroup_helpers.c trace_helpers.c	\
 			 network_helpers.c testing_helpers.c		\
 			 btf_helpers.c flow_dissector_load.h		\
-			 cap_helpers.c
+			 cap_helpers.c xsk.c
 TRUNNER_EXTRA_FILES := $(OUTPUT)/urandom_read $(OUTPUT)/bpf_testmod.ko	\
 		       $(OUTPUT)/liburandom_read.so			\
 		       $(OUTPUT)/xdp_synproxy				\
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
new file mode 100644
index 000000000000..bb06e25fb2bb
--- /dev/null
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -0,0 +1,302 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <test_progs.h>
+#include <network_helpers.h>
+#include "xdp_metadata.skel.h"
+#include "xsk.h"
+
+#include <linux/errqueue.h>
+#include <linux/if_link.h>
+#include <linux/net_tstamp.h>
+#include <linux/udp.h>
+#include <sys/mman.h>
+#include <net/if.h>
+#include <poll.h>
+
+#define TX_NAME "veTX"
+#define RX_NAME "veRX"
+
+#define UDP_PAYLOAD_BYTES 4
+
+#define AF_XDP_SOURCE_PORT 1234
+#define AF_XDP_CONSUMER_PORT 8080
+
+#define UMEM_NUM 16
+#define UMEM_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
+#define UMEM_SIZE (UMEM_FRAME_SIZE * UMEM_NUM)
+#define XDP_FLAGS XDP_FLAGS_DRV_MODE
+#define QUEUE_ID 0
+
+#define TX_ADDR "10.0.0.1"
+#define RX_ADDR "10.0.0.2"
+#define PREFIX_LEN "8"
+#define FAMILY AF_INET
+
+#define SYS(cmd) ({ \
+	if (!ASSERT_OK(system(cmd), (cmd))) \
+		goto out; \
+})
+
+struct xsk {
+	void *umem_area;
+	struct xsk_umem *umem;
+	struct xsk_ring_prod fill;
+	struct xsk_ring_cons comp;
+	struct xsk_ring_prod tx;
+	struct xsk_ring_cons rx;
+	struct xsk_socket *socket;
+	int next_tx;
+};
+
+int open_xsk(const char *ifname, struct xsk *xsk)
+{
+	int mmap_flags = MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE;
+	const struct xsk_socket_config socket_config = {
+		.rx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.tx_size = XSK_RING_PROD__DEFAULT_NUM_DESCS,
+		.libbpf_flags = XSK_LIBBPF_FLAGS__INHIBIT_PROG_LOAD,
+		.xdp_flags = XDP_FLAGS,
+		.bind_flags = XDP_COPY,
+	};
+	u64 addr;
+	int ret;
+	int i;
+
+	xsk->umem_area = mmap(NULL, UMEM_SIZE, PROT_READ | PROT_WRITE, mmap_flags, -1, 0);
+	if (!ASSERT_NEQ(xsk->umem_area, MAP_FAILED, "mmap"))
+		return -1;
+
+	ret = xsk_umem__create(&xsk->umem,
+			       xsk->umem_area, UMEM_SIZE,
+			       &xsk->fill,
+			       &xsk->comp,
+			       NULL);
+	if (!ASSERT_OK(ret, "xsk_umem__create"))
+		return ret;
+
+	ret = xsk_socket__create(&xsk->socket, ifname, QUEUE_ID,
+				 xsk->umem,
+				 &xsk->rx,
+				 &xsk->tx,
+				 &socket_config);
+	if (!ASSERT_OK(ret, "xsk_socket__create"))
+		return ret;
+
+	/* First half of umem is for TX. This way address matches 1-to-1
+	 * to the completion queue index.
+	 */
+
+	xsk->next_tx = 0;
+
+	/* Second half of umem is for RX. */
+
+	__u32 idx;
+	ret = xsk_ring_prod__reserve(&xsk->fill, UMEM_NUM / 2, &idx);
+	if (!ASSERT_EQ(UMEM_NUM / 2, ret, "xsk_ring_prod__reserve"))
+		return ret;
+	if (!ASSERT_EQ(idx, 0, "fill idx != 0"))
+		return -1;
+
+	for (i = 0; i < UMEM_NUM / 2; i++) {
+		addr = (UMEM_NUM / 2 + i) * UMEM_FRAME_SIZE;
+		*xsk_ring_prod__fill_addr(&xsk->fill, i) = addr;
+	}
+	xsk_ring_prod__submit(&xsk->fill, ret);
+
+	return 0;
+}
+
+void close_xsk(struct xsk *xsk)
+{
+	if (xsk->umem)
+		xsk_umem__delete(xsk->umem);
+	if (xsk->socket)
+		xsk_socket__delete(xsk->socket);
+	munmap(xsk->umem, UMEM_SIZE);
+}
+
+static void ip_csum(struct iphdr *iph)
+{
+	__u32 sum = 0;
+	__u16 *p;
+	int i;
+
+	iph->check = 0;
+	p = (void *)iph;
+	for (i = 0; i < sizeof(*iph) / sizeof(*p); i++)
+		sum += p[i];
+
+	while (sum >> 16)
+		sum = (sum & 0xffff) + (sum >> 16);
+
+	iph->check = ~sum;
+}
+
+int generate_packet(struct xsk *xsk, __u16 dst_port)
+{
+	struct xdp_desc *tx_desc;
+	struct udphdr *udph;
+	struct ethhdr *eth;
+	struct iphdr *iph;
+	void *data;
+	__u32 idx;
+	int ret;
+
+	ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
+	if (!ASSERT_EQ(ret, 1, "xsk_ring_prod__reserve"))
+		return -1;
+
+	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
+	tx_desc->addr = xsk->next_tx++ % (UMEM_NUM / 2);
+	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
+
+	eth = data;
+	iph = (void *)(eth + 1);
+	udph = (void *)(iph + 1);
+
+	memcpy(eth->h_dest, "\x00\x00\x00\x00\x00\x02", ETH_ALEN);
+	memcpy(eth->h_source, "\x00\x00\x00\x00\x00\x01", ETH_ALEN);
+	eth->h_proto = htons(ETH_P_IP);
+
+	iph->version = 0x4;
+	iph->ihl = 0x5;
+	iph->tos = 0x9;
+	iph->tot_len = htons(sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES);
+	iph->id = 0;
+	iph->frag_off = 0;
+	iph->ttl = 0;
+	iph->protocol = IPPROTO_UDP;
+	ASSERT_EQ(inet_pton(FAMILY, TX_ADDR, &iph->saddr), 1, "inet_pton(TX_ADDR)");
+	ASSERT_EQ(inet_pton(FAMILY, RX_ADDR, &iph->daddr), 1, "inet_pton(RX_ADDR)");
+	ip_csum(iph);
+
+	udph->source = htons(AF_XDP_SOURCE_PORT);
+	udph->dest = htons(dst_port);
+	udph->len = htons(sizeof(*udph) + UDP_PAYLOAD_BYTES);
+	udph->check = 0;
+
+	memset(udph + 1, 0xAA, UDP_PAYLOAD_BYTES);
+
+	tx_desc->len = sizeof(*eth) + sizeof(*iph) + sizeof(*udph) + UDP_PAYLOAD_BYTES;
+	xsk_ring_prod__submit(&xsk->tx, 1);
+
+	ret = sendto(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, 0);
+	if (!ASSERT_GE(ret, 0, "sendto"))
+		return ret;
+
+	return 0;
+}
+
+int verify_xsk_metadata(struct xsk *xsk)
+{
+	const struct xdp_desc *rx_desc;
+	struct pollfd fds = {};
+	void *data_meta;
+	void *data;
+	__u32 idx;
+	int ret;
+
+	ret = recvfrom(xsk_socket__fd(xsk->socket), NULL, 0, MSG_DONTWAIT, NULL, NULL);
+	if (!ASSERT_EQ(ret, 0, "recvfrom"))
+		return -1;
+
+	fds.fd = xsk_socket__fd(xsk->socket);
+	fds.events = POLLIN;
+
+	ret = poll(&fds, 1, 1000);
+	if (!ASSERT_GT(ret, 0, "poll"))
+		return -1;
+
+	ret = xsk_ring_cons__peek(&xsk->rx, 1, &idx);
+	if (!ASSERT_EQ(ret, 1, "xsk_ring_cons__peek"))
+		return -2;
+
+	rx_desc = xsk_ring_cons__rx_desc(&xsk->rx, idx++);
+	data = xsk_umem__get_data(xsk->umem_area, rx_desc->addr);
+
+	data_meta = data - 8; /* oh boy, this seems wrong! */
+
+	if (*(__u32 *)data_meta == 0)
+		return -1;
+
+	return 0;
+}
+
+void test_xdp_metadata(void)
+{
+	struct xdp_metadata *bpf_obj = NULL;
+	struct nstoken *tok = NULL;
+	struct bpf_program *prog;
+	struct xsk tx_xsk = {};
+	struct xsk rx_xsk = {};
+	int rx_ifindex;
+	int ret;
+
+	/* Setup new networking namespace, with a veth pair. */
+
+	SYS("ip netns add xdp_metadata");
+	tok = open_netns("xdp_metadata");
+	SYS("ip link add numtxqueues 1 numrxqueues 1 " TX_NAME " type veth "
+	    "peer " RX_NAME " numtxqueues 1 numrxqueues 1");
+	SYS("ip link set dev " TX_NAME " address 00:00:00:00:00:01");
+	SYS("ip link set dev " RX_NAME " address 00:00:00:00:00:02");
+	SYS("ip link set dev " TX_NAME " up");
+	SYS("ip link set dev " RX_NAME " up");
+	SYS("ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
+	SYS("ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
+
+	rx_ifindex = if_nametoindex(RX_NAME);
+
+	/* Setup separate AF_XDP for TX and RX interfaces. */
+
+	ret = open_xsk(TX_NAME, &tx_xsk);
+	if (!ASSERT_OK(ret, "open_xsk(TX_NAME)"))
+		goto out;
+
+	ret = open_xsk(RX_NAME, &rx_xsk);
+	if (!ASSERT_OK(ret, "open_xsk(RX_NAME)"))
+		goto out;
+
+	/* Attach BPF program to RX interface. */
+
+	bpf_obj = xdp_metadata__open();
+	if (!ASSERT_OK_PTR(bpf_obj, "open skeleton"))
+		goto out;
+
+	prog = bpf_object__find_program_by_name(bpf_obj->obj, "rx");
+	bpf_program__set_ifindex(prog, rx_ifindex);
+	bpf_program__set_flags(prog, BPF_F_XDP_HAS_METADATA);
+
+	if (!ASSERT_OK(xdp_metadata__load(bpf_obj), "load skeleton"))
+		goto out;
+
+	ret = bpf_xdp_attach(rx_ifindex,
+			     bpf_program__fd(bpf_obj->progs.rx),
+			     XDP_FLAGS, NULL);
+	if (!ASSERT_GE(ret, 0, "bpf_xdp_attach"))
+		goto out;
+
+	__u32 queue_id = QUEUE_ID;
+	int sock_fd = xsk_socket__fd(rx_xsk.socket);
+	ret = bpf_map_update_elem(bpf_map__fd(bpf_obj->maps.xsk), &queue_id, &sock_fd, 0);
+	if (!ASSERT_GE(ret, 0, "bpf_map_update_elem"))
+		goto out;
+
+	/* Send packet destined to RX AF_XDP socket. */
+	if (!ASSERT_GE(generate_packet(&tx_xsk, AF_XDP_CONSUMER_PORT), 0,
+		       "generate AF_XDP_CONSUMER_PORT"))
+	    goto out;
+
+	/* Verify AF_XDP RX packet has proper metadata. */
+	if (!ASSERT_GE(verify_xsk_metadata(&rx_xsk), 0,
+		       "verify_xsk_metadata"))
+	    goto out;
+
+out:
+	close_xsk(&rx_xsk);
+	close_xsk(&tx_xsk);
+	if (bpf_obj)
+		xdp_metadata__destroy(bpf_obj);
+	system("ip netns del xdp_metadata");
+	if (tok)
+		close_netns(tok);
+}
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
new file mode 100644
index 000000000000..bdde17961ab6
--- /dev/null
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/bpf.h>
+#include <linux/if_ether.h>
+#include <linux/ip.h>
+#include <linux/ipv6.h>
+#include <linux/in.h>
+#include <linux/udp.h>
+
+#include <bpf/bpf_helpers.h>
+#include <bpf/bpf_endian.h>
+
+struct {
+	__uint(type, BPF_MAP_TYPE_XSKMAP);
+	__uint(max_entries, 4);
+	__type(key, __u32);
+	__type(value, __u32);
+} xsk SEC(".maps");
+
+extern int bpf_xdp_metadata_rx_timestamp_supported(const struct xdp_md *ctx) __ksym;
+extern const __u64 bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx) __ksym;
+
+SEC("xdp")
+int rx(struct xdp_md *ctx)
+{
+	void *data, *data_meta;
+	int ret;
+
+	if (bpf_xdp_metadata_rx_timestamp_supported(ctx)) {
+		__u64 rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
+
+		if (rx_timestamp) {
+			ret = bpf_xdp_adjust_meta(ctx, -(int)sizeof(rx_timestamp));
+			if (ret != 0)
+				return XDP_DROP;
+
+			data = (void *)(long)ctx->data;
+			data_meta = (void *)(long)ctx->data_meta;
+
+			if (data_meta + sizeof(rx_timestamp) > data)
+				return XDP_DROP;
+
+			*(__u64 *)data_meta = rx_timestamp;
+		}
+	}
+
+	return bpf_redirect_map(&xsk, ctx->rx_queue_index, XDP_PASS);
+}
+
+char _license[] SEC("license") = "GPL";
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-04  3:25 [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs Stanislav Fomichev
                   ` (4 preceding siblings ...)
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 05/14] selftests/bpf: Verify xdp_metadata xdp->af_xdp path Stanislav Fomichev
@ 2022-11-04  3:25 ` Stanislav Fomichev
  2022-11-07 22:01   ` [xdp-hints] " Martin KaFai Lau
  2022-11-10  1:09   ` John Fastabend
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 07/14] selftests/bpf: Verify xdp_metadata xdp->skb path Stanislav Fomichev
                   ` (7 subsequent siblings)
  13 siblings, 2 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-04  3:25 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Implement new bpf_xdp_metadata_export_to_skb kfunc which
prepares compatible xdp metadata for kernel consumption.
This kfunc should be called prior to bpf_redirect
or (unless already called) when XDP_PASS'ing the frame
into the kernel.

The implementation currently maintains xdp_to_skb_metadata
layout by calling bpf_xdp_metadata_rx_timestamp and placing
small magic number. From skb_metdata_set, when we get expected magic number,
we interpret metadata accordingly.

Both magic number and struct layout are randomized to make sure
it doesn't leak into the userspace.

skb_metadata_set is amended with skb_metadata_import_from_xdp which
tries to parse out the metadata and put it into skb.

See the comment for r1 vs r2/r3/r4/r5 conventions.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/veth.c        |   4 +-
 include/linux/bpf_patch.h |   2 +
 include/linux/skbuff.h    |   4 ++
 include/net/xdp.h         |  13 +++++
 kernel/bpf/bpf_patch.c    |  30 +++++++++++
 kernel/bpf/verifier.c     |  18 +++++++
 net/core/skbuff.c         |  25 +++++++++
 net/core/xdp.c            | 104 +++++++++++++++++++++++++++++++++++---
 8 files changed, 193 insertions(+), 7 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 0e629ceb087b..d4cd0938360b 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -1673,7 +1673,9 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
 			      struct bpf_patch *patch)
 {
-	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
+	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_EXPORT_TO_SKB)) {
+		return xdp_metadata_export_to_skb(prog, patch);
+	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
 		/* return true; */
 		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
 	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
diff --git a/include/linux/bpf_patch.h b/include/linux/bpf_patch.h
index 81ff738eef8d..359c165ad68b 100644
--- a/include/linux/bpf_patch.h
+++ b/include/linux/bpf_patch.h
@@ -16,6 +16,8 @@ size_t bpf_patch_len(const struct bpf_patch *patch);
 int bpf_patch_err(const struct bpf_patch *patch);
 void __bpf_patch_append(struct bpf_patch *patch, struct bpf_insn insn);
 struct bpf_insn *bpf_patch_data(const struct bpf_patch *patch);
+void bpf_patch_resolve_jmp(struct bpf_patch *patch);
+u32 bpf_patch_magles_registers(const struct bpf_patch *patch);
 
 #define bpf_patch_append(patch, ...) ({ \
 	struct bpf_insn insn[] = { __VA_ARGS__ }; \
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 59c9fd55699d..dba857f212d7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4217,9 +4217,13 @@ static inline bool skb_metadata_differs(const struct sk_buff *skb_a,
 	       true : __skb_metadata_differs(skb_a, skb_b, len_a);
 }
 
+void skb_metadata_import_from_xdp(struct sk_buff *skb, size_t len);
+
 static inline void skb_metadata_set(struct sk_buff *skb, u8 meta_len)
 {
 	skb_shinfo(skb)->meta_len = meta_len;
+	if (meta_len)
+		skb_metadata_import_from_xdp(skb, meta_len);
 }
 
 static inline void skb_metadata_clear(struct sk_buff *skb)
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 2a82a98f2f9f..8c97c6996172 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -411,6 +411,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
 #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
 
 #define XDP_METADATA_KFUNC_xxx	\
+	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_EXPORT_TO_SKB, \
+			   bpf_xdp_metadata_export_to_skb) \
 	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED, \
 			   bpf_xdp_metadata_rx_timestamp_supported) \
 	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
@@ -423,14 +425,25 @@ XDP_METADATA_KFUNC_xxx
 MAX_XDP_METADATA_KFUNC,
 };
 
+struct xdp_to_skb_metadata {
+	u32 magic; /* xdp_metadata_magic */
+	u64 rx_timestamp;
+} __randomize_layout;
+
+struct bpf_patch;
+
 #ifdef CONFIG_DEBUG_INFO_BTF
+extern u32 xdp_metadata_magic;
 extern struct btf_id_set8 xdp_metadata_kfunc_ids;
 static inline u32 xdp_metadata_kfunc_id(int id)
 {
 	return xdp_metadata_kfunc_ids.pairs[id].id;
 }
+void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch);
 #else
+#define xdp_metadata_magic 0
 static inline u32 xdp_metadata_kfunc_id(int id) { return 0; }
+static void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch) { return 0; }
 #endif
 
 #endif /* __LINUX_NET_XDP_H__ */
diff --git a/kernel/bpf/bpf_patch.c b/kernel/bpf/bpf_patch.c
index 82a10bf5624a..8f1fef74299c 100644
--- a/kernel/bpf/bpf_patch.c
+++ b/kernel/bpf/bpf_patch.c
@@ -49,3 +49,33 @@ struct bpf_insn *bpf_patch_data(const struct bpf_patch *patch)
 {
 	return patch->insn;
 }
+
+void bpf_patch_resolve_jmp(struct bpf_patch *patch)
+{
+	int i;
+
+	for (i = 0; i < patch->len; i++) {
+		if (BPF_CLASS(patch->insn[i].code) != BPF_JMP)
+			continue;
+
+		if (BPF_SRC(patch->insn[i].code) != BPF_X)
+			continue;
+
+		if (patch->insn[i].off != S16_MAX)
+			continue;
+
+		patch->insn[i].off = patch->len - i - 1;
+	}
+}
+
+u32 bpf_patch_magles_registers(const struct bpf_patch *patch)
+{
+	u32 mask = 0;
+	int i;
+
+	for (i = 0; i < patch->len; i++) {
+		mask |= 1 << patch->insn[i].dst_reg;
+	}
+
+	return mask;
+}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 4e5c5ff35d5f..49f55f81e7f3 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -13902,6 +13902,24 @@ static int unroll_kfunc_call(struct bpf_verifier_env *env,
 		 */
 		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 0));
 	}
+
+	if (func_id != xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_EXPORT_TO_SKB)) {
+		u32 allowed = 0;
+		u32 mangled = bpf_patch_magles_registers(patch);
+
+		/* See xdp_metadata_export_to_skb comment for the
+		 * reason about these constraints.
+		 */
+
+		allowed |= 1 << BPF_REG_0;
+		allowed |= 1 << BPF_REG_2;
+		allowed |= 1 << BPF_REG_3;
+		allowed |= 1 << BPF_REG_4;
+		allowed |= 1 << BPF_REG_5;
+
+		WARN_ON_ONCE(mangled & ~allowed);
+	}
+
 	return bpf_patch_err(patch);
 }
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 42a35b59fb1e..37e3aef46525 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -72,6 +72,7 @@
 #include <net/mptcp.h>
 #include <net/mctp.h>
 #include <net/page_pool.h>
+#include <net/xdp.h>
 
 #include <linux/uaccess.h>
 #include <trace/events/skb.h>
@@ -6672,3 +6673,27 @@ nodefer:	__kfree_skb(skb);
 	if (unlikely(kick) && !cmpxchg(&sd->defer_ipi_scheduled, 0, 1))
 		smp_call_function_single_async(cpu, &sd->defer_csd);
 }
+
+void skb_metadata_import_from_xdp(struct sk_buff *skb, size_t len)
+{
+	struct xdp_to_skb_metadata *meta = (void *)(skb_mac_header(skb) - len);
+
+	/* Optional SKB info, currently missing:
+	 * - HW checksum info		(skb->ip_summed)
+	 * - HW RX hash			(skb_set_hash)
+	 * - RX ring dev queue index	(skb_record_rx_queue)
+	 */
+
+	if (len != sizeof(struct xdp_to_skb_metadata))
+		return;
+
+	if (meta->magic != xdp_metadata_magic)
+		return;
+
+	if (meta->rx_timestamp) {
+		*skb_hwtstamps(skb) = (struct skb_shared_hwtstamps){
+			.hwtstamp = ns_to_ktime(meta->rx_timestamp),
+		};
+	}
+}
+EXPORT_SYMBOL(skb_metadata_import_from_xdp);
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 22f1e44700eb..8204fa05c5e9 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -653,12 +653,6 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
 	/* Essential SKB info: protocol and skb->dev */
 	skb->protocol = eth_type_trans(skb, dev);
 
-	/* Optional SKB info, currently missing:
-	 * - HW checksum info		(skb->ip_summed)
-	 * - HW RX hash			(skb_set_hash)
-	 * - RX ring dev queue index	(skb_record_rx_queue)
-	 */
-
 	/* Until page_pool get SKB return path, release DMA here */
 	xdp_release_frame(xdpf);
 
@@ -712,6 +706,13 @@ struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
 	return nxdpf;
 }
 
+/* For the packets directed to the kernel, this kfunc exports XDP metadata
+ * into skb context.
+ */
+noinline void bpf_xdp_metadata_export_to_skb(const struct xdp_md *ctx)
+{
+}
+
 /* Indicates whether particular device supports rx_timestamp metadata.
  * This is an optional helper to support marking some branches as
  * "dead code" in the BPF programs.
@@ -737,13 +738,104 @@ XDP_METADATA_KFUNC_xxx
 #undef XDP_METADATA_KFUNC
 BTF_SET8_END(xdp_metadata_kfunc_ids)
 
+/* Make sure userspace doesn't depend on our layout by using
+ * different pseudo-generated magic value.
+ */
+u32 xdp_metadata_magic;
+
 static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
 	.owner = THIS_MODULE,
 	.set   = &xdp_metadata_kfunc_ids,
 };
 
+/* Since we're not actually doing a call but instead rewriting
+ * in place, we can only afford to use R0-R5 scratch registers.
+ *
+ * We reserve R1 for bpf_xdp_metadata_export_to_skb and let individual
+ * metadata kfuncs use only R0,R4-R5.
+ *
+ * The above also means we _cannot_ easily call any other helper/kfunc
+ * because there is no place for us to preserve our R1 argument;
+ * existing R6-R9 belong to the callee.
+ */
+void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch)
+{
+	u32 func_id;
+
+	/*
+	 * The code below generates the following:
+	 *
+	 * void bpf_xdp_metadata_export_to_skb(struct xdp_md *ctx)
+	 * {
+	 *	struct xdp_to_skb_metadata *meta;
+	 *	int ret;
+	 *
+	 *	ret = bpf_xdp_adjust_meta(ctx, -sizeof(*meta));
+	 *	if (!ret)
+	 *		return;
+	 *
+	 *	meta = ctx->data_meta;
+	 *	meta->magic = xdp_metadata_magic;
+	 *	meta->rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
+	 * }
+	 *
+	 */
+
+	bpf_patch_append(patch,
+		/* r2 = ((struct xdp_buff *)r1)->data_meta; */
+		BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
+			    offsetof(struct xdp_buff, data_meta)),
+		/* r3 = ((struct xdp_buff *)r1)->data; */
+		BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1,
+			    offsetof(struct xdp_buff, data)),
+		/* if (data_meta != data) return;
+		 *
+		 *	data_meta > data: xdp_data_meta_unsupported()
+		 *	data_meta < data: already used, no need to touch
+		 */
+		BPF_JMP_REG(BPF_JNE, BPF_REG_2, BPF_REG_3, S16_MAX),
+
+		/* r2 -= sizeof(struct xdp_to_skb_metadata); */
+		BPF_ALU64_IMM(BPF_SUB, BPF_REG_2,
+			      sizeof(struct xdp_to_skb_metadata)),
+		/* r3 = ((struct xdp_buff *)r1)->data_hard_start; */
+		BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1,
+			    offsetof(struct xdp_buff, data_hard_start)),
+		/* r3 += sizeof(struct xdp_frame) */
+		BPF_ALU64_IMM(BPF_ADD, BPF_REG_3,
+			      sizeof(struct xdp_frame)),
+		/* if (data-sizeof(struct xdp_to_skb_metadata) < data_hard_start+sizeof(struct xdp_frame)) return; */
+		BPF_JMP_REG(BPF_JLT, BPF_REG_2, BPF_REG_3, S16_MAX),
+
+		/* ((struct xdp_buff *)r1)->data_meta = r2; */
+		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_2,
+			    offsetof(struct xdp_buff, data_meta)),
+
+		/* *((struct xdp_to_skb_metadata *)r2)->magic = xdp_metadata_magic; */
+		BPF_ST_MEM(BPF_W, BPF_REG_2,
+			   offsetof(struct xdp_to_skb_metadata, magic),
+			   xdp_metadata_magic),
+	);
+
+	/*	r0 = bpf_xdp_metadata_rx_timestamp(ctx); */
+	func_id = xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP);
+	prog->aux->xdp_kfunc_ndo->ndo_unroll_kfunc(prog, func_id, patch);
+
+	bpf_patch_append(patch,
+		/* r2 = ((struct xdp_buff *)r1)->data_meta; */
+		BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
+			    offsetof(struct xdp_buff, data_meta)),
+		/* *((struct xdp_to_skb_metadata *)r2)->rx_timestamp = r0; */
+		BPF_STX_MEM(BPF_DW, BPF_REG_2, BPF_REG_0,
+			    offsetof(struct xdp_to_skb_metadata, rx_timestamp)),
+	);
+
+	bpf_patch_resolve_jmp(patch);
+}
+
 static int __init xdp_metadata_init(void)
 {
+	xdp_metadata_magic = get_random_u32() | 1;
 	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &xdp_metadata_kfunc_set);
 }
 late_initcall(xdp_metadata_init);
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] [RFC bpf-next v2 07/14] selftests/bpf: Verify xdp_metadata xdp->skb path
  2022-11-04  3:25 [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs Stanislav Fomichev
                   ` (5 preceding siblings ...)
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context Stanislav Fomichev
@ 2022-11-04  3:25 ` Stanislav Fomichev
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 08/14] bpf: Helper to simplify calling kernel routines from unrolled kfuncs Stanislav Fomichev
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-04  3:25 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

- divert 9081 UDP traffic to the kernel
- call bpf_xdp_metadata_export_to_skb for such packets
- the kernel should fill in hwtstamp
- verify that the received packet has non-zero hwtstamp

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../selftests/bpf/prog_tests/xdp_metadata.c   | 92 +++++++++++++++++++
 .../selftests/bpf/progs/xdp_metadata.c        | 28 ++++++
 2 files changed, 120 insertions(+)

diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
index bb06e25fb2bb..96cc6d7697f8 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_metadata.c
@@ -19,6 +19,11 @@
 
 #define AF_XDP_SOURCE_PORT 1234
 #define AF_XDP_CONSUMER_PORT 8080
+#define SOCKET_CONSUMER_PORT 9081
+
+#ifndef SOL_UDP
+#define SOL_UDP		17
+#endif
 
 #define UMEM_NUM 16
 #define UMEM_FRAME_SIZE XSK_UMEM__DEFAULT_FRAME_SIZE
@@ -221,6 +226,71 @@ int verify_xsk_metadata(struct xsk *xsk)
 	return 0;
 }
 
+static void disable_rx_checksum(int fd)
+{
+	int ret, val;
+
+	val = 1;
+	ret = setsockopt(fd, SOL_UDP, UDP_NO_CHECK6_RX, &val, sizeof(val));
+	ASSERT_OK(ret, "setsockopt(UDP_NO_CHECK6_RX)");
+}
+
+static void timestamping_enable(int fd)
+{
+	int ret, val;
+
+	val = SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_RAW_HARDWARE;
+	ret = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, &val, sizeof(val));
+	ASSERT_OK(ret, "setsockopt(SO_TIMESTAMPING)");
+}
+
+int verify_skb_metadata(int fd)
+{
+	char cmsg_buf[1024];
+	char packet_buf[128];
+
+	struct scm_timestamping *ts;
+	struct iovec packet_iov;
+	struct cmsghdr *cmsg;
+	struct msghdr hdr;
+	bool found_hwtstamp = false;
+
+	memset(&hdr, 0, sizeof(hdr));
+	hdr.msg_iov = &packet_iov;
+	hdr.msg_iovlen = 1;
+	packet_iov.iov_base = packet_buf;
+	packet_iov.iov_len = sizeof(packet_buf);
+
+	hdr.msg_control = cmsg_buf;
+	hdr.msg_controllen = sizeof(cmsg_buf);
+
+	if (ASSERT_GE(recvmsg(fd, &hdr, 0), 0, "recvmsg")) {
+		for (cmsg = CMSG_FIRSTHDR(&hdr); cmsg != NULL;
+		     cmsg = CMSG_NXTHDR(&hdr, cmsg)) {
+
+			if (cmsg->cmsg_level != SOL_SOCKET)
+				continue;
+
+			switch (cmsg->cmsg_type) {
+			case SCM_TIMESTAMPING:
+				ts = (struct scm_timestamping *)CMSG_DATA(cmsg);
+				if (ts->ts[2].tv_sec || ts->ts[2].tv_nsec) {
+					found_hwtstamp = true;
+					break;
+				}
+				break;
+			default:
+				break;
+			}
+		}
+	}
+
+	if (!ASSERT_EQ(found_hwtstamp, true, "no hwtstamp!"))
+		return -1;
+
+	return 0;
+}
+
 void test_xdp_metadata(void)
 {
 	struct xdp_metadata *bpf_obj = NULL;
@@ -228,6 +298,7 @@ void test_xdp_metadata(void)
 	struct bpf_program *prog;
 	struct xsk tx_xsk = {};
 	struct xsk rx_xsk = {};
+	int rx_udp_fd = -1;
 	int rx_ifindex;
 	int ret;
 
@@ -243,6 +314,8 @@ void test_xdp_metadata(void)
 	SYS("ip link set dev " RX_NAME " up");
 	SYS("ip addr add " TX_ADDR "/" PREFIX_LEN " dev " TX_NAME);
 	SYS("ip addr add " RX_ADDR "/" PREFIX_LEN " dev " RX_NAME);
+	SYS("sysctl -q net.ipv4.ip_forward=1");
+	SYS("sysctl -q net.ipv4.conf.all.accept_local=1");
 
 	rx_ifindex = if_nametoindex(RX_NAME);
 
@@ -256,6 +329,14 @@ void test_xdp_metadata(void)
 	if (!ASSERT_OK(ret, "open_xsk(RX_NAME)"))
 		goto out;
 
+	/* Setup UPD listener for RX interface. */
+
+	rx_udp_fd = start_server(FAMILY, SOCK_DGRAM, NULL, SOCKET_CONSUMER_PORT, 1000);
+	if (!ASSERT_GE(rx_udp_fd, 0, "start_server"))
+		goto out;
+	disable_rx_checksum(rx_udp_fd);
+	timestamping_enable(rx_udp_fd);
+
 	/* Attach BPF program to RX interface. */
 
 	bpf_obj = xdp_metadata__open();
@@ -291,9 +372,20 @@ void test_xdp_metadata(void)
 		       "verify_xsk_metadata"))
 	    goto out;
 
+	/* Send packet destined to RX UDP socket. */
+	if (!ASSERT_GE(generate_packet(&tx_xsk, SOCKET_CONSUMER_PORT), 0,
+		       "generate SOCKET_CONSUMER_PORT"))
+	    goto out;
+
+	/* Verify SKB RX packet has proper metadata. */
+	if (!ASSERT_GE(verify_skb_metadata(rx_udp_fd), 0,
+		       "verify_skb_metadata"))
+	    goto out;
+
 out:
 	close_xsk(&rx_xsk);
 	close_xsk(&tx_xsk);
+	close(rx_udp_fd);
 	if (bpf_obj)
 		xdp_metadata__destroy(bpf_obj);
 	system("ip netns del xdp_metadata");
diff --git a/tools/testing/selftests/bpf/progs/xdp_metadata.c b/tools/testing/selftests/bpf/progs/xdp_metadata.c
index bdde17961ab6..6e7292c58b86 100644
--- a/tools/testing/selftests/bpf/progs/xdp_metadata.c
+++ b/tools/testing/selftests/bpf/progs/xdp_metadata.c
@@ -17,6 +17,7 @@ struct {
 	__type(value, __u32);
 } xsk SEC(".maps");
 
+extern void bpf_xdp_metadata_export_to_skb(const struct xdp_md *ctx) __ksym;
 extern int bpf_xdp_metadata_rx_timestamp_supported(const struct xdp_md *ctx) __ksym;
 extern const __u64 bpf_xdp_metadata_rx_timestamp(const struct xdp_md *ctx) __ksym;
 
@@ -24,8 +25,35 @@ SEC("xdp")
 int rx(struct xdp_md *ctx)
 {
 	void *data, *data_meta;
+	struct ethhdr *eth = NULL;
+	struct udphdr *udp = NULL;
+	struct iphdr *iph = NULL;
+	void *data_end;
 	int ret;
 
+	/* Exercise xdp -> skb metadata path by diverting some traffic
+	 * into the kernel (UDP destination port 9081).
+	 */
+
+	data = (void *)(long)ctx->data;
+	data_end = (void *)(long)ctx->data_end;
+	eth = data;
+	if (eth + 1 < data_end) {
+		if (eth->h_proto == bpf_htons(ETH_P_IP)) {
+			iph = (void *)(eth + 1);
+			if (iph + 1 < data_end && iph->protocol == IPPROTO_UDP)
+				udp = (void *)(iph + 1);
+		}
+		if (udp && udp + 1 > data_end)
+			udp = NULL;
+	}
+	if (udp && udp->dest == bpf_htons(9081)) {
+		bpf_xdp_metadata_export_to_skb(ctx);
+		bpf_printk("exporting metadata to skb for UDP port 9081");
+		/*return bpf_redirect(ifindex, BPF_F_INGRESS);*/
+		return XDP_PASS;
+	}
+
 	if (bpf_xdp_metadata_rx_timestamp_supported(ctx)) {
 		__u64 rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
 
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] [RFC bpf-next v2 08/14] bpf: Helper to simplify calling kernel routines from unrolled kfuncs
  2022-11-04  3:25 [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs Stanislav Fomichev
                   ` (6 preceding siblings ...)
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 07/14] selftests/bpf: Verify xdp_metadata xdp->skb path Stanislav Fomichev
@ 2022-11-04  3:25 ` Stanislav Fomichev
  2022-11-05  0:40   ` [xdp-hints] " Alexei Starovoitov
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 09/14] ice: Introduce ice_xdp_buff wrapper for xdp_buff Stanislav Fomichev
                   ` (5 subsequent siblings)
  13 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-04  3:25 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

When we need to call the kernel function from the unrolled
kfunc, we have to take extra care: r6-r9 belong to the callee,
not us, so we can't use these registers to stash our r1.

We use the same trick we use elsewhere: ask the user
to provide extra on-stack storage.

Also, note, the program being called has to receive and
return the context.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 include/net/xdp.h |  4 ++++
 net/core/xdp.c    | 24 +++++++++++++++++++++++-
 2 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/include/net/xdp.h b/include/net/xdp.h
index 8c97c6996172..09c05d1da69c 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -440,10 +440,14 @@ static inline u32 xdp_metadata_kfunc_id(int id)
 	return xdp_metadata_kfunc_ids.pairs[id].id;
 }
 void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch);
+void xdp_kfunc_call_preserving_r1(struct bpf_patch *patch, size_t r0_offset,
+				  void *kfunc);
 #else
 #define xdp_metadata_magic 0
 static inline u32 xdp_metadata_kfunc_id(int id) { return 0; }
 static void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch) { return 0; }
+static void xdp_kfunc_call_preserving_r1(struct bpf_patch *patch, size_t r0_offset,
+					 void *kfunc) {}
 #endif
 
 #endif /* __LINUX_NET_XDP_H__ */
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 8204fa05c5e9..16dd7850b9b0 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -737,6 +737,7 @@ BTF_SET8_START_GLOBAL(xdp_metadata_kfunc_ids)
 XDP_METADATA_KFUNC_xxx
 #undef XDP_METADATA_KFUNC
 BTF_SET8_END(xdp_metadata_kfunc_ids)
+EXPORT_SYMBOL(xdp_metadata_kfunc_ids);
 
 /* Make sure userspace doesn't depend on our layout by using
  * different pseudo-generated magic value.
@@ -756,7 +757,8 @@ static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
  *
  * The above also means we _cannot_ easily call any other helper/kfunc
  * because there is no place for us to preserve our R1 argument;
- * existing R6-R9 belong to the callee.
+ * existing R6-R9 belong to the callee. For the cases where calling into
+ * the kernel is the only option, see xdp_kfunc_call_preserving_r1.
  */
 void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch)
 {
@@ -832,6 +834,26 @@ void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *p
 
 	bpf_patch_resolve_jmp(patch);
 }
+EXPORT_SYMBOL(xdp_metadata_export_to_skb);
+
+/* Helper to generate the bytecode that calls the supplied kfunc.
+ * The kfunc has to accept a pointer to the context and return the
+ * same pointer back. The user also has to supply an offset within
+ * the context to store r0.
+ */
+void xdp_kfunc_call_preserving_r1(struct bpf_patch *patch, size_t r0_offset,
+				  void *kfunc)
+{
+	bpf_patch_append(patch,
+		/* r0 = kfunc(r1); */
+		BPF_EMIT_CALL(kfunc),
+		/* r1 = r0; */
+		BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
+		/* r0 = *(r1 + r0_offset); */
+		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, r0_offset),
+	);
+}
+EXPORT_SYMBOL(xdp_kfunc_call_preserving_r1);
 
 static int __init xdp_metadata_init(void)
 {
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] [RFC bpf-next v2 09/14] ice: Introduce ice_xdp_buff wrapper for xdp_buff
  2022-11-04  3:25 [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs Stanislav Fomichev
                   ` (7 preceding siblings ...)
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 08/14] bpf: Helper to simplify calling kernel routines from unrolled kfuncs Stanislav Fomichev
@ 2022-11-04  3:25 ` Stanislav Fomichev
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 10/14] ice: Support rx timestamp metadata for xdp Stanislav Fomichev
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-04  3:25 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

No functional changes. Boilerplate to allow stuffing more data after xdp_buff.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/intel/ice/ice_txrx.c | 30 +++++++++++++----------
 1 file changed, 17 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index dbe80e5053a8..1b6afa168501 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1096,6 +1096,10 @@ ice_is_non_eop(struct ice_rx_ring *rx_ring, union ice_32b_rx_flex_desc *rx_desc)
 	return true;
 }
 
+struct ice_xdp_buff {
+	struct xdp_buff xdp;
+};
+
 /**
  * ice_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
  * @rx_ring: Rx descriptor ring to transact packets on
@@ -1117,14 +1121,14 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 	unsigned int xdp_res, xdp_xmit = 0;
 	struct sk_buff *skb = rx_ring->skb;
 	struct bpf_prog *xdp_prog = NULL;
-	struct xdp_buff xdp;
+	struct ice_xdp_buff ixbuf;
 	bool failure;
 
 	/* Frame size depend on rx_ring setup when PAGE_SIZE=4K */
 #if (PAGE_SIZE < 8192)
 	frame_sz = ice_rx_frame_truesize(rx_ring, 0);
 #endif
-	xdp_init_buff(&xdp, frame_sz, &rx_ring->xdp_rxq);
+	xdp_init_buff(&ixbuf.xdp, frame_sz, &rx_ring->xdp_rxq);
 
 	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
 	if (xdp_prog)
@@ -1178,30 +1182,30 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 		rx_buf = ice_get_rx_buf(rx_ring, size, &rx_buf_pgcnt);
 
 		if (!size) {
-			xdp.data = NULL;
-			xdp.data_end = NULL;
-			xdp.data_hard_start = NULL;
-			xdp.data_meta = NULL;
+			ixbuf.xdp.data = NULL;
+			ixbuf.xdp.data_end = NULL;
+			ixbuf.xdp.data_hard_start = NULL;
+			ixbuf.xdp.data_meta = NULL;
 			goto construct_skb;
 		}
 
 		hard_start = page_address(rx_buf->page) + rx_buf->page_offset -
 			     offset;
-		xdp_prepare_buff(&xdp, hard_start, offset, size, true);
+		xdp_prepare_buff(&ixbuf.xdp, hard_start, offset, size, true);
 #if (PAGE_SIZE > 4096)
 		/* At larger PAGE_SIZE, frame_sz depend on len size */
-		xdp.frame_sz = ice_rx_frame_truesize(rx_ring, size);
+		ixbuf.xdp.frame_sz = ice_rx_frame_truesize(rx_ring, size);
 #endif
 
 		if (!xdp_prog)
 			goto construct_skb;
 
-		xdp_res = ice_run_xdp(rx_ring, &xdp, xdp_prog, xdp_ring);
+		xdp_res = ice_run_xdp(rx_ring, &ixbuf.xdp, xdp_prog, xdp_ring);
 		if (!xdp_res)
 			goto construct_skb;
 		if (xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR)) {
 			xdp_xmit |= xdp_res;
-			ice_rx_buf_adjust_pg_offset(rx_buf, xdp.frame_sz);
+			ice_rx_buf_adjust_pg_offset(rx_buf, ixbuf.xdp.frame_sz);
 		} else {
 			rx_buf->pagecnt_bias++;
 		}
@@ -1214,11 +1218,11 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 construct_skb:
 		if (skb) {
 			ice_add_rx_frag(rx_ring, rx_buf, skb, size);
-		} else if (likely(xdp.data)) {
+		} else if (likely(ixbuf.xdp.data)) {
 			if (ice_ring_uses_build_skb(rx_ring))
-				skb = ice_build_skb(rx_ring, rx_buf, &xdp);
+				skb = ice_build_skb(rx_ring, rx_buf, &ixbuf.xdp);
 			else
-				skb = ice_construct_skb(rx_ring, rx_buf, &xdp);
+				skb = ice_construct_skb(rx_ring, rx_buf, &ixbuf.xdp);
 		}
 		/* exit if we failed to retrieve a buffer */
 		if (!skb) {
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] [RFC bpf-next v2 10/14] ice: Support rx timestamp metadata for xdp
  2022-11-04  3:25 [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs Stanislav Fomichev
                   ` (8 preceding siblings ...)
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 09/14] ice: Introduce ice_xdp_buff wrapper for xdp_buff Stanislav Fomichev
@ 2022-11-04  3:25 ` Stanislav Fomichev
  2022-11-04 14:35   ` [xdp-hints] " Alexander Lobakin
  2022-12-15 11:54   ` Larysa Zaremba
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 11/14] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff Stanislav Fomichev
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-04  3:25 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

COMPILE-TESTED ONLY!

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/intel/ice/ice.h      |  5 ++
 drivers/net/ethernet/intel/ice/ice_main.c |  1 +
 drivers/net/ethernet/intel/ice/ice_txrx.c | 75 +++++++++++++++++++++++
 3 files changed, 81 insertions(+)

diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index f88ee051e71c..c51a392d64a4 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -925,6 +925,11 @@ int ice_open_internal(struct net_device *netdev);
 int ice_stop(struct net_device *netdev);
 void ice_service_task_schedule(struct ice_pf *pf);
 
+struct bpf_insn;
+struct bpf_patch;
+void ice_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
+		      struct bpf_patch *patch);
+
 /**
  * ice_set_rdma_cap - enable RDMA support
  * @pf: PF struct
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 1f27dc20b4f1..8ddc6851ef86 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -9109,4 +9109,5 @@ static const struct net_device_ops ice_netdev_ops = {
 	.ndo_xdp_xmit = ice_xdp_xmit,
 	.ndo_xsk_wakeup = ice_xsk_wakeup,
 	.ndo_get_devlink_port = ice_get_devlink_port,
+	.ndo_unroll_kfunc = ice_unroll_kfunc,
 };
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 1b6afa168501..e9b5e883753e 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -7,6 +7,7 @@
 #include <linux/netdevice.h>
 #include <linux/prefetch.h>
 #include <linux/bpf_trace.h>
+#include <linux/bpf_patch.h>
 #include <net/dsfield.h>
 #include <net/mpls.h>
 #include <net/xdp.h>
@@ -1098,8 +1099,80 @@ ice_is_non_eop(struct ice_rx_ring *rx_ring, union ice_32b_rx_flex_desc *rx_desc)
 
 struct ice_xdp_buff {
 	struct xdp_buff xdp;
+	struct ice_rx_ring *rx_ring;
+	union ice_32b_rx_flex_desc *rx_desc;
 };
 
+void ice_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
+		      struct bpf_patch *patch)
+{
+	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_EXPORT_TO_SKB)) {
+		return xdp_metadata_export_to_skb(prog, patch);
+	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
+		/* return true; */
+		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
+	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
+		bpf_patch_append(patch,
+			/* Loosely based on ice_ptp_rx_hwtstamp. */
+
+			BPF_MOV64_IMM(BPF_REG_0, 0),
+
+			/* r5 = ((struct ice_xdp_buff *)r1)->rx_ring; */
+			BPF_LDX_MEM(BPF_DW, BPF_REG_5, BPF_REG_1,
+				    offsetof(struct ice_xdp_buff, rx_ring)),
+			/* if (r5 == NULL) return; */
+			BPF_JMP_IMM(BPF_JNE, BPF_REG_5, 0, S16_MAX),
+
+			/* r5 = rx_ring->cached_phctime; */
+			BPF_LDX_MEM(BPF_DW, BPF_REG_5, BPF_REG_5,
+				    offsetof(struct ice_rx_ring, cached_phctime)),
+			/* if (r5 == 0) return; */
+			BPF_JMP_IMM(BPF_JNE, BPF_REG_5, 0, S16_MAX),
+
+			/* r4 = ((struct ice_xdp_buff *)r1)->rx_desc; */
+			BPF_LDX_MEM(BPF_DW, BPF_REG_4, BPF_REG_1,
+				    offsetof(struct ice_xdp_buff, rx_desc)),
+
+			/* r3 = rx_desc->wb.time_stamp_low; */
+			BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_4,
+				    offsetof(union ice_32b_rx_flex_desc, wb.time_stamp_low)),
+			/* r3 = r3 & ICE_PTP_TS_VALID; */
+			BPF_ALU64_IMM(BPF_AND, BPF_REG_3, 1),
+			/* if (r3 == 0) return; */
+			BPF_JMP_IMM(BPF_JNE, BPF_REG_3, 0, S16_MAX),
+
+			/* r3 = rx_desc->wb.flex_ts.ts_high; */
+			BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_4,
+				    offsetof(union ice_32b_rx_flex_desc, wb.flex_ts.ts_high)),
+
+			/* r5 == cached_phc_time; */
+			/* r3 == in_tstamp */
+
+			/* r4 = in_tstamp - pch_time_lo; (delta) */
+			BPF_MOV32_REG(BPF_REG_4, BPF_REG_3),
+			BPF_ALU32_REG(BPF_SUB, BPF_REG_4, BPF_REG_5),
+
+			/* if (delta <= U32_MAX / 2) { */
+			BPF_JMP_IMM(BPF_JGT, BPF_REG_4, U32_MAX / 2, 3),
+
+			/*	return cached_pch_time + delta */
+			BPF_MOV64_REG(BPF_REG_0, BPF_REG_4),
+			BPF_ALU32_REG(BPF_ADD, BPF_REG_0, BPF_REG_5),
+			BPF_JMP_A(4),
+
+			/* } else { */
+			/*	r4 = cached_phc_time_lo - in_tstamp; (delta) */
+			BPF_MOV64_REG(BPF_REG_4, BPF_REG_5),
+			BPF_ALU32_REG(BPF_SUB, BPF_REG_4, BPF_REG_3),
+
+			/*	return cached_pch_time - delta */
+			BPF_MOV64_REG(BPF_REG_0, BPF_REG_5),
+			BPF_ALU32_REG(BPF_SUB, BPF_REG_0, BPF_REG_4),
+			/* } */
+		);
+	}
+}
+
 /**
  * ice_clean_rx_irq - Clean completed descriptors from Rx ring - bounce buf
  * @rx_ring: Rx descriptor ring to transact packets on
@@ -1196,6 +1269,8 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
 		/* At larger PAGE_SIZE, frame_sz depend on len size */
 		ixbuf.xdp.frame_sz = ice_rx_frame_truesize(rx_ring, size);
 #endif
+		ixbuf.rx_ring = rx_ring;
+		ixbuf.rx_desc = rx_desc;
 
 		if (!xdp_prog)
 			goto construct_skb;
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] [RFC bpf-next v2 11/14] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff
  2022-11-04  3:25 [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs Stanislav Fomichev
                   ` (9 preceding siblings ...)
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 10/14] ice: Support rx timestamp metadata for xdp Stanislav Fomichev
@ 2022-11-04  3:25 ` Stanislav Fomichev
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 12/14] mxl4: Support rx timestamp metadata for xdp Stanislav Fomichev
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-04  3:25 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

No functional changes. Boilerplate to allow stuffing more data after xdp_buff.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/mellanox/mlx4/en_rx.c | 26 +++++++++++++---------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 8f762fc170b3..467356633172 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -661,17 +661,21 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
 #define MLX4_CQE_STATUS_IP_ANY (MLX4_CQE_STATUS_IPV4)
 #endif
 
+struct mlx4_xdp_buff {
+	struct xdp_buff xdp;
+};
+
 int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
 	int factor = priv->cqe_factor;
 	struct mlx4_en_rx_ring *ring;
+	struct mlx4_xdp_buff mxbuf;
 	struct bpf_prog *xdp_prog;
 	int cq_ring = cq->ring;
 	bool doorbell_pending;
 	bool xdp_redir_flush;
 	struct mlx4_cqe *cqe;
-	struct xdp_buff xdp;
 	int polled = 0;
 	int index;
 
@@ -681,7 +685,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 	ring = priv->rx_ring[cq_ring];
 
 	xdp_prog = rcu_dereference_bh(ring->xdp_prog);
-	xdp_init_buff(&xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq);
+	xdp_init_buff(&mxbuf.xdp, priv->frag_info[0].frag_stride, &ring->xdp_rxq);
 	doorbell_pending = false;
 	xdp_redir_flush = false;
 
@@ -776,24 +780,24 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 						priv->frag_info[0].frag_size,
 						DMA_FROM_DEVICE);
 
-			xdp_prepare_buff(&xdp, va - frags[0].page_offset,
+			xdp_prepare_buff(&mxbuf.xdp, va - frags[0].page_offset,
 					 frags[0].page_offset, length, false);
-			orig_data = xdp.data;
+			orig_data = mxbuf.xdp.data;
 
-			act = bpf_prog_run_xdp(xdp_prog, &xdp);
+			act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
 
-			length = xdp.data_end - xdp.data;
-			if (xdp.data != orig_data) {
-				frags[0].page_offset = xdp.data -
-					xdp.data_hard_start;
-				va = xdp.data;
+			length = mxbuf.xdp.data_end - mxbuf.xdp.data;
+			if (mxbuf.xdp.data != orig_data) {
+				frags[0].page_offset = mxbuf.xdp.data -
+					mxbuf.xdp.data_hard_start;
+				va = mxbuf.xdp.data;
 			}
 
 			switch (act) {
 			case XDP_PASS:
 				break;
 			case XDP_REDIRECT:
-				if (likely(!xdp_do_redirect(dev, &xdp, xdp_prog))) {
+				if (likely(!xdp_do_redirect(dev, &mxbuf.xdp, xdp_prog))) {
 					ring->xdp_redirect++;
 					xdp_redir_flush = true;
 					frags[0].page = NULL;
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] [RFC bpf-next v2 12/14] mxl4: Support rx timestamp metadata for xdp
  2022-11-04  3:25 [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs Stanislav Fomichev
                   ` (10 preceding siblings ...)
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 11/14] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff Stanislav Fomichev
@ 2022-11-04  3:25 ` Stanislav Fomichev
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 13/14] bnxt: Introduce bnxt_xdp_buff wrapper for xdp_buff Stanislav Fomichev
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 14/14] bnxt: Support rx timestamp metadata for xdp Stanislav Fomichev
  13 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-04  3:25 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

COMPILE-TESTED ONLY!

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 .../net/ethernet/mellanox/mlx4/en_netdev.c    |  1 +
 drivers/net/ethernet/mellanox/mlx4/en_rx.c    | 40 +++++++++++++++++++
 include/linux/mlx4/device.h                   |  7 ++++
 3 files changed, 48 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
index ca4b93a01034..3ef7995950dc 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_netdev.c
@@ -2890,6 +2890,7 @@ static const struct net_device_ops mlx4_netdev_ops_master = {
 	.ndo_features_check	= mlx4_en_features_check,
 	.ndo_set_tx_maxrate	= mlx4_en_set_tx_maxrate,
 	.ndo_bpf		= mlx4_xdp,
+	.ndo_unroll_kfunc	= mlx4_unroll_kfunc,
 };
 
 struct mlx4_en_bond {
diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 467356633172..83c89a1db3b8 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -33,6 +33,7 @@
 
 #include <linux/bpf.h>
 #include <linux/bpf_trace.h>
+#include <linux/bpf_patch.h>
 #include <linux/mlx4/cq.h>
 #include <linux/slab.h>
 #include <linux/mlx4/qp.h>
@@ -663,8 +664,43 @@ static int check_csum(struct mlx4_cqe *cqe, struct sk_buff *skb, void *va,
 
 struct mlx4_xdp_buff {
 	struct xdp_buff xdp;
+	struct mlx4_cqe *cqe;
+	struct mlx4_en_dev *mdev;
+	u64 r0;
 };
 
+struct mlx4_xdp_buff *mxl4_xdp_rx_timestamp(struct mlx4_xdp_buff *ctx)
+{
+	unsigned int seq;
+	u64 timestamp;
+	u64 nsec;
+
+	timestamp = mlx4_en_get_cqe_ts(ctx->cqe);
+
+	do {
+		seq = read_seqbegin(&ctx->mdev->clock_lock);
+		nsec = timecounter_cyc2time(&ctx->mdev->clock, timestamp);
+	} while (read_seqretry(&ctx->mdev->clock_lock, seq));
+
+	ctx->r0 = (u64)ns_to_ktime(nsec);
+	return ctx;
+}
+
+void mlx4_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
+		       struct bpf_patch *patch)
+{
+	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_EXPORT_TO_SKB)) {
+		return xdp_metadata_export_to_skb(prog, patch);
+	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
+		/* return true; */
+		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
+	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
+		xdp_kfunc_call_preserving_r1(patch,
+					     offsetof(struct mlx4_xdp_buff, r0),
+					     mxl4_xdp_rx_timestamp);
+	}
+}
+
 int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int budget)
 {
 	struct mlx4_en_priv *priv = netdev_priv(dev);
@@ -783,6 +819,10 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 			xdp_prepare_buff(&mxbuf.xdp, va - frags[0].page_offset,
 					 frags[0].page_offset, length, false);
 			orig_data = mxbuf.xdp.data;
+			if (unlikely(ring->hwtstamp_rx_filter == HWTSTAMP_FILTER_ALL)) {
+				mxbuf.cqe = cqe;
+				mxbuf.mdev = priv->mdev;
+			}
 
 			act = bpf_prog_run_xdp(xdp_prog, &mxbuf.xdp);
 
diff --git a/include/linux/mlx4/device.h b/include/linux/mlx4/device.h
index 6646634a0b9d..a0e4d490b2fb 100644
--- a/include/linux/mlx4/device.h
+++ b/include/linux/mlx4/device.h
@@ -1585,4 +1585,11 @@ static inline int mlx4_get_num_reserved_uar(struct mlx4_dev *dev)
 	/* The first 128 UARs are used for EQ doorbells */
 	return (128 >> (PAGE_SHIFT - dev->uar_page_shift));
 }
+
+struct bpf_prog;
+struct bpf_insn;
+struct bpf_patch;
+
+void mlx4_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
+		       struct bpf_patch *patch);
 #endif /* MLX4_DEVICE_H */
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] [RFC bpf-next v2 13/14] bnxt: Introduce bnxt_xdp_buff wrapper for xdp_buff
  2022-11-04  3:25 [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs Stanislav Fomichev
                   ` (11 preceding siblings ...)
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 12/14] mxl4: Support rx timestamp metadata for xdp Stanislav Fomichev
@ 2022-11-04  3:25 ` Stanislav Fomichev
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 14/14] bnxt: Support rx timestamp metadata for xdp Stanislav Fomichev
  13 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-04  3:25 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

No functional changes. Boilerplate to allow stuffing more data after xdp_buff.

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index 04cf7684f1b0..b2e0607a6400 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -1789,6 +1789,10 @@ static void bnxt_deliver_skb(struct bnxt *bp, struct bnxt_napi *bnapi,
 	napi_gro_receive(&bnapi->napi, skb);
 }
 
+struct bnxt_xdp_buff {
+	struct xdp_buff xdp;
+};
+
 /* returns the following:
  * 1       - 1 packet successfully received
  * 0       - successful TPA_START, packet not completed yet
@@ -1812,7 +1816,7 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
 	bool xdp_active = false;
 	dma_addr_t dma_addr;
 	struct sk_buff *skb;
-	struct xdp_buff xdp;
+	struct bnxt_xdp_buff bxbuf;
 	u32 flags, misc;
 	void *data;
 	int rc = 0;
@@ -1922,9 +1926,9 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
 	dma_addr = rx_buf->mapping;
 
 	if (bnxt_xdp_attached(bp, rxr)) {
-		bnxt_xdp_buff_init(bp, rxr, cons, &data_ptr, &len, &xdp);
+		bnxt_xdp_buff_init(bp, rxr, cons, &data_ptr, &len, &bxbuf.xdp);
 		if (agg_bufs) {
-			u32 frag_len = bnxt_rx_agg_pages_xdp(bp, cpr, &xdp,
+			u32 frag_len = bnxt_rx_agg_pages_xdp(bp, cpr, &bxbuf.xdp,
 							     cp_cons, agg_bufs,
 							     false);
 			if (!frag_len) {
@@ -1937,7 +1941,7 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
 	}
 
 	if (xdp_active) {
-		if (bnxt_rx_xdp(bp, rxr, cons, xdp, data, &len, event)) {
+		if (bnxt_rx_xdp(bp, rxr, cons, bxbuf.xdp, data, &len, event)) {
 			rc = 1;
 			goto next_rx;
 		}
@@ -1952,7 +1956,7 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
 					bnxt_reuse_rx_agg_bufs(cpr, cp_cons, 0,
 							       agg_bufs, false);
 				else
-					bnxt_xdp_buff_frags_free(rxr, &xdp);
+					bnxt_xdp_buff_frags_free(rxr, &bxbuf.xdp);
 			}
 			cpr->sw_stats.rx.rx_oom_discards += 1;
 			rc = -ENOMEM;
@@ -1983,10 +1987,10 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
 				goto next_rx;
 			}
 		} else {
-			skb = bnxt_xdp_build_skb(bp, skb, agg_bufs, rxr->page_pool, &xdp, rxcmp1);
+			skb = bnxt_xdp_build_skb(bp, skb, agg_bufs, rxr->page_pool, &bxbuf.xdp, rxcmp1);
 			if (!skb) {
 				/* we should be able to free the old skb here */
-				bnxt_xdp_buff_frags_free(rxr, &xdp);
+				bnxt_xdp_buff_frags_free(rxr, &bxbuf.xdp);
 				cpr->sw_stats.rx.rx_oom_discards += 1;
 				rc = -ENOMEM;
 				goto next_rx;
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] [RFC bpf-next v2 14/14] bnxt: Support rx timestamp metadata for xdp
  2022-11-04  3:25 [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs Stanislav Fomichev
                   ` (12 preceding siblings ...)
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 13/14] bnxt: Introduce bnxt_xdp_buff wrapper for xdp_buff Stanislav Fomichev
@ 2022-11-04  3:25 ` Stanislav Fomichev
  13 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-04  3:25 UTC (permalink / raw)
  To: bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

COMPILE-TESTED ONLY!

Cc: John Fastabend <john.fastabend@gmail.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin KaFai Lau <martin.lau@linux.dev>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: Anatoly Burakov <anatoly.burakov@intel.com>
Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
Cc: Maryam Tahhan <mtahhan@redhat.com>
Cc: xdp-hints@xdp-project.net
Cc: netdev@vger.kernel.org
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
 drivers/net/ethernet/broadcom/bnxt/bnxt.c | 55 +++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
index b2e0607a6400..968266844f49 100644
--- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c
+++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c
@@ -37,6 +37,7 @@
 #include <linux/if_bridge.h>
 #include <linux/rtc.h>
 #include <linux/bpf.h>
+#include <linux/bpf_patch.h>
 #include <net/gro.h>
 #include <net/ip.h>
 #include <net/tcp.h>
@@ -1791,8 +1792,53 @@ static void bnxt_deliver_skb(struct bnxt *bp, struct bnxt_napi *bnapi,
 
 struct bnxt_xdp_buff {
 	struct xdp_buff xdp;
+	struct rx_cmp_ext *rxcmp1;
+	struct bnxt *bp;
+	u64 r0;
 };
 
+struct bnxt_xdp_buff *bnxt_xdp_rx_timestamp(struct bnxt_xdp_buff *ctx)
+{
+	struct bnxt_ptp_cfg *ptp;
+	u32 cmpl_ts;
+	u64 ns, ts;
+
+	if (!ctx->rxcmp1) {
+		ctx->r0 = 0;
+		return ctx;
+	}
+
+	cmpl_ts = le32_to_cpu(ctx->rxcmp1->rx_cmp_timestamp);
+	if (bnxt_get_rx_ts_p5(ctx->bp, &ts, cmpl_ts) < 0) {
+		ctx->r0 = 0;
+		return ctx;
+	}
+
+	ptp = ctx->bp->ptp_cfg;
+
+	spin_lock_bh(&ptp->ptp_lock);
+	ns = timecounter_cyc2time(&ptp->tc, ts);
+	spin_unlock_bh(&ptp->ptp_lock);
+
+	ctx->r0 = (u64)ns_to_ktime(ns);
+	return ctx;
+}
+
+void bnxt_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
+		       struct bpf_patch *patch)
+{
+	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_EXPORT_TO_SKB)) {
+		return xdp_metadata_export_to_skb(prog, patch);
+	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
+		/* return true; */
+		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
+	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
+		xdp_kfunc_call_preserving_r1(patch,
+					     offsetof(struct bnxt_xdp_buff, r0),
+					     bnxt_xdp_rx_timestamp);
+	}
+}
+
 /* returns the following:
  * 1       - 1 packet successfully received
  * 0       - successful TPA_START, packet not completed yet
@@ -1941,6 +1987,14 @@ static int bnxt_rx_pkt(struct bnxt *bp, struct bnxt_cp_ring_info *cpr,
 	}
 
 	if (xdp_active) {
+		if (unlikely((flags & RX_CMP_FLAGS_ITYPES_MASK) ==
+			     RX_CMP_FLAGS_ITYPE_PTP_W_TS) || bp->ptp_all_rx_tstamp) {
+			if (bp->flags & BNXT_FLAG_CHIP_P5) {
+				bxbuf.rxcmp1 = rxcmp1;
+				bxbuf.bp = bp;
+			}
+		}
+
 		if (bnxt_rx_xdp(bp, rxr, cons, bxbuf.xdp, data, &len, event)) {
 			rc = 1;
 			goto next_rx;
@@ -13116,6 +13170,7 @@ static const struct net_device_ops bnxt_netdev_ops = {
 	.ndo_bridge_getlink	= bnxt_bridge_getlink,
 	.ndo_bridge_setlink	= bnxt_bridge_setlink,
 	.ndo_get_devlink_port	= bnxt_get_devlink_port,
+	.ndo_unroll_kfunc	= bnxt_unroll_kfunc,
 };
 
 static void bnxt_remove_one(struct pci_dev *pdev)
-- 
2.38.1.431.g37b22c650d-goog


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 10/14] ice: Support rx timestamp metadata for xdp
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 10/14] ice: Support rx timestamp metadata for xdp Stanislav Fomichev
@ 2022-11-04 14:35   ` Alexander Lobakin
  2022-11-04 18:21     ` Stanislav Fomichev
  2022-12-15 11:54   ` Larysa Zaremba
  1 sibling, 1 reply; 66+ messages in thread
From: Alexander Lobakin @ 2022-11-04 14:35 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Alexander Lobakin, bpf, ast, daniel, andrii, martin.lau, song,
	yhs, john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

From: Stanislav Fomichev <sdf@google.com>
Date: Thu,3 Nov 2022 20:25:28 -0700

> COMPILE-TESTED ONLY!
> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  drivers/net/ethernet/intel/ice/ice.h      |  5 ++
>  drivers/net/ethernet/intel/ice/ice_main.c |  1 +
>  drivers/net/ethernet/intel/ice/ice_txrx.c | 75 +++++++++++++++++++++++
>  3 files changed, 81 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> index f88ee051e71c..c51a392d64a4 100644
> --- a/drivers/net/ethernet/intel/ice/ice.h
> +++ b/drivers/net/ethernet/intel/ice/ice.h
> @@ -925,6 +925,11 @@ int ice_open_internal(struct net_device *netdev);
>  int ice_stop(struct net_device *netdev);
>  void ice_service_task_schedule(struct ice_pf *pf);
>  
> +struct bpf_insn;
> +struct bpf_patch;
> +void ice_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> +		      struct bpf_patch *patch);
> +
>  /**
>   * ice_set_rdma_cap - enable RDMA support
>   * @pf: PF struct
> diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> index 1f27dc20b4f1..8ddc6851ef86 100644
> --- a/drivers/net/ethernet/intel/ice/ice_main.c
> +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> @@ -9109,4 +9109,5 @@ static const struct net_device_ops ice_netdev_ops = {
>  	.ndo_xdp_xmit = ice_xdp_xmit,
>  	.ndo_xsk_wakeup = ice_xsk_wakeup,
>  	.ndo_get_devlink_port = ice_get_devlink_port,
> +	.ndo_unroll_kfunc = ice_unroll_kfunc,
>  };
> diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> index 1b6afa168501..e9b5e883753e 100644
> --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> @@ -7,6 +7,7 @@
>  #include <linux/netdevice.h>
>  #include <linux/prefetch.h>
>  #include <linux/bpf_trace.h>
> +#include <linux/bpf_patch.h>
>  #include <net/dsfield.h>
>  #include <net/mpls.h>
>  #include <net/xdp.h>
> @@ -1098,8 +1099,80 @@ ice_is_non_eop(struct ice_rx_ring *rx_ring, union ice_32b_rx_flex_desc *rx_desc)
>  
>  struct ice_xdp_buff {
>  	struct xdp_buff xdp;
> +	struct ice_rx_ring *rx_ring;
> +	union ice_32b_rx_flex_desc *rx_desc;
>  };
>  
> +void ice_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> +		      struct bpf_patch *patch)
> +{
> +	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_EXPORT_TO_SKB)) {
> +		return xdp_metadata_export_to_skb(prog, patch);

Hey,

FYI, our team wants to write a follow-up patch with ice support
added, not like a draft, more of a mature code. I'm thinking of
calling ice C function which would process Rx descriptors from
that BPF code from the unrolling callback -- otherwise,
implementing a couple hundred C code lines from ice_txrx_lib.c
would be a bit too much :D

> +	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> +		/* return true; */
> +		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> +	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {

[...]

> -- 
> 2.38.1.431.g37b22c650d-goog

Thanks,
Olek

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 10/14] ice: Support rx timestamp metadata for xdp
  2022-11-04 14:35   ` [xdp-hints] " Alexander Lobakin
@ 2022-11-04 18:21     ` Stanislav Fomichev
  2022-11-07 17:11       ` Alexander Lobakin
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-04 18:21 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Fri, Nov 4, 2022 at 7:38 AM Alexander Lobakin
<alexandr.lobakin@intel.com> wrote:
>
> From: Stanislav Fomichev <sdf@google.com>
> Date: Thu,3 Nov 2022 20:25:28 -0700
>
> > COMPILE-TESTED ONLY!
> >
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: David Ahern <dsahern@gmail.com>
> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > Cc: xdp-hints@xdp-project.net
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  drivers/net/ethernet/intel/ice/ice.h      |  5 ++
> >  drivers/net/ethernet/intel/ice/ice_main.c |  1 +
> >  drivers/net/ethernet/intel/ice/ice_txrx.c | 75 +++++++++++++++++++++++
> >  3 files changed, 81 insertions(+)
> >
> > diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
> > index f88ee051e71c..c51a392d64a4 100644
> > --- a/drivers/net/ethernet/intel/ice/ice.h
> > +++ b/drivers/net/ethernet/intel/ice/ice.h
> > @@ -925,6 +925,11 @@ int ice_open_internal(struct net_device *netdev);
> >  int ice_stop(struct net_device *netdev);
> >  void ice_service_task_schedule(struct ice_pf *pf);
> >
> > +struct bpf_insn;
> > +struct bpf_patch;
> > +void ice_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > +                   struct bpf_patch *patch);
> > +
> >  /**
> >   * ice_set_rdma_cap - enable RDMA support
> >   * @pf: PF struct
> > diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
> > index 1f27dc20b4f1..8ddc6851ef86 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_main.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_main.c
> > @@ -9109,4 +9109,5 @@ static const struct net_device_ops ice_netdev_ops = {
> >       .ndo_xdp_xmit = ice_xdp_xmit,
> >       .ndo_xsk_wakeup = ice_xsk_wakeup,
> >       .ndo_get_devlink_port = ice_get_devlink_port,
> > +     .ndo_unroll_kfunc = ice_unroll_kfunc,
> >  };
> > diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
> > index 1b6afa168501..e9b5e883753e 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_txrx.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
> > @@ -7,6 +7,7 @@
> >  #include <linux/netdevice.h>
> >  #include <linux/prefetch.h>
> >  #include <linux/bpf_trace.h>
> > +#include <linux/bpf_patch.h>
> >  #include <net/dsfield.h>
> >  #include <net/mpls.h>
> >  #include <net/xdp.h>
> > @@ -1098,8 +1099,80 @@ ice_is_non_eop(struct ice_rx_ring *rx_ring, union ice_32b_rx_flex_desc *rx_desc)
> >
> >  struct ice_xdp_buff {
> >       struct xdp_buff xdp;
> > +     struct ice_rx_ring *rx_ring;
> > +     union ice_32b_rx_flex_desc *rx_desc;
> >  };
> >
> > +void ice_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > +                   struct bpf_patch *patch)
> > +{
> > +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_EXPORT_TO_SKB)) {
> > +             return xdp_metadata_export_to_skb(prog, patch);
>
> Hey,
>
> FYI, our team wants to write a follow-up patch with ice support
> added, not like a draft, more of a mature code. I'm thinking of
> calling ice C function which would process Rx descriptors from
> that BPF code from the unrolling callback -- otherwise,
> implementing a couple hundred C code lines from ice_txrx_lib.c
> would be a bit too much :D

Sounds good! I would gladly drop all/most of the driver changes for
the non-rfc posting :-)
I'll probably have a mlx4 one because there is a chance I might find
HW, but the rest I'll drop most likely.
(they are here to show how the driver changes might look like, hence
compile-tested only)

Btw, does it make sense to have some small xsk selftest binary that
can be used to test the metadata with the real device?
The one I'm having right now depends on veth/namespaces; having a
similar one for the real hardware to qualify it sounds useful?
Something simple that sets up af_xdp for all queues, divers some
traffic, and exposes to the userspace consumer all the info about
frame metadata...
Or maybe there is something I can reuse already?


> > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > +             /* return true; */
> > +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
>
> [...]
>
> > --
> > 2.38.1.431.g37b22c650d-goog
>
> Thanks,
> Olek

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 08/14] bpf: Helper to simplify calling kernel routines from unrolled kfuncs
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 08/14] bpf: Helper to simplify calling kernel routines from unrolled kfuncs Stanislav Fomichev
@ 2022-11-05  0:40   ` Alexei Starovoitov
  2022-11-05  2:18     ` Stanislav Fomichev
  0 siblings, 1 reply; 66+ messages in thread
From: Alexei Starovoitov @ 2022-11-05  0:40 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Thu, Nov 03, 2022 at 08:25:26PM -0700, Stanislav Fomichev wrote:
> When we need to call the kernel function from the unrolled
> kfunc, we have to take extra care: r6-r9 belong to the callee,
> not us, so we can't use these registers to stash our r1.
> 
> We use the same trick we use elsewhere: ask the user
> to provide extra on-stack storage.
> 
> Also, note, the program being called has to receive and
> return the context.
> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  include/net/xdp.h |  4 ++++
>  net/core/xdp.c    | 24 +++++++++++++++++++++++-
>  2 files changed, 27 insertions(+), 1 deletion(-)
> 
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 8c97c6996172..09c05d1da69c 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -440,10 +440,14 @@ static inline u32 xdp_metadata_kfunc_id(int id)
>  	return xdp_metadata_kfunc_ids.pairs[id].id;
>  }
>  void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch);
> +void xdp_kfunc_call_preserving_r1(struct bpf_patch *patch, size_t r0_offset,
> +				  void *kfunc);
>  #else
>  #define xdp_metadata_magic 0
>  static inline u32 xdp_metadata_kfunc_id(int id) { return 0; }
>  static void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch) { return 0; }
> +static void xdp_kfunc_call_preserving_r1(struct bpf_patch *patch, size_t r0_offset,
> +					 void *kfunc) {}
>  #endif
>  
>  #endif /* __LINUX_NET_XDP_H__ */
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 8204fa05c5e9..16dd7850b9b0 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -737,6 +737,7 @@ BTF_SET8_START_GLOBAL(xdp_metadata_kfunc_ids)
>  XDP_METADATA_KFUNC_xxx
>  #undef XDP_METADATA_KFUNC
>  BTF_SET8_END(xdp_metadata_kfunc_ids)
> +EXPORT_SYMBOL(xdp_metadata_kfunc_ids);
>  
>  /* Make sure userspace doesn't depend on our layout by using
>   * different pseudo-generated magic value.
> @@ -756,7 +757,8 @@ static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
>   *
>   * The above also means we _cannot_ easily call any other helper/kfunc
>   * because there is no place for us to preserve our R1 argument;
> - * existing R6-R9 belong to the callee.
> + * existing R6-R9 belong to the callee. For the cases where calling into
> + * the kernel is the only option, see xdp_kfunc_call_preserving_r1.
>   */
>  void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch)
>  {
> @@ -832,6 +834,26 @@ void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *p
>  
>  	bpf_patch_resolve_jmp(patch);
>  }
> +EXPORT_SYMBOL(xdp_metadata_export_to_skb);
> +
> +/* Helper to generate the bytecode that calls the supplied kfunc.
> + * The kfunc has to accept a pointer to the context and return the
> + * same pointer back. The user also has to supply an offset within
> + * the context to store r0.
> + */
> +void xdp_kfunc_call_preserving_r1(struct bpf_patch *patch, size_t r0_offset,
> +				  void *kfunc)
> +{
> +	bpf_patch_append(patch,
> +		/* r0 = kfunc(r1); */
> +		BPF_EMIT_CALL(kfunc),
> +		/* r1 = r0; */
> +		BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
> +		/* r0 = *(r1 + r0_offset); */
> +		BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, r0_offset),
> +	);
> +}
> +EXPORT_SYMBOL(xdp_kfunc_call_preserving_r1);

That's one twisted logic :)
I guess it works for preserving r1, but r2-r5 are gone and r6-r9 cannot be touched.
So it works for the most basic case of returning single value from that additional
kfunc while preserving single r1.
I'm afraid that will be very limiting moving forward.
imo we need push/pop insns. It shouldn't difficult to add to the interpreter and JITs.
Since this patching is done after verificaiton they will be kernel internal insns.
Like we have BPF_REG_AX internal reg.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 08/14] bpf: Helper to simplify calling kernel routines from unrolled kfuncs
  2022-11-05  0:40   ` [xdp-hints] " Alexei Starovoitov
@ 2022-11-05  2:18     ` Stanislav Fomichev
  0 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-05  2:18 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Fri, Nov 4, 2022 at 5:40 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Nov 03, 2022 at 08:25:26PM -0700, Stanislav Fomichev wrote:
> > When we need to call the kernel function from the unrolled
> > kfunc, we have to take extra care: r6-r9 belong to the callee,
> > not us, so we can't use these registers to stash our r1.
> >
> > We use the same trick we use elsewhere: ask the user
> > to provide extra on-stack storage.
> >
> > Also, note, the program being called has to receive and
> > return the context.
> >
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: David Ahern <dsahern@gmail.com>
> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > Cc: xdp-hints@xdp-project.net
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  include/net/xdp.h |  4 ++++
> >  net/core/xdp.c    | 24 +++++++++++++++++++++++-
> >  2 files changed, 27 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index 8c97c6996172..09c05d1da69c 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -440,10 +440,14 @@ static inline u32 xdp_metadata_kfunc_id(int id)
> >       return xdp_metadata_kfunc_ids.pairs[id].id;
> >  }
> >  void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch);
> > +void xdp_kfunc_call_preserving_r1(struct bpf_patch *patch, size_t r0_offset,
> > +                               void *kfunc);
> >  #else
> >  #define xdp_metadata_magic 0
> >  static inline u32 xdp_metadata_kfunc_id(int id) { return 0; }
> >  static void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch) { return 0; }
> > +static void xdp_kfunc_call_preserving_r1(struct bpf_patch *patch, size_t r0_offset,
> > +                                      void *kfunc) {}
> >  #endif
> >
> >  #endif /* __LINUX_NET_XDP_H__ */
> > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > index 8204fa05c5e9..16dd7850b9b0 100644
> > --- a/net/core/xdp.c
> > +++ b/net/core/xdp.c
> > @@ -737,6 +737,7 @@ BTF_SET8_START_GLOBAL(xdp_metadata_kfunc_ids)
> >  XDP_METADATA_KFUNC_xxx
> >  #undef XDP_METADATA_KFUNC
> >  BTF_SET8_END(xdp_metadata_kfunc_ids)
> > +EXPORT_SYMBOL(xdp_metadata_kfunc_ids);
> >
> >  /* Make sure userspace doesn't depend on our layout by using
> >   * different pseudo-generated magic value.
> > @@ -756,7 +757,8 @@ static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
> >   *
> >   * The above also means we _cannot_ easily call any other helper/kfunc
> >   * because there is no place for us to preserve our R1 argument;
> > - * existing R6-R9 belong to the callee.
> > + * existing R6-R9 belong to the callee. For the cases where calling into
> > + * the kernel is the only option, see xdp_kfunc_call_preserving_r1.
> >   */
> >  void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch)
> >  {
> > @@ -832,6 +834,26 @@ void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *p
> >
> >       bpf_patch_resolve_jmp(patch);
> >  }
> > +EXPORT_SYMBOL(xdp_metadata_export_to_skb);
> > +
> > +/* Helper to generate the bytecode that calls the supplied kfunc.
> > + * The kfunc has to accept a pointer to the context and return the
> > + * same pointer back. The user also has to supply an offset within
> > + * the context to store r0.
> > + */
> > +void xdp_kfunc_call_preserving_r1(struct bpf_patch *patch, size_t r0_offset,
> > +                               void *kfunc)
> > +{
> > +     bpf_patch_append(patch,
> > +             /* r0 = kfunc(r1); */
> > +             BPF_EMIT_CALL(kfunc),
> > +             /* r1 = r0; */
> > +             BPF_MOV64_REG(BPF_REG_1, BPF_REG_0),
> > +             /* r0 = *(r1 + r0_offset); */
> > +             BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_1, r0_offset),
> > +     );
> > +}
> > +EXPORT_SYMBOL(xdp_kfunc_call_preserving_r1);
>
> That's one twisted logic :)
> I guess it works for preserving r1, but r2-r5 are gone and r6-r9 cannot be touched.
> So it works for the most basic case of returning single value from that additional
> kfunc while preserving single r1.
> I'm afraid that will be very limiting moving forward.
> imo we need push/pop insns. It shouldn't difficult to add to the interpreter and JITs.
> Since this patching is done after verificaiton they will be kernel internal insns.
> Like we have BPF_REG_AX internal reg.

Heh, having push/pop would help a lot, agree :-)
Good suggestion, will look into that, thank you!

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 10/14] ice: Support rx timestamp metadata for xdp
  2022-11-04 18:21     ` Stanislav Fomichev
@ 2022-11-07 17:11       ` Alexander Lobakin
  2022-11-07 19:10         ` Stanislav Fomichev
  0 siblings, 1 reply; 66+ messages in thread
From: Alexander Lobakin @ 2022-11-07 17:11 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Alexander Lobakin, bpf, ast, daniel, andrii, martin.lau, song,
	yhs, john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

From: Stanislav Fomichev <sdf@google.com>
Date: Fri, 4 Nov 2022 11:21:47 -0700

> On Fri, Nov 4, 2022 at 7:38 AM Alexander Lobakin
> <alexandr.lobakin@intel.com> wrote:
> >
> > From: Stanislav Fomichev <sdf@google.com>
> > Date: Thu,3 Nov 2022 20:25:28 -0700

[...]

> > Hey,
> >
> > FYI, our team wants to write a follow-up patch with ice support
> > added, not like a draft, more of a mature code. I'm thinking of
> > calling ice C function which would process Rx descriptors from
> > that BPF code from the unrolling callback -- otherwise,
> > implementing a couple hundred C code lines from ice_txrx_lib.c
> > would be a bit too much :D
> 
> Sounds good! I would gladly drop all/most of the driver changes for
> the non-rfc posting :-)
> I'll probably have a mlx4 one because there is a chance I might find
> HW, but the rest I'll drop most likely.
> (they are here to show how the driver changes might look like, hence
> compile-tested only)
> 
> Btw, does it make sense to have some small xsk selftest binary that
> can be used to test the metadata with the real device?
> The one I'm having right now depends on veth/namespaces; having a
> similar one for the real hardware to qualify it sounds useful?
> Something simple that sets up af_xdp for all queues, divers some
> traffic, and exposes to the userspace consumer all the info about
> frame metadata...
> Or maybe there is something I can reuse already?

There's XSk selftest already and recently Maciej added support for
executing it on a physical device (via `-i <iface>` cmdline arg)[0].
I guess the most optimal solution is to expand it to cover metadata
cases as it has some sort of useful helper functions / infra? In the
set I posted in June, I simply expanded xdp_meta selftest, but there
weren't any XSk bits, so I don't think it's a way to go.

> 
> 
> > > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > +             /* return true; */
> > > +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> >
> > [...]
> >
> > > --
> > > 2.38.1.431.g37b22c650d-goog
> >
> > Thanks,
> > Olek

[0] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=a693ff3ed5610a07b1b0dd831d10f516e13cf6c6

Thank,
Olek

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 10/14] ice: Support rx timestamp metadata for xdp
  2022-11-07 17:11       ` Alexander Lobakin
@ 2022-11-07 19:10         ` Stanislav Fomichev
  0 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-07 19:10 UTC (permalink / raw)
  To: Alexander Lobakin
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Mon, Nov 7, 2022 at 9:33 AM Alexander Lobakin
<alexandr.lobakin@intel.com> wrote:
>
> From: Stanislav Fomichev <sdf@google.com>
> Date: Fri, 4 Nov 2022 11:21:47 -0700
>
> > On Fri, Nov 4, 2022 at 7:38 AM Alexander Lobakin
> > <alexandr.lobakin@intel.com> wrote:
> > >
> > > From: Stanislav Fomichev <sdf@google.com>
> > > Date: Thu,3 Nov 2022 20:25:28 -0700
>
> [...]
>
> > > Hey,
> > >
> > > FYI, our team wants to write a follow-up patch with ice support
> > > added, not like a draft, more of a mature code. I'm thinking of
> > > calling ice C function which would process Rx descriptors from
> > > that BPF code from the unrolling callback -- otherwise,
> > > implementing a couple hundred C code lines from ice_txrx_lib.c
> > > would be a bit too much :D
> >
> > Sounds good! I would gladly drop all/most of the driver changes for
> > the non-rfc posting :-)
> > I'll probably have a mlx4 one because there is a chance I might find
> > HW, but the rest I'll drop most likely.
> > (they are here to show how the driver changes might look like, hence
> > compile-tested only)
> >
> > Btw, does it make sense to have some small xsk selftest binary that
> > can be used to test the metadata with the real device?
> > The one I'm having right now depends on veth/namespaces; having a
> > similar one for the real hardware to qualify it sounds useful?
> > Something simple that sets up af_xdp for all queues, divers some
> > traffic, and exposes to the userspace consumer all the info about
> > frame metadata...
> > Or maybe there is something I can reuse already?
>
> There's XSk selftest already and recently Maciej added support for
> executing it on a physical device (via `-i <iface>` cmdline arg)[0].
> I guess the most optimal solution is to expand it to cover metadata
> cases as it has some sort of useful helper functions / infra? In the
> set I posted in June, I simply expanded xdp_meta selftest, but there
> weren't any XSk bits, so I don't think it's a way to go.

Yeah, I was also extending xskxceiver initially, but not sure we want
to put metadata in there (that test is already too big imo)?
Jesper also pointed out [0], so maybe that thing should live out-of-tree?

I got mlx4 working locally with a small xskxceiver-like selftest. I'll
probably include it in the next non-rfc submission and we can discuss
whether it makes sense to keep it or it's better to extend xskxceiver.

0: https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-interaction


> >
> >
> > > > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > > > +             /* return true; */
> > > > +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > > > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > >
> > > [...]
> > >
> > > > --
> > > > 2.38.1.431.g37b22c650d-goog
> > >
> > > Thanks,
> > > Olek
>
> [0] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commit/?id=a693ff3ed5610a07b1b0dd831d10f516e13cf6c6
>
> Thank,
> Olek

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context Stanislav Fomichev
@ 2022-11-07 22:01   ` Martin KaFai Lau
  2022-11-08 21:54     ` Stanislav Fomichev
  2022-11-10  1:09   ` John Fastabend
  1 sibling, 1 reply; 66+ messages in thread
From: Martin KaFai Lau @ 2022-11-07 22:01 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

On 11/3/22 8:25 PM, Stanislav Fomichev wrote:
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 59c9fd55699d..dba857f212d7 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -4217,9 +4217,13 @@ static inline bool skb_metadata_differs(const struct sk_buff *skb_a,
>   	       true : __skb_metadata_differs(skb_a, skb_b, len_a);
>   }
>   
> +void skb_metadata_import_from_xdp(struct sk_buff *skb, size_t len);
> +
>   static inline void skb_metadata_set(struct sk_buff *skb, u8 meta_len)
>   {
>   	skb_shinfo(skb)->meta_len = meta_len;
> +	if (meta_len)
> +		skb_metadata_import_from_xdp(skb, meta_len);
>   }
>
[ ... ]

> +struct xdp_to_skb_metadata {
> +	u32 magic; /* xdp_metadata_magic */
> +	u64 rx_timestamp;
> +} __randomize_layout;
> +
> +struct bpf_patch;
> +

[ ... ]

> +void skb_metadata_import_from_xdp(struct sk_buff *skb, size_t len)
> +{
> +	struct xdp_to_skb_metadata *meta = (void *)(skb_mac_header(skb) - len);
> +
> +	/* Optional SKB info, currently missing:
> +	 * - HW checksum info		(skb->ip_summed)
> +	 * - HW RX hash			(skb_set_hash)
> +	 * - RX ring dev queue index	(skb_record_rx_queue)
> +	 */
> +
> +	if (len != sizeof(struct xdp_to_skb_metadata))
> +		return;
> +
> +	if (meta->magic != xdp_metadata_magic)
> +		return;
> +
> +	if (meta->rx_timestamp) {
> +		*skb_hwtstamps(skb) = (struct skb_shared_hwtstamps){
> +			.hwtstamp = ns_to_ktime(meta->rx_timestamp),
> +		};
> +	}
> +}

Considering the metadata will affect the gro, should the meta be cleared after 
importing to the skb?

[ ... ]

> +/* Since we're not actually doing a call but instead rewriting
> + * in place, we can only afford to use R0-R5 scratch registers.
> + *
> + * We reserve R1 for bpf_xdp_metadata_export_to_skb and let individual
> + * metadata kfuncs use only R0,R4-R5.
> + *
> + * The above also means we _cannot_ easily call any other helper/kfunc
> + * because there is no place for us to preserve our R1 argument;
> + * existing R6-R9 belong to the callee.
> + */
> +void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch)
> +{
> +	u32 func_id;
> +
> +	/*
> +	 * The code below generates the following:
> +	 *
> +	 * void bpf_xdp_metadata_export_to_skb(struct xdp_md *ctx)
> +	 * {
> +	 *	struct xdp_to_skb_metadata *meta;
> +	 *	int ret;
> +	 *
> +	 *	ret = bpf_xdp_adjust_meta(ctx, -sizeof(*meta));
> +	 *	if (!ret)
> +	 *		return;
> +	 *
> +	 *	meta = ctx->data_meta;
> +	 *	meta->magic = xdp_metadata_magic;
> +	 *	meta->rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> +	 * }
> +	 *
> +	 */
> +
> +	bpf_patch_append(patch,
> +		/* r2 = ((struct xdp_buff *)r1)->data_meta; */
> +		BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
> +			    offsetof(struct xdp_buff, data_meta)),
> +		/* r3 = ((struct xdp_buff *)r1)->data; */
> +		BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1,
> +			    offsetof(struct xdp_buff, data)),
> +		/* if (data_meta != data) return;
> +		 *
> +		 *	data_meta > data: xdp_data_meta_unsupported()
> +		 *	data_meta < data: already used, no need to touch
> +		 */
> +		BPF_JMP_REG(BPF_JNE, BPF_REG_2, BPF_REG_3, S16_MAX),
> +
> +		/* r2 -= sizeof(struct xdp_to_skb_metadata); */
> +		BPF_ALU64_IMM(BPF_SUB, BPF_REG_2,
> +			      sizeof(struct xdp_to_skb_metadata)),
> +		/* r3 = ((struct xdp_buff *)r1)->data_hard_start; */
> +		BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1,
> +			    offsetof(struct xdp_buff, data_hard_start)),
> +		/* r3 += sizeof(struct xdp_frame) */
> +		BPF_ALU64_IMM(BPF_ADD, BPF_REG_3,
> +			      sizeof(struct xdp_frame)),
> +		/* if (data-sizeof(struct xdp_to_skb_metadata) < data_hard_start+sizeof(struct xdp_frame)) return; */
> +		BPF_JMP_REG(BPF_JLT, BPF_REG_2, BPF_REG_3, S16_MAX),
> +
> +		/* ((struct xdp_buff *)r1)->data_meta = r2; */
> +		BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_2,
> +			    offsetof(struct xdp_buff, data_meta)),
> +
> +		/* *((struct xdp_to_skb_metadata *)r2)->magic = xdp_metadata_magic; */
> +		BPF_ST_MEM(BPF_W, BPF_REG_2,
> +			   offsetof(struct xdp_to_skb_metadata, magic),
> +			   xdp_metadata_magic),
> +	);
> +
> +	/*	r0 = bpf_xdp_metadata_rx_timestamp(ctx); */
> +	func_id = xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP);
> +	prog->aux->xdp_kfunc_ndo->ndo_unroll_kfunc(prog, func_id, patch);
> +
> +	bpf_patch_append(patch,
> +		/* r2 = ((struct xdp_buff *)r1)->data_meta; */
> +		BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
> +			    offsetof(struct xdp_buff, data_meta)),
> +		/* *((struct xdp_to_skb_metadata *)r2)->rx_timestamp = r0; */
> +		BPF_STX_MEM(BPF_DW, BPF_REG_2, BPF_REG_0,
> +			    offsetof(struct xdp_to_skb_metadata, rx_timestamp)),

Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not 
sure it is solid enough by asking the xdp prog not to use the same random number 
in its own metadata + not to change the metadata through xdp->data_meta after 
calling bpf_xdp_metadata_export_to_skb().

Does xdp_to_skb_metadata have a use case for XDP_PASS (like patch 7) or the 
xdp_to_skb_metadata can be limited to XDP_REDIRECT only?


> +	);
> +
> +	bpf_patch_resolve_jmp(patch);
> +}
> +
>   static int __init xdp_metadata_init(void)
>   {
> +	xdp_metadata_magic = get_random_u32() | 1;
>   	return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &xdp_metadata_kfunc_set);
>   }
>   late_initcall(xdp_metadata_init);

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-07 22:01   ` [xdp-hints] " Martin KaFai Lau
@ 2022-11-08 21:54     ` Stanislav Fomichev
  2022-11-09  3:07       ` Martin KaFai Lau
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-08 21:54 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

On Mon, Nov 7, 2022 at 2:02 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 11/3/22 8:25 PM, Stanislav Fomichev wrote:
> > > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 59c9fd55699d..dba857f212d7 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -4217,9 +4217,13 @@ static inline bool skb_metadata_differs(const struct sk_buff *skb_a,
> >              true : __skb_metadata_differs(skb_a, skb_b, len_a);
> >   }
> >
> > +void skb_metadata_import_from_xdp(struct sk_buff *skb, size_t len);
> > +
> >   static inline void skb_metadata_set(struct sk_buff *skb, u8 meta_len)
> >   {
> >       skb_shinfo(skb)->meta_len = meta_len;
> > +     if (meta_len)
> > +             skb_metadata_import_from_xdp(skb, meta_len);
> >   }
> >
> [ ... ]
>
> > +struct xdp_to_skb_metadata {
> > +     u32 magic; /* xdp_metadata_magic */
> > +     u64 rx_timestamp;
> > +} __randomize_layout;
> > +
> > +struct bpf_patch;
> > +
>
> [ ... ]
>
> > +void skb_metadata_import_from_xdp(struct sk_buff *skb, size_t len)
> > +{
> > +     struct xdp_to_skb_metadata *meta = (void *)(skb_mac_header(skb) - len);
> > +
> > +     /* Optional SKB info, currently missing:
> > +      * - HW checksum info           (skb->ip_summed)
> > +      * - HW RX hash                 (skb_set_hash)
> > +      * - RX ring dev queue index    (skb_record_rx_queue)
> > +      */
> > +
> > +     if (len != sizeof(struct xdp_to_skb_metadata))
> > +             return;
> > +
> > +     if (meta->magic != xdp_metadata_magic)
> > +             return;
> > +
> > +     if (meta->rx_timestamp) {
> > +             *skb_hwtstamps(skb) = (struct skb_shared_hwtstamps){
> > +                     .hwtstamp = ns_to_ktime(meta->rx_timestamp),
> > +             };
> > +     }
> > +}
>
> Considering the metadata will affect the gro, should the meta be cleared after
> importing to the skb?

Yeah, good suggestion, will clear it here.

> [ ... ]
>
> > +/* Since we're not actually doing a call but instead rewriting
> > + * in place, we can only afford to use R0-R5 scratch registers.
> > + *
> > + * We reserve R1 for bpf_xdp_metadata_export_to_skb and let individual
> > + * metadata kfuncs use only R0,R4-R5.
> > + *
> > + * The above also means we _cannot_ easily call any other helper/kfunc
> > + * because there is no place for us to preserve our R1 argument;
> > + * existing R6-R9 belong to the callee.
> > + */
> > +void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch)
> > +{
> > +     u32 func_id;
> > +
> > +     /*
> > +      * The code below generates the following:
> > +      *
> > +      * void bpf_xdp_metadata_export_to_skb(struct xdp_md *ctx)
> > +      * {
> > +      *      struct xdp_to_skb_metadata *meta;
> > +      *      int ret;
> > +      *
> > +      *      ret = bpf_xdp_adjust_meta(ctx, -sizeof(*meta));
> > +      *      if (!ret)
> > +      *              return;
> > +      *
> > +      *      meta = ctx->data_meta;
> > +      *      meta->magic = xdp_metadata_magic;
> > +      *      meta->rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
> > +      * }
> > +      *
> > +      */
> > +
> > +     bpf_patch_append(patch,
> > +             /* r2 = ((struct xdp_buff *)r1)->data_meta; */
> > +             BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
> > +                         offsetof(struct xdp_buff, data_meta)),
> > +             /* r3 = ((struct xdp_buff *)r1)->data; */
> > +             BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1,
> > +                         offsetof(struct xdp_buff, data)),
> > +             /* if (data_meta != data) return;
> > +              *
> > +              *      data_meta > data: xdp_data_meta_unsupported()
> > +              *      data_meta < data: already used, no need to touch
> > +              */
> > +             BPF_JMP_REG(BPF_JNE, BPF_REG_2, BPF_REG_3, S16_MAX),
> > +
> > +             /* r2 -= sizeof(struct xdp_to_skb_metadata); */
> > +             BPF_ALU64_IMM(BPF_SUB, BPF_REG_2,
> > +                           sizeof(struct xdp_to_skb_metadata)),
> > +             /* r3 = ((struct xdp_buff *)r1)->data_hard_start; */
> > +             BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1,
> > +                         offsetof(struct xdp_buff, data_hard_start)),
> > +             /* r3 += sizeof(struct xdp_frame) */
> > +             BPF_ALU64_IMM(BPF_ADD, BPF_REG_3,
> > +                           sizeof(struct xdp_frame)),
> > +             /* if (data-sizeof(struct xdp_to_skb_metadata) < data_hard_start+sizeof(struct xdp_frame)) return; */
> > +             BPF_JMP_REG(BPF_JLT, BPF_REG_2, BPF_REG_3, S16_MAX),
> > +
> > +             /* ((struct xdp_buff *)r1)->data_meta = r2; */
> > +             BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_2,
> > +                         offsetof(struct xdp_buff, data_meta)),
> > +
> > +             /* *((struct xdp_to_skb_metadata *)r2)->magic = xdp_metadata_magic; */
> > +             BPF_ST_MEM(BPF_W, BPF_REG_2,
> > +                        offsetof(struct xdp_to_skb_metadata, magic),
> > +                        xdp_metadata_magic),
> > +     );
> > +
> > +     /*      r0 = bpf_xdp_metadata_rx_timestamp(ctx); */
> > +     func_id = xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP);
> > +     prog->aux->xdp_kfunc_ndo->ndo_unroll_kfunc(prog, func_id, patch);
> > +
> > +     bpf_patch_append(patch,
> > +             /* r2 = ((struct xdp_buff *)r1)->data_meta; */
> > +             BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
> > +                         offsetof(struct xdp_buff, data_meta)),
> > +             /* *((struct xdp_to_skb_metadata *)r2)->rx_timestamp = r0; */
> > +             BPF_STX_MEM(BPF_DW, BPF_REG_2, BPF_REG_0,
> > +                         offsetof(struct xdp_to_skb_metadata, rx_timestamp)),
>
> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
> sure it is solid enough by asking the xdp prog not to use the same random number
> in its own metadata + not to change the metadata through xdp->data_meta after
> calling bpf_xdp_metadata_export_to_skb().

What do you think the usecase here might be? Or are you suggesting we
reject further access to data_meta after
bpf_xdp_metadata_export_to_skb somehow?

If we want to let the programs override some of this
bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
more kfuncs instead of exposing the layout?

bpf_xdp_metadata_export_to_skb(ctx);
bpf_xdp_metadata_export_skb_hash(ctx, 1234);
...

> Does xdp_to_skb_metadata have a use case for XDP_PASS (like patch 7) or the
> xdp_to_skb_metadata can be limited to XDP_REDIRECT only?

XDP_PASS cases where we convert xdp_buff into skb in the drivers right
now usually have C code to manually pull out the metadata (out of hw
desc) and put it into skb.

So, currently, if we're calling bpf_xdp_metadata_export_to_skb() for
XDP_PASS, we're doing a double amount of work:
skb_metadata_import_from_xdp first, then custom driver code second.

In theory, maybe we should completely skip drivers custom parsing when
there is a prog with BPF_F_XDP_HAS_METADATA?
Then both xdp->skb paths (XDP_PASS+XDP_REDIRECT) will be bpf-driven
and won't require any mental work (plus, the drivers won't have to
care either in the future).

WDYT?

> > +     );
> > +
> > +     bpf_patch_resolve_jmp(patch);
> > +}
> > +
> >   static int __init xdp_metadata_init(void)
> >   {
> > +     xdp_metadata_magic = get_random_u32() | 1;
> >       return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &xdp_metadata_kfunc_set);
> >   }
> >   late_initcall(xdp_metadata_init);

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-08 21:54     ` Stanislav Fomichev
@ 2022-11-09  3:07       ` Martin KaFai Lau
  2022-11-09  4:19         ` Martin KaFai Lau
  0 siblings, 1 reply; 66+ messages in thread
From: Martin KaFai Lau @ 2022-11-09  3:07 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

On 11/8/22 1:54 PM, Stanislav Fomichev wrote:
> On Mon, Nov 7, 2022 at 2:02 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 11/3/22 8:25 PM, Stanislav Fomichev wrote:
>>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>> index 59c9fd55699d..dba857f212d7 100644
>>> --- a/include/linux/skbuff.h
>>> +++ b/include/linux/skbuff.h
>>> @@ -4217,9 +4217,13 @@ static inline bool skb_metadata_differs(const struct sk_buff *skb_a,
>>>               true : __skb_metadata_differs(skb_a, skb_b, len_a);
>>>    }
>>>
>>> +void skb_metadata_import_from_xdp(struct sk_buff *skb, size_t len);
>>> +
>>>    static inline void skb_metadata_set(struct sk_buff *skb, u8 meta_len)
>>>    {
>>>        skb_shinfo(skb)->meta_len = meta_len;
>>> +     if (meta_len)
>>> +             skb_metadata_import_from_xdp(skb, meta_len);
>>>    }
>>>
>> [ ... ]
>>
>>> +struct xdp_to_skb_metadata {
>>> +     u32 magic; /* xdp_metadata_magic */
>>> +     u64 rx_timestamp;
>>> +} __randomize_layout;
>>> +
>>> +struct bpf_patch;
>>> +
>>
>> [ ... ]
>>
>>> +void skb_metadata_import_from_xdp(struct sk_buff *skb, size_t len)
>>> +{
>>> +     struct xdp_to_skb_metadata *meta = (void *)(skb_mac_header(skb) - len);
>>> +
>>> +     /* Optional SKB info, currently missing:
>>> +      * - HW checksum info           (skb->ip_summed)
>>> +      * - HW RX hash                 (skb_set_hash)
>>> +      * - RX ring dev queue index    (skb_record_rx_queue)
>>> +      */
>>> +
>>> +     if (len != sizeof(struct xdp_to_skb_metadata))
>>> +             return;
>>> +
>>> +     if (meta->magic != xdp_metadata_magic)
>>> +             return;
>>> +
>>> +     if (meta->rx_timestamp) {
>>> +             *skb_hwtstamps(skb) = (struct skb_shared_hwtstamps){
>>> +                     .hwtstamp = ns_to_ktime(meta->rx_timestamp),
>>> +             };
>>> +     }
>>> +}
>>
>> Considering the metadata will affect the gro, should the meta be cleared after
>> importing to the skb?
> 
> Yeah, good suggestion, will clear it here.
> 
>> [ ... ]
>>
>>> +/* Since we're not actually doing a call but instead rewriting
>>> + * in place, we can only afford to use R0-R5 scratch registers.
>>> + *
>>> + * We reserve R1 for bpf_xdp_metadata_export_to_skb and let individual
>>> + * metadata kfuncs use only R0,R4-R5.
>>> + *
>>> + * The above also means we _cannot_ easily call any other helper/kfunc
>>> + * because there is no place for us to preserve our R1 argument;
>>> + * existing R6-R9 belong to the callee.
>>> + */
>>> +void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch)
>>> +{
>>> +     u32 func_id;
>>> +
>>> +     /*
>>> +      * The code below generates the following:
>>> +      *
>>> +      * void bpf_xdp_metadata_export_to_skb(struct xdp_md *ctx)
>>> +      * {
>>> +      *      struct xdp_to_skb_metadata *meta;
>>> +      *      int ret;
>>> +      *
>>> +      *      ret = bpf_xdp_adjust_meta(ctx, -sizeof(*meta));
>>> +      *      if (!ret)
>>> +      *              return;
>>> +      *
>>> +      *      meta = ctx->data_meta;
>>> +      *      meta->magic = xdp_metadata_magic;
>>> +      *      meta->rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
>>> +      * }
>>> +      *
>>> +      */
>>> +
>>> +     bpf_patch_append(patch,
>>> +             /* r2 = ((struct xdp_buff *)r1)->data_meta; */
>>> +             BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
>>> +                         offsetof(struct xdp_buff, data_meta)),
>>> +             /* r3 = ((struct xdp_buff *)r1)->data; */
>>> +             BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1,
>>> +                         offsetof(struct xdp_buff, data)),
>>> +             /* if (data_meta != data) return;
>>> +              *
>>> +              *      data_meta > data: xdp_data_meta_unsupported()
>>> +              *      data_meta < data: already used, no need to touch
>>> +              */
>>> +             BPF_JMP_REG(BPF_JNE, BPF_REG_2, BPF_REG_3, S16_MAX),
>>> +
>>> +             /* r2 -= sizeof(struct xdp_to_skb_metadata); */
>>> +             BPF_ALU64_IMM(BPF_SUB, BPF_REG_2,
>>> +                           sizeof(struct xdp_to_skb_metadata)),
>>> +             /* r3 = ((struct xdp_buff *)r1)->data_hard_start; */
>>> +             BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1,
>>> +                         offsetof(struct xdp_buff, data_hard_start)),
>>> +             /* r3 += sizeof(struct xdp_frame) */
>>> +             BPF_ALU64_IMM(BPF_ADD, BPF_REG_3,
>>> +                           sizeof(struct xdp_frame)),
>>> +             /* if (data-sizeof(struct xdp_to_skb_metadata) < data_hard_start+sizeof(struct xdp_frame)) return; */
>>> +             BPF_JMP_REG(BPF_JLT, BPF_REG_2, BPF_REG_3, S16_MAX),
>>> +
>>> +             /* ((struct xdp_buff *)r1)->data_meta = r2; */
>>> +             BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_2,
>>> +                         offsetof(struct xdp_buff, data_meta)),
>>> +
>>> +             /* *((struct xdp_to_skb_metadata *)r2)->magic = xdp_metadata_magic; */
>>> +             BPF_ST_MEM(BPF_W, BPF_REG_2,
>>> +                        offsetof(struct xdp_to_skb_metadata, magic),
>>> +                        xdp_metadata_magic),
>>> +     );
>>> +
>>> +     /*      r0 = bpf_xdp_metadata_rx_timestamp(ctx); */
>>> +     func_id = xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP);
>>> +     prog->aux->xdp_kfunc_ndo->ndo_unroll_kfunc(prog, func_id, patch);
>>> +
>>> +     bpf_patch_append(patch,
>>> +             /* r2 = ((struct xdp_buff *)r1)->data_meta; */
>>> +             BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
>>> +                         offsetof(struct xdp_buff, data_meta)),
>>> +             /* *((struct xdp_to_skb_metadata *)r2)->rx_timestamp = r0; */
>>> +             BPF_STX_MEM(BPF_DW, BPF_REG_2, BPF_REG_0,
>>> +                         offsetof(struct xdp_to_skb_metadata, rx_timestamp)),
>>
>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
>> sure it is solid enough by asking the xdp prog not to use the same random number
>> in its own metadata + not to change the metadata through xdp->data_meta after
>> calling bpf_xdp_metadata_export_to_skb().
> 
> What do you think the usecase here might be? Or are you suggesting we
> reject further access to data_meta after
> bpf_xdp_metadata_export_to_skb somehow?
> 
> If we want to let the programs override some of this
> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
> more kfuncs instead of exposing the layout?
> 
> bpf_xdp_metadata_export_to_skb(ctx);
> bpf_xdp_metadata_export_skb_hash(ctx, 1234);


I can't think of a use case now for the xdp prog to use the xdp_to_skb_metadata 
while the xdp prog can directly call the kfunc (eg 
bpf_xdp_metadata_rx_timestamp) to get individual hint.  I was asking if patch 7 
is an actual use case because it does test the tstamp in XDP_PASS or it is 
mostly for selftest purpose?  Yeah, may be the xdp prog will be able to change 
the xdp_to_skb_metadata eventually but that is for later.

My concern is the xdp prog is allowed to change xdp_to_skb_metadata or 
by-coincident writing metadata that matches the random and the sizeof(struct 
xdp_to_skb_metadata).

Also, the added opacity of xdp_to_skb_metadata (__randomize_layout + random int) 
is trying very hard to hide it from xdp prog.  Instead, would it be cleaner to 
have a flag in xdp->flags (to be set by bpf_xdp_metadata_export_to_skb?) to 
guard this, like one of Jesper's patch.  The xdp_convert_ctx_access() and 
bpf_xdp_adjust_meta() can check this bit to ensure the xdp_to_skb_metadata 
cannot be read and no metadata can be added/deleted after that.  btw, is it 
possible to keep both xdp_to_skb_metadata and the xdp_prog's metadata?  After 
skb_metadata_import_from_xdp popping the xdp_to_skb_metadata, the remaining 
xdp_prog's metatdata can still be used by the bpf-tc.

> ...
> 
>> Does xdp_to_skb_metadata have a use case for XDP_PASS (like patch 7) or the
>> xdp_to_skb_metadata can be limited to XDP_REDIRECT only?
> 
> XDP_PASS cases where we convert xdp_buff into skb in the drivers right
> now usually have C code to manually pull out the metadata (out of hw
> desc) and put it into skb.
> 
> So, currently, if we're calling bpf_xdp_metadata_export_to_skb() for
> XDP_PASS, we're doing a double amount of work:
> skb_metadata_import_from_xdp first, then custom driver code second.
> 
> In theory, maybe we should completely skip drivers custom parsing when
> there is a prog with BPF_F_XDP_HAS_METADATA?
> Then both xdp->skb paths (XDP_PASS+XDP_REDIRECT) will be bpf-driven
> and won't require any mental work (plus, the drivers won't have to
> care either in the future).
>  > WDYT?


Yeah, not sure if it can solely depend on BPF_F_XDP_HAS_METADATA but it makes 
sense to only use the hints (if ever written) from xdp prog especially if it 
will eventually support xdp prog changing some of the hints in the future.  For 
now, I think either way is fine since they are the same and the xdp prog is sort 
of doing extra unnecessary work anyway by calling 
bpf_xdp_metadata_export_to_skb() with XDP_PASS and knowing nothing can be 
changed now.


> 
>>> +     );
>>> +
>>> +     bpf_patch_resolve_jmp(patch);
>>> +}
>>> +
>>>    static int __init xdp_metadata_init(void)
>>>    {
>>> +     xdp_metadata_magic = get_random_u32() | 1;
>>>        return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, &xdp_metadata_kfunc_set);
>>>    }
>>>    late_initcall(xdp_metadata_init);


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-09  3:07       ` Martin KaFai Lau
@ 2022-11-09  4:19         ` Martin KaFai Lau
  2022-11-09 11:10           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 66+ messages in thread
From: Martin KaFai Lau @ 2022-11-09  4:19 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

On 11/8/22 7:07 PM, Martin KaFai Lau wrote:
> On 11/8/22 1:54 PM, Stanislav Fomichev wrote:
>> On Mon, Nov 7, 2022 at 2:02 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>
>>> On 11/3/22 8:25 PM, Stanislav Fomichev wrote:
>>>>> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
>>>> index 59c9fd55699d..dba857f212d7 100644
>>>> --- a/include/linux/skbuff.h
>>>> +++ b/include/linux/skbuff.h
>>>> @@ -4217,9 +4217,13 @@ static inline bool skb_metadata_differs(const struct 
>>>> sk_buff *skb_a,
>>>>               true : __skb_metadata_differs(skb_a, skb_b, len_a);
>>>>    }
>>>>
>>>> +void skb_metadata_import_from_xdp(struct sk_buff *skb, size_t len);
>>>> +
>>>>    static inline void skb_metadata_set(struct sk_buff *skb, u8 meta_len)
>>>>    {
>>>>        skb_shinfo(skb)->meta_len = meta_len;
>>>> +     if (meta_len)
>>>> +             skb_metadata_import_from_xdp(skb, meta_len);
>>>>    }
>>>>
>>> [ ... ]
>>>
>>>> +struct xdp_to_skb_metadata {
>>>> +     u32 magic; /* xdp_metadata_magic */
>>>> +     u64 rx_timestamp;
>>>> +} __randomize_layout;
>>>> +
>>>> +struct bpf_patch;
>>>> +
>>>
>>> [ ... ]
>>>
>>>> +void skb_metadata_import_from_xdp(struct sk_buff *skb, size_t len)
>>>> +{
>>>> +     struct xdp_to_skb_metadata *meta = (void *)(skb_mac_header(skb) - len);
>>>> +
>>>> +     /* Optional SKB info, currently missing:
>>>> +      * - HW checksum info           (skb->ip_summed)
>>>> +      * - HW RX hash                 (skb_set_hash)
>>>> +      * - RX ring dev queue index    (skb_record_rx_queue)
>>>> +      */
>>>> +
>>>> +     if (len != sizeof(struct xdp_to_skb_metadata))
>>>> +             return;
>>>> +
>>>> +     if (meta->magic != xdp_metadata_magic)
>>>> +             return;
>>>> +
>>>> +     if (meta->rx_timestamp) {
>>>> +             *skb_hwtstamps(skb) = (struct skb_shared_hwtstamps){
>>>> +                     .hwtstamp = ns_to_ktime(meta->rx_timestamp),
>>>> +             };
>>>> +     }
>>>> +}
>>>
>>> Considering the metadata will affect the gro, should the meta be cleared after
>>> importing to the skb?
>>
>> Yeah, good suggestion, will clear it here.
>>
>>> [ ... ]
>>>
>>>> +/* Since we're not actually doing a call but instead rewriting
>>>> + * in place, we can only afford to use R0-R5 scratch registers.
>>>> + *
>>>> + * We reserve R1 for bpf_xdp_metadata_export_to_skb and let individual
>>>> + * metadata kfuncs use only R0,R4-R5.
>>>> + *
>>>> + * The above also means we _cannot_ easily call any other helper/kfunc
>>>> + * because there is no place for us to preserve our R1 argument;
>>>> + * existing R6-R9 belong to the callee.
>>>> + */
>>>> +void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct 
>>>> bpf_patch *patch)
>>>> +{
>>>> +     u32 func_id;
>>>> +
>>>> +     /*
>>>> +      * The code below generates the following:
>>>> +      *
>>>> +      * void bpf_xdp_metadata_export_to_skb(struct xdp_md *ctx)
>>>> +      * {
>>>> +      *      struct xdp_to_skb_metadata *meta;
>>>> +      *      int ret;
>>>> +      *
>>>> +      *      ret = bpf_xdp_adjust_meta(ctx, -sizeof(*meta));
>>>> +      *      if (!ret)
>>>> +      *              return;
>>>> +      *
>>>> +      *      meta = ctx->data_meta;
>>>> +      *      meta->magic = xdp_metadata_magic;
>>>> +      *      meta->rx_timestamp = bpf_xdp_metadata_rx_timestamp(ctx);
>>>> +      * }
>>>> +      *
>>>> +      */
>>>> +
>>>> +     bpf_patch_append(patch,
>>>> +             /* r2 = ((struct xdp_buff *)r1)->data_meta; */
>>>> +             BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
>>>> +                         offsetof(struct xdp_buff, data_meta)),
>>>> +             /* r3 = ((struct xdp_buff *)r1)->data; */
>>>> +             BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1,
>>>> +                         offsetof(struct xdp_buff, data)),
>>>> +             /* if (data_meta != data) return;
>>>> +              *
>>>> +              *      data_meta > data: xdp_data_meta_unsupported()
>>>> +              *      data_meta < data: already used, no need to touch
>>>> +              */
>>>> +             BPF_JMP_REG(BPF_JNE, BPF_REG_2, BPF_REG_3, S16_MAX),
>>>> +
>>>> +             /* r2 -= sizeof(struct xdp_to_skb_metadata); */
>>>> +             BPF_ALU64_IMM(BPF_SUB, BPF_REG_2,
>>>> +                           sizeof(struct xdp_to_skb_metadata)),
>>>> +             /* r3 = ((struct xdp_buff *)r1)->data_hard_start; */
>>>> +             BPF_LDX_MEM(BPF_DW, BPF_REG_3, BPF_REG_1,
>>>> +                         offsetof(struct xdp_buff, data_hard_start)),
>>>> +             /* r3 += sizeof(struct xdp_frame) */
>>>> +             BPF_ALU64_IMM(BPF_ADD, BPF_REG_3,
>>>> +                           sizeof(struct xdp_frame)),
>>>> +             /* if (data-sizeof(struct xdp_to_skb_metadata) < 
>>>> data_hard_start+sizeof(struct xdp_frame)) return; */
>>>> +             BPF_JMP_REG(BPF_JLT, BPF_REG_2, BPF_REG_3, S16_MAX),
>>>> +
>>>> +             /* ((struct xdp_buff *)r1)->data_meta = r2; */
>>>> +             BPF_STX_MEM(BPF_DW, BPF_REG_1, BPF_REG_2,
>>>> +                         offsetof(struct xdp_buff, data_meta)),
>>>> +
>>>> +             /* *((struct xdp_to_skb_metadata *)r2)->magic = 
>>>> xdp_metadata_magic; */
>>>> +             BPF_ST_MEM(BPF_W, BPF_REG_2,
>>>> +                        offsetof(struct xdp_to_skb_metadata, magic),
>>>> +                        xdp_metadata_magic),
>>>> +     );
>>>> +
>>>> +     /*      r0 = bpf_xdp_metadata_rx_timestamp(ctx); */
>>>> +     func_id = xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP);
>>>> +     prog->aux->xdp_kfunc_ndo->ndo_unroll_kfunc(prog, func_id, patch);
>>>> +
>>>> +     bpf_patch_append(patch,
>>>> +             /* r2 = ((struct xdp_buff *)r1)->data_meta; */
>>>> +             BPF_LDX_MEM(BPF_DW, BPF_REG_2, BPF_REG_1,
>>>> +                         offsetof(struct xdp_buff, data_meta)),
>>>> +             /* *((struct xdp_to_skb_metadata *)r2)->rx_timestamp = r0; */
>>>> +             BPF_STX_MEM(BPF_DW, BPF_REG_2, BPF_REG_0,
>>>> +                         offsetof(struct xdp_to_skb_metadata, rx_timestamp)),
>>>
>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
>>> sure it is solid enough by asking the xdp prog not to use the same random number
>>> in its own metadata + not to change the metadata through xdp->data_meta after
>>> calling bpf_xdp_metadata_export_to_skb().
>>
>> What do you think the usecase here might be? Or are you suggesting we
>> reject further access to data_meta after
>> bpf_xdp_metadata_export_to_skb somehow?
>>
>> If we want to let the programs override some of this
>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
>> more kfuncs instead of exposing the layout?
>>
>> bpf_xdp_metadata_export_to_skb(ctx);
>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);

After re-reading patch 6, have another question.  The 'void 
bpf_xdp_metadata_export_to_skb();' function signature.  Should it at least 
return ok/err? or even return a 'struct xdp_to_skb_metadata *' pointer and the 
xdp prog can directly read (or even write) it?  A related question, why 'struct 
xdp_to_skb_metadata' needs __randomize_layout?


> 
> 
> I can't think of a use case now for the xdp prog to use the xdp_to_skb_metadata 
> while the xdp prog can directly call the kfunc (eg 
> bpf_xdp_metadata_rx_timestamp) to get individual hint.  I was asking if patch 7 
> is an actual use case because it does test the tstamp in XDP_PASS or it is 
> mostly for selftest purpose?  Yeah, may be the xdp prog will be able to change 
> the xdp_to_skb_metadata eventually but that is for later.
> 
> My concern is the xdp prog is allowed to change xdp_to_skb_metadata or 
> by-coincident writing metadata that matches the random and the sizeof(struct 
> xdp_to_skb_metadata).
> 
> Also, the added opacity of xdp_to_skb_metadata (__randomize_layout + random int) 
> is trying very hard to hide it from xdp prog.  Instead, would it be cleaner to 
> have a flag in xdp->flags (to be set by bpf_xdp_metadata_export_to_skb?) to 
> guard this, like one of Jesper's patch.  The xdp_convert_ctx_access() and 
> bpf_xdp_adjust_meta() can check this bit to ensure the xdp_to_skb_metadata 
> cannot be read and no metadata can be added/deleted after that.  btw, is it 
> possible to keep both xdp_to_skb_metadata and the xdp_prog's metadata?  After 
> skb_metadata_import_from_xdp popping the xdp_to_skb_metadata, the remaining 
> xdp_prog's metatdata can still be used by the bpf-tc.
> 
>> ...
>>
>>> Does xdp_to_skb_metadata have a use case for XDP_PASS (like patch 7) or the
>>> xdp_to_skb_metadata can be limited to XDP_REDIRECT only?
>>
>> XDP_PASS cases where we convert xdp_buff into skb in the drivers right
>> now usually have C code to manually pull out the metadata (out of hw
>> desc) and put it into skb.
>>
>> So, currently, if we're calling bpf_xdp_metadata_export_to_skb() for
>> XDP_PASS, we're doing a double amount of work:
>> skb_metadata_import_from_xdp first, then custom driver code second.
>>
>> In theory, maybe we should completely skip drivers custom parsing when
>> there is a prog with BPF_F_XDP_HAS_METADATA?
>> Then both xdp->skb paths (XDP_PASS+XDP_REDIRECT) will be bpf-driven
>> and won't require any mental work (plus, the drivers won't have to
>> care either in the future).
>>  > WDYT?
> 
> 
> Yeah, not sure if it can solely depend on BPF_F_XDP_HAS_METADATA but it makes 
> sense to only use the hints (if ever written) from xdp prog especially if it 
> will eventually support xdp prog changing some of the hints in the future.  For 
> now, I think either way is fine since they are the same and the xdp prog is sort 
> of doing extra unnecessary work anyway by calling 
> bpf_xdp_metadata_export_to_skb() with XDP_PASS and knowing nothing can be 
> changed now.
> 
> 
>>
>>>> +     );
>>>> +
>>>> +     bpf_patch_resolve_jmp(patch);
>>>> +}
>>>> +
>>>>    static int __init xdp_metadata_init(void)
>>>>    {
>>>> +     xdp_metadata_magic = get_random_u32() | 1;
>>>>        return register_btf_kfunc_id_set(BPF_PROG_TYPE_XDP, 
>>>> &xdp_metadata_kfunc_set);
>>>>    }
>>>>    late_initcall(xdp_metadata_init);
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-09  4:19         ` Martin KaFai Lau
@ 2022-11-09 11:10           ` Toke Høiland-Jørgensen
  2022-11-09 18:22             ` Martin KaFai Lau
  2022-11-10  1:26             ` John Fastabend
  0 siblings, 2 replies; 66+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-09 11:10 UTC (permalink / raw)
  To: Martin KaFai Lau, Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

Snipping a bit of context to reply to this bit:

>>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
>>>> sure it is solid enough by asking the xdp prog not to use the same random number
>>>> in its own metadata + not to change the metadata through xdp->data_meta after
>>>> calling bpf_xdp_metadata_export_to_skb().
>>>
>>> What do you think the usecase here might be? Or are you suggesting we
>>> reject further access to data_meta after
>>> bpf_xdp_metadata_export_to_skb somehow?
>>>
>>> If we want to let the programs override some of this
>>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
>>> more kfuncs instead of exposing the layout?
>>>
>>> bpf_xdp_metadata_export_to_skb(ctx);
>>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);

There are several use cases for needing to access the metadata after
calling bpf_xdp_metdata_export_to_skb():

- Accessing the metadata after redirect (in a cpumap or devmap program,
  or on a veth device)
- Transferring the packet+metadata to AF_XDP
- Returning XDP_PASS, but accessing some of the metadata first (whether
  to read or change it)

The last one could be solved by calling additional kfuncs, but that
would be less efficient than just directly editing the struct which
will be cache-hot after the helper returns.

And yeah, this will allow the XDP program to inject arbitrary metadata
into the netstack; but it can already inject arbitrary *packet* data
into the stack, so not sure if this is much of an additional risk? If it
does lead to trivial crashes, we should probably harden the stack
against that?

As for the random number, Jesper and I discussed replacing this with the
same BTF-ID scheme that he was using in his patch series. I.e., instead
of just putting in a random number, we insert the BTF ID of the metadata
struct at the end of it. This will allow us to support multiple
different formats in the future (not just changing the layout, but
having multiple simultaneous formats in the same kernel image), in case
we run out of space.

We should probably also have a flag set on the xdp_frame so the stack
knows that the metadata area contains relevant-to-skb data, to guard
against an XDP program accidentally hitting the "magic number" (BTF_ID)
in unrelated stuff it puts into the metadata area.

> After re-reading patch 6, have another question. The 'void
> bpf_xdp_metadata_export_to_skb();' function signature. Should it at
> least return ok/err? or even return a 'struct xdp_to_skb_metadata *'
> pointer and the xdp prog can directly read (or even write) it?

Hmm, I'm not sure returning a failure makes sense? Failure to read one
or more fields just means that those fields will not be populated? We
should probably have a flags field inside the metadata struct itself to
indicate which fields are set or not, but I'm not sure returning an
error value adds anything? Returning a pointer to the metadata field
might be convenient for users (it would just be an alias to the
data_meta pointer, but the verifier could know its size, so the program
doesn't have to bounds check it).

> A related question, why 'struct xdp_to_skb_metadata' needs
> __randomize_layout?

The __randomize_layout thing is there to force BPF programs to use CO-RE
to access the field. This is to avoid the struct layout accidentally
ossifying because people in practice rely on a particular layout, even
though we tell them to use CO-RE. There are lots of examples of this
happening in other domains (IP header options, TCP options, etc), and
__randomize_layout seemed like a neat trick to enforce CO-RE usage :)

>>>> Does xdp_to_skb_metadata have a use case for XDP_PASS (like patch 7) or the
>>>> xdp_to_skb_metadata can be limited to XDP_REDIRECT only?
>>>
>>> XDP_PASS cases where we convert xdp_buff into skb in the drivers right
>>> now usually have C code to manually pull out the metadata (out of hw
>>> desc) and put it into skb.
>>>
>>> So, currently, if we're calling bpf_xdp_metadata_export_to_skb() for
>>> XDP_PASS, we're doing a double amount of work:
>>> skb_metadata_import_from_xdp first, then custom driver code second.
>>>
>>> In theory, maybe we should completely skip drivers custom parsing when
>>> there is a prog with BPF_F_XDP_HAS_METADATA?
>>> Then both xdp->skb paths (XDP_PASS+XDP_REDIRECT) will be bpf-driven
>>> and won't require any mental work (plus, the drivers won't have to
>>> care either in the future).
>>>  > WDYT?
>> 
>> 
>> Yeah, not sure if it can solely depend on BPF_F_XDP_HAS_METADATA but it makes 
>> sense to only use the hints (if ever written) from xdp prog especially if it 
>> will eventually support xdp prog changing some of the hints in the future.  For 
>> now, I think either way is fine since they are the same and the xdp prog is sort 
>> of doing extra unnecessary work anyway by calling 
>> bpf_xdp_metadata_export_to_skb() with XDP_PASS and knowing nothing can be 
>> changed now.

I agree it would be best if the drivers also use the XDP metadata (if
present) on XDP_PASS. Longer term my hope is we can make the XDP
metadata support the only thing drivers need to implement (i.e., have
the stack call into that code even when no XDP program is loaded), but
for now just for consistency (and allowing the XDP program to update the
metadata), we should probably at least consume it on XDP_PASS.

-Toke

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 04/14] veth: Support rx timestamp metadata for xdp
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 04/14] veth: Support rx timestamp metadata for xdp Stanislav Fomichev
@ 2022-11-09 11:21   ` Toke Høiland-Jørgensen
  2022-11-09 21:34     ` Stanislav Fomichev
  2022-11-10  0:25   ` John Fastabend
  1 sibling, 1 reply; 66+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-09 11:21 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Stanislav Fomichev <sdf@google.com> writes:

> xskxceiver conveniently setups up veth pairs so it seems logical
> to use veth as an example for some of the metadata handling.
>
> We timestamp skb right when we "receive" it, store its
> pointer in new veth_xdp_buff wrapper and generate BPF bytecode to
> reach it from the BPF program.
>
> This largely follows the idea of "store some queue context in
> the xdp_buff/xdp_frame so the metadata can be reached out
> from the BPF program".
>
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  drivers/net/veth.c | 31 +++++++++++++++++++++++++++++++
>  1 file changed, 31 insertions(+)
>
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 917ba57453c1..0e629ceb087b 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -25,6 +25,7 @@
>  #include <linux/filter.h>
>  #include <linux/ptr_ring.h>
>  #include <linux/bpf_trace.h>
> +#include <linux/bpf_patch.h>
>  #include <linux/net_tstamp.h>
>  
>  #define DRV_NAME	"veth"
> @@ -118,6 +119,7 @@ static struct {
>  
>  struct veth_xdp_buff {
>  	struct xdp_buff xdp;
> +	struct sk_buff *skb;
>  };
>  
>  static int veth_get_link_ksettings(struct net_device *dev,
> @@ -602,6 +604,7 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
>  
>  		xdp_convert_frame_to_buff(frame, xdp);
>  		xdp->rxq = &rq->xdp_rxq;
> +		vxbuf.skb = NULL;
>  
>  		act = bpf_prog_run_xdp(xdp_prog, xdp);
>  
> @@ -826,6 +829,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
>  
>  	orig_data = xdp->data;
>  	orig_data_end = xdp->data_end;
> +	vxbuf.skb = skb;
>  
>  	act = bpf_prog_run_xdp(xdp_prog, xdp);
>  
> @@ -942,6 +946,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>  			struct sk_buff *skb = ptr;
>  
>  			stats->xdp_bytes += skb->len;
> +			__net_timestamp(skb);
>  			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
>  			if (skb) {
>  				if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
> @@ -1665,6 +1670,31 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>  	}
>  }
>  
> +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> +			      struct bpf_patch *patch)
> +{
> +	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> +		/* return true; */
> +		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> +	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> +		bpf_patch_append(patch,
> +			/* r5 = ((struct veth_xdp_buff *)r1)->skb; */
> +			BPF_LDX_MEM(BPF_DW, BPF_REG_5, BPF_REG_1,
> +				    offsetof(struct veth_xdp_buff, skb)),
> +			/* if (r5 == NULL) { */
> +			BPF_JMP_IMM(BPF_JNE, BPF_REG_5, 0, 2),
> +			/*	return 0; */
> +			BPF_MOV64_IMM(BPF_REG_0, 0),
> +			BPF_JMP_A(1),
> +			/* } else { */
> +			/*	return ((struct sk_buff *)r5)->tstamp; */
> +			BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_5,
> +				    offsetof(struct sk_buff, tstamp)),
> +			/* } */

I don't think it's realistic to expect driver developers to write this
level of BPF instructions for everything. With the 'patch' thing it
should be feasible to write some helpers that driver developers can use,
right? E.g., this one could be:

bpf_read_context_member_u64(size_t ctx_offset, size_t member_offset)

called as:

bpf_read_context_member_u64(offsetof(struct veth_xdp_buff, skb), offsetof(struct sk_buff, tstamp));

or with some macro trickery we could even hide the offsetof so you just
pass in types and member names?

-Toke


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-09 11:10           ` Toke Høiland-Jørgensen
@ 2022-11-09 18:22             ` Martin KaFai Lau
  2022-11-09 21:33               ` Stanislav Fomichev
  2022-11-10 14:19               ` Toke Høiland-Jørgensen
  2022-11-10  1:26             ` John Fastabend
  1 sibling, 2 replies; 66+ messages in thread
From: Martin KaFai Lau @ 2022-11-09 18:22 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, ast, daniel, andrii, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, bpf

On 11/9/22 3:10 AM, Toke Høiland-Jørgensen wrote:
> Snipping a bit of context to reply to this bit:
> 
>>>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
>>>>> sure it is solid enough by asking the xdp prog not to use the same random number
>>>>> in its own metadata + not to change the metadata through xdp->data_meta after
>>>>> calling bpf_xdp_metadata_export_to_skb().
>>>>
>>>> What do you think the usecase here might be? Or are you suggesting we
>>>> reject further access to data_meta after
>>>> bpf_xdp_metadata_export_to_skb somehow?
>>>>
>>>> If we want to let the programs override some of this
>>>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
>>>> more kfuncs instead of exposing the layout?
>>>>
>>>> bpf_xdp_metadata_export_to_skb(ctx);
>>>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);
> 
> There are several use cases for needing to access the metadata after
> calling bpf_xdp_metdata_export_to_skb():
> 
> - Accessing the metadata after redirect (in a cpumap or devmap program,
>    or on a veth device)
> - Transferring the packet+metadata to AF_XDP
fwiw, the xdp prog could also be more selective and only stores one of the hints 
instead of the whole 'struct xdp_to_skb_metadata'.

> - Returning XDP_PASS, but accessing some of the metadata first (whether
>    to read or change it)
> 
> The last one could be solved by calling additional kfuncs, but that
> would be less efficient than just directly editing the struct which
> will be cache-hot after the helper returns.

Yeah, it is more efficient to directly write if possible.  I think this set 
allows the direct reading and writing already through data_meta (as a _u8 *).

> 
> And yeah, this will allow the XDP program to inject arbitrary metadata
> into the netstack; but it can already inject arbitrary *packet* data
> into the stack, so not sure if this is much of an additional risk? If it
> does lead to trivial crashes, we should probably harden the stack
> against that?
> 
> As for the random number, Jesper and I discussed replacing this with the
> same BTF-ID scheme that he was using in his patch series. I.e., instead
> of just putting in a random number, we insert the BTF ID of the metadata
> struct at the end of it. This will allow us to support multiple
> different formats in the future (not just changing the layout, but
> having multiple simultaneous formats in the same kernel image), in case
> we run out of space.

This seems a bit hypothetical.  How much headroom does it usually have for the 
xdp prog?  Potentially the hints can use all the remaining space left after the 
header encap and the current bpf_xdp_adjust_meta() usage?

> 
> We should probably also have a flag set on the xdp_frame so the stack
> knows that the metadata area contains relevant-to-skb data, to guard
> against an XDP program accidentally hitting the "magic number" (BTF_ID)
> in unrelated stuff it puts into the metadata area.

Yeah, I think having a flag is useful.  The flag will be set at xdp_buff and 
then transfer to the xdp_frame?

> 
>> After re-reading patch 6, have another question. The 'void
>> bpf_xdp_metadata_export_to_skb();' function signature. Should it at
>> least return ok/err? or even return a 'struct xdp_to_skb_metadata *'
>> pointer and the xdp prog can directly read (or even write) it?
> 
> Hmm, I'm not sure returning a failure makes sense? Failure to read one
> or more fields just means that those fields will not be populated? We
> should probably have a flags field inside the metadata struct itself to
> indicate which fields are set or not, but I'm not sure returning an
> error value adds anything? Returning a pointer to the metadata field
> might be convenient for users (it would just be an alias to the
> data_meta pointer, but the verifier could know its size, so the program
> doesn't have to bounds check it).

If some hints are not available, those hints should be initialized to 
0/CHECKSUM_NONE/...etc.  The xdp prog needs a direct way to tell hard failure 
when it cannot write the meta area because of not enough space.  Comparing 
xdp->data_meta with xdp->data as a side effect is not intuitive.

It is more than saving the bound check.  With type info of 'struct 
xdp_to_skb_metadata *', the verifier can do more checks like reading in the 
middle of an integer member.  The verifier could also limit write access only to 
a few struct's members if it is needed.

The returning 'struct xdp_to_skb_metadata *' should not be an alias to the 
xdp->data_meta.  They should actually point to different locations in the 
headroom.  bpf_xdp_metadata_export_to_skb() sets a flag in xdp_buff. 
xdp->data_meta won't be changed and keeps pointing to the last 
bpf_xdp_adjust_meta() location.  The kernel will know if there is 
xdp_to_skb_metadata before the xdp->data_meta when that bit is set in the 
xdp_{buff,frame}.  Would it work?

> 
>> A related question, why 'struct xdp_to_skb_metadata' needs
>> __randomize_layout?
> 
> The __randomize_layout thing is there to force BPF programs to use CO-RE
> to access the field. This is to avoid the struct layout accidentally
> ossifying because people in practice rely on a particular layout, even
> though we tell them to use CO-RE. There are lots of examples of this
> happening in other domains (IP header options, TCP options, etc), and
> __randomize_layout seemed like a neat trick to enforce CO-RE usage :)

I am not sure if it is necessary or helpful to only enforce __randomize_layout 
in 'struct xdp_to_skb_metadata'.  There are other CO-RE use cases (tracing and 
non tracing) that already have direct access (reading and/or writing) to other 
kernel structures.

It is more important for the verifier to see the xdp prog accessing it as a 
'struct xdp_to_skb_metadata *' instead of xdp->data_meta which is a __u8 * so 
that the verifier can enforce the rules of access.

> 
>>>>> Does xdp_to_skb_metadata have a use case for XDP_PASS (like patch 7) or the
>>>>> xdp_to_skb_metadata can be limited to XDP_REDIRECT only?
>>>>
>>>> XDP_PASS cases where we convert xdp_buff into skb in the drivers right
>>>> now usually have C code to manually pull out the metadata (out of hw
>>>> desc) and put it into skb.
>>>>
>>>> So, currently, if we're calling bpf_xdp_metadata_export_to_skb() for
>>>> XDP_PASS, we're doing a double amount of work:
>>>> skb_metadata_import_from_xdp first, then custom driver code second.
>>>>
>>>> In theory, maybe we should completely skip drivers custom parsing when
>>>> there is a prog with BPF_F_XDP_HAS_METADATA?
>>>> Then both xdp->skb paths (XDP_PASS+XDP_REDIRECT) will be bpf-driven
>>>> and won't require any mental work (plus, the drivers won't have to
>>>> care either in the future).
>>>>   > WDYT?
>>>
>>>
>>> Yeah, not sure if it can solely depend on BPF_F_XDP_HAS_METADATA but it makes
>>> sense to only use the hints (if ever written) from xdp prog especially if it
>>> will eventually support xdp prog changing some of the hints in the future.  For
>>> now, I think either way is fine since they are the same and the xdp prog is sort
>>> of doing extra unnecessary work anyway by calling
>>> bpf_xdp_metadata_export_to_skb() with XDP_PASS and knowing nothing can be
>>> changed now.
> 
> I agree it would be best if the drivers also use the XDP metadata (if
> present) on XDP_PASS. Longer term my hope is we can make the XDP
> metadata support the only thing drivers need to implement (i.e., have
> the stack call into that code even when no XDP program is loaded), but
> for now just for consistency (and allowing the XDP program to update the
> metadata), we should probably at least consume it on XDP_PASS.
> 
> -Toke
> 


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-09 18:22             ` Martin KaFai Lau
@ 2022-11-09 21:33               ` Stanislav Fomichev
  2022-11-10  0:13                 ` Martin KaFai Lau
  2022-11-10 14:19               ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-09 21:33 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Toke Høiland-Jørgensen, ast, daniel, andrii, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, bpf

On Wed, Nov 9, 2022 at 10:22 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 11/9/22 3:10 AM, Toke Høiland-Jørgensen wrote:
> > Snipping a bit of context to reply to this bit:
> >
> >>>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
> >>>>> sure it is solid enough by asking the xdp prog not to use the same random number
> >>>>> in its own metadata + not to change the metadata through xdp->data_meta after
> >>>>> calling bpf_xdp_metadata_export_to_skb().
> >>>>
> >>>> What do you think the usecase here might be? Or are you suggesting we
> >>>> reject further access to data_meta after
> >>>> bpf_xdp_metadata_export_to_skb somehow?
> >>>>
> >>>> If we want to let the programs override some of this
> >>>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
> >>>> more kfuncs instead of exposing the layout?
> >>>>
> >>>> bpf_xdp_metadata_export_to_skb(ctx);
> >>>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);
> >
> > There are several use cases for needing to access the metadata after
> > calling bpf_xdp_metdata_export_to_skb():
> >
> > - Accessing the metadata after redirect (in a cpumap or devmap program,
> >    or on a veth device)
> > - Transferring the packet+metadata to AF_XDP
> fwiw, the xdp prog could also be more selective and only stores one of the hints
> instead of the whole 'struct xdp_to_skb_metadata'.
>
> > - Returning XDP_PASS, but accessing some of the metadata first (whether
> >    to read or change it)
> >
> > The last one could be solved by calling additional kfuncs, but that
> > would be less efficient than just directly editing the struct which
> > will be cache-hot after the helper returns.
>
> Yeah, it is more efficient to directly write if possible.  I think this set
> allows the direct reading and writing already through data_meta (as a _u8 *).
>
> >
> > And yeah, this will allow the XDP program to inject arbitrary metadata
> > into the netstack; but it can already inject arbitrary *packet* data
> > into the stack, so not sure if this is much of an additional risk? If it
> > does lead to trivial crashes, we should probably harden the stack
> > against that?
> >
> > As for the random number, Jesper and I discussed replacing this with the
> > same BTF-ID scheme that he was using in his patch series. I.e., instead
> > of just putting in a random number, we insert the BTF ID of the metadata
> > struct at the end of it. This will allow us to support multiple
> > different formats in the future (not just changing the layout, but
> > having multiple simultaneous formats in the same kernel image), in case
> > we run out of space.
>
> This seems a bit hypothetical.  How much headroom does it usually have for the
> xdp prog?  Potentially the hints can use all the remaining space left after the
> header encap and the current bpf_xdp_adjust_meta() usage?
>
> >
> > We should probably also have a flag set on the xdp_frame so the stack
> > knows that the metadata area contains relevant-to-skb data, to guard
> > against an XDP program accidentally hitting the "magic number" (BTF_ID)
> > in unrelated stuff it puts into the metadata area.
>
> Yeah, I think having a flag is useful.  The flag will be set at xdp_buff and
> then transfer to the xdp_frame?
>
> >
> >> After re-reading patch 6, have another question. The 'void
> >> bpf_xdp_metadata_export_to_skb();' function signature. Should it at
> >> least return ok/err? or even return a 'struct xdp_to_skb_metadata *'
> >> pointer and the xdp prog can directly read (or even write) it?
> >
> > Hmm, I'm not sure returning a failure makes sense? Failure to read one
> > or more fields just means that those fields will not be populated? We
> > should probably have a flags field inside the metadata struct itself to
> > indicate which fields are set or not, but I'm not sure returning an
> > error value adds anything? Returning a pointer to the metadata field
> > might be convenient for users (it would just be an alias to the
> > data_meta pointer, but the verifier could know its size, so the program
> > doesn't have to bounds check it).
>
> If some hints are not available, those hints should be initialized to
> 0/CHECKSUM_NONE/...etc.  The xdp prog needs a direct way to tell hard failure
> when it cannot write the meta area because of not enough space.  Comparing
> xdp->data_meta with xdp->data as a side effect is not intuitive.
>
> It is more than saving the bound check.  With type info of 'struct
> xdp_to_skb_metadata *', the verifier can do more checks like reading in the
> middle of an integer member.  The verifier could also limit write access only to
> a few struct's members if it is needed.
>
> The returning 'struct xdp_to_skb_metadata *' should not be an alias to the
> xdp->data_meta.  They should actually point to different locations in the
> headroom.  bpf_xdp_metadata_export_to_skb() sets a flag in xdp_buff.
> xdp->data_meta won't be changed and keeps pointing to the last
> bpf_xdp_adjust_meta() location.  The kernel will know if there is
> xdp_to_skb_metadata before the xdp->data_meta when that bit is set in the
> xdp_{buff,frame}.  Would it work?
>
> >
> >> A related question, why 'struct xdp_to_skb_metadata' needs
> >> __randomize_layout?
> >
> > The __randomize_layout thing is there to force BPF programs to use CO-RE
> > to access the field. This is to avoid the struct layout accidentally
> > ossifying because people in practice rely on a particular layout, even
> > though we tell them to use CO-RE. There are lots of examples of this
> > happening in other domains (IP header options, TCP options, etc), and
> > __randomize_layout seemed like a neat trick to enforce CO-RE usage :)
>
> I am not sure if it is necessary or helpful to only enforce __randomize_layout
> in 'struct xdp_to_skb_metadata'.  There are other CO-RE use cases (tracing and
> non tracing) that already have direct access (reading and/or writing) to other
> kernel structures.
>
> It is more important for the verifier to see the xdp prog accessing it as a
> 'struct xdp_to_skb_metadata *' instead of xdp->data_meta which is a __u8 * so
> that the verifier can enforce the rules of access.
>
> >
> >>>>> Does xdp_to_skb_metadata have a use case for XDP_PASS (like patch 7) or the
> >>>>> xdp_to_skb_metadata can be limited to XDP_REDIRECT only?
> >>>>
> >>>> XDP_PASS cases where we convert xdp_buff into skb in the drivers right
> >>>> now usually have C code to manually pull out the metadata (out of hw
> >>>> desc) and put it into skb.
> >>>>
> >>>> So, currently, if we're calling bpf_xdp_metadata_export_to_skb() for
> >>>> XDP_PASS, we're doing a double amount of work:
> >>>> skb_metadata_import_from_xdp first, then custom driver code second.
> >>>>
> >>>> In theory, maybe we should completely skip drivers custom parsing when
> >>>> there is a prog with BPF_F_XDP_HAS_METADATA?
> >>>> Then both xdp->skb paths (XDP_PASS+XDP_REDIRECT) will be bpf-driven
> >>>> and won't require any mental work (plus, the drivers won't have to
> >>>> care either in the future).
> >>>>   > WDYT?
> >>>
> >>>
> >>> Yeah, not sure if it can solely depend on BPF_F_XDP_HAS_METADATA but it makes
> >>> sense to only use the hints (if ever written) from xdp prog especially if it
> >>> will eventually support xdp prog changing some of the hints in the future.  For
> >>> now, I think either way is fine since they are the same and the xdp prog is sort
> >>> of doing extra unnecessary work anyway by calling
> >>> bpf_xdp_metadata_export_to_skb() with XDP_PASS and knowing nothing can be
> >>> changed now.
> >
> > I agree it would be best if the drivers also use the XDP metadata (if
> > present) on XDP_PASS. Longer term my hope is we can make the XDP
> > metadata support the only thing drivers need to implement (i.e., have
> > the stack call into that code even when no XDP program is loaded), but
> > for now just for consistency (and allowing the XDP program to update the
> > metadata), we should probably at least consume it on XDP_PASS.
> >
> > -Toke
> >

Not to derail the discussion (left the last message intact on top,
feel free to continue), but to summarize. The proposed changes seem to
be:

1. bpf_xdp_metadata_export_to_skb() should return pointer to "struct
xdp_to_skb_metadata"
  - This should let bpf programs change the metadata passed to the skb

2. "struct xdp_to_skb_metadata" should have its btf_id as the first
__u32 member (and remove the magic)
  - This is for the redirect case where the end users, including
AF_XDP, can parse this metadata from btf_id
  - This, however, is not all the metadata that the device can
support, but a much narrower set that the kernel is expected to use
for skb construction

3. __randomize_layout isn't really helping, CO-RE will trigger
regardless; maybe only the case where it matters is probably AF_XDP,
so still useful?

4. The presence of the metadata generated by
bpf_xdp_metadata_export_to_skb should be indicated by a flag in
xdp_{buff,frame}->flags
  - Assuming exposing it via xdp_md->has_skb_metadata is ok?
  - Since the programs probably need to do the following:

  if (xdp_md->has_skb_metadata) {
    access/change skb metadata by doing struct xdp_to_skb_metadata *p
= data_meta;
  } else {
    use kfuncs
  }

5. Support the case where we keep program's metadata and kernel's
xdp_to_skb_metadata
  - skb_metadata_import_from_xdp() will "consume" it by mem-moving the
rest of the metadata over it and adjusting the headroom


I think the above solves all the cases Toke points to?

a) Accessing the metadata after redirect (in a cpumap or devmap
program, or on a veth device)
  - only a small xdp_to_skb_metadata subset will work out of the box
iff the redirecttor calls bpf_xdp_metadata_export_to_skb; for the rest
the progs will have to agree on the layout, right?

b) Transferring the packet+metadata to AF_XDP
  - here, again, the AF_XDP consumer will have to either expect
xdp_to_skb_metadata with a smaller set of skb-related metadata, or
will have to make sure the producer builds a custom layout using
kfuncs; there is also no flag to indicate whether xdp_to_skb_metadata
is there or not; the consumer will have to test btf_id at the right
offset

c) Returning XDP_PASS, but accessing some of the metadata first
(whether to read or change it)
  - can read via kfuncs, can change via
bpf_xdp_metadata_export_to_skb(); m->xyz=abc;

Anything I'm missing?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 04/14] veth: Support rx timestamp metadata for xdp
  2022-11-09 11:21   ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2022-11-09 21:34     ` Stanislav Fomichev
  0 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-09 21:34 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Wed, Nov 9, 2022 at 3:21 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > xskxceiver conveniently setups up veth pairs so it seems logical
> > to use veth as an example for some of the metadata handling.
> >
> > We timestamp skb right when we "receive" it, store its
> > pointer in new veth_xdp_buff wrapper and generate BPF bytecode to
> > reach it from the BPF program.
> >
> > This largely follows the idea of "store some queue context in
> > the xdp_buff/xdp_frame so the metadata can be reached out
> > from the BPF program".
> >
> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: David Ahern <dsahern@gmail.com>
> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > Cc: xdp-hints@xdp-project.net
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  drivers/net/veth.c | 31 +++++++++++++++++++++++++++++++
> >  1 file changed, 31 insertions(+)
> >
> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > index 917ba57453c1..0e629ceb087b 100644
> > --- a/drivers/net/veth.c
> > +++ b/drivers/net/veth.c
> > @@ -25,6 +25,7 @@
> >  #include <linux/filter.h>
> >  #include <linux/ptr_ring.h>
> >  #include <linux/bpf_trace.h>
> > +#include <linux/bpf_patch.h>
> >  #include <linux/net_tstamp.h>
> >
> >  #define DRV_NAME     "veth"
> > @@ -118,6 +119,7 @@ static struct {
> >
> >  struct veth_xdp_buff {
> >       struct xdp_buff xdp;
> > +     struct sk_buff *skb;
> >  };
> >
> >  static int veth_get_link_ksettings(struct net_device *dev,
> > @@ -602,6 +604,7 @@ static struct xdp_frame *veth_xdp_rcv_one(struct veth_rq *rq,
> >
> >               xdp_convert_frame_to_buff(frame, xdp);
> >               xdp->rxq = &rq->xdp_rxq;
> > +             vxbuf.skb = NULL;
> >
> >               act = bpf_prog_run_xdp(xdp_prog, xdp);
> >
> > @@ -826,6 +829,7 @@ static struct sk_buff *veth_xdp_rcv_skb(struct veth_rq *rq,
> >
> >       orig_data = xdp->data;
> >       orig_data_end = xdp->data_end;
> > +     vxbuf.skb = skb;
> >
> >       act = bpf_prog_run_xdp(xdp_prog, xdp);
> >
> > @@ -942,6 +946,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
> >                       struct sk_buff *skb = ptr;
> >
> >                       stats->xdp_bytes += skb->len;
> > +                     __net_timestamp(skb);
> >                       skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
> >                       if (skb) {
> >                               if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
> > @@ -1665,6 +1670,31 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >       }
> >  }
> >
> > +static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> > +                           struct bpf_patch *patch)
> > +{
> > +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > +             /* return true; */
> > +             bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > +             bpf_patch_append(patch,
> > +                     /* r5 = ((struct veth_xdp_buff *)r1)->skb; */
> > +                     BPF_LDX_MEM(BPF_DW, BPF_REG_5, BPF_REG_1,
> > +                                 offsetof(struct veth_xdp_buff, skb)),
> > +                     /* if (r5 == NULL) { */
> > +                     BPF_JMP_IMM(BPF_JNE, BPF_REG_5, 0, 2),
> > +                     /*      return 0; */
> > +                     BPF_MOV64_IMM(BPF_REG_0, 0),
> > +                     BPF_JMP_A(1),
> > +                     /* } else { */
> > +                     /*      return ((struct sk_buff *)r5)->tstamp; */
> > +                     BPF_LDX_MEM(BPF_DW, BPF_REG_0, BPF_REG_5,
> > +                                 offsetof(struct sk_buff, tstamp)),
> > +                     /* } */
>
> I don't think it's realistic to expect driver developers to write this
> level of BPF instructions for everything. With the 'patch' thing it
> should be feasible to write some helpers that driver developers can use,
> right? E.g., this one could be:
>
> bpf_read_context_member_u64(size_t ctx_offset, size_t member_offset)
>
> called as:
>
> bpf_read_context_member_u64(offsetof(struct veth_xdp_buff, skb), offsetof(struct sk_buff, tstamp));
>
> or with some macro trickery we could even hide the offsetof so you just
> pass in types and member names?

Definitely; let's start with the one you're proposing, we'll figure
out the rest as we go; thx!

> -Toke
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-09 21:33               ` Stanislav Fomichev
@ 2022-11-10  0:13                 ` Martin KaFai Lau
  2022-11-10  1:02                   ` Stanislav Fomichev
  0 siblings, 1 reply; 66+ messages in thread
From: Martin KaFai Lau @ 2022-11-10  0:13 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toke Høiland-Jørgensen, ast, daniel, andrii, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, bpf

On 11/9/22 1:33 PM, Stanislav Fomichev wrote:
> On Wed, Nov 9, 2022 at 10:22 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 11/9/22 3:10 AM, Toke Høiland-Jørgensen wrote:
>>> Snipping a bit of context to reply to this bit:
>>>
>>>>>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
>>>>>>> sure it is solid enough by asking the xdp prog not to use the same random number
>>>>>>> in its own metadata + not to change the metadata through xdp->data_meta after
>>>>>>> calling bpf_xdp_metadata_export_to_skb().
>>>>>>
>>>>>> What do you think the usecase here might be? Or are you suggesting we
>>>>>> reject further access to data_meta after
>>>>>> bpf_xdp_metadata_export_to_skb somehow?
>>>>>>
>>>>>> If we want to let the programs override some of this
>>>>>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
>>>>>> more kfuncs instead of exposing the layout?
>>>>>>
>>>>>> bpf_xdp_metadata_export_to_skb(ctx);
>>>>>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);
>>>
>>> There are several use cases for needing to access the metadata after
>>> calling bpf_xdp_metdata_export_to_skb():
>>>
>>> - Accessing the metadata after redirect (in a cpumap or devmap program,
>>>     or on a veth device)
>>> - Transferring the packet+metadata to AF_XDP
>> fwiw, the xdp prog could also be more selective and only stores one of the hints
>> instead of the whole 'struct xdp_to_skb_metadata'.
>>
>>> - Returning XDP_PASS, but accessing some of the metadata first (whether
>>>     to read or change it)
>>>
>>> The last one could be solved by calling additional kfuncs, but that
>>> would be less efficient than just directly editing the struct which
>>> will be cache-hot after the helper returns.
>>
>> Yeah, it is more efficient to directly write if possible.  I think this set
>> allows the direct reading and writing already through data_meta (as a _u8 *).
>>
>>>
>>> And yeah, this will allow the XDP program to inject arbitrary metadata
>>> into the netstack; but it can already inject arbitrary *packet* data
>>> into the stack, so not sure if this is much of an additional risk? If it
>>> does lead to trivial crashes, we should probably harden the stack
>>> against that?
>>>
>>> As for the random number, Jesper and I discussed replacing this with the
>>> same BTF-ID scheme that he was using in his patch series. I.e., instead
>>> of just putting in a random number, we insert the BTF ID of the metadata
>>> struct at the end of it. This will allow us to support multiple
>>> different formats in the future (not just changing the layout, but
>>> having multiple simultaneous formats in the same kernel image), in case
>>> we run out of space.
>>
>> This seems a bit hypothetical.  How much headroom does it usually have for the
>> xdp prog?  Potentially the hints can use all the remaining space left after the
>> header encap and the current bpf_xdp_adjust_meta() usage?
>>
>>>
>>> We should probably also have a flag set on the xdp_frame so the stack
>>> knows that the metadata area contains relevant-to-skb data, to guard
>>> against an XDP program accidentally hitting the "magic number" (BTF_ID)
>>> in unrelated stuff it puts into the metadata area.
>>
>> Yeah, I think having a flag is useful.  The flag will be set at xdp_buff and
>> then transfer to the xdp_frame?
>>
>>>
>>>> After re-reading patch 6, have another question. The 'void
>>>> bpf_xdp_metadata_export_to_skb();' function signature. Should it at
>>>> least return ok/err? or even return a 'struct xdp_to_skb_metadata *'
>>>> pointer and the xdp prog can directly read (or even write) it?
>>>
>>> Hmm, I'm not sure returning a failure makes sense? Failure to read one
>>> or more fields just means that those fields will not be populated? We
>>> should probably have a flags field inside the metadata struct itself to
>>> indicate which fields are set or not, but I'm not sure returning an
>>> error value adds anything? Returning a pointer to the metadata field
>>> might be convenient for users (it would just be an alias to the
>>> data_meta pointer, but the verifier could know its size, so the program
>>> doesn't have to bounds check it).
>>
>> If some hints are not available, those hints should be initialized to
>> 0/CHECKSUM_NONE/...etc.  The xdp prog needs a direct way to tell hard failure
>> when it cannot write the meta area because of not enough space.  Comparing
>> xdp->data_meta with xdp->data as a side effect is not intuitive.
>>
>> It is more than saving the bound check.  With type info of 'struct
>> xdp_to_skb_metadata *', the verifier can do more checks like reading in the
>> middle of an integer member.  The verifier could also limit write access only to
>> a few struct's members if it is needed.
>>
>> The returning 'struct xdp_to_skb_metadata *' should not be an alias to the
>> xdp->data_meta.  They should actually point to different locations in the
>> headroom.  bpf_xdp_metadata_export_to_skb() sets a flag in xdp_buff.
>> xdp->data_meta won't be changed and keeps pointing to the last
>> bpf_xdp_adjust_meta() location.  The kernel will know if there is
>> xdp_to_skb_metadata before the xdp->data_meta when that bit is set in the
>> xdp_{buff,frame}.  Would it work?
>>
>>>
>>>> A related question, why 'struct xdp_to_skb_metadata' needs
>>>> __randomize_layout?
>>>
>>> The __randomize_layout thing is there to force BPF programs to use CO-RE
>>> to access the field. This is to avoid the struct layout accidentally
>>> ossifying because people in practice rely on a particular layout, even
>>> though we tell them to use CO-RE. There are lots of examples of this
>>> happening in other domains (IP header options, TCP options, etc), and
>>> __randomize_layout seemed like a neat trick to enforce CO-RE usage :)
>>
>> I am not sure if it is necessary or helpful to only enforce __randomize_layout
>> in 'struct xdp_to_skb_metadata'.  There are other CO-RE use cases (tracing and
>> non tracing) that already have direct access (reading and/or writing) to other
>> kernel structures.
>>
>> It is more important for the verifier to see the xdp prog accessing it as a
>> 'struct xdp_to_skb_metadata *' instead of xdp->data_meta which is a __u8 * so
>> that the verifier can enforce the rules of access.
>>
>>>
>>>>>>> Does xdp_to_skb_metadata have a use case for XDP_PASS (like patch 7) or the
>>>>>>> xdp_to_skb_metadata can be limited to XDP_REDIRECT only?
>>>>>>
>>>>>> XDP_PASS cases where we convert xdp_buff into skb in the drivers right
>>>>>> now usually have C code to manually pull out the metadata (out of hw
>>>>>> desc) and put it into skb.
>>>>>>
>>>>>> So, currently, if we're calling bpf_xdp_metadata_export_to_skb() for
>>>>>> XDP_PASS, we're doing a double amount of work:
>>>>>> skb_metadata_import_from_xdp first, then custom driver code second.
>>>>>>
>>>>>> In theory, maybe we should completely skip drivers custom parsing when
>>>>>> there is a prog with BPF_F_XDP_HAS_METADATA?
>>>>>> Then both xdp->skb paths (XDP_PASS+XDP_REDIRECT) will be bpf-driven
>>>>>> and won't require any mental work (plus, the drivers won't have to
>>>>>> care either in the future).
>>>>>>    > WDYT?
>>>>>
>>>>>
>>>>> Yeah, not sure if it can solely depend on BPF_F_XDP_HAS_METADATA but it makes
>>>>> sense to only use the hints (if ever written) from xdp prog especially if it
>>>>> will eventually support xdp prog changing some of the hints in the future.  For
>>>>> now, I think either way is fine since they are the same and the xdp prog is sort
>>>>> of doing extra unnecessary work anyway by calling
>>>>> bpf_xdp_metadata_export_to_skb() with XDP_PASS and knowing nothing can be
>>>>> changed now.
>>>
>>> I agree it would be best if the drivers also use the XDP metadata (if
>>> present) on XDP_PASS. Longer term my hope is we can make the XDP
>>> metadata support the only thing drivers need to implement (i.e., have
>>> the stack call into that code even when no XDP program is loaded), but
>>> for now just for consistency (and allowing the XDP program to update the
>>> metadata), we should probably at least consume it on XDP_PASS.
>>>
>>> -Toke
>>>
> 
> Not to derail the discussion (left the last message intact on top,
> feel free to continue), but to summarize. The proposed changes seem to
> be:
> 
> 1. bpf_xdp_metadata_export_to_skb() should return pointer to "struct
> xdp_to_skb_metadata"
>    - This should let bpf programs change the metadata passed to the skb
> 
> 2. "struct xdp_to_skb_metadata" should have its btf_id as the first
> __u32 member (and remove the magic)
>    - This is for the redirect case where the end users, including
> AF_XDP, can parse this metadata from btf_id

I think Toke's idea is to put the btf_id at the end of xdp_to_skb_metadata.  I 
can see why the end is needed for the userspace AF_XDP because, afaict, AF_XDP 
rx_desc currently cannot tell if there is metadata written by the xdp prog or 
not.  However, if the 'has_skb_metadata' bit can also be passed to the AF_XDP 
rx_desc->options, the btf_id may as well be not needed now.  However, the btf_id 
and other future new members can be added to the xdp_to_skb_metadata later if 
there is a need.

For the kernel and xdp prog, a bit in the xdp->flags should be enough to get to 
the xdp_to_skb_metadata.  The xdp prog will use CO-RE to access the members in 
xdp_to_skb_metadata.

>    - This, however, is not all the metadata that the device can
> support, but a much narrower set that the kernel is expected to use
> for skb construction
> 
> 3. __randomize_layout isn't really helping, CO-RE will trigger
> regardless; maybe only the case where it matters is probably AF_XDP,
> so still useful?
> 
> 4. The presence of the metadata generated by
> bpf_xdp_metadata_export_to_skb should be indicated by a flag in
> xdp_{buff,frame}->flags
>    - Assuming exposing it via xdp_md->has_skb_metadata is ok?

probably __bpf_md_ptr(struct xdp_to_skb_metadata *, skb_metadata) and the type 
will be PTR_TO_BTF_ID_OR_NULL.

>    - Since the programs probably need to do the following:
> 
>    if (xdp_md->has_skb_metadata) {
>      access/change skb metadata by doing struct xdp_to_skb_metadata *p
> = data_meta;

and directly access/change xdp->skb_metadata instead of using xdp->data_meta.

>    } else {
>      use kfuncs
>    }
> 
> 5. Support the case where we keep program's metadata and kernel's
> xdp_to_skb_metadata
>    - skb_metadata_import_from_xdp() will "consume" it by mem-moving the
> rest of the metadata over it and adjusting the headroom

I was thinking the kernel's xdp_to_skb_metadata is always before the program's 
metadata.  xdp prog should usually work in this order also: read/write headers, 
write its own metadata, call bpf_xdp_metadata_export_to_skb(), and return 
XDP_PASS/XDP_REDIRECT.  When it is XDP_PASS, the kernel just needs to pop the 
xdp_to_skb_metadata and pass the remaining program's metadata to the bpf-tc.

For the kernel and xdp prog, I don't think it matters where the 
xdp_to_skb_metadata is.  However, the xdp->data_meta (program's metadata) has to 
be before xdp->data because of the current data_meta and data comparison usage 
in the xdp prog.

The order of the kernel's xdp_to_skb_metadata and the program's metadata 
probably only matters to the userspace AF_XDP.  However, I don't see how AF_XDP 
supports the program's metadata now.  afaict, it can only work now if there is 
some sort of contract between them or the AF_XDP currently does not use the 
program's metadata.  Either way, we can do the mem-moving only for AF_XDP and it 
should be a no op if there is no program's metadata?  This behavior could also 
be configurable through setsockopt?

Thanks for the summary!

> 
> 
> I think the above solves all the cases Toke points to?
> 
> a) Accessing the metadata after redirect (in a cpumap or devmap
> program, or on a veth device)
>    - only a small xdp_to_skb_metadata subset will work out of the box
> iff the redirecttor calls bpf_xdp_metadata_export_to_skb; for the rest
> the progs will have to agree on the layout, right?
> 
> b) Transferring the packet+metadata to AF_XDP
>    - here, again, the AF_XDP consumer will have to either expect
> xdp_to_skb_metadata with a smaller set of skb-related metadata, or
> will have to make sure the producer builds a custom layout using
> kfuncs; there is also no flag to indicate whether xdp_to_skb_metadata
> is there or not; the consumer will have to test btf_id at the right
> offset
> 
> c) Returning XDP_PASS, but accessing some of the metadata first
> (whether to read or change it)
>    - can read via kfuncs, can change via
> bpf_xdp_metadata_export_to_skb(); m->xyz=abc;
> 
> Anything I'm missing?


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 04/14] veth: Support rx timestamp metadata for xdp
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 04/14] veth: Support rx timestamp metadata for xdp Stanislav Fomichev
  2022-11-09 11:21   ` [xdp-hints] " Toke Høiland-Jørgensen
@ 2022-11-10  0:25   ` John Fastabend
  2022-11-10  1:02     ` Stanislav Fomichev
  1 sibling, 1 reply; 66+ messages in thread
From: John Fastabend @ 2022-11-10  0:25 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Stanislav Fomichev wrote:
> xskxceiver conveniently setups up veth pairs so it seems logical
> to use veth as an example for some of the metadata handling.
> 
> We timestamp skb right when we "receive" it, store its
> pointer in new veth_xdp_buff wrapper and generate BPF bytecode to
> reach it from the BPF program.
> 
> This largely follows the idea of "store some queue context in
> the xdp_buff/xdp_frame so the metadata can be reached out
> from the BPF program".
> 

[...]

>  	orig_data = xdp->data;
>  	orig_data_end = xdp->data_end;
> +	vxbuf.skb = skb;
>  
>  	act = bpf_prog_run_xdp(xdp_prog, xdp);
>  
> @@ -942,6 +946,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>  			struct sk_buff *skb = ptr;
>  
>  			stats->xdp_bytes += skb->len;
> +			__net_timestamp(skb);

Just getting to reviewing in depth a bit more. But we hit veth with lots of
packets in some configurations I don't think we want to add a __net_timestamp
here when vast majority of use cases will have no need for timestamp on veth
device. I didn't do a benchmark but its not free.

If there is a real use case for timestamping on veth we could do it through
a XDP program directly? Basically fallback for devices without hw timestamps.
Anyways I need the helper to support hardware without time stamping.

Not sure if this was just part of the RFC to explore BPF programs or not.

>  			skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
>  			if (skb) {
>  				if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 04/14] veth: Support rx timestamp metadata for xdp
  2022-11-10  0:25   ` John Fastabend
@ 2022-11-10  1:02     ` Stanislav Fomichev
  2022-11-10  1:35       ` John Fastabend
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-10  1:02 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Wed, Nov 9, 2022 at 4:26 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Stanislav Fomichev wrote:
> > xskxceiver conveniently setups up veth pairs so it seems logical
> > to use veth as an example for some of the metadata handling.
> >
> > We timestamp skb right when we "receive" it, store its
> > pointer in new veth_xdp_buff wrapper and generate BPF bytecode to
> > reach it from the BPF program.
> >
> > This largely follows the idea of "store some queue context in
> > the xdp_buff/xdp_frame so the metadata can be reached out
> > from the BPF program".
> >
>
> [...]
>
> >       orig_data = xdp->data;
> >       orig_data_end = xdp->data_end;
> > +     vxbuf.skb = skb;
> >
> >       act = bpf_prog_run_xdp(xdp_prog, xdp);
> >
> > @@ -942,6 +946,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
> >                       struct sk_buff *skb = ptr;
> >
> >                       stats->xdp_bytes += skb->len;
> > +                     __net_timestamp(skb);
>
> Just getting to reviewing in depth a bit more. But we hit veth with lots of
> packets in some configurations I don't think we want to add a __net_timestamp
> here when vast majority of use cases will have no need for timestamp on veth
> device. I didn't do a benchmark but its not free.
>
> If there is a real use case for timestamping on veth we could do it through
> a XDP program directly? Basically fallback for devices without hw timestamps.
> Anyways I need the helper to support hardware without time stamping.
>
> Not sure if this was just part of the RFC to explore BPF programs or not.

Initially I've done it mostly so I can have selftests on top of veth
driver, but I'd still prefer to keep it to have working tests.
Any way I can make it configurable? Is there some ethtool "enable tx
timestamping" option I can reuse?

> >                       skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
> >                       if (skb) {
> >                               if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10  0:13                 ` Martin KaFai Lau
@ 2022-11-10  1:02                   ` Stanislav Fomichev
  2022-11-10 14:26                     ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-10  1:02 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Toke Høiland-Jørgensen, ast, daniel, andrii, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, bpf

On Wed, Nov 9, 2022 at 4:13 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 11/9/22 1:33 PM, Stanislav Fomichev wrote:
> > On Wed, Nov 9, 2022 at 10:22 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 11/9/22 3:10 AM, Toke Høiland-Jørgensen wrote:
> >>> Snipping a bit of context to reply to this bit:
> >>>
> >>>>>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
> >>>>>>> sure it is solid enough by asking the xdp prog not to use the same random number
> >>>>>>> in its own metadata + not to change the metadata through xdp->data_meta after
> >>>>>>> calling bpf_xdp_metadata_export_to_skb().
> >>>>>>
> >>>>>> What do you think the usecase here might be? Or are you suggesting we
> >>>>>> reject further access to data_meta after
> >>>>>> bpf_xdp_metadata_export_to_skb somehow?
> >>>>>>
> >>>>>> If we want to let the programs override some of this
> >>>>>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
> >>>>>> more kfuncs instead of exposing the layout?
> >>>>>>
> >>>>>> bpf_xdp_metadata_export_to_skb(ctx);
> >>>>>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);
> >>>
> >>> There are several use cases for needing to access the metadata after
> >>> calling bpf_xdp_metdata_export_to_skb():
> >>>
> >>> - Accessing the metadata after redirect (in a cpumap or devmap program,
> >>>     or on a veth device)
> >>> - Transferring the packet+metadata to AF_XDP
> >> fwiw, the xdp prog could also be more selective and only stores one of the hints
> >> instead of the whole 'struct xdp_to_skb_metadata'.
> >>
> >>> - Returning XDP_PASS, but accessing some of the metadata first (whether
> >>>     to read or change it)
> >>>
> >>> The last one could be solved by calling additional kfuncs, but that
> >>> would be less efficient than just directly editing the struct which
> >>> will be cache-hot after the helper returns.
> >>
> >> Yeah, it is more efficient to directly write if possible.  I think this set
> >> allows the direct reading and writing already through data_meta (as a _u8 *).
> >>
> >>>
> >>> And yeah, this will allow the XDP program to inject arbitrary metadata
> >>> into the netstack; but it can already inject arbitrary *packet* data
> >>> into the stack, so not sure if this is much of an additional risk? If it
> >>> does lead to trivial crashes, we should probably harden the stack
> >>> against that?
> >>>
> >>> As for the random number, Jesper and I discussed replacing this with the
> >>> same BTF-ID scheme that he was using in his patch series. I.e., instead
> >>> of just putting in a random number, we insert the BTF ID of the metadata
> >>> struct at the end of it. This will allow us to support multiple
> >>> different formats in the future (not just changing the layout, but
> >>> having multiple simultaneous formats in the same kernel image), in case
> >>> we run out of space.
> >>
> >> This seems a bit hypothetical.  How much headroom does it usually have for the
> >> xdp prog?  Potentially the hints can use all the remaining space left after the
> >> header encap and the current bpf_xdp_adjust_meta() usage?
> >>
> >>>
> >>> We should probably also have a flag set on the xdp_frame so the stack
> >>> knows that the metadata area contains relevant-to-skb data, to guard
> >>> against an XDP program accidentally hitting the "magic number" (BTF_ID)
> >>> in unrelated stuff it puts into the metadata area.
> >>
> >> Yeah, I think having a flag is useful.  The flag will be set at xdp_buff and
> >> then transfer to the xdp_frame?
> >>
> >>>
> >>>> After re-reading patch 6, have another question. The 'void
> >>>> bpf_xdp_metadata_export_to_skb();' function signature. Should it at
> >>>> least return ok/err? or even return a 'struct xdp_to_skb_metadata *'
> >>>> pointer and the xdp prog can directly read (or even write) it?
> >>>
> >>> Hmm, I'm not sure returning a failure makes sense? Failure to read one
> >>> or more fields just means that those fields will not be populated? We
> >>> should probably have a flags field inside the metadata struct itself to
> >>> indicate which fields are set or not, but I'm not sure returning an
> >>> error value adds anything? Returning a pointer to the metadata field
> >>> might be convenient for users (it would just be an alias to the
> >>> data_meta pointer, but the verifier could know its size, so the program
> >>> doesn't have to bounds check it).
> >>
> >> If some hints are not available, those hints should be initialized to
> >> 0/CHECKSUM_NONE/...etc.  The xdp prog needs a direct way to tell hard failure
> >> when it cannot write the meta area because of not enough space.  Comparing
> >> xdp->data_meta with xdp->data as a side effect is not intuitive.
> >>
> >> It is more than saving the bound check.  With type info of 'struct
> >> xdp_to_skb_metadata *', the verifier can do more checks like reading in the
> >> middle of an integer member.  The verifier could also limit write access only to
> >> a few struct's members if it is needed.
> >>
> >> The returning 'struct xdp_to_skb_metadata *' should not be an alias to the
> >> xdp->data_meta.  They should actually point to different locations in the
> >> headroom.  bpf_xdp_metadata_export_to_skb() sets a flag in xdp_buff.
> >> xdp->data_meta won't be changed and keeps pointing to the last
> >> bpf_xdp_adjust_meta() location.  The kernel will know if there is
> >> xdp_to_skb_metadata before the xdp->data_meta when that bit is set in the
> >> xdp_{buff,frame}.  Would it work?
> >>
> >>>
> >>>> A related question, why 'struct xdp_to_skb_metadata' needs
> >>>> __randomize_layout?
> >>>
> >>> The __randomize_layout thing is there to force BPF programs to use CO-RE
> >>> to access the field. This is to avoid the struct layout accidentally
> >>> ossifying because people in practice rely on a particular layout, even
> >>> though we tell them to use CO-RE. There are lots of examples of this
> >>> happening in other domains (IP header options, TCP options, etc), and
> >>> __randomize_layout seemed like a neat trick to enforce CO-RE usage :)
> >>
> >> I am not sure if it is necessary or helpful to only enforce __randomize_layout
> >> in 'struct xdp_to_skb_metadata'.  There are other CO-RE use cases (tracing and
> >> non tracing) that already have direct access (reading and/or writing) to other
> >> kernel structures.
> >>
> >> It is more important for the verifier to see the xdp prog accessing it as a
> >> 'struct xdp_to_skb_metadata *' instead of xdp->data_meta which is a __u8 * so
> >> that the verifier can enforce the rules of access.
> >>
> >>>
> >>>>>>> Does xdp_to_skb_metadata have a use case for XDP_PASS (like patch 7) or the
> >>>>>>> xdp_to_skb_metadata can be limited to XDP_REDIRECT only?
> >>>>>>
> >>>>>> XDP_PASS cases where we convert xdp_buff into skb in the drivers right
> >>>>>> now usually have C code to manually pull out the metadata (out of hw
> >>>>>> desc) and put it into skb.
> >>>>>>
> >>>>>> So, currently, if we're calling bpf_xdp_metadata_export_to_skb() for
> >>>>>> XDP_PASS, we're doing a double amount of work:
> >>>>>> skb_metadata_import_from_xdp first, then custom driver code second.
> >>>>>>
> >>>>>> In theory, maybe we should completely skip drivers custom parsing when
> >>>>>> there is a prog with BPF_F_XDP_HAS_METADATA?
> >>>>>> Then both xdp->skb paths (XDP_PASS+XDP_REDIRECT) will be bpf-driven
> >>>>>> and won't require any mental work (plus, the drivers won't have to
> >>>>>> care either in the future).
> >>>>>>    > WDYT?
> >>>>>
> >>>>>
> >>>>> Yeah, not sure if it can solely depend on BPF_F_XDP_HAS_METADATA but it makes
> >>>>> sense to only use the hints (if ever written) from xdp prog especially if it
> >>>>> will eventually support xdp prog changing some of the hints in the future.  For
> >>>>> now, I think either way is fine since they are the same and the xdp prog is sort
> >>>>> of doing extra unnecessary work anyway by calling
> >>>>> bpf_xdp_metadata_export_to_skb() with XDP_PASS and knowing nothing can be
> >>>>> changed now.
> >>>
> >>> I agree it would be best if the drivers also use the XDP metadata (if
> >>> present) on XDP_PASS. Longer term my hope is we can make the XDP
> >>> metadata support the only thing drivers need to implement (i.e., have
> >>> the stack call into that code even when no XDP program is loaded), but
> >>> for now just for consistency (and allowing the XDP program to update the
> >>> metadata), we should probably at least consume it on XDP_PASS.
> >>>
> >>> -Toke
> >>>
> >
> > Not to derail the discussion (left the last message intact on top,
> > feel free to continue), but to summarize. The proposed changes seem to
> > be:
> >
> > 1. bpf_xdp_metadata_export_to_skb() should return pointer to "struct
> > xdp_to_skb_metadata"
> >    - This should let bpf programs change the metadata passed to the skb
> >
> > 2. "struct xdp_to_skb_metadata" should have its btf_id as the first
> > __u32 member (and remove the magic)
> >    - This is for the redirect case where the end users, including
> > AF_XDP, can parse this metadata from btf_id
>
> I think Toke's idea is to put the btf_id at the end of xdp_to_skb_metadata.  I
> can see why the end is needed for the userspace AF_XDP because, afaict, AF_XDP
> rx_desc currently cannot tell if there is metadata written by the xdp prog or
> not.  However, if the 'has_skb_metadata' bit can also be passed to the AF_XDP
> rx_desc->options, the btf_id may as well be not needed now.  However, the btf_id
> and other future new members can be added to the xdp_to_skb_metadata later if
> there is a need.
>
> For the kernel and xdp prog, a bit in the xdp->flags should be enough to get to
> the xdp_to_skb_metadata.  The xdp prog will use CO-RE to access the members in
> xdp_to_skb_metadata.

Ack, good points on putting it at the end.
Regarding bit in desc->options vs btf_id: since it seems that btf_id
is useful anyway, let's start with that? We can add a bit later on if
it turns out using metadata is problematic otherwise.

> >    - This, however, is not all the metadata that the device can
> > support, but a much narrower set that the kernel is expected to use
> > for skb construction
> >
> > 3. __randomize_layout isn't really helping, CO-RE will trigger
> > regardless; maybe only the case where it matters is probably AF_XDP,
> > so still useful?
> >
> > 4. The presence of the metadata generated by
> > bpf_xdp_metadata_export_to_skb should be indicated by a flag in
> > xdp_{buff,frame}->flags
> >    - Assuming exposing it via xdp_md->has_skb_metadata is ok?
>
> probably __bpf_md_ptr(struct xdp_to_skb_metadata *, skb_metadata) and the type
> will be PTR_TO_BTF_ID_OR_NULL.

Oh, that seems even better than returning it from
bpf_xdp_metadata_export_to_skb.
bpf_xdp_metadata_export_to_skb can return true/false and the rest goes
via default verifier ctx resolution mechanism..
(returning ptr from a kfunc seems to be a bit complicated right now)

> >    - Since the programs probably need to do the following:
> >
> >    if (xdp_md->has_skb_metadata) {
> >      access/change skb metadata by doing struct xdp_to_skb_metadata *p
> > = data_meta;
>
> and directly access/change xdp->skb_metadata instead of using xdp->data_meta.

Ack.

> >    } else {
> >      use kfuncs
> >    }
> >
> > 5. Support the case where we keep program's metadata and kernel's
> > xdp_to_skb_metadata
> >    - skb_metadata_import_from_xdp() will "consume" it by mem-moving the
> > rest of the metadata over it and adjusting the headroom
>
> I was thinking the kernel's xdp_to_skb_metadata is always before the program's
> metadata.  xdp prog should usually work in this order also: read/write headers,
> write its own metadata, call bpf_xdp_metadata_export_to_skb(), and return
> XDP_PASS/XDP_REDIRECT.  When it is XDP_PASS, the kernel just needs to pop the
> xdp_to_skb_metadata and pass the remaining program's metadata to the bpf-tc.
>
> For the kernel and xdp prog, I don't think it matters where the
> xdp_to_skb_metadata is.  However, the xdp->data_meta (program's metadata) has to
> be before xdp->data because of the current data_meta and data comparison usage
> in the xdp prog.
>
> The order of the kernel's xdp_to_skb_metadata and the program's metadata
> probably only matters to the userspace AF_XDP.  However, I don't see how AF_XDP
> supports the program's metadata now.  afaict, it can only work now if there is
> some sort of contract between them or the AF_XDP currently does not use the
> program's metadata.  Either way, we can do the mem-moving only for AF_XDP and it
> should be a no op if there is no program's metadata?  This behavior could also
> be configurable through setsockopt?

Agreed on all of the above. For now it seems like the safest thing to
do is to put xdp_to_skb_metadata last to allow af_xdp to properly
locate btf_id.
Let's see if Toke disagrees :-)


> Thanks for the summary!
>
> >
> >
> > I think the above solves all the cases Toke points to?
> >
> > a) Accessing the metadata after redirect (in a cpumap or devmap
> > program, or on a veth device)
> >    - only a small xdp_to_skb_metadata subset will work out of the box
> > iff the redirecttor calls bpf_xdp_metadata_export_to_skb; for the rest
> > the progs will have to agree on the layout, right?
> >
> > b) Transferring the packet+metadata to AF_XDP
> >    - here, again, the AF_XDP consumer will have to either expect
> > xdp_to_skb_metadata with a smaller set of skb-related metadata, or
> > will have to make sure the producer builds a custom layout using
> > kfuncs; there is also no flag to indicate whether xdp_to_skb_metadata
> > is there or not; the consumer will have to test btf_id at the right
> > offset
> >
> > c) Returning XDP_PASS, but accessing some of the metadata first
> > (whether to read or change it)
> >    - can read via kfuncs, can change via
> > bpf_xdp_metadata_export_to_skb(); m->xyz=abc;
> >
> > Anything I'm missing?
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context Stanislav Fomichev
  2022-11-07 22:01   ` [xdp-hints] " Martin KaFai Lau
@ 2022-11-10  1:09   ` John Fastabend
  2022-11-10  6:44     ` Stanislav Fomichev
  1 sibling, 1 reply; 66+ messages in thread
From: John Fastabend @ 2022-11-10  1:09 UTC (permalink / raw)
  To: Stanislav Fomichev, bpf
  Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, sdf, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Stanislav Fomichev wrote:
> Implement new bpf_xdp_metadata_export_to_skb kfunc which
> prepares compatible xdp metadata for kernel consumption.
> This kfunc should be called prior to bpf_redirect
> or (unless already called) when XDP_PASS'ing the frame
> into the kernel.

Hi,

Had a couple high level questions so starting a new thread thought
it would be more confusing than helpful to add to the thread on
this patch.

> 
> The implementation currently maintains xdp_to_skb_metadata
> layout by calling bpf_xdp_metadata_rx_timestamp and placing
> small magic number. From skb_metdata_set, when we get expected magic number,
> we interpret metadata accordingly.

From commit message side I'm not able to parse this paragraph without
reading code. Maybe expand it a bit for next version or it could
just be me.

> 
> Both magic number and struct layout are randomized to make sure
> it doesn't leak into the userspace.

Are we worried about leaking pointers into XDP program here? We already
leak pointers into XDP through helpers so I'm not sure it matters.

> 
> skb_metadata_set is amended with skb_metadata_import_from_xdp which
> tries to parse out the metadata and put it into skb.
> 
> See the comment for r1 vs r2/r3/r4/r5 conventions.

I think for next version an expanded commit message with use
cases would help. I had to follow the thread to get some ideas
why this might be useful.

> 
> Cc: John Fastabend <john.fastabend@gmail.com>
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Martin KaFai Lau <martin.lau@linux.dev>
> Cc: Jakub Kicinski <kuba@kernel.org>
> Cc: Willem de Bruijn <willemb@google.com>
> Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> Cc: Maryam Tahhan <mtahhan@redhat.com>
> Cc: xdp-hints@xdp-project.net
> Cc: netdev@vger.kernel.org
> Signed-off-by: Stanislav Fomichev <sdf@google.com>
> ---
>  drivers/net/veth.c        |   4 +-
>  include/linux/bpf_patch.h |   2 +
>  include/linux/skbuff.h    |   4 ++
>  include/net/xdp.h         |  13 +++++
>  kernel/bpf/bpf_patch.c    |  30 +++++++++++
>  kernel/bpf/verifier.c     |  18 +++++++
>  net/core/skbuff.c         |  25 +++++++++
>  net/core/xdp.c            | 104 +++++++++++++++++++++++++++++++++++---
>  8 files changed, 193 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> index 0e629ceb087b..d4cd0938360b 100644
> --- a/drivers/net/veth.c
> +++ b/drivers/net/veth.c
> @@ -1673,7 +1673,9 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
>  static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
>  			      struct bpf_patch *patch)
>  {
> -	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> +	if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_EXPORT_TO_SKB)) {
> +		return xdp_metadata_export_to_skb(prog, patch);
> +	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
>  		/* return true; */
>  		bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
>  	} else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> diff --git a/include/linux/bpf_patch.h b/include/linux/bpf_patch.h
> index 81ff738eef8d..359c165ad68b 100644
> --- a/include/linux/bpf_patch.h
> +++ b/include/linux/bpf_patch.h
> @@ -16,6 +16,8 @@ size_t bpf_patch_len(const struct bpf_patch *patch);
>  int bpf_patch_err(const struct bpf_patch *patch);
>  void __bpf_patch_append(struct bpf_patch *patch, struct bpf_insn insn);
>  struct bpf_insn *bpf_patch_data(const struct bpf_patch *patch);
> +void bpf_patch_resolve_jmp(struct bpf_patch *patch);
> +u32 bpf_patch_magles_registers(const struct bpf_patch *patch);
>  
>  #define bpf_patch_append(patch, ...) ({ \
>  	struct bpf_insn insn[] = { __VA_ARGS__ }; \
> diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> index 59c9fd55699d..dba857f212d7 100644
> --- a/include/linux/skbuff.h
> +++ b/include/linux/skbuff.h
> @@ -4217,9 +4217,13 @@ static inline bool skb_metadata_differs(const struct sk_buff *skb_a,
>  	       true : __skb_metadata_differs(skb_a, skb_b, len_a);
>  }
>  
> +void skb_metadata_import_from_xdp(struct sk_buff *skb, size_t len);
> +
>  static inline void skb_metadata_set(struct sk_buff *skb, u8 meta_len)
>  {
>  	skb_shinfo(skb)->meta_len = meta_len;
> +	if (meta_len)
> +		skb_metadata_import_from_xdp(skb, meta_len);
>  }
>  
>  static inline void skb_metadata_clear(struct sk_buff *skb)
> diff --git a/include/net/xdp.h b/include/net/xdp.h
> index 2a82a98f2f9f..8c97c6996172 100644
> --- a/include/net/xdp.h
> +++ b/include/net/xdp.h
> @@ -411,6 +411,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
>  #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
>  
>  #define XDP_METADATA_KFUNC_xxx	\
> +	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_EXPORT_TO_SKB, \
> +			   bpf_xdp_metadata_export_to_skb) \
>  	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED, \
>  			   bpf_xdp_metadata_rx_timestamp_supported) \
>  	XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
> @@ -423,14 +425,25 @@ XDP_METADATA_KFUNC_xxx
>  MAX_XDP_METADATA_KFUNC,
>  };
>  
> +struct xdp_to_skb_metadata {
> +	u32 magic; /* xdp_metadata_magic */
> +	u64 rx_timestamp;

Slightly confused. I thought/think most drivers populate the skb timestamp
if they can already? So why do we need to bounce these through some xdp
metadata? Won't all this cost more than the load/store directly from the
descriptor into the skb? Even if drivers are not populating skb now
shouldn't an ethtool knob be enough to turn this on?

I don't see the value of getting this in veth side its just a sw
timestamp there.

If its specific to cpumap shouldn't we land this in cpumap code paths
out of general XDP code paths?


> +} __randomize_layout;
> +
> +struct bpf_patch;
> +
>  #ifdef CONFIG_DEBUG_INFO_BTF
> +extern u32 xdp_metadata_magic;
>  extern struct btf_id_set8 xdp_metadata_kfunc_ids;
>  static inline u32 xdp_metadata_kfunc_id(int id)
>  {
>  	return xdp_metadata_kfunc_ids.pairs[id].id;
>  }
> +void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch);
>  #else
> +#define xdp_metadata_magic 0
>  static inline u32 xdp_metadata_kfunc_id(int id) { return 0; }
> +static void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch) { return 0; }
>  #endif
>  
>  #endif /* __LINUX_NET_XDP_H__ */
> diff --git a/kernel/bpf/bpf_patch.c b/kernel/bpf/bpf_patch.c
> index 82a10bf5624a..8f1fef74299c 100644
> --- a/kernel/bpf/bpf_patch.c
> +++ b/kernel/bpf/bpf_patch.c
> @@ -49,3 +49,33 @@ struct bpf_insn *bpf_patch_data(const struct bpf_patch *patch)
>  {
>  	return patch->insn;
>  }

[...]

>  
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 42a35b59fb1e..37e3aef46525 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -72,6 +72,7 @@
>  #include <net/mptcp.h>
>  #include <net/mctp.h>
>  #include <net/page_pool.h>
> +#include <net/xdp.h>
>  
>  #include <linux/uaccess.h>
>  #include <trace/events/skb.h>
> @@ -6672,3 +6673,27 @@ nodefer:	__kfree_skb(skb);
>  	if (unlikely(kick) && !cmpxchg(&sd->defer_ipi_scheduled, 0, 1))
>  		smp_call_function_single_async(cpu, &sd->defer_csd);
>  }
> +
> +void skb_metadata_import_from_xdp(struct sk_buff *skb, size_t len)
> +{
> +	struct xdp_to_skb_metadata *meta = (void *)(skb_mac_header(skb) - len);
> +
> +	/* Optional SKB info, currently missing:
> +	 * - HW checksum info		(skb->ip_summed)
> +	 * - HW RX hash			(skb_set_hash)
> +	 * - RX ring dev queue index	(skb_record_rx_queue)
> +	 */
> +
> +	if (len != sizeof(struct xdp_to_skb_metadata))
> +		return;
> +
> +	if (meta->magic != xdp_metadata_magic)
> +		return;
> +
> +	if (meta->rx_timestamp) {
> +		*skb_hwtstamps(skb) = (struct skb_shared_hwtstamps){
> +			.hwtstamp = ns_to_ktime(meta->rx_timestamp),
> +		};
> +	}
> +}
> +EXPORT_SYMBOL(skb_metadata_import_from_xdp);
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 22f1e44700eb..8204fa05c5e9 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -653,12 +653,6 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
>  	/* Essential SKB info: protocol and skb->dev */
>  	skb->protocol = eth_type_trans(skb, dev);
>  
> -	/* Optional SKB info, currently missing:
> -	 * - HW checksum info		(skb->ip_summed)
> -	 * - HW RX hash			(skb_set_hash)
> -	 * - RX ring dev queue index	(skb_record_rx_queue)
> -	 */
> -
>  	/* Until page_pool get SKB return path, release DMA here */
>  	xdp_release_frame(xdpf);
>  
> @@ -712,6 +706,13 @@ struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
>  	return nxdpf;
>  }
>  
> +/* For the packets directed to the kernel, this kfunc exports XDP metadata
> + * into skb context.
> + */
> +noinline void bpf_xdp_metadata_export_to_skb(const struct xdp_md *ctx)
> +{
> +}
> +
>  /* Indicates whether particular device supports rx_timestamp metadata.
>   * This is an optional helper to support marking some branches as
>   * "dead code" in the BPF programs.
> @@ -737,13 +738,104 @@ XDP_METADATA_KFUNC_xxx
>  #undef XDP_METADATA_KFUNC
>  BTF_SET8_END(xdp_metadata_kfunc_ids)
>  
> +/* Make sure userspace doesn't depend on our layout by using
> + * different pseudo-generated magic value.
> + */
> +u32 xdp_metadata_magic;
> +
>  static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
>  	.owner = THIS_MODULE,
>  	.set   = &xdp_metadata_kfunc_ids,
>  };
>  
> +/* Since we're not actually doing a call but instead rewriting
> + * in place, we can only afford to use R0-R5 scratch registers.

Why not just do a call? Its neat to inline this but your going
to build an skb next. Thats not cheap and the cost of a call
should be complete noise when hitting the entire stack?

Any benchmark to convince us this is worthwhile optimizations?

> + *
> + * We reserve R1 for bpf_xdp_metadata_export_to_skb and let individual
> + * metadata kfuncs use only R0,R4-R5.
> + *
> + * The above also means we _cannot_ easily call any other helper/kfunc
> + * because there is no place for us to preserve our R1 argument;
> + * existing R6-R9 belong to the callee.
> + */
> +void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch)
> +{

[...]

>  }
>  late_initcall(xdp_metadata_init);

Thanks,
John



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-09 11:10           ` Toke Høiland-Jørgensen
  2022-11-09 18:22             ` Martin KaFai Lau
@ 2022-11-10  1:26             ` John Fastabend
  2022-11-10 14:32               ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 66+ messages in thread
From: John Fastabend @ 2022-11-10  1:26 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Martin KaFai Lau, Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

Toke Høiland-Jørgensen wrote:
> Snipping a bit of context to reply to this bit:
> 
> >>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
> >>>> sure it is solid enough by asking the xdp prog not to use the same random number
> >>>> in its own metadata + not to change the metadata through xdp->data_meta after
> >>>> calling bpf_xdp_metadata_export_to_skb().
> >>>
> >>> What do you think the usecase here might be? Or are you suggesting we
> >>> reject further access to data_meta after
> >>> bpf_xdp_metadata_export_to_skb somehow?
> >>>
> >>> If we want to let the programs override some of this
> >>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
> >>> more kfuncs instead of exposing the layout?
> >>>
> >>> bpf_xdp_metadata_export_to_skb(ctx);
> >>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);
> 

Hi Toke,

Trying not to bifurcate your thread. Can I start a new one here to
elaborate on these use cases. I'm still a bit lost on any use case
for this that makes sense to actually deploy on a network.

> There are several use cases for needing to access the metadata after
> calling bpf_xdp_metdata_export_to_skb():
> 
> - Accessing the metadata after redirect (in a cpumap or devmap program,
>   or on a veth device)

I think for devmap there are still lots of opens how/where the skb
is even built.

For cpumap I'm a bit unsure what the use case is. For ice, mlx and
such you should use the hardware RSS if performance is top of mind.
And then for specific devices on cpumap (maybe realtime or ptp things?)
could we just throw it through the xdp_frame?

> - Transferring the packet+metadata to AF_XDP

In this case we have the metadata and AF_XDP program and XDP program
simply need to agree on metadata format. No need to have some magic
numbers and driver specific kfuncs.

> - Returning XDP_PASS, but accessing some of the metadata first (whether
>   to read or change it)
> 

I don't get this case? XDP_PASS should go to stack normally through
drivers build_skb routines. These will populate timestamp normally.
My guess is simply descriptor->skb load/store is cheaper than carrying
around this metadata and doing the call in BPF side. Anyways you
just built an entire skb and hit the stack I don't think you will
notice this noise in any benchmark.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 04/14] veth: Support rx timestamp metadata for xdp
  2022-11-10  1:02     ` Stanislav Fomichev
@ 2022-11-10  1:35       ` John Fastabend
  2022-11-10  6:44         ` Stanislav Fomichev
  0 siblings, 1 reply; 66+ messages in thread
From: John Fastabend @ 2022-11-10  1:35 UTC (permalink / raw)
  To: Stanislav Fomichev, John Fastabend
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

Stanislav Fomichev wrote:
> On Wed, Nov 9, 2022 at 4:26 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Stanislav Fomichev wrote:
> > > xskxceiver conveniently setups up veth pairs so it seems logical
> > > to use veth as an example for some of the metadata handling.
> > >
> > > We timestamp skb right when we "receive" it, store its
> > > pointer in new veth_xdp_buff wrapper and generate BPF bytecode to
> > > reach it from the BPF program.
> > >
> > > This largely follows the idea of "store some queue context in
> > > the xdp_buff/xdp_frame so the metadata can be reached out
> > > from the BPF program".
> > >
> >
> > [...]
> >
> > >       orig_data = xdp->data;
> > >       orig_data_end = xdp->data_end;
> > > +     vxbuf.skb = skb;
> > >
> > >       act = bpf_prog_run_xdp(xdp_prog, xdp);
> > >
> > > @@ -942,6 +946,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
> > >                       struct sk_buff *skb = ptr;
> > >
> > >                       stats->xdp_bytes += skb->len;
> > > +                     __net_timestamp(skb);
> >
> > Just getting to reviewing in depth a bit more. But we hit veth with lots of
> > packets in some configurations I don't think we want to add a __net_timestamp
> > here when vast majority of use cases will have no need for timestamp on veth
> > device. I didn't do a benchmark but its not free.
> >
> > If there is a real use case for timestamping on veth we could do it through
> > a XDP program directly? Basically fallback for devices without hw timestamps.
> > Anyways I need the helper to support hardware without time stamping.
> >
> > Not sure if this was just part of the RFC to explore BPF programs or not.
> 
> Initially I've done it mostly so I can have selftests on top of veth
> driver, but I'd still prefer to keep it to have working tests.

I can't think of a use for it though so its just extra cycles. There
is a helper to read the ktime.

> Any way I can make it configurable? Is there some ethtool "enable tx
> timestamping" option I can reuse?

There is a -T option for timestamping in ethtool. There are also the
socket controls for it. So you could spin up a socket and use it.
But that is a bit broken as well I think it would be better if the
timestamp came from the receiving physical nic?

I have some mlx nics here and a k8s cluster with lots of veth
devices so I could think a bit more. I'm just not sure why I would
want the veth to timestamp things off hand?

> 
> > >                       skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
> > >                       if (skb) {
> > >                               if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))



^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10  1:09   ` John Fastabend
@ 2022-11-10  6:44     ` Stanislav Fomichev
  2022-11-10 21:21       ` David Ahern
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-10  6:44 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Wed, Nov 9, 2022 at 5:09 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Stanislav Fomichev wrote:
> > Implement new bpf_xdp_metadata_export_to_skb kfunc which
> > prepares compatible xdp metadata for kernel consumption.
> > This kfunc should be called prior to bpf_redirect
> > or (unless already called) when XDP_PASS'ing the frame
> > into the kernel.
>
> Hi,
>
> Had a couple high level questions so starting a new thread thought
> it would be more confusing than helpful to add to the thread on
> this patch.
>
> >
> > The implementation currently maintains xdp_to_skb_metadata
> > layout by calling bpf_xdp_metadata_rx_timestamp and placing
> > small magic number. From skb_metdata_set, when we get expected magic number,
> > we interpret metadata accordingly.
>
> From commit message side I'm not able to parse this paragraph without
> reading code. Maybe expand it a bit for next version or it could
> just be me.

Sure, will try to reword. In another thread we've discussed removing
the magic number, so I'll have to rewrite this part anyway :-)

> >
> > Both magic number and struct layout are randomized to make sure
> > it doesn't leak into the userspace.
>
> Are we worried about leaking pointers into XDP program here? We already
> leak pointers into XDP through helpers so I'm not sure it matters.

I wasn't sure here whether we want to have that xdp->skb (redirect)
path to be completely opaque or it's ok to expose this to the
bpf/af_xdp.
Seems like everybody's on board about exposing, so I'll be most likely
removing that randomization part.
(see my recent reply to Martin/Toke)

> >
> > skb_metadata_set is amended with skb_metadata_import_from_xdp which
> > tries to parse out the metadata and put it into skb.
> >
> > See the comment for r1 vs r2/r3/r4/r5 conventions.
>
> I think for next version an expanded commit message with use
> cases would help. I had to follow the thread to get some ideas
> why this might be useful.

Sure. I was planning to put together
Documentation/bpf/xdp_metadata.rst with more
details/assumptions/use-cases for the final non-rfc submissions..
Maybe that's better to help build up a full picture?

> > Cc: John Fastabend <john.fastabend@gmail.com>
> > Cc: David Ahern <dsahern@gmail.com>
> > Cc: Martin KaFai Lau <martin.lau@linux.dev>
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Cc: Willem de Bruijn <willemb@google.com>
> > Cc: Jesper Dangaard Brouer <brouer@redhat.com>
> > Cc: Anatoly Burakov <anatoly.burakov@intel.com>
> > Cc: Alexander Lobakin <alexandr.lobakin@intel.com>
> > Cc: Magnus Karlsson <magnus.karlsson@gmail.com>
> > Cc: Maryam Tahhan <mtahhan@redhat.com>
> > Cc: xdp-hints@xdp-project.net
> > Cc: netdev@vger.kernel.org
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >  drivers/net/veth.c        |   4 +-
> >  include/linux/bpf_patch.h |   2 +
> >  include/linux/skbuff.h    |   4 ++
> >  include/net/xdp.h         |  13 +++++
> >  kernel/bpf/bpf_patch.c    |  30 +++++++++++
> >  kernel/bpf/verifier.c     |  18 +++++++
> >  net/core/skbuff.c         |  25 +++++++++
> >  net/core/xdp.c            | 104 +++++++++++++++++++++++++++++++++++---
> >  8 files changed, 193 insertions(+), 7 deletions(-)
> >
> > diff --git a/drivers/net/veth.c b/drivers/net/veth.c
> > index 0e629ceb087b..d4cd0938360b 100644
> > --- a/drivers/net/veth.c
> > +++ b/drivers/net/veth.c
> > @@ -1673,7 +1673,9 @@ static int veth_xdp(struct net_device *dev, struct netdev_bpf *xdp)
> >  static void veth_unroll_kfunc(const struct bpf_prog *prog, u32 func_id,
> >                             struct bpf_patch *patch)
> >  {
> > -     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> > +     if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_EXPORT_TO_SKB)) {
> > +             return xdp_metadata_export_to_skb(prog, patch);
> > +     } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED)) {
> >               /* return true; */
> >               bpf_patch_append(patch, BPF_MOV64_IMM(BPF_REG_0, 1));
> >       } else if (func_id == xdp_metadata_kfunc_id(XDP_METADATA_KFUNC_RX_TIMESTAMP)) {
> > diff --git a/include/linux/bpf_patch.h b/include/linux/bpf_patch.h
> > index 81ff738eef8d..359c165ad68b 100644
> > --- a/include/linux/bpf_patch.h
> > +++ b/include/linux/bpf_patch.h
> > @@ -16,6 +16,8 @@ size_t bpf_patch_len(const struct bpf_patch *patch);
> >  int bpf_patch_err(const struct bpf_patch *patch);
> >  void __bpf_patch_append(struct bpf_patch *patch, struct bpf_insn insn);
> >  struct bpf_insn *bpf_patch_data(const struct bpf_patch *patch);
> > +void bpf_patch_resolve_jmp(struct bpf_patch *patch);
> > +u32 bpf_patch_magles_registers(const struct bpf_patch *patch);
> >
> >  #define bpf_patch_append(patch, ...) ({ \
> >       struct bpf_insn insn[] = { __VA_ARGS__ }; \
> > diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
> > index 59c9fd55699d..dba857f212d7 100644
> > --- a/include/linux/skbuff.h
> > +++ b/include/linux/skbuff.h
> > @@ -4217,9 +4217,13 @@ static inline bool skb_metadata_differs(const struct sk_buff *skb_a,
> >              true : __skb_metadata_differs(skb_a, skb_b, len_a);
> >  }
> >
> > +void skb_metadata_import_from_xdp(struct sk_buff *skb, size_t len);
> > +
> >  static inline void skb_metadata_set(struct sk_buff *skb, u8 meta_len)
> >  {
> >       skb_shinfo(skb)->meta_len = meta_len;
> > +     if (meta_len)
> > +             skb_metadata_import_from_xdp(skb, meta_len);
> >  }
> >
> >  static inline void skb_metadata_clear(struct sk_buff *skb)
> > diff --git a/include/net/xdp.h b/include/net/xdp.h
> > index 2a82a98f2f9f..8c97c6996172 100644
> > --- a/include/net/xdp.h
> > +++ b/include/net/xdp.h
> > @@ -411,6 +411,8 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
> >  #define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
> >
> >  #define XDP_METADATA_KFUNC_xxx       \
> > +     XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_EXPORT_TO_SKB, \
> > +                        bpf_xdp_metadata_export_to_skb) \
> >       XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP_SUPPORTED, \
> >                          bpf_xdp_metadata_rx_timestamp_supported) \
> >       XDP_METADATA_KFUNC(XDP_METADATA_KFUNC_RX_TIMESTAMP, \
> > @@ -423,14 +425,25 @@ XDP_METADATA_KFUNC_xxx
> >  MAX_XDP_METADATA_KFUNC,
> >  };
> >
> > +struct xdp_to_skb_metadata {
> > +     u32 magic; /* xdp_metadata_magic */
> > +     u64 rx_timestamp;
>
> Slightly confused. I thought/think most drivers populate the skb timestamp
> if they can already? So why do we need to bounce these through some xdp
> metadata? Won't all this cost more than the load/store directly from the
> descriptor into the skb? Even if drivers are not populating skb now
> shouldn't an ethtool knob be enough to turn this on?

dsahern@ pointed out that it might be useful for the program to be
able to override some of that metadata.
Or, for example, if there is no rx vlan offload, the program can strip
it (as part of parsing) and put the vlan tag into that
xdp_to_skb_metadata.

> I don't see the value of getting this in veth side its just a sw
> timestamp there.

(veth is there so we can have some selftests)

> If its specific to cpumap shouldn't we land this in cpumap code paths
> out of general XDP code paths?

See above, if we run this at cpumap time it's too late?

> > +} __randomize_layout;
> > +
> > +struct bpf_patch;
> > +
> >  #ifdef CONFIG_DEBUG_INFO_BTF
> > +extern u32 xdp_metadata_magic;
> >  extern struct btf_id_set8 xdp_metadata_kfunc_ids;
> >  static inline u32 xdp_metadata_kfunc_id(int id)
> >  {
> >       return xdp_metadata_kfunc_ids.pairs[id].id;
> >  }
> > +void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch);
> >  #else
> > +#define xdp_metadata_magic 0
> >  static inline u32 xdp_metadata_kfunc_id(int id) { return 0; }
> > +static void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch) { return 0; }
> >  #endif
> >
> >  #endif /* __LINUX_NET_XDP_H__ */
> > diff --git a/kernel/bpf/bpf_patch.c b/kernel/bpf/bpf_patch.c
> > index 82a10bf5624a..8f1fef74299c 100644
> > --- a/kernel/bpf/bpf_patch.c
> > +++ b/kernel/bpf/bpf_patch.c
> > @@ -49,3 +49,33 @@ struct bpf_insn *bpf_patch_data(const struct bpf_patch *patch)
> >  {
> >       return patch->insn;
> >  }
>
> [...]
>
> >
> > diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> > index 42a35b59fb1e..37e3aef46525 100644
> > --- a/net/core/skbuff.c
> > +++ b/net/core/skbuff.c
> > @@ -72,6 +72,7 @@
> >  #include <net/mptcp.h>
> >  #include <net/mctp.h>
> >  #include <net/page_pool.h>
> > +#include <net/xdp.h>
> >
> >  #include <linux/uaccess.h>
> >  #include <trace/events/skb.h>
> > @@ -6672,3 +6673,27 @@ nodefer:       __kfree_skb(skb);
> >       if (unlikely(kick) && !cmpxchg(&sd->defer_ipi_scheduled, 0, 1))
> >               smp_call_function_single_async(cpu, &sd->defer_csd);
> >  }
> > +
> > +void skb_metadata_import_from_xdp(struct sk_buff *skb, size_t len)
> > +{
> > +     struct xdp_to_skb_metadata *meta = (void *)(skb_mac_header(skb) - len);
> > +
> > +     /* Optional SKB info, currently missing:
> > +      * - HW checksum info           (skb->ip_summed)
> > +      * - HW RX hash                 (skb_set_hash)
> > +      * - RX ring dev queue index    (skb_record_rx_queue)
> > +      */
> > +
> > +     if (len != sizeof(struct xdp_to_skb_metadata))
> > +             return;
> > +
> > +     if (meta->magic != xdp_metadata_magic)
> > +             return;
> > +
> > +     if (meta->rx_timestamp) {
> > +             *skb_hwtstamps(skb) = (struct skb_shared_hwtstamps){
> > +                     .hwtstamp = ns_to_ktime(meta->rx_timestamp),
> > +             };
> > +     }
> > +}
> > +EXPORT_SYMBOL(skb_metadata_import_from_xdp);
> > diff --git a/net/core/xdp.c b/net/core/xdp.c
> > index 22f1e44700eb..8204fa05c5e9 100644
> > --- a/net/core/xdp.c
> > +++ b/net/core/xdp.c
> > @@ -653,12 +653,6 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
> >       /* Essential SKB info: protocol and skb->dev */
> >       skb->protocol = eth_type_trans(skb, dev);
> >
> > -     /* Optional SKB info, currently missing:
> > -      * - HW checksum info           (skb->ip_summed)
> > -      * - HW RX hash                 (skb_set_hash)
> > -      * - RX ring dev queue index    (skb_record_rx_queue)
> > -      */
> > -
> >       /* Until page_pool get SKB return path, release DMA here */
> >       xdp_release_frame(xdpf);
> >
> > @@ -712,6 +706,13 @@ struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
> >       return nxdpf;
> >  }
> >
> > +/* For the packets directed to the kernel, this kfunc exports XDP metadata
> > + * into skb context.
> > + */
> > +noinline void bpf_xdp_metadata_export_to_skb(const struct xdp_md *ctx)
> > +{
> > +}
> > +
> >  /* Indicates whether particular device supports rx_timestamp metadata.
> >   * This is an optional helper to support marking some branches as
> >   * "dead code" in the BPF programs.
> > @@ -737,13 +738,104 @@ XDP_METADATA_KFUNC_xxx
> >  #undef XDP_METADATA_KFUNC
> >  BTF_SET8_END(xdp_metadata_kfunc_ids)
> >
> > +/* Make sure userspace doesn't depend on our layout by using
> > + * different pseudo-generated magic value.
> > + */
> > +u32 xdp_metadata_magic;
> > +
> >  static const struct btf_kfunc_id_set xdp_metadata_kfunc_set = {
> >       .owner = THIS_MODULE,
> >       .set   = &xdp_metadata_kfunc_ids,
> >  };
> >
> > +/* Since we're not actually doing a call but instead rewriting
> > + * in place, we can only afford to use R0-R5 scratch registers.
>
> Why not just do a call? Its neat to inline this but your going
> to build an skb next. Thats not cheap and the cost of a call
> should be complete noise when hitting the entire stack?
>
> Any benchmark to convince us this is worthwhile optimizations?

I do have a suspicion that a non-zero amount of drivers will actually
resort to calling kernel instead of writing bpf bytecode (especially
since there might be some locks involved).

However if we prefer to go back to calls, there still has to be some
translation table.
I'm assuming we want to at least resolve indirect
netdev->kfunc_rx_metatada() calls at prog load time.
So we still need a per-netdev map from kfunc_id to real implementation func.
And I don't think that it would be more simple than the switch
statement that I have?



> > + *
> > + * We reserve R1 for bpf_xdp_metadata_export_to_skb and let individual
> > + * metadata kfuncs use only R0,R4-R5.
> > + *
> > + * The above also means we _cannot_ easily call any other helper/kfunc
> > + * because there is no place for us to preserve our R1 argument;
> > + * existing R6-R9 belong to the callee.
> > + */
> > +void xdp_metadata_export_to_skb(const struct bpf_prog *prog, struct bpf_patch *patch)
> > +{
>
> [...]
>
> >  }
> >  late_initcall(xdp_metadata_init);
>
> Thanks,
> John
>
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 04/14] veth: Support rx timestamp metadata for xdp
  2022-11-10  1:35       ` John Fastabend
@ 2022-11-10  6:44         ` Stanislav Fomichev
  2022-11-10 17:39           ` John Fastabend
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-10  6:44 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Wed, Nov 9, 2022 at 5:35 PM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Stanislav Fomichev wrote:
> > On Wed, Nov 9, 2022 at 4:26 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > > Stanislav Fomichev wrote:
> > > > xskxceiver conveniently setups up veth pairs so it seems logical
> > > > to use veth as an example for some of the metadata handling.
> > > >
> > > > We timestamp skb right when we "receive" it, store its
> > > > pointer in new veth_xdp_buff wrapper and generate BPF bytecode to
> > > > reach it from the BPF program.
> > > >
> > > > This largely follows the idea of "store some queue context in
> > > > the xdp_buff/xdp_frame so the metadata can be reached out
> > > > from the BPF program".
> > > >
> > >
> > > [...]
> > >
> > > >       orig_data = xdp->data;
> > > >       orig_data_end = xdp->data_end;
> > > > +     vxbuf.skb = skb;
> > > >
> > > >       act = bpf_prog_run_xdp(xdp_prog, xdp);
> > > >
> > > > @@ -942,6 +946,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
> > > >                       struct sk_buff *skb = ptr;
> > > >
> > > >                       stats->xdp_bytes += skb->len;
> > > > +                     __net_timestamp(skb);
> > >
> > > Just getting to reviewing in depth a bit more. But we hit veth with lots of
> > > packets in some configurations I don't think we want to add a __net_timestamp
> > > here when vast majority of use cases will have no need for timestamp on veth
> > > device. I didn't do a benchmark but its not free.
> > >
> > > If there is a real use case for timestamping on veth we could do it through
> > > a XDP program directly? Basically fallback for devices without hw timestamps.
> > > Anyways I need the helper to support hardware without time stamping.
> > >
> > > Not sure if this was just part of the RFC to explore BPF programs or not.
> >
> > Initially I've done it mostly so I can have selftests on top of veth
> > driver, but I'd still prefer to keep it to have working tests.
>
> I can't think of a use for it though so its just extra cycles. There
> is a helper to read the ktime.

As I mentioned in another reply, I wanted something SW-only to test
this whole metadata story.
The idea was:
- veth rx sets skb->tstamp (otherwise it's 0 at this point)
- veth kfunc to access rx_timestamp returns skb->tstamp
- xsk bpf program verifies that the metadata is non-zero
- the above shows end-to-end functionality with a software driver

> > Any way I can make it configurable? Is there some ethtool "enable tx
> > timestamping" option I can reuse?
>
> There is a -T option for timestamping in ethtool. There are also the
> socket controls for it. So you could spin up a socket and use it.
> But that is a bit broken as well I think it would be better if the
> timestamp came from the receiving physical nic?
>
> I have some mlx nics here and a k8s cluster with lots of veth
> devices so I could think a bit more. I'm just not sure why I would
> want the veth to timestamp things off hand?

-T is for dumping only it seems?

I'm probably using skb->tstamp in an unconventional manner here :-/
Do you know if enabling timestamping on the socket, as you suggest,
will get me some non-zero skb_hwtstamps with xsk?
I need something to show how the kfunc can return this data and how
can this data land in xdp prog / af_xdp chunk..


> >
> > > >                       skb = veth_xdp_rcv_skb(rq, skb, bq, stats);
> > > >                       if (skb) {
> > > >                               if (skb_shared(skb) || skb_unclone(skb, GFP_ATOMIC))
>
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-09 18:22             ` Martin KaFai Lau
  2022-11-09 21:33               ` Stanislav Fomichev
@ 2022-11-10 14:19               ` Toke Høiland-Jørgensen
  2022-11-10 19:04                 ` Martin KaFai Lau
  1 sibling, 1 reply; 66+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-10 14:19 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Stanislav Fomichev, ast, daniel, andrii, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, bpf

Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 11/9/22 3:10 AM, Toke Høiland-Jørgensen wrote:
>> Snipping a bit of context to reply to this bit:
>> 
>>>>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
>>>>>> sure it is solid enough by asking the xdp prog not to use the same random number
>>>>>> in its own metadata + not to change the metadata through xdp->data_meta after
>>>>>> calling bpf_xdp_metadata_export_to_skb().
>>>>>
>>>>> What do you think the usecase here might be? Or are you suggesting we
>>>>> reject further access to data_meta after
>>>>> bpf_xdp_metadata_export_to_skb somehow?
>>>>>
>>>>> If we want to let the programs override some of this
>>>>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
>>>>> more kfuncs instead of exposing the layout?
>>>>>
>>>>> bpf_xdp_metadata_export_to_skb(ctx);
>>>>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);
>> 
>> There are several use cases for needing to access the metadata after
>> calling bpf_xdp_metdata_export_to_skb():
>> 
>> - Accessing the metadata after redirect (in a cpumap or devmap program,
>>    or on a veth device)
>> - Transferring the packet+metadata to AF_XDP
> fwiw, the xdp prog could also be more selective and only stores one of the hints 
> instead of the whole 'struct xdp_to_skb_metadata'.

Yup, absolutely! In that sense, reusing the SKB format is mostly a
convenience feature. However, lots of people consume AF_XDP through the
default program installed by libxdp in the XSK setup code, and including
custom metadata there is awkward. So having the metadata consumed by the
stack as the "default subset" would enable easy consumption by
non-advanced users, while advanced users can still do custom stuff by
writing their own XDP program that calls the kfuncs.

>> - Returning XDP_PASS, but accessing some of the metadata first (whether
>>    to read or change it)
>> 
>> The last one could be solved by calling additional kfuncs, but that
>> would be less efficient than just directly editing the struct which
>> will be cache-hot after the helper returns.
>
> Yeah, it is more efficient to directly write if possible.  I think this set 
> allows the direct reading and writing already through data_meta (as a _u8 *).

Yup, totally fine with just keeping that capability :)

>> And yeah, this will allow the XDP program to inject arbitrary metadata
>> into the netstack; but it can already inject arbitrary *packet* data
>> into the stack, so not sure if this is much of an additional risk? If it
>> does lead to trivial crashes, we should probably harden the stack
>> against that?
>> 
>> As for the random number, Jesper and I discussed replacing this with the
>> same BTF-ID scheme that he was using in his patch series. I.e., instead
>> of just putting in a random number, we insert the BTF ID of the metadata
>> struct at the end of it. This will allow us to support multiple
>> different formats in the future (not just changing the layout, but
>> having multiple simultaneous formats in the same kernel image), in case
>> we run out of space.
>
> This seems a bit hypothetical.  How much headroom does it usually have for the 
> xdp prog?  Potentially the hints can use all the remaining space left after the 
> header encap and the current bpf_xdp_adjust_meta() usage?

For the metadata consumed by the stack right now it's a bit
hypothetical, yeah. However, there's a bunch of metadata commonly
supported by hardware that the stack currently doesn't consume and that
hopefully this feature will end up making more accessible. My hope is
that the stack can also learn how to use this in the future, in which
case we may run out of space. So I think of that bit mostly as
future-proofing...

>> We should probably also have a flag set on the xdp_frame so the stack
>> knows that the metadata area contains relevant-to-skb data, to guard
>> against an XDP program accidentally hitting the "magic number" (BTF_ID)
>> in unrelated stuff it puts into the metadata area.
>
> Yeah, I think having a flag is useful.  The flag will be set at xdp_buff and 
> then transfer to the xdp_frame?

Yeah, exactly!

>>> After re-reading patch 6, have another question. The 'void
>>> bpf_xdp_metadata_export_to_skb();' function signature. Should it at
>>> least return ok/err? or even return a 'struct xdp_to_skb_metadata *'
>>> pointer and the xdp prog can directly read (or even write) it?
>> 
>> Hmm, I'm not sure returning a failure makes sense? Failure to read one
>> or more fields just means that those fields will not be populated? We
>> should probably have a flags field inside the metadata struct itself to
>> indicate which fields are set or not, but I'm not sure returning an
>> error value adds anything? Returning a pointer to the metadata field
>> might be convenient for users (it would just be an alias to the
>> data_meta pointer, but the verifier could know its size, so the program
>> doesn't have to bounds check it).
>
> If some hints are not available, those hints should be initialized to
> 0/CHECKSUM_NONE/...etc.

The problem with that is that then we have to spend cycles writing
eight bytes of zeroes into the checksum field :)

> The xdp prog needs a direct way to tell hard failure when it cannot
> write the meta area because of not enough space. Comparing
> xdp->data_meta with xdp->data as a side effect is not intuitive.

Yeah, hence a flags field so we can just see if setting each field
succeeded?

> It is more than saving the bound check.  With type info of 'struct 
> xdp_to_skb_metadata *', the verifier can do more checks like reading in the 
> middle of an integer member.  The verifier could also limit write access only to 
> a few struct's members if it is needed.
>
> The returning 'struct xdp_to_skb_metadata *' should not be an alias to the 
> xdp->data_meta.  They should actually point to different locations in the 
> headroom.  bpf_xdp_metadata_export_to_skb() sets a flag in xdp_buff. 
> xdp->data_meta won't be changed and keeps pointing to the last 
> bpf_xdp_adjust_meta() location.  The kernel will know if there is 
> xdp_to_skb_metadata before the xdp->data_meta when that bit is set in the 
> xdp_{buff,frame}.  Would it work?

Hmm, logically splitting the program metadata and the xdp_hints metadata
(but having them share the same area) *could* work, I guess, I'm just
not sure it's worth the extra complexity?

>>> A related question, why 'struct xdp_to_skb_metadata' needs
>>> __randomize_layout?
>> 
>> The __randomize_layout thing is there to force BPF programs to use CO-RE
>> to access the field. This is to avoid the struct layout accidentally
>> ossifying because people in practice rely on a particular layout, even
>> though we tell them to use CO-RE. There are lots of examples of this
>> happening in other domains (IP header options, TCP options, etc), and
>> __randomize_layout seemed like a neat trick to enforce CO-RE usage :)
>
> I am not sure if it is necessary or helpful to only enforce __randomize_layout 
> in 'struct xdp_to_skb_metadata'.  There are other CO-RE use cases (tracing and 
> non tracing) that already have direct access (reading and/or writing) to other 
> kernel structures.
>
> It is more important for the verifier to see the xdp prog accessing it as a 
> 'struct xdp_to_skb_metadata *' instead of xdp->data_meta which is a __u8 * so 
> that the verifier can enforce the rules of access.

That only works inside the kernel, though. Since the metadata field can
be copied wholesale to AF_XDP, having it randomized forces userspace
consumers to also write code to deal with the layout being dynamic...

-Toke


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10  1:02                   ` Stanislav Fomichev
@ 2022-11-10 14:26                     ` Toke Høiland-Jørgensen
  2022-11-10 18:52                       ` Stanislav Fomichev
  0 siblings, 1 reply; 66+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-10 14:26 UTC (permalink / raw)
  To: Stanislav Fomichev, Martin KaFai Lau
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

Stanislav Fomichev <sdf@google.com> writes:

> On Wed, Nov 9, 2022 at 4:13 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 11/9/22 1:33 PM, Stanislav Fomichev wrote:
>> > On Wed, Nov 9, 2022 at 10:22 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>> >>
>> >> On 11/9/22 3:10 AM, Toke Høiland-Jørgensen wrote:
>> >>> Snipping a bit of context to reply to this bit:
>> >>>
>> >>>>>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
>> >>>>>>> sure it is solid enough by asking the xdp prog not to use the same random number
>> >>>>>>> in its own metadata + not to change the metadata through xdp->data_meta after
>> >>>>>>> calling bpf_xdp_metadata_export_to_skb().
>> >>>>>>
>> >>>>>> What do you think the usecase here might be? Or are you suggesting we
>> >>>>>> reject further access to data_meta after
>> >>>>>> bpf_xdp_metadata_export_to_skb somehow?
>> >>>>>>
>> >>>>>> If we want to let the programs override some of this
>> >>>>>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
>> >>>>>> more kfuncs instead of exposing the layout?
>> >>>>>>
>> >>>>>> bpf_xdp_metadata_export_to_skb(ctx);
>> >>>>>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);
>> >>>
>> >>> There are several use cases for needing to access the metadata after
>> >>> calling bpf_xdp_metdata_export_to_skb():
>> >>>
>> >>> - Accessing the metadata after redirect (in a cpumap or devmap program,
>> >>>     or on a veth device)
>> >>> - Transferring the packet+metadata to AF_XDP
>> >> fwiw, the xdp prog could also be more selective and only stores one of the hints
>> >> instead of the whole 'struct xdp_to_skb_metadata'.
>> >>
>> >>> - Returning XDP_PASS, but accessing some of the metadata first (whether
>> >>>     to read or change it)
>> >>>
>> >>> The last one could be solved by calling additional kfuncs, but that
>> >>> would be less efficient than just directly editing the struct which
>> >>> will be cache-hot after the helper returns.
>> >>
>> >> Yeah, it is more efficient to directly write if possible.  I think this set
>> >> allows the direct reading and writing already through data_meta (as a _u8 *).
>> >>
>> >>>
>> >>> And yeah, this will allow the XDP program to inject arbitrary metadata
>> >>> into the netstack; but it can already inject arbitrary *packet* data
>> >>> into the stack, so not sure if this is much of an additional risk? If it
>> >>> does lead to trivial crashes, we should probably harden the stack
>> >>> against that?
>> >>>
>> >>> As for the random number, Jesper and I discussed replacing this with the
>> >>> same BTF-ID scheme that he was using in his patch series. I.e., instead
>> >>> of just putting in a random number, we insert the BTF ID of the metadata
>> >>> struct at the end of it. This will allow us to support multiple
>> >>> different formats in the future (not just changing the layout, but
>> >>> having multiple simultaneous formats in the same kernel image), in case
>> >>> we run out of space.
>> >>
>> >> This seems a bit hypothetical.  How much headroom does it usually have for the
>> >> xdp prog?  Potentially the hints can use all the remaining space left after the
>> >> header encap and the current bpf_xdp_adjust_meta() usage?
>> >>
>> >>>
>> >>> We should probably also have a flag set on the xdp_frame so the stack
>> >>> knows that the metadata area contains relevant-to-skb data, to guard
>> >>> against an XDP program accidentally hitting the "magic number" (BTF_ID)
>> >>> in unrelated stuff it puts into the metadata area.
>> >>
>> >> Yeah, I think having a flag is useful.  The flag will be set at xdp_buff and
>> >> then transfer to the xdp_frame?
>> >>
>> >>>
>> >>>> After re-reading patch 6, have another question. The 'void
>> >>>> bpf_xdp_metadata_export_to_skb();' function signature. Should it at
>> >>>> least return ok/err? or even return a 'struct xdp_to_skb_metadata *'
>> >>>> pointer and the xdp prog can directly read (or even write) it?
>> >>>
>> >>> Hmm, I'm not sure returning a failure makes sense? Failure to read one
>> >>> or more fields just means that those fields will not be populated? We
>> >>> should probably have a flags field inside the metadata struct itself to
>> >>> indicate which fields are set or not, but I'm not sure returning an
>> >>> error value adds anything? Returning a pointer to the metadata field
>> >>> might be convenient for users (it would just be an alias to the
>> >>> data_meta pointer, but the verifier could know its size, so the program
>> >>> doesn't have to bounds check it).
>> >>
>> >> If some hints are not available, those hints should be initialized to
>> >> 0/CHECKSUM_NONE/...etc.  The xdp prog needs a direct way to tell hard failure
>> >> when it cannot write the meta area because of not enough space.  Comparing
>> >> xdp->data_meta with xdp->data as a side effect is not intuitive.
>> >>
>> >> It is more than saving the bound check.  With type info of 'struct
>> >> xdp_to_skb_metadata *', the verifier can do more checks like reading in the
>> >> middle of an integer member.  The verifier could also limit write access only to
>> >> a few struct's members if it is needed.
>> >>
>> >> The returning 'struct xdp_to_skb_metadata *' should not be an alias to the
>> >> xdp->data_meta.  They should actually point to different locations in the
>> >> headroom.  bpf_xdp_metadata_export_to_skb() sets a flag in xdp_buff.
>> >> xdp->data_meta won't be changed and keeps pointing to the last
>> >> bpf_xdp_adjust_meta() location.  The kernel will know if there is
>> >> xdp_to_skb_metadata before the xdp->data_meta when that bit is set in the
>> >> xdp_{buff,frame}.  Would it work?
>> >>
>> >>>
>> >>>> A related question, why 'struct xdp_to_skb_metadata' needs
>> >>>> __randomize_layout?
>> >>>
>> >>> The __randomize_layout thing is there to force BPF programs to use CO-RE
>> >>> to access the field. This is to avoid the struct layout accidentally
>> >>> ossifying because people in practice rely on a particular layout, even
>> >>> though we tell them to use CO-RE. There are lots of examples of this
>> >>> happening in other domains (IP header options, TCP options, etc), and
>> >>> __randomize_layout seemed like a neat trick to enforce CO-RE usage :)
>> >>
>> >> I am not sure if it is necessary or helpful to only enforce __randomize_layout
>> >> in 'struct xdp_to_skb_metadata'.  There are other CO-RE use cases (tracing and
>> >> non tracing) that already have direct access (reading and/or writing) to other
>> >> kernel structures.
>> >>
>> >> It is more important for the verifier to see the xdp prog accessing it as a
>> >> 'struct xdp_to_skb_metadata *' instead of xdp->data_meta which is a __u8 * so
>> >> that the verifier can enforce the rules of access.
>> >>
>> >>>
>> >>>>>>> Does xdp_to_skb_metadata have a use case for XDP_PASS (like patch 7) or the
>> >>>>>>> xdp_to_skb_metadata can be limited to XDP_REDIRECT only?
>> >>>>>>
>> >>>>>> XDP_PASS cases where we convert xdp_buff into skb in the drivers right
>> >>>>>> now usually have C code to manually pull out the metadata (out of hw
>> >>>>>> desc) and put it into skb.
>> >>>>>>
>> >>>>>> So, currently, if we're calling bpf_xdp_metadata_export_to_skb() for
>> >>>>>> XDP_PASS, we're doing a double amount of work:
>> >>>>>> skb_metadata_import_from_xdp first, then custom driver code second.
>> >>>>>>
>> >>>>>> In theory, maybe we should completely skip drivers custom parsing when
>> >>>>>> there is a prog with BPF_F_XDP_HAS_METADATA?
>> >>>>>> Then both xdp->skb paths (XDP_PASS+XDP_REDIRECT) will be bpf-driven
>> >>>>>> and won't require any mental work (plus, the drivers won't have to
>> >>>>>> care either in the future).
>> >>>>>>    > WDYT?
>> >>>>>
>> >>>>>
>> >>>>> Yeah, not sure if it can solely depend on BPF_F_XDP_HAS_METADATA but it makes
>> >>>>> sense to only use the hints (if ever written) from xdp prog especially if it
>> >>>>> will eventually support xdp prog changing some of the hints in the future.  For
>> >>>>> now, I think either way is fine since they are the same and the xdp prog is sort
>> >>>>> of doing extra unnecessary work anyway by calling
>> >>>>> bpf_xdp_metadata_export_to_skb() with XDP_PASS and knowing nothing can be
>> >>>>> changed now.
>> >>>
>> >>> I agree it would be best if the drivers also use the XDP metadata (if
>> >>> present) on XDP_PASS. Longer term my hope is we can make the XDP
>> >>> metadata support the only thing drivers need to implement (i.e., have
>> >>> the stack call into that code even when no XDP program is loaded), but
>> >>> for now just for consistency (and allowing the XDP program to update the
>> >>> metadata), we should probably at least consume it on XDP_PASS.
>> >>>
>> >>> -Toke
>> >>>
>> >
>> > Not to derail the discussion (left the last message intact on top,
>> > feel free to continue), but to summarize. The proposed changes seem to
>> > be:
>> >
>> > 1. bpf_xdp_metadata_export_to_skb() should return pointer to "struct
>> > xdp_to_skb_metadata"
>> >    - This should let bpf programs change the metadata passed to the skb
>> >
>> > 2. "struct xdp_to_skb_metadata" should have its btf_id as the first
>> > __u32 member (and remove the magic)
>> >    - This is for the redirect case where the end users, including
>> > AF_XDP, can parse this metadata from btf_id
>>
>> I think Toke's idea is to put the btf_id at the end of xdp_to_skb_metadata.  I
>> can see why the end is needed for the userspace AF_XDP because, afaict, AF_XDP
>> rx_desc currently cannot tell if there is metadata written by the xdp prog or
>> not.  However, if the 'has_skb_metadata' bit can also be passed to the AF_XDP
>> rx_desc->options, the btf_id may as well be not needed now.  However, the btf_id
>> and other future new members can be added to the xdp_to_skb_metadata later if
>> there is a need.
>>
>> For the kernel and xdp prog, a bit in the xdp->flags should be enough to get to
>> the xdp_to_skb_metadata.  The xdp prog will use CO-RE to access the members in
>> xdp_to_skb_metadata.
>
> Ack, good points on putting it at the end.
> Regarding bit in desc->options vs btf_id: since it seems that btf_id
> is useful anyway, let's start with that? We can add a bit later on if
> it turns out using metadata is problematic otherwise.

I think the bit is mostly useful so that the stack can know that the
metadata has been set before consuming it (to guard against regular
xdp_metadata usage accidentally hitting the "right" BTF ID). I don't
think it needs to be exposed to the XDP programs themselves.

>> >    - This, however, is not all the metadata that the device can
>> > support, but a much narrower set that the kernel is expected to use
>> > for skb construction
>> >
>> > 3. __randomize_layout isn't really helping, CO-RE will trigger
>> > regardless; maybe only the case where it matters is probably AF_XDP,
>> > so still useful?

Yeah, see my response to Martin, I think the randomisation is useful for
AF_XDP transfer.

>> > 4. The presence of the metadata generated by
>> > bpf_xdp_metadata_export_to_skb should be indicated by a flag in
>> > xdp_{buff,frame}->flags
>> >    - Assuming exposing it via xdp_md->has_skb_metadata is ok?
>>
>> probably __bpf_md_ptr(struct xdp_to_skb_metadata *, skb_metadata) and the type
>> will be PTR_TO_BTF_ID_OR_NULL.
>
> Oh, that seems even better than returning it from
> bpf_xdp_metadata_export_to_skb.
> bpf_xdp_metadata_export_to_skb can return true/false and the rest goes
> via default verifier ctx resolution mechanism..
> (returning ptr from a kfunc seems to be a bit complicated right now)

See my response to John in the other thread about mixing stable UAPI (in
xdp_md) and unstable BTF structures in the xdp_md struct: I think this
is confusing and would prefer a kfunc.

>> >    - Since the programs probably need to do the following:
>> >
>> >    if (xdp_md->has_skb_metadata) {
>> >      access/change skb metadata by doing struct xdp_to_skb_metadata *p
>> > = data_meta;
>>
>> and directly access/change xdp->skb_metadata instead of using xdp->data_meta.
>
> Ack.
>
>> >    } else {
>> >      use kfuncs
>> >    }
>> >
>> > 5. Support the case where we keep program's metadata and kernel's
>> > xdp_to_skb_metadata
>> >    - skb_metadata_import_from_xdp() will "consume" it by mem-moving the
>> > rest of the metadata over it and adjusting the headroom
>>
>> I was thinking the kernel's xdp_to_skb_metadata is always before the program's
>> metadata.  xdp prog should usually work in this order also: read/write headers,
>> write its own metadata, call bpf_xdp_metadata_export_to_skb(), and return
>> XDP_PASS/XDP_REDIRECT.  When it is XDP_PASS, the kernel just needs to pop the
>> xdp_to_skb_metadata and pass the remaining program's metadata to the bpf-tc.
>>
>> For the kernel and xdp prog, I don't think it matters where the
>> xdp_to_skb_metadata is.  However, the xdp->data_meta (program's metadata) has to
>> be before xdp->data because of the current data_meta and data comparison usage
>> in the xdp prog.
>>
>> The order of the kernel's xdp_to_skb_metadata and the program's metadata
>> probably only matters to the userspace AF_XDP.  However, I don't see how AF_XDP
>> supports the program's metadata now.  afaict, it can only work now if there is
>> some sort of contract between them or the AF_XDP currently does not use the
>> program's metadata.  Either way, we can do the mem-moving only for AF_XDP and it
>> should be a no op if there is no program's metadata?  This behavior could also
>> be configurable through setsockopt?
>
> Agreed on all of the above. For now it seems like the safest thing to
> do is to put xdp_to_skb_metadata last to allow af_xdp to properly
> locate btf_id.
> Let's see if Toke disagrees :-)

As I replied to Martin, I'm not sure it's worth the complexity to
logically split the SKB metadata from the program's own metadata (as
opposed to just reusing the existing data_meta pointer)?

However, if we do, the layout that makes most sense to me is putting the
skb metadata before the program metadata, like:

--------------
| skb_metadata
--------------
| data_meta
--------------
| data
--------------

Not sure if that's what you meant? :)

-Toke


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10  1:26             ` John Fastabend
@ 2022-11-10 14:32               ` Toke Høiland-Jørgensen
  2022-11-10 17:30                 ` John Fastabend
  0 siblings, 1 reply; 66+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-10 14:32 UTC (permalink / raw)
  To: John Fastabend, Martin KaFai Lau, Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

John Fastabend <john.fastabend@gmail.com> writes:

> Toke Høiland-Jørgensen wrote:
>> Snipping a bit of context to reply to this bit:
>> 
>> >>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
>> >>>> sure it is solid enough by asking the xdp prog not to use the same random number
>> >>>> in its own metadata + not to change the metadata through xdp->data_meta after
>> >>>> calling bpf_xdp_metadata_export_to_skb().
>> >>>
>> >>> What do you think the usecase here might be? Or are you suggesting we
>> >>> reject further access to data_meta after
>> >>> bpf_xdp_metadata_export_to_skb somehow?
>> >>>
>> >>> If we want to let the programs override some of this
>> >>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
>> >>> more kfuncs instead of exposing the layout?
>> >>>
>> >>> bpf_xdp_metadata_export_to_skb(ctx);
>> >>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);
>> 
>
> Hi Toke,
>
> Trying not to bifurcate your thread. Can I start a new one here to
> elaborate on these use cases. I'm still a bit lost on any use case
> for this that makes sense to actually deploy on a network.
>
>> There are several use cases for needing to access the metadata after
>> calling bpf_xdp_metdata_export_to_skb():
>> 
>> - Accessing the metadata after redirect (in a cpumap or devmap program,
>>   or on a veth device)
>
> I think for devmap there are still lots of opens how/where the skb
> is even built.

For veth it's pretty clear; i.e., when redirecting into containers.

> For cpumap I'm a bit unsure what the use case is. For ice, mlx and
> such you should use the hardware RSS if performance is top of mind.

Hardware RSS works fine if your hardware supports the hashing you want;
many do not. As an example, Jesper wrote this application that uses
cpumap to divide out ISP customer traffic among different CPUs (solving
an HTB scaling problem):

https://github.com/xdp-project/xdp-cpumap-tc

> And then for specific devices on cpumap (maybe realtime or ptp
> things?) could we just throw it through the xdp_frame?

Not sure what you mean here? Throw what through the xdp_frame?

>> - Transferring the packet+metadata to AF_XDP
>
> In this case we have the metadata and AF_XDP program and XDP program
> simply need to agree on metadata format. No need to have some magic
> numbers and driver specific kfuncs.

See my other reply to Martin: Yeah, for AF_XDP users that write their
own kernel XDP programs, they can just do whatever they want. But many
users just rely on the default program in libxdp, so having a standard
format to include with that is useful.

>> - Returning XDP_PASS, but accessing some of the metadata first (whether
>>   to read or change it)
>> 
>
> I don't get this case? XDP_PASS should go to stack normally through
> drivers build_skb routines. These will populate timestamp normally.
> My guess is simply descriptor->skb load/store is cheaper than carrying
> around this metadata and doing the call in BPF side. Anyways you
> just built an entire skb and hit the stack I don't think you will
> notice this noise in any benchmark.

If you modify the packet before calling XDP_PASS you may want to update
the metadata as well (for instance the RX hash, or in the future the
metadata could also carry transport header offsets).

-Toke


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10 14:32               ` Toke Høiland-Jørgensen
@ 2022-11-10 17:30                 ` John Fastabend
  2022-11-10 22:49                   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 66+ messages in thread
From: John Fastabend @ 2022-11-10 17:30 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, John Fastabend,
	Martin KaFai Lau, Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

Toke Høiland-Jørgensen wrote:
> John Fastabend <john.fastabend@gmail.com> writes:
> 
> > Toke Høiland-Jørgensen wrote:
> >> Snipping a bit of context to reply to this bit:
> >> 
> >> >>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
> >> >>>> sure it is solid enough by asking the xdp prog not to use the same random number
> >> >>>> in its own metadata + not to change the metadata through xdp->data_meta after
> >> >>>> calling bpf_xdp_metadata_export_to_skb().
> >> >>>
> >> >>> What do you think the usecase here might be? Or are you suggesting we
> >> >>> reject further access to data_meta after
> >> >>> bpf_xdp_metadata_export_to_skb somehow?
> >> >>>
> >> >>> If we want to let the programs override some of this
> >> >>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
> >> >>> more kfuncs instead of exposing the layout?
> >> >>>
> >> >>> bpf_xdp_metadata_export_to_skb(ctx);
> >> >>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);
> >> 
> >
> > Hi Toke,
> >
> > Trying not to bifurcate your thread. Can I start a new one here to
> > elaborate on these use cases. I'm still a bit lost on any use case
> > for this that makes sense to actually deploy on a network.
> >
> >> There are several use cases for needing to access the metadata after
> >> calling bpf_xdp_metdata_export_to_skb():
> >> 
> >> - Accessing the metadata after redirect (in a cpumap or devmap program,
> >>   or on a veth device)
> >
> > I think for devmap there are still lots of opens how/where the skb
> > is even built.
> 
> For veth it's pretty clear; i.e., when redirecting into containers.

Ah but I think XDP on veth is a bit questionable in general. The use
case is NFV I guess but its not how I would build out NFV. I've never
seen it actually deployed other than in CI. Anyways not necessary to
drop into that debate here. It exists so OK.

> 
> > For cpumap I'm a bit unsure what the use case is. For ice, mlx and
> > such you should use the hardware RSS if performance is top of mind.
> 
> Hardware RSS works fine if your hardware supports the hashing you want;
> many do not. As an example, Jesper wrote this application that uses
> cpumap to divide out ISP customer traffic among different CPUs (solving
> an HTB scaling problem):
> 
> https://github.com/xdp-project/xdp-cpumap-tc

I'm going to argue hw should be able to do this still and we
should fix the hw but maybe not easily doable without convincing
hardware folks to talk to us.

Also not obvious tto me how linked code works without more studying, its
ingress HTB? So you would push the rxhash and timestamp into cpumap and
then build the skb here with the correct skb->timestamp?

OK even if I can't exactly find the use case for cpumap if I had
a use case I can see how passing metadata through is useful.

> 
> > And then for specific devices on cpumap (maybe realtime or ptp
> > things?) could we just throw it through the xdp_frame?
> 
> Not sure what you mean here? Throw what through the xdp_frame?

Doesn't matter reread patches and figured it out I was slightly
confused.

> 
> >> - Transferring the packet+metadata to AF_XDP
> >
> > In this case we have the metadata and AF_XDP program and XDP program
> > simply need to agree on metadata format. No need to have some magic
> > numbers and driver specific kfuncs.
> 
> See my other reply to Martin: Yeah, for AF_XDP users that write their
> own kernel XDP programs, they can just do whatever they want. But many
> users just rely on the default program in libxdp, so having a standard
> format to include with that is useful.
> 

I don't think your AF_XDP program is any different than other AF_XDP
programs. Your lib can create a standard format if it wants but
kernel doesn't need to enforce it anyway.


> >> - Returning XDP_PASS, but accessing some of the metadata first (whether
> >>   to read or change it)
> >> 
> >
> > I don't get this case? XDP_PASS should go to stack normally through
> > drivers build_skb routines. These will populate timestamp normally.
> > My guess is simply descriptor->skb load/store is cheaper than carrying
> > around this metadata and doing the call in BPF side. Anyways you
> > just built an entire skb and hit the stack I don't think you will
> > notice this noise in any benchmark.
> 
> If you modify the packet before calling XDP_PASS you may want to update
> the metadata as well (for instance the RX hash, or in the future the
> metadata could also carry transport header offsets).

OK. So when you modify the pkt fixing up the rxhash makes sense. Thanks
for the response I can see the argument.

Thanks,
John

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 04/14] veth: Support rx timestamp metadata for xdp
  2022-11-10  6:44         ` Stanislav Fomichev
@ 2022-11-10 17:39           ` John Fastabend
  2022-11-10 18:52             ` Stanislav Fomichev
  2022-11-11 10:41             ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 66+ messages in thread
From: John Fastabend @ 2022-11-10 17:39 UTC (permalink / raw)
  To: Stanislav Fomichev, John Fastabend
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

Stanislav Fomichev wrote:
> On Wed, Nov 9, 2022 at 5:35 PM John Fastabend <john.fastabend@gmail.com> wrote:
> >
> > Stanislav Fomichev wrote:
> > > On Wed, Nov 9, 2022 at 4:26 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > >
> > > > Stanislav Fomichev wrote:
> > > > > xskxceiver conveniently setups up veth pairs so it seems logical
> > > > > to use veth as an example for some of the metadata handling.
> > > > >
> > > > > We timestamp skb right when we "receive" it, store its
> > > > > pointer in new veth_xdp_buff wrapper and generate BPF bytecode to
> > > > > reach it from the BPF program.
> > > > >
> > > > > This largely follows the idea of "store some queue context in
> > > > > the xdp_buff/xdp_frame so the metadata can be reached out
> > > > > from the BPF program".
> > > > >
> > > >
> > > > [...]
> > > >
> > > > >       orig_data = xdp->data;
> > > > >       orig_data_end = xdp->data_end;
> > > > > +     vxbuf.skb = skb;
> > > > >
> > > > >       act = bpf_prog_run_xdp(xdp_prog, xdp);
> > > > >
> > > > > @@ -942,6 +946,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
> > > > >                       struct sk_buff *skb = ptr;
> > > > >
> > > > >                       stats->xdp_bytes += skb->len;
> > > > > +                     __net_timestamp(skb);
> > > >
> > > > Just getting to reviewing in depth a bit more. But we hit veth with lots of
> > > > packets in some configurations I don't think we want to add a __net_timestamp
> > > > here when vast majority of use cases will have no need for timestamp on veth
> > > > device. I didn't do a benchmark but its not free.
> > > >
> > > > If there is a real use case for timestamping on veth we could do it through
> > > > a XDP program directly? Basically fallback for devices without hw timestamps.
> > > > Anyways I need the helper to support hardware without time stamping.
> > > >
> > > > Not sure if this was just part of the RFC to explore BPF programs or not.
> > >
> > > Initially I've done it mostly so I can have selftests on top of veth
> > > driver, but I'd still prefer to keep it to have working tests.
> >
> > I can't think of a use for it though so its just extra cycles. There
> > is a helper to read the ktime.
> 
> As I mentioned in another reply, I wanted something SW-only to test
> this whole metadata story.

Yeah I see the value there. Also because this is in the veth_xdp_rcv
path we don't actually attach XDP programs to veths except for in
CI anyways. I assume though if someone actually does use this in
prod having an extra _net_timestamp there would be extra unwanted
cycles.

> The idea was:
> - veth rx sets skb->tstamp (otherwise it's 0 at this point)
> - veth kfunc to access rx_timestamp returns skb->tstamp
> - xsk bpf program verifies that the metadata is non-zero
> - the above shows end-to-end functionality with a software driver

Yep 100% agree very handy for testing just not sure we can add code
to hotpath just for testing.

> 
> > > Any way I can make it configurable? Is there some ethtool "enable tx
> > > timestamping" option I can reuse?
> >
> > There is a -T option for timestamping in ethtool. There are also the
> > socket controls for it. So you could spin up a socket and use it.
> > But that is a bit broken as well I think it would be better if the
> > timestamp came from the receiving physical nic?
> >
> > I have some mlx nics here and a k8s cluster with lots of veth
> > devices so I could think a bit more. I'm just not sure why I would
> > want the veth to timestamp things off hand?
> 
> -T is for dumping only it seems?
> 
> I'm probably using skb->tstamp in an unconventional manner here :-/
> Do you know if enabling timestamping on the socket, as you suggest,
> will get me some non-zero skb_hwtstamps with xsk?
> I need something to show how the kfunc can return this data and how
> can this data land in xdp prog / af_xdp chunk..

Take a look at ./Documentation/networking/timestamping.rst the 3.1
section is maybe relevant. But then you end up implementing a bunch
of random ioctls for no reason other than testing. Maybe worth doing
though for this not sure.

Using virtio driver might be actual useful and give you a test device.
Early XDP days I used it for testing a lot. Would require qemu to
setup though.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10 14:26                     ` Toke Høiland-Jørgensen
@ 2022-11-10 18:52                       ` Stanislav Fomichev
  2022-11-10 23:14                         ` Toke Høiland-Jørgensen
  2022-11-10 23:58                         ` Martin KaFai Lau
  0 siblings, 2 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-10 18:52 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Martin KaFai Lau, ast, daniel, andrii, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, bpf

On Thu, Nov 10, 2022 at 6:27 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Stanislav Fomichev <sdf@google.com> writes:
>
> > On Wed, Nov 9, 2022 at 4:13 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >>
> >> On 11/9/22 1:33 PM, Stanislav Fomichev wrote:
> >> > On Wed, Nov 9, 2022 at 10:22 AM Martin KaFai Lau <martin.lau@linux.dev> wrote:
> >> >>
> >> >> On 11/9/22 3:10 AM, Toke Høiland-Jørgensen wrote:
> >> >>> Snipping a bit of context to reply to this bit:
> >> >>>
> >> >>>>>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
> >> >>>>>>> sure it is solid enough by asking the xdp prog not to use the same random number
> >> >>>>>>> in its own metadata + not to change the metadata through xdp->data_meta after
> >> >>>>>>> calling bpf_xdp_metadata_export_to_skb().
> >> >>>>>>
> >> >>>>>> What do you think the usecase here might be? Or are you suggesting we
> >> >>>>>> reject further access to data_meta after
> >> >>>>>> bpf_xdp_metadata_export_to_skb somehow?
> >> >>>>>>
> >> >>>>>> If we want to let the programs override some of this
> >> >>>>>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
> >> >>>>>> more kfuncs instead of exposing the layout?
> >> >>>>>>
> >> >>>>>> bpf_xdp_metadata_export_to_skb(ctx);
> >> >>>>>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);
> >> >>>
> >> >>> There are several use cases for needing to access the metadata after
> >> >>> calling bpf_xdp_metdata_export_to_skb():
> >> >>>
> >> >>> - Accessing the metadata after redirect (in a cpumap or devmap program,
> >> >>>     or on a veth device)
> >> >>> - Transferring the packet+metadata to AF_XDP
> >> >> fwiw, the xdp prog could also be more selective and only stores one of the hints
> >> >> instead of the whole 'struct xdp_to_skb_metadata'.
> >> >>
> >> >>> - Returning XDP_PASS, but accessing some of the metadata first (whether
> >> >>>     to read or change it)
> >> >>>
> >> >>> The last one could be solved by calling additional kfuncs, but that
> >> >>> would be less efficient than just directly editing the struct which
> >> >>> will be cache-hot after the helper returns.
> >> >>
> >> >> Yeah, it is more efficient to directly write if possible.  I think this set
> >> >> allows the direct reading and writing already through data_meta (as a _u8 *).
> >> >>
> >> >>>
> >> >>> And yeah, this will allow the XDP program to inject arbitrary metadata
> >> >>> into the netstack; but it can already inject arbitrary *packet* data
> >> >>> into the stack, so not sure if this is much of an additional risk? If it
> >> >>> does lead to trivial crashes, we should probably harden the stack
> >> >>> against that?
> >> >>>
> >> >>> As for the random number, Jesper and I discussed replacing this with the
> >> >>> same BTF-ID scheme that he was using in his patch series. I.e., instead
> >> >>> of just putting in a random number, we insert the BTF ID of the metadata
> >> >>> struct at the end of it. This will allow us to support multiple
> >> >>> different formats in the future (not just changing the layout, but
> >> >>> having multiple simultaneous formats in the same kernel image), in case
> >> >>> we run out of space.
> >> >>
> >> >> This seems a bit hypothetical.  How much headroom does it usually have for the
> >> >> xdp prog?  Potentially the hints can use all the remaining space left after the
> >> >> header encap and the current bpf_xdp_adjust_meta() usage?
> >> >>
> >> >>>
> >> >>> We should probably also have a flag set on the xdp_frame so the stack
> >> >>> knows that the metadata area contains relevant-to-skb data, to guard
> >> >>> against an XDP program accidentally hitting the "magic number" (BTF_ID)
> >> >>> in unrelated stuff it puts into the metadata area.
> >> >>
> >> >> Yeah, I think having a flag is useful.  The flag will be set at xdp_buff and
> >> >> then transfer to the xdp_frame?
> >> >>
> >> >>>
> >> >>>> After re-reading patch 6, have another question. The 'void
> >> >>>> bpf_xdp_metadata_export_to_skb();' function signature. Should it at
> >> >>>> least return ok/err? or even return a 'struct xdp_to_skb_metadata *'
> >> >>>> pointer and the xdp prog can directly read (or even write) it?
> >> >>>
> >> >>> Hmm, I'm not sure returning a failure makes sense? Failure to read one
> >> >>> or more fields just means that those fields will not be populated? We
> >> >>> should probably have a flags field inside the metadata struct itself to
> >> >>> indicate which fields are set or not, but I'm not sure returning an
> >> >>> error value adds anything? Returning a pointer to the metadata field
> >> >>> might be convenient for users (it would just be an alias to the
> >> >>> data_meta pointer, but the verifier could know its size, so the program
> >> >>> doesn't have to bounds check it).
> >> >>
> >> >> If some hints are not available, those hints should be initialized to
> >> >> 0/CHECKSUM_NONE/...etc.  The xdp prog needs a direct way to tell hard failure
> >> >> when it cannot write the meta area because of not enough space.  Comparing
> >> >> xdp->data_meta with xdp->data as a side effect is not intuitive.
> >> >>
> >> >> It is more than saving the bound check.  With type info of 'struct
> >> >> xdp_to_skb_metadata *', the verifier can do more checks like reading in the
> >> >> middle of an integer member.  The verifier could also limit write access only to
> >> >> a few struct's members if it is needed.
> >> >>
> >> >> The returning 'struct xdp_to_skb_metadata *' should not be an alias to the
> >> >> xdp->data_meta.  They should actually point to different locations in the
> >> >> headroom.  bpf_xdp_metadata_export_to_skb() sets a flag in xdp_buff.
> >> >> xdp->data_meta won't be changed and keeps pointing to the last
> >> >> bpf_xdp_adjust_meta() location.  The kernel will know if there is
> >> >> xdp_to_skb_metadata before the xdp->data_meta when that bit is set in the
> >> >> xdp_{buff,frame}.  Would it work?
> >> >>
> >> >>>
> >> >>>> A related question, why 'struct xdp_to_skb_metadata' needs
> >> >>>> __randomize_layout?
> >> >>>
> >> >>> The __randomize_layout thing is there to force BPF programs to use CO-RE
> >> >>> to access the field. This is to avoid the struct layout accidentally
> >> >>> ossifying because people in practice rely on a particular layout, even
> >> >>> though we tell them to use CO-RE. There are lots of examples of this
> >> >>> happening in other domains (IP header options, TCP options, etc), and
> >> >>> __randomize_layout seemed like a neat trick to enforce CO-RE usage :)
> >> >>
> >> >> I am not sure if it is necessary or helpful to only enforce __randomize_layout
> >> >> in 'struct xdp_to_skb_metadata'.  There are other CO-RE use cases (tracing and
> >> >> non tracing) that already have direct access (reading and/or writing) to other
> >> >> kernel structures.
> >> >>
> >> >> It is more important for the verifier to see the xdp prog accessing it as a
> >> >> 'struct xdp_to_skb_metadata *' instead of xdp->data_meta which is a __u8 * so
> >> >> that the verifier can enforce the rules of access.
> >> >>
> >> >>>
> >> >>>>>>> Does xdp_to_skb_metadata have a use case for XDP_PASS (like patch 7) or the
> >> >>>>>>> xdp_to_skb_metadata can be limited to XDP_REDIRECT only?
> >> >>>>>>
> >> >>>>>> XDP_PASS cases where we convert xdp_buff into skb in the drivers right
> >> >>>>>> now usually have C code to manually pull out the metadata (out of hw
> >> >>>>>> desc) and put it into skb.
> >> >>>>>>
> >> >>>>>> So, currently, if we're calling bpf_xdp_metadata_export_to_skb() for
> >> >>>>>> XDP_PASS, we're doing a double amount of work:
> >> >>>>>> skb_metadata_import_from_xdp first, then custom driver code second.
> >> >>>>>>
> >> >>>>>> In theory, maybe we should completely skip drivers custom parsing when
> >> >>>>>> there is a prog with BPF_F_XDP_HAS_METADATA?
> >> >>>>>> Then both xdp->skb paths (XDP_PASS+XDP_REDIRECT) will be bpf-driven
> >> >>>>>> and won't require any mental work (plus, the drivers won't have to
> >> >>>>>> care either in the future).
> >> >>>>>>    > WDYT?
> >> >>>>>
> >> >>>>>
> >> >>>>> Yeah, not sure if it can solely depend on BPF_F_XDP_HAS_METADATA but it makes
> >> >>>>> sense to only use the hints (if ever written) from xdp prog especially if it
> >> >>>>> will eventually support xdp prog changing some of the hints in the future.  For
> >> >>>>> now, I think either way is fine since they are the same and the xdp prog is sort
> >> >>>>> of doing extra unnecessary work anyway by calling
> >> >>>>> bpf_xdp_metadata_export_to_skb() with XDP_PASS and knowing nothing can be
> >> >>>>> changed now.
> >> >>>
> >> >>> I agree it would be best if the drivers also use the XDP metadata (if
> >> >>> present) on XDP_PASS. Longer term my hope is we can make the XDP
> >> >>> metadata support the only thing drivers need to implement (i.e., have
> >> >>> the stack call into that code even when no XDP program is loaded), but
> >> >>> for now just for consistency (and allowing the XDP program to update the
> >> >>> metadata), we should probably at least consume it on XDP_PASS.
> >> >>>
> >> >>> -Toke
> >> >>>
> >> >
> >> > Not to derail the discussion (left the last message intact on top,
> >> > feel free to continue), but to summarize. The proposed changes seem to
> >> > be:
> >> >
> >> > 1. bpf_xdp_metadata_export_to_skb() should return pointer to "struct
> >> > xdp_to_skb_metadata"
> >> >    - This should let bpf programs change the metadata passed to the skb
> >> >
> >> > 2. "struct xdp_to_skb_metadata" should have its btf_id as the first
> >> > __u32 member (and remove the magic)
> >> >    - This is for the redirect case where the end users, including
> >> > AF_XDP, can parse this metadata from btf_id
> >>
> >> I think Toke's idea is to put the btf_id at the end of xdp_to_skb_metadata.  I
> >> can see why the end is needed for the userspace AF_XDP because, afaict, AF_XDP
> >> rx_desc currently cannot tell if there is metadata written by the xdp prog or
> >> not.  However, if the 'has_skb_metadata' bit can also be passed to the AF_XDP
> >> rx_desc->options, the btf_id may as well be not needed now.  However, the btf_id
> >> and other future new members can be added to the xdp_to_skb_metadata later if
> >> there is a need.
> >>
> >> For the kernel and xdp prog, a bit in the xdp->flags should be enough to get to
> >> the xdp_to_skb_metadata.  The xdp prog will use CO-RE to access the members in
> >> xdp_to_skb_metadata.
> >
> > Ack, good points on putting it at the end.
> > Regarding bit in desc->options vs btf_id: since it seems that btf_id
> > is useful anyway, let's start with that? We can add a bit later on if
> > it turns out using metadata is problematic otherwise.
>
> I think the bit is mostly useful so that the stack can know that the
> metadata has been set before consuming it (to guard against regular
> xdp_metadata usage accidentally hitting the "right" BTF ID). I don't
> think it needs to be exposed to the XDP programs themselves.

SG! A flag for xdp_buff/frame and a kfunc to query it for bpf.

> >> >    - This, however, is not all the metadata that the device can
> >> > support, but a much narrower set that the kernel is expected to use
> >> > for skb construction
> >> >
> >> > 3. __randomize_layout isn't really helping, CO-RE will trigger
> >> > regardless; maybe only the case where it matters is probably AF_XDP,
> >> > so still useful?
>
> Yeah, see my response to Martin, I think the randomisation is useful for
> AF_XDP transfer.

SG. Let's keep it for now. Worst case, if it hurts, we can remove it later...

> >> > 4. The presence of the metadata generated by
> >> > bpf_xdp_metadata_export_to_skb should be indicated by a flag in
> >> > xdp_{buff,frame}->flags
> >> >    - Assuming exposing it via xdp_md->has_skb_metadata is ok?
> >>
> >> probably __bpf_md_ptr(struct xdp_to_skb_metadata *, skb_metadata) and the type
> >> will be PTR_TO_BTF_ID_OR_NULL.
> >
> > Oh, that seems even better than returning it from
> > bpf_xdp_metadata_export_to_skb.
> > bpf_xdp_metadata_export_to_skb can return true/false and the rest goes
> > via default verifier ctx resolution mechanism..
> > (returning ptr from a kfunc seems to be a bit complicated right now)
>
> See my response to John in the other thread about mixing stable UAPI (in
> xdp_md) and unstable BTF structures in the xdp_md struct: I think this
> is confusing and would prefer a kfunc.

SG!

> >> >    - Since the programs probably need to do the following:
> >> >
> >> >    if (xdp_md->has_skb_metadata) {
> >> >      access/change skb metadata by doing struct xdp_to_skb_metadata *p
> >> > = data_meta;
> >>
> >> and directly access/change xdp->skb_metadata instead of using xdp->data_meta.
> >
> > Ack.
> >
> >> >    } else {
> >> >      use kfuncs
> >> >    }
> >> >
> >> > 5. Support the case where we keep program's metadata and kernel's
> >> > xdp_to_skb_metadata
> >> >    - skb_metadata_import_from_xdp() will "consume" it by mem-moving the
> >> > rest of the metadata over it and adjusting the headroom
> >>
> >> I was thinking the kernel's xdp_to_skb_metadata is always before the program's
> >> metadata.  xdp prog should usually work in this order also: read/write headers,
> >> write its own metadata, call bpf_xdp_metadata_export_to_skb(), and return
> >> XDP_PASS/XDP_REDIRECT.  When it is XDP_PASS, the kernel just needs to pop the
> >> xdp_to_skb_metadata and pass the remaining program's metadata to the bpf-tc.
> >>
> >> For the kernel and xdp prog, I don't think it matters where the
> >> xdp_to_skb_metadata is.  However, the xdp->data_meta (program's metadata) has to
> >> be before xdp->data because of the current data_meta and data comparison usage
> >> in the xdp prog.
> >>
> >> The order of the kernel's xdp_to_skb_metadata and the program's metadata
> >> probably only matters to the userspace AF_XDP.  However, I don't see how AF_XDP
> >> supports the program's metadata now.  afaict, it can only work now if there is
> >> some sort of contract between them or the AF_XDP currently does not use the
> >> program's metadata.  Either way, we can do the mem-moving only for AF_XDP and it
> >> should be a no op if there is no program's metadata?  This behavior could also
> >> be configurable through setsockopt?
> >
> > Agreed on all of the above. For now it seems like the safest thing to
> > do is to put xdp_to_skb_metadata last to allow af_xdp to properly
> > locate btf_id.
> > Let's see if Toke disagrees :-)
>
> As I replied to Martin, I'm not sure it's worth the complexity to
> logically split the SKB metadata from the program's own metadata (as
> opposed to just reusing the existing data_meta pointer)?

I'd gladly keep my current requirement where it's either or, but not both :-)
We can relax it later if required?

> However, if we do, the layout that makes most sense to me is putting the
> skb metadata before the program metadata, like:
>
> --------------
> | skb_metadata
> --------------
> | data_meta
> --------------
> | data
> --------------
>
> Not sure if that's what you meant? :)

I was suggesting the other way around: |custom meta|skb_metadata|data|
(but, as Martin points out, consuming skb_metadata in the kernel
becomes messier)

af_xdp can check whether skb_metdata is present by looking at data -
offsetof(struct skb_metadata, btf_id).
progs that know how to handle custom metadata, will look at data -
sizeof(skb_metadata)

Otherwise, if it's the other way around, how do we find skb_metadata
in a redirected frame?
Let's say we have |skb_metadata|custom meta|data|, how does the final
program find skb_metadata?
All the progs have to agree on the sizeof(tc/custom meta), right?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 04/14] veth: Support rx timestamp metadata for xdp
  2022-11-10 17:39           ` John Fastabend
@ 2022-11-10 18:52             ` Stanislav Fomichev
  2022-11-11 10:41             ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-10 18:52 UTC (permalink / raw)
  To: John Fastabend
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev

On Thu, Nov 10, 2022 at 9:39 AM John Fastabend <john.fastabend@gmail.com> wrote:
>
> Stanislav Fomichev wrote:
> > On Wed, Nov 9, 2022 at 5:35 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > >
> > > Stanislav Fomichev wrote:
> > > > On Wed, Nov 9, 2022 at 4:26 PM John Fastabend <john.fastabend@gmail.com> wrote:
> > > > >
> > > > > Stanislav Fomichev wrote:
> > > > > > xskxceiver conveniently setups up veth pairs so it seems logical
> > > > > > to use veth as an example for some of the metadata handling.
> > > > > >
> > > > > > We timestamp skb right when we "receive" it, store its
> > > > > > pointer in new veth_xdp_buff wrapper and generate BPF bytecode to
> > > > > > reach it from the BPF program.
> > > > > >
> > > > > > This largely follows the idea of "store some queue context in
> > > > > > the xdp_buff/xdp_frame so the metadata can be reached out
> > > > > > from the BPF program".
> > > > > >
> > > > >
> > > > > [...]
> > > > >
> > > > > >       orig_data = xdp->data;
> > > > > >       orig_data_end = xdp->data_end;
> > > > > > +     vxbuf.skb = skb;
> > > > > >
> > > > > >       act = bpf_prog_run_xdp(xdp_prog, xdp);
> > > > > >
> > > > > > @@ -942,6 +946,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
> > > > > >                       struct sk_buff *skb = ptr;
> > > > > >
> > > > > >                       stats->xdp_bytes += skb->len;
> > > > > > +                     __net_timestamp(skb);
> > > > >
> > > > > Just getting to reviewing in depth a bit more. But we hit veth with lots of
> > > > > packets in some configurations I don't think we want to add a __net_timestamp
> > > > > here when vast majority of use cases will have no need for timestamp on veth
> > > > > device. I didn't do a benchmark but its not free.
> > > > >
> > > > > If there is a real use case for timestamping on veth we could do it through
> > > > > a XDP program directly? Basically fallback for devices without hw timestamps.
> > > > > Anyways I need the helper to support hardware without time stamping.
> > > > >
> > > > > Not sure if this was just part of the RFC to explore BPF programs or not.
> > > >
> > > > Initially I've done it mostly so I can have selftests on top of veth
> > > > driver, but I'd still prefer to keep it to have working tests.
> > >
> > > I can't think of a use for it though so its just extra cycles. There
> > > is a helper to read the ktime.
> >
> > As I mentioned in another reply, I wanted something SW-only to test
> > this whole metadata story.
>
> Yeah I see the value there. Also because this is in the veth_xdp_rcv
> path we don't actually attach XDP programs to veths except for in
> CI anyways. I assume though if someone actually does use this in
> prod having an extra _net_timestamp there would be extra unwanted
> cycles.
>
> > The idea was:
> > - veth rx sets skb->tstamp (otherwise it's 0 at this point)
> > - veth kfunc to access rx_timestamp returns skb->tstamp
> > - xsk bpf program verifies that the metadata is non-zero
> > - the above shows end-to-end functionality with a software driver
>
> Yep 100% agree very handy for testing just not sure we can add code
> to hotpath just for testing.
>
> >
> > > > Any way I can make it configurable? Is there some ethtool "enable tx
> > > > timestamping" option I can reuse?
> > >
> > > There is a -T option for timestamping in ethtool. There are also the
> > > socket controls for it. So you could spin up a socket and use it.
> > > But that is a bit broken as well I think it would be better if the
> > > timestamp came from the receiving physical nic?
> > >
> > > I have some mlx nics here and a k8s cluster with lots of veth
> > > devices so I could think a bit more. I'm just not sure why I would
> > > want the veth to timestamp things off hand?
> >
> > -T is for dumping only it seems?
> >
> > I'm probably using skb->tstamp in an unconventional manner here :-/
> > Do you know if enabling timestamping on the socket, as you suggest,
> > will get me some non-zero skb_hwtstamps with xsk?
> > I need something to show how the kfunc can return this data and how
> > can this data land in xdp prog / af_xdp chunk..
>
> Take a look at ./Documentation/networking/timestamping.rst the 3.1
> section is maybe relevant. But then you end up implementing a bunch
> of random ioctls for no reason other than testing. Maybe worth doing
> though for this not sure.

Hmm, there is a call to skb_tx_timestamp in veth_xmit that I missed.
Let me see if I can make it insert skb->tstamp by turning on one of
the timestamping options you mentioned..


> Using virtio driver might be actual useful and give you a test device.
> Early XDP days I used it for testing a lot. Would require qemu to
> setup though.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10 14:19               ` Toke Høiland-Jørgensen
@ 2022-11-10 19:04                 ` Martin KaFai Lau
  2022-11-10 23:29                   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 66+ messages in thread
From: Martin KaFai Lau @ 2022-11-10 19:04 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, ast, daniel, andrii, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, bpf

On 11/10/22 6:19 AM, Toke Høiland-Jørgensen wrote:
> Martin KaFai Lau <martin.lau@linux.dev> writes:
> 
>> On 11/9/22 3:10 AM, Toke Høiland-Jørgensen wrote:
>>> Snipping a bit of context to reply to this bit:
>>>
>>>>>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
>>>>>>> sure it is solid enough by asking the xdp prog not to use the same random number
>>>>>>> in its own metadata + not to change the metadata through xdp->data_meta after
>>>>>>> calling bpf_xdp_metadata_export_to_skb().
>>>>>>
>>>>>> What do you think the usecase here might be? Or are you suggesting we
>>>>>> reject further access to data_meta after
>>>>>> bpf_xdp_metadata_export_to_skb somehow?
>>>>>>
>>>>>> If we want to let the programs override some of this
>>>>>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
>>>>>> more kfuncs instead of exposing the layout?
>>>>>>
>>>>>> bpf_xdp_metadata_export_to_skb(ctx);
>>>>>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);
>>>
>>> There are several use cases for needing to access the metadata after
>>> calling bpf_xdp_metdata_export_to_skb():
>>>
>>> - Accessing the metadata after redirect (in a cpumap or devmap program,
>>>     or on a veth device)
>>> - Transferring the packet+metadata to AF_XDP
>> fwiw, the xdp prog could also be more selective and only stores one of the hints
>> instead of the whole 'struct xdp_to_skb_metadata'.
> 
> Yup, absolutely! In that sense, reusing the SKB format is mostly a
> convenience feature. However, lots of people consume AF_XDP through the
> default program installed by libxdp in the XSK setup code, and including
> custom metadata there is awkward. So having the metadata consumed by the
> stack as the "default subset" would enable easy consumption by
> non-advanced users, while advanced users can still do custom stuff by
> writing their own XDP program that calls the kfuncs.
> 
>>> - Returning XDP_PASS, but accessing some of the metadata first (whether
>>>     to read or change it)
>>>
>>> The last one could be solved by calling additional kfuncs, but that
>>> would be less efficient than just directly editing the struct which
>>> will be cache-hot after the helper returns.
>>
>> Yeah, it is more efficient to directly write if possible.  I think this set
>> allows the direct reading and writing already through data_meta (as a _u8 *).
> 
> Yup, totally fine with just keeping that capability :)
> 
>>> And yeah, this will allow the XDP program to inject arbitrary metadata
>>> into the netstack; but it can already inject arbitrary *packet* data
>>> into the stack, so not sure if this is much of an additional risk? If it
>>> does lead to trivial crashes, we should probably harden the stack
>>> against that?
>>>
>>> As for the random number, Jesper and I discussed replacing this with the
>>> same BTF-ID scheme that he was using in his patch series. I.e., instead
>>> of just putting in a random number, we insert the BTF ID of the metadata
>>> struct at the end of it. This will allow us to support multiple
>>> different formats in the future (not just changing the layout, but
>>> having multiple simultaneous formats in the same kernel image), in case
>>> we run out of space.
>>
>> This seems a bit hypothetical.  How much headroom does it usually have for the
>> xdp prog?  Potentially the hints can use all the remaining space left after the
>> header encap and the current bpf_xdp_adjust_meta() usage?
> 
> For the metadata consumed by the stack right now it's a bit
> hypothetical, yeah. However, there's a bunch of metadata commonly
> supported by hardware that the stack currently doesn't consume and that
> hopefully this feature will end up making more accessible. My hope is
> that the stack can also learn how to use this in the future, in which
> case we may run out of space. So I think of that bit mostly as
> future-proofing...

ic. in this case, Can the btf_id be added to 'struct xdp_to_skb_metadata' later 
if it is indeed needed?  The 'struct xdp_to_skb_metadata' is not in UAPI and 
doing it with CO-RE is to give us flexibility to make this kind of changes in 
the future.

> 
>>> We should probably also have a flag set on the xdp_frame so the stack
>>> knows that the metadata area contains relevant-to-skb data, to guard
>>> against an XDP program accidentally hitting the "magic number" (BTF_ID)
>>> in unrelated stuff it puts into the metadata area.
>>
>> Yeah, I think having a flag is useful.  The flag will be set at xdp_buff and
>> then transfer to the xdp_frame?
> 
> Yeah, exactly!
> 
>>>> After re-reading patch 6, have another question. The 'void
>>>> bpf_xdp_metadata_export_to_skb();' function signature. Should it at
>>>> least return ok/err? or even return a 'struct xdp_to_skb_metadata *'
>>>> pointer and the xdp prog can directly read (or even write) it?
>>>
>>> Hmm, I'm not sure returning a failure makes sense? Failure to read one
>>> or more fields just means that those fields will not be populated? We
>>> should probably have a flags field inside the metadata struct itself to
>>> indicate which fields are set or not, but I'm not sure returning an
>>> error value adds anything? Returning a pointer to the metadata field
>>> might be convenient for users (it would just be an alias to the
>>> data_meta pointer, but the verifier could know its size, so the program
>>> doesn't have to bounds check it).
>>
>> If some hints are not available, those hints should be initialized to
>> 0/CHECKSUM_NONE/...etc.
> 
> The problem with that is that then we have to spend cycles writing
> eight bytes of zeroes into the checksum field :)

With a common 'struct xdp_to_skb_metadata', I am not sure how some of these zero 
writes can be avoided.  If the xdp prog wants to optimize, it can call 
individual kfunc to get individual hints.

> 
>> The xdp prog needs a direct way to tell hard failure when it cannot
>> write the meta area because of not enough space. Comparing
>> xdp->data_meta with xdp->data as a side effect is not intuitive.
> 
> Yeah, hence a flags field so we can just see if setting each field
> succeeded?

How testing a flag is different from checking 0/invalid-value of a field?  or 
some fields just don't have an invalid value to check for like vlan_tci?

You meant a flags field as a return value or in the 'struct xdp_to_skb_metadata' ?

> 
>> It is more than saving the bound check.  With type info of 'struct
>> xdp_to_skb_metadata *', the verifier can do more checks like reading in the
>> middle of an integer member.  The verifier could also limit write access only to
>> a few struct's members if it is needed.
>>
>> The returning 'struct xdp_to_skb_metadata *' should not be an alias to the
>> xdp->data_meta.  They should actually point to different locations in the
>> headroom.  bpf_xdp_metadata_export_to_skb() sets a flag in xdp_buff.
>> xdp->data_meta won't be changed and keeps pointing to the last
>> bpf_xdp_adjust_meta() location.  The kernel will know if there is
>> xdp_to_skb_metadata before the xdp->data_meta when that bit is set in the
>> xdp_{buff,frame}.  Would it work?
> 
> Hmm, logically splitting the program metadata and the xdp_hints metadata
> (but having them share the same area) *could* work, I guess, I'm just
> not sure it's worth the extra complexity?

It shouldn't stop the existing xdp prog writing its own metadata from using the 
the new bpf_xdp_metadata_export_to_skb().

> 
>>>> A related question, why 'struct xdp_to_skb_metadata' needs
>>>> __randomize_layout?
>>>
>>> The __randomize_layout thing is there to force BPF programs to use CO-RE
>>> to access the field. This is to avoid the struct layout accidentally
>>> ossifying because people in practice rely on a particular layout, even
>>> though we tell them to use CO-RE. There are lots of examples of this
>>> happening in other domains (IP header options, TCP options, etc), and
>>> __randomize_layout seemed like a neat trick to enforce CO-RE usage :)
>>
>> I am not sure if it is necessary or helpful to only enforce __randomize_layout
>> in 'struct xdp_to_skb_metadata'.  There are other CO-RE use cases (tracing and
>> non tracing) that already have direct access (reading and/or writing) to other
>> kernel structures.
>>
>> It is more important for the verifier to see the xdp prog accessing it as a
>> 'struct xdp_to_skb_metadata *' instead of xdp->data_meta which is a __u8 * so
>> that the verifier can enforce the rules of access.
> 
> That only works inside the kernel, though. Since the metadata field can
> be copied wholesale to AF_XDP, having it randomized forces userspace
> consumers to also write code to deal with the layout being dynamic...

hm... I still don't see how useful it is, in particular you mentioned the libxdp 
will install a xdp prog to write this default format (xdp_to_skb_metadata) and 
likely libxdp will also provide some helpers to parse the xdp_to_skb_metadata 
and the libxdp user should not need to know if CO-RE is used or not. 
Considering it is a kernel internal struct, I think it is fine to keep it and 
can be revisited later if needed.  Lets get on to other things first :)


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10  6:44     ` Stanislav Fomichev
@ 2022-11-10 21:21       ` David Ahern
  0 siblings, 0 replies; 66+ messages in thread
From: David Ahern @ 2022-11-10 21:21 UTC (permalink / raw)
  To: Stanislav Fomichev, John Fastabend
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, kpsingh, haoluo,
	jolsa, Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev

On 11/9/22 11:44 PM, Stanislav Fomichev wrote:
>>> @@ -423,14 +425,25 @@ XDP_METADATA_KFUNC_xxx
>>>  MAX_XDP_METADATA_KFUNC,
>>>  };
>>>
>>> +struct xdp_to_skb_metadata {
>>> +     u32 magic; /* xdp_metadata_magic */
>>> +     u64 rx_timestamp;
>>
>> Slightly confused. I thought/think most drivers populate the skb timestamp
>> if they can already? So why do we need to bounce these through some xdp
>> metadata? Won't all this cost more than the load/store directly from the
>> descriptor into the skb? Even if drivers are not populating skb now
>> shouldn't an ethtool knob be enough to turn this on?
> 
> dsahern@ pointed out that it might be useful for the program to be
> able to override some of that metadata.

Examples that come to mind from previous work:
1. changing vlans on a redirect: Rx on vlan 1 with h/w popping the vlan
so it is provided in metadata. Then on a redirect it shifts to vlan 2
for Tx. But this example use case assumes Tx accesses the metadata to
ask h/w to insert the header.

2. popping or updating an encap header and wanting to update the checksum

3. changing the h/w provided hash to affect steering of a subsequent skb

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10 17:30                 ` John Fastabend
@ 2022-11-10 22:49                   ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 66+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-10 22:49 UTC (permalink / raw)
  To: John Fastabend, John Fastabend, Martin KaFai Lau, Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

John Fastabend <john.fastabend@gmail.com> writes:

> Toke Høiland-Jørgensen wrote:
>> John Fastabend <john.fastabend@gmail.com> writes:
>> 
>> > Toke Høiland-Jørgensen wrote:
>> >> Snipping a bit of context to reply to this bit:
>> >> 
>> >> >>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
>> >> >>>> sure it is solid enough by asking the xdp prog not to use the same random number
>> >> >>>> in its own metadata + not to change the metadata through xdp->data_meta after
>> >> >>>> calling bpf_xdp_metadata_export_to_skb().
>> >> >>>
>> >> >>> What do you think the usecase here might be? Or are you suggesting we
>> >> >>> reject further access to data_meta after
>> >> >>> bpf_xdp_metadata_export_to_skb somehow?
>> >> >>>
>> >> >>> If we want to let the programs override some of this
>> >> >>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
>> >> >>> more kfuncs instead of exposing the layout?
>> >> >>>
>> >> >>> bpf_xdp_metadata_export_to_skb(ctx);
>> >> >>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);
>> >> 
>> >
>> > Hi Toke,
>> >
>> > Trying not to bifurcate your thread. Can I start a new one here to
>> > elaborate on these use cases. I'm still a bit lost on any use case
>> > for this that makes sense to actually deploy on a network.
>> >
>> >> There are several use cases for needing to access the metadata after
>> >> calling bpf_xdp_metdata_export_to_skb():
>> >> 
>> >> - Accessing the metadata after redirect (in a cpumap or devmap program,
>> >>   or on a veth device)
>> >
>> > I think for devmap there are still lots of opens how/where the skb
>> > is even built.
>> 
>> For veth it's pretty clear; i.e., when redirecting into containers.
>
> Ah but I think XDP on veth is a bit questionable in general. The use
> case is NFV I guess but its not how I would build out NFV. I've never
> seen it actually deployed other than in CI. Anyways not necessary to
> drop into that debate here. It exists so OK.
>
>> 
>> > For cpumap I'm a bit unsure what the use case is. For ice, mlx and
>> > such you should use the hardware RSS if performance is top of mind.
>> 
>> Hardware RSS works fine if your hardware supports the hashing you want;
>> many do not. As an example, Jesper wrote this application that uses
>> cpumap to divide out ISP customer traffic among different CPUs (solving
>> an HTB scaling problem):
>> 
>> https://github.com/xdp-project/xdp-cpumap-tc
>
> I'm going to argue hw should be able to do this still and we
> should fix the hw but maybe not easily doable without convincing
> hardware folks to talk to us.

Sure, in the ideal world the hardware should just be able to do this.
Unfortunately, we don't live in that ideal world :)

> Also not obvious tto me how linked code works without more studying,
> its ingress HTB? So you would push the rxhash and timestamp into
> cpumap and then build the skb here with the correct skb->timestamp?

No, the HTB tree is on egress. The use case is an ISP middlebox that
shapes (say) 1000 customers to their subscribed rate, using a big HTB
tree. If you just do this with a single HTB instance on the egress NIC
you run into the global qdisc lock and you can't scale beyond a pretty
measly bandwidth. Whereas if you use multiple HW TXQs and the mq qdisc,
you can partition the HTB tree so you only have a subset of customers on
each HWQ/HTB instance. But for this to work, and still guarantee each
customer gets shaped to the right rate, you need to ensure that all that
customer's traffic hits the same HWQ. The xdp-cpumap-tc tool does this
by configuring the TXQs to correspond to individual CPUs, and then runs
an XDP program that matches traffic to customers and redirects them to
the right CPU (using an LPM map).

This solution runs in production in quite a number of smallish WISPs,
BTW, with quite nice results. The software to set it all up is also open
sourced: https://libreqos.io/

Coming back to HW metadata, the LibreQoS system could benefit from the
hardware flow hash in particular, since that would save a hash operation
when enqueueing the packet into sch_cake.

> OK even if I can't exactly find the use case for cpumap if I had
> a use case I can see how passing metadata through is useful.

Great!

>> > And then for specific devices on cpumap (maybe realtime or ptp
>> > things?) could we just throw it through the xdp_frame?
>> 
>> Not sure what you mean here? Throw what through the xdp_frame?
>
> Doesn't matter reread patches and figured it out I was slightly
> confused.

Right, OK.

>> 
>> >> - Transferring the packet+metadata to AF_XDP
>> >
>> > In this case we have the metadata and AF_XDP program and XDP program
>> > simply need to agree on metadata format. No need to have some magic
>> > numbers and driver specific kfuncs.
>> 
>> See my other reply to Martin: Yeah, for AF_XDP users that write their
>> own kernel XDP programs, they can just do whatever they want. But many
>> users just rely on the default program in libxdp, so having a standard
>> format to include with that is useful.
>> 
>
> I don't think your AF_XDP program is any different than other AF_XDP
> programs. Your lib can create a standard format if it wants but
> kernel doesn't need to enforce it anyway.

Yeah, we totally could. But since we're defining a "standard" format for
kernel (skb) consumption anyway, making this available to AF_XDP is
kinda convenient so we don't have to :)

>> >> - Returning XDP_PASS, but accessing some of the metadata first (whether
>> >>   to read or change it)
>> >> 
>> >
>> > I don't get this case? XDP_PASS should go to stack normally through
>> > drivers build_skb routines. These will populate timestamp normally.
>> > My guess is simply descriptor->skb load/store is cheaper than carrying
>> > around this metadata and doing the call in BPF side. Anyways you
>> > just built an entire skb and hit the stack I don't think you will
>> > notice this noise in any benchmark.
>> 
>> If you modify the packet before calling XDP_PASS you may want to update
>> the metadata as well (for instance the RX hash, or in the future the
>> metadata could also carry transport header offsets).
>
> OK. So when you modify the pkt fixing up the rxhash makes sense. Thanks
> for the response I can see the argument.

Great! You're welcome :)

-Toke


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10 18:52                       ` Stanislav Fomichev
@ 2022-11-10 23:14                         ` Toke Høiland-Jørgensen
  2022-11-10 23:52                           ` Stanislav Fomichev
  2022-11-10 23:58                         ` Martin KaFai Lau
  1 sibling, 1 reply; 66+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-10 23:14 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Martin KaFai Lau, ast, daniel, andrii, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, bpf

Skipping to the last bit:

>> >> >    } else {
>> >> >      use kfuncs
>> >> >    }
>> >> >
>> >> > 5. Support the case where we keep program's metadata and kernel's
>> >> > xdp_to_skb_metadata
>> >> >    - skb_metadata_import_from_xdp() will "consume" it by mem-moving the
>> >> > rest of the metadata over it and adjusting the headroom
>> >>
>> >> I was thinking the kernel's xdp_to_skb_metadata is always before the program's
>> >> metadata.  xdp prog should usually work in this order also: read/write headers,
>> >> write its own metadata, call bpf_xdp_metadata_export_to_skb(), and return
>> >> XDP_PASS/XDP_REDIRECT.  When it is XDP_PASS, the kernel just needs to pop the
>> >> xdp_to_skb_metadata and pass the remaining program's metadata to the bpf-tc.
>> >>
>> >> For the kernel and xdp prog, I don't think it matters where the
>> >> xdp_to_skb_metadata is.  However, the xdp->data_meta (program's metadata) has to
>> >> be before xdp->data because of the current data_meta and data comparison usage
>> >> in the xdp prog.
>> >>
>> >> The order of the kernel's xdp_to_skb_metadata and the program's metadata
>> >> probably only matters to the userspace AF_XDP.  However, I don't see how AF_XDP
>> >> supports the program's metadata now.  afaict, it can only work now if there is
>> >> some sort of contract between them or the AF_XDP currently does not use the
>> >> program's metadata.  Either way, we can do the mem-moving only for AF_XDP and it
>> >> should be a no op if there is no program's metadata?  This behavior could also
>> >> be configurable through setsockopt?
>> >
>> > Agreed on all of the above. For now it seems like the safest thing to
>> > do is to put xdp_to_skb_metadata last to allow af_xdp to properly
>> > locate btf_id.
>> > Let's see if Toke disagrees :-)
>>
>> As I replied to Martin, I'm not sure it's worth the complexity to
>> logically split the SKB metadata from the program's own metadata (as
>> opposed to just reusing the existing data_meta pointer)?
>
> I'd gladly keep my current requirement where it's either or, but not both :-)
> We can relax it later if required?

So the way I've been thinking about it is simply that the skb_metadata
would live in the same place at the data_meta pointer (including
adjusting that pointer to accommodate it), and just overriding the
existing program metadata, if any exists. But looking at it now, I guess
having the split makes it easier for a program to write its own custom
metadata and still use the skb metadata. See below about the ordering.

>> However, if we do, the layout that makes most sense to me is putting the
>> skb metadata before the program metadata, like:
>>
>> --------------
>> | skb_metadata
>> --------------
>> | data_meta
>> --------------
>> | data
>> --------------
>>
>> Not sure if that's what you meant? :)
>
> I was suggesting the other way around: |custom meta|skb_metadata|data|
> (but, as Martin points out, consuming skb_metadata in the kernel
> becomes messier)
>
> af_xdp can check whether skb_metdata is present by looking at data -
> offsetof(struct skb_metadata, btf_id).
> progs that know how to handle custom metadata, will look at data -
> sizeof(skb_metadata)
>
> Otherwise, if it's the other way around, how do we find skb_metadata
> in a redirected frame?
> Let's say we have |skb_metadata|custom meta|data|, how does the final
> program find skb_metadata?
> All the progs have to agree on the sizeof(tc/custom meta), right?

Erm, maybe I'm missing something here, but skb_metadata is fixed size,
right? So if the "skb_metadata is present" flag is set, we know that the
sizeof(skb_metadata) bytes before the data_meta pointer contains the
metadata, and if the flag is not set, we know those bytes are not valid
metadata.

For AF_XDP, we'd need to transfer the flag as well, and it could apply
the same logic (getting the size from the vmlinux BTF).

By this logic, the BTF_ID should be the *first* entry of struct
skb_metadata, since that will be the field AF_XDP programs can find
right off the bat, no?

-Toke


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10 19:04                 ` Martin KaFai Lau
@ 2022-11-10 23:29                   ` Toke Høiland-Jørgensen
  2022-11-11  1:39                     ` Martin KaFai Lau
  0 siblings, 1 reply; 66+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-10 23:29 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Stanislav Fomichev, ast, daniel, andrii, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, bpf

Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 11/10/22 6:19 AM, Toke Høiland-Jørgensen wrote:
>> Martin KaFai Lau <martin.lau@linux.dev> writes:
>> 
>>> On 11/9/22 3:10 AM, Toke Høiland-Jørgensen wrote:
>>>> Snipping a bit of context to reply to this bit:
>>>>
>>>>>>>> Can the xdp prog still change the metadata through xdp->data_meta? tbh, I am not
>>>>>>>> sure it is solid enough by asking the xdp prog not to use the same random number
>>>>>>>> in its own metadata + not to change the metadata through xdp->data_meta after
>>>>>>>> calling bpf_xdp_metadata_export_to_skb().
>>>>>>>
>>>>>>> What do you think the usecase here might be? Or are you suggesting we
>>>>>>> reject further access to data_meta after
>>>>>>> bpf_xdp_metadata_export_to_skb somehow?
>>>>>>>
>>>>>>> If we want to let the programs override some of this
>>>>>>> bpf_xdp_metadata_export_to_skb() metadata, it feels like we can add
>>>>>>> more kfuncs instead of exposing the layout?
>>>>>>>
>>>>>>> bpf_xdp_metadata_export_to_skb(ctx);
>>>>>>> bpf_xdp_metadata_export_skb_hash(ctx, 1234);
>>>>
>>>> There are several use cases for needing to access the metadata after
>>>> calling bpf_xdp_metdata_export_to_skb():
>>>>
>>>> - Accessing the metadata after redirect (in a cpumap or devmap program,
>>>>     or on a veth device)
>>>> - Transferring the packet+metadata to AF_XDP
>>> fwiw, the xdp prog could also be more selective and only stores one of the hints
>>> instead of the whole 'struct xdp_to_skb_metadata'.
>> 
>> Yup, absolutely! In that sense, reusing the SKB format is mostly a
>> convenience feature. However, lots of people consume AF_XDP through the
>> default program installed by libxdp in the XSK setup code, and including
>> custom metadata there is awkward. So having the metadata consumed by the
>> stack as the "default subset" would enable easy consumption by
>> non-advanced users, while advanced users can still do custom stuff by
>> writing their own XDP program that calls the kfuncs.
>> 
>>>> - Returning XDP_PASS, but accessing some of the metadata first (whether
>>>>     to read or change it)
>>>>
>>>> The last one could be solved by calling additional kfuncs, but that
>>>> would be less efficient than just directly editing the struct which
>>>> will be cache-hot after the helper returns.
>>>
>>> Yeah, it is more efficient to directly write if possible.  I think this set
>>> allows the direct reading and writing already through data_meta (as a _u8 *).
>> 
>> Yup, totally fine with just keeping that capability :)
>> 
>>>> And yeah, this will allow the XDP program to inject arbitrary metadata
>>>> into the netstack; but it can already inject arbitrary *packet* data
>>>> into the stack, so not sure if this is much of an additional risk? If it
>>>> does lead to trivial crashes, we should probably harden the stack
>>>> against that?
>>>>
>>>> As for the random number, Jesper and I discussed replacing this with the
>>>> same BTF-ID scheme that he was using in his patch series. I.e., instead
>>>> of just putting in a random number, we insert the BTF ID of the metadata
>>>> struct at the end of it. This will allow us to support multiple
>>>> different formats in the future (not just changing the layout, but
>>>> having multiple simultaneous formats in the same kernel image), in case
>>>> we run out of space.
>>>
>>> This seems a bit hypothetical.  How much headroom does it usually have for the
>>> xdp prog?  Potentially the hints can use all the remaining space left after the
>>> header encap and the current bpf_xdp_adjust_meta() usage?
>> 
>> For the metadata consumed by the stack right now it's a bit
>> hypothetical, yeah. However, there's a bunch of metadata commonly
>> supported by hardware that the stack currently doesn't consume and that
>> hopefully this feature will end up making more accessible. My hope is
>> that the stack can also learn how to use this in the future, in which
>> case we may run out of space. So I think of that bit mostly as
>> future-proofing...
>
> ic. in this case, Can the btf_id be added to 'struct xdp_to_skb_metadata' later 
> if it is indeed needed?  The 'struct xdp_to_skb_metadata' is not in UAPI and 
> doing it with CO-RE is to give us flexibility to make this kind of changes in 
> the future.

My worry is mostly that it'll be more painful to add it later than just
including it from the start, mostly because of AF_XDP users. But if we
do the randomisation thing (thus forcing AF_XDP users to deal with the
dynamic layout as well), it should be possible to add it later, and I
can live with that option as well...

>>>> We should probably also have a flag set on the xdp_frame so the stack
>>>> knows that the metadata area contains relevant-to-skb data, to guard
>>>> against an XDP program accidentally hitting the "magic number" (BTF_ID)
>>>> in unrelated stuff it puts into the metadata area.
>>>
>>> Yeah, I think having a flag is useful.  The flag will be set at xdp_buff and
>>> then transfer to the xdp_frame?
>> 
>> Yeah, exactly!
>> 
>>>>> After re-reading patch 6, have another question. The 'void
>>>>> bpf_xdp_metadata_export_to_skb();' function signature. Should it at
>>>>> least return ok/err? or even return a 'struct xdp_to_skb_metadata *'
>>>>> pointer and the xdp prog can directly read (or even write) it?
>>>>
>>>> Hmm, I'm not sure returning a failure makes sense? Failure to read one
>>>> or more fields just means that those fields will not be populated? We
>>>> should probably have a flags field inside the metadata struct itself to
>>>> indicate which fields are set or not, but I'm not sure returning an
>>>> error value adds anything? Returning a pointer to the metadata field
>>>> might be convenient for users (it would just be an alias to the
>>>> data_meta pointer, but the verifier could know its size, so the program
>>>> doesn't have to bounds check it).
>>>
>>> If some hints are not available, those hints should be initialized to
>>> 0/CHECKSUM_NONE/...etc.
>> 
>> The problem with that is that then we have to spend cycles writing
>> eight bytes of zeroes into the checksum field :)
>
> With a common 'struct xdp_to_skb_metadata', I am not sure how some of these zero 
> writes can be avoided.  If the xdp prog wants to optimize, it can call 
> individual kfunc to get individual hints.

Erm, we just... don't write those fields? Something like:

void write_skb_meta(hw, ctx) {
  struct xdp_skb_metadata meta = ctx->data_meta - sizeof(struct xdp_skb_metadata);
  meta->valid_fields = 0;

  if (hw_has_timestamp) {
    meta->timestamp = hw->timestamp;
    meta->valid_fields |= FIELD_TIMESTAMP;
  } /* otherwise meta->timestamp is just uninitialised */

  if (hw_has_rxhash) {
    meta->rxhash = hw->rxhash;
    meta->valid_fields |= FIELD_RXHASH;
  } /* otherwise meta->rxhash is just uninitialised */
  ...etc...
}

>>> The xdp prog needs a direct way to tell hard failure when it cannot
>>> write the meta area because of not enough space. Comparing
>>> xdp->data_meta with xdp->data as a side effect is not intuitive.
>> 
>> Yeah, hence a flags field so we can just see if setting each field
>> succeeded?
>
> How testing a flag is different from checking 0/invalid-value of a
> field?

The flags field is smaller, so we write fewer bytes if not all fields
are populated.

> or some fields just don't have an invalid value to check for
> like vlan_tci?

Yeah, that could also be an issue.

> You meant a flags field as a return value or in the 'struct xdp_to_skb_metadata' ?

See above.


>> 
>>> It is more than saving the bound check.  With type info of 'struct
>>> xdp_to_skb_metadata *', the verifier can do more checks like reading in the
>>> middle of an integer member.  The verifier could also limit write access only to
>>> a few struct's members if it is needed.
>>>
>>> The returning 'struct xdp_to_skb_metadata *' should not be an alias to the
>>> xdp->data_meta.  They should actually point to different locations in the
>>> headroom.  bpf_xdp_metadata_export_to_skb() sets a flag in xdp_buff.
>>> xdp->data_meta won't be changed and keeps pointing to the last
>>> bpf_xdp_adjust_meta() location.  The kernel will know if there is
>>> xdp_to_skb_metadata before the xdp->data_meta when that bit is set in the
>>> xdp_{buff,frame}.  Would it work?
>> 
>> Hmm, logically splitting the program metadata and the xdp_hints metadata
>> (but having them share the same area) *could* work, I guess, I'm just
>> not sure it's worth the extra complexity?
>
> It shouldn't stop the existing xdp prog writing its own metadata from using the 
> the new bpf_xdp_metadata_export_to_skb().

Right, I think I see what you mean, and I agree that splitting it
internally in the kernel does make sense (see my other reply to
Stanislav).

>> 
>>>>> A related question, why 'struct xdp_to_skb_metadata' needs
>>>>> __randomize_layout?
>>>>
>>>> The __randomize_layout thing is there to force BPF programs to use CO-RE
>>>> to access the field. This is to avoid the struct layout accidentally
>>>> ossifying because people in practice rely on a particular layout, even
>>>> though we tell them to use CO-RE. There are lots of examples of this
>>>> happening in other domains (IP header options, TCP options, etc), and
>>>> __randomize_layout seemed like a neat trick to enforce CO-RE usage :)
>>>
>>> I am not sure if it is necessary or helpful to only enforce __randomize_layout
>>> in 'struct xdp_to_skb_metadata'.  There are other CO-RE use cases (tracing and
>>> non tracing) that already have direct access (reading and/or writing) to other
>>> kernel structures.
>>>
>>> It is more important for the verifier to see the xdp prog accessing it as a
>>> 'struct xdp_to_skb_metadata *' instead of xdp->data_meta which is a __u8 * so
>>> that the verifier can enforce the rules of access.
>> 
>> That only works inside the kernel, though. Since the metadata field can
>> be copied wholesale to AF_XDP, having it randomized forces userspace
>> consumers to also write code to deal with the layout being dynamic...
>
> hm... I still don't see how useful it is, in particular you mentioned
> the libxdp will install a xdp prog to write this default format
> (xdp_to_skb_metadata) and likely libxdp will also provide some helpers
> to parse the xdp_to_skb_metadata and the libxdp user should not need
> to know if CO-RE is used or not. Considering it is a kernel internal
> struct, I think it is fine to keep it and can be revisited later if
> needed. Lets get on to other things first :)

Well, if it was just kernel-internal, sure. But we're also exporting it
to userspace (through AF_XDP). Well-behaved users will obviously do the
right thing and use CO-RE. I'm trying to guard against users just
blindly copy-pasting the struct definition from the kernel and doing a
static cast, then complaining about their code breaking the first time
we change the struct layout. Which I sadly expect that there absolutely
will be people who do unless we actively make sure that doesn't work
from the get-go. And since the randomisation is literally just adding
__randomize_layout to the struct definition, I'd rather start out with
having it, and removing it later if it turns out not to be needed... :)

-Toke


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10 23:14                         ` Toke Høiland-Jørgensen
@ 2022-11-10 23:52                           ` Stanislav Fomichev
  2022-11-11  0:10                             ` Toke Høiland-Jørgensen
  2022-11-11  0:33                             ` Martin KaFai Lau
  0 siblings, 2 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-10 23:52 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Martin KaFai Lau, ast, daniel, andrii, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, bpf

On Thu, Nov 10, 2022 at 3:14 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Skipping to the last bit:
>
> >> >> >    } else {
> >> >> >      use kfuncs
> >> >> >    }
> >> >> >
> >> >> > 5. Support the case where we keep program's metadata and kernel's
> >> >> > xdp_to_skb_metadata
> >> >> >    - skb_metadata_import_from_xdp() will "consume" it by mem-moving the
> >> >> > rest of the metadata over it and adjusting the headroom
> >> >>
> >> >> I was thinking the kernel's xdp_to_skb_metadata is always before the program's
> >> >> metadata.  xdp prog should usually work in this order also: read/write headers,
> >> >> write its own metadata, call bpf_xdp_metadata_export_to_skb(), and return
> >> >> XDP_PASS/XDP_REDIRECT.  When it is XDP_PASS, the kernel just needs to pop the
> >> >> xdp_to_skb_metadata and pass the remaining program's metadata to the bpf-tc.
> >> >>
> >> >> For the kernel and xdp prog, I don't think it matters where the
> >> >> xdp_to_skb_metadata is.  However, the xdp->data_meta (program's metadata) has to
> >> >> be before xdp->data because of the current data_meta and data comparison usage
> >> >> in the xdp prog.
> >> >>
> >> >> The order of the kernel's xdp_to_skb_metadata and the program's metadata
> >> >> probably only matters to the userspace AF_XDP.  However, I don't see how AF_XDP
> >> >> supports the program's metadata now.  afaict, it can only work now if there is
> >> >> some sort of contract between them or the AF_XDP currently does not use the
> >> >> program's metadata.  Either way, we can do the mem-moving only for AF_XDP and it
> >> >> should be a no op if there is no program's metadata?  This behavior could also
> >> >> be configurable through setsockopt?
> >> >
> >> > Agreed on all of the above. For now it seems like the safest thing to
> >> > do is to put xdp_to_skb_metadata last to allow af_xdp to properly
> >> > locate btf_id.
> >> > Let's see if Toke disagrees :-)
> >>
> >> As I replied to Martin, I'm not sure it's worth the complexity to
> >> logically split the SKB metadata from the program's own metadata (as
> >> opposed to just reusing the existing data_meta pointer)?
> >
> > I'd gladly keep my current requirement where it's either or, but not both :-)
> > We can relax it later if required?
>
> So the way I've been thinking about it is simply that the skb_metadata
> would live in the same place at the data_meta pointer (including
> adjusting that pointer to accommodate it), and just overriding the
> existing program metadata, if any exists. But looking at it now, I guess
> having the split makes it easier for a program to write its own custom
> metadata and still use the skb metadata. See below about the ordering.
>
> >> However, if we do, the layout that makes most sense to me is putting the
> >> skb metadata before the program metadata, like:
> >>
> >> --------------
> >> | skb_metadata
> >> --------------
> >> | data_meta
> >> --------------
> >> | data
> >> --------------
> >>
> >> Not sure if that's what you meant? :)
> >
> > I was suggesting the other way around: |custom meta|skb_metadata|data|
> > (but, as Martin points out, consuming skb_metadata in the kernel
> > becomes messier)
> >
> > af_xdp can check whether skb_metdata is present by looking at data -
> > offsetof(struct skb_metadata, btf_id).
> > progs that know how to handle custom metadata, will look at data -
> > sizeof(skb_metadata)
> >
> > Otherwise, if it's the other way around, how do we find skb_metadata
> > in a redirected frame?
> > Let's say we have |skb_metadata|custom meta|data|, how does the final
> > program find skb_metadata?
> > All the progs have to agree on the sizeof(tc/custom meta), right?
>
> Erm, maybe I'm missing something here, but skb_metadata is fixed size,
> right? So if the "skb_metadata is present" flag is set, we know that the
> sizeof(skb_metadata) bytes before the data_meta pointer contains the
> metadata, and if the flag is not set, we know those bytes are not valid
> metadata.
>
> For AF_XDP, we'd need to transfer the flag as well, and it could apply
> the same logic (getting the size from the vmlinux BTF).
>
> By this logic, the BTF_ID should be the *first* entry of struct
> skb_metadata, since that will be the field AF_XDP programs can find
> right off the bat, no?

The problem with AF_XDP is that, IIUC, it doesn't have a data_meta
pointer in the userspace.

You get an rx descriptor where the address points to the 'data':
| 256 bytes headroom where metadata can go | data |

So you have (at most) 256 bytes of headroom, some of that might be the
metadata, but you really don't know where it starts. But you know it
definitely ends where the data begins.

So if we have the following, we can locate skb_metadata:
| 256-sizeof(skb_metadata) headroom | custom metadata | skb_metadata | data |
data - sizeof(skb_metadata) will get you there

But if it's the other way around, the program has to know
sizeof(custom metadata) to locate skb_metadata:
| 256-sizeof(skb_metadata) headroom | skb_metadata | custom metadata | data |

Am I missing something here?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10 18:52                       ` Stanislav Fomichev
  2022-11-10 23:14                         ` Toke Høiland-Jørgensen
@ 2022-11-10 23:58                         ` Martin KaFai Lau
  2022-11-11  0:20                           ` Stanislav Fomichev
  1 sibling, 1 reply; 66+ messages in thread
From: Martin KaFai Lau @ 2022-11-10 23:58 UTC (permalink / raw)
  To: Stanislav Fomichev, Toke Høiland-Jørgensen
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

On 11/10/22 10:52 AM, Stanislav Fomichev wrote:
>>> Oh, that seems even better than returning it from
>>> bpf_xdp_metadata_export_to_skb.
>>> bpf_xdp_metadata_export_to_skb can return true/false and the rest goes
>>> via default verifier ctx resolution mechanism..
>>> (returning ptr from a kfunc seems to be a bit complicated right now)

What is the complication in returning ptr from a kfunc?  I want to see if it can 
be solved because it will be an easier API to use instead of calling another 
kfunc to get the ptr.

>> See my response to John in the other thread about mixing stable UAPI (in
>> xdp_md) and unstable BTF structures in the xdp_md struct: I think this
>> is confusing and would prefer a kfunc.

hmm... right, it will be one of the first considering the current __bpf_md_ptr() 
usages are only exposing another struct in uapi, eg. __bpf_md_ptr(struct 
bpf_sock *, sk).

A kfunc will also be fine.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10 23:52                           ` Stanislav Fomichev
@ 2022-11-11  0:10                             ` Toke Høiland-Jørgensen
  2022-11-11  0:45                               ` Martin KaFai Lau
  2022-11-11  0:33                             ` Martin KaFai Lau
  1 sibling, 1 reply; 66+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-11  0:10 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Martin KaFai Lau, ast, daniel, andrii, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev, bpf

Stanislav Fomichev <sdf@google.com> writes:

> On Thu, Nov 10, 2022 at 3:14 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Skipping to the last bit:
>>
>> >> >> >    } else {
>> >> >> >      use kfuncs
>> >> >> >    }
>> >> >> >
>> >> >> > 5. Support the case where we keep program's metadata and kernel's
>> >> >> > xdp_to_skb_metadata
>> >> >> >    - skb_metadata_import_from_xdp() will "consume" it by mem-moving the
>> >> >> > rest of the metadata over it and adjusting the headroom
>> >> >>
>> >> >> I was thinking the kernel's xdp_to_skb_metadata is always before the program's
>> >> >> metadata.  xdp prog should usually work in this order also: read/write headers,
>> >> >> write its own metadata, call bpf_xdp_metadata_export_to_skb(), and return
>> >> >> XDP_PASS/XDP_REDIRECT.  When it is XDP_PASS, the kernel just needs to pop the
>> >> >> xdp_to_skb_metadata and pass the remaining program's metadata to the bpf-tc.
>> >> >>
>> >> >> For the kernel and xdp prog, I don't think it matters where the
>> >> >> xdp_to_skb_metadata is.  However, the xdp->data_meta (program's metadata) has to
>> >> >> be before xdp->data because of the current data_meta and data comparison usage
>> >> >> in the xdp prog.
>> >> >>
>> >> >> The order of the kernel's xdp_to_skb_metadata and the program's metadata
>> >> >> probably only matters to the userspace AF_XDP.  However, I don't see how AF_XDP
>> >> >> supports the program's metadata now.  afaict, it can only work now if there is
>> >> >> some sort of contract between them or the AF_XDP currently does not use the
>> >> >> program's metadata.  Either way, we can do the mem-moving only for AF_XDP and it
>> >> >> should be a no op if there is no program's metadata?  This behavior could also
>> >> >> be configurable through setsockopt?
>> >> >
>> >> > Agreed on all of the above. For now it seems like the safest thing to
>> >> > do is to put xdp_to_skb_metadata last to allow af_xdp to properly
>> >> > locate btf_id.
>> >> > Let's see if Toke disagrees :-)
>> >>
>> >> As I replied to Martin, I'm not sure it's worth the complexity to
>> >> logically split the SKB metadata from the program's own metadata (as
>> >> opposed to just reusing the existing data_meta pointer)?
>> >
>> > I'd gladly keep my current requirement where it's either or, but not both :-)
>> > We can relax it later if required?
>>
>> So the way I've been thinking about it is simply that the skb_metadata
>> would live in the same place at the data_meta pointer (including
>> adjusting that pointer to accommodate it), and just overriding the
>> existing program metadata, if any exists. But looking at it now, I guess
>> having the split makes it easier for a program to write its own custom
>> metadata and still use the skb metadata. See below about the ordering.
>>
>> >> However, if we do, the layout that makes most sense to me is putting the
>> >> skb metadata before the program metadata, like:
>> >>
>> >> --------------
>> >> | skb_metadata
>> >> --------------
>> >> | data_meta
>> >> --------------
>> >> | data
>> >> --------------
>> >>
>> >> Not sure if that's what you meant? :)
>> >
>> > I was suggesting the other way around: |custom meta|skb_metadata|data|
>> > (but, as Martin points out, consuming skb_metadata in the kernel
>> > becomes messier)
>> >
>> > af_xdp can check whether skb_metdata is present by looking at data -
>> > offsetof(struct skb_metadata, btf_id).
>> > progs that know how to handle custom metadata, will look at data -
>> > sizeof(skb_metadata)
>> >
>> > Otherwise, if it's the other way around, how do we find skb_metadata
>> > in a redirected frame?
>> > Let's say we have |skb_metadata|custom meta|data|, how does the final
>> > program find skb_metadata?
>> > All the progs have to agree on the sizeof(tc/custom meta), right?
>>
>> Erm, maybe I'm missing something here, but skb_metadata is fixed size,
>> right? So if the "skb_metadata is present" flag is set, we know that the
>> sizeof(skb_metadata) bytes before the data_meta pointer contains the
>> metadata, and if the flag is not set, we know those bytes are not valid
>> metadata.
>>
>> For AF_XDP, we'd need to transfer the flag as well, and it could apply
>> the same logic (getting the size from the vmlinux BTF).
>>
>> By this logic, the BTF_ID should be the *first* entry of struct
>> skb_metadata, since that will be the field AF_XDP programs can find
>> right off the bat, no?
>
> The problem with AF_XDP is that, IIUC, it doesn't have a data_meta
> pointer in the userspace.
>
> You get an rx descriptor where the address points to the 'data':
> | 256 bytes headroom where metadata can go | data |

Ah, I was missing the bit where the data pointer actually points at
data, not the start of the buf. Oops, my bad!

> So you have (at most) 256 bytes of headroom, some of that might be the
> metadata, but you really don't know where it starts. But you know it
> definitely ends where the data begins.
>
> So if we have the following, we can locate skb_metadata:
> | 256-sizeof(skb_metadata) headroom | custom metadata | skb_metadata | data |
> data - sizeof(skb_metadata) will get you there
>
> But if it's the other way around, the program has to know
> sizeof(custom metadata) to locate skb_metadata:
> | 256-sizeof(skb_metadata) headroom | skb_metadata | custom metadata | data |
>
> Am I missing something here?

Hmm, so one could argue that the only way AF_XDP can consume custom
metadata today is if it knows out of band what the size of it is. And if
it knows that, it can just skip over it to go back to the skb_metadata,
no?

The only problem left then is if there were multiple XDP programs called
in sequence (whether before a redirect, or by libxdp chaining or tail
calls), and the first one resized the metadata area without the last one
knowing about it. For this, we could add a CLOBBER_PROGRAM_META flag to
the skb_metadata helper which if set will ensure that the program
metadata length is reset to 0?

-Toke


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10 23:58                         ` Martin KaFai Lau
@ 2022-11-11  0:20                           ` Stanislav Fomichev
  0 siblings, 0 replies; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-11  0:20 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Toke Høiland-Jørgensen, ast, daniel, andrii, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, bpf

"

On Thu, Nov 10, 2022 at 3:58 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 11/10/22 10:52 AM, Stanislav Fomichev wrote:
> >>> Oh, that seems even better than returning it from
> >>> bpf_xdp_metadata_export_to_skb.
> >>> bpf_xdp_metadata_export_to_skb can return true/false and the rest goes
> >>> via default verifier ctx resolution mechanism..
> >>> (returning ptr from a kfunc seems to be a bit complicated right now)
>
> What is the complication in returning ptr from a kfunc?  I want to see if it can
> be solved because it will be an easier API to use instead of calling another
> kfunc to get the ptr.

At least for returning a pointer to the basic c types like int/long, I
was hitting the following:

commit eb1f7f71c126c8fd50ea81af98f97c4b581ea4ae
Author:     Benjamin Tissoires <benjamin.tissoires@redhat.com>
AuthorDate: Tue Sep 6 17:13:02 2022 +0200
Commit:     Alexei Starovoitov <ast@kernel.org>
CommitDate: Wed Sep 7 11:05:17 2022 -0700

    bpf/verifier: allow kfunc to return an allocated mem

Where I had to add an extra "const int rdonly_buf_size" argument to
the kfunc to make the verifier happy.
But I haven't really looked at what would happen with the pointers to
structs (or why, at all, we have this restriction).


> >> See my response to John in the other thread about mixing stable UAPI (in
> >> xdp_md) and unstable BTF structures in the xdp_md struct: I think this
> >> is confusing and would prefer a kfunc.
>
> hmm... right, it will be one of the first considering the current __bpf_md_ptr()
> usages are only exposing another struct in uapi, eg. __bpf_md_ptr(struct
> bpf_sock *, sk).
>
> A kfunc will also be fine.
>

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10 23:52                           ` Stanislav Fomichev
  2022-11-11  0:10                             ` Toke Høiland-Jørgensen
@ 2022-11-11  0:33                             ` Martin KaFai Lau
  2022-11-11  0:57                               ` Stanislav Fomichev
  1 sibling, 1 reply; 66+ messages in thread
From: Martin KaFai Lau @ 2022-11-11  0:33 UTC (permalink / raw)
  To: Stanislav Fomichev, Toke Høiland-Jørgensen
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

On 11/10/22 3:52 PM, Stanislav Fomichev wrote:
> On Thu, Nov 10, 2022 at 3:14 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Skipping to the last bit:
>>
>>>>>>>     } else {
>>>>>>>       use kfuncs
>>>>>>>     }
>>>>>>>
>>>>>>> 5. Support the case where we keep program's metadata and kernel's
>>>>>>> xdp_to_skb_metadata
>>>>>>>     - skb_metadata_import_from_xdp() will "consume" it by mem-moving the
>>>>>>> rest of the metadata over it and adjusting the headroom
>>>>>>
>>>>>> I was thinking the kernel's xdp_to_skb_metadata is always before the program's
>>>>>> metadata.  xdp prog should usually work in this order also: read/write headers,
>>>>>> write its own metadata, call bpf_xdp_metadata_export_to_skb(), and return
>>>>>> XDP_PASS/XDP_REDIRECT.  When it is XDP_PASS, the kernel just needs to pop the
>>>>>> xdp_to_skb_metadata and pass the remaining program's metadata to the bpf-tc.
>>>>>>
>>>>>> For the kernel and xdp prog, I don't think it matters where the
>>>>>> xdp_to_skb_metadata is.  However, the xdp->data_meta (program's metadata) has to
>>>>>> be before xdp->data because of the current data_meta and data comparison usage
>>>>>> in the xdp prog.
>>>>>>
>>>>>> The order of the kernel's xdp_to_skb_metadata and the program's metadata
>>>>>> probably only matters to the userspace AF_XDP.  However, I don't see how AF_XDP
>>>>>> supports the program's metadata now.  afaict, it can only work now if there is
>>>>>> some sort of contract between them or the AF_XDP currently does not use the
>>>>>> program's metadata.  Either way, we can do the mem-moving only for AF_XDP and it
>>>>>> should be a no op if there is no program's metadata?  This behavior could also
>>>>>> be configurable through setsockopt?
>>>>>
>>>>> Agreed on all of the above. For now it seems like the safest thing to
>>>>> do is to put xdp_to_skb_metadata last to allow af_xdp to properly
>>>>> locate btf_id.
>>>>> Let's see if Toke disagrees :-)
>>>>
>>>> As I replied to Martin, I'm not sure it's worth the complexity to
>>>> logically split the SKB metadata from the program's own metadata (as
>>>> opposed to just reusing the existing data_meta pointer)?
>>>
>>> I'd gladly keep my current requirement where it's either or, but not both :-)
>>> We can relax it later if required?
>>
>> So the way I've been thinking about it is simply that the skb_metadata
>> would live in the same place at the data_meta pointer (including
>> adjusting that pointer to accommodate it), and just overriding the
>> existing program metadata, if any exists. But looking at it now, I guess
>> having the split makes it easier for a program to write its own custom
>> metadata and still use the skb metadata. See below about the ordering.
>>
>>>> However, if we do, the layout that makes most sense to me is putting the
>>>> skb metadata before the program metadata, like:
>>>>
>>>> --------------
>>>> | skb_metadata
>>>> --------------
>>>> | data_meta
>>>> --------------
>>>> | data
>>>> --------------
>>>>

Yeah, for the kernel and xdp prog (ie not AF_XDP), I meant this:

| skb_metadata | custom metadata | data |

>>>> Not sure if that's what you meant? :)
>>>
>>> I was suggesting the other way around: |custom meta|skb_metadata|data|
>>> (but, as Martin points out, consuming skb_metadata in the kernel
>>> becomes messier)
>>>
>>> af_xdp can check whether skb_metdata is present by looking at data -
>>> offsetof(struct skb_metadata, btf_id).
>>> progs that know how to handle custom metadata, will look at data -
>>> sizeof(skb_metadata)
>>>
>>> Otherwise, if it's the other way around, how do we find skb_metadata
>>> in a redirected frame?
>>> Let's say we have |skb_metadata|custom meta|data|, how does the final
>>> program find skb_metadata?
>>> All the progs have to agree on the sizeof(tc/custom meta), right?
>>
>> Erm, maybe I'm missing something here, but skb_metadata is fixed size,
>> right? So if the "skb_metadata is present" flag is set, we know that the
>> sizeof(skb_metadata) bytes before the data_meta pointer contains the
>> metadata, and if the flag is not set, we know those bytes are not valid
>> metadata.

right, so to get to the skb_metadata, it will be
data_meta -= sizeof(skb_metadata);  /* probably need alignment */

>>
>> For AF_XDP, we'd need to transfer the flag as well, and it could apply
>> the same logic (getting the size from the vmlinux BTF).
>>
>> By this logic, the BTF_ID should be the *first* entry of struct
>> skb_metadata, since that will be the field AF_XDP programs can find
>> right off the bat, no? >
> The problem with AF_XDP is that, IIUC, it doesn't have a data_meta
> pointer in the userspace.

Yep. It is my understanding also.  Missing data_meta pointer in the AF_XDP 
rx_desc is a potential problem.  Having BTF_ID or not won't help.

> 
> You get an rx descriptor where the address points to the 'data':
> | 256 bytes headroom where metadata can go | data |
> 
> So you have (at most) 256 bytes of headroom, some of that might be the
> metadata, but you really don't know where it starts. But you know it
> definitely ends where the data begins.
> 
> So if we have the following, we can locate skb_metadata:
> | 256-sizeof(skb_metadata) headroom | custom metadata | skb_metadata | data |
> data - sizeof(skb_metadata) will get you there
> 
> But if it's the other way around, the program has to know
> sizeof(custom metadata) to locate skb_metadata:
> | 256-sizeof(skb_metadata) headroom | skb_metadata | custom metadata | data |

Right, this won't work if the AF_XDP user does not know how big the custom 
metadata is.  The kernel then needs to swap the "skb_metadata" and "custom 
metadata" + setting a flag in the AF_XDP rx_desc->options to make it looks like 
this:
| custom metadata | skb_metadata | data |

However, since data_meta is missing from the rx_desc, may be we can safely 
assume the AF_XDP user always knows the size of the custom metadata or there is 
usually no "custom metadata" and no swap is needed?

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-11  0:10                             ` Toke Høiland-Jørgensen
@ 2022-11-11  0:45                               ` Martin KaFai Lau
  2022-11-11  9:37                                 ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 66+ messages in thread
From: Martin KaFai Lau @ 2022-11-11  0:45 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

On 11/10/22 4:10 PM, Toke Høiland-Jørgensen wrote:
>> The problem with AF_XDP is that, IIUC, it doesn't have a data_meta
>> pointer in the userspace.
>>
>> You get an rx descriptor where the address points to the 'data':
>> | 256 bytes headroom where metadata can go | data |
> 
> Ah, I was missing the bit where the data pointer actually points at
> data, not the start of the buf. Oops, my bad!
> 
>> So you have (at most) 256 bytes of headroom, some of that might be the
>> metadata, but you really don't know where it starts. But you know it
>> definitely ends where the data begins.
>>
>> So if we have the following, we can locate skb_metadata:
>> | 256-sizeof(skb_metadata) headroom | custom metadata | skb_metadata | data |
>> data - sizeof(skb_metadata) will get you there
>>
>> But if it's the other way around, the program has to know
>> sizeof(custom metadata) to locate skb_metadata:
>> | 256-sizeof(skb_metadata) headroom | skb_metadata | custom metadata | data |
>>
>> Am I missing something here?
> 
> Hmm, so one could argue that the only way AF_XDP can consume custom
> metadata today is if it knows out of band what the size of it is. And if
> it knows that, it can just skip over it to go back to the skb_metadata,
> no?

+1 I replied with a similar point in another email.  I also think we can safely 
assume this.

> 
> The only problem left then is if there were multiple XDP programs called
> in sequence (whether before a redirect, or by libxdp chaining or tail
> calls), and the first one resized the metadata area without the last one
> knowing about it. For this, we could add a CLOBBER_PROGRAM_META flag to
> the skb_metadata helper which if set will ensure that the program
> metadata length is reset to 0?

How is it different from the same xdp prog calling bpf_xdp_adjust_meta() and 
bpf_xdp_metadata_export_to_skb() multiple times.  The earlier stored 
skb_metadata needs to be moved during the latter bpf_xdp_adjust_meta().  The 
latter bpf_xdp_metadata_export_to_skb() will overwrite the earlier skb_metadata.


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-11  0:33                             ` Martin KaFai Lau
@ 2022-11-11  0:57                               ` Stanislav Fomichev
  2022-11-11  1:26                                 ` Martin KaFai Lau
  0 siblings, 1 reply; 66+ messages in thread
From: Stanislav Fomichev @ 2022-11-11  0:57 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Toke Høiland-Jørgensen, ast, daniel, andrii, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, bpf

On Thu, Nov 10, 2022 at 4:33 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>
> On 11/10/22 3:52 PM, Stanislav Fomichev wrote:
> > On Thu, Nov 10, 2022 at 3:14 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Skipping to the last bit:
> >>
> >>>>>>>     } else {
> >>>>>>>       use kfuncs
> >>>>>>>     }
> >>>>>>>
> >>>>>>> 5. Support the case where we keep program's metadata and kernel's
> >>>>>>> xdp_to_skb_metadata
> >>>>>>>     - skb_metadata_import_from_xdp() will "consume" it by mem-moving the
> >>>>>>> rest of the metadata over it and adjusting the headroom
> >>>>>>
> >>>>>> I was thinking the kernel's xdp_to_skb_metadata is always before the program's
> >>>>>> metadata.  xdp prog should usually work in this order also: read/write headers,
> >>>>>> write its own metadata, call bpf_xdp_metadata_export_to_skb(), and return
> >>>>>> XDP_PASS/XDP_REDIRECT.  When it is XDP_PASS, the kernel just needs to pop the
> >>>>>> xdp_to_skb_metadata and pass the remaining program's metadata to the bpf-tc.
> >>>>>>
> >>>>>> For the kernel and xdp prog, I don't think it matters where the
> >>>>>> xdp_to_skb_metadata is.  However, the xdp->data_meta (program's metadata) has to
> >>>>>> be before xdp->data because of the current data_meta and data comparison usage
> >>>>>> in the xdp prog.
> >>>>>>
> >>>>>> The order of the kernel's xdp_to_skb_metadata and the program's metadata
> >>>>>> probably only matters to the userspace AF_XDP.  However, I don't see how AF_XDP
> >>>>>> supports the program's metadata now.  afaict, it can only work now if there is
> >>>>>> some sort of contract between them or the AF_XDP currently does not use the
> >>>>>> program's metadata.  Either way, we can do the mem-moving only for AF_XDP and it
> >>>>>> should be a no op if there is no program's metadata?  This behavior could also
> >>>>>> be configurable through setsockopt?
> >>>>>
> >>>>> Agreed on all of the above. For now it seems like the safest thing to
> >>>>> do is to put xdp_to_skb_metadata last to allow af_xdp to properly
> >>>>> locate btf_id.
> >>>>> Let's see if Toke disagrees :-)
> >>>>
> >>>> As I replied to Martin, I'm not sure it's worth the complexity to
> >>>> logically split the SKB metadata from the program's own metadata (as
> >>>> opposed to just reusing the existing data_meta pointer)?
> >>>
> >>> I'd gladly keep my current requirement where it's either or, but not both :-)
> >>> We can relax it later if required?
> >>
> >> So the way I've been thinking about it is simply that the skb_metadata
> >> would live in the same place at the data_meta pointer (including
> >> adjusting that pointer to accommodate it), and just overriding the
> >> existing program metadata, if any exists. But looking at it now, I guess
> >> having the split makes it easier for a program to write its own custom
> >> metadata and still use the skb metadata. See below about the ordering.
> >>
> >>>> However, if we do, the layout that makes most sense to me is putting the
> >>>> skb metadata before the program metadata, like:
> >>>>
> >>>> --------------
> >>>> | skb_metadata
> >>>> --------------
> >>>> | data_meta
> >>>> --------------
> >>>> | data
> >>>> --------------
> >>>>
>
> Yeah, for the kernel and xdp prog (ie not AF_XDP), I meant this:
>
> | skb_metadata | custom metadata | data |
>
> >>>> Not sure if that's what you meant? :)
> >>>
> >>> I was suggesting the other way around: |custom meta|skb_metadata|data|
> >>> (but, as Martin points out, consuming skb_metadata in the kernel
> >>> becomes messier)
> >>>
> >>> af_xdp can check whether skb_metdata is present by looking at data -
> >>> offsetof(struct skb_metadata, btf_id).
> >>> progs that know how to handle custom metadata, will look at data -
> >>> sizeof(skb_metadata)
> >>>
> >>> Otherwise, if it's the other way around, how do we find skb_metadata
> >>> in a redirected frame?
> >>> Let's say we have |skb_metadata|custom meta|data|, how does the final
> >>> program find skb_metadata?
> >>> All the progs have to agree on the sizeof(tc/custom meta), right?
> >>
> >> Erm, maybe I'm missing something here, but skb_metadata is fixed size,
> >> right? So if the "skb_metadata is present" flag is set, we know that the
> >> sizeof(skb_metadata) bytes before the data_meta pointer contains the
> >> metadata, and if the flag is not set, we know those bytes are not valid
> >> metadata.
>
> right, so to get to the skb_metadata, it will be
> data_meta -= sizeof(skb_metadata);  /* probably need alignment */
>
> >>
> >> For AF_XDP, we'd need to transfer the flag as well, and it could apply
> >> the same logic (getting the size from the vmlinux BTF).
> >>
> >> By this logic, the BTF_ID should be the *first* entry of struct
> >> skb_metadata, since that will be the field AF_XDP programs can find
> >> right off the bat, no? >
> > The problem with AF_XDP is that, IIUC, it doesn't have a data_meta
> > pointer in the userspace.
>
> Yep. It is my understanding also.  Missing data_meta pointer in the AF_XDP
> rx_desc is a potential problem.  Having BTF_ID or not won't help.
>
> >
> > You get an rx descriptor where the address points to the 'data':
> > | 256 bytes headroom where metadata can go | data |
> >
> > So you have (at most) 256 bytes of headroom, some of that might be the
> > metadata, but you really don't know where it starts. But you know it
> > definitely ends where the data begins.
> >
> > So if we have the following, we can locate skb_metadata:
> > | 256-sizeof(skb_metadata) headroom | custom metadata | skb_metadata | data |
> > data - sizeof(skb_metadata) will get you there
> >
> > But if it's the other way around, the program has to know
> > sizeof(custom metadata) to locate skb_metadata:
> > | 256-sizeof(skb_metadata) headroom | skb_metadata | custom metadata | data |
>
> Right, this won't work if the AF_XDP user does not know how big the custom
> metadata is.  The kernel then needs to swap the "skb_metadata" and "custom
> metadata" + setting a flag in the AF_XDP rx_desc->options to make it looks like
> this:
> | custom metadata | skb_metadata | data |
>
> However, since data_meta is missing from the rx_desc, may be we can safely
> assume the AF_XDP user always knows the size of the custom metadata or there is
> usually no "custom metadata" and no swap is needed?

If we can assume they can share that info, can they also share more
info on what kind of metadata they would prefer to get?
If they can agree on the size, maybe they also can agree on the flows
that need skb_metdata vs the flows that need a custom one?

Seems like we can start with supporting either one, but not both and
extend in the future once we have more understanding on whether it's
actually needed or not?

bpf_xdp_metadata_export_to_skb: adjust data meta, add uses-skb-metadata flag
bpf_xdp_adjust_meta: unconditionally reset uses-skb-metadata flag

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-11  0:57                               ` Stanislav Fomichev
@ 2022-11-11  1:26                                 ` Martin KaFai Lau
  2022-11-11  9:41                                   ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 66+ messages in thread
From: Martin KaFai Lau @ 2022-11-11  1:26 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: Toke Høiland-Jørgensen, ast, daniel, andrii, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, bpf

On 11/10/22 4:57 PM, Stanislav Fomichev wrote:
> On Thu, Nov 10, 2022 at 4:33 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>
>> On 11/10/22 3:52 PM, Stanislav Fomichev wrote:
>>> On Thu, Nov 10, 2022 at 3:14 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>>
>>>> Skipping to the last bit:
>>>>
>>>>>>>>>      } else {
>>>>>>>>>        use kfuncs
>>>>>>>>>      }
>>>>>>>>>
>>>>>>>>> 5. Support the case where we keep program's metadata and kernel's
>>>>>>>>> xdp_to_skb_metadata
>>>>>>>>>      - skb_metadata_import_from_xdp() will "consume" it by mem-moving the
>>>>>>>>> rest of the metadata over it and adjusting the headroom
>>>>>>>>
>>>>>>>> I was thinking the kernel's xdp_to_skb_metadata is always before the program's
>>>>>>>> metadata.  xdp prog should usually work in this order also: read/write headers,
>>>>>>>> write its own metadata, call bpf_xdp_metadata_export_to_skb(), and return
>>>>>>>> XDP_PASS/XDP_REDIRECT.  When it is XDP_PASS, the kernel just needs to pop the
>>>>>>>> xdp_to_skb_metadata and pass the remaining program's metadata to the bpf-tc.
>>>>>>>>
>>>>>>>> For the kernel and xdp prog, I don't think it matters where the
>>>>>>>> xdp_to_skb_metadata is.  However, the xdp->data_meta (program's metadata) has to
>>>>>>>> be before xdp->data because of the current data_meta and data comparison usage
>>>>>>>> in the xdp prog.
>>>>>>>>
>>>>>>>> The order of the kernel's xdp_to_skb_metadata and the program's metadata
>>>>>>>> probably only matters to the userspace AF_XDP.  However, I don't see how AF_XDP
>>>>>>>> supports the program's metadata now.  afaict, it can only work now if there is
>>>>>>>> some sort of contract between them or the AF_XDP currently does not use the
>>>>>>>> program's metadata.  Either way, we can do the mem-moving only for AF_XDP and it
>>>>>>>> should be a no op if there is no program's metadata?  This behavior could also
>>>>>>>> be configurable through setsockopt?
>>>>>>>
>>>>>>> Agreed on all of the above. For now it seems like the safest thing to
>>>>>>> do is to put xdp_to_skb_metadata last to allow af_xdp to properly
>>>>>>> locate btf_id.
>>>>>>> Let's see if Toke disagrees :-)
>>>>>>
>>>>>> As I replied to Martin, I'm not sure it's worth the complexity to
>>>>>> logically split the SKB metadata from the program's own metadata (as
>>>>>> opposed to just reusing the existing data_meta pointer)?
>>>>>
>>>>> I'd gladly keep my current requirement where it's either or, but not both :-)
>>>>> We can relax it later if required?
>>>>
>>>> So the way I've been thinking about it is simply that the skb_metadata
>>>> would live in the same place at the data_meta pointer (including
>>>> adjusting that pointer to accommodate it), and just overriding the
>>>> existing program metadata, if any exists. But looking at it now, I guess
>>>> having the split makes it easier for a program to write its own custom
>>>> metadata and still use the skb metadata. See below about the ordering.
>>>>
>>>>>> However, if we do, the layout that makes most sense to me is putting the
>>>>>> skb metadata before the program metadata, like:
>>>>>>
>>>>>> --------------
>>>>>> | skb_metadata
>>>>>> --------------
>>>>>> | data_meta
>>>>>> --------------
>>>>>> | data
>>>>>> --------------
>>>>>>
>>
>> Yeah, for the kernel and xdp prog (ie not AF_XDP), I meant this:
>>
>> | skb_metadata | custom metadata | data |
>>
>>>>>> Not sure if that's what you meant? :)
>>>>>
>>>>> I was suggesting the other way around: |custom meta|skb_metadata|data|
>>>>> (but, as Martin points out, consuming skb_metadata in the kernel
>>>>> becomes messier)
>>>>>
>>>>> af_xdp can check whether skb_metdata is present by looking at data -
>>>>> offsetof(struct skb_metadata, btf_id).
>>>>> progs that know how to handle custom metadata, will look at data -
>>>>> sizeof(skb_metadata)
>>>>>
>>>>> Otherwise, if it's the other way around, how do we find skb_metadata
>>>>> in a redirected frame?
>>>>> Let's say we have |skb_metadata|custom meta|data|, how does the final
>>>>> program find skb_metadata?
>>>>> All the progs have to agree on the sizeof(tc/custom meta), right?
>>>>
>>>> Erm, maybe I'm missing something here, but skb_metadata is fixed size,
>>>> right? So if the "skb_metadata is present" flag is set, we know that the
>>>> sizeof(skb_metadata) bytes before the data_meta pointer contains the
>>>> metadata, and if the flag is not set, we know those bytes are not valid
>>>> metadata.
>>
>> right, so to get to the skb_metadata, it will be
>> data_meta -= sizeof(skb_metadata);  /* probably need alignment */
>>
>>>>
>>>> For AF_XDP, we'd need to transfer the flag as well, and it could apply
>>>> the same logic (getting the size from the vmlinux BTF).
>>>>
>>>> By this logic, the BTF_ID should be the *first* entry of struct
>>>> skb_metadata, since that will be the field AF_XDP programs can find
>>>> right off the bat, no? >
>>> The problem with AF_XDP is that, IIUC, it doesn't have a data_meta
>>> pointer in the userspace.
>>
>> Yep. It is my understanding also.  Missing data_meta pointer in the AF_XDP
>> rx_desc is a potential problem.  Having BTF_ID or not won't help.
>>
>>>
>>> You get an rx descriptor where the address points to the 'data':
>>> | 256 bytes headroom where metadata can go | data |
>>>
>>> So you have (at most) 256 bytes of headroom, some of that might be the
>>> metadata, but you really don't know where it starts. But you know it
>>> definitely ends where the data begins.
>>>
>>> So if we have the following, we can locate skb_metadata:
>>> | 256-sizeof(skb_metadata) headroom | custom metadata | skb_metadata | data |
>>> data - sizeof(skb_metadata) will get you there
>>>
>>> But if it's the other way around, the program has to know
>>> sizeof(custom metadata) to locate skb_metadata:
>>> | 256-sizeof(skb_metadata) headroom | skb_metadata | custom metadata | data |
>>
>> Right, this won't work if the AF_XDP user does not know how big the custom
>> metadata is.  The kernel then needs to swap the "skb_metadata" and "custom
>> metadata" + setting a flag in the AF_XDP rx_desc->options to make it looks like
>> this:
>> | custom metadata | skb_metadata | data |
>>
>> However, since data_meta is missing from the rx_desc, may be we can safely
>> assume the AF_XDP user always knows the size of the custom metadata or there is
>> usually no "custom metadata" and no swap is needed?
> 
> If we can assume they can share that info, can they also share more
> info on what kind of metadata they would prefer to get?
> If they can agree on the size, maybe they also can agree on the flows
> that need skb_metdata vs the flows that need a custom one?
> 
> Seems like we can start with supporting either one, but not both and
> extend in the future once we have more understanding on whether it's
> actually needed or not?
> 
> bpf_xdp_metadata_export_to_skb: adjust data meta, add uses-skb-metadata flag
> bpf_xdp_adjust_meta: unconditionally reset uses-skb-metadata flag
hmm... I am thinking:

bpf_xdp_adjust_meta: move the existing (if any) skb_metadata and adjust 
xdp->data_meta.

bpf_xdp_metadata_export_to_skb: If skb_metadata exists, overwrites the existing 
one.  If not exists, gets headroom before xdp->data_meta and writes hints.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-10 23:29                   ` Toke Høiland-Jørgensen
@ 2022-11-11  1:39                     ` Martin KaFai Lau
  2022-11-11  9:44                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 66+ messages in thread
From: Martin KaFai Lau @ 2022-11-11  1:39 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Stanislav Fomichev, ast, daniel, andrii, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, bpf

On 11/10/22 3:29 PM, Toke Høiland-Jørgensen wrote:
>>> For the metadata consumed by the stack right now it's a bit
>>> hypothetical, yeah. However, there's a bunch of metadata commonly
>>> supported by hardware that the stack currently doesn't consume and that
>>> hopefully this feature will end up making more accessible. My hope is
>>> that the stack can also learn how to use this in the future, in which
>>> case we may run out of space. So I think of that bit mostly as
>>> future-proofing...
>>
>> ic. in this case, Can the btf_id be added to 'struct xdp_to_skb_metadata' later
>> if it is indeed needed?  The 'struct xdp_to_skb_metadata' is not in UAPI and
>> doing it with CO-RE is to give us flexibility to make this kind of changes in
>> the future.
> 
> My worry is mostly that it'll be more painful to add it later than just
> including it from the start, mostly because of AF_XDP users. But if we
> do the randomisation thing (thus forcing AF_XDP users to deal with the
> dynamic layout as well), it should be possible to add it later, and I
> can live with that option as well...

imo, considering we are trying to optimize unnecessary field initialization as 
below, it is sort of wasteful to always initialize the btf_id with the same 
value.  It is better to add it in the future when there is a need.

>>>>> We should probably also have a flag set on the xdp_frame so the stack
>>>>> knows that the metadata area contains relevant-to-skb data, to guard
>>>>> against an XDP program accidentally hitting the "magic number" (BTF_ID)
>>>>> in unrelated stuff it puts into the metadata area.
>>>>
>>>> Yeah, I think having a flag is useful.  The flag will be set at xdp_buff and
>>>> then transfer to the xdp_frame?
>>>
>>> Yeah, exactly!
>>>
>>>>>> After re-reading patch 6, have another question. The 'void
>>>>>> bpf_xdp_metadata_export_to_skb();' function signature. Should it at
>>>>>> least return ok/err? or even return a 'struct xdp_to_skb_metadata *'
>>>>>> pointer and the xdp prog can directly read (or even write) it?
>>>>>
>>>>> Hmm, I'm not sure returning a failure makes sense? Failure to read one
>>>>> or more fields just means that those fields will not be populated? We
>>>>> should probably have a flags field inside the metadata struct itself to
>>>>> indicate which fields are set or not, but I'm not sure returning an
>>>>> error value adds anything? Returning a pointer to the metadata field
>>>>> might be convenient for users (it would just be an alias to the
>>>>> data_meta pointer, but the verifier could know its size, so the program
>>>>> doesn't have to bounds check it).
>>>>
>>>> If some hints are not available, those hints should be initialized to
>>>> 0/CHECKSUM_NONE/...etc.
>>>
>>> The problem with that is that then we have to spend cycles writing
>>> eight bytes of zeroes into the checksum field :)
>>
>> With a common 'struct xdp_to_skb_metadata', I am not sure how some of these zero
>> writes can be avoided.  If the xdp prog wants to optimize, it can call
>> individual kfunc to get individual hints.
> 
> Erm, we just... don't write those fields? Something like:
> 
> void write_skb_meta(hw, ctx) {
>    struct xdp_skb_metadata meta = ctx->data_meta - sizeof(struct xdp_skb_metadata);
>    meta->valid_fields = 0;
> 
>    if (hw_has_timestamp) {
>      meta->timestamp = hw->timestamp;
>      meta->valid_fields |= FIELD_TIMESTAMP;
>    } /* otherwise meta->timestamp is just uninitialised */
> 
>    if (hw_has_rxhash) {
>      meta->rxhash = hw->rxhash;
>      meta->valid_fields |= FIELD_RXHASH;
>    } /* otherwise meta->rxhash is just uninitialised */
>    ...etc...
> }

Ah, got it.  Make sense.  My mind was stalled in the paradigm that a helper that 
needs to initialize the result.

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-11  0:45                               ` Martin KaFai Lau
@ 2022-11-11  9:37                                 ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 66+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-11  9:37 UTC (permalink / raw)
  To: Martin KaFai Lau, Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 11/10/22 4:10 PM, Toke Høiland-Jørgensen wrote:
>>> The problem with AF_XDP is that, IIUC, it doesn't have a data_meta
>>> pointer in the userspace.
>>>
>>> You get an rx descriptor where the address points to the 'data':
>>> | 256 bytes headroom where metadata can go | data |
>> 
>> Ah, I was missing the bit where the data pointer actually points at
>> data, not the start of the buf. Oops, my bad!
>> 
>>> So you have (at most) 256 bytes of headroom, some of that might be the
>>> metadata, but you really don't know where it starts. But you know it
>>> definitely ends where the data begins.
>>>
>>> So if we have the following, we can locate skb_metadata:
>>> | 256-sizeof(skb_metadata) headroom | custom metadata | skb_metadata | data |
>>> data - sizeof(skb_metadata) will get you there
>>>
>>> But if it's the other way around, the program has to know
>>> sizeof(custom metadata) to locate skb_metadata:
>>> | 256-sizeof(skb_metadata) headroom | skb_metadata | custom metadata | data |
>>>
>>> Am I missing something here?
>> 
>> Hmm, so one could argue that the only way AF_XDP can consume custom
>> metadata today is if it knows out of band what the size of it is. And if
>> it knows that, it can just skip over it to go back to the skb_metadata,
>> no?
>
> +1 I replied with a similar point in another email. I also think we
> can safely assume this.

Great!

>> 
>> The only problem left then is if there were multiple XDP programs called
>> in sequence (whether before a redirect, or by libxdp chaining or tail
>> calls), and the first one resized the metadata area without the last one
>> knowing about it. For this, we could add a CLOBBER_PROGRAM_META flag to
>> the skb_metadata helper which if set will ensure that the program
>> metadata length is reset to 0?
>
> How is it different from the same xdp prog calling bpf_xdp_adjust_meta() and 
> bpf_xdp_metadata_export_to_skb() multiple times.  The earlier stored 
> skb_metadata needs to be moved during the latter bpf_xdp_adjust_meta().  The 
> latter bpf_xdp_metadata_export_to_skb() will overwrite the earlier skb_metadata.

Well, it would just be a convenience flag, so instead of doing:

metalen = ctx->data - ctx->data_meta;
if (metalen)
  xdp_adjust_meta(-metalen);
bpf_xdp_metadata_export_to_skb(ctx);

you could just do:

bpf_xdp_metadata_export_to_skb(ctx, CLOBBER_PROGRAM_META);

and the kernel would do the check+move for you. But, well, the couple of
extra instructions to do the check in BPF is probably fine.

(I'm talking here about a program that wants to make sure that any
custom metadata that may have been added by an earlier program is
removed before redirecting to an XSK socket; I expect we'd want to do
something like this in the default program in libxdp).

-Toke


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-11  1:26                                 ` Martin KaFai Lau
@ 2022-11-11  9:41                                   ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 66+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-11  9:41 UTC (permalink / raw)
  To: Martin KaFai Lau, Stanislav Fomichev
  Cc: ast, daniel, andrii, song, yhs, john.fastabend, kpsingh, haoluo,
	jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Jesper Dangaard Brouer, Anatoly Burakov, Alexander Lobakin,
	Magnus Karlsson, Maryam Tahhan, xdp-hints, netdev, bpf

Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 11/10/22 4:57 PM, Stanislav Fomichev wrote:
>> On Thu, Nov 10, 2022 at 4:33 PM Martin KaFai Lau <martin.lau@linux.dev> wrote:
>>>
>>> On 11/10/22 3:52 PM, Stanislav Fomichev wrote:
>>>> On Thu, Nov 10, 2022 at 3:14 PM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>>>>
>>>>> Skipping to the last bit:
>>>>>
>>>>>>>>>>      } else {
>>>>>>>>>>        use kfuncs
>>>>>>>>>>      }
>>>>>>>>>>
>>>>>>>>>> 5. Support the case where we keep program's metadata and kernel's
>>>>>>>>>> xdp_to_skb_metadata
>>>>>>>>>>      - skb_metadata_import_from_xdp() will "consume" it by mem-moving the
>>>>>>>>>> rest of the metadata over it and adjusting the headroom
>>>>>>>>>
>>>>>>>>> I was thinking the kernel's xdp_to_skb_metadata is always before the program's
>>>>>>>>> metadata.  xdp prog should usually work in this order also: read/write headers,
>>>>>>>>> write its own metadata, call bpf_xdp_metadata_export_to_skb(), and return
>>>>>>>>> XDP_PASS/XDP_REDIRECT.  When it is XDP_PASS, the kernel just needs to pop the
>>>>>>>>> xdp_to_skb_metadata and pass the remaining program's metadata to the bpf-tc.
>>>>>>>>>
>>>>>>>>> For the kernel and xdp prog, I don't think it matters where the
>>>>>>>>> xdp_to_skb_metadata is.  However, the xdp->data_meta (program's metadata) has to
>>>>>>>>> be before xdp->data because of the current data_meta and data comparison usage
>>>>>>>>> in the xdp prog.
>>>>>>>>>
>>>>>>>>> The order of the kernel's xdp_to_skb_metadata and the program's metadata
>>>>>>>>> probably only matters to the userspace AF_XDP.  However, I don't see how AF_XDP
>>>>>>>>> supports the program's metadata now.  afaict, it can only work now if there is
>>>>>>>>> some sort of contract between them or the AF_XDP currently does not use the
>>>>>>>>> program's metadata.  Either way, we can do the mem-moving only for AF_XDP and it
>>>>>>>>> should be a no op if there is no program's metadata?  This behavior could also
>>>>>>>>> be configurable through setsockopt?
>>>>>>>>
>>>>>>>> Agreed on all of the above. For now it seems like the safest thing to
>>>>>>>> do is to put xdp_to_skb_metadata last to allow af_xdp to properly
>>>>>>>> locate btf_id.
>>>>>>>> Let's see if Toke disagrees :-)
>>>>>>>
>>>>>>> As I replied to Martin, I'm not sure it's worth the complexity to
>>>>>>> logically split the SKB metadata from the program's own metadata (as
>>>>>>> opposed to just reusing the existing data_meta pointer)?
>>>>>>
>>>>>> I'd gladly keep my current requirement where it's either or, but not both :-)
>>>>>> We can relax it later if required?
>>>>>
>>>>> So the way I've been thinking about it is simply that the skb_metadata
>>>>> would live in the same place at the data_meta pointer (including
>>>>> adjusting that pointer to accommodate it), and just overriding the
>>>>> existing program metadata, if any exists. But looking at it now, I guess
>>>>> having the split makes it easier for a program to write its own custom
>>>>> metadata and still use the skb metadata. See below about the ordering.
>>>>>
>>>>>>> However, if we do, the layout that makes most sense to me is putting the
>>>>>>> skb metadata before the program metadata, like:
>>>>>>>
>>>>>>> --------------
>>>>>>> | skb_metadata
>>>>>>> --------------
>>>>>>> | data_meta
>>>>>>> --------------
>>>>>>> | data
>>>>>>> --------------
>>>>>>>
>>>
>>> Yeah, for the kernel and xdp prog (ie not AF_XDP), I meant this:
>>>
>>> | skb_metadata | custom metadata | data |
>>>
>>>>>>> Not sure if that's what you meant? :)
>>>>>>
>>>>>> I was suggesting the other way around: |custom meta|skb_metadata|data|
>>>>>> (but, as Martin points out, consuming skb_metadata in the kernel
>>>>>> becomes messier)
>>>>>>
>>>>>> af_xdp can check whether skb_metdata is present by looking at data -
>>>>>> offsetof(struct skb_metadata, btf_id).
>>>>>> progs that know how to handle custom metadata, will look at data -
>>>>>> sizeof(skb_metadata)
>>>>>>
>>>>>> Otherwise, if it's the other way around, how do we find skb_metadata
>>>>>> in a redirected frame?
>>>>>> Let's say we have |skb_metadata|custom meta|data|, how does the final
>>>>>> program find skb_metadata?
>>>>>> All the progs have to agree on the sizeof(tc/custom meta), right?
>>>>>
>>>>> Erm, maybe I'm missing something here, but skb_metadata is fixed size,
>>>>> right? So if the "skb_metadata is present" flag is set, we know that the
>>>>> sizeof(skb_metadata) bytes before the data_meta pointer contains the
>>>>> metadata, and if the flag is not set, we know those bytes are not valid
>>>>> metadata.
>>>
>>> right, so to get to the skb_metadata, it will be
>>> data_meta -= sizeof(skb_metadata);  /* probably need alignment */
>>>
>>>>>
>>>>> For AF_XDP, we'd need to transfer the flag as well, and it could apply
>>>>> the same logic (getting the size from the vmlinux BTF).
>>>>>
>>>>> By this logic, the BTF_ID should be the *first* entry of struct
>>>>> skb_metadata, since that will be the field AF_XDP programs can find
>>>>> right off the bat, no? >
>>>> The problem with AF_XDP is that, IIUC, it doesn't have a data_meta
>>>> pointer in the userspace.
>>>
>>> Yep. It is my understanding also.  Missing data_meta pointer in the AF_XDP
>>> rx_desc is a potential problem.  Having BTF_ID or not won't help.
>>>
>>>>
>>>> You get an rx descriptor where the address points to the 'data':
>>>> | 256 bytes headroom where metadata can go | data |
>>>>
>>>> So you have (at most) 256 bytes of headroom, some of that might be the
>>>> metadata, but you really don't know where it starts. But you know it
>>>> definitely ends where the data begins.
>>>>
>>>> So if we have the following, we can locate skb_metadata:
>>>> | 256-sizeof(skb_metadata) headroom | custom metadata | skb_metadata | data |
>>>> data - sizeof(skb_metadata) will get you there
>>>>
>>>> But if it's the other way around, the program has to know
>>>> sizeof(custom metadata) to locate skb_metadata:
>>>> | 256-sizeof(skb_metadata) headroom | skb_metadata | custom metadata | data |
>>>
>>> Right, this won't work if the AF_XDP user does not know how big the custom
>>> metadata is.  The kernel then needs to swap the "skb_metadata" and "custom
>>> metadata" + setting a flag in the AF_XDP rx_desc->options to make it looks like
>>> this:
>>> | custom metadata | skb_metadata | data |
>>>
>>> However, since data_meta is missing from the rx_desc, may be we can safely
>>> assume the AF_XDP user always knows the size of the custom metadata or there is
>>> usually no "custom metadata" and no swap is needed?
>> 
>> If we can assume they can share that info, can they also share more
>> info on what kind of metadata they would prefer to get?
>> If they can agree on the size, maybe they also can agree on the flows
>> that need skb_metdata vs the flows that need a custom one?
>> 
>> Seems like we can start with supporting either one, but not both and
>> extend in the future once we have more understanding on whether it's
>> actually needed or not?
>> 
>> bpf_xdp_metadata_export_to_skb: adjust data meta, add uses-skb-metadata flag
>> bpf_xdp_adjust_meta: unconditionally reset uses-skb-metadata flag
> hmm... I am thinking:
>
> bpf_xdp_adjust_meta: move the existing (if any) skb_metadata and adjust 
> xdp->data_meta.
>
> bpf_xdp_metadata_export_to_skb: If skb_metadata exists, overwrites the existing 
> one.  If not exists, gets headroom before xdp->data_meta and writes hints.

Yeah, +1 on this. For AF_XDP that means that if you only do
bpf_xdp_metadata_export_to_skb() you'll get the skb metadata right
before data, and if you add custom metadata yourself, that goes
in-between so you'll need to have some way to communicate the length to
the AF_XDP consumer. And a default program (like in libxdp) can just
check the metadata pointer and make sure to remove any metadata before
redirecting to userspace, like in the example from my other reply to
Martin.

-Toke


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context
  2022-11-11  1:39                     ` Martin KaFai Lau
@ 2022-11-11  9:44                       ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 66+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-11-11  9:44 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: Stanislav Fomichev, ast, daniel, andrii, song, yhs,
	john.fastabend, kpsingh, haoluo, jolsa, David Ahern,
	Jakub Kicinski, Willem de Bruijn, Jesper Dangaard Brouer,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev, bpf

Martin KaFai Lau <martin.lau@linux.dev> writes:

> On 11/10/22 3:29 PM, Toke Høiland-Jørgensen wrote:
>>>> For the metadata consumed by the stack right now it's a bit
>>>> hypothetical, yeah. However, there's a bunch of metadata commonly
>>>> supported by hardware that the stack currently doesn't consume and that
>>>> hopefully this feature will end up making more accessible. My hope is
>>>> that the stack can also learn how to use this in the future, in which
>>>> case we may run out of space. So I think of that bit mostly as
>>>> future-proofing...
>>>
>>> ic. in this case, Can the btf_id be added to 'struct xdp_to_skb_metadata' later
>>> if it is indeed needed?  The 'struct xdp_to_skb_metadata' is not in UAPI and
>>> doing it with CO-RE is to give us flexibility to make this kind of changes in
>>> the future.
>> 
>> My worry is mostly that it'll be more painful to add it later than just
>> including it from the start, mostly because of AF_XDP users. But if we
>> do the randomisation thing (thus forcing AF_XDP users to deal with the
>> dynamic layout as well), it should be possible to add it later, and I
>> can live with that option as well...
>
> imo, considering we are trying to optimize unnecessary field
> initialization as below, it is sort of wasteful to always initialize
> the btf_id with the same value. It is better to add it in the future
> when there is a need.

Okay, let's omit the BTF ID for now, and see what that looks like. I'll
try to keep in mind to see if I can find any reasons why we'd need to
add it back and make sure to complain before this lands if I find any :)

-Toke


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 04/14] veth: Support rx timestamp metadata for xdp
  2022-11-10 17:39           ` John Fastabend
  2022-11-10 18:52             ` Stanislav Fomichev
@ 2022-11-11 10:41             ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 66+ messages in thread
From: Jesper Dangaard Brouer @ 2022-11-11 10:41 UTC (permalink / raw)
  To: John Fastabend, Stanislav Fomichev
  Cc: brouer, bpf, ast, daniel, andrii, martin.lau, song, yhs, kpsingh,
	haoluo, jolsa, David Ahern, Jakub Kicinski, Willem de Bruijn,
	Anatoly Burakov, Alexander Lobakin, Magnus Karlsson,
	Maryam Tahhan, xdp-hints, netdev


On 10/11/2022 18.39, John Fastabend wrote:
> Stanislav Fomichev wrote:
>> On Wed, Nov 9, 2022 at 5:35 PM John Fastabend <john.fastabend@gmail.com> wrote:
>>>
>>> Stanislav Fomichev wrote:
>>>> On Wed, Nov 9, 2022 at 4:26 PM John Fastabend <john.fastabend@gmail.com> wrote:
>>>>>
>>>>> Stanislav Fomichev wrote:
>>>>>> xskxceiver conveniently setups up veth pairs so it seems logical
>>>>>> to use veth as an example for some of the metadata handling.
>>>>>>
>>>>>> We timestamp skb right when we "receive" it, store its
>>>>>> pointer in new veth_xdp_buff wrapper and generate BPF bytecode to
>>>>>> reach it from the BPF program.
>>>>>>
>>>>>> This largely follows the idea of "store some queue context in
>>>>>> the xdp_buff/xdp_frame so the metadata can be reached out
>>>>>> from the BPF program".
>>>>>>
>>>>>
>>>>> [...]
>>>>>
>>>>>>        orig_data = xdp->data;
>>>>>>        orig_data_end = xdp->data_end;
>>>>>> +     vxbuf.skb = skb;
>>>>>>
>>>>>>        act = bpf_prog_run_xdp(xdp_prog, xdp);
>>>>>>
>>>>>> @@ -942,6 +946,7 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>>>>                        struct sk_buff *skb = ptr;
>>>>>>
>>>>>>                        stats->xdp_bytes += skb->len;
>>>>>> +                     __net_timestamp(skb);
>>>>>
>>>>> Just getting to reviewing in depth a bit more. But we hit veth with lots of
>>>>> packets in some configurations I don't think we want to add a __net_timestamp
>>>>> here when vast majority of use cases will have no need for timestamp on veth
>>>>> device. I didn't do a benchmark but its not free.
>>>>>
>>>>> If there is a real use case for timestamping on veth we could do it through
>>>>> a XDP program directly? Basically fallback for devices without hw timestamps.
>>>>> Anyways I need the helper to support hardware without time stamping.
>>>>>
>>>>> Not sure if this was just part of the RFC to explore BPF programs or not.
>>>>
>>>> Initially I've done it mostly so I can have selftests on top of veth
>>>> driver, but I'd still prefer to keep it to have working tests.
>>>
>>> I can't think of a use for it though so its just extra cycles. There
>>> is a helper to read the ktime.
>>
>> As I mentioned in another reply, I wanted something SW-only to test
>> this whole metadata story.
> 
> Yeah I see the value there. Also because this is in the veth_xdp_rcv
> path we don't actually attach XDP programs to veths except for in
> CI anyways. I assume though if someone actually does use this in
> prod having an extra _net_timestamp there would be extra unwanted
> cycles.
> 

Sorry, but I think it is wrong to add this SW-timestamp to veth.
As John already pointed out the BPF-prog can just call ktime helper
itself. Plus, we *do* want to use this code path as a fast-path, not
just for CI testing.

I suggest you use the offloaded VLAN tag instead.


>> The idea was:
>> - veth rx sets skb->tstamp (otherwise it's 0 at this point)
>> - veth kfunc to access rx_timestamp returns skb->tstamp
>> - xsk bpf program verifies that the metadata is non-zero
>> - the above shows end-to-end functionality with a software driver
> 
> Yep 100% agree very handy for testing just not sure we can add code
> to hotpath just for testing.

I agree, I really dislike adding code to hotpath just for testing.

Using VLAN instead would solve a practical problem, that XDP lacks
access to the offloaded VLAN tag.  Which is one of the lacking features
that XDP-hints targets to add.

--Jesper


^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 10/14] ice: Support rx timestamp metadata for xdp
  2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 10/14] ice: Support rx timestamp metadata for xdp Stanislav Fomichev
  2022-11-04 14:35   ` [xdp-hints] " Alexander Lobakin
@ 2022-12-15 11:54   ` Larysa Zaremba
  2022-12-15 14:29     ` Toke Høiland-Jørgensen
  1 sibling, 1 reply; 66+ messages in thread
From: Larysa Zaremba @ 2022-12-15 11:54 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

On Thu, Nov 03, 2022 at 08:25:28PM -0700, Stanislav Fomichev wrote:
> +			/* if (r5 == NULL) return; */
> +			BPF_JMP_IMM(BPF_JNE, BPF_REG_5, 0, S16_MAX),

S16_MAX jump crashes my system and I do not see such jumps used very often
in bpf code found in-tree, setting a fixed jump length worked for me.
Also, I think BPF_JEQ is a correct condition in this case, not BPF_JNE.

But the main reason for my reply is that I have implemented RX hash hint
for ice both as unrolled bpf code and with BPF_EMIT_CALL [0].
Both bpf_xdp_metadata_rx_hash() and bpf_xdp_metadata_rx_hash_supported() 
are implemented in those 2 ways.

RX hash is the easiest hint to read, so performance difference
should be more visible than when reading timestapm.

Counting packets in an rxdrop XDP program on a single queue
gave me the following numbers:

- unrolled:		41264360 pps
- BPF_EMIT_CALL:	40370651 pps

So, reading a single hint in an unrolled way instead of calling 2 driver
functions in a row, gives us a 2.2% performance boost.
Surely, the difference will increase, if we read more than a single hint.
Therefore, it would be great to implement at least some simple hints
functions as unrolled.

[0] https://github.com/walking-machine/linux/tree/ice-kfunc-hints-clean

^ permalink raw reply	[flat|nested] 66+ messages in thread

* [xdp-hints] Re: [RFC bpf-next v2 10/14] ice: Support rx timestamp metadata for xdp
  2022-12-15 11:54   ` Larysa Zaremba
@ 2022-12-15 14:29     ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 66+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-12-15 14:29 UTC (permalink / raw)
  To: Larysa Zaremba, Stanislav Fomichev
  Cc: bpf, ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
	kpsingh, haoluo, jolsa, David Ahern, Jakub Kicinski,
	Willem de Bruijn, Jesper Dangaard Brouer, Anatoly Burakov,
	Alexander Lobakin, Magnus Karlsson, Maryam Tahhan, xdp-hints,
	netdev

Larysa Zaremba <larysa.zaremba@intel.com> writes:

> On Thu, Nov 03, 2022 at 08:25:28PM -0700, Stanislav Fomichev wrote:
>> +			/* if (r5 == NULL) return; */
>> +			BPF_JMP_IMM(BPF_JNE, BPF_REG_5, 0, S16_MAX),
>
> S16_MAX jump crashes my system and I do not see such jumps used very often
> in bpf code found in-tree, setting a fixed jump length worked for me.
> Also, I think BPF_JEQ is a correct condition in this case, not BPF_JNE.
>
> But the main reason for my reply is that I have implemented RX hash hint
> for ice both as unrolled bpf code and with BPF_EMIT_CALL [0].
> Both bpf_xdp_metadata_rx_hash() and bpf_xdp_metadata_rx_hash_supported() 
> are implemented in those 2 ways.
>
> RX hash is the easiest hint to read, so performance difference
> should be more visible than when reading timestapm.
>
> Counting packets in an rxdrop XDP program on a single queue
> gave me the following numbers:
>
> - unrolled:		41264360 pps
> - BPF_EMIT_CALL:	40370651 pps
>
> So, reading a single hint in an unrolled way instead of calling 2 driver
> functions in a row, gives us a 2.2% performance boost.
> Surely, the difference will increase, if we read more than a single hint.
> Therefore, it would be great to implement at least some simple hints
> functions as unrolled.
>
> [0] https://github.com/walking-machine/linux/tree/ice-kfunc-hints-clean

Right, so this corresponds to ~0.5ns function call overhead, which is a
bit less than what I was seeing[0], but you're also getting 41 Mpps
where I was getting 25, so I assume your hardware is newer :)

And yeah, I agree that ideally we really should inline these functions.
However, seeing as that may be a ways off[1], I suppose we'll have to
live with the function call overhead for now. As long as we're
reasonably confident that inlining can be added later without disruptive
API breaks I am OK with proceeding without inlining for now, though.
That way, inlining will just be a nice performance optimisation once it
does land, and who knows, maybe this will provide the impetus for
someone to land it sooner rather than later...

-Toke

[0] https://lore.kernel.org/r/875yellcx6.fsf@toke.dk
[1] https://lore.kernel.org/r/CAADnVQ+MyE280Q-7iw2Y-P6qGs4xcDML-tUrXEv_EQTmeESVaQ@mail.gmail.com


^ permalink raw reply	[flat|nested] 66+ messages in thread

end of thread, other threads:[~2022-12-15 14:29 UTC | newest]

Thread overview: 66+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-04  3:25 [xdp-hints] [RFC bpf-next v2 00/14] xdp: hints via kfuncs Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 01/14] bpf: Introduce bpf_patch Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 02/14] bpf: Support inlined/unrolled kfuncs for xdp metadata Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 03/14] veth: Introduce veth_xdp_buff wrapper for xdp_buff Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 04/14] veth: Support rx timestamp metadata for xdp Stanislav Fomichev
2022-11-09 11:21   ` [xdp-hints] " Toke Høiland-Jørgensen
2022-11-09 21:34     ` Stanislav Fomichev
2022-11-10  0:25   ` John Fastabend
2022-11-10  1:02     ` Stanislav Fomichev
2022-11-10  1:35       ` John Fastabend
2022-11-10  6:44         ` Stanislav Fomichev
2022-11-10 17:39           ` John Fastabend
2022-11-10 18:52             ` Stanislav Fomichev
2022-11-11 10:41             ` Jesper Dangaard Brouer
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 05/14] selftests/bpf: Verify xdp_metadata xdp->af_xdp path Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 06/14] xdp: Carry over xdp metadata into skb context Stanislav Fomichev
2022-11-07 22:01   ` [xdp-hints] " Martin KaFai Lau
2022-11-08 21:54     ` Stanislav Fomichev
2022-11-09  3:07       ` Martin KaFai Lau
2022-11-09  4:19         ` Martin KaFai Lau
2022-11-09 11:10           ` Toke Høiland-Jørgensen
2022-11-09 18:22             ` Martin KaFai Lau
2022-11-09 21:33               ` Stanislav Fomichev
2022-11-10  0:13                 ` Martin KaFai Lau
2022-11-10  1:02                   ` Stanislav Fomichev
2022-11-10 14:26                     ` Toke Høiland-Jørgensen
2022-11-10 18:52                       ` Stanislav Fomichev
2022-11-10 23:14                         ` Toke Høiland-Jørgensen
2022-11-10 23:52                           ` Stanislav Fomichev
2022-11-11  0:10                             ` Toke Høiland-Jørgensen
2022-11-11  0:45                               ` Martin KaFai Lau
2022-11-11  9:37                                 ` Toke Høiland-Jørgensen
2022-11-11  0:33                             ` Martin KaFai Lau
2022-11-11  0:57                               ` Stanislav Fomichev
2022-11-11  1:26                                 ` Martin KaFai Lau
2022-11-11  9:41                                   ` Toke Høiland-Jørgensen
2022-11-10 23:58                         ` Martin KaFai Lau
2022-11-11  0:20                           ` Stanislav Fomichev
2022-11-10 14:19               ` Toke Høiland-Jørgensen
2022-11-10 19:04                 ` Martin KaFai Lau
2022-11-10 23:29                   ` Toke Høiland-Jørgensen
2022-11-11  1:39                     ` Martin KaFai Lau
2022-11-11  9:44                       ` Toke Høiland-Jørgensen
2022-11-10  1:26             ` John Fastabend
2022-11-10 14:32               ` Toke Høiland-Jørgensen
2022-11-10 17:30                 ` John Fastabend
2022-11-10 22:49                   ` Toke Høiland-Jørgensen
2022-11-10  1:09   ` John Fastabend
2022-11-10  6:44     ` Stanislav Fomichev
2022-11-10 21:21       ` David Ahern
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 07/14] selftests/bpf: Verify xdp_metadata xdp->skb path Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 08/14] bpf: Helper to simplify calling kernel routines from unrolled kfuncs Stanislav Fomichev
2022-11-05  0:40   ` [xdp-hints] " Alexei Starovoitov
2022-11-05  2:18     ` Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 09/14] ice: Introduce ice_xdp_buff wrapper for xdp_buff Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 10/14] ice: Support rx timestamp metadata for xdp Stanislav Fomichev
2022-11-04 14:35   ` [xdp-hints] " Alexander Lobakin
2022-11-04 18:21     ` Stanislav Fomichev
2022-11-07 17:11       ` Alexander Lobakin
2022-11-07 19:10         ` Stanislav Fomichev
2022-12-15 11:54   ` Larysa Zaremba
2022-12-15 14:29     ` Toke Høiland-Jørgensen
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 11/14] mlx4: Introduce mlx4_xdp_buff wrapper for xdp_buff Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 12/14] mxl4: Support rx timestamp metadata for xdp Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 13/14] bnxt: Introduce bnxt_xdp_buff wrapper for xdp_buff Stanislav Fomichev
2022-11-04  3:25 ` [xdp-hints] [RFC bpf-next v2 14/14] bnxt: Support rx timestamp metadata for xdp Stanislav Fomichev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox