* [xdp-hints] [PATCH RFC bpf-next 01/52] libbpf: factor out BTF loading from load_module_btfs()
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 02/52] libbpf: try to load vmlinux BTF from the kernel first Alexander Lobakin
` (51 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
From: Larysa Zaremba <larysa.zaremba@intel.com>
In order to be able to reuse BTF loading logics, move it to the new
btf_load_next_with_info() and call it from load_module_btfs()
instead.
To still be able to get the ID, introduce the ID field to the
userspace struct btf and return it via the new btf_obj_id().
To still be able to use bpf_btf_info::name as a string, locally add
a counterpart to ptr_to_u64() - u64_to_ptr() and use it to filter
vmlinux/module BTFs.
Also, add a definition for easy bpf_btf_info name declaration and
make btf_get_from_fd() static as it's now used only in btf.c.
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
tools/lib/bpf/btf.c | 110 +++++++++++++++++++++++++++++++-
tools/lib/bpf/libbpf.c | 52 ++++-----------
tools/lib/bpf/libbpf_internal.h | 7 +-
3 files changed, 126 insertions(+), 43 deletions(-)
diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index ae1520f7e1b0..7e4dbf71fd52 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -121,6 +121,9 @@ struct btf {
/* Pointer size (in bytes) for a target architecture of this BTF */
int ptr_sz;
+
+ /* BTF object ID, valid for vmlinux and module BTF */
+ __u32 id;
};
static inline __u64 ptr_to_u64(const void *ptr)
@@ -128,6 +131,11 @@ static inline __u64 ptr_to_u64(const void *ptr)
return (__u64) (unsigned long) ptr;
}
+static inline const void *u64_to_ptr(__u64 val)
+{
+ return (const void *)(unsigned long)val;
+}
+
/* Ensure given dynamically allocated memory region pointed to by *data* with
* capacity of *cap_cnt* elements each taking *elem_sz* bytes has enough
* memory to accommodate *add_cnt* new elements, assuming *cur_cnt* elements
@@ -463,6 +471,11 @@ const struct btf *btf__base_btf(const struct btf *btf)
return btf->base_btf;
}
+__u32 btf_obj_id(const struct btf *btf)
+{
+ return btf->id;
+}
+
/* internal helper returning non-const pointer to a type */
struct btf_type *btf_type_by_id(const struct btf *btf, __u32 type_id)
{
@@ -819,6 +832,7 @@ static struct btf *btf_new_empty(struct btf *base_btf)
btf->fd = -1;
btf->ptr_sz = sizeof(void *);
btf->swapped_endian = false;
+ btf->id = 0;
if (base_btf) {
btf->base_btf = base_btf;
@@ -869,6 +883,7 @@ static struct btf *btf_new(const void *data, __u32 size, struct btf *base_btf)
btf->start_id = 1;
btf->start_str_off = 0;
btf->fd = -1;
+ btf->id = 0;
if (base_btf) {
btf->base_btf = base_btf;
@@ -1334,7 +1349,7 @@ const char *btf__name_by_offset(const struct btf *btf, __u32 offset)
return btf__str_by_offset(btf, offset);
}
-struct btf *btf_get_from_fd(int btf_fd, struct btf *base_btf)
+static struct btf *btf_get_from_fd(int btf_fd, struct btf *base_btf)
{
struct bpf_btf_info btf_info;
__u32 len = sizeof(btf_info);
@@ -1382,6 +1397,8 @@ struct btf *btf_get_from_fd(int btf_fd, struct btf *base_btf)
}
btf = btf_new(ptr, btf_info.btf_size, base_btf);
+ if (!IS_ERR_OR_NULL(btf))
+ btf->id = btf_info.id;
exit_free:
free(ptr);
@@ -4819,6 +4836,97 @@ static int btf_dedup_remap_types(struct btf_dedup *d)
return 0;
}
+/**
+ * btf_load_next_with_info - get first BTF with ID bigger than the input one.
+ * @start_id: ID to start the search from
+ * @info: buffer to put BTF info to
+ * @base_btf: base BTF, can be %NULL if @vmlinux is true
+ * @vmlinux: true to look for the vmlinux BTF instead of a module BTF
+ *
+ * Obtains the first BTF with the ID bigger than the @start_id. @info::name and
+ * @info::name_len must be initialized by the caller. The default name buffer
+ * size is %BTF_NAME_BUF_LEN.
+ * FD must be closed after BTF is no longer needed. If @vmlinux is true, FD can
+ * be closed and set to -1 right away without preventing later usage.
+ *
+ * Returns pointer to the BTF loaded from the kernel or an error pointer.
+ */
+struct btf *btf_load_next_with_info(__u32 start_id, struct bpf_btf_info *info,
+ struct btf *base_btf, bool vmlinux)
+{
+ __u32 name_len = info->name_len;
+ __u64 name = info->name;
+ const char *name_str;
+ __u32 id = start_id;
+
+ if (!name)
+ return ERR_PTR(-EINVAL);
+
+ name_str = u64_to_ptr(name);
+
+ while (true) {
+ __u32 len = sizeof(*info);
+ struct btf *btf;
+ int err, fd;
+
+ err = bpf_btf_get_next_id(id, &id);
+ if (err) {
+ err = -errno;
+ if (err != -ENOENT)
+ pr_warn("failed to iterate BTF objects: %d\n",
+ err);
+ return ERR_PTR(err);
+ }
+
+ fd = bpf_btf_get_fd_by_id(id);
+ if (fd < 0) {
+ err = -errno;
+ if (err == -ENOENT)
+ /* Expected race: non-vmlinux BTF was
+ * unloaded
+ */
+ continue;
+ pr_warn("failed to get BTF object #%d FD: %d\n",
+ id, err);
+ return ERR_PTR(err);
+ }
+
+ memset(info, 0, len);
+ info->name = name;
+ info->name_len = name_len;
+
+ err = bpf_obj_get_info_by_fd(fd, info, &len);
+ if (err) {
+ err = -errno;
+ pr_warn("failed to get BTF object #%d info: %d\n",
+ id, err);
+ goto err_out;
+ }
+
+ /* Filter BTFs */
+ if (!info->kernel_btf ||
+ !strcmp(name_str, "vmlinux") != vmlinux) {
+ close(fd);
+ continue;
+ }
+
+ btf = btf_get_from_fd(fd, base_btf);
+ err = libbpf_get_error(btf);
+ if (err) {
+ pr_warn("failed to load module [%s]'s BTF object #%d: %d\n",
+ name_str, id, err);
+ goto err_out;
+ }
+
+ btf->fd = fd;
+ return btf;
+
+err_out:
+ close(fd);
+ return ERR_PTR(err);
+ }
+}
+
/*
* Probe few well-known locations for vmlinux kernel image and try to load BTF
* data out of it to use for target BTF.
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 335467ece75f..8e27bad5e80f 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -5559,11 +5559,11 @@ int bpf_core_add_cands(struct bpf_core_cand *local_cand,
static int load_module_btfs(struct bpf_object *obj)
{
- struct bpf_btf_info info;
+ char name[BTF_NAME_BUF_LEN] = { };
struct module_btf *mod_btf;
+ struct bpf_btf_info info;
struct btf *btf;
- char name[64];
- __u32 id = 0, len;
+ __u32 id = 0;
int err, fd;
if (obj->btf_modules_loaded)
@@ -5580,49 +5580,19 @@ static int load_module_btfs(struct bpf_object *obj)
return 0;
while (true) {
- err = bpf_btf_get_next_id(id, &id);
- if (err && errno == ENOENT)
- return 0;
- if (err) {
- err = -errno;
- pr_warn("failed to iterate BTF objects: %d\n", err);
- return err;
- }
-
- fd = bpf_btf_get_fd_by_id(id);
- if (fd < 0) {
- if (errno == ENOENT)
- continue; /* expected race: BTF was unloaded */
- err = -errno;
- pr_warn("failed to get BTF object #%d FD: %d\n", id, err);
- return err;
- }
-
- len = sizeof(info);
memset(&info, 0, sizeof(info));
info.name = ptr_to_u64(name);
info.name_len = sizeof(name);
- err = bpf_obj_get_info_by_fd(fd, &info, &len);
- if (err) {
- err = -errno;
- pr_warn("failed to get BTF object #%d info: %d\n", id, err);
- goto err_out;
- }
-
- /* ignore non-module BTFs */
- if (!info.kernel_btf || strcmp(name, "vmlinux") == 0) {
- close(fd);
- continue;
- }
-
- btf = btf_get_from_fd(fd, obj->btf_vmlinux);
+ btf = btf_load_next_with_info(id, &info, obj->btf_vmlinux,
+ false);
err = libbpf_get_error(btf);
- if (err) {
- pr_warn("failed to load module [%s]'s BTF object #%d: %d\n",
- name, id, err);
- goto err_out;
- }
+ if (err)
+ return err == -ENOENT ? 0 : err;
+
+ fd = btf__fd(btf);
+ btf__set_fd(btf, -1);
+ id = btf_obj_id(btf);
err = libbpf_ensure_mem((void **)&obj->btf_modules, &obj->btf_module_cap,
sizeof(*obj->btf_modules), obj->btf_module_cnt + 1);
diff --git a/tools/lib/bpf/libbpf_internal.h b/tools/lib/bpf/libbpf_internal.h
index a1ad145ffa74..9b0bbd4a5f64 100644
--- a/tools/lib/bpf/libbpf_internal.h
+++ b/tools/lib/bpf/libbpf_internal.h
@@ -366,9 +366,14 @@ int libbpf__load_raw_btf(const char *raw_types, size_t types_len,
const char *str_sec, size_t str_len);
int btf_load_into_kernel(struct btf *btf, char *log_buf, size_t log_sz, __u32 log_level);
-struct btf *btf_get_from_fd(int btf_fd, struct btf *base_btf);
void btf_get_kernel_prefix_kind(enum bpf_attach_type attach_type,
const char **prefix, int *kind);
+__u32 btf_obj_id(const struct btf *btf);
+
+#define BTF_NAME_BUF_LEN 64
+
+struct btf *btf_load_next_with_info(__u32 start_id, struct bpf_btf_info *info,
+ struct btf *base_btf, bool vmlinux);
struct btf_ext_info {
/*
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 02/52] libbpf: try to load vmlinux BTF from the kernel first
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 01/52] libbpf: factor out BTF loading from load_module_btfs() Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 03/52] libbpf: add function to get the pair BTF ID + type ID for a given type Alexander Lobakin
` (50 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
From: Larysa Zaremba <larysa.zaremba@intel.com>
Try to acquire vmlinux BTF the same way it's being done for module
BTFs. Use btf_load_next_with_info() and resort to the filesystem
lookup only if it fails.
Also, adjust debug messages in btf__load_vmlinux_btf() to reflect
that it actually tries to load vmlinux BTF.
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
tools/lib/bpf/btf.c | 32 ++++++++++++++++++++++++++++++--
1 file changed, 30 insertions(+), 2 deletions(-)
diff --git a/tools/lib/bpf/btf.c b/tools/lib/bpf/btf.c
index 7e4dbf71fd52..8ecd50923fab 100644
--- a/tools/lib/bpf/btf.c
+++ b/tools/lib/bpf/btf.c
@@ -4927,6 +4927,25 @@ struct btf *btf_load_next_with_info(__u32 start_id, struct bpf_btf_info *info,
}
}
+static struct btf *btf_load_vmlinux_from_kernel(void)
+{
+ char name[BTF_NAME_BUF_LEN] = { };
+ struct bpf_btf_info info;
+ struct btf *btf;
+
+ memset(&info, 0, sizeof(info));
+ info.name = ptr_to_u64(name);
+ info.name_len = sizeof(name);
+
+ btf = btf_load_next_with_info(0, &info, NULL, true);
+ if (!libbpf_get_error(btf)) {
+ close(btf->fd);
+ btf__set_fd(btf, -1);
+ }
+
+ return btf;
+}
+
/*
* Probe few well-known locations for vmlinux kernel image and try to load BTF
* data out of it to use for target BTF.
@@ -4953,6 +4972,15 @@ struct btf *btf__load_vmlinux_btf(void)
struct btf *btf;
int i, err;
+ btf = btf_load_vmlinux_from_kernel();
+ err = libbpf_get_error(btf);
+ pr_debug("loading vmlinux BTF from kernel: %d\n", err);
+ if (!err)
+ return btf;
+
+ pr_info("failed to load vmlinux BTF from kernel: %d, will look through filesystem\n",
+ err);
+
uname(&buf);
for (i = 0; i < ARRAY_SIZE(locations); i++) {
@@ -4966,14 +4994,14 @@ struct btf *btf__load_vmlinux_btf(void)
else
btf = btf__parse_elf(path, NULL);
err = libbpf_get_error(btf);
- pr_debug("loading kernel BTF '%s': %d\n", path, err);
+ pr_debug("loading vmlinux BTF '%s': %d\n", path, err);
if (err)
continue;
return btf;
}
- pr_warn("failed to find valid kernel BTF\n");
+ pr_warn("failed to find valid vmlinux BTF\n");
return libbpf_err_ptr(-ESRCH);
}
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 03/52] libbpf: add function to get the pair BTF ID + type ID for a given type
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 01/52] libbpf: factor out BTF loading from load_module_btfs() Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 02/52] libbpf: try to load vmlinux BTF from the kernel first Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 04/52] libbpf: patch module BTF ID into BPF insns Alexander Lobakin
` (49 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Add new libbpf API function libbpf_get_type_btf_id() to provide a
short way to get the pair of BTF ID << 32 | type ID for the provided
type. The primary purpose is to use it in userspace BPF prog loaders
to pass those IDs to the kernel to tell what XDP generic metadata to
create, as well as in AF_XDP programs to be able to compare them
against the ones from frame metadata.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
tools/lib/bpf/libbpf.c | 113 +++++++++++++++++++++++++++++++++++++++
tools/lib/bpf/libbpf.h | 1 +
tools/lib/bpf/libbpf.map | 1 +
3 files changed, 115 insertions(+)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 8e27bad5e80f..9bda111c8167 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -2252,6 +2252,28 @@ const char *btf_kind_str(const struct btf_type *t)
return __btf_kind_str(btf_kind(t));
}
+static __u32 btf_kind_from_str(const char **type)
+{
+ const char *pos, *orig = *type;
+ __u32 kind;
+ int len;
+
+ pos = strchr(orig, ' ');
+ if (pos) {
+ len = pos - orig;
+ *type = pos + 1;
+ } else {
+ len = strlen(orig);
+ }
+
+ for (kind = BTF_KIND_UNKN; kind < NR_BTF_KINDS; kind++) {
+ if (!strncmp(orig, __btf_kind_str(kind), len))
+ break;
+ }
+
+ return kind < NR_BTF_KINDS ? kind : BTF_KIND_UNKN;
+}
+
/*
* Fetch integer attribute of BTF map definition. Such attributes are
* represented using a pointer to an array, in which dimensionality of array
@@ -9617,6 +9639,97 @@ int libbpf_find_vmlinux_btf_id(const char *name,
return libbpf_err(err);
}
+static __s32 libbpf_find_btf_id(const char *type, __u32 kind,
+ struct btf **res_btf)
+{
+ char name[BTF_NAME_BUF_LEN] = { };
+ struct btf *vmlinux_btf, *btf;
+ struct bpf_btf_info info;
+ __u32 id = 0;
+ __s32 ret;
+
+ if (res_btf)
+ *res_btf = NULL;
+
+ if (!type || !*type)
+ return -EINVAL;
+
+ vmlinux_btf = btf__load_vmlinux_btf();
+ ret = libbpf_get_error(vmlinux_btf);
+ if (ret < 0)
+ goto free_vmlinux;
+
+ ret = btf__find_by_name_kind(vmlinux_btf, type, kind);
+ if (ret > 0) {
+ btf = vmlinux_btf;
+ goto out;
+ }
+
+ while (true) {
+ memset(&info, 0, sizeof(info));
+ info.name = ptr_to_u64(name);
+ info.name_len = sizeof(name);
+
+ btf = btf_load_next_with_info(id, &info, vmlinux_btf, false);
+ ret = libbpf_get_error(btf);
+ if (ret)
+ break;
+
+ ret = btf__find_by_name_kind(btf, type, kind);
+ if (ret > 0)
+ break;
+
+ id = btf_obj_id(btf);
+ btf__free(btf);
+ }
+
+free_vmlinux:
+ btf__free(vmlinux_btf);
+
+out:
+ if (ret > 0 && res_btf)
+ *res_btf = btf;
+
+ return ret ? : -ESRCH;
+}
+
+/**
+ * libbpf_get_type_btf_id - get the pair BTF ID + type ID for a given type
+ * @type: pointer to the name of the type to look for
+ * @res_id: pointer to write the result to
+ *
+ * Tries to find the BTF corresponding to the provided type (full string) and
+ * write the pair of BTF ID << 32 | type ID. Such coded __u64 are being used
+ * in XDP generic-compatible metadata to distinguish between different
+ * metadata structures.
+ * @res_id can be %NULL to only check if a particular type exists within
+ * the BTF.
+ *
+ * Returns 0 in case of success, -errno otherwise.
+ */
+int libbpf_get_type_btf_id(const char *type, __u64 *res_id)
+{
+ struct btf *btf = NULL;
+ __s32 type_id;
+ __u32 kind;
+
+ if (res_id)
+ *res_id = 0;
+
+ if (!type || !*type)
+ return libbpf_err(-EINVAL);
+
+ kind = btf_kind_from_str(&type);
+
+ type_id = libbpf_find_btf_id(type, kind, &btf);
+ if (type_id > 0 && res_id)
+ *res_id = ((__u64)btf_obj_id(btf) << 32) | type_id;
+
+ btf__free(btf);
+
+ return libbpf_err(min(type_id, 0));
+}
+
static int libbpf_find_prog_btf_id(const char *name, __u32 attach_prog_fd)
{
struct bpf_prog_info info = {};
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index fa27969da0da..4056e9038086 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -295,6 +295,7 @@ LIBBPF_API int libbpf_attach_type_by_name(const char *name,
enum bpf_attach_type *attach_type);
LIBBPF_API int libbpf_find_vmlinux_btf_id(const char *name,
enum bpf_attach_type attach_type);
+LIBBPF_API int libbpf_get_type_btf_id(const char *type, __u64 *id);
/* Accessors of bpf_program */
struct bpf_program;
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index 116a2a8ee7c2..f0987df15b7a 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -468,6 +468,7 @@ LIBBPF_1.0.0 {
libbpf_bpf_link_type_str;
libbpf_bpf_map_type_str;
libbpf_bpf_prog_type_str;
+ libbpf_get_type_btf_id;
local: *;
};
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 04/52] libbpf: patch module BTF ID into BPF insns
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (2 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 03/52] libbpf: add function to get the pair BTF ID + type ID for a given type Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 05/52] net, xdp: decouple XDP code from the core networking code Alexander Lobakin
` (48 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
From: Larysa Zaremba <larysa.zaremba@intel.com>
Return both type id and BTF id from bpf_core_type_id_kernel().
Earlier only type id was returned despite the fact that llvm
has enabled the 64 return type for this instruction [1].
This was done as a preparation to the patch [2], which
also strongly served as a inspiration for this implementation.
[1] https://reviews.llvm.org/D91489
[2] https://lore.kernel.org/all/20201205025140.443115-1-andrii@kernel.org
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
tools/lib/bpf/bpf_core_read.h | 3 ++-
tools/lib/bpf/relo_core.c | 8 +++++++-
tools/lib/bpf/relo_core.h | 1 +
3 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/tools/lib/bpf/bpf_core_read.h b/tools/lib/bpf/bpf_core_read.h
index fd48b1ff59ca..2b7d675b2dd0 100644
--- a/tools/lib/bpf/bpf_core_read.h
+++ b/tools/lib/bpf/bpf_core_read.h
@@ -167,7 +167,8 @@ enum bpf_enum_value_kind {
* Convenience macro to get BTF type ID of a target kernel's type that matches
* specified local type.
* Returns:
- * - valid 32-bit unsigned type ID in kernel BTF;
+ * - valid 64-bit unsigned integer: the upper 32 bits is the BTF ID
+ * and the lower 32 bits is the type ID within the BTF.
* - 0, if no matching type was found in a target kernel BTF.
*/
#define bpf_core_type_id_kernel(type) \
diff --git a/tools/lib/bpf/relo_core.c b/tools/lib/bpf/relo_core.c
index e070123332cd..020f0f81374c 100644
--- a/tools/lib/bpf/relo_core.c
+++ b/tools/lib/bpf/relo_core.c
@@ -884,6 +884,7 @@ static int bpf_core_calc_relo(const char *prog_name,
res->fail_memsz_adjust = false;
res->orig_sz = res->new_sz = 0;
res->orig_type_id = res->new_type_id = 0;
+ res->btf_obj_id = 0;
if (core_relo_is_field_based(relo->kind)) {
err = bpf_core_calc_field_relo(prog_name, relo, local_spec,
@@ -934,6 +935,8 @@ static int bpf_core_calc_relo(const char *prog_name,
} else if (core_relo_is_type_based(relo->kind)) {
err = bpf_core_calc_type_relo(relo, local_spec, &res->orig_val, &res->validate);
err = err ?: bpf_core_calc_type_relo(relo, targ_spec, &res->new_val, NULL);
+ if (!err && relo->kind == BPF_CORE_TYPE_ID_TARGET)
+ res->btf_obj_id = btf_obj_id(targ_spec->btf);
} else if (core_relo_is_enumval_based(relo->kind)) {
err = bpf_core_calc_enumval_relo(relo, local_spec, &res->orig_val);
err = err ?: bpf_core_calc_enumval_relo(relo, targ_spec, &res->new_val);
@@ -1125,7 +1128,10 @@ int bpf_core_patch_insn(const char *prog_name, struct bpf_insn *insn,
}
insn[0].imm = new_val;
- insn[1].imm = new_val >> 32;
+ /* For type IDs, upper 32 bits are used for BTF ID */
+ insn[1].imm = relo->kind == BPF_CORE_TYPE_ID_TARGET ?
+ res->btf_obj_id :
+ (new_val >> 32);
pr_debug("prog '%s': relo #%d: patched insn #%d (LDIMM64) imm64 %llu -> %llu\n",
prog_name, relo_idx, insn_idx,
(unsigned long long)imm, (unsigned long long)new_val);
diff --git a/tools/lib/bpf/relo_core.h b/tools/lib/bpf/relo_core.h
index 3fd3842d4230..f026ea36140e 100644
--- a/tools/lib/bpf/relo_core.h
+++ b/tools/lib/bpf/relo_core.h
@@ -66,6 +66,7 @@ struct bpf_core_relo_res {
__u32 orig_type_id;
__u32 new_sz;
__u32 new_type_id;
+ __u32 btf_obj_id;
};
int __bpf_core_types_are_compat(const struct btf *local_btf, __u32 local_id,
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 05/52] net, xdp: decouple XDP code from the core networking code
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (3 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 04/52] libbpf: patch module BTF ID into BPF insns Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 06/52] bpf: pass a pointer to union bpf_attr to bpf_link_ops::update_prog() Alexander Lobakin
` (47 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Currently, there are a couple of rather big pieces of purely XDP
code residing in `net/core/dev.c` and `net/core/filter.c`, and they
won't get smaller any time soon.
To make it more scalable, move them to the new separate files inside
`net/bpf/`, which is almost empty now, along with `net/core/xdp.c`.
This goes so well so that we only had to make 3 functions global
which were static previously (+1 static key). The only mentions of
XDP left in `filter.c` are helpers which share code with the skb
variants and it would cost much more to make the shared code global
instead.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
MAINTAINERS | 4 +-
include/linux/filter.h | 2 +
include/linux/netdevice.h | 5 +
net/bpf/Makefile | 5 +-
net/{core/xdp.c => bpf/core.c} | 2 +-
net/bpf/dev.c | 776 ++++++++++++++++++++++++++++
net/bpf/prog_ops.c | 911 +++++++++++++++++++++++++++++++++
net/core/Makefile | 2 +-
net/core/dev.c | 771 ----------------------------
net/core/dev.h | 4 -
net/core/filter.c | 883 +-------------------------------
11 files changed, 1705 insertions(+), 1660 deletions(-)
rename net/{core/xdp.c => bpf/core.c} (99%)
create mode 100644 net/bpf/dev.c
create mode 100644 net/bpf/prog_ops.c
diff --git a/MAINTAINERS b/MAINTAINERS
index ca95b1833b97..91190e12a157 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -21726,7 +21726,9 @@ F: include/net/xdp_priv.h
F: include/trace/events/xdp.h
F: kernel/bpf/cpumap.c
F: kernel/bpf/devmap.c
-F: net/core/xdp.c
+F: net/bpf/core.c
+F: net/bpf/dev.c
+F: net/bpf/prog_ops.c
F: samples/bpf/xdp*
F: tools/testing/selftests/bpf/*xdp*
F: tools/testing/selftests/bpf/*/*xdp*
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 4c1a8b247545..360e60a425ad 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -992,6 +992,8 @@ void xdp_do_flush(void);
#define xdp_do_flush_map xdp_do_flush
void bpf_warn_invalid_xdp_action(struct net_device *dev, struct bpf_prog *prog, u32 act);
+const struct bpf_func_proto *xdp_inet_func_proto(enum bpf_func_id func_id);
+bool xdp_helper_changes_pkt_data(const void *func);
#ifdef CONFIG_INET
struct sock *bpf_run_sk_reuseport(struct sock_reuseport *reuse, struct sock *sk,
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 89afa4f7747d..0b8169c23f22 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3848,7 +3848,12 @@ struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *d
struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
struct netdev_queue *txq, int *ret);
+DECLARE_STATIC_KEY_FALSE(generic_xdp_needed_key);
+
int bpf_xdp_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
+int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
+ int fd, int expected_fd, u32 flags);
+void dev_xdp_uninstall(struct net_device *dev);
u8 dev_xdp_prog_count(struct net_device *dev);
u32 dev_xdp_prog_id(struct net_device *dev, enum bpf_xdp_mode mode);
diff --git a/net/bpf/Makefile b/net/bpf/Makefile
index 1ebe270bde23..715550f9048b 100644
--- a/net/bpf/Makefile
+++ b/net/bpf/Makefile
@@ -1,5 +1,8 @@
# SPDX-License-Identifier: GPL-2.0-only
-obj-$(CONFIG_BPF_SYSCALL) := test_run.o
+
+obj-y := core.o dev.o prog_ops.o
+
+obj-$(CONFIG_BPF_SYSCALL) += test_run.o
ifeq ($(CONFIG_BPF_JIT),y)
obj-$(CONFIG_BPF_SYSCALL) += bpf_dummy_struct_ops.o
endif
diff --git a/net/core/xdp.c b/net/bpf/core.c
similarity index 99%
rename from net/core/xdp.c
rename to net/bpf/core.c
index 24420209bf0e..fbb72792320a 100644
--- a/net/core/xdp.c
+++ b/net/bpf/core.c
@@ -1,5 +1,5 @@
// SPDX-License-Identifier: GPL-2.0-only
-/* net/core/xdp.c
+/* net/bpf/core.c
*
* Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
*/
diff --git a/net/bpf/dev.c b/net/bpf/dev.c
new file mode 100644
index 000000000000..dfe0402947f8
--- /dev/null
+++ b/net/bpf/dev.c
@@ -0,0 +1,776 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <trace/events/xdp.h>
+
+DEFINE_STATIC_KEY_FALSE(generic_xdp_needed_key);
+
+static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
+{
+ struct net_device *dev = skb->dev;
+ struct netdev_rx_queue *rxqueue;
+
+ rxqueue = dev->_rx;
+
+ if (skb_rx_queue_recorded(skb)) {
+ u16 index = skb_get_rx_queue(skb);
+
+ if (unlikely(index >= dev->real_num_rx_queues)) {
+ WARN_ONCE(dev->real_num_rx_queues > 1,
+ "%s received packet on queue %u, but number "
+ "of RX queues is %u\n",
+ dev->name, index, dev->real_num_rx_queues);
+
+ return rxqueue; /* Return first rxqueue */
+ }
+ rxqueue += index;
+ }
+ return rxqueue;
+}
+
+u32 bpf_prog_run_generic_xdp(struct sk_buff *skb, struct xdp_buff *xdp,
+ struct bpf_prog *xdp_prog)
+{
+ void *orig_data, *orig_data_end, *hard_start;
+ struct netdev_rx_queue *rxqueue;
+ bool orig_bcast, orig_host;
+ u32 mac_len, frame_sz;
+ __be16 orig_eth_type;
+ struct ethhdr *eth;
+ u32 metalen, act;
+ int off;
+
+ /* The XDP program wants to see the packet starting at the MAC
+ * header.
+ */
+ mac_len = skb->data - skb_mac_header(skb);
+ hard_start = skb->data - skb_headroom(skb);
+
+ /* SKB "head" area always have tailroom for skb_shared_info */
+ frame_sz = (void *)skb_end_pointer(skb) - hard_start;
+ frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+
+ rxqueue = netif_get_rxqueue(skb);
+ xdp_init_buff(xdp, frame_sz, &rxqueue->xdp_rxq);
+ xdp_prepare_buff(xdp, hard_start, skb_headroom(skb) - mac_len,
+ skb_headlen(skb) + mac_len, true);
+
+ orig_data_end = xdp->data_end;
+ orig_data = xdp->data;
+ eth = (struct ethhdr *)xdp->data;
+ orig_host = ether_addr_equal_64bits(eth->h_dest, skb->dev->dev_addr);
+ orig_bcast = is_multicast_ether_addr_64bits(eth->h_dest);
+ orig_eth_type = eth->h_proto;
+
+ act = bpf_prog_run_xdp(xdp_prog, xdp);
+
+ /* check if bpf_xdp_adjust_head was used */
+ off = xdp->data - orig_data;
+ if (off) {
+ if (off > 0)
+ __skb_pull(skb, off);
+ else if (off < 0)
+ __skb_push(skb, -off);
+
+ skb->mac_header += off;
+ skb_reset_network_header(skb);
+ }
+
+ /* check if bpf_xdp_adjust_tail was used */
+ off = xdp->data_end - orig_data_end;
+ if (off != 0) {
+ skb_set_tail_pointer(skb, xdp->data_end - xdp->data);
+ skb->len += off; /* positive on grow, negative on shrink */
+ }
+
+ /* check if XDP changed eth hdr such SKB needs update */
+ eth = (struct ethhdr *)xdp->data;
+ if ((orig_eth_type != eth->h_proto) ||
+ (orig_host != ether_addr_equal_64bits(eth->h_dest,
+ skb->dev->dev_addr)) ||
+ (orig_bcast != is_multicast_ether_addr_64bits(eth->h_dest))) {
+ __skb_push(skb, ETH_HLEN);
+ skb->pkt_type = PACKET_HOST;
+ skb->protocol = eth_type_trans(skb, skb->dev);
+ }
+
+ /* Redirect/Tx gives L2 packet, code that will reuse skb must __skb_pull
+ * before calling us again on redirect path. We do not call do_redirect
+ * as we leave that up to the caller.
+ *
+ * Caller is responsible for managing lifetime of skb (i.e. calling
+ * kfree_skb in response to actions it cannot handle/XDP_DROP).
+ */
+ switch (act) {
+ case XDP_REDIRECT:
+ case XDP_TX:
+ __skb_push(skb, mac_len);
+ break;
+ case XDP_PASS:
+ metalen = xdp->data - xdp->data_meta;
+ if (metalen)
+ skb_metadata_set(skb, metalen);
+ break;
+ }
+
+ return act;
+}
+
+static u32 netif_receive_generic_xdp(struct sk_buff *skb,
+ struct xdp_buff *xdp,
+ struct bpf_prog *xdp_prog)
+{
+ u32 act = XDP_DROP;
+
+ /* Reinjected packets coming from act_mirred or similar should
+ * not get XDP generic processing.
+ */
+ if (skb_is_redirected(skb))
+ return XDP_PASS;
+
+ /* XDP packets must be linear and must have sufficient headroom
+ * of XDP_PACKET_HEADROOM bytes. This is the guarantee that also
+ * native XDP provides, thus we need to do it here as well.
+ */
+ if (skb_cloned(skb) || skb_is_nonlinear(skb) ||
+ skb_headroom(skb) < XDP_PACKET_HEADROOM) {
+ int hroom = XDP_PACKET_HEADROOM - skb_headroom(skb);
+ int troom = skb->tail + skb->data_len - skb->end;
+
+ /* In case we have to go down the path and also linearize,
+ * then lets do the pskb_expand_head() work just once here.
+ */
+ if (pskb_expand_head(skb,
+ hroom > 0 ? ALIGN(hroom, NET_SKB_PAD) : 0,
+ troom > 0 ? troom + 128 : 0, GFP_ATOMIC))
+ goto do_drop;
+ if (skb_linearize(skb))
+ goto do_drop;
+ }
+
+ act = bpf_prog_run_generic_xdp(skb, xdp, xdp_prog);
+ switch (act) {
+ case XDP_REDIRECT:
+ case XDP_TX:
+ case XDP_PASS:
+ break;
+ default:
+ bpf_warn_invalid_xdp_action(skb->dev, xdp_prog, act);
+ fallthrough;
+ case XDP_ABORTED:
+ trace_xdp_exception(skb->dev, xdp_prog, act);
+ fallthrough;
+ case XDP_DROP:
+ do_drop:
+ kfree_skb(skb);
+ break;
+ }
+
+ return act;
+}
+
+/* When doing generic XDP we have to bypass the qdisc layer and the
+ * network taps in order to match in-driver-XDP behavior.
+ */
+void generic_xdp_tx(struct sk_buff *skb, struct bpf_prog *xdp_prog)
+{
+ struct net_device *dev = skb->dev;
+ struct netdev_queue *txq;
+ bool free_skb = true;
+ int cpu, rc;
+
+ txq = netdev_core_pick_tx(dev, skb, NULL);
+ cpu = smp_processor_id();
+ HARD_TX_LOCK(dev, txq, cpu);
+ if (!netif_xmit_stopped(txq)) {
+ rc = netdev_start_xmit(skb, dev, txq, 0);
+ if (dev_xmit_complete(rc))
+ free_skb = false;
+ }
+ HARD_TX_UNLOCK(dev, txq);
+ if (free_skb) {
+ trace_xdp_exception(dev, xdp_prog, XDP_TX);
+ kfree_skb(skb);
+ }
+}
+
+int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
+{
+ if (xdp_prog) {
+ struct xdp_buff xdp;
+ u32 act;
+ int err;
+
+ act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
+ if (act != XDP_PASS) {
+ switch (act) {
+ case XDP_REDIRECT:
+ err = xdp_do_generic_redirect(skb->dev, skb,
+ &xdp, xdp_prog);
+ if (err)
+ goto out_redir;
+ break;
+ case XDP_TX:
+ generic_xdp_tx(skb, xdp_prog);
+ break;
+ }
+ return XDP_DROP;
+ }
+ }
+ return XDP_PASS;
+out_redir:
+ kfree_skb_reason(skb, SKB_DROP_REASON_XDP);
+ return XDP_DROP;
+}
+EXPORT_SYMBOL_GPL(do_xdp_generic);
+
+/**
+ * dev_disable_gro_hw - disable HW Generic Receive Offload on a device
+ * @dev: device
+ *
+ * Disable HW Generic Receive Offload (GRO_HW) on a net device. Must be
+ * called under RTNL. This is needed if Generic XDP is installed on
+ * the device.
+ */
+static void dev_disable_gro_hw(struct net_device *dev)
+{
+ dev->wanted_features &= ~NETIF_F_GRO_HW;
+ netdev_update_features(dev);
+
+ if (unlikely(dev->features & NETIF_F_GRO_HW))
+ netdev_WARN(dev, "failed to disable GRO_HW!\n");
+}
+
+static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
+{
+ struct bpf_prog *old = rtnl_dereference(dev->xdp_prog);
+ struct bpf_prog *new = xdp->prog;
+ int ret = 0;
+
+ switch (xdp->command) {
+ case XDP_SETUP_PROG:
+ rcu_assign_pointer(dev->xdp_prog, new);
+ if (old)
+ bpf_prog_put(old);
+
+ if (old && !new) {
+ static_branch_dec(&generic_xdp_needed_key);
+ } else if (new && !old) {
+ static_branch_inc(&generic_xdp_needed_key);
+ dev_disable_lro(dev);
+ dev_disable_gro_hw(dev);
+ }
+ break;
+
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+
+struct bpf_xdp_link {
+ struct bpf_link link;
+ struct net_device *dev; /* protected by rtnl_lock, no refcnt held */
+ int flags;
+};
+
+typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
+
+static enum bpf_xdp_mode dev_xdp_mode(struct net_device *dev, u32 flags)
+{
+ if (flags & XDP_FLAGS_HW_MODE)
+ return XDP_MODE_HW;
+ if (flags & XDP_FLAGS_DRV_MODE)
+ return XDP_MODE_DRV;
+ if (flags & XDP_FLAGS_SKB_MODE)
+ return XDP_MODE_SKB;
+ return dev->netdev_ops->ndo_bpf ? XDP_MODE_DRV : XDP_MODE_SKB;
+}
+
+static bpf_op_t dev_xdp_bpf_op(struct net_device *dev, enum bpf_xdp_mode mode)
+{
+ switch (mode) {
+ case XDP_MODE_SKB:
+ return generic_xdp_install;
+ case XDP_MODE_DRV:
+ case XDP_MODE_HW:
+ return dev->netdev_ops->ndo_bpf;
+ default:
+ return NULL;
+ }
+}
+
+static struct bpf_xdp_link *dev_xdp_link(struct net_device *dev,
+ enum bpf_xdp_mode mode)
+{
+ return dev->xdp_state[mode].link;
+}
+
+static struct bpf_prog *dev_xdp_prog(struct net_device *dev,
+ enum bpf_xdp_mode mode)
+{
+ struct bpf_xdp_link *link = dev_xdp_link(dev, mode);
+
+ if (link)
+ return link->link.prog;
+ return dev->xdp_state[mode].prog;
+}
+
+u8 dev_xdp_prog_count(struct net_device *dev)
+{
+ u8 count = 0;
+ int i;
+
+ for (i = 0; i < __MAX_XDP_MODE; i++)
+ if (dev->xdp_state[i].prog || dev->xdp_state[i].link)
+ count++;
+ return count;
+}
+EXPORT_SYMBOL_GPL(dev_xdp_prog_count);
+
+u32 dev_xdp_prog_id(struct net_device *dev, enum bpf_xdp_mode mode)
+{
+ struct bpf_prog *prog = dev_xdp_prog(dev, mode);
+
+ return prog ? prog->aux->id : 0;
+}
+
+static void dev_xdp_set_link(struct net_device *dev, enum bpf_xdp_mode mode,
+ struct bpf_xdp_link *link)
+{
+ dev->xdp_state[mode].link = link;
+ dev->xdp_state[mode].prog = NULL;
+}
+
+static void dev_xdp_set_prog(struct net_device *dev, enum bpf_xdp_mode mode,
+ struct bpf_prog *prog)
+{
+ dev->xdp_state[mode].link = NULL;
+ dev->xdp_state[mode].prog = prog;
+}
+
+static int dev_xdp_install(struct net_device *dev, enum bpf_xdp_mode mode,
+ bpf_op_t bpf_op, struct netlink_ext_ack *extack,
+ u32 flags, struct bpf_prog *prog)
+{
+ struct netdev_bpf xdp;
+ int err;
+
+ memset(&xdp, 0, sizeof(xdp));
+ xdp.command = mode == XDP_MODE_HW ? XDP_SETUP_PROG_HW : XDP_SETUP_PROG;
+ xdp.extack = extack;
+ xdp.flags = flags;
+ xdp.prog = prog;
+
+ /* Drivers assume refcnt is already incremented (i.e, prog pointer is
+ * "moved" into driver), so they don't increment it on their own, but
+ * they do decrement refcnt when program is detached or replaced.
+ * Given net_device also owns link/prog, we need to bump refcnt here
+ * to prevent drivers from underflowing it.
+ */
+ if (prog)
+ bpf_prog_inc(prog);
+ err = bpf_op(dev, &xdp);
+ if (err) {
+ if (prog)
+ bpf_prog_put(prog);
+ return err;
+ }
+
+ if (mode != XDP_MODE_HW)
+ bpf_prog_change_xdp(dev_xdp_prog(dev, mode), prog);
+
+ return 0;
+}
+
+void dev_xdp_uninstall(struct net_device *dev)
+{
+ struct bpf_xdp_link *link;
+ struct bpf_prog *prog;
+ enum bpf_xdp_mode mode;
+ bpf_op_t bpf_op;
+
+ ASSERT_RTNL();
+
+ for (mode = XDP_MODE_SKB; mode < __MAX_XDP_MODE; mode++) {
+ prog = dev_xdp_prog(dev, mode);
+ if (!prog)
+ continue;
+
+ bpf_op = dev_xdp_bpf_op(dev, mode);
+ if (!bpf_op)
+ continue;
+
+ WARN_ON(dev_xdp_install(dev, mode, bpf_op, NULL, 0, NULL));
+
+ /* auto-detach link from net device */
+ link = dev_xdp_link(dev, mode);
+ if (link)
+ link->dev = NULL;
+ else
+ bpf_prog_put(prog);
+
+ dev_xdp_set_link(dev, mode, NULL);
+ }
+}
+
+static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack,
+ struct bpf_xdp_link *link, struct bpf_prog *new_prog,
+ struct bpf_prog *old_prog, u32 flags)
+{
+ unsigned int num_modes = hweight32(flags & XDP_FLAGS_MODES);
+ struct bpf_prog *cur_prog;
+ struct net_device *upper;
+ struct list_head *iter;
+ enum bpf_xdp_mode mode;
+ bpf_op_t bpf_op;
+ int err;
+
+ ASSERT_RTNL();
+
+ /* either link or prog attachment, never both */
+ if (link && (new_prog || old_prog))
+ return -EINVAL;
+ /* link supports only XDP mode flags */
+ if (link && (flags & ~XDP_FLAGS_MODES)) {
+ NL_SET_ERR_MSG(extack, "Invalid XDP flags for BPF link attachment");
+ return -EINVAL;
+ }
+ /* just one XDP mode bit should be set, zero defaults to drv/skb mode */
+ if (num_modes > 1) {
+ NL_SET_ERR_MSG(extack, "Only one XDP mode flag can be set");
+ return -EINVAL;
+ }
+ /* avoid ambiguity if offload + drv/skb mode progs are both loaded */
+ if (!num_modes && dev_xdp_prog_count(dev) > 1) {
+ NL_SET_ERR_MSG(extack,
+ "More than one program loaded, unset mode is ambiguous");
+ return -EINVAL;
+ }
+ /* old_prog != NULL implies XDP_FLAGS_REPLACE is set */
+ if (old_prog && !(flags & XDP_FLAGS_REPLACE)) {
+ NL_SET_ERR_MSG(extack, "XDP_FLAGS_REPLACE is not specified");
+ return -EINVAL;
+ }
+
+ mode = dev_xdp_mode(dev, flags);
+ /* can't replace attached link */
+ if (dev_xdp_link(dev, mode)) {
+ NL_SET_ERR_MSG(extack, "Can't replace active BPF XDP link");
+ return -EBUSY;
+ }
+
+ /* don't allow if an upper device already has a program */
+ netdev_for_each_upper_dev_rcu(dev, upper, iter) {
+ if (dev_xdp_prog_count(upper) > 0) {
+ NL_SET_ERR_MSG(extack, "Cannot attach when an upper device already has a program");
+ return -EEXIST;
+ }
+ }
+
+ cur_prog = dev_xdp_prog(dev, mode);
+ /* can't replace attached prog with link */
+ if (link && cur_prog) {
+ NL_SET_ERR_MSG(extack, "Can't replace active XDP program with BPF link");
+ return -EBUSY;
+ }
+ if ((flags & XDP_FLAGS_REPLACE) && cur_prog != old_prog) {
+ NL_SET_ERR_MSG(extack, "Active program does not match expected");
+ return -EEXIST;
+ }
+
+ /* put effective new program into new_prog */
+ if (link)
+ new_prog = link->link.prog;
+
+ if (new_prog) {
+ bool offload = mode == XDP_MODE_HW;
+ enum bpf_xdp_mode other_mode = mode == XDP_MODE_SKB
+ ? XDP_MODE_DRV : XDP_MODE_SKB;
+
+ if ((flags & XDP_FLAGS_UPDATE_IF_NOEXIST) && cur_prog) {
+ NL_SET_ERR_MSG(extack, "XDP program already attached");
+ return -EBUSY;
+ }
+ if (!offload && dev_xdp_prog(dev, other_mode)) {
+ NL_SET_ERR_MSG(extack, "Native and generic XDP can't be active at the same time");
+ return -EEXIST;
+ }
+ if (!offload && bpf_prog_is_dev_bound(new_prog->aux)) {
+ NL_SET_ERR_MSG(extack, "Using device-bound program without HW_MODE flag is not supported");
+ return -EINVAL;
+ }
+ if (new_prog->expected_attach_type == BPF_XDP_DEVMAP) {
+ NL_SET_ERR_MSG(extack, "BPF_XDP_DEVMAP programs can not be attached to a device");
+ return -EINVAL;
+ }
+ if (new_prog->expected_attach_type == BPF_XDP_CPUMAP) {
+ NL_SET_ERR_MSG(extack, "BPF_XDP_CPUMAP programs can not be attached to a device");
+ return -EINVAL;
+ }
+ }
+
+ /* don't call drivers if the effective program didn't change */
+ if (new_prog != cur_prog) {
+ bpf_op = dev_xdp_bpf_op(dev, mode);
+ if (!bpf_op) {
+ NL_SET_ERR_MSG(extack, "Underlying driver does not support XDP in native mode");
+ return -EOPNOTSUPP;
+ }
+
+ err = dev_xdp_install(dev, mode, bpf_op, extack, flags, new_prog);
+ if (err)
+ return err;
+ }
+
+ if (link)
+ dev_xdp_set_link(dev, mode, link);
+ else
+ dev_xdp_set_prog(dev, mode, new_prog);
+ if (cur_prog)
+ bpf_prog_put(cur_prog);
+
+ return 0;
+}
+
+static int dev_xdp_attach_link(struct net_device *dev,
+ struct netlink_ext_ack *extack,
+ struct bpf_xdp_link *link)
+{
+ return dev_xdp_attach(dev, extack, link, NULL, NULL, link->flags);
+}
+
+static int dev_xdp_detach_link(struct net_device *dev,
+ struct netlink_ext_ack *extack,
+ struct bpf_xdp_link *link)
+{
+ enum bpf_xdp_mode mode;
+ bpf_op_t bpf_op;
+
+ ASSERT_RTNL();
+
+ mode = dev_xdp_mode(dev, link->flags);
+ if (dev_xdp_link(dev, mode) != link)
+ return -EINVAL;
+
+ bpf_op = dev_xdp_bpf_op(dev, mode);
+ WARN_ON(dev_xdp_install(dev, mode, bpf_op, NULL, 0, NULL));
+ dev_xdp_set_link(dev, mode, NULL);
+ return 0;
+}
+
+static void bpf_xdp_link_release(struct bpf_link *link)
+{
+ struct bpf_xdp_link *xdp_link = container_of(link, struct bpf_xdp_link, link);
+
+ rtnl_lock();
+
+ /* if racing with net_device's tear down, xdp_link->dev might be
+ * already NULL, in which case link was already auto-detached
+ */
+ if (xdp_link->dev) {
+ WARN_ON(dev_xdp_detach_link(xdp_link->dev, NULL, xdp_link));
+ xdp_link->dev = NULL;
+ }
+
+ rtnl_unlock();
+}
+
+static int bpf_xdp_link_detach(struct bpf_link *link)
+{
+ bpf_xdp_link_release(link);
+ return 0;
+}
+
+static void bpf_xdp_link_dealloc(struct bpf_link *link)
+{
+ struct bpf_xdp_link *xdp_link = container_of(link, struct bpf_xdp_link, link);
+
+ kfree(xdp_link);
+}
+
+static void bpf_xdp_link_show_fdinfo(const struct bpf_link *link,
+ struct seq_file *seq)
+{
+ struct bpf_xdp_link *xdp_link = container_of(link, struct bpf_xdp_link, link);
+ u32 ifindex = 0;
+
+ rtnl_lock();
+ if (xdp_link->dev)
+ ifindex = xdp_link->dev->ifindex;
+ rtnl_unlock();
+
+ seq_printf(seq, "ifindex:\t%u\n", ifindex);
+}
+
+static int bpf_xdp_link_fill_link_info(const struct bpf_link *link,
+ struct bpf_link_info *info)
+{
+ struct bpf_xdp_link *xdp_link = container_of(link, struct bpf_xdp_link, link);
+ u32 ifindex = 0;
+
+ rtnl_lock();
+ if (xdp_link->dev)
+ ifindex = xdp_link->dev->ifindex;
+ rtnl_unlock();
+
+ info->xdp.ifindex = ifindex;
+ return 0;
+}
+
+static int bpf_xdp_link_update(struct bpf_link *link, struct bpf_prog *new_prog,
+ struct bpf_prog *old_prog)
+{
+ struct bpf_xdp_link *xdp_link = container_of(link, struct bpf_xdp_link, link);
+ enum bpf_xdp_mode mode;
+ bpf_op_t bpf_op;
+ int err = 0;
+
+ rtnl_lock();
+
+ /* link might have been auto-released already, so fail */
+ if (!xdp_link->dev) {
+ err = -ENOLINK;
+ goto out_unlock;
+ }
+
+ if (old_prog && link->prog != old_prog) {
+ err = -EPERM;
+ goto out_unlock;
+ }
+ old_prog = link->prog;
+ if (old_prog->type != new_prog->type ||
+ old_prog->expected_attach_type != new_prog->expected_attach_type) {
+ err = -EINVAL;
+ goto out_unlock;
+ }
+
+ if (old_prog == new_prog) {
+ /* no-op, don't disturb drivers */
+ bpf_prog_put(new_prog);
+ goto out_unlock;
+ }
+
+ mode = dev_xdp_mode(xdp_link->dev, xdp_link->flags);
+ bpf_op = dev_xdp_bpf_op(xdp_link->dev, mode);
+ err = dev_xdp_install(xdp_link->dev, mode, bpf_op, NULL,
+ xdp_link->flags, new_prog);
+ if (err)
+ goto out_unlock;
+
+ old_prog = xchg(&link->prog, new_prog);
+ bpf_prog_put(old_prog);
+
+out_unlock:
+ rtnl_unlock();
+ return err;
+}
+
+static const struct bpf_link_ops bpf_xdp_link_lops = {
+ .release = bpf_xdp_link_release,
+ .dealloc = bpf_xdp_link_dealloc,
+ .detach = bpf_xdp_link_detach,
+ .show_fdinfo = bpf_xdp_link_show_fdinfo,
+ .fill_link_info = bpf_xdp_link_fill_link_info,
+ .update_prog = bpf_xdp_link_update,
+};
+
+int bpf_xdp_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
+{
+ struct net *net = current->nsproxy->net_ns;
+ struct bpf_link_primer link_primer;
+ struct bpf_xdp_link *link;
+ struct net_device *dev;
+ int err, fd;
+
+ rtnl_lock();
+ dev = dev_get_by_index(net, attr->link_create.target_ifindex);
+ if (!dev) {
+ rtnl_unlock();
+ return -EINVAL;
+ }
+
+ link = kzalloc(sizeof(*link), GFP_USER);
+ if (!link) {
+ err = -ENOMEM;
+ goto unlock;
+ }
+
+ bpf_link_init(&link->link, BPF_LINK_TYPE_XDP, &bpf_xdp_link_lops, prog);
+ link->dev = dev;
+ link->flags = attr->link_create.flags;
+
+ err = bpf_link_prime(&link->link, &link_primer);
+ if (err) {
+ kfree(link);
+ goto unlock;
+ }
+
+ err = dev_xdp_attach_link(dev, NULL, link);
+ rtnl_unlock();
+
+ if (err) {
+ link->dev = NULL;
+ bpf_link_cleanup(&link_primer);
+ goto out_put_dev;
+ }
+
+ fd = bpf_link_settle(&link_primer);
+ /* link itself doesn't hold dev's refcnt to not complicate shutdown */
+ dev_put(dev);
+ return fd;
+
+unlock:
+ rtnl_unlock();
+
+out_put_dev:
+ dev_put(dev);
+ return err;
+}
+
+/**
+ * dev_change_xdp_fd - set or clear a bpf program for a device rx path
+ * @dev: device
+ * @extack: netlink extended ack
+ * @fd: new program fd or negative value to clear
+ * @expected_fd: old program fd that userspace expects to replace or clear
+ * @flags: xdp-related flags
+ *
+ * Set or clear a bpf program for a device
+ */
+int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
+ int fd, int expected_fd, u32 flags)
+{
+ enum bpf_xdp_mode mode = dev_xdp_mode(dev, flags);
+ struct bpf_prog *new_prog = NULL, *old_prog = NULL;
+ int err;
+
+ ASSERT_RTNL();
+
+ if (fd >= 0) {
+ new_prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_XDP,
+ mode != XDP_MODE_SKB);
+ if (IS_ERR(new_prog))
+ return PTR_ERR(new_prog);
+ }
+
+ if (expected_fd >= 0) {
+ old_prog = bpf_prog_get_type_dev(expected_fd, BPF_PROG_TYPE_XDP,
+ mode != XDP_MODE_SKB);
+ if (IS_ERR(old_prog)) {
+ err = PTR_ERR(old_prog);
+ old_prog = NULL;
+ goto err_out;
+ }
+ }
+
+ err = dev_xdp_attach(dev, extack, NULL, new_prog, old_prog, flags);
+
+err_out:
+ if (err && new_prog)
+ bpf_prog_put(new_prog);
+ if (old_prog)
+ bpf_prog_put(old_prog);
+ return err;
+}
diff --git a/net/bpf/prog_ops.c b/net/bpf/prog_ops.c
new file mode 100644
index 000000000000..33f02842e715
--- /dev/null
+++ b/net/bpf/prog_ops.c
@@ -0,0 +1,911 @@
+// SPDX-License-Identifier: GPL-2.0-only
+
+#include <linux/btf.h>
+#include <linux/btf_ids.h>
+#include <net/xdp_sock.h>
+#include <trace/events/xdp.h>
+
+BPF_CALL_1(bpf_xdp_get_buff_len, struct xdp_buff*, xdp)
+{
+ return xdp_get_buff_len(xdp);
+}
+
+static const struct bpf_func_proto bpf_xdp_get_buff_len_proto = {
+ .func = bpf_xdp_get_buff_len,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+};
+
+BTF_ID_LIST_SINGLE(bpf_xdp_get_buff_len_bpf_ids, struct, xdp_buff)
+
+const struct bpf_func_proto bpf_xdp_get_buff_len_trace_proto = {
+ .func = bpf_xdp_get_buff_len,
+ .gpl_only = false,
+ .arg1_type = ARG_PTR_TO_BTF_ID,
+ .arg1_btf_id = &bpf_xdp_get_buff_len_bpf_ids[0],
+};
+
+static unsigned long xdp_get_metalen(const struct xdp_buff *xdp)
+{
+ return xdp_data_meta_unsupported(xdp) ? 0 :
+ xdp->data - xdp->data_meta;
+}
+
+BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
+{
+ void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
+ unsigned long metalen = xdp_get_metalen(xdp);
+ void *data_start = xdp_frame_end + metalen;
+ void *data = xdp->data + offset;
+
+ if (unlikely(data < data_start ||
+ data > xdp->data_end - ETH_HLEN))
+ return -EINVAL;
+
+ if (metalen)
+ memmove(xdp->data_meta + offset,
+ xdp->data_meta, metalen);
+ xdp->data_meta += offset;
+ xdp->data = data;
+
+ return 0;
+}
+
+static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
+ .func = bpf_xdp_adjust_head,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+};
+
+static void bpf_xdp_copy_buf(struct xdp_buff *xdp, unsigned long off,
+ void *buf, unsigned long len, bool flush)
+{
+ unsigned long ptr_len, ptr_off = 0;
+ skb_frag_t *next_frag, *end_frag;
+ struct skb_shared_info *sinfo;
+ void *src, *dst;
+ u8 *ptr_buf;
+
+ if (likely(xdp->data_end - xdp->data >= off + len)) {
+ src = flush ? buf : xdp->data + off;
+ dst = flush ? xdp->data + off : buf;
+ memcpy(dst, src, len);
+ return;
+ }
+
+ sinfo = xdp_get_shared_info_from_buff(xdp);
+ end_frag = &sinfo->frags[sinfo->nr_frags];
+ next_frag = &sinfo->frags[0];
+
+ ptr_len = xdp->data_end - xdp->data;
+ ptr_buf = xdp->data;
+
+ while (true) {
+ if (off < ptr_off + ptr_len) {
+ unsigned long copy_off = off - ptr_off;
+ unsigned long copy_len = min(len, ptr_len - copy_off);
+
+ src = flush ? buf : ptr_buf + copy_off;
+ dst = flush ? ptr_buf + copy_off : buf;
+ memcpy(dst, src, copy_len);
+
+ off += copy_len;
+ len -= copy_len;
+ buf += copy_len;
+ }
+
+ if (!len || next_frag == end_frag)
+ break;
+
+ ptr_off += ptr_len;
+ ptr_buf = skb_frag_address(next_frag);
+ ptr_len = skb_frag_size(next_frag);
+ next_frag++;
+ }
+}
+
+static void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len)
+{
+ struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
+ u32 size = xdp->data_end - xdp->data;
+ void *addr = xdp->data;
+ int i;
+
+ if (unlikely(offset > 0xffff || len > 0xffff))
+ return ERR_PTR(-EFAULT);
+
+ if (offset + len > xdp_get_buff_len(xdp))
+ return ERR_PTR(-EINVAL);
+
+ if (offset < size) /* linear area */
+ goto out;
+
+ offset -= size;
+ for (i = 0; i < sinfo->nr_frags; i++) { /* paged area */
+ u32 frag_size = skb_frag_size(&sinfo->frags[i]);
+
+ if (offset < frag_size) {
+ addr = skb_frag_address(&sinfo->frags[i]);
+ size = frag_size;
+ break;
+ }
+ offset -= frag_size;
+ }
+out:
+ return offset + len < size ? addr + offset : NULL;
+}
+
+BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset,
+ void *, buf, u32, len)
+{
+ void *ptr;
+
+ ptr = bpf_xdp_pointer(xdp, offset, len);
+ if (IS_ERR(ptr))
+ return PTR_ERR(ptr);
+
+ if (!ptr)
+ bpf_xdp_copy_buf(xdp, offset, buf, len, false);
+ else
+ memcpy(buf, ptr, len);
+
+ return 0;
+}
+
+static const struct bpf_func_proto bpf_xdp_load_bytes_proto = {
+ .func = bpf_xdp_load_bytes,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_PTR_TO_UNINIT_MEM,
+ .arg4_type = ARG_CONST_SIZE,
+};
+
+BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset,
+ void *, buf, u32, len)
+{
+ void *ptr;
+
+ ptr = bpf_xdp_pointer(xdp, offset, len);
+ if (IS_ERR(ptr))
+ return PTR_ERR(ptr);
+
+ if (!ptr)
+ bpf_xdp_copy_buf(xdp, offset, buf, len, true);
+ else
+ memcpy(ptr, buf, len);
+
+ return 0;
+}
+
+static const struct bpf_func_proto bpf_xdp_store_bytes_proto = {
+ .func = bpf_xdp_store_bytes,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_PTR_TO_UNINIT_MEM,
+ .arg4_type = ARG_CONST_SIZE,
+};
+
+static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
+{
+ struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
+ skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
+ struct xdp_rxq_info *rxq = xdp->rxq;
+ unsigned int tailroom;
+
+ if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
+ return -EOPNOTSUPP;
+
+ tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
+ if (unlikely(offset > tailroom))
+ return -EINVAL;
+
+ memset(skb_frag_address(frag) + skb_frag_size(frag), 0, offset);
+ skb_frag_size_add(frag, offset);
+ sinfo->xdp_frags_size += offset;
+
+ return 0;
+}
+
+static int bpf_xdp_frags_shrink_tail(struct xdp_buff *xdp, int offset)
+{
+ struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
+ int i, n_frags_free = 0, len_free = 0;
+
+ if (unlikely(offset > (int)xdp_get_buff_len(xdp) - ETH_HLEN))
+ return -EINVAL;
+
+ for (i = sinfo->nr_frags - 1; i >= 0 && offset > 0; i--) {
+ skb_frag_t *frag = &sinfo->frags[i];
+ int shrink = min_t(int, offset, skb_frag_size(frag));
+
+ len_free += shrink;
+ offset -= shrink;
+
+ if (skb_frag_size(frag) == shrink) {
+ struct page *page = skb_frag_page(frag);
+
+ __xdp_return(page_address(page), &xdp->rxq->mem,
+ false, NULL);
+ n_frags_free++;
+ } else {
+ skb_frag_size_sub(frag, shrink);
+ break;
+ }
+ }
+ sinfo->nr_frags -= n_frags_free;
+ sinfo->xdp_frags_size -= len_free;
+
+ if (unlikely(!sinfo->nr_frags)) {
+ xdp_buff_clear_frags_flag(xdp);
+ xdp->data_end -= offset;
+ }
+
+ return 0;
+}
+
+BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
+{
+ void *data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
+ void *data_end = xdp->data_end + offset;
+
+ if (unlikely(xdp_buff_has_frags(xdp))) { /* non-linear xdp buff */
+ if (offset < 0)
+ return bpf_xdp_frags_shrink_tail(xdp, -offset);
+
+ return bpf_xdp_frags_increase_tail(xdp, offset);
+ }
+
+ /* Notice that xdp_data_hard_end have reserved some tailroom */
+ if (unlikely(data_end > data_hard_end))
+ return -EINVAL;
+
+ /* ALL drivers MUST init xdp->frame_sz, chicken check below */
+ if (unlikely(xdp->frame_sz > PAGE_SIZE)) {
+ WARN_ONCE(1, "Too BIG xdp->frame_sz = %d\n", xdp->frame_sz);
+ return -EINVAL;
+ }
+
+ if (unlikely(data_end < xdp->data + ETH_HLEN))
+ return -EINVAL;
+
+ /* Clear memory area on grow, can contain uninit kernel memory */
+ if (offset > 0)
+ memset(xdp->data_end, 0, offset);
+
+ xdp->data_end = data_end;
+
+ return 0;
+}
+
+static const struct bpf_func_proto bpf_xdp_adjust_tail_proto = {
+ .func = bpf_xdp_adjust_tail,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+};
+
+BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
+{
+ void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
+ void *meta = xdp->data_meta + offset;
+ unsigned long metalen = xdp->data - meta;
+
+ if (xdp_data_meta_unsupported(xdp))
+ return -ENOTSUPP;
+ if (unlikely(meta < xdp_frame_end ||
+ meta > xdp->data))
+ return -EINVAL;
+ if (unlikely(xdp_metalen_invalid(metalen)))
+ return -EACCES;
+
+ xdp->data_meta = meta;
+
+ return 0;
+}
+
+static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
+ .func = bpf_xdp_adjust_meta,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_ANYTHING,
+};
+
+/* XDP_REDIRECT works by a three-step process, implemented in the functions
+ * below:
+ *
+ * 1. The bpf_redirect() and bpf_redirect_map() helpers will lookup the target
+ * of the redirect and store it (along with some other metadata) in a per-CPU
+ * struct bpf_redirect_info.
+ *
+ * 2. When the program returns the XDP_REDIRECT return code, the driver will
+ * call xdp_do_redirect() which will use the information in struct
+ * bpf_redirect_info to actually enqueue the frame into a map type-specific
+ * bulk queue structure.
+ *
+ * 3. Before exiting its NAPI poll loop, the driver will call xdp_do_flush(),
+ * which will flush all the different bulk queues, thus completing the
+ * redirect.
+ *
+ * Pointers to the map entries will be kept around for this whole sequence of
+ * steps, protected by RCU. However, there is no top-level rcu_read_lock() in
+ * the core code; instead, the RCU protection relies on everything happening
+ * inside a single NAPI poll sequence, which means it's between a pair of calls
+ * to local_bh_disable()/local_bh_enable().
+ *
+ * The map entries are marked as __rcu and the map code makes sure to
+ * dereference those pointers with rcu_dereference_check() in a way that works
+ * for both sections that to hold an rcu_read_lock() and sections that are
+ * called from NAPI without a separate rcu_read_lock(). The code below does not
+ * use RCU annotations, but relies on those in the map code.
+ */
+void xdp_do_flush(void)
+{
+ __dev_flush();
+ __cpu_map_flush();
+ __xsk_map_flush();
+}
+EXPORT_SYMBOL_GPL(xdp_do_flush);
+
+void bpf_clear_redirect_map(struct bpf_map *map)
+{
+ struct bpf_redirect_info *ri;
+ int cpu;
+
+ for_each_possible_cpu(cpu) {
+ ri = per_cpu_ptr(&bpf_redirect_info, cpu);
+ /* Avoid polluting remote cacheline due to writes if
+ * not needed. Once we pass this test, we need the
+ * cmpxchg() to make sure it hasn't been changed in
+ * the meantime by remote CPU.
+ */
+ if (unlikely(READ_ONCE(ri->map) == map))
+ cmpxchg(&ri->map, map, NULL);
+ }
+}
+
+DEFINE_STATIC_KEY_FALSE(bpf_master_redirect_enabled_key);
+EXPORT_SYMBOL_GPL(bpf_master_redirect_enabled_key);
+
+u32 xdp_master_redirect(struct xdp_buff *xdp)
+{
+ struct net_device *master, *slave;
+ struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+ master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev);
+ slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp);
+ if (slave && slave != xdp->rxq->dev) {
+ /* The target device is different from the receiving device, so
+ * redirect it to the new device.
+ * Using XDP_REDIRECT gets the correct behaviour from XDP enabled
+ * drivers to unmap the packet from their rx ring.
+ */
+ ri->tgt_index = slave->ifindex;
+ ri->map_id = INT_MAX;
+ ri->map_type = BPF_MAP_TYPE_UNSPEC;
+ return XDP_REDIRECT;
+ }
+ return XDP_TX;
+}
+EXPORT_SYMBOL_GPL(xdp_master_redirect);
+
+static inline int __xdp_do_redirect_xsk(struct bpf_redirect_info *ri,
+ struct net_device *dev,
+ struct xdp_buff *xdp,
+ struct bpf_prog *xdp_prog)
+{
+ enum bpf_map_type map_type = ri->map_type;
+ void *fwd = ri->tgt_value;
+ u32 map_id = ri->map_id;
+ int err;
+
+ ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */
+ ri->map_type = BPF_MAP_TYPE_UNSPEC;
+
+ err = __xsk_map_redirect(fwd, xdp);
+ if (unlikely(err))
+ goto err;
+
+ _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index);
+ return 0;
+err:
+ _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index, err);
+ return err;
+}
+
+static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri,
+ struct net_device *dev,
+ struct xdp_frame *xdpf,
+ struct bpf_prog *xdp_prog)
+{
+ enum bpf_map_type map_type = ri->map_type;
+ void *fwd = ri->tgt_value;
+ u32 map_id = ri->map_id;
+ struct bpf_map *map;
+ int err;
+
+ ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */
+ ri->map_type = BPF_MAP_TYPE_UNSPEC;
+
+ if (unlikely(!xdpf)) {
+ err = -EOVERFLOW;
+ goto err;
+ }
+
+ switch (map_type) {
+ case BPF_MAP_TYPE_DEVMAP:
+ fallthrough;
+ case BPF_MAP_TYPE_DEVMAP_HASH:
+ map = READ_ONCE(ri->map);
+ if (unlikely(map)) {
+ WRITE_ONCE(ri->map, NULL);
+ err = dev_map_enqueue_multi(xdpf, dev, map,
+ ri->flags & BPF_F_EXCLUDE_INGRESS);
+ } else {
+ err = dev_map_enqueue(fwd, xdpf, dev);
+ }
+ break;
+ case BPF_MAP_TYPE_CPUMAP:
+ err = cpu_map_enqueue(fwd, xdpf, dev);
+ break;
+ case BPF_MAP_TYPE_UNSPEC:
+ if (map_id == INT_MAX) {
+ fwd = dev_get_by_index_rcu(dev_net(dev), ri->tgt_index);
+ if (unlikely(!fwd)) {
+ err = -EINVAL;
+ break;
+ }
+ err = dev_xdp_enqueue(fwd, xdpf, dev);
+ break;
+ }
+ fallthrough;
+ default:
+ err = -EBADRQC;
+ }
+
+ if (unlikely(err))
+ goto err;
+
+ _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index);
+ return 0;
+err:
+ _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index, err);
+ return err;
+}
+
+int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
+ struct bpf_prog *xdp_prog)
+{
+ struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ enum bpf_map_type map_type = ri->map_type;
+
+ /* XDP_REDIRECT is not fully supported yet for xdp frags since
+ * not all XDP capable drivers can map non-linear xdp_frame in
+ * ndo_xdp_xmit.
+ */
+ if (unlikely(xdp_buff_has_frags(xdp) &&
+ map_type != BPF_MAP_TYPE_CPUMAP))
+ return -EOPNOTSUPP;
+
+ if (map_type == BPF_MAP_TYPE_XSKMAP)
+ return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
+
+ return __xdp_do_redirect_frame(ri, dev, xdp_convert_buff_to_frame(xdp),
+ xdp_prog);
+}
+EXPORT_SYMBOL_GPL(xdp_do_redirect);
+
+int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp,
+ struct xdp_frame *xdpf, struct bpf_prog *xdp_prog)
+{
+ struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ enum bpf_map_type map_type = ri->map_type;
+
+ if (map_type == BPF_MAP_TYPE_XSKMAP)
+ return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
+
+ return __xdp_do_redirect_frame(ri, dev, xdpf, xdp_prog);
+}
+EXPORT_SYMBOL_GPL(xdp_do_redirect_frame);
+
+static int xdp_do_generic_redirect_map(struct net_device *dev,
+ struct sk_buff *skb,
+ struct xdp_buff *xdp,
+ struct bpf_prog *xdp_prog,
+ void *fwd,
+ enum bpf_map_type map_type, u32 map_id)
+{
+ struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ struct bpf_map *map;
+ int err;
+
+ switch (map_type) {
+ case BPF_MAP_TYPE_DEVMAP:
+ fallthrough;
+ case BPF_MAP_TYPE_DEVMAP_HASH:
+ map = READ_ONCE(ri->map);
+ if (unlikely(map)) {
+ WRITE_ONCE(ri->map, NULL);
+ err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
+ ri->flags & BPF_F_EXCLUDE_INGRESS);
+ } else {
+ err = dev_map_generic_redirect(fwd, skb, xdp_prog);
+ }
+ if (unlikely(err))
+ goto err;
+ break;
+ case BPF_MAP_TYPE_XSKMAP:
+ err = xsk_generic_rcv(fwd, xdp);
+ if (err)
+ goto err;
+ consume_skb(skb);
+ break;
+ case BPF_MAP_TYPE_CPUMAP:
+ err = cpu_map_generic_redirect(fwd, skb);
+ if (unlikely(err))
+ goto err;
+ break;
+ default:
+ err = -EBADRQC;
+ goto err;
+ }
+
+ _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index);
+ return 0;
+err:
+ _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index, err);
+ return err;
+}
+
+int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
+ struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
+{
+ struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+ enum bpf_map_type map_type = ri->map_type;
+ void *fwd = ri->tgt_value;
+ u32 map_id = ri->map_id;
+ int err;
+
+ ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */
+ ri->map_type = BPF_MAP_TYPE_UNSPEC;
+
+ if (map_type == BPF_MAP_TYPE_UNSPEC && map_id == INT_MAX) {
+ fwd = dev_get_by_index_rcu(dev_net(dev), ri->tgt_index);
+ if (unlikely(!fwd)) {
+ err = -EINVAL;
+ goto err;
+ }
+
+ err = xdp_ok_fwd_dev(fwd, skb->len);
+ if (unlikely(err))
+ goto err;
+
+ skb->dev = fwd;
+ _trace_xdp_redirect(dev, xdp_prog, ri->tgt_index);
+ generic_xdp_tx(skb, xdp_prog);
+ return 0;
+ }
+
+ return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog, fwd, map_type, map_id);
+err:
+ _trace_xdp_redirect_err(dev, xdp_prog, ri->tgt_index, err);
+ return err;
+}
+
+BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
+{
+ struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
+
+ if (unlikely(flags))
+ return XDP_ABORTED;
+
+ /* NB! Map type UNSPEC and map_id == INT_MAX (never generated
+ * by map_idr) is used for ifindex based XDP redirect.
+ */
+ ri->tgt_index = ifindex;
+ ri->map_id = INT_MAX;
+ ri->map_type = BPF_MAP_TYPE_UNSPEC;
+
+ return XDP_REDIRECT;
+}
+
+static const struct bpf_func_proto bpf_xdp_redirect_proto = {
+ .func = bpf_xdp_redirect,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_ANYTHING,
+ .arg2_type = ARG_ANYTHING,
+};
+
+BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, map, u32, ifindex,
+ u64, flags)
+{
+ return map->ops->map_redirect(map, ifindex, flags);
+}
+
+static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
+ .func = bpf_xdp_redirect_map,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_CONST_MAP_PTR,
+ .arg2_type = ARG_ANYTHING,
+ .arg3_type = ARG_ANYTHING,
+};
+
+
+static unsigned long bpf_xdp_copy(void *dst, const void *ctx,
+ unsigned long off, unsigned long len)
+{
+ struct xdp_buff *xdp = (struct xdp_buff *)ctx;
+
+ bpf_xdp_copy_buf(xdp, off, dst, len, false);
+ return 0;
+}
+
+BPF_CALL_5(bpf_xdp_event_output, struct xdp_buff *, xdp, struct bpf_map *, map,
+ u64, flags, void *, meta, u64, meta_size)
+{
+ u64 xdp_size = (flags & BPF_F_CTXLEN_MASK) >> 32;
+
+ if (unlikely(flags & ~(BPF_F_CTXLEN_MASK | BPF_F_INDEX_MASK)))
+ return -EINVAL;
+
+ if (unlikely(!xdp || xdp_size > xdp_get_buff_len(xdp)))
+ return -EFAULT;
+
+ return bpf_event_output(map, flags, meta, meta_size, xdp,
+ xdp_size, bpf_xdp_copy);
+}
+
+static const struct bpf_func_proto bpf_xdp_event_output_proto = {
+ .func = bpf_xdp_event_output,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_CONST_MAP_PTR,
+ .arg3_type = ARG_ANYTHING,
+ .arg4_type = ARG_PTR_TO_MEM | MEM_RDONLY,
+ .arg5_type = ARG_CONST_SIZE_OR_ZERO,
+};
+
+BTF_ID_LIST_SINGLE(bpf_xdp_output_btf_ids, struct, xdp_buff)
+
+const struct bpf_func_proto bpf_xdp_output_proto = {
+ .func = bpf_xdp_event_output,
+ .gpl_only = true,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_BTF_ID,
+ .arg1_btf_id = &bpf_xdp_output_btf_ids[0],
+ .arg2_type = ARG_CONST_MAP_PTR,
+ .arg3_type = ARG_ANYTHING,
+ .arg4_type = ARG_PTR_TO_MEM | MEM_RDONLY,
+ .arg5_type = ARG_CONST_SIZE_OR_ZERO,
+};
+
+#ifdef CONFIG_INET
+bool bpf_xdp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
+ struct bpf_insn_access_aux *info)
+{
+ if (off < 0 || off >= offsetofend(struct bpf_xdp_sock, queue_id))
+ return false;
+
+ if (off % size != 0)
+ return false;
+
+ switch (off) {
+ default:
+ return size == sizeof(__u32);
+ }
+}
+
+u32 bpf_xdp_sock_convert_ctx_access(enum bpf_access_type type,
+ const struct bpf_insn *si,
+ struct bpf_insn *insn_buf,
+ struct bpf_prog *prog, u32 *target_size)
+{
+ struct bpf_insn *insn = insn_buf;
+
+#define BPF_XDP_SOCK_GET(FIELD) \
+ do { \
+ BUILD_BUG_ON(sizeof_field(struct xdp_sock, FIELD) > \
+ sizeof_field(struct bpf_xdp_sock, FIELD)); \
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_sock, FIELD),\
+ si->dst_reg, si->src_reg, \
+ offsetof(struct xdp_sock, FIELD)); \
+ } while (0)
+
+ switch (si->off) {
+ case offsetof(struct bpf_xdp_sock, queue_id):
+ BPF_XDP_SOCK_GET(queue_id);
+ break;
+ }
+
+ return insn - insn_buf;
+}
+#endif /* CONFIG_INET */
+
+static int xdp_noop_prologue(struct bpf_insn *insn_buf, bool direct_write,
+ const struct bpf_prog *prog)
+{
+ /* Neither direct read nor direct write requires any preliminary
+ * action.
+ */
+ return 0;
+}
+
+static bool __is_valid_xdp_access(int off, int size)
+{
+ if (off < 0 || off >= sizeof(struct xdp_md))
+ return false;
+ if (off % size != 0)
+ return false;
+ if (size != sizeof(__u32))
+ return false;
+
+ return true;
+}
+
+static bool xdp_is_valid_access(int off, int size,
+ enum bpf_access_type type,
+ const struct bpf_prog *prog,
+ struct bpf_insn_access_aux *info)
+{
+ if (prog->expected_attach_type != BPF_XDP_DEVMAP) {
+ switch (off) {
+ case offsetof(struct xdp_md, egress_ifindex):
+ return false;
+ }
+ }
+
+ if (type == BPF_WRITE) {
+ if (bpf_prog_is_dev_bound(prog->aux)) {
+ switch (off) {
+ case offsetof(struct xdp_md, rx_queue_index):
+ return __is_valid_xdp_access(off, size);
+ }
+ }
+ return false;
+ }
+
+ switch (off) {
+ case offsetof(struct xdp_md, data):
+ info->reg_type = PTR_TO_PACKET;
+ break;
+ case offsetof(struct xdp_md, data_meta):
+ info->reg_type = PTR_TO_PACKET_META;
+ break;
+ case offsetof(struct xdp_md, data_end):
+ info->reg_type = PTR_TO_PACKET_END;
+ break;
+ }
+
+ return __is_valid_xdp_access(off, size);
+}
+
+void bpf_warn_invalid_xdp_action(struct net_device *dev, struct bpf_prog *prog, u32 act)
+{
+ const u32 act_max = XDP_REDIRECT;
+
+ pr_warn_once("%s XDP return value %u on prog %s (id %d) dev %s, expect packet loss!\n",
+ act > act_max ? "Illegal" : "Driver unsupported",
+ act, prog->aux->name, prog->aux->id, dev ? dev->name : "N/A");
+}
+EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
+
+static u32 xdp_convert_ctx_access(enum bpf_access_type type,
+ const struct bpf_insn *si,
+ struct bpf_insn *insn_buf,
+ struct bpf_prog *prog, u32 *target_size)
+{
+ struct bpf_insn *insn = insn_buf;
+
+ switch (si->off) {
+ case offsetof(struct xdp_md, data):
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data),
+ si->dst_reg, si->src_reg,
+ offsetof(struct xdp_buff, data));
+ break;
+ case offsetof(struct xdp_md, data_meta):
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data_meta),
+ si->dst_reg, si->src_reg,
+ offsetof(struct xdp_buff, data_meta));
+ break;
+ case offsetof(struct xdp_md, data_end):
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data_end),
+ si->dst_reg, si->src_reg,
+ offsetof(struct xdp_buff, data_end));
+ break;
+ case offsetof(struct xdp_md, ingress_ifindex):
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, rxq),
+ si->dst_reg, si->src_reg,
+ offsetof(struct xdp_buff, rxq));
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_rxq_info, dev),
+ si->dst_reg, si->dst_reg,
+ offsetof(struct xdp_rxq_info, dev));
+ *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+ offsetof(struct net_device, ifindex));
+ break;
+ case offsetof(struct xdp_md, rx_queue_index):
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, rxq),
+ si->dst_reg, si->src_reg,
+ offsetof(struct xdp_buff, rxq));
+ *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+ offsetof(struct xdp_rxq_info,
+ queue_index));
+ break;
+ case offsetof(struct xdp_md, egress_ifindex):
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, txq),
+ si->dst_reg, si->src_reg,
+ offsetof(struct xdp_buff, txq));
+ *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_txq_info, dev),
+ si->dst_reg, si->dst_reg,
+ offsetof(struct xdp_txq_info, dev));
+ *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
+ offsetof(struct net_device, ifindex));
+ break;
+ }
+
+ return insn - insn_buf;
+}
+
+bool xdp_helper_changes_pkt_data(const void *func)
+{
+ return func == bpf_xdp_adjust_head ||
+ func == bpf_xdp_adjust_meta ||
+ func == bpf_xdp_adjust_tail;
+}
+
+static const struct bpf_func_proto *
+xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+{
+ switch (func_id) {
+ case BPF_FUNC_perf_event_output:
+ return &bpf_xdp_event_output_proto;
+ case BPF_FUNC_get_smp_processor_id:
+ return &bpf_get_smp_processor_id_proto;
+ case BPF_FUNC_xdp_adjust_head:
+ return &bpf_xdp_adjust_head_proto;
+ case BPF_FUNC_xdp_adjust_meta:
+ return &bpf_xdp_adjust_meta_proto;
+ case BPF_FUNC_redirect:
+ return &bpf_xdp_redirect_proto;
+ case BPF_FUNC_redirect_map:
+ return &bpf_xdp_redirect_map_proto;
+ case BPF_FUNC_xdp_adjust_tail:
+ return &bpf_xdp_adjust_tail_proto;
+ case BPF_FUNC_xdp_get_buff_len:
+ return &bpf_xdp_get_buff_len_proto;
+ case BPF_FUNC_xdp_load_bytes:
+ return &bpf_xdp_load_bytes_proto;
+ case BPF_FUNC_xdp_store_bytes:
+ return &bpf_xdp_store_bytes_proto;
+ default:
+ return xdp_inet_func_proto(func_id);
+ }
+}
+
+const struct bpf_verifier_ops xdp_verifier_ops = {
+ .get_func_proto = xdp_func_proto,
+ .is_valid_access = xdp_is_valid_access,
+ .convert_ctx_access = xdp_convert_ctx_access,
+ .gen_prologue = xdp_noop_prologue,
+};
+
+const struct bpf_prog_ops xdp_prog_ops = {
+ .test_run = bpf_prog_test_run_xdp,
+};
+
+DEFINE_BPF_DISPATCHER(xdp)
+
+void bpf_prog_change_xdp(struct bpf_prog *prev_prog, struct bpf_prog *prog)
+{
+ bpf_dispatcher_change_prog(BPF_DISPATCHER_PTR(xdp), prev_prog, prog);
+}
diff --git a/net/core/Makefile b/net/core/Makefile
index e8ce3bd283a6..f6eceff1cf36 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -12,7 +12,7 @@ obj-$(CONFIG_SYSCTL) += sysctl_net_core.o
obj-y += dev.o dev_addr_lists.o dst.o netevent.o \
neighbour.o rtnetlink.o utils.o link_watch.o filter.o \
sock_diag.o dev_ioctl.o tso.o sock_reuseport.o \
- fib_notifier.o xdp.o flow_offload.o gro.o
+ fib_notifier.o flow_offload.o gro.o
obj-$(CONFIG_NETDEV_ADDR_LIST_TEST) += dev_addr_lists_test.o
diff --git a/net/core/dev.c b/net/core/dev.c
index 8958c4227b67..52b64d24c439 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1593,23 +1593,6 @@ void dev_disable_lro(struct net_device *dev)
}
EXPORT_SYMBOL(dev_disable_lro);
-/**
- * dev_disable_gro_hw - disable HW Generic Receive Offload on a device
- * @dev: device
- *
- * Disable HW Generic Receive Offload (GRO_HW) on a net device. Must be
- * called under RTNL. This is needed if Generic XDP is installed on
- * the device.
- */
-static void dev_disable_gro_hw(struct net_device *dev)
-{
- dev->wanted_features &= ~NETIF_F_GRO_HW;
- netdev_update_features(dev);
-
- if (unlikely(dev->features & NETIF_F_GRO_HW))
- netdev_WARN(dev, "failed to disable GRO_HW!\n");
-}
-
const char *netdev_cmd_to_name(enum netdev_cmd cmd)
{
#define N(val) \
@@ -4696,227 +4679,6 @@ static int enqueue_to_backlog(struct sk_buff *skb, int cpu,
return NET_RX_DROP;
}
-static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
-{
- struct net_device *dev = skb->dev;
- struct netdev_rx_queue *rxqueue;
-
- rxqueue = dev->_rx;
-
- if (skb_rx_queue_recorded(skb)) {
- u16 index = skb_get_rx_queue(skb);
-
- if (unlikely(index >= dev->real_num_rx_queues)) {
- WARN_ONCE(dev->real_num_rx_queues > 1,
- "%s received packet on queue %u, but number "
- "of RX queues is %u\n",
- dev->name, index, dev->real_num_rx_queues);
-
- return rxqueue; /* Return first rxqueue */
- }
- rxqueue += index;
- }
- return rxqueue;
-}
-
-u32 bpf_prog_run_generic_xdp(struct sk_buff *skb, struct xdp_buff *xdp,
- struct bpf_prog *xdp_prog)
-{
- void *orig_data, *orig_data_end, *hard_start;
- struct netdev_rx_queue *rxqueue;
- bool orig_bcast, orig_host;
- u32 mac_len, frame_sz;
- __be16 orig_eth_type;
- struct ethhdr *eth;
- u32 metalen, act;
- int off;
-
- /* The XDP program wants to see the packet starting at the MAC
- * header.
- */
- mac_len = skb->data - skb_mac_header(skb);
- hard_start = skb->data - skb_headroom(skb);
-
- /* SKB "head" area always have tailroom for skb_shared_info */
- frame_sz = (void *)skb_end_pointer(skb) - hard_start;
- frame_sz += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
-
- rxqueue = netif_get_rxqueue(skb);
- xdp_init_buff(xdp, frame_sz, &rxqueue->xdp_rxq);
- xdp_prepare_buff(xdp, hard_start, skb_headroom(skb) - mac_len,
- skb_headlen(skb) + mac_len, true);
-
- orig_data_end = xdp->data_end;
- orig_data = xdp->data;
- eth = (struct ethhdr *)xdp->data;
- orig_host = ether_addr_equal_64bits(eth->h_dest, skb->dev->dev_addr);
- orig_bcast = is_multicast_ether_addr_64bits(eth->h_dest);
- orig_eth_type = eth->h_proto;
-
- act = bpf_prog_run_xdp(xdp_prog, xdp);
-
- /* check if bpf_xdp_adjust_head was used */
- off = xdp->data - orig_data;
- if (off) {
- if (off > 0)
- __skb_pull(skb, off);
- else if (off < 0)
- __skb_push(skb, -off);
-
- skb->mac_header += off;
- skb_reset_network_header(skb);
- }
-
- /* check if bpf_xdp_adjust_tail was used */
- off = xdp->data_end - orig_data_end;
- if (off != 0) {
- skb_set_tail_pointer(skb, xdp->data_end - xdp->data);
- skb->len += off; /* positive on grow, negative on shrink */
- }
-
- /* check if XDP changed eth hdr such SKB needs update */
- eth = (struct ethhdr *)xdp->data;
- if ((orig_eth_type != eth->h_proto) ||
- (orig_host != ether_addr_equal_64bits(eth->h_dest,
- skb->dev->dev_addr)) ||
- (orig_bcast != is_multicast_ether_addr_64bits(eth->h_dest))) {
- __skb_push(skb, ETH_HLEN);
- skb->pkt_type = PACKET_HOST;
- skb->protocol = eth_type_trans(skb, skb->dev);
- }
-
- /* Redirect/Tx gives L2 packet, code that will reuse skb must __skb_pull
- * before calling us again on redirect path. We do not call do_redirect
- * as we leave that up to the caller.
- *
- * Caller is responsible for managing lifetime of skb (i.e. calling
- * kfree_skb in response to actions it cannot handle/XDP_DROP).
- */
- switch (act) {
- case XDP_REDIRECT:
- case XDP_TX:
- __skb_push(skb, mac_len);
- break;
- case XDP_PASS:
- metalen = xdp->data - xdp->data_meta;
- if (metalen)
- skb_metadata_set(skb, metalen);
- break;
- }
-
- return act;
-}
-
-static u32 netif_receive_generic_xdp(struct sk_buff *skb,
- struct xdp_buff *xdp,
- struct bpf_prog *xdp_prog)
-{
- u32 act = XDP_DROP;
-
- /* Reinjected packets coming from act_mirred or similar should
- * not get XDP generic processing.
- */
- if (skb_is_redirected(skb))
- return XDP_PASS;
-
- /* XDP packets must be linear and must have sufficient headroom
- * of XDP_PACKET_HEADROOM bytes. This is the guarantee that also
- * native XDP provides, thus we need to do it here as well.
- */
- if (skb_cloned(skb) || skb_is_nonlinear(skb) ||
- skb_headroom(skb) < XDP_PACKET_HEADROOM) {
- int hroom = XDP_PACKET_HEADROOM - skb_headroom(skb);
- int troom = skb->tail + skb->data_len - skb->end;
-
- /* In case we have to go down the path and also linearize,
- * then lets do the pskb_expand_head() work just once here.
- */
- if (pskb_expand_head(skb,
- hroom > 0 ? ALIGN(hroom, NET_SKB_PAD) : 0,
- troom > 0 ? troom + 128 : 0, GFP_ATOMIC))
- goto do_drop;
- if (skb_linearize(skb))
- goto do_drop;
- }
-
- act = bpf_prog_run_generic_xdp(skb, xdp, xdp_prog);
- switch (act) {
- case XDP_REDIRECT:
- case XDP_TX:
- case XDP_PASS:
- break;
- default:
- bpf_warn_invalid_xdp_action(skb->dev, xdp_prog, act);
- fallthrough;
- case XDP_ABORTED:
- trace_xdp_exception(skb->dev, xdp_prog, act);
- fallthrough;
- case XDP_DROP:
- do_drop:
- kfree_skb(skb);
- break;
- }
-
- return act;
-}
-
-/* When doing generic XDP we have to bypass the qdisc layer and the
- * network taps in order to match in-driver-XDP behavior.
- */
-void generic_xdp_tx(struct sk_buff *skb, struct bpf_prog *xdp_prog)
-{
- struct net_device *dev = skb->dev;
- struct netdev_queue *txq;
- bool free_skb = true;
- int cpu, rc;
-
- txq = netdev_core_pick_tx(dev, skb, NULL);
- cpu = smp_processor_id();
- HARD_TX_LOCK(dev, txq, cpu);
- if (!netif_xmit_stopped(txq)) {
- rc = netdev_start_xmit(skb, dev, txq, 0);
- if (dev_xmit_complete(rc))
- free_skb = false;
- }
- HARD_TX_UNLOCK(dev, txq);
- if (free_skb) {
- trace_xdp_exception(dev, xdp_prog, XDP_TX);
- kfree_skb(skb);
- }
-}
-
-static DEFINE_STATIC_KEY_FALSE(generic_xdp_needed_key);
-
-int do_xdp_generic(struct bpf_prog *xdp_prog, struct sk_buff *skb)
-{
- if (xdp_prog) {
- struct xdp_buff xdp;
- u32 act;
- int err;
-
- act = netif_receive_generic_xdp(skb, &xdp, xdp_prog);
- if (act != XDP_PASS) {
- switch (act) {
- case XDP_REDIRECT:
- err = xdp_do_generic_redirect(skb->dev, skb,
- &xdp, xdp_prog);
- if (err)
- goto out_redir;
- break;
- case XDP_TX:
- generic_xdp_tx(skb, xdp_prog);
- break;
- }
- return XDP_DROP;
- }
- }
- return XDP_PASS;
-out_redir:
- kfree_skb_reason(skb, SKB_DROP_REASON_XDP);
- return XDP_DROP;
-}
-EXPORT_SYMBOL_GPL(do_xdp_generic);
-
static int netif_rx_internal(struct sk_buff *skb)
{
int ret;
@@ -5624,35 +5386,6 @@ static void __netif_receive_skb_list(struct list_head *head)
memalloc_noreclaim_restore(noreclaim_flag);
}
-static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
-{
- struct bpf_prog *old = rtnl_dereference(dev->xdp_prog);
- struct bpf_prog *new = xdp->prog;
- int ret = 0;
-
- switch (xdp->command) {
- case XDP_SETUP_PROG:
- rcu_assign_pointer(dev->xdp_prog, new);
- if (old)
- bpf_prog_put(old);
-
- if (old && !new) {
- static_branch_dec(&generic_xdp_needed_key);
- } else if (new && !old) {
- static_branch_inc(&generic_xdp_needed_key);
- dev_disable_lro(dev);
- dev_disable_gro_hw(dev);
- }
- break;
-
- default:
- ret = -EINVAL;
- break;
- }
-
- return ret;
-}
-
static int netif_receive_skb_internal(struct sk_buff *skb)
{
int ret;
@@ -9016,510 +8749,6 @@ void dev_change_proto_down_reason(struct net_device *dev, unsigned long mask,
}
}
-struct bpf_xdp_link {
- struct bpf_link link;
- struct net_device *dev; /* protected by rtnl_lock, no refcnt held */
- int flags;
-};
-
-static enum bpf_xdp_mode dev_xdp_mode(struct net_device *dev, u32 flags)
-{
- if (flags & XDP_FLAGS_HW_MODE)
- return XDP_MODE_HW;
- if (flags & XDP_FLAGS_DRV_MODE)
- return XDP_MODE_DRV;
- if (flags & XDP_FLAGS_SKB_MODE)
- return XDP_MODE_SKB;
- return dev->netdev_ops->ndo_bpf ? XDP_MODE_DRV : XDP_MODE_SKB;
-}
-
-static bpf_op_t dev_xdp_bpf_op(struct net_device *dev, enum bpf_xdp_mode mode)
-{
- switch (mode) {
- case XDP_MODE_SKB:
- return generic_xdp_install;
- case XDP_MODE_DRV:
- case XDP_MODE_HW:
- return dev->netdev_ops->ndo_bpf;
- default:
- return NULL;
- }
-}
-
-static struct bpf_xdp_link *dev_xdp_link(struct net_device *dev,
- enum bpf_xdp_mode mode)
-{
- return dev->xdp_state[mode].link;
-}
-
-static struct bpf_prog *dev_xdp_prog(struct net_device *dev,
- enum bpf_xdp_mode mode)
-{
- struct bpf_xdp_link *link = dev_xdp_link(dev, mode);
-
- if (link)
- return link->link.prog;
- return dev->xdp_state[mode].prog;
-}
-
-u8 dev_xdp_prog_count(struct net_device *dev)
-{
- u8 count = 0;
- int i;
-
- for (i = 0; i < __MAX_XDP_MODE; i++)
- if (dev->xdp_state[i].prog || dev->xdp_state[i].link)
- count++;
- return count;
-}
-EXPORT_SYMBOL_GPL(dev_xdp_prog_count);
-
-u32 dev_xdp_prog_id(struct net_device *dev, enum bpf_xdp_mode mode)
-{
- struct bpf_prog *prog = dev_xdp_prog(dev, mode);
-
- return prog ? prog->aux->id : 0;
-}
-
-static void dev_xdp_set_link(struct net_device *dev, enum bpf_xdp_mode mode,
- struct bpf_xdp_link *link)
-{
- dev->xdp_state[mode].link = link;
- dev->xdp_state[mode].prog = NULL;
-}
-
-static void dev_xdp_set_prog(struct net_device *dev, enum bpf_xdp_mode mode,
- struct bpf_prog *prog)
-{
- dev->xdp_state[mode].link = NULL;
- dev->xdp_state[mode].prog = prog;
-}
-
-static int dev_xdp_install(struct net_device *dev, enum bpf_xdp_mode mode,
- bpf_op_t bpf_op, struct netlink_ext_ack *extack,
- u32 flags, struct bpf_prog *prog)
-{
- struct netdev_bpf xdp;
- int err;
-
- memset(&xdp, 0, sizeof(xdp));
- xdp.command = mode == XDP_MODE_HW ? XDP_SETUP_PROG_HW : XDP_SETUP_PROG;
- xdp.extack = extack;
- xdp.flags = flags;
- xdp.prog = prog;
-
- /* Drivers assume refcnt is already incremented (i.e, prog pointer is
- * "moved" into driver), so they don't increment it on their own, but
- * they do decrement refcnt when program is detached or replaced.
- * Given net_device also owns link/prog, we need to bump refcnt here
- * to prevent drivers from underflowing it.
- */
- if (prog)
- bpf_prog_inc(prog);
- err = bpf_op(dev, &xdp);
- if (err) {
- if (prog)
- bpf_prog_put(prog);
- return err;
- }
-
- if (mode != XDP_MODE_HW)
- bpf_prog_change_xdp(dev_xdp_prog(dev, mode), prog);
-
- return 0;
-}
-
-static void dev_xdp_uninstall(struct net_device *dev)
-{
- struct bpf_xdp_link *link;
- struct bpf_prog *prog;
- enum bpf_xdp_mode mode;
- bpf_op_t bpf_op;
-
- ASSERT_RTNL();
-
- for (mode = XDP_MODE_SKB; mode < __MAX_XDP_MODE; mode++) {
- prog = dev_xdp_prog(dev, mode);
- if (!prog)
- continue;
-
- bpf_op = dev_xdp_bpf_op(dev, mode);
- if (!bpf_op)
- continue;
-
- WARN_ON(dev_xdp_install(dev, mode, bpf_op, NULL, 0, NULL));
-
- /* auto-detach link from net device */
- link = dev_xdp_link(dev, mode);
- if (link)
- link->dev = NULL;
- else
- bpf_prog_put(prog);
-
- dev_xdp_set_link(dev, mode, NULL);
- }
-}
-
-static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack,
- struct bpf_xdp_link *link, struct bpf_prog *new_prog,
- struct bpf_prog *old_prog, u32 flags)
-{
- unsigned int num_modes = hweight32(flags & XDP_FLAGS_MODES);
- struct bpf_prog *cur_prog;
- struct net_device *upper;
- struct list_head *iter;
- enum bpf_xdp_mode mode;
- bpf_op_t bpf_op;
- int err;
-
- ASSERT_RTNL();
-
- /* either link or prog attachment, never both */
- if (link && (new_prog || old_prog))
- return -EINVAL;
- /* link supports only XDP mode flags */
- if (link && (flags & ~XDP_FLAGS_MODES)) {
- NL_SET_ERR_MSG(extack, "Invalid XDP flags for BPF link attachment");
- return -EINVAL;
- }
- /* just one XDP mode bit should be set, zero defaults to drv/skb mode */
- if (num_modes > 1) {
- NL_SET_ERR_MSG(extack, "Only one XDP mode flag can be set");
- return -EINVAL;
- }
- /* avoid ambiguity if offload + drv/skb mode progs are both loaded */
- if (!num_modes && dev_xdp_prog_count(dev) > 1) {
- NL_SET_ERR_MSG(extack,
- "More than one program loaded, unset mode is ambiguous");
- return -EINVAL;
- }
- /* old_prog != NULL implies XDP_FLAGS_REPLACE is set */
- if (old_prog && !(flags & XDP_FLAGS_REPLACE)) {
- NL_SET_ERR_MSG(extack, "XDP_FLAGS_REPLACE is not specified");
- return -EINVAL;
- }
-
- mode = dev_xdp_mode(dev, flags);
- /* can't replace attached link */
- if (dev_xdp_link(dev, mode)) {
- NL_SET_ERR_MSG(extack, "Can't replace active BPF XDP link");
- return -EBUSY;
- }
-
- /* don't allow if an upper device already has a program */
- netdev_for_each_upper_dev_rcu(dev, upper, iter) {
- if (dev_xdp_prog_count(upper) > 0) {
- NL_SET_ERR_MSG(extack, "Cannot attach when an upper device already has a program");
- return -EEXIST;
- }
- }
-
- cur_prog = dev_xdp_prog(dev, mode);
- /* can't replace attached prog with link */
- if (link && cur_prog) {
- NL_SET_ERR_MSG(extack, "Can't replace active XDP program with BPF link");
- return -EBUSY;
- }
- if ((flags & XDP_FLAGS_REPLACE) && cur_prog != old_prog) {
- NL_SET_ERR_MSG(extack, "Active program does not match expected");
- return -EEXIST;
- }
-
- /* put effective new program into new_prog */
- if (link)
- new_prog = link->link.prog;
-
- if (new_prog) {
- bool offload = mode == XDP_MODE_HW;
- enum bpf_xdp_mode other_mode = mode == XDP_MODE_SKB
- ? XDP_MODE_DRV : XDP_MODE_SKB;
-
- if ((flags & XDP_FLAGS_UPDATE_IF_NOEXIST) && cur_prog) {
- NL_SET_ERR_MSG(extack, "XDP program already attached");
- return -EBUSY;
- }
- if (!offload && dev_xdp_prog(dev, other_mode)) {
- NL_SET_ERR_MSG(extack, "Native and generic XDP can't be active at the same time");
- return -EEXIST;
- }
- if (!offload && bpf_prog_is_dev_bound(new_prog->aux)) {
- NL_SET_ERR_MSG(extack, "Using device-bound program without HW_MODE flag is not supported");
- return -EINVAL;
- }
- if (new_prog->expected_attach_type == BPF_XDP_DEVMAP) {
- NL_SET_ERR_MSG(extack, "BPF_XDP_DEVMAP programs can not be attached to a device");
- return -EINVAL;
- }
- if (new_prog->expected_attach_type == BPF_XDP_CPUMAP) {
- NL_SET_ERR_MSG(extack, "BPF_XDP_CPUMAP programs can not be attached to a device");
- return -EINVAL;
- }
- }
-
- /* don't call drivers if the effective program didn't change */
- if (new_prog != cur_prog) {
- bpf_op = dev_xdp_bpf_op(dev, mode);
- if (!bpf_op) {
- NL_SET_ERR_MSG(extack, "Underlying driver does not support XDP in native mode");
- return -EOPNOTSUPP;
- }
-
- err = dev_xdp_install(dev, mode, bpf_op, extack, flags, new_prog);
- if (err)
- return err;
- }
-
- if (link)
- dev_xdp_set_link(dev, mode, link);
- else
- dev_xdp_set_prog(dev, mode, new_prog);
- if (cur_prog)
- bpf_prog_put(cur_prog);
-
- return 0;
-}
-
-static int dev_xdp_attach_link(struct net_device *dev,
- struct netlink_ext_ack *extack,
- struct bpf_xdp_link *link)
-{
- return dev_xdp_attach(dev, extack, link, NULL, NULL, link->flags);
-}
-
-static int dev_xdp_detach_link(struct net_device *dev,
- struct netlink_ext_ack *extack,
- struct bpf_xdp_link *link)
-{
- enum bpf_xdp_mode mode;
- bpf_op_t bpf_op;
-
- ASSERT_RTNL();
-
- mode = dev_xdp_mode(dev, link->flags);
- if (dev_xdp_link(dev, mode) != link)
- return -EINVAL;
-
- bpf_op = dev_xdp_bpf_op(dev, mode);
- WARN_ON(dev_xdp_install(dev, mode, bpf_op, NULL, 0, NULL));
- dev_xdp_set_link(dev, mode, NULL);
- return 0;
-}
-
-static void bpf_xdp_link_release(struct bpf_link *link)
-{
- struct bpf_xdp_link *xdp_link = container_of(link, struct bpf_xdp_link, link);
-
- rtnl_lock();
-
- /* if racing with net_device's tear down, xdp_link->dev might be
- * already NULL, in which case link was already auto-detached
- */
- if (xdp_link->dev) {
- WARN_ON(dev_xdp_detach_link(xdp_link->dev, NULL, xdp_link));
- xdp_link->dev = NULL;
- }
-
- rtnl_unlock();
-}
-
-static int bpf_xdp_link_detach(struct bpf_link *link)
-{
- bpf_xdp_link_release(link);
- return 0;
-}
-
-static void bpf_xdp_link_dealloc(struct bpf_link *link)
-{
- struct bpf_xdp_link *xdp_link = container_of(link, struct bpf_xdp_link, link);
-
- kfree(xdp_link);
-}
-
-static void bpf_xdp_link_show_fdinfo(const struct bpf_link *link,
- struct seq_file *seq)
-{
- struct bpf_xdp_link *xdp_link = container_of(link, struct bpf_xdp_link, link);
- u32 ifindex = 0;
-
- rtnl_lock();
- if (xdp_link->dev)
- ifindex = xdp_link->dev->ifindex;
- rtnl_unlock();
-
- seq_printf(seq, "ifindex:\t%u\n", ifindex);
-}
-
-static int bpf_xdp_link_fill_link_info(const struct bpf_link *link,
- struct bpf_link_info *info)
-{
- struct bpf_xdp_link *xdp_link = container_of(link, struct bpf_xdp_link, link);
- u32 ifindex = 0;
-
- rtnl_lock();
- if (xdp_link->dev)
- ifindex = xdp_link->dev->ifindex;
- rtnl_unlock();
-
- info->xdp.ifindex = ifindex;
- return 0;
-}
-
-static int bpf_xdp_link_update(struct bpf_link *link, struct bpf_prog *new_prog,
- struct bpf_prog *old_prog)
-{
- struct bpf_xdp_link *xdp_link = container_of(link, struct bpf_xdp_link, link);
- enum bpf_xdp_mode mode;
- bpf_op_t bpf_op;
- int err = 0;
-
- rtnl_lock();
-
- /* link might have been auto-released already, so fail */
- if (!xdp_link->dev) {
- err = -ENOLINK;
- goto out_unlock;
- }
-
- if (old_prog && link->prog != old_prog) {
- err = -EPERM;
- goto out_unlock;
- }
- old_prog = link->prog;
- if (old_prog->type != new_prog->type ||
- old_prog->expected_attach_type != new_prog->expected_attach_type) {
- err = -EINVAL;
- goto out_unlock;
- }
-
- if (old_prog == new_prog) {
- /* no-op, don't disturb drivers */
- bpf_prog_put(new_prog);
- goto out_unlock;
- }
-
- mode = dev_xdp_mode(xdp_link->dev, xdp_link->flags);
- bpf_op = dev_xdp_bpf_op(xdp_link->dev, mode);
- err = dev_xdp_install(xdp_link->dev, mode, bpf_op, NULL,
- xdp_link->flags, new_prog);
- if (err)
- goto out_unlock;
-
- old_prog = xchg(&link->prog, new_prog);
- bpf_prog_put(old_prog);
-
-out_unlock:
- rtnl_unlock();
- return err;
-}
-
-static const struct bpf_link_ops bpf_xdp_link_lops = {
- .release = bpf_xdp_link_release,
- .dealloc = bpf_xdp_link_dealloc,
- .detach = bpf_xdp_link_detach,
- .show_fdinfo = bpf_xdp_link_show_fdinfo,
- .fill_link_info = bpf_xdp_link_fill_link_info,
- .update_prog = bpf_xdp_link_update,
-};
-
-int bpf_xdp_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
-{
- struct net *net = current->nsproxy->net_ns;
- struct bpf_link_primer link_primer;
- struct bpf_xdp_link *link;
- struct net_device *dev;
- int err, fd;
-
- rtnl_lock();
- dev = dev_get_by_index(net, attr->link_create.target_ifindex);
- if (!dev) {
- rtnl_unlock();
- return -EINVAL;
- }
-
- link = kzalloc(sizeof(*link), GFP_USER);
- if (!link) {
- err = -ENOMEM;
- goto unlock;
- }
-
- bpf_link_init(&link->link, BPF_LINK_TYPE_XDP, &bpf_xdp_link_lops, prog);
- link->dev = dev;
- link->flags = attr->link_create.flags;
-
- err = bpf_link_prime(&link->link, &link_primer);
- if (err) {
- kfree(link);
- goto unlock;
- }
-
- err = dev_xdp_attach_link(dev, NULL, link);
- rtnl_unlock();
-
- if (err) {
- link->dev = NULL;
- bpf_link_cleanup(&link_primer);
- goto out_put_dev;
- }
-
- fd = bpf_link_settle(&link_primer);
- /* link itself doesn't hold dev's refcnt to not complicate shutdown */
- dev_put(dev);
- return fd;
-
-unlock:
- rtnl_unlock();
-
-out_put_dev:
- dev_put(dev);
- return err;
-}
-
-/**
- * dev_change_xdp_fd - set or clear a bpf program for a device rx path
- * @dev: device
- * @extack: netlink extended ack
- * @fd: new program fd or negative value to clear
- * @expected_fd: old program fd that userspace expects to replace or clear
- * @flags: xdp-related flags
- *
- * Set or clear a bpf program for a device
- */
-int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
- int fd, int expected_fd, u32 flags)
-{
- enum bpf_xdp_mode mode = dev_xdp_mode(dev, flags);
- struct bpf_prog *new_prog = NULL, *old_prog = NULL;
- int err;
-
- ASSERT_RTNL();
-
- if (fd >= 0) {
- new_prog = bpf_prog_get_type_dev(fd, BPF_PROG_TYPE_XDP,
- mode != XDP_MODE_SKB);
- if (IS_ERR(new_prog))
- return PTR_ERR(new_prog);
- }
-
- if (expected_fd >= 0) {
- old_prog = bpf_prog_get_type_dev(expected_fd, BPF_PROG_TYPE_XDP,
- mode != XDP_MODE_SKB);
- if (IS_ERR(old_prog)) {
- err = PTR_ERR(old_prog);
- old_prog = NULL;
- goto err_out;
- }
- }
-
- err = dev_xdp_attach(dev, extack, NULL, new_prog, old_prog, flags);
-
-err_out:
- if (err && new_prog)
- bpf_prog_put(new_prog);
- if (old_prog)
- bpf_prog_put(old_prog);
- return err;
-}
-
/**
* dev_new_index - allocate an ifindex
* @net: the applicable net namespace
diff --git a/net/core/dev.h b/net/core/dev.h
index cbb8a925175a..36a68992f17b 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -78,10 +78,6 @@ int dev_change_proto_down(struct net_device *dev, bool proto_down);
void dev_change_proto_down_reason(struct net_device *dev, unsigned long mask,
u32 value);
-typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
-int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
- int fd, int expected_fd, u32 flags);
-
int dev_change_tx_queue_len(struct net_device *dev, unsigned long new_len);
void dev_set_group(struct net_device *dev, int new_group);
int dev_change_carrier(struct net_device *dev, bool new_carrier);
diff --git a/net/core/filter.c b/net/core/filter.c
index 151aa4756bd6..3933465eb972 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -3788,641 +3788,6 @@ static const struct bpf_func_proto sk_skb_change_head_proto = {
.arg3_type = ARG_ANYTHING,
};
-BPF_CALL_1(bpf_xdp_get_buff_len, struct xdp_buff*, xdp)
-{
- return xdp_get_buff_len(xdp);
-}
-
-static const struct bpf_func_proto bpf_xdp_get_buff_len_proto = {
- .func = bpf_xdp_get_buff_len,
- .gpl_only = false,
- .ret_type = RET_INTEGER,
- .arg1_type = ARG_PTR_TO_CTX,
-};
-
-BTF_ID_LIST_SINGLE(bpf_xdp_get_buff_len_bpf_ids, struct, xdp_buff)
-
-const struct bpf_func_proto bpf_xdp_get_buff_len_trace_proto = {
- .func = bpf_xdp_get_buff_len,
- .gpl_only = false,
- .arg1_type = ARG_PTR_TO_BTF_ID,
- .arg1_btf_id = &bpf_xdp_get_buff_len_bpf_ids[0],
-};
-
-static unsigned long xdp_get_metalen(const struct xdp_buff *xdp)
-{
- return xdp_data_meta_unsupported(xdp) ? 0 :
- xdp->data - xdp->data_meta;
-}
-
-BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
-{
- void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
- unsigned long metalen = xdp_get_metalen(xdp);
- void *data_start = xdp_frame_end + metalen;
- void *data = xdp->data + offset;
-
- if (unlikely(data < data_start ||
- data > xdp->data_end - ETH_HLEN))
- return -EINVAL;
-
- if (metalen)
- memmove(xdp->data_meta + offset,
- xdp->data_meta, metalen);
- xdp->data_meta += offset;
- xdp->data = data;
-
- return 0;
-}
-
-static const struct bpf_func_proto bpf_xdp_adjust_head_proto = {
- .func = bpf_xdp_adjust_head,
- .gpl_only = false,
- .ret_type = RET_INTEGER,
- .arg1_type = ARG_PTR_TO_CTX,
- .arg2_type = ARG_ANYTHING,
-};
-
-static void bpf_xdp_copy_buf(struct xdp_buff *xdp, unsigned long off,
- void *buf, unsigned long len, bool flush)
-{
- unsigned long ptr_len, ptr_off = 0;
- skb_frag_t *next_frag, *end_frag;
- struct skb_shared_info *sinfo;
- void *src, *dst;
- u8 *ptr_buf;
-
- if (likely(xdp->data_end - xdp->data >= off + len)) {
- src = flush ? buf : xdp->data + off;
- dst = flush ? xdp->data + off : buf;
- memcpy(dst, src, len);
- return;
- }
-
- sinfo = xdp_get_shared_info_from_buff(xdp);
- end_frag = &sinfo->frags[sinfo->nr_frags];
- next_frag = &sinfo->frags[0];
-
- ptr_len = xdp->data_end - xdp->data;
- ptr_buf = xdp->data;
-
- while (true) {
- if (off < ptr_off + ptr_len) {
- unsigned long copy_off = off - ptr_off;
- unsigned long copy_len = min(len, ptr_len - copy_off);
-
- src = flush ? buf : ptr_buf + copy_off;
- dst = flush ? ptr_buf + copy_off : buf;
- memcpy(dst, src, copy_len);
-
- off += copy_len;
- len -= copy_len;
- buf += copy_len;
- }
-
- if (!len || next_frag == end_frag)
- break;
-
- ptr_off += ptr_len;
- ptr_buf = skb_frag_address(next_frag);
- ptr_len = skb_frag_size(next_frag);
- next_frag++;
- }
-}
-
-static void *bpf_xdp_pointer(struct xdp_buff *xdp, u32 offset, u32 len)
-{
- struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
- u32 size = xdp->data_end - xdp->data;
- void *addr = xdp->data;
- int i;
-
- if (unlikely(offset > 0xffff || len > 0xffff))
- return ERR_PTR(-EFAULT);
-
- if (offset + len > xdp_get_buff_len(xdp))
- return ERR_PTR(-EINVAL);
-
- if (offset < size) /* linear area */
- goto out;
-
- offset -= size;
- for (i = 0; i < sinfo->nr_frags; i++) { /* paged area */
- u32 frag_size = skb_frag_size(&sinfo->frags[i]);
-
- if (offset < frag_size) {
- addr = skb_frag_address(&sinfo->frags[i]);
- size = frag_size;
- break;
- }
- offset -= frag_size;
- }
-out:
- return offset + len < size ? addr + offset : NULL;
-}
-
-BPF_CALL_4(bpf_xdp_load_bytes, struct xdp_buff *, xdp, u32, offset,
- void *, buf, u32, len)
-{
- void *ptr;
-
- ptr = bpf_xdp_pointer(xdp, offset, len);
- if (IS_ERR(ptr))
- return PTR_ERR(ptr);
-
- if (!ptr)
- bpf_xdp_copy_buf(xdp, offset, buf, len, false);
- else
- memcpy(buf, ptr, len);
-
- return 0;
-}
-
-static const struct bpf_func_proto bpf_xdp_load_bytes_proto = {
- .func = bpf_xdp_load_bytes,
- .gpl_only = false,
- .ret_type = RET_INTEGER,
- .arg1_type = ARG_PTR_TO_CTX,
- .arg2_type = ARG_ANYTHING,
- .arg3_type = ARG_PTR_TO_UNINIT_MEM,
- .arg4_type = ARG_CONST_SIZE,
-};
-
-BPF_CALL_4(bpf_xdp_store_bytes, struct xdp_buff *, xdp, u32, offset,
- void *, buf, u32, len)
-{
- void *ptr;
-
- ptr = bpf_xdp_pointer(xdp, offset, len);
- if (IS_ERR(ptr))
- return PTR_ERR(ptr);
-
- if (!ptr)
- bpf_xdp_copy_buf(xdp, offset, buf, len, true);
- else
- memcpy(ptr, buf, len);
-
- return 0;
-}
-
-static const struct bpf_func_proto bpf_xdp_store_bytes_proto = {
- .func = bpf_xdp_store_bytes,
- .gpl_only = false,
- .ret_type = RET_INTEGER,
- .arg1_type = ARG_PTR_TO_CTX,
- .arg2_type = ARG_ANYTHING,
- .arg3_type = ARG_PTR_TO_UNINIT_MEM,
- .arg4_type = ARG_CONST_SIZE,
-};
-
-static int bpf_xdp_frags_increase_tail(struct xdp_buff *xdp, int offset)
-{
- struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
- skb_frag_t *frag = &sinfo->frags[sinfo->nr_frags - 1];
- struct xdp_rxq_info *rxq = xdp->rxq;
- unsigned int tailroom;
-
- if (!rxq->frag_size || rxq->frag_size > xdp->frame_sz)
- return -EOPNOTSUPP;
-
- tailroom = rxq->frag_size - skb_frag_size(frag) - skb_frag_off(frag);
- if (unlikely(offset > tailroom))
- return -EINVAL;
-
- memset(skb_frag_address(frag) + skb_frag_size(frag), 0, offset);
- skb_frag_size_add(frag, offset);
- sinfo->xdp_frags_size += offset;
-
- return 0;
-}
-
-static int bpf_xdp_frags_shrink_tail(struct xdp_buff *xdp, int offset)
-{
- struct skb_shared_info *sinfo = xdp_get_shared_info_from_buff(xdp);
- int i, n_frags_free = 0, len_free = 0;
-
- if (unlikely(offset > (int)xdp_get_buff_len(xdp) - ETH_HLEN))
- return -EINVAL;
-
- for (i = sinfo->nr_frags - 1; i >= 0 && offset > 0; i--) {
- skb_frag_t *frag = &sinfo->frags[i];
- int shrink = min_t(int, offset, skb_frag_size(frag));
-
- len_free += shrink;
- offset -= shrink;
-
- if (skb_frag_size(frag) == shrink) {
- struct page *page = skb_frag_page(frag);
-
- __xdp_return(page_address(page), &xdp->rxq->mem,
- false, NULL);
- n_frags_free++;
- } else {
- skb_frag_size_sub(frag, shrink);
- break;
- }
- }
- sinfo->nr_frags -= n_frags_free;
- sinfo->xdp_frags_size -= len_free;
-
- if (unlikely(!sinfo->nr_frags)) {
- xdp_buff_clear_frags_flag(xdp);
- xdp->data_end -= offset;
- }
-
- return 0;
-}
-
-BPF_CALL_2(bpf_xdp_adjust_tail, struct xdp_buff *, xdp, int, offset)
-{
- void *data_hard_end = xdp_data_hard_end(xdp); /* use xdp->frame_sz */
- void *data_end = xdp->data_end + offset;
-
- if (unlikely(xdp_buff_has_frags(xdp))) { /* non-linear xdp buff */
- if (offset < 0)
- return bpf_xdp_frags_shrink_tail(xdp, -offset);
-
- return bpf_xdp_frags_increase_tail(xdp, offset);
- }
-
- /* Notice that xdp_data_hard_end have reserved some tailroom */
- if (unlikely(data_end > data_hard_end))
- return -EINVAL;
-
- /* ALL drivers MUST init xdp->frame_sz, chicken check below */
- if (unlikely(xdp->frame_sz > PAGE_SIZE)) {
- WARN_ONCE(1, "Too BIG xdp->frame_sz = %d\n", xdp->frame_sz);
- return -EINVAL;
- }
-
- if (unlikely(data_end < xdp->data + ETH_HLEN))
- return -EINVAL;
-
- /* Clear memory area on grow, can contain uninit kernel memory */
- if (offset > 0)
- memset(xdp->data_end, 0, offset);
-
- xdp->data_end = data_end;
-
- return 0;
-}
-
-static const struct bpf_func_proto bpf_xdp_adjust_tail_proto = {
- .func = bpf_xdp_adjust_tail,
- .gpl_only = false,
- .ret_type = RET_INTEGER,
- .arg1_type = ARG_PTR_TO_CTX,
- .arg2_type = ARG_ANYTHING,
-};
-
-BPF_CALL_2(bpf_xdp_adjust_meta, struct xdp_buff *, xdp, int, offset)
-{
- void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
- void *meta = xdp->data_meta + offset;
- unsigned long metalen = xdp->data - meta;
-
- if (xdp_data_meta_unsupported(xdp))
- return -ENOTSUPP;
- if (unlikely(meta < xdp_frame_end ||
- meta > xdp->data))
- return -EINVAL;
- if (unlikely(xdp_metalen_invalid(metalen)))
- return -EACCES;
-
- xdp->data_meta = meta;
-
- return 0;
-}
-
-static const struct bpf_func_proto bpf_xdp_adjust_meta_proto = {
- .func = bpf_xdp_adjust_meta,
- .gpl_only = false,
- .ret_type = RET_INTEGER,
- .arg1_type = ARG_PTR_TO_CTX,
- .arg2_type = ARG_ANYTHING,
-};
-
-/* XDP_REDIRECT works by a three-step process, implemented in the functions
- * below:
- *
- * 1. The bpf_redirect() and bpf_redirect_map() helpers will lookup the target
- * of the redirect and store it (along with some other metadata) in a per-CPU
- * struct bpf_redirect_info.
- *
- * 2. When the program returns the XDP_REDIRECT return code, the driver will
- * call xdp_do_redirect() which will use the information in struct
- * bpf_redirect_info to actually enqueue the frame into a map type-specific
- * bulk queue structure.
- *
- * 3. Before exiting its NAPI poll loop, the driver will call xdp_do_flush(),
- * which will flush all the different bulk queues, thus completing the
- * redirect.
- *
- * Pointers to the map entries will be kept around for this whole sequence of
- * steps, protected by RCU. However, there is no top-level rcu_read_lock() in
- * the core code; instead, the RCU protection relies on everything happening
- * inside a single NAPI poll sequence, which means it's between a pair of calls
- * to local_bh_disable()/local_bh_enable().
- *
- * The map entries are marked as __rcu and the map code makes sure to
- * dereference those pointers with rcu_dereference_check() in a way that works
- * for both sections that to hold an rcu_read_lock() and sections that are
- * called from NAPI without a separate rcu_read_lock(). The code below does not
- * use RCU annotations, but relies on those in the map code.
- */
-void xdp_do_flush(void)
-{
- __dev_flush();
- __cpu_map_flush();
- __xsk_map_flush();
-}
-EXPORT_SYMBOL_GPL(xdp_do_flush);
-
-void bpf_clear_redirect_map(struct bpf_map *map)
-{
- struct bpf_redirect_info *ri;
- int cpu;
-
- for_each_possible_cpu(cpu) {
- ri = per_cpu_ptr(&bpf_redirect_info, cpu);
- /* Avoid polluting remote cacheline due to writes if
- * not needed. Once we pass this test, we need the
- * cmpxchg() to make sure it hasn't been changed in
- * the meantime by remote CPU.
- */
- if (unlikely(READ_ONCE(ri->map) == map))
- cmpxchg(&ri->map, map, NULL);
- }
-}
-
-DEFINE_STATIC_KEY_FALSE(bpf_master_redirect_enabled_key);
-EXPORT_SYMBOL_GPL(bpf_master_redirect_enabled_key);
-
-u32 xdp_master_redirect(struct xdp_buff *xdp)
-{
- struct net_device *master, *slave;
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-
- master = netdev_master_upper_dev_get_rcu(xdp->rxq->dev);
- slave = master->netdev_ops->ndo_xdp_get_xmit_slave(master, xdp);
- if (slave && slave != xdp->rxq->dev) {
- /* The target device is different from the receiving device, so
- * redirect it to the new device.
- * Using XDP_REDIRECT gets the correct behaviour from XDP enabled
- * drivers to unmap the packet from their rx ring.
- */
- ri->tgt_index = slave->ifindex;
- ri->map_id = INT_MAX;
- ri->map_type = BPF_MAP_TYPE_UNSPEC;
- return XDP_REDIRECT;
- }
- return XDP_TX;
-}
-EXPORT_SYMBOL_GPL(xdp_master_redirect);
-
-static inline int __xdp_do_redirect_xsk(struct bpf_redirect_info *ri,
- struct net_device *dev,
- struct xdp_buff *xdp,
- struct bpf_prog *xdp_prog)
-{
- enum bpf_map_type map_type = ri->map_type;
- void *fwd = ri->tgt_value;
- u32 map_id = ri->map_id;
- int err;
-
- ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */
- ri->map_type = BPF_MAP_TYPE_UNSPEC;
-
- err = __xsk_map_redirect(fwd, xdp);
- if (unlikely(err))
- goto err;
-
- _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index);
- return 0;
-err:
- _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index, err);
- return err;
-}
-
-static __always_inline int __xdp_do_redirect_frame(struct bpf_redirect_info *ri,
- struct net_device *dev,
- struct xdp_frame *xdpf,
- struct bpf_prog *xdp_prog)
-{
- enum bpf_map_type map_type = ri->map_type;
- void *fwd = ri->tgt_value;
- u32 map_id = ri->map_id;
- struct bpf_map *map;
- int err;
-
- ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */
- ri->map_type = BPF_MAP_TYPE_UNSPEC;
-
- if (unlikely(!xdpf)) {
- err = -EOVERFLOW;
- goto err;
- }
-
- switch (map_type) {
- case BPF_MAP_TYPE_DEVMAP:
- fallthrough;
- case BPF_MAP_TYPE_DEVMAP_HASH:
- map = READ_ONCE(ri->map);
- if (unlikely(map)) {
- WRITE_ONCE(ri->map, NULL);
- err = dev_map_enqueue_multi(xdpf, dev, map,
- ri->flags & BPF_F_EXCLUDE_INGRESS);
- } else {
- err = dev_map_enqueue(fwd, xdpf, dev);
- }
- break;
- case BPF_MAP_TYPE_CPUMAP:
- err = cpu_map_enqueue(fwd, xdpf, dev);
- break;
- case BPF_MAP_TYPE_UNSPEC:
- if (map_id == INT_MAX) {
- fwd = dev_get_by_index_rcu(dev_net(dev), ri->tgt_index);
- if (unlikely(!fwd)) {
- err = -EINVAL;
- break;
- }
- err = dev_xdp_enqueue(fwd, xdpf, dev);
- break;
- }
- fallthrough;
- default:
- err = -EBADRQC;
- }
-
- if (unlikely(err))
- goto err;
-
- _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index);
- return 0;
-err:
- _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index, err);
- return err;
-}
-
-int xdp_do_redirect(struct net_device *dev, struct xdp_buff *xdp,
- struct bpf_prog *xdp_prog)
-{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
- enum bpf_map_type map_type = ri->map_type;
-
- /* XDP_REDIRECT is not fully supported yet for xdp frags since
- * not all XDP capable drivers can map non-linear xdp_frame in
- * ndo_xdp_xmit.
- */
- if (unlikely(xdp_buff_has_frags(xdp) &&
- map_type != BPF_MAP_TYPE_CPUMAP))
- return -EOPNOTSUPP;
-
- if (map_type == BPF_MAP_TYPE_XSKMAP)
- return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
-
- return __xdp_do_redirect_frame(ri, dev, xdp_convert_buff_to_frame(xdp),
- xdp_prog);
-}
-EXPORT_SYMBOL_GPL(xdp_do_redirect);
-
-int xdp_do_redirect_frame(struct net_device *dev, struct xdp_buff *xdp,
- struct xdp_frame *xdpf, struct bpf_prog *xdp_prog)
-{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
- enum bpf_map_type map_type = ri->map_type;
-
- if (map_type == BPF_MAP_TYPE_XSKMAP)
- return __xdp_do_redirect_xsk(ri, dev, xdp, xdp_prog);
-
- return __xdp_do_redirect_frame(ri, dev, xdpf, xdp_prog);
-}
-EXPORT_SYMBOL_GPL(xdp_do_redirect_frame);
-
-static int xdp_do_generic_redirect_map(struct net_device *dev,
- struct sk_buff *skb,
- struct xdp_buff *xdp,
- struct bpf_prog *xdp_prog,
- void *fwd,
- enum bpf_map_type map_type, u32 map_id)
-{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
- struct bpf_map *map;
- int err;
-
- switch (map_type) {
- case BPF_MAP_TYPE_DEVMAP:
- fallthrough;
- case BPF_MAP_TYPE_DEVMAP_HASH:
- map = READ_ONCE(ri->map);
- if (unlikely(map)) {
- WRITE_ONCE(ri->map, NULL);
- err = dev_map_redirect_multi(dev, skb, xdp_prog, map,
- ri->flags & BPF_F_EXCLUDE_INGRESS);
- } else {
- err = dev_map_generic_redirect(fwd, skb, xdp_prog);
- }
- if (unlikely(err))
- goto err;
- break;
- case BPF_MAP_TYPE_XSKMAP:
- err = xsk_generic_rcv(fwd, xdp);
- if (err)
- goto err;
- consume_skb(skb);
- break;
- case BPF_MAP_TYPE_CPUMAP:
- err = cpu_map_generic_redirect(fwd, skb);
- if (unlikely(err))
- goto err;
- break;
- default:
- err = -EBADRQC;
- goto err;
- }
-
- _trace_xdp_redirect_map(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index);
- return 0;
-err:
- _trace_xdp_redirect_map_err(dev, xdp_prog, fwd, map_type, map_id, ri->tgt_index, err);
- return err;
-}
-
-int xdp_do_generic_redirect(struct net_device *dev, struct sk_buff *skb,
- struct xdp_buff *xdp, struct bpf_prog *xdp_prog)
-{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
- enum bpf_map_type map_type = ri->map_type;
- void *fwd = ri->tgt_value;
- u32 map_id = ri->map_id;
- int err;
-
- ri->map_id = 0; /* Valid map id idr range: [1,INT_MAX[ */
- ri->map_type = BPF_MAP_TYPE_UNSPEC;
-
- if (map_type == BPF_MAP_TYPE_UNSPEC && map_id == INT_MAX) {
- fwd = dev_get_by_index_rcu(dev_net(dev), ri->tgt_index);
- if (unlikely(!fwd)) {
- err = -EINVAL;
- goto err;
- }
-
- err = xdp_ok_fwd_dev(fwd, skb->len);
- if (unlikely(err))
- goto err;
-
- skb->dev = fwd;
- _trace_xdp_redirect(dev, xdp_prog, ri->tgt_index);
- generic_xdp_tx(skb, xdp_prog);
- return 0;
- }
-
- return xdp_do_generic_redirect_map(dev, skb, xdp, xdp_prog, fwd, map_type, map_id);
-err:
- _trace_xdp_redirect_err(dev, xdp_prog, ri->tgt_index, err);
- return err;
-}
-
-BPF_CALL_2(bpf_xdp_redirect, u32, ifindex, u64, flags)
-{
- struct bpf_redirect_info *ri = this_cpu_ptr(&bpf_redirect_info);
-
- if (unlikely(flags))
- return XDP_ABORTED;
-
- /* NB! Map type UNSPEC and map_id == INT_MAX (never generated
- * by map_idr) is used for ifindex based XDP redirect.
- */
- ri->tgt_index = ifindex;
- ri->map_id = INT_MAX;
- ri->map_type = BPF_MAP_TYPE_UNSPEC;
-
- return XDP_REDIRECT;
-}
-
-static const struct bpf_func_proto bpf_xdp_redirect_proto = {
- .func = bpf_xdp_redirect,
- .gpl_only = false,
- .ret_type = RET_INTEGER,
- .arg1_type = ARG_ANYTHING,
- .arg2_type = ARG_ANYTHING,
-};
-
-BPF_CALL_3(bpf_xdp_redirect_map, struct bpf_map *, map, u32, ifindex,
- u64, flags)
-{
- return map->ops->map_redirect(map, ifindex, flags);
-}
-
-static const struct bpf_func_proto bpf_xdp_redirect_map_proto = {
- .func = bpf_xdp_redirect_map,
- .gpl_only = false,
- .ret_type = RET_INTEGER,
- .arg1_type = ARG_CONST_MAP_PTR,
- .arg2_type = ARG_ANYTHING,
- .arg3_type = ARG_ANYTHING,
-};
-
static unsigned long bpf_skb_copy(void *dst_buff, const void *skb,
unsigned long off, unsigned long len)
{
@@ -4830,55 +4195,6 @@ static const struct bpf_func_proto bpf_sk_ancestor_cgroup_id_proto = {
};
#endif
-static unsigned long bpf_xdp_copy(void *dst, const void *ctx,
- unsigned long off, unsigned long len)
-{
- struct xdp_buff *xdp = (struct xdp_buff *)ctx;
-
- bpf_xdp_copy_buf(xdp, off, dst, len, false);
- return 0;
-}
-
-BPF_CALL_5(bpf_xdp_event_output, struct xdp_buff *, xdp, struct bpf_map *, map,
- u64, flags, void *, meta, u64, meta_size)
-{
- u64 xdp_size = (flags & BPF_F_CTXLEN_MASK) >> 32;
-
- if (unlikely(flags & ~(BPF_F_CTXLEN_MASK | BPF_F_INDEX_MASK)))
- return -EINVAL;
-
- if (unlikely(!xdp || xdp_size > xdp_get_buff_len(xdp)))
- return -EFAULT;
-
- return bpf_event_output(map, flags, meta, meta_size, xdp,
- xdp_size, bpf_xdp_copy);
-}
-
-static const struct bpf_func_proto bpf_xdp_event_output_proto = {
- .func = bpf_xdp_event_output,
- .gpl_only = true,
- .ret_type = RET_INTEGER,
- .arg1_type = ARG_PTR_TO_CTX,
- .arg2_type = ARG_CONST_MAP_PTR,
- .arg3_type = ARG_ANYTHING,
- .arg4_type = ARG_PTR_TO_MEM | MEM_RDONLY,
- .arg5_type = ARG_CONST_SIZE_OR_ZERO,
-};
-
-BTF_ID_LIST_SINGLE(bpf_xdp_output_btf_ids, struct, xdp_buff)
-
-const struct bpf_func_proto bpf_xdp_output_proto = {
- .func = bpf_xdp_event_output,
- .gpl_only = true,
- .ret_type = RET_INTEGER,
- .arg1_type = ARG_PTR_TO_BTF_ID,
- .arg1_btf_id = &bpf_xdp_output_btf_ids[0],
- .arg2_type = ARG_CONST_MAP_PTR,
- .arg3_type = ARG_ANYTHING,
- .arg4_type = ARG_PTR_TO_MEM | MEM_RDONLY,
- .arg5_type = ARG_CONST_SIZE_OR_ZERO,
-};
-
BPF_CALL_1(bpf_get_socket_cookie, struct sk_buff *, skb)
{
return skb->sk ? __sock_gen_cookie(skb->sk) : 0;
@@ -6957,46 +6273,6 @@ BPF_CALL_1(bpf_skb_ecn_set_ce, struct sk_buff *, skb)
return INET_ECN_set_ce(skb);
}
-bool bpf_xdp_sock_is_valid_access(int off, int size, enum bpf_access_type type,
- struct bpf_insn_access_aux *info)
-{
- if (off < 0 || off >= offsetofend(struct bpf_xdp_sock, queue_id))
- return false;
-
- if (off % size != 0)
- return false;
-
- switch (off) {
- default:
- return size == sizeof(__u32);
- }
-}
-
-u32 bpf_xdp_sock_convert_ctx_access(enum bpf_access_type type,
- const struct bpf_insn *si,
- struct bpf_insn *insn_buf,
- struct bpf_prog *prog, u32 *target_size)
-{
- struct bpf_insn *insn = insn_buf;
-
-#define BPF_XDP_SOCK_GET(FIELD) \
- do { \
- BUILD_BUG_ON(sizeof_field(struct xdp_sock, FIELD) > \
- sizeof_field(struct bpf_xdp_sock, FIELD)); \
- *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_sock, FIELD),\
- si->dst_reg, si->src_reg, \
- offsetof(struct xdp_sock, FIELD)); \
- } while (0)
-
- switch (si->off) {
- case offsetof(struct bpf_xdp_sock, queue_id):
- BPF_XDP_SOCK_GET(queue_id);
- break;
- }
-
- return insn - insn_buf;
-}
-
static const struct bpf_func_proto bpf_skb_ecn_set_ce_proto = {
.func = bpf_skb_ecn_set_ce,
.gpl_only = false,
@@ -7569,12 +6845,10 @@ bool bpf_helper_changes_pkt_data(void *func)
func == bpf_clone_redirect ||
func == bpf_l3_csum_replace ||
func == bpf_l4_csum_replace ||
- func == bpf_xdp_adjust_head ||
- func == bpf_xdp_adjust_meta ||
+ xdp_helper_changes_pkt_data(func) ||
func == bpf_msg_pull_data ||
func == bpf_msg_push_data ||
func == bpf_msg_pop_data ||
- func == bpf_xdp_adjust_tail ||
#if IS_ENABLED(CONFIG_IPV6_SEG6_BPF)
func == bpf_lwt_seg6_store_bytes ||
func == bpf_lwt_seg6_adjust_srh ||
@@ -7929,32 +7203,11 @@ tc_cls_act_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
}
}
-static const struct bpf_func_proto *
-xdp_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
+const struct bpf_func_proto *xdp_inet_func_proto(enum bpf_func_id func_id)
{
switch (func_id) {
- case BPF_FUNC_perf_event_output:
- return &bpf_xdp_event_output_proto;
- case BPF_FUNC_get_smp_processor_id:
- return &bpf_get_smp_processor_id_proto;
case BPF_FUNC_csum_diff:
return &bpf_csum_diff_proto;
- case BPF_FUNC_xdp_adjust_head:
- return &bpf_xdp_adjust_head_proto;
- case BPF_FUNC_xdp_adjust_meta:
- return &bpf_xdp_adjust_meta_proto;
- case BPF_FUNC_redirect:
- return &bpf_xdp_redirect_proto;
- case BPF_FUNC_redirect_map:
- return &bpf_xdp_redirect_map_proto;
- case BPF_FUNC_xdp_adjust_tail:
- return &bpf_xdp_adjust_tail_proto;
- case BPF_FUNC_xdp_get_buff_len:
- return &bpf_xdp_get_buff_len_proto;
- case BPF_FUNC_xdp_load_bytes:
- return &bpf_xdp_load_bytes_proto;
- case BPF_FUNC_xdp_store_bytes:
- return &bpf_xdp_store_bytes_proto;
case BPF_FUNC_fib_lookup:
return &bpf_xdp_fib_lookup_proto;
case BPF_FUNC_check_mtu:
@@ -8643,64 +7896,6 @@ static bool tc_cls_act_is_valid_access(int off, int size,
return bpf_skb_is_valid_access(off, size, type, prog, info);
}
-static bool __is_valid_xdp_access(int off, int size)
-{
- if (off < 0 || off >= sizeof(struct xdp_md))
- return false;
- if (off % size != 0)
- return false;
- if (size != sizeof(__u32))
- return false;
-
- return true;
-}
-
-static bool xdp_is_valid_access(int off, int size,
- enum bpf_access_type type,
- const struct bpf_prog *prog,
- struct bpf_insn_access_aux *info)
-{
- if (prog->expected_attach_type != BPF_XDP_DEVMAP) {
- switch (off) {
- case offsetof(struct xdp_md, egress_ifindex):
- return false;
- }
- }
-
- if (type == BPF_WRITE) {
- if (bpf_prog_is_dev_bound(prog->aux)) {
- switch (off) {
- case offsetof(struct xdp_md, rx_queue_index):
- return __is_valid_xdp_access(off, size);
- }
- }
- return false;
- }
-
- switch (off) {
- case offsetof(struct xdp_md, data):
- info->reg_type = PTR_TO_PACKET;
- break;
- case offsetof(struct xdp_md, data_meta):
- info->reg_type = PTR_TO_PACKET_META;
- break;
- case offsetof(struct xdp_md, data_end):
- info->reg_type = PTR_TO_PACKET_END;
- break;
- }
-
- return __is_valid_xdp_access(off, size);
-}
-
-void bpf_warn_invalid_xdp_action(struct net_device *dev, struct bpf_prog *prog, u32 act)
-{
- const u32 act_max = XDP_REDIRECT;
-
- pr_warn_once("%s XDP return value %u on prog %s (id %d) dev %s, expect packet loss!\n",
- act > act_max ? "Illegal" : "Driver unsupported",
- act, prog->aux->name, prog->aux->id, dev ? dev->name : "N/A");
-}
-EXPORT_SYMBOL_GPL(bpf_warn_invalid_xdp_action);
static bool sock_addr_is_valid_access(int off, int size,
enum bpf_access_type type,
@@ -9705,62 +8900,6 @@ static u32 tc_cls_act_convert_ctx_access(enum bpf_access_type type,
return insn - insn_buf;
}
-static u32 xdp_convert_ctx_access(enum bpf_access_type type,
- const struct bpf_insn *si,
- struct bpf_insn *insn_buf,
- struct bpf_prog *prog, u32 *target_size)
-{
- struct bpf_insn *insn = insn_buf;
-
- switch (si->off) {
- case offsetof(struct xdp_md, data):
- *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data),
- si->dst_reg, si->src_reg,
- offsetof(struct xdp_buff, data));
- break;
- case offsetof(struct xdp_md, data_meta):
- *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data_meta),
- si->dst_reg, si->src_reg,
- offsetof(struct xdp_buff, data_meta));
- break;
- case offsetof(struct xdp_md, data_end):
- *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, data_end),
- si->dst_reg, si->src_reg,
- offsetof(struct xdp_buff, data_end));
- break;
- case offsetof(struct xdp_md, ingress_ifindex):
- *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, rxq),
- si->dst_reg, si->src_reg,
- offsetof(struct xdp_buff, rxq));
- *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_rxq_info, dev),
- si->dst_reg, si->dst_reg,
- offsetof(struct xdp_rxq_info, dev));
- *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
- offsetof(struct net_device, ifindex));
- break;
- case offsetof(struct xdp_md, rx_queue_index):
- *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, rxq),
- si->dst_reg, si->src_reg,
- offsetof(struct xdp_buff, rxq));
- *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
- offsetof(struct xdp_rxq_info,
- queue_index));
- break;
- case offsetof(struct xdp_md, egress_ifindex):
- *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_buff, txq),
- si->dst_reg, si->src_reg,
- offsetof(struct xdp_buff, txq));
- *insn++ = BPF_LDX_MEM(BPF_FIELD_SIZEOF(struct xdp_txq_info, dev),
- si->dst_reg, si->dst_reg,
- offsetof(struct xdp_txq_info, dev));
- *insn++ = BPF_LDX_MEM(BPF_W, si->dst_reg, si->dst_reg,
- offsetof(struct net_device, ifindex));
- break;
- }
-
- return insn - insn_buf;
-}
-
/* SOCK_ADDR_LOAD_NESTED_FIELD() loads Nested Field S.F.NF where S is type of
* context Structure, F is Field in context structure that contains a pointer
* to Nested Structure of type NS that has the field NF.
@@ -10602,17 +9741,6 @@ const struct bpf_prog_ops tc_cls_act_prog_ops = {
.test_run = bpf_prog_test_run_skb,
};
-const struct bpf_verifier_ops xdp_verifier_ops = {
- .get_func_proto = xdp_func_proto,
- .is_valid_access = xdp_is_valid_access,
- .convert_ctx_access = xdp_convert_ctx_access,
- .gen_prologue = bpf_noop_prologue,
-};
-
-const struct bpf_prog_ops xdp_prog_ops = {
- .test_run = bpf_prog_test_run_xdp,
-};
-
const struct bpf_verifier_ops cg_skb_verifier_ops = {
.get_func_proto = cg_skb_func_proto,
.is_valid_access = cg_skb_is_valid_access,
@@ -11266,13 +10394,6 @@ const struct bpf_verifier_ops sk_lookup_verifier_ops = {
#endif /* CONFIG_INET */
-DEFINE_BPF_DISPATCHER(xdp)
-
-void bpf_prog_change_xdp(struct bpf_prog *prev_prog, struct bpf_prog *prog)
-{
- bpf_dispatcher_change_prog(BPF_DISPATCHER_PTR(xdp), prev_prog, prog);
-}
-
BTF_ID_LIST_GLOBAL(btf_sock_ids, MAX_BTF_SOCK_TYPE)
#define BTF_SOCK_TYPE(name, type) BTF_ID(struct, type)
BTF_SOCK_TYPE_xxx
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 06/52] bpf: pass a pointer to union bpf_attr to bpf_link_ops::update_prog()
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (4 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 05/52] net, xdp: decouple XDP code from the core networking code Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 07/52] net, xdp: remove redundant arguments from dev_xdp_{at,de}tach_link() Alexander Lobakin
` (46 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
In order to be able to use any arbitrary data from
bpf_attr::link_update inside the bpf_link_ops::update_prog()
implementations, pass a pointer to the whole attr as a callback
argument.
@new_prog and @old_prog arguments are still here as ::link_update
contains only their FDs.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/linux/bpf.h | 3 ++-
kernel/bpf/bpf_iter.c | 1 +
kernel/bpf/cgroup.c | 4 +++-
kernel/bpf/net_namespace.c | 1 +
kernel/bpf/syscall.c | 2 +-
net/bpf/dev.c | 4 +++-
6 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index d05e1495a06e..c08690a49011 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -1155,7 +1155,8 @@ struct bpf_link_ops {
void (*release)(struct bpf_link *link);
void (*dealloc)(struct bpf_link *link);
int (*detach)(struct bpf_link *link);
- int (*update_prog)(struct bpf_link *link, struct bpf_prog *new_prog,
+ int (*update_prog)(struct bpf_link *link, const union bpf_attr *attr,
+ struct bpf_prog *new_prog,
struct bpf_prog *old_prog);
void (*show_fdinfo)(const struct bpf_link *link, struct seq_file *seq);
int (*fill_link_info)(const struct bpf_link *link,
diff --git a/kernel/bpf/bpf_iter.c b/kernel/bpf/bpf_iter.c
index 7e8fd49406f6..1d3dcc853f70 100644
--- a/kernel/bpf/bpf_iter.c
+++ b/kernel/bpf/bpf_iter.c
@@ -400,6 +400,7 @@ static void bpf_iter_link_dealloc(struct bpf_link *link)
}
static int bpf_iter_link_replace(struct bpf_link *link,
+ const union bpf_attr *attr,
struct bpf_prog *new_prog,
struct bpf_prog *old_prog)
{
diff --git a/kernel/bpf/cgroup.c b/kernel/bpf/cgroup.c
index 7a394f7c205c..f4d8100dd22f 100644
--- a/kernel/bpf/cgroup.c
+++ b/kernel/bpf/cgroup.c
@@ -664,7 +664,9 @@ static int __cgroup_bpf_replace(struct cgroup *cgrp,
return 0;
}
-static int cgroup_bpf_replace(struct bpf_link *link, struct bpf_prog *new_prog,
+static int cgroup_bpf_replace(struct bpf_link *link,
+ const union bpf_attr *attr,
+ struct bpf_prog *new_prog,
struct bpf_prog *old_prog)
{
struct bpf_cgroup_link *cg_link;
diff --git a/kernel/bpf/net_namespace.c b/kernel/bpf/net_namespace.c
index 868cc2c43899..5d80a4a9d0bd 100644
--- a/kernel/bpf/net_namespace.c
+++ b/kernel/bpf/net_namespace.c
@@ -162,6 +162,7 @@ static void bpf_netns_link_dealloc(struct bpf_link *link)
}
static int bpf_netns_link_update_prog(struct bpf_link *link,
+ const union bpf_attr *attr,
struct bpf_prog *new_prog,
struct bpf_prog *old_prog)
{
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 7d5af5b99f0d..f7a674656067 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4614,7 +4614,7 @@ static int link_update(union bpf_attr *attr)
}
if (link->ops->update_prog)
- ret = link->ops->update_prog(link, new_prog, old_prog);
+ ret = link->ops->update_prog(link, attr, new_prog, old_prog);
else
ret = -EINVAL;
diff --git a/net/bpf/dev.c b/net/bpf/dev.c
index dfe0402947f8..68a7b2c49392 100644
--- a/net/bpf/dev.c
+++ b/net/bpf/dev.c
@@ -619,7 +619,9 @@ static int bpf_xdp_link_fill_link_info(const struct bpf_link *link,
return 0;
}
-static int bpf_xdp_link_update(struct bpf_link *link, struct bpf_prog *new_prog,
+static int bpf_xdp_link_update(struct bpf_link *link,
+ const union bpf_attr *attr,
+ struct bpf_prog *new_prog,
struct bpf_prog *old_prog)
{
struct bpf_xdp_link *xdp_link = container_of(link, struct bpf_xdp_link, link);
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 07/52] net, xdp: remove redundant arguments from dev_xdp_{at,de}tach_link()
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (5 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 06/52] bpf: pass a pointer to union bpf_attr to bpf_link_ops::update_prog() Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 08/52] net, xdp: factor out XDP install arguments to a separate structure Alexander Lobakin
` (45 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
dev_xdp_attach_link(): the sole caller always passes %NULL as
@extack and @link->dev as @dev, so they both can be omitted.
The very same story with dev_xdp_detach_link(): remove both
@dev and @extack as they both can be obtained inside the
function itself.
This decreases stack usage with no functional changes.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
net/bpf/dev.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/net/bpf/dev.c b/net/bpf/dev.c
index 68a7b2c49392..0010b20719e8 100644
--- a/net/bpf/dev.c
+++ b/net/bpf/dev.c
@@ -534,17 +534,14 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack
return 0;
}
-static int dev_xdp_attach_link(struct net_device *dev,
- struct netlink_ext_ack *extack,
- struct bpf_xdp_link *link)
+static int dev_xdp_attach_link(struct bpf_xdp_link *link)
{
- return dev_xdp_attach(dev, extack, link, NULL, NULL, link->flags);
+ return dev_xdp_attach(link->dev, NULL, link, NULL, NULL, link->flags);
}
-static int dev_xdp_detach_link(struct net_device *dev,
- struct netlink_ext_ack *extack,
- struct bpf_xdp_link *link)
+static int dev_xdp_detach_link(struct bpf_xdp_link *link)
{
+ struct net_device *dev = link->dev;
enum bpf_xdp_mode mode;
bpf_op_t bpf_op;
@@ -570,7 +567,7 @@ static void bpf_xdp_link_release(struct bpf_link *link)
* already NULL, in which case link was already auto-detached
*/
if (xdp_link->dev) {
- WARN_ON(dev_xdp_detach_link(xdp_link->dev, NULL, xdp_link));
+ WARN_ON(dev_xdp_detach_link(xdp_link));
xdp_link->dev = NULL;
}
@@ -709,7 +706,7 @@ int bpf_xdp_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
goto unlock;
}
- err = dev_xdp_attach_link(dev, NULL, link);
+ err = dev_xdp_attach_link(link);
rtnl_unlock();
if (err) {
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 08/52] net, xdp: factor out XDP install arguments to a separate structure
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (6 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 07/52] net, xdp: remove redundant arguments from dev_xdp_{at,de}tach_link() Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 09/52] net, xdp: add ability to specify BTF ID for XDP metadata Alexander Lobakin
` (44 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
The current way of passing parameters from userland/rtnetlink
(do_set_link()) to dev_change_xdp_fd() and in the end to the drivers
separately does not scale a lot: each new parameter/argument
requires changing the prototypes of several functions at once.
To be able to pass more, derive them into a structure which for now
will contain:
* dev, the actual netdevice,
* extack, Netlink extack to pass arbitrary messages to userland,
* flags, XDP install flags passed from the user.
and use it in the following functions instead of the separate
arguments: dev_change_xdp_fd(), dev_xdp_attach() and
dev_xdp_install(). Adjust the rest accordingly.
Those three are being used in the whole chain 'user -> driver', the
rest can {,dis}appear later, thus not included.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/linux/netdevice.h | 10 +++++--
net/bpf/dev.c | 61 ++++++++++++++++++++++++---------------
net/core/rtnetlink.c | 10 +++++--
3 files changed, 53 insertions(+), 28 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0b8169c23f22..1e342c285f48 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3848,11 +3848,17 @@ struct sk_buff *validate_xmit_skb_list(struct sk_buff *skb, struct net_device *d
struct sk_buff *dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
struct netdev_queue *txq, int *ret);
+struct xdp_install_args {
+ struct net_device *dev;
+ struct netlink_ext_ack *extack;
+ u32 flags;
+};
+
DECLARE_STATIC_KEY_FALSE(generic_xdp_needed_key);
int bpf_xdp_link_attach(const union bpf_attr *attr, struct bpf_prog *prog);
-int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
- int fd, int expected_fd, u32 flags);
+int dev_change_xdp_fd(const struct xdp_install_args *args, int fd,
+ int expected_fd);
void dev_xdp_uninstall(struct net_device *dev);
u8 dev_xdp_prog_count(struct net_device *dev);
u32 dev_xdp_prog_id(struct net_device *dev, enum bpf_xdp_mode mode);
diff --git a/net/bpf/dev.c b/net/bpf/dev.c
index 0010b20719e8..7df42bb886ad 100644
--- a/net/bpf/dev.c
+++ b/net/bpf/dev.c
@@ -350,17 +350,17 @@ static void dev_xdp_set_prog(struct net_device *dev, enum bpf_xdp_mode mode,
dev->xdp_state[mode].prog = prog;
}
-static int dev_xdp_install(struct net_device *dev, enum bpf_xdp_mode mode,
- bpf_op_t bpf_op, struct netlink_ext_ack *extack,
- u32 flags, struct bpf_prog *prog)
+static int dev_xdp_install(const struct xdp_install_args *args,
+ enum bpf_xdp_mode mode, bpf_op_t bpf_op,
+ struct bpf_prog *prog)
{
struct netdev_bpf xdp;
int err;
memset(&xdp, 0, sizeof(xdp));
xdp.command = mode == XDP_MODE_HW ? XDP_SETUP_PROG_HW : XDP_SETUP_PROG;
- xdp.extack = extack;
- xdp.flags = flags;
+ xdp.extack = args->extack;
+ xdp.flags = args->flags;
xdp.prog = prog;
/* Drivers assume refcnt is already incremented (i.e, prog pointer is
@@ -371,7 +371,7 @@ static int dev_xdp_install(struct net_device *dev, enum bpf_xdp_mode mode,
*/
if (prog)
bpf_prog_inc(prog);
- err = bpf_op(dev, &xdp);
+ err = bpf_op(args->dev, &xdp);
if (err) {
if (prog)
bpf_prog_put(prog);
@@ -379,13 +379,16 @@ static int dev_xdp_install(struct net_device *dev, enum bpf_xdp_mode mode,
}
if (mode != XDP_MODE_HW)
- bpf_prog_change_xdp(dev_xdp_prog(dev, mode), prog);
+ bpf_prog_change_xdp(dev_xdp_prog(args->dev, mode), prog);
return 0;
}
void dev_xdp_uninstall(struct net_device *dev)
{
+ struct xdp_install_args args = {
+ .dev = dev,
+ };
struct bpf_xdp_link *link;
struct bpf_prog *prog;
enum bpf_xdp_mode mode;
@@ -402,7 +405,7 @@ void dev_xdp_uninstall(struct net_device *dev)
if (!bpf_op)
continue;
- WARN_ON(dev_xdp_install(dev, mode, bpf_op, NULL, 0, NULL));
+ WARN_ON(dev_xdp_install(&args, mode, bpf_op, NULL));
/* auto-detach link from net device */
link = dev_xdp_link(dev, mode);
@@ -415,13 +418,16 @@ void dev_xdp_uninstall(struct net_device *dev)
}
}
-static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack,
+static int dev_xdp_attach(const struct xdp_install_args *args,
struct bpf_xdp_link *link, struct bpf_prog *new_prog,
- struct bpf_prog *old_prog, u32 flags)
+ struct bpf_prog *old_prog)
{
- unsigned int num_modes = hweight32(flags & XDP_FLAGS_MODES);
+ unsigned int num_modes = hweight32(args->flags & XDP_FLAGS_MODES);
+ struct netlink_ext_ack *extack = args->extack;
+ struct net_device *dev = args->dev;
struct bpf_prog *cur_prog;
struct net_device *upper;
+ u32 flags = args->flags;
struct list_head *iter;
enum bpf_xdp_mode mode;
bpf_op_t bpf_op;
@@ -519,7 +525,7 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack
return -EOPNOTSUPP;
}
- err = dev_xdp_install(dev, mode, bpf_op, extack, flags, new_prog);
+ err = dev_xdp_install(args, mode, bpf_op, new_prog);
if (err)
return err;
}
@@ -536,12 +542,20 @@ static int dev_xdp_attach(struct net_device *dev, struct netlink_ext_ack *extack
static int dev_xdp_attach_link(struct bpf_xdp_link *link)
{
- return dev_xdp_attach(link->dev, NULL, link, NULL, NULL, link->flags);
+ struct xdp_install_args args = {
+ .dev = link->dev,
+ .flags = link->flags,
+ };
+
+ return dev_xdp_attach(&args, link, NULL, NULL);
}
static int dev_xdp_detach_link(struct bpf_xdp_link *link)
{
struct net_device *dev = link->dev;
+ struct xdp_install_args args = {
+ .dev = dev,
+ };
enum bpf_xdp_mode mode;
bpf_op_t bpf_op;
@@ -552,7 +566,7 @@ static int dev_xdp_detach_link(struct bpf_xdp_link *link)
return -EINVAL;
bpf_op = dev_xdp_bpf_op(dev, mode);
- WARN_ON(dev_xdp_install(dev, mode, bpf_op, NULL, 0, NULL));
+ WARN_ON(dev_xdp_install(&args, mode, bpf_op, NULL));
dev_xdp_set_link(dev, mode, NULL);
return 0;
}
@@ -622,6 +636,10 @@ static int bpf_xdp_link_update(struct bpf_link *link,
struct bpf_prog *old_prog)
{
struct bpf_xdp_link *xdp_link = container_of(link, struct bpf_xdp_link, link);
+ struct xdp_install_args args = {
+ .dev = xdp_link->dev,
+ .flags = xdp_link->flags,
+ };
enum bpf_xdp_mode mode;
bpf_op_t bpf_op;
int err = 0;
@@ -653,8 +671,7 @@ static int bpf_xdp_link_update(struct bpf_link *link,
mode = dev_xdp_mode(xdp_link->dev, xdp_link->flags);
bpf_op = dev_xdp_bpf_op(xdp_link->dev, mode);
- err = dev_xdp_install(xdp_link->dev, mode, bpf_op, NULL,
- xdp_link->flags, new_prog);
+ err = dev_xdp_install(&args, mode, bpf_op, new_prog);
if (err)
goto out_unlock;
@@ -730,18 +747,16 @@ int bpf_xdp_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
/**
* dev_change_xdp_fd - set or clear a bpf program for a device rx path
- * @dev: device
- * @extack: netlink extended ack
+ * @args: common XDP arguments (device, extended ack, flags etc.)
* @fd: new program fd or negative value to clear
* @expected_fd: old program fd that userspace expects to replace or clear
- * @flags: xdp-related flags
*
* Set or clear a bpf program for a device
*/
-int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
- int fd, int expected_fd, u32 flags)
+int dev_change_xdp_fd(const struct xdp_install_args *args, int fd,
+ int expected_fd)
{
- enum bpf_xdp_mode mode = dev_xdp_mode(dev, flags);
+ enum bpf_xdp_mode mode = dev_xdp_mode(args->dev, args->flags);
struct bpf_prog *new_prog = NULL, *old_prog = NULL;
int err;
@@ -764,7 +779,7 @@ int dev_change_xdp_fd(struct net_device *dev, struct netlink_ext_ack *extack,
}
}
- err = dev_xdp_attach(dev, extack, NULL, new_prog, old_prog, flags);
+ err = dev_xdp_attach(args, NULL, new_prog, old_prog);
err_out:
if (err && new_prog)
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index ac45328607f7..5b06ded689b2 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -2987,6 +2987,11 @@ static int do_setlink(const struct sk_buff *skb,
}
if (xdp[IFLA_XDP_FD]) {
+ struct xdp_install_args args = {
+ .dev = dev,
+ .extack = extack,
+ .flags = xdp_flags,
+ };
int expected_fd = -1;
if (xdp_flags & XDP_FLAGS_REPLACE) {
@@ -2998,10 +3003,9 @@ static int do_setlink(const struct sk_buff *skb,
nla_get_s32(xdp[IFLA_XDP_EXPECTED_FD]);
}
- err = dev_change_xdp_fd(dev, extack,
+ err = dev_change_xdp_fd(&args,
nla_get_s32(xdp[IFLA_XDP_FD]),
- expected_fd,
- xdp_flags);
+ expected_fd);
if (err)
goto errout;
status |= DO_SETLINK_NOTIFY;
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 09/52] net, xdp: add ability to specify BTF ID for XDP metadata
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (7 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 08/52] net, xdp: factor out XDP install arguments to a separate structure Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 10/52] net, xdp: add ability to specify frame size threshold " Alexander Lobakin
` (43 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Add the UAPI and the corresponding kernel part to be able to specify
the BTF ID of the format which the drivers should compose metadata
in (if supported).
A driver might be able to provide XDP metadata in different formats,
e.g. the generic one and one or several custom (with some
non-universal data from DMA descriptors etc.). In this case, a BPF
loader program will specify the wanted BTF ID and then BPF and
AF_XDP programs will be expecting this format in XDP metadata and
will be comparing different BTF IDs against the one that will be
put in front of a frame.
The BTF ID can be set and updated via both BPF link and rtnetlink
(the %IFLA_XDP_BTF_ID attribute) interfaces, got via &bpf_link_info
and is being passed to the drivers inside &netdev_bpf.
net_device_ops::ndo_bpf() is now being called not only when
@new_prog != @old_prog, but also when @new_prog == @old_prog &&
@new_btf_id != @btf_id, so the drivers should be able to handle
such cases.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/linux/netdevice.h | 2 ++
include/net/xdp.h | 1 +
include/uapi/linux/bpf.h | 12 ++++++++++++
include/uapi/linux/if_link.h | 1 +
kernel/bpf/syscall.c | 2 +-
net/bpf/core.c | 1 +
net/bpf/dev.c | 26 +++++++++++++++++++++++---
net/core/rtnetlink.c | 6 ++++++
tools/include/uapi/linux/bpf.h | 12 ++++++++++++
tools/include/uapi/linux/if_link.h | 1 +
10 files changed, 60 insertions(+), 4 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1e342c285f48..2218c1901daf 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -985,6 +985,7 @@ struct netdev_bpf {
/* XDP_SETUP_PROG */
struct {
u32 flags;
+ u64 btf_id;
struct bpf_prog *prog;
struct netlink_ext_ack *extack;
};
@@ -3852,6 +3853,7 @@ struct xdp_install_args {
struct net_device *dev;
struct netlink_ext_ack *extack;
u32 flags;
+ u64 btf_id;
};
DECLARE_STATIC_KEY_FALSE(generic_xdp_needed_key);
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 04c852c7a77f..13133c7493bc 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -400,6 +400,7 @@ static inline bool xdp_metalen_invalid(unsigned long metalen)
struct xdp_attachment_info {
struct bpf_prog *prog;
+ u64 btf_id;
u32 flags;
};
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index e81362891596..c67ddb78915d 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1499,6 +1499,10 @@ union bpf_attr {
*/
__u64 cookie;
} tracing;
+ struct {
+ /* target metadata BTF + type ID */
+ __aligned_u64 btf_id;
+ } xdp;
};
} link_create;
@@ -1510,6 +1514,12 @@ union bpf_attr {
/* expected link's program fd; is specified only if
* BPF_F_REPLACE flag is set in flags */
__u32 old_prog_fd;
+ union {
+ struct {
+ /* new target metadata BTF + type ID */
+ __aligned_u64 new_btf_id;
+ } xdp;
+ };
} link_update;
struct {
@@ -6138,6 +6148,8 @@ struct bpf_link_info {
} netns;
struct {
__u32 ifindex;
+ __u32 :32;
+ __aligned_u64 btf_id;
} xdp;
};
} __attribute__((aligned(8)));
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 5f58dcfe2787..73cdcc86875e 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -1307,6 +1307,7 @@ enum {
IFLA_XDP_SKB_PROG_ID,
IFLA_XDP_HW_PROG_ID,
IFLA_XDP_EXPECTED_FD,
+ IFLA_XDP_BTF_ID,
__IFLA_XDP_MAX,
};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index f7a674656067..2e86cfeae10f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4575,7 +4575,7 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
return ret;
}
-#define BPF_LINK_UPDATE_LAST_FIELD link_update.old_prog_fd
+#define BPF_LINK_UPDATE_LAST_FIELD link_update.xdp.new_btf_id
static int link_update(union bpf_attr *attr)
{
diff --git a/net/bpf/core.c b/net/bpf/core.c
index fbb72792320a..e5abd5a64df7 100644
--- a/net/bpf/core.c
+++ b/net/bpf/core.c
@@ -552,6 +552,7 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
if (info->prog)
bpf_prog_put(info->prog);
info->prog = bpf->prog;
+ info->btf_id = bpf->btf_id;
info->flags = bpf->flags;
}
EXPORT_SYMBOL_GPL(xdp_attachment_setup);
diff --git a/net/bpf/dev.c b/net/bpf/dev.c
index 7df42bb886ad..e96986220126 100644
--- a/net/bpf/dev.c
+++ b/net/bpf/dev.c
@@ -273,6 +273,7 @@ struct bpf_xdp_link {
struct bpf_link link;
struct net_device *dev; /* protected by rtnl_lock, no refcnt held */
int flags;
+ u64 btf_id;
};
typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf);
@@ -357,8 +358,13 @@ static int dev_xdp_install(const struct xdp_install_args *args,
struct netdev_bpf xdp;
int err;
+ /* BTF ID must not be set when uninstalling the program */
+ if (!prog && args->btf_id)
+ return -EINVAL;
+
memset(&xdp, 0, sizeof(xdp));
xdp.command = mode == XDP_MODE_HW ? XDP_SETUP_PROG_HW : XDP_SETUP_PROG;
+ xdp.btf_id = args->btf_id;
xdp.extack = args->extack;
xdp.flags = args->flags;
xdp.prog = prog;
@@ -517,8 +523,11 @@ static int dev_xdp_attach(const struct xdp_install_args *args,
}
}
- /* don't call drivers if the effective program didn't change */
- if (new_prog != cur_prog) {
+ /* don't call drivers if the effective program or BTF ID didn't change.
+ * If @link == %NULL, we don't know the old value, so the only thing we
+ * can do is to call installing unconditionally
+ */
+ if (new_prog != cur_prog || !link || args->btf_id != link->btf_id) {
bpf_op = dev_xdp_bpf_op(dev, mode);
if (!bpf_op) {
NL_SET_ERR_MSG(extack, "Underlying driver does not support XDP in native mode");
@@ -545,6 +554,7 @@ static int dev_xdp_attach_link(struct bpf_xdp_link *link)
struct xdp_install_args args = {
.dev = link->dev,
.flags = link->flags,
+ .btf_id = link->btf_id,
};
return dev_xdp_attach(&args, link, NULL, NULL);
@@ -606,13 +616,16 @@ static void bpf_xdp_link_show_fdinfo(const struct bpf_link *link,
{
struct bpf_xdp_link *xdp_link = container_of(link, struct bpf_xdp_link, link);
u32 ifindex = 0;
+ u64 btf_id;
rtnl_lock();
if (xdp_link->dev)
ifindex = xdp_link->dev->ifindex;
+ btf_id = xdp_link->btf_id;
rtnl_unlock();
seq_printf(seq, "ifindex:\t%u\n", ifindex);
+ seq_printf(seq, "btf_id:\t0x%llx\n", btf_id);
}
static int bpf_xdp_link_fill_link_info(const struct bpf_link *link,
@@ -620,13 +633,16 @@ static int bpf_xdp_link_fill_link_info(const struct bpf_link *link,
{
struct bpf_xdp_link *xdp_link = container_of(link, struct bpf_xdp_link, link);
u32 ifindex = 0;
+ u64 btf_id;
rtnl_lock();
if (xdp_link->dev)
ifindex = xdp_link->dev->ifindex;
+ btf_id = xdp_link->btf_id;
rtnl_unlock();
info->xdp.ifindex = ifindex;
+ info->xdp.btf_id = btf_id;
return 0;
}
@@ -639,6 +655,7 @@ static int bpf_xdp_link_update(struct bpf_link *link,
struct xdp_install_args args = {
.dev = xdp_link->dev,
.flags = xdp_link->flags,
+ .btf_id = attr->link_update.xdp.new_btf_id,
};
enum bpf_xdp_mode mode;
bpf_op_t bpf_op;
@@ -663,7 +680,7 @@ static int bpf_xdp_link_update(struct bpf_link *link,
goto out_unlock;
}
- if (old_prog == new_prog) {
+ if (old_prog == new_prog && args.btf_id == xdp_link->btf_id) {
/* no-op, don't disturb drivers */
bpf_prog_put(new_prog);
goto out_unlock;
@@ -678,6 +695,8 @@ static int bpf_xdp_link_update(struct bpf_link *link,
old_prog = xchg(&link->prog, new_prog);
bpf_prog_put(old_prog);
+ xdp_link->btf_id = args.btf_id;
+
out_unlock:
rtnl_unlock();
return err;
@@ -716,6 +735,7 @@ int bpf_xdp_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
bpf_link_init(&link->link, BPF_LINK_TYPE_XDP, &bpf_xdp_link_lops, prog);
link->dev = dev;
link->flags = attr->link_create.flags;
+ link->btf_id = attr->link_create.xdp.btf_id;
err = bpf_link_prime(&link->link, &link_primer);
if (err) {
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 5b06ded689b2..a30723b0e50c 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1979,6 +1979,7 @@ static const struct nla_policy ifla_xdp_policy[IFLA_XDP_MAX + 1] = {
[IFLA_XDP_ATTACHED] = { .type = NLA_U8 },
[IFLA_XDP_FLAGS] = { .type = NLA_U32 },
[IFLA_XDP_PROG_ID] = { .type = NLA_U32 },
+ [IFLA_XDP_BTF_ID] = { .type = NLA_U64 },
};
static const struct rtnl_link_ops *linkinfo_to_kind_ops(const struct nlattr *nla)
@@ -2962,6 +2963,7 @@ static int do_setlink(const struct sk_buff *skb,
if (tb[IFLA_XDP]) {
struct nlattr *xdp[IFLA_XDP_MAX + 1];
u32 xdp_flags = 0;
+ u64 btf_id = 0;
err = nla_parse_nested_deprecated(xdp, IFLA_XDP_MAX,
tb[IFLA_XDP],
@@ -2986,10 +2988,14 @@ static int do_setlink(const struct sk_buff *skb,
}
}
+ if (xdp[IFLA_XDP_BTF_ID])
+ btf_id = nla_get_u64(xdp[IFLA_XDP_BTF_ID]);
+
if (xdp[IFLA_XDP_FD]) {
struct xdp_install_args args = {
.dev = dev,
.extack = extack,
+ .btf_id = btf_id,
.flags = xdp_flags,
};
int expected_fd = -1;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index e81362891596..c67ddb78915d 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1499,6 +1499,10 @@ union bpf_attr {
*/
__u64 cookie;
} tracing;
+ struct {
+ /* target metadata BTF + type ID */
+ __aligned_u64 btf_id;
+ } xdp;
};
} link_create;
@@ -1510,6 +1514,12 @@ union bpf_attr {
/* expected link's program fd; is specified only if
* BPF_F_REPLACE flag is set in flags */
__u32 old_prog_fd;
+ union {
+ struct {
+ /* new target metadata BTF + type ID */
+ __aligned_u64 new_btf_id;
+ } xdp;
+ };
} link_update;
struct {
@@ -6138,6 +6148,8 @@ struct bpf_link_info {
} netns;
struct {
__u32 ifindex;
+ __u32 :32;
+ __aligned_u64 btf_id;
} xdp;
};
} __attribute__((aligned(8)));
diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h
index b339bf2196ca..68b126678dc8 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -1212,6 +1212,7 @@ enum {
IFLA_XDP_SKB_PROG_ID,
IFLA_XDP_HW_PROG_ID,
IFLA_XDP_EXPECTED_FD,
+ IFLA_XDP_BTF_ID,
__IFLA_XDP_MAX,
};
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 10/52] net, xdp: add ability to specify frame size threshold for XDP metadata
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (8 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 09/52] net, xdp: add ability to specify BTF ID for XDP metadata Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 11/52] libbpf: factor out __bpf_set_link_xdp_fd_replace() args into a struct Alexander Lobakin
` (42 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Add the UAPI and the corresponding kernel part to be able to specify
the frame size which the drivers should start composing metadata
from (if supported).
Instead of having just 1 bit on/off, have the possilibty to set the
threshold for drivers to start composing meta. It helps with the
situations when e.g. lots of traffic receive %XDP_DROP verdict
without looking at the meta. In such cases, the performance on the
small frames (< 96 bytes) can suffer by several Mpps with no
benefits, so setting the threshold of 100-128 makes much sense.
Setting it to 0 or 1 works just like a bitflag, values of 2-14
work like 1, values of SZ_16K+ works like 0.
So, the logics in the drivers should be like:
if (rx_desc->frame_size >= meta_thresh)
compose_meta();
bpf_prog_run_xdp();
The threshold can be set and updated via both BPF link and rtnetlink
(the %IFLA_XDP_META_THRESH attribute) interfaces, got via
&bpf_link_info and is being passed to the drivers inside
&netdev_bpf. net_device_ops::ndo_bpf() is now also being called when
@new_prog == @old_prog && @new_btf_id == @btf_id &&
@new_meta_thresh != @meta_thresh.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/linux/netdevice.h | 2 ++
include/net/xdp.h | 1 +
include/uapi/linux/bpf.h | 10 +++++++-
include/uapi/linux/if_link.h | 1 +
kernel/bpf/syscall.c | 2 +-
net/bpf/core.c | 1 +
net/bpf/dev.c | 38 +++++++++++++++++++++++-------
net/core/rtnetlink.c | 6 +++++
tools/include/uapi/linux/bpf.h | 10 +++++++-
tools/include/uapi/linux/if_link.h | 1 +
10 files changed, 60 insertions(+), 12 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 2218c1901daf..bc2d82a3d0de 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -985,6 +985,7 @@ struct netdev_bpf {
/* XDP_SETUP_PROG */
struct {
u32 flags;
+ u32 meta_thresh;
u64 btf_id;
struct bpf_prog *prog;
struct netlink_ext_ack *extack;
@@ -3853,6 +3854,7 @@ struct xdp_install_args {
struct net_device *dev;
struct netlink_ext_ack *extack;
u32 flags;
+ u32 meta_thresh;
u64 btf_id;
};
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 13133c7493bc..7b8ba068d28a 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -401,6 +401,7 @@ static inline bool xdp_metalen_invalid(unsigned long metalen)
struct xdp_attachment_info {
struct bpf_prog *prog;
u64 btf_id;
+ u32 meta_thresh;
u32 flags;
};
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index c67ddb78915d..372170ded1d8 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -1502,6 +1502,10 @@ union bpf_attr {
struct {
/* target metadata BTF + type ID */
__aligned_u64 btf_id;
+ /* frame size to start composing XDP
+ * metadata from
+ */
+ __u32 meta_thresh;
} xdp;
};
} link_create;
@@ -1518,6 +1522,10 @@ union bpf_attr {
struct {
/* new target metadata BTF + type ID */
__aligned_u64 new_btf_id;
+ /* new frame size to start composing XDP
+ * metadata from
+ */
+ __u32 new_meta_thresh;
} xdp;
};
} link_update;
@@ -6148,7 +6156,7 @@ struct bpf_link_info {
} netns;
struct {
__u32 ifindex;
- __u32 :32;
+ __u32 meta_thresh;
__aligned_u64 btf_id;
} xdp;
};
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 73cdcc86875e..78b448ff1cb7 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -1308,6 +1308,7 @@ enum {
IFLA_XDP_HW_PROG_ID,
IFLA_XDP_EXPECTED_FD,
IFLA_XDP_BTF_ID,
+ IFLA_XDP_META_THRESH,
__IFLA_XDP_MAX,
};
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index 2e86cfeae10f..e1a56e62bdb4 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -4575,7 +4575,7 @@ static int link_create(union bpf_attr *attr, bpfptr_t uattr)
return ret;
}
-#define BPF_LINK_UPDATE_LAST_FIELD link_update.xdp.new_btf_id
+#define BPF_LINK_UPDATE_LAST_FIELD link_update.xdp.new_meta_thresh
static int link_update(union bpf_attr *attr)
{
diff --git a/net/bpf/core.c b/net/bpf/core.c
index e5abd5a64df7..dcd3b6ae86b7 100644
--- a/net/bpf/core.c
+++ b/net/bpf/core.c
@@ -553,6 +553,7 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
bpf_prog_put(info->prog);
info->prog = bpf->prog;
info->btf_id = bpf->btf_id;
+ info->meta_thresh = bpf->meta_thresh;
info->flags = bpf->flags;
}
EXPORT_SYMBOL_GPL(xdp_attachment_setup);
diff --git a/net/bpf/dev.c b/net/bpf/dev.c
index e96986220126..82948d0536c8 100644
--- a/net/bpf/dev.c
+++ b/net/bpf/dev.c
@@ -273,6 +273,7 @@ struct bpf_xdp_link {
struct bpf_link link;
struct net_device *dev; /* protected by rtnl_lock, no refcnt held */
int flags;
+ u32 meta_thresh;
u64 btf_id;
};
@@ -358,12 +359,20 @@ static int dev_xdp_install(const struct xdp_install_args *args,
struct netdev_bpf xdp;
int err;
- /* BTF ID must not be set when uninstalling the program */
- if (!prog && args->btf_id)
+ /* Neither BTF ID nor meta threshold can be set when uninstalling
+ * the program
+ */
+ if (!prog && (args->btf_id || args->meta_thresh))
+ return -EINVAL;
+
+ /* Both meta threshold and BTF ID must be either specified or not */
+ if (!args->btf_id != !args->meta_thresh)
return -EINVAL;
memset(&xdp, 0, sizeof(xdp));
xdp.command = mode == XDP_MODE_HW ? XDP_SETUP_PROG_HW : XDP_SETUP_PROG;
+ /* Convert 0 to "infitity" to allow plain >= comparison on hotpath */
+ xdp.meta_thresh = args->meta_thresh ? : ~args->meta_thresh;
xdp.btf_id = args->btf_id;
xdp.extack = args->extack;
xdp.flags = args->flags;
@@ -523,11 +532,13 @@ static int dev_xdp_attach(const struct xdp_install_args *args,
}
}
- /* don't call drivers if the effective program or BTF ID didn't change.
- * If @link == %NULL, we don't know the old value, so the only thing we
- * can do is to call installing unconditionally
+ /* don't call drivers if the effective program or BTF ID / metadata
+ * threshold didn't change. If @link == %NULL, we don't know the
+ * old values, so the only thing we can do is to call installing
+ * unconditionally
*/
- if (new_prog != cur_prog || !link || args->btf_id != link->btf_id) {
+ if (new_prog != cur_prog || !link || args->btf_id != link->btf_id ||
+ args->meta_thresh != link->meta_thresh) {
bpf_op = dev_xdp_bpf_op(dev, mode);
if (!bpf_op) {
NL_SET_ERR_MSG(extack, "Underlying driver does not support XDP in native mode");
@@ -555,6 +566,7 @@ static int dev_xdp_attach_link(struct bpf_xdp_link *link)
.dev = link->dev,
.flags = link->flags,
.btf_id = link->btf_id,
+ .meta_thresh = link->meta_thresh,
};
return dev_xdp_attach(&args, link, NULL, NULL);
@@ -615,16 +627,18 @@ static void bpf_xdp_link_show_fdinfo(const struct bpf_link *link,
struct seq_file *seq)
{
struct bpf_xdp_link *xdp_link = container_of(link, struct bpf_xdp_link, link);
- u32 ifindex = 0;
+ u32 meta_thresh, ifindex = 0;
u64 btf_id;
rtnl_lock();
if (xdp_link->dev)
ifindex = xdp_link->dev->ifindex;
+ meta_thresh = xdp_link->meta_thresh;
btf_id = xdp_link->btf_id;
rtnl_unlock();
seq_printf(seq, "ifindex:\t%u\n", ifindex);
+ seq_printf(seq, "meta_thresh:\t%u\n", meta_thresh);
seq_printf(seq, "btf_id:\t0x%llx\n", btf_id);
}
@@ -632,17 +646,19 @@ static int bpf_xdp_link_fill_link_info(const struct bpf_link *link,
struct bpf_link_info *info)
{
struct bpf_xdp_link *xdp_link = container_of(link, struct bpf_xdp_link, link);
- u32 ifindex = 0;
+ u32 meta_thresh, ifindex = 0;
u64 btf_id;
rtnl_lock();
if (xdp_link->dev)
ifindex = xdp_link->dev->ifindex;
+ meta_thresh = xdp_link->meta_thresh;
btf_id = xdp_link->btf_id;
rtnl_unlock();
info->xdp.ifindex = ifindex;
info->xdp.btf_id = btf_id;
+ info->xdp.meta_thresh = meta_thresh;
return 0;
}
@@ -656,6 +672,7 @@ static int bpf_xdp_link_update(struct bpf_link *link,
.dev = xdp_link->dev,
.flags = xdp_link->flags,
.btf_id = attr->link_update.xdp.new_btf_id,
+ .meta_thresh = attr->link_update.xdp.new_meta_thresh,
};
enum bpf_xdp_mode mode;
bpf_op_t bpf_op;
@@ -680,7 +697,8 @@ static int bpf_xdp_link_update(struct bpf_link *link,
goto out_unlock;
}
- if (old_prog == new_prog && args.btf_id == xdp_link->btf_id) {
+ if (old_prog == new_prog && args.btf_id == xdp_link->btf_id &&
+ args.meta_thresh == xdp_link->meta_thresh) {
/* no-op, don't disturb drivers */
bpf_prog_put(new_prog);
goto out_unlock;
@@ -696,6 +714,7 @@ static int bpf_xdp_link_update(struct bpf_link *link,
bpf_prog_put(old_prog);
xdp_link->btf_id = args.btf_id;
+ xdp_link->meta_thresh = args.meta_thresh;
out_unlock:
rtnl_unlock();
@@ -736,6 +755,7 @@ int bpf_xdp_link_attach(const union bpf_attr *attr, struct bpf_prog *prog)
link->dev = dev;
link->flags = attr->link_create.flags;
link->btf_id = attr->link_create.xdp.btf_id;
+ link->meta_thresh = attr->link_create.xdp.meta_thresh;
err = bpf_link_prime(&link->link, &link_primer);
if (err) {
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index a30723b0e50c..500420d5017c 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1980,6 +1980,7 @@ static const struct nla_policy ifla_xdp_policy[IFLA_XDP_MAX + 1] = {
[IFLA_XDP_FLAGS] = { .type = NLA_U32 },
[IFLA_XDP_PROG_ID] = { .type = NLA_U32 },
[IFLA_XDP_BTF_ID] = { .type = NLA_U64 },
+ [IFLA_XDP_META_THRESH] = { .type = NLA_U32 },
};
static const struct rtnl_link_ops *linkinfo_to_kind_ops(const struct nlattr *nla)
@@ -2962,6 +2963,7 @@ static int do_setlink(const struct sk_buff *skb,
if (tb[IFLA_XDP]) {
struct nlattr *xdp[IFLA_XDP_MAX + 1];
+ u32 meta_thresh = 0;
u32 xdp_flags = 0;
u64 btf_id = 0;
@@ -2991,12 +2993,16 @@ static int do_setlink(const struct sk_buff *skb,
if (xdp[IFLA_XDP_BTF_ID])
btf_id = nla_get_u64(xdp[IFLA_XDP_BTF_ID]);
+ if (xdp[IFLA_XDP_META_THRESH])
+ meta_thresh = nla_get_u32(xdp[IFLA_XDP_META_THRESH]);
+
if (xdp[IFLA_XDP_FD]) {
struct xdp_install_args args = {
.dev = dev,
.extack = extack,
.btf_id = btf_id,
.flags = xdp_flags,
+ .meta_thresh = meta_thresh,
};
int expected_fd = -1;
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index c67ddb78915d..372170ded1d8 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -1502,6 +1502,10 @@ union bpf_attr {
struct {
/* target metadata BTF + type ID */
__aligned_u64 btf_id;
+ /* frame size to start composing XDP
+ * metadata from
+ */
+ __u32 meta_thresh;
} xdp;
};
} link_create;
@@ -1518,6 +1522,10 @@ union bpf_attr {
struct {
/* new target metadata BTF + type ID */
__aligned_u64 new_btf_id;
+ /* new frame size to start composing XDP
+ * metadata from
+ */
+ __u32 new_meta_thresh;
} xdp;
};
} link_update;
@@ -6148,7 +6156,7 @@ struct bpf_link_info {
} netns;
struct {
__u32 ifindex;
- __u32 :32;
+ __u32 meta_thresh;
__aligned_u64 btf_id;
} xdp;
};
diff --git a/tools/include/uapi/linux/if_link.h b/tools/include/uapi/linux/if_link.h
index 68b126678dc8..b73b9c0f06fb 100644
--- a/tools/include/uapi/linux/if_link.h
+++ b/tools/include/uapi/linux/if_link.h
@@ -1213,6 +1213,7 @@ enum {
IFLA_XDP_HW_PROG_ID,
IFLA_XDP_EXPECTED_FD,
IFLA_XDP_BTF_ID,
+ IFLA_XDP_META_THRESH,
__IFLA_XDP_MAX,
};
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 11/52] libbpf: factor out __bpf_set_link_xdp_fd_replace() args into a struct
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (9 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 10/52] net, xdp: add ability to specify frame size threshold " Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 12/52] libbpf: add ability to set the BTF/type ID on setting XDP prog Alexander Lobakin
` (41 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Its argument list already consists of 4 entries, and there are more
to be added. It's convenient to add new opts as they are already
being passed using structs, but at the end the mentioned function
take all the opts one by one.
Place them into a local struct which will satisfy every initial call
site, so it will be now a matter of adding a new field and a
corresponding nlattr_add() to handle a new opt.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
tools/lib/bpf/netlink.c | 60 ++++++++++++++++++++++++++++-------------
1 file changed, 42 insertions(+), 18 deletions(-)
diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index cbc8967d5402..3a25178d0d12 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -230,8 +230,15 @@ static int libbpf_netlink_send_recv(struct libbpf_nla_req *req,
return ret;
}
-static int __bpf_set_link_xdp_fd_replace(int ifindex, int fd, int old_fd,
- __u32 flags)
+struct __bpf_set_link_xdp_fd_opts {
+ int ifindex;
+ int fd;
+ int old_fd;
+ __u32 flags;
+};
+
+static int
+__bpf_set_link_xdp_fd_replace(const struct __bpf_set_link_xdp_fd_opts *opts)
{
struct nlattr *nla;
int ret;
@@ -242,22 +249,23 @@ static int __bpf_set_link_xdp_fd_replace(int ifindex, int fd, int old_fd,
req.nh.nlmsg_flags = NLM_F_REQUEST | NLM_F_ACK;
req.nh.nlmsg_type = RTM_SETLINK;
req.ifinfo.ifi_family = AF_UNSPEC;
- req.ifinfo.ifi_index = ifindex;
+ req.ifinfo.ifi_index = opts->ifindex;
nla = nlattr_begin_nested(&req, IFLA_XDP);
if (!nla)
return -EMSGSIZE;
- ret = nlattr_add(&req, IFLA_XDP_FD, &fd, sizeof(fd));
+ ret = nlattr_add(&req, IFLA_XDP_FD, &opts->fd, sizeof(opts->fd));
if (ret < 0)
return ret;
- if (flags) {
- ret = nlattr_add(&req, IFLA_XDP_FLAGS, &flags, sizeof(flags));
+ if (opts->flags) {
+ ret = nlattr_add(&req, IFLA_XDP_FLAGS, &opts->flags,
+ sizeof(opts->flags));
if (ret < 0)
return ret;
}
- if (flags & XDP_FLAGS_REPLACE) {
- ret = nlattr_add(&req, IFLA_XDP_EXPECTED_FD, &old_fd,
- sizeof(old_fd));
+ if (opts->flags & XDP_FLAGS_REPLACE) {
+ ret = nlattr_add(&req, IFLA_XDP_EXPECTED_FD, &opts->old_fd,
+ sizeof(opts->old_fd));
if (ret < 0)
return ret;
}
@@ -268,18 +276,23 @@ static int __bpf_set_link_xdp_fd_replace(int ifindex, int fd, int old_fd,
int bpf_xdp_attach(int ifindex, int prog_fd, __u32 flags, const struct bpf_xdp_attach_opts *opts)
{
- int old_prog_fd, err;
+ struct __bpf_set_link_xdp_fd_opts sl_opts = {
+ .ifindex = ifindex,
+ .flags = flags,
+ .fd = prog_fd,
+ };
+ int err;
if (!OPTS_VALID(opts, bpf_xdp_attach_opts))
return libbpf_err(-EINVAL);
- old_prog_fd = OPTS_GET(opts, old_prog_fd, 0);
- if (old_prog_fd)
+ sl_opts.old_fd = OPTS_GET(opts, old_prog_fd, 0);
+ if (sl_opts.old_fd)
flags |= XDP_FLAGS_REPLACE;
else
- old_prog_fd = -1;
+ sl_opts.old_fd = -1;
- err = __bpf_set_link_xdp_fd_replace(ifindex, prog_fd, old_prog_fd, flags);
+ err = __bpf_set_link_xdp_fd_replace(&sl_opts);
return libbpf_err(err);
}
@@ -291,25 +304,36 @@ int bpf_xdp_detach(int ifindex, __u32 flags, const struct bpf_xdp_attach_opts *o
int bpf_set_link_xdp_fd_opts(int ifindex, int fd, __u32 flags,
const struct bpf_xdp_set_link_opts *opts)
{
- int old_fd = -1, ret;
+ struct __bpf_set_link_xdp_fd_opts sl_opts = {
+ .ifindex = ifindex,
+ .flags = flags,
+ .old_fd = -1,
+ .fd = fd,
+ };
+ int ret;
if (!OPTS_VALID(opts, bpf_xdp_set_link_opts))
return libbpf_err(-EINVAL);
if (OPTS_HAS(opts, old_fd)) {
- old_fd = OPTS_GET(opts, old_fd, -1);
+ sl_opts.old_fd = OPTS_GET(opts, old_fd, -1);
flags |= XDP_FLAGS_REPLACE;
}
- ret = __bpf_set_link_xdp_fd_replace(ifindex, fd, old_fd, flags);
+ ret = __bpf_set_link_xdp_fd_replace(&sl_opts);
return libbpf_err(ret);
}
int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags)
{
+ struct __bpf_set_link_xdp_fd_opts sl_opts = {
+ .ifindex = ifindex,
+ .flags = flags,
+ .fd = fd,
+ };
int ret;
- ret = __bpf_set_link_xdp_fd_replace(ifindex, fd, 0, flags);
+ ret = __bpf_set_link_xdp_fd_replace(&sl_opts);
return libbpf_err(ret);
}
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 12/52] libbpf: add ability to set the BTF/type ID on setting XDP prog
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (10 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 11/52] libbpf: factor out __bpf_set_link_xdp_fd_replace() args into a struct Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 13/52] libbpf: add ability to set the meta threshold " Alexander Lobakin
` (40 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Covered functions:
* bpf_link_create() - via &bpf_link_create_ops;
* bpf_link_update() - via &bpf_link_update_ops;
* bpf_xdp_attach() - via &bpf_xdp_attach_ops;
* bpf_set_link_xdp_fd_opts() - via &bpf_xdp_set_link_opts;
bpf_link_update() got the ability to pass arbitrary link
type-specific data to the kernel, not just the old and new FDs.
No support for bpf_get_link_xdp_info()/&xdp_link_info as we store
additional data such as flags and BTF ID in the kernel in BPF link
mode only.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
tools/lib/bpf/bpf.c | 19 +++++++++++++++++++
tools/lib/bpf/bpf.h | 16 +++++++++++++++-
tools/lib/bpf/libbpf.h | 8 ++++++--
tools/lib/bpf/netlink.c | 11 +++++++++++
4 files changed, 51 insertions(+), 3 deletions(-)
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 240186aac8e6..6036dc75cc7b 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -805,6 +805,11 @@ int bpf_link_create(int prog_fd, int target_fd,
if (!OPTS_ZEROED(opts, tracing))
return libbpf_err(-EINVAL);
break;
+ case BPF_XDP:
+ attr.link_create.xdp.btf_id = OPTS_GET(opts, xdp.btf_id, 0);
+ if (!OPTS_ZEROED(opts, xdp))
+ return libbpf_err(-EINVAL);
+ break;
default:
if (!OPTS_ZEROED(opts, flags))
return libbpf_err(-EINVAL);
@@ -872,6 +877,20 @@ int bpf_link_update(int link_fd, int new_prog_fd,
attr.link_update.flags = OPTS_GET(opts, flags, 0);
attr.link_update.old_prog_fd = OPTS_GET(opts, old_prog_fd, 0);
+ /* As the union in both @attr and @opts is unnamed, just use a pointer
+ * to any of its members to copy the whole rest of the union/opts
+ */
+ if (opts && opts->sz > offsetof(typeof(*opts), xdp)) {
+ __u32 attr_left, opts_left;
+
+ attr_left = sizeof(attr.link_update) -
+ offsetof(typeof(attr.link_update), xdp);
+ opts_left = opts->sz - offsetof(typeof(*opts), xdp);
+
+ memcpy(&attr.link_update.xdp, &opts->xdp,
+ min(attr_left, opts_left));
+ }
+
ret = sys_bpf(BPF_LINK_UPDATE, &attr, sizeof(attr));
return libbpf_err_errno(ret);
}
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index cabc03703e29..4e17995fdaff 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -382,6 +382,10 @@ struct bpf_link_create_opts {
struct {
__u64 cookie;
} tracing;
+ struct {
+ /* target metadata BTF + type ID */
+ __aligned_u64 btf_id;
+ } xdp;
};
size_t :0;
};
@@ -397,8 +401,18 @@ struct bpf_link_update_opts {
size_t sz; /* size of this struct for forward/backward compatibility */
__u32 flags; /* extra flags */
__u32 old_prog_fd; /* expected old program FD */
+ /* must have the same layout as the same union from
+ * bpf_attr::link_update, uses direct memcpy() to there
+ */
+ union {
+ struct {
+ /* new target metadata BTF + type ID */
+ __aligned_u64 new_btf_id;
+ } xdp;
+ };
+ size_t :0;
};
-#define bpf_link_update_opts__last_field old_prog_fd
+#define bpf_link_update_opts__last_field xdp.new_btf_id
LIBBPF_API int bpf_link_update(int link_fd, int new_prog_fd,
const struct bpf_link_update_opts *opts);
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 4056e9038086..4f77128ba770 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -1193,9 +1193,11 @@ struct xdp_link_info {
struct bpf_xdp_set_link_opts {
size_t sz;
int old_fd;
+ __u32 :32;
+ __u64 btf_id;
size_t :0;
};
-#define bpf_xdp_set_link_opts__last_field old_fd
+#define bpf_xdp_set_link_opts__last_field btf_id
LIBBPF_DEPRECATED_SINCE(0, 8, "use bpf_xdp_attach() instead")
LIBBPF_API int bpf_set_link_xdp_fd(int ifindex, int fd, __u32 flags);
@@ -1211,9 +1213,11 @@ LIBBPF_API int bpf_get_link_xdp_info(int ifindex, struct xdp_link_info *info,
struct bpf_xdp_attach_opts {
size_t sz;
int old_prog_fd;
+ __u32 :32;
+ __u64 btf_id;
size_t :0;
};
-#define bpf_xdp_attach_opts__last_field old_prog_fd
+#define bpf_xdp_attach_opts__last_field btf_id
struct bpf_xdp_query_opts {
size_t sz;
diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index 3a25178d0d12..104a809d5fb2 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -235,6 +235,7 @@ struct __bpf_set_link_xdp_fd_opts {
int fd;
int old_fd;
__u32 flags;
+ __u64 btf_id;
};
static int
@@ -269,6 +270,12 @@ __bpf_set_link_xdp_fd_replace(const struct __bpf_set_link_xdp_fd_opts *opts)
if (ret < 0)
return ret;
}
+ if (opts->btf_id) {
+ ret = nlattr_add(&req, IFLA_XDP_BTF_ID, &opts->btf_id,
+ sizeof(opts->btf_id));
+ if (ret < 0)
+ return ret;
+ }
nlattr_end_nested(&req, nla);
return libbpf_netlink_send_recv(&req, NULL, NULL, NULL);
@@ -292,6 +299,8 @@ int bpf_xdp_attach(int ifindex, int prog_fd, __u32 flags, const struct bpf_xdp_a
else
sl_opts.old_fd = -1;
+ sl_opts.btf_id = OPTS_GET(opts, btf_id, 0);
+
err = __bpf_set_link_xdp_fd_replace(&sl_opts);
return libbpf_err(err);
}
@@ -320,6 +329,8 @@ int bpf_set_link_xdp_fd_opts(int ifindex, int fd, __u32 flags,
flags |= XDP_FLAGS_REPLACE;
}
+ sl_opts.btf_id = OPTS_GET(opts, btf_id, 0);
+
ret = __bpf_set_link_xdp_fd_replace(&sl_opts);
return libbpf_err(ret);
}
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 13/52] libbpf: add ability to set the meta threshold on setting XDP prog
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (11 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 12/52] libbpf: add ability to set the BTF/type ID on setting XDP prog Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 14/52] libbpf: pass &bpf_link_create_opts directly to bpf_program__attach_fd() Alexander Lobakin
` (39 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Covered functions:
* bpf_link_create() - via &bpf_link_create_ops;
* bpf_link_update() - via &bpf_link_update_ops;
* bpf_xdp_attach() - via &bpf_xdp_attach_ops;
* bpf_set_link_xdp_fd_opts() - via &bpf_xdp_set_link_opts;
No support for bpf_get_link_xdp_info()/&xdp_link_info as we store
additional data in the kernel in BPF link mode only.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
tools/lib/bpf/bpf.c | 3 +++
tools/lib/bpf/bpf.h | 8 +++++++-
tools/lib/bpf/libbpf.h | 4 ++--
tools/lib/bpf/netlink.c | 10 ++++++++++
4 files changed, 22 insertions(+), 3 deletions(-)
diff --git a/tools/lib/bpf/bpf.c b/tools/lib/bpf/bpf.c
index 6036dc75cc7b..e7c713a418f6 100644
--- a/tools/lib/bpf/bpf.c
+++ b/tools/lib/bpf/bpf.c
@@ -807,6 +807,9 @@ int bpf_link_create(int prog_fd, int target_fd,
break;
case BPF_XDP:
attr.link_create.xdp.btf_id = OPTS_GET(opts, xdp.btf_id, 0);
+ attr.link_create.xdp.meta_thresh = OPTS_GET(opts,
+ xdp.meta_thresh,
+ 0);
if (!OPTS_ZEROED(opts, xdp))
return libbpf_err(-EINVAL);
break;
diff --git a/tools/lib/bpf/bpf.h b/tools/lib/bpf/bpf.h
index 4e17995fdaff..c0f54f24d675 100644
--- a/tools/lib/bpf/bpf.h
+++ b/tools/lib/bpf/bpf.h
@@ -385,6 +385,8 @@ struct bpf_link_create_opts {
struct {
/* target metadata BTF + type ID */
__aligned_u64 btf_id;
+ /* frame size to start composing XDP metadata from */
+ __u32 meta_thresh;
} xdp;
};
size_t :0;
@@ -408,11 +410,15 @@ struct bpf_link_update_opts {
struct {
/* new target metadata BTF + type ID */
__aligned_u64 new_btf_id;
+ /* new frame size to start composing XDP
+ * metadata from
+ */
+ __u32 new_meta_thresh;
} xdp;
};
size_t :0;
};
-#define bpf_link_update_opts__last_field xdp.new_btf_id
+#define bpf_link_update_opts__last_field xdp.new_meta_thresh
LIBBPF_API int bpf_link_update(int link_fd, int new_prog_fd,
const struct bpf_link_update_opts *opts);
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 4f77128ba770..99ac94f148fc 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -1193,7 +1193,7 @@ struct xdp_link_info {
struct bpf_xdp_set_link_opts {
size_t sz;
int old_fd;
- __u32 :32;
+ __u32 meta_thresh;
__u64 btf_id;
size_t :0;
};
@@ -1213,7 +1213,7 @@ LIBBPF_API int bpf_get_link_xdp_info(int ifindex, struct xdp_link_info *info,
struct bpf_xdp_attach_opts {
size_t sz;
int old_prog_fd;
- __u32 :32;
+ __u32 meta_thresh;
__u64 btf_id;
size_t :0;
};
diff --git a/tools/lib/bpf/netlink.c b/tools/lib/bpf/netlink.c
index 104a809d5fb2..ac2a87243ecd 100644
--- a/tools/lib/bpf/netlink.c
+++ b/tools/lib/bpf/netlink.c
@@ -236,6 +236,7 @@ struct __bpf_set_link_xdp_fd_opts {
int old_fd;
__u32 flags;
__u64 btf_id;
+ __u32 meta_thresh;
};
static int
@@ -276,6 +277,13 @@ __bpf_set_link_xdp_fd_replace(const struct __bpf_set_link_xdp_fd_opts *opts)
if (ret < 0)
return ret;
}
+ if (opts->meta_thresh) {
+ ret = nlattr_add(&req, IFLA_XDP_META_THRESH,
+ &opts->meta_thresh,
+ sizeof(opts->meta_thresh));
+ if (ret < 0)
+ return ret;
+ }
nlattr_end_nested(&req, nla);
return libbpf_netlink_send_recv(&req, NULL, NULL, NULL);
@@ -300,6 +308,7 @@ int bpf_xdp_attach(int ifindex, int prog_fd, __u32 flags, const struct bpf_xdp_a
sl_opts.old_fd = -1;
sl_opts.btf_id = OPTS_GET(opts, btf_id, 0);
+ sl_opts.meta_thresh = OPTS_GET(opts, meta_thresh, 0);
err = __bpf_set_link_xdp_fd_replace(&sl_opts);
return libbpf_err(err);
@@ -330,6 +339,7 @@ int bpf_set_link_xdp_fd_opts(int ifindex, int fd, __u32 flags,
}
sl_opts.btf_id = OPTS_GET(opts, btf_id, 0);
+ sl_opts.meta_thresh = OPTS_GET(opts, meta_thresh, 0);
ret = __bpf_set_link_xdp_fd_replace(&sl_opts);
return libbpf_err(ret);
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 14/52] libbpf: pass &bpf_link_create_opts directly to bpf_program__attach_fd()
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (12 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 13/52] libbpf: add ability to set the meta threshold " Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 15/52] libbpf: add bpf_program__attach_xdp_opts() Alexander Lobakin
` (38 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Instead of providing an optional @btf_id which is zero in 3 of 4
cases as an argument, pass a pointer to &bpf_link_create_ops
directly instead. This way we can just pass %NULL when no opts are
needed and use any of its fields on our wish otherwise.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
tools/lib/bpf/libbpf.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index 9bda111c8167..f4014c09f1cf 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -11958,11 +11958,10 @@ static int attach_lsm(const struct bpf_program *prog, long cookie, struct bpf_li
}
static struct bpf_link *
-bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id,
+bpf_program__attach_fd(const struct bpf_program *prog, int target_fd,
+ const struct bpf_link_create_opts *opts,
const char *target_name)
{
- DECLARE_LIBBPF_OPTS(bpf_link_create_opts, opts,
- .target_btf_id = btf_id);
enum bpf_attach_type attach_type;
char errmsg[STRERR_BUFSIZE];
struct bpf_link *link;
@@ -11980,7 +11979,7 @@ bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id
link->detach = &bpf_link__detach_fd;
attach_type = bpf_program__expected_attach_type(prog);
- link_fd = bpf_link_create(prog_fd, target_fd, attach_type, &opts);
+ link_fd = bpf_link_create(prog_fd, target_fd, attach_type, opts);
if (link_fd < 0) {
link_fd = -errno;
free(link);
@@ -11996,19 +11995,19 @@ bpf_program__attach_fd(const struct bpf_program *prog, int target_fd, int btf_id
struct bpf_link *
bpf_program__attach_cgroup(const struct bpf_program *prog, int cgroup_fd)
{
- return bpf_program__attach_fd(prog, cgroup_fd, 0, "cgroup");
+ return bpf_program__attach_fd(prog, cgroup_fd, NULL, "cgroup");
}
struct bpf_link *
bpf_program__attach_netns(const struct bpf_program *prog, int netns_fd)
{
- return bpf_program__attach_fd(prog, netns_fd, 0, "netns");
+ return bpf_program__attach_fd(prog, netns_fd, NULL, "netns");
}
struct bpf_link *bpf_program__attach_xdp(const struct bpf_program *prog, int ifindex)
{
/* target_fd/target_ifindex use the same field in LINK_CREATE */
- return bpf_program__attach_fd(prog, ifindex, 0, "xdp");
+ return bpf_program__attach_fd(prog, ifindex, NULL, "xdp");
}
struct bpf_link *bpf_program__attach_freplace(const struct bpf_program *prog,
@@ -12030,11 +12029,16 @@ struct bpf_link *bpf_program__attach_freplace(const struct bpf_program *prog,
}
if (target_fd) {
+ LIBBPF_OPTS(bpf_link_create_opts, opts);
+
btf_id = libbpf_find_prog_btf_id(attach_func_name, target_fd);
if (btf_id < 0)
return libbpf_err_ptr(btf_id);
- return bpf_program__attach_fd(prog, target_fd, btf_id, "freplace");
+ opts.target_btf_id = btf_id;
+
+ return bpf_program__attach_fd(prog, target_fd, &opts,
+ "freplace");
} else {
/* no target, so use raw_tracepoint_open for compatibility
* with old kernels
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 15/52] libbpf: add bpf_program__attach_xdp_opts()
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (13 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 14/52] libbpf: pass &bpf_link_create_opts directly to bpf_program__attach_fd() Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 16/52] selftests/bpf: expand xdp_link to check that setting meta opts works Alexander Lobakin
` (37 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Add a version of bpf_program__attach_xdp() which can take an
optional pointer to &bpf_xdp_attach_opts to carry opts from it to
bpf_link_create(), primarily to be able to specify a BTF/type ID and
a metadata threshold when attaching an XDP program.
This struct is originally designed for bpf_xdp_{at,de}tach(), reuse
it here as well to not spawn entities (with ::old_prog_fd reused for
XDP flags via union).
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
tools/lib/bpf/libbpf.c | 16 ++++++++++++++++
tools/lib/bpf/libbpf.h | 27 ++++++++++++++++++---------
tools/lib/bpf/libbpf.map | 1 +
3 files changed, 35 insertions(+), 9 deletions(-)
diff --git a/tools/lib/bpf/libbpf.c b/tools/lib/bpf/libbpf.c
index f4014c09f1cf..b6cc238a2657 100644
--- a/tools/lib/bpf/libbpf.c
+++ b/tools/lib/bpf/libbpf.c
@@ -12010,6 +12010,22 @@ struct bpf_link *bpf_program__attach_xdp(const struct bpf_program *prog, int ifi
return bpf_program__attach_fd(prog, ifindex, NULL, "xdp");
}
+struct bpf_link *
+bpf_program__attach_xdp_opts(const struct bpf_program *prog, int ifindex,
+ const struct bpf_xdp_attach_opts *opts)
+{
+ LIBBPF_OPTS(bpf_link_create_opts, lc_opts);
+
+ if (!OPTS_VALID(opts, bpf_xdp_attach_opts))
+ return libbpf_err_ptr(-EINVAL);
+
+ lc_opts.flags = OPTS_GET(opts, flags, 0);
+ lc_opts.xdp.btf_id = OPTS_GET(opts, btf_id, 0);
+ lc_opts.xdp.meta_thresh = OPTS_GET(opts, meta_thresh, 0);
+
+ return bpf_program__attach_fd(prog, ifindex, &lc_opts, "xdp");
+}
+
struct bpf_link *bpf_program__attach_freplace(const struct bpf_program *prog,
int target_fd,
const char *attach_func_name)
diff --git a/tools/lib/bpf/libbpf.h b/tools/lib/bpf/libbpf.h
index 99ac94f148fc..d6dd05b5b753 100644
--- a/tools/lib/bpf/libbpf.h
+++ b/tools/lib/bpf/libbpf.h
@@ -678,8 +678,26 @@ LIBBPF_API struct bpf_link *
bpf_program__attach_cgroup(const struct bpf_program *prog, int cgroup_fd);
LIBBPF_API struct bpf_link *
bpf_program__attach_netns(const struct bpf_program *prog, int netns_fd);
+
+struct bpf_xdp_attach_opts {
+ size_t sz;
+ union {
+ int old_prog_fd;
+ /* for bpf_program__attach_xdp_opts() */
+ __u32 flags;
+ };
+ __u32 meta_thresh;
+ __u64 btf_id;
+ size_t :0;
+};
+#define bpf_xdp_attach_opts__last_field btf_id
+
LIBBPF_API struct bpf_link *
bpf_program__attach_xdp(const struct bpf_program *prog, int ifindex);
+LIBBPF_API struct bpf_link *
+bpf_program__attach_xdp_opts(const struct bpf_program *prog, int ifindex,
+ const struct bpf_xdp_attach_opts *opts);
+
LIBBPF_API struct bpf_link *
bpf_program__attach_freplace(const struct bpf_program *prog,
int target_fd, const char *attach_func_name);
@@ -1210,15 +1228,6 @@ LIBBPF_DEPRECATED_SINCE(0, 8, "use bpf_xdp_query() instead")
LIBBPF_API int bpf_get_link_xdp_info(int ifindex, struct xdp_link_info *info,
size_t info_size, __u32 flags);
-struct bpf_xdp_attach_opts {
- size_t sz;
- int old_prog_fd;
- __u32 meta_thresh;
- __u64 btf_id;
- size_t :0;
-};
-#define bpf_xdp_attach_opts__last_field btf_id
-
struct bpf_xdp_query_opts {
size_t sz;
__u32 prog_id; /* output */
diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
index f0987df15b7a..d14bbf82e37c 100644
--- a/tools/lib/bpf/libbpf.map
+++ b/tools/lib/bpf/libbpf.map
@@ -464,6 +464,7 @@ LIBBPF_1.0.0 {
global:
btf__add_enum64;
btf__add_enum64_value;
+ bpf_program__attach_xdp_opts;
libbpf_bpf_attach_type_str;
libbpf_bpf_link_type_str;
libbpf_bpf_map_type_str;
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 16/52] selftests/bpf: expand xdp_link to check that setting meta opts works
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (14 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 15/52] libbpf: add bpf_program__attach_xdp_opts() Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 17/52] samples/bpf: pass a struct to sample_install_xdp() Alexander Lobakin
` (36 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Add a check in xdp_link to ensure that the values of btf_id and
meta_thresh gotten via bpf_obj_get_info_by_fd() is the same as was
passed via bpf_link_update(). Plus, that the kernel should fail to
set btf_id to 0 when meta_thresh != 0.
Also, use the new bpf_program__attach_xdp_opts() instead of the
non-optsed version and set the initial metadata threshold value
to check whether the kernel is able to process this request.
When the threshold is being set via the Netlink interface, it's
not being stored anywhere in the kernel core, so no test for it.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
.../selftests/bpf/prog_tests/xdp_link.c | 30 +++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)
diff --git a/tools/testing/selftests/bpf/prog_tests/xdp_link.c b/tools/testing/selftests/bpf/prog_tests/xdp_link.c
index 3e9d5c5521f0..0723278c448f 100644
--- a/tools/testing/selftests/bpf/prog_tests/xdp_link.c
+++ b/tools/testing/selftests/bpf/prog_tests/xdp_link.c
@@ -10,6 +10,7 @@ void serial_test_xdp_link(void)
{
struct test_xdp_link *skel1 = NULL, *skel2 = NULL;
__u32 id1, id2, id0 = 0, prog_fd1, prog_fd2;
+ LIBBPF_OPTS(bpf_link_update_opts, lu_opts);
LIBBPF_OPTS(bpf_xdp_attach_opts, opts);
struct bpf_link_info link_info;
struct bpf_prog_info prog_info;
@@ -103,8 +104,16 @@ void serial_test_xdp_link(void)
bpf_link__destroy(skel1->links.xdp_handler);
skel1->links.xdp_handler = NULL;
+ opts.old_prog_fd = 0;
+ opts.meta_thresh = 128;
+
+ err = libbpf_get_type_btf_id("struct xdp_meta_generic", &opts.btf_id);
+ if (!ASSERT_OK(err, "libbpf_get_type_btf_id"))
+ goto cleanup;
+
/* new link attach should succeed */
- link = bpf_program__attach_xdp(skel2->progs.xdp_handler, IFINDEX_LO);
+ link = bpf_program__attach_xdp_opts(skel2->progs.xdp_handler,
+ IFINDEX_LO, &opts);
if (!ASSERT_OK_PTR(link, "link_attach"))
goto cleanup;
skel2->links.xdp_handler = link;
@@ -113,11 +122,25 @@ void serial_test_xdp_link(void)
if (!ASSERT_OK(err, "id2_check_err") || !ASSERT_EQ(id0, id2, "id2_check_val"))
goto cleanup;
+ lu_opts.xdp.new_meta_thresh = 256;
+ lu_opts.xdp.new_btf_id = opts.btf_id;
+
/* updating program under active BPF link works as expected */
- err = bpf_link__update_program(link, skel1->progs.xdp_handler);
+ err = bpf_link_update(bpf_link__fd(link),
+ bpf_program__fd(skel1->progs.xdp_handler),
+ &lu_opts);
if (!ASSERT_OK(err, "link_upd"))
goto cleanup;
+ lu_opts.xdp.new_btf_id = 0;
+
+ /* BTF ID can't be 0 when meta_thresh != 0, and vice versa */
+ err = bpf_link_update(bpf_link__fd(link),
+ bpf_program__fd(skel1->progs.xdp_handler),
+ &lu_opts);
+ if (!ASSERT_ERR(err, "link_upd_fail"))
+ goto cleanup;
+
memset(&link_info, 0, sizeof(link_info));
err = bpf_obj_get_info_by_fd(bpf_link__fd(link), &link_info, &link_info_len);
if (!ASSERT_OK(err, "link_info"))
@@ -126,6 +149,9 @@ void serial_test_xdp_link(void)
ASSERT_EQ(link_info.type, BPF_LINK_TYPE_XDP, "link_type");
ASSERT_EQ(link_info.prog_id, id1, "link_prog_id");
ASSERT_EQ(link_info.xdp.ifindex, IFINDEX_LO, "link_ifindex");
+ ASSERT_EQ(link_info.xdp.btf_id, opts.btf_id, "btf_id");
+ ASSERT_EQ(link_info.xdp.meta_thresh, lu_opts.xdp.new_meta_thresh,
+ "meta_thresh");
/* updating program under active BPF link with different type fails */
err = bpf_link__update_program(link, skel1->progs.tc_handler);
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 17/52] samples/bpf: pass a struct to sample_install_xdp()
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (15 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 16/52] selftests/bpf: expand xdp_link to check that setting meta opts works Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 18/52] samples/bpf: add ability to specify metadata threshold Alexander Lobakin
` (35 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
In order to be able to pass more flags and/or other options to
sample_install_xdp() from userland programs built on top of this
framework, make it consume a const pointer to a structure with
all the parameters needed to initialize the sample instead of
a set of standalone parameters which doesn't scale.
Adjust all the samples accordingly.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
samples/bpf/xdp_redirect_cpu_user.c | 24 +++++++++++------------
samples/bpf/xdp_redirect_map_multi_user.c | 19 +++++++++---------
samples/bpf/xdp_redirect_map_user.c | 15 +++++++-------
samples/bpf/xdp_redirect_user.c | 15 +++++++-------
samples/bpf/xdp_router_ipv4_user.c | 13 ++++++------
samples/bpf/xdp_sample_user.c | 12 +++++++-----
samples/bpf/xdp_sample_user.h | 10 ++++++++--
7 files changed, 58 insertions(+), 50 deletions(-)
diff --git a/samples/bpf/xdp_redirect_cpu_user.c b/samples/bpf/xdp_redirect_cpu_user.c
index a12381c37d2b..15745d8cb5c2 100644
--- a/samples/bpf/xdp_redirect_cpu_user.c
+++ b/samples/bpf/xdp_redirect_cpu_user.c
@@ -306,6 +306,9 @@ int main(int argc, char **argv)
{
const char *redir_interface = NULL, *redir_map = NULL;
const char *mprog_filename = NULL, *mprog_name = NULL;
+ struct sample_install_opts opts = {
+ .ifindex = -1,
+ };
struct xdp_redirect_cpu *skel;
struct bpf_map_info info = {};
struct bpf_cpumap_val value;
@@ -315,13 +318,10 @@ int main(int argc, char **argv)
bool stress_mode = false;
struct bpf_program *prog;
const char *prog_name;
- bool generic = false;
- bool force = false;
int added_cpus = 0;
bool error = true;
int longindex = 0;
int add_cpu = -1;
- int ifindex = -1;
int *cpu, i, opt;
__u32 qsize;
int n_cpus;
@@ -391,10 +391,10 @@ int main(int argc, char **argv)
usage(argv, long_options, __doc__, mask, true, skel->obj);
goto end_cpu;
}
- ifindex = if_nametoindex(optarg);
- if (!ifindex)
- ifindex = strtoul(optarg, NULL, 0);
- if (!ifindex) {
+ opts.ifindex = if_nametoindex(optarg);
+ if (!opts.ifindex)
+ opts.ifindex = strtoul(optarg, NULL, 0);
+ if (!opts.ifindex) {
fprintf(stderr, "Bad interface index or name (%d): %s\n",
errno, strerror(errno));
usage(argv, long_options, __doc__, mask, true, skel->obj);
@@ -408,7 +408,7 @@ int main(int argc, char **argv)
interval = strtoul(optarg, NULL, 0);
break;
case 'S':
- generic = true;
+ opts.generic = true;
break;
case 'x':
stress_mode = true;
@@ -456,7 +456,7 @@ int main(int argc, char **argv)
qsize = strtoul(optarg, NULL, 0);
break;
case 'F':
- force = true;
+ opts.force = true;
break;
case 'v':
sample_switch_mode();
@@ -470,7 +470,7 @@ int main(int argc, char **argv)
}
ret = EXIT_FAIL_OPTION;
- if (ifindex == -1) {
+ if (opts.ifindex == -1) {
fprintf(stderr, "Required option --dev missing\n");
usage(argv, long_options, __doc__, mask, true, skel->obj);
goto end_cpu;
@@ -483,7 +483,7 @@ int main(int argc, char **argv)
goto end_cpu;
}
- skel->rodata->from_match[0] = ifindex;
+ skel->rodata->from_match[0] = opts.ifindex;
if (redir_interface)
skel->rodata->to_match[0] = if_nametoindex(redir_interface);
@@ -540,7 +540,7 @@ int main(int argc, char **argv)
}
ret = EXIT_FAIL_XDP;
- if (sample_install_xdp(prog, ifindex, generic, force) < 0)
+ if (sample_install_xdp(prog, &opts) < 0)
goto end_cpu;
ret = sample_run(interval, stress_mode ? stress_cpumap : NULL, &value);
diff --git a/samples/bpf/xdp_redirect_map_multi_user.c b/samples/bpf/xdp_redirect_map_multi_user.c
index 9e24f2705b67..85e66f9dc259 100644
--- a/samples/bpf/xdp_redirect_map_multi_user.c
+++ b/samples/bpf/xdp_redirect_map_multi_user.c
@@ -77,6 +77,7 @@ static int update_mac_map(struct bpf_map *map)
int main(int argc, char **argv)
{
struct bpf_devmap_val devmap_val = {};
+ struct sample_install_opts opts = { };
struct xdp_redirect_map_multi *skel;
struct bpf_program *ingress_prog;
bool xdp_devmap_attached = false;
@@ -84,9 +85,6 @@ int main(int argc, char **argv)
int ret = EXIT_FAIL_OPTION;
unsigned long interval = 2;
char ifname[IF_NAMESIZE];
- unsigned int ifindex;
- bool generic = false;
- bool force = false;
bool tried = false;
bool error = true;
int i, opt;
@@ -95,13 +93,13 @@ int main(int argc, char **argv)
long_options, NULL)) != -1) {
switch (opt) {
case 'S':
- generic = true;
+ opts.generic = true;
/* devmap_xmit tracepoint not available */
mask &= ~(SAMPLE_DEVMAP_XMIT_CNT |
SAMPLE_DEVMAP_XMIT_CNT_MULTI);
break;
case 'F':
- force = true;
+ opts.force = true;
break;
case 'X':
xdp_devmap_attached = true;
@@ -186,13 +184,13 @@ int main(int argc, char **argv)
forward_map = skel->maps.forward_map_native;
for (i = 0; ifaces[i] > 0; i++) {
- ifindex = ifaces[i];
+ opts.ifindex = ifaces[i];
ret = EXIT_FAIL_XDP;
restart:
/* bind prog_fd to each interface */
- if (sample_install_xdp(ingress_prog, ifindex, generic, force) < 0) {
- if (generic && !tried) {
+ if (sample_install_xdp(ingress_prog, &opts) < 0) {
+ if (opts.generic && !tried) {
fprintf(stderr,
"Trying fallback to sizeof(int) as value_size for devmap in generic mode\n");
ingress_prog = skel->progs.xdp_redirect_map_general;
@@ -206,10 +204,11 @@ int main(int argc, char **argv)
/* Add all the interfaces to forward group and attach
* egress devmap program if exist
*/
- devmap_val.ifindex = ifindex;
+ devmap_val.ifindex = opts.ifindex;
if (xdp_devmap_attached)
devmap_val.bpf_prog.fd = bpf_program__fd(skel->progs.xdp_devmap_prog);
- ret = bpf_map_update_elem(bpf_map__fd(forward_map), &ifindex, &devmap_val, 0);
+ ret = bpf_map_update_elem(bpf_map__fd(forward_map),
+ &opts.ifindex, &devmap_val, 0);
if (ret < 0) {
fprintf(stderr, "Failed to update devmap value: %s\n",
strerror(errno));
diff --git a/samples/bpf/xdp_redirect_map_user.c b/samples/bpf/xdp_redirect_map_user.c
index b6e4fc849577..d09ef866e62b 100644
--- a/samples/bpf/xdp_redirect_map_user.c
+++ b/samples/bpf/xdp_redirect_map_user.c
@@ -43,6 +43,7 @@ static const struct option long_options[] = {
int main(int argc, char **argv)
{
struct bpf_devmap_val devmap_val = {};
+ struct sample_install_opts opts = { };
bool xdp_devmap_attached = false;
struct xdp_redirect_map *skel;
char str[2 * IF_NAMESIZE + 1];
@@ -53,8 +54,6 @@ int main(int argc, char **argv)
unsigned long interval = 2;
int ret = EXIT_FAIL_OPTION;
struct bpf_program *prog;
- bool generic = false;
- bool force = false;
bool tried = false;
bool error = true;
int opt, key = 0;
@@ -63,13 +62,13 @@ int main(int argc, char **argv)
long_options, NULL)) != -1) {
switch (opt) {
case 'S':
- generic = true;
+ opts.generic = true;
/* devmap_xmit tracepoint not available */
mask &= ~(SAMPLE_DEVMAP_XMIT_CNT |
SAMPLE_DEVMAP_XMIT_CNT_MULTI);
break;
case 'F':
- force = true;
+ opts.force = true;
break;
case 'X':
xdp_devmap_attached = true;
@@ -157,13 +156,14 @@ int main(int argc, char **argv)
prog = skel->progs.xdp_redirect_map_native;
tx_port_map = skel->maps.tx_port_native;
restart:
- if (sample_install_xdp(prog, ifindex_in, generic, force) < 0) {
+ opts.ifindex = ifindex_in;
+ if (sample_install_xdp(prog, &opts) < 0) {
/* First try with struct bpf_devmap_val as value for generic
* mode, then fallback to sizeof(int) for older kernels.
*/
fprintf(stderr,
"Trying fallback to sizeof(int) as value_size for devmap in generic mode\n");
- if (generic && !tried) {
+ if (opts.generic && !tried) {
prog = skel->progs.xdp_redirect_map_general;
tx_port_map = skel->maps.tx_port_general;
tried = true;
@@ -174,7 +174,8 @@ int main(int argc, char **argv)
}
/* Loading dummy XDP prog on out-device */
- sample_install_xdp(skel->progs.xdp_redirect_dummy_prog, ifindex_out, generic, force);
+ opts.ifindex = ifindex_out;
+ sample_install_xdp(skel->progs.xdp_redirect_dummy_prog, &opts);
devmap_val.ifindex = ifindex_out;
if (xdp_devmap_attached)
diff --git a/samples/bpf/xdp_redirect_user.c b/samples/bpf/xdp_redirect_user.c
index 8663dd631b6e..2da686a9b8a0 100644
--- a/samples/bpf/xdp_redirect_user.c
+++ b/samples/bpf/xdp_redirect_user.c
@@ -41,6 +41,7 @@ static const struct option long_options[] = {
int main(int argc, char **argv)
{
+ struct sample_install_opts opts = { };
int ifindex_in, ifindex_out, opt;
char str[2 * IF_NAMESIZE + 1];
char ifname_out[IF_NAMESIZE];
@@ -48,20 +49,18 @@ int main(int argc, char **argv)
int ret = EXIT_FAIL_OPTION;
unsigned long interval = 2;
struct xdp_redirect *skel;
- bool generic = false;
- bool force = false;
bool error = true;
while ((opt = getopt_long(argc, argv, "hSFi:vs",
long_options, NULL)) != -1) {
switch (opt) {
case 'S':
- generic = true;
+ opts.generic = true;
mask &= ~(SAMPLE_DEVMAP_XMIT_CNT |
SAMPLE_DEVMAP_XMIT_CNT_MULTI);
break;
case 'F':
- force = true;
+ opts.force = true;
break;
case 'i':
interval = strtoul(optarg, NULL, 0);
@@ -132,13 +131,13 @@ int main(int argc, char **argv)
}
ret = EXIT_FAIL_XDP;
- if (sample_install_xdp(skel->progs.xdp_redirect_prog, ifindex_in,
- generic, force) < 0)
+ opts.ifindex = ifindex_in;
+ if (sample_install_xdp(skel->progs.xdp_redirect_prog, &opts) < 0)
goto end_destroy;
/* Loading dummy XDP prog on out-device */
- sample_install_xdp(skel->progs.xdp_redirect_dummy_prog, ifindex_out,
- generic, force);
+ opts.ifindex = ifindex_out;
+ sample_install_xdp(skel->progs.xdp_redirect_dummy_prog, &opts);
ret = EXIT_FAIL;
if (!if_indextoname(ifindex_in, ifname_in)) {
diff --git a/samples/bpf/xdp_router_ipv4_user.c b/samples/bpf/xdp_router_ipv4_user.c
index 294fc15ad1cb..48e9bcb38c8e 100644
--- a/samples/bpf/xdp_router_ipv4_user.c
+++ b/samples/bpf/xdp_router_ipv4_user.c
@@ -549,13 +549,14 @@ static void usage(char *argv[], const struct option *long_options,
int main(int argc, char **argv)
{
- bool error = true, generic = false, force = false;
+ struct sample_install_opts opts = { };
int opt, ret = EXIT_FAIL_BPF;
struct xdp_router_ipv4 *skel;
int i, total_ifindex = argc - 1;
char **ifname_list = argv + 1;
pthread_t routes_thread;
int longindex = 0;
+ bool error = true;
if (libbpf_set_strict_mode(LIBBPF_STRICT_ALL) < 0) {
fprintf(stderr, "Failed to set libbpf strict mode: %s\n",
@@ -606,12 +607,12 @@ int main(int argc, char **argv)
ifname_list += 2;
break;
case 'S':
- generic = true;
+ opts.generic = true;
total_ifindex--;
ifname_list++;
break;
case 'F':
- force = true;
+ opts.force = true;
total_ifindex--;
ifname_list++;
break;
@@ -661,15 +662,15 @@ int main(int argc, char **argv)
ret = EXIT_FAIL_XDP;
for (i = 0; i < total_ifindex; i++) {
- int index = if_nametoindex(ifname_list[i]);
+ opts.ifindex = if_nametoindex(ifname_list[i]);
- if (!index) {
+ if (!opts.ifindex) {
fprintf(stderr, "Interface %s not found %s\n",
ifname_list[i], strerror(-tx_port_map_fd));
goto end_destroy;
}
if (sample_install_xdp(skel->progs.xdp_router_ipv4_prog,
- index, generic, force) < 0)
+ &opts) < 0)
goto end_destroy;
}
diff --git a/samples/bpf/xdp_sample_user.c b/samples/bpf/xdp_sample_user.c
index 158682852162..8bc23b4c5f19 100644
--- a/samples/bpf/xdp_sample_user.c
+++ b/samples/bpf/xdp_sample_user.c
@@ -1280,9 +1280,10 @@ static int __sample_remove_xdp(int ifindex, __u32 prog_id, int xdp_flags)
return bpf_xdp_detach(ifindex, xdp_flags, NULL);
}
-int sample_install_xdp(struct bpf_program *xdp_prog, int ifindex, bool generic,
- bool force)
+int sample_install_xdp(struct bpf_program *xdp_prog,
+ const struct sample_install_opts *opts)
{
+ __u32 ifindex = opts->ifindex;
int ret, xdp_flags = 0;
__u32 prog_id = 0;
@@ -1292,8 +1293,8 @@ int sample_install_xdp(struct bpf_program *xdp_prog, int ifindex, bool generic,
return -ENOTSUP;
}
- xdp_flags |= !force ? XDP_FLAGS_UPDATE_IF_NOEXIST : 0;
- xdp_flags |= generic ? XDP_FLAGS_SKB_MODE : XDP_FLAGS_DRV_MODE;
+ xdp_flags |= !opts->force ? XDP_FLAGS_UPDATE_IF_NOEXIST : 0;
+ xdp_flags |= opts->generic ? XDP_FLAGS_SKB_MODE : XDP_FLAGS_DRV_MODE;
ret = bpf_xdp_attach(ifindex, bpf_program__fd(xdp_prog), xdp_flags, NULL);
if (ret < 0) {
ret = -errno;
@@ -1301,7 +1302,8 @@ int sample_install_xdp(struct bpf_program *xdp_prog, int ifindex, bool generic,
"Failed to install program \"%s\" on ifindex %d, mode = %s, "
"force = %s: %s\n",
bpf_program__name(xdp_prog), ifindex,
- generic ? "skb" : "native", force ? "true" : "false",
+ opts->generic ? "skb" : "native",
+ opts->force ? "true" : "false",
strerror(-ret));
return ret;
}
diff --git a/samples/bpf/xdp_sample_user.h b/samples/bpf/xdp_sample_user.h
index f45051679977..22afe844ae30 100644
--- a/samples/bpf/xdp_sample_user.h
+++ b/samples/bpf/xdp_sample_user.h
@@ -30,14 +30,20 @@ enum stats_mask {
#define EXIT_FAIL_BPF 4
#define EXIT_FAIL_MEM 5
+struct sample_install_opts {
+ int ifindex;
+ __u32 force:1;
+ __u32 generic:1;
+};
+
int sample_setup_maps(struct bpf_map **maps);
int __sample_init(int mask);
void sample_exit(int status);
int sample_run(int interval, void (*post_cb)(void *), void *ctx);
void sample_switch_mode(void);
-int sample_install_xdp(struct bpf_program *xdp_prog, int ifindex, bool generic,
- bool force);
+int sample_install_xdp(struct bpf_program *xdp_prog,
+ const struct sample_install_opts *opts);
void sample_usage(char *argv[], const struct option *long_options,
const char *doc, int mask, bool error);
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 18/52] samples/bpf: add ability to specify metadata threshold
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (16 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 17/52] samples/bpf: pass a struct to sample_install_xdp() Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 19/52] stddef: make __struct_group() UAPI C++-friendly Alexander Lobakin
` (34 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
For all of the users of sample_install_xdp() infra (primarily for
xdp_redirect_cpu), add the ability to enable/disable/control generic
metadata generation using the new UAPI.
The format is either just '-M' to enable it unconditionally or
'--meta-thresh=<frame_size>' to enable it starting from frames
bigger than <frame_size>.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
samples/bpf/xdp_redirect_cpu_user.c | 7 +++++-
samples/bpf/xdp_redirect_map_multi_user.c | 7 +++++-
samples/bpf/xdp_redirect_map_user.c | 7 +++++-
samples/bpf/xdp_redirect_user.c | 6 ++++-
samples/bpf/xdp_router_ipv4_user.c | 7 +++++-
samples/bpf/xdp_sample_user.c | 28 +++++++++++++++++++----
samples/bpf/xdp_sample_user.h | 1 +
7 files changed, 53 insertions(+), 10 deletions(-)
diff --git a/samples/bpf/xdp_redirect_cpu_user.c b/samples/bpf/xdp_redirect_cpu_user.c
index 15745d8cb5c2..ca457c34eb0f 100644
--- a/samples/bpf/xdp_redirect_cpu_user.c
+++ b/samples/bpf/xdp_redirect_cpu_user.c
@@ -60,6 +60,7 @@ static const struct option long_options[] = {
{ "mprog-filename", required_argument, NULL, 'f' },
{ "redirect-device", required_argument, NULL, 'r' },
{ "redirect-map", required_argument, NULL, 'm' },
+ { "meta-thresh", optional_argument, NULL, 'M' },
{}
};
@@ -382,7 +383,7 @@ int main(int argc, char **argv)
}
prog = skel->progs.xdp_prognum5_lb_hash_ip_pairs;
- while ((opt = getopt_long(argc, argv, "d:si:Sxp:f:e:r:m:c:q:Fvh",
+ while ((opt = getopt_long(argc, argv, "d:si:Sxp:f:e:r:m:c:q:FMvh",
long_options, &longindex)) != -1) {
switch (opt) {
case 'd':
@@ -461,6 +462,10 @@ int main(int argc, char **argv)
case 'v':
sample_switch_mode();
break;
+ case 'M':
+ opts.meta_thresh = optarg ? strtoul(optarg, NULL, 0) :
+ 1;
+ break;
case 'h':
error = false;
default:
diff --git a/samples/bpf/xdp_redirect_map_multi_user.c b/samples/bpf/xdp_redirect_map_multi_user.c
index 85e66f9dc259..b1c575f3d5f6 100644
--- a/samples/bpf/xdp_redirect_map_multi_user.c
+++ b/samples/bpf/xdp_redirect_map_multi_user.c
@@ -43,6 +43,7 @@ static const struct option long_options[] = {
{ "stats", no_argument, NULL, 's' },
{ "interval", required_argument, NULL, 'i' },
{ "verbose", no_argument, NULL, 'v' },
+ { "meta-thresh", optional_argument, NULL, 'M' },
{}
};
@@ -89,7 +90,7 @@ int main(int argc, char **argv)
bool error = true;
int i, opt;
- while ((opt = getopt_long(argc, argv, "hSFXi:vs",
+ while ((opt = getopt_long(argc, argv, "hSFMXi:vs",
long_options, NULL)) != -1) {
switch (opt) {
case 'S':
@@ -98,6 +99,10 @@ int main(int argc, char **argv)
mask &= ~(SAMPLE_DEVMAP_XMIT_CNT |
SAMPLE_DEVMAP_XMIT_CNT_MULTI);
break;
+ case 'M':
+ opts.meta_thresh = optarg ? strtoul(optarg, NULL, 0) :
+ 1;
+ break;
case 'F':
opts.force = true;
break;
diff --git a/samples/bpf/xdp_redirect_map_user.c b/samples/bpf/xdp_redirect_map_user.c
index d09ef866e62b..29dd7df804dc 100644
--- a/samples/bpf/xdp_redirect_map_user.c
+++ b/samples/bpf/xdp_redirect_map_user.c
@@ -37,6 +37,7 @@ static const struct option long_options[] = {
{ "stats", no_argument, NULL, 's' },
{ "interval", required_argument, NULL, 'i' },
{ "verbose", no_argument, NULL, 'v' },
+ { "meta-thresh", optional_argument, NULL, 'M' },
{}
};
@@ -58,7 +59,7 @@ int main(int argc, char **argv)
bool error = true;
int opt, key = 0;
- while ((opt = getopt_long(argc, argv, "hSFXi:vs",
+ while ((opt = getopt_long(argc, argv, "hSFMXi:vs",
long_options, NULL)) != -1) {
switch (opt) {
case 'S':
@@ -67,6 +68,10 @@ int main(int argc, char **argv)
mask &= ~(SAMPLE_DEVMAP_XMIT_CNT |
SAMPLE_DEVMAP_XMIT_CNT_MULTI);
break;
+ case 'M':
+ opts.meta_thresh = optarg ? strtoul(optarg, NULL, 0) :
+ 1;
+ break;
case 'F':
opts.force = true;
break;
diff --git a/samples/bpf/xdp_redirect_user.c b/samples/bpf/xdp_redirect_user.c
index 2da686a9b8a0..f37c570877ca 100644
--- a/samples/bpf/xdp_redirect_user.c
+++ b/samples/bpf/xdp_redirect_user.c
@@ -36,6 +36,7 @@ static const struct option long_options[] = {
{"stats", no_argument, NULL, 's' },
{"interval", required_argument, NULL, 'i' },
{"verbose", no_argument, NULL, 'v' },
+ {"meta-thresh", optional_argument, NULL, 'M' },
{}
};
@@ -51,7 +52,7 @@ int main(int argc, char **argv)
struct xdp_redirect *skel;
bool error = true;
- while ((opt = getopt_long(argc, argv, "hSFi:vs",
+ while ((opt = getopt_long(argc, argv, "hSFMi:vs",
long_options, NULL)) != -1) {
switch (opt) {
case 'S':
@@ -59,6 +60,9 @@ int main(int argc, char **argv)
mask &= ~(SAMPLE_DEVMAP_XMIT_CNT |
SAMPLE_DEVMAP_XMIT_CNT_MULTI);
break;
+ case 'M':
+ opts.meta_thresh = optarg ? strtoul(optarg, NULL, 0) :
+ 1;
case 'F':
opts.force = true;
break;
diff --git a/samples/bpf/xdp_router_ipv4_user.c b/samples/bpf/xdp_router_ipv4_user.c
index 48e9bcb38c8e..5ff12688a31b 100644
--- a/samples/bpf/xdp_router_ipv4_user.c
+++ b/samples/bpf/xdp_router_ipv4_user.c
@@ -53,6 +53,7 @@ static const struct option long_options[] = {
{ "interval", required_argument, NULL, 'i' },
{ "verbose", no_argument, NULL, 'v' },
{ "stats", no_argument, NULL, 's' },
+ { "meta-thresh", optional_argument, NULL, 'M' },
{}
};
@@ -593,7 +594,7 @@ int main(int argc, char **argv)
goto end_destroy;
}
- while ((opt = getopt_long(argc, argv, "si:SFvh",
+ while ((opt = getopt_long(argc, argv, "si:SFMvh",
long_options, &longindex)) != -1) {
switch (opt) {
case 's':
@@ -621,6 +622,10 @@ int main(int argc, char **argv)
total_ifindex--;
ifname_list++;
break;
+ case 'M':
+ opts.meta_thresh = optarg ? strtoul(optarg, NULL, 0) :
+ 1;
+ break;
case 'h':
error = false;
default:
diff --git a/samples/bpf/xdp_sample_user.c b/samples/bpf/xdp_sample_user.c
index 8bc23b4c5f19..354352541c5e 100644
--- a/samples/bpf/xdp_sample_user.c
+++ b/samples/bpf/xdp_sample_user.c
@@ -1283,6 +1283,8 @@ static int __sample_remove_xdp(int ifindex, __u32 prog_id, int xdp_flags)
int sample_install_xdp(struct bpf_program *xdp_prog,
const struct sample_install_opts *opts)
{
+ LIBBPF_OPTS(bpf_xdp_attach_opts, attach_opts,
+ .meta_thresh = opts->meta_thresh);
__u32 ifindex = opts->ifindex;
int ret, xdp_flags = 0;
__u32 prog_id = 0;
@@ -1293,18 +1295,34 @@ int sample_install_xdp(struct bpf_program *xdp_prog,
return -ENOTSUP;
}
+ if (attach_opts.meta_thresh) {
+ ret = libbpf_get_type_btf_id("struct xdp_meta_generic",
+ &attach_opts.btf_id);
+ if (ret) {
+ fprintf(stderr, "Failed to retrieve BTF ID: %s\n",
+ strerror(-ret));
+ return ret;
+ }
+ }
+
xdp_flags |= !opts->force ? XDP_FLAGS_UPDATE_IF_NOEXIST : 0;
xdp_flags |= opts->generic ? XDP_FLAGS_SKB_MODE : XDP_FLAGS_DRV_MODE;
- ret = bpf_xdp_attach(ifindex, bpf_program__fd(xdp_prog), xdp_flags, NULL);
+ ret = bpf_xdp_attach(ifindex, bpf_program__fd(xdp_prog), xdp_flags,
+ &attach_opts);
if (ret < 0) {
ret = -errno;
fprintf(stderr,
- "Failed to install program \"%s\" on ifindex %d, mode = %s, "
- "force = %s: %s\n",
+ "Failed to install program \"%s\" on ifindex %d, mode = %s, force = %s, metadata = ",
bpf_program__name(xdp_prog), ifindex,
opts->generic ? "skb" : "native",
- opts->force ? "true" : "false",
- strerror(-ret));
+ opts->force ? "true" : "false");
+ if (attach_opts.meta_thresh)
+ fprintf(stderr,
+ "true (from %u bytes, BTF ID is 0x%16llx)",
+ attach_opts.meta_thresh, attach_opts.btf_id);
+ else
+ fprintf(stderr, "false");
+ fprintf(stderr, ": %s\n", strerror(-ret));
return ret;
}
diff --git a/samples/bpf/xdp_sample_user.h b/samples/bpf/xdp_sample_user.h
index 22afe844ae30..207953406ee1 100644
--- a/samples/bpf/xdp_sample_user.h
+++ b/samples/bpf/xdp_sample_user.h
@@ -34,6 +34,7 @@ struct sample_install_opts {
int ifindex;
__u32 force:1;
__u32 generic:1;
+ __u32 meta_thresh;
};
int sample_setup_maps(struct bpf_map **maps);
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 19/52] stddef: make __struct_group() UAPI C++-friendly
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (17 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 18/52] samples/bpf: add ability to specify metadata threshold Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 20/52] net, xdp: move XDP metadata helpers into new xdp_meta.h Alexander Lobakin
` (33 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
For the most part of C++ history, it couldn't have type declarations
inside anonymous unions for different reasons. At the same time,
__struct_group() relies on the latters, so when the @TAG arguments
is not empty, C++ code doesn't want to build:
In file included from test_cpp.cpp:5:
In file included from tools/testing/selftests/bpf/tools/include/bpf/libbpf.h:18:
tools/include/uapi/linux/bpf.h:6774:17: error: types cannot be declared in an anonymous union
__struct_group(xdp_meta_generic_rx, rx_full, /* no attrs */,
^
The safest way to fix this without trying to switch standards (which
is impossible anyway in UAPI) etc., is to disable tag declaration
for that language. This won't break anything since for now it's not
buildable at all.
Use a separate definition for __struct_group() when __cplusplus is
defined to mitigate the error.
Also, mirror stddef.h into tools/ so that kernel-shipped userspace
code would use the fixed definition instead of _something_ present
in the system.
Fixes: 50d7bd38c3aa ("stddef: Introduce struct_group() helper macro")
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/uapi/linux/stddef.h | 12 ++++++--
tools/include/uapi/linux/stddef.h | 50 +++++++++++++++++++++++++++++++
2 files changed, 60 insertions(+), 2 deletions(-)
create mode 100644 tools/include/uapi/linux/stddef.h
diff --git a/include/uapi/linux/stddef.h b/include/uapi/linux/stddef.h
index 7837ba4fe728..67ee9c8aba56 100644
--- a/include/uapi/linux/stddef.h
+++ b/include/uapi/linux/stddef.h
@@ -20,14 +20,22 @@
* and size: one anonymous and one named. The former's members can be used
* normally without sub-struct naming, and the latter can be used to
* reason about the start, end, and size of the group of struct members.
- * The named struct can also be explicitly tagged for layer reuse, as well
- * as both having struct attributes appended.
+ * The named struct can also be explicitly tagged for layer reuse (C only),
+ * as well as both having struct attributes appended.
*/
+#ifndef __cplusplus
#define __struct_group(TAG, NAME, ATTRS, MEMBERS...) \
union { \
struct { MEMBERS } ATTRS; \
struct TAG { MEMBERS } ATTRS NAME; \
}
+#else
+#define __struct_group(__IGNORED, NAME, ATTRS, MEMBERS...) \
+ union { \
+ struct { MEMBERS } ATTRS; \
+ struct { MEMBERS } ATTRS NAME; \
+ }
+#endif
/**
* __DECLARE_FLEX_ARRAY() - Declare a flexible array usable in a union
diff --git a/tools/include/uapi/linux/stddef.h b/tools/include/uapi/linux/stddef.h
new file mode 100644
index 000000000000..40d1c4b21003
--- /dev/null
+++ b/tools/include/uapi/linux/stddef.h
@@ -0,0 +1,50 @@
+/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
+
+#ifndef __always_inline
+#define __always_inline inline
+#endif
+
+/**
+ * __struct_group() - Create a mirrored named and anonyomous struct
+ *
+ * @TAG: The tag name for the named sub-struct (usually empty)
+ * @NAME: The identifier name of the mirrored sub-struct
+ * @ATTRS: Any struct attributes (usually empty)
+ * @MEMBERS: The member declarations for the mirrored structs
+ *
+ * Used to create an anonymous union of two structs with identical layout
+ * and size: one anonymous and one named. The former's members can be used
+ * normally without sub-struct naming, and the latter can be used to
+ * reason about the start, end, and size of the group of struct members.
+ * The named struct can also be explicitly tagged for layer reuse (C only),
+ * as well as both having struct attributes appended.
+ */
+#ifndef __cplusplus
+#define __struct_group(TAG, NAME, ATTRS, MEMBERS...) \
+ union { \
+ struct { MEMBERS } ATTRS; \
+ struct TAG { MEMBERS } ATTRS NAME; \
+ }
+#else
+#define __struct_group(__IGNORED, NAME, ATTRS, MEMBERS...) \
+ union { \
+ struct { MEMBERS } ATTRS; \
+ struct { MEMBERS } ATTRS NAME; \
+ }
+#endif
+
+/**
+ * __DECLARE_FLEX_ARRAY() - Declare a flexible array usable in a union
+ *
+ * @TYPE: The type of each flexible array element
+ * @NAME: The name of the flexible array member
+ *
+ * In order to have a flexible array member in a union or alone in a
+ * struct, it needs to be wrapped in an anonymous struct with at least 1
+ * named member, but that member can be empty.
+ */
+#define __DECLARE_FLEX_ARRAY(TYPE, NAME) \
+ struct { \
+ struct { } __empty_ ## NAME; \
+ TYPE NAME[]; \
+ }
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 20/52] net, xdp: move XDP metadata helpers into new xdp_meta.h
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (18 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 19/52] stddef: make __struct_group() UAPI C++-friendly Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 21/52] net, xdp: allow metadata > 32 Alexander Lobakin
` (32 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
<net/xdp.h> gets included indirectly into tons of different files
across the kernel. To not make them dependent on the header files
needed for the XDP metadata definitions, which will be used only
by several driver and XDP core files, and have the metadata code
logically separated, create a new header file, <net/xdp_meta.h>,
and move several already existing metadata helpers to it.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
MAINTAINERS | 1 +
.../ethernet/mellanox/mlx5/core/en/xsk/rx.c | 1 +
drivers/net/ethernet/netronome/nfp/nfd3/xsk.c | 1 +
drivers/net/tun.c | 2 +-
include/net/xdp.h | 20 -------------
include/net/xdp_meta.h | 29 +++++++++++++++++++
net/bpf/core.c | 2 +-
net/bpf/prog_ops.c | 1 +
net/bpf/test_run.c | 2 +-
net/xdp/xsk.c | 2 +-
10 files changed, 37 insertions(+), 24 deletions(-)
create mode 100644 include/net/xdp_meta.h
diff --git a/MAINTAINERS b/MAINTAINERS
index 91190e12a157..24a640c8a306 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -21722,6 +21722,7 @@ L: netdev@vger.kernel.org
L: bpf@vger.kernel.org
S: Supported
F: include/net/xdp.h
+F: include/net/xdp_meta.h
F: include/net/xdp_priv.h
F: include/trace/events/xdp.h
F: kernel/bpf/cpumap.c
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index 9a1553598a7c..c1fc5c79d90f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -3,6 +3,7 @@
#include "rx.h"
#include "en/xdp.h"
+#include <net/xdp_meta.h>
#include <net/xdp_sock_drv.h>
#include <linux/filter.h>
diff --git a/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c b/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c
index 454fea4c8be2..0957e866799b 100644
--- a/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c
+++ b/drivers/net/ethernet/netronome/nfp/nfd3/xsk.c
@@ -4,6 +4,7 @@
#include <linux/bpf_trace.h>
#include <linux/netdevice.h>
+#include <net/xdp_meta.h>
#include "../nfp_app.h"
#include "../nfp_net.h"
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 87a635aac008..0eb0cc6966e4 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -61,7 +61,7 @@
#include <net/netns/generic.h>
#include <net/rtnetlink.h>
#include <net/sock.h>
-#include <net/xdp.h>
+#include <net/xdp_meta.h>
#include <net/ip_tunnels.h>
#include <linux/seq_file.h>
#include <linux/uio.h>
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 7b8ba068d28a..1663d0b3a05a 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -378,26 +378,6 @@ int xdp_reg_mem_model(struct xdp_mem_info *mem,
enum xdp_mem_type type, void *allocator);
void xdp_unreg_mem_model(struct xdp_mem_info *mem);
-/* Drivers not supporting XDP metadata can use this helper, which
- * rejects any room expansion for metadata as a result.
- */
-static __always_inline void
-xdp_set_data_meta_invalid(struct xdp_buff *xdp)
-{
- xdp->data_meta = xdp->data + 1;
-}
-
-static __always_inline bool
-xdp_data_meta_unsupported(const struct xdp_buff *xdp)
-{
- return unlikely(xdp->data_meta > xdp->data);
-}
-
-static inline bool xdp_metalen_invalid(unsigned long metalen)
-{
- return (metalen & (sizeof(__u32) - 1)) || (metalen > 32);
-}
-
struct xdp_attachment_info {
struct bpf_prog *prog;
u64 btf_id;
diff --git a/include/net/xdp_meta.h b/include/net/xdp_meta.h
new file mode 100644
index 000000000000..e1f3df9ceb93
--- /dev/null
+++ b/include/net/xdp_meta.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/* Copyright (C) 2022, Intel Corporation. */
+
+#ifndef __LINUX_NET_XDP_META_H__
+#define __LINUX_NET_XDP_META_H__
+
+#include <net/xdp.h>
+
+/* Drivers not supporting XDP metadata can use this helper, which
+ * rejects any room expansion for metadata as a result.
+ */
+static __always_inline void
+xdp_set_data_meta_invalid(struct xdp_buff *xdp)
+{
+ xdp->data_meta = xdp->data + 1;
+}
+
+static __always_inline bool
+xdp_data_meta_unsupported(const struct xdp_buff *xdp)
+{
+ return unlikely(xdp->data_meta > xdp->data);
+}
+
+static inline bool xdp_metalen_invalid(unsigned long metalen)
+{
+ return (metalen & (sizeof(__u32) - 1)) || (metalen > 32);
+}
+
+#endif /* __LINUX_NET_XDP_META_H__ */
diff --git a/net/bpf/core.c b/net/bpf/core.c
index dcd3b6ae86b7..18174d6d8687 100644
--- a/net/bpf/core.c
+++ b/net/bpf/core.c
@@ -14,7 +14,7 @@
#include <linux/bug.h>
#include <net/page_pool.h>
-#include <net/xdp.h>
+#include <net/xdp_meta.h>
#include <net/xdp_priv.h> /* struct xdp_mem_allocator */
#include <trace/events/xdp.h>
#include <net/xdp_sock_drv.h>
diff --git a/net/bpf/prog_ops.c b/net/bpf/prog_ops.c
index 33f02842e715..bf174b8d8a36 100644
--- a/net/bpf/prog_ops.c
+++ b/net/bpf/prog_ops.c
@@ -2,6 +2,7 @@
#include <linux/btf.h>
#include <linux/btf_ids.h>
+#include <net/xdp_meta.h>
#include <net/xdp_sock.h>
#include <trace/events/xdp.h>
diff --git a/net/bpf/test_run.c b/net/bpf/test_run.c
index 2ca96acbc50a..596b523ccced 100644
--- a/net/bpf/test_run.c
+++ b/net/bpf/test_run.c
@@ -19,7 +19,7 @@
#include <linux/error-injection.h>
#include <linux/smp.h>
#include <linux/sock_diag.h>
-#include <net/xdp.h>
+#include <net/xdp_meta.h>
#define CREATE_TRACE_POINTS
#include <trace/events/bpf_test_run.h>
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 19ac872a6624..ebf6a67424cd 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -24,7 +24,7 @@
#include <linux/rculist.h>
#include <net/xdp_sock_drv.h>
#include <net/busy_poll.h>
-#include <net/xdp.h>
+#include <net/xdp_meta.h>
#include "xsk_queue.h"
#include "xdp_umem.h"
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 21/52] net, xdp: allow metadata > 32
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (19 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 20/52] net, xdp: move XDP metadata helpers into new xdp_meta.h Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 22/52] net, skbuff: add ability to skip skb metadata comparison Alexander Lobakin
` (31 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Hardware/driver-prepended XDP metadata might be much bigger than 32
bytes, especially if it includes a piece of a descriptor.
Relax the restriction and allow metadata larger than 32 bytes and
make __skb_metadata_differs() work with bigger lengths. The new
restriction is pretty much mechanical -- skb_shared_info::meta_len
is a u8 and XDP_PACKET_HEADROOM is 256 (minus
`sizeof(struct xdp_frame)`).
The requirement of having its length aligned to 4 bytes is still
valid.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/linux/skbuff.h | 13 ++++++++-----
include/net/xdp_meta.h | 21 ++++++++++++++++++++-
2 files changed, 28 insertions(+), 6 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 82edf0359ab3..a825ea7f375d 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -4096,10 +4096,13 @@ static inline bool __skb_metadata_differs(const struct sk_buff *skb_a,
{
const void *a = skb_metadata_end(skb_a);
const void *b = skb_metadata_end(skb_b);
- /* Using more efficient varaiant than plain call to memcmp(). */
-#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && BITS_PER_LONG == 64
u64 diffs = 0;
+ if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
+ BITS_PER_LONG != 64)
+ goto slow;
+
+ /* Using more efficient variant than plain call to memcmp(). */
switch (meta_len) {
#define __it(x, op) (x -= sizeof(u##op))
#define __it_diff(a, b, op) (*(u##op *)__it(a, op)) ^ (*(u##op *)__it(b, op))
@@ -4119,11 +4122,11 @@ static inline bool __skb_metadata_differs(const struct sk_buff *skb_a,
fallthrough;
case 4: diffs |= __it_diff(a, b, 32);
break;
+ default:
+slow:
+ return memcmp(a - meta_len, b - meta_len, meta_len);
}
return diffs;
-#else
- return memcmp(a - meta_len, b - meta_len, meta_len);
-#endif
}
static inline bool skb_metadata_differs(const struct sk_buff *skb_a,
diff --git a/include/net/xdp_meta.h b/include/net/xdp_meta.h
index e1f3df9ceb93..3a40189d71c6 100644
--- a/include/net/xdp_meta.h
+++ b/include/net/xdp_meta.h
@@ -5,6 +5,7 @@
#define __LINUX_NET_XDP_META_H__
#include <net/xdp.h>
+#include <uapi/linux/bpf.h>
/* Drivers not supporting XDP metadata can use this helper, which
* rejects any room expansion for metadata as a result.
@@ -21,9 +22,27 @@ xdp_data_meta_unsupported(const struct xdp_buff *xdp)
return unlikely(xdp->data_meta > xdp->data);
}
+/**
+ * xdp_metalen_invalid -- check if the length of a frame's metadata is valid
+ * @metalen: the length of the frame's metadata
+ *
+ * skb_shared_info::meta_len is of 1 byte long, thus it can't be longer than
+ * 255, but this always can change. XDP_PACKET_HEADROOM is 256, and this is a
+ * UAPI. sizeof(struct xdp_frame) is reserved since xdp_frame is being placed
+ * at xdp_buff::data_hard_start whilst being constructed on XDP_REDIRECT.
+ * The 32-bit alignment requirement is arbitrary, kept for simplicity and,
+ * sometimes, speed.
+ */
static inline bool xdp_metalen_invalid(unsigned long metalen)
{
- return (metalen & (sizeof(__u32) - 1)) || (metalen > 32);
+ typeof(metalen) max;
+
+ max = min_t(typeof(max),
+ (typeof_member(struct skb_shared_info, meta_len))~0UL,
+ XDP_PACKET_HEADROOM - sizeof(struct xdp_frame));
+ BUILD_BUG_ON(!__builtin_constant_p(max));
+
+ return (metalen & (sizeof(u32) - 1)) || metalen > max;
}
#endif /* __LINUX_NET_XDP_META_H__ */
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 22/52] net, skbuff: add ability to skip skb metadata comparison
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (20 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 21/52] net, xdp: allow metadata > 32 Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 23/52] net, skbuff: constify the @skb argument of skb_hwtstamps() Alexander Lobakin
` (30 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Some XDP metadata fields maybe be unique from frame to frame, not
necessarily indicating that it's from a different flow. This
includes frame checksums, timestamps etc.
The drivers usually carry the metadata to skbs along with the
payload, and the GRO layer tries to compare the metadata of
the frames. This not only leads to perf regressions (esp. given
that metadata can now be larger than 32 bytes -> a slower call to
memmp() will be used), but also breaks frame coalescing at all.
To avoid that, add an skb flag indicating that the metadata can
carry unique values and thus should not be compared. If at least
one of the skbs passed to skb_metadata_differs() carries it, the
function will then immediately return reporting that they're
identical.
The underscored version of the function is not affected, allowing
to explicitly compare the meta if needed. The flag is being cleared
on pskb_expand_head() when the skb_shared_info::meta_len gets
zeroed.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/linux/skbuff.h | 18 ++++++++++++++++++
net/core/skbuff.c | 1 +
2 files changed, 19 insertions(+)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index a825ea7f375d..1c308511acbb 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -509,6 +509,11 @@ enum {
* charged to the kernel memory.
*/
SKBFL_PURE_ZEROCOPY = BIT(2),
+
+ /* skb metadata may contain unique values such as checksums
+ * and we should not compare it against others.
+ */
+ SKBFL_METADATA_NOCOMP = BIT(3),
};
#define SKBFL_ZEROCOPY_FRAG (SKBFL_ZEROCOPY_ENABLE | SKBFL_SHARED_FRAG)
@@ -4137,6 +4142,9 @@ static inline bool skb_metadata_differs(const struct sk_buff *skb_a,
if (!(len_a | len_b))
return false;
+ if ((skb_shinfo(skb_a)->flags | skb_shinfo(skb_b)->flags) &
+ SKBFL_METADATA_NOCOMP)
+ return false;
return len_a != len_b ?
true : __skb_metadata_differs(skb_a, skb_b, len_a);
@@ -4152,6 +4160,16 @@ static inline void skb_metadata_clear(struct sk_buff *skb)
skb_metadata_set(skb, 0);
}
+static inline void skb_metadata_nocomp_set(struct sk_buff *skb)
+{
+ skb_shinfo(skb)->flags |= SKBFL_METADATA_NOCOMP;
+}
+
+static inline void skb_metadata_nocomp_clear(struct sk_buff *skb)
+{
+ skb_shinfo(skb)->flags &= ~SKBFL_METADATA_NOCOMP;
+}
+
struct sk_buff *skb_clone_sk(struct sk_buff *skb);
#ifdef CONFIG_NETWORK_PHY_TIMESTAMPING
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 00bf35ee8205..5b23fc7f1157 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1750,6 +1750,7 @@ int pskb_expand_head(struct sk_buff *skb, int nhead, int ntail,
atomic_set(&skb_shinfo(skb)->dataref, 1);
skb_metadata_clear(skb);
+ skb_metadata_nocomp_clear(skb);
/* It is not generally safe to change skb->truesize.
* For the moment, we really care of rx path, or
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 23/52] net, skbuff: constify the @skb argument of skb_hwtstamps()
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (21 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 22/52] net, skbuff: add ability to skip skb metadata comparison Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 24/52] bpf, xdp: declare generic XDP metadata structure Alexander Lobakin
` (29 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
The @skb argument only dereferences the &skb_shared_info pointer,
so it doesn't need a writable pointer. Constify it to be able to
pass const pointers to the code which uses this function and give
the compilers a little more room for optimization.
As an example, constify the @skb argument of tpacket_get_timestamp()
and __packet_set_timestamp() of the AF_PACKET core code. There are
lot more places in the kernel where the similar micro-opts can be
done in the future.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/linux/skbuff.h | 3 ++-
net/packet/af_packet.c | 8 ++++----
2 files changed, 6 insertions(+), 5 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 1c308511acbb..0a95f753c1d9 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1617,7 +1617,8 @@ int skb_zerocopy_iter_stream(struct sock *sk, struct sk_buff *skb,
/* Internal */
#define skb_shinfo(SKB) ((struct skb_shared_info *)(skb_end_pointer(SKB)))
-static inline struct skb_shared_hwtstamps *skb_hwtstamps(struct sk_buff *skb)
+static inline struct skb_shared_hwtstamps *
+skb_hwtstamps(const struct sk_buff *skb)
{
return &skb_shinfo(skb)->hwtstamps;
}
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index d08c4728523b..20eac049e69e 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -449,10 +449,10 @@ static int __packet_get_status(const struct packet_sock *po, void *frame)
}
}
-static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts,
- unsigned int flags)
+static __u32 tpacket_get_timestamp(const struct sk_buff *skb,
+ struct timespec64 *ts, unsigned int flags)
{
- struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
+ const struct skb_shared_hwtstamps *shhwtstamps = skb_hwtstamps(skb);
if (shhwtstamps &&
(flags & SOF_TIMESTAMPING_RAW_HARDWARE) &&
@@ -467,7 +467,7 @@ static __u32 tpacket_get_timestamp(struct sk_buff *skb, struct timespec64 *ts,
}
static __u32 __packet_set_timestamp(struct packet_sock *po, void *frame,
- struct sk_buff *skb)
+ const struct sk_buff *skb)
{
union tpacket_uhdr h;
struct timespec64 ts;
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 24/52] bpf, xdp: declare generic XDP metadata structure
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (22 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 23/52] net, skbuff: constify the @skb argument of skb_hwtstamps() Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 25/52] net, xdp: add basic generic metadata accessors Alexander Lobakin
` (28 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
From: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
The generic XDP metadata is a driver-independent "header" which
carries the essential info such as the checksum status, the hash
etc. It can be composed by both hardware and software (drivers)
and is designed to pass that info, usually taken from the NIC
descriptors, between the different subsystems and layers in one
unified format.
As it's "cross-everything" and can be composed by hardware
(primarily SmartNICs), an explicit Endianness is required. Most
hardware and hosts operate in LE nowadays, so the choice was
obvious although network frames themselves are in BE. The byteswap
macros will be no-ops for LE systems.
The first and the last field must always be 2-byte one to have
a natural alignment of 4 and 8 byte members on 32-bit platforms
where there's an "IP align" 2-byte padding in front of the data:
the first member paired with that padding makes the next one
aligned to 4 bytes, the last one stacks with the Ethernet header
to make its end aligned to 4 bytes.
As it's being prepended right in front of the Ethernet header, it
grows to the left, so all new fields must be added at the beginning
of the structure in the future.
The related definitions are declared inside an enum so that they're
visible to BPF programs. The struct is declared in UAPI so AF_XDP
programs, which can work with metadata as well, would have access
to it.
Signed-off-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com>
Co-developed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Co-developed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/uapi/linux/bpf.h | 173 +++++++++++++++++++++++++++++++++
tools/include/uapi/linux/bpf.h | 173 +++++++++++++++++++++++++++++++++
2 files changed, 346 insertions(+)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 372170ded1d8..1caaec1de625 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -8,6 +8,7 @@
#ifndef _UAPI__LINUX_BPF_H__
#define _UAPI__LINUX_BPF_H__
+#include <asm/byteorder.h>
#include <linux/types.h>
#include <linux/bpf_common.h>
@@ -6859,4 +6860,176 @@ struct bpf_core_relo {
enum bpf_core_relo_kind kind;
};
+/* Definitions being used to work with &xdp_meta_generic, declared as an enum
+ * so they are visible for BPF programs via vmlinux.h.
+ */
+enum xdp_meta_generic_defs {
+ /* xdp_meta_generic::tx_flags */
+
+ /* Mask of bits containing Tx timestamp action */
+ XDP_META_TX_TSTAMP_TYPE = (0x3 << 4),
+ /* No action is needed */
+ XDP_META_TX_TSTAMP_ACT = 0x0,
+ /* %SO_TIMESTAMP command */
+ XDP_META_TX_TSTAMP_SOCK = 0x1,
+ /* Set the value to the actual time when a packet is sent */
+ XDP_META_TX_TSTAMP_COMP = 0x2,
+ /* Mask of bits containing Tx VLAN action */
+ XDP_META_TX_VLAN_TYPE = (0x3 << 2),
+ /* No action is needed */
+ XDP_META_TX_VLAN_NONE = 0x0,
+ /* NIC must push C-VLAN tag */
+ XDP_META_TX_CVID = 0x1,
+ /* NIC must push S-VLAN tag */
+ XDP_META_TX_SVID = 0x2,
+ /* Mask of bits containing Tx checksum action */
+ XDP_META_TX_CSUM_ACT = (0x3 << 0),
+ /* No action for checksum */
+ XDP_META_TX_CSUM_ASIS = 0x0,
+ /* NIC must compute checksum, no start/offset are provided */
+ XDP_META_TX_CSUM_AUTO = 0x1,
+ /* NIC must compute checksum using the provided start and offset */
+ XDP_META_TX_CSUM_HELP = 0x2,
+
+ /* xdp_meta_generic::rx_flags */
+
+ /* Metadata contains valid Rx queue ID */
+ XDP_META_RX_QID_PRESENT = (0x1 << 9),
+ /* Metadata contains valid Rx timestamp */
+ XDP_META_RX_TSTAMP_PRESENT = (0x1 << 8),
+ /* Mask of bits containing Rx VLAN status */
+ XDP_META_RX_VLAN_TYPE = (0x3 << 6),
+ /* Metadata does not have any VLAN tags */
+ XDP_META_RX_VLAN_NONE = 0x0,
+ /* Metadata carries valid C-VLAN tag */
+ XDP_META_RX_CVID = 0x1,
+ /* Metadata carries valid S-VLAN tag */
+ XDP_META_RX_SVID = 0x2,
+ /* Mask of bits containing Rx hash status */
+ XDP_META_RX_HASH_TYPE = (0x3 << 4),
+ /* Metadata has no RSS hash */
+ XDP_META_RX_HASH_NONE = 0x0,
+ /* Metadata has valid L2 hash */
+ XDP_META_RX_HASH_L2 = 0x1,
+ /* Metadata has valid L3 hash */
+ XDP_META_RX_HASH_L3 = 0x2,
+ /* Metadata has valid L4 hash */
+ XDP_META_RX_HASH_L4 = 0x3,
+ /* Mask of the field containing checksum level (if there's encap) */
+ XDP_META_RX_CSUM_LEVEL = (0x3 << 2),
+ /* Mask of bits containing Rx checksum status */
+ XDP_META_RX_CSUM_STATUS = (0x3 << 0),
+ /* Metadata has no checksum info */
+ XDP_META_RX_CSUM_NONE = 0x0,
+ /* Checksum has been verified by NIC */
+ XDP_META_RX_CSUM_OK = 0x1,
+ /* Metadata carries valid checksum */
+ XDP_META_RX_CSUM_COMP = 0x2,
+
+ /* xdp_meta_generic::magic_id indicates that the metadata is either
+ * struct xdp_meta_generic itself or contains it at the end -> can be
+ * used to get/set HW hints.
+ * Direct btf_id comparison is not enough here as a custom structure
+ * caring xdp_meta_generic at the end will have a different ID.
+ */
+ XDP_META_GENERIC_MAGIC = 0xeda6,
+};
+
+/* Generic metadata can be composed directly by HW, plus it should always
+ * have the first field as __le16 to account the 2 bytes of "IP align", so
+ * we pack it to avoid unexpected paddings. Also, it should be aligned to
+ * sizeof(__be16) as any other Ethernet data, and to optimize access on the
+ * 32-bit platforms.
+ */
+#define __xdp_meta_generic_attrs \
+ __attribute__((__packed__)) \
+ __attribute__((aligned(sizeof(__be16))))
+
+/* Depending on the field layout inside the structure, it might or might not
+ * emit a "packed attribute is unnecessary" warning (when enabled, e.g. in
+ * libbpf). To not add and remove the attributes on each field addition,
+ * just suppress it.
+ */
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wpacked"
+
+/* All fields have explicit endianness, as it might be composed by HW.
+ * Byteswaps are needed for the Big Endian architectures to access the
+ * fields.
+ */
+struct xdp_meta_generic {
+ /* Add new fields here */
+
+ /* Egress part */
+ __struct_group(/* no tag */, tx, __xdp_meta_generic_attrs,
+ /* Offset from the start of the frame to the L4 header
+ * to compute checksum for
+ */
+ __le16 tx_csum_start;
+ /* Offset inside the L4 header to the checksum field */
+ __le16 tx_csum_off;
+ /* ID for hardware VLAN push */
+ __le16 tx_vid;
+ /* Flags indicating which Tx metadata is used */
+ __le32 tx_flags;
+ /* Tx timestamp value */
+ __le64 tx_tstamp;
+ );
+
+ /* Shortcut for the half relevant on ingress: Rx + IDs */
+ __struct_group(xdp_meta_generic_rx, rx_full, __xdp_meta_generic_attrs,
+ /* Ingress part */
+ __struct_group(/* no tag */, rx, __xdp_meta_generic_attrs,
+ /* Rx timestamp value */
+ __le64 rx_tstamp;
+ /* Rx hash value */
+ __le32 rx_hash;
+ /* Rx checksum value */
+ __le32 rx_csum;
+ /* VLAN ID popped on Rx */
+ __le16 rx_vid;
+ /* Rx queue ID on which the frame has arrived */
+ __le16 rx_qid;
+ /* Flags indicating which Rx metadata is used */
+ __le32 rx_flags;
+ );
+
+ /* Unique metadata identifiers */
+ __struct_group(/* no tag */, id, __xdp_meta_generic_attrs,
+ union {
+ struct {
+#ifdef __BIG_ENDIAN_BITFIELD
+ /* Indicates the ID of the BTF which
+ * the below type ID comes from, as
+ * several kernel modules may have
+ * identical type IDs
+ */
+ __le32 btf_id;
+ /* Indicates the ID of the actual
+ * structure passed as metadata,
+ * within the above BTF ID
+ */
+ __le32 type_id;
+#else /* __LITTLE_ENDIAN_BITFIELD */
+ __le32 type_id;
+ __le32 btf_id;
+#endif /* __LITTLE_ENDIAN_BITFIELD */
+ };
+ /* BPF program gets IDs coded as one __u64:
+ * `btf_id << 32 | type_id`, allow direct
+ * comparison
+ */
+ __le64 full_id;
+ };
+ /* If set to the correct value, indicates that the
+ * meta is generic-compatible and can be used by
+ * the consumers of generic metadata
+ */
+ __le16 magic_id;
+ );
+ );
+} __xdp_meta_generic_attrs;
+
+#pragma GCC diagnostic pop
+
#endif /* _UAPI__LINUX_BPF_H__ */
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 372170ded1d8..436b925adfb3 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -8,6 +8,7 @@
#ifndef _UAPI__LINUX_BPF_H__
#define _UAPI__LINUX_BPF_H__
+#include <asm/byteorder.h>
#include <linux/types.h>
#include <linux/bpf_common.h>
@@ -6859,4 +6860,176 @@ struct bpf_core_relo {
enum bpf_core_relo_kind kind;
};
+/* Definitions being used to work with &xdp_meta_generic, declared as an enum
+ * so they are visible for BPF programs via vmlinux.h.
+ */
+enum xdp_meta_generic_defs {
+ /* xdp_meta_generic::tx_flags */
+
+ /* Mask of bits containing Tx timestamp action */
+ XDP_META_TX_TSTAMP_ACT = (0x3 << 4),
+ /* No action is needed */
+ XDP_META_TX_TSTAMP_NONE = 0x0,
+ /* %SO_TIMESTAMP command */
+ XDP_META_TX_TSTAMP_SOCK = 0x1,
+ /* Set the value to the actual time when a packet is sent */
+ XDP_META_TX_TSTAMP_COMP = 0x2,
+ /* Mask of bits containing Tx VLAN action */
+ XDP_META_TX_VLAN_TYPE = (0x3 << 2),
+ /* No action is needed */
+ XDP_META_TX_VLAN_NONE = 0x0,
+ /* NIC must push C-VLAN tag */
+ XDP_META_TX_CVID = 0x1,
+ /* NIC must push S-VLAN tag */
+ XDP_META_TX_SVID = 0x2,
+ /* Mask of bits containing Tx checksum action */
+ XDP_META_TX_CSUM_ACT = (0x3 << 0),
+ /* No action for checksum */
+ XDP_META_TX_CSUM_ASIS = 0x0,
+ /* NIC must compute checksum, no start/offset are provided */
+ XDP_META_TX_CSUM_AUTO = 0x1,
+ /* NIC must compute checksum using the provided start and offset */
+ XDP_META_TX_CSUM_HELP = 0x2,
+
+ /* xdp_meta_generic::rx_flags */
+
+ /* Metadata contains valid Rx queue ID */
+ XDP_META_RX_QID_PRESENT = (0x1 << 9),
+ /* Metadata contains valid Rx timestamp */
+ XDP_META_RX_TSTAMP_PRESENT = (0x1 << 8),
+ /* Mask of bits containing Rx VLAN status */
+ XDP_META_RX_VLAN_TYPE = (0x3 << 6),
+ /* Metadata does not have any VLAN tags */
+ XDP_META_RX_VLAN_NONE = 0x0,
+ /* Metadata carries valid C-VLAN tag */
+ XDP_META_RX_CVID = 0x1,
+ /* Metadata carries valid S-VLAN tag */
+ XDP_META_RX_SVID = 0x2,
+ /* Mask of bits containing Rx hash status */
+ XDP_META_RX_HASH_TYPE = (0x3 << 4),
+ /* Metadata has no RSS hash */
+ XDP_META_RX_HASH_NONE = 0x0,
+ /* Metadata has valid L2 hash */
+ XDP_META_RX_HASH_L2 = 0x1,
+ /* Metadata has valid L3 hash */
+ XDP_META_RX_HASH_L3 = 0x2,
+ /* Metadata has valid L4 hash */
+ XDP_META_RX_HASH_L4 = 0x3,
+ /* Mask of the field containing checksum level (if there's encap) */
+ XDP_META_RX_CSUM_LEVEL = (0x3 << 2),
+ /* Mask of bits containing Rx checksum status */
+ XDP_META_RX_CSUM_STATUS = (0x3 << 0),
+ /* Metadata has no checksum info */
+ XDP_META_RX_CSUM_NONE = 0x0,
+ /* Checksum has been verified by NIC */
+ XDP_META_RX_CSUM_OK = 0x1,
+ /* Metadata carries valid checksum */
+ XDP_META_RX_CSUM_COMP = 0x2,
+
+ /* xdp_meta_generic::magic_id indicates that the metadata is either
+ * struct xdp_meta_generic itself or contains it at the end -> can be
+ * used to get/set HW hints.
+ * Direct btf_id comparison is not enough here as a custom structure
+ * caring xdp_meta_generic at the end will have a different ID.
+ */
+ XDP_META_GENERIC_MAGIC = 0xeda6,
+};
+
+/* Generic metadata can be composed directly by HW, plus it should always
+ * have the first field as __le16 to account the 2 bytes of "IP align", so
+ * we pack it to avoid unexpected paddings. Also, it should be aligned to
+ * sizeof(__be16) as any other Ethernet data, and to optimize access on the
+ * 32-bit platforms.
+ */
+#define __xdp_meta_generic_attrs \
+ __attribute__((__packed__)) \
+ __attribute__((aligned(sizeof(__be16))))
+
+/* Depending on the field layout inside the structure, it might or might not
+ * emit a "packed attribute is unnecessary" warning (when enabled, e.g. in
+ * libbpf). To not add and remove the attributes on each field addition,
+ * just suppress it.
+ */
+#pragma GCC diagnostic push
+#pragma GCC diagnostic ignored "-Wpacked"
+
+/* All fields have explicit endianness, as it might be composed by HW.
+ * Byteswaps are needed for the Big Endian architectures to access the
+ * fields.
+ */
+struct xdp_meta_generic {
+ /* Add new fields here */
+
+ /* Egress part */
+ __struct_group(/* no tag */, tx, __xdp_meta_generic_attrs,
+ /* Offset from the start of the frame to the L4 header
+ * to compute checksum for
+ */
+ __le16 tx_csum_start;
+ /* Offset inside the L4 header to the checksum field */
+ __le16 tx_csum_off;
+ /* ID for hardware VLAN push */
+ __le16 tx_vid;
+ /* Flags indicating which Tx metadata is used */
+ __le32 tx_flags;
+ /* Tx timestamp value */
+ __le64 tx_tstamp;
+ );
+
+ /* Shortcut for the half relevant on ingress: Rx + IDs */
+ __struct_group(xdp_meta_generic_rx, rx_full, __xdp_meta_generic_attrs,
+ /* Ingress part */
+ __struct_group(/* no tag */, rx, __xdp_meta_generic_attrs,
+ /* Rx timestamp value */
+ __le64 rx_tstamp;
+ /* Rx hash value */
+ __le32 rx_hash;
+ /* Rx checksum value */
+ __le32 rx_csum;
+ /* VLAN ID popped on Rx */
+ __le16 rx_vid;
+ /* Rx queue ID on which the frame has arrived */
+ __le16 rx_qid;
+ /* Flags indicating which Rx metadata is used */
+ __le32 rx_flags;
+ );
+
+ /* Unique metadata identifiers */
+ __struct_group(/* no tag */, id, __xdp_meta_generic_attrs,
+ union {
+ struct {
+#ifdef __BIG_ENDIAN_BITFIELD
+ /* Indicates the ID of the BTF which
+ * the below type ID comes from, as
+ * several kernel modules may have
+ * identical type IDs
+ */
+ __le32 btf_id;
+ /* Indicates the ID of the actual
+ * structure passed as metadata,
+ * within the above BTF ID
+ */
+ __le32 type_id;
+#else /* __LITTLE_ENDIAN_BITFIELD */
+ __le32 type_id;
+ __le32 btf_id;
+#endif /* __LITTLE_ENDIAN_BITFIELD */
+ };
+ /* BPF program gets IDs coded as one __u64:
+ * `btf_id << 32 | type_id`, allow direct
+ * comparison
+ */
+ __le64 full_id;
+ };
+ /* If set to the correct value, indicates that the
+ * meta is generic-compatible and can be used by
+ * the consumers of generic metadata
+ */
+ __le16 magic_id;
+ );
+ );
+} __xdp_meta_generic_attrs;
+
+#pragma GCC diagnostic pop
+
#endif /* _UAPI__LINUX_BPF_H__ */
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 25/52] net, xdp: add basic generic metadata accessors
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (23 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 24/52] bpf, xdp: declare generic XDP metadata structure Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 26/52] bpf, btf: add a pair of function to work with the BTF ID + type ID pair Alexander Lobakin
` (27 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
As all of the fields in the generic XDP metadata structure have
explicit Endianness, it's worth to provide some basic helpers.
Add get and set accessors for each field and get, set and rep
accessors for each bitfield of ::{rx,tx}_flags. rep are for the
cases when it's unknown whether a flags field is clear, so they
effectively replace the value in a bitfield instead of just ORing.
Also add a couple of helpers: to get a pointer to the generic
metadata structure and check whether a given metadata is generic
compatible.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/net/xdp_meta.h | 238 +++++++++++++++++++++++++++++++++++++++++
1 file changed, 238 insertions(+)
diff --git a/include/net/xdp_meta.h b/include/net/xdp_meta.h
index 3a40189d71c6..f61831e39eb0 100644
--- a/include/net/xdp_meta.h
+++ b/include/net/xdp_meta.h
@@ -4,6 +4,7 @@
#ifndef __LINUX_NET_XDP_META_H__
#define __LINUX_NET_XDP_META_H__
+#include <linux/bitfield.h>
#include <net/xdp.h>
#include <uapi/linux/bpf.h>
@@ -45,4 +46,241 @@ static inline bool xdp_metalen_invalid(unsigned long metalen)
return (metalen & (sizeof(u32) - 1)) || metalen > max;
}
+/* This builds _get(), _set() and _rep() for each bitfield.
+ * If you know for sure the field is empty (e.g. you zeroed the struct
+ * previously), use faster _set() op to save several cycles, otherwise
+ * use _rep() to avoid mixing values.
+ */
+#define XDP_META_BUILD_FLAGS_ACC(dir, pfx, FLD) \
+static inline u32 \
+xdp_meta_##dir##_##pfx##_get(const struct xdp_meta_generic *md) \
+{ \
+ static_assert(__same_type(md->dir##_flags, __le32)); \
+ \
+ return le32_get_bits(md->dir##_flags, XDP_META_##FLD); \
+} \
+ \
+static inline void \
+xdp_meta_##dir##_##pfx##_set(struct xdp_meta_generic *md, u32 val) \
+{ \
+ md->dir##_flags |= le32_encode_bits(val, XDP_META_##FLD); \
+} \
+ \
+static inline void \
+xdp_meta_##dir##_##pfx##_rep(struct xdp_meta_generic *md, u32 val) \
+{ \
+ le32p_replace_bits(&md->dir##_flags, val, XDP_META_##FLD); \
+} \
+
+/* This builds _get() and _set() for each structure field -- those are just
+ * byteswap operations however.
+ * The second static assertion is due to that all of the fields in the
+ * structure should be naturally-aligned when ::magic_id starts at
+ * `XDP_PACKET_HEADROOM + 8n`, which is the default and recommended case.
+ * This check makes no sense for the efficient unaligned access platforms,
+ * but helps the rest.
+ */
+#define XDP_META_BUILD_ACC(dir, pfx, sz) \
+static inline u##sz \
+xdp_meta_##dir##_##pfx##_get(const struct xdp_meta_generic *md) \
+{ \
+ static_assert(__same_type(md->dir##_##pfx, __le##sz)); \
+ \
+ return le##sz##_to_cpu(md->dir##_##pfx); \
+} \
+ \
+static inline void \
+xdp_meta_##dir##_##pfx##_set(struct xdp_meta_generic *md, u##sz val) \
+{ \
+ static_assert((XDP_PACKET_HEADROOM - sizeof(*md) + \
+ sizeof_field(typeof(*md), magic_id) + \
+ offsetof(typeof(*md), dir##_##pfx)) % \
+ sizeof_field(typeof(*md), dir##_##pfx) == 0); \
+ \
+ md->dir##_##pfx = cpu_to_le##sz(val); \
+}
+
+#if 0 /* For grepping/indexers */
+u16 xdp_meta_tx_csum_action_get(const struct xdp_meta_generic *md);
+void xdp_meta_tx_csum_action_set(struct xdp_meta_generic *md, u16 val);
+void xdp_meta_tx_csum_action_rep(struct xdp_meta_generic *md, u16 val);
+u16 xdp_meta_tx_vlan_type_get(const struct xdp_meta_generic *md);
+void xdp_meta_tx_vlan_type_set(struct xdp_meta_generic *md, u16 val);
+void xdp_meta_tx_vlan_type_rep(struct xdp_meta_generic *md, u16 val);
+u16 xdp_meta_tx_tstamp_action_get(const struct xdp_meta_generic *md);
+void xdp_meta_tx_tstamp_action_set(struct xdp_meta_generic *md, u16 val);
+void xdp_meta_tx_tstamp_action_rep(struct xdp_meta_generic *md, u16 val);
+#endif
+XDP_META_BUILD_FLAGS_ACC(tx, csum_action, TX_CSUM_ACT);
+XDP_META_BUILD_FLAGS_ACC(tx, vlan_type, TX_VLAN_TYPE);
+XDP_META_BUILD_FLAGS_ACC(tx, tstamp_action, TX_TSTAMP_ACT);
+
+#if 0
+u16 xdp_meta_tx_csum_start_get(const struct xdp_meta_generic *md);
+void xdp_meta_tx_csum_start_set(struct xdp_meta_generic *md, u64 val);
+u16 xdp_meta_tx_csum_off_get(const struct xdp_meta_generic *md);
+void xdp_meta_tx_csum_off_set(struct xdp_meta_generic *md, u64 val);
+u16 xdp_meta_tx_vid_get(const struct xdp_meta_generic *md);
+void xdp_meta_tx_vid_set(struct xdp_meta_generic *md, u64 val);
+u32 xdp_meta_tx_flags_get(const struct xdp_meta_generic *md);
+void xdp_meta_tx_flags_set(struct xdp_meta_generic *md, u32 val);
+u64 xdp_meta_tx_tstamp_get(const struct xdp_meta_generic *md);
+void xdp_meta_tx_tstamp_set(struct xdp_meta_generic *md, u64 val);
+#endif
+XDP_META_BUILD_ACC(tx, csum_start, 16);
+XDP_META_BUILD_ACC(tx, csum_off, 16);
+XDP_META_BUILD_ACC(tx, vid, 16);
+XDP_META_BUILD_ACC(tx, flags, 32);
+XDP_META_BUILD_ACC(tx, tstamp, 64);
+
+#if 0
+u16 xdp_meta_rx_csum_status_get(const struct xdp_meta_generic *md);
+void xdp_meta_rx_csum_status_set(struct xdp_meta_generic *md, u16 val);
+void xdp_meta_rx_csum_status_rep(struct xdp_meta_generic *md, u16 val);
+u16 xdp_meta_rx_csum_level_get(const struct xdp_meta_generic *md);
+void xdp_meta_rx_csum_level_set(struct xdp_meta_generic *md, u16 val);
+void xdp_meta_rx_csum_level_rep(struct xdp_meta_generic *md, u16 val);
+u16 xdp_meta_rx_hash_type_get(const struct xdp_meta_generic *md);
+void xdp_meta_rx_hash_type_set(struct xdp_meta_generic *md, u16 val);
+void xdp_meta_rx_hash_type_rep(struct xdp_meta_generic *md, u16 val);
+u16 xdp_meta_rx_vlan_type_get(const struct xdp_meta_generic *md);
+void xdp_meta_rx_vlan_type_set(struct xdp_meta_generic *md, u16 val);
+void xdp_meta_rx_vlan_type_rep(struct xdp_meta_generic *md, u16 val);
+u16 xdp_meta_rx_tstamp_present_get(const struct xdp_meta_generic *md);
+void xdp_meta_rx_tstamp_present_set(struct xdp_meta_generic *md, u16 val);
+void xdp_meta_rx_tstamp_present_rep(struct xdp_meta_generic *md, u16 val);
+u16 xdp_meta_rx_qid_present_get(const struct xdp_meta_generic *md);
+void xdp_meta_rx_qid_present_set(struct xdp_meta_generic *md, u16 val);
+void xdp_meta_rx_qid_present_rep(struct xdp_meta_generic *md, u16 val);
+#endif
+XDP_META_BUILD_FLAGS_ACC(rx, csum_status, RX_CSUM_STATUS);
+XDP_META_BUILD_FLAGS_ACC(rx, csum_level, RX_CSUM_LEVEL);
+XDP_META_BUILD_FLAGS_ACC(rx, hash_type, RX_HASH_TYPE);
+XDP_META_BUILD_FLAGS_ACC(rx, vlan_type, RX_VLAN_TYPE);
+XDP_META_BUILD_FLAGS_ACC(rx, tstamp_present, RX_TSTAMP_PRESENT);
+XDP_META_BUILD_FLAGS_ACC(rx, qid_present, RX_QID_PRESENT);
+
+#if 0
+u64 xdp_meta_rx_tstamp_get(const struct xdp_meta_generic *md);
+void xdp_meta_rx_tstamp_set(struct xdp_meta_generic *md, u64 val);
+u32 xdp_meta_rx_hash_get(const struct xdp_meta_generic *md);
+void xdp_meta_rx_hash_set(struct xdp_meta_generic *md, u32 val);
+u32 xdp_meta_rx_csum_get(const struct xdp_meta_generic *md);
+void xdp_meta_rx_csum_set(struct xdp_meta_generic *md, u32 val);
+u16 xdp_meta_rx_vid_get(const struct xdp_meta_generic *md);
+void xdp_meta_rx_vid_set(struct xdp_meta_generic *md, u16 val);
+u16 xdp_meta_rx_qid_get(const struct xdp_meta_generic *md);
+void xdp_meta_rx_qid_set(struct xdp_meta_generic *md, u16 val);
+u32 xdp_meta_rx_flags_get(const struct xdp_meta_generic *md);
+void xdp_meta_rx_flags_set(struct xdp_meta_generic *md, u32 val);
+#endif
+XDP_META_BUILD_ACC(rx, tstamp, 64);
+XDP_META_BUILD_ACC(rx, hash, 32);
+XDP_META_BUILD_ACC(rx, csum, 32);
+XDP_META_BUILD_ACC(rx, vid, 16);
+XDP_META_BUILD_ACC(rx, qid, 16);
+XDP_META_BUILD_ACC(rx, flags, 32);
+
+#if 0
+u32 xdp_meta_btf_id_get(const struct xdp_meta_generic *md);
+void xdp_meta_btf_id_set(struct xdp_meta_generic *md, u32 val);
+u32 xdp_meta_type_id_get(const struct xdp_meta_generic *md);
+void xdp_meta_type_id_set(struct xdp_meta_generic *md, u32 val);
+u64 xdp_meta_full_id_get(const struct xdp_meta_generic *md);
+void xdp_meta_full_id_set(struct xdp_meta_generic *md, u64 val);
+u16 xdp_meta_magic_id_get(const struct xdp_meta_generic *md);
+void xdp_meta_magic_id_set(struct xdp_meta_generic *md, u16 val);
+#endif
+XDP_META_BUILD_ACC(btf, id, 32);
+XDP_META_BUILD_ACC(type, id, 32);
+XDP_META_BUILD_ACC(full, id, 64);
+XDP_META_BUILD_ACC(magic, id, 16);
+
+/* This allows to jump from xdp_metadata_generic::{tx,rx_full,rx,id} to the
+ * parent if needed. For example, declare one of them on stack for convenience
+ * and still pass a generic pointer.
+ * No out-of-bound checks, a caller must sanitize it on its side.
+ */
+#define _to_gen_md(ptr, locptr, locmd) ({ \
+ struct xdp_meta_generic *locmd; \
+ typeof(ptr) locptr = (ptr); \
+ \
+ if (__same_type(*locptr, typeof(locmd->tx))) \
+ locmd = (void *)locptr - offsetof(typeof(*locmd), tx); \
+ else if (__same_type(*locptr, typeof(locmd->rx_full))) \
+ locmd = (void *)locptr - offsetof(typeof(*locmd), rx_full); \
+ else if (__same_type(*locptr, typeof(locmd->rx))) \
+ locmd = (void *)locptr - offsetof(typeof(*locmd), rx); \
+ else if (__same_type(*locptr, typeof(locmd->id))) \
+ locmd = (void *)locptr - offsetof(typeof(*locmd), id); \
+ else if (__same_type(*locptr, typeof(locmd)) || \
+ __same_type(*locptr, void)) \
+ locmd = (void *)locptr; \
+ else \
+ BUILD_BUG(); \
+ \
+ locmd; \
+})
+#define to_gen_md(ptr) _to_gen_md((ptr), __UNIQUE_ID(ptr_), __UNIQUE_ID(md_))
+
+/* This allows to pass an xdp_meta_generic pointer instead of an
+ * xdp_meta_generic::rx{,_full} pointer for convenience.
+ */
+#define _to_rx_md(ptr, locptr, locmd) ({ \
+ struct xdp_meta_generic_rx *locmd; \
+ typeof(ptr) locptr = (ptr); \
+ \
+ if (__same_type(*locptr, struct xdp_meta_generic_rx)) \
+ locmd = (struct xdp_meta_generic_rx *)locptr; \
+ else if (__same_type(*locptr, struct xdp_meta_generic) || \
+ __same_type(*locptr, void)) \
+ locmd = &((struct xdp_meta_generic *)locptr)->rx_full; \
+ else \
+ BUILD_BUG(); \
+ \
+ locmd; \
+})
+#define to_rx_md(ptr) _to_rx_md((ptr), __UNIQUE_ID(ptr_), __UNIQUE_ID(md_))
+
+/**
+ * xdp_meta_has_generic - get a pointer to the generic metadata before a frame
+ * @data: a pointer to the beginning of the frame
+ *
+ * Note: the function does not perform any access sanity checks, they should
+ * be done manually prior to calling it.
+ *
+ * Returns a pointer to the beginning of the generic metadata.
+ */
+static inline struct xdp_meta_generic *xdp_meta_generic_ptr(const void *data)
+{
+ BUILD_BUG_ON(xdp_metalen_invalid(sizeof(struct xdp_meta_generic)));
+
+ return (void *)data - sizeof(struct xdp_meta_generic);
+}
+
+/**
+ * xdp_meta_has_generic - check whether a frame has a generic meta in front
+ * @data: a pointer to the beginning of the frame
+ *
+ * Returns true if it does, false otherwise.
+ */
+static inline bool xdp_meta_has_generic(const void *data)
+{
+ return xdp_meta_generic_ptr(data)->magic_id ==
+ cpu_to_le16(XDP_META_GENERIC_MAGIC);
+}
+
+/**
+ * xdp_meta_skb_has_generic - check whether an skb has a generic meta
+ * @skb: a pointer to the &sk_buff
+ *
+ * Note: must be called only when skb_mac_header_was_set(skb) == true.
+ *
+ * Returns true if it does, false otherwise.
+ */
+static inline bool xdp_meta_skb_has_generic(const struct sk_buff *skb)
+{
+ return xdp_meta_has_generic(skb_metadata_end(skb));
+}
+
#endif /* __LINUX_NET_XDP_META_H__ */
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 26/52] bpf, btf: add a pair of function to work with the BTF ID + type ID pair
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (24 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 25/52] net, xdp: add basic generic metadata accessors Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 27/52] net, xdp: add &sk_buff <-> &xdp_meta_generic converters Alexander Lobakin
` (26 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Add a kernel counterpart of libbpf_get_type_btf_id() to easily get
the pair of BTF ID << 32 | type ID for the provided type. Drivers
and the XDP core will use it to handle different XDP generic
metadata formats.
Also add a function to return matching type string (e.g.
"struct foo") index from an array of such strings for a given BTF
ID + type ID pair. The intention is to be able to quickly identify
the ID received from somewhere else and to assign some own constant
identifiers to the supported types.
To not do:
priv->foo_id = bpf_get_type_btf_id("struct foo");
priv->bar_id = bpf_get_type_btf_id("struct bar");
[...]
if (id == priv->foo_id)
do_smth_for_foo();
else if (id == priv->bar_id)
do_smth_for_bar();
else
unsupp();
but instead:
const char * const supp[] = {
[FOO_ID] = "struct foo",
[BAR_ID] = "struct bar",
NULL, // serves as a terminator, can be ""
};
[...]
type = bpf_match_type_btf_id(supp, id);
switch(type) {
case FOO_ID:
do_smth_for_foo();
break;
case BAR_ID:
do_smth_for_bar();
break;
default:
unsupp();
break;
}
Aux function:
* btf_kind_from_str(): returns the kind of the provided full type
string and removes the kind identifier to e.g. be able to pass it
directly to btf_find_by_name_kind(). For example, "struct foo"
becomes "foo" and the return value will be BTF_KIND_STRUCT.
* btf_get_by_id() is a shorthand to quickly get the BTF by its ID,
factored-out from btf_get_fd_by_id().
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/linux/btf.h | 13 +++++
kernel/bpf/btf.c | 133 ++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 140 insertions(+), 6 deletions(-)
diff --git a/include/linux/btf.h b/include/linux/btf.h
index 1bfed7fa0428..36bc9c499409 100644
--- a/include/linux/btf.h
+++ b/include/linux/btf.h
@@ -386,6 +386,8 @@ int register_btf_kfunc_id_set(enum bpf_prog_type prog_type,
s32 btf_find_dtor_kfunc(struct btf *btf, u32 btf_id);
int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dtors, u32 add_cnt,
struct module *owner);
+int bpf_get_type_btf_id(const char *type, u64 *res_id);
+int bpf_match_type_btf_id(const char * const *list, u64 id);
#else
static inline const struct btf_type *btf_type_by_id(const struct btf *btf,
u32 type_id)
@@ -418,6 +420,17 @@ static inline int register_btf_id_dtor_kfuncs(const struct btf_id_dtor_kfunc *dt
{
return 0;
}
+static inline int bpf_get_type_btf_id(const char *type, u64 *res_id)
+{
+ if (res_id)
+ *res_id = 0;
+
+ return -ENOSYS;
+}
+static inline int bpf_match_type_btf_id(const char * const *list, u64 id)
+{
+ return -ENOSYS;
+}
#endif
#endif
diff --git a/kernel/bpf/btf.c b/kernel/bpf/btf.c
index 2e2066d6af94..dc316c43a348 100644
--- a/kernel/bpf/btf.c
+++ b/kernel/bpf/btf.c
@@ -317,6 +317,28 @@ const char *btf_type_str(const struct btf_type *t)
return btf_kind_str[BTF_INFO_KIND(t->info)];
}
+static u32 btf_kind_from_str(const char **type)
+{
+ const char *pos, *orig = *type;
+ u32 kind;
+ int len;
+
+ pos = strchr(orig, ' ');
+ if (pos) {
+ len = pos - orig;
+ *type = pos + 1;
+ } else {
+ len = strlen(orig);
+ }
+
+ for (kind = BTF_KIND_UNKN; kind < NR_BTF_KINDS; kind++) {
+ if (!strncasecmp(orig, btf_kind_str[kind], len))
+ break;
+ }
+
+ return kind < NR_BTF_KINDS ? kind : BTF_KIND_UNKN;
+}
+
/* Chunk size we use in safe copy of data to be shown. */
#define BTF_SHOW_OBJ_SAFE_SIZE 32
@@ -579,6 +601,110 @@ static s32 bpf_find_btf_id(const char *name, u32 kind, struct btf **btf_p)
return ret;
}
+/**
+ * bpf_get_type_btf_id - get the pair BTF ID + type ID for a given type
+ * @type: pointer to the name of the type to look for
+ * @res_id: pointer to write the result to
+ *
+ * Tries to find the BTF corresponding to the provided type (full string) and
+ * write the pair of BTF ID << 32 | type ID. Such coded __u64 are being used
+ * in XDP generic-compatible metadata to distinguish between different
+ * metadata structures.
+ * @res_id can be %NULL to only check if a particular type exists within
+ * the BTF.
+ *
+ * Returns 0 in case of success, an error code otherwise.
+ */
+int bpf_get_type_btf_id(const char *type, u64 *res_id)
+{
+ struct btf *btf = NULL;
+ s32 type_id;
+ u32 kind;
+
+ if (res_id)
+ *res_id = 0;
+
+ if (!type || !*type)
+ return -EINVAL;
+
+ kind = btf_kind_from_str(&type);
+
+ type_id = bpf_find_btf_id(type, kind, &btf);
+ if (type_id > 0 && res_id)
+ *res_id = ((u64)btf_obj_id(btf) << 32) | type_id;
+
+ btf_put(btf);
+
+ return min(type_id, 0);
+}
+EXPORT_SYMBOL_GPL(bpf_get_type_btf_id);
+
+static struct btf *btf_get_by_id(u32 id)
+{
+ struct btf *btf;
+
+ rcu_read_lock();
+ btf = idr_find(&btf_idr, id);
+ if (!btf || !refcount_inc_not_zero(&btf->refcnt))
+ btf = ERR_PTR(-ENOENT);
+ rcu_read_unlock();
+
+ return btf;
+}
+
+/**
+ * bpf_match_type_btf_id - find a type name corresponding to a given full ID
+ * @list: pointer to the %NULL-terminated list of type names
+ * @id: full ID (BTF ID + type ID) of the type to look
+ *
+ * Do the opposite to what bpf_get_type_btf_id() does: looks over the
+ * candidates in %NULL-terminated @list and tries to find a match for
+ * the given ID. If found, returns its index.
+ *
+ * Returns a string array element index on success, an error code otherwise.
+ */
+int bpf_match_type_btf_id(const char * const *list, u64 id)
+{
+ const struct btf_type *t;
+ int ret = -ENOENT;
+ const char *name;
+ struct btf *btf;
+ u32 kind;
+
+ btf = btf_get_by_id(upper_32_bits(id));
+ if (IS_ERR(btf))
+ return PTR_ERR(btf);
+
+ t = btf_type_by_id(btf, lower_32_bits(id));
+ if (!t)
+ goto err_put;
+
+ name = btf_name_by_offset(btf, t->name_off);
+ if (!name) {
+ ret = -EINVAL;
+ goto err_put;
+ }
+
+ kind = BTF_INFO_KIND(t->info);
+
+ for (u32 i = 0; ; i++) {
+ const char *cand = list[i];
+
+ if (!cand)
+ break;
+
+ if (btf_kind_from_str(&cand) == kind && !strcmp(cand, name)) {
+ ret = i;
+ break;
+ }
+ }
+
+err_put:
+ btf_put(btf);
+
+ return ret;
+}
+
const struct btf_type *btf_type_skip_modifiers(const struct btf *btf,
u32 id, u32 *res_id)
{
@@ -6804,12 +6930,7 @@ int btf_get_fd_by_id(u32 id)
struct btf *btf;
int fd;
- rcu_read_lock();
- btf = idr_find(&btf_idr, id);
- if (!btf || !refcount_inc_not_zero(&btf->refcnt))
- btf = ERR_PTR(-ENOENT);
- rcu_read_unlock();
-
+ btf = btf_get_by_id(id);
if (IS_ERR(btf))
return PTR_ERR(btf);
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 27/52] net, xdp: add &sk_buff <-> &xdp_meta_generic converters
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (25 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 26/52] bpf, btf: add a pair of function to work with the BTF ID + type ID pair Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 28/52] net, xdp: prefetch data a bit when building an skb from an &xdp_frame Alexander Lobakin
` (25 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Add two functions (with their underscored versions) to pass
HW-origined info (checksums, hashes, Rx queue ID etc.) from an skb
to an XDP generic metadata and vice versa. They can be used to carry
that info between hardware, xdp_buff/xdp_frame and sk_buff.
The &sk_buff -> &xdp_meta_generic converter uses a static,
init-time filled &xdp_meta_tail to not query BTF info on hotpath.
For the fields which values are being assigned directly, make sure
they match with the help of static asserts.
Also add a wrapper bpf_match_type_btf_id() designed especially for
drivers and taking care of corner-cases.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/net/xdp_meta.h | 112 +++++++++++++++++++++++++++++++
net/bpf/core.c | 148 ++++++++++++++++++++++++++++++++++++++++-
2 files changed, 259 insertions(+), 1 deletion(-)
diff --git a/include/net/xdp_meta.h b/include/net/xdp_meta.h
index f61831e39eb0..d37ea873a6a8 100644
--- a/include/net/xdp_meta.h
+++ b/include/net/xdp_meta.h
@@ -46,6 +46,17 @@ static inline bool xdp_metalen_invalid(unsigned long metalen)
return (metalen & (sizeof(u32) - 1)) || metalen > max;
}
+/* We use direct assignments from &xdp_meta_generic to &sk_buff fields,
+ * thus they must match.
+ */
+static_assert((u32)XDP_META_RX_CSUM_NONE == (u32)CHECKSUM_NONE);
+static_assert((u32)XDP_META_RX_CSUM_OK == (u32)CHECKSUM_UNNECESSARY);
+static_assert((u32)XDP_META_RX_CSUM_COMP == (u32)CHECKSUM_COMPLETE);
+static_assert((u32)XDP_META_RX_HASH_NONE == (u32)PKT_HASH_TYPE_NONE);
+static_assert((u32)XDP_META_RX_HASH_L2 == (u32)PKT_HASH_TYPE_L2);
+static_assert((u32)XDP_META_RX_HASH_L3 == (u32)PKT_HASH_TYPE_L3);
+static_assert((u32)XDP_META_RX_HASH_L4 == (u32)PKT_HASH_TYPE_L4);
+
/* This builds _get(), _set() and _rep() for each bitfield.
* If you know for sure the field is empty (e.g. you zeroed the struct
* previously), use faster _set() op to save several cycles, otherwise
@@ -283,4 +294,105 @@ static inline bool xdp_meta_skb_has_generic(const struct sk_buff *skb)
return xdp_meta_has_generic(skb_metadata_end(skb));
}
+/**
+ * xdp_meta_init - initialize a metadata structure
+ * @md: pointer to xdp_meta_generic or its ::rx_full or its ::id member
+ * @id: full BTF + type ID for the metadata type (can be u* or __le64)
+ *
+ * Zeroes the passed metadata struct (or part) and initializes its tail, so
+ * it becomes ready for further processing. If a driver is responsible for
+ * composing metadata, it is important to zero the space it occupies in each
+ * Rx buffer as `xdp->data - xdp->data_hard_start` doesn't get initialized
+ * by default.
+ */
+#define _xdp_meta_init(md, id, locmd, locid) ({ \
+ typeof(md) locmd = (md); \
+ typeof(id) locid = (id); \
+ \
+ if (offsetof(typeof(*locmd), full_id)) \
+ memset(locmd, 0, offsetof(typeof(*locmd), full_id)); \
+ \
+ locmd->full_id = __same_type(locid, __le64) ? (__force __le64)locid : \
+ cpu_to_le64((__force u64)locid); \
+ locmd->magic_id = cpu_to_le16(XDP_META_GENERIC_MAGIC); \
+})
+#define xdp_meta_init(md, id) \
+ _xdp_meta_init((md), (id), __UNIQUE_ID(md_), __UNIQUE_ID(id_))
+
+void ___xdp_build_meta_generic_from_skb(struct xdp_meta_generic_rx *rx_md,
+ const struct sk_buff *skb);
+void ___xdp_populate_skb_meta_generic(struct sk_buff *skb,
+ const struct xdp_meta_generic_rx *rx_md);
+
+#define _xdp_build_meta_generic_from_skb(md, skb, locmd) ({ \
+ typeof(md) locmd = (md); \
+ \
+ if (offsetof(typeof(*locmd), rx)) \
+ memset(locmd, 0, offsetof(typeof(*locmd), rx)); \
+ \
+ ___xdp_build_meta_generic_from_skb(to_rx_md(locmd), skb); \
+})
+#define __xdp_build_meta_generic_from_skb(md, skb) \
+ _xdp_build_meta_generic_from_skb((md), (skb), __UNIQUE_ID(md_))
+
+#define __xdp_populate_skb_meta_generic(skb, md) \
+ ___xdp_populate_skb_meta_generic((skb), to_rx_md(md))
+
+/**
+ * xdp_build_meta_generic_from_skb - build the generic meta before the skb data
+ * @skb: a pointer to the &sk_buff
+ *
+ * Builds an XDP generic metadata in front of the skb data from its fields.
+ * Note: skb->mac_header must be set and valid.
+ */
+static inline void xdp_build_meta_generic_from_skb(struct sk_buff *skb)
+{
+ struct xdp_meta_generic *md;
+ u32 needed;
+
+ /* skb_headroom() is `skb->data - skb->head`, i.e. it doesn't account
+ * for the pulled headers, e.g. MAC header. Metadata resides in front
+ * of the MAC header, so counting starts from there, not the current
+ * data pointer position.
+ * CoW won't happen in here when coming from Generic XDP path as it
+ * ensures that an skb has at least %XDP_PACKET_HEADROOM beforehand.
+ * It won't be happening also as long as `sizeof(*md) <= NET_SKB_PAD`.
+ */
+ needed = (void *)skb->data - skb_metadata_end(skb) + sizeof(*md);
+ if (unlikely(skb_cow_head(skb, needed)))
+ return;
+
+ md = xdp_meta_generic_ptr(skb_metadata_end(skb));
+ __xdp_build_meta_generic_from_skb(md, skb);
+
+ skb_metadata_set(skb, sizeof(*md));
+ skb_metadata_nocomp_set(skb);
+}
+
+/**
+ * xdp_populate_skb_meta_generic - fill an skb from the metadata in front of it
+ * @skb: a pointer to the &sk_buff
+ *
+ * Fills the skb fields from the metadata in front of its MAC header and marks
+ * its metadata as "non-comparable".
+ * Note: skb->mac_header must be set and valid.
+ */
+static inline void xdp_populate_skb_meta_generic(struct sk_buff *skb)
+{
+ const struct xdp_meta_generic *md;
+
+ if (skb_metadata_len(skb) < sizeof(*md))
+ return;
+
+ md = xdp_meta_generic_ptr(skb_metadata_end(skb));
+ __xdp_populate_skb_meta_generic(skb, md);
+
+ /* We know at this point that skb metadata may contain
+ * unique values, mark it as nocomp to not confuse GRO.
+ */
+ skb_metadata_nocomp_set(skb);
+}
+
+int xdp_meta_match_id(const char * const *list, u64 id);
+
#endif /* __LINUX_NET_XDP_META_H__ */
diff --git a/net/bpf/core.c b/net/bpf/core.c
index 18174d6d8687..a8685bcc6e00 100644
--- a/net/bpf/core.c
+++ b/net/bpf/core.c
@@ -3,7 +3,7 @@
*
* Copyright (c) 2017 Jesper Dangaard Brouer, Red Hat Inc.
*/
-#include <linux/bpf.h>
+#include <linux/btf.h>
#include <linux/filter.h>
#include <linux/types.h>
#include <linux/mm.h>
@@ -713,3 +713,149 @@ struct xdp_frame *xdpf_clone(struct xdp_frame *xdpf)
return nxdpf;
}
+
+/**
+ * xdp_meta_match_id - find a type name corresponding to a given full ID
+ * @list: pointer to the %NULL-terminated list of type names
+ * @id: full ID (BTF ID + type ID) of the type to look
+ *
+ * Convenience wrapper over bpf_match_type_btf_id() for usage in drivers which
+ * takes care of zeroed ID and BPF syscall being not compiled in (to not break
+ * code flow and return "no meta").
+ *
+ * Returns a string array element index on success, an error code otherwise.
+ */
+int xdp_meta_match_id(const char * const *list, u64 id)
+{
+ int ret;
+
+ if (unlikely(!list || !*list))
+ return id ? -EINVAL : 0;
+
+ ret = bpf_match_type_btf_id(list, id);
+ if (ret == -ENOSYS || !id) {
+ for (ret = 0; list[ret]; ret++)
+ ;
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL_GPL(xdp_meta_match_id);
+
+/* Used in __xdp_build_meta_generic_from_skb() to quickly get the ID
+ * on hotpath.
+ */
+static __le64 xdp_meta_generic_id __ro_after_init;
+
+static int __init xdp_meta_generic_id_init(void)
+{
+ int ret;
+ u64 id;
+
+ ret = bpf_get_type_btf_id("struct xdp_meta_generic", &id);
+ xdp_meta_generic_id = cpu_to_le64(id);
+
+ return ret;
+}
+late_initcall(xdp_meta_generic_id_init);
+
+#define _xdp_meta_rx_hash_type_from_skb(skb, locskb) ({ \
+ typeof(skb) locskb = (skb); \
+ \
+ likely((locskb)->l4_hash) ? XDP_META_RX_HASH_L4 : \
+ skb_get_hash_raw(locskb) ? XDP_META_RX_HASH_L3 : \
+ XDP_META_RX_HASH_NONE; \
+})
+#define xdp_meta_rx_hash_type_from_skb(skb) \
+ _xdp_meta_rx_hash_type_from_skb((skb), __UNIQUE_ID(skb_))
+
+#define xdp_meta_rx_vlan_from_prot(skb) ({ \
+ (skb)->vlan_proto == htons(ETH_P_8021Q) ? \
+ XDP_META_RX_CVID : XDP_META_RX_SVID; \
+})
+
+#define xdp_meta_rx_vlan_to_prot(md) ({ \
+ xdp_meta_rx_vlan_type_get(md) == XDP_META_RX_CVID ? \
+ htons(ETH_P_8021Q) : htons(ETH_P_8021AD); \
+})
+
+/**
+ * ___xdp_build_meta_generic_from_skb - fill a generic metadata from an skb
+ * @rx_md: a pointer to the XDP generic metadata to be filled
+ * @skb: a pointer to the skb to take the info from
+ *
+ * Fills a given generic metadata struct with the info set previously in
+ * an skb. @md can point to anywhere and the function doesn't use the
+ * skb_metadata_{end,len}().
+ */
+void ___xdp_build_meta_generic_from_skb(struct xdp_meta_generic_rx *rx_md,
+ const struct sk_buff *skb)
+{
+ struct xdp_meta_generic *md = to_gen_md(rx_md);
+ ktime_t ts;
+
+ xdp_meta_init(rx_md, xdp_meta_generic_id);
+
+ xdp_meta_rx_csum_level_set(md, skb->csum_level);
+ xdp_meta_rx_csum_status_set(md, skb->ip_summed);
+ xdp_meta_rx_csum_set(md, skb->csum);
+
+ xdp_meta_rx_hash_set(md, skb_get_hash_raw(skb));
+ xdp_meta_rx_hash_type_set(md, xdp_meta_rx_hash_type_from_skb(skb));
+
+ if (likely(skb_rx_queue_recorded(skb))) {
+ xdp_meta_rx_qid_present_set(md, 1);
+ xdp_meta_rx_qid_set(md, skb_get_rx_queue(skb));
+ }
+
+ if (skb_vlan_tag_present(skb)) {
+ xdp_meta_rx_vlan_type_set(md, xdp_meta_rx_vlan_from_prot(skb));
+ xdp_meta_rx_vid_set(md, skb_vlan_tag_get(skb));
+ }
+
+ ts = skb_hwtstamps(skb)->hwtstamp;
+ if (ts) {
+ xdp_meta_rx_tstamp_present_set(md, 1);
+ xdp_meta_rx_tstamp_set(md, ktime_to_ns(ts));
+ }
+}
+EXPORT_SYMBOL_GPL(___xdp_build_meta_generic_from_skb);
+
+/**
+ * ___xdp_populate_skb_meta_generic - fill the skb fields from a generic meta
+ * @skb: a pointer to the skb to be filled
+ * @rx_md: a pointer to the generic metadata to take the values from
+ *
+ * Populates the &sk_buff fields from a given XDP generic metadata. A meta
+ * can be from anywhere, the function doesn't use skb_metadata_{end,len}().
+ * Checks whether the metadata is generic-compatible before accessing other
+ * fields.
+ */
+void ___xdp_populate_skb_meta_generic(struct sk_buff *skb,
+ const struct xdp_meta_generic_rx *rx_md)
+{
+ const struct xdp_meta_generic *md = to_gen_md(rx_md);
+
+ if (unlikely(!xdp_meta_has_generic(md + 1)))
+ return;
+
+ skb->csum_level = xdp_meta_rx_csum_level_get(md);
+ skb->ip_summed = xdp_meta_rx_csum_status_get(md);
+ skb->csum = xdp_meta_rx_csum_get(md);
+
+ skb_set_hash(skb, xdp_meta_rx_hash_get(md),
+ xdp_meta_rx_hash_type_get(md));
+
+ if (likely(xdp_meta_rx_qid_present_get(md)))
+ skb_record_rx_queue(skb, xdp_meta_rx_qid_get(md));
+
+ if (xdp_meta_rx_vlan_type_get(md))
+ __vlan_hwaccel_put_tag(skb, xdp_meta_rx_vlan_to_prot(md),
+ xdp_meta_rx_vid_get(md));
+
+ if (xdp_meta_rx_tstamp_present_get(md))
+ *skb_hwtstamps(skb) = (struct skb_shared_hwtstamps){
+ .hwtstamp = ns_to_ktime(xdp_meta_rx_tstamp_get(md)),
+ };
+}
+EXPORT_SYMBOL_GPL(___xdp_populate_skb_meta_generic);
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 28/52] net, xdp: prefetch data a bit when building an skb from an &xdp_frame
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (26 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 27/52] net, xdp: add &sk_buff <-> &xdp_meta_generic converters Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 29/52] net, xdp: try to fill skb fields when converting " Alexander Lobakin
` (24 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Different cpumap tests showed that a couple of little careful
prefetches helps the performance. The only thing is to not go crazy:
only one cacheline to the right from the frame start and one to the
left -- if there is a metadata in front.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
net/bpf/core.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
diff --git a/net/bpf/core.c b/net/bpf/core.c
index a8685bcc6e00..775f9648e8cf 100644
--- a/net/bpf/core.c
+++ b/net/bpf/core.c
@@ -620,10 +620,26 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
struct net_device *dev)
{
struct skb_shared_info *sinfo = xdp_get_shared_info_from_frame(xdpf);
+ u32 dist, metasize = xdpf->metasize;
unsigned int headroom, frame_size;
+ void *data = xdpf->data;
void *hard_start;
u8 nr_frags;
+ /* Bring the headers to the current CPU, as well as the
+ * metadata if present. This helps eth_type_trans() and
+ * xdp_populate_skb_meta_generic().
+ * The idea here is to prefetch no more than 2 cachelines:
+ * one to the left from the data start and one to the right.
+ */
+#define to_cl(ptr) PTR_ALIGN_DOWN(ptr, L1_CACHE_BYTES)
+ dist = min_t(typeof(dist), metasize, L1_CACHE_BYTES);
+ if (dist && to_cl(data - dist) != to_cl(data))
+ prefetch(data - dist);
+#undef to_cl
+
+ prefetch(data);
+
/* xdp frags frame */
if (unlikely(xdp_frame_has_frags(xdpf)))
nr_frags = sinfo->nr_frags;
@@ -636,15 +652,15 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
*/
frame_size = xdpf->frame_sz;
- hard_start = xdpf->data - headroom;
+ hard_start = data - headroom;
skb = build_skb_around(skb, hard_start, frame_size);
if (unlikely(!skb))
return NULL;
skb_reserve(skb, headroom);
__skb_put(skb, xdpf->len);
- if (xdpf->metasize)
- skb_metadata_set(skb, xdpf->metasize);
+ if (metasize)
+ skb_metadata_set(skb, metasize);
if (unlikely(xdp_frame_has_frags(xdpf)))
xdp_update_skb_shared_info(skb, nr_frags,
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 29/52] net, xdp: try to fill skb fields when converting from an &xdp_frame
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (27 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 28/52] net, xdp: prefetch data a bit when building an skb from an &xdp_frame Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 30/52] net, gro: decouple GRO from the NAPI layer Alexander Lobakin
` (23 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
In __xdp_build_skb_from_frame(), if there's a metadata in front of
the data, check if it's a generic-compatible metadata and try to
populate the HW-originated skb fields: checksum status, hash etc.
As xdp_populate_skb_meta_generic() requires the skb->mac_header
to be set and valid, call the skb_reset_mac_header() first, as
skb->data at this point is pointing (sic!) to the MAC header.
The two most obvious users are cpumap and veth.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
net/bpf/core.c | 11 ++++-------
1 file changed, 4 insertions(+), 7 deletions(-)
diff --git a/net/bpf/core.c b/net/bpf/core.c
index 775f9648e8cf..d2d01b8e6441 100644
--- a/net/bpf/core.c
+++ b/net/bpf/core.c
@@ -659,8 +659,11 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
skb_reserve(skb, headroom);
__skb_put(skb, xdpf->len);
- if (metasize)
+ if (metasize) {
+ skb_reset_mac_header(skb);
skb_metadata_set(skb, metasize);
+ xdp_populate_skb_meta_generic(skb);
+ }
if (unlikely(xdp_frame_has_frags(xdpf)))
xdp_update_skb_shared_info(skb, nr_frags,
@@ -671,12 +674,6 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
/* Essential SKB info: protocol and skb->dev */
skb->protocol = eth_type_trans(skb, dev);
- /* Optional SKB info, currently missing:
- * - HW checksum info (skb->ip_summed)
- * - HW RX hash (skb_set_hash)
- * - RX ring dev queue index (skb_record_rx_queue)
- */
-
/* Until page_pool get SKB return path, release DMA here */
xdp_release_frame(xdpf);
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 30/52] net, gro: decouple GRO from the NAPI layer
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (28 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 29/52] net, xdp: try to fill skb fields when converting " Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 31/52] net, gro: expose some GRO API to use outside of NAPI Alexander Lobakin
` (22 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
In fact, these two are not tied closely to each other. The only
requirement to GRO is to use it in the BH context and have some
sane limits on the packet batches, e.g. NAPI has a limit of its
budget (64/8/etc.).
Factor out purely GRO fields into a new structure, &gro_node.
Embed it into &napi_struct and adjust all the references. ::timer
was moved because it is more tied to GRO than to NAPI as the former
relies on deciding whether to do a full or a partial flush.
This does not make GRO ready to use outside of the NAPI context
yet.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
drivers/net/ethernet/brocade/bna/bnad.c | 1 +
drivers/net/ethernet/cortina/gemini.c | 1 +
include/linux/netdevice.h | 19 ++++---
include/net/gro.h | 35 ++++++++----
net/core/dev.c | 75 +++++++++++--------------
net/core/gro.c | 63 ++++++++++-----------
6 files changed, 103 insertions(+), 91 deletions(-)
diff --git a/drivers/net/ethernet/brocade/bna/bnad.c b/drivers/net/ethernet/brocade/bna/bnad.c
index f6fe08df568b..8bcae1616b15 100644
--- a/drivers/net/ethernet/brocade/bna/bnad.c
+++ b/drivers/net/ethernet/brocade/bna/bnad.c
@@ -19,6 +19,7 @@
#include <linux/ip.h>
#include <linux/prefetch.h>
#include <linux/module.h>
+#include <net/gro.h>
#include "bnad.h"
#include "bna.h"
diff --git a/drivers/net/ethernet/cortina/gemini.c b/drivers/net/ethernet/cortina/gemini.c
index 9e6de2f968fa..6f208ce457dd 100644
--- a/drivers/net/ethernet/cortina/gemini.c
+++ b/drivers/net/ethernet/cortina/gemini.c
@@ -40,6 +40,7 @@
#include <linux/in.h>
#include <linux/ip.h>
#include <linux/ipv6.h>
+#include <net/gro.h>
#include "gemini.h"
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index bc2d82a3d0de..60df42b3f116 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -318,11 +318,19 @@ struct gro_list {
};
/*
- * size of gro hash buckets, must less than bit number of
- * napi_struct::gro_bitmask
+ * size of gro hash buckets, must be <= the number of bits in
+ * gro_node::bitmask
*/
#define GRO_HASH_BUCKETS 8
+struct gro_node {
+ unsigned long bitmask; /* Mask of used buckets */
+ struct gro_list hash[GRO_HASH_BUCKETS]; /* Pending GRO skbs */
+ struct list_head rx_list; /* Pending GRO_NORMAL skbs */
+ int rx_count; /* Length of rx_list */
+ struct hrtimer timer; /* Timer for deferred flush */
+};
+
/*
* Structure for NAPI scheduling similar to tasklet but with weighting
*/
@@ -338,17 +346,13 @@ struct napi_struct {
unsigned long state;
int weight;
int defer_hard_irqs_count;
- unsigned long gro_bitmask;
int (*poll)(struct napi_struct *, int);
#ifdef CONFIG_NETPOLL
int poll_owner;
#endif
struct net_device *dev;
- struct gro_list gro_hash[GRO_HASH_BUCKETS];
+ struct gro_node gro;
struct sk_buff *skb;
- struct list_head rx_list; /* Pending GRO_NORMAL skbs */
- int rx_count; /* length of rx_list */
- struct hrtimer timer;
struct list_head dev_list;
struct hlist_node napi_hash_node;
unsigned int napi_id;
@@ -3788,7 +3792,6 @@ int netif_receive_skb_core(struct sk_buff *skb);
void netif_receive_skb_list_internal(struct list_head *head);
void netif_receive_skb_list(struct list_head *head);
gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
-void napi_gro_flush(struct napi_struct *napi, bool flush_old);
struct sk_buff *napi_get_frags(struct napi_struct *napi);
gro_result_t napi_gro_frags(struct napi_struct *napi);
struct packet_offload *gro_find_receive_by_type(__be16 type);
diff --git a/include/net/gro.h b/include/net/gro.h
index 867656b0739c..75211ebd8765 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -421,26 +421,41 @@ static inline __wsum ip6_gro_compute_pseudo(struct sk_buff *skb, int proto)
}
int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb);
+void __gro_flush(struct gro_node *gro, bool flush_old);
+
+static inline void gro_flush(struct gro_node *gro, bool flush_old)
+{
+ if (!gro->bitmask)
+ return;
+
+ __gro_flush(gro, flush_old);
+}
+
+static inline void napi_gro_flush(struct napi_struct *napi, bool flush_old)
+{
+ gro_flush(&napi->gro, flush_old);
+}
/* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
-static inline void gro_normal_list(struct napi_struct *napi)
+static inline void gro_normal_list(struct gro_node *gro)
{
- if (!napi->rx_count)
+ if (!gro->rx_count)
return;
- netif_receive_skb_list_internal(&napi->rx_list);
- INIT_LIST_HEAD(&napi->rx_list);
- napi->rx_count = 0;
+ netif_receive_skb_list_internal(&gro->rx_list);
+ INIT_LIST_HEAD(&gro->rx_list);
+ gro->rx_count = 0;
}
/* Queue one GRO_NORMAL SKB up for list processing. If batch size exceeded,
* pass the whole batch up to the stack.
*/
-static inline void gro_normal_one(struct napi_struct *napi, struct sk_buff *skb, int segs)
+static inline void gro_normal_one(struct gro_node *gro, struct sk_buff *skb,
+ int segs)
{
- list_add_tail(&skb->list, &napi->rx_list);
- napi->rx_count += segs;
- if (napi->rx_count >= gro_normal_batch)
- gro_normal_list(napi);
+ list_add_tail(&skb->list, &gro->rx_list);
+ gro->rx_count += segs;
+ if (gro->rx_count >= gro_normal_batch)
+ gro_normal_list(gro);
}
diff --git a/net/core/dev.c b/net/core/dev.c
index 52b64d24c439..8b334aa974c2 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5765,7 +5765,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
return false;
if (work_done) {
- if (n->gro_bitmask)
+ if (n->gro.bitmask)
timeout = READ_ONCE(n->dev->gro_flush_timeout);
n->defer_hard_irqs_count = READ_ONCE(n->dev->napi_defer_hard_irqs);
}
@@ -5775,15 +5775,13 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
if (timeout)
ret = false;
}
- if (n->gro_bitmask) {
- /* When the NAPI instance uses a timeout and keeps postponing
- * it, we need to bound somehow the time packets are kept in
- * the GRO layer
- */
- napi_gro_flush(n, !!timeout);
- }
- gro_normal_list(n);
+ /* When the NAPI instance uses a timeout and keeps postponing
+ * it, we need to bound somehow the time packets are kept in
+ * the GRO layer
+ */
+ gro_flush(&n->gro, !!timeout);
+ gro_normal_list(&n->gro);
if (unlikely(!list_empty(&n->poll_list))) {
/* If n->poll_list is not empty, we need to mask irqs */
@@ -5815,7 +5813,7 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
}
if (timeout)
- hrtimer_start(&n->timer, ns_to_ktime(timeout),
+ hrtimer_start(&n->gro.timer, ns_to_ktime(timeout),
HRTIMER_MODE_REL_PINNED);
return ret;
}
@@ -5839,19 +5837,17 @@ static struct napi_struct *napi_by_id(unsigned int napi_id)
static void __busy_poll_stop(struct napi_struct *napi, bool skip_schedule)
{
if (!skip_schedule) {
- gro_normal_list(napi);
+ gro_normal_list(&napi->gro);
__napi_schedule(napi);
return;
}
- if (napi->gro_bitmask) {
- /* flush too old packets
- * If HZ < 1000, flush all packets.
- */
- napi_gro_flush(napi, HZ >= 1000);
- }
+ /* flush too old packets
+ * If HZ < 1000, flush all packets.
+ */
+ gro_flush(&napi->gro, HZ >= 1000);
+ gro_normal_list(&napi->gro);
- gro_normal_list(napi);
clear_bit(NAPI_STATE_SCHED, &napi->state);
}
@@ -5880,7 +5876,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock, bool
napi->defer_hard_irqs_count = READ_ONCE(napi->dev->napi_defer_hard_irqs);
timeout = READ_ONCE(napi->dev->gro_flush_timeout);
if (napi->defer_hard_irqs_count && timeout) {
- hrtimer_start(&napi->timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
+ hrtimer_start(&napi->gro.timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
skip_schedule = true;
}
}
@@ -5947,7 +5943,7 @@ void napi_busy_loop(unsigned int napi_id,
}
work = napi_poll(napi, budget);
trace_napi_poll(napi, work, budget);
- gro_normal_list(napi);
+ gro_normal_list(&napi->gro);
count:
if (work > 0)
__NET_ADD_STATS(dev_net(napi->dev),
@@ -6015,7 +6011,7 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
{
struct napi_struct *napi;
- napi = container_of(timer, struct napi_struct, timer);
+ napi = container_of(timer, struct napi_struct, gro.timer);
/* Note : we use a relaxed variant of napi_schedule_prep() not setting
* NAPI_STATE_MISSED, since we do not react to a device IRQ.
@@ -6034,10 +6030,10 @@ static void init_gro_hash(struct napi_struct *napi)
int i;
for (i = 0; i < GRO_HASH_BUCKETS; i++) {
- INIT_LIST_HEAD(&napi->gro_hash[i].list);
- napi->gro_hash[i].count = 0;
+ INIT_LIST_HEAD(&napi->gro.hash[i].list);
+ napi->gro.hash[i].count = 0;
}
- napi->gro_bitmask = 0;
+ napi->gro.bitmask = 0;
}
int dev_set_threaded(struct net_device *dev, bool threaded)
@@ -6109,12 +6105,12 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
INIT_LIST_HEAD(&napi->poll_list);
INIT_HLIST_NODE(&napi->napi_hash_node);
- hrtimer_init(&napi->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
- napi->timer.function = napi_watchdog;
+ hrtimer_init(&napi->gro.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+ napi->gro.timer.function = napi_watchdog;
init_gro_hash(napi);
napi->skb = NULL;
- INIT_LIST_HEAD(&napi->rx_list);
- napi->rx_count = 0;
+ INIT_LIST_HEAD(&napi->gro.rx_list);
+ napi->gro.rx_count = 0;
napi->poll = poll;
if (weight > NAPI_POLL_WEIGHT)
netdev_err_once(dev, "%s() called with weight %d\n", __func__,
@@ -6159,7 +6155,7 @@ void napi_disable(struct napi_struct *n)
break;
}
- hrtimer_cancel(&n->timer);
+ hrtimer_cancel(&n->gro.timer);
clear_bit(NAPI_STATE_DISABLE, &n->state);
}
@@ -6194,9 +6190,9 @@ static void flush_gro_hash(struct napi_struct *napi)
for (i = 0; i < GRO_HASH_BUCKETS; i++) {
struct sk_buff *skb, *n;
- list_for_each_entry_safe(skb, n, &napi->gro_hash[i].list, list)
+ list_for_each_entry_safe(skb, n, &napi->gro.hash[i].list, list)
kfree_skb(skb);
- napi->gro_hash[i].count = 0;
+ napi->gro.hash[i].count = 0;
}
}
@@ -6211,7 +6207,7 @@ void __netif_napi_del(struct napi_struct *napi)
napi_free_frags(napi);
flush_gro_hash(napi);
- napi->gro_bitmask = 0;
+ napi->gro.bitmask = 0;
if (napi->thread) {
kthread_stop(napi->thread);
@@ -6268,14 +6264,11 @@ static int __napi_poll(struct napi_struct *n, bool *repoll)
return work;
}
- if (n->gro_bitmask) {
- /* flush too old packets
- * If HZ < 1000, flush all packets.
- */
- napi_gro_flush(n, HZ >= 1000);
- }
-
- gro_normal_list(n);
+ /* flush too old packets
+ * If HZ < 1000, flush all packets.
+ */
+ gro_flush(&n->gro, HZ >= 1000);
+ gro_normal_list(&n->gro);
/* Some drivers may have called napi_schedule
* prior to exhausting their budget.
@@ -10396,7 +10389,7 @@ static struct hlist_head * __net_init netdev_create_hash(void)
static int __net_init netdev_init(struct net *net)
{
BUILD_BUG_ON(GRO_HASH_BUCKETS >
- 8 * sizeof_field(struct napi_struct, gro_bitmask));
+ BITS_PER_BYTE * sizeof_field(struct gro_node, bitmask));
INIT_LIST_HEAD(&net->dev_base_head);
diff --git a/net/core/gro.c b/net/core/gro.c
index b4190eb08467..67fd587a87c9 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -278,8 +278,7 @@ int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb)
return 0;
}
-
-static void napi_gro_complete(struct napi_struct *napi, struct sk_buff *skb)
+static void gro_complete(struct gro_node *gro, struct sk_buff *skb)
{
struct packet_offload *ptype;
__be16 type = skb->protocol;
@@ -312,43 +311,42 @@ static void napi_gro_complete(struct napi_struct *napi, struct sk_buff *skb)
}
out:
- gro_normal_one(napi, skb, NAPI_GRO_CB(skb)->count);
+ gro_normal_one(gro, skb, NAPI_GRO_CB(skb)->count);
}
-static void __napi_gro_flush_chain(struct napi_struct *napi, u32 index,
- bool flush_old)
+static void __gro_flush_chain(struct gro_node *gro, u32 index, bool flush_old)
{
- struct list_head *head = &napi->gro_hash[index].list;
+ struct list_head *head = &gro->hash[index].list;
struct sk_buff *skb, *p;
list_for_each_entry_safe_reverse(skb, p, head, list) {
if (flush_old && NAPI_GRO_CB(skb)->age == jiffies)
return;
skb_list_del_init(skb);
- napi_gro_complete(napi, skb);
- napi->gro_hash[index].count--;
+ gro_complete(gro, skb);
+ gro->hash[index].count--;
}
- if (!napi->gro_hash[index].count)
- __clear_bit(index, &napi->gro_bitmask);
+ if (!gro->hash[index].count)
+ __clear_bit(index, &gro->bitmask);
}
-/* napi->gro_hash[].list contains packets ordered by age.
+/* gro->hash[].list contains packets ordered by age.
* youngest packets at the head of it.
* Complete skbs in reverse order to reduce latencies.
*/
-void napi_gro_flush(struct napi_struct *napi, bool flush_old)
+void __gro_flush(struct gro_node *gro, bool flush_old)
{
- unsigned long bitmask = napi->gro_bitmask;
+ unsigned long bitmask = gro->bitmask;
unsigned int i, base = ~0U;
while ((i = ffs(bitmask)) != 0) {
bitmask >>= i;
base += i;
- __napi_gro_flush_chain(napi, base, flush_old);
+ __gro_flush_chain(gro, base, flush_old);
}
}
-EXPORT_SYMBOL(napi_gro_flush);
+EXPORT_SYMBOL(__gro_flush);
static void gro_list_prepare(const struct list_head *head,
const struct sk_buff *skb)
@@ -449,7 +447,7 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int grow)
}
}
-static void gro_flush_oldest(struct napi_struct *napi, struct list_head *head)
+static void gro_flush_oldest(struct gro_node *gro, struct list_head *head)
{
struct sk_buff *oldest;
@@ -465,13 +463,14 @@ static void gro_flush_oldest(struct napi_struct *napi, struct list_head *head)
* SKB to the chain.
*/
skb_list_del_init(oldest);
- napi_gro_complete(napi, oldest);
+ gro_complete(gro, oldest);
}
-static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
+static enum gro_result dev_gro_receive(struct gro_node *gro,
+ struct sk_buff *skb)
{
u32 bucket = skb_get_hash_raw(skb) & (GRO_HASH_BUCKETS - 1);
- struct gro_list *gro_list = &napi->gro_hash[bucket];
+ struct gro_list *gro_list = &gro->hash[bucket];
struct list_head *head = &offload_base;
struct packet_offload *ptype;
__be16 type = skb->protocol;
@@ -530,7 +529,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
if (pp) {
skb_list_del_init(pp);
- napi_gro_complete(napi, pp);
+ gro_complete(gro, pp);
gro_list->count--;
}
@@ -541,7 +540,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
goto normal;
if (unlikely(gro_list->count >= MAX_GRO_SKBS))
- gro_flush_oldest(napi, &gro_list->list);
+ gro_flush_oldest(gro, &gro_list->list);
else
gro_list->count++;
@@ -558,10 +557,10 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
gro_pull_from_frag0(skb, grow);
ok:
if (gro_list->count) {
- if (!test_bit(bucket, &napi->gro_bitmask))
- __set_bit(bucket, &napi->gro_bitmask);
- } else if (test_bit(bucket, &napi->gro_bitmask)) {
- __clear_bit(bucket, &napi->gro_bitmask);
+ if (!test_bit(bucket, &gro->bitmask))
+ __set_bit(bucket, &gro->bitmask);
+ } else if (test_bit(bucket, &gro->bitmask)) {
+ __clear_bit(bucket, &gro->bitmask);
}
return ret;
@@ -599,13 +598,12 @@ struct packet_offload *gro_find_complete_by_type(__be16 type)
}
EXPORT_SYMBOL(gro_find_complete_by_type);
-static gro_result_t napi_skb_finish(struct napi_struct *napi,
- struct sk_buff *skb,
- gro_result_t ret)
+static gro_result_t gro_skb_finish(struct gro_node *gro, struct sk_buff *skb,
+ gro_result_t ret)
{
switch (ret) {
case GRO_NORMAL:
- gro_normal_one(napi, skb, 1);
+ gro_normal_one(gro, skb, 1);
break;
case GRO_MERGED_FREE:
@@ -628,6 +626,7 @@ static gro_result_t napi_skb_finish(struct napi_struct *napi,
gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
{
+ struct gro_node *gro = &napi->gro;
gro_result_t ret;
skb_mark_napi_id(skb, napi);
@@ -635,7 +634,7 @@ gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
skb_gro_reset_offset(skb, 0);
- ret = napi_skb_finish(napi, skb, dev_gro_receive(napi, skb));
+ ret = gro_skb_finish(gro, skb, dev_gro_receive(gro, skb));
trace_napi_gro_receive_exit(ret);
return ret;
@@ -695,7 +694,7 @@ static gro_result_t napi_frags_finish(struct napi_struct *napi,
__skb_push(skb, ETH_HLEN);
skb->protocol = eth_type_trans(skb, skb->dev);
if (ret == GRO_NORMAL)
- gro_normal_one(napi, skb, 1);
+ gro_normal_one(&napi->gro, skb, 1);
break;
case GRO_MERGED_FREE:
@@ -761,7 +760,7 @@ gro_result_t napi_gro_frags(struct napi_struct *napi)
trace_napi_gro_frags_entry(skb);
- ret = napi_frags_finish(napi, skb, dev_gro_receive(napi, skb));
+ ret = napi_frags_finish(napi, skb, dev_gro_receive(&napi->gro, skb));
trace_napi_gro_frags_exit(ret);
return ret;
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 31/52] net, gro: expose some GRO API to use outside of NAPI
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (29 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 30/52] net, gro: decouple GRO from the NAPI layer Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list() Alexander Lobakin
` (21 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Make several functions global to be able to use GRO without a NAPI
instance. This includes init, cleanup, receive functions, as well
as a couple inlines to start and stop the deferred flush timer.
Taking into account already global gro_flush(), it is now fully
possible to maintain a GRO node without an aux NAPI entity.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/net/gro.h | 18 +++++++++++++++
net/core/dev.c | 45 ++++++-------------------------------
net/core/gro.c | 57 +++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 82 insertions(+), 38 deletions(-)
diff --git a/include/net/gro.h b/include/net/gro.h
index 75211ebd8765..539f931e736f 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -421,6 +421,7 @@ static inline __wsum ip6_gro_compute_pseudo(struct sk_buff *skb, int proto)
}
int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb);
+void gro_receive_skb_list(struct gro_node *gro, struct list_head *list);
void __gro_flush(struct gro_node *gro, bool flush_old);
static inline void gro_flush(struct gro_node *gro, bool flush_old)
@@ -458,5 +459,22 @@ static inline void gro_normal_one(struct gro_node *gro, struct sk_buff *skb,
gro_normal_list(gro);
}
+static inline void gro_timer_start(struct gro_node *gro, u64 timeout_ns)
+{
+ if (!timeout_ns)
+ return;
+
+ hrtimer_start(&gro->timer, ns_to_ktime(timeout_ns),
+ HRTIMER_MODE_REL_PINNED);
+}
+
+static inline void gro_timer_cancel(struct gro_node *gro)
+{
+ hrtimer_cancel(&gro->timer);
+}
+
+void gro_init(struct gro_node *gro,
+ enum hrtimer_restart (*timer_cb)(struct hrtimer *timer));
+void gro_cleanup(struct gro_node *gro);
#endif /* _NET_IPV6_GRO_H */
diff --git a/net/core/dev.c b/net/core/dev.c
index 8b334aa974c2..62bf6ee00741 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5812,9 +5812,8 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
return false;
}
- if (timeout)
- hrtimer_start(&n->gro.timer, ns_to_ktime(timeout),
- HRTIMER_MODE_REL_PINNED);
+ gro_timer_start(&n->gro, timeout);
+
return ret;
}
EXPORT_SYMBOL(napi_complete_done);
@@ -5876,7 +5875,7 @@ static void busy_poll_stop(struct napi_struct *napi, void *have_poll_lock, bool
napi->defer_hard_irqs_count = READ_ONCE(napi->dev->napi_defer_hard_irqs);
timeout = READ_ONCE(napi->dev->gro_flush_timeout);
if (napi->defer_hard_irqs_count && timeout) {
- hrtimer_start(&napi->gro.timer, ns_to_ktime(timeout), HRTIMER_MODE_REL_PINNED);
+ gro_timer_start(&napi->gro, timeout);
skip_schedule = true;
}
}
@@ -6025,17 +6024,6 @@ static enum hrtimer_restart napi_watchdog(struct hrtimer *timer)
return HRTIMER_NORESTART;
}
-static void init_gro_hash(struct napi_struct *napi)
-{
- int i;
-
- for (i = 0; i < GRO_HASH_BUCKETS; i++) {
- INIT_LIST_HEAD(&napi->gro.hash[i].list);
- napi->gro.hash[i].count = 0;
- }
- napi->gro.bitmask = 0;
-}
-
int dev_set_threaded(struct net_device *dev, bool threaded)
{
struct napi_struct *napi;
@@ -6105,12 +6093,8 @@ void netif_napi_add_weight(struct net_device *dev, struct napi_struct *napi,
INIT_LIST_HEAD(&napi->poll_list);
INIT_HLIST_NODE(&napi->napi_hash_node);
- hrtimer_init(&napi->gro.timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
- napi->gro.timer.function = napi_watchdog;
- init_gro_hash(napi);
+ gro_init(&napi->gro, napi_watchdog);
napi->skb = NULL;
- INIT_LIST_HEAD(&napi->gro.rx_list);
- napi->gro.rx_count = 0;
napi->poll = poll;
if (weight > NAPI_POLL_WEIGHT)
netdev_err_once(dev, "%s() called with weight %d\n", __func__,
@@ -6155,8 +6139,7 @@ void napi_disable(struct napi_struct *n)
break;
}
- hrtimer_cancel(&n->gro.timer);
-
+ gro_timer_cancel(&n->gro);
clear_bit(NAPI_STATE_DISABLE, &n->state);
}
EXPORT_SYMBOL(napi_disable);
@@ -6183,19 +6166,6 @@ void napi_enable(struct napi_struct *n)
}
EXPORT_SYMBOL(napi_enable);
-static void flush_gro_hash(struct napi_struct *napi)
-{
- int i;
-
- for (i = 0; i < GRO_HASH_BUCKETS; i++) {
- struct sk_buff *skb, *n;
-
- list_for_each_entry_safe(skb, n, &napi->gro.hash[i].list, list)
- kfree_skb(skb);
- napi->gro.hash[i].count = 0;
- }
-}
-
/* Must be called in process context */
void __netif_napi_del(struct napi_struct *napi)
{
@@ -6206,8 +6176,7 @@ void __netif_napi_del(struct napi_struct *napi)
list_del_rcu(&napi->dev_list);
napi_free_frags(napi);
- flush_gro_hash(napi);
- napi->gro.bitmask = 0;
+ gro_cleanup(&napi->gro);
if (napi->thread) {
kthread_stop(napi->thread);
@@ -10627,7 +10596,7 @@ static int __init net_dev_init(void)
INIT_CSD(&sd->defer_csd, trigger_rx_softirq, sd);
spin_lock_init(&sd->defer_lock);
- init_gro_hash(&sd->backlog);
+ gro_init(&sd->backlog.gro, NULL);
sd->backlog.poll = process_backlog;
sd->backlog.weight = weight_p;
}
diff --git a/net/core/gro.c b/net/core/gro.c
index 67fd587a87c9..424c812abe79 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -624,6 +624,18 @@ static gro_result_t gro_skb_finish(struct gro_node *gro, struct sk_buff *skb,
return ret;
}
+void gro_receive_skb_list(struct gro_node *gro, struct list_head *list)
+{
+ struct sk_buff *skb, *tmp;
+
+ list_for_each_entry_safe(skb, tmp, list, list) {
+ skb_list_del_init(skb);
+
+ skb_gro_reset_offset(skb, 0);
+ gro_skb_finish(gro, skb, dev_gro_receive(gro, skb));
+ }
+}
+
gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
{
struct gro_node *gro = &napi->gro;
@@ -792,3 +804,48 @@ __sum16 __skb_gro_checksum_complete(struct sk_buff *skb)
return sum;
}
EXPORT_SYMBOL(__skb_gro_checksum_complete);
+
+void gro_init(struct gro_node *gro,
+ enum hrtimer_restart (*timer_cb)(struct hrtimer *))
+{
+ u32 i;
+
+ for (i = 0; i < GRO_HASH_BUCKETS; i++) {
+ INIT_LIST_HEAD(&gro->hash[i].list);
+ gro->hash[i].count = 0;
+ }
+
+ gro->bitmask = 0;
+
+ INIT_LIST_HEAD(&gro->rx_list);
+ gro->rx_count = 0;
+
+ if (!timer_cb)
+ return;
+
+ hrtimer_init(&gro->timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL_PINNED);
+ gro->timer.function = timer_cb;
+}
+
+void gro_cleanup(struct gro_node *gro)
+{
+ struct sk_buff *skb, *n;
+ u32 i;
+
+ gro_timer_cancel(gro);
+ memset(&gro->timer, 0, sizeof(gro->timer));
+
+ for (i = 0; i < GRO_HASH_BUCKETS; i++) {
+ list_for_each_entry_safe(skb, n, &gro->hash[i].list, list)
+ kfree_skb(skb);
+
+ gro->hash[i].count = 0;
+ }
+
+ gro->bitmask = 0;
+
+ list_for_each_entry_safe(skb, n, &gro->rx_list, list)
+ kfree_skb(skb);
+
+ gro->rx_count = 0;
+}
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (30 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 31/52] net, gro: expose some GRO API to use outside of NAPI Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2024-08-07 20:38 ` [xdp-hints] " Daniel Xu
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 33/52] bpf, cpumap: add option to set a timeout for deferred flush Alexander Lobakin
` (20 subsequent siblings)
52 siblings, 1 reply; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
cpumap has its own BH context based on kthread. It has a sane batch
size of 8 frames per one cycle.
GRO can be used on its own, adjust cpumap calls to the
upper stack to use GRO API instead of netif_receive_skb_list() which
processes skbs by batches, but doesn't involve GRO layer at all.
It is most beneficial when a NIC which frame come from is XDP
generic metadata-enabled, but in plenty of tests GRO performs better
than listed receiving even given that it has to calculate full frame
checksums on CPU.
As GRO passes the skbs to the upper stack in the batches of
@gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
device where the frame comes from, it is enough to disable GRO
netdev feature on it to completely restore the original behaviour:
untouched frames will be being bulked and passed to the upper stack
by 8, as it was with netif_receive_skb_list().
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
1 file changed, 38 insertions(+), 5 deletions(-)
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index f4860ac756cd..2d0edf8f6a05 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -29,8 +29,8 @@
#include <trace/events/xdp.h>
#include <linux/btf_ids.h>
-#include <linux/netdevice.h> /* netif_receive_skb_list */
-#include <linux/etherdevice.h> /* eth_type_trans */
+#include <linux/netdevice.h>
+#include <net/gro.h>
/* General idea: XDP packets getting XDP redirected to another CPU,
* will maximum be stored/queued for one driver ->poll() call. It is
@@ -67,6 +67,8 @@ struct bpf_cpu_map_entry {
struct bpf_cpumap_val value;
struct bpf_prog *prog;
+ struct gro_node gro;
+
atomic_t refcnt; /* Control when this struct can be free'ed */
struct rcu_head rcu;
@@ -162,6 +164,7 @@ static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
if (atomic_dec_and_test(&rcpu->refcnt)) {
if (rcpu->prog)
bpf_prog_put(rcpu->prog);
+ gro_cleanup(&rcpu->gro);
/* The queue should be empty at this point */
__cpu_map_ring_cleanup(rcpu->queue);
ptr_ring_cleanup(rcpu->queue, NULL);
@@ -295,6 +298,33 @@ static int cpu_map_bpf_prog_run(struct bpf_cpu_map_entry *rcpu, void **frames,
return nframes;
}
+static void cpu_map_gro_flush(struct bpf_cpu_map_entry *rcpu,
+ struct list_head *list)
+{
+ bool new = !list_empty(list);
+
+ if (likely(new))
+ gro_receive_skb_list(&rcpu->gro, list);
+
+ if (rcpu->gro.bitmask) {
+ bool flush_old = HZ >= 1000;
+
+ /* If the ring is not empty, there'll be a new iteration
+ * soon, and we only need to do a full flush if a tick is
+ * long (> 1 ms).
+ * If the ring is empty, to not hold GRO packets in the
+ * stack for too long, do a full flush.
+ * This is equivalent to how NAPI decides whether to perform
+ * a full flush (by batches of up to 64 frames tho).
+ */
+ if (__ptr_ring_empty(rcpu->queue))
+ flush_old = false;
+
+ __gro_flush(&rcpu->gro, flush_old);
+ }
+
+ gro_normal_list(&rcpu->gro);
+}
static int cpu_map_kthread_run(void *data)
{
@@ -384,7 +414,7 @@ static int cpu_map_kthread_run(void *data)
list_add_tail(&skb->list, &list);
}
- netif_receive_skb_list(&list);
+ cpu_map_gro_flush(rcpu, &list);
/* Feedback loop via tracepoint */
trace_xdp_cpumap_kthread(rcpu->map_id, n, kmem_alloc_drops,
@@ -460,8 +490,10 @@ __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value,
rcpu->map_id = map->id;
rcpu->value.qsize = value->qsize;
+ gro_init(&rcpu->gro, NULL);
+
if (fd > 0 && __cpu_map_load_bpf_program(rcpu, map, fd))
- goto free_ptr_ring;
+ goto free_gro;
/* Setup kthread */
rcpu->kthread = kthread_create_on_node(cpu_map_kthread_run, rcpu, numa,
@@ -482,7 +514,8 @@ __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value,
free_prog:
if (rcpu->prog)
bpf_prog_put(rcpu->prog);
-free_ptr_ring:
+free_gro:
+ gro_cleanup(&rcpu->gro);
ptr_ring_cleanup(rcpu->queue, NULL);
free_queue:
kfree(rcpu->queue);
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list() Alexander Lobakin
@ 2024-08-07 20:38 ` Daniel Xu
2024-08-08 4:54 ` Lorenzo Bianconi
0 siblings, 1 reply; 98+ messages in thread
From: Daniel Xu @ 2024-08-07 20:38 UTC (permalink / raw)
To: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Larysa Zaremba, Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, toke, Lorenzo Bianconi, David Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesse Brandeburg,
John Fastabend, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints
Hi Alexander,
On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
> cpumap has its own BH context based on kthread. It has a sane batch
> size of 8 frames per one cycle.
> GRO can be used on its own, adjust cpumap calls to the
> upper stack to use GRO API instead of netif_receive_skb_list() which
> processes skbs by batches, but doesn't involve GRO layer at all.
> It is most beneficial when a NIC which frame come from is XDP
> generic metadata-enabled, but in plenty of tests GRO performs better
> than listed receiving even given that it has to calculate full frame
> checksums on CPU.
> As GRO passes the skbs to the upper stack in the batches of
> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
> device where the frame comes from, it is enough to disable GRO
> netdev feature on it to completely restore the original behaviour:
> untouched frames will be being bulked and passed to the upper stack
> by 8, as it was with netif_receive_skb_list().
>
> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> ---
> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 38 insertions(+), 5 deletions(-)
>
AFAICT the cpumap + GRO is a good standalone improvement. I think
cpumap is still missing this.
I have a production use case for this now. We want to do some intelligent
RX steering and I think GRO would help over list-ified receive in some cases.
We would prefer steer in HW (and thus get existing GRO support) but not all
our NICs support it. So we need a software fallback.
Are you still interested in merging the cpumap + GRO patches?
Thanks,
Daniel
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-07 20:38 ` [xdp-hints] " Daniel Xu
@ 2024-08-08 4:54 ` Lorenzo Bianconi
2024-08-08 11:57 ` Alexander Lobakin
2024-08-08 20:44 ` Daniel Xu
0 siblings, 2 replies; 98+ messages in thread
From: Lorenzo Bianconi @ 2024-08-08 4:54 UTC (permalink / raw)
To: Daniel Xu
Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, toke, Lorenzo Bianconi,
David Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jesse Brandeburg, John Fastabend, Yajun Deng, Willem de Bruijn,
bpf, netdev, linux-kernel, xdp-hints
[-- Attachment #1: Type: text/plain, Size: 2390 bytes --]
> Hi Alexander,
>
> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
> > cpumap has its own BH context based on kthread. It has a sane batch
> > size of 8 frames per one cycle.
> > GRO can be used on its own, adjust cpumap calls to the
> > upper stack to use GRO API instead of netif_receive_skb_list() which
> > processes skbs by batches, but doesn't involve GRO layer at all.
> > It is most beneficial when a NIC which frame come from is XDP
> > generic metadata-enabled, but in plenty of tests GRO performs better
> > than listed receiving even given that it has to calculate full frame
> > checksums on CPU.
> > As GRO passes the skbs to the upper stack in the batches of
> > @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
> > device where the frame comes from, it is enough to disable GRO
> > netdev feature on it to completely restore the original behaviour:
> > untouched frames will be being bulked and passed to the upper stack
> > by 8, as it was with netif_receive_skb_list().
> >
> > Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> > ---
> > kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
> > 1 file changed, 38 insertions(+), 5 deletions(-)
> >
>
> AFAICT the cpumap + GRO is a good standalone improvement. I think
> cpumap is still missing this.
>
> I have a production use case for this now. We want to do some intelligent
> RX steering and I think GRO would help over list-ified receive in some cases.
> We would prefer steer in HW (and thus get existing GRO support) but not all
> our NICs support it. So we need a software fallback.
>
> Are you still interested in merging the cpumap + GRO patches?
Hi Daniel and Alex,
Recently I worked on a PoC to add GRO support to cpumap codebase:
- https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
Here I added GRO support to cpumap through gro-cells.
- https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
Here I added GRO support to cpumap trough napi-threaded APIs (with a some
changes to them).
Please note I have not run any performance tests so far, just verified it does
not crash (I was planning to resume this work soon). Please let me know if it
works for you.
Regards,
Lorenzo
>
> Thanks,
> Daniel
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-08 4:54 ` Lorenzo Bianconi
@ 2024-08-08 11:57 ` Alexander Lobakin
2024-08-08 17:22 ` Lorenzo Bianconi
` (3 more replies)
2024-08-08 20:44 ` Daniel Xu
1 sibling, 4 replies; 98+ messages in thread
From: Alexander Lobakin @ 2024-08-08 11:57 UTC (permalink / raw)
To: Lorenzo Bianconi, Daniel Xu
Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, toke, Lorenzo Bianconi,
David Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jesse Brandeburg, John Fastabend, Yajun Deng, Willem de Bruijn,
bpf, netdev, linux-kernel, xdp-hints
From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Date: Thu, 8 Aug 2024 06:54:06 +0200
>> Hi Alexander,
>>
>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>> cpumap has its own BH context based on kthread. It has a sane batch
>>> size of 8 frames per one cycle.
>>> GRO can be used on its own, adjust cpumap calls to the
>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>> It is most beneficial when a NIC which frame come from is XDP
>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>> than listed receiving even given that it has to calculate full frame
>>> checksums on CPU.
>>> As GRO passes the skbs to the upper stack in the batches of
>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>> device where the frame comes from, it is enough to disable GRO
>>> netdev feature on it to completely restore the original behaviour:
>>> untouched frames will be being bulked and passed to the upper stack
>>> by 8, as it was with netif_receive_skb_list().
>>>
>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>> ---
>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>> 1 file changed, 38 insertions(+), 5 deletions(-)
>>>
>>
>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>> cpumap is still missing this.
The only concern for having GRO in cpumap without metadata from the NIC
descriptor was that when the checksum status is missing, GRO calculates
the checksum on CPU, which is not really fast.
But I remember sometimes GRO was faster despite that.
>>
>> I have a production use case for this now. We want to do some intelligent
>> RX steering and I think GRO would help over list-ified receive in some cases.
>> We would prefer steer in HW (and thus get existing GRO support) but not all
>> our NICs support it. So we need a software fallback.
>>
>> Are you still interested in merging the cpumap + GRO patches?
For sure I can revive this part. I was planning to get back to this
branch and pick patches which were not related to XDP hints and send
them separately.
>
> Hi Daniel and Alex,
>
> Recently I worked on a PoC to add GRO support to cpumap codebase:
> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
> Here I added GRO support to cpumap through gro-cells.
> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
> Here I added GRO support to cpumap trough napi-threaded APIs (with a some
> changes to them).
Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
overkill, that's why I separated GRO structure from &napi_struct.
Let me maybe find some free time, I would then test all 3 solutions
(mine, gro_cells, threaded NAPI) and pick/send the best?
>
> Please note I have not run any performance tests so far, just verified it does
> not crash (I was planning to resume this work soon). Please let me know if it
> works for you.
>
> Regards,
> Lorenzo
>
>>
>> Thanks,
>> Daniel
Thanks,
Olek
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-08 11:57 ` Alexander Lobakin
@ 2024-08-08 17:22 ` Lorenzo Bianconi
2024-08-08 20:52 ` Daniel Xu
` (2 subsequent siblings)
3 siblings, 0 replies; 98+ messages in thread
From: Lorenzo Bianconi @ 2024-08-08 17:22 UTC (permalink / raw)
To: Alexander Lobakin
Cc: Daniel Xu, Alexander Lobakin, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, toke, Lorenzo Bianconi, David Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesse Brandeburg,
John Fastabend, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints
[-- Attachment #1: Type: text/plain, Size: 3607 bytes --]
> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> Date: Thu, 8 Aug 2024 06:54:06 +0200
>
> >> Hi Alexander,
> >>
> >> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
> >>> cpumap has its own BH context based on kthread. It has a sane batch
> >>> size of 8 frames per one cycle.
> >>> GRO can be used on its own, adjust cpumap calls to the
> >>> upper stack to use GRO API instead of netif_receive_skb_list() which
> >>> processes skbs by batches, but doesn't involve GRO layer at all.
> >>> It is most beneficial when a NIC which frame come from is XDP
> >>> generic metadata-enabled, but in plenty of tests GRO performs better
> >>> than listed receiving even given that it has to calculate full frame
> >>> checksums on CPU.
> >>> As GRO passes the skbs to the upper stack in the batches of
> >>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
> >>> device where the frame comes from, it is enough to disable GRO
> >>> netdev feature on it to completely restore the original behaviour:
> >>> untouched frames will be being bulked and passed to the upper stack
> >>> by 8, as it was with netif_receive_skb_list().
> >>>
> >>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> >>> ---
> >>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
> >>> 1 file changed, 38 insertions(+), 5 deletions(-)
> >>>
> >>
> >> AFAICT the cpumap + GRO is a good standalone improvement. I think
> >> cpumap is still missing this.
>
> The only concern for having GRO in cpumap without metadata from the NIC
> descriptor was that when the checksum status is missing, GRO calculates
> the checksum on CPU, which is not really fast.
> But I remember sometimes GRO was faster despite that.
>
> >>
> >> I have a production use case for this now. We want to do some intelligent
> >> RX steering and I think GRO would help over list-ified receive in some cases.
> >> We would prefer steer in HW (and thus get existing GRO support) but not all
> >> our NICs support it. So we need a software fallback.
> >>
> >> Are you still interested in merging the cpumap + GRO patches?
>
> For sure I can revive this part. I was planning to get back to this
> branch and pick patches which were not related to XDP hints and send
> them separately.
>
> >
> > Hi Daniel and Alex,
> >
> > Recently I worked on a PoC to add GRO support to cpumap codebase:
> > - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
> > Here I added GRO support to cpumap through gro-cells.
> > - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
> > Here I added GRO support to cpumap trough napi-threaded APIs (with a some
> > changes to them).
>
> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
> overkill, that's why I separated GRO structure from &napi_struct.
if we consider the NAPI-threaded implementation, we have the same architecture
we have in current cpumap codebase, a thread for each cpumap entry, the only
different is we can rely on GRO APIs.
Regards,
Lorenzo
>
> Let me maybe find some free time, I would then test all 3 solutions
> (mine, gro_cells, threaded NAPI) and pick/send the best?
>
> >
> > Please note I have not run any performance tests so far, just verified it does
> > not crash (I was planning to resume this work soon). Please let me know if it
> > works for you.
> >
> > Regards,
> > Lorenzo
> >
> >>
> >> Thanks,
> >> Daniel
>
> Thanks,
> Olek
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-08 11:57 ` Alexander Lobakin
2024-08-08 17:22 ` Lorenzo Bianconi
@ 2024-08-08 20:52 ` Daniel Xu
2024-08-09 10:02 ` Jesper Dangaard Brouer
2024-08-09 12:20 ` Alexander Lobakin
2024-08-10 8:00 ` Lorenzo Bianconi
2024-08-13 14:09 ` Alexander Lobakin
3 siblings, 2 replies; 98+ messages in thread
From: Daniel Xu @ 2024-08-08 20:52 UTC (permalink / raw)
To: Alexander Lobakin, Lorenzo Bianconi
Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, toke, Lorenzo Bianconi,
David Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jesse Brandeburg, John Fastabend, Yajun Deng, Willem de Bruijn,
bpf, netdev, linux-kernel, xdp-hints
Hi,
On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote:
> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> Date: Thu, 8 Aug 2024 06:54:06 +0200
>
>>> Hi Alexander,
>>>
>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>> size of 8 frames per one cycle.
>>>> GRO can be used on its own, adjust cpumap calls to the
>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>> It is most beneficial when a NIC which frame come from is XDP
>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>> than listed receiving even given that it has to calculate full frame
>>>> checksums on CPU.
>>>> As GRO passes the skbs to the upper stack in the batches of
>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>> device where the frame comes from, it is enough to disable GRO
>>>> netdev feature on it to completely restore the original behaviour:
>>>> untouched frames will be being bulked and passed to the upper stack
>>>> by 8, as it was with netif_receive_skb_list().
>>>>
>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>> ---
>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>> 1 file changed, 38 insertions(+), 5 deletions(-)
>>>>
>>>
>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>> cpumap is still missing this.
>
> The only concern for having GRO in cpumap without metadata from the NIC
> descriptor was that when the checksum status is missing, GRO calculates
> the checksum on CPU, which is not really fast.
> But I remember sometimes GRO was faster despite that.
Good to know, thanks. IIUC some kind of XDP hint support landed already?
My use case could also use HW RSS hash to avoid a rehash in XDP prog.
And HW RX timestamp to not break SO_TIMESTAMPING. These two
are on one of my TODO lists. But I can’t get to them for at least
a few weeks. So free to take it if you’d like.
>
>>>
>>> I have a production use case for this now. We want to do some intelligent
>>> RX steering and I think GRO would help over list-ified receive in some cases.
>>> We would prefer steer in HW (and thus get existing GRO support) but not all
>>> our NICs support it. So we need a software fallback.
>>>
>>> Are you still interested in merging the cpumap + GRO patches?
>
> For sure I can revive this part. I was planning to get back to this
> branch and pick patches which were not related to XDP hints and send
> them separately.
>
>>
>> Hi Daniel and Alex,
>>
>> Recently I worked on a PoC to add GRO support to cpumap codebase:
>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
>> Here I added GRO support to cpumap through gro-cells.
>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
>> Here I added GRO support to cpumap trough napi-threaded APIs (with a some
>> changes to them).
>
> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
> overkill, that's why I separated GRO structure from &napi_struct.
>
> Let me maybe find some free time, I would then test all 3 solutions
> (mine, gro_cells, threaded NAPI) and pick/send the best?
Sounds good. Would be good to compare results.
[…]
Thanks,
Daniel
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-08 20:52 ` Daniel Xu
@ 2024-08-09 10:02 ` Jesper Dangaard Brouer
2024-08-09 12:20 ` Alexander Lobakin
1 sibling, 0 replies; 98+ messages in thread
From: Jesper Dangaard Brouer @ 2024-08-09 10:02 UTC (permalink / raw)
To: Daniel Xu, Alexander Lobakin, Lorenzo Bianconi
Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, toke, Lorenzo Bianconi, David Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesse Brandeburg,
John Fastabend, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints, Stanislav Fomichev, kernel-team
On 08/08/2024 22.52, Daniel Xu wrote:
>
> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote:
>>
[...]
>> The only concern for having GRO in cpumap without metadata from the NIC
>> descriptor was that when the checksum status is missing, GRO calculates
>> the checksum on CPU, which is not really fast.
>> But I remember sometimes GRO was faster despite that.
>
> Good to know, thanks. IIUC some kind of XDP hint support landed already?
>
The XDP-hints ended-up being called 'XDP RX metadata' in kernel docs[1],
which makes it difficult to talk about without talking past each-other.
The TX side only got implemented for AF_XDP.
[1] https://www.kernel.org/doc/html/latest/networking/xdp-rx-metadata.html
[2] https://www.kernel.org/doc/html/latest/networking/xsk-tx-metadata.html
What landed 'XDP RX metadata'[1] is that we (via kfunc calls) get
access to reading hardware RX offloads/hints directly from the
RX-descriptor. This implies a limitation that we only have access to
this data in the running XDP-program as the RX-descriptor is short lived.
Thus, we need to store the RX-descriptor information somewhere, to make
it available to 'cpumap' on the remote CPU. After failing to standardize
formatting XDP metadata area. My "new" opinion is that we should simply
extend struct xdp_frame with the fields needed for SKB creation. Then
we can create some kfunc helpers that allow XDP-prog stores this info.
> My use case could also use HW RSS hash to avoid a rehash in XDP prog.
> And HW RX timestamp to not break SO_TIMESTAMPING. These two
> are on one of my TODO lists. But I can’t get to them for at least
> a few weeks. So free to take it if you’d like.
The kfuncs you need should be available:
HW RSS hash = bpf_xdp_metadata_rx_hash()
HW RX timestamp = bpf_xdp_metadata_rx_timestamp()
We just need to implement storing the information, such that it is
available to CPUMAP, and make it generic such that it also works for
veth when getting a XDP redirected xdp_frame.
Hoping someone can works on this soon,
--Jesper
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-08 20:52 ` Daniel Xu
2024-08-09 10:02 ` Jesper Dangaard Brouer
@ 2024-08-09 12:20 ` Alexander Lobakin
2024-08-09 12:45 ` Toke Høiland-Jørgensen
2024-08-13 1:33 ` Jakub Kicinski
1 sibling, 2 replies; 98+ messages in thread
From: Alexander Lobakin @ 2024-08-09 12:20 UTC (permalink / raw)
To: Daniel Xu
Cc: Lorenzo Bianconi, Alexander Lobakin, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, toke, Lorenzo Bianconi, David Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesse Brandeburg,
John Fastabend, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints
From: Daniel Xu <dxu@dxuuu.xyz>
Date: Thu, 08 Aug 2024 16:52:51 -0400
> Hi,
>
> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote:
>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>
>>>> Hi Alexander,
>>>>
>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>>> size of 8 frames per one cycle.
>>>>> GRO can be used on its own, adjust cpumap calls to the
>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>>> It is most beneficial when a NIC which frame come from is XDP
>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>>> than listed receiving even given that it has to calculate full frame
>>>>> checksums on CPU.
>>>>> As GRO passes the skbs to the upper stack in the batches of
>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>>> device where the frame comes from, it is enough to disable GRO
>>>>> netdev feature on it to completely restore the original behaviour:
>>>>> untouched frames will be being bulked and passed to the upper stack
>>>>> by 8, as it was with netif_receive_skb_list().
>>>>>
>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>> ---
>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>> 1 file changed, 38 insertions(+), 5 deletions(-)
>>>>>
>>>>
>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>>> cpumap is still missing this.
>>
>> The only concern for having GRO in cpumap without metadata from the NIC
>> descriptor was that when the checksum status is missing, GRO calculates
>> the checksum on CPU, which is not really fast.
>> But I remember sometimes GRO was faster despite that.
>
> Good to know, thanks. IIUC some kind of XDP hint support landed already?
>
> My use case could also use HW RSS hash to avoid a rehash in XDP prog.
Unfortunately, for now it's impossible to get HW metadata such as RSS
hash and checksum status in cpumap. They're implemented via kfuncs
specific to a particular netdevice and this info is available only when
running XDP prog.
But I think one solution could be:
1. We create some generic structure for cpumap, like
struct cpumap_meta {
u32 magic;
u32 hash;
}
2. We add such check in the cpumap code
if (xdpf->metalen == sizeof(struct cpumap_meta) &&
<here we check magic>)
skb->hash = meta->hash;
3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.
> And HW RX timestamp to not break SO_TIMESTAMPING. These two
> are on one of my TODO lists. But I can’t get to them for at least
> a few weeks. So free to take it if you’d like.
>
>>
>>>>
>>>> I have a production use case for this now. We want to do some intelligent
>>>> RX steering and I think GRO would help over list-ified receive in some cases.
>>>> We would prefer steer in HW (and thus get existing GRO support) but not all
>>>> our NICs support it. So we need a software fallback.
>>>>
>>>> Are you still interested in merging the cpumap + GRO patches?
>>
>> For sure I can revive this part. I was planning to get back to this
>> branch and pick patches which were not related to XDP hints and send
>> them separately.
>>
>>>
>>> Hi Daniel and Alex,
>>>
>>> Recently I worked on a PoC to add GRO support to cpumap codebase:
>>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
>>> Here I added GRO support to cpumap through gro-cells.
>>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
>>> Here I added GRO support to cpumap trough napi-threaded APIs (with a some
>>> changes to them).
>>
>> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
>> overkill, that's why I separated GRO structure from &napi_struct.
>>
>> Let me maybe find some free time, I would then test all 3 solutions
>> (mine, gro_cells, threaded NAPI) and pick/send the best?
>
> Sounds good. Would be good to compare results.
>
> […]
>
> Thanks,
> Daniel
Thanks,
Olek
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-09 12:20 ` Alexander Lobakin
@ 2024-08-09 12:45 ` Toke Høiland-Jørgensen
2024-08-09 12:56 ` Alexander Lobakin
2024-08-13 1:33 ` Jakub Kicinski
1 sibling, 1 reply; 98+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-08-09 12:45 UTC (permalink / raw)
To: Alexander Lobakin, Daniel Xu
Cc: Lorenzo Bianconi, Alexander Lobakin, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, Lorenzo Bianconi, David Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Jesse Brandeburg, John Fastabend,
Yajun Deng, Willem de Bruijn, bpf, netdev, linux-kernel,
xdp-hints
Alexander Lobakin <aleksander.lobakin@intel.com> writes:
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Thu, 08 Aug 2024 16:52:51 -0400
>
>> Hi,
>>
>> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote:
>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>>
>>>>> Hi Alexander,
>>>>>
>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>>>> size of 8 frames per one cycle.
>>>>>> GRO can be used on its own, adjust cpumap calls to the
>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>>>> It is most beneficial when a NIC which frame come from is XDP
>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>>>> than listed receiving even given that it has to calculate full frame
>>>>>> checksums on CPU.
>>>>>> As GRO passes the skbs to the upper stack in the batches of
>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>>>> device where the frame comes from, it is enough to disable GRO
>>>>>> netdev feature on it to completely restore the original behaviour:
>>>>>> untouched frames will be being bulked and passed to the upper stack
>>>>>> by 8, as it was with netif_receive_skb_list().
>>>>>>
>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>> ---
>>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>>> 1 file changed, 38 insertions(+), 5 deletions(-)
>>>>>>
>>>>>
>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>>>> cpumap is still missing this.
>>>
>>> The only concern for having GRO in cpumap without metadata from the NIC
>>> descriptor was that when the checksum status is missing, GRO calculates
>>> the checksum on CPU, which is not really fast.
>>> But I remember sometimes GRO was faster despite that.
>>
>> Good to know, thanks. IIUC some kind of XDP hint support landed already?
>>
>> My use case could also use HW RSS hash to avoid a rehash in XDP prog.
>
> Unfortunately, for now it's impossible to get HW metadata such as RSS
> hash and checksum status in cpumap. They're implemented via kfuncs
> specific to a particular netdevice and this info is available only when
> running XDP prog.
>
> But I think one solution could be:
>
> 1. We create some generic structure for cpumap, like
>
> struct cpumap_meta {
> u32 magic;
> u32 hash;
> }
>
> 2. We add such check in the cpumap code
>
> if (xdpf->metalen == sizeof(struct cpumap_meta) &&
> <here we check magic>)
> skb->hash = meta->hash;
>
> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.
Yes, except don't make this cpumap-specific, make it generic for kernel
consumption of the metadata. That way it doesn't even have to be stored
in the xdp metadata area, it can be anywhere we want (and hence not
subject to ABI issues), and we can use it for skb creation after
redirect in other places than cpumap as well (say, on veth devices).
So it'll be:
struct kernel_meta {
u32 hash;
u32 timestamp;
...etc
}
and a kfunc:
void store_xdp_kernel_meta(struct kernel meta *meta);
which the XDP program can call to populate the metadata area.
-Toke
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-09 12:45 ` Toke Høiland-Jørgensen
@ 2024-08-09 12:56 ` Alexander Lobakin
2024-08-09 13:42 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 98+ messages in thread
From: Alexander Lobakin @ 2024-08-09 12:56 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: Daniel Xu, Lorenzo Bianconi, Alexander Lobakin,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Larysa Zaremba, Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, Lorenzo Bianconi, David Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Jesse Brandeburg, John Fastabend,
Yajun Deng, Willem de Bruijn, bpf, netdev, linux-kernel,
xdp-hints
From: Toke Høiland-Jørgensen <toke@redhat.com>
Date: Fri, 09 Aug 2024 14:45:33 +0200
> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
>
>> From: Daniel Xu <dxu@dxuuu.xyz>
>> Date: Thu, 08 Aug 2024 16:52:51 -0400
>>
>>> Hi,
>>>
>>> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote:
>>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>>>
>>>>>> Hi Alexander,
>>>>>>
>>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>>>>> size of 8 frames per one cycle.
>>>>>>> GRO can be used on its own, adjust cpumap calls to the
>>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>>>>> It is most beneficial when a NIC which frame come from is XDP
>>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>>>>> than listed receiving even given that it has to calculate full frame
>>>>>>> checksums on CPU.
>>>>>>> As GRO passes the skbs to the upper stack in the batches of
>>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>>>>> device where the frame comes from, it is enough to disable GRO
>>>>>>> netdev feature on it to completely restore the original behaviour:
>>>>>>> untouched frames will be being bulked and passed to the upper stack
>>>>>>> by 8, as it was with netif_receive_skb_list().
>>>>>>>
>>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>>> ---
>>>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>>>> 1 file changed, 38 insertions(+), 5 deletions(-)
>>>>>>>
>>>>>>
>>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>>>>> cpumap is still missing this.
>>>>
>>>> The only concern for having GRO in cpumap without metadata from the NIC
>>>> descriptor was that when the checksum status is missing, GRO calculates
>>>> the checksum on CPU, which is not really fast.
>>>> But I remember sometimes GRO was faster despite that.
>>>
>>> Good to know, thanks. IIUC some kind of XDP hint support landed already?
>>>
>>> My use case could also use HW RSS hash to avoid a rehash in XDP prog.
>>
>> Unfortunately, for now it's impossible to get HW metadata such as RSS
>> hash and checksum status in cpumap. They're implemented via kfuncs
>> specific to a particular netdevice and this info is available only when
>> running XDP prog.
>>
>> But I think one solution could be:
>>
>> 1. We create some generic structure for cpumap, like
>>
>> struct cpumap_meta {
>> u32 magic;
>> u32 hash;
>> }
>>
>> 2. We add such check in the cpumap code
>>
>> if (xdpf->metalen == sizeof(struct cpumap_meta) &&
>> <here we check magic>)
>> skb->hash = meta->hash;
>>
>> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
>> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.
>
> Yes, except don't make this cpumap-specific, make it generic for kernel
> consumption of the metadata. That way it doesn't even have to be stored
> in the xdp metadata area, it can be anywhere we want (and hence not
> subject to ABI issues), and we can use it for skb creation after
> redirect in other places than cpumap as well (say, on veth devices).
>
> So it'll be:
>
> struct kernel_meta {
> u32 hash;
> u32 timestamp;
> ...etc
> }
>
> and a kfunc:
>
> void store_xdp_kernel_meta(struct kernel meta *meta);
>
> which the XDP program can call to populate the metadata area.
Hmm, nice!
But where to store this info in case of cpumap if not in xdp->data_meta?
When you convert XDP frames to skbs in the cpumap code, you only have
&xdp_frame and that's it. XDP prog was already run earlier from the
driver code at that point.
But yes, in general we still need some generic structure, so that
generic consumers like cpumap (but not only) could make use of it, not
only XDP programs.
>
> -Toke
Thanks,
Olek
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-09 12:56 ` Alexander Lobakin
@ 2024-08-09 13:42 ` Toke Høiland-Jørgensen
2024-08-10 0:54 ` Martin KaFai Lau
2024-08-10 8:02 ` Lorenzo Bianconi
0 siblings, 2 replies; 98+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-08-09 13:42 UTC (permalink / raw)
To: Alexander Lobakin
Cc: Daniel Xu, Lorenzo Bianconi, Alexander Lobakin,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Larysa Zaremba, Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, Lorenzo Bianconi, David Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Alexander Lobakin <aleksander.lobakin@intel.com> writes:
> From: Toke Høiland-Jørgensen <toke@redhat.com>
> Date: Fri, 09 Aug 2024 14:45:33 +0200
>
>> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
>>
>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>> Date: Thu, 08 Aug 2024 16:52:51 -0400
>>>
>>>> Hi,
>>>>
>>>> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote:
>>>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>>>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>>>>
>>>>>>> Hi Alexander,
>>>>>>>
>>>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>>>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>>>>>> size of 8 frames per one cycle.
>>>>>>>> GRO can be used on its own, adjust cpumap calls to the
>>>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>>>>>> It is most beneficial when a NIC which frame come from is XDP
>>>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>>>>>> than listed receiving even given that it has to calculate full frame
>>>>>>>> checksums on CPU.
>>>>>>>> As GRO passes the skbs to the upper stack in the batches of
>>>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>>>>>> device where the frame comes from, it is enough to disable GRO
>>>>>>>> netdev feature on it to completely restore the original behaviour:
>>>>>>>> untouched frames will be being bulked and passed to the upper stack
>>>>>>>> by 8, as it was with netif_receive_skb_list().
>>>>>>>>
>>>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>>>> ---
>>>>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>>>>> 1 file changed, 38 insertions(+), 5 deletions(-)
>>>>>>>>
>>>>>>>
>>>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>>>>>> cpumap is still missing this.
>>>>>
>>>>> The only concern for having GRO in cpumap without metadata from the NIC
>>>>> descriptor was that when the checksum status is missing, GRO calculates
>>>>> the checksum on CPU, which is not really fast.
>>>>> But I remember sometimes GRO was faster despite that.
>>>>
>>>> Good to know, thanks. IIUC some kind of XDP hint support landed already?
>>>>
>>>> My use case could also use HW RSS hash to avoid a rehash in XDP prog.
>>>
>>> Unfortunately, for now it's impossible to get HW metadata such as RSS
>>> hash and checksum status in cpumap. They're implemented via kfuncs
>>> specific to a particular netdevice and this info is available only when
>>> running XDP prog.
>>>
>>> But I think one solution could be:
>>>
>>> 1. We create some generic structure for cpumap, like
>>>
>>> struct cpumap_meta {
>>> u32 magic;
>>> u32 hash;
>>> }
>>>
>>> 2. We add such check in the cpumap code
>>>
>>> if (xdpf->metalen == sizeof(struct cpumap_meta) &&
>>> <here we check magic>)
>>> skb->hash = meta->hash;
>>>
>>> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
>>> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.
>>
>> Yes, except don't make this cpumap-specific, make it generic for kernel
>> consumption of the metadata. That way it doesn't even have to be stored
>> in the xdp metadata area, it can be anywhere we want (and hence not
>> subject to ABI issues), and we can use it for skb creation after
>> redirect in other places than cpumap as well (say, on veth devices).
>>
>> So it'll be:
>>
>> struct kernel_meta {
>> u32 hash;
>> u32 timestamp;
>> ...etc
>> }
>>
>> and a kfunc:
>>
>> void store_xdp_kernel_meta(struct kernel meta *meta);
>>
>> which the XDP program can call to populate the metadata area.
>
> Hmm, nice!
>
> But where to store this info in case of cpumap if not in xdp->data_meta?
> When you convert XDP frames to skbs in the cpumap code, you only have
> &xdp_frame and that's it. XDP prog was already run earlier from the
> driver code at that point.
Well, we could put it in skb_shared_info? IIRC, some of the metadata
(timestamps?) end up there when building an skb anyway, so we won't even
have to copy it around...
-Toke
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-09 13:42 ` Toke Høiland-Jørgensen
@ 2024-08-10 0:54 ` Martin KaFai Lau
2024-08-10 8:02 ` Lorenzo Bianconi
1 sibling, 0 replies; 98+ messages in thread
From: Martin KaFai Lau @ 2024-08-10 0:54 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Alexander Lobakin
Cc: Daniel Xu, Lorenzo Bianconi, Alexander Lobakin,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Larysa Zaremba, Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, Lorenzo Bianconi, David Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
On 8/9/24 6:42 AM, Toke Høiland-Jørgensen wrote:
> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
>
>> From: Toke Høiland-Jørgensen <toke@redhat.com>
>> Date: Fri, 09 Aug 2024 14:45:33 +0200
>>
>>> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
>>>
>>>> From: Daniel Xu <dxu@dxuuu.xyz>
>>>> Date: Thu, 08 Aug 2024 16:52:51 -0400
>>>>
>>>>> Hi,
>>>>>
>>>>> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote:
>>>>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>>>>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>>>>>
>>>>>>>> Hi Alexander,
>>>>>>>>
>>>>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>>>>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>>>>>>> size of 8 frames per one cycle.
>>>>>>>>> GRO can be used on its own, adjust cpumap calls to the
>>>>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>>>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>>>>>>> It is most beneficial when a NIC which frame come from is XDP
>>>>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>>>>>>> than listed receiving even given that it has to calculate full frame
>>>>>>>>> checksums on CPU.
>>>>>>>>> As GRO passes the skbs to the upper stack in the batches of
>>>>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>>>>>>> device where the frame comes from, it is enough to disable GRO
>>>>>>>>> netdev feature on it to completely restore the original behaviour:
>>>>>>>>> untouched frames will be being bulked and passed to the upper stack
>>>>>>>>> by 8, as it was with netif_receive_skb_list().
>>>>>>>>>
>>>>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>>>>> ---
>>>>>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>>>>>> 1 file changed, 38 insertions(+), 5 deletions(-)
>>>>>>>>>
>>>>>>>>
>>>>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>>>>>>> cpumap is still missing this.
>>>>>>
>>>>>> The only concern for having GRO in cpumap without metadata from the NIC
>>>>>> descriptor was that when the checksum status is missing, GRO calculates
>>>>>> the checksum on CPU, which is not really fast.
>>>>>> But I remember sometimes GRO was faster despite that.
>>>>>
>>>>> Good to know, thanks. IIUC some kind of XDP hint support landed already?
>>>>>
>>>>> My use case could also use HW RSS hash to avoid a rehash in XDP prog.
>>>>
>>>> Unfortunately, for now it's impossible to get HW metadata such as RSS
>>>> hash and checksum status in cpumap. They're implemented via kfuncs
>>>> specific to a particular netdevice and this info is available only when
>>>> running XDP prog.
>>>>
>>>> But I think one solution could be:
>>>>
>>>> 1. We create some generic structure for cpumap, like
>>>>
>>>> struct cpumap_meta {
>>>> u32 magic;
>>>> u32 hash;
>>>> }
>>>>
>>>> 2. We add such check in the cpumap code
>>>>
>>>> if (xdpf->metalen == sizeof(struct cpumap_meta) &&
>>>> <here we check magic>)
>>>> skb->hash = meta->hash;
>>>>
>>>> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
>>>> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.
>>>
>>> Yes, except don't make this cpumap-specific, make it generic for kernel
>>> consumption of the metadata. That way it doesn't even have to be stored
>>> in the xdp metadata area, it can be anywhere we want (and hence not
>>> subject to ABI issues), and we can use it for skb creation after
>>> redirect in other places than cpumap as well (say, on veth devices).
>>>
>>> So it'll be:
>>>
>>> struct kernel_meta {
>>> u32 hash;
>>> u32 timestamp;
>>> ...etc
>>> }
>>>
>>> and a kfunc:
>>>
>>> void store_xdp_kernel_meta(struct kernel meta *meta);
>>>
>>> which the XDP program can call to populate the metadata area.
>>
>> Hmm, nice!
>>
>> But where to store this info in case of cpumap if not in xdp->data_meta?
The cpumap has a xdp program. Instead of the kernel's cpumap code building the
skb, the cpumap's xdp prog could build the skb itself and directly use the
xdp->data_meta to init the skb.
I recall there was discussion about doing gro in a bpf prog (I may be
mis-remembering though). If possible, then the cpumap's xdp prog can also do the
gro?
>> When you convert XDP frames to skbs in the cpumap code, you only have
>> &xdp_frame and that's it. XDP prog was already run earlier from the
>> driver code at that point.
>
> Well, we could put it in skb_shared_info? IIRC, some of the metadata
> (timestamps?) end up there when building an skb anyway, so we won't even
> have to copy it around...
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-09 13:42 ` Toke Høiland-Jørgensen
2024-08-10 0:54 ` Martin KaFai Lau
@ 2024-08-10 8:02 ` Lorenzo Bianconi
1 sibling, 0 replies; 98+ messages in thread
From: Lorenzo Bianconi @ 2024-08-10 8:02 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: Alexander Lobakin, Daniel Xu, Alexander Lobakin,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Larysa Zaremba, Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, Lorenzo Bianconi, David Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
[-- Attachment #1: Type: text/plain, Size: 4662 bytes --]
On Aug 09, Toke wrote:
> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
>
> > From: Toke Høiland-Jørgensen <toke@redhat.com>
> > Date: Fri, 09 Aug 2024 14:45:33 +0200
> >
> >> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
> >>
> >>> From: Daniel Xu <dxu@dxuuu.xyz>
> >>> Date: Thu, 08 Aug 2024 16:52:51 -0400
> >>>
> >>>> Hi,
> >>>>
> >>>> On Thu, Aug 8, 2024, at 7:57 AM, Alexander Lobakin wrote:
> >>>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> >>>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
> >>>>>
> >>>>>>> Hi Alexander,
> >>>>>>>
> >>>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
> >>>>>>>> cpumap has its own BH context based on kthread. It has a sane batch
> >>>>>>>> size of 8 frames per one cycle.
> >>>>>>>> GRO can be used on its own, adjust cpumap calls to the
> >>>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
> >>>>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
> >>>>>>>> It is most beneficial when a NIC which frame come from is XDP
> >>>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
> >>>>>>>> than listed receiving even given that it has to calculate full frame
> >>>>>>>> checksums on CPU.
> >>>>>>>> As GRO passes the skbs to the upper stack in the batches of
> >>>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
> >>>>>>>> device where the frame comes from, it is enough to disable GRO
> >>>>>>>> netdev feature on it to completely restore the original behaviour:
> >>>>>>>> untouched frames will be being bulked and passed to the upper stack
> >>>>>>>> by 8, as it was with netif_receive_skb_list().
> >>>>>>>>
> >>>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> >>>>>>>> ---
> >>>>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
> >>>>>>>> 1 file changed, 38 insertions(+), 5 deletions(-)
> >>>>>>>>
> >>>>>>>
> >>>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
> >>>>>>> cpumap is still missing this.
> >>>>>
> >>>>> The only concern for having GRO in cpumap without metadata from the NIC
> >>>>> descriptor was that when the checksum status is missing, GRO calculates
> >>>>> the checksum on CPU, which is not really fast.
> >>>>> But I remember sometimes GRO was faster despite that.
> >>>>
> >>>> Good to know, thanks. IIUC some kind of XDP hint support landed already?
> >>>>
> >>>> My use case could also use HW RSS hash to avoid a rehash in XDP prog.
> >>>
> >>> Unfortunately, for now it's impossible to get HW metadata such as RSS
> >>> hash and checksum status in cpumap. They're implemented via kfuncs
> >>> specific to a particular netdevice and this info is available only when
> >>> running XDP prog.
> >>>
> >>> But I think one solution could be:
> >>>
> >>> 1. We create some generic structure for cpumap, like
> >>>
> >>> struct cpumap_meta {
> >>> u32 magic;
> >>> u32 hash;
> >>> }
> >>>
> >>> 2. We add such check in the cpumap code
> >>>
> >>> if (xdpf->metalen == sizeof(struct cpumap_meta) &&
> >>> <here we check magic>)
> >>> skb->hash = meta->hash;
> >>>
> >>> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
> >>> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.
> >>
> >> Yes, except don't make this cpumap-specific, make it generic for kernel
> >> consumption of the metadata. That way it doesn't even have to be stored
> >> in the xdp metadata area, it can be anywhere we want (and hence not
> >> subject to ABI issues), and we can use it for skb creation after
> >> redirect in other places than cpumap as well (say, on veth devices).
> >>
> >> So it'll be:
> >>
> >> struct kernel_meta {
> >> u32 hash;
> >> u32 timestamp;
> >> ...etc
> >> }
> >>
> >> and a kfunc:
> >>
> >> void store_xdp_kernel_meta(struct kernel meta *meta);
> >>
> >> which the XDP program can call to populate the metadata area.
> >
> > Hmm, nice!
> >
> > But where to store this info in case of cpumap if not in xdp->data_meta?
> > When you convert XDP frames to skbs in the cpumap code, you only have
> > &xdp_frame and that's it. XDP prog was already run earlier from the
> > driver code at that point.
>
> Well, we could put it in skb_shared_info? IIRC, some of the metadata
> (timestamps?) end up there when building an skb anyway, so we won't even
> have to copy it around...
Before vacation I started looking into it a bit, I will resume this work in one
week or so.
Regards,
Lorenzo
>
> -Toke
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-09 12:20 ` Alexander Lobakin
2024-08-09 12:45 ` Toke Høiland-Jørgensen
@ 2024-08-13 1:33 ` Jakub Kicinski
2024-08-13 9:51 ` Jesper Dangaard Brouer
1 sibling, 1 reply; 98+ messages in thread
From: Jakub Kicinski @ 2024-08-13 1:33 UTC (permalink / raw)
To: Alexander Lobakin
Cc: Daniel Xu, Lorenzo Bianconi, Alexander Lobakin,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Larysa Zaremba, Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, toke, Lorenzo Bianconi, David Miller,
Eric Dumazet, Paolo Abeni, Jesse Brandeburg, John Fastabend,
Yajun Deng, Willem de Bruijn, bpf, netdev, linux-kernel,
xdp-hints
On Fri, 9 Aug 2024 14:20:25 +0200 Alexander Lobakin wrote:
> But I think one solution could be:
>
> 1. We create some generic structure for cpumap, like
>
> struct cpumap_meta {
> u32 magic;
> u32 hash;
> }
>
> 2. We add such check in the cpumap code
>
> if (xdpf->metalen == sizeof(struct cpumap_meta) &&
> <here we check magic>)
> skb->hash = meta->hash;
>
> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.
I wonder what the overhead of skb metadata allocation is in practice.
With Eric's "return skb to the CPU of origin" we can feed the lockless
skb cache one the right CPU, and also feed the lockless page pool
cache. I wonder if batched RFS wouldn't be faster than the XDP thing
that requires all the groundwork.
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-13 1:33 ` Jakub Kicinski
@ 2024-08-13 9:51 ` Jesper Dangaard Brouer
0 siblings, 0 replies; 98+ messages in thread
From: Jesper Dangaard Brouer @ 2024-08-13 9:51 UTC (permalink / raw)
To: Jakub Kicinski, Alexander Lobakin
Cc: Daniel Xu, Lorenzo Bianconi, Alexander Lobakin,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Larysa Zaremba, Michal Swiatkowski, Björn Töpel,
Magnus Karlsson, Maciej Fijalkowski, Jonathan Lemon, toke,
Lorenzo Bianconi, David Miller, Eric Dumazet, Paolo Abeni,
Jesse Brandeburg, John Fastabend, Yajun Deng, Willem de Bruijn,
bpf, netdev, linux-kernel, xdp-hints
On 13/08/2024 03.33, Jakub Kicinski wrote:
> On Fri, 9 Aug 2024 14:20:25 +0200 Alexander Lobakin wrote:
>> But I think one solution could be:
>>
>> 1. We create some generic structure for cpumap, like
>>
>> struct cpumap_meta {
>> u32 magic;
>> u32 hash;
>> }
>>
>> 2. We add such check in the cpumap code
>>
>> if (xdpf->metalen == sizeof(struct cpumap_meta) &&
>> <here we check magic>)
>> skb->hash = meta->hash;
>>
>> 3. In XDP prog, you call Rx hints kfuncs when they're available, obtain
>> RSS hash and then put it in the struct cpumap_meta as XDP frame metadata.
>
> I wonder what the overhead of skb metadata allocation is in practice.
> With Eric's "return skb to the CPU of origin" we can feed the lockless
> skb cache one the right CPU, and also feed the lockless page pool
> cache. I wonder if batched RFS wouldn't be faster than the XDP thing
> that requires all the groundwork.
I explicitly developed CPUMAP because I was benchmarking Receive Flow
Steering (RFS) and Receive Packet Steering (RPS), which I observed was
the bottleneck. The overhead was too large on the RX-CPU and bottleneck
due to RFS and RPS maintaining data structures to avoid Out-of-Order
packets. The Flow Dissector step was also a limiting factor.
By bottleneck I mean it didn't scale, as RX-CPU packet per second
processing speeds was too low compared to the remote-CPU pps.
Digging in my old notes, I can see that RPS was limited to around 4.8
Mpps (and I have a weird disabling part of it showing 7.5Mpps). In [1]
remote-CPU could process (starts at) 2.7 Mpps when dropping UDP packet
due to UdpNoPorts configured (and baseline 3.3 Mpps if not remote), thus
it only scales up-to 1.78 remote-CPUs. [1] shows how optimizations
brings remote-CPU to handle 3.2Mpps (close non-remote to 3.3Mpps
baseline). In [2] those optimizations bring remote-CPU to 4Mpps (for
UdpNoPorts case). XDP RX-redirect in [1]+[2] was around 19Mpps (which
might be lower today due to perf paper cuts).
[1]
https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap02-optimizations.org
[2]
https://github.com/xdp-project/xdp-project/blob/master/areas/cpumap/cpumap03-optimizations.org
The benefits Eric's "return skb to the CPU of origin" should help
improve the case for the remote-CPU, as I was seeing some bottlenecks in
how we returned the memory.
--Jesper
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-08 11:57 ` Alexander Lobakin
2024-08-08 17:22 ` Lorenzo Bianconi
2024-08-08 20:52 ` Daniel Xu
@ 2024-08-10 8:00 ` Lorenzo Bianconi
2024-08-13 14:09 ` Alexander Lobakin
3 siblings, 0 replies; 98+ messages in thread
From: Lorenzo Bianconi @ 2024-08-10 8:00 UTC (permalink / raw)
To: Alexander Lobakin
Cc: Daniel Xu, Alexander Lobakin, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, toke, Lorenzo Bianconi, David Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesse Brandeburg,
John Fastabend, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints
[-- Attachment #1: Type: text/plain, Size: 3521 bytes --]
On Aug 08, Alexander Lobakin wrote:
> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> Date: Thu, 8 Aug 2024 06:54:06 +0200
>
> >> Hi Alexander,
> >>
> >> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
> >>> cpumap has its own BH context based on kthread. It has a sane batch
> >>> size of 8 frames per one cycle.
> >>> GRO can be used on its own, adjust cpumap calls to the
> >>> upper stack to use GRO API instead of netif_receive_skb_list() which
> >>> processes skbs by batches, but doesn't involve GRO layer at all.
> >>> It is most beneficial when a NIC which frame come from is XDP
> >>> generic metadata-enabled, but in plenty of tests GRO performs better
> >>> than listed receiving even given that it has to calculate full frame
> >>> checksums on CPU.
> >>> As GRO passes the skbs to the upper stack in the batches of
> >>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
> >>> device where the frame comes from, it is enough to disable GRO
> >>> netdev feature on it to completely restore the original behaviour:
> >>> untouched frames will be being bulked and passed to the upper stack
> >>> by 8, as it was with netif_receive_skb_list().
> >>>
> >>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> >>> ---
> >>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
> >>> 1 file changed, 38 insertions(+), 5 deletions(-)
> >>>
> >>
> >> AFAICT the cpumap + GRO is a good standalone improvement. I think
> >> cpumap is still missing this.
>
> The only concern for having GRO in cpumap without metadata from the NIC
> descriptor was that when the checksum status is missing, GRO calculates
> the checksum on CPU, which is not really fast.
> But I remember sometimes GRO was faster despite that.
For the moment we could test it with UDP traffic with checksum disabled.
Regards,
Lorenzo
>
> >>
> >> I have a production use case for this now. We want to do some intelligent
> >> RX steering and I think GRO would help over list-ified receive in some cases.
> >> We would prefer steer in HW (and thus get existing GRO support) but not all
> >> our NICs support it. So we need a software fallback.
> >>
> >> Are you still interested in merging the cpumap + GRO patches?
>
> For sure I can revive this part. I was planning to get back to this
> branch and pick patches which were not related to XDP hints and send
> them separately.
>
> >
> > Hi Daniel and Alex,
> >
> > Recently I worked on a PoC to add GRO support to cpumap codebase:
> > - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
> > Here I added GRO support to cpumap through gro-cells.
> > - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
> > Here I added GRO support to cpumap trough napi-threaded APIs (with a some
> > changes to them).
>
> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
> overkill, that's why I separated GRO structure from &napi_struct.
>
> Let me maybe find some free time, I would then test all 3 solutions
> (mine, gro_cells, threaded NAPI) and pick/send the best?
>
> >
> > Please note I have not run any performance tests so far, just verified it does
> > not crash (I was planning to resume this work soon). Please let me know if it
> > works for you.
> >
> > Regards,
> > Lorenzo
> >
> >>
> >> Thanks,
> >> Daniel
>
> Thanks,
> Olek
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-08 11:57 ` Alexander Lobakin
` (2 preceding siblings ...)
2024-08-10 8:00 ` Lorenzo Bianconi
@ 2024-08-13 14:09 ` Alexander Lobakin
2024-08-13 14:54 ` Toke Høiland-Jørgensen
` (2 more replies)
3 siblings, 3 replies; 98+ messages in thread
From: Alexander Lobakin @ 2024-08-13 14:09 UTC (permalink / raw)
To: Lorenzo Bianconi, Daniel Xu
Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, toke, Lorenzo Bianconi,
David Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jesse Brandeburg, John Fastabend, Yajun Deng, Willem de Bruijn,
bpf, netdev, linux-kernel, xdp-hints
From: Alexander Lobakin <aleksander.lobakin@intel.com>
Date: Thu, 8 Aug 2024 13:57:00 +0200
> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> Date: Thu, 8 Aug 2024 06:54:06 +0200
>
>>> Hi Alexander,
>>>
>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>> size of 8 frames per one cycle.
>>>> GRO can be used on its own, adjust cpumap calls to the
>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>> It is most beneficial when a NIC which frame come from is XDP
>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>> than listed receiving even given that it has to calculate full frame
>>>> checksums on CPU.
>>>> As GRO passes the skbs to the upper stack in the batches of
>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>> device where the frame comes from, it is enough to disable GRO
>>>> netdev feature on it to completely restore the original behaviour:
>>>> untouched frames will be being bulked and passed to the upper stack
>>>> by 8, as it was with netif_receive_skb_list().
>>>>
>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>> ---
>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>> 1 file changed, 38 insertions(+), 5 deletions(-)
>>>>
>>>
>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>> cpumap is still missing this.
>
> The only concern for having GRO in cpumap without metadata from the NIC
> descriptor was that when the checksum status is missing, GRO calculates
> the checksum on CPU, which is not really fast.
> But I remember sometimes GRO was faster despite that.
>
>>>
>>> I have a production use case for this now. We want to do some intelligent
>>> RX steering and I think GRO would help over list-ified receive in some cases.
>>> We would prefer steer in HW (and thus get existing GRO support) but not all
>>> our NICs support it. So we need a software fallback.
>>>
>>> Are you still interested in merging the cpumap + GRO patches?
>
> For sure I can revive this part. I was planning to get back to this
> branch and pick patches which were not related to XDP hints and send
> them separately.
>
>>
>> Hi Daniel and Alex,
>>
>> Recently I worked on a PoC to add GRO support to cpumap codebase:
>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
>> Here I added GRO support to cpumap through gro-cells.
>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
>> Here I added GRO support to cpumap trough napi-threaded APIs (with a some
>> changes to them).
>
> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
> overkill, that's why I separated GRO structure from &napi_struct.
>
> Let me maybe find some free time, I would then test all 3 solutions
> (mine, gro_cells, threaded NAPI) and pick/send the best?
>
>>
>> Please note I have not run any performance tests so far, just verified it does
>> not crash (I was planning to resume this work soon). Please let me know if it
>> works for you.
I did tests on both threaded NAPI for cpumap and my old implementation
with a traffic generator and I have the following (in Kpps):
direct Rx direct GRO cpumap cpumap GRO
baseline 2900 5800 2700 2700 (N/A)
threaded 2300 4000
old GRO 2300 4000
IOW,
1. There are no differences in perf between Lorenzo's threaded NAPI
GRO implementation and my old implementation, but Lorenzo's is also
a very nice cleanup as it switches cpumap to threaded NAPI completely
and the final diffstat even removes more lines than adds, while mine
adds a bunch of lines and refactors a couple hundred, so I'd go with
his variant.
2. After switching to NAPI, the performance without GRO decreases (2.3
Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely
(4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums
manually.
Note that the code is not polished to the top and I also have a good
improvement for allocating skb heads from the percpu NAPI cache in my
old tree which I'm planning to add to the series, so the final
improvement will be even bigger.
+ after we find how to pass checksum hint to cpumap, it will be yet
another big improvement for GRO (current code won't benefit from
this at all)
To Lorenzo:
Would it be fine if I prepare a series containing your patch for
threaded NAPI for cpumap (I'd polish it and break into 2 or 3) +
skb allocation optimization and send it OR you wanted to send this
on your own? I'm fine with either, in the first case, everything
would land within one series with the respective credits; in case
of the latter, I'd need to send a followup :)
>>
>> Regards,
>> Lorenzo
>>
>>>
>>> Thanks,
>>> Daniel
Thanks,
Olek
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-13 14:09 ` Alexander Lobakin
@ 2024-08-13 14:54 ` Toke Høiland-Jørgensen
2024-08-13 15:57 ` Jesper Dangaard Brouer
2024-08-13 16:14 ` Lorenzo Bianconi
2024-08-13 16:27 ` Lorenzo Bianconi
2 siblings, 1 reply; 98+ messages in thread
From: Toke Høiland-Jørgensen @ 2024-08-13 14:54 UTC (permalink / raw)
To: Alexander Lobakin, Lorenzo Bianconi, Daniel Xu
Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Lorenzo Bianconi,
David Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jesse Brandeburg, John Fastabend, Yajun Deng, Willem de Bruijn,
bpf, netdev, linux-kernel, xdp-hints
Alexander Lobakin <aleksander.lobakin@intel.com> writes:
> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Thu, 8 Aug 2024 13:57:00 +0200
>
>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>
>>>> Hi Alexander,
>>>>
>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>>> size of 8 frames per one cycle.
>>>>> GRO can be used on its own, adjust cpumap calls to the
>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>>> It is most beneficial when a NIC which frame come from is XDP
>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>>> than listed receiving even given that it has to calculate full frame
>>>>> checksums on CPU.
>>>>> As GRO passes the skbs to the upper stack in the batches of
>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>>> device where the frame comes from, it is enough to disable GRO
>>>>> netdev feature on it to completely restore the original behaviour:
>>>>> untouched frames will be being bulked and passed to the upper stack
>>>>> by 8, as it was with netif_receive_skb_list().
>>>>>
>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>> ---
>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>> 1 file changed, 38 insertions(+), 5 deletions(-)
>>>>>
>>>>
>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>>> cpumap is still missing this.
>>
>> The only concern for having GRO in cpumap without metadata from the NIC
>> descriptor was that when the checksum status is missing, GRO calculates
>> the checksum on CPU, which is not really fast.
>> But I remember sometimes GRO was faster despite that.
>>
>>>>
>>>> I have a production use case for this now. We want to do some intelligent
>>>> RX steering and I think GRO would help over list-ified receive in some cases.
>>>> We would prefer steer in HW (and thus get existing GRO support) but not all
>>>> our NICs support it. So we need a software fallback.
>>>>
>>>> Are you still interested in merging the cpumap + GRO patches?
>>
>> For sure I can revive this part. I was planning to get back to this
>> branch and pick patches which were not related to XDP hints and send
>> them separately.
>>
>>>
>>> Hi Daniel and Alex,
>>>
>>> Recently I worked on a PoC to add GRO support to cpumap codebase:
>>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
>>> Here I added GRO support to cpumap through gro-cells.
>>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
>>> Here I added GRO support to cpumap trough napi-threaded APIs (with a some
>>> changes to them).
>>
>> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
>> overkill, that's why I separated GRO structure from &napi_struct.
>>
>> Let me maybe find some free time, I would then test all 3 solutions
>> (mine, gro_cells, threaded NAPI) and pick/send the best?
>>
>>>
>>> Please note I have not run any performance tests so far, just verified it does
>>> not crash (I was planning to resume this work soon). Please let me know if it
>>> works for you.
>
> I did tests on both threaded NAPI for cpumap and my old implementation
> with a traffic generator and I have the following (in Kpps):
>
> direct Rx direct GRO cpumap cpumap GRO
> baseline 2900 5800 2700 2700 (N/A)
> threaded 2300 4000
> old GRO 2300 4000
>
> IOW,
>
> 1. There are no differences in perf between Lorenzo's threaded NAPI
> GRO implementation and my old implementation, but Lorenzo's is also
> a very nice cleanup as it switches cpumap to threaded NAPI completely
> and the final diffstat even removes more lines than adds, while mine
> adds a bunch of lines and refactors a couple hundred, so I'd go with
> his variant.
>
> 2. After switching to NAPI, the performance without GRO decreases (2.3
> Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely
> (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums
> manually.
One question for this: IIUC, the benefit of GRO varies with the traffic
mix, depending on how much the GRO logic can actually aggregate. So did
you test the pathological case as well (spraying packets over so many
flows that there is basically no aggregation taking place)? Just to make
sure we don't accidentally screw up performance in that case while
optimising for the aggregating case :)
-Toke
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-13 14:54 ` Toke Høiland-Jørgensen
@ 2024-08-13 15:57 ` Jesper Dangaard Brouer
2024-08-19 14:50 ` Alexander Lobakin
0 siblings, 1 reply; 98+ messages in thread
From: Jesper Dangaard Brouer @ 2024-08-13 15:57 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Alexander Lobakin,
Lorenzo Bianconi, Daniel Xu
Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, Lorenzo Bianconi, David Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
On 13/08/2024 16.54, Toke Høiland-Jørgensen wrote:
> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
>
>> From: Alexander Lobakin <aleksander.lobakin@intel.com>
>> Date: Thu, 8 Aug 2024 13:57:00 +0200
>>
>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>>
>>>>> Hi Alexander,
>>>>>
>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>>>> size of 8 frames per one cycle.
>>>>>> GRO can be used on its own, adjust cpumap calls to the
>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>>>> It is most beneficial when a NIC which frame come from is XDP
>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>>>> than listed receiving even given that it has to calculate full frame
>>>>>> checksums on CPU.
>>>>>> As GRO passes the skbs to the upper stack in the batches of
>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>>>> device where the frame comes from, it is enough to disable GRO
>>>>>> netdev feature on it to completely restore the original behaviour:
>>>>>> untouched frames will be being bulked and passed to the upper stack
>>>>>> by 8, as it was with netif_receive_skb_list().
>>>>>>
>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>> ---
>>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>>> 1 file changed, 38 insertions(+), 5 deletions(-)
>>>>>>
>>>>>
>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>>>> cpumap is still missing this.
>>>
>>> The only concern for having GRO in cpumap without metadata from the NIC
>>> descriptor was that when the checksum status is missing, GRO calculates
>>> the checksum on CPU, which is not really fast.
>>> But I remember sometimes GRO was faster despite that.
>>>
>>>>>
>>>>> I have a production use case for this now. We want to do some intelligent
>>>>> RX steering and I think GRO would help over list-ified receive in some cases.
>>>>> We would prefer steer in HW (and thus get existing GRO support) but not all
>>>>> our NICs support it. So we need a software fallback.
>>>>>
>>>>> Are you still interested in merging the cpumap + GRO patches?
>>>
>>> For sure I can revive this part. I was planning to get back to this
>>> branch and pick patches which were not related to XDP hints and send
>>> them separately.
>>>
>>>>
>>>> Hi Daniel and Alex,
>>>>
>>>> Recently I worked on a PoC to add GRO support to cpumap codebase:
>>>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
>>>> Here I added GRO support to cpumap through gro-cells.
>>>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
>>>> Here I added GRO support to cpumap trough napi-threaded APIs (with a some
>>>> changes to them).
>>>
>>> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
>>> overkill, that's why I separated GRO structure from &napi_struct.
>>>
>>> Let me maybe find some free time, I would then test all 3 solutions
>>> (mine, gro_cells, threaded NAPI) and pick/send the best?
>>>
>>>>
>>>> Please note I have not run any performance tests so far, just verified it does
>>>> not crash (I was planning to resume this work soon). Please let me know if it
>>>> works for you.
>>
>> I did tests on both threaded NAPI for cpumap and my old implementation
>> with a traffic generator and I have the following (in Kpps):
>>
What kind of traffic is the traffic generator sending?
E.g. is this a type of traffic that gets GRO aggregated?
>> direct Rx direct GRO cpumap cpumap GRO
>> baseline 2900 5800 2700 2700 (N/A)
>> threaded 2300 4000
>> old GRO 2300 4000
>>
Nice results. Just to confirm, the units are in Kpps.
>> IOW,
>>
>> 1. There are no differences in perf between Lorenzo's threaded NAPI
>> GRO implementation and my old implementation, but Lorenzo's is also
>> a very nice cleanup as it switches cpumap to threaded NAPI completely
>> and the final diffstat even removes more lines than adds, while mine
>> adds a bunch of lines and refactors a couple hundred, so I'd go with
>> his variant.
>>
>> 2. After switching to NAPI, the performance without GRO decreases (2.3
>> Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely
>> (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums
>> manually.
>
> One question for this: IIUC, the benefit of GRO varies with the traffic
> mix, depending on how much the GRO logic can actually aggregate. So did
> you test the pathological case as well (spraying packets over so many
> flows that there is basically no aggregation taking place)? Just to make
> sure we don't accidentally screw up performance in that case while
> optimising for the aggregating case :)
>
For the GRO use-case, I think a basic TCP stream throughput test (like
netperf) should show a benefit once cpumap enable GRO, Can you confirm this?
Or does the missing hardware RX-hash and RX-checksum cause TCP GRO not
to fully work, yet?
Thanks A LOT for doing this benchmarking!
--Jesper
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-13 15:57 ` Jesper Dangaard Brouer
@ 2024-08-19 14:50 ` Alexander Lobakin
2024-08-21 0:29 ` Daniel Xu
0 siblings, 1 reply; 98+ messages in thread
From: Alexander Lobakin @ 2024-08-19 14:50 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
Lorenzo Bianconi, Daniel Xu
Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, Lorenzo Bianconi, David Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
From: Jesper Dangaard Brouer <hawk@kernel.org>
Date: Tue, 13 Aug 2024 17:57:44 +0200
>
>
> On 13/08/2024 16.54, Toke Høiland-Jørgensen wrote:
>> Alexander Lobakin <aleksander.lobakin@intel.com> writes:
>>
>>> From: Alexander Lobakin <aleksander.lobakin@intel.com>
>>> Date: Thu, 8 Aug 2024 13:57:00 +0200
>>>
>>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>>>
>>>>>> Hi Alexander,
[...]
>>> I did tests on both threaded NAPI for cpumap and my old implementation
>>> with a traffic generator and I have the following (in Kpps):
>>>
>
> What kind of traffic is the traffic generator sending?
>
> E.g. is this a type of traffic that gets GRO aggregated?
Yes. It's UDP, with the UDP GRO enabled on the receiver.
>
>>> direct Rx direct GRO cpumap cpumap GRO
>>> baseline 2900 5800 2700 2700 (N/A)
>>> threaded 2300 4000
>>> old GRO 2300 4000
>>>
>
> Nice results. Just to confirm, the units are in Kpps.
Yes. I.e. cpumap was giving 2.7 Mpps without GRO, then 4.0 Mpps with it.
>
>
>>> IOW,
>>>
>>> 1. There are no differences in perf between Lorenzo's threaded NAPI
>>> GRO implementation and my old implementation, but Lorenzo's is also
>>> a very nice cleanup as it switches cpumap to threaded NAPI
>>> completely
>>> and the final diffstat even removes more lines than adds, while mine
>>> adds a bunch of lines and refactors a couple hundred, so I'd go with
>>> his variant.
>>>
>>> 2. After switching to NAPI, the performance without GRO decreases (2.3
>>> Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely
>>> (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums
>>> manually.
>>
>> One question for this: IIUC, the benefit of GRO varies with the traffic
>> mix, depending on how much the GRO logic can actually aggregate. So did
>> you test the pathological case as well (spraying packets over so many
>> flows that there is basically no aggregation taking place)? Just to make
>> sure we don't accidentally screw up performance in that case while
>> optimising for the aggregating case :)
>>
>
> For the GRO use-case, I think a basic TCP stream throughput test (like
> netperf) should show a benefit once cpumap enable GRO, Can you confirm
> this?
Yes, TCP benefits as well.
> Or does the missing hardware RX-hash and RX-checksum cause TCP GRO not
> to fully work, yet?
GRO works well for both TCP and UDP. The main bottleneck is that GRO
calculates the checksum manually on the CPU now, since there's no
checksum status from the NIC.
Also, missing Rx hash means GRO will place packets from every flow into
the same bucket, but it's not a big deal (they get compared layer by
layer anyway).
>
> Thanks A LOT for doing this benchmarking!
I optimized the code a bit and picked my old patches for bulk NAPI skb
cache allocation and today I got 4.7 Mpps 🎉
IOW, the result of the series (7 patches totally, but 2 are not
networking-related) is 2.7 -> 4.7 Mpps == 75%!
Daniel,
if you want, you can pick my tree[0], either full or just up to
"bpf: cpumap: switch to napi_skb_cache_get_bulk()"
(13 patches total: 6 for netdev_feature_t and 7 for the cpumap)
and test with your usecases. Would be nice to see some real world
results, not my synthetic tests :D
> --Jesper
[0]
https://github.com/alobakin/linux/compare/idpf-libie-new~52...idpf-libie-new/
Thanks,
Olek
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-19 14:50 ` Alexander Lobakin
@ 2024-08-21 0:29 ` Daniel Xu
2024-08-21 13:16 ` Alexander Lobakin
0 siblings, 1 reply; 98+ messages in thread
From: Daniel Xu @ 2024-08-21 0:29 UTC (permalink / raw)
To: Alexander Lobakin
Cc: Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
Lorenzo Bianconi, Alexander Lobakin, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Lorenzo Bianconi,
David Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
John Fastabend, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints
Hi Olek,
On Mon, Aug 19, 2024 at 04:50:52PM GMT, Alexander Lobakin wrote:
[..]
> > Thanks A LOT for doing this benchmarking!
>
> I optimized the code a bit and picked my old patches for bulk NAPI skb
> cache allocation and today I got 4.7 Mpps 🎉
> IOW, the result of the series (7 patches totally, but 2 are not
> networking-related) is 2.7 -> 4.7 Mpps == 75%!
>
> Daniel,
>
> if you want, you can pick my tree[0], either full or just up to
>
> "bpf: cpumap: switch to napi_skb_cache_get_bulk()"
>
> (13 patches total: 6 for netdev_feature_t and 7 for the cpumap)
>
> and test with your usecases. Would be nice to see some real world
> results, not my synthetic tests :D
>
> > --Jesper
>
> [0]
> https://github.com/alobakin/linux/compare/idpf-libie-new~52...idpf-libie-new/
So it turns out keeping the workload in place while I update and reboot
the kernel is a Hard Problem. I'll put in some more effort and see if I
can get one of the workloads to stay still, but it'll be a somewhat
noisy test even if it works. So the following are synthetic tests
(neper) but on a real prod setup as far as container networking and
configuration is concerned.
I cherry-picked 586be610~1..ca22ac8e9de onto our 6.9-ish branch. Had to
skip some of the flag refactors b/c of conflicts - I didn't know the
code well enough to do fixups. So I had to apply this diff (FWIW not sure
the struct_size() here was right anyhow):
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 089d19c62efe..359fbfaa43eb 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -110,7 +110,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
if (!cmap->cpu_map)
goto free_cmap;
- dev = bpf_map_area_alloc(struct_size(dev, priv, 0), NUMA_NO_NODE);
+ dev = bpf_map_area_alloc(sizeof(*dev), NUMA_NO_NODE);
if (!dev)
goto free_cpu_map;
==== Baseline ===
./tcp_rr -c -H $SERVER -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $SERVER -T8 -F16 -l30
Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s)
Run 1 2578189 0.00008831 0.00010623 0.00013439 Run 1 15427.22
Run 2 2657923 0.00008575 0.00010239 0.00012927 Run 2 15272.12
Run 3 2700402 0.00008447 0.00010111 0.00013183 Run 3 14871.35
Run 4 2571739 0.00008575 0.00011519 0.00013823 Run 4 15344.72
Run 5 2476427 0.00008703 0.00013055 0.00016895 Run 5 15193.2
Average 2596936 0.000086262 0.000111094 0.000140534 Average 15221.722
=== cpumap NAPI patches ===
Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s)
Run 1 2554598 0.00008703 0.00011263 0.00013055 Run 1 17090.29
Run 2 2478905 0.00009087 0.00011391 0.00014463 Run 2 16742.27
Run 3 2418599 0.00009471 0.00011007 0.00014207 Run 3 17555.3
Run 4 2562463 0.00008959 0.00010367 0.00013055 Run 4 17892.3
Run 5 2716551 0.00008127 0.00010879 0.00013439 Run 5 17578.32
Average 2546223.2 0.000088694 0.000109814 0.000136438 Average 17371.696
Delta -1.95% 2.82% -1.15% -2.91% 14.12%
So it looks like the GRO patches work quite well out of the box. It's
curious that tcp_rr transactions go down a bit, though. I don't have any
intuition around that.
Lemme know if you wanna change some stuff and get a rerun.
Thanks,
Daniel
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-21 0:29 ` Daniel Xu
@ 2024-08-21 13:16 ` Alexander Lobakin
2024-08-21 16:36 ` Daniel Xu
0 siblings, 1 reply; 98+ messages in thread
From: Alexander Lobakin @ 2024-08-21 13:16 UTC (permalink / raw)
To: Daniel Xu
Cc: Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
Lorenzo Bianconi, Alexander Lobakin, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Lorenzo Bianconi,
David Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
John Fastabend, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints
From: Daniel Xu <dxu@dxuuu.xyz>
Date: Tue, 20 Aug 2024 17:29:45 -0700
> Hi Olek,
>
> On Mon, Aug 19, 2024 at 04:50:52PM GMT, Alexander Lobakin wrote:
> [..]
>>> Thanks A LOT for doing this benchmarking!
>>
>> I optimized the code a bit and picked my old patches for bulk NAPI skb
>> cache allocation and today I got 4.7 Mpps 🎉
>> IOW, the result of the series (7 patches totally, but 2 are not
>> networking-related) is 2.7 -> 4.7 Mpps == 75%!
>>
>> Daniel,
>>
>> if you want, you can pick my tree[0], either full or just up to
>>
>> "bpf: cpumap: switch to napi_skb_cache_get_bulk()"
>>
>> (13 patches total: 6 for netdev_feature_t and 7 for the cpumap)
>>
>> and test with your usecases. Would be nice to see some real world
>> results, not my synthetic tests :D
>>
>>> --Jesper
>>
>> [0]
>> https://github.com/alobakin/linux/compare/idpf-libie-new~52...idpf-libie-new/
>
> So it turns out keeping the workload in place while I update and reboot
> the kernel is a Hard Problem. I'll put in some more effort and see if I
> can get one of the workloads to stay still, but it'll be a somewhat
> noisy test even if it works. So the following are synthetic tests
> (neper) but on a real prod setup as far as container networking and
> configuration is concerned.
>
> I cherry-picked 586be610~1..ca22ac8e9de onto our 6.9-ish branch. Had to
> skip some of the flag refactors b/c of conflicts - I didn't know the
> code well enough to do fixups. So I had to apply this diff (FWIW not sure
> the struct_size() here was right anyhow):
>
> diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> index 089d19c62efe..359fbfaa43eb 100644
> --- a/kernel/bpf/cpumap.c
> +++ b/kernel/bpf/cpumap.c
> @@ -110,7 +110,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
> if (!cmap->cpu_map)
> goto free_cmap;
>
> - dev = bpf_map_area_alloc(struct_size(dev, priv, 0), NUMA_NO_NODE);
> + dev = bpf_map_area_alloc(sizeof(*dev), NUMA_NO_NODE);
Hmm, it will allocate the same amount of memory. Why do you need this?
Are you running these patches on some older kernel which doesn't have a
proper flex array at the end of &net_device?
> if (!dev)
> goto free_cpu_map;
>
>
> ==== Baseline ===
> ./tcp_rr -c -H $SERVER -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $SERVER -T8 -F16 -l30
>
> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s)
> Run 1 2578189 0.00008831 0.00010623 0.00013439 Run 1 15427.22
> Run 2 2657923 0.00008575 0.00010239 0.00012927 Run 2 15272.12
> Run 3 2700402 0.00008447 0.00010111 0.00013183 Run 3 14871.35
> Run 4 2571739 0.00008575 0.00011519 0.00013823 Run 4 15344.72
> Run 5 2476427 0.00008703 0.00013055 0.00016895 Run 5 15193.2
> Average 2596936 0.000086262 0.000111094 0.000140534 Average 15221.722
>
> === cpumap NAPI patches ===
> Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s)
> Run 1 2554598 0.00008703 0.00011263 0.00013055 Run 1 17090.29
> Run 2 2478905 0.00009087 0.00011391 0.00014463 Run 2 16742.27
> Run 3 2418599 0.00009471 0.00011007 0.00014207 Run 3 17555.3
> Run 4 2562463 0.00008959 0.00010367 0.00013055 Run 4 17892.3
> Run 5 2716551 0.00008127 0.00010879 0.00013439 Run 5 17578.32
> Average 2546223.2 0.000088694 0.000109814 0.000136438 Average 17371.696
> Delta -1.95% 2.82% -1.15% -2.91% 14.12%
>
>
> So it looks like the GRO patches work quite well out of the box. It's
> curious that tcp_rr transactions go down a bit, though. I don't have any
> intuition around that.
14% is quite nice I'd say. Is this first table taken from the cpumap as
well or just direct Rx?
>
> Lemme know if you wanna change some stuff and get a rerun.
>
> Thanks,
> Daniel
Thanks,
Olek
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-21 13:16 ` Alexander Lobakin
@ 2024-08-21 16:36 ` Daniel Xu
0 siblings, 0 replies; 98+ messages in thread
From: Daniel Xu @ 2024-08-21 16:36 UTC (permalink / raw)
To: Alexander Lobakin
Cc: Jesper Dangaard Brouer, Toke Høiland-Jørgensen,
Lorenzo Bianconi, Alexander Lobakin, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Lorenzo Bianconi,
David Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
John Fastabend, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints
On Wed, Aug 21, 2024 at 03:16:51PM GMT, Alexander Lobakin wrote:
> From: Daniel Xu <dxu@dxuuu.xyz>
> Date: Tue, 20 Aug 2024 17:29:45 -0700
>
> > Hi Olek,
> >
> > On Mon, Aug 19, 2024 at 04:50:52PM GMT, Alexander Lobakin wrote:
> > [..]
> >>> Thanks A LOT for doing this benchmarking!
> >>
> >> I optimized the code a bit and picked my old patches for bulk NAPI skb
> >> cache allocation and today I got 4.7 Mpps 🎉
> >> IOW, the result of the series (7 patches totally, but 2 are not
> >> networking-related) is 2.7 -> 4.7 Mpps == 75%!
> >>
> >> Daniel,
> >>
> >> if you want, you can pick my tree[0], either full or just up to
> >>
> >> "bpf: cpumap: switch to napi_skb_cache_get_bulk()"
> >>
> >> (13 patches total: 6 for netdev_feature_t and 7 for the cpumap)
> >>
> >> and test with your usecases. Would be nice to see some real world
> >> results, not my synthetic tests :D
> >>
> >>> --Jesper
> >>
> >> [0]
> >> https://github.com/alobakin/linux/compare/idpf-libie-new~52...idpf-libie-new/
> >
> > So it turns out keeping the workload in place while I update and reboot
> > the kernel is a Hard Problem. I'll put in some more effort and see if I
> > can get one of the workloads to stay still, but it'll be a somewhat
> > noisy test even if it works. So the following are synthetic tests
> > (neper) but on a real prod setup as far as container networking and
> > configuration is concerned.
> >
> > I cherry-picked 586be610~1..ca22ac8e9de onto our 6.9-ish branch. Had to
> > skip some of the flag refactors b/c of conflicts - I didn't know the
> > code well enough to do fixups. So I had to apply this diff (FWIW not sure
> > the struct_size() here was right anyhow):
> >
> > diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
> > index 089d19c62efe..359fbfaa43eb 100644
> > --- a/kernel/bpf/cpumap.c
> > +++ b/kernel/bpf/cpumap.c
> > @@ -110,7 +110,7 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
> > if (!cmap->cpu_map)
> > goto free_cmap;
> >
> > - dev = bpf_map_area_alloc(struct_size(dev, priv, 0), NUMA_NO_NODE);
> > + dev = bpf_map_area_alloc(sizeof(*dev), NUMA_NO_NODE);
>
> Hmm, it will allocate the same amount of memory. Why do you need this?
> Are you running these patches on some older kernel which doesn't have a
> proper flex array at the end of &net_device?
Ah my mistake, you're right. I probably looked at the 6.9 source without
the flex array and confused it with net-next. But yeah, the 6.9 kernel
I tested with does not have the flex array.
>
> > if (!dev)
> > goto free_cpu_map;
> >
> >
> > ==== Baseline ===
> > ./tcp_rr -c -H $SERVER -p 50,90,99 -T4 -F8 -l30 ./tcp_stream -c -H $SERVER -T8 -F16 -l30
> >
> > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s)
> > Run 1 2578189 0.00008831 0.00010623 0.00013439 Run 1 15427.22
> > Run 2 2657923 0.00008575 0.00010239 0.00012927 Run 2 15272.12
> > Run 3 2700402 0.00008447 0.00010111 0.00013183 Run 3 14871.35
> > Run 4 2571739 0.00008575 0.00011519 0.00013823 Run 4 15344.72
> > Run 5 2476427 0.00008703 0.00013055 0.00016895 Run 5 15193.2
> > Average 2596936 0.000086262 0.000111094 0.000140534 Average 15221.722
> >
> > === cpumap NAPI patches ===
> > Transactions Latency P50 (s) Latency P90 (s) Latency P99 (s) Throughput (Mbit/s)
> > Run 1 2554598 0.00008703 0.00011263 0.00013055 Run 1 17090.29
> > Run 2 2478905 0.00009087 0.00011391 0.00014463 Run 2 16742.27
> > Run 3 2418599 0.00009471 0.00011007 0.00014207 Run 3 17555.3
> > Run 4 2562463 0.00008959 0.00010367 0.00013055 Run 4 17892.3
> > Run 5 2716551 0.00008127 0.00010879 0.00013439 Run 5 17578.32
> > Average 2546223.2 0.000088694 0.000109814 0.000136438 Average 17371.696
> > Delta -1.95% 2.82% -1.15% -2.91% 14.12%
> >
> >
> > So it looks like the GRO patches work quite well out of the box. It's
> > curious that tcp_rr transactions go down a bit, though. I don't have any
> > intuition around that.
>
> 14% is quite nice I'd say. Is this first table taken from the cpumap as
> well or just direct Rx?
Both cpumap. The only variable I changed was adding your patches.
Thanks,
Daniel
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-13 14:09 ` Alexander Lobakin
2024-08-13 14:54 ` Toke Høiland-Jørgensen
@ 2024-08-13 16:14 ` Lorenzo Bianconi
2024-08-13 16:27 ` Lorenzo Bianconi
2 siblings, 0 replies; 98+ messages in thread
From: Lorenzo Bianconi @ 2024-08-13 16:14 UTC (permalink / raw)
To: Alexander Lobakin
Cc: Lorenzo Bianconi, Daniel Xu, Alexander Lobakin,
Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
Larysa Zaremba, Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, toke, David Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
[-- Attachment #1: Type: text/plain, Size: 5583 bytes --]
> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Thu, 8 Aug 2024 13:57:00 +0200
>
> > From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> > Date: Thu, 8 Aug 2024 06:54:06 +0200
> >
> >>> Hi Alexander,
> >>>
> >>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
> >>>> cpumap has its own BH context based on kthread. It has a sane batch
> >>>> size of 8 frames per one cycle.
> >>>> GRO can be used on its own, adjust cpumap calls to the
> >>>> upper stack to use GRO API instead of netif_receive_skb_list() which
> >>>> processes skbs by batches, but doesn't involve GRO layer at all.
> >>>> It is most beneficial when a NIC which frame come from is XDP
> >>>> generic metadata-enabled, but in plenty of tests GRO performs better
> >>>> than listed receiving even given that it has to calculate full frame
> >>>> checksums on CPU.
> >>>> As GRO passes the skbs to the upper stack in the batches of
> >>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
> >>>> device where the frame comes from, it is enough to disable GRO
> >>>> netdev feature on it to completely restore the original behaviour:
> >>>> untouched frames will be being bulked and passed to the upper stack
> >>>> by 8, as it was with netif_receive_skb_list().
> >>>>
> >>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> >>>> ---
> >>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
> >>>> 1 file changed, 38 insertions(+), 5 deletions(-)
> >>>>
> >>>
> >>> AFAICT the cpumap + GRO is a good standalone improvement. I think
> >>> cpumap is still missing this.
> >
> > The only concern for having GRO in cpumap without metadata from the NIC
> > descriptor was that when the checksum status is missing, GRO calculates
> > the checksum on CPU, which is not really fast.
> > But I remember sometimes GRO was faster despite that.
> >
> >>>
> >>> I have a production use case for this now. We want to do some intelligent
> >>> RX steering and I think GRO would help over list-ified receive in some cases.
> >>> We would prefer steer in HW (and thus get existing GRO support) but not all
> >>> our NICs support it. So we need a software fallback.
> >>>
> >>> Are you still interested in merging the cpumap + GRO patches?
> >
> > For sure I can revive this part. I was planning to get back to this
> > branch and pick patches which were not related to XDP hints and send
> > them separately.
> >
> >>
> >> Hi Daniel and Alex,
> >>
> >> Recently I worked on a PoC to add GRO support to cpumap codebase:
> >> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
> >> Here I added GRO support to cpumap through gro-cells.
> >> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
> >> Here I added GRO support to cpumap trough napi-threaded APIs (with a some
> >> changes to them).
> >
> > Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
> > overkill, that's why I separated GRO structure from &napi_struct.
> >
> > Let me maybe find some free time, I would then test all 3 solutions
> > (mine, gro_cells, threaded NAPI) and pick/send the best?
> >
> >>
> >> Please note I have not run any performance tests so far, just verified it does
> >> not crash (I was planning to resume this work soon). Please let me know if it
> >> works for you.
>
> I did tests on both threaded NAPI for cpumap and my old implementation
> with a traffic generator and I have the following (in Kpps):
>
> direct Rx direct GRO cpumap cpumap GRO
> baseline 2900 5800 2700 2700 (N/A)
> threaded 2300 4000
> old GRO 2300 4000
cool, very nice improvement
>
> IOW,
>
> 1. There are no differences in perf between Lorenzo's threaded NAPI
> GRO implementation and my old implementation, but Lorenzo's is also
> a very nice cleanup as it switches cpumap to threaded NAPI completely
> and the final diffstat even removes more lines than adds, while mine
> adds a bunch of lines and refactors a couple hundred, so I'd go with
> his variant.
>
> 2. After switching to NAPI, the performance without GRO decreases (2.3
> Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely
> (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums
> manually.
>
> Note that the code is not polished to the top and I also have a good
> improvement for allocating skb heads from the percpu NAPI cache in my
> old tree which I'm planning to add to the series, so the final
> improvement will be even bigger.
>
> + after we find how to pass checksum hint to cpumap, it will be yet
> another big improvement for GRO (current code won't benefit from
> this at all)
>
> To Lorenzo:
>
> Would it be fine if I prepare a series containing your patch for
> threaded NAPI for cpumap (I'd polish it and break into 2 or 3) +
> skb allocation optimization and send it OR you wanted to send this
> on your own? I'm fine with either, in the first case, everything
> would land within one series with the respective credits; in case
> of the latter, I'd need to send a followup :)
Sure, I am fine to send my codebase into a bigger series.
Thanks a lot for testing :)
Regards,
Lorenzo
>
> >>
> >> Regards,
> >> Lorenzo
> >>
> >>>
> >>> Thanks,
> >>> Daniel
>
> Thanks,
> Olek
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-13 14:09 ` Alexander Lobakin
2024-08-13 14:54 ` Toke Høiland-Jørgensen
2024-08-13 16:14 ` Lorenzo Bianconi
@ 2024-08-13 16:27 ` Lorenzo Bianconi
2024-08-13 16:31 ` Alexander Lobakin
2 siblings, 1 reply; 98+ messages in thread
From: Lorenzo Bianconi @ 2024-08-13 16:27 UTC (permalink / raw)
To: Alexander Lobakin
Cc: Daniel Xu, Alexander Lobakin, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, toke, Lorenzo Bianconi, David Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesse Brandeburg,
John Fastabend, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints
[-- Attachment #1: Type: text/plain, Size: 5555 bytes --]
On Aug 13, Alexander Lobakin wrote:
> From: Alexander Lobakin <aleksander.lobakin@intel.com>
> Date: Thu, 8 Aug 2024 13:57:00 +0200
>
> > From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
> > Date: Thu, 8 Aug 2024 06:54:06 +0200
> >
> >>> Hi Alexander,
> >>>
> >>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
> >>>> cpumap has its own BH context based on kthread. It has a sane batch
> >>>> size of 8 frames per one cycle.
> >>>> GRO can be used on its own, adjust cpumap calls to the
> >>>> upper stack to use GRO API instead of netif_receive_skb_list() which
> >>>> processes skbs by batches, but doesn't involve GRO layer at all.
> >>>> It is most beneficial when a NIC which frame come from is XDP
> >>>> generic metadata-enabled, but in plenty of tests GRO performs better
> >>>> than listed receiving even given that it has to calculate full frame
> >>>> checksums on CPU.
> >>>> As GRO passes the skbs to the upper stack in the batches of
> >>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
> >>>> device where the frame comes from, it is enough to disable GRO
> >>>> netdev feature on it to completely restore the original behaviour:
> >>>> untouched frames will be being bulked and passed to the upper stack
> >>>> by 8, as it was with netif_receive_skb_list().
> >>>>
> >>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
> >>>> ---
> >>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
> >>>> 1 file changed, 38 insertions(+), 5 deletions(-)
> >>>>
> >>>
> >>> AFAICT the cpumap + GRO is a good standalone improvement. I think
> >>> cpumap is still missing this.
> >
> > The only concern for having GRO in cpumap without metadata from the NIC
> > descriptor was that when the checksum status is missing, GRO calculates
> > the checksum on CPU, which is not really fast.
> > But I remember sometimes GRO was faster despite that.
> >
> >>>
> >>> I have a production use case for this now. We want to do some intelligent
> >>> RX steering and I think GRO would help over list-ified receive in some cases.
> >>> We would prefer steer in HW (and thus get existing GRO support) but not all
> >>> our NICs support it. So we need a software fallback.
> >>>
> >>> Are you still interested in merging the cpumap + GRO patches?
> >
> > For sure I can revive this part. I was planning to get back to this
> > branch and pick patches which were not related to XDP hints and send
> > them separately.
> >
> >>
> >> Hi Daniel and Alex,
> >>
> >> Recently I worked on a PoC to add GRO support to cpumap codebase:
> >> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
> >> Here I added GRO support to cpumap through gro-cells.
> >> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
> >> Here I added GRO support to cpumap trough napi-threaded APIs (with a some
> >> changes to them).
> >
> > Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
> > overkill, that's why I separated GRO structure from &napi_struct.
> >
> > Let me maybe find some free time, I would then test all 3 solutions
> > (mine, gro_cells, threaded NAPI) and pick/send the best?
> >
> >>
> >> Please note I have not run any performance tests so far, just verified it does
> >> not crash (I was planning to resume this work soon). Please let me know if it
> >> works for you.
>
> I did tests on both threaded NAPI for cpumap and my old implementation
> with a traffic generator and I have the following (in Kpps):
>
> direct Rx direct GRO cpumap cpumap GRO
> baseline 2900 5800 2700 2700 (N/A)
> threaded 2300 4000
> old GRO 2300 4000
out of my curiority, have you tested even the gro_cells one?
Lorenzo
>
> IOW,
>
> 1. There are no differences in perf between Lorenzo's threaded NAPI
> GRO implementation and my old implementation, but Lorenzo's is also
> a very nice cleanup as it switches cpumap to threaded NAPI completely
> and the final diffstat even removes more lines than adds, while mine
> adds a bunch of lines and refactors a couple hundred, so I'd go with
> his variant.
>
> 2. After switching to NAPI, the performance without GRO decreases (2.3
> Mpps vs 2.7 Mpps), but after enabling GRO the perf increases hugely
> (4 Mpps vs 2.7 Mpps) even though the CPU needs to compute checksums
> manually.
>
> Note that the code is not polished to the top and I also have a good
> improvement for allocating skb heads from the percpu NAPI cache in my
> old tree which I'm planning to add to the series, so the final
> improvement will be even bigger.
>
> + after we find how to pass checksum hint to cpumap, it will be yet
> another big improvement for GRO (current code won't benefit from
> this at all)
>
> To Lorenzo:
>
> Would it be fine if I prepare a series containing your patch for
> threaded NAPI for cpumap (I'd polish it and break into 2 or 3) +
> skb allocation optimization and send it OR you wanted to send this
> on your own? I'm fine with either, in the first case, everything
> would land within one series with the respective credits; in case
> of the latter, I'd need to send a followup :)
>
> >>
> >> Regards,
> >> Lorenzo
> >>
> >>>
> >>> Thanks,
> >>> Daniel
>
> Thanks,
> Olek
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-13 16:27 ` Lorenzo Bianconi
@ 2024-08-13 16:31 ` Alexander Lobakin
0 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2024-08-13 16:31 UTC (permalink / raw)
To: Lorenzo Bianconi
Cc: Daniel Xu, Alexander Lobakin, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, toke, Lorenzo Bianconi, David Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesse Brandeburg,
John Fastabend, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints
From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Date: Tue, 13 Aug 2024 18:27:41 +0200
> On Aug 13, Alexander Lobakin wrote:
>> From: Alexander Lobakin <aleksander.lobakin@intel.com>
>> Date: Thu, 8 Aug 2024 13:57:00 +0200
>>
>>> From: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
>>> Date: Thu, 8 Aug 2024 06:54:06 +0200
>>>
>>>>> Hi Alexander,
>>>>>
>>>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>>>>>> cpumap has its own BH context based on kthread. It has a sane batch
>>>>>> size of 8 frames per one cycle.
>>>>>> GRO can be used on its own, adjust cpumap calls to the
>>>>>> upper stack to use GRO API instead of netif_receive_skb_list() which
>>>>>> processes skbs by batches, but doesn't involve GRO layer at all.
>>>>>> It is most beneficial when a NIC which frame come from is XDP
>>>>>> generic metadata-enabled, but in plenty of tests GRO performs better
>>>>>> than listed receiving even given that it has to calculate full frame
>>>>>> checksums on CPU.
>>>>>> As GRO passes the skbs to the upper stack in the batches of
>>>>>> @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>>>>>> device where the frame comes from, it is enough to disable GRO
>>>>>> netdev feature on it to completely restore the original behaviour:
>>>>>> untouched frames will be being bulked and passed to the upper stack
>>>>>> by 8, as it was with netif_receive_skb_list().
>>>>>>
>>>>>> Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>>>>>> ---
>>>>>> kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>>>>>> 1 file changed, 38 insertions(+), 5 deletions(-)
>>>>>>
>>>>>
>>>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>>>> cpumap is still missing this.
>>>
>>> The only concern for having GRO in cpumap without metadata from the NIC
>>> descriptor was that when the checksum status is missing, GRO calculates
>>> the checksum on CPU, which is not really fast.
>>> But I remember sometimes GRO was faster despite that.
>>>
>>>>>
>>>>> I have a production use case for this now. We want to do some intelligent
>>>>> RX steering and I think GRO would help over list-ified receive in some cases.
>>>>> We would prefer steer in HW (and thus get existing GRO support) but not all
>>>>> our NICs support it. So we need a software fallback.
>>>>>
>>>>> Are you still interested in merging the cpumap + GRO patches?
>>>
>>> For sure I can revive this part. I was planning to get back to this
>>> branch and pick patches which were not related to XDP hints and send
>>> them separately.
>>>
>>>>
>>>> Hi Daniel and Alex,
>>>>
>>>> Recently I worked on a PoC to add GRO support to cpumap codebase:
>>>> - https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
>>>> Here I added GRO support to cpumap through gro-cells.
>>>> - https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
>>>> Here I added GRO support to cpumap trough napi-threaded APIs (with a some
>>>> changes to them).
>>>
>>> Hmm, when I was testing it, adding a whole NAPI to cpumap was sorta
>>> overkill, that's why I separated GRO structure from &napi_struct.
>>>
>>> Let me maybe find some free time, I would then test all 3 solutions
>>> (mine, gro_cells, threaded NAPI) and pick/send the best?
>>>
>>>>
>>>> Please note I have not run any performance tests so far, just verified it does
>>>> not crash (I was planning to resume this work soon). Please let me know if it
>>>> works for you.
>>
>> I did tests on both threaded NAPI for cpumap and my old implementation
>> with a traffic generator and I have the following (in Kpps):
>>
>> direct Rx direct GRO cpumap cpumap GRO
>> baseline 2900 5800 2700 2700 (N/A)
>> threaded 2300 4000
>> old GRO 2300 4000
>
> out of my curiority, have you tested even the gro_cells one?
I haven't. I mean I could, but I don't feel like cpumap's kthread +
separate NAPI then could give better results than merged NAPI + kthread.
>
> Lorenzo
Thanks,
Olek
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-08 4:54 ` Lorenzo Bianconi
2024-08-08 11:57 ` Alexander Lobakin
@ 2024-08-08 20:44 ` Daniel Xu
2024-08-09 9:32 ` Jesper Dangaard Brouer
1 sibling, 1 reply; 98+ messages in thread
From: Daniel Xu @ 2024-08-08 20:44 UTC (permalink / raw)
To: Lorenzo Bianconi
Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, toke, Lorenzo Bianconi,
David Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jesse Brandeburg, John Fastabend, Yajun Deng, Willem de Bruijn,
bpf, netdev, linux-kernel, xdp-hints
Hi Lorenzo,
On Thu, Aug 8, 2024, at 12:54 AM, Lorenzo Bianconi wrote:
>> Hi Alexander,
>>
>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
>> > cpumap has its own BH context based on kthread. It has a sane batch
>> > size of 8 frames per one cycle.
>> > GRO can be used on its own, adjust cpumap calls to the
>> > upper stack to use GRO API instead of netif_receive_skb_list() which
>> > processes skbs by batches, but doesn't involve GRO layer at all.
>> > It is most beneficial when a NIC which frame come from is XDP
>> > generic metadata-enabled, but in plenty of tests GRO performs better
>> > than listed receiving even given that it has to calculate full frame
>> > checksums on CPU.
>> > As GRO passes the skbs to the upper stack in the batches of
>> > @gro_normal_batch, i.e. 8 by default, and @skb->dev point to the
>> > device where the frame comes from, it is enough to disable GRO
>> > netdev feature on it to completely restore the original behaviour:
>> > untouched frames will be being bulked and passed to the upper stack
>> > by 8, as it was with netif_receive_skb_list().
>> >
>> > Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
>> > ---
>> > kernel/bpf/cpumap.c | 43 ++++++++++++++++++++++++++++++++++++++-----
>> > 1 file changed, 38 insertions(+), 5 deletions(-)
>> >
>>
>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>> cpumap is still missing this.
>>
>> I have a production use case for this now. We want to do some intelligent
>> RX steering and I think GRO would help over list-ified receive in some cases.
>> We would prefer steer in HW (and thus get existing GRO support) but not all
>> our NICs support it. So we need a software fallback.
>>
>> Are you still interested in merging the cpumap + GRO patches?
>
> Hi Daniel and Alex,
>
> Recently I worked on a PoC to add GRO support to cpumap codebase:
> -
> https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
> Here I added GRO support to cpumap through gro-cells.
> -
> https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
> Here I added GRO support to cpumap trough napi-threaded APIs (with a
> some
> changes to them).
Cool!
>
> Please note I have not run any performance tests so far, just verified it does
> not crash (I was planning to resume this work soon). Please let me know if it
> works for you.
I’ll try to run an A/B test on your two approaches as well as Alex’s. I’ve still
got some testbeds with production traffic going thru them.
Thanks,
Daniel
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list()
2024-08-08 20:44 ` Daniel Xu
@ 2024-08-09 9:32 ` Jesper Dangaard Brouer
0 siblings, 0 replies; 98+ messages in thread
From: Jesper Dangaard Brouer @ 2024-08-09 9:32 UTC (permalink / raw)
To: Daniel Xu, Lorenzo Bianconi
Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, toke, Lorenzo Bianconi, David Miller,
Eric Dumazet, Jakub Kicinski, Paolo Abeni, Jesse Brandeburg,
John Fastabend, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints, kernel-team
On 08/08/2024 22.44, Daniel Xu wrote:
> Hi Lorenzo,
>
> On Thu, Aug 8, 2024, at 12:54 AM, Lorenzo Bianconi wrote:
>>> Hi Alexander,
>>>
>>> On Tue, Jun 28, 2022, at 12:47 PM, Alexander Lobakin wrote:
[...]
>>>
>>> AFAICT the cpumap + GRO is a good standalone improvement. I think
>>> cpumap is still missing this.
>>>
>>> I have a production use case for this now. We want to do some intelligent
>>> RX steering and I think GRO would help over list-ified receive in some cases.
>>> We would prefer steer in HW (and thus get existing GRO support) but not all
>>> our NICs support it. So we need a software fallback.
>>>
I want to state that Cloudflare is also planning to use cpumap in
production, and (one) blocker is that CPUMAP doesn't support GRO.
>>> Are you still interested in merging the cpumap + GRO patches?
>>
>> Hi Daniel and Alex,
>>
>> Recently I worked on a PoC to add GRO support to cpumap codebase:
>> -
>> https://github.com/LorenzoBianconi/bpf-next/commit/a4b8264d5000ecf016da5a2dd9ac302deaf38b3e
>> Here I added GRO support to cpumap through gro-cells.
>> -
>> https://github.com/LorenzoBianconi/bpf-next/commit/da6cb32a4674aa72401c7414c9a8a0775ef41a55
>> Here I added GRO support to cpumap trough napi-threaded APIs (with a
>> some
>> changes to them).
>
> Cool!
>
>>
>> Please note I have not run any performance tests so far, just verified it does
>> not crash (I was planning to resume this work soon). Please let me know if it
>> works for you.
>
> I’ll try to run an A/B test on your two approaches as well as Alex’s. I’ve still
> got some testbeds with production traffic going thru them.
>
It is awesome that both Olek and you are stepping up for testing this.
(I'm currently too busy on cgroup rstat lock related work, but more
people will be joining my team this month and I hope they have interest
in contributing to this effort).
--Jesper
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 33/52] bpf, cpumap: add option to set a timeout for deferred flush
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (31 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 32/52] bpf, cpumap: switch to GRO from netif_receive_skb_list() Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 34/52] samples/bpf: add 'timeout' option to xdp_redirect_cpu Alexander Lobakin
` (19 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
GRO efficiency depends a lot on the batch size. With the size of 8,
it is less efficient than e.g. with NAPI and the size of 64.
To do less percentage of full flushes and not hold GRO packets for
too long, use the GRO hrtimer to wake up the kthread even if there's
no new frames in the ptr_ring. Its value is being passed from the
user side inside the corresponding &bpf_cpumap_val on map creation,
in nanoseconds.
When the timeout is 0/unset, the behaviour is the same as it was
prior to the change.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/uapi/linux/bpf.h | 1 +
kernel/bpf/cpumap.c | 39 +++++++++++++++++++++++++++++-----
tools/include/uapi/linux/bpf.h | 1 +
3 files changed, 36 insertions(+), 5 deletions(-)
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index 1caaec1de625..097719ee2172 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -5989,6 +5989,7 @@ struct bpf_cpumap_val {
int fd; /* prog fd on map write */
__u32 id; /* prog id on map read */
} bpf_prog;
+ __u64 timeout; /* timeout to wait for new packets, in ns */
};
enum sk_action {
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 2d0edf8f6a05..145f49de0931 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -95,7 +95,8 @@ static struct bpf_map *cpu_map_alloc(union bpf_attr *attr)
/* check sanity of attributes */
if (attr->max_entries == 0 || attr->key_size != 4 ||
(value_size != offsetofend(struct bpf_cpumap_val, qsize) &&
- value_size != offsetofend(struct bpf_cpumap_val, bpf_prog.fd)) ||
+ value_size != offsetofend(struct bpf_cpumap_val, bpf_prog.fd) &&
+ value_size != offsetofend(struct bpf_cpumap_val, timeout)) ||
attr->map_flags & ~BPF_F_NUMA_NODE)
return ERR_PTR(-EINVAL);
@@ -312,18 +313,42 @@ static void cpu_map_gro_flush(struct bpf_cpu_map_entry *rcpu,
/* If the ring is not empty, there'll be a new iteration
* soon, and we only need to do a full flush if a tick is
* long (> 1 ms).
- * If the ring is empty, to not hold GRO packets in the
- * stack for too long, do a full flush.
+ * If the ring is empty, and there were some new packets
+ * processed, either do a partial flush and spin up a timer
+ * to flush the rest if the timeout is set, or do a full
+ * flush otherwise.
+ * No new packets with non-zero gro_bitmask can mean that we
+ * probably came from the timer call and/or there's [almost]
+ * no activity here right now. To not hold GRO packets in
+ * the stack for too long, do a full flush.
* This is equivalent to how NAPI decides whether to perform
* a full flush (by batches of up to 64 frames tho).
*/
if (__ptr_ring_empty(rcpu->queue))
- flush_old = false;
+ flush_old = new ? !!rcpu->value.timeout : false;
__gro_flush(&rcpu->gro, flush_old);
}
gro_normal_list(&rcpu->gro);
+
+ /* Non-zero gro_bitmask at this point means that we have some packets
+ * held in the GRO engine after a partial flush. If we have a timeout
+ * set up, and there are no signs of a new kthread iteration, launch
+ * a timer to flush them as well.
+ */
+ if (rcpu->gro.bitmask && __ptr_ring_empty(rcpu->queue))
+ gro_timer_start(&rcpu->gro, rcpu->value.timeout);
+}
+
+static enum hrtimer_restart cpu_map_gro_watchdog(struct hrtimer *timer)
+{
+ const struct bpf_cpu_map_entry *rcpu;
+
+ rcpu = container_of(timer, typeof(*rcpu), gro.timer);
+ wake_up_process(rcpu->kthread);
+
+ return HRTIMER_NORESTART;
}
static int cpu_map_kthread_run(void *data)
@@ -489,8 +514,9 @@ __cpu_map_entry_alloc(struct bpf_map *map, struct bpf_cpumap_val *value,
rcpu->cpu = cpu;
rcpu->map_id = map->id;
rcpu->value.qsize = value->qsize;
+ rcpu->value.timeout = value->timeout;
- gro_init(&rcpu->gro, NULL);
+ gro_init(&rcpu->gro, cpu_map_gro_watchdog);
if (fd > 0 && __cpu_map_load_bpf_program(rcpu, map, fd))
goto free_gro;
@@ -606,6 +632,9 @@ static int cpu_map_update_elem(struct bpf_map *map, void *key, void *value,
return -EEXIST;
if (unlikely(cpumap_value.qsize > 16384)) /* sanity limit on qsize */
return -EOVERFLOW;
+ /* Don't allow timeout longer than 1 ms -- 1 tick on HZ == 1000 */
+ if (unlikely(cpumap_value.timeout > 1 * NSEC_PER_MSEC))
+ return -ERANGE;
/* Make sure CPU is a valid possible cpu */
if (key_cpu >= nr_cpumask_bits || !cpu_possible(key_cpu))
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 436b925adfb3..a3579cdb0225 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -5989,6 +5989,7 @@ struct bpf_cpumap_val {
int fd; /* prog fd on map write */
__u32 id; /* prog id on map read */
} bpf_prog;
+ __u64 timeout; /* timeout to wait for new packets, in ns */
};
enum sk_action {
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 34/52] samples/bpf: add 'timeout' option to xdp_redirect_cpu
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (32 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 33/52] bpf, cpumap: add option to set a timeout for deferred flush Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 35/52] net, skbuff: introduce napi_skb_cache_get_bulk() Alexander Lobakin
` (18 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Add ability to specify a deferred flush timeout (in usec, not nsec!)
when setting up a cpumap in xdp_redirect_cpu sample.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
samples/bpf/xdp_redirect_cpu_user.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/samples/bpf/xdp_redirect_cpu_user.c b/samples/bpf/xdp_redirect_cpu_user.c
index ca457c34eb0f..d184c3fcab53 100644
--- a/samples/bpf/xdp_redirect_cpu_user.c
+++ b/samples/bpf/xdp_redirect_cpu_user.c
@@ -34,6 +34,8 @@ static const char *__doc__ =
#include "xdp_sample_user.h"
#include "xdp_redirect_cpu.skel.h"
+#define NSEC_PER_USEC 1000UL
+
static int map_fd;
static int avail_fd;
static int count_fd;
@@ -61,6 +63,7 @@ static const struct option long_options[] = {
{ "redirect-device", required_argument, NULL, 'r' },
{ "redirect-map", required_argument, NULL, 'm' },
{ "meta-thresh", optional_argument, NULL, 'M' },
+ { "timeout", required_argument, NULL, 't'},
{}
};
@@ -128,9 +131,10 @@ static int create_cpu_entry(__u32 cpu, struct bpf_cpumap_val *value,
}
}
- printf("%s CPU: %u as idx: %u qsize: %d cpumap_prog_fd: %d (cpus_count: %u)\n",
+ printf("%s CPU: %u as idx: %u qsize: %d timeout: %llu cpumap_prog_fd: %d (cpus_count: %u)\n",
new ? "Add new" : "Replace", cpu, avail_idx,
- value->qsize, value->bpf_prog.fd, curr_cpus_count);
+ value->qsize, value->timeout, value->bpf_prog.fd,
+ curr_cpus_count);
return 0;
}
@@ -346,6 +350,7 @@ int main(int argc, char **argv)
* tuned-adm profile network-latency
*/
qsize = 2048;
+ value.timeout = 0; /* Defaults to 0 to mimic the previous behaviour. */
skel = xdp_redirect_cpu__open();
if (!skel) {
@@ -383,7 +388,7 @@ int main(int argc, char **argv)
}
prog = skel->progs.xdp_prognum5_lb_hash_ip_pairs;
- while ((opt = getopt_long(argc, argv, "d:si:Sxp:f:e:r:m:c:q:FMvh",
+ while ((opt = getopt_long(argc, argv, "d:si:Sxp:f:e:r:m:c:q:FMt:vh",
long_options, &longindex)) != -1) {
switch (opt) {
case 'd':
@@ -466,6 +471,10 @@ int main(int argc, char **argv)
opts.meta_thresh = optarg ? strtoul(optarg, NULL, 0) :
1;
break;
+ case 't':
+ value.timeout = strtoull(optarg, NULL, 0) *
+ NSEC_PER_USEC;
+ break;
case 'h':
error = false;
default:
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 35/52] net, skbuff: introduce napi_skb_cache_get_bulk()
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (33 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 34/52] samples/bpf: add 'timeout' option to xdp_redirect_cpu Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 36/52] bpf, cpumap: switch to napi_skb_cache_get_bulk() Alexander Lobakin
` (17 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Add a function to get an array of skbs from the NAPI percpu cache.
It's supposed to be a drop-in replacement for
kmem_cache_alloc_bulk(skbuff_head_cache, GFP_ATOMIC) and
xdp_alloc_skb_bulk(GFP_ATOMIC). The difference (apart from the
requirement to call it only from the BH) is that it tries to use
as many NAPI cache entries for skbs as possible, and allocate new
ones only if and as less as needed.
It can save significant amounts of CPU cycles if there are GRO
cycles and/or Tx completion cycles (anything that descends to
napi_skb_cache_put()) happening on this CPU. If the function is
not able to provide the requested number of entries due to an
allocation error, it returns as much as it got.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/linux/skbuff.h | 1 +
net/core/skbuff.c | 43 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 44 insertions(+)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0a95f753c1d9..0c1e5446653b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1240,6 +1240,7 @@ struct sk_buff *build_skb_around(struct sk_buff *skb,
void skb_attempt_defer_free(struct sk_buff *skb);
struct sk_buff *napi_build_skb(void *data, unsigned int frag_size);
+size_t napi_skb_cache_get_bulk(void **skbs, size_t n);
/**
* alloc_skb - allocate a network buffer
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b23fc7f1157..9b075f52d1fb 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -190,6 +190,49 @@ static struct sk_buff *napi_skb_cache_get(void)
return skb;
}
+/**
+ * napi_skb_cache_get_bulk - obtain a number of zeroed skb heads from the cache
+ * @skbs: a pointer to an at least @n-sized array to fill with skb pointers
+ * @n: the number of entries to provide
+ *
+ * Tries to obtain @n &sk_buff entries from the NAPI percpu cache and writes
+ * the pointers into the provided array @skbs. If there are less entries
+ * available, bulk-allocates the diff from the MM layer.
+ * The heads are being zeroed with either memset() or %__GFP_ZERO, so they are
+ * ready for {,__}build_skb_around() and don't have any data buffers attached.
+ * Must be called *only* from the BH context.
+ *
+ * Returns the number of successfully allocated skbs (@n if
+ * kmem_cache_alloc_bulk() didn't fail).
+ */
+size_t napi_skb_cache_get_bulk(void **skbs, size_t n)
+{
+ struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+ size_t total = n;
+
+ if (nc->skb_count < n)
+ n -= kmem_cache_alloc_bulk(skbuff_head_cache,
+ GFP_ATOMIC | __GFP_ZERO,
+ n - nc->skb_count,
+ skbs + nc->skb_count);
+ if (unlikely(nc->skb_count < n)) {
+ total -= n - nc->skb_count;
+ n = nc->skb_count;
+ }
+
+ for (size_t i = 0; i < n; i++) {
+ skbs[i] = nc->skb_cache[nc->skb_count - n + i];
+
+ kasan_unpoison_object_data(skbuff_head_cache, skbs[i]);
+ memset(skbs[i], 0, offsetof(struct sk_buff, tail));
+ }
+
+ nc->skb_count -= n;
+
+ return total;
+}
+EXPORT_SYMBOL_GPL(napi_skb_cache_get_bulk);
+
/* Caller must provide SKB that is memset cleared */
static void __build_skb_around(struct sk_buff *skb, void *data,
unsigned int frag_size)
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 36/52] bpf, cpumap: switch to napi_skb_cache_get_bulk()
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (34 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 35/52] net, skbuff: introduce napi_skb_cache_get_bulk() Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 37/52] rcupdate: fix access helpers for incomplete struct pointers on GCC < 10 Alexander Lobakin
` (16 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Now that cpumap uses GRO, which drops unused skb heads to the NAPI
cache, use napi_skb_cache_get_bulk() to try to reuse cached entries
and lower the MM layer pressure.
In the situation when all 8 skbs from one cpumap batch goes into one
GRO skb (so the rest 7 go into the cache), there will now be only 1
skb to allocate per cycle instead of 8. If there is some other work
happening in between the cycles, even all 8 might be getting
decached each cycle.
This makes the BH-off period per each batch slightly longer --
previously, skb allocation was happening in the process context.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
kernel/bpf/cpumap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index 145f49de0931..1bb3ae570e6c 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -365,7 +365,6 @@ static int cpu_map_kthread_run(void *data)
while (!kthread_should_stop() || !__ptr_ring_empty(rcpu->queue)) {
struct xdp_cpumap_stats stats = {}; /* zero stats */
unsigned int kmem_alloc_drops = 0, sched = 0;
- gfp_t gfp = __GFP_ZERO | GFP_ATOMIC;
int i, n, m, nframes, xdp_n;
void *frames[CPUMAP_BATCH];
void *skbs[CPUMAP_BATCH];
@@ -416,8 +415,10 @@ static int cpu_map_kthread_run(void *data)
/* Support running another XDP prog on this CPU */
nframes = cpu_map_bpf_prog_run(rcpu, frames, xdp_n, &stats, &list);
+ local_bh_disable();
+
if (nframes) {
- m = kmem_cache_alloc_bulk(skbuff_head_cache, gfp, nframes, skbs);
+ m = napi_skb_cache_get_bulk(skbs, nframes);
if (unlikely(m == 0)) {
for (i = 0; i < nframes; i++)
skbs[i] = NULL; /* effect: xdp_return_frame */
@@ -425,7 +426,6 @@ static int cpu_map_kthread_run(void *data)
}
}
- local_bh_disable();
for (i = 0; i < nframes; i++) {
struct xdp_frame *xdpf = frames[i];
struct sk_buff *skb = skbs[i];
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 37/52] rcupdate: fix access helpers for incomplete struct pointers on GCC < 10
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (35 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 36/52] bpf, cpumap: switch to napi_skb_cache_get_bulk() Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 38/52] net, xdp: remove unused xdp_attachment_info::flags Alexander Lobakin
` (15 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
It's been found that currently it is impossible to use RCU for
incomplete struct pointers.
RCU access helpers have the following construct:
typeof(*p) *local = ...
GCC versions older than 10 don't look at the whole sentence and
believe that there's a dereference happening inside the typeof(),
although it does not.
As RCU doesn't imply any dereference, but only the way to store and
access pointers, this is not a valid case. Moreover, Clang and GCC
10 onwards evaluate it with no issues.
Fix this by introducing a new macro, __rcutype(), which will take
care of pointer annotations inside the RCU access helpers, in two
different ways depending on the compiler used. For sane compilers,
leave it as it is for now, as it ensures that the passed argument
is a pointer, and for the affected ones use...
`typeof(0 ? (p) : (p))`. As:
void fc(void) { }
...
pr_info("%d", __builtin_types_compatible(typeof(*fn) *, typeof(fn)));
pr_info("%d", __builtin_types_compatible(typeof(*fn) *, typeof(&fn)));
pr_info("%d", __builtin_types_compatible(typeof(*fn) *,
typeof(0 ? (fn) : (fn)));
emits:
011
and we can't use the second for non-functions.
Fixes: ca5ecddfa8fc ("rcu: define __rcu address space modifier for sparse")
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/linux/rcupdate.h | 37 ++++++++++++++++++++++++++-----------
1 file changed, 26 insertions(+), 11 deletions(-)
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 1a32036c918c..f5971fccf852 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -358,18 +358,33 @@ static inline void rcu_preempt_sleep_check(void) { }
* (e.g., __srcu), should this make sense in the future.
*/
+/*
+ * Unfortunately, GCC versions older than 10 don't look at the whole sentence
+ * and treat `typeof(*(p)) *` as dereferencing although it is not. This makes
+ * it impossible to use those helpers with pointers to incomplete structures.
+ * Plain `typeof(p)` is not the same, as `typeof(func)` returns the type of a
+ * function, not a pointer to it, as `typeof(*(func)) *` does.
+ * `typeof(<anything> ? (func) : (func))` is silly; however, it works just as
+ * the original definition.
+ */
+#if defined(CONFIG_CC_IS_GCC) && CONFIG_GCC_VERSION < 100000
+#define __rcutype(p, ...) typeof(0 ? (p) : (p)) __VA_ARGS__
+#else
+#define __rcutype(p, ...) typeof(*(p)) __VA_ARGS__ *
+#endif
+
#ifdef __CHECKER__
#define rcu_check_sparse(p, space) \
- ((void)(((typeof(*p) space *)p) == p))
+ ((void)((__rcutype(p, space))(p) == (p)))
#else /* #ifdef __CHECKER__ */
#define rcu_check_sparse(p, space)
#endif /* #else #ifdef __CHECKER__ */
#define __unrcu_pointer(p, local) \
({ \
- typeof(*p) *local = (typeof(*p) *__force)(p); \
+ __rcutype(p) local = (__rcutype(p, __force))(p); \
rcu_check_sparse(p, __rcu); \
- ((typeof(*p) __force __kernel *)(local)); \
+ ((__rcutype(p, __force __kernel))(local)); \
})
/**
* unrcu_pointer - mark a pointer as not being RCU protected
@@ -382,29 +397,29 @@ static inline void rcu_preempt_sleep_check(void) { }
#define __rcu_access_pointer(p, local, space) \
({ \
- typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
+ __rcutype(p) local = (__rcutype(p, __force))READ_ONCE(p); \
rcu_check_sparse(p, space); \
- ((typeof(*p) __force __kernel *)(local)); \
+ ((__rcutype(p, __force __kernel))(local)); \
})
#define __rcu_dereference_check(p, local, c, space) \
({ \
/* Dependency order vs. p above. */ \
- typeof(*p) *local = (typeof(*p) *__force)READ_ONCE(p); \
+ __rcutype(p) local = (__rcutype(p, __force))READ_ONCE(p); \
RCU_LOCKDEP_WARN(!(c), "suspicious rcu_dereference_check() usage"); \
rcu_check_sparse(p, space); \
- ((typeof(*p) __force __kernel *)(local)); \
+ ((__rcutype(p, __force __kernel))(local)); \
})
#define __rcu_dereference_protected(p, local, c, space) \
({ \
RCU_LOCKDEP_WARN(!(c), "suspicious rcu_dereference_protected() usage"); \
rcu_check_sparse(p, space); \
- ((typeof(*p) __force __kernel *)(p)); \
+ ((__rcutype(p, __force __kernel))(p)); \
})
#define __rcu_dereference_raw(p, local) \
({ \
/* Dependency order vs. p above. */ \
- typeof(p) local = READ_ONCE(p); \
- ((typeof(*p) __force __kernel *)(local)); \
+ __rcutype(p) local = READ_ONCE(p); \
+ ((__rcutype(p, __force __kernel))(local)); \
})
#define rcu_dereference_raw(p) __rcu_dereference_raw(p, __UNIQUE_ID(rcu))
@@ -412,7 +427,7 @@ static inline void rcu_preempt_sleep_check(void) { }
* RCU_INITIALIZER() - statically initialize an RCU-protected global variable
* @v: The value to statically initialize with.
*/
-#define RCU_INITIALIZER(v) (typeof(*(v)) __force __rcu *)(v)
+#define RCU_INITIALIZER(v) (__rcutype(v, __force __rcu))(v)
/**
* rcu_assign_pointer() - assign to RCU-protected pointer
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 38/52] net, xdp: remove unused xdp_attachment_info::flags
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (36 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 37/52] rcupdate: fix access helpers for incomplete struct pointers on GCC < 10 Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 39/52] net, xdp: make &xdp_attachment_info a bit more useful in drivers Alexander Lobakin
` (14 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Since %XDP_QUERY_PROG was removed, the ::flags field is not used
anymore. It's being written by xdp_attachment_setup(), but never
read.
Remove it.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/net/xdp.h | 1 -
net/bpf/core.c | 1 -
2 files changed, 2 deletions(-)
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 1663d0b3a05a..d1fd809655be 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -382,7 +382,6 @@ struct xdp_attachment_info {
struct bpf_prog *prog;
u64 btf_id;
u32 meta_thresh;
- u32 flags;
};
struct netdev_bpf;
diff --git a/net/bpf/core.c b/net/bpf/core.c
index d2d01b8e6441..65f25019493d 100644
--- a/net/bpf/core.c
+++ b/net/bpf/core.c
@@ -554,7 +554,6 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
info->prog = bpf->prog;
info->btf_id = bpf->btf_id;
info->meta_thresh = bpf->meta_thresh;
- info->flags = bpf->flags;
}
EXPORT_SYMBOL_GPL(xdp_attachment_setup);
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 39/52] net, xdp: make &xdp_attachment_info a bit more useful in drivers
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (37 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 38/52] net, xdp: remove unused xdp_attachment_info::flags Alexander Lobakin
@ 2022-06-28 19:47 ` Alexander Lobakin
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 40/52] net, xdp: add an RCU version of xdp_attachment_setup() Alexander Lobakin
` (13 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:47 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Add a new field which will store an arbitrary 'driver cookie', the
closest usage is to store enum there corresponding to the metadata
types supported by a driver to shortcut them on hotpath.
In fact, it's just reusing the 4-byte padding at the end.
Also, make it possible to store BTF ID in LE rather than CPU
byteorder, so that drivers could save some cycles on [potential]
byteswapping on hotpath.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/net/xdp.h | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/include/net/xdp.h b/include/net/xdp.h
index d1fd809655be..5762ce18885f 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -380,8 +380,12 @@ void xdp_unreg_mem_model(struct xdp_mem_info *mem);
struct xdp_attachment_info {
struct bpf_prog *prog;
- u64 btf_id;
+ union {
+ __le64 btf_id_le;
+ u64 btf_id;
+ };
u32 meta_thresh;
+ u32 drv_cookie;
};
struct netdev_bpf;
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 40/52] net, xdp: add an RCU version of xdp_attachment_setup()
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (38 preceding siblings ...)
2022-06-28 19:47 ` [xdp-hints] [PATCH RFC bpf-next 39/52] net, xdp: make &xdp_attachment_info a bit more useful in drivers Alexander Lobakin
@ 2022-06-28 19:48 ` Alexander Lobakin
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 41/52] net, xdp: replace net_device::xdp_prog pointer with &xdp_attachment_info Alexander Lobakin
` (12 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:48 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Currently, xdp_attachment_setup() uses plain assignments and puts
the previous BPF program before updating the pointer, rendering
itself dangerous for program hot-swaps due to pointer tearing and
potential use-after-free's.
At the same time, &xdp_attachment_info comes handy to use it in
drivers as a main container including hotpath -- the BTF ID and meta
threshold values are now being used there as well, not speaking of
reducing some boilerplate code.
Add an RCU-protected pointer to XDP program to that structure and an
RCU version of xdp_attachment_setup(), which will make sure that all
the values were not corrupted and that old BPF program was freed
only after the pointer was updated. The only thing left is that RCU
read critical sections might happen in between each assignment, but
since the relations between XDP prog, BTF ID and meta threshold are
not vital, it's totally fine to allow this.
A caller must ensure it's being executed under the RTNL lock. Reader
sides must ensure they're being executed under the RCU read lock.
Once all the current users of xdp_attachment_setup() are switched to
the RCU-aware version (with appropriate adjustments), the "regular"
one will be removed.
Partially inspired by commit fe45386a2082 ("net/mlx5e: Use RCU to
protect rq->xdp_prog").
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/net/xdp.h | 7 ++++++-
net/bpf/core.c | 28 ++++++++++++++++++++++++++++
2 files changed, 34 insertions(+), 1 deletion(-)
diff --git a/include/net/xdp.h b/include/net/xdp.h
index 5762ce18885f..49e562e4fcca 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -379,7 +379,10 @@ int xdp_reg_mem_model(struct xdp_mem_info *mem,
void xdp_unreg_mem_model(struct xdp_mem_info *mem);
struct xdp_attachment_info {
- struct bpf_prog *prog;
+ union {
+ struct bpf_prog __rcu *prog_rcu;
+ struct bpf_prog *prog;
+ };
union {
__le64 btf_id_le;
u64 btf_id;
@@ -391,6 +394,8 @@ struct xdp_attachment_info {
struct netdev_bpf;
void xdp_attachment_setup(struct xdp_attachment_info *info,
struct netdev_bpf *bpf);
+void xdp_attachment_setup_rcu(struct xdp_attachment_info *info,
+ struct netdev_bpf *bpf);
#define DEV_MAP_BULK_SIZE XDP_BULK_QUEUE_SIZE
diff --git a/net/bpf/core.c b/net/bpf/core.c
index 65f25019493d..d444d0555057 100644
--- a/net/bpf/core.c
+++ b/net/bpf/core.c
@@ -557,6 +557,34 @@ void xdp_attachment_setup(struct xdp_attachment_info *info,
}
EXPORT_SYMBOL_GPL(xdp_attachment_setup);
+/**
+ * xdp_attachment_setup_rcu - an RCU-powered version of xdp_attachment_setup()
+ * @info: pointer to the target container
+ * @bpf: pointer to the container passed to ::ndo_bpf()
+ *
+ * Protects sensitive values with RCU to allow program how-swaps without
+ * stopping an interface. Write side (this) must be called under the RTNL lock
+ * and reader sides must fetch any data only under the RCU read lock -- old BPF
+ * program will be freed only after a critical section is finished (see
+ * bpf_prog_put()).
+ */
+void xdp_attachment_setup_rcu(struct xdp_attachment_info *info,
+ struct netdev_bpf *bpf)
+{
+ struct bpf_prog *old_prog;
+
+ ASSERT_RTNL();
+
+ old_prog = rcu_replace_pointer(info->prog_rcu, bpf->prog,
+ lockdep_rtnl_is_held());
+ WRITE_ONCE(info->btf_id, bpf->btf_id);
+ WRITE_ONCE(info->meta_thresh, bpf->meta_thresh);
+
+ if (old_prog)
+ bpf_prog_put(old_prog);
+}
+EXPORT_SYMBOL_GPL(xdp_attachment_setup_rcu);
+
struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
{
unsigned int metasize, totsize;
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 41/52] net, xdp: replace net_device::xdp_prog pointer with &xdp_attachment_info
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (39 preceding siblings ...)
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 40/52] net, xdp: add an RCU version of xdp_attachment_setup() Alexander Lobakin
@ 2022-06-28 19:48 ` Alexander Lobakin
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 42/52] net, xdp: shortcut skb->dev in bpf_prog_run_generic_xdp() Alexander Lobakin
` (11 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:48 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
To have access and store not only BPF prog pointer, but also
auxiliary params on Generic (skb) XDP path, replace it with
an &xdp_attachment_info struct and use xdp_attachment_setup_rcu()
(since Generic XDP code RCU-protects the pointer already).
This slightly changes the struct &net_device cacheline layout, but
nothing performance-critical.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
include/linux/netdevice.h | 7 +++----
net/bpf/dev.c | 11 ++++-------
net/core/dev.c | 4 +++-
net/core/rtnetlink.c | 2 +-
4 files changed, 11 insertions(+), 13 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 60df42b3f116..1c033c164257 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2168,7 +2168,7 @@ struct net_device {
unsigned int num_rx_queues;
unsigned int real_num_rx_queues;
- struct bpf_prog __rcu *xdp_prog;
+ struct xdp_attachment_info xdp_info;
unsigned long gro_flush_timeout;
int napi_defer_hard_irqs;
#define GRO_LEGACY_MAX_SIZE 65536u
@@ -2343,9 +2343,8 @@ struct net_device {
static inline bool netif_elide_gro(const struct net_device *dev)
{
- if (!(dev->features & NETIF_F_GRO) || dev->xdp_prog)
- return true;
- return false;
+ return !(dev->features & NETIF_F_GRO) ||
+ rcu_access_pointer(dev->xdp_info.prog_rcu);
}
#define NETDEV_ALIGN 32
diff --git a/net/bpf/dev.c b/net/bpf/dev.c
index 82948d0536c8..cc43f73929f3 100644
--- a/net/bpf/dev.c
+++ b/net/bpf/dev.c
@@ -242,19 +242,16 @@ static void dev_disable_gro_hw(struct net_device *dev)
static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
{
- struct bpf_prog *old = rtnl_dereference(dev->xdp_prog);
- struct bpf_prog *new = xdp->prog;
+ bool old = !!rtnl_dereference(dev->xdp_info.prog_rcu);
int ret = 0;
switch (xdp->command) {
case XDP_SETUP_PROG:
- rcu_assign_pointer(dev->xdp_prog, new);
- if (old)
- bpf_prog_put(old);
+ xdp_attachment_setup_rcu(&dev->xdp_info, xdp);
- if (old && !new) {
+ if (old && !xdp->prog) {
static_branch_dec(&generic_xdp_needed_key);
- } else if (new && !old) {
+ } else if (xdp->prog && !old) {
static_branch_inc(&generic_xdp_needed_key);
dev_disable_lro(dev);
dev_disable_gro_hw(dev);
diff --git a/net/core/dev.c b/net/core/dev.c
index 62bf6ee00741..e57ae87d619e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5055,10 +5055,12 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
__this_cpu_inc(softnet_data.processed);
if (static_branch_unlikely(&generic_xdp_needed_key)) {
+ struct bpf_prog *prog;
int ret2;
migrate_disable();
- ret2 = do_xdp_generic(rcu_dereference(skb->dev->xdp_prog), skb);
+ prog = rcu_dereference(skb->dev->xdp_info.prog_rcu);
+ ret2 = do_xdp_generic(prog, skb);
migrate_enable();
if (ret2 != XDP_PASS) {
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 500420d5017c..72f696b12df2 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1451,7 +1451,7 @@ static u32 rtnl_xdp_prog_skb(struct net_device *dev)
ASSERT_RTNL();
- generic_xdp_prog = rtnl_dereference(dev->xdp_prog);
+ generic_xdp_prog = rtnl_dereference(dev->xdp_info.prog_rcu);
if (!generic_xdp_prog)
return 0;
return generic_xdp_prog->aux->id;
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 42/52] net, xdp: shortcut skb->dev in bpf_prog_run_generic_xdp()
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (40 preceding siblings ...)
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 41/52] net, xdp: replace net_device::xdp_prog pointer with &xdp_attachment_info Alexander Lobakin
@ 2022-06-28 19:48 ` Alexander Lobakin
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 43/52] net, xdp: build XDP generic metadata on Generic (skb) XDP path Alexander Lobakin
` (10 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:48 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
It's being used 3 times and more to come. Fetch it onto the stack
to reduce jumping back and forth.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
net/bpf/dev.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/net/bpf/dev.c b/net/bpf/dev.c
index cc43f73929f3..350ebdc783a0 100644
--- a/net/bpf/dev.c
+++ b/net/bpf/dev.c
@@ -31,6 +31,7 @@ u32 bpf_prog_run_generic_xdp(struct sk_buff *skb, struct xdp_buff *xdp,
struct bpf_prog *xdp_prog)
{
void *orig_data, *orig_data_end, *hard_start;
+ struct net_device *dev = skb->dev;
struct netdev_rx_queue *rxqueue;
bool orig_bcast, orig_host;
u32 mac_len, frame_sz;
@@ -57,7 +58,7 @@ u32 bpf_prog_run_generic_xdp(struct sk_buff *skb, struct xdp_buff *xdp,
orig_data_end = xdp->data_end;
orig_data = xdp->data;
eth = (struct ethhdr *)xdp->data;
- orig_host = ether_addr_equal_64bits(eth->h_dest, skb->dev->dev_addr);
+ orig_host = ether_addr_equal_64bits(eth->h_dest, dev->dev_addr);
orig_bcast = is_multicast_ether_addr_64bits(eth->h_dest);
orig_eth_type = eth->h_proto;
@@ -86,11 +87,11 @@ u32 bpf_prog_run_generic_xdp(struct sk_buff *skb, struct xdp_buff *xdp,
eth = (struct ethhdr *)xdp->data;
if ((orig_eth_type != eth->h_proto) ||
(orig_host != ether_addr_equal_64bits(eth->h_dest,
- skb->dev->dev_addr)) ||
+ dev->dev_addr)) ||
(orig_bcast != is_multicast_ether_addr_64bits(eth->h_dest))) {
__skb_push(skb, ETH_HLEN);
skb->pkt_type = PACKET_HOST;
- skb->protocol = eth_type_trans(skb, skb->dev);
+ skb->protocol = eth_type_trans(skb, dev);
}
/* Redirect/Tx gives L2 packet, code that will reuse skb must __skb_pull
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 43/52] net, xdp: build XDP generic metadata on Generic (skb) XDP path
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (41 preceding siblings ...)
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 42/52] net, xdp: shortcut skb->dev in bpf_prog_run_generic_xdp() Alexander Lobakin
@ 2022-06-28 19:48 ` Alexander Lobakin
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 44/52] net, ice: allow XDP prog hot-swapping Alexander Lobakin
` (9 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:48 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Now that the core has the routine to make XDP generic metadata from
the skb fields and &net_device stores meta_thresh, provide XDP
generic metadata to BPF programs running on Generic/skb XDP path.
skb fields are being updated from the metadata after BPF program
exits (if it's still there).
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
net/bpf/dev.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 47 insertions(+), 4 deletions(-)
diff --git a/net/bpf/dev.c b/net/bpf/dev.c
index 350ebdc783a0..f4187b357a0c 100644
--- a/net/bpf/dev.c
+++ b/net/bpf/dev.c
@@ -1,7 +1,20 @@
// SPDX-License-Identifier: GPL-2.0-only
+#include <net/xdp_meta.h>
#include <trace/events/xdp.h>
+enum {
+ GENERIC_XDP_META_GEN,
+
+ /* Must be last */
+ GENERIC_XDP_META_NONE,
+ __GENERIC_XDP_META_NUM,
+};
+
+static const char * const generic_xdp_meta_types[__GENERIC_XDP_META_NUM] = {
+ [GENERIC_XDP_META_GEN] = "struct xdp_meta_generic",
+};
+
DEFINE_STATIC_KEY_FALSE(generic_xdp_needed_key);
static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
@@ -27,17 +40,33 @@ static struct netdev_rx_queue *netif_get_rxqueue(struct sk_buff *skb)
return rxqueue;
}
+static void generic_xdp_handle_meta(struct xdp_buff *xdp, struct sk_buff *skb,
+ const struct xdp_attachment_info *info)
+{
+ if (xdp->data_end - xdp->data < READ_ONCE(info->meta_thresh))
+ return;
+
+ switch (READ_ONCE(info->drv_cookie)) {
+ case GENERIC_XDP_META_GEN:
+ xdp_build_meta_generic_from_skb(skb);
+ xdp->data_meta = skb_metadata_end(skb) - skb_metadata_len(skb);
+ break;
+ default:
+ break;
+ }
+}
+
u32 bpf_prog_run_generic_xdp(struct sk_buff *skb, struct xdp_buff *xdp,
struct bpf_prog *xdp_prog)
{
void *orig_data, *orig_data_end, *hard_start;
struct net_device *dev = skb->dev;
struct netdev_rx_queue *rxqueue;
+ u32 metalen, orig_metalen, act;
bool orig_bcast, orig_host;
u32 mac_len, frame_sz;
__be16 orig_eth_type;
struct ethhdr *eth;
- u32 metalen, act;
int off;
/* The XDP program wants to see the packet starting at the MAC
@@ -62,6 +91,9 @@ u32 bpf_prog_run_generic_xdp(struct sk_buff *skb, struct xdp_buff *xdp,
orig_bcast = is_multicast_ether_addr_64bits(eth->h_dest);
orig_eth_type = eth->h_proto;
+ generic_xdp_handle_meta(xdp, skb, &dev->xdp_info);
+ orig_metalen = xdp->data - xdp->data_meta;
+
act = bpf_prog_run_xdp(xdp_prog, xdp);
/* check if bpf_xdp_adjust_head was used */
@@ -105,11 +137,15 @@ u32 bpf_prog_run_generic_xdp(struct sk_buff *skb, struct xdp_buff *xdp,
case XDP_REDIRECT:
case XDP_TX:
__skb_push(skb, mac_len);
- break;
+ fallthrough;
case XDP_PASS:
metalen = xdp->data - xdp->data_meta;
- if (metalen)
+ if (metalen != orig_metalen)
skb_metadata_set(skb, metalen);
+ if (metalen)
+ xdp_populate_skb_meta_generic(skb);
+ else if (orig_metalen)
+ skb_metadata_nocomp_clear(skb);
break;
}
@@ -244,10 +280,15 @@ static void dev_disable_gro_hw(struct net_device *dev)
static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
{
bool old = !!rtnl_dereference(dev->xdp_info.prog_rcu);
- int ret = 0;
+ int ret;
switch (xdp->command) {
case XDP_SETUP_PROG:
+ ret = xdp_meta_match_id(generic_xdp_meta_types, xdp->btf_id);
+ if (ret < 0)
+ return ret;
+
+ WRITE_ONCE(dev->xdp_info.drv_cookie, ret);
xdp_attachment_setup_rcu(&dev->xdp_info, xdp);
if (old && !xdp->prog) {
@@ -257,6 +298,8 @@ static int generic_xdp_install(struct net_device *dev, struct netdev_bpf *xdp)
dev_disable_lro(dev);
dev_disable_gro_hw(dev);
}
+
+ ret = 0;
break;
default:
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 44/52] net, ice: allow XDP prog hot-swapping
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (42 preceding siblings ...)
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 43/52] net, xdp: build XDP generic metadata on Generic (skb) XDP path Alexander Lobakin
@ 2022-06-28 19:48 ` Alexander Lobakin
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 45/52] net, ice: consolidate all skb fields processing Alexander Lobakin
` (8 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:48 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Currently, an interface is always being brought down on
%XDP_SETUP_PROG, no matter if there would be a global configuration
change (no prog -> prog, prog -> no prog) or just a hot-swap (prog
-> prog). That is suboptimal, especially when old_prog == new_prog,
which should be a no-op at all. Moreover, it makes it impossible to
change some aux XDP options on the fly which could be designed to
work like that.
Store &xdp_attachment_info in just one copy inside the VSI
structure, RQs will only have pointers to it. This way we only need
to rewrite it once and xdp_attachment_setup_rcu() now may be used.
Guard NAPI poll routines with RCU read locks to make sure the BPF
prog won't get freed right in the middle of a cycle. Now the old
program will be freed only when all of the rings will use the new
one already. Then do an ifdown->ifup cycle in ::ndo_bpf() only if
absolutely needed (mentioned above), the rest will be completely
safe to do on the go.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
drivers/net/ethernet/intel/ice/ice.h | 8 +--
drivers/net/ethernet/intel/ice/ice_lib.c | 4 +-
drivers/net/ethernet/intel/ice/ice_main.c | 61 ++++++++++-------------
drivers/net/ethernet/intel/ice/ice_txrx.c | 11 ++--
drivers/net/ethernet/intel/ice/ice_txrx.h | 2 +-
drivers/net/ethernet/intel/ice/ice_xsk.c | 2 +-
6 files changed, 40 insertions(+), 48 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index 60453b3b8d23..402b71ab48e4 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -386,7 +386,7 @@ struct ice_vsi {
u16 num_tx_desc;
u16 qset_handle[ICE_MAX_TRAFFIC_CLASS];
struct ice_tc_cfg tc_cfg;
- struct bpf_prog *xdp_prog;
+ struct xdp_attachment_info xdp_info;
struct ice_tx_ring **xdp_rings; /* XDP ring array */
unsigned long *af_xdp_zc_qps; /* tracks AF_XDP ZC enabled qps */
u16 num_xdp_txq; /* Used XDP queues */
@@ -672,7 +672,7 @@ static inline struct ice_pf *ice_netdev_to_pf(struct net_device *netdev)
static inline bool ice_is_xdp_ena_vsi(struct ice_vsi *vsi)
{
- return !!READ_ONCE(vsi->xdp_prog);
+ return !!rcu_access_pointer(vsi->xdp_info.prog_rcu);
}
static inline void ice_set_ring_xdp(struct ice_tx_ring *ring)
@@ -857,8 +857,8 @@ int ice_down(struct ice_vsi *vsi);
int ice_vsi_cfg(struct ice_vsi *vsi);
struct ice_vsi *ice_lb_vsi_setup(struct ice_pf *pf, struct ice_port_info *pi);
int ice_vsi_determine_xdp_res(struct ice_vsi *vsi);
-int ice_prepare_xdp_rings(struct ice_vsi *vsi, struct bpf_prog *prog);
-int ice_destroy_xdp_rings(struct ice_vsi *vsi);
+int ice_prepare_xdp_rings(struct ice_vsi *vsi, struct netdev_bpf *xdp);
+int ice_destroy_xdp_rings(struct ice_vsi *vsi, struct netdev_bpf *xdp);
int
ice_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
u32 flags);
diff --git a/drivers/net/ethernet/intel/ice/ice_lib.c b/drivers/net/ethernet/intel/ice/ice_lib.c
index b28fb8eacffb..3db1271b5176 100644
--- a/drivers/net/ethernet/intel/ice/ice_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_lib.c
@@ -3200,7 +3200,7 @@ int ice_vsi_rebuild(struct ice_vsi *vsi, bool init_vsi)
/* return value check can be skipped here, it always returns
* 0 if reset is in progress
*/
- ice_destroy_xdp_rings(vsi);
+ ice_destroy_xdp_rings(vsi, NULL);
ice_vsi_put_qs(vsi);
ice_vsi_clear_rings(vsi);
ice_vsi_free_arrays(vsi);
@@ -3248,7 +3248,7 @@ int ice_vsi_rebuild(struct ice_vsi *vsi, bool init_vsi)
ret = ice_vsi_determine_xdp_res(vsi);
if (ret)
goto err_vectors;
- ret = ice_prepare_xdp_rings(vsi, vsi->xdp_prog);
+ ret = ice_prepare_xdp_rings(vsi, NULL);
if (ret)
goto err_vectors;
}
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index c1ac2f746714..7d049930a0a8 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -2603,32 +2603,14 @@ static int ice_xdp_alloc_setup_rings(struct ice_vsi *vsi)
return -ENOMEM;
}
-/**
- * ice_vsi_assign_bpf_prog - set or clear bpf prog pointer on VSI
- * @vsi: VSI to set the bpf prog on
- * @prog: the bpf prog pointer
- */
-static void ice_vsi_assign_bpf_prog(struct ice_vsi *vsi, struct bpf_prog *prog)
-{
- struct bpf_prog *old_prog;
- int i;
-
- old_prog = xchg(&vsi->xdp_prog, prog);
- if (old_prog)
- bpf_prog_put(old_prog);
-
- ice_for_each_rxq(vsi, i)
- WRITE_ONCE(vsi->rx_rings[i]->xdp_prog, vsi->xdp_prog);
-}
-
/**
* ice_prepare_xdp_rings - Allocate, configure and setup Tx rings for XDP
* @vsi: VSI to bring up Tx rings used by XDP
- * @prog: bpf program that will be assigned to VSI
+ * @xdp: &netdev_bpf with XDP program and additional data passed from the stack
*
* Return 0 on success and negative value on error
*/
-int ice_prepare_xdp_rings(struct ice_vsi *vsi, struct bpf_prog *prog)
+int ice_prepare_xdp_rings(struct ice_vsi *vsi, struct netdev_bpf *xdp)
{
u16 max_txqs[ICE_MAX_TRAFFIC_CLASS] = { 0 };
int xdp_rings_rem = vsi->num_xdp_txq;
@@ -2713,8 +2695,8 @@ int ice_prepare_xdp_rings(struct ice_vsi *vsi, struct bpf_prog *prog)
* this is not harmful as dev_xdp_install bumps the refcount
* before calling the op exposed by the driver;
*/
- if (!ice_is_xdp_ena_vsi(vsi))
- ice_vsi_assign_bpf_prog(vsi, prog);
+ if (xdp)
+ xdp_attachment_setup_rcu(&vsi->xdp_info, xdp);
return 0;
clear_xdp_rings:
@@ -2739,11 +2721,12 @@ int ice_prepare_xdp_rings(struct ice_vsi *vsi, struct bpf_prog *prog)
/**
* ice_destroy_xdp_rings - undo the configuration made by ice_prepare_xdp_rings
* @vsi: VSI to remove XDP rings
+ * @xdp: &netdev_bpf with XDP program and additional data passed from the stack
*
* Detach XDP rings from irq vectors, clean up the PF bitmap and free
* resources
*/
-int ice_destroy_xdp_rings(struct ice_vsi *vsi)
+int ice_destroy_xdp_rings(struct ice_vsi *vsi, struct netdev_bpf *xdp)
{
u16 max_txqs[ICE_MAX_TRAFFIC_CLASS] = { 0 };
struct ice_pf *pf = vsi->back;
@@ -2796,7 +2779,11 @@ int ice_destroy_xdp_rings(struct ice_vsi *vsi)
if (ice_is_reset_in_progress(pf->state) || !vsi->q_vectors[0])
return 0;
- ice_vsi_assign_bpf_prog(vsi, NULL);
+ /* Symmetrically to ice_prepare_xdp_rings(), touch XDP program only
+ * when called from ::ndo_bpf().
+ */
+ if (xdp)
+ xdp_attachment_setup_rcu(&vsi->xdp_info, xdp);
/* notify Tx scheduler that we destroyed XDP queues and bring
* back the old number of child nodes
@@ -2853,15 +2840,14 @@ int ice_vsi_determine_xdp_res(struct ice_vsi *vsi)
/**
* ice_xdp_setup_prog - Add or remove XDP eBPF program
* @vsi: VSI to setup XDP for
- * @prog: XDP program
- * @extack: netlink extended ack
+ * @xdp: &netdev_bpf with XDP program and additional data passed from the stack
*/
static int
-ice_xdp_setup_prog(struct ice_vsi *vsi, struct bpf_prog *prog,
- struct netlink_ext_ack *extack)
+ice_xdp_setup_prog(struct ice_vsi *vsi, struct netdev_bpf *xdp)
{
int frame_size = vsi->netdev->mtu + ICE_ETH_PKT_HDR_PAD;
- bool if_running = netif_running(vsi->netdev);
+ struct netlink_ext_ack *extack = xdp->extack;
+ bool restart = false, prog = !!xdp->prog;
int ret = 0, xdp_ring_err = 0;
if (frame_size > vsi->rx_buf_len) {
@@ -2870,12 +2856,15 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct bpf_prog *prog,
}
/* need to stop netdev while setting up the program for Rx rings */
- if (if_running && !test_and_set_bit(ICE_VSI_DOWN, vsi->state)) {
+ if (ice_is_xdp_ena_vsi(vsi) != prog && netif_running(vsi->netdev) &&
+ !test_and_set_bit(ICE_VSI_DOWN, vsi->state)) {
ret = ice_down(vsi);
if (ret) {
NL_SET_ERR_MSG_MOD(extack, "Preparing device for XDP attach failed");
return ret;
}
+
+ restart = true;
}
if (!ice_is_xdp_ena_vsi(vsi) && prog) {
@@ -2883,24 +2872,24 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct bpf_prog *prog,
if (xdp_ring_err) {
NL_SET_ERR_MSG_MOD(extack, "Not enough Tx resources for XDP");
} else {
- xdp_ring_err = ice_prepare_xdp_rings(vsi, prog);
+ xdp_ring_err = ice_prepare_xdp_rings(vsi, xdp);
if (xdp_ring_err)
NL_SET_ERR_MSG_MOD(extack, "Setting up XDP Tx resources failed");
}
} else if (ice_is_xdp_ena_vsi(vsi) && !prog) {
- xdp_ring_err = ice_destroy_xdp_rings(vsi);
+ xdp_ring_err = ice_destroy_xdp_rings(vsi, xdp);
if (xdp_ring_err)
NL_SET_ERR_MSG_MOD(extack, "Freeing XDP Tx resources failed");
} else {
- /* safe to call even when prog == vsi->xdp_prog as
+ /* safe to call even when prog == vsi->xdp_info.prog as
* dev_xdp_install in net/core/dev.c incremented prog's
* refcount so corresponding bpf_prog_put won't cause
* underflow
*/
- ice_vsi_assign_bpf_prog(vsi, prog);
+ xdp_attachment_setup_rcu(&vsi->xdp_info, xdp);
}
- if (if_running)
+ if (restart)
ret = ice_up(vsi);
if (!ret && prog)
@@ -2940,7 +2929,7 @@ static int ice_xdp(struct net_device *dev, struct netdev_bpf *xdp)
switch (xdp->command) {
case XDP_SETUP_PROG:
- return ice_xdp_setup_prog(vsi, xdp->prog, xdp->extack);
+ return ice_xdp_setup_prog(vsi, xdp);
case XDP_SETUP_XSK_POOL:
return ice_xsk_pool_setup(vsi, xdp->xsk.pool,
xdp->xsk.queue_id);
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 3f8b7274ed2f..25383bbf8245 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -454,7 +454,7 @@ void ice_free_rx_ring(struct ice_rx_ring *rx_ring)
if (rx_ring->vsi->type == ICE_VSI_PF)
if (xdp_rxq_info_is_reg(&rx_ring->xdp_rxq))
xdp_rxq_info_unreg(&rx_ring->xdp_rxq);
- rx_ring->xdp_prog = NULL;
+
if (rx_ring->xsk_pool) {
kfree(rx_ring->xdp_buf);
rx_ring->xdp_buf = NULL;
@@ -507,8 +507,7 @@ int ice_setup_rx_ring(struct ice_rx_ring *rx_ring)
rx_ring->next_to_use = 0;
rx_ring->next_to_clean = 0;
- if (ice_is_xdp_ena_vsi(rx_ring->vsi))
- WRITE_ONCE(rx_ring->xdp_prog, rx_ring->vsi->xdp_prog);
+ rx_ring->xdp_info = &rx_ring->vsi->xdp_info;
if (rx_ring->vsi->type == ICE_VSI_PF &&
!xdp_rxq_info_is_reg(&rx_ring->xdp_rxq))
@@ -1123,7 +1122,7 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
#endif
xdp_init_buff(&xdp, frame_sz, &rx_ring->xdp_rxq);
- xdp_prog = READ_ONCE(rx_ring->xdp_prog);
+ xdp_prog = rcu_dereference(rx_ring->xdp_info->prog_rcu);
if (xdp_prog)
xdp_ring = rx_ring->xdp_ring;
@@ -1489,6 +1488,8 @@ int ice_napi_poll(struct napi_struct *napi, int budget)
/* Max of 1 Rx ring in this q_vector so give it the budget */
budget_per_ring = budget;
+ rcu_read_lock();
+
ice_for_each_rx_ring(rx_ring, q_vector->rx) {
int cleaned;
@@ -1505,6 +1506,8 @@ int ice_napi_poll(struct napi_struct *napi, int budget)
clean_complete = false;
}
+ rcu_read_unlock();
+
/* If work not completed, return budget and polling will return */
if (!clean_complete) {
/* Set the writeback on ITR so partial completions of
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index ca902af54bb4..1fc31ab0bf33 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -290,7 +290,7 @@ struct ice_rx_ring {
struct rcu_head rcu; /* to avoid race on free */
/* CL4 - 3rd cacheline starts here */
struct ice_channel *ch;
- struct bpf_prog *xdp_prog;
+ const struct xdp_attachment_info *xdp_info;
struct ice_tx_ring *xdp_ring;
struct xsk_buff_pool *xsk_pool;
struct sk_buff *skb;
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 49ba8bfdbf04..eb994cf68ff4 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -597,7 +597,7 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
/* ZC patch is enabled only when XDP program is set,
* so here it can not be NULL
*/
- xdp_prog = READ_ONCE(rx_ring->xdp_prog);
+ xdp_prog = rcu_dereference(rx_ring->xdp_info->prog_rcu);
xdp_ring = rx_ring->xdp_ring;
while (likely(total_rx_packets < (unsigned int)budget)) {
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 45/52] net, ice: consolidate all skb fields processing
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (43 preceding siblings ...)
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 44/52] net, ice: allow XDP prog hot-swapping Alexander Lobakin
@ 2022-06-28 19:48 ` Alexander Lobakin
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 46/52] net, ice: use an onstack &xdp_meta_generic_rx to store HW frame info Alexander Lobakin
` (7 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:48 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
For now, skb fields filling is scattered across RQ / XSK RQ polling
function. Make it consistent and do everything in
ice_process_skb_fields().
Obtaining @vlan_tag and @rx_ptype can be moved in there too, there
is no reason to do it outside. ice_receive_skb() now becomes just
a standard pair of eth_type_trans() + napi_gro_receive(), make it
static inline to save a couple redundant jumps.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
drivers/net/ethernet/intel/ice/ice_txrx.c | 19 +----
drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 81 +++++++++----------
drivers/net/ethernet/intel/ice/ice_txrx_lib.h | 25 +++++-
drivers/net/ethernet/intel/ice/ice_xsk.c | 11 +--
4 files changed, 65 insertions(+), 71 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index 25383bbf8245..ffea5138a7e8 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -949,11 +949,6 @@ ice_build_skb(struct ice_rx_ring *rx_ring, struct ice_rx_buf *rx_buf,
if (unlikely(!skb))
return NULL;
- /* must to record Rx queue, otherwise OS features such as
- * symmetric queue won't work
- */
- skb_record_rx_queue(skb, rx_ring->q_index);
-
/* update pointers within the skb to store the data */
skb_reserve(skb, xdp->data - xdp->data_hard_start);
__skb_put(skb, xdp->data_end - xdp->data);
@@ -995,7 +990,6 @@ ice_construct_skb(struct ice_rx_ring *rx_ring, struct ice_rx_buf *rx_buf,
if (unlikely(!skb))
return NULL;
- skb_record_rx_queue(skb, rx_ring->q_index);
/* Determine available headroom for copy */
headlen = size;
if (headlen > ICE_RX_HDR_SIZE)
@@ -1134,8 +1128,6 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
unsigned int size;
u16 stat_err_bits;
int rx_buf_pgcnt;
- u16 vlan_tag = 0;
- u16 rx_ptype;
/* get the Rx desc from Rx ring based on 'next_to_clean' */
rx_desc = ICE_RX_DESC(rx_ring, rx_ring->next_to_clean);
@@ -1238,8 +1230,6 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
continue;
}
- vlan_tag = ice_get_vlan_tag_from_rx_desc(rx_desc);
-
/* pad the skb if needed, to make a valid ethernet frame */
if (eth_skb_pad(skb)) {
skb = NULL;
@@ -1249,15 +1239,10 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
/* probably a little skewed due to removing CRC */
total_rx_bytes += skb->len;
- /* populate checksum, VLAN, and protocol */
- rx_ptype = le16_to_cpu(rx_desc->wb.ptype_flex_flags0) &
- ICE_RX_FLEX_DESC_PTYPE_M;
-
- ice_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
+ ice_process_skb_fields(rx_ring, rx_desc, skb);
ice_trace(clean_rx_irq_indicate, rx_ring, rx_desc, skb);
- /* send completed skb up the stack */
- ice_receive_skb(rx_ring, skb, vlan_tag);
+ ice_receive_skb(rx_ring, skb);
skb = NULL;
/* update budget accounting */
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 7ee38d02d1e5..92c001baa2cc 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -40,16 +40,15 @@ void ice_release_rx_desc(struct ice_rx_ring *rx_ring, u16 val)
/**
* ice_ptype_to_htype - get a hash type
- * @ptype: the ptype value from the descriptor
+ * @decoded: the decoded ptype value from the descriptor
*
* Returns appropriate hash type (such as PKT_HASH_TYPE_L2/L3/L4) to be used by
* skb_set_hash based on PTYPE as parsed by HW Rx pipeline and is part of
* Rx desc.
*/
-static enum pkt_hash_types ice_ptype_to_htype(u16 ptype)
+static enum pkt_hash_types
+ice_ptype_to_htype(struct ice_rx_ptype_decoded decoded)
{
- struct ice_rx_ptype_decoded decoded = ice_decode_rx_desc_ptype(ptype);
-
if (!decoded.known)
return PKT_HASH_TYPE_NONE;
if (decoded.payload_layer == ICE_RX_PTYPE_PAYLOAD_LAYER_PAY4)
@@ -67,11 +66,11 @@ static enum pkt_hash_types ice_ptype_to_htype(u16 ptype)
* @rx_ring: descriptor ring
* @rx_desc: specific descriptor
* @skb: pointer to current skb
- * @rx_ptype: the ptype value from the descriptor
+ * @decoded: the decoded ptype value from the descriptor
*/
static void
ice_rx_hash(struct ice_rx_ring *rx_ring, union ice_32b_rx_flex_desc *rx_desc,
- struct sk_buff *skb, u16 rx_ptype)
+ struct sk_buff *skb, struct ice_rx_ptype_decoded decoded)
{
struct ice_32b_rx_flex_desc_nic *nic_mdid;
u32 hash;
@@ -84,7 +83,7 @@ ice_rx_hash(struct ice_rx_ring *rx_ring, union ice_32b_rx_flex_desc *rx_desc,
nic_mdid = (struct ice_32b_rx_flex_desc_nic *)rx_desc;
hash = le32_to_cpu(nic_mdid->rss_hash);
- skb_set_hash(skb, hash, ice_ptype_to_htype(rx_ptype));
+ skb_set_hash(skb, hash, ice_ptype_to_htype(decoded));
}
/**
@@ -92,23 +91,21 @@ ice_rx_hash(struct ice_rx_ring *rx_ring, union ice_32b_rx_flex_desc *rx_desc,
* @ring: the ring we care about
* @skb: skb currently being received and modified
* @rx_desc: the receive descriptor
- * @ptype: the packet type decoded by hardware
+ * @decoded: the decoded packet type parsed by hardware
*
* skb->protocol must be set before this function is called
*/
static void
ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
- union ice_32b_rx_flex_desc *rx_desc, u16 ptype)
+ union ice_32b_rx_flex_desc *rx_desc,
+ struct ice_rx_ptype_decoded decoded)
{
- struct ice_rx_ptype_decoded decoded;
u16 rx_status0, rx_status1;
bool ipv4, ipv6;
rx_status0 = le16_to_cpu(rx_desc->wb.status_error0);
rx_status1 = le16_to_cpu(rx_desc->wb.status_error1);
- decoded = ice_decode_rx_desc_ptype(ptype);
-
/* Start with CHECKSUM_NONE and by default csum_level = 0 */
skb->ip_summed = CHECKSUM_NONE;
skb_checksum_none_assert(skb);
@@ -170,12 +167,31 @@ ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
ring->vsi->back->hw_csum_rx_error++;
}
+static void ice_rx_vlan(struct sk_buff *skb,
+ const struct ice_rx_ring *rx_ring,
+ const union ice_32b_rx_flex_desc *rx_desc)
+{
+ netdev_features_t features = rx_ring->netdev->features;
+ bool non_zero_vlan;
+ u16 vlan_tag;
+
+ vlan_tag = ice_get_vlan_tag_from_rx_desc(rx_desc);
+ non_zero_vlan = !!(vlan_tag & VLAN_VID_MASK);
+
+ if (!non_zero_vlan)
+ return;
+
+ if ((features & NETIF_F_HW_VLAN_CTAG_RX))
+ __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlan_tag);
+ else if ((features & NETIF_F_HW_VLAN_STAG_RX))
+ __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021AD), vlan_tag);
+}
+
/**
* ice_process_skb_fields - Populate skb header fields from Rx descriptor
* @rx_ring: Rx descriptor ring packet is being transacted on
* @rx_desc: pointer to the EOP Rx descriptor
* @skb: pointer to current skb being populated
- * @ptype: the packet type decoded by hardware
*
* This function checks the ring, descriptor, and packet information in
* order to populate the hash, checksum, VLAN, protocol, and
@@ -184,42 +200,25 @@ ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
void
ice_process_skb_fields(struct ice_rx_ring *rx_ring,
union ice_32b_rx_flex_desc *rx_desc,
- struct sk_buff *skb, u16 ptype)
+ struct sk_buff *skb)
{
- ice_rx_hash(rx_ring, rx_desc, skb, ptype);
+ struct ice_rx_ptype_decoded decoded;
+ u16 ptype;
- /* modifies the skb - consumes the enet header */
- skb->protocol = eth_type_trans(skb, rx_ring->netdev);
+ skb_record_rx_queue(skb, rx_ring->q_index);
- ice_rx_csum(rx_ring, skb, rx_desc, ptype);
+ ptype = le16_to_cpu(rx_desc->wb.ptype_flex_flags0) &
+ ICE_RX_FLEX_DESC_PTYPE_M;
+ decoded = ice_decode_rx_desc_ptype(ptype);
+
+ ice_rx_hash(rx_ring, rx_desc, skb, decoded);
+ ice_rx_csum(rx_ring, skb, rx_desc, decoded);
+ ice_rx_vlan(skb, rx_ring, rx_desc);
if (rx_ring->ptp_rx)
ice_ptp_rx_hwtstamp(rx_ring, rx_desc, skb);
}
-/**
- * ice_receive_skb - Send a completed packet up the stack
- * @rx_ring: Rx ring in play
- * @skb: packet to send up
- * @vlan_tag: VLAN tag for packet
- *
- * This function sends the completed packet (via. skb) up the stack using
- * gro receive functions (with/without VLAN tag)
- */
-void
-ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tag)
-{
- netdev_features_t features = rx_ring->netdev->features;
- bool non_zero_vlan = !!(vlan_tag & VLAN_VID_MASK);
-
- if ((features & NETIF_F_HW_VLAN_CTAG_RX) && non_zero_vlan)
- __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlan_tag);
- else if ((features & NETIF_F_HW_VLAN_STAG_RX) && non_zero_vlan)
- __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021AD), vlan_tag);
-
- napi_gro_receive(&rx_ring->q_vector->napi, skb);
-}
-
/**
* ice_clean_xdp_irq - Reclaim resources after transmit completes on XDP ring
* @xdp_ring: XDP ring to clean
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
index c7d2954dc9ea..45dc5ef79e28 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
@@ -40,7 +40,7 @@ ice_build_ctob(u64 td_cmd, u64 td_offset, unsigned int size, u64 td_tag)
* one is found return the tag, else return 0 to mean no VLAN tag was found.
*/
static inline u16
-ice_get_vlan_tag_from_rx_desc(union ice_32b_rx_flex_desc *rx_desc)
+ice_get_vlan_tag_from_rx_desc(const union ice_32b_rx_flex_desc *rx_desc)
{
u16 stat_err_bits;
@@ -55,6 +55,24 @@ ice_get_vlan_tag_from_rx_desc(union ice_32b_rx_flex_desc *rx_desc)
return 0;
}
+/**
+ * ice_receive_skb - Send a completed packet up the stack
+ * @rx_ring: Rx ring in play
+ * @skb: packet to send up
+ *
+ * This function sends the completed packet (via. skb) up the stack using
+ * gro receive functions
+ */
+static inline void ice_receive_skb(const struct ice_rx_ring *rx_ring,
+ struct sk_buff *skb)
+{
+ /* modifies the skb - consumes the enet header */
+ skb->protocol = eth_type_trans(skb, rx_ring->netdev);
+
+ /* send completed skb up the stack */
+ napi_gro_receive(&rx_ring->q_vector->napi, skb);
+}
+
/**
* ice_xdp_ring_update_tail - Updates the XDP Tx ring tail register
* @xdp_ring: XDP Tx ring
@@ -77,7 +95,6 @@ void ice_release_rx_desc(struct ice_rx_ring *rx_ring, u16 val);
void
ice_process_skb_fields(struct ice_rx_ring *rx_ring,
union ice_32b_rx_flex_desc *rx_desc,
- struct sk_buff *skb, u16 ptype);
-void
-ice_receive_skb(struct ice_rx_ring *rx_ring, struct sk_buff *skb, u16 vlan_tag);
+ struct sk_buff *skb);
+
#endif /* !_ICE_TXRX_LIB_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index eb994cf68ff4..0a66128964e7 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -606,8 +606,6 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
struct xdp_buff *xdp;
struct sk_buff *skb;
u16 stat_err_bits;
- u16 vlan_tag = 0;
- u16 rx_ptype;
rx_desc = ICE_RX_DESC(rx_ring, rx_ring->next_to_clean);
@@ -675,13 +673,8 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
total_rx_bytes += skb->len;
total_rx_packets++;
- vlan_tag = ice_get_vlan_tag_from_rx_desc(rx_desc);
-
- rx_ptype = le16_to_cpu(rx_desc->wb.ptype_flex_flags0) &
- ICE_RX_FLEX_DESC_PTYPE_M;
-
- ice_process_skb_fields(rx_ring, rx_desc, skb, rx_ptype);
- ice_receive_skb(rx_ring, skb, vlan_tag);
+ ice_process_skb_fields(rx_ring, rx_desc, skb);
+ ice_receive_skb(rx_ring, skb);
}
entries_to_alloc = ICE_DESC_UNUSED(rx_ring);
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 46/52] net, ice: use an onstack &xdp_meta_generic_rx to store HW frame info
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (44 preceding siblings ...)
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 45/52] net, ice: consolidate all skb fields processing Alexander Lobakin
@ 2022-06-28 19:48 ` Alexander Lobakin
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 47/52] net, ice: build XDP generic metadata Alexander Lobakin
` (6 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:48 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
To be able to pass HW-provided frame metadata, such as hash,
checksum status etc., to BPF and XSK programs, unify the container
which is used to store it regardless of an XDP program presence or
a verdict returned by it. Use an intermediate onstack
&xdp_meta_generic_rx before filling skb fields and switch descriptor
parsing functions to use it instead of an &sk_buff.
This works the same way how &xdp_buff is being filled before forming
an skb. If metadata generation is enabled, the actual space in front
of a frame will be used in the upcoming changes.
Using &xdp_meta_generic_rx instead of full-blown &xdp_meta_generic
reduces text size by 32 bytes per function.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
drivers/net/ethernet/intel/ice/ice_ptp.c | 19 ++--
drivers/net/ethernet/intel/ice/ice_ptp.h | 17 ++-
drivers/net/ethernet/intel/ice/ice_txrx.c | 4 +-
drivers/net/ethernet/intel/ice/ice_txrx.h | 1 +
drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 105 ++++++++++--------
drivers/net/ethernet/intel/ice/ice_txrx_lib.h | 12 +-
drivers/net/ethernet/intel/ice/ice_xsk.c | 4 +-
7 files changed, 91 insertions(+), 71 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.c b/drivers/net/ethernet/intel/ice/ice_ptp.c
index ef9344ef0d8e..d4d955152682 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp.c
+++ b/drivers/net/ethernet/intel/ice/ice_ptp.c
@@ -1795,24 +1795,22 @@ int ice_ptp_set_ts_config(struct ice_pf *pf, struct ifreq *ifr)
/**
* ice_ptp_rx_hwtstamp - Check for an Rx timestamp
- * @rx_ring: Ring to get the VSI info
* @rx_desc: Receive descriptor
- * @skb: Particular skb to send timestamp with
+ * @rx_ring: Ring to get the VSI info
+ * @md: Metadata to set timestamp in
*
* The driver receives a notification in the receive descriptor with timestamp.
* The timestamp is in ns, so we must convert the result first.
*/
-void
-ice_ptp_rx_hwtstamp(struct ice_rx_ring *rx_ring,
- union ice_32b_rx_flex_desc *rx_desc, struct sk_buff *skb)
+void ice_ptp_rx_hwtstamp(struct xdp_meta_generic *md,
+ const union ice_32b_rx_flex_desc *rx_desc,
+ const struct ice_rx_ring *rx_ring)
{
u32 ts_high;
u64 ts_ns;
- /* Populate timesync data into skb */
+ /* Populate timesync data into md */
if (rx_desc->wb.time_stamp_low & ICE_PTP_TS_VALID) {
- struct skb_shared_hwtstamps *hwtstamps;
-
/* Use ice_ptp_extend_32b_ts directly, using the ring-specific
* cached PHC value, rather than accessing the PF. This also
* allows us to simply pass the upper 32bits of nanoseconds
@@ -1822,9 +1820,8 @@ ice_ptp_rx_hwtstamp(struct ice_rx_ring *rx_ring,
ts_high = le32_to_cpu(rx_desc->wb.flex_ts.ts_high);
ts_ns = ice_ptp_extend_32b_ts(rx_ring->cached_phctime, ts_high);
- hwtstamps = skb_hwtstamps(skb);
- memset(hwtstamps, 0, sizeof(*hwtstamps));
- hwtstamps->hwtstamp = ns_to_ktime(ts_ns);
+ xdp_meta_rx_tstamp_present_set(md, 1);
+ xdp_meta_rx_tstamp_set(md, ts_ns);
}
}
diff --git a/drivers/net/ethernet/intel/ice/ice_ptp.h b/drivers/net/ethernet/intel/ice/ice_ptp.h
index 10e396abf130..488b6bb01605 100644
--- a/drivers/net/ethernet/intel/ice/ice_ptp.h
+++ b/drivers/net/ethernet/intel/ice/ice_ptp.h
@@ -228,8 +228,12 @@ struct ice_ptp {
#define N_EXT_TS_E810_NO_SMA 2
#define ETH_GLTSYN_ENA(_i) (0x03000348 + ((_i) * 4))
-#if IS_ENABLED(CONFIG_PTP_1588_CLOCK)
struct ice_pf;
+struct ice_rx_ring;
+struct xdp_meta_generic;
+union ice_32b_rx_flex_desc;
+
+#if IS_ENABLED(CONFIG_PTP_1588_CLOCK)
int ice_ptp_set_ts_config(struct ice_pf *pf, struct ifreq *ifr);
int ice_ptp_get_ts_config(struct ice_pf *pf, struct ifreq *ifr);
void ice_ptp_cfg_timestamp(struct ice_pf *pf, bool ena);
@@ -238,9 +242,9 @@ int ice_get_ptp_clock_index(struct ice_pf *pf);
s8 ice_ptp_request_ts(struct ice_ptp_tx *tx, struct sk_buff *skb);
void ice_ptp_process_ts(struct ice_pf *pf);
-void
-ice_ptp_rx_hwtstamp(struct ice_rx_ring *rx_ring,
- union ice_32b_rx_flex_desc *rx_desc, struct sk_buff *skb);
+void ice_ptp_rx_hwtstamp(struct xdp_meta_generic *md,
+ const union ice_32b_rx_flex_desc *rx_desc,
+ const struct ice_rx_ring *rx_ring);
void ice_ptp_reset(struct ice_pf *pf);
void ice_ptp_prepare_for_reset(struct ice_pf *pf);
void ice_ptp_init(struct ice_pf *pf);
@@ -271,8 +275,9 @@ ice_ptp_request_ts(struct ice_ptp_tx *tx, struct sk_buff *skb)
static inline void ice_ptp_process_ts(struct ice_pf *pf) { }
static inline void
-ice_ptp_rx_hwtstamp(struct ice_rx_ring *rx_ring,
- union ice_32b_rx_flex_desc *rx_desc, struct sk_buff *skb) { }
+ice_ptp_rx_hwtstamp(struct xdp_meta_generic *md,
+ const union ice_32b_rx_flex_desc *rx_desc,
+ const struct ice_rx_ring *rx_ring) { }
static inline void ice_ptp_reset(struct ice_pf *pf) { }
static inline void ice_ptp_prepare_for_reset(struct ice_pf *pf) { }
static inline void ice_ptp_init(struct ice_pf *pf) { }
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index ffea5138a7e8..c679f7c30bdc 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1123,6 +1123,7 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
/* start the loop to process Rx packets bounded by 'budget' */
while (likely(total_rx_pkts < (unsigned int)budget)) {
union ice_32b_rx_flex_desc *rx_desc;
+ struct xdp_meta_generic_rx md;
struct ice_rx_buf *rx_buf;
unsigned char *hard_start;
unsigned int size;
@@ -1239,7 +1240,8 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
/* probably a little skewed due to removing CRC */
total_rx_bytes += skb->len;
- ice_process_skb_fields(rx_ring, rx_desc, skb);
+ ice_xdp_build_meta(&md, rx_desc, rx_ring, 0);
+ __xdp_populate_skb_meta_generic(skb, &md);
ice_trace(clean_rx_irq_indicate, rx_ring, rx_desc, skb);
ice_receive_skb(rx_ring, skb);
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.h b/drivers/net/ethernet/intel/ice/ice_txrx.h
index 1fc31ab0bf33..a814709deb50 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.h
@@ -4,6 +4,7 @@
#ifndef _ICE_TXRX_H_
#define _ICE_TXRX_H_
+#include <net/xdp_meta.h>
#include "ice_type.h"
#define ICE_DFLT_IRQ_WORK 256
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
index 92c001baa2cc..7550e2ed8936 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.c
@@ -43,36 +43,37 @@ void ice_release_rx_desc(struct ice_rx_ring *rx_ring, u16 val)
* @decoded: the decoded ptype value from the descriptor
*
* Returns appropriate hash type (such as PKT_HASH_TYPE_L2/L3/L4) to be used by
- * skb_set_hash based on PTYPE as parsed by HW Rx pipeline and is part of
- * Rx desc.
+ * xdp_meta_rx_hash_type_set() based on PTYPE as parsed by HW Rx pipeline and
+ * is part of Rx desc.
*/
-static enum pkt_hash_types
+static u32
ice_ptype_to_htype(struct ice_rx_ptype_decoded decoded)
{
if (!decoded.known)
- return PKT_HASH_TYPE_NONE;
+ return XDP_META_RX_HASH_NONE;
if (decoded.payload_layer == ICE_RX_PTYPE_PAYLOAD_LAYER_PAY4)
- return PKT_HASH_TYPE_L4;
+ return XDP_META_RX_HASH_L4;
if (decoded.payload_layer == ICE_RX_PTYPE_PAYLOAD_LAYER_PAY3)
- return PKT_HASH_TYPE_L3;
+ return XDP_META_RX_HASH_L3;
if (decoded.outer_ip == ICE_RX_PTYPE_OUTER_L2)
- return PKT_HASH_TYPE_L2;
+ return XDP_META_RX_HASH_L2;
- return PKT_HASH_TYPE_NONE;
+ return XDP_META_RX_HASH_NONE;
}
/**
- * ice_rx_hash - set the hash value in the skb
+ * ice_rx_hash - set the hash value in the medatadata
+ * @md: pointer to current metadata
* @rx_ring: descriptor ring
* @rx_desc: specific descriptor
- * @skb: pointer to current skb
* @decoded: the decoded ptype value from the descriptor
*/
-static void
-ice_rx_hash(struct ice_rx_ring *rx_ring, union ice_32b_rx_flex_desc *rx_desc,
- struct sk_buff *skb, struct ice_rx_ptype_decoded decoded)
+static void ice_rx_hash(struct xdp_meta_generic *md,
+ const struct ice_rx_ring *rx_ring,
+ const union ice_32b_rx_flex_desc *rx_desc,
+ struct ice_rx_ptype_decoded decoded)
{
- struct ice_32b_rx_flex_desc_nic *nic_mdid;
+ const struct ice_32b_rx_flex_desc_nic *nic_mdid;
u32 hash;
if (!(rx_ring->netdev->features & NETIF_F_RXHASH))
@@ -81,24 +82,24 @@ ice_rx_hash(struct ice_rx_ring *rx_ring, union ice_32b_rx_flex_desc *rx_desc,
if (rx_desc->wb.rxdid != ICE_RXDID_FLEX_NIC)
return;
- nic_mdid = (struct ice_32b_rx_flex_desc_nic *)rx_desc;
+ nic_mdid = (typeof(nic_mdid))rx_desc;
hash = le32_to_cpu(nic_mdid->rss_hash);
- skb_set_hash(skb, hash, ice_ptype_to_htype(decoded));
+
+ xdp_meta_rx_hash_type_set(md, ice_ptype_to_htype(decoded));
+ xdp_meta_rx_hash_set(md, hash);
}
/**
- * ice_rx_csum - Indicate in skb if checksum is good
+ * ice_rx_csum - Indicate in metadata if checksum is good
+ * @md: metadata currently being filled
* @ring: the ring we care about
- * @skb: skb currently being received and modified
* @rx_desc: the receive descriptor
* @decoded: the decoded packet type parsed by hardware
- *
- * skb->protocol must be set before this function is called
*/
-static void
-ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
- union ice_32b_rx_flex_desc *rx_desc,
- struct ice_rx_ptype_decoded decoded)
+static void ice_rx_csum(struct xdp_meta_generic *md,
+ const struct ice_rx_ring *ring,
+ const union ice_32b_rx_flex_desc *rx_desc,
+ struct ice_rx_ptype_decoded decoded)
{
u16 rx_status0, rx_status1;
bool ipv4, ipv6;
@@ -106,10 +107,6 @@ ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
rx_status0 = le16_to_cpu(rx_desc->wb.status_error0);
rx_status1 = le16_to_cpu(rx_desc->wb.status_error1);
- /* Start with CHECKSUM_NONE and by default csum_level = 0 */
- skb->ip_summed = CHECKSUM_NONE;
- skb_checksum_none_assert(skb);
-
/* check if Rx checksum is enabled */
if (!(ring->netdev->features & NETIF_F_RXCSUM))
return;
@@ -149,14 +146,14 @@ ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
* we are indicating we validated the inner checksum.
*/
if (decoded.tunnel_type >= ICE_RX_PTYPE_TUNNEL_IP_GRENAT)
- skb->csum_level = 1;
+ xdp_meta_rx_csum_level_set(md, 1);
/* Only report checksum unnecessary for TCP, UDP, or SCTP */
switch (decoded.inner_prot) {
case ICE_RX_PTYPE_INNER_PROT_TCP:
case ICE_RX_PTYPE_INNER_PROT_UDP:
case ICE_RX_PTYPE_INNER_PROT_SCTP:
- skb->ip_summed = CHECKSUM_UNNECESSARY;
+ xdp_meta_rx_csum_status_set(md, XDP_META_RX_CSUM_OK);
break;
default:
break;
@@ -167,7 +164,13 @@ ice_rx_csum(struct ice_rx_ring *ring, struct sk_buff *skb,
ring->vsi->back->hw_csum_rx_error++;
}
-static void ice_rx_vlan(struct sk_buff *skb,
+#define xdp_meta_rx_vlan_from_feat(feat) ({ \
+ ((feat) & NETIF_F_HW_VLAN_CTAG_RX) ? XDP_META_RX_CVID : \
+ ((feat) & NETIF_F_HW_VLAN_STAG_RX) ? XDP_META_RX_SVID : \
+ XDP_META_RX_VLAN_NONE; \
+})
+
+static void ice_rx_vlan(struct xdp_meta_generic *md,
const struct ice_rx_ring *rx_ring,
const union ice_32b_rx_flex_desc *rx_desc)
{
@@ -181,42 +184,48 @@ static void ice_rx_vlan(struct sk_buff *skb,
if (!non_zero_vlan)
return;
- if ((features & NETIF_F_HW_VLAN_CTAG_RX))
- __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vlan_tag);
- else if ((features & NETIF_F_HW_VLAN_STAG_RX))
- __vlan_hwaccel_put_tag(skb, htons(ETH_P_8021AD), vlan_tag);
+ xdp_meta_rx_vlan_type_set(md, xdp_meta_rx_vlan_from_feat(features));
+ xdp_meta_rx_vid_set(md, vlan_tag);
}
/**
- * ice_process_skb_fields - Populate skb header fields from Rx descriptor
- * @rx_ring: Rx descriptor ring packet is being transacted on
+ * __ice_xdp_build_meta - Populate XDP generic metadata fields from Rx desc
+ * @rx_md: pointer to the metadata structure to be populated
* @rx_desc: pointer to the EOP Rx descriptor
- * @skb: pointer to current skb being populated
+ * @rx_ring: Rx descriptor ring packet is being transacted on
+ * @full_id: full ID (BTF ID + type ID) to fill in
*
* This function checks the ring, descriptor, and packet information in
* order to populate the hash, checksum, VLAN, protocol, and
- * other fields within the skb.
+ * other fields within the metadata.
*/
-void
-ice_process_skb_fields(struct ice_rx_ring *rx_ring,
- union ice_32b_rx_flex_desc *rx_desc,
- struct sk_buff *skb)
+void __ice_xdp_build_meta(struct xdp_meta_generic_rx *rx_md,
+ const union ice_32b_rx_flex_desc *rx_desc,
+ const struct ice_rx_ring *rx_ring,
+ __le64 full_id)
{
+ struct xdp_meta_generic *md = to_gen_md(rx_md);
struct ice_rx_ptype_decoded decoded;
u16 ptype;
- skb_record_rx_queue(skb, rx_ring->q_index);
+ xdp_meta_init(&md->id, full_id);
+ md->rx_hash = 0;
+ md->rx_csum = 0;
+ md->rx_flags = 0;
+
+ xdp_meta_rx_qid_present_set(md, 1);
+ xdp_meta_rx_qid_set(md, rx_ring->q_index);
ptype = le16_to_cpu(rx_desc->wb.ptype_flex_flags0) &
ICE_RX_FLEX_DESC_PTYPE_M;
decoded = ice_decode_rx_desc_ptype(ptype);
- ice_rx_hash(rx_ring, rx_desc, skb, decoded);
- ice_rx_csum(rx_ring, skb, rx_desc, decoded);
- ice_rx_vlan(skb, rx_ring, rx_desc);
+ ice_rx_hash(md, rx_ring, rx_desc, decoded);
+ ice_rx_csum(md, rx_ring, rx_desc, decoded);
+ ice_rx_vlan(md, rx_ring, rx_desc);
if (rx_ring->ptp_rx)
- ice_ptp_rx_hwtstamp(rx_ring, rx_desc, skb);
+ ice_ptp_rx_hwtstamp(md, rx_desc, rx_ring);
}
/**
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
index 45dc5ef79e28..b51e58b8e83d 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
@@ -92,9 +92,13 @@ void ice_finalize_xdp_rx(struct ice_tx_ring *xdp_ring, unsigned int xdp_res);
int ice_xmit_xdp_buff(struct xdp_buff *xdp, struct ice_tx_ring *xdp_ring);
int ice_xmit_xdp_ring(void *data, u16 size, struct ice_tx_ring *xdp_ring);
void ice_release_rx_desc(struct ice_rx_ring *rx_ring, u16 val);
-void
-ice_process_skb_fields(struct ice_rx_ring *rx_ring,
- union ice_32b_rx_flex_desc *rx_desc,
- struct sk_buff *skb);
+
+void __ice_xdp_build_meta(struct xdp_meta_generic_rx *rx_md,
+ const union ice_32b_rx_flex_desc *rx_desc,
+ const struct ice_rx_ring *rx_ring,
+ __le64 full_id);
+
+#define ice_xdp_build_meta(md, ...) \
+ __ice_xdp_build_meta(to_rx_md(md), ##__VA_ARGS__)
#endif /* !_ICE_TXRX_LIB_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index 0a66128964e7..eade918723eb 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -603,6 +603,7 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
while (likely(total_rx_packets < (unsigned int)budget)) {
union ice_32b_rx_flex_desc *rx_desc;
unsigned int size, xdp_res = 0;
+ struct xdp_meta_generic_rx md;
struct xdp_buff *xdp;
struct sk_buff *skb;
u16 stat_err_bits;
@@ -673,7 +674,8 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
total_rx_bytes += skb->len;
total_rx_packets++;
- ice_process_skb_fields(rx_ring, rx_desc, skb);
+ ice_xdp_build_meta(&md, rx_desc, rx_ring, 0);
+ __xdp_populate_skb_meta_generic(skb, &md);
ice_receive_skb(rx_ring, skb);
}
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 47/52] net, ice: build XDP generic metadata
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (45 preceding siblings ...)
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 46/52] net, ice: use an onstack &xdp_meta_generic_rx to store HW frame info Alexander Lobakin
@ 2022-06-28 19:48 ` Alexander Lobakin
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 48/52] libbpf: compress Endianness ops with a macro Alexander Lobakin
` (5 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:48 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Now that the driver builds skbs from an onstack generic meta
structure, add the ability to configure the actual metadata format
to be provided to BPF and XSK programs (and other consumers like
cpumap).
At first, it is being built on the stack and then synchronized with
the buffer in front of a frame; and vice versa after the program
returns back to the driver. In cases when meta is disabled or the
frame size is below the threshold, the driver populates it only on
%XDP_PASS and right before populating an skb, so no perf hits for
that.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
drivers/net/ethernet/intel/ice/ice.h | 8 +++
drivers/net/ethernet/intel/ice/ice_main.c | 18 ++++++-
drivers/net/ethernet/intel/ice/ice_txrx.c | 25 ++++++---
drivers/net/ethernet/intel/ice/ice_txrx_lib.h | 53 +++++++++++++++++++
drivers/net/ethernet/intel/ice/ice_xsk.c | 17 ++++--
5 files changed, 107 insertions(+), 14 deletions(-)
diff --git a/drivers/net/ethernet/intel/ice/ice.h b/drivers/net/ethernet/intel/ice/ice.h
index 402b71ab48e4..bd929bb1a359 100644
--- a/drivers/net/ethernet/intel/ice/ice.h
+++ b/drivers/net/ethernet/intel/ice/ice.h
@@ -490,6 +490,14 @@ enum ice_pf_flags {
ICE_PF_FLAGS_NBITS /* must be last */
};
+enum {
+ ICE_MD_GENERIC,
+
+ /* Must be last */
+ ICE_MD_NONE,
+ __ICE_MD_NUM,
+};
+
struct ice_switchdev_info {
struct ice_vsi *control_vsi;
struct ice_vsi *uplink_vsi;
diff --git a/drivers/net/ethernet/intel/ice/ice_main.c b/drivers/net/ethernet/intel/ice/ice_main.c
index 7d049930a0a8..62bd0d316873 100644
--- a/drivers/net/ethernet/intel/ice/ice_main.c
+++ b/drivers/net/ethernet/intel/ice/ice_main.c
@@ -48,6 +48,11 @@ static DEFINE_IDA(ice_aux_ida);
DEFINE_STATIC_KEY_FALSE(ice_xdp_locking_key);
EXPORT_SYMBOL(ice_xdp_locking_key);
+/* List of XDP metadata formats supported by the driver */
+static const char * const ice_supported_md[__ICE_MD_NUM] = {
+ [ICE_MD_GENERIC] = "struct xdp_meta_generic",
+};
+
/**
* ice_hw_to_dev - Get device pointer from the hardware structure
* @hw: pointer to the device HW structure
@@ -2848,13 +2853,19 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct netdev_bpf *xdp)
int frame_size = vsi->netdev->mtu + ICE_ETH_PKT_HDR_PAD;
struct netlink_ext_ack *extack = xdp->extack;
bool restart = false, prog = !!xdp->prog;
- int ret = 0, xdp_ring_err = 0;
+ int pos, ret = 0, xdp_ring_err = 0;
if (frame_size > vsi->rx_buf_len) {
NL_SET_ERR_MSG_MOD(extack, "MTU too large for loading XDP");
return -EOPNOTSUPP;
}
+ pos = xdp_meta_match_id(ice_supported_md, xdp->btf_id);
+ if (pos < 0) {
+ NL_SET_ERR_MSG_MOD(extack, "Invalid or unsupported BTF ID");
+ return pos;
+ }
+
/* need to stop netdev while setting up the program for Rx rings */
if (ice_is_xdp_ena_vsi(vsi) != prog && netif_running(vsi->netdev) &&
!test_and_set_bit(ICE_VSI_DOWN, vsi->state)) {
@@ -2867,6 +2878,9 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct netdev_bpf *xdp)
restart = true;
}
+ /* Paired with the READ_ONCE()'s in ice_clean_rx_irq{,_zc}() */
+ WRITE_ONCE(vsi->xdp_info.drv_cookie, ICE_MD_NONE);
+
if (!ice_is_xdp_ena_vsi(vsi) && prog) {
xdp_ring_err = ice_vsi_determine_xdp_res(vsi);
if (xdp_ring_err) {
@@ -2889,6 +2903,8 @@ ice_xdp_setup_prog(struct ice_vsi *vsi, struct netdev_bpf *xdp)
xdp_attachment_setup_rcu(&vsi->xdp_info, xdp);
}
+ WRITE_ONCE(vsi->xdp_info.drv_cookie, pos);
+
if (restart)
ret = ice_up(vsi);
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx.c b/drivers/net/ethernet/intel/ice/ice_txrx.c
index c679f7c30bdc..50de6d54e3b0 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx.c
+++ b/drivers/net/ethernet/intel/ice/ice_txrx.c
@@ -1103,10 +1103,10 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
unsigned int total_rx_bytes = 0, total_rx_pkts = 0, frame_sz = 0;
u16 cleaned_count = ICE_DESC_UNUSED(rx_ring);
unsigned int offset = rx_ring->rx_offset;
+ struct xdp_attachment_info xdp_info;
struct ice_tx_ring *xdp_ring = NULL;
unsigned int xdp_res, xdp_xmit = 0;
struct sk_buff *skb = rx_ring->skb;
- struct bpf_prog *xdp_prog = NULL;
struct xdp_buff xdp;
bool failure;
@@ -1116,9 +1116,16 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
#endif
xdp_init_buff(&xdp, frame_sz, &rx_ring->xdp_rxq);
- xdp_prog = rcu_dereference(rx_ring->xdp_info->prog_rcu);
- if (xdp_prog)
+ xdp_info.prog = rcu_dereference(rx_ring->xdp_info->prog_rcu);
+ if (xdp_info.prog) {
+ const struct xdp_attachment_info *info = rx_ring->xdp_info;
+
+ xdp_info.btf_id_le = cpu_to_le64(READ_ONCE(info->btf_id));
+ xdp_info.meta_thresh = READ_ONCE(info->meta_thresh);
+ xdp_info.drv_cookie = READ_ONCE(info->drv_cookie);
+
xdp_ring = rx_ring->xdp_ring;
+ }
/* start the loop to process Rx packets bounded by 'budget' */
while (likely(total_rx_pkts < (unsigned int)budget)) {
@@ -1182,10 +1189,12 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
xdp.frame_sz = ice_rx_frame_truesize(rx_ring, size);
#endif
- if (!xdp_prog)
+ if (!xdp_info.prog)
goto construct_skb;
- xdp_res = ice_run_xdp(rx_ring, &xdp, xdp_prog, xdp_ring);
+ ice_xdp_handle_meta(&xdp, &md, &xdp_info, rx_desc, rx_ring);
+
+ xdp_res = ice_run_xdp(rx_ring, &xdp, xdp_info.prog, xdp_ring);
if (!xdp_res)
goto construct_skb;
if (xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR)) {
@@ -1240,8 +1249,8 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
/* probably a little skewed due to removing CRC */
total_rx_bytes += skb->len;
- ice_xdp_build_meta(&md, rx_desc, rx_ring, 0);
- __xdp_populate_skb_meta_generic(skb, &md);
+ ice_xdp_meta_populate_skb(skb, &md, xdp.data, rx_desc,
+ rx_ring);
ice_trace(clean_rx_irq_indicate, rx_ring, rx_desc, skb);
ice_receive_skb(rx_ring, skb);
@@ -1254,7 +1263,7 @@ int ice_clean_rx_irq(struct ice_rx_ring *rx_ring, int budget)
/* return up to cleaned_count buffers to hardware */
failure = ice_alloc_rx_bufs(rx_ring, cleaned_count);
- if (xdp_prog)
+ if (xdp_info.prog)
ice_finalize_xdp_rx(xdp_ring, xdp_xmit);
rx_ring->skb = skb;
diff --git a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
index b51e58b8e83d..a9d3f3adf86b 100644
--- a/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
+++ b/drivers/net/ethernet/intel/ice/ice_txrx_lib.h
@@ -98,7 +98,60 @@ void __ice_xdp_build_meta(struct xdp_meta_generic_rx *rx_md,
const struct ice_rx_ring *rx_ring,
__le64 full_id);
+static inline void
+__ice_xdp_handle_meta(struct xdp_buff *xdp, struct xdp_meta_generic_rx *rx_md,
+ const struct xdp_attachment_info *info,
+ const union ice_32b_rx_flex_desc *rx_desc,
+ const struct ice_rx_ring *rx_ring)
+{
+ rx_md->rx_flags = 0;
+
+ if (xdp->data_end - xdp->data < info->meta_thresh)
+ return;
+
+ switch (info->drv_cookie) {
+ case ICE_MD_GENERIC:
+ __ice_xdp_build_meta(rx_md, rx_desc, rx_ring, info->btf_id_le);
+
+ xdp->data_meta = xdp_meta_generic_ptr(xdp->data);
+ memcpy(to_rx_md(xdp->data_meta), rx_md, sizeof(*rx_md));
+
+ /* Just zero Tx flags instead of zeroing the whole part */
+ to_gen_md(xdp->data_meta)->tx_flags = 0;
+ break;
+ default:
+ break;
+ }
+}
+
+static inline void
+__ice_xdp_meta_populate_skb(struct sk_buff *skb,
+ struct xdp_meta_generic_rx *rx_md,
+ const void *data,
+ const union ice_32b_rx_flex_desc *rx_desc,
+ const struct ice_rx_ring *rx_ring)
+{
+ /* __ice_xdp_build_meta() unconditionally sets Rx queue id. If it's
+ * not here, it means that metadata for this frame hasn't been built
+ * yet and we need to do this now. Otherwise, sync onstack metadata
+ * copy and mark meta as nocomp to ignore it on GRO layer.
+ */
+ if (rx_md->rx_flags && likely(xdp_meta_has_generic(data))) {
+ memcpy(rx_md, to_rx_md(xdp_meta_generic_ptr(data)),
+ sizeof(*rx_md));
+ skb_metadata_nocomp_set(skb);
+ } else {
+ __ice_xdp_build_meta(rx_md, rx_desc, rx_ring, 0);
+ }
+
+ __xdp_populate_skb_meta_generic(skb, rx_md);
+}
+
#define ice_xdp_build_meta(md, ...) \
__ice_xdp_build_meta(to_rx_md(md), ##__VA_ARGS__)
+#define ice_xdp_handle_meta(xdp, md, ...) \
+ __ice_xdp_handle_meta((xdp), to_rx_md(md), ##__VA_ARGS__)
+#define ice_xdp_meta_populate_skb(skb, md, ...) \
+ __ice_xdp_meta_populate_skb((skb), to_rx_md(md), ##__VA_ARGS__)
#endif /* !_ICE_TXRX_LIB_H_ */
diff --git a/drivers/net/ethernet/intel/ice/ice_xsk.c b/drivers/net/ethernet/intel/ice/ice_xsk.c
index eade918723eb..f5769f49e3c3 100644
--- a/drivers/net/ethernet/intel/ice/ice_xsk.c
+++ b/drivers/net/ethernet/intel/ice/ice_xsk.c
@@ -588,16 +588,20 @@ ice_run_xdp_zc(struct ice_rx_ring *rx_ring, struct xdp_buff *xdp,
int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
{
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
+ const struct xdp_attachment_info *rxi = rx_ring->xdp_info, xdp_info = {
+ .prog = rcu_dereference(rxi->prog_rcu),
+ .btf_id_le = cpu_to_le64(READ_ONCE(rxi->btf_id)),
+ .meta_thresh = READ_ONCE(rxi->meta_thresh),
+ .drv_cookie = READ_ONCE(rxi->drv_cookie),
+ };
struct ice_tx_ring *xdp_ring;
unsigned int xdp_xmit = 0;
- struct bpf_prog *xdp_prog;
bool failure = false;
int entries_to_alloc;
/* ZC patch is enabled only when XDP program is set,
* so here it can not be NULL
*/
- xdp_prog = rcu_dereference(rx_ring->xdp_info->prog_rcu);
xdp_ring = rx_ring->xdp_ring;
while (likely(total_rx_packets < (unsigned int)budget)) {
@@ -638,7 +642,10 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
xsk_buff_set_size(xdp, size);
xsk_buff_dma_sync_for_cpu(xdp, rx_ring->xsk_pool);
- xdp_res = ice_run_xdp_zc(rx_ring, xdp, xdp_prog, xdp_ring);
+ ice_xdp_handle_meta(xdp, &md, &xdp_info, rx_desc, rx_ring);
+
+ xdp_res = ice_run_xdp_zc(rx_ring, xdp, xdp_info.prog,
+ xdp_ring);
if (likely(xdp_res & (ICE_XDP_TX | ICE_XDP_REDIR))) {
xdp_xmit |= xdp_res;
} else if (xdp_res == ICE_XDP_EXIT) {
@@ -674,8 +681,8 @@ int ice_clean_rx_irq_zc(struct ice_rx_ring *rx_ring, int budget)
total_rx_bytes += skb->len;
total_rx_packets++;
- ice_xdp_build_meta(&md, rx_desc, rx_ring, 0);
- __xdp_populate_skb_meta_generic(skb, &md);
+ ice_xdp_meta_populate_skb(skb, &md, xdp->data, rx_desc,
+ rx_ring);
ice_receive_skb(rx_ring, skb);
}
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 48/52] libbpf: compress Endianness ops with a macro
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (46 preceding siblings ...)
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 47/52] net, ice: build XDP generic metadata Alexander Lobakin
@ 2022-06-28 19:48 ` Alexander Lobakin
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 49/52] libbpf: add LE <--> CPU conversion helpers Alexander Lobakin
` (4 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:48 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
All of the Endianness helpers for BPF programs have the same
pattern and can be defined using a compression macro, which
will also protect against typos and copy-paste mistakes.
Not speaking of saving locs, of course.
Ahh, if we only could define macros inside other macros.
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
tools/lib/bpf/bpf_endian.h | 26 +++++++++-----------------
1 file changed, 9 insertions(+), 17 deletions(-)
diff --git a/tools/lib/bpf/bpf_endian.h b/tools/lib/bpf/bpf_endian.h
index ec9db4feca9f..b03db6aa3f14 100644
--- a/tools/lib/bpf/bpf_endian.h
+++ b/tools/lib/bpf/bpf_endian.h
@@ -77,23 +77,15 @@
# error "Fix your compiler's __BYTE_ORDER__?!"
#endif
-#define bpf_htons(x) \
+#define __bpf_endop(op, x) \
(__builtin_constant_p(x) ? \
- __bpf_constant_htons(x) : __bpf_htons(x))
-#define bpf_ntohs(x) \
- (__builtin_constant_p(x) ? \
- __bpf_constant_ntohs(x) : __bpf_ntohs(x))
-#define bpf_htonl(x) \
- (__builtin_constant_p(x) ? \
- __bpf_constant_htonl(x) : __bpf_htonl(x))
-#define bpf_ntohl(x) \
- (__builtin_constant_p(x) ? \
- __bpf_constant_ntohl(x) : __bpf_ntohl(x))
-#define bpf_cpu_to_be64(x) \
- (__builtin_constant_p(x) ? \
- __bpf_constant_cpu_to_be64(x) : __bpf_cpu_to_be64(x))
-#define bpf_be64_to_cpu(x) \
- (__builtin_constant_p(x) ? \
- __bpf_constant_be64_to_cpu(x) : __bpf_be64_to_cpu(x))
+ __bpf_constant_##op(x) : __bpf_##op(x))
+
+#define bpf_htons(x) __bpf_endop(htons, x)
+#define bpf_ntohs(x) __bpf_endop(ntohs, x)
+#define bpf_htonl(x) __bpf_endop(htonl, x)
+#define bpf_ntohl(x) __bpf_endop(ntohl, x)
+#define bpf_cpu_to_be64(x) __bpf_endop(cpu_to_be64, x)
+#define bpf_be64_to_cpu(x) __bpf_endop(be64_to_cpu, x)
#endif /* __BPF_ENDIAN__ */
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 49/52] libbpf: add LE <--> CPU conversion helpers
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (47 preceding siblings ...)
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 48/52] libbpf: compress Endianness ops with a macro Alexander Lobakin
@ 2022-06-28 19:48 ` Alexander Lobakin
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 50/52] libbpf: introduce a couple memory access helpers Alexander Lobakin
` (3 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:48 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
From: Larysa Zaremba <larysa.zaremba@intel.com>
XDP Generic metadata structure has fields of the explicit
Endianness, all 16, 32 and 64-bit wide.
To make it easier to access them, define __le{16,32,64} <--> cpu
helpers the same way it's done for the BEs.
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
tools/lib/bpf/bpf_endian.h | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)
diff --git a/tools/lib/bpf/bpf_endian.h b/tools/lib/bpf/bpf_endian.h
index b03db6aa3f14..35941e6f1d99 100644
--- a/tools/lib/bpf/bpf_endian.h
+++ b/tools/lib/bpf/bpf_endian.h
@@ -60,6 +60,18 @@
# define __bpf_cpu_to_be64(x) __builtin_bswap64(x)
# define __bpf_constant_be64_to_cpu(x) ___bpf_swab64(x)
# define __bpf_constant_cpu_to_be64(x) ___bpf_swab64(x)
+# define __bpf_le16_to_cpu(x) (x)
+# define __bpf_cpu_to_le16(x) (x)
+# define __bpf_constant_le16_to_cpu(x) (x)
+# define __bpf_constant_cpu_to_le16(x) (x)
+# define __bpf_le32_to_cpu(x) (x)
+# define __bpf_cpu_to_le32(x) (x)
+# define __bpf_constant_le32_to_cpu(x) (x)
+# define __bpf_constant_cpu_to_le32(x) (x)
+# define __bpf_le64_to_cpu(x) (x)
+# define __bpf_cpu_to_le64(x) (x)
+# define __bpf_constant_le64_to_cpu(x) (x)
+# define __bpf_constant_cpu_to_le64(x) (x)
#elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
# define __bpf_ntohs(x) (x)
# define __bpf_htons(x) (x)
@@ -73,6 +85,18 @@
# define __bpf_cpu_to_be64(x) (x)
# define __bpf_constant_be64_to_cpu(x) (x)
# define __bpf_constant_cpu_to_be64(x) (x)
+# define __bpf_le16_to_cpu(x) __builtin_bswap16(x)
+# define __bpf_cpu_to_le16(x) __builtin_bswap16(x)
+# define __bpf_constant_le16_to_cpu(x) ___bpf_swab16(x)
+# define __bpf_constant_cpu_to_le16(x) ___bpf_swab16(x)
+# define __bpf_le32_to_cpu(x) __builtin_bswap32(x)
+# define __bpf_cpu_to_le32(x) __builtin_bswap32(x)
+# define __bpf_constant_le32_to_cpu(x) ___bpf_swab32(x)
+# define __bpf_constant_cpu_to_le32(x) ___bpf_swab32(x)
+# define __bpf_le64_to_cpu(x) __builtin_bswap64(x)
+# define __bpf_cpu_to_le64(x) __builtin_bswap64(x)
+# define __bpf_constant_le64_to_cpu(x) ___bpf_swab64(x)
+# define __bpf_constant_cpu_to_le64(x) ___bpf_swab64(x)
#else
# error "Fix your compiler's __BYTE_ORDER__?!"
#endif
@@ -87,5 +111,11 @@
#define bpf_ntohl(x) __bpf_endop(ntohl, x)
#define bpf_cpu_to_be64(x) __bpf_endop(cpu_to_be64, x)
#define bpf_be64_to_cpu(x) __bpf_endop(be64_to_cpu, x)
+#define bpf_cpu_to_le16(x) __bpf_endop(cpu_to_le16, x)
+#define bpf_le16_to_cpu(x) __bpf_endop(le16_to_cpu, x)
+#define bpf_cpu_to_le32(x) __bpf_endop(cpu_to_le32, x)
+#define bpf_le32_to_cpu(x) __bpf_endop(le32_to_cpu, x)
+#define bpf_cpu_to_le64(x) __bpf_endop(cpu_to_le64, x)
+#define bpf_le64_to_cpu(x) __bpf_endop(le64_to_cpu, x)
#endif /* __BPF_ENDIAN__ */
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 50/52] libbpf: introduce a couple memory access helpers
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (48 preceding siblings ...)
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 49/52] libbpf: add LE <--> CPU conversion helpers Alexander Lobakin
@ 2022-06-28 19:48 ` Alexander Lobakin
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 51/52] selftests/bpf: fix using test_xdp_meta BPF prog via skeleton infra Alexander Lobakin
` (2 subsequent siblings)
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:48 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
From: Larysa Zaremba <larysa.zaremba@intel.com>
In BPF programs, it is a common thing to declare that we're going
to do a memory access via such snippet:
if (data + ETH_HLEN > data_end)
// bail out
Offsets can be variable:
if (VLAN_HLEN * vlan_count > SOME_ARBITRARY_MAX_OFFSET ||
ctx->data + VLAN_HLEN * vlan_count > data_end)
//
Or even calculated from the end:
if (ctx->data_end - ctx->data - ETH_FCS_LEN > SOME_ARB_MAX_OFF ||
ctx->data_end - ETH_FCS_LEN < ctx->data)
//
As a bonus, LLVM sometimes has a hard time compiling sane C code
the way that it would pass the in-kernel verifier.
Add two new functions to sanitize memory accesses and get pointers
to the requested ranges: one taking an offset from the start and one
from the end (useful for metadata and different integrity check
headers). They are written in Asm, so the offset can be variable and
the code will pass the verifier. There are checks for the maximum
offset (backed by the original verifier value), going out of bounds
etc., so the pointer they return is ready to use (if it's
non-%NULL).
So now all is needed is:
iphdr = bpf_access_mem(ctx->data, ctx->data_end, ETH_HLEN,
sizeof(*iphdr));
if (!iphdr)
// bail out
or
some_meta_struct = bpf_access_mem_end(ctx->data_meta, ctx->data,
sizeof(*some_meta_struct),
sizeof(*some_meta_struct));
if (!some_meta_struct)
//
The Asm code was happily stolen from the Cilium project repo[0] and
then reworked.
[0] https://github.com/cilium/cilium/blob/master/bpf/include/bpf/ctx/xdp.h#L43
Suggested-by: Daniel Borkmann <daniel@iogearbox.net> # original helper
Suggested-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Co-developed-by: Alexander Lobakin <alexandr.lobakin@intel.com>
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
tools/lib/bpf/bpf_helpers.h | 64 +++++++++++++++++++++++++++++++++++++
1 file changed, 64 insertions(+)
diff --git a/tools/lib/bpf/bpf_helpers.h b/tools/lib/bpf/bpf_helpers.h
index fb04eaf367f1..cd16e3c9cd85 100644
--- a/tools/lib/bpf/bpf_helpers.h
+++ b/tools/lib/bpf/bpf_helpers.h
@@ -285,4 +285,68 @@ enum libbpf_tristate {
/* Helper macro to print out debug messages */
#define bpf_printk(fmt, args...) ___bpf_pick_printk(args)(fmt, ##args)
+/* Max offset as per kernel verifier */
+#define MAX_PACKET_OFF 0xffff
+
+/**
+ * bpf_access_mem - sanitize memory access to a range
+ * @mem: start of the memory segment
+ * @mem_end: end of the memory segment
+ * @off: offset from the start of the memory segment
+ * @len: length of the range to give access to
+ *
+ * Verifies that the memory operations we want to perform are sane and within
+ * bounds and gives pointer to the requested range. The checks are done in Asm,
+ * so that it is safe to pass variable offset (verifier might reject such code
+ * written in plain C).
+ * The intended way of using it is as follows:
+ *
+ * iphdr = bpf_access_mem(ctx->data, ctx->data_end, ETH_HLEN, sizeof(*iphdr));
+ *
+ * Returns pointer to the beginning of the range or %NULL.
+ */
+static __always_inline void *
+bpf_access_mem(__u64 mem, __u64 mem_end, __u64 off, const __u64 len)
+{
+ void *ret;
+
+ asm volatile("r1 = %[start]\n\t"
+ "r2 = %[end]\n\t"
+ "r3 = %[offmax] - %[len]\n\t"
+ "if %[off] > r3 goto +5\n\t"
+ "r1 += %[off]\n\t"
+ "%[ret] = r1\n\t"
+ "r1 += %[len]\n\t"
+ "if r1 > r2 goto +1\n\t"
+ "goto +1\n\t"
+ "%[ret] = %[null]\n\t"
+ : [ret]"=r"(ret)
+ : [start]"r"(mem), [end]"r"(mem_end), [off]"r"(off),
+ [len]"ri"(len), [offmax]"i"(MAX_PACKET_OFF),
+ [null]"i"(NULL)
+ : "r1", "r2", "r3");
+
+ return ret;
+}
+
+/**
+ * bpf_access_mem_end - sanitize memory access to a range at the end of segment
+ * @mem: start of the memory segment
+ * @mem_end: end of the memory segment
+ * @offend: offset from the end of the memory segment
+ * @len: length of the range to give access to
+ *
+ * Version of bpf_access_mem() which performs all needed calculations to
+ * access a memory segment from the end. E.g., to access FCS (if provided):
+ *
+ * cp = bpf_access_mem_end(ctx->data, ctx->data_end, ETH_FCS_LEN, ETH_FCS_LEN);
+ *
+ * Returns pointer to the beginning of the range or %NULL.
+ */
+static __always_inline void *
+bpf_access_mem_end(__u64 mem, __u64 mem_end, __u64 offend, const __u64 len)
+{
+ return bpf_access_mem(mem, mem_end, mem_end - mem - offend, len);
+}
+
#endif
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 51/52] selftests/bpf: fix using test_xdp_meta BPF prog via skeleton infra
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (49 preceding siblings ...)
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 50/52] libbpf: introduce a couple memory access helpers Alexander Lobakin
@ 2022-06-28 19:48 ` Alexander Lobakin
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 52/52] selftests/bpf: add XDP Generic Hints selftest Alexander Lobakin
2022-06-29 6:15 ` [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata John Fastabend
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:48 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
progs/test_xdp_meta works fine when loading via iproute2, but the
skeleton infra can't load it, saying that the types of the BPF
programs present in the binary are not set.
This is due to that the convention is to place XDP progs in the
section which named 'xdp' and TC BPF progs in the section 'tc',
so do it here as well.
Fixes: 22c8852624fc ("bpf: improve selftests and add tests for meta pointer")
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
tools/testing/selftests/bpf/progs/test_xdp_meta.c | 4 ++--
tools/testing/selftests/bpf/test_xdp_meta.sh | 8 ++++----
2 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_meta.c b/tools/testing/selftests/bpf/progs/test_xdp_meta.c
index a7c4a7d49fe6..fe2d71ae0e71 100644
--- a/tools/testing/selftests/bpf/progs/test_xdp_meta.c
+++ b/tools/testing/selftests/bpf/progs/test_xdp_meta.c
@@ -8,7 +8,7 @@
#define round_up(x, y) ((((x) - 1) | __round_mask(x, y)) + 1)
#define ctx_ptr(ctx, mem) (void *)(unsigned long)ctx->mem
-SEC("t")
+SEC("tc")
int ing_cls(struct __sk_buff *ctx)
{
__u8 *data, *data_meta, *data_end;
@@ -28,7 +28,7 @@ int ing_cls(struct __sk_buff *ctx)
return diff ? TC_ACT_SHOT : TC_ACT_OK;
}
-SEC("x")
+SEC("xdp")
int ing_xdp(struct xdp_md *ctx)
{
__u8 *data, *data_meta, *data_end;
diff --git a/tools/testing/selftests/bpf/test_xdp_meta.sh b/tools/testing/selftests/bpf/test_xdp_meta.sh
index ea69370caae3..7232714e89b3 100755
--- a/tools/testing/selftests/bpf/test_xdp_meta.sh
+++ b/tools/testing/selftests/bpf/test_xdp_meta.sh
@@ -42,11 +42,11 @@ ip netns exec ${NS2} ip addr add 10.1.1.22/24 dev veth2
ip netns exec ${NS1} tc qdisc add dev veth1 clsact
ip netns exec ${NS2} tc qdisc add dev veth2 clsact
-ip netns exec ${NS1} tc filter add dev veth1 ingress bpf da obj test_xdp_meta.o sec t
-ip netns exec ${NS2} tc filter add dev veth2 ingress bpf da obj test_xdp_meta.o sec t
+ip netns exec ${NS1} tc filter add dev veth1 ingress bpf da obj test_xdp_meta.o sec tc
+ip netns exec ${NS2} tc filter add dev veth2 ingress bpf da obj test_xdp_meta.o sec tc
-ip netns exec ${NS1} ip link set dev veth1 xdp obj test_xdp_meta.o sec x
-ip netns exec ${NS2} ip link set dev veth2 xdp obj test_xdp_meta.o sec x
+ip netns exec ${NS1} ip link set dev veth1 xdp obj test_xdp_meta.o sec xdp
+ip netns exec ${NS2} ip link set dev veth2 xdp obj test_xdp_meta.o sec xdp
ip netns exec ${NS1} ip link set dev veth1 up
ip netns exec ${NS2} ip link set dev veth2 up
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] [PATCH RFC bpf-next 52/52] selftests/bpf: add XDP Generic Hints selftest
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (50 preceding siblings ...)
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 51/52] selftests/bpf: fix using test_xdp_meta BPF prog via skeleton infra Alexander Lobakin
@ 2022-06-28 19:48 ` Alexander Lobakin
2022-06-29 6:15 ` [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata John Fastabend
52 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-06-28 19:48 UTC (permalink / raw)
To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Add a new BPF selftest which checks whether XDP Generic metadata
works correctly using generic/skb XDP path. It is always available
on any interface, so must always succeed.
It uses special BPF program which works as follows:
* tries to access metadata memory via bpf_access_mem_end();
* checks the frame size. For sizes < 128 bytes, drop packets with
metadata present, so that we could check that setting the
threshold works;
* for sizes 128+, drop packets with no meta. Otherwise, check that
it has correct magic and BTF ID matches with the one written by
the verifier;
* finally, pass packets with fully correct generic meta up the
stack.
And the test itself does the following:
1) attaches that XDP prog to veth interfaces with the threshold of
1, i.e. enable metadata generation for every packet;
2) ensures that the prog drops frames lesser than 128 bytes as
intended (see above);
3) raises the threshold to 128 bytes (test updating the parameters
without replacing the prog);
4) ensures that now no drops occur and that meta for frames >= 128
is valid.
As it involves multiple userspace prog invocation, it performs BPF
link pinning to make it freerunning. `ip netns exec` creates a new
mount namespace (including sysfs) on each execution, the script
now does a temporary persistent BPF FS mountpoint in the tests
directory, so that pinned progs/links will be accessible across
the launches.
Co-developed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Alexander Lobakin <alexandr.lobakin@intel.com>
---
tools/testing/selftests/bpf/.gitignore | 1 +
tools/testing/selftests/bpf/Makefile | 4 +-
.../selftests/bpf/progs/test_xdp_meta.c | 36 +++
tools/testing/selftests/bpf/test_xdp_meta.c | 294 ++++++++++++++++++
tools/testing/selftests/bpf/test_xdp_meta.sh | 51 +++
5 files changed, 385 insertions(+), 1 deletion(-)
create mode 100644 tools/testing/selftests/bpf/test_xdp_meta.c
diff --git a/tools/testing/selftests/bpf/.gitignore b/tools/testing/selftests/bpf/.gitignore
index ca2f47f45670..7d4de9d9002c 100644
--- a/tools/testing/selftests/bpf/.gitignore
+++ b/tools/testing/selftests/bpf/.gitignore
@@ -44,3 +44,4 @@ test_cpp
xdpxceiver
xdp_redirect_multi
xdp_synproxy
+/test_xdp_meta
diff --git a/tools/testing/selftests/bpf/Makefile b/tools/testing/selftests/bpf/Makefile
index 4fbd88a8ed9e..aca8867deb8c 100644
--- a/tools/testing/selftests/bpf/Makefile
+++ b/tools/testing/selftests/bpf/Makefile
@@ -82,7 +82,7 @@ TEST_PROGS_EXTENDED := with_addr.sh \
TEST_GEN_PROGS_EXTENDED = test_sock_addr test_skb_cgroup_id_user \
flow_dissector_load test_flow_dissector test_tcp_check_syncookie_user \
test_lirc_mode2_user xdping test_cpp runqslower bench bpf_testmod.ko \
- xdpxceiver xdp_redirect_multi xdp_synproxy
+ xdpxceiver xdp_redirect_multi xdp_synproxy test_xdp_meta
TEST_CUSTOM_PROGS = $(OUTPUT)/urandom_read
@@ -589,6 +589,8 @@ $(OUTPUT)/bench: $(OUTPUT)/bench.o \
$(call msg,BINARY,,$@)
$(Q)$(CC) $(CFLAGS) $(LDFLAGS) $(filter %.a %.o,$^) $(LDLIBS) -o $@
+$(OUTPUT)/test_xdp_meta: | $(OUTPUT)/test_xdp_meta.skel.h
+
EXTRA_CLEAN := $(TEST_CUSTOM_PROGS) $(SCRATCH_DIR) $(HOST_SCRATCH_DIR) \
prog_tests/tests.h map_tests/tests.h verifier/tests.h \
feature bpftool \
diff --git a/tools/testing/selftests/bpf/progs/test_xdp_meta.c b/tools/testing/selftests/bpf/progs/test_xdp_meta.c
index fe2d71ae0e71..0b05d1c3979b 100644
--- a/tools/testing/selftests/bpf/progs/test_xdp_meta.c
+++ b/tools/testing/selftests/bpf/progs/test_xdp_meta.c
@@ -2,6 +2,8 @@
#include <linux/if_ether.h>
#include <linux/pkt_cls.h>
+#include <bpf/bpf_core_read.h>
+#include <bpf/bpf_endian.h>
#include <bpf/bpf_helpers.h>
#define __round_mask(x, y) ((__typeof__(x))((y) - 1))
@@ -50,4 +52,38 @@ int ing_xdp(struct xdp_md *ctx)
return XDP_PASS;
}
+#define TEST_META_THRESH 128
+
+SEC("xdp")
+int ing_hints(struct xdp_md *ctx)
+{
+ const struct xdp_meta_generic *md;
+ __le64 genid;
+
+ md = bpf_access_mem_end(ctx->data_meta, ctx->data, sizeof(*md),
+ sizeof(*md));
+
+ /* Selftest enables metadata starting from 128 byte frame size, fail it
+ * if we receive a shorter frame with metadata
+ */
+ if (ctx->data_end - ctx->data < TEST_META_THRESH)
+ return md ? XDP_DROP : XDP_PASS;
+
+ if (!md)
+ return XDP_DROP;
+
+ if (md->magic_id != bpf_cpu_to_le16(XDP_META_GENERIC_MAGIC))
+ return XDP_DROP;
+
+ genid = bpf_cpu_to_le64(bpf_core_type_id_kernel(typeof(*md)));
+ if (md->full_id != genid)
+ return XDP_DROP;
+
+ /* Tx flags must be zeroed */
+ if (md->tx_flags)
+ return XDP_DROP;
+
+ return XDP_PASS;
+}
+
char _license[] SEC("license") = "GPL";
diff --git a/tools/testing/selftests/bpf/test_xdp_meta.c b/tools/testing/selftests/bpf/test_xdp_meta.c
new file mode 100644
index 000000000000..e5c147d19190
--- /dev/null
+++ b/tools/testing/selftests/bpf/test_xdp_meta.c
@@ -0,0 +1,294 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright (C) 2022, Intel Corporation. */
+
+#define _GNU_SOURCE /* asprintf() */
+
+#include <bpf/bpf.h>
+#include <getopt.h>
+#include <net/if.h>
+#include <uapi/linux/if_link.h>
+
+#include "test_xdp_meta.skel.h"
+
+struct test_meta_op_opts {
+ struct test_xdp_meta *skel;
+ const char *cmd;
+ char *path;
+ __u32 ifindex;
+ __u32 flags;
+ __u64 btf_id;
+ __u32 meta_thresh;
+};
+
+struct test_meta_opt_desc {
+ const char *arg;
+ const char *help;
+};
+
+#define OPT(n, a, s) { \
+ .name = #n, \
+ .has_arg = (a), \
+ .val = #s[0], \
+}
+
+#define DESC(a, h) { \
+ .arg = (a), \
+ .help = (h), \
+}
+
+static const struct option test_meta_opts[] = {
+ OPT(dev, required_argument, d),
+ OPT(fs, required_argument, f),
+ OPT(help, no_argument, h),
+ OPT(meta-thresh, optional_argument, M),
+ OPT(mode, required_argument, m),
+ { /* Sentinel */ },
+};
+
+static const struct test_meta_opt_desc test_meta_descs[] = {
+ DESC("= < IFNAME | IFINDEX >", "target interface name or index"),
+ DESC("= < MOUNTPOINT >", "BPF FS mountpoint"),
+ DESC(NULL, "display this text and exit"),
+ DESC("= [ THRESH ]", "enable Generic metadata generation (frame size)"),
+ DESC("= < skb | drv | hw >", "force particular XDP mode"),
+};
+
+static void test_meta_usage(char *argv[], bool err)
+{
+ FILE *out = err ? stderr : stdout;
+ __u32 i = 0;
+
+ fprintf(out,
+ "Usage:\n\t%s COMMAND < -d | --dev= > < IFNAME | IFINDEX > [ OPTIONS ]\n\n",
+ argv[0]);
+ fprintf(out, "OPTIONS:\n");
+
+ for (const struct option *opt = test_meta_opts; opt->name; opt++) {
+ fprintf(out, "\t-%c, --%s", opt->val, opt->name);
+ fprintf(out, "%s\t", test_meta_descs[i].arg ? : "\t\t");
+ fprintf(out, "%s\n", test_meta_descs[i++].help);
+ }
+}
+
+static int test_meta_link_attach(const struct test_meta_op_opts *opts)
+{
+ LIBBPF_OPTS(bpf_xdp_attach_opts, la_opts,
+ .flags = opts->flags,
+ .btf_id = opts->btf_id,
+ .meta_thresh = opts->meta_thresh);
+ struct bpf_link *link;
+ int ret;
+
+ link = bpf_program__attach_xdp_opts(opts->skel->progs.ing_hints,
+ opts->ifindex, &la_opts);
+ ret = libbpf_get_error(link);
+ if (ret) {
+ fprintf(stderr, "Failed to attach XDP program: %s (%d)\n",
+ strerror(-ret), ret);
+ return ret;
+ }
+
+ opts->skel->links.ing_hints = link;
+
+ ret = bpf_link__pin(link, opts->path);
+ if (ret)
+ fprintf(stderr, "Failed to pin XDP link at %s: %s (%d)\n",
+ opts->path, strerror(-ret), ret);
+
+ bpf_link__disconnect(link);
+
+ return ret;
+}
+
+static int test_meta_link_update(const struct test_meta_op_opts *opts)
+{
+ LIBBPF_OPTS(bpf_link_update_opts, lu_opts,
+ .xdp.new_btf_id = opts->btf_id,
+ .xdp.new_meta_thresh = opts->meta_thresh);
+ struct bpf_link *link;
+ int ret;
+
+ link = bpf_link__open(opts->path);
+ ret = libbpf_get_error(link);
+ if (ret) {
+ fprintf(stderr, "Failed to open XDP link at %s: %s (%d)\n",
+ opts->path, strerror(-ret), ret);
+ return ret;
+ }
+
+ opts->skel->links.ing_hints = link;
+
+ ret = bpf_link_update(bpf_link__fd(link),
+ bpf_program__fd(opts->skel->progs.ing_hints),
+ &lu_opts);
+ if (ret)
+ fprintf(stderr, "Failed to update XDP link: %s (%d)\n",
+ strerror(-ret), ret);
+
+ return ret;
+}
+
+static int test_meta_link_detach(const struct test_meta_op_opts *opts)
+{
+ struct bpf_link *link;
+ int ret;
+
+ link = bpf_link__open(opts->path);
+ ret = libbpf_get_error(link);
+ if (ret) {
+ fprintf(stderr, "Failed to open XDP link at %s: %s (%d)\n",
+ opts->path, strerror(-ret), ret);
+ return ret;
+ }
+
+ opts->skel->links.ing_hints = link;
+
+ ret = bpf_link__unpin(link);
+ if (ret) {
+ fprintf(stderr, "Failed to unpin XDP link: %s (%d)\n",
+ strerror(-ret), ret);
+ return ret;
+ }
+
+ ret = bpf_link__detach(link);
+ if (ret)
+ fprintf(stderr, "Failed to detach XDP link: %s (%d)\n",
+ strerror(-ret), ret);
+
+ return ret;
+}
+
+static int test_meta_parse_args(struct test_meta_op_opts *opts, int argc,
+ char *argv[])
+{
+ int opt, longidx, ret;
+
+ while (1) {
+ opt = getopt_long(argc, argv, "d:f:hM::m:", test_meta_opts,
+ &longidx);
+ if (opt < 0)
+ break;
+
+ switch (opt) {
+ case 'd':
+ opts->ifindex = if_nametoindex(optarg);
+ if (!opts->ifindex)
+ opts->ifindex = strtoul(optarg, NULL, 0);
+
+ break;
+ case 'f':
+ opts->path = optarg;
+ break;
+ case 'h':
+ test_meta_usage(argv, false);
+ return 0;
+ case 'M':
+ ret = libbpf_get_type_btf_id("struct xdp_meta_generic",
+ &opts->btf_id);
+ if (ret) {
+ fprintf(stderr,
+ "Failed to get BTF ID: %s (%d)\n",
+ strerror(-ret), ret);
+ return ret;
+ }
+
+ /* Allow both `-M64` and `-M 64` */
+ if (!optarg && optind < argc && argv[optind] &&
+ *argv[optind] >= '0' && *argv[optind] <= '9')
+ optarg = argv[optind];
+
+ opts->meta_thresh = strtoul(optarg ? : "1", NULL, 0);
+ break;
+ case 'm':
+ if (!strcmp(optarg, "skb"))
+ opts->flags = XDP_FLAGS_SKB_MODE;
+ else if (!strcmp(optarg, "drv"))
+ opts->flags = XDP_FLAGS_DRV_MODE;
+ else if (!strcmp(optarg, "hw"))
+ opts->flags = XDP_FLAGS_HW_MODE;
+
+ if (opts->flags)
+ break;
+
+ /* fallthrough */
+ default:
+ test_meta_usage(argv, true);
+ return -EINVAL;
+ }
+ }
+
+ if (optind >= argc || !argv[optind]) {
+ fprintf(stderr, "Command is required\n");
+ test_meta_usage(argv, true);
+
+ return -EINVAL;
+ }
+
+ opts->cmd = argv[optind];
+
+ return 0;
+}
+
+int main(int argc, char *argv[])
+{
+ struct test_meta_op_opts opts = { };
+ int ret;
+
+ libbpf_set_strict_mode(LIBBPF_STRICT_ALL);
+
+ if (argc < 3) {
+ test_meta_usage(argv, true);
+ return -EINVAL;
+ }
+
+ ret = test_meta_parse_args(&opts, argc, argv);
+ if (ret)
+ return ret;
+
+ if (!opts.ifindex) {
+ fprintf(stderr, "Invalid or missing device argument\n");
+ test_meta_usage(argv, true);
+
+ return -EINVAL;
+ }
+
+ opts.skel = test_xdp_meta__open_and_load();
+ ret = libbpf_get_error(opts.skel);
+ if (ret) {
+ fprintf(stderr, "Failed to load test_xdp_meta skeleton: %s (%d)\n",
+ strerror(-ret), ret);
+ return ret;
+ }
+
+ ret = asprintf(&opts.path, "%s/xdp/%s-%u", opts.path ? : "/sys/fs/bpf",
+ opts.skel->skeleton->name, opts.ifindex);
+ ret = ret < 0 ? -errno : 0;
+ if (ret) {
+ fprintf(stderr, "Failed to allocate path string: %s (%d)\n",
+ strerror(-ret), ret);
+ goto meta_destroy;
+ }
+
+ if (!strcmp(opts.cmd, "attach")) {
+ ret = test_meta_link_attach(&opts);
+ } else if (!strcmp(opts.cmd, "update")) {
+ ret = test_meta_link_update(&opts);
+ } else if (!strcmp(opts.cmd, "detach")) {
+ ret = test_meta_link_detach(&opts);
+ } else {
+ fprintf(stderr, "Invalid command '%s'\n", opts.cmd);
+ test_meta_usage(argv, true);
+
+ ret = -EINVAL;
+ }
+
+ if (ret)
+ fprintf(stderr, "Failed to execute command '%s': %s (%d)\n",
+ opts.cmd, strerror(-ret), ret);
+
+ free(opts.path);
+meta_destroy:
+ test_xdp_meta__destroy(opts.skel);
+
+ return ret;
+}
diff --git a/tools/testing/selftests/bpf/test_xdp_meta.sh b/tools/testing/selftests/bpf/test_xdp_meta.sh
index 7232714e89b3..79c2ccb68dda 100755
--- a/tools/testing/selftests/bpf/test_xdp_meta.sh
+++ b/tools/testing/selftests/bpf/test_xdp_meta.sh
@@ -5,6 +5,11 @@ readonly KSFT_SKIP=4
readonly NS1="ns1-$(mktemp -u XXXXXX)"
readonly NS2="ns2-$(mktemp -u XXXXXX)"
+# We need a persistent BPF FS mointpoint. `ip netns exec` prepares a different
+# temporary one on each invocation
+readonly FS="$(mktemp -d XXXXXX)"
+mount -t bpf bpffs ${FS}
+
cleanup()
{
if [ "$?" = "0" ]; then
@@ -14,9 +19,16 @@ cleanup()
fi
set +e
+
+ ip netns exec ${NS1} ./test_xdp_meta detach -d veth1 -f ${FS} -m skb 2> /dev/null
+ ip netns exec ${NS2} ./test_xdp_meta detach -d veth2 -f ${FS} -m skb 2> /dev/null
+
ip link del veth1 2> /dev/null
ip netns del ${NS1} 2> /dev/null
ip netns del ${NS2} 2> /dev/null
+
+ umount ${FS}
+ rm -fr ${FS}
}
ip link set dev lo xdp off 2>/dev/null > /dev/null
@@ -54,4 +66,43 @@ ip netns exec ${NS2} ip link set dev veth2 up
ip netns exec ${NS1} ping -c 1 10.1.1.22
ip netns exec ${NS2} ping -c 1 10.1.1.11
+#
+# Generic metadata part
+#
+
+# Cleanup
+ip netns exec ${NS1} ip link set dev veth1 xdp off
+ip netns exec ${NS2} ip link set dev veth2 xdp off
+
+ip netns exec ${NS1} tc filter del dev veth1 ingress
+ip netns exec ${NS2} tc filter del dev veth2 ingress
+
+# Enable metadata generation for every frame
+ip netns exec ${NS1} ./test_xdp_meta attach -d veth1 -f ${FS} -m skb -M
+ip netns exec ${NS2} ./test_xdp_meta attach -d veth2 -f ${FS} -m skb -M
+
+# Those two must fail: XDP prog drops packets < 128 bytes with metadata
+set +e
+
+ip netns exec ${NS1} ping -c 1 10.1.1.22 -W 0.2
+if [ "$?" = "0" ]; then
+ exit 1
+fi
+ip netns exec ${NS2} ping -c 1 10.1.1.11 -W 0.2
+if [ "$?" = "0" ]; then
+ exit 1
+fi
+
+set -e
+
+# Enable metadata only for frames >= 128 bytes
+ip netns exec ${NS1} ./test_xdp_meta update -d veth1 -f ${FS} -m skb -M 128
+ip netns exec ${NS2} ./test_xdp_meta update -d veth2 -f ${FS} -m skb -M 128
+
+# Must succeed
+ip netns exec ${NS1} ping -c 1 10.1.1.22
+ip netns exec ${NS2} ping -c 1 10.1.1.11
+ip netns exec ${NS1} ping -c 1 10.1.1.22 -s 128
+ip netns exec ${NS2} ping -c 1 10.1.1.11 -s 128
+
exit 0
--
2.36.1
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-06-28 19:47 [xdp-hints] [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata Alexander Lobakin
` (51 preceding siblings ...)
2022-06-28 19:48 ` [xdp-hints] [PATCH RFC bpf-next 52/52] selftests/bpf: add XDP Generic Hints selftest Alexander Lobakin
@ 2022-06-29 6:15 ` John Fastabend
2022-06-29 13:43 ` Toke Høiland-Jørgensen
` (2 more replies)
52 siblings, 3 replies; 98+ messages in thread
From: John Fastabend @ 2022-06-29 6:15 UTC (permalink / raw)
To: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, John Fastabend, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
Alexander Lobakin wrote:
> This RFC is to give the whole picture. It will most likely be split
> onto several series, maybe even merge cycles. See the "table of
> contents" below.
Even for RFC its a bit much. Probably improve the summary
message here as well I'm still not clear on the overall
architecture so not sure I want to dig into patches.
>
> The series adds ability to pass different frame
> details/parameters/parameters used by most of NICs and the kernel
> stack (in skbs), not essential, but highly wanted, such as:
>
> * checksum value, status (Rx) or command (Tx);
> * hash value and type/level (Rx);
> * queue number (Rx);
> * timestamps;
> * and so on.
>
> As XDP structures used to represent frames are as small as possible
> and must stay like that, it is done by using the already existing
> concept of metadata, i.e. some space right before a frame where BPF
> programs can put arbitrary data.
OK so you stick attributes in the metadata. You can do this without
touching anything but your driver today. Why not push a patch to
ice to start doing this? People could start using it today and put
it in some feature flag.
I get everyone wants some grand theory around this but again one
patch would do it and your customers could start using it. Show
a benchmark with 20% speedup or whatever with small XDP prog
update and you win.
>
> Now, a NIC driver, or even a SmartNIC itself, can put those params
> there in a well-defined format. The format is fixed, but can be of
> several different types represented by structures, which definitions
> are available to the kernel, BPF programs and the userland.
I don't think in general the format needs to be fixed.
> It is fixed due to it being almost a UAPI, and the exact format can
> be determined by reading the last 10 bytes of metadata. They contain
> a 2-byte magic ID to not confuse it with a non-compatible meta and
> a 8-byte combined BTF ID + type ID: the ID of the BTF where this
> structure is defined and the ID of that definition inside that BTF.
> Users can obtain BTF IDs by structure types using helpers available
> in the kernel, BPF (written by the CO-RE/verifier) and the userland
> (libbpf -> kernel call) and then rely on those ID when reading data
> to make sure whether they support it and what to do with it.
> Why separate magic and ID? The idea is to make different formats
> always contain the basic/"generic" structure embedded at the end.
> This way we can still benefit in purely generic consumers (like
> cpumap) while providing some "extra" data to those who support it.
I don't follow this. If you have a struct in your driver name it
something obvious, ice_xdp_metadata. If I understand things
correctly just dump the BTF for the driver, extract the
struct and done you can use CO-RE reads. For the 'fixed' case
this looks easy. And I don't think you even need a patch for this.
>
> The enablement of this feature is controlled on attaching/replacing
> XDP program on an interface with two new parameters: that combined
> BTF+type ID and metadata threshold.
> The threshold specifies the minimum frame size which a driver (or
> NIC) should start composing metadata from. It is introduced instead
> of just false/true flag due to that often it's not worth it to spend
> cycles to fetch all that data for such small frames: let's say, it
> can be even faster to just calculate checksums for them on CPU
> rather than touch non-coherent DMA zone. Simple XDP_DROP case loses
> 15 Mpps on 64 byte frames with enabled metadata, threshold can help
> mitigate that.
I would put this in the bonus category. Can you do the simple thing
above without these extra bits and then add them later. Just
pick some overly conservative threshold to start with.
>
> The RFC can be divided into 8 parts:
I'm missing something why not do the simplest bit of work and
get this running in ice with a few smallish driver updates
so we can all see it. No need for so many patches.
>
> 01-04: BTF ID hacking: here Larysa provides BPF programs with not
> only type ID, but the ID of the BTF as well by using the
> unused upper 32 bits.
> 05-10: this provides in-kernel mechanisms for taking ID and
> threshold from the userspace and passing it to the drivers.
> 11-18: provides libbpf API to be able to specify those params from
> the userspace, plus some small selftest to verify that both
> the kernel and the userspace parts work.
> 19-29: here the actual structure is defined, then the in-kernel
> helpers and finally here comes the first consumer: function
> used to convert &xdp_frame to &sk_buff now will be trying
> to parse metadata. The affected users are cpumap and veth.
> 30-36: here I try to benefit from the metadata in cpumap even more
> by switching it to GRO. Now that we have checksums from NIC
> available... but even with no meta it gives some fair
> improvements.
> 37-43: enabling building generic metadata on Generic/skb path. Since
> skbs already have all those fields, it's not a problem to do
> this in here, plus allows to benefit from it on interfaces
> not supporting meta yet.
> 44-47: ice driver part, including enabling prog hot-swap;
> 48-52: adds a complex selftest to verify everything works. Can be
> used as a sample as well, showing how to work with metadata
> in BPF programs and how to configure it from the userspace.
>
> Please refer to the actual commit messages where some precise
> implementation details might be explained.
> Nearly 20 of 52 are various cleanups and prereqs, as usually.
>
> Perf figures were taken on cpumap redirect from the ice interface
> (driver-side XDP), redirecting the traffic within the same node.
>
> Frame size / 64/42 128/20 256/8 512/4 1024/2 1532/1
> thread num
You'll have to remind me whats the production use case for
cpu_map on a modern nic or even smart nic? Why are you not
just using a hardware queues and redirecting to the right
queues in hardware to start with?
Also my understanding is if you do XDP_PASS up the stack
the skb is built with all the normal good stuff from hw
descriptor. Sorry going to need some extra context here
to understand.
Could you do a benchmark for AF_XDP I thought this was
the troublesome use case where the user space ring lost
the hardware info e.g. timestamps and checksum values.
>
> meta off 30022 31350 21993 12144 6374 3610
> meta on 33059 28502 21503 12146 6380 3610
> GRO meta off 30020 31822 21970 12145 6384 3610
> GRO meta on 34736 28848 21566 12144 6381 3610
>
> Yes, redirect between the nodes plays awfully with the metadata
> composed by the driver:
Many production use case use XDP exactly for this. If it
slows this basic use case down its going to be very hard
to use in many environments. Likely it wont be used.
>
> meta off 21449 18078 16897 11820 6383 3610
> meta on 16956 19004 14337 8228 5683 2822
> GRO meta off 22539 19129 16304 11659 6381 3592
> GRO meta on 17047 20366 15435 8878 5600 2753
Do you have hardware that can write the data into the
metadata region so you don't do it in software? Seems
like it should be doable without much trouble and would
make this more viable.
>
> Questions still open:
>
> * the actual generic structure: it must have all the fields used
> oftenly and by the majority of NICs. It can always be expanded
> later on (note that the structure grows to the left), but the
> less often UAPI is modified, the better (less compat pain);
I don't believe a generic structure is needed.
> * ability to specify the exact fields to fill by the driver, e.g.
> flags bitmap passed from the userspace. In theory it can be more
> optimal to not spend cycles on data we don't need, but at the
> same time increases the complexity of the whole concept (e.g. it
> will be more problematic to unify drivers' routines for collecting
> data from descriptors to metadata and to skbs);
> * there was an idea to be able to specify from the userspace the
> desired cacheline offset, so that [the wanted fields of] metadata
> and the packet headers would lay in the same CL. Can't be
> implemented in Generic/skb XDP and ice has some troubles with it
> too;
> * lacks AF_XDP/XSk perf numbers and different other scenarios in
> general, is the current implementation optimal for them?
AF_XDP is the primary use case from my understanding.
> * metadata threshold and everything else present in this
> implementation.
I really think your asking questions that are two or three
jumps away. Why not do the simplest bit first and kick
the driver with an on/off switch into this mode. But
I don't understand this cpumap use case so maybe explain
that first.
And sorry didn't even look at your 50+ patches. Figure lets
get agreement on the goal first.
.John
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-06-29 6:15 ` [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata John Fastabend
@ 2022-06-29 13:43 ` Toke Høiland-Jørgensen
2022-07-04 15:44 ` Alexander Lobakin
2022-06-29 17:56 ` Zvi Effron
2022-07-04 15:31 ` Alexander Lobakin
2 siblings, 1 reply; 98+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-06-29 13:43 UTC (permalink / raw)
To: John Fastabend, Alexander Lobakin, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko
Cc: Alexander Lobakin, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Lorenzo Bianconi,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jesse Brandeburg, John Fastabend, Yajun Deng, Willem de Bruijn,
bpf, netdev, linux-kernel, xdp-hints
John Fastabend <john.fastabend@gmail.com> writes:
> Alexander Lobakin wrote:
>> This RFC is to give the whole picture. It will most likely be split
>> onto several series, maybe even merge cycles. See the "table of
>> contents" below.
>
> Even for RFC its a bit much. Probably improve the summary
> message here as well I'm still not clear on the overall
> architecture so not sure I want to dig into patches.
+1 on this, and piggybacking on your comment to chime in on the general
architecture.
>> Now, a NIC driver, or even a SmartNIC itself, can put those params
>> there in a well-defined format. The format is fixed, but can be of
>> several different types represented by structures, which definitions
>> are available to the kernel, BPF programs and the userland.
>
> I don't think in general the format needs to be fixed.
No, that's the whole point of BTF: it's not supposed to be UAPI, we'll
use CO-RE to enable dynamic formats...
[...]
>> It is fixed due to it being almost a UAPI, and the exact format can
>> be determined by reading the last 10 bytes of metadata. They contain
>> a 2-byte magic ID to not confuse it with a non-compatible meta and
>> a 8-byte combined BTF ID + type ID: the ID of the BTF where this
>> structure is defined and the ID of that definition inside that BTF.
>> Users can obtain BTF IDs by structure types using helpers available
>> in the kernel, BPF (written by the CO-RE/verifier) and the userland
>> (libbpf -> kernel call) and then rely on those ID when reading data
>> to make sure whether they support it and what to do with it.
>> Why separate magic and ID? The idea is to make different formats
>> always contain the basic/"generic" structure embedded at the end.
>> This way we can still benefit in purely generic consumers (like
>> cpumap) while providing some "extra" data to those who support it.
>
> I don't follow this. If you have a struct in your driver name it
> something obvious, ice_xdp_metadata. If I understand things
> correctly just dump the BTF for the driver, extract the
> struct and done you can use CO-RE reads. For the 'fixed' case
> this looks easy. And I don't think you even need a patch for this.
...however as we've discussed previously, we do need a bit of
infrastructure around this. In particular, we need to embed the embed
the BTF ID into the metadata itself so BPF can do runtime disambiguation
between different formats (and add the right CO-RE primitives to make
this easy). This is for two reasons:
- The metadata might be different per-packet (e.g., PTP packets with
timestamps interleaved with bulk data without them)
- With redirects we may end up processing packets from different devices
in a single XDP program (in devmap or cpumap, or on a veth) so we need
to be able to disambiguate at runtime.
So I think the part of the design that puts the BTF ID into the end of
the metadata struct is sound; however, the actual format doesn't have to
be fixed, we can use CO-RE to pick out the bits that a given BPF program
needs; we just need a convention for how drivers report which format(s)
they support. Which we should also agree on (and add core infrastructure
around) so each driver doesn't go around inventing their own
conventions.
>> The enablement of this feature is controlled on attaching/replacing
>> XDP program on an interface with two new parameters: that combined
>> BTF+type ID and metadata threshold.
>> The threshold specifies the minimum frame size which a driver (or
>> NIC) should start composing metadata from. It is introduced instead
>> of just false/true flag due to that often it's not worth it to spend
>> cycles to fetch all that data for such small frames: let's say, it
>> can be even faster to just calculate checksums for them on CPU
>> rather than touch non-coherent DMA zone. Simple XDP_DROP case loses
>> 15 Mpps on 64 byte frames with enabled metadata, threshold can help
>> mitigate that.
>
> I would put this in the bonus category. Can you do the simple thing
> above without these extra bits and then add them later. Just
> pick some overly conservative threshold to start with.
Yeah, I'd agree this kind of configuration is something that can be
added later, and also it's sort of orthogonal to the consumption of the
metadata itself.
Also, tying this configuration into the loading of an XDP program is a
terrible interface: these are hardware configuration options, let's just
put them into ethtool or 'ip link' like any other piece of device
configuration.
>> The RFC can be divided into 8 parts:
>
> I'm missing something why not do the simplest bit of work and
> get this running in ice with a few smallish driver updates
> so we can all see it. No need for so many patches.
Agreed. This incremental approach is basically what Jesper's
simultaneous series makes a start on, AFAICT? Would be nice if y'all
could converge the efforts :)
[...]
> I really think your asking questions that are two or three
> jumps away. Why not do the simplest bit first and kick
> the driver with an on/off switch into this mode. But
> I don't understand this cpumap use case so maybe explain
> that first.
>
> And sorry didn't even look at your 50+ patches. Figure lets
> get agreement on the goal first.
+1 on both of these :)
-Toke
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-06-29 13:43 ` Toke Høiland-Jørgensen
@ 2022-07-04 15:44 ` Alexander Lobakin
2022-07-04 17:13 ` Jesper Dangaard Brouer
2022-07-04 17:14 ` Toke Høiland-Jørgensen
0 siblings, 2 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-07-04 15:44 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: Alexander Lobakin, John Fastabend, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, Lorenzo Bianconi, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Jesse Brandeburg, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
From: Toke H��iland-J��rgensen <toke@redhat.com>
Date: Wed, 29 Jun 2022 15:43:05 +0200
> John Fastabend <john.fastabend@gmail.com> writes:
>
> > Alexander Lobakin wrote:
> >> This RFC is to give the whole picture. It will most likely be split
> >> onto several series, maybe even merge cycles. See the "table of
> >> contents" below.
> >
> > Even for RFC its a bit much. Probably improve the summary
> > message here as well I'm still not clear on the overall
> > architecture so not sure I want to dig into patches.
>
> +1 on this, and piggybacking on your comment to chime in on the general
> architecture.
>
> >> Now, a NIC driver, or even a SmartNIC itself, can put those params
> >> there in a well-defined format. The format is fixed, but can be of
> >> several different types represented by structures, which definitions
> >> are available to the kernel, BPF programs and the userland.
> >
> > I don't think in general the format needs to be fixed.
>
> No, that's the whole point of BTF: it's not supposed to be UAPI, we'll
> use CO-RE to enable dynamic formats...
>
> [...]
>
> >> It is fixed due to it being almost a UAPI, and the exact format can
> >> be determined by reading the last 10 bytes of metadata. They contain
> >> a 2-byte magic ID to not confuse it with a non-compatible meta and
> >> a 8-byte combined BTF ID + type ID: the ID of the BTF where this
> >> structure is defined and the ID of that definition inside that BTF.
> >> Users can obtain BTF IDs by structure types using helpers available
> >> in the kernel, BPF (written by the CO-RE/verifier) and the userland
> >> (libbpf -> kernel call) and then rely on those ID when reading data
> >> to make sure whether they support it and what to do with it.
> >> Why separate magic and ID? The idea is to make different formats
> >> always contain the basic/"generic" structure embedded at the end.
> >> This way we can still benefit in purely generic consumers (like
> >> cpumap) while providing some "extra" data to those who support it.
> >
> > I don't follow this. If you have a struct in your driver name it
> > something obvious, ice_xdp_metadata. If I understand things
> > correctly just dump the BTF for the driver, extract the
> > struct and done you can use CO-RE reads. For the 'fixed' case
> > this looks easy. And I don't think you even need a patch for this.
>
> ...however as we've discussed previously, we do need a bit of
> infrastructure around this. In particular, we need to embed the embed
> the BTF ID into the metadata itself so BPF can do runtime disambiguation
> between different formats (and add the right CO-RE primitives to make
> this easy). This is for two reasons:
>
> - The metadata might be different per-packet (e.g., PTP packets with
> timestamps interleaved with bulk data without them)
>
> - With redirects we may end up processing packets from different devices
> in a single XDP program (in devmap or cpumap, or on a veth) so we need
> to be able to disambiguate at runtime.
>
> So I think the part of the design that puts the BTF ID into the end of
> the metadata struct is sound; however, the actual format doesn't have to
> be fixed, we can use CO-RE to pick out the bits that a given BPF program
> needs; we just need a convention for how drivers report which format(s)
> they support. Which we should also agree on (and add core infrastructure
> around) so each driver doesn't go around inventing their own
> conventions.
>
> >> The enablement of this feature is controlled on attaching/replacing
> >> XDP program on an interface with two new parameters: that combined
> >> BTF+type ID and metadata threshold.
> >> The threshold specifies the minimum frame size which a driver (or
> >> NIC) should start composing metadata from. It is introduced instead
> >> of just false/true flag due to that often it's not worth it to spend
> >> cycles to fetch all that data for such small frames: let's say, it
> >> can be even faster to just calculate checksums for them on CPU
> >> rather than touch non-coherent DMA zone. Simple XDP_DROP case loses
> >> 15 Mpps on 64 byte frames with enabled metadata, threshold can help
> >> mitigate that.
> >
> > I would put this in the bonus category. Can you do the simple thing
> > above without these extra bits and then add them later. Just
> > pick some overly conservative threshold to start with.
>
> Yeah, I'd agree this kind of configuration is something that can be
> added later, and also it's sort of orthogonal to the consumption of the
> metadata itself.
>
> Also, tying this configuration into the loading of an XDP program is a
> terrible interface: these are hardware configuration options, let's just
> put them into ethtool or 'ip link' like any other piece of device
> configuration.
I don't believe it fits there, especially Ethtool. Ethtool is for
hardware configuration, XDP/AF_XDP is 95% software stuff (apart from
offload bits which is purely NFP's for now).
I follow that way:
1) you pick a program you want to attach;
2) usually they are written for special needs and usecases;
3) so most likely that program will be tied with metadata/driver/etc
in some way;
4) so you want to enable Hints of a particular format primarily for
this program and usecase, same with threshold and everything
else.
Pls explain how you see it, I might be wrong for sure.
>
> >> The RFC can be divided into 8 parts:
> >
> > I'm missing something why not do the simplest bit of work and
> > get this running in ice with a few smallish driver updates
> > so we can all see it. No need for so many patches.
>
> Agreed. This incremental approach is basically what Jesper's
> simultaneous series makes a start on, AFAICT? Would be nice if y'all
> could converge the efforts :)
I don't know why at some point Jesper decided to go on his own as he
for sure was using our tree as a base for some time, dunno what
happened then. Regarding these two particular submissions, I didn't
see Jesper's RFC when sending mine, only after when I went to read
some stuff.
>
> [...]
>
> > I really think your asking questions that are two or three
> > jumps away. Why not do the simplest bit first and kick
> > the driver with an on/off switch into this mode. But
> > I don't understand this cpumap use case so maybe explain
> > that first.
> >
> > And sorry didn't even look at your 50+ patches. Figure lets
> > get agreement on the goal first.
>
> +1 on both of these :)
I just thought most of parts were already discussed previously and
the reason I marked it "RFC" was that there are lots of changes and
not everyone may agree with them... Like "here's overview of what
was discussed and what we decided previously, let's review it to see
if there are any questionable/debatable stuff and agree on those 3
questions, then I'll split it according to my taste or to how the
maintainers see it and apply it slow'n'steady".
>
> -Toke
Thanks,
Olek
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-07-04 15:44 ` Alexander Lobakin
@ 2022-07-04 17:13 ` Jesper Dangaard Brouer
2022-07-05 14:38 ` Alexander Lobakin
2022-07-04 17:14 ` Toke Høiland-Jørgensen
1 sibling, 1 reply; 98+ messages in thread
From: Jesper Dangaard Brouer @ 2022-07-04 17:13 UTC (permalink / raw)
To: Alexander Lobakin, Toke Høiland-Jørgensen
Cc: brouer, John Fastabend, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Lorenzo Bianconi,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jesse Brandeburg, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints
On 04/07/2022 17.44, Alexander Lobakin wrote:
>> Agreed. This incremental approach is basically what Jesper's
>> simultaneous series makes a start on, AFAICT? Would be nice if y'all
>> could converge the efforts :) >
> I don't know why at some point Jesper decided to go on his own as he
> for sure was using our tree as a base for some time, dunno what
> happened then. Regarding these two particular submissions, I didn't
> see Jesper's RFC when sending mine, only after when I went to read
> some stuff.
>
Well, I have written to you (offlist) that the git tree didn't compile,
so I had a hard time getting it into a working state. We had a
ping-pong of stuff to fix, but it wasn't and you basically told me to
switch to using LLVM to compile your kernel tree, I was not interested
in doing that.
I have looked at the code in your GitHub tree, and decided that it was
an over-engineered approach IMHO. Also simply being 52 commits deep
without having posted this incrementally upstream were also a
non-starter for me, as this isn't the way-to-work upstream.
To get the ball rolling, I have implemented the base XDP-hints support
here[1] with only 9 patches (including support for two drivers).
IMHO we need to start out small and not intermix these huge refactoring
patches. E.g. I'm not convinced renaming net/{core/xdp.c => bpf/core.c}
is an improvement.
-Jesper
[1]
https://lore.kernel.org/bpf/165643378969.449467.13237011812569188299.stgit@firesoul/
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-07-04 17:13 ` Jesper Dangaard Brouer
@ 2022-07-05 14:38 ` Alexander Lobakin
2022-07-05 19:08 ` Daniel Borkmann
0 siblings, 1 reply; 98+ messages in thread
From: Alexander Lobakin @ 2022-07-05 14:38 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Alexander Lobakin, Toke Høiland-Jørgensen, brouer,
John Fastabend, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Lorenzo Bianconi,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jesse Brandeburg, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints
From: Jesper Dangaard Brouer <jbrouer@redhat.com>
Date: Mon, 4 Jul 2022 19:13:53 +0200
> On 04/07/2022 17.44, Alexander Lobakin wrote:
> >> Agreed. This incremental approach is basically what Jesper's
> >> simultaneous series makes a start on, AFAICT? Would be nice if y'all
> >> could converge the efforts :) >
> > I don't know why at some point Jesper decided to go on his own as he
> > for sure was using our tree as a base for some time, dunno what
> > happened then. Regarding these two particular submissions, I didn't
> > see Jesper's RFC when sending mine, only after when I went to read
> > some stuff.
> >
>
> Well, I have written to you (offlist) that the git tree didn't compile,
> so I had a hard time getting it into a working state. We had a
> ping-pong of stuff to fix, but it wasn't and you basically told me to
> switch to using LLVM to compile your kernel tree, I was not interested
> in doing that.
Yes and no, I only told you that I missed those build issues due to
that I use LLVM as my primary compiler, but I didn't suggest you
switch to it. Then I fixed all of the issues in a couple days and
wrote you the email on 3th of June saying that everything works
now =\
>
> I have looked at the code in your GitHub tree, and decided that it was
> an over-engineered approach IMHO. Also simply being 52 commits deep
> without having posted this incrementally upstream were also a
> non-starter for me, as this isn't the way-to-work upstream.
So Ingo announced recently that he has a series of 2300+ patches
to try to fix include hell. Now he's preparing to submit them by
batches/series. Look at this RFC as at an announce. "Hey folks,
I have a bunch of stuff and will be submitting it soon, but I'm
posting the whole changeset here, so you could take a look or
give it a try before it's actually started being posted".
All this is mentioned in the cover letter as well. What is the
problem? Ok, next time I can not do any announces and just start
posting series if it made such misunderstandings.
Anyway, I will post a demo-version of the series in a couple weeks
containing only the parts required to get Hints working on ice if
folks prefer to look at the pencil draft instead of looking at the
final painting (never thought I'll have to do that in the kernel
dev community :D).
>
> To get the ball rolling, I have implemented the base XDP-hints support
> here[1] with only 9 patches (including support for two drivers).
>
> IMHO we need to start out small and not intermix these huge refactoring
> patches. E.g. I'm not convinced renaming net/{core/xdp.c => bpf/core.c}
> is an improvement.
Those cleanup patches can be easily put in a standalone series as
a prerequisite. I even mentioned them in the cover letter.
File names is a matter of discussing, my intention there was mainly
to move XDP stuff out of overburdened net/core/dev.c.
>
> -Jesper
>
> [1]
> https://lore.kernel.org/bpf/165643378969.449467.13237011812569188299.stgit@firesoul/
Thanks,
Olek
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-07-05 14:38 ` Alexander Lobakin
@ 2022-07-05 19:08 ` Daniel Borkmann
0 siblings, 0 replies; 98+ messages in thread
From: Daniel Borkmann @ 2022-07-05 19:08 UTC (permalink / raw)
To: Alexander Lobakin, Jesper Dangaard Brouer
Cc: Toke Høiland-Jørgensen, brouer, John Fastabend,
Alexei Starovoitov, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, Lorenzo Bianconi, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Jesse Brandeburg, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
On 7/5/22 4:38 PM, Alexander Lobakin wrote:
> From: Jesper Dangaard Brouer <jbrouer@redhat.com>
> Date: Mon, 4 Jul 2022 19:13:53 +0200
[...]
>> I have looked at the code in your GitHub tree, and decided that it was
>> an over-engineered approach IMHO. Also simply being 52 commits deep
>> without having posted this incrementally upstream were also a
>> non-starter for me, as this isn't the way-to-work upstream.
>
> So Ingo announced recently that he has a series of 2300+ patches
> to try to fix include hell. Now he's preparing to submit them by
> batches/series. Look at this RFC as at an announce. "Hey folks,
> I have a bunch of stuff and will be submitting it soon, but I'm
> posting the whole changeset here, so you could take a look or
> give it a try before it's actually started being posted".
> All this is mentioned in the cover letter as well. What is the
> problem? Ok, next time I can not do any announces and just start
> posting series if it made such misunderstandings.
I would suggest to please calm down first. No offense, but above example
with the 2300+ patches is not a great one. There is no way any mortal
would be able to review them, not even thinking about the cycles spent
around rebasing, merge conflict resolution or bugs they may contain.
Anyway, that aside..
Your series essentially starts out with ...
The series adds ability to pass different frame
details/parameters/parameters used by most of NICs and the kernel
stack (in skbs), not essential, but highly wanted, such as:
* checksum value, status (Rx) or command (Tx);
* hash value and type/level (Rx);
* queue number (Rx);
* timestamps;
* and so on.
... so my initial question would be whether in this context there has
been done research / analysis of how this can speed up /real world/
production applications such as Katran L4LB [0], for example? What is
the speedup you observed with it by utilizing the fields from meta data?
Thanks,
Daniel
[0] https://github.com/facebookincubator/katran
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-07-04 15:44 ` Alexander Lobakin
2022-07-04 17:13 ` Jesper Dangaard Brouer
@ 2022-07-04 17:14 ` Toke Høiland-Jørgensen
2022-07-05 15:41 ` Alexander Lobakin
1 sibling, 1 reply; 98+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-04 17:14 UTC (permalink / raw)
To: Alexander Lobakin
Cc: John Fastabend, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Lorenzo Bianconi,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jesse Brandeburg, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints
Alexander Lobakin <alexandr.lobakin@intel.com> writes:
> From: Toke H??iland-J??rgensen <toke@redhat.com>
> Date: Wed, 29 Jun 2022 15:43:05 +0200
>
>> John Fastabend <john.fastabend@gmail.com> writes:
>>
>> > Alexander Lobakin wrote:
>> >> This RFC is to give the whole picture. It will most likely be split
>> >> onto several series, maybe even merge cycles. See the "table of
>> >> contents" below.
>> >
>> > Even for RFC its a bit much. Probably improve the summary
>> > message here as well I'm still not clear on the overall
>> > architecture so not sure I want to dig into patches.
>>
>> +1 on this, and piggybacking on your comment to chime in on the general
>> architecture.
>>
>> >> Now, a NIC driver, or even a SmartNIC itself, can put those params
>> >> there in a well-defined format. The format is fixed, but can be of
>> >> several different types represented by structures, which definitions
>> >> are available to the kernel, BPF programs and the userland.
>> >
>> > I don't think in general the format needs to be fixed.
>>
>> No, that's the whole point of BTF: it's not supposed to be UAPI, we'll
>> use CO-RE to enable dynamic formats...
>>
>> [...]
>>
>> >> It is fixed due to it being almost a UAPI, and the exact format can
>> >> be determined by reading the last 10 bytes of metadata. They contain
>> >> a 2-byte magic ID to not confuse it with a non-compatible meta and
>> >> a 8-byte combined BTF ID + type ID: the ID of the BTF where this
>> >> structure is defined and the ID of that definition inside that BTF.
>> >> Users can obtain BTF IDs by structure types using helpers available
>> >> in the kernel, BPF (written by the CO-RE/verifier) and the userland
>> >> (libbpf -> kernel call) and then rely on those ID when reading data
>> >> to make sure whether they support it and what to do with it.
>> >> Why separate magic and ID? The idea is to make different formats
>> >> always contain the basic/"generic" structure embedded at the end.
>> >> This way we can still benefit in purely generic consumers (like
>> >> cpumap) while providing some "extra" data to those who support it.
>> >
>> > I don't follow this. If you have a struct in your driver name it
>> > something obvious, ice_xdp_metadata. If I understand things
>> > correctly just dump the BTF for the driver, extract the
>> > struct and done you can use CO-RE reads. For the 'fixed' case
>> > this looks easy. And I don't think you even need a patch for this.
>>
>> ...however as we've discussed previously, we do need a bit of
>> infrastructure around this. In particular, we need to embed the embed
>> the BTF ID into the metadata itself so BPF can do runtime disambiguation
>> between different formats (and add the right CO-RE primitives to make
>> this easy). This is for two reasons:
>>
>> - The metadata might be different per-packet (e.g., PTP packets with
>> timestamps interleaved with bulk data without them)
>>
>> - With redirects we may end up processing packets from different devices
>> in a single XDP program (in devmap or cpumap, or on a veth) so we need
>> to be able to disambiguate at runtime.
>>
>> So I think the part of the design that puts the BTF ID into the end of
>> the metadata struct is sound; however, the actual format doesn't have to
>> be fixed, we can use CO-RE to pick out the bits that a given BPF program
>> needs; we just need a convention for how drivers report which format(s)
>> they support. Which we should also agree on (and add core infrastructure
>> around) so each driver doesn't go around inventing their own
>> conventions.
>>
>> >> The enablement of this feature is controlled on attaching/replacing
>> >> XDP program on an interface with two new parameters: that combined
>> >> BTF+type ID and metadata threshold.
>> >> The threshold specifies the minimum frame size which a driver (or
>> >> NIC) should start composing metadata from. It is introduced instead
>> >> of just false/true flag due to that often it's not worth it to spend
>> >> cycles to fetch all that data for such small frames: let's say, it
>> >> can be even faster to just calculate checksums for them on CPU
>> >> rather than touch non-coherent DMA zone. Simple XDP_DROP case loses
>> >> 15 Mpps on 64 byte frames with enabled metadata, threshold can help
>> >> mitigate that.
>> >
>> > I would put this in the bonus category. Can you do the simple thing
>> > above without these extra bits and then add them later. Just
>> > pick some overly conservative threshold to start with.
>>
>> Yeah, I'd agree this kind of configuration is something that can be
>> added later, and also it's sort of orthogonal to the consumption of the
>> metadata itself.
>>
>> Also, tying this configuration into the loading of an XDP program is a
>> terrible interface: these are hardware configuration options, let's just
>> put them into ethtool or 'ip link' like any other piece of device
>> configuration.
>
> I don't believe it fits there, especially Ethtool. Ethtool is for
> hardware configuration, XDP/AF_XDP is 95% software stuff (apart from
> offload bits which is purely NFP's for now).
But XDP-hints is about consuming hardware features. When you're
configuring which metadata items you want, you're saying "please provide
me with these (hardware) features". So ethtool is an excellent place to
do that :)
> I follow that way:
>
> 1) you pick a program you want to attach;
> 2) usually they are written for special needs and usecases;
> 3) so most likely that program will be tied with metadata/driver/etc
> in some way;
> 4) so you want to enable Hints of a particular format primarily for
> this program and usecase, same with threshold and everything
> else.
>
> Pls explain how you see it, I might be wrong for sure.
As above: XDP hints is about giving XDP programs (and AF_XDP consumers)
access to metadata that is not currently available. Tying the lifetime
of that hardware configuration (i.e., which information to provide) to
the lifetime of an XDP program is not a good interface: for one thing,
how will it handle multiple programs? What about when XDP is not used at
all but you still want to configure the same features?
In addition, in every other case where we do dynamic data access (with
CO-RE) the BPF program is a consumer that modifies itself to access the
data provided by the kernel. I get that this is harder to achieve for
AF_XDP, but then let's solve that instead of making a totally
inconsistent interface for XDP.
I'm as excited as you about the prospect of having totally programmable
hardware where you can just specify any arbitrary metadata format and
it'll provide that for you. But that is an orthogonal feature: let's
start with creating a dynamic interface for consuming the (static)
hardware features we already have, and then later we can have a separate
interface for configuring more dynamic hardware features. XDP-hints is
about adding this consumption feature in a way that's sufficiently
dynamic that we can do the other (programmable hardware) thing on top
later...
-Toke
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-07-04 17:14 ` Toke Høiland-Jørgensen
@ 2022-07-05 15:41 ` Alexander Lobakin
2022-07-05 18:51 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 98+ messages in thread
From: Alexander Lobakin @ 2022-07-05 15:41 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: Alexander Lobakin, John Fastabend, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, Lorenzo Bianconi, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Jesse Brandeburg, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
From: Toke Høiland-Jørgensen <toke@redhat.com>
Date: Mon, 04 Jul 2022 19:14:04 +0200
> Alexander Lobakin <alexandr.lobakin@intel.com> writes:
>
> > From: Toke H??iland-J??rgensen <toke@redhat.com>
> > Date: Wed, 29 Jun 2022 15:43:05 +0200
> >
> >> John Fastabend <john.fastabend@gmail.com> writes:
> >>
> >> > Alexander Lobakin wrote:
> >> >> This RFC is to give the whole picture. It will most likely be split
> >> >> onto several series, maybe even merge cycles. See the "table of
> >> >> contents" below.
> >> >
> >> > Even for RFC its a bit much. Probably improve the summary
> >> > message here as well I'm still not clear on the overall
> >> > architecture so not sure I want to dig into patches.
> >>
> >> +1 on this, and piggybacking on your comment to chime in on the general
> >> architecture.
> >>
> >> >> Now, a NIC driver, or even a SmartNIC itself, can put those params
> >> >> there in a well-defined format. The format is fixed, but can be of
> >> >> several different types represented by structures, which definitions
> >> >> are available to the kernel, BPF programs and the userland.
> >> >
> >> > I don't think in general the format needs to be fixed.
> >>
> >> No, that's the whole point of BTF: it's not supposed to be UAPI, we'll
> >> use CO-RE to enable dynamic formats...
> >>
> >> [...]
> >>
> >> >> It is fixed due to it being almost a UAPI, and the exact format can
> >> >> be determined by reading the last 10 bytes of metadata. They contain
> >> >> a 2-byte magic ID to not confuse it with a non-compatible meta and
> >> >> a 8-byte combined BTF ID + type ID: the ID of the BTF where this
> >> >> structure is defined and the ID of that definition inside that BTF.
> >> >> Users can obtain BTF IDs by structure types using helpers available
> >> >> in the kernel, BPF (written by the CO-RE/verifier) and the userland
> >> >> (libbpf -> kernel call) and then rely on those ID when reading data
> >> >> to make sure whether they support it and what to do with it.
> >> >> Why separate magic and ID? The idea is to make different formats
> >> >> always contain the basic/"generic" structure embedded at the end.
> >> >> This way we can still benefit in purely generic consumers (like
> >> >> cpumap) while providing some "extra" data to those who support it.
> >> >
> >> > I don't follow this. If you have a struct in your driver name it
> >> > something obvious, ice_xdp_metadata. If I understand things
> >> > correctly just dump the BTF for the driver, extract the
> >> > struct and done you can use CO-RE reads. For the 'fixed' case
> >> > this looks easy. And I don't think you even need a patch for this.
> >>
> >> ...however as we've discussed previously, we do need a bit of
> >> infrastructure around this. In particular, we need to embed the embed
> >> the BTF ID into the metadata itself so BPF can do runtime disambiguation
> >> between different formats (and add the right CO-RE primitives to make
> >> this easy). This is for two reasons:
> >>
> >> - The metadata might be different per-packet (e.g., PTP packets with
> >> timestamps interleaved with bulk data without them)
> >>
> >> - With redirects we may end up processing packets from different devices
> >> in a single XDP program (in devmap or cpumap, or on a veth) so we need
> >> to be able to disambiguate at runtime.
> >>
> >> So I think the part of the design that puts the BTF ID into the end of
> >> the metadata struct is sound; however, the actual format doesn't have to
> >> be fixed, we can use CO-RE to pick out the bits that a given BPF program
> >> needs; we just need a convention for how drivers report which format(s)
> >> they support. Which we should also agree on (and add core infrastructure
> >> around) so each driver doesn't go around inventing their own
> >> conventions.
> >>
> >> >> The enablement of this feature is controlled on attaching/replacing
> >> >> XDP program on an interface with two new parameters: that combined
> >> >> BTF+type ID and metadata threshold.
> >> >> The threshold specifies the minimum frame size which a driver (or
> >> >> NIC) should start composing metadata from. It is introduced instead
> >> >> of just false/true flag due to that often it's not worth it to spend
> >> >> cycles to fetch all that data for such small frames: let's say, it
> >> >> can be even faster to just calculate checksums for them on CPU
> >> >> rather than touch non-coherent DMA zone. Simple XDP_DROP case loses
> >> >> 15 Mpps on 64 byte frames with enabled metadata, threshold can help
> >> >> mitigate that.
> >> >
> >> > I would put this in the bonus category. Can you do the simple thing
> >> > above without these extra bits and then add them later. Just
> >> > pick some overly conservative threshold to start with.
> >>
> >> Yeah, I'd agree this kind of configuration is something that can be
> >> added later, and also it's sort of orthogonal to the consumption of the
> >> metadata itself.
> >>
> >> Also, tying this configuration into the loading of an XDP program is a
> >> terrible interface: these are hardware configuration options, let's just
> >> put them into ethtool or 'ip link' like any other piece of device
> >> configuration.
> >
> > I don't believe it fits there, especially Ethtool. Ethtool is for
> > hardware configuration, XDP/AF_XDP is 95% software stuff (apart from
> > offload bits which is purely NFP's for now).
>
> But XDP-hints is about consuming hardware features. When you're
> configuring which metadata items you want, you're saying "please provide
> me with these (hardware) features". So ethtool is an excellent place to
> do that :)
With Ethtool you configure the hardware, e.g. it won't strip VLAN
tags if you disable rx-cvlan-stripping. With configuring metadata
you only tell what you want to see there, don't you?
>
> > I follow that way:
> >
> > 1) you pick a program you want to attach;
> > 2) usually they are written for special needs and usecases;
> > 3) so most likely that program will be tied with metadata/driver/etc
> > in some way;
> > 4) so you want to enable Hints of a particular format primarily for
> > this program and usecase, same with threshold and everything
> > else.
> >
> > Pls explain how you see it, I might be wrong for sure.
>
> As above: XDP hints is about giving XDP programs (and AF_XDP consumers)
> access to metadata that is not currently available. Tying the lifetime
> of that hardware configuration (i.e., which information to provide) to
> the lifetime of an XDP program is not a good interface: for one thing,
> how will it handle multiple programs? What about when XDP is not used at
Multiple progs is stuff I didn't cover, but will do later (as you
all say to me, "let's start with something simple" :)). Aaaand
multiple XDP progs (I'm not talking about attaching progs in
differeng modes) is not a kernel feature, rather a libpf feature,
so I believe it should be handled there later...
> all but you still want to configure the same features?
What's the point of configuring metadata when there are no progs
attached? To configure it once and not on every prog attach? I'm
not saying I don't like it, just want to clarify.
Maybe I need opinions from some more people, just to have an
overview of how most of folks see it and would like to configure
it. 'Cause I heard from at least one of the consumers that
libpf API is a perfect place for Hints to him :)
>
> In addition, in every other case where we do dynamic data access (with
> CO-RE) the BPF program is a consumer that modifies itself to access the
> data provided by the kernel. I get that this is harder to achieve for
> AF_XDP, but then let's solve that instead of making a totally
> inconsistent interface for XDP.
I also see CO-RE more fitting and convenient way to use them, but
didn't manage to solve two things:
1) AF_XDP programs, so what to do with them? Prepare patches for
LLVM to make it able to do CO-RE on AF_XDP program load? Or
just hardcode them for particular usecases and NICs? What about
"general-purpose" programs?
And if hardcode, what's the point then to do Generic Hints at
all? Then all it needs is making driver building some meta in
front of frames via on-off button and that's it? Why BTF ID in
the meta then if consumers will access meta hardcoded (via CO-RE
or literally hardcoded, doesn't matter)?
2) In-kernel metadata consumers? Also do CO-RE? Otherwise, with no
generic metadata structure they won't be able to benefit from
Hints. But I guess we still need to provide kernel with meta?
Or no?
>
> I'm as excited as you about the prospect of having totally programmable
But I mostly care about current generation with no programmable
Hints...
> hardware where you can just specify any arbitrary metadata format and
> it'll provide that for you. But that is an orthogonal feature: let's
> start with creating a dynamic interface for consuming the (static)
> hardware features we already have, and then later we can have a separate
> interface for configuring more dynamic hardware features. XDP-hints is
> about adding this consumption feature in a way that's sufficiently
> dynamic that we can do the other (programmable hardware) thing on top
> later...
>
> -Toke
Thanks,
Olek
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-07-05 15:41 ` Alexander Lobakin
@ 2022-07-05 18:51 ` Toke Høiland-Jørgensen
2022-07-06 13:50 ` Alexander Lobakin
0 siblings, 1 reply; 98+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-05 18:51 UTC (permalink / raw)
To: Alexander Lobakin
Cc: John Fastabend, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Lorenzo Bianconi,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jesse Brandeburg, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints
Alexander Lobakin <alexandr.lobakin@intel.com> writes:
[... snipping a bit of context here ...]
>> >> Yeah, I'd agree this kind of configuration is something that can be
>> >> added later, and also it's sort of orthogonal to the consumption of the
>> >> metadata itself.
>> >>
>> >> Also, tying this configuration into the loading of an XDP program is a
>> >> terrible interface: these are hardware configuration options, let's just
>> >> put them into ethtool or 'ip link' like any other piece of device
>> >> configuration.
>> >
>> > I don't believe it fits there, especially Ethtool. Ethtool is for
>> > hardware configuration, XDP/AF_XDP is 95% software stuff (apart from
>> > offload bits which is purely NFP's for now).
>>
>> But XDP-hints is about consuming hardware features. When you're
>> configuring which metadata items you want, you're saying "please provide
>> me with these (hardware) features". So ethtool is an excellent place to
>> do that :)
>
> With Ethtool you configure the hardware, e.g. it won't strip VLAN
> tags if you disable rx-cvlan-stripping. With configuring metadata
> you only tell what you want to see there, don't you?
Ah, I think we may be getting closer to identifying the disconnect
between our way of thinking about this!
In my mind, there's no separate "configuration of the metadata" step.
You simply tell the hardware what features you want (say, "enable
timestamps and VLAN offload"), and the driver will then provide the
information related to these features in the metadata area
unconditionally. All XDP hints is about, then, is a way for the driver
to inform the rest of the system how that information is actually laid
out in the metadata area.
Having a separate configuration knob to tell the driver "please lay out
these particular bits of metadata this way" seems like a totally
unnecessary (and quite complicated) feature to have when we can just let
the driver decide and use CO-RE to consume it?
>> > I follow that way:
>> >
>> > 1) you pick a program you want to attach;
>> > 2) usually they are written for special needs and usecases;
>> > 3) so most likely that program will be tied with metadata/driver/etc
>> > in some way;
>> > 4) so you want to enable Hints of a particular format primarily for
>> > this program and usecase, same with threshold and everything
>> > else.
>> >
>> > Pls explain how you see it, I might be wrong for sure.
>>
>> As above: XDP hints is about giving XDP programs (and AF_XDP consumers)
>> access to metadata that is not currently available. Tying the lifetime
>> of that hardware configuration (i.e., which information to provide) to
>> the lifetime of an XDP program is not a good interface: for one thing,
>> how will it handle multiple programs? What about when XDP is not used at
>
> Multiple progs is stuff I didn't cover, but will do later (as you
> all say to me, "let's start with something simple" :)). Aaaand
> multiple XDP progs (I'm not talking about attaching progs in
> differeng modes) is not a kernel feature, rather a libpf feature,
> so I believe it should be handled there later...
Right, but even if we don't *implement* it straight away we still need
to take it into consideration in the design. And expecting libxdp to
arbitrate between different XDP programs' metadata formats sounds like a
royal PITA :)
>> all but you still want to configure the same features?
>
> What's the point of configuring metadata when there are no progs
> attached? To configure it once and not on every prog attach? I'm
> not saying I don't like it, just want to clarify.
See above: you turn on the features because you want the stack to
consume them.
> Maybe I need opinions from some more people, just to have an
> overview of how most of folks see it and would like to configure
> it. 'Cause I heard from at least one of the consumers that
> libpf API is a perfect place for Hints to him :)
Well, as a program author who wants to consume hints, you'd use
lib{bpf,xdp} APIs to do so (probably in the form of suitable CO-RE
macros)...
>> In addition, in every other case where we do dynamic data access (with
>> CO-RE) the BPF program is a consumer that modifies itself to access the
>> data provided by the kernel. I get that this is harder to achieve for
>> AF_XDP, but then let's solve that instead of making a totally
>> inconsistent interface for XDP.
>
> I also see CO-RE more fitting and convenient way to use them, but
> didn't manage to solve two things:
>
> 1) AF_XDP programs, so what to do with them? Prepare patches for
> LLVM to make it able to do CO-RE on AF_XDP program load? Or
> just hardcode them for particular usecases and NICs? What about
> "general-purpose" programs?
You provide a library to read the fields. Jesper actually already
implemented this, did you look at his code?
https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-interaction
It basically builds a lookup table at load-time using BTF information
from the kernel, keyed on BTF ID and field name, resolving them into
offsets. It's not quite the zero-overhead of CO-RE, but it's fairly
close and can be improved upon (CO-RE for userspace being one way of
doing that).
> And if hardcode, what's the point then to do Generic Hints at
> all? Then all it needs is making driver building some meta in
> front of frames via on-off button and that's it? Why BTF ID in
> the meta then if consumers will access meta hardcoded (via CO-RE
> or literally hardcoded, doesn't matter)?
You're quite right, we could probably implement all the access to
existing (fixed) metadata without using any BTF at all - just define a
common struct and some flags to designate which fields are set. In my
mind, there are a couple of reasons for going the BTF route instead:
- We can leverage CO-RE to get close to optimal efficiency in field
access.
and, more importantly:
- It's infinitely extensible. With the infrastructure in place to make
it really easy to consume metadata described by BTF, we lower the bar
for future innovation in hardware offloads. Both for just adding new
fixed-function stuff to hardware, but especially for fully
programmable hardware.
> 2) In-kernel metadata consumers? Also do CO-RE? Otherwise, with no
> generic metadata structure they won't be able to benefit from
> Hints. But I guess we still need to provide kernel with meta?
> Or no?
In the short term, I think the "generic structure" approach is fine for
leveraging this in the stack. Both your and Jesper's series include
this, and I think that's totally fine. Longer term, if it turns out to
be useful to have something more dynamic for the stack consumption as
well, we could extend it to be CO-RE based as well (most likely by
having the stack load a "translator" BPF program or something along
those lines).
>> I'm as excited as you about the prospect of having totally programmable
>
> But I mostly care about current generation with no programmable
> Hints...
Well, see above; we should be able to support both :)
-Toke
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-07-05 18:51 ` Toke Høiland-Jørgensen
@ 2022-07-06 13:50 ` Alexander Lobakin
2022-07-06 23:22 ` Toke Høiland-Jørgensen
0 siblings, 1 reply; 98+ messages in thread
From: Alexander Lobakin @ 2022-07-06 13:50 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: Alexander Lobakin, John Fastabend, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, Lorenzo Bianconi, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Jesse Brandeburg, Yajun Deng,
Willem de Bruijn, bpf, netdev, linux-kernel, xdp-hints
From: Toke H��iland-J��rgensen <toke@redhat.com>
Date: Tue, 05 Jul 2022 20:51:14 +0200
> Alexander Lobakin <alexandr.lobakin@intel.com> writes:
>
> [... snipping a bit of context here ...]
>
> >> >> Yeah, I'd agree this kind of configuration is something that can be
> >> >> added later, and also it's sort of orthogonal to the consumption of the
> >> >> metadata itself.
> >> >>
> >> >> Also, tying this configuration into the loading of an XDP program is a
> >> >> terrible interface: these are hardware configuration options, let's just
> >> >> put them into ethtool or 'ip link' like any other piece of device
> >> >> configuration.
> >> >
> >> > I don't believe it fits there, especially Ethtool. Ethtool is for
> >> > hardware configuration, XDP/AF_XDP is 95% software stuff (apart from
> >> > offload bits which is purely NFP's for now).
> >>
> >> But XDP-hints is about consuming hardware features. When you're
> >> configuring which metadata items you want, you're saying "please provide
> >> me with these (hardware) features". So ethtool is an excellent place to
> >> do that :)
> >
> > With Ethtool you configure the hardware, e.g. it won't strip VLAN
> > tags if you disable rx-cvlan-stripping. With configuring metadata
> > you only tell what you want to see there, don't you?
>
> Ah, I think we may be getting closer to identifying the disconnect
> between our way of thinking about this!
>
> In my mind, there's no separate "configuration of the metadata" step.
> You simply tell the hardware what features you want (say, "enable
> timestamps and VLAN offload"), and the driver will then provide the
> information related to these features in the metadata area
> unconditionally. All XDP hints is about, then, is a way for the driver
> to inform the rest of the system how that information is actually laid
> out in the metadata area.
>
> Having a separate configuration knob to tell the driver "please lay out
> these particular bits of metadata this way" seems like a totally
> unnecessary (and quite complicated) feature to have when we can just let
> the driver decide and use CO-RE to consume it?
Magnus (he's currently on vacation) told me it would be useful for
AF_XDP to enable/disable particular metadata, at least from perf
perspective. Let's say, just fetching of one "checksum ok" bit in
the driver is faster than walking through all the descriptor words
and driver logics (i.e. there's several hundred locs in ice which
just parse descriptor data and build an skb or metadata from it).
But if we would just enable/disable corresponding features through
Ethtool, that would hurt XDP_PASS. Maybe it's a bad example, but
what if I want to have only RSS hash in the metadata (and don't
want to spend cycles on parsing the rest), but at the same time
still want skb path to have checksum status to not die at CPU
checksum calculation?
>
> >> > I follow that way:
> >> >
> >> > 1) you pick a program you want to attach;
> >> > 2) usually they are written for special needs and usecases;
> >> > 3) so most likely that program will be tied with metadata/driver/etc
> >> > in some way;
> >> > 4) so you want to enable Hints of a particular format primarily for
> >> > this program and usecase, same with threshold and everything
> >> > else.
> >> >
> >> > Pls explain how you see it, I might be wrong for sure.
> >>
> >> As above: XDP hints is about giving XDP programs (and AF_XDP consumers)
> >> access to metadata that is not currently available. Tying the lifetime
> >> of that hardware configuration (i.e., which information to provide) to
> >> the lifetime of an XDP program is not a good interface: for one thing,
> >> how will it handle multiple programs? What about when XDP is not used at
> >
> > Multiple progs is stuff I didn't cover, but will do later (as you
> > all say to me, "let's start with something simple" :)). Aaaand
> > multiple XDP progs (I'm not talking about attaching progs in
> > differeng modes) is not a kernel feature, rather a libpf feature,
> > so I believe it should be handled there later...
>
> Right, but even if we don't *implement* it straight away we still need
> to take it into consideration in the design. And expecting libxdp to
> arbitrate between different XDP programs' metadata formats sounds like a
> royal PITA :)
>
> >> all but you still want to configure the same features?
> >
> > What's the point of configuring metadata when there are no progs
> > attached? To configure it once and not on every prog attach? I'm
> > not saying I don't like it, just want to clarify.
>
> See above: you turn on the features because you want the stack to
> consume them.
>
> > Maybe I need opinions from some more people, just to have an
> > overview of how most of folks see it and would like to configure
> > it. 'Cause I heard from at least one of the consumers that
> > libpf API is a perfect place for Hints to him :)
>
> Well, as a program author who wants to consume hints, you'd use
> lib{bpf,xdp} APIs to do so (probably in the form of suitable CO-RE
> macros)...
>
> >> In addition, in every other case where we do dynamic data access (with
> >> CO-RE) the BPF program is a consumer that modifies itself to access the
> >> data provided by the kernel. I get that this is harder to achieve for
> >> AF_XDP, but then let's solve that instead of making a totally
> >> inconsistent interface for XDP.
> >
> > I also see CO-RE more fitting and convenient way to use them, but
> > didn't manage to solve two things:
> >
> > 1) AF_XDP programs, so what to do with them? Prepare patches for
> > LLVM to make it able to do CO-RE on AF_XDP program load? Or
> > just hardcode them for particular usecases and NICs? What about
> > "general-purpose" programs?
>
> You provide a library to read the fields. Jesper actually already
> implemented this, did you look at his code?
>
> https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-interaction
>
> It basically builds a lookup table at load-time using BTF information
> from the kernel, keyed on BTF ID and field name, resolving them into
> offsets. It's not quite the zero-overhead of CO-RE, but it's fairly
> close and can be improved upon (CO-RE for userspace being one way of
> doing that).
Aaaah, sorry, I completely missed that. I thought of something
similar as well, but then thought "variable field offsets, that
would annihilate optimization and performance", and our Xsk team
is super concerned about performance hits when using Hints.
>
> > And if hardcode, what's the point then to do Generic Hints at
> > all? Then all it needs is making driver building some meta in
> > front of frames via on-off button and that's it? Why BTF ID in
> > the meta then if consumers will access meta hardcoded (via CO-RE
> > or literally hardcoded, doesn't matter)?
>
> You're quite right, we could probably implement all the access to
> existing (fixed) metadata without using any BTF at all - just define a
> common struct and some flags to designate which fields are set. In my
> mind, there are a couple of reasons for going the BTF route instead:
>
> - We can leverage CO-RE to get close to optimal efficiency in field
> access.
>
> and, more importantly:
>
> - It's infinitely extensible. With the infrastructure in place to make
> it really easy to consume metadata described by BTF, we lower the bar
> for future innovation in hardware offloads. Both for just adding new
> fixed-function stuff to hardware, but especially for fully
> programmable hardware.
Agree :) That libxdp lookup translator fixed lots of stuff in my
mind.
>
> > 2) In-kernel metadata consumers? Also do CO-RE? Otherwise, with no
> > generic metadata structure they won't be able to benefit from
> > Hints. But I guess we still need to provide kernel with meta?
> > Or no?
>
> In the short term, I think the "generic structure" approach is fine for
> leveraging this in the stack. Both your and Jesper's series include
> this, and I think that's totally fine. Longer term, if it turns out to
> be useful to have something more dynamic for the stack consumption as
> well, we could extend it to be CO-RE based as well (most likely by
> having the stack load a "translator" BPF program or something along
> those lines).
Oh, that translator prog sounds nice BTW!
>
> >> I'm as excited as you about the prospect of having totally programmable
> >
> > But I mostly care about current generation with no programmable
> > Hints...
>
> Well, see above; we should be able to support both :)
>
> -Toke
Thanks,
Olek
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-07-06 13:50 ` Alexander Lobakin
@ 2022-07-06 23:22 ` Toke Høiland-Jørgensen
2022-07-07 11:41 ` Jesper Dangaard Brouer
2022-07-12 10:33 ` Magnus Karlsson
0 siblings, 2 replies; 98+ messages in thread
From: Toke Høiland-Jørgensen @ 2022-07-06 23:22 UTC (permalink / raw)
To: Alexander Lobakin
Cc: John Fastabend, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Lorenzo Bianconi,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jesse Brandeburg, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints
Alexander Lobakin <alexandr.lobakin@intel.com> writes:
> From: Toke H??iland-J??rgensen <toke@redhat.com>
> Date: Tue, 05 Jul 2022 20:51:14 +0200
>
>> Alexander Lobakin <alexandr.lobakin@intel.com> writes:
>>
>> [... snipping a bit of context here ...]
>>
>> >> >> Yeah, I'd agree this kind of configuration is something that can be
>> >> >> added later, and also it's sort of orthogonal to the consumption of the
>> >> >> metadata itself.
>> >> >>
>> >> >> Also, tying this configuration into the loading of an XDP program is a
>> >> >> terrible interface: these are hardware configuration options, let's just
>> >> >> put them into ethtool or 'ip link' like any other piece of device
>> >> >> configuration.
>> >> >
>> >> > I don't believe it fits there, especially Ethtool. Ethtool is for
>> >> > hardware configuration, XDP/AF_XDP is 95% software stuff (apart from
>> >> > offload bits which is purely NFP's for now).
>> >>
>> >> But XDP-hints is about consuming hardware features. When you're
>> >> configuring which metadata items you want, you're saying "please provide
>> >> me with these (hardware) features". So ethtool is an excellent place to
>> >> do that :)
>> >
>> > With Ethtool you configure the hardware, e.g. it won't strip VLAN
>> > tags if you disable rx-cvlan-stripping. With configuring metadata
>> > you only tell what you want to see there, don't you?
>>
>> Ah, I think we may be getting closer to identifying the disconnect
>> between our way of thinking about this!
>>
>> In my mind, there's no separate "configuration of the metadata" step.
>> You simply tell the hardware what features you want (say, "enable
>> timestamps and VLAN offload"), and the driver will then provide the
>> information related to these features in the metadata area
>> unconditionally. All XDP hints is about, then, is a way for the driver
>> to inform the rest of the system how that information is actually laid
>> out in the metadata area.
>>
>> Having a separate configuration knob to tell the driver "please lay out
>> these particular bits of metadata this way" seems like a totally
>> unnecessary (and quite complicated) feature to have when we can just let
>> the driver decide and use CO-RE to consume it?
>
> Magnus (he's currently on vacation) told me it would be useful for
> AF_XDP to enable/disable particular metadata, at least from perf
> perspective. Let's say, just fetching of one "checksum ok" bit in
> the driver is faster than walking through all the descriptor words
> and driver logics (i.e. there's several hundred locs in ice which
> just parse descriptor data and build an skb or metadata from it).
> But if we would just enable/disable corresponding features through
> Ethtool, that would hurt XDP_PASS. Maybe it's a bad example, but
> what if I want to have only RSS hash in the metadata (and don't
> want to spend cycles on parsing the rest), but at the same time
> still want skb path to have checksum status to not die at CPU
> checksum calculation?
Hmm, so this feels a little like a driver-specific optimisation? I.e.,
my guess is that not all drivers have a measurable overhead for pulling
out the metadata. Also, once the XDP metadata bits are in place, we can
move in the direction of building SKBs from the same source, so I'm not
sure it's a good idea to assume that the XDP metadata is separate from
what the stack consumes...
In any case, if such an optimisation does turn out to be useful, we can
add it later (backed by rigorous benchmarks, of course), so I think we
can still start with the simple case and iterate from there?
>> >> > I follow that way:
>> >> >
>> >> > 1) you pick a program you want to attach;
>> >> > 2) usually they are written for special needs and usecases;
>> >> > 3) so most likely that program will be tied with metadata/driver/etc
>> >> > in some way;
>> >> > 4) so you want to enable Hints of a particular format primarily for
>> >> > this program and usecase, same with threshold and everything
>> >> > else.
>> >> >
>> >> > Pls explain how you see it, I might be wrong for sure.
>> >>
>> >> As above: XDP hints is about giving XDP programs (and AF_XDP consumers)
>> >> access to metadata that is not currently available. Tying the lifetime
>> >> of that hardware configuration (i.e., which information to provide) to
>> >> the lifetime of an XDP program is not a good interface: for one thing,
>> >> how will it handle multiple programs? What about when XDP is not used at
>> >
>> > Multiple progs is stuff I didn't cover, but will do later (as you
>> > all say to me, "let's start with something simple" :)). Aaaand
>> > multiple XDP progs (I'm not talking about attaching progs in
>> > differeng modes) is not a kernel feature, rather a libpf feature,
>> > so I believe it should be handled there later...
>>
>> Right, but even if we don't *implement* it straight away we still need
>> to take it into consideration in the design. And expecting libxdp to
>> arbitrate between different XDP programs' metadata formats sounds like a
>> royal PITA :)
>>
>> >> all but you still want to configure the same features?
>> >
>> > What's the point of configuring metadata when there are no progs
>> > attached? To configure it once and not on every prog attach? I'm
>> > not saying I don't like it, just want to clarify.
>>
>> See above: you turn on the features because you want the stack to
>> consume them.
>>
>> > Maybe I need opinions from some more people, just to have an
>> > overview of how most of folks see it and would like to configure
>> > it. 'Cause I heard from at least one of the consumers that
>> > libpf API is a perfect place for Hints to him :)
>>
>> Well, as a program author who wants to consume hints, you'd use
>> lib{bpf,xdp} APIs to do so (probably in the form of suitable CO-RE
>> macros)...
>>
>> >> In addition, in every other case where we do dynamic data access (with
>> >> CO-RE) the BPF program is a consumer that modifies itself to access the
>> >> data provided by the kernel. I get that this is harder to achieve for
>> >> AF_XDP, but then let's solve that instead of making a totally
>> >> inconsistent interface for XDP.
>> >
>> > I also see CO-RE more fitting and convenient way to use them, but
>> > didn't manage to solve two things:
>> >
>> > 1) AF_XDP programs, so what to do with them? Prepare patches for
>> > LLVM to make it able to do CO-RE on AF_XDP program load? Or
>> > just hardcode them for particular usecases and NICs? What about
>> > "general-purpose" programs?
>>
>> You provide a library to read the fields. Jesper actually already
>> implemented this, did you look at his code?
>>
>> https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-interaction
>>
>> It basically builds a lookup table at load-time using BTF information
>> from the kernel, keyed on BTF ID and field name, resolving them into
>> offsets. It's not quite the zero-overhead of CO-RE, but it's fairly
>> close and can be improved upon (CO-RE for userspace being one way of
>> doing that).
>
> Aaaah, sorry, I completely missed that. I thought of something
> similar as well, but then thought "variable field offsets, that
> would annihilate optimization and performance", and our Xsk team
> is super concerned about performance hits when using Hints.
>
>>
>> > And if hardcode, what's the point then to do Generic Hints at
>> > all? Then all it needs is making driver building some meta in
>> > front of frames via on-off button and that's it? Why BTF ID in
>> > the meta then if consumers will access meta hardcoded (via CO-RE
>> > or literally hardcoded, doesn't matter)?
>>
>> You're quite right, we could probably implement all the access to
>> existing (fixed) metadata without using any BTF at all - just define a
>> common struct and some flags to designate which fields are set. In my
>> mind, there are a couple of reasons for going the BTF route instead:
>>
>> - We can leverage CO-RE to get close to optimal efficiency in field
>> access.
>>
>> and, more importantly:
>>
>> - It's infinitely extensible. With the infrastructure in place to make
>> it really easy to consume metadata described by BTF, we lower the bar
>> for future innovation in hardware offloads. Both for just adding new
>> fixed-function stuff to hardware, but especially for fully
>> programmable hardware.
>
> Agree :) That libxdp lookup translator fixed lots of stuff in my
> mind.
Great! Looks like we're slowly converging towards a shared
understanding, then! :)
>> > 2) In-kernel metadata consumers? Also do CO-RE? Otherwise, with no
>> > generic metadata structure they won't be able to benefit from
>> > Hints. But I guess we still need to provide kernel with meta?
>> > Or no?
>>
>> In the short term, I think the "generic structure" approach is fine for
>> leveraging this in the stack. Both your and Jesper's series include
>> this, and I think that's totally fine. Longer term, if it turns out to
>> be useful to have something more dynamic for the stack consumption as
>> well, we could extend it to be CO-RE based as well (most likely by
>> having the stack load a "translator" BPF program or something along
>> those lines).
>
> Oh, that translator prog sounds nice BTW!
Yeah, it's only a rough idea Jesper and I discussed at some point, but I
think it could have potential (see also point above re: making XDP hints
*the* source of metadata for the whole stack; wouldn't it be nice if
drivers didn't have to deal with the intricacies of assembling SKBs?).
-Toke
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-07-06 23:22 ` Toke Høiland-Jørgensen
@ 2022-07-07 11:41 ` Jesper Dangaard Brouer
2022-07-12 10:33 ` Magnus Karlsson
1 sibling, 0 replies; 98+ messages in thread
From: Jesper Dangaard Brouer @ 2022-07-07 11:41 UTC (permalink / raw)
To: Toke Høiland-Jørgensen, Alexander Lobakin
Cc: brouer, John Fastabend, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Lorenzo Bianconi,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jesse Brandeburg, Yajun Deng, Willem de Bruijn, bpf, netdev,
linux-kernel, xdp-hints
On 07/07/2022 01.22, Toke Høiland-Jørgensen wrote:
> Alexander Lobakin <alexandr.lobakin@intel.com> writes:
>
>> From: Toke H??iland-J??rgensen <toke@redhat.com>
>> Date: Tue, 05 Jul 2022 20:51:14 +0200
>>
>>> Alexander Lobakin <alexandr.lobakin@intel.com> writes:
>>>
>>> [... snipping a bit of context here ...]
>>>
>>>>>>> Yeah, I'd agree this kind of configuration is something that can be
>>>>>>> added later, and also it's sort of orthogonal to the consumption of the
>>>>>>> metadata itself.
>>>>>>>
>>>>>>> Also, tying this configuration into the loading of an XDP program is a
>>>>>>> terrible interface: these are hardware configuration options, let's just
>>>>>>> put them into ethtool or 'ip link' like any other piece of device
>>>>>>> configuration.
>>>>>>
>>>>>> I don't believe it fits there, especially Ethtool. Ethtool is for
>>>>>> hardware configuration, XDP/AF_XDP is 95% software stuff (apart from
>>>>>> offload bits which is purely NFP's for now).
>>>>>
>>>>> But XDP-hints is about consuming hardware features. When you're
>>>>> configuring which metadata items you want, you're saying "please provide
>>>>> me with these (hardware) features". So ethtool is an excellent place to
>>>>> do that :)
>>>>
>>>> With Ethtool you configure the hardware, e.g. it won't strip VLAN
>>>> tags if you disable rx-cvlan-stripping. With configuring metadata
>>>> you only tell what you want to see there, don't you?
>>>
>>> Ah, I think we may be getting closer to identifying the disconnect
>>> between our way of thinking about this!
>>>
>>> In my mind, there's no separate "configuration of the metadata" step.
>>> You simply tell the hardware what features you want (say, "enable
>>> timestamps and VLAN offload"), and the driver will then provide the
>>> information related to these features in the metadata area
>>> unconditionally. All XDP hints is about, then, is a way for the driver
>>> to inform the rest of the system how that information is actually laid
>>> out in the metadata area.
>>>
>>> Having a separate configuration knob to tell the driver "please lay out
>>> these particular bits of metadata this way" seems like a totally
>>> unnecessary (and quite complicated) feature to have when we can just let
>>> the driver decide and use CO-RE to consume it?
>>
>> Magnus (he's currently on vacation) told me it would be useful for
>> AF_XDP to enable/disable particular metadata, at least from perf
>> perspective.
I have recently talked to Magnus (in person at Kernel Recipes), where I
tried to convey my opinion, which is: At least for existing hardware
hints, we need to respect the existing Linux kernel's config interfaces,
and not invent yet-another-way to configure these.
(At least for now) the kernel module defined structs in C-code is the
source of truth, and we consume these layouts via BTF information
provided by the kernel for our XDP-hints.
>> Let's say, just fetching of one "checksum ok" bit in
>> the driver is faster than walking through all the descriptor words
>> and driver logics (i.e. there's several hundred locs in ice which
>> just parse descriptor data and build an skb or metadata from it).
>> But if we would just enable/disable corresponding features through
>> Ethtool, that would hurt XDP_PASS. Maybe it's a bad example, but
>> what if I want to have only RSS hash in the metadata (and don't
>> want to spend cycles on parsing the rest), but at the same time
>> still want skb path to have checksum status to not die at CPU
>> checksum calculation?
>
> Hmm, so this feels a little like a driver-specific optimisation? I.e.,
> my guess is that not all drivers have a measurable overhead for pulling
> out the metadata. Also, once the XDP metadata bits are in place, we can
> move in the direction of building SKBs from the same source, so I'm not
> sure it's a good idea to assume that the XDP metadata is separate from
> what the stack consumes...
I agree.
> In any case, if such an optimisation does turn out to be useful, we can
> add it later (backed by rigorous benchmarks, of course), so I think we
> can still start with the simple case and iterate from there?
For every element in the generic hints data-structure, we already have a
per-element enable/disable facilities. As they are already controlled
by ethtool. Except the timestamping, which can be enabled via a sockopt.
I don't see a benefit of creating another layer (of if-statements) that
are also required to get the HW hint written to XDP-hints metadata area.
>>>>>> I follow that way:
>>>>>>
>>>>>> 1) you pick a program you want to attach;
>>>>>> 2) usually they are written for special needs and usecases;
>>>>>> 3) so most likely that program will be tied with metadata/driver/etc
>>>>>> in some way;
>>>>>> 4) so you want to enable Hints of a particular format primarily for
>>>>>> this program and usecase, same with threshold and everything
>>>>>> else.
>>>>>>
>>>>>> Pls explain how you see it, I might be wrong for sure.
>>>>>
>>>>> As above: XDP hints is about giving XDP programs (and AF_XDP consumers)
>>>>> access to metadata that is not currently available. Tying the lifetime
>>>>> of that hardware configuration (i.e., which information to provide) to
>>>>> the lifetime of an XDP program is not a good interface: for one thing,
>>>>> how will it handle multiple programs? What about when XDP is not used at
>>>>
>>>> Multiple progs is stuff I didn't cover, but will do later (as you
>>>> all say to me, "let's start with something simple" :)). Aaaand
>>>> multiple XDP progs (I'm not talking about attaching progs in
>>>> differeng modes) is not a kernel feature, rather a libpf feature,
>>>> so I believe it should be handled there later...
>>>
>>> Right, but even if we don't *implement* it straight away we still need
>>> to take it into consideration in the design. And expecting libxdp to
>>> arbitrate between different XDP programs' metadata formats sounds like a
>>> royal PITA :)
>>>
>>>>> all but you still want to configure the same features?
>>>>
>>>> What's the point of configuring metadata when there are no progs
>>>> attached? To configure it once and not on every prog attach? I'm
>>>> not saying I don't like it, just want to clarify.
>>>
>>> See above: you turn on the features because you want the stack to
>>> consume them.
>>>
>>>> Maybe I need opinions from some more people, just to have an
>>>> overview of how most of folks see it and would like to configure
>>>> it. 'Cause I heard from at least one of the consumers that
>>>> libpf API is a perfect place for Hints to him :)
>>>
>>> Well, as a program author who wants to consume hints, you'd use
>>> lib{bpf,xdp} APIs to do so (probably in the form of suitable CO-RE
>>> macros)...
>>>
>>>>> In addition, in every other case where we do dynamic data access (with
>>>>> CO-RE) the BPF program is a consumer that modifies itself to access the
>>>>> data provided by the kernel. I get that this is harder to achieve for
>>>>> AF_XDP, but then let's solve that instead of making a totally
>>>>> inconsistent interface for XDP.
>>>>
>>>> I also see CO-RE more fitting and convenient way to use them, but
>>>> didn't manage to solve two things:
>>>>
>>>> 1) AF_XDP programs, so what to do with them? Prepare patches for
>>>> LLVM to make it able to do CO-RE on AF_XDP program load? Or
>>>> just hardcode them for particular usecases and NICs? What about
>>>> "general-purpose" programs?
>>>
>>> You provide a library to read the fields. Jesper actually already
>>> implemented this, did you look at his code?
>>>
>>> https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-interaction
>>>
>>> It basically builds a lookup table at load-time using BTF information
>>> from the kernel, keyed on BTF ID and field name, resolving them into
>>> offsets. It's not quite the zero-overhead of CO-RE, but it's fairly
>>> close and can be improved upon (CO-RE for userspace being one way of
>>> doing that).
>>
>> Aaaah, sorry, I completely missed that. I thought of something
>> similar as well, but then thought "variable field offsets, that
>> would annihilate optimization and performance", and our Xsk team
>> is super concerned about performance hits when using Hints.
>>
>>>
>>>> And if hardcode, what's the point then to do Generic Hints at
>>>> all? Then all it needs is making driver building some meta in
>>>> front of frames via on-off button and that's it? Why BTF ID in
>>>> the meta then if consumers will access meta hardcoded (via CO-RE
>>>> or literally hardcoded, doesn't matter)?
>>>
>>> You're quite right, we could probably implement all the access to
>>> existing (fixed) metadata without using any BTF at all - just define a
>>> common struct and some flags to designate which fields are set. In my
>>> mind, there are a couple of reasons for going the BTF route instead:
>>>
>>> - We can leverage CO-RE to get close to optimal efficiency in field
>>> access.
>>>
>>> and, more importantly:
>>>
>>> - It's infinitely extensible. With the infrastructure in place to make
>>> it really easy to consume metadata described by BTF, we lower the bar
>>> for future innovation in hardware offloads. Both for just adding new
>>> fixed-function stuff to hardware, but especially for fully
>>> programmable hardware.
>>
>> Agree :) That libxdp lookup translator fixed lots of stuff in my
>> mind.
>
> Great! Looks like we're slowly converging towards a shared
> understanding, then! :)
>
>>>> 2) In-kernel metadata consumers? Also do CO-RE? Otherwise, with no
>>>> generic metadata structure they won't be able to benefit from
>>>> Hints. But I guess we still need to provide kernel with meta?
>>>> Or no?
>>>
>>> In the short term, I think the "generic structure" approach is fine for
>>> leveraging this in the stack. Both your and Jesper's series include
>>> this, and I think that's totally fine. Longer term, if it turns out to
>>> be useful to have something more dynamic for the stack consumption as
>>> well, we could extend it to be CO-RE based as well (most likely by
>>> having the stack load a "translator" BPF program or something along
>>> those lines).
>>
>> Oh, that translator prog sounds nice BTW!
>
> Yeah, it's only a rough idea Jesper and I discussed at some point, but I
> think it could have potential (see also point above re: making XDP hints
> *the* source of metadata for the whole stack; wouldn't it be nice if
> drivers didn't have to deal with the intricacies of assembling SKBs?).
Yes, this is the longer term goal, but we should take this in steps.
(Thus, my patchset[0] focuses on the existing xdp_hints_common).
Eventually (pipe-dream?), I would like to add a new BPF-hook that runs
in the step converting xdp_frame to SKB (today handled in function
__xdp_build_skb_from_frame). This "translator" BPF program should be
tied/loaded per net_device, which makes it easier to consume the driver
specific/dynamic XDP-hints layouts and BPF-code can be smaller as it
only need to CO-RE handle xdp-hints structs known for this driver.
Default BPF-prog should be provided and maintained by driver
maintainers, but can be replaced by end-users.
--Jesper
[0]
https://lore.kernel.org/bpf/165643378969.449467.13237011812569188299.stgit@firesoul/
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-07-06 23:22 ` Toke Høiland-Jørgensen
2022-07-07 11:41 ` Jesper Dangaard Brouer
@ 2022-07-12 10:33 ` Magnus Karlsson
2022-07-12 14:14 ` Jesper Dangaard Brouer
1 sibling, 1 reply; 98+ messages in thread
From: Magnus Karlsson @ 2022-07-12 10:33 UTC (permalink / raw)
To: Toke Høiland-Jørgensen
Cc: Alexander Lobakin, John Fastabend, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, Lorenzo Bianconi, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Jesse Brandeburg, Yajun Deng,
Willem de Bruijn, bpf, Network Development, open list, xdp-hints
On Thu, Jul 7, 2022 at 1:25 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Alexander Lobakin <alexandr.lobakin@intel.com> writes:
>
> > From: Toke H??iland-J??rgensen <toke@redhat.com>
> > Date: Tue, 05 Jul 2022 20:51:14 +0200
> >
> >> Alexander Lobakin <alexandr.lobakin@intel.com> writes:
> >>
> >> [... snipping a bit of context here ...]
> >>
> >> >> >> Yeah, I'd agree this kind of configuration is something that can be
> >> >> >> added later, and also it's sort of orthogonal to the consumption of the
> >> >> >> metadata itself.
> >> >> >>
> >> >> >> Also, tying this configuration into the loading of an XDP program is a
> >> >> >> terrible interface: these are hardware configuration options, let's just
> >> >> >> put them into ethtool or 'ip link' like any other piece of device
> >> >> >> configuration.
> >> >> >
> >> >> > I don't believe it fits there, especially Ethtool. Ethtool is for
> >> >> > hardware configuration, XDP/AF_XDP is 95% software stuff (apart from
> >> >> > offload bits which is purely NFP's for now).
> >> >>
> >> >> But XDP-hints is about consuming hardware features. When you're
> >> >> configuring which metadata items you want, you're saying "please provide
> >> >> me with these (hardware) features". So ethtool is an excellent place to
> >> >> do that :)
> >> >
> >> > With Ethtool you configure the hardware, e.g. it won't strip VLAN
> >> > tags if you disable rx-cvlan-stripping. With configuring metadata
> >> > you only tell what you want to see there, don't you?
> >>
> >> Ah, I think we may be getting closer to identifying the disconnect
> >> between our way of thinking about this!
> >>
> >> In my mind, there's no separate "configuration of the metadata" step.
> >> You simply tell the hardware what features you want (say, "enable
> >> timestamps and VLAN offload"), and the driver will then provide the
> >> information related to these features in the metadata area
> >> unconditionally. All XDP hints is about, then, is a way for the driver
> >> to inform the rest of the system how that information is actually laid
> >> out in the metadata area.
> >>
> >> Having a separate configuration knob to tell the driver "please lay out
> >> these particular bits of metadata this way" seems like a totally
> >> unnecessary (and quite complicated) feature to have when we can just let
> >> the driver decide and use CO-RE to consume it?
> >
> > Magnus (he's currently on vacation) told me it would be useful for
> > AF_XDP to enable/disable particular metadata, at least from perf
> > perspective. Let's say, just fetching of one "checksum ok" bit in
> > the driver is faster than walking through all the descriptor words
> > and driver logics (i.e. there's several hundred locs in ice which
> > just parse descriptor data and build an skb or metadata from it).
> > But if we would just enable/disable corresponding features through
> > Ethtool, that would hurt XDP_PASS. Maybe it's a bad example, but
> > what if I want to have only RSS hash in the metadata (and don't
> > want to spend cycles on parsing the rest), but at the same time
> > still want skb path to have checksum status to not die at CPU
> > checksum calculation?
>
> Hmm, so this feels a little like a driver-specific optimisation? I.e.,
> my guess is that not all drivers have a measurable overhead for pulling
> out the metadata. Also, once the XDP metadata bits are in place, we can
> move in the direction of building SKBs from the same source, so I'm not
> sure it's a good idea to assume that the XDP metadata is separate from
> what the stack consumes...
>
> In any case, if such an optimisation does turn out to be useful, we can
> add it later (backed by rigorous benchmarks, of course), so I think we
> can still start with the simple case and iterate from there?
Just to check if my intuition was correct or not I ran some benchmarks
around this. I ported Jesper's patch set to the zero-copy driver of
i40e, which was really simple thanks to Jesper's refactoring. One line
of code added to the data path of the zc driver and making
i40e_process_xdp_hints() a global function so it can be reached from
the zc driver. I also moved the prefetch Jesper added to after the
check if xdp_hints are available since it really degrades performance
in the xdp_hints off case.
First number is the throughput change with hints on, and the second
number is with hints off. All are compared to the performance without
Jesper's patch set applied. The application is xdpsock -r (which used
to be part of the samples/bpf directory).
Copy mode with all hints: -21% / -2%
Zero-copy mode with all hints: -29% / -9%
Copy mode rx timestamp only (the rest removed with an #if 0): -11%
Zero-copy mode rx timestamp only: -20%
So, if you only want rx timestamp, but can only enable every hint or
nothing, then you get a 10% performance degradation with copy mode and
9% with zero-copy mode compared to if you were able just to enable rx
timestamp alone. With these rough numbers (a real implementation would
not have an #if 0) I would say it matters, but that does not mean we
should not start simple and just have a big switch to start with. But
as we add hints (to the same btfid), this will just get worse.
Here are some other numbers I got, in case someone is interested. They
are XDP numbers from xdp_rxq_info in samples/bpf.
hints on / hints off
XDP_DROP: -18% / -1.5%
XDP_TX: -10% / -2.5%
> >> >> > I follow that way:
> >> >> >
> >> >> > 1) you pick a program you want to attach;
> >> >> > 2) usually they are written for special needs and usecases;
> >> >> > 3) so most likely that program will be tied with metadata/driver/etc
> >> >> > in some way;
> >> >> > 4) so you want to enable Hints of a particular format primarily for
> >> >> > this program and usecase, same with threshold and everything
> >> >> > else.
> >> >> >
> >> >> > Pls explain how you see it, I might be wrong for sure.
> >> >>
> >> >> As above: XDP hints is about giving XDP programs (and AF_XDP consumers)
> >> >> access to metadata that is not currently available. Tying the lifetime
> >> >> of that hardware configuration (i.e., which information to provide) to
> >> >> the lifetime of an XDP program is not a good interface: for one thing,
> >> >> how will it handle multiple programs? What about when XDP is not used at
> >> >
> >> > Multiple progs is stuff I didn't cover, but will do later (as you
> >> > all say to me, "let's start with something simple" :)). Aaaand
> >> > multiple XDP progs (I'm not talking about attaching progs in
> >> > differeng modes) is not a kernel feature, rather a libpf feature,
> >> > so I believe it should be handled there later...
> >>
> >> Right, but even if we don't *implement* it straight away we still need
> >> to take it into consideration in the design. And expecting libxdp to
> >> arbitrate between different XDP programs' metadata formats sounds like a
> >> royal PITA :)
> >>
> >> >> all but you still want to configure the same features?
> >> >
> >> > What's the point of configuring metadata when there are no progs
> >> > attached? To configure it once and not on every prog attach? I'm
> >> > not saying I don't like it, just want to clarify.
> >>
> >> See above: you turn on the features because you want the stack to
> >> consume them.
> >>
> >> > Maybe I need opinions from some more people, just to have an
> >> > overview of how most of folks see it and would like to configure
> >> > it. 'Cause I heard from at least one of the consumers that
> >> > libpf API is a perfect place for Hints to him :)
> >>
> >> Well, as a program author who wants to consume hints, you'd use
> >> lib{bpf,xdp} APIs to do so (probably in the form of suitable CO-RE
> >> macros)...
> >>
> >> >> In addition, in every other case where we do dynamic data access (with
> >> >> CO-RE) the BPF program is a consumer that modifies itself to access the
> >> >> data provided by the kernel. I get that this is harder to achieve for
> >> >> AF_XDP, but then let's solve that instead of making a totally
> >> >> inconsistent interface for XDP.
> >> >
> >> > I also see CO-RE more fitting and convenient way to use them, but
> >> > didn't manage to solve two things:
> >> >
> >> > 1) AF_XDP programs, so what to do with them? Prepare patches for
> >> > LLVM to make it able to do CO-RE on AF_XDP program load? Or
> >> > just hardcode them for particular usecases and NICs? What about
> >> > "general-purpose" programs?
> >>
> >> You provide a library to read the fields. Jesper actually already
> >> implemented this, did you look at his code?
> >>
> >> https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-interaction
> >>
> >> It basically builds a lookup table at load-time using BTF information
> >> from the kernel, keyed on BTF ID and field name, resolving them into
> >> offsets. It's not quite the zero-overhead of CO-RE, but it's fairly
> >> close and can be improved upon (CO-RE for userspace being one way of
> >> doing that).
> >
> > Aaaah, sorry, I completely missed that. I thought of something
> > similar as well, but then thought "variable field offsets, that
> > would annihilate optimization and performance", and our Xsk team
> > is super concerned about performance hits when using Hints.
> >
> >>
> >> > And if hardcode, what's the point then to do Generic Hints at
> >> > all? Then all it needs is making driver building some meta in
> >> > front of frames via on-off button and that's it? Why BTF ID in
> >> > the meta then if consumers will access meta hardcoded (via CO-RE
> >> > or literally hardcoded, doesn't matter)?
> >>
> >> You're quite right, we could probably implement all the access to
> >> existing (fixed) metadata without using any BTF at all - just define a
> >> common struct and some flags to designate which fields are set. In my
> >> mind, there are a couple of reasons for going the BTF route instead:
> >>
> >> - We can leverage CO-RE to get close to optimal efficiency in field
> >> access.
> >>
> >> and, more importantly:
> >>
> >> - It's infinitely extensible. With the infrastructure in place to make
> >> it really easy to consume metadata described by BTF, we lower the bar
> >> for future innovation in hardware offloads. Both for just adding new
> >> fixed-function stuff to hardware, but especially for fully
> >> programmable hardware.
> >
> > Agree :) That libxdp lookup translator fixed lots of stuff in my
> > mind.
>
> Great! Looks like we're slowly converging towards a shared
> understanding, then! :)
>
> >> > 2) In-kernel metadata consumers? Also do CO-RE? Otherwise, with no
> >> > generic metadata structure they won't be able to benefit from
> >> > Hints. But I guess we still need to provide kernel with meta?
> >> > Or no?
> >>
> >> In the short term, I think the "generic structure" approach is fine for
> >> leveraging this in the stack. Both your and Jesper's series include
> >> this, and I think that's totally fine. Longer term, if it turns out to
> >> be useful to have something more dynamic for the stack consumption as
> >> well, we could extend it to be CO-RE based as well (most likely by
> >> having the stack load a "translator" BPF program or something along
> >> those lines).
> >
> > Oh, that translator prog sounds nice BTW!
>
> Yeah, it's only a rough idea Jesper and I discussed at some point, but I
> think it could have potential (see also point above re: making XDP hints
> *the* source of metadata for the whole stack; wouldn't it be nice if
> drivers didn't have to deal with the intricacies of assembling SKBs?).
>
> -Toke
>
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-07-12 10:33 ` Magnus Karlsson
@ 2022-07-12 14:14 ` Jesper Dangaard Brouer
2022-07-15 11:11 ` Magnus Karlsson
0 siblings, 1 reply; 98+ messages in thread
From: Jesper Dangaard Brouer @ 2022-07-12 14:14 UTC (permalink / raw)
To: Magnus Karlsson, Toke Høiland-Jørgensen
Cc: brouer, Alexander Lobakin, John Fastabend, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, Lorenzo Bianconi, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Jesse Brandeburg, Yajun Deng,
Willem de Bruijn, bpf, Network Development, open list, xdp-hints
On 12/07/2022 12.33, Magnus Karlsson wrote:
> On Thu, Jul 7, 2022 at 1:25 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Alexander Lobakin <alexandr.lobakin@intel.com> writes:
>>
>>> From: Toke H??iland-J??rgensen <toke@redhat.com>
>>> Date: Tue, 05 Jul 2022 20:51:14 +0200
>>>
>>>> Alexander Lobakin <alexandr.lobakin@intel.com> writes:
>>>>
>>>> [... snipping a bit of context here ...]
>>>>
>>>>>>>> Yeah, I'd agree this kind of configuration is something that can be
>>>>>>>> added later, and also it's sort of orthogonal to the consumption of the
>>>>>>>> metadata itself.
>>>>>>>>
>>>>>>>> Also, tying this configuration into the loading of an XDP program is a
>>>>>>>> terrible interface: these are hardware configuration options, let's just
>>>>>>>> put them into ethtool or 'ip link' like any other piece of device
>>>>>>>> configuration.
>>>>>>>
>>>>>>> I don't believe it fits there, especially Ethtool. Ethtool is for
>>>>>>> hardware configuration, XDP/AF_XDP is 95% software stuff (apart from
>>>>>>> offload bits which is purely NFP's for now).
>>>>>>
>>>>>> But XDP-hints is about consuming hardware features. When you're
>>>>>> configuring which metadata items you want, you're saying "please provide
>>>>>> me with these (hardware) features". So ethtool is an excellent place to
>>>>>> do that :)
>>>>>
>>>>> With Ethtool you configure the hardware, e.g. it won't strip VLAN
>>>>> tags if you disable rx-cvlan-stripping. With configuring metadata
>>>>> you only tell what you want to see there, don't you?
>>>>
>>>> Ah, I think we may be getting closer to identifying the disconnect
>>>> between our way of thinking about this!
>>>>
>>>> In my mind, there's no separate "configuration of the metadata" step.
>>>> You simply tell the hardware what features you want (say, "enable
>>>> timestamps and VLAN offload"), and the driver will then provide the
>>>> information related to these features in the metadata area
>>>> unconditionally. All XDP hints is about, then, is a way for the driver
>>>> to inform the rest of the system how that information is actually laid
>>>> out in the metadata area.
>>>>
>>>> Having a separate configuration knob to tell the driver "please lay out
>>>> these particular bits of metadata this way" seems like a totally
>>>> unnecessary (and quite complicated) feature to have when we can just let
>>>> the driver decide and use CO-RE to consume it?
>>>
>>> Magnus (he's currently on vacation) told me it would be useful for
>>> AF_XDP to enable/disable particular metadata, at least from perf
>>> perspective. Let's say, just fetching of one "checksum ok" bit in
>>> the driver is faster than walking through all the descriptor words
>>> and driver logics (i.e. there's several hundred locs in ice which
>>> just parse descriptor data and build an skb or metadata from it).
>>> But if we would just enable/disable corresponding features through
>>> Ethtool, that would hurt XDP_PASS. Maybe it's a bad example, but
>>> what if I want to have only RSS hash in the metadata (and don't
>>> want to spend cycles on parsing the rest), but at the same time
>>> still want skb path to have checksum status to not die at CPU
>>> checksum calculation?
>>
>> Hmm, so this feels a little like a driver-specific optimisation? I.e.,
>> my guess is that not all drivers have a measurable overhead for pulling
>> out the metadata. Also, once the XDP metadata bits are in place, we can
>> move in the direction of building SKBs from the same source, so I'm not
>> sure it's a good idea to assume that the XDP metadata is separate from
>> what the stack consumes...
>>
>> In any case, if such an optimisation does turn out to be useful, we can
>> add it later (backed by rigorous benchmarks, of course), so I think we
>> can still start with the simple case and iterate from there?
>
> Just to check if my intuition was correct or not I ran some benchmarks
> around this. I ported Jesper's patch set to the zero-copy driver of
> i40e, which was really simple thanks to Jesper's refactoring. One line
> of code added to the data path of the zc driver and making
> i40e_process_xdp_hints() a global function so it can be reached from
> the zc driver.
Happy to hear it was simple to extend this to AF_XDP in the driver.
Code design wise I'm trying to keep it simple for drivers to add this.
I have a co-worker that have already extended ixgbe.
> I also moved the prefetch Jesper added to after the
> check if xdp_hints are available since it really degrades performance
> in the xdp_hints off case.
Good to know.
> First number is the throughput change with hints on, and the second
> number is with hints off. All are compared to the performance without
> Jesper's patch set applied. The application is xdpsock -r (which used
> to be part of the samples/bpf directory).
For reviewer to relate to these numbers we need to understand/explain
the extreme numbers we are dealing with. In my system with i40e and
xdpsock --rx-drop I can AF_XDP drop packets with a rate of 33.633.761 pps.
This corresponds to a processing time per packet: 29.7 ns (nanosec)
- Calc: (1/33633761)*10^9
> Copy mode with all hints: -21% / -2%
The -21% for enabling all hints does sound like an excessive overhead,
but time-wise this is a reduction/overhead of 6.2 ns.
The real question: Is this 6.2 ns overhead that gives us e.g.
RX-checksumming lower than the gain we can obtain from avoiding doing
RX-checksumming in software?
- A: My previous experiments conclude[1] that for 1500 bytes frames we
can save 54 ns (or increase performance with 8% for normal netstack).
I was going for zero overhead when disabling xdp-hints, which is almost
true as the -2% is time-wise a reduction/overhead of 0.59 ns.
[1]
https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_frame01_checksum.org#measurements-compare-results--conclusion
> Zero-copy mode with all hints: -29% / -9%
I'm unsure why the percentages increase here, perhaps because zero-copy
is faster and thus the overhead becomes a larger percentage?
> Copy mode rx timestamp only (the rest removed with an #if 0): -11%
> Zero-copy mode rx timestamp only: -20%
>
> So, if you only want rx timestamp, but can only enable every hint or
> nothing, then you get a 10% performance degradation with copy mode and
> 9% with zero-copy mode compared to if you were able just to enable rx
> timestamp alone. With these rough numbers (a real implementation would
> not have an #if 0) I would say it matters, but that does not mean we
> should not start simple and just have a big switch to start with. But
> as we add hints (to the same btfid), this will just get worse.
IMHO we *do* already have individual enable/disable hints features via
ethtool.
Have you tried to use the individual ethtool switches. e.g.:
ethtool -K i40e2 rx-checksumming off
The i40e code uses bitfields for extracting the descriptor, which cause
code that isn't optimal or fully optimized by the compiler. On my setup
I gained 4.2% (or 1.24 ns) by doing this.
> Here are some other numbers I got, in case someone is interested. They
> are XDP numbers from xdp_rxq_info in samples/bpf.
>
> hints on / hints off
> XDP_DROP: -18% / -1.5%
My xdp_rxq_info (no-touch XDP_DROP) nanosec numbers are:
hints on / hints off
XDP_DROP: 35.97ns / 29.80ns (diff 6.17 ns)
Maybe interesting if I touch data (via option --read), then the overhead
is reduced to 4.84 ns.
--Jesper
> XDP_TX: -10% / -2.5%
>
>>>>>>> I follow that way:
>>>>>>>
>>>>>>> 1) you pick a program you want to attach;
>>>>>>> 2) usually they are written for special needs and usecases;
>>>>>>> 3) so most likely that program will be tied with metadata/driver/etc
>>>>>>> in some way;
>>>>>>> 4) so you want to enable Hints of a particular format primarily for
>>>>>>> this program and usecase, same with threshold and everything
>>>>>>> else.
>>>>>>>
>>>>>>> Pls explain how you see it, I might be wrong for sure.
>>>>>>
>>>>>> As above: XDP hints is about giving XDP programs (and AF_XDP consumers)
>>>>>> access to metadata that is not currently available. Tying the lifetime
>>>>>> of that hardware configuration (i.e., which information to provide) to
>>>>>> the lifetime of an XDP program is not a good interface: for one thing,
>>>>>> how will it handle multiple programs? What about when XDP is not used at
>>>>>
>>>>> Multiple progs is stuff I didn't cover, but will do later (as you
>>>>> all say to me, "let's start with something simple" :)). Aaaand
>>>>> multiple XDP progs (I'm not talking about attaching progs in
>>>>> differeng modes) is not a kernel feature, rather a libpf feature,
>>>>> so I believe it should be handled there later...
>>>>
>>>> Right, but even if we don't *implement* it straight away we still need
>>>> to take it into consideration in the design. And expecting libxdp to
>>>> arbitrate between different XDP programs' metadata formats sounds like a
>>>> royal PITA :)
>>>>
>>>>>> all but you still want to configure the same features?
>>>>>
>>>>> What's the point of configuring metadata when there are no progs
>>>>> attached? To configure it once and not on every prog attach? I'm
>>>>> not saying I don't like it, just want to clarify.
>>>>
>>>> See above: you turn on the features because you want the stack to
>>>> consume them.
>>>>
>>>>> Maybe I need opinions from some more people, just to have an
>>>>> overview of how most of folks see it and would like to configure
>>>>> it. 'Cause I heard from at least one of the consumers that
>>>>> libpf API is a perfect place for Hints to him :)
>>>>
>>>> Well, as a program author who wants to consume hints, you'd use
>>>> lib{bpf,xdp} APIs to do so (probably in the form of suitable CO-RE
>>>> macros)...
>>>>
>>>>>> In addition, in every other case where we do dynamic data access (with
>>>>>> CO-RE) the BPF program is a consumer that modifies itself to access the
>>>>>> data provided by the kernel. I get that this is harder to achieve for
>>>>>> AF_XDP, but then let's solve that instead of making a totally
>>>>>> inconsistent interface for XDP.
>>>>>
>>>>> I also see CO-RE more fitting and convenient way to use them, but
>>>>> didn't manage to solve two things:
>>>>>
>>>>> 1) AF_XDP programs, so what to do with them? Prepare patches for
>>>>> LLVM to make it able to do CO-RE on AF_XDP program load? Or
>>>>> just hardcode them for particular usecases and NICs? What about
>>>>> "general-purpose" programs?
>>>>
>>>> You provide a library to read the fields. Jesper actually already
>>>> implemented this, did you look at his code?
>>>>
>>>> https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-interaction
>>>>
>>>> It basically builds a lookup table at load-time using BTF information
>>>> from the kernel, keyed on BTF ID and field name, resolving them into
>>>> offsets. It's not quite the zero-overhead of CO-RE, but it's fairly
>>>> close and can be improved upon (CO-RE for userspace being one way of
>>>> doing that).
>>>
>>> Aaaah, sorry, I completely missed that. I thought of something
>>> similar as well, but then thought "variable field offsets, that
>>> would annihilate optimization and performance", and our Xsk team
>>> is super concerned about performance hits when using Hints.
>>>
>>>>
>>>>> And if hardcode, what's the point then to do Generic Hints at
>>>>> all? Then all it needs is making driver building some meta in
>>>>> front of frames via on-off button and that's it? Why BTF ID in
>>>>> the meta then if consumers will access meta hardcoded (via CO-RE
>>>>> or literally hardcoded, doesn't matter)?
>>>>
>>>> You're quite right, we could probably implement all the access to
>>>> existing (fixed) metadata without using any BTF at all - just define a
>>>> common struct and some flags to designate which fields are set. In my
>>>> mind, there are a couple of reasons for going the BTF route instead:
>>>>
>>>> - We can leverage CO-RE to get close to optimal efficiency in field
>>>> access.
>>>>
>>>> and, more importantly:
>>>>
>>>> - It's infinitely extensible. With the infrastructure in place to make
>>>> it really easy to consume metadata described by BTF, we lower the bar
>>>> for future innovation in hardware offloads. Both for just adding new
>>>> fixed-function stuff to hardware, but especially for fully
>>>> programmable hardware.
>>>
>>> Agree :) That libxdp lookup translator fixed lots of stuff in my
>>> mind.
>>
>> Great! Looks like we're slowly converging towards a shared
>> understanding, then! :)
>>
>>>>> 2) In-kernel metadata consumers? Also do CO-RE? Otherwise, with no
>>>>> generic metadata structure they won't be able to benefit from
>>>>> Hints. But I guess we still need to provide kernel with meta?
>>>>> Or no?
>>>>
>>>> In the short term, I think the "generic structure" approach is fine for
>>>> leveraging this in the stack. Both your and Jesper's series include
>>>> this, and I think that's totally fine. Longer term, if it turns out to
>>>> be useful to have something more dynamic for the stack consumption as
>>>> well, we could extend it to be CO-RE based as well (most likely by
>>>> having the stack load a "translator" BPF program or something along
>>>> those lines).
>>>
>>> Oh, that translator prog sounds nice BTW!
>>
>> Yeah, it's only a rough idea Jesper and I discussed at some point, but I
>> think it could have potential (see also point above re: making XDP hints
>> *the* source of metadata for the whole stack; wouldn't it be nice if
>> drivers didn't have to deal with the intricacies of assembling SKBs?).
>>
>> -Toke
>>
>
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-07-12 14:14 ` Jesper Dangaard Brouer
@ 2022-07-15 11:11 ` Magnus Karlsson
0 siblings, 0 replies; 98+ messages in thread
From: Magnus Karlsson @ 2022-07-15 11:11 UTC (permalink / raw)
To: Jesper Dangaard Brouer
Cc: Toke Høiland-Jørgensen, Jesper Dangaard Brouer,
Alexander Lobakin, John Fastabend, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, Lorenzo Bianconi, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Jesse Brandeburg, Yajun Deng,
Willem de Bruijn, bpf, Network Development, open list, xdp-hints
On Tue, Jul 12, 2022 at 4:15 PM Jesper Dangaard Brouer
<jbrouer@redhat.com> wrote:
>
>
>
> On 12/07/2022 12.33, Magnus Karlsson wrote:
> > On Thu, Jul 7, 2022 at 1:25 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >>
> >> Alexander Lobakin <alexandr.lobakin@intel.com> writes:
> >>
> >>> From: Toke H??iland-J??rgensen <toke@redhat.com>
> >>> Date: Tue, 05 Jul 2022 20:51:14 +0200
> >>>
> >>>> Alexander Lobakin <alexandr.lobakin@intel.com> writes:
> >>>>
> >>>> [... snipping a bit of context here ...]
> >>>>
> >>>>>>>> Yeah, I'd agree this kind of configuration is something that can be
> >>>>>>>> added later, and also it's sort of orthogonal to the consumption of the
> >>>>>>>> metadata itself.
> >>>>>>>>
> >>>>>>>> Also, tying this configuration into the loading of an XDP program is a
> >>>>>>>> terrible interface: these are hardware configuration options, let's just
> >>>>>>>> put them into ethtool or 'ip link' like any other piece of device
> >>>>>>>> configuration.
> >>>>>>>
> >>>>>>> I don't believe it fits there, especially Ethtool. Ethtool is for
> >>>>>>> hardware configuration, XDP/AF_XDP is 95% software stuff (apart from
> >>>>>>> offload bits which is purely NFP's for now).
> >>>>>>
> >>>>>> But XDP-hints is about consuming hardware features. When you're
> >>>>>> configuring which metadata items you want, you're saying "please provide
> >>>>>> me with these (hardware) features". So ethtool is an excellent place to
> >>>>>> do that :)
> >>>>>
> >>>>> With Ethtool you configure the hardware, e.g. it won't strip VLAN
> >>>>> tags if you disable rx-cvlan-stripping. With configuring metadata
> >>>>> you only tell what you want to see there, don't you?
> >>>>
> >>>> Ah, I think we may be getting closer to identifying the disconnect
> >>>> between our way of thinking about this!
> >>>>
> >>>> In my mind, there's no separate "configuration of the metadata" step.
> >>>> You simply tell the hardware what features you want (say, "enable
> >>>> timestamps and VLAN offload"), and the driver will then provide the
> >>>> information related to these features in the metadata area
> >>>> unconditionally. All XDP hints is about, then, is a way for the driver
> >>>> to inform the rest of the system how that information is actually laid
> >>>> out in the metadata area.
> >>>>
> >>>> Having a separate configuration knob to tell the driver "please lay out
> >>>> these particular bits of metadata this way" seems like a totally
> >>>> unnecessary (and quite complicated) feature to have when we can just let
> >>>> the driver decide and use CO-RE to consume it?
> >>>
> >>> Magnus (he's currently on vacation) told me it would be useful for
> >>> AF_XDP to enable/disable particular metadata, at least from perf
> >>> perspective. Let's say, just fetching of one "checksum ok" bit in
> >>> the driver is faster than walking through all the descriptor words
> >>> and driver logics (i.e. there's several hundred locs in ice which
> >>> just parse descriptor data and build an skb or metadata from it).
> >>> But if we would just enable/disable corresponding features through
> >>> Ethtool, that would hurt XDP_PASS. Maybe it's a bad example, but
> >>> what if I want to have only RSS hash in the metadata (and don't
> >>> want to spend cycles on parsing the rest), but at the same time
> >>> still want skb path to have checksum status to not die at CPU
> >>> checksum calculation?
> >>
> >> Hmm, so this feels a little like a driver-specific optimisation? I.e.,
> >> my guess is that not all drivers have a measurable overhead for pulling
> >> out the metadata. Also, once the XDP metadata bits are in place, we can
> >> move in the direction of building SKBs from the same source, so I'm not
> >> sure it's a good idea to assume that the XDP metadata is separate from
> >> what the stack consumes...
> >>
> >> In any case, if such an optimisation does turn out to be useful, we can
> >> add it later (backed by rigorous benchmarks, of course), so I think we
> >> can still start with the simple case and iterate from there?
> >
> > Just to check if my intuition was correct or not I ran some benchmarks
> > around this. I ported Jesper's patch set to the zero-copy driver of
> > i40e, which was really simple thanks to Jesper's refactoring. One line
> > of code added to the data path of the zc driver and making
> > i40e_process_xdp_hints() a global function so it can be reached from
> > the zc driver.
>
> Happy to hear it was simple to extend this to AF_XDP in the driver.
> Code design wise I'm trying to keep it simple for drivers to add this.
> I have a co-worker that have already extended ixgbe.
>
> > I also moved the prefetch Jesper added to after the
> > check if xdp_hints are available since it really degrades performance
> > in the xdp_hints off case.
>
> Good to know.
>
> > First number is the throughput change with hints on, and the second
> > number is with hints off. All are compared to the performance without
> > Jesper's patch set applied. The application is xdpsock -r (which used
> > to be part of the samples/bpf directory).
>
> For reviewer to relate to these numbers we need to understand/explain
> the extreme numbers we are dealing with. In my system with i40e and
> xdpsock --rx-drop I can AF_XDP drop packets with a rate of 33.633.761 pps.
>
> This corresponds to a processing time per packet: 29.7 ns (nanosec)
> - Calc: (1/33633761)*10^9
>
> > Copy mode with all hints: -21% / -2%
On my system, the overhead is 66 cycles/packet or 31 ns/packet (2.1
GHz CPU with TurboBoost disabled). Copy-mode only drops packets at a
rate of 8.5 Mpps or 118 ns/packet on my system. The rate you quote
must be for zero-copy as I see something similar there if I enable
TurboBoost on my system.
> The -21% for enabling all hints does sound like an excessive overhead,
> but time-wise this is a reduction/overhead of 6.2 ns.
>
> The real question: Is this 6.2 ns overhead that gives us e.g.
> RX-checksumming lower than the gain we can obtain from avoiding doin.
> RX-checksumming in software?
> - A: My previous experiments conclude[1] that for 1500 bytes frames we
> can save 54 ns (or increase performance with 8% for normal netstack).
If you use Rx-checksumming alone, it is a good idea for packets that
are bigger than something around 500 bytes, if you use copy mode. This
is a very rough estimation since I cannot mix your numbers with mine.
But there is a substantial window where it pays off for sure. For ZC,
this window is even larger, see below.
>
> I was going for zero overhead when disabling xdp-hints, which is almost
> true as the -2% is time-wise a reduction/overhead of 0.59 ns.
>
> [1]
> https://github.com/xdp-project/xdp-project/blob/master/areas/core/xdp_frame01_checksum.org#measurements-compare-results--conclusion
>
>
> > Zero-copy mode with all hints: -29% / -9%
>
> I'm unsure why the percentages increase here, perhaps because zero-copy
> is faster and thus the overhead becomes a larger percentage?
For zero-copy, the overhead is 31 cycles/packet or 15 ns/packet on my
system. I would have expected the cycles/packet overhead for copy-mode
and zero-copy mode to be about the same since they use the same hints
code, but it is roughly half for zero-copy. Have not examined why. The
packet processing time without your patches on my system is 36
ns/packet or 27.65 Mpps for zero-copy.
>
> > Copy mode rx timestamp only (the rest removed with an #if 0): -11%
> > Zero-copy mode rx timestamp only: -20%
> >
> > So, if you only want rx timestamp, but can only enable every hint or
> > nothing, then you get a 10% performance degradation with copy mode and
> > 9% with zero-copy mode compared to if you were able just to enable rx
> > timestamp alone. With these rough numbers (a real implementation would
> > not have an #if 0) I would say it matters, but that does not mean we
> > should not start simple and just have a big switch to start with. But
> > as we add hints (to the same btfid), this will just get worse.
>
> IMHO we *do* already have individual enable/disable hints features via
> ethtool.
> Have you tried to use the individual ethtool switches. e.g.:
>
> ethtool -K i40e2 rx-checksumming off
>
> The i40e code uses bitfields for extracting the descriptor, which cause
> code that isn't optimal or fully optimized by the compiler. On my setup
> I gained 4.2% (or 1.24 ns) by doing this.
Forgot about that one. Will replace the bitfields and rerun the
experiments to get the overhead down.
>
> > Here are some other numbers I got, in case someone is interested. They
> > are XDP numbers from xdp_rxq_info in samples/bpf.
> >
> > hints on / hints off
> > XDP_DROP: -18% / -1.5%
>
> My xdp_rxq_info (no-touch XDP_DROP) nanosec numbers are:
>
> hints on / hints off
> XDP_DROP: 35.97ns / 29.80ns (diff 6.17 ns)
>
> Maybe interesting if I touch data (via option --read), then the overhead
> is reduced to 4.84 ns.
Good point. We should always touch the data. Will include that in the
next set of experiments.
> --Jesper
>
> > XDP_TX: -10% / -2.5%
> >
> >>>>>>> I follow that way:
> >>>>>>>
> >>>>>>> 1) you pick a program you want to attach;
> >>>>>>> 2) usually they are written for special needs and usecases;
> >>>>>>> 3) so most likely that program will be tied with metadata/driver/etc
> >>>>>>> in some way;
> >>>>>>> 4) so you want to enable Hints of a particular format primarily for
> >>>>>>> this program and usecase, same with threshold and everything
> >>>>>>> else.
> >>>>>>>
> >>>>>>> Pls explain how you see it, I might be wrong for sure.
> >>>>>>
> >>>>>> As above: XDP hints is about giving XDP programs (and AF_XDP consumers)
> >>>>>> access to metadata that is not currently available. Tying the lifetime
> >>>>>> of that hardware configuration (i.e., which information to provide) to
> >>>>>> the lifetime of an XDP program is not a good interface: for one thing,
> >>>>>> how will it handle multiple programs? What about when XDP is not used at
> >>>>>
> >>>>> Multiple progs is stuff I didn't cover, but will do later (as you
> >>>>> all say to me, "let's start with something simple" :)). Aaaand
> >>>>> multiple XDP progs (I'm not talking about attaching progs in
> >>>>> differeng modes) is not a kernel feature, rather a libpf feature,
> >>>>> so I believe it should be handled there later...
> >>>>
> >>>> Right, but even if we don't *implement* it straight away we still need
> >>>> to take it into consideration in the design. And expecting libxdp to
> >>>> arbitrate between different XDP programs' metadata formats sounds like a
> >>>> royal PITA :)
> >>>>
> >>>>>> all but you still want to configure the same features?
> >>>>>
> >>>>> What's the point of configuring metadata when there are no progs
> >>>>> attached? To configure it once and not on every prog attach? I'm
> >>>>> not saying I don't like it, just want to clarify.
> >>>>
> >>>> See above: you turn on the features because you want the stack to
> >>>> consume them.
> >>>>
> >>>>> Maybe I need opinions from some more people, just to have an
> >>>>> overview of how most of folks see it and would like to configure
> >>>>> it. 'Cause I heard from at least one of the consumers that
> >>>>> libpf API is a perfect place for Hints to him :)
> >>>>
> >>>> Well, as a program author who wants to consume hints, you'd use
> >>>> lib{bpf,xdp} APIs to do so (probably in the form of suitable CO-RE
> >>>> macros)...
> >>>>
> >>>>>> In addition, in every other case where we do dynamic data access (with
> >>>>>> CO-RE) the BPF program is a consumer that modifies itself to access the
> >>>>>> data provided by the kernel. I get that this is harder to achieve for
> >>>>>> AF_XDP, but then let's solve that instead of making a totally
> >>>>>> inconsistent interface for XDP.
> >>>>>
> >>>>> I also see CO-RE more fitting and convenient way to use them, but
> >>>>> didn't manage to solve two things:
> >>>>>
> >>>>> 1) AF_XDP programs, so what to do with them? Prepare patches for
> >>>>> LLVM to make it able to do CO-RE on AF_XDP program load? Or
> >>>>> just hardcode them for particular usecases and NICs? What about
> >>>>> "general-purpose" programs?
> >>>>
> >>>> You provide a library to read the fields. Jesper actually already
> >>>> implemented this, did you look at his code?
> >>>>
> >>>> https://github.com/xdp-project/bpf-examples/tree/master/AF_XDP-interaction
> >>>>
> >>>> It basically builds a lookup table at load-time using BTF information
> >>>> from the kernel, keyed on BTF ID and field name, resolving them into
> >>>> offsets. It's not quite the zero-overhead of CO-RE, but it's fairly
> >>>> close and can be improved upon (CO-RE for userspace being one way of
> >>>> doing that).
> >>>
> >>> Aaaah, sorry, I completely missed that. I thought of something
> >>> similar as well, but then thought "variable field offsets, that
> >>> would annihilate optimization and performance", and our Xsk team
> >>> is super concerned about performance hits when using Hints.
> >>>
> >>>>
> >>>>> And if hardcode, what's the point then to do Generic Hints at
> >>>>> all? Then all it needs is making driver building some meta in
> >>>>> front of frames via on-off button and that's it? Why BTF ID in
> >>>>> the meta then if consumers will access meta hardcoded (via CO-RE
> >>>>> or literally hardcoded, doesn't matter)?
> >>>>
> >>>> You're quite right, we could probably implement all the access to
> >>>> existing (fixed) metadata without using any BTF at all - just define a
> >>>> common struct and some flags to designate which fields are set. In my
> >>>> mind, there are a couple of reasons for going the BTF route instead:
> >>>>
> >>>> - We can leverage CO-RE to get close to optimal efficiency in field
> >>>> access.
> >>>>
> >>>> and, more importantly:
> >>>>
> >>>> - It's infinitely extensible. With the infrastructure in place to make
> >>>> it really easy to consume metadata described by BTF, we lower the bar
> >>>> for future innovation in hardware offloads. Both for just adding new
> >>>> fixed-function stuff to hardware, but especially for fully
> >>>> programmable hardware.
> >>>
> >>> Agree :) That libxdp lookup translator fixed lots of stuff in my
> >>> mind.
> >>
> >> Great! Looks like we're slowly converging towards a shared
> >> understanding, then! :)
> >>
> >>>>> 2) In-kernel metadata consumers? Also do CO-RE? Otherwise, with no
> >>>>> generic metadata structure they won't be able to benefit from
> >>>>> Hints. But I guess we still need to provide kernel with meta?
> >>>>> Or no?
> >>>>
> >>>> In the short term, I think the "generic structure" approach is fine for
> >>>> leveraging this in the stack. Both your and Jesper's series include
> >>>> this, and I think that's totally fine. Longer term, if it turns out to
> >>>> be useful to have something more dynamic for the stack consumption as
> >>>> well, we could extend it to be CO-RE based as well (most likely by
> >>>> having the stack load a "translator" BPF program or something along
> >>>> those lines).
> >>>
> >>> Oh, that translator prog sounds nice BTW!
> >>
> >> Yeah, it's only a rough idea Jesper and I discussed at some point, but I
> >> think it could have potential (see also point above re: making XDP hints
> >> *the* source of metadata for the whole stack; wouldn't it be nice if
> >> drivers didn't have to deal with the intricacies of assembling SKBs?).
> >>
> >> -Toke
> >>
> >
>
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-06-29 6:15 ` [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata John Fastabend
2022-06-29 13:43 ` Toke Høiland-Jørgensen
@ 2022-06-29 17:56 ` Zvi Effron
2022-06-30 7:39 ` Magnus Karlsson
2022-07-04 15:31 ` Alexander Lobakin
2 siblings, 1 reply; 98+ messages in thread
From: Zvi Effron @ 2022-06-29 17:56 UTC (permalink / raw)
To: John Fastabend
Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, Yajun Deng, Willem de Bruijn, bpf,
netdev, linux-kernel, xdp-hints
On Tue, Jun 28, 2022 at 11:15 PM John Fastabend
<john.fastabend@gmail.com> wrote:
>
> Alexander Lobakin wrote:
> > This RFC is to give the whole picture. It will most likely be split
> > onto several series, maybe even merge cycles. See the "table of
> > contents" below.
>
> Even for RFC its a bit much. Probably improve the summary
> message here as well I'm still not clear on the overall
> architecture so not sure I want to dig into patches.
>
> >
> > The series adds ability to pass different frame
> > details/parameters/parameters used by most of NICs and the kernel
> > stack (in skbs), not essential, but highly wanted, such as:
> >
> > * checksum value, status (Rx) or command (Tx);
> > * hash value and type/level (Rx);
> > * queue number (Rx);
> > * timestamps;
> > * and so on.
> >
> > As XDP structures used to represent frames are as small as possible
> > and must stay like that, it is done by using the already existing
> > concept of metadata, i.e. some space right before a frame where BPF
> > programs can put arbitrary data.
>
> OK so you stick attributes in the metadata. You can do this without
> touching anything but your driver today. Why not push a patch to
> ice to start doing this? People could start using it today and put
> it in some feature flag.
>
> I get everyone wants some grand theory around this but again one
> patch would do it and your customers could start using it. Show
> a benchmark with 20% speedup or whatever with small XDP prog
> update and you win.
>
> >
> > Now, a NIC driver, or even a SmartNIC itself, can put those params
> > there in a well-defined format. The format is fixed, but can be of
> > several different types represented by structures, which definitions
> > are available to the kernel, BPF programs and the userland.
>
> I don't think in general the format needs to be fixed.
>
> > It is fixed due to it being almost a UAPI, and the exact format can
> > be determined by reading the last 10 bytes of metadata. They contain
> > a 2-byte magic ID to not confuse it with a non-compatible meta and
> > a 8-byte combined BTF ID + type ID: the ID of the BTF where this
> > structure is defined and the ID of that definition inside that BTF.
> > Users can obtain BTF IDs by structure types using helpers available
> > in the kernel, BPF (written by the CO-RE/verifier) and the userland
> > (libbpf -> kernel call) and then rely on those ID when reading data
> > to make sure whether they support it and what to do with it.
> > Why separate magic and ID? The idea is to make different formats
> > always contain the basic/"generic" structure embedded at the end.
> > This way we can still benefit in purely generic consumers (like
> > cpumap) while providing some "extra" data to those who support it.
>
> I don't follow this. If you have a struct in your driver name it
> something obvious, ice_xdp_metadata. If I understand things
> correctly just dump the BTF for the driver, extract the
> struct and done you can use CO-RE reads. For the 'fixed' case
> this looks easy. And I don't think you even need a patch for this.
>
> >
> > The enablement of this feature is controlled on attaching/replacing
> > XDP program on an interface with two new parameters: that combined
> > BTF+type ID and metadata threshold.
> > The threshold specifies the minimum frame size which a driver (or
> > NIC) should start composing metadata from. It is introduced instead
> > of just false/true flag due to that often it's not worth it to spend
> > cycles to fetch all that data for such small frames: let's say, it
> > can be even faster to just calculate checksums for them on CPU
> > rather than touch non-coherent DMA zone. Simple XDP_DROP case loses
> > 15 Mpps on 64 byte frames with enabled metadata, threshold can help
> > mitigate that.
>
> I would put this in the bonus category. Can you do the simple thing
> above without these extra bits and then add them later. Just
> pick some overly conservative threshold to start with.
>
> >
> > The RFC can be divided into 8 parts:
>
> I'm missing something why not do the simplest bit of work and
> get this running in ice with a few smallish driver updates
> so we can all see it. No need for so many patches.
>
> >
> > 01-04: BTF ID hacking: here Larysa provides BPF programs with not
> > only type ID, but the ID of the BTF as well by using the
> > unused upper 32 bits.
> > 05-10: this provides in-kernel mechanisms for taking ID and
> > threshold from the userspace and passing it to the drivers.
> > 11-18: provides libbpf API to be able to specify those params from
> > the userspace, plus some small selftest to verify that both
> > the kernel and the userspace parts work.
> > 19-29: here the actual structure is defined, then the in-kernel
> > helpers and finally here comes the first consumer: function
> > used to convert &xdp_frame to &sk_buff now will be trying
> > to parse metadata. The affected users are cpumap and veth.
> > 30-36: here I try to benefit from the metadata in cpumap even more
> > by switching it to GRO. Now that we have checksums from NIC
> > available... but even with no meta it gives some fair
> > improvements.
> > 37-43: enabling building generic metadata on Generic/skb path. Since
> > skbs already have all those fields, it's not a problem to do
> > this in here, plus allows to benefit from it on interfaces
> > not supporting meta yet.
> > 44-47: ice driver part, including enabling prog hot-swap;
> > 48-52: adds a complex selftest to verify everything works. Can be
> > used as a sample as well, showing how to work with metadata
> > in BPF programs and how to configure it from the userspace.
> >
> > Please refer to the actual commit messages where some precise
> > implementation details might be explained.
> > Nearly 20 of 52 are various cleanups and prereqs, as usually.
> >
> > Perf figures were taken on cpumap redirect from the ice interface
> > (driver-side XDP), redirecting the traffic within the same node.
> >
> > Frame size / 64/42 128/20 256/8 512/4 1024/2 1532/1
> > thread num
>
> You'll have to remind me whats the production use case for
> cpu_map on a modern nic or even smart nic? Why are you not
> just using a hardware queues and redirecting to the right
> queues in hardware to start with?
>
> Also my understanding is if you do XDP_PASS up the stack
> the skb is built with all the normal good stuff from hw
> descriptor. Sorry going to need some extra context here
> to understand.
>
> Could you do a benchmark for AF_XDP I thought this was
> the troublesome use case where the user space ring lost
> the hardware info e.g. timestamps and checksum values.
>
> >
> > meta off 30022 31350 21993 12144 6374 3610
> > meta on 33059 28502 21503 12146 6380 3610
> > GRO meta off 30020 31822 21970 12145 6384 3610
> > GRO meta on 34736 28848 21566 12144 6381 3610
> >
> > Yes, redirect between the nodes plays awfully with the metadata
> > composed by the driver:
>
> Many production use case use XDP exactly for this. If it
> slows this basic use case down its going to be very hard
> to use in many environments. Likely it wont be used.
>
> >
> > meta off 21449 18078 16897 11820 6383 3610
> > meta on 16956 19004 14337 8228 5683 2822
> > GRO meta off 22539 19129 16304 11659 6381 3592
> > GRO meta on 17047 20366 15435 8878 5600 2753
>
> Do you have hardware that can write the data into the
> metadata region so you don't do it in software? Seems
> like it should be doable without much trouble and would
> make this more viable.
>
> >
> > Questions still open:
> >
> > * the actual generic structure: it must have all the fields used
> > oftenly and by the majority of NICs. It can always be expanded
> > later on (note that the structure grows to the left), but the
> > less often UAPI is modified, the better (less compat pain);
>
> I don't believe a generic structure is needed.
>
> > * ability to specify the exact fields to fill by the driver, e.g.
> > flags bitmap passed from the userspace. In theory it can be more
> > optimal to not spend cycles on data we don't need, but at the
> > same time increases the complexity of the whole concept (e.g. it
> > will be more problematic to unify drivers' routines for collecting
> > data from descriptors to metadata and to skbs);
> > * there was an idea to be able to specify from the userspace the
> > desired cacheline offset, so that [the wanted fields of] metadata
> > and the packet headers would lay in the same CL. Can't be
> > implemented in Generic/skb XDP and ice has some troubles with it
> > too;
> > * lacks AF_XDP/XSk perf numbers and different other scenarios in
> > general, is the current implementation optimal for them?
>
> AF_XDP is the primary use case from my understanding.
>
AF_XDP is a use case, and might be the primary, but we work with pure XDP and
have been waiting for the ability to take advantage of the hardware checksums
for years. It would be a very large performance boost for us (in theory) as
we're currently having to verify the checksums ourselves in software, and
recompute them on modifications (since we can't use hardware TX checksums).
Also, if I understand correctly, if the functionality is available to pure XDP,
AF_XDP could benefit from it by having the XDP program that redirects to AF_XDP
copy it into metadata where AF_XDP can find it because of the user defined
contract between the XDP program and the userspace program? (Not as efficient,
obviously, and duplicative, but would work, I think.)
> > * metadata threshold and everything else present in this
> > implementation.
>
> I really think your asking questions that are two or three
> jumps away. Why not do the simplest bit first and kick
> the driver with an on/off switch into this mode. But
> I don't understand this cpumap use case so maybe explain
> that first.
>
> And sorry didn't even look at your 50+ patches. Figure lets
> get agreement on the goal first.
>
> .John
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-06-29 17:56 ` Zvi Effron
@ 2022-06-30 7:39 ` Magnus Karlsson
0 siblings, 0 replies; 98+ messages in thread
From: Magnus Karlsson @ 2022-06-30 7:39 UTC (permalink / raw)
To: Zvi Effron
Cc: John Fastabend, Alexander Lobakin, Alexei Starovoitov,
Daniel Borkmann, Andrii Nakryiko, Larysa Zaremba,
Michal Swiatkowski, Jesper Dangaard Brouer,
Björn Töpel, Magnus Karlsson, Maciej Fijalkowski,
Jonathan Lemon, Toke Hoiland-Jorgensen, Lorenzo Bianconi,
David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jesse Brandeburg, Yajun Deng, Willem de Bruijn, bpf,
Network Development, open list, xdp-hints
On Wed, Jun 29, 2022 at 7:57 PM Zvi Effron <zeffron@riotgames.com> wrote:
>
> On Tue, Jun 28, 2022 at 11:15 PM John Fastabend
> <john.fastabend@gmail.com> wrote:
> >
> > Alexander Lobakin wrote:
> > > This RFC is to give the whole picture. It will most likely be split
> > > onto several series, maybe even merge cycles. See the "table of
> > > contents" below.
> >
> > Even for RFC its a bit much. Probably improve the summary
> > message here as well I'm still not clear on the overall
> > architecture so not sure I want to dig into patches.
> >
> > >
> > > The series adds ability to pass different frame
> > > details/parameters/parameters used by most of NICs and the kernel
> > > stack (in skbs), not essential, but highly wanted, such as:
> > >
> > > * checksum value, status (Rx) or command (Tx);
> > > * hash value and type/level (Rx);
> > > * queue number (Rx);
> > > * timestamps;
> > > * and so on.
> > >
> > > As XDP structures used to represent frames are as small as possible
> > > and must stay like that, it is done by using the already existing
> > > concept of metadata, i.e. some space right before a frame where BPF
> > > programs can put arbitrary data.
> >
> > OK so you stick attributes in the metadata. You can do this without
> > touching anything but your driver today. Why not push a patch to
> > ice to start doing this? People could start using it today and put
> > it in some feature flag.
> >
> > I get everyone wants some grand theory around this but again one
> > patch would do it and your customers could start using it. Show
> > a benchmark with 20% speedup or whatever with small XDP prog
> > update and you win.
> >
> > >
> > > Now, a NIC driver, or even a SmartNIC itself, can put those params
> > > there in a well-defined format. The format is fixed, but can be of
> > > several different types represented by structures, which definitions
> > > are available to the kernel, BPF programs and the userland.
> >
> > I don't think in general the format needs to be fixed.
> >
> > > It is fixed due to it being almost a UAPI, and the exact format can
> > > be determined by reading the last 10 bytes of metadata. They contain
> > > a 2-byte magic ID to not confuse it with a non-compatible meta and
> > > a 8-byte combined BTF ID + type ID: the ID of the BTF where this
> > > structure is defined and the ID of that definition inside that BTF.
> > > Users can obtain BTF IDs by structure types using helpers available
> > > in the kernel, BPF (written by the CO-RE/verifier) and the userland
> > > (libbpf -> kernel call) and then rely on those ID when reading data
> > > to make sure whether they support it and what to do with it.
> > > Why separate magic and ID? The idea is to make different formats
> > > always contain the basic/"generic" structure embedded at the end.
> > > This way we can still benefit in purely generic consumers (like
> > > cpumap) while providing some "extra" data to those who support it.
> >
> > I don't follow this. If you have a struct in your driver name it
> > something obvious, ice_xdp_metadata. If I understand things
> > correctly just dump the BTF for the driver, extract the
> > struct and done you can use CO-RE reads. For the 'fixed' case
> > this looks easy. And I don't think you even need a patch for this.
> >
> > >
> > > The enablement of this feature is controlled on attaching/replacing
> > > XDP program on an interface with two new parameters: that combined
> > > BTF+type ID and metadata threshold.
> > > The threshold specifies the minimum frame size which a driver (or
> > > NIC) should start composing metadata from. It is introduced instead
> > > of just false/true flag due to that often it's not worth it to spend
> > > cycles to fetch all that data for such small frames: let's say, it
> > > can be even faster to just calculate checksums for them on CPU
> > > rather than touch non-coherent DMA zone. Simple XDP_DROP case loses
> > > 15 Mpps on 64 byte frames with enabled metadata, threshold can help
> > > mitigate that.
> >
> > I would put this in the bonus category. Can you do the simple thing
> > above without these extra bits and then add them later. Just
> > pick some overly conservative threshold to start with.
> >
> > >
> > > The RFC can be divided into 8 parts:
> >
> > I'm missing something why not do the simplest bit of work and
> > get this running in ice with a few smallish driver updates
> > so we can all see it. No need for so many patches.
> >
> > >
> > > 01-04: BTF ID hacking: here Larysa provides BPF programs with not
> > > only type ID, but the ID of the BTF as well by using the
> > > unused upper 32 bits.
> > > 05-10: this provides in-kernel mechanisms for taking ID and
> > > threshold from the userspace and passing it to the drivers.
> > > 11-18: provides libbpf API to be able to specify those params from
> > > the userspace, plus some small selftest to verify that both
> > > the kernel and the userspace parts work.
> > > 19-29: here the actual structure is defined, then the in-kernel
> > > helpers and finally here comes the first consumer: function
> > > used to convert &xdp_frame to &sk_buff now will be trying
> > > to parse metadata. The affected users are cpumap and veth.
> > > 30-36: here I try to benefit from the metadata in cpumap even more
> > > by switching it to GRO. Now that we have checksums from NIC
> > > available... but even with no meta it gives some fair
> > > improvements.
> > > 37-43: enabling building generic metadata on Generic/skb path. Since
> > > skbs already have all those fields, it's not a problem to do
> > > this in here, plus allows to benefit from it on interfaces
> > > not supporting meta yet.
> > > 44-47: ice driver part, including enabling prog hot-swap;
> > > 48-52: adds a complex selftest to verify everything works. Can be
> > > used as a sample as well, showing how to work with metadata
> > > in BPF programs and how to configure it from the userspace.
> > >
> > > Please refer to the actual commit messages where some precise
> > > implementation details might be explained.
> > > Nearly 20 of 52 are various cleanups and prereqs, as usually.
> > >
> > > Perf figures were taken on cpumap redirect from the ice interface
> > > (driver-side XDP), redirecting the traffic within the same node.
> > >
> > > Frame size / 64/42 128/20 256/8 512/4 1024/2 1532/1
> > > thread num
> >
> > You'll have to remind me whats the production use case for
> > cpu_map on a modern nic or even smart nic? Why are you not
> > just using a hardware queues and redirecting to the right
> > queues in hardware to start with?
> >
> > Also my understanding is if you do XDP_PASS up the stack
> > the skb is built with all the normal good stuff from hw
> > descriptor. Sorry going to need some extra context here
> > to understand.
> >
> > Could you do a benchmark for AF_XDP I thought this was
> > the troublesome use case where the user space ring lost
> > the hardware info e.g. timestamps and checksum values.
> >
> > >
> > > meta off 30022 31350 21993 12144 6374 3610
> > > meta on 33059 28502 21503 12146 6380 3610
> > > GRO meta off 30020 31822 21970 12145 6384 3610
> > > GRO meta on 34736 28848 21566 12144 6381 3610
> > >
> > > Yes, redirect between the nodes plays awfully with the metadata
> > > composed by the driver:
> >
> > Many production use case use XDP exactly for this. If it
> > slows this basic use case down its going to be very hard
> > to use in many environments. Likely it wont be used.
> >
> > >
> > > meta off 21449 18078 16897 11820 6383 3610
> > > meta on 16956 19004 14337 8228 5683 2822
> > > GRO meta off 22539 19129 16304 11659 6381 3592
> > > GRO meta on 17047 20366 15435 8878 5600 2753
> >
> > Do you have hardware that can write the data into the
> > metadata region so you don't do it in software? Seems
> > like it should be doable without much trouble and would
> > make this more viable.
> >
> > >
> > > Questions still open:
> > >
> > > * the actual generic structure: it must have all the fields used
> > > oftenly and by the majority of NICs. It can always be expanded
> > > later on (note that the structure grows to the left), but the
> > > less often UAPI is modified, the better (less compat pain);
> >
> > I don't believe a generic structure is needed.
> >
> > > * ability to specify the exact fields to fill by the driver, e.g.
> > > flags bitmap passed from the userspace. In theory it can be more
> > > optimal to not spend cycles on data we don't need, but at the
> > > same time increases the complexity of the whole concept (e.g. it
> > > will be more problematic to unify drivers' routines for collecting
> > > data from descriptors to metadata and to skbs);
> > > * there was an idea to be able to specify from the userspace the
> > > desired cacheline offset, so that [the wanted fields of] metadata
> > > and the packet headers would lay in the same CL. Can't be
> > > implemented in Generic/skb XDP and ice has some troubles with it
> > > too;
> > > * lacks AF_XDP/XSk perf numbers and different other scenarios in
> > > general, is the current implementation optimal for them?
> >
> > AF_XDP is the primary use case from my understanding.
> >
>
> AF_XDP is a use case, and might be the primary, but we work with pure XDP and
> have been waiting for the ability to take advantage of the hardware checksums
> for years. It would be a very large performance boost for us (in theory) as
> we're currently having to verify the checksums ourselves in software, and
> recompute them on modifications (since we can't use hardware TX checksums).
Yes, there are plenty of people out there that want that boost both
for XDP and AF_XDP, me included. So this work is important.
> Also, if I understand correctly, if the functionality is available to pure XDP,
> AF_XDP could benefit from it by having the XDP program that redirects to AF_XDP
> copy it into metadata where AF_XDP can find it because of the user defined
> contract between the XDP program and the userspace program? (Not as efficient,
> obviously, and duplicative, but would work, I think.)
Correct, AF_XDP can benefit from this directly. The metadata is
already put before the packet, so in zero-copy mode it will be
available to user-space at no extra cost. In copy mode though, it has
to be copied out together with the packet data. But that code is
already there since you can use the metadata section today for
communicating information between XDP and user-space via AF_XDP.
> > > * metadata threshold and everything else present in this
> > > implementation.
> >
> > I really think your asking questions that are two or three
> > jumps away. Why not do the simplest bit first and kick
> > the driver with an on/off switch into this mode. But
> > I don't understand this cpumap use case so maybe explain
> > that first.
> >
> > And sorry didn't even look at your 50+ patches. Figure lets
> > get agreement on the goal first.
> >
> > .John
^ permalink raw reply [flat|nested] 98+ messages in thread
* [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata
2022-06-29 6:15 ` [xdp-hints] Re: [PATCH RFC bpf-next 00/52] bpf, xdp: introduce and use Generic Hints/metadata John Fastabend
2022-06-29 13:43 ` Toke Høiland-Jørgensen
2022-06-29 17:56 ` Zvi Effron
@ 2022-07-04 15:31 ` Alexander Lobakin
2 siblings, 0 replies; 98+ messages in thread
From: Alexander Lobakin @ 2022-07-04 15:31 UTC (permalink / raw)
To: John Fastabend
Cc: Alexander Lobakin, Alexei Starovoitov, Daniel Borkmann,
Andrii Nakryiko, Larysa Zaremba, Michal Swiatkowski,
Jesper Dangaard Brouer, Björn Töpel, Magnus Karlsson,
Maciej Fijalkowski, Jonathan Lemon, Toke Hoiland-Jorgensen,
Lorenzo Bianconi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Jesse Brandeburg, Yajun Deng, Willem de Bruijn, bpf,
netdev, linux-kernel, xdp-hints
From: John Fastabend <john.fastabend@gmail.com>
Date: Tue, 28 Jun 2022 23:15:12 -0700
> Alexander Lobakin wrote:
> > This RFC is to give the whole picture. It will most likely be split
> > onto several series, maybe even merge cycles. See the "table of
> > contents" below.
>
> Even for RFC its a bit much. Probably improve the summary
> message here as well I'm still not clear on the overall
> architecture so not sure I want to dig into patches.
I wanted to give an overview to the whole code that I have for now.
We have several series on 30+, even 60+ next to this one, I'm not
sure I've seen a single "TL;DR" there. Same thing was with Matt's
folio series, there was 165+, ok he didn't send it as is to LKML,
but his words were "this is the first part, please take a look at
the whole concept here[link]".
I provided a "table of contents", so that anyone could go and
review only stuff he's interested in, not touching the rest.
My intention is that what if I submit the first part, then second,
and then at the third someone says he doesn't like the idea and
NACKs it, what then, if it all was supposed to work together as one?
>
> >
> > The series adds ability to pass different frame
> > details/parameters/parameters used by most of NICs and the kernel
> > stack (in skbs), not essential, but highly wanted, such as:
> >
> > * checksum value, status (Rx) or command (Tx);
> > * hash value and type/level (Rx);
> > * queue number (Rx);
> > * timestamps;
> > * and so on.
> >
> > As XDP structures used to represent frames are as small as possible
> > and must stay like that, it is done by using the already existing
> > concept of metadata, i.e. some space right before a frame where BPF
> > programs can put arbitrary data.
>
> OK so you stick attributes in the metadata. You can do this without
> touching anything but your driver today. Why not push a patch to
> ice to start doing this? People could start using it today and put
> it in some feature flag.
We have to configure it somehow, doesn't we? Do a feature flag for
ice? Why doing anything "generic" then if I could just go with one
driver and send it through our Intel mailing list?
>
> I get everyone wants some grand theory around this but again one
> patch would do it and your customers could start using it. Show
> a benchmark with 20% speedup or whatever with small XDP prog
> update and you win.
One patch would be a hardwiring/hardcoding everything on one button,
what's the point if there were several such examples?
This is RFC because the whole stuff needs to be discussed, not
because a have some drafts and want to show them. It's finished
and polished production-quality code which any vendor or customer
could start using without rehardcoding for their own
driver/needs/etc.
I'm not following this "TL;DR" stuff, one can just apply the series
and see how it goes/works for his needs (and then get back and
report) even if he's not feeling like reviewing it.
>
> >
> > Now, a NIC driver, or even a SmartNIC itself, can put those params
> > there in a well-defined format. The format is fixed, but can be of
> > several different types represented by structures, which definitions
> > are available to the kernel, BPF programs and the userland.
>
> I don't think in general the format needs to be fixed.
We discussed it previously as well, not only in regards to this
stuff, but in general. For BPF programs, for sure we can CO-RE
everything, but we also have: a) in-kernel users not hardcoded to
a particular vendor/driver which just want to have generic fields
in one format for every driver; b) AF_XDP/XSk programs which you
can't CO-RE. There was a proposal from Alexei to patch LLVM to be
able to apply CO-RE for AF_XDP (I mean, for ARM64/x86_64/etc.
binaries) as well, but it's a whole different story with way more
caveats.
>
> > It is fixed due to it being almost a UAPI, and the exact format can
> > be determined by reading the last 10 bytes of metadata. They contain
> > a 2-byte magic ID to not confuse it with a non-compatible meta and
> > a 8-byte combined BTF ID + type ID: the ID of the BTF where this
> > structure is defined and the ID of that definition inside that BTF.
> > Users can obtain BTF IDs by structure types using helpers available
> > in the kernel, BPF (written by the CO-RE/verifier) and the userland
> > (libbpf -> kernel call) and then rely on those ID when reading data
> > to make sure whether they support it and what to do with it.
> > Why separate magic and ID? The idea is to make different formats
> > always contain the basic/"generic" structure embedded at the end.
> > This way we can still benefit in purely generic consumers (like
> > cpumap) while providing some "extra" data to those who support it.
>
> I don't follow this. If you have a struct in your driver name it
> something obvious, ice_xdp_metadata. If I understand things
> correctly just dump the BTF for the driver, extract the
> struct and done you can use CO-RE reads. For the 'fixed' case
> this looks easy. And I don't think you even need a patch for this.
>
> >
> > The enablement of this feature is controlled on attaching/replacing
> > XDP program on an interface with two new parameters: that combined
> > BTF+type ID and metadata threshold.
> > The threshold specifies the minimum frame size which a driver (or
> > NIC) should start composing metadata from. It is introduced instead
> > of just false/true flag due to that often it's not worth it to spend
> > cycles to fetch all that data for such small frames: let's say, it
> > can be even faster to just calculate checksums for them on CPU
> > rather than touch non-coherent DMA zone. Simple XDP_DROP case loses
> > 15 Mpps on 64 byte frames with enabled metadata, threshold can help
> > mitigate that.
>
> I would put this in the bonus category. Can you do the simple thing
> above without these extra bits and then add them later. Just
> pick some overly conservative threshold to start with.
It's as simple as adding on/off button, there's no reason to leave
it for later. Or is there?
>
> >
> > The RFC can be divided into 8 parts:
>
> I'm missing something why not do the simplest bit of work and
> get this running in ice with a few smallish driver updates
> so we can all see it. No need for so many patches.
Ok I should've write this down in the cover: it's not a draft or
some hardcode to just show a PoC...
>
> >
> > 01-04: BTF ID hacking: here Larysa provides BPF programs with not
> > only type ID, but the ID of the BTF as well by using the
> > unused upper 32 bits.
> > 05-10: this provides in-kernel mechanisms for taking ID and
> > threshold from the userspace and passing it to the drivers.
> > 11-18: provides libbpf API to be able to specify those params from
> > the userspace, plus some small selftest to verify that both
> > the kernel and the userspace parts work.
> > 19-29: here the actual structure is defined, then the in-kernel
> > helpers and finally here comes the first consumer: function
> > used to convert &xdp_frame to &sk_buff now will be trying
> > to parse metadata. The affected users are cpumap and veth.
> > 30-36: here I try to benefit from the metadata in cpumap even more
> > by switching it to GRO. Now that we have checksums from NIC
> > available... but even with no meta it gives some fair
> > improvements.
> > 37-43: enabling building generic metadata on Generic/skb path. Since
> > skbs already have all those fields, it's not a problem to do
> > this in here, plus allows to benefit from it on interfaces
> > not supporting meta yet.
> > 44-47: ice driver part, including enabling prog hot-swap;
> > 48-52: adds a complex selftest to verify everything works. Can be
> > used as a sample as well, showing how to work with metadata
> > in BPF programs and how to configure it from the userspace.
> >
> > Please refer to the actual commit messages where some precise
> > implementation details might be explained.
> > Nearly 20 of 52 are various cleanups and prereqs, as usually.
> >
> > Perf figures were taken on cpumap redirect from the ice interface
> > (driver-side XDP), redirecting the traffic within the same node.
> >
> > Frame size / 64/42 128/20 256/8 512/4 1024/2 1532/1
> > thread num
>
> You'll have to remind me whats the production use case for
> cpu_map on a modern nic or even smart nic? Why are you not
> just using a hardware queues and redirecting to the right
> queues in hardware to start with?
Load balancing, you can distribute packets not only by flows, but
as you wish as you have full access to frames. Also, with RSS/RFS
you serve interrupts and push a frame through the networking stack
on the same CPU, with cpumap you can do the former on one and the
latter on another one, and it's obviously faster.
>
> Also my understanding is if you do XDP_PASS up the stack
> the skb is built with all the normal good stuff from hw
> descriptor. Sorry going to need some extra context here
> to understand.
Correct, so this series makes cpumap on par (probably even better)
with just %XDP_PASS.
>
> Could you do a benchmark for AF_XDP I thought this was
> the troublesome use case where the user space ring lost
> the hardware info e.g. timestamps and checksum values.
Ok sure, a bit later. I wasn't focusing on AF_XDP, but it's there
in the closest plans.
>
> >
> > meta off 30022 31350 21993 12144 6374 3610
> > meta on 33059 28502 21503 12146 6380 3610
> > GRO meta off 30020 31822 21970 12145 6384 3610
> > GRO meta on 34736 28848 21566 12144 6381 3610
> >
> > Yes, redirect between the nodes plays awfully with the metadata
> > composed by the driver:
>
> Many production use case use XDP exactly for this. If it
> slows this basic use case down its going to be very hard
> to use in many environments. Likely it wont be used.
Redirect between the nodes is not a good idea in general as you will
be working with remote memory each redirect. Not sure it's widely
used.
And yes, SmartNICs don't have that problem if they're capable of
composing arbitrary meta themselves.
>
> >
> > meta off 21449 18078 16897 11820 6383 3610
> > meta on 16956 19004 14337 8228 5683 2822
> > GRO meta off 22539 19129 16304 11659 6381 3592
> > GRO meta on 17047 20366 15435 8878 5600 2753
>
> Do you have hardware that can write the data into the
> metadata region so you don't do it in software? Seems
> like it should be doable without much trouble and would
> make this more viable.
For now I personally don't, but: a) some people do; b) I will in
some time. IIRC, we were concering for both SmartNIC and "current
generation" NICs not giving favour to any of them.
>
> >
> > Questions still open:
> >
> > * the actual generic structure: it must have all the fields used
> > oftenly and by the majority of NICs. It can always be expanded
> > later on (note that the structure grows to the left), but the
> > less often UAPI is modified, the better (less compat pain);
>
> I don't believe a generic structure is needed.
Please see above.
>
> > * ability to specify the exact fields to fill by the driver, e.g.
> > flags bitmap passed from the userspace. In theory it can be more
> > optimal to not spend cycles on data we don't need, but at the
> > same time increases the complexity of the whole concept (e.g. it
> > will be more problematic to unify drivers' routines for collecting
> > data from descriptors to metadata and to skbs);
> > * there was an idea to be able to specify from the userspace the
> > desired cacheline offset, so that [the wanted fields of] metadata
> > and the packet headers would lay in the same CL. Can't be
> > implemented in Generic/skb XDP and ice has some troubles with it
> > too;
> > * lacks AF_XDP/XSk perf numbers and different other scenarios in
> > general, is the current implementation optimal for them?
>
> AF_XDP is the primary use case from my understanding.
Not really, but is one of them.
>
> > * metadata threshold and everything else present in this
> > implementation.
>
> I really think your asking questions that are two or three
> jumps away. Why not do the simplest bit first and kick
> the driver with an on/off switch into this mode. But
> I don't understand this cpumap use case so maybe explain
> that first.
>
> And sorry didn't even look at your 50+ patches. Figure lets
> get agreement on the goal first.
"TL;DR" will kill open source once ._. As I said, you could just
pick whatever you want to look at, I never said "you, go and
review cpumap GRO stuff" to anyone.
>
> .John
Thanks,
Olek
^ permalink raw reply [flat|nested] 98+ messages in thread