6

[PATCH 00/32] bcachefs - a new COW filesystem

 1 year ago
source link: https://lore.kernel.org/lkml/[email protected]/T/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
neoserver,ios ssh client

[PATCH 00/32] bcachefs

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/32] bcachefs - a new COW filesystem
@ 2023-05-09 16:56 Kent Overstreet
  2023-05-09 16:56 ` [PATCH 01/32] Compiler Attributes: add __flatten Kent Overstreet
                   ` (31 more replies)
  0 siblings, 32 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-block, linux-mm, linux-bcachefs
  Cc: Kent Overstreet, viro, akpm, boqun.feng, brauner, hch, colyli,
	djwong, mingo, jack, axboe, willy, ojeda, ming.lei, ndesaulniers,
	peterz, phillip, urezki, longman, will

I'm submitting the bcachefs filesystem for review and inclusion.

Included in this patch series are all the non fs/bcachefs/ patches. The
entire tree, based on v6.3, may be found at:

  http://evilpiepirate.org/git/bcachefs.git bcachefs-for-upstream

----------------------------------------------------------------

bcachefs overview, status:

Features:
 - too many to list

Known bugs:
 - too many to list

Status:
 - Snapshots have been declared stable; one serious bug report
   outstanding to look into, most users report it working well.

   These are RW btrfs-style snapshots, but with far better scalability
   and no scalability issues with sparse snapshots due to key level
   versioning.

 - Erasure coding is getting really close; hope to have it ready for
   users to beat on it by this summer. This is a novel RAID/erasure
   coding design with no write hole, and no fragmentation of writes
   (e.g. RAIDZ).

 - Tons of scalabality work finished over the past year, users are
   running it on 100 TB filesystems without complaint, waiting for first
   1 PB user; next thing to address re: scalability is fsck/recovery
   memory usage.

 - Test infrastructure! Major project milestone, check out our test
   dashboard at
     https://evilpiepirate.org/~testdashboard/ci?branch=bcachefs

Other project notes:

irc::/irc.oftc.net/bcache is where most activity happens; I'm always
there, and most code review happens there - I find the conversational
format more productive.

------------------------------------------------

patches in this series:

Christopher James Halse Rogers (1):
  stacktrace: Export stack_trace_save_tsk

Daniel Hill (1):
  lib: add mean and variance module.

Dave Chinner (3):
  vfs: factor out inode hash head calculation
  hlist-bl: add hlist_bl_fake()
  vfs: inode cache conversion to hash-bl

Kent Overstreet (27):
  Compiler Attributes: add __flatten
  locking/lockdep: lock_class_is_held()
  locking/lockdep: lockdep_set_no_check_recursion()
  locking: SIX locks (shared/intent/exclusive)
  MAINTAINERS: Add entry for six locks
  sched: Add task_struct->faults_disabled_mapping
  mm: Bring back vmalloc_exec
  fs: factor out d_mark_tmpfile()
  block: Add some exports for bcachefs
  block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset
  block: Bring back zero_fill_bio_iter
  block: Rework bio_for_each_segment_all()
  block: Rework bio_for_each_folio_all()
  block: Don't block on s_umount from __invalidate_super()
  bcache: move closures to lib/
  MAINTAINERS: Add entry for closures
  closures: closure_wait_event()
  closures: closure_nr_remaining()
  closures: Add a missing include
  iov_iter: copy_folio_from_iter_atomic()
  MAINTAINERS: Add entry for generic-radix-tree
  lib/generic-radix-tree.c: Don't overflow in peek()
  lib/generic-radix-tree.c: Add a missing include
  lib/generic-radix-tree.c: Add peek_prev()
  lib/string_helpers: string_get_size() now returns characters wrote
  lib: Export errname
  MAINTAINERS: Add entry for bcachefs

 MAINTAINERS                                   |  39 +
 block/bdev.c                                  |   2 +-
 block/bio.c                                   |  57 +-
 block/blk-core.c                              |   1 +
 block/blk-map.c                               |  38 +-
 block/blk.h                                   |   1 -
 block/bounce.c                                |  12 +-
 drivers/md/bcache/Kconfig                     |  10 +-
 drivers/md/bcache/Makefile                    |   4 +-
 drivers/md/bcache/bcache.h                    |   2 +-
 drivers/md/bcache/btree.c                     |   8 +-
 drivers/md/bcache/super.c                     |   1 -
 drivers/md/bcache/util.h                      |   3 +-
 drivers/md/dm-crypt.c                         |  10 +-
 drivers/md/raid1.c                            |   4 +-
 fs/btrfs/disk-io.c                            |   4 +-
 fs/btrfs/extent_io.c                          |  50 +-
 fs/btrfs/raid56.c                             |  14 +-
 fs/crypto/bio.c                               |   9 +-
 fs/dcache.c                                   |  12 +-
 fs/erofs/zdata.c                              |   4 +-
 fs/ext4/page-io.c                             |   8 +-
 fs/ext4/readpage.c                            |   4 +-
 fs/f2fs/data.c                                |  20 +-
 fs/gfs2/lops.c                                |  10 +-
 fs/gfs2/meta_io.c                             |   8 +-
 fs/inode.c                                    | 218 +++--
 fs/iomap/buffered-io.c                        |  14 +-
 fs/mpage.c                                    |   4 +-
 fs/squashfs/block.c                           |  48 +-
 fs/squashfs/lz4_wrapper.c                     |  17 +-
 fs/squashfs/lzo_wrapper.c                     |  17 +-
 fs/squashfs/xz_wrapper.c                      |  19 +-
 fs/squashfs/zlib_wrapper.c                    |  18 +-
 fs/squashfs/zstd_wrapper.c                    |  19 +-
 fs/super.c                                    |  40 +-
 fs/verity/verify.c                            |   9 +-
 include/linux/bio.h                           | 132 +--
 include/linux/blkdev.h                        |   1 +
 include/linux/bvec.h                          |  70 +-
 .../md/bcache => include/linux}/closure.h     |  46 +-
 include/linux/compiler_attributes.h           |   5 +
 include/linux/dcache.h                        |   1 +
 include/linux/fs.h                            |  10 +-
 include/linux/generic-radix-tree.h            |  68 +-
 include/linux/list_bl.h                       |  22 +
 include/linux/lockdep.h                       |  10 +
 include/linux/lockdep_types.h                 |   2 +-
 include/linux/mean_and_variance.h             | 219 +++++
 include/linux/sched.h                         |   1 +
 include/linux/six.h                           | 210 +++++
 include/linux/string_helpers.h                |   4 +-
 include/linux/uio.h                           |   2 +
 include/linux/vmalloc.h                       |   1 +
 init/init_task.c                              |   1 +
 kernel/Kconfig.locks                          |   3 +
 kernel/locking/Makefile                       |   1 +
 kernel/locking/lockdep.c                      |  46 ++
 kernel/locking/six.c                          | 779 ++++++++++++++++++
 kernel/module/main.c                          |   4 +-
 kernel/stacktrace.c                           |   2 +
 lib/Kconfig                                   |   3 +
 lib/Kconfig.debug                             |  18 +
 lib/Makefile                                  |   2 +
 {drivers/md/bcache => lib}/closure.c          |  36 +-
 lib/errname.c                                 |   1 +
 lib/generic-radix-tree.c                      |  76 +-
 lib/iov_iter.c                                |  53 +-
 lib/math/Kconfig                              |   3 +
 lib/math/Makefile                             |   2 +
 lib/math/mean_and_variance.c                  | 136 +++
 lib/math/mean_and_variance_test.c             | 155 ++++
 lib/string_helpers.c                          |   8 +-
 mm/nommu.c                                    |  18 +
 mm/vmalloc.c                                  |  21 +
 75 files changed, 2485 insertions(+), 445 deletions(-)
 rename {drivers/md/bcache => include/linux}/closure.h (93%)
 create mode 100644 include/linux/mean_and_variance.h
 create mode 100644 include/linux/six.h
 create mode 100644 kernel/locking/six.c
 rename {drivers/md/bcache => lib}/closure.c (88%)
 create mode 100644 lib/math/mean_and_variance.c
 create mode 100644 lib/math/mean_and_variance_test.c

-- 
2.40.1


^ permalink raw reply	[flat|nested] 73+ messages in thread

* [PATCH 01/32] Compiler Attributes: add __flatten
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 17:04   ` Miguel Ojeda
  2023-05-09 16:56 ` [PATCH 02/32] locking/lockdep: lock_class_is_held() Kent Overstreet
                   ` (30 subsequent siblings)
  31 siblings, 1 reply; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Miguel Ojeda, Nick Desaulniers, Kent Overstreet

From: Kent Overstreet <[email protected]>

This makes __attribute__((flatten)) available, which is used by
bcachefs.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Miguel Ojeda <[email protected]> (maintainer:COMPILER ATTRIBUTES)
Cc: Nick Desaulniers <[email protected]> (reviewer:COMPILER ATTRIBUTES)
Signed-off-by: Kent Overstreet <[email protected]>
---
 include/linux/compiler_attributes.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/compiler_attributes.h b/include/linux/compiler_attributes.h
index e659cb6fde..e56793bc08 100644
--- a/include/linux/compiler_attributes.h
+++ b/include/linux/compiler_attributes.h
@@ -366,4 +366,9 @@
  */
 #define __fix_address noinline __noclone
 
+/*
+ *   gcc: https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-flatten-function-attribute
+ */
+#define __flatten __attribute__((flatten))
+
 #endif /* __LINUX_COMPILER_ATTRIBUTES_H */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 02/32] locking/lockdep: lock_class_is_held()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
  2023-05-09 16:56 ` [PATCH 01/32] Compiler Attributes: add __flatten Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 19:30   ` Peter Zijlstra
  2023-05-09 16:56 ` [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion() Kent Overstreet
                   ` (29 subsequent siblings)
  31 siblings, 1 reply; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

From: Kent Overstreet <[email protected]>

This patch adds lock_class_is_held(), which can be used to assert that a
particular type of lock is not held.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
---
 include/linux/lockdep.h  |  4 ++++
 kernel/locking/lockdep.c | 20 ++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index 1023f349af..e858c288c7 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -339,6 +339,8 @@ extern void lock_unpin_lock(struct lockdep_map *lock, struct pin_cookie);
 #define lockdep_repin_lock(l,c)	lock_repin_lock(&(l)->dep_map, (c))
 #define lockdep_unpin_lock(l,c)	lock_unpin_lock(&(l)->dep_map, (c))
 
+int lock_class_is_held(struct lock_class_key *key);
+
 #else /* !CONFIG_LOCKDEP */
 
 static inline void lockdep_init_task(struct task_struct *task)
@@ -427,6 +429,8 @@ extern int lockdep_is_held(const void *);
 #define lockdep_repin_lock(l, c)		do { (void)(l); (void)(c); } while (0)
 #define lockdep_unpin_lock(l, c)		do { (void)(l); (void)(c); } while (0)
 
+static inline int lock_class_is_held(struct lock_class_key *key) { return 0; }
+
 #endif /* !LOCKDEP */
 
 enum xhlock_context_t {
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 50d4863974..e631464070 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -6487,6 +6487,26 @@ void debug_check_no_locks_held(void)
 }
 EXPORT_SYMBOL_GPL(debug_check_no_locks_held);
 
+#ifdef CONFIG_LOCKDEP
+int lock_class_is_held(struct lock_class_key *key)
+{
+	struct task_struct *curr = current;
+	struct held_lock *hlock;
+
+	if (unlikely(!debug_locks))
+		return 0;
+
+	for (hlock = curr->held_locks;
+	     hlock < curr->held_locks + curr->lockdep_depth;
+	     hlock++)
+		if (hlock->instance->key == key)
+			return 1;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(lock_class_is_held);
+#endif
+
 #ifdef __KERNEL__
 void debug_show_all_locks(void)
 {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
  2023-05-09 16:56 ` [PATCH 01/32] Compiler Attributes: add __flatten Kent Overstreet
  2023-05-09 16:56 ` [PATCH 02/32] locking/lockdep: lock_class_is_held() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 19:31   ` Peter Zijlstra
  2023-05-09 16:56 ` [PATCH 04/32] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
                   ` (28 subsequent siblings)
  31 siblings, 1 reply; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Peter Zijlstra, Ingo Molnar, Will Deacon,
	Waiman Long, Boqun Feng

This adds a method to tell lockdep not to check lock ordering within a
lock class - but to still check lock ordering w.r.t. other lock types.

This is for bcachefs, where for btree node locks we have our own
deadlock avoidance strategy w.r.t. other btree node locks (cycle
detection), but we still want lockdep to check lock ordering w.r.t.
other lock types.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
---
 include/linux/lockdep.h       |  6 ++++++
 include/linux/lockdep_types.h |  2 +-
 kernel/locking/lockdep.c      | 26 ++++++++++++++++++++++++++
 3 files changed, 33 insertions(+), 1 deletion(-)

diff --git a/include/linux/lockdep.h b/include/linux/lockdep.h
index e858c288c7..f6cc8709e2 100644
--- a/include/linux/lockdep.h
+++ b/include/linux/lockdep.h
@@ -665,4 +665,10 @@ lockdep_rcu_suspicious(const char *file, const int line, const char *s)
 }
 #endif
 
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+void lockdep_set_no_check_recursion(struct lockdep_map *map);
+#else
+static inline void lockdep_set_no_check_recursion(struct lockdep_map *map) {}
+#endif
+
 #endif /* __LINUX_LOCKDEP_H */
diff --git a/include/linux/lockdep_types.h b/include/linux/lockdep_types.h
index d22430840b..506e769b4a 100644
--- a/include/linux/lockdep_types.h
+++ b/include/linux/lockdep_types.h
@@ -128,7 +128,7 @@ struct lock_class {
 	u8				wait_type_inner;
 	u8				wait_type_outer;
 	u8				lock_type;
-	/* u8				hole; */
+	u8				no_check_recursion;
 
 #ifdef CONFIG_LOCK_STAT
 	unsigned long			contention_point[LOCKSTAT_POINTS];
diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index e631464070..f022b58dfa 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -3024,6 +3024,9 @@ check_deadlock(struct task_struct *curr, struct held_lock *next)
 		if ((next->read == 2) && prev->read)
 			continue;
 
+		if (hlock_class(next)->no_check_recursion)
+			continue;
+
 		/*
 		 * We're holding the nest_lock, which serializes this lock's
 		 * nesting behaviour.
@@ -3085,6 +3088,10 @@ check_prev_add(struct task_struct *curr, struct held_lock *prev,
 		return 2;
 	}
 
+	if (hlock_class(prev) == hlock_class(next) &&
+	    hlock_class(prev)->no_check_recursion)
+		return 2;
+
 	/*
 	 * Prove that the new <prev> -> <next> dependency would not
 	 * create a circular dependency in the graph. (We do this by
@@ -6620,3 +6627,22 @@ void lockdep_rcu_suspicious(const char *file, const int line, const char *s)
 	warn_rcu_exit(rcu);
 }
 EXPORT_SYMBOL_GPL(lockdep_rcu_suspicious);
+
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+void lockdep_set_no_check_recursion(struct lockdep_map *lock)
+{
+	struct lock_class *class = lock->class_cache[0];
+	unsigned long flags;
+
+	raw_local_irq_save(flags);
+	lockdep_recursion_inc();
+
+	if (!class)
+		class = register_lock_class(lock, 0, 0);
+	if (class)
+		class->no_check_recursion = true;
+	lockdep_recursion_finish();
+	raw_local_irq_restore(flags);
+}
+EXPORT_SYMBOL_GPL(lockdep_set_no_check_recursion);
+#endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 04/32] locking: SIX locks (shared/intent/exclusive)
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (2 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-11 12:14   ` Jan Engelhardt
  2023-05-09 16:56 ` [PATCH 05/32] MAINTAINERS: Add entry for six locks Kent Overstreet
                   ` (27 subsequent siblings)
  31 siblings, 1 reply; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet, Peter Zijlstra, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

From: Kent Overstreet <[email protected]>

New lock for bcachefs, like read/write locks but with a third state,
intent.

Intent locks conflict with each other, but not with read locks; taking a
write lock requires first holding an intent lock.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Waiman Long <[email protected]>
Cc: Boqun Feng <[email protected]>
---
 include/linux/six.h     | 210 +++++++++++
 kernel/Kconfig.locks    |   3 +
 kernel/locking/Makefile |   1 +
 kernel/locking/six.c    | 779 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 993 insertions(+)
 create mode 100644 include/linux/six.h
 create mode 100644 kernel/locking/six.c

diff --git a/include/linux/six.h b/include/linux/six.h
new file mode 100644
index 0000000000..41ddf63b74
--- /dev/null
+++ b/include/linux/six.h
@@ -0,0 +1,210 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+
+#ifndef _LINUX_SIX_H
+#define _LINUX_SIX_H
+
+/*
+ * Shared/intent/exclusive locks: sleepable read/write locks, much like rw
+ * semaphores, except with a third intermediate state, intent. Basic operations
+ * are:
+ *
+ * six_lock_read(&foo->lock);
+ * six_unlock_read(&foo->lock);
+ *
+ * six_lock_intent(&foo->lock);
+ * six_unlock_intent(&foo->lock);
+ *
+ * six_lock_write(&foo->lock);
+ * six_unlock_write(&foo->lock);
+ *
+ * Intent locks block other intent locks, but do not block read locks, and you
+ * must have an intent lock held before taking a write lock, like so:
+ *
+ * six_lock_intent(&foo->lock);
+ * six_lock_write(&foo->lock);
+ * six_unlock_write(&foo->lock);
+ * six_unlock_intent(&foo->lock);
+ *
+ * Other operations:
+ *
+ *   six_trylock_read()
+ *   six_trylock_intent()
+ *   six_trylock_write()
+ *
+ *   six_lock_downgrade():	convert from intent to read
+ *   six_lock_tryupgrade():	attempt to convert from read to intent
+ *
+ * Locks also embed a sequence number, which is incremented when the lock is
+ * locked or unlocked for write. The current sequence number can be grabbed
+ * while a lock is held from lock->state.seq; then, if you drop the lock you can
+ * use six_relock_(read|intent_write)(lock, seq) to attempt to retake the lock
+ * iff it hasn't been locked for write in the meantime.
+ *
+ * There are also operations that take the lock type as a parameter, where the
+ * type is one of SIX_LOCK_read, SIX_LOCK_intent, or SIX_LOCK_write:
+ *
+ *   six_lock_type(lock, type)
+ *   six_unlock_type(lock, type)
+ *   six_relock(lock, type, seq)
+ *   six_trylock_type(lock, type)
+ *   six_trylock_convert(lock, from, to)
+ *
+ * A lock may be held multiple types by the same thread (for read or intent,
+ * not write). However, the six locks code does _not_ implement the actual
+ * recursive checks itself though - rather, if your code (e.g. btree iterator
+ * code) knows that the current thread already has a lock held, and for the
+ * correct type, six_lock_increment() may be used to bump up the counter for
+ * that type - the only effect is that one more call to unlock will be required
+ * before the lock is unlocked.
+ */
+
+#include <linux/lockdep.h>
+#include <linux/osq_lock.h>
+#include <linux/sched.h>
+#include <linux/types.h>
+
+#define SIX_LOCK_SEPARATE_LOCKFNS
+
+union six_lock_state {
+	struct {
+		atomic64_t	counter;
+	};
+
+	struct {
+		u64		v;
+	};
+
+	struct {
+		/* for waitlist_bitnr() */
+		unsigned long	l;
+	};
+
+	struct {
+		unsigned	read_lock:27;
+		unsigned	write_locking:1;
+		unsigned	intent_lock:1;
+		unsigned	waiters:3;
+		/*
+		 * seq works much like in seqlocks: it's incremented every time
+		 * we lock and unlock for write.
+		 *
+		 * If it's odd write lock is held, even unlocked.
+		 *
+		 * Thus readers can unlock, and then lock again later iff it
+		 * hasn't been modified in the meantime.
+		 */
+		u32		seq;
+	};
+};
+
+enum six_lock_type {
+	SIX_LOCK_read,
+	SIX_LOCK_intent,
+	SIX_LOCK_write,
+};
+
+struct six_lock {
+	union six_lock_state	state;
+	unsigned		intent_lock_recurse;
+	struct task_struct	*owner;
+	struct optimistic_spin_queue osq;
+	unsigned __percpu	*readers;
+
+	raw_spinlock_t		wait_lock;
+	struct list_head	wait_list[2];
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	struct lockdep_map	dep_map;
+#endif
+};
+
+typedef int (*six_lock_should_sleep_fn)(struct six_lock *lock, void *);
+
+static __always_inline void __six_lock_init(struct six_lock *lock,
+					    const char *name,
+					    struct lock_class_key *key)
+{
+	atomic64_set(&lock->state.counter, 0);
+	raw_spin_lock_init(&lock->wait_lock);
+	INIT_LIST_HEAD(&lock->wait_list[SIX_LOCK_read]);
+	INIT_LIST_HEAD(&lock->wait_list[SIX_LOCK_intent]);
+#ifdef CONFIG_DEBUG_LOCK_ALLOC
+	debug_check_no_locks_freed((void *) lock, sizeof(*lock));
+	lockdep_init_map(&lock->dep_map, name, key, 0);
+#endif
+}
+
+#define six_lock_init(lock)						\
+do {									\
+	static struct lock_class_key __key;				\
+									\
+	__six_lock_init((lock), #lock, &__key);				\
+} while (0)
+
+#define __SIX_VAL(field, _v)	(((union six_lock_state) { .field = _v }).v)
+
+#define __SIX_LOCK(type)						\
+bool six_trylock_##type(struct six_lock *);				\
+bool six_relock_##type(struct six_lock *, u32);				\
+int six_lock_##type(struct six_lock *, six_lock_should_sleep_fn, void *);\
+void six_unlock_##type(struct six_lock *);
+
+__SIX_LOCK(read)
+__SIX_LOCK(intent)
+__SIX_LOCK(write)
+#undef __SIX_LOCK
+
+#define SIX_LOCK_DISPATCH(type, fn, ...)			\
+	switch (type) {						\
+	case SIX_LOCK_read:					\
+		return fn##_read(__VA_ARGS__);			\
+	case SIX_LOCK_intent:					\
+		return fn##_intent(__VA_ARGS__);		\
+	case SIX_LOCK_write:					\
+		return fn##_write(__VA_ARGS__);			\
+	default:						\
+		BUG();						\
+	}
+
+static inline bool six_trylock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	SIX_LOCK_DISPATCH(type, six_trylock, lock);
+}
+
+static inline bool six_relock_type(struct six_lock *lock, enum six_lock_type type,
+				   unsigned seq)
+{
+	SIX_LOCK_DISPATCH(type, six_relock, lock, seq);
+}
+
+static inline int six_lock_type(struct six_lock *lock, enum six_lock_type type,
+				six_lock_should_sleep_fn should_sleep_fn, void *p)
+{
+	SIX_LOCK_DISPATCH(type, six_lock, lock, should_sleep_fn, p);
+}
+
+static inline void six_unlock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	SIX_LOCK_DISPATCH(type, six_unlock, lock);
+}
+
+void six_lock_downgrade(struct six_lock *);
+bool six_lock_tryupgrade(struct six_lock *);
+bool six_trylock_convert(struct six_lock *, enum six_lock_type,
+			 enum six_lock_type);
+
+void six_lock_increment(struct six_lock *, enum six_lock_type);
+
+void six_lock_wakeup_all(struct six_lock *);
+
+void six_lock_pcpu_free_rcu(struct six_lock *);
+void six_lock_pcpu_free(struct six_lock *);
+void six_lock_pcpu_alloc(struct six_lock *);
+
+struct six_lock_count {
+	unsigned read;
+	unsigned intent;
+};
+
+struct six_lock_count six_lock_counts(struct six_lock *);
+
+#endif /* _LINUX_SIX_H */
diff --git a/kernel/Kconfig.locks b/kernel/Kconfig.locks
index 4198f0273e..b2abd9a5d9 100644
--- a/kernel/Kconfig.locks
+++ b/kernel/Kconfig.locks
@@ -259,3 +259,6 @@ config ARCH_HAS_MMIOWB
 config MMIOWB
 	def_bool y if ARCH_HAS_MMIOWB
 	depends on SMP
+
+config SIXLOCKS
+	bool
diff --git a/kernel/locking/Makefile b/kernel/locking/Makefile
index 0db4093d17..a095dbbf01 100644
--- a/kernel/locking/Makefile
+++ b/kernel/locking/Makefile
@@ -32,3 +32,4 @@ obj-$(CONFIG_QUEUED_RWLOCKS) += qrwlock.o
 obj-$(CONFIG_LOCK_TORTURE_TEST) += locktorture.o
 obj-$(CONFIG_WW_MUTEX_SELFTEST) += test-ww_mutex.o
 obj-$(CONFIG_LOCK_EVENT_COUNTS) += lock_events.o
+obj-$(CONFIG_SIXLOCKS) += six.o
diff --git a/kernel/locking/six.c b/kernel/locking/six.c
new file mode 100644
index 0000000000..5b2d92c6e9
--- /dev/null
+++ b/kernel/locking/six.c
@@ -0,0 +1,779 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/export.h>
+#include <linux/log2.h>
+#include <linux/percpu.h>
+#include <linux/preempt.h>
+#include <linux/rcupdate.h>
+#include <linux/sched.h>
+#include <linux/sched/rt.h>
+#include <linux/six.h>
+#include <linux/slab.h>
+
+#ifdef DEBUG
+#define EBUG_ON(cond)		BUG_ON(cond)
+#else
+#define EBUG_ON(cond)		do {} while (0)
+#endif
+
+#define six_acquire(l, t)	lock_acquire(l, 0, t, 0, 0, NULL, _RET_IP_)
+#define six_release(l)		lock_release(l, _RET_IP_)
+
+struct six_lock_vals {
+	/* Value we add to the lock in order to take the lock: */
+	u64			lock_val;
+
+	/* If the lock has this value (used as a mask), taking the lock fails: */
+	u64			lock_fail;
+
+	/* Value we add to the lock in order to release the lock: */
+	u64			unlock_val;
+
+	/* Mask that indicates lock is held for this type: */
+	u64			held_mask;
+
+	/* Waitlist we wakeup when releasing the lock: */
+	enum six_lock_type	unlock_wakeup;
+};
+
+#define __SIX_LOCK_HELD_read	__SIX_VAL(read_lock, ~0)
+#define __SIX_LOCK_HELD_intent	__SIX_VAL(intent_lock, ~0)
+#define __SIX_LOCK_HELD_write	__SIX_VAL(seq, 1)
+
+#define LOCK_VALS {							\
+	[SIX_LOCK_read] = {						\
+		.lock_val	= __SIX_VAL(read_lock, 1),		\
+		.lock_fail	= __SIX_LOCK_HELD_write + __SIX_VAL(write_locking, 1),\
+		.unlock_val	= -__SIX_VAL(read_lock, 1),		\
+		.held_mask	= __SIX_LOCK_HELD_read,			\
+		.unlock_wakeup	= SIX_LOCK_write,			\
+	},								\
+	[SIX_LOCK_intent] = {						\
+		.lock_val	= __SIX_VAL(intent_lock, 1),		\
+		.lock_fail	= __SIX_LOCK_HELD_intent,		\
+		.unlock_val	= -__SIX_VAL(intent_lock, 1),		\
+		.held_mask	= __SIX_LOCK_HELD_intent,		\
+		.unlock_wakeup	= SIX_LOCK_intent,			\
+	},								\
+	[SIX_LOCK_write] = {						\
+		.lock_val	= __SIX_VAL(seq, 1),			\
+		.lock_fail	= __SIX_LOCK_HELD_read,			\
+		.unlock_val	= __SIX_VAL(seq, 1),			\
+		.held_mask	= __SIX_LOCK_HELD_write,		\
+		.unlock_wakeup	= SIX_LOCK_read,			\
+	},								\
+}
+
+static inline void six_set_owner(struct six_lock *lock, enum six_lock_type type,
+				 union six_lock_state old)
+{
+	if (type != SIX_LOCK_intent)
+		return;
+
+	if (!old.intent_lock) {
+		EBUG_ON(lock->owner);
+		lock->owner = current;
+	} else {
+		EBUG_ON(lock->owner != current);
+	}
+}
+
+static inline unsigned pcpu_read_count(struct six_lock *lock)
+{
+	unsigned read_count = 0;
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		read_count += *per_cpu_ptr(lock->readers, cpu);
+	return read_count;
+}
+
+struct six_lock_waiter {
+	struct list_head	list;
+	struct task_struct	*task;
+};
+
+/* This is probably up there with the more evil things I've done */
+#define waitlist_bitnr(id) ilog2((((union six_lock_state) { .waiters = 1 << (id) }).l))
+
+static inline void six_lock_wakeup(struct six_lock *lock,
+				   union six_lock_state state,
+				   unsigned waitlist_id)
+{
+	if (waitlist_id == SIX_LOCK_write) {
+		if (state.write_locking && !state.read_lock) {
+			struct task_struct *p = READ_ONCE(lock->owner);
+			if (p)
+				wake_up_process(p);
+		}
+	} else {
+		struct list_head *wait_list = &lock->wait_list[waitlist_id];
+		struct six_lock_waiter *w, *next;
+
+		if (!(state.waiters & (1 << waitlist_id)))
+			return;
+
+		clear_bit(waitlist_bitnr(waitlist_id),
+			  (unsigned long *) &lock->state.v);
+
+		raw_spin_lock(&lock->wait_lock);
+
+		list_for_each_entry_safe(w, next, wait_list, list) {
+			list_del_init(&w->list);
+
+			if (wake_up_process(w->task) &&
+			    waitlist_id != SIX_LOCK_read) {
+				if (!list_empty(wait_list))
+					set_bit(waitlist_bitnr(waitlist_id),
+						(unsigned long *) &lock->state.v);
+				break;
+			}
+		}
+
+		raw_spin_unlock(&lock->wait_lock);
+	}
+}
+
+static __always_inline bool do_six_trylock_type(struct six_lock *lock,
+						enum six_lock_type type,
+						bool try)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state old, new;
+	bool ret;
+	u64 v;
+
+	EBUG_ON(type == SIX_LOCK_write && lock->owner != current);
+	EBUG_ON(type == SIX_LOCK_write && (lock->state.seq & 1));
+
+	EBUG_ON(type == SIX_LOCK_write && (try != !(lock->state.write_locking)));
+
+	/*
+	 * Percpu reader mode:
+	 *
+	 * The basic idea behind this algorithm is that you can implement a lock
+	 * between two threads without any atomics, just memory barriers:
+	 *
+	 * For two threads you'll need two variables, one variable for "thread a
+	 * has the lock" and another for "thread b has the lock".
+	 *
+	 * To take the lock, a thread sets its variable indicating that it holds
+	 * the lock, then issues a full memory barrier, then reads from the
+	 * other thread's variable to check if the other thread thinks it has
+	 * the lock. If we raced, we backoff and retry/sleep.
+	 */
+
+	if (type == SIX_LOCK_read && lock->readers) {
+retry:
+		preempt_disable();
+		this_cpu_inc(*lock->readers); /* signal that we own lock */
+
+		smp_mb();
+
+		old.v = READ_ONCE(lock->state.v);
+		ret = !(old.v & l[type].lock_fail);
+
+		this_cpu_sub(*lock->readers, !ret);
+		preempt_enable();
+
+		/*
+		 * If we failed because a writer was trying to take the
+		 * lock, issue a wakeup because we might have caused a
+		 * spurious trylock failure:
+		 */
+		if (old.write_locking) {
+			struct task_struct *p = READ_ONCE(lock->owner);
+
+			if (p)
+				wake_up_process(p);
+		}
+
+		/*
+		 * If we failed from the lock path and the waiting bit wasn't
+		 * set, set it:
+		 */
+		if (!try && !ret) {
+			v = old.v;
+
+			do {
+				new.v = old.v = v;
+
+				if (!(old.v & l[type].lock_fail))
+					goto retry;
+
+				if (new.waiters & (1 << type))
+					break;
+
+				new.waiters |= 1 << type;
+			} while ((v = atomic64_cmpxchg(&lock->state.counter,
+						       old.v, new.v)) != old.v);
+		}
+	} else if (type == SIX_LOCK_write && lock->readers) {
+		if (try) {
+			atomic64_add(__SIX_VAL(write_locking, 1),
+				     &lock->state.counter);
+			smp_mb__after_atomic();
+		}
+
+		ret = !pcpu_read_count(lock);
+
+		/*
+		 * On success, we increment lock->seq; also we clear
+		 * write_locking unless we failed from the lock path:
+		 */
+		v = 0;
+		if (ret)
+			v += __SIX_VAL(seq, 1);
+		if (ret || try)
+			v -= __SIX_VAL(write_locking, 1);
+
+		if (try && !ret) {
+			old.v = atomic64_add_return(v, &lock->state.counter);
+			six_lock_wakeup(lock, old, SIX_LOCK_read);
+		} else {
+			atomic64_add(v, &lock->state.counter);
+		}
+	} else {
+		v = READ_ONCE(lock->state.v);
+		do {
+			new.v = old.v = v;
+
+			if (!(old.v & l[type].lock_fail)) {
+				new.v += l[type].lock_val;
+
+				if (type == SIX_LOCK_write)
+					new.write_locking = 0;
+			} else if (!try && type != SIX_LOCK_write &&
+				   !(new.waiters & (1 << type)))
+				new.waiters |= 1 << type;
+			else
+				break; /* waiting bit already set */
+		} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+					old.v, new.v)) != old.v);
+
+		ret = !(old.v & l[type].lock_fail);
+
+		EBUG_ON(ret && !(lock->state.v & l[type].held_mask));
+	}
+
+	if (ret)
+		six_set_owner(lock, type, old);
+
+	EBUG_ON(type == SIX_LOCK_write && (try || ret) && (lock->state.write_locking));
+
+	return ret;
+}
+
+__always_inline __flatten
+static bool __six_trylock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	if (!do_six_trylock_type(lock, type, true))
+		return false;
+
+	if (type != SIX_LOCK_write)
+		six_acquire(&lock->dep_map, 1);
+	return true;
+}
+
+__always_inline __flatten
+static bool __six_relock_type(struct six_lock *lock, enum six_lock_type type,
+			      unsigned seq)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state old;
+	u64 v;
+
+	EBUG_ON(type == SIX_LOCK_write);
+
+	if (type == SIX_LOCK_read &&
+	    lock->readers) {
+		bool ret;
+
+		preempt_disable();
+		this_cpu_inc(*lock->readers);
+
+		smp_mb();
+
+		old.v = READ_ONCE(lock->state.v);
+		ret = !(old.v & l[type].lock_fail) && old.seq == seq;
+
+		this_cpu_sub(*lock->readers, !ret);
+		preempt_enable();
+
+		/*
+		 * Similar to the lock path, we may have caused a spurious write
+		 * lock fail and need to issue a wakeup:
+		 */
+		if (old.write_locking) {
+			struct task_struct *p = READ_ONCE(lock->owner);
+
+			if (p)
+				wake_up_process(p);
+		}
+
+		if (ret)
+			six_acquire(&lock->dep_map, 1);
+
+		return ret;
+	}
+
+	v = READ_ONCE(lock->state.v);
+	do {
+		old.v = v;
+
+		if (old.seq != seq || old.v & l[type].lock_fail)
+			return false;
+	} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+				old.v,
+				old.v + l[type].lock_val)) != old.v);
+
+	six_set_owner(lock, type, old);
+	if (type != SIX_LOCK_write)
+		six_acquire(&lock->dep_map, 1);
+	return true;
+}
+
+#ifdef CONFIG_LOCK_SPIN_ON_OWNER
+
+static inline int six_can_spin_on_owner(struct six_lock *lock)
+{
+	struct task_struct *owner;
+	int retval = 1;
+
+	if (need_resched())
+		return 0;
+
+	rcu_read_lock();
+	owner = READ_ONCE(lock->owner);
+	if (owner)
+		retval = owner->on_cpu;
+	rcu_read_unlock();
+	/*
+	 * if lock->owner is not set, the mutex owner may have just acquired
+	 * it and not set the owner yet or the mutex has been released.
+	 */
+	return retval;
+}
+
+static inline bool six_spin_on_owner(struct six_lock *lock,
+				     struct task_struct *owner)
+{
+	bool ret = true;
+
+	rcu_read_lock();
+	while (lock->owner == owner) {
+		/*
+		 * Ensure we emit the owner->on_cpu, dereference _after_
+		 * checking lock->owner still matches owner. If that fails,
+		 * owner might point to freed memory. If it still matches,
+		 * the rcu_read_lock() ensures the memory stays valid.
+		 */
+		barrier();
+
+		if (!owner->on_cpu || need_resched()) {
+			ret = false;
+			break;
+		}
+
+		cpu_relax();
+	}
+	rcu_read_unlock();
+
+	return ret;
+}
+
+static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+{
+	struct task_struct *task = current;
+
+	if (type == SIX_LOCK_write)
+		return false;
+
+	preempt_disable();
+	if (!six_can_spin_on_owner(lock))
+		goto fail;
+
+	if (!osq_lock(&lock->osq))
+		goto fail;
+
+	while (1) {
+		struct task_struct *owner;
+
+		/*
+		 * If there's an owner, wait for it to either
+		 * release the lock or go to sleep.
+		 */
+		owner = READ_ONCE(lock->owner);
+		if (owner && !six_spin_on_owner(lock, owner))
+			break;
+
+		if (do_six_trylock_type(lock, type, false)) {
+			osq_unlock(&lock->osq);
+			preempt_enable();
+			return true;
+		}
+
+		/*
+		 * When there's no owner, we might have preempted between the
+		 * owner acquiring the lock and setting the owner field. If
+		 * we're an RT task that will live-lock because we won't let
+		 * the owner complete.
+		 */
+		if (!owner && (need_resched() || rt_task(task)))
+			break;
+
+		/*
+		 * The cpu_relax() call is a compiler barrier which forces
+		 * everything in this loop to be re-loaded. We don't need
+		 * memory barriers as we'll eventually observe the right
+		 * values at the cost of a few extra spins.
+		 */
+		cpu_relax();
+	}
+
+	osq_unlock(&lock->osq);
+fail:
+	preempt_enable();
+
+	/*
+	 * If we fell out of the spin path because of need_resched(),
+	 * reschedule now, before we try-lock again. This avoids getting
+	 * scheduled out right after we obtained the lock.
+	 */
+	if (need_resched())
+		schedule();
+
+	return false;
+}
+
+#else /* CONFIG_LOCK_SPIN_ON_OWNER */
+
+static inline bool six_optimistic_spin(struct six_lock *lock, enum six_lock_type type)
+{
+	return false;
+}
+
+#endif
+
+noinline
+static int __six_lock_type_slowpath(struct six_lock *lock, enum six_lock_type type,
+				    six_lock_should_sleep_fn should_sleep_fn, void *p)
+{
+	union six_lock_state old;
+	struct six_lock_waiter wait;
+	int ret = 0;
+
+	if (type == SIX_LOCK_write) {
+		EBUG_ON(lock->state.write_locking);
+		atomic64_add(__SIX_VAL(write_locking, 1), &lock->state.counter);
+		smp_mb__after_atomic();
+	}
+
+	ret = should_sleep_fn ? should_sleep_fn(lock, p) : 0;
+	if (ret)
+		goto out_before_sleep;
+
+	if (six_optimistic_spin(lock, type))
+		goto out_before_sleep;
+
+	lock_contended(&lock->dep_map, _RET_IP_);
+
+	INIT_LIST_HEAD(&wait.list);
+	wait.task = current;
+
+	while (1) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		if (type == SIX_LOCK_write)
+			EBUG_ON(lock->owner != current);
+		else if (list_empty_careful(&wait.list)) {
+			raw_spin_lock(&lock->wait_lock);
+			list_add_tail(&wait.list, &lock->wait_list[type]);
+			raw_spin_unlock(&lock->wait_lock);
+		}
+
+		if (do_six_trylock_type(lock, type, false))
+			break;
+
+		ret = should_sleep_fn ? should_sleep_fn(lock, p) : 0;
+		if (ret)
+			break;
+
+		schedule();
+	}
+
+	__set_current_state(TASK_RUNNING);
+
+	if (!list_empty_careful(&wait.list)) {
+		raw_spin_lock(&lock->wait_lock);
+		list_del_init(&wait.list);
+		raw_spin_unlock(&lock->wait_lock);
+	}
+out_before_sleep:
+	if (ret && type == SIX_LOCK_write) {
+		old.v = atomic64_sub_return(__SIX_VAL(write_locking, 1),
+					    &lock->state.counter);
+		six_lock_wakeup(lock, old, SIX_LOCK_read);
+	}
+
+	return ret;
+}
+
+__always_inline
+static int __six_lock_type(struct six_lock *lock, enum six_lock_type type,
+			   six_lock_should_sleep_fn should_sleep_fn, void *p)
+{
+	int ret;
+
+	if (type != SIX_LOCK_write)
+		six_acquire(&lock->dep_map, 0);
+
+	ret = do_six_trylock_type(lock, type, true) ? 0
+		: __six_lock_type_slowpath(lock, type, should_sleep_fn, p);
+
+	if (ret && type != SIX_LOCK_write)
+		six_release(&lock->dep_map);
+	if (!ret)
+		lock_acquired(&lock->dep_map, _RET_IP_);
+
+	return ret;
+}
+
+__always_inline __flatten
+static void __six_unlock_type(struct six_lock *lock, enum six_lock_type type)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+	union six_lock_state state;
+
+	EBUG_ON(type == SIX_LOCK_write &&
+		!(lock->state.v & __SIX_LOCK_HELD_intent));
+
+	if (type != SIX_LOCK_write)
+		six_release(&lock->dep_map);
+
+	if (type == SIX_LOCK_intent) {
+		EBUG_ON(lock->owner != current);
+
+		if (lock->intent_lock_recurse) {
+			--lock->intent_lock_recurse;
+			return;
+		}
+
+		lock->owner = NULL;
+	}
+
+	if (type == SIX_LOCK_read &&
+	    lock->readers) {
+		smp_mb(); /* unlock barrier */
+		this_cpu_dec(*lock->readers);
+		smp_mb(); /* between unlocking and checking for waiters */
+		state.v = READ_ONCE(lock->state.v);
+	} else {
+		EBUG_ON(!(lock->state.v & l[type].held_mask));
+		state.v = atomic64_add_return_release(l[type].unlock_val,
+						      &lock->state.counter);
+	}
+
+	six_lock_wakeup(lock, state, l[type].unlock_wakeup);
+}
+
+#define __SIX_LOCK(type)						\
+bool six_trylock_##type(struct six_lock *lock)				\
+{									\
+	return __six_trylock_type(lock, SIX_LOCK_##type);		\
+}									\
+EXPORT_SYMBOL_GPL(six_trylock_##type);					\
+									\
+bool six_relock_##type(struct six_lock *lock, u32 seq)			\
+{									\
+	return __six_relock_type(lock, SIX_LOCK_##type, seq);		\
+}									\
+EXPORT_SYMBOL_GPL(six_relock_##type);					\
+									\
+int six_lock_##type(struct six_lock *lock,				\
+		    six_lock_should_sleep_fn should_sleep_fn, void *p)	\
+{									\
+	return __six_lock_type(lock, SIX_LOCK_##type, should_sleep_fn, p);\
+}									\
+EXPORT_SYMBOL_GPL(six_lock_##type);					\
+									\
+void six_unlock_##type(struct six_lock *lock)				\
+{									\
+	__six_unlock_type(lock, SIX_LOCK_##type);			\
+}									\
+EXPORT_SYMBOL_GPL(six_unlock_##type);
+
+__SIX_LOCK(read)
+__SIX_LOCK(intent)
+__SIX_LOCK(write)
+
+#undef __SIX_LOCK
+
+/* Convert from intent to read: */
+void six_lock_downgrade(struct six_lock *lock)
+{
+	six_lock_increment(lock, SIX_LOCK_read);
+	six_unlock_intent(lock);
+}
+EXPORT_SYMBOL_GPL(six_lock_downgrade);
+
+bool six_lock_tryupgrade(struct six_lock *lock)
+{
+	union six_lock_state old, new;
+	u64 v = READ_ONCE(lock->state.v);
+
+	do {
+		new.v = old.v = v;
+
+		if (new.intent_lock)
+			return false;
+
+		if (!lock->readers) {
+			EBUG_ON(!new.read_lock);
+			new.read_lock--;
+		}
+
+		new.intent_lock = 1;
+	} while ((v = atomic64_cmpxchg_acquire(&lock->state.counter,
+				old.v, new.v)) != old.v);
+
+	if (lock->readers)
+		this_cpu_dec(*lock->readers);
+
+	six_set_owner(lock, SIX_LOCK_intent, old);
+
+	return true;
+}
+EXPORT_SYMBOL_GPL(six_lock_tryupgrade);
+
+bool six_trylock_convert(struct six_lock *lock,
+			 enum six_lock_type from,
+			 enum six_lock_type to)
+{
+	EBUG_ON(to == SIX_LOCK_write || from == SIX_LOCK_write);
+
+	if (to == from)
+		return true;
+
+	if (to == SIX_LOCK_read) {
+		six_lock_downgrade(lock);
+		return true;
+	} else {
+		return six_lock_tryupgrade(lock);
+	}
+}
+EXPORT_SYMBOL_GPL(six_trylock_convert);
+
+/*
+ * Increment read/intent lock count, assuming we already have it read or intent
+ * locked:
+ */
+void six_lock_increment(struct six_lock *lock, enum six_lock_type type)
+{
+	const struct six_lock_vals l[] = LOCK_VALS;
+
+	six_acquire(&lock->dep_map, 0);
+
+	/* XXX: assert already locked, and that we don't overflow: */
+
+	switch (type) {
+	case SIX_LOCK_read:
+		if (lock->readers) {
+			this_cpu_inc(*lock->readers);
+		} else {
+			EBUG_ON(!lock->state.read_lock &&
+				!lock->state.intent_lock);
+			atomic64_add(l[type].lock_val, &lock->state.counter);
+		}
+		break;
+	case SIX_LOCK_intent:
+		EBUG_ON(!lock->state.intent_lock);
+		lock->intent_lock_recurse++;
+		break;
+	case SIX_LOCK_write:
+		BUG();
+		break;
+	}
+}
+EXPORT_SYMBOL_GPL(six_lock_increment);
+
+void six_lock_wakeup_all(struct six_lock *lock)
+{
+	struct six_lock_waiter *w;
+
+	raw_spin_lock(&lock->wait_lock);
+
+	list_for_each_entry(w, &lock->wait_list[0], list)
+		wake_up_process(w->task);
+	list_for_each_entry(w, &lock->wait_list[1], list)
+		wake_up_process(w->task);
+
+	raw_spin_unlock(&lock->wait_lock);
+}
+EXPORT_SYMBOL_GPL(six_lock_wakeup_all);
+
+struct free_pcpu_rcu {
+	struct rcu_head		rcu;
+	void __percpu		*p;
+};
+
+static void free_pcpu_rcu_fn(struct rcu_head *_rcu)
+{
+	struct free_pcpu_rcu *rcu =
+		container_of(_rcu, struct free_pcpu_rcu, rcu);
+
+	free_percpu(rcu->p);
+	kfree(rcu);
+}
+
+void six_lock_pcpu_free_rcu(struct six_lock *lock)
+{
+	struct free_pcpu_rcu *rcu = kzalloc(sizeof(*rcu), GFP_KERNEL);
+
+	if (!rcu)
+		return;
+
+	rcu->p = lock->readers;
+	lock->readers = NULL;
+
+	call_rcu(&rcu->rcu, free_pcpu_rcu_fn);
+}
+EXPORT_SYMBOL_GPL(six_lock_pcpu_free_rcu);
+
+void six_lock_pcpu_free(struct six_lock *lock)
+{
+	BUG_ON(lock->readers && pcpu_read_count(lock));
+	BUG_ON(lock->state.read_lock);
+
+	free_percpu(lock->readers);
+	lock->readers = NULL;
+}
+EXPORT_SYMBOL_GPL(six_lock_pcpu_free);
+
+void six_lock_pcpu_alloc(struct six_lock *lock)
+{
+#ifdef __KERNEL__
+	if (!lock->readers)
+		lock->readers = alloc_percpu(unsigned);
+#endif
+}
+EXPORT_SYMBOL_GPL(six_lock_pcpu_alloc);
+
+/*
+ * Returns lock held counts, for both read and intent
+ */
+struct six_lock_count six_lock_counts(struct six_lock *lock)
+{
+	struct six_lock_count ret = { 0, lock->state.intent_lock };
+
+	if (!lock->readers)
+		ret.read += lock->state.read_lock;
+	else {
+		int cpu;
+
+		for_each_possible_cpu(cpu)
+			ret.read += *per_cpu_ptr(lock->readers, cpu);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(six_lock_counts);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 05/32] MAINTAINERS: Add entry for six locks
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (3 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 04/32] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping Kent Overstreet
                   ` (26 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Kent Overstreet

SIX locks are a new locking primitive, shared/intent/exclusive,
currently used by bcachefs but available for other uses. Mark them as
maintained.

Signed-off-by: Kent Overstreet <[email protected]>
---
 MAINTAINERS | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index c6545eb541..3fc37de3d6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -19166,6 +19166,14 @@ S:	Maintained
 W:	http://www.winischhofer.at/linuxsisusbvga.shtml
 F:	drivers/usb/misc/sisusbvga/
 
+SIX LOCKS
+M:	Kent Overstreet <[email protected]>
+L:	[email protected]
+S:	Supported
+C:	irc://irc.oftc.net/bcache
+F:	include/linux/six.h
+F:	kernel/locking/six.c
+
 SL28 CPLD MFD DRIVER
 M:	Michael Walle <[email protected]>
 S:	Maintained
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (4 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 05/32] MAINTAINERS: Add entry for six locks Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-10  1:07   ` Jan Kara
  2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
                   ` (25 subsequent siblings)
  31 siblings, 1 reply; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet, Jan Kara, Darrick J . Wong

From: Kent Overstreet <[email protected]>

This is used by bcachefs to fix a page cache coherency issue with
O_DIRECT writes.

Also relevant: mapping->invalidate_lock, see below.

O_DIRECT writes (and other filesystem operations that modify file data
while bypassing the page cache) need to shoot down ranges of the page
cache - and additionally, need locking to prevent those pages from
pulled back in.

But O_DIRECT writes invoke the page fault handler (via get_user_pages),
and the page fault handler will need to take that same lock - this is a
classic recursive deadlock if userspace has mmaped the file they're DIO
writing to and uses those pages for the buffer to write from, and it's a
lock ordering deadlock in general.

Thus we need a way to signal from the dio code to the page fault handler
when we already are holding the pagecache add lock on an address space -
this patch just adds a member to task_struct for this purpose. For now
only bcachefs is implementing this locking, though it may be moved out
of bcachefs and made available to other filesystems in the future.

---------------------------------

The closest current VFS equivalent is mapping->invalidate_lock, which
comes from XFS. However, it's not used by direct IO.  Instead, direct IO
paths shoot down the page cache twice - before starting the IO and at
the end, and they're still technically racy w.r.t. page cache coherency.

This is a more complete approach: in the future we might consider
replacing mapping->invalidate_lock with the bcachefs code.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Darrick J. Wong <[email protected]>
Cc: [email protected]
---
 include/linux/sched.h | 1 +
 init/init_task.c      | 1 +
 2 files changed, 2 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 63d242164b..f2a56f64f7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -869,6 +869,7 @@ struct task_struct {
 
 	struct mm_struct		*mm;
 	struct mm_struct		*active_mm;
+	struct address_space		*faults_disabled_mapping;
 
 	int				exit_state;
 	int				exit_code;
diff --git a/init/init_task.c b/init/init_task.c
index ff6c4b9bfe..f703116e05 100644
--- a/init/init_task.c
+++ b/init/init_task.c
@@ -85,6 +85,7 @@ struct task_struct init_task
 	.nr_cpus_allowed= NR_CPUS,
 	.mm		= NULL,
 	.active_mm	= &init_mm,
+	.faults_disabled_mapping = NULL,
 	.restart_block	= {
 		.fn = do_no_restart_syscall,
 	},
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (5 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 18:19   ` Lorenzo Stoakes
                     ` (3 more replies)
  2023-05-09 16:56 ` [PATCH 08/32] fs: factor out d_mark_tmpfile() Kent Overstreet
                   ` (24 subsequent siblings)
  31 siblings, 4 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet, Andrew Morton,
	Uladzislau Rezki, Christoph Hellwig, linux-mm

From: Kent Overstreet <[email protected]>

This is needed for bcachefs, which dynamically generates per-btree node
unpack functions.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Uladzislau Rezki <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: [email protected]
---
 include/linux/vmalloc.h |  1 +
 kernel/module/main.c    |  4 +---
 mm/nommu.c              | 18 ++++++++++++++++++
 mm/vmalloc.c            | 21 +++++++++++++++++++++
 4 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 69250efa03..ff147fe115 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -145,6 +145,7 @@ extern void *vzalloc(unsigned long size) __alloc_size(1);
 extern void *vmalloc_user(unsigned long size) __alloc_size(1);
 extern void *vmalloc_node(unsigned long size, int node) __alloc_size(1);
 extern void *vzalloc_node(unsigned long size, int node) __alloc_size(1);
+extern void *vmalloc_exec(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
 extern void *vmalloc_32(unsigned long size) __alloc_size(1);
 extern void *vmalloc_32_user(unsigned long size) __alloc_size(1);
 extern void *__vmalloc(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
diff --git a/kernel/module/main.c b/kernel/module/main.c
index d3be89de70..9eaa89e84c 100644
--- a/kernel/module/main.c
+++ b/kernel/module/main.c
@@ -1607,9 +1607,7 @@ static void dynamic_debug_remove(struct module *mod, struct _ddebug_info *dyndbg
 
 void * __weak module_alloc(unsigned long size)
 {
-	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
-			GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
-			NUMA_NO_NODE, __builtin_return_address(0));
+	return vmalloc_exec(size, GFP_KERNEL);
 }
 
 bool __weak module_init_section(const char *name)
diff --git a/mm/nommu.c b/mm/nommu.c
index 57ba243c6a..8d9ab19e39 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -280,6 +280,24 @@ void *vzalloc_node(unsigned long size, int node)
 }
 EXPORT_SYMBOL(vzalloc_node);
 
+/**
+ *	vmalloc_exec  -  allocate virtually contiguous, executable memory
+ *	@size:		allocation size
+ *
+ *	Kernel-internal function to allocate enough pages to cover @size
+ *	the page level allocator and map them into contiguous and
+ *	executable kernel virtual space.
+ *
+ *	For tight control over page level allocator and protection flags
+ *	use __vmalloc() instead.
+ */
+
+void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
+{
+	return __vmalloc(size, gfp_mask);
+}
+EXPORT_SYMBOL_GPL(vmalloc_exec);
+
 /**
  * vmalloc_32  -  allocate virtually contiguous memory (32bit addressable)
  *	@size:		allocation size
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 31ff782d36..2ebb9ea7f0 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3401,6 +3401,27 @@ void *vzalloc_node(unsigned long size, int node)
 }
 EXPORT_SYMBOL(vzalloc_node);
 
+/**
+ * vmalloc_exec - allocate virtually contiguous, executable memory
+ * @size:	  allocation size
+ *
+ * Kernel-internal function to allocate enough pages to cover @size
+ * the page level allocator and map them into contiguous and
+ * executable kernel virtual space.
+ *
+ * For tight control over page level allocator and protection flags
+ * use __vmalloc() instead.
+ *
+ * Return: pointer to the allocated memory or %NULL on error
+ */
+void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
+{
+	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
+			gfp_mask, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
+			NUMA_NO_NODE, __builtin_return_address(0));
+}
+EXPORT_SYMBOL_GPL(vmalloc_exec);
+
 #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
 #define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
 #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 08/32] fs: factor out d_mark_tmpfile()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (6 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 09/32] block: Add some exports for bcachefs Kent Overstreet
                   ` (23 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet, Alexander Viro, Christian Brauner

From: Kent Overstreet <[email protected]>

New helper for bcachefs - bcachefs doesn't want the
inode_dec_link_count() call that d_tmpfile does, it handles i_nlink on
its own atomically with other btree updates

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: [email protected]
---
 fs/dcache.c            | 12 ++++++++++--
 include/linux/dcache.h |  1 +
 2 files changed, 11 insertions(+), 2 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 52e6d5fdab..dbdafa2617 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -3249,11 +3249,10 @@ void d_genocide(struct dentry *parent)
 
 EXPORT_SYMBOL(d_genocide);
 
-void d_tmpfile(struct file *file, struct inode *inode)
+void d_mark_tmpfile(struct file *file, struct inode *inode)
 {
 	struct dentry *dentry = file->f_path.dentry;
 
-	inode_dec_link_count(inode);
 	BUG_ON(dentry->d_name.name != dentry->d_iname ||
 		!hlist_unhashed(&dentry->d_u.d_alias) ||
 		!d_unlinked(dentry));
@@ -3263,6 +3262,15 @@ void d_tmpfile(struct file *file, struct inode *inode)
 				(unsigned long long)inode->i_ino);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dentry->d_parent->d_lock);
+}
+EXPORT_SYMBOL(d_mark_tmpfile);
+
+void d_tmpfile(struct file *file, struct inode *inode)
+{
+	struct dentry *dentry = file->f_path.dentry;
+
+	inode_dec_link_count(inode);
+	d_mark_tmpfile(file, inode);
 	d_instantiate(dentry, inode);
 }
 EXPORT_SYMBOL(d_tmpfile);
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 6b351e009f..3da2f0545d 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -251,6 +251,7 @@ extern struct dentry * d_make_root(struct inode *);
 /* <clickety>-<click> the ramfs-type tree */
 extern void d_genocide(struct dentry *);
 
+extern void d_mark_tmpfile(struct file *, struct inode *);
 extern void d_tmpfile(struct file *, struct inode *);
 
 extern struct dentry *d_find_alias(struct inode *);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 09/32] block: Add some exports for bcachefs
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (7 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 08/32] fs: factor out d_mark_tmpfile() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 10/32] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset Kent Overstreet
                   ` (22 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, linux-block, Jens Axboe, Kent Overstreet

From: Kent Overstreet <[email protected]>

 - bio_set_pages_dirty(), bio_check_pages_dirty() - dio path
 - blk_status_to_str() - error messages
 - bio_add_folio() - this should definitely be exported for everyone,
   it's the modern version of bio_add_page()

Signed-off-by: Kent Overstreet <[email protected]>
Cc: [email protected]
Cc: Jens Axboe <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
 block/bio.c            | 3 +++
 block/blk-core.c       | 1 +
 block/blk.h            | 1 -
 include/linux/blkdev.h | 1 +
 4 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/block/bio.c b/block/bio.c
index fd11614bba..1e75840d17 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1159,6 +1159,7 @@ bool bio_add_folio(struct bio *bio, struct folio *folio, size_t len,
 		return false;
 	return bio_add_page(bio, &folio->page, len, off) > 0;
 }
+EXPORT_SYMBOL(bio_add_folio);
 
 void __bio_release_pages(struct bio *bio, bool mark_dirty)
 {
@@ -1480,6 +1481,7 @@ void bio_set_pages_dirty(struct bio *bio)
 			set_page_dirty_lock(bvec->bv_page);
 	}
 }
+EXPORT_SYMBOL_GPL(bio_set_pages_dirty);
 
 /*
  * bio_check_pages_dirty() will check that all the BIO's pages are still dirty.
@@ -1539,6 +1541,7 @@ void bio_check_pages_dirty(struct bio *bio)
 	spin_unlock_irqrestore(&bio_dirty_lock, flags);
 	schedule_work(&bio_dirty_work);
 }
+EXPORT_SYMBOL_GPL(bio_check_pages_dirty);
 
 static inline bool bio_remaining_done(struct bio *bio)
 {
diff --git a/block/blk-core.c b/block/blk-core.c
index 42926e6cb8..f19bcc684b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -205,6 +205,7 @@ const char *blk_status_to_str(blk_status_t status)
 		return "<null>";
 	return blk_errors[idx].name;
 }
+EXPORT_SYMBOL_GPL(blk_status_to_str);
 
 /**
  * blk_sync_queue - cancel any pending callbacks on a queue
diff --git a/block/blk.h b/block/blk.h
index cc4e8873df..cc04dc73e9 100644
--- a/block/blk.h
+++ b/block/blk.h
@@ -259,7 +259,6 @@ static inline void blk_integrity_del(struct gendisk *disk)
 
 unsigned long blk_rq_timeout(unsigned long timeout);
 void blk_add_timer(struct request *req);
-const char *blk_status_to_str(blk_status_t status);
 
 bool blk_attempt_plug_merge(struct request_queue *q, struct bio *bio,
 		unsigned int nr_segs);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 941304f174..7cac183112 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -867,6 +867,7 @@ extern const char *blk_op_str(enum req_op op);
 
 int blk_status_to_errno(blk_status_t status);
 blk_status_t errno_to_blk_status(int errno);
+const char *blk_status_to_str(blk_status_t status);
 
 /* only poll the hardware once, don't continue until a completion was found */
 #define BLK_POLL_ONESHOT		(1 << 0)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 10/32] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (8 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 09/32] block: Add some exports for bcachefs Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 11/32] block: Bring back zero_fill_bio_iter Kent Overstreet
                   ` (21 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Jens Axboe, linux-block

bio_iov_iter_get_pages() trims the IO based on the block size of the
block device the IO will be issued to.

However, bcachefs is a multi device filesystem; when we're creating the
bio we don't yet know which block device the bio will be submitted to -
we have to handle the alignment checks elsewhere.

Thus this is needed to avoid a null ptr deref.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: [email protected]
---
 block/bio.c | 10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 1e75840d17..e74a04ea14 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1245,7 +1245,7 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 	struct page **pages = (struct page **)bv;
 	ssize_t size, left;
 	unsigned len, i = 0;
-	size_t offset, trim;
+	size_t offset;
 	int ret = 0;
 
 	/*
@@ -1274,10 +1274,12 @@ static int __bio_iov_iter_get_pages(struct bio *bio, struct iov_iter *iter)
 
 	nr_pages = DIV_ROUND_UP(offset + size, PAGE_SIZE);
 
-	trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
-	iov_iter_revert(iter, trim);
+	if (bio->bi_bdev) {
+		size_t trim = size & (bdev_logical_block_size(bio->bi_bdev) - 1);
+		iov_iter_revert(iter, trim);
+		size -= trim;
+	}
 
-	size -= trim;
 	if (unlikely(!size)) {
 		ret = -EFAULT;
 		goto out;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 11/32] block: Bring back zero_fill_bio_iter
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (9 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 10/32] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 12/32] block: Rework bio_for_each_segment_all() Kent Overstreet
                   ` (20 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet, Jens Axboe, linux-block

From: Kent Overstreet <[email protected]>

This reverts the commit that deleted it; it's used by bcachefs.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: [email protected]
---
 block/bio.c         | 6 +++---
 include/linux/bio.h | 7 ++++++-
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index e74a04ea14..70b5c987bc 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -606,15 +606,15 @@ struct bio *bio_kmalloc(unsigned short nr_vecs, gfp_t gfp_mask)
 }
 EXPORT_SYMBOL(bio_kmalloc);
 
-void zero_fill_bio(struct bio *bio)
+void zero_fill_bio_iter(struct bio *bio, struct bvec_iter start)
 {
 	struct bio_vec bv;
 	struct bvec_iter iter;
 
-	bio_for_each_segment(bv, bio, iter)
+	__bio_for_each_segment(bv, bio, iter, start)
 		memzero_bvec(&bv);
 }
-EXPORT_SYMBOL(zero_fill_bio);
+EXPORT_SYMBOL(zero_fill_bio_iter);
 
 /**
  * bio_truncate - truncate the bio to small size of @new_size
diff --git a/include/linux/bio.h b/include/linux/bio.h
index d766be7152..3536f28c05 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -484,7 +484,12 @@ extern void bio_copy_data_iter(struct bio *dst, struct bvec_iter *dst_iter,
 extern void bio_copy_data(struct bio *dst, struct bio *src);
 extern void bio_free_pages(struct bio *bio);
 void guard_bio_eod(struct bio *bio);
-void zero_fill_bio(struct bio *bio);
+void zero_fill_bio_iter(struct bio *bio, struct bvec_iter iter);
+
+static inline void zero_fill_bio(struct bio *bio)
+{
+	zero_fill_bio_iter(bio, bio->bi_iter);
+}
 
 static inline void bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 12/32] block: Rework bio_for_each_segment_all()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (10 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 11/32] block: Bring back zero_fill_bio_iter Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 13/32] block: Rework bio_for_each_folio_all() Kent Overstreet
                   ` (19 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Jens Axboe, linux-block, Ming Lei, Phillip Lougher

This patch reworks bio_for_each_segment_all() to be more inline with how
the other bio iterators work:

 - bio_iter_all_peek() now returns a synthesized bio_vec; we don't stash
   one in the iterator and pass a pointer to it - bad. This way makes it
   clearer what's a constructed value vs. a reference to something
   pre-existing, and it also will help with cleaning up and
   consolidating code with bio_for_each_folio_all().

 - We now provide bio_for_each_segment_all_continue(), for squashfs:
   this makes their code clearer.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: [email protected]
Cc: Ming Lei <[email protected]>
Cc: Phillip Lougher <[email protected]>
---
 block/bio.c                | 38 ++++++++++++------------
 block/blk-map.c            | 38 ++++++++++++------------
 block/bounce.c             | 12 ++++----
 drivers/md/bcache/btree.c  |  8 ++---
 drivers/md/dm-crypt.c      | 10 +++----
 drivers/md/raid1.c         |  4 +--
 fs/btrfs/disk-io.c         |  4 +--
 fs/btrfs/extent_io.c       | 50 +++++++++++++++----------------
 fs/btrfs/raid56.c          | 14 ++++-----
 fs/erofs/zdata.c           |  4 +--
 fs/ext4/page-io.c          |  8 ++---
 fs/ext4/readpage.c         |  4 +--
 fs/f2fs/data.c             | 20 ++++++-------
 fs/gfs2/lops.c             | 10 +++----
 fs/gfs2/meta_io.c          |  8 ++---
 fs/mpage.c                 |  4 +--
 fs/squashfs/block.c        | 48 +++++++++++++++++-------------
 fs/squashfs/lz4_wrapper.c  | 17 ++++++-----
 fs/squashfs/lzo_wrapper.c  | 17 ++++++-----
 fs/squashfs/xz_wrapper.c   | 19 ++++++------
 fs/squashfs/zlib_wrapper.c | 18 ++++++-----
 fs/squashfs/zstd_wrapper.c | 19 ++++++------
 include/linux/bio.h        | 34 ++++++++++++++++-----
 include/linux/bvec.h       | 61 ++++++++++++++++++++++----------------
 24 files changed, 256 insertions(+), 213 deletions(-)

diff --git a/block/bio.c b/block/bio.c
index 70b5c987bc..f2845d4e47 100644
--- a/block/bio.c
+++ b/block/bio.c
@@ -1163,13 +1163,13 @@ EXPORT_SYMBOL(bio_add_folio);
 
 void __bio_release_pages(struct bio *bio, bool mark_dirty)
 {
-	struct bvec_iter_all iter_all;
-	struct bio_vec *bvec;
+	struct bvec_iter_all iter;
+	struct bio_vec bvec;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		if (mark_dirty && !PageCompound(bvec->bv_page))
-			set_page_dirty_lock(bvec->bv_page);
-		put_page(bvec->bv_page);
+	bio_for_each_segment_all(bvec, bio, iter) {
+		if (mark_dirty && !PageCompound(bvec.bv_page))
+			set_page_dirty_lock(bvec.bv_page);
+		put_page(bvec.bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(__bio_release_pages);
@@ -1436,11 +1436,11 @@ EXPORT_SYMBOL(bio_copy_data);
 
 void bio_free_pages(struct bio *bio)
 {
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
+	struct bvec_iter_all iter;
+	struct bio_vec bvec;
 
-	bio_for_each_segment_all(bvec, bio, iter_all)
-		__free_page(bvec->bv_page);
+	bio_for_each_segment_all(bvec, bio, iter)
+		__free_page(bvec.bv_page);
 }
 EXPORT_SYMBOL(bio_free_pages);
 
@@ -1475,12 +1475,12 @@ EXPORT_SYMBOL(bio_free_pages);
  */
 void bio_set_pages_dirty(struct bio *bio)
 {
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
+	struct bvec_iter_all iter;
+	struct bio_vec bvec;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		if (!PageCompound(bvec->bv_page))
-			set_page_dirty_lock(bvec->bv_page);
+	bio_for_each_segment_all(bvec, bio, iter) {
+		if (!PageCompound(bvec.bv_page))
+			set_page_dirty_lock(bvec.bv_page);
 	}
 }
 EXPORT_SYMBOL_GPL(bio_set_pages_dirty);
@@ -1524,12 +1524,12 @@ static void bio_dirty_fn(struct work_struct *work)
 
 void bio_check_pages_dirty(struct bio *bio)
 {
-	struct bio_vec *bvec;
+	struct bvec_iter_all iter;
+	struct bio_vec bvec;
 	unsigned long flags;
-	struct bvec_iter_all iter_all;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		if (!PageDirty(bvec->bv_page) && !PageCompound(bvec->bv_page))
+	bio_for_each_segment_all(bvec, bio, iter) {
+		if (!PageDirty(bvec.bv_page) && !PageCompound(bvec.bv_page))
 			goto defer;
 	}
 
diff --git a/block/blk-map.c b/block/blk-map.c
index 9137d16cec..5774a9e467 100644
--- a/block/blk-map.c
+++ b/block/blk-map.c
@@ -46,21 +46,21 @@ static struct bio_map_data *bio_alloc_map_data(struct iov_iter *data,
  */
 static int bio_copy_from_iter(struct bio *bio, struct iov_iter *iter)
 {
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
+	struct bvec_iter_all bv_iter;
+	struct bio_vec bvec;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
+	bio_for_each_segment_all(bvec, bio, bv_iter) {
 		ssize_t ret;
 
-		ret = copy_page_from_iter(bvec->bv_page,
-					  bvec->bv_offset,
-					  bvec->bv_len,
+		ret = copy_page_from_iter(bvec.bv_page,
+					  bvec.bv_offset,
+					  bvec.bv_len,
 					  iter);
 
 		if (!iov_iter_count(iter))
 			break;
 
-		if (ret < bvec->bv_len)
+		if (ret < bvec.bv_len)
 			return -EFAULT;
 	}
 
@@ -77,21 +77,21 @@ static int bio_copy_from_iter(struct bio *bio, struct iov_iter *iter)
  */
 static int bio_copy_to_iter(struct bio *bio, struct iov_iter iter)
 {
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
+	struct bvec_iter_all bv_iter;
+	struct bio_vec bvec;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
+	bio_for_each_segment_all(bvec, bio, bv_iter) {
 		ssize_t ret;
 
-		ret = copy_page_to_iter(bvec->bv_page,
-					bvec->bv_offset,
-					bvec->bv_len,
+		ret = copy_page_to_iter(bvec.bv_page,
+					bvec.bv_offset,
+					bvec.bv_len,
 					&iter);
 
 		if (!iov_iter_count(&iter))
 			break;
 
-		if (ret < bvec->bv_len)
+		if (ret < bvec.bv_len)
 			return -EFAULT;
 	}
 
@@ -442,12 +442,12 @@ static void bio_copy_kern_endio(struct bio *bio)
 static void bio_copy_kern_endio_read(struct bio *bio)
 {
 	char *p = bio->bi_private;
-	struct bio_vec *bvec;
-	struct bvec_iter_all iter_all;
+	struct bvec_iter_all iter;
+	struct bio_vec bvec;
 
-	bio_for_each_segment_all(bvec, bio, iter_all) {
-		memcpy_from_bvec(p, bvec);
-		p += bvec->bv_len;
+	bio_for_each_segment_all(bvec, bio, iter) {
+		memcpy_from_bvec(p, &bvec);
+		p += bvec.bv_len;
 	}
 
 	bio_copy_kern_endio(bio);
diff --git a/block/bounce.c b/block/bounce.c
index 7cfcb242f9..e701832d76 100644
--- a/block/bounce.c
+++ b/block/bounce.c
@@ -102,18 +102,18 @@ static void copy_to_high_bio_irq(struct bio *to, struct bio *from)
 static void bounce_end_io(struct bio *bio)
 {
 	struct bio *bio_orig = bio->bi_private;
-	struct bio_vec *bvec, orig_vec;
+	struct bio_vec bvec, orig_vec;
 	struct bvec_iter orig_iter = bio_orig->bi_iter;
-	struct bvec_iter_all iter_all;
+	struct bvec_iter_all iter;
 
 	/*
 	 * free up bounce indirect pages used
 	 */
-	bio_for_each_segment_all(bvec, bio, iter_all) {
+	bio_for_each_segment_all(bvec, bio, iter) {
 		orig_vec = bio_iter_iovec(bio_orig, orig_iter);
-		if (bvec->bv_page != orig_vec.bv_page) {
-			dec_zone_page_state(bvec->bv_page, NR_BOUNCE);
-			mempool_free(bvec->bv_page, &page_pool);
+		if (bvec.bv_page != orig_vec.bv_page) {
+			dec_zone_page_state(bvec.bv_page, NR_BOUNCE);
+			mempool_free(bvec.bv_page, &page_pool);
 		}
 		bio_advance_iter(bio_orig, &orig_iter, orig_vec.bv_len);
 	}
diff --git a/drivers/md/bcache/btree.c b/drivers/md/bcache/btree.c
index 147c493a98..98ce12b239 100644
--- a/drivers/md/bcache/btree.c
+++ b/drivers/md/bcache/btree.c
@@ -373,12 +373,12 @@ static void do_btree_node_write(struct btree *b)
 		       bset_sector_offset(&b->keys, i));
 
 	if (!bch_bio_alloc_pages(b->bio, __GFP_NOWARN|GFP_NOWAIT)) {
-		struct bio_vec *bv;
+		struct bio_vec bv;
 		void *addr = (void *) ((unsigned long) i & ~(PAGE_SIZE - 1));
-		struct bvec_iter_all iter_all;
+		struct bvec_iter_all iter;
 
-		bio_for_each_segment_all(bv, b->bio, iter_all) {
-			memcpy(page_address(bv->bv_page), addr, PAGE_SIZE);
+		bio_for_each_segment_all(bv, b->bio, iter) {
+			memcpy(page_address(bv.bv_page), addr, PAGE_SIZE);
 			addr += PAGE_SIZE;
 		}
 
diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c
index 3ba53dc3cc..166bb4fdb4 100644
--- a/drivers/md/dm-crypt.c
+++ b/drivers/md/dm-crypt.c
@@ -1713,12 +1713,12 @@ static struct bio *crypt_alloc_buffer(struct dm_crypt_io *io, unsigned int size)
 
 static void crypt_free_buffer_pages(struct crypt_config *cc, struct bio *clone)
 {
-	struct bio_vec *bv;
-	struct bvec_iter_all iter_all;
+	struct bvec_iter_all iter;
+	struct bio_vec bv;
 
-	bio_for_each_segment_all(bv, clone, iter_all) {
-		BUG_ON(!bv->bv_page);
-		mempool_free(bv->bv_page, &cc->page_pool);
+	bio_for_each_segment_all(bv, clone, iter) {
+		BUG_ON(!bv.bv_page);
+		mempool_free(bv.bv_page, &cc->page_pool);
 	}
 }
 
diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index 68a9e2d998..4f58cae37e 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2188,7 +2188,7 @@ static void process_checks(struct r1bio *r1_bio)
 		blk_status_t status = sbio->bi_status;
 		struct page **ppages = get_resync_pages(pbio)->pages;
 		struct page **spages = get_resync_pages(sbio)->pages;
-		struct bio_vec *bi;
+		struct bio_vec bi;
 		int page_len[RESYNC_PAGES] = { 0 };
 		struct bvec_iter_all iter_all;
 
@@ -2198,7 +2198,7 @@ static void process_checks(struct r1bio *r1_bio)
 		sbio->bi_status = 0;
 
 		bio_for_each_segment_all(bi, sbio, iter_all)
-			page_len[j++] = bi->bv_len;
+			page_len[j++] = bi.bv_len;
 
 		if (!status) {
 			for (j = vcnt; j-- ; ) {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index 9e1596bb20..92b3396c15 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -3804,12 +3804,12 @@ ALLOW_ERROR_INJECTION(open_ctree, ERRNO);
 static void btrfs_end_super_write(struct bio *bio)
 {
 	struct btrfs_device *device = bio->bi_private;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 	struct page *page;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		page = bvec->bv_page;
+		page = bvec.bv_page;
 
 		if (bio->bi_status) {
 			btrfs_warn_rl_in_rcu(device->fs_info,
diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
index 40300e8e5f..5796c99ea1 100644
--- a/fs/btrfs/extent_io.c
+++ b/fs/btrfs/extent_io.c
@@ -581,34 +581,34 @@ static void end_bio_extent_writepage(struct btrfs_bio *bbio)
 {
 	struct bio *bio = &bbio->bio;
 	int error = blk_status_to_errno(bio->bi_status);
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	u64 start;
 	u64 end;
 	struct bvec_iter_all iter_all;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec.bv_page;
 		struct inode *inode = page->mapping->host;
 		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 		const u32 sectorsize = fs_info->sectorsize;
 
 		/* Our read/write should always be sector aligned. */
-		if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
+		if (!IS_ALIGNED(bvec.bv_offset, sectorsize))
 			btrfs_err(fs_info,
 		"partial page write in btrfs with offset %u and length %u",
-				  bvec->bv_offset, bvec->bv_len);
-		else if (!IS_ALIGNED(bvec->bv_len, sectorsize))
+				  bvec.bv_offset, bvec.bv_len);
+		else if (!IS_ALIGNED(bvec.bv_len, sectorsize))
 			btrfs_info(fs_info,
 		"incomplete page write with offset %u and length %u",
-				   bvec->bv_offset, bvec->bv_len);
+				   bvec.bv_offset, bvec.bv_len);
 
-		start = page_offset(page) + bvec->bv_offset;
-		end = start + bvec->bv_len - 1;
+		start = page_offset(page) + bvec.bv_offset;
+		end = start + bvec.bv_len - 1;
 
 		end_extent_writepage(page, error, start, end);
 
-		btrfs_page_clear_writeback(fs_info, page, start, bvec->bv_len);
+		btrfs_page_clear_writeback(fs_info, page, start, bvec.bv_len);
 	}
 
 	bio_put(bio);
@@ -736,7 +736,7 @@ static struct extent_buffer *find_extent_buffer_readpage(
 static void end_bio_extent_readpage(struct btrfs_bio *bbio)
 {
 	struct bio *bio = &bbio->bio;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct processed_extent processed = { 0 };
 	/*
 	 * The offset to the beginning of a bio, since one bio can never be
@@ -749,7 +749,7 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		bool uptodate = !bio->bi_status;
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec.bv_page;
 		struct inode *inode = page->mapping->host;
 		struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
 		const u32 sectorsize = fs_info->sectorsize;
@@ -769,19 +769,19 @@ static void end_bio_extent_readpage(struct btrfs_bio *bbio)
 		 * for unaligned offsets, and an error if they don't add up to
 		 * a full sector.
 		 */
-		if (!IS_ALIGNED(bvec->bv_offset, sectorsize))
+		if (!IS_ALIGNED(bvec.bv_offset, sectorsize))
 			btrfs_err(fs_info,
 		"partial page read in btrfs with offset %u and length %u",
-				  bvec->bv_offset, bvec->bv_len);
-		else if (!IS_ALIGNED(bvec->bv_offset + bvec->bv_len,
+				  bvec.bv_offset, bvec.bv_len);
+		else if (!IS_ALIGNED(bvec.bv_offset + bvec.bv_len,
 				     sectorsize))
 			btrfs_info(fs_info,
 		"incomplete page read with offset %u and length %u",
-				   bvec->bv_offset, bvec->bv_len);
+				   bvec.bv_offset, bvec.bv_len);
 
-		start = page_offset(page) + bvec->bv_offset;
-		end = start + bvec->bv_len - 1;
-		len = bvec->bv_len;
+		start = page_offset(page) + bvec.bv_offset;
+		end = start + bvec.bv_len - 1;
+		len = bvec.bv_len;
 
 		mirror = bbio->mirror_num;
 		if (uptodate && !is_data_inode(inode) &&
@@ -1993,7 +1993,7 @@ static void end_bio_subpage_eb_writepage(struct btrfs_bio *bbio)
 {
 	struct bio *bio = &bbio->bio;
 	struct btrfs_fs_info *fs_info;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	fs_info = btrfs_sb(bio_first_page_all(bio)->mapping->host->i_sb);
@@ -2001,12 +2001,12 @@ static void end_bio_subpage_eb_writepage(struct btrfs_bio *bbio)
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
-		u64 bvec_start = page_offset(page) + bvec->bv_offset;
-		u64 bvec_end = bvec_start + bvec->bv_len - 1;
+		struct page *page = bvec.bv_page;
+		u64 bvec_start = page_offset(page) + bvec.bv_offset;
+		u64 bvec_end = bvec_start + bvec.bv_len - 1;
 		u64 cur_bytenr = bvec_start;
 
-		ASSERT(IS_ALIGNED(bvec->bv_len, fs_info->nodesize));
+		ASSERT(IS_ALIGNED(bvec.bv_len, fs_info->nodesize));
 
 		/* Iterate through all extent buffers in the range */
 		while (cur_bytenr <= bvec_end) {
@@ -2050,14 +2050,14 @@ static void end_bio_subpage_eb_writepage(struct btrfs_bio *bbio)
 static void end_bio_extent_buffer_writepage(struct btrfs_bio *bbio)
 {
 	struct bio *bio = &bbio->bio;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct extent_buffer *eb;
 	int done;
 	struct bvec_iter_all iter_all;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec.bv_page;
 
 		eb = (struct extent_buffer *)page->private;
 		BUG_ON(!eb);
diff --git a/fs/btrfs/raid56.c b/fs/btrfs/raid56.c
index 642828c1b2..39d8101541 100644
--- a/fs/btrfs/raid56.c
+++ b/fs/btrfs/raid56.c
@@ -1388,7 +1388,7 @@ static struct sector_ptr *find_stripe_sector(struct btrfs_raid_bio *rbio,
 static void set_bio_pages_uptodate(struct btrfs_raid_bio *rbio, struct bio *bio)
 {
 	const u32 sectorsize = rbio->bioc->fs_info->sectorsize;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	ASSERT(!bio_flagged(bio, BIO_CLONED));
@@ -1397,9 +1397,9 @@ static void set_bio_pages_uptodate(struct btrfs_raid_bio *rbio, struct bio *bio)
 		struct sector_ptr *sector;
 		int pgoff;
 
-		for (pgoff = bvec->bv_offset; pgoff - bvec->bv_offset < bvec->bv_len;
+		for (pgoff = bvec.bv_offset; pgoff - bvec.bv_offset < bvec.bv_len;
 		     pgoff += sectorsize) {
-			sector = find_stripe_sector(rbio, bvec->bv_page, pgoff);
+			sector = find_stripe_sector(rbio, bvec.bv_page, pgoff);
 			ASSERT(sector);
 			if (sector)
 				sector->uptodate = 1;
@@ -1453,7 +1453,7 @@ static void verify_bio_data_sectors(struct btrfs_raid_bio *rbio,
 {
 	struct btrfs_fs_info *fs_info = rbio->bioc->fs_info;
 	int total_sector_nr = get_bio_sector_nr(rbio, bio);
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	/* No data csum for the whole stripe, no need to verify. */
@@ -1467,8 +1467,8 @@ static void verify_bio_data_sectors(struct btrfs_raid_bio *rbio,
 	bio_for_each_segment_all(bvec, bio, iter_all) {
 		int bv_offset;
 
-		for (bv_offset = bvec->bv_offset;
-		     bv_offset < bvec->bv_offset + bvec->bv_len;
+		for (bv_offset = bvec.bv_offset;
+		     bv_offset < bvec.bv_offset + bvec.bv_len;
 		     bv_offset += fs_info->sectorsize, total_sector_nr++) {
 			u8 csum_buf[BTRFS_CSUM_SIZE];
 			u8 *expected_csum = rbio->csum_buf +
@@ -1479,7 +1479,7 @@ static void verify_bio_data_sectors(struct btrfs_raid_bio *rbio,
 			if (!test_bit(total_sector_nr, rbio->csum_bitmap))
 				continue;
 
-			ret = btrfs_check_sector_csum(fs_info, bvec->bv_page,
+			ret = btrfs_check_sector_csum(fs_info, bvec.bv_page,
 				bv_offset, csum_buf, expected_csum);
 			if (ret < 0)
 				set_bit(total_sector_nr, rbio->error_bitmap);
diff --git a/fs/erofs/zdata.c b/fs/erofs/zdata.c
index f1708c77a9..1fd0f01d11 100644
--- a/fs/erofs/zdata.c
+++ b/fs/erofs/zdata.c
@@ -1651,11 +1651,11 @@ static void z_erofs_decompressqueue_endio(struct bio *bio)
 {
 	struct z_erofs_decompressqueue *q = bio->bi_private;
 	blk_status_t err = bio->bi_status;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec.bv_page;
 
 		DBG_BUGON(PageUptodate(page));
 		DBG_BUGON(z_erofs_page_is_invalidated(page));
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index 1e4db96a04..81a1cc4518 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -99,15 +99,15 @@ static void buffer_io_error(struct buffer_head *bh)
 
 static void ext4_finish_bio(struct bio *bio)
 {
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec.bv_page;
 		struct page *bounce_page = NULL;
 		struct buffer_head *bh, *head;
-		unsigned bio_start = bvec->bv_offset;
-		unsigned bio_end = bio_start + bvec->bv_len;
+		unsigned bio_start = bvec.bv_offset;
+		unsigned bio_end = bio_start + bvec.bv_len;
 		unsigned under_io = 0;
 		unsigned long flags;
 
diff --git a/fs/ext4/readpage.c b/fs/ext4/readpage.c
index c61dc8a7c0..ce42b3d5c9 100644
--- a/fs/ext4/readpage.c
+++ b/fs/ext4/readpage.c
@@ -69,11 +69,11 @@ struct bio_post_read_ctx {
 static void __read_end_io(struct bio *bio)
 {
 	struct page *page;
-	struct bio_vec *bv;
+	struct bio_vec bv;
 	struct bvec_iter_all iter_all;
 
 	bio_for_each_segment_all(bv, bio, iter_all) {
-		page = bv->bv_page;
+		page = bv.bv_page;
 
 		if (bio->bi_status)
 			ClearPageUptodate(page);
diff --git a/fs/f2fs/data.c b/fs/f2fs/data.c
index 06b552a0ab..e44bd8586f 100644
--- a/fs/f2fs/data.c
+++ b/fs/f2fs/data.c
@@ -139,12 +139,12 @@ struct bio_post_read_ctx {
  */
 static void f2fs_finish_read_bio(struct bio *bio, bool in_task)
 {
-	struct bio_vec *bv;
+	struct bio_vec bv;
 	struct bvec_iter_all iter_all;
 	struct bio_post_read_ctx *ctx = bio->bi_private;
 
 	bio_for_each_segment_all(bv, bio, iter_all) {
-		struct page *page = bv->bv_page;
+		struct page *page = bv.bv_page;
 
 		if (f2fs_is_compressed_page(page)) {
 			if (ctx && !ctx->decompression_attempted)
@@ -189,11 +189,11 @@ static void f2fs_verify_bio(struct work_struct *work)
 	 * as those were handled separately by f2fs_end_read_compressed_page().
 	 */
 	if (may_have_compressed_pages) {
-		struct bio_vec *bv;
+		struct bio_vec bv;
 		struct bvec_iter_all iter_all;
 
 		bio_for_each_segment_all(bv, bio, iter_all) {
-			struct page *page = bv->bv_page;
+			struct page *page = bv.bv_page;
 
 			if (!f2fs_is_compressed_page(page) &&
 			    !fsverity_verify_page(page)) {
@@ -241,13 +241,13 @@ static void f2fs_verify_and_finish_bio(struct bio *bio, bool in_task)
 static void f2fs_handle_step_decompress(struct bio_post_read_ctx *ctx,
 		bool in_task)
 {
-	struct bio_vec *bv;
+	struct bio_vec bv;
 	struct bvec_iter_all iter_all;
 	bool all_compressed = true;
 	block_t blkaddr = ctx->fs_blkaddr;
 
 	bio_for_each_segment_all(bv, ctx->bio, iter_all) {
-		struct page *page = bv->bv_page;
+		struct page *page = bv.bv_page;
 
 		if (f2fs_is_compressed_page(page))
 			f2fs_end_read_compressed_page(page, false, blkaddr,
@@ -327,7 +327,7 @@ static void f2fs_read_end_io(struct bio *bio)
 static void f2fs_write_end_io(struct bio *bio)
 {
 	struct f2fs_sb_info *sbi;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	iostat_update_and_unbind_ctx(bio);
@@ -337,7 +337,7 @@ static void f2fs_write_end_io(struct bio *bio)
 		bio->bi_status = BLK_STS_IOERR;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec.bv_page;
 		enum count_type type = WB_DATA_TYPE(page);
 
 		if (page_private_dummy(page)) {
@@ -583,7 +583,7 @@ static void __submit_merged_bio(struct f2fs_bio_info *io)
 static bool __has_merged_page(struct bio *bio, struct inode *inode,
 						struct page *page, nid_t ino)
 {
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	if (!bio)
@@ -593,7 +593,7 @@ static bool __has_merged_page(struct bio *bio, struct inode *inode,
 		return true;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *target = bvec->bv_page;
+		struct page *target = bvec.bv_page;
 
 		if (fscrypt_is_bounce_page(target)) {
 			target = fscrypt_pagecache_page(target);
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 1902413d5d..7f62fe8eb7 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -202,7 +202,7 @@ static void gfs2_end_log_write_bh(struct gfs2_sbd *sdp,
 static void gfs2_end_log_write(struct bio *bio)
 {
 	struct gfs2_sbd *sdp = bio->bi_private;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct page *page;
 	struct bvec_iter_all iter_all;
 
@@ -217,9 +217,9 @@ static void gfs2_end_log_write(struct bio *bio)
 	}
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		page = bvec->bv_page;
+		page = bvec.bv_page;
 		if (page_has_buffers(page))
-			gfs2_end_log_write_bh(sdp, bvec, bio->bi_status);
+			gfs2_end_log_write_bh(sdp, &bvec, bio->bi_status);
 		else
 			mempool_free(page, gfs2_page_pool);
 	}
@@ -395,11 +395,11 @@ static void gfs2_log_write_page(struct gfs2_sbd *sdp, struct page *page)
 static void gfs2_end_log_read(struct bio *bio)
 {
 	struct page *page;
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		page = bvec->bv_page;
+		page = bvec.bv_page;
 		if (bio->bi_status) {
 			int err = blk_status_to_errno(bio->bi_status);
 
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index 924361fa51..832572784e 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -193,15 +193,15 @@ struct buffer_head *gfs2_meta_new(struct gfs2_glock *gl, u64 blkno)
 
 static void gfs2_meta_read_endio(struct bio *bio)
 {
-	struct bio_vec *bvec;
+	struct bio_vec bvec;
 	struct bvec_iter_all iter_all;
 
 	bio_for_each_segment_all(bvec, bio, iter_all) {
-		struct page *page = bvec->bv_page;
+		struct page *page = bvec.bv_page;
 		struct buffer_head *bh = page_buffers(page);
-		unsigned int len = bvec->bv_len;
+		unsigned int len = bvec.bv_len;
 
-		while (bh_offset(bh) < bvec->bv_offset)
+		while (bh_offset(bh) < bvec.bv_offset)
 			bh = bh->b_this_page;
 		do {
 			struct buffer_head *next = bh->b_this_page;
diff --git a/fs/mpage.c b/fs/mpage.c
index 22b9de5ddd..49505456ba 100644
--- a/fs/mpage.c
+++ b/fs/mpage.c
@@ -45,11 +45,11 @@
  */
 static void mpage_end_io(struct bio *bio)
 {
-	struct bio_vec *bv;
+	struct bio_vec bv;
 	struct bvec_iter_all iter_all;
 
 	bio_for_each_segment_all(bv, bio, iter_all) {
-		struct page *page = bv->bv_page;
+		struct page *page = bv.bv_page;
 		page_endio(page, bio_op(bio),
 			   blk_status_to_errno(bio->bi_status));
 	}
diff --git a/fs/squashfs/block.c b/fs/squashfs/block.c
index bed3bb8b27..83e8b44518 100644
--- a/fs/squashfs/block.c
+++ b/fs/squashfs/block.c
@@ -35,30 +35,33 @@ static int copy_bio_to_actor(struct bio *bio,
 			     int offset, int req_length)
 {
 	void *actor_addr;
-	struct bvec_iter_all iter_all = {};
-	struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+	struct bvec_iter_all iter;
+	struct bio_vec bvec;
 	int copied_bytes = 0;
 	int actor_offset = 0;
+	int bytes_to_copy;
 
 	squashfs_actor_nobuff(actor);
 	actor_addr = squashfs_first_page(actor);
 
-	if (WARN_ON_ONCE(!bio_next_segment(bio, &iter_all)))
-		return 0;
+	bvec_iter_all_init(&iter);
+	bio_iter_all_advance(bio, &iter, offset);
 
-	while (copied_bytes < req_length) {
-		int bytes_to_copy = min_t(int, bvec->bv_len - offset,
+	while (copied_bytes < req_length &&
+	       iter.idx < bio->bi_vcnt) {
+		bvec = bio_iter_all_peek(bio, &iter);
+
+		bytes_to_copy = min_t(int, bvec.bv_len,
 					  PAGE_SIZE - actor_offset);
 
 		bytes_to_copy = min_t(int, bytes_to_copy,
 				      req_length - copied_bytes);
 		if (!IS_ERR(actor_addr))
-			memcpy(actor_addr + actor_offset, bvec_virt(bvec) +
-					offset, bytes_to_copy);
+			memcpy(actor_addr + actor_offset, bvec_virt(&bvec),
+			       bytes_to_copy);
 
 		actor_offset += bytes_to_copy;
 		copied_bytes += bytes_to_copy;
-		offset += bytes_to_copy;
 
 		if (actor_offset >= PAGE_SIZE) {
 			actor_addr = squashfs_next_page(actor);
@@ -66,11 +69,8 @@ static int copy_bio_to_actor(struct bio *bio,
 				break;
 			actor_offset = 0;
 		}
-		if (offset >= bvec->bv_len) {
-			if (!bio_next_segment(bio, &iter_all))
-				break;
-			offset = 0;
-		}
+
+		bio_iter_all_advance(bio, &iter, bytes_to_copy);
 	}
 	squashfs_finish_page(actor);
 	return copied_bytes;
@@ -159,8 +159,10 @@ int squashfs_read_data(struct super_block *sb, u64 index, int length,
 		 * Metadata block.
 		 */
 		const u8 *data;
-		struct bvec_iter_all iter_all = {};
-		struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+		struct bvec_iter_all iter;
+		struct bio_vec bvec;
+
+		bvec_iter_all_init(&iter);
 
 		if (index + 2 > msblk->bytes_used) {
 			res = -EIO;
@@ -170,21 +172,25 @@ int squashfs_read_data(struct super_block *sb, u64 index, int length,
 		if (res)
 			goto out;
 
-		if (WARN_ON_ONCE(!bio_next_segment(bio, &iter_all))) {
+		bvec = bio_iter_all_peek(bio, &iter);
+
+		if (WARN_ON_ONCE(!bvec.bv_len)) {
 			res = -EIO;
 			goto out_free_bio;
 		}
 		/* Extract the length of the metadata block */
-		data = bvec_virt(bvec);
+		data = bvec_virt(&bvec);
 		length = data[offset];
-		if (offset < bvec->bv_len - 1) {
+		if (offset < bvec.bv_len - 1) {
 			length |= data[offset + 1] << 8;
 		} else {
-			if (WARN_ON_ONCE(!bio_next_segment(bio, &iter_all))) {
+			bio_iter_all_advance(bio, &iter, bvec.bv_len);
+
+			if (WARN_ON_ONCE(!bvec.bv_len)) {
 				res = -EIO;
 				goto out_free_bio;
 			}
-			data = bvec_virt(bvec);
+			data = bvec_virt(&bvec);
 			length |= data[0] << 8;
 		}
 		bio_free_pages(bio);
diff --git a/fs/squashfs/lz4_wrapper.c b/fs/squashfs/lz4_wrapper.c
index 49797729f1..bd0dd787d2 100644
--- a/fs/squashfs/lz4_wrapper.c
+++ b/fs/squashfs/lz4_wrapper.c
@@ -92,20 +92,23 @@ static int lz4_uncompress(struct squashfs_sb_info *msblk, void *strm,
 	struct bio *bio, int offset, int length,
 	struct squashfs_page_actor *output)
 {
-	struct bvec_iter_all iter_all = {};
-	struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+	struct bvec_iter_all iter;
+	struct bio_vec bvec;
 	struct squashfs_lz4 *stream = strm;
 	void *buff = stream->input, *data;
 	int bytes = length, res;
 
-	while (bio_next_segment(bio, &iter_all)) {
-		int avail = min(bytes, ((int)bvec->bv_len) - offset);
+	bvec_iter_all_init(&iter);
+	bio_iter_all_advance(bio, &iter, offset);
 
-		data = bvec_virt(bvec);
-		memcpy(buff, data + offset, avail);
+	bio_for_each_segment_all_continue(bvec, bio, iter) {
+		unsigned avail = min_t(unsigned, bytes, bvec.bv_len);
+
+		memcpy(buff, bvec_virt(&bvec), avail);
 		buff += avail;
 		bytes -= avail;
-		offset = 0;
+		if (!bytes)
+			break;
 	}
 
 	res = LZ4_decompress_safe(stream->input, stream->output,
diff --git a/fs/squashfs/lzo_wrapper.c b/fs/squashfs/lzo_wrapper.c
index d216aeefa8..bccfcfa12e 100644
--- a/fs/squashfs/lzo_wrapper.c
+++ b/fs/squashfs/lzo_wrapper.c
@@ -66,21 +66,24 @@ static int lzo_uncompress(struct squashfs_sb_info *msblk, void *strm,
 	struct bio *bio, int offset, int length,
 	struct squashfs_page_actor *output)
 {
-	struct bvec_iter_all iter_all = {};
-	struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+	struct bvec_iter_all iter;
+	struct bio_vec bvec;
 	struct squashfs_lzo *stream = strm;
 	void *buff = stream->input, *data;
 	int bytes = length, res;
 	size_t out_len = output->length;
 
-	while (bio_next_segment(bio, &iter_all)) {
-		int avail = min(bytes, ((int)bvec->bv_len) - offset);
+	bvec_iter_all_init(&iter);
+	bio_iter_all_advance(bio, &iter, offset);
 
-		data = bvec_virt(bvec);
-		memcpy(buff, data + offset, avail);
+	bio_for_each_segment_all_continue(bvec, bio, iter) {
+		unsigned avail = min_t(unsigned, bytes, bvec.bv_len);
+
+		memcpy(buff, bvec_virt(&bvec), avail);
 		buff += avail;
 		bytes -= avail;
-		offset = 0;
+		if (!bytes)
+			break;
 	}
 
 	res = lzo1x_decompress_safe(stream->input, (size_t)length,
diff --git a/fs/squashfs/xz_wrapper.c b/fs/squashfs/xz_wrapper.c
index 6c49481a2f..6cf0e11e3b 100644
--- a/fs/squashfs/xz_wrapper.c
+++ b/fs/squashfs/xz_wrapper.c
@@ -120,8 +120,7 @@ static int squashfs_xz_uncompress(struct squashfs_sb_info *msblk, void *strm,
 	struct bio *bio, int offset, int length,
 	struct squashfs_page_actor *output)
 {
-	struct bvec_iter_all iter_all = {};
-	struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+	struct bvec_iter_all iter;
 	int total = 0, error = 0;
 	struct squashfs_xz *stream = strm;
 
@@ -136,26 +135,28 @@ static int squashfs_xz_uncompress(struct squashfs_sb_info *msblk, void *strm,
 		goto finish;
 	}
 
+	bvec_iter_all_init(&iter);
+	bio_iter_all_advance(bio, &iter, offset);
+
 	for (;;) {
 		enum xz_ret xz_err;
 
 		if (stream->buf.in_pos == stream->buf.in_size) {
-			const void *data;
-			int avail;
+			struct bio_vec bvec = bio_iter_all_peek(bio, &iter);
+			unsigned avail = min_t(unsigned, length, bvec.bv_len);
 
-			if (!bio_next_segment(bio, &iter_all)) {
+			if (iter.idx >= bio->bi_vcnt) {
 				/* XZ_STREAM_END must be reached. */
 				error = -EIO;
 				break;
 			}
 
-			avail = min(length, ((int)bvec->bv_len) - offset);
-			data = bvec_virt(bvec);
 			length -= avail;
-			stream->buf.in = data + offset;
+			stream->buf.in = bvec_virt(&bvec);
 			stream->buf.in_size = avail;
 			stream->buf.in_pos = 0;
-			offset = 0;
+
+			bio_iter_all_advance(bio, &iter, avail);
 		}
 
 		if (stream->buf.out_pos == stream->buf.out_size) {
diff --git a/fs/squashfs/zlib_wrapper.c b/fs/squashfs/zlib_wrapper.c
index cbb7afe7bc..981ca5e410 100644
--- a/fs/squashfs/zlib_wrapper.c
+++ b/fs/squashfs/zlib_wrapper.c
@@ -53,8 +53,7 @@ static int zlib_uncompress(struct squashfs_sb_info *msblk, void *strm,
 	struct bio *bio, int offset, int length,
 	struct squashfs_page_actor *output)
 {
-	struct bvec_iter_all iter_all = {};
-	struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+	struct bvec_iter_all iter;
 	int zlib_init = 0, error = 0;
 	z_stream *stream = strm;
 
@@ -67,25 +66,28 @@ static int zlib_uncompress(struct squashfs_sb_info *msblk, void *strm,
 		goto finish;
 	}
 
+	bvec_iter_all_init(&iter);
+	bio_iter_all_advance(bio, &iter, offset);
+
 	for (;;) {
 		int zlib_err;
 
 		if (stream->avail_in == 0) {
-			const void *data;
+			struct bio_vec bvec = bio_iter_all_peek(bio, &iter);
 			int avail;
 
-			if (!bio_next_segment(bio, &iter_all)) {
+			if (iter.idx >= bio->bi_vcnt) {
 				/* Z_STREAM_END must be reached. */
 				error = -EIO;
 				break;
 			}
 
-			avail = min(length, ((int)bvec->bv_len) - offset);
-			data = bvec_virt(bvec);
+			avail = min_t(unsigned, length, bvec.bv_len);
 			length -= avail;
-			stream->next_in = data + offset;
+			stream->next_in = bvec_virt(&bvec);
 			stream->avail_in = avail;
-			offset = 0;
+
+			bio_iter_all_advance(bio, &iter, avail);
 		}
 
 		if (stream->avail_out == 0) {
diff --git a/fs/squashfs/zstd_wrapper.c b/fs/squashfs/zstd_wrapper.c
index 0e407c4d8b..658e5d462a 100644
--- a/fs/squashfs/zstd_wrapper.c
+++ b/fs/squashfs/zstd_wrapper.c
@@ -68,8 +68,7 @@ static int zstd_uncompress(struct squashfs_sb_info *msblk, void *strm,
 	int error = 0;
 	zstd_in_buffer in_buf = { NULL, 0, 0 };
 	zstd_out_buffer out_buf = { NULL, 0, 0 };
-	struct bvec_iter_all iter_all = {};
-	struct bio_vec *bvec = bvec_init_iter_all(&iter_all);
+	struct bvec_iter_all iter;
 
 	stream = zstd_init_dstream(wksp->window_size, wksp->mem, wksp->mem_size);
 
@@ -85,25 +84,27 @@ static int zstd_uncompress(struct squashfs_sb_info *msblk, void *strm,
 		goto finish;
 	}
 
+	bvec_iter_all_init(&iter);
+	bio_iter_all_advance(bio, &iter, offset);
+
 	for (;;) {
 		size_t zstd_err;
 
 		if (in_buf.pos == in_buf.size) {
-			const void *data;
-			int avail;
+			struct bio_vec bvec = bio_iter_all_peek(bio, &iter);
+			unsigned avail = min_t(unsigned, length, bvec.bv_len);
 
-			if (!bio_next_segment(bio, &iter_all)) {
+			if (iter.idx >= bio->bi_vcnt) {
 				error = -EIO;
 				break;
 			}
 
-			avail = min(length, ((int)bvec->bv_len) - offset);
-			data = bvec_virt(bvec);
 			length -= avail;
-			in_buf.src = data + offset;
+			in_buf.src = bvec_virt(&bvec);
 			in_buf.size = avail;
 			in_buf.pos = 0;
-			offset = 0;
+
+			bio_iter_all_advance(bio, &iter, avail);
 		}
 
 		if (out_buf.pos == out_buf.size) {
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 3536f28c05..f86c7190c3 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -78,22 +78,40 @@ static inline void *bio_data(struct bio *bio)
 	return NULL;
 }
 
-static inline bool bio_next_segment(const struct bio *bio,
-				    struct bvec_iter_all *iter)
+static inline struct bio_vec bio_iter_all_peek(const struct bio *bio,
+					       struct bvec_iter_all *iter)
 {
-	if (iter->idx >= bio->bi_vcnt)
-		return false;
+	if (WARN_ON(iter->idx >= bio->bi_vcnt))
+		return (struct bio_vec) { NULL };
 
-	bvec_advance(&bio->bi_io_vec[iter->idx], iter);
-	return true;
+	return bvec_iter_all_peek(bio->bi_io_vec, iter);
+}
+
+static inline void bio_iter_all_advance(const struct bio *bio,
+					struct bvec_iter_all *iter,
+					unsigned bytes)
+{
+	bvec_iter_all_advance(bio->bi_io_vec, iter, bytes);
+
+	WARN_ON(iter->idx > bio->bi_vcnt ||
+		(iter->idx == bio->bi_vcnt && iter->done));
 }
 
+#define bio_for_each_segment_all_continue(bvl, bio, iter)		\
+	for (;								\
+	     iter.idx < bio->bi_vcnt &&					\
+		((bvl = bio_iter_all_peek(bio, &iter)), true);		\
+	     bio_iter_all_advance((bio), &iter, bvl.bv_len))
+
 /*
  * drivers should _never_ use the all version - the bio may have been split
  * before it got to the driver and the driver won't own all of it
  */
-#define bio_for_each_segment_all(bvl, bio, iter) \
-	for (bvl = bvec_init_iter_all(&iter); bio_next_segment((bio), &iter); )
+#define bio_for_each_segment_all(bvl, bio, iter)			\
+	for (bvec_iter_all_init(&iter);					\
+	     iter.idx < (bio)->bi_vcnt &&				\
+		((bvl = bio_iter_all_peek((bio), &iter)), true);		\
+	     bio_iter_all_advance((bio), &iter, bvl.bv_len))
 
 static inline void bio_advance_iter(const struct bio *bio,
 				    struct bvec_iter *iter, unsigned int bytes)
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 555aae5448..635fb54143 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -85,12 +85,6 @@ struct bvec_iter {
 						   current bvec */
 } __packed;
 
-struct bvec_iter_all {
-	struct bio_vec	bv;
-	int		idx;
-	unsigned	done;
-};
-
 /*
  * various member access, note that bio_data should of course not be used
  * on highmem page vectors
@@ -184,7 +178,10 @@ static inline void bvec_iter_advance_single(const struct bio_vec *bv,
 		((bvl = bvec_iter_bvec((bio_vec), (iter))), 1);	\
 	     bvec_iter_advance_single((bio_vec), &(iter), (bvl).bv_len))
 
-/* for iterating one bio from start to end */
+/*
+ * bvec_iter_all: for advancing over a bio as it was originally created, but
+ * with the usual bio_for_each_segment interface - nonstandard, do not use:
+ */
 #define BVEC_ITER_ALL_INIT (struct bvec_iter)				\
 {									\
 	.bi_sector	= 0,						\
@@ -193,33 +190,45 @@ static inline void bvec_iter_advance_single(const struct bio_vec *bv,
 	.bi_bvec_done	= 0,						\
 }
 
-static inline struct bio_vec *bvec_init_iter_all(struct bvec_iter_all *iter_all)
+/*
+ * bvec_iter_all: for advancing over individual pages in a bio, as it was when
+ * it was first created:
+ */
+struct bvec_iter_all {
+	int		idx;
+	unsigned	done;
+};
+
+static inline void bvec_iter_all_init(struct bvec_iter_all *iter_all)
 {
 	iter_all->done = 0;
 	iter_all->idx = 0;
+}
 
-	return &iter_all->bv;
+static inline struct bio_vec bvec_iter_all_peek(const struct bio_vec *bvec,
+						struct bvec_iter_all *iter)
+{
+	struct bio_vec bv = bvec[iter->idx];
+
+	bv.bv_offset	+= iter->done;
+	bv.bv_len	-= iter->done;
+
+	bv.bv_page	+= bv.bv_offset >> PAGE_SHIFT;
+	bv.bv_offset	&= ~PAGE_MASK;
+	bv.bv_len	= min_t(unsigned, PAGE_SIZE - bv.bv_offset, bv.bv_len);
+
+	return bv;
 }
 
-static inline void bvec_advance(const struct bio_vec *bvec,
-				struct bvec_iter_all *iter_all)
+static inline void bvec_iter_all_advance(const struct bio_vec *bvec,
+					 struct bvec_iter_all *iter,
+					 unsigned bytes)
 {
-	struct bio_vec *bv = &iter_all->bv;
-
-	if (iter_all->done) {
-		bv->bv_page++;
-		bv->bv_offset = 0;
-	} else {
-		bv->bv_page = bvec->bv_page + (bvec->bv_offset >> PAGE_SHIFT);
-		bv->bv_offset = bvec->bv_offset & ~PAGE_MASK;
-	}
-	bv->bv_len = min_t(unsigned int, PAGE_SIZE - bv->bv_offset,
-			   bvec->bv_len - iter_all->done);
-	iter_all->done += bv->bv_len;
+	iter->done += bytes;
 
-	if (iter_all->done == bvec->bv_len) {
-		iter_all->idx++;
-		iter_all->done = 0;
+	while (iter->done && iter->done >= bvec[iter->idx].bv_len) {
+		iter->done -= bvec[iter->idx].bv_len;
+		iter->idx++;
 	}
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 13/32] block: Rework bio_for_each_folio_all()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (11 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 12/32] block: Rework bio_for_each_segment_all() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 14/32] block: Don't block on s_umount from __invalidate_super() Kent Overstreet
                   ` (18 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Matthew Wilcox, linux-block

This reimplements bio_for_each_folio_all() on top of the newly-reworked
bvec_iter_all, and since it's now trivial we also provide
bio_for_each_folio.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: [email protected]
---
 fs/crypto/bio.c        |  9 +++--
 fs/iomap/buffered-io.c | 14 ++++---
 fs/verity/verify.c     |  9 +++--
 include/linux/bio.h    | 91 +++++++++++++++++++++---------------------
 include/linux/bvec.h   | 15 +++++--
 5 files changed, 75 insertions(+), 63 deletions(-)

diff --git a/fs/crypto/bio.c b/fs/crypto/bio.c
index d57d0a020f..6469861add 100644
--- a/fs/crypto/bio.c
+++ b/fs/crypto/bio.c
@@ -30,11 +30,12 @@
  */
 bool fscrypt_decrypt_bio(struct bio *bio)
 {
-	struct folio_iter fi;
+	struct bvec_iter_all iter;
+	struct folio_vec fv;
 
-	bio_for_each_folio_all(fi, bio) {
-		int err = fscrypt_decrypt_pagecache_blocks(fi.folio, fi.length,
-							   fi.offset);
+	bio_for_each_folio_all(fv, bio, iter) {
+		int err = fscrypt_decrypt_pagecache_blocks(fv.fv_folio, fv.fv_len,
+							   fv.fv_offset);
 
 		if (err) {
 			bio->bi_status = errno_to_blk_status(err);
diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 6f4c97a6d7..60661c87d5 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -187,10 +187,11 @@ static void iomap_finish_folio_read(struct folio *folio, size_t offset,
 static void iomap_read_end_io(struct bio *bio)
 {
 	int error = blk_status_to_errno(bio->bi_status);
-	struct folio_iter fi;
+	struct bvec_iter_all iter;
+	struct folio_vec fv;
 
-	bio_for_each_folio_all(fi, bio)
-		iomap_finish_folio_read(fi.folio, fi.offset, fi.length, error);
+	bio_for_each_folio_all(fv, bio, iter)
+		iomap_finish_folio_read(fv.fv_folio, fv.fv_offset, fv.fv_len, error);
 	bio_put(bio);
 }
 
@@ -1328,7 +1329,8 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 	u32 folio_count = 0;
 
 	for (bio = &ioend->io_inline_bio; bio; bio = next) {
-		struct folio_iter fi;
+		struct bvec_iter_all iter;
+		struct folio_vec fv;
 
 		/*
 		 * For the last bio, bi_private points to the ioend, so we
@@ -1340,8 +1342,8 @@ iomap_finish_ioend(struct iomap_ioend *ioend, int error)
 			next = bio->bi_private;
 
 		/* walk all folios in bio, ending page IO on them */
-		bio_for_each_folio_all(fi, bio) {
-			iomap_finish_folio_write(inode, fi.folio, fi.length,
+		bio_for_each_folio_all(fv, bio, iter) {
+			iomap_finish_folio_write(inode, fv.fv_folio, fv.fv_len,
 					error);
 			folio_count++;
 		}
diff --git a/fs/verity/verify.c b/fs/verity/verify.c
index e250822275..b111ab0102 100644
--- a/fs/verity/verify.c
+++ b/fs/verity/verify.c
@@ -340,7 +340,8 @@ void fsverity_verify_bio(struct bio *bio)
 	struct inode *inode = bio_first_page_all(bio)->mapping->host;
 	struct fsverity_info *vi = inode->i_verity_info;
 	struct ahash_request *req;
-	struct folio_iter fi;
+	struct bvec_iter_all iter;
+	struct folio_vec fv;
 	unsigned long max_ra_pages = 0;
 
 	/* This allocation never fails, since it's mempool-backed. */
@@ -359,9 +360,9 @@ void fsverity_verify_bio(struct bio *bio)
 		max_ra_pages = bio->bi_iter.bi_size >> (PAGE_SHIFT + 2);
 	}
 
-	bio_for_each_folio_all(fi, bio) {
-		if (!verify_data_blocks(inode, vi, req, fi.folio, fi.length,
-					fi.offset, max_ra_pages)) {
+	bio_for_each_folio_all(fv, bio, iter) {
+		if (!verify_data_blocks(inode, vi, req, fv.fv_folio, fv.fv_len,
+					fv.fv_offset, max_ra_pages)) {
 			bio->bi_status = BLK_STS_IOERR;
 			break;
 		}
diff --git a/include/linux/bio.h b/include/linux/bio.h
index f86c7190c3..7ced281734 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -169,6 +169,42 @@ static inline void bio_advance(struct bio *bio, unsigned int nbytes)
 #define bio_for_each_segment(bvl, bio, iter)				\
 	__bio_for_each_segment(bvl, bio, iter, (bio)->bi_iter)
 
+struct folio_vec {
+	struct folio	*fv_folio;
+	size_t		fv_offset;
+	size_t		fv_len;
+};
+
+static inline struct folio_vec biovec_to_foliovec(struct bio_vec bv)
+{
+
+	struct folio *folio	= page_folio(bv.bv_page);
+	size_t offset		= (folio_page_idx(folio, bv.bv_page) << PAGE_SHIFT) +
+		bv.bv_offset;
+	size_t len = min_t(size_t, folio_size(folio) - offset, bv.bv_len);
+
+	return (struct folio_vec) {
+		.fv_folio	= folio,
+		.fv_offset	= offset,
+		.fv_len		= len,
+	};
+}
+
+static inline struct folio_vec bio_iter_iovec_folio(struct bio *bio,
+						    struct bvec_iter iter)
+{
+	return biovec_to_foliovec(bio_iter_iovec(bio, iter));
+}
+
+#define __bio_for_each_folio(bvl, bio, iter, start)			\
+	for (iter = (start);						\
+	     (iter).bi_size &&						\
+		((bvl = bio_iter_iovec_folio((bio), (iter))), 1);	\
+	     bio_advance_iter_single((bio), &(iter), (bvl).fv_len))
+
+#define bio_for_each_folio(bvl, bio, iter)				\
+	__bio_for_each_folio(bvl, bio, iter, (bio)->bi_iter)
+
 #define __bio_for_each_bvec(bvl, bio, iter, start)		\
 	for (iter = (start);						\
 	     (iter).bi_size &&						\
@@ -277,59 +313,22 @@ static inline struct bio_vec *bio_last_bvec_all(struct bio *bio)
 	return &bio->bi_io_vec[bio->bi_vcnt - 1];
 }
 
-/**
- * struct folio_iter - State for iterating all folios in a bio.
- * @folio: The current folio we're iterating.  NULL after the last folio.
- * @offset: The byte offset within the current folio.
- * @length: The number of bytes in this iteration (will not cross folio
- *	boundary).
- */
-struct folio_iter {
-	struct folio *folio;
-	size_t offset;
-	size_t length;
-	/* private: for use by the iterator */
-	struct folio *_next;
-	size_t _seg_count;
-	int _i;
-};
-
-static inline void bio_first_folio(struct folio_iter *fi, struct bio *bio,
-				   int i)
-{
-	struct bio_vec *bvec = bio_first_bvec_all(bio) + i;
-
-	fi->folio = page_folio(bvec->bv_page);
-	fi->offset = bvec->bv_offset +
-			PAGE_SIZE * (bvec->bv_page - &fi->folio->page);
-	fi->_seg_count = bvec->bv_len;
-	fi->length = min(folio_size(fi->folio) - fi->offset, fi->_seg_count);
-	fi->_next = folio_next(fi->folio);
-	fi->_i = i;
-}
-
-static inline void bio_next_folio(struct folio_iter *fi, struct bio *bio)
+static inline struct folio_vec bio_folio_iter_all_peek(const struct bio *bio,
+						       const struct bvec_iter_all *iter)
 {
-	fi->_seg_count -= fi->length;
-	if (fi->_seg_count) {
-		fi->folio = fi->_next;
-		fi->offset = 0;
-		fi->length = min(folio_size(fi->folio), fi->_seg_count);
-		fi->_next = folio_next(fi->folio);
-	} else if (fi->_i + 1 < bio->bi_vcnt) {
-		bio_first_folio(fi, bio, fi->_i + 1);
-	} else {
-		fi->folio = NULL;
-	}
+	return biovec_to_foliovec(__bvec_iter_all_peek(bio->bi_io_vec, iter));
 }
 
 /**
  * bio_for_each_folio_all - Iterate over each folio in a bio.
- * @fi: struct folio_iter which is updated for each folio.
+ * @fi: struct bio_folio_iter_all which is updated for each folio.
  * @bio: struct bio to iterate over.
  */
-#define bio_for_each_folio_all(fi, bio)				\
-	for (bio_first_folio(&fi, bio, 0); fi.folio; bio_next_folio(&fi, bio))
+#define bio_for_each_folio_all(fv, bio, iter)				\
+	for (bvec_iter_all_init(&iter);					\
+	     iter.idx < bio->bi_vcnt &&					\
+		((fv = bio_folio_iter_all_peek(bio, &iter)), true);	\
+	     bio_iter_all_advance((bio), &iter, fv.fv_len))
 
 enum bip_flags {
 	BIP_BLOCK_INTEGRITY	= 1 << 0, /* block layer owns integrity data */
diff --git a/include/linux/bvec.h b/include/linux/bvec.h
index 635fb54143..d238f959e3 100644
--- a/include/linux/bvec.h
+++ b/include/linux/bvec.h
@@ -205,18 +205,27 @@ static inline void bvec_iter_all_init(struct bvec_iter_all *iter_all)
 	iter_all->idx = 0;
 }
 
-static inline struct bio_vec bvec_iter_all_peek(const struct bio_vec *bvec,
-						struct bvec_iter_all *iter)
+static inline struct bio_vec __bvec_iter_all_peek(const struct bio_vec *bvec,
+						  const struct bvec_iter_all *iter)
 {
 	struct bio_vec bv = bvec[iter->idx];
 
+	BUG_ON(iter->done >= bv.bv_len);
+
 	bv.bv_offset	+= iter->done;
 	bv.bv_len	-= iter->done;
 
 	bv.bv_page	+= bv.bv_offset >> PAGE_SHIFT;
 	bv.bv_offset	&= ~PAGE_MASK;
-	bv.bv_len	= min_t(unsigned, PAGE_SIZE - bv.bv_offset, bv.bv_len);
+	return bv;
+}
+
+static inline struct bio_vec bvec_iter_all_peek(const struct bio_vec *bvec,
+						const struct bvec_iter_all *iter)
+{
+	struct bio_vec bv = __bvec_iter_all_peek(bvec, iter);
 
+	bv.bv_len = min_t(unsigned, PAGE_SIZE - bv.bv_offset, bv.bv_len);
 	return bv;
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 14/32] block: Don't block on s_umount from __invalidate_super()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (12 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 13/32] block: Rework bio_for_each_folio_all() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 15/32] bcache: move closures to lib/ Kent Overstreet
                   ` (17 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Kent Overstreet

__invalidate_super() is used to flush any filesystem mounted on a
device, generally on some sort of media change event.

However, when unmounting a filesystem and closing the underlying block
devices, we can deadlock if the block driver then calls
__invalidate_device() (e.g. because the block device goes away when it
is no longer in use).

This happens with bcachefs on top of loopback, and can be triggered by
fstests generic/042:

  put_super
    -> blkdev_put
    -> lo_release
    -> disk_force_media_change
    -> __invalidate_device
    -> get_super

This isn't inherently specific to bcachefs - it hasn't shown up with
other filesystems before because most other filesystems use the sget()
mechanism for opening/closing block devices (and enforcing exclusion),
however sget() has its own downsides and weird/sketchy behaviour w.r.t.
block device open lifetime - if that ever gets fixed more code will run
into this issue.

The __invalidate_device() call here is really a best effort "I just
yanked the device for a mounted filesystem, please try not to lose my
data" - if it's ever actually needed the user has already done something
crazy, and we probably shouldn't make things worse by deadlocking.
Switching to a trylock seems in keeping with what the code is trying to
do.

If we ever get revoke() at the block layer, perhaps we would look at
rearchitecting to use that instead.

Signed-off-by: Kent Overstreet <[email protected]>
---
 block/bdev.c       |  2 +-
 fs/super.c         | 40 +++++++++++++++++++++++++++++++---------
 include/linux/fs.h |  1 +
 3 files changed, 33 insertions(+), 10 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 1795c7d4b9..743e969b7b 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -922,7 +922,7 @@ EXPORT_SYMBOL(lookup_bdev);
 
 int __invalidate_device(struct block_device *bdev, bool kill_dirty)
 {
-	struct super_block *sb = get_super(bdev);
+	struct super_block *sb = try_get_super(bdev);
 	int res = 0;
 
 	if (sb) {
diff --git a/fs/super.c b/fs/super.c
index 04bc62ab7d..a2decce02f 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -791,14 +791,7 @@ void iterate_supers_type(struct file_system_type *type,
 
 EXPORT_SYMBOL(iterate_supers_type);
 
-/**
- * get_super - get the superblock of a device
- * @bdev: device to get the superblock for
- *
- * Scans the superblock list and finds the superblock of the file system
- * mounted on the device given. %NULL is returned if no match is found.
- */
-struct super_block *get_super(struct block_device *bdev)
+static struct super_block *__get_super(struct block_device *bdev, bool try)
 {
 	struct super_block *sb;
 
@@ -813,7 +806,12 @@ struct super_block *get_super(struct block_device *bdev)
 		if (sb->s_bdev == bdev) {
 			sb->s_count++;
 			spin_unlock(&sb_lock);
-			down_read(&sb->s_umount);
+
+			if (!try)
+				down_read(&sb->s_umount);
+			else if (!down_read_trylock(&sb->s_umount))
+				return NULL;
+
 			/* still alive? */
 			if (sb->s_root && (sb->s_flags & SB_BORN))
 				return sb;
@@ -828,6 +826,30 @@ struct super_block *get_super(struct block_device *bdev)
 	return NULL;
 }
 
+/**
+ * get_super - get the superblock of a device
+ * @bdev: device to get the superblock for
+ *
+ * Scans the superblock list and finds the superblock of the file system
+ * mounted on the device given. %NULL is returned if no match is found.
+ */
+struct super_block *get_super(struct block_device *bdev)
+{
+	return __get_super(bdev, false);
+}
+
+/**
+ * try_get_super - get the superblock of a device, using trylock on sb->s_umount
+ * @bdev: device to get the superblock for
+ *
+ * Scans the superblock list and finds the superblock of the file system
+ * mounted on the device given. %NULL is returned if no match is found.
+ */
+struct super_block *try_get_super(struct block_device *bdev)
+{
+	return __get_super(bdev, true);
+}
+
 /**
  * get_active_super - get an active reference to the superblock of a device
  * @bdev: device to get the superblock for
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7..1a6f951942 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2878,6 +2878,7 @@ extern struct file_system_type *get_filesystem(struct file_system_type *fs);
 extern void put_filesystem(struct file_system_type *fs);
 extern struct file_system_type *get_fs_type(const char *name);
 extern struct super_block *get_super(struct block_device *);
+extern struct super_block *try_get_super(struct block_device *);
 extern struct super_block *get_active_super(struct block_device *bdev);
 extern void drop_super(struct super_block *sb);
 extern void drop_super_exclusive(struct super_block *sb);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 15/32] bcache: move closures to lib/
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (13 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 14/32] block: Don't block on s_umount from __invalidate_super() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-10  1:10   ` Randy Dunlap
  2023-05-09 16:56 ` [PATCH 16/32] MAINTAINERS: Add entry for closures Kent Overstreet
                   ` (16 subsequent siblings)
  31 siblings, 1 reply; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet, Coly Li

From: Kent Overstreet <[email protected]>

Prep work for bcachefs - being a fork of bcache it also uses closures

Signed-off-by: Kent Overstreet <[email protected]>
Acked-by: Coly Li <[email protected]>
---
 drivers/md/bcache/Kconfig                     | 10 +-----
 drivers/md/bcache/Makefile                    |  4 +--
 drivers/md/bcache/bcache.h                    |  2 +-
 drivers/md/bcache/super.c                     |  1 -
 drivers/md/bcache/util.h                      |  3 +-
 .../md/bcache => include/linux}/closure.h     | 17 +++++----
 lib/Kconfig                                   |  3 ++
 lib/Kconfig.debug                             |  9 +++++
 lib/Makefile                                  |  2 ++
 {drivers/md/bcache => lib}/closure.c          | 35 +++++++++----------
 10 files changed, 43 insertions(+), 43 deletions(-)
 rename {drivers/md/bcache => include/linux}/closure.h (97%)
 rename {drivers/md/bcache => lib}/closure.c (88%)

diff --git a/drivers/md/bcache/Kconfig b/drivers/md/bcache/Kconfig
index 529c9d04e9..b2d10063d3 100644
--- a/drivers/md/bcache/Kconfig
+++ b/drivers/md/bcache/Kconfig
@@ -4,6 +4,7 @@ config BCACHE
 	tristate "Block device as cache"
 	select BLOCK_HOLDER_DEPRECATED if SYSFS
 	select CRC64
+	select CLOSURES
 	help
 	Allows a block device to be used as cache for other devices; uses
 	a btree for indexing and the layout is optimized for SSDs.
@@ -19,15 +20,6 @@ config BCACHE_DEBUG
 	Enables extra debugging tools, allows expensive runtime checks to be
 	turned on.
 
-config BCACHE_CLOSURES_DEBUG
-	bool "Debug closures"
-	depends on BCACHE
-	select DEBUG_FS
-	help
-	Keeps all active closures in a linked list and provides a debugfs
-	interface to list them, which makes it possible to see asynchronous
-	operations that get stuck.
-
 config BCACHE_ASYNC_REGISTRATION
 	bool "Asynchronous device registration"
 	depends on BCACHE
diff --git a/drivers/md/bcache/Makefile b/drivers/md/bcache/Makefile
index 5b87e59676..054e8a33a7 100644
--- a/drivers/md/bcache/Makefile
+++ b/drivers/md/bcache/Makefile
@@ -2,6 +2,6 @@
 
 obj-$(CONFIG_BCACHE)	+= bcache.o
 
-bcache-y		:= alloc.o bset.o btree.o closure.o debug.o extents.o\
-	io.o journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
+bcache-y		:= alloc.o bset.o btree.o debug.o extents.o io.o\
+	journal.o movinggc.o request.o stats.o super.o sysfs.o trace.o\
 	util.o writeback.o features.o
diff --git a/drivers/md/bcache/bcache.h b/drivers/md/bcache/bcache.h
index aebb7ef10e..c8b4914ad8 100644
--- a/drivers/md/bcache/bcache.h
+++ b/drivers/md/bcache/bcache.h
@@ -179,6 +179,7 @@
 #define pr_fmt(fmt) "bcache: %s() " fmt, __func__
 
 #include <linux/bio.h>
+#include <linux/closure.h>
 #include <linux/kobject.h>
 #include <linux/list.h>
 #include <linux/mutex.h>
@@ -192,7 +193,6 @@
 #include "bcache_ondisk.h"
 #include "bset.h"
 #include "util.h"
-#include "closure.h"
 
 struct bucket {
 	atomic_t	pin;
diff --git a/drivers/md/bcache/super.c b/drivers/md/bcache/super.c
index ba3909bb6b..31b68a1b87 100644
--- a/drivers/md/bcache/super.c
+++ b/drivers/md/bcache/super.c
@@ -2912,7 +2912,6 @@ static int __init bcache_init(void)
 		goto err;
 
 	bch_debug_init();
-	closure_debug_init();
 
 	bcache_is_reboot = false;
 
diff --git a/drivers/md/bcache/util.h b/drivers/md/bcache/util.h
index 6f3cb7c921..f61ab1bada 100644
--- a/drivers/md/bcache/util.h
+++ b/drivers/md/bcache/util.h
@@ -4,6 +4,7 @@
 #define _BCACHE_UTIL_H
 
 #include <linux/blkdev.h>
+#include <linux/closure.h>
 #include <linux/errno.h>
 #include <linux/kernel.h>
 #include <linux/sched/clock.h>
@@ -13,8 +14,6 @@
 #include <linux/workqueue.h>
 #include <linux/crc64.h>
 
-#include "closure.h"
-
 struct closure;
 
 #ifdef CONFIG_BCACHE_DEBUG
diff --git a/drivers/md/bcache/closure.h b/include/linux/closure.h
similarity index 97%
rename from drivers/md/bcache/closure.h
rename to include/linux/closure.h
index c88cdc4ae4..0ec9e7bc8d 100644
--- a/drivers/md/bcache/closure.h
+++ b/include/linux/closure.h
@@ -155,7 +155,7 @@ struct closure {
 
 	atomic_t		remaining;
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 #define CLOSURE_MAGIC_DEAD	0xc054dead
 #define CLOSURE_MAGIC_ALIVE	0xc054a11e
 
@@ -184,15 +184,13 @@ static inline void closure_sync(struct closure *cl)
 		__closure_sync(cl);
 }
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 
-void closure_debug_init(void);
 void closure_debug_create(struct closure *cl);
 void closure_debug_destroy(struct closure *cl);
 
 #else
 
-static inline void closure_debug_init(void) {}
 static inline void closure_debug_create(struct closure *cl) {}
 static inline void closure_debug_destroy(struct closure *cl) {}
 
@@ -200,21 +198,21 @@ static inline void closure_debug_destroy(struct closure *cl) {}
 
 static inline void closure_set_ip(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->ip = _THIS_IP_;
 #endif
 }
 
 static inline void closure_set_ret_ip(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->ip = _RET_IP_;
 #endif
 }
 
 static inline void closure_set_waiting(struct closure *cl, unsigned long f)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	cl->waiting_on = f;
 #endif
 }
@@ -243,6 +241,7 @@ static inline void closure_queue(struct closure *cl)
 	 */
 	BUILD_BUG_ON(offsetof(struct closure, fn)
 		     != offsetof(struct work_struct, func));
+
 	if (wq) {
 		INIT_WORK(&cl->work, cl->work.func);
 		BUG_ON(!queue_work(wq, &cl->work));
@@ -255,7 +254,7 @@ static inline void closure_queue(struct closure *cl)
  */
 static inline void closure_get(struct closure *cl)
 {
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 	BUG_ON((atomic_inc_return(&cl->remaining) &
 		CLOSURE_REMAINING_MASK) <= 1);
 #else
@@ -271,7 +270,7 @@ static inline void closure_get(struct closure *cl)
  */
 static inline void closure_init(struct closure *cl, struct closure *parent)
 {
-	memset(cl, 0, sizeof(struct closure));
+	cl->fn = NULL;
 	cl->parent = parent;
 	if (parent)
 		closure_get(parent);
diff --git a/lib/Kconfig b/lib/Kconfig
index ce2abffb9e..1aa1c15a83 100644
--- a/lib/Kconfig
+++ b/lib/Kconfig
@@ -504,6 +504,9 @@ config ASSOCIATIVE_ARRAY
 
 	  for more information.
 
+config CLOSURES
+	bool
+
 config HAS_IOMEM
 	bool
 	depends on !NO_IOMEM
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 39d1d93164..3dba7a9aff 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1618,6 +1618,15 @@ config DEBUG_NOTIFIERS
 	  This is a relatively cheap check but if you care about maximum
 	  performance, say N.
 
+config DEBUG_CLOSURES
+	bool "Debug closures (bcache async widgits)"
+	depends on CLOSURES
+	select DEBUG_FS
+	help
+	Keeps all active closures in a linked list and provides a debugfs
+	interface to list them, which makes it possible to see asynchronous
+	operations that get stuck.
+
 config BUG_ON_DATA_CORRUPTION
 	bool "Trigger a BUG when data corruption is detected"
 	select DEBUG_LIST
diff --git a/lib/Makefile b/lib/Makefile
index baf2821f7a..fd13ca6e0e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -245,6 +245,8 @@ obj-$(CONFIG_ATOMIC64_SELFTEST) += atomic64_test.o
 
 obj-$(CONFIG_CPU_RMAP) += cpu_rmap.o
 
+obj-$(CONFIG_CLOSURES) += closure.o
+
 obj-$(CONFIG_DQL) += dynamic_queue_limits.o
 
 obj-$(CONFIG_GLOB) += glob.o
diff --git a/drivers/md/bcache/closure.c b/lib/closure.c
similarity index 88%
rename from drivers/md/bcache/closure.c
rename to lib/closure.c
index d8d9394a6b..b38ded00b9 100644
--- a/drivers/md/bcache/closure.c
+++ b/lib/closure.c
@@ -6,13 +6,12 @@
  * Copyright 2012 Google, Inc.
  */
 
+#include <linux/closure.h>
 #include <linux/debugfs.h>
-#include <linux/module.h>
+#include <linux/export.h>
 #include <linux/seq_file.h>
 #include <linux/sched/debug.h>
 
-#include "closure.h"
-
 static inline void closure_put_after_sub(struct closure *cl, int flags)
 {
 	int r = flags & CLOSURE_REMAINING_MASK;
@@ -45,6 +44,7 @@ void closure_sub(struct closure *cl, int v)
 {
 	closure_put_after_sub(cl, atomic_sub_return(v, &cl->remaining));
 }
+EXPORT_SYMBOL(closure_sub);
 
 /*
  * closure_put - decrement a closure's refcount
@@ -53,6 +53,7 @@ void closure_put(struct closure *cl)
 {
 	closure_put_after_sub(cl, atomic_dec_return(&cl->remaining));
 }
+EXPORT_SYMBOL(closure_put);
 
 /*
  * closure_wake_up - wake up all closures on a wait list, without memory barrier
@@ -74,6 +75,7 @@ void __closure_wake_up(struct closure_waitlist *wait_list)
 		closure_sub(cl, CLOSURE_WAITING + 1);
 	}
 }
+EXPORT_SYMBOL(__closure_wake_up);
 
 /**
  * closure_wait - add a closure to a waitlist
@@ -93,6 +95,7 @@ bool closure_wait(struct closure_waitlist *waitlist, struct closure *cl)
 
 	return true;
 }
+EXPORT_SYMBOL(closure_wait);
 
 struct closure_syncer {
 	struct task_struct	*task;
@@ -127,8 +130,9 @@ void __sched __closure_sync(struct closure *cl)
 
 	__set_current_state(TASK_RUNNING);
 }
+EXPORT_SYMBOL(__closure_sync);
 
-#ifdef CONFIG_BCACHE_CLOSURES_DEBUG
+#ifdef CONFIG_DEBUG_CLOSURES
 
 static LIST_HEAD(closure_list);
 static DEFINE_SPINLOCK(closure_list_lock);
@@ -144,6 +148,7 @@ void closure_debug_create(struct closure *cl)
 	list_add(&cl->all, &closure_list);
 	spin_unlock_irqrestore(&closure_list_lock, flags);
 }
+EXPORT_SYMBOL(closure_debug_create);
 
 void closure_debug_destroy(struct closure *cl)
 {
@@ -156,8 +161,7 @@ void closure_debug_destroy(struct closure *cl)
 	list_del(&cl->all);
 	spin_unlock_irqrestore(&closure_list_lock, flags);
 }
-
-static struct dentry *closure_debug;
+EXPORT_SYMBOL(closure_debug_destroy);
 
 static int debug_show(struct seq_file *f, void *data)
 {
@@ -181,7 +185,7 @@ static int debug_show(struct seq_file *f, void *data)
 			seq_printf(f, " W %pS\n",
 				   (void *) cl->waiting_on);
 
-		seq_printf(f, "\n");
+		seq_puts(f, "\n");
 	}
 
 	spin_unlock_irq(&closure_list_lock);
@@ -190,18 +194,11 @@ static int debug_show(struct seq_file *f, void *data)
 
 DEFINE_SHOW_ATTRIBUTE(debug);
 
-void  __init closure_debug_init(void)
+static int __init closure_debug_init(void)
 {
-	if (!IS_ERR_OR_NULL(bcache_debug))
-		/*
-		 * it is unnecessary to check return value of
-		 * debugfs_create_file(), we should not care
-		 * about this.
-		 */
-		closure_debug = debugfs_create_file(
-			"closures", 0400, bcache_debug, NULL, &debug_fops);
+	debugfs_create_file("closures", 0400, NULL, NULL, &debug_fops);
+	return 0;
 }
-#endif
+late_initcall(closure_debug_init)
 
-MODULE_AUTHOR("Kent Overstreet <[email protected]>");
-MODULE_LICENSE("GPL");
+#endif
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 16/32] MAINTAINERS: Add entry for closures
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (14 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 15/32] bcache: move closures to lib/ Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 17:05   ` Coly Li
  2023-05-09 21:03   ` Randy Dunlap
  2023-05-09 16:56 ` [PATCH 17/32] closures: closure_wait_event() Kent Overstreet
                   ` (15 subsequent siblings)
  31 siblings, 2 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Kent Overstreet, Coly Li

closures, from bcache, are async widgets with a variety of uses.
bcachefs also uses them, so they're being moved to lib/; mark them as
maintained.

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Coly Li <[email protected]>
---
 MAINTAINERS | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3fc37de3d6..5d76169140 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5044,6 +5044,14 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
 F:	Documentation/devicetree/bindings/timer/
 F:	drivers/clocksource/
 
+CLOSURES:
+M:	Kent Overstreet <[email protected]>
+L:	[email protected]
+S:	Supported
+C:	irc://irc.oftc.net/bcache
+F:	include/linux/closure.h
+F:	lib/closure.c
+
 CMPC ACPI DRIVER
 M:	Thadeu Lima de Souza Cascardo <[email protected]>
 M:	Daniel Oliveira Nascimento <[email protected]>
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 17/32] closures: closure_wait_event()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (15 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 16/32] MAINTAINERS: Add entry for closures Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 18/32] closures: closure_nr_remaining() Kent Overstreet
                   ` (14 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Coly Li, Kent Overstreet

From: Kent Overstreet <[email protected]>

Like wait_event() - except, because it uses closures and closure
waitlists it doesn't have the restriction on modifying task state inside
the condition check, like wait_event() does.

Signed-off-by: Kent Overstreet <[email protected]>
Acked-by: Coly Li <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
 include/linux/closure.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/closure.h b/include/linux/closure.h
index 0ec9e7bc8d..36b4a83f9b 100644
--- a/include/linux/closure.h
+++ b/include/linux/closure.h
@@ -374,4 +374,26 @@ static inline void closure_call(struct closure *cl, closure_fn fn,
 	continue_at_nobarrier(cl, fn, wq);
 }
 
+#define __closure_wait_event(waitlist, _cond)				\
+do {									\
+	struct closure cl;						\
+									\
+	closure_init_stack(&cl);					\
+									\
+	while (1) {							\
+		closure_wait(waitlist, &cl);				\
+		if (_cond)						\
+			break;						\
+		closure_sync(&cl);					\
+	}								\
+	closure_wake_up(waitlist);					\
+	closure_sync(&cl);						\
+} while (0)
+
+#define closure_wait_event(waitlist, _cond)				\
+do {									\
+	if (!(_cond))							\
+		__closure_wait_event(waitlist, _cond);			\
+} while (0)
+
 #endif /* _LINUX_CLOSURE_H */
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 18/32] closures: closure_nr_remaining()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (16 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 17/32] closures: closure_wait_event() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 19/32] closures: Add a missing include Kent Overstreet
                   ` (13 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Kent Overstreet

Factor out a new helper, which returns the number of events outstanding.

Signed-off-by: Kent Overstreet <[email protected]>
---
 include/linux/closure.h | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/closure.h b/include/linux/closure.h
index 36b4a83f9b..722a586bb2 100644
--- a/include/linux/closure.h
+++ b/include/linux/closure.h
@@ -172,6 +172,11 @@ void __closure_wake_up(struct closure_waitlist *list);
 bool closure_wait(struct closure_waitlist *list, struct closure *cl);
 void __closure_sync(struct closure *cl);
 
+static inline unsigned closure_nr_remaining(struct closure *cl)
+{
+	return atomic_read(&cl->remaining) & CLOSURE_REMAINING_MASK;
+}
+
 /**
  * closure_sync - sleep until a closure a closure has nothing left to wait on
  *
@@ -180,7 +185,7 @@ void __closure_sync(struct closure *cl);
  */
 static inline void closure_sync(struct closure *cl)
 {
-	if ((atomic_read(&cl->remaining) & CLOSURE_REMAINING_MASK) != 1)
+	if (closure_nr_remaining(cl) != 1)
 		__closure_sync(cl);
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 19/32] closures: Add a missing include
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (17 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 18/32] closures: closure_nr_remaining() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 20/32] vfs: factor out inode hash head calculation Kent Overstreet
                   ` (12 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Kent Overstreet

Fixes building in userspace.

Signed-off-by: Kent Overstreet <[email protected]>
---
 lib/closure.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/closure.c b/lib/closure.c
index b38ded00b9..0855e698ce 100644
--- a/lib/closure.c
+++ b/lib/closure.c
@@ -9,6 +9,7 @@
 #include <linux/closure.h>
 #include <linux/debugfs.h>
 #include <linux/export.h>
+#include <linux/rcupdate.h>
 #include <linux/seq_file.h>
 #include <linux/sched/debug.h>
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 20/32] vfs: factor out inode hash head calculation
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (18 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 19/32] closures: Add a missing include Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 21/32] hlist-bl: add hlist_bl_fake() Kent Overstreet
                   ` (11 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Dave Chinner, Alexander Viro, Christian Brauner, Kent Overstreet

From: Dave Chinner <[email protected]>

In preparation for changing the inode hash table implementation.

Signed-off-by: Dave Chinner <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: [email protected]
Signed-off-by: Kent Overstreet <[email protected]>
---
 fs/inode.c | 44 +++++++++++++++++++++++++-------------------
 1 file changed, 25 insertions(+), 19 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 4558dc2f13..41a10bcda1 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -60,6 +60,22 @@ static unsigned int i_hash_shift __read_mostly;
 static struct hlist_head *inode_hashtable __read_mostly;
 static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
 
+static unsigned long hash(struct super_block *sb, unsigned long hashval)
+{
+	unsigned long tmp;
+
+	tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
+			L1_CACHE_BYTES;
+	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> i_hash_shift);
+	return tmp & i_hash_mask;
+}
+
+static inline struct hlist_head *i_hash_head(struct super_block *sb,
+		unsigned int hashval)
+{
+	return inode_hashtable + hash(sb, hashval);
+}
+
 /*
  * Empty aops. Can be used for the cases where the user does not
  * define any of the address_space operations.
@@ -506,16 +522,6 @@ static inline void inode_sb_list_del(struct inode *inode)
 	}
 }
 
-static unsigned long hash(struct super_block *sb, unsigned long hashval)
-{
-	unsigned long tmp;
-
-	tmp = (hashval * (unsigned long)sb) ^ (GOLDEN_RATIO_PRIME + hashval) /
-			L1_CACHE_BYTES;
-	tmp = tmp ^ ((tmp ^ GOLDEN_RATIO_PRIME) >> i_hash_shift);
-	return tmp & i_hash_mask;
-}
-
 /**
  *	__insert_inode_hash - hash an inode
  *	@inode: unhashed inode
@@ -1163,7 +1169,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
 			    int (*test)(struct inode *, void *),
 			    int (*set)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(inode->i_sb, hashval);
+	struct hlist_head *head = i_hash_head(inode->i_sb, hashval);
 	struct inode *old;
 
 again:
@@ -1267,7 +1273,7 @@ EXPORT_SYMBOL(iget5_locked);
  */
 struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_head *head = i_hash_head(sb, ino);
 	struct inode *inode;
 again:
 	spin_lock(&inode_hash_lock);
@@ -1335,7 +1341,7 @@ EXPORT_SYMBOL(iget_locked);
  */
 static int test_inode_iunique(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *b = inode_hashtable + hash(sb, ino);
+	struct hlist_head *b = i_hash_head(sb, ino);
 	struct inode *inode;
 
 	hlist_for_each_entry_rcu(inode, b, i_hash) {
@@ -1422,7 +1428,7 @@ EXPORT_SYMBOL(igrab);
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_head *head = i_hash_head(sb, hashval);
 	struct inode *inode;
 
 	spin_lock(&inode_hash_lock);
@@ -1477,7 +1483,7 @@ EXPORT_SYMBOL(ilookup5);
  */
 struct inode *ilookup(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_head *head = i_hash_head(sb, ino);
 	struct inode *inode;
 again:
 	spin_lock(&inode_hash_lock);
@@ -1526,7 +1532,7 @@ struct inode *find_inode_nowait(struct super_block *sb,
 					     void *),
 				void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_head *head = i_hash_head(sb, hashval);
 	struct inode *inode, *ret_inode = NULL;
 	int mval;
 
@@ -1571,7 +1577,7 @@ EXPORT_SYMBOL(find_inode_nowait);
 struct inode *find_inode_rcu(struct super_block *sb, unsigned long hashval,
 			     int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+	struct hlist_head *head = i_hash_head(sb, hashval);
 	struct inode *inode;
 
 	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
@@ -1609,7 +1615,7 @@ EXPORT_SYMBOL(find_inode_rcu);
 struct inode *find_inode_by_ino_rcu(struct super_block *sb,
 				    unsigned long ino)
 {
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_head *head = i_hash_head(sb, ino);
 	struct inode *inode;
 
 	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
@@ -1629,7 +1635,7 @@ int insert_inode_locked(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	ino_t ino = inode->i_ino;
-	struct hlist_head *head = inode_hashtable + hash(sb, ino);
+	struct hlist_head *head = i_hash_head(sb, ino);
 
 	while (1) {
 		struct inode *old = NULL;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 21/32] hlist-bl: add hlist_bl_fake()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (19 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 20/32] vfs: factor out inode hash head calculation Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-10  4:48   ` Dave Chinner
  2023-05-09 16:56 ` [PATCH 22/32] vfs: inode cache conversion to hash-bl Kent Overstreet
                   ` (10 subsequent siblings)
  31 siblings, 1 reply; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Dave Chinner, Kent Overstreet

From: Dave Chinner <[email protected]>

in preparation for switching the VFS inode cache over the hlist_bl
lists, we nee dto be able to fake a list node that looks like it is
hased for correct operation of filesystems that don't directly use
the VFS indoe cache.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
 include/linux/list_bl.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
index ae1b541446..8ee2bf5af1 100644
--- a/include/linux/list_bl.h
+++ b/include/linux/list_bl.h
@@ -143,6 +143,28 @@ static inline void hlist_bl_del_init(struct hlist_bl_node *n)
 	}
 }
 
+/**
+ * hlist_bl_add_fake - create a fake list consisting of a single headless node
+ * @n: Node to make a fake list out of
+ *
+ * This makes @n appear to be its own predecessor on a headless hlist.
+ * The point of this is to allow things like hlist_bl_del() to work correctly
+ * in cases where there is no list.
+ */
+static inline void hlist_bl_add_fake(struct hlist_bl_node *n)
+{
+	n->pprev = &n->next;
+}
+
+/**
+ * hlist_fake: Is this node a fake hlist_bl?
+ * @h: Node to check for being a self-referential fake hlist.
+ */
+static inline bool hlist_bl_fake(struct hlist_bl_node *n)
+{
+	return n->pprev == &n->next;
+}
+
 static inline void hlist_bl_lock(struct hlist_bl_head *b)
 {
 	bit_spin_lock(0, (unsigned long *)b);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (20 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 21/32] hlist-bl: add hlist_bl_fake() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-10  4:45   ` Dave Chinner
  2023-05-09 16:56 ` [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic() Kent Overstreet
                   ` (9 subsequent siblings)
  31 siblings, 1 reply; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Dave Chinner, Alexander Viro, Christian Brauner, Kent Overstreet

From: Dave Chinner <[email protected]>

Because scalability of the global inode_hash_lock really, really
sucks.

32-way concurrent create on a couple of different filesystems
before:

-   52.13%     0.04%  [kernel]            [k] ext4_create
   - 52.09% ext4_create
      - 41.03% __ext4_new_inode
         - 29.92% insert_inode_locked
            - 25.35% _raw_spin_lock
               - do_raw_spin_lock
                  - 24.97% __pv_queued_spin_lock_slowpath

-   72.33%     0.02%  [kernel]            [k] do_filp_open
   - 72.31% do_filp_open
      - 72.28% path_openat
         - 57.03% bch2_create
            - 56.46% __bch2_create
               - 40.43% inode_insert5
                  - 36.07% _raw_spin_lock
                     - do_raw_spin_lock
                          35.86% __pv_queued_spin_lock_slowpath
                    4.02% find_inode

Convert the inode hash table to a RCU-aware hash-bl table just like
the dentry cache. Note that we need to store a pointer to the
hlist_bl_head the inode has been added to in the inode so that when
it comes to unhash the inode we know what list to lock. We need to
do this because the hash value that is used to hash the inode is
generated from the inode itself - filesystems can provide this
themselves so we have to either store the hash or the head pointer
in the inode to be able to find the right list head for removal...

Same workload after:

Signed-off-by: Dave Chinner <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: [email protected]
Signed-off-by: Kent Overstreet <[email protected]>
---
 fs/inode.c         | 200 ++++++++++++++++++++++++++++-----------------
 include/linux/fs.h |   9 +-
 2 files changed, 132 insertions(+), 77 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index 41a10bcda1..d446b054ec 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -57,8 +57,7 @@
 
 static unsigned int i_hash_mask __read_mostly;
 static unsigned int i_hash_shift __read_mostly;
-static struct hlist_head *inode_hashtable __read_mostly;
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
+static struct hlist_bl_head *inode_hashtable __read_mostly;
 
 static unsigned long hash(struct super_block *sb, unsigned long hashval)
 {
@@ -70,7 +69,7 @@ static unsigned long hash(struct super_block *sb, unsigned long hashval)
 	return tmp & i_hash_mask;
 }
 
-static inline struct hlist_head *i_hash_head(struct super_block *sb,
+static inline struct hlist_bl_head *i_hash_head(struct super_block *sb,
 		unsigned int hashval)
 {
 	return inode_hashtable + hash(sb, hashval);
@@ -433,7 +432,7 @@ EXPORT_SYMBOL(address_space_init_once);
 void inode_init_once(struct inode *inode)
 {
 	memset(inode, 0, sizeof(*inode));
-	INIT_HLIST_NODE(&inode->i_hash);
+	INIT_HLIST_BL_NODE(&inode->i_hash);
 	INIT_LIST_HEAD(&inode->i_devices);
 	INIT_LIST_HEAD(&inode->i_io_list);
 	INIT_LIST_HEAD(&inode->i_wb_list);
@@ -522,6 +521,17 @@ static inline void inode_sb_list_del(struct inode *inode)
 	}
 }
 
+/*
+ * Ensure that we store the hash head in the inode when we insert the inode into
+ * the hlist_bl_head...
+ */
+static inline void
+__insert_inode_hash_head(struct inode *inode, struct hlist_bl_head *b)
+{
+	hlist_bl_add_head_rcu(&inode->i_hash, b);
+	inode->i_hash_head = b;
+}
+
 /**
  *	__insert_inode_hash - hash an inode
  *	@inode: unhashed inode
@@ -532,13 +542,13 @@ static inline void inode_sb_list_del(struct inode *inode)
  */
 void __insert_inode_hash(struct inode *inode, unsigned long hashval)
 {
-	struct hlist_head *b = inode_hashtable + hash(inode->i_sb, hashval);
+	struct hlist_bl_head *b = i_hash_head(inode->i_sb, hashval);
 
-	spin_lock(&inode_hash_lock);
+	hlist_bl_lock(b);
 	spin_lock(&inode->i_lock);
-	hlist_add_head_rcu(&inode->i_hash, b);
+	__insert_inode_hash_head(inode, b);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_hash_lock);
+	hlist_bl_unlock(b);
 }
 EXPORT_SYMBOL(__insert_inode_hash);
 
@@ -550,11 +560,44 @@ EXPORT_SYMBOL(__insert_inode_hash);
  */
 void __remove_inode_hash(struct inode *inode)
 {
-	spin_lock(&inode_hash_lock);
-	spin_lock(&inode->i_lock);
-	hlist_del_init_rcu(&inode->i_hash);
-	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_hash_lock);
+	struct hlist_bl_head *b = inode->i_hash_head;
+
+	/*
+	 * There are some callers that come through here without synchronisation
+	 * and potentially with multiple references to the inode. Hence we have
+	 * to handle the case that we might race with a remove and insert to a
+	 * different list. Coda, in particular, seems to have a userspace API
+	 * that can directly trigger "unhash/rehash to different list" behaviour
+	 * without any serialisation at all.
+	 *
+	 * Hence we have to handle the situation where the inode->i_hash_head
+	 * might point to a different list than what we expect, indicating that
+	 * we raced with another unhash and potentially a new insertion. This
+	 * means we have to retest the head once we have everything locked up
+	 * and loop again if it doesn't match.
+	 */
+	while (b) {
+		hlist_bl_lock(b);
+		spin_lock(&inode->i_lock);
+		if (b != inode->i_hash_head) {
+			hlist_bl_unlock(b);
+			b = inode->i_hash_head;
+			spin_unlock(&inode->i_lock);
+			continue;
+		}
+		/*
+		 * Need to set the pprev pointer to NULL after list removal so
+		 * that both RCU traversals and hlist_bl_unhashed() work
+		 * correctly at this point.
+		 */
+		hlist_bl_del_rcu(&inode->i_hash);
+		inode->i_hash.pprev = NULL;
+		inode->i_hash_head = NULL;
+		spin_unlock(&inode->i_lock);
+		hlist_bl_unlock(b);
+		break;
+	}
+
 }
 EXPORT_SYMBOL(__remove_inode_hash);
 
@@ -904,26 +947,28 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
 	return freed;
 }
 
-static void __wait_on_freeing_inode(struct inode *inode);
+static void __wait_on_freeing_inode(struct hlist_bl_head *b,
+				struct inode *inode);
 /*
  * Called with the inode lock held.
  */
 static struct inode *find_inode(struct super_block *sb,
-				struct hlist_head *head,
+				struct hlist_bl_head *b,
 				int (*test)(struct inode *, void *),
 				void *data)
 {
+	struct hlist_bl_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	hlist_for_each_entry(inode, head, i_hash) {
+	hlist_bl_for_each_entry(inode, node, b, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
 		if (!test(inode, data))
 			continue;
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
-			__wait_on_freeing_inode(inode);
+			__wait_on_freeing_inode(b, inode);
 			goto repeat;
 		}
 		if (unlikely(inode->i_state & I_CREATING)) {
@@ -942,19 +987,20 @@ static struct inode *find_inode(struct super_block *sb,
  * iget_locked for details.
  */
 static struct inode *find_inode_fast(struct super_block *sb,
-				struct hlist_head *head, unsigned long ino)
+				struct hlist_bl_head *b, unsigned long ino)
 {
+	struct hlist_bl_node *node;
 	struct inode *inode = NULL;
 
 repeat:
-	hlist_for_each_entry(inode, head, i_hash) {
+	hlist_bl_for_each_entry(inode, node, b, i_hash) {
 		if (inode->i_ino != ino)
 			continue;
 		if (inode->i_sb != sb)
 			continue;
 		spin_lock(&inode->i_lock);
 		if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
-			__wait_on_freeing_inode(inode);
+			__wait_on_freeing_inode(b, inode);
 			goto repeat;
 		}
 		if (unlikely(inode->i_state & I_CREATING)) {
@@ -1162,25 +1208,25 @@ EXPORT_SYMBOL(unlock_two_nondirectories);
  * return it locked, hashed, and with the I_NEW flag set. The file system gets
  * to fill it in before unlocking it via unlock_new_inode().
  *
- * Note both @test and @set are called with the inode_hash_lock held, so can't
- * sleep.
+ * Note both @test and @set are called with the inode hash chain lock held,
+ * so can't sleep.
  */
 struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
 			    int (*test)(struct inode *, void *),
 			    int (*set)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = i_hash_head(inode->i_sb, hashval);
+	struct hlist_bl_head *b = i_hash_head(inode->i_sb, hashval);
 	struct inode *old;
 
 again:
-	spin_lock(&inode_hash_lock);
-	old = find_inode(inode->i_sb, head, test, data);
+	hlist_bl_lock(b);
+	old = find_inode(inode->i_sb, b, test, data);
 	if (unlikely(old)) {
 		/*
 		 * Uhhuh, somebody else created the same inode under us.
 		 * Use the old inode instead of the preallocated one.
 		 */
-		spin_unlock(&inode_hash_lock);
+		hlist_bl_unlock(b);
 		if (IS_ERR(old))
 			return NULL;
 		wait_on_inode(old);
@@ -1202,7 +1248,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
 	 */
 	spin_lock(&inode->i_lock);
 	inode->i_state |= I_NEW;
-	hlist_add_head_rcu(&inode->i_hash, head);
+	__insert_inode_hash_head(inode, b);
 	spin_unlock(&inode->i_lock);
 
 	/*
@@ -1212,7 +1258,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
 	if (list_empty(&inode->i_sb_list))
 		inode_sb_list_add(inode);
 unlock:
-	spin_unlock(&inode_hash_lock);
+	hlist_bl_unlock(b);
 
 	return inode;
 }
@@ -1273,12 +1319,12 @@ EXPORT_SYMBOL(iget5_locked);
  */
 struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = i_hash_head(sb, ino);
+	struct hlist_bl_head *b = i_hash_head(sb, ino);
 	struct inode *inode;
 again:
-	spin_lock(&inode_hash_lock);
-	inode = find_inode_fast(sb, head, ino);
-	spin_unlock(&inode_hash_lock);
+	hlist_bl_lock(b);
+	inode = find_inode_fast(sb, b, ino);
+	hlist_bl_unlock(b);
 	if (inode) {
 		if (IS_ERR(inode))
 			return NULL;
@@ -1294,17 +1340,17 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 	if (inode) {
 		struct inode *old;
 
-		spin_lock(&inode_hash_lock);
+		hlist_bl_lock(b);
 		/* We released the lock, so.. */
-		old = find_inode_fast(sb, head, ino);
+		old = find_inode_fast(sb, b, ino);
 		if (!old) {
 			inode->i_ino = ino;
 			spin_lock(&inode->i_lock);
 			inode->i_state = I_NEW;
-			hlist_add_head_rcu(&inode->i_hash, head);
+			__insert_inode_hash_head(inode, b);
 			spin_unlock(&inode->i_lock);
 			inode_sb_list_add(inode);
-			spin_unlock(&inode_hash_lock);
+			hlist_bl_unlock(b);
 
 			/* Return the locked inode with I_NEW set, the
 			 * caller is responsible for filling in the contents
@@ -1317,7 +1363,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
 		 * us. Use the old inode instead of the one we just
 		 * allocated.
 		 */
-		spin_unlock(&inode_hash_lock);
+		hlist_bl_unlock(b);
 		destroy_inode(inode);
 		if (IS_ERR(old))
 			return NULL;
@@ -1341,10 +1387,11 @@ EXPORT_SYMBOL(iget_locked);
  */
 static int test_inode_iunique(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *b = i_hash_head(sb, ino);
+	struct hlist_bl_head *b = i_hash_head(sb, ino);
+	struct hlist_bl_node *node;
 	struct inode *inode;
 
-	hlist_for_each_entry_rcu(inode, b, i_hash) {
+	hlist_bl_for_each_entry_rcu(inode, node, b, i_hash) {
 		if (inode->i_ino == ino && inode->i_sb == sb)
 			return 0;
 	}
@@ -1428,12 +1475,12 @@ EXPORT_SYMBOL(igrab);
 struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
 		int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = i_hash_head(sb, hashval);
+	struct hlist_bl_head *b = i_hash_head(sb, hashval);
 	struct inode *inode;
 
-	spin_lock(&inode_hash_lock);
-	inode = find_inode(sb, head, test, data);
-	spin_unlock(&inode_hash_lock);
+	hlist_bl_lock(b);
+	inode = find_inode(sb, b, test, data);
+	hlist_bl_unlock(b);
 
 	return IS_ERR(inode) ? NULL : inode;
 }
@@ -1483,12 +1530,12 @@ EXPORT_SYMBOL(ilookup5);
  */
 struct inode *ilookup(struct super_block *sb, unsigned long ino)
 {
-	struct hlist_head *head = i_hash_head(sb, ino);
+	struct hlist_bl_head *b = i_hash_head(sb, ino);
 	struct inode *inode;
 again:
-	spin_lock(&inode_hash_lock);
-	inode = find_inode_fast(sb, head, ino);
-	spin_unlock(&inode_hash_lock);
+	hlist_bl_lock(b);
+	inode = find_inode_fast(sb, b, ino);
+	hlist_bl_unlock(b);
 
 	if (inode) {
 		if (IS_ERR(inode))
@@ -1532,12 +1579,13 @@ struct inode *find_inode_nowait(struct super_block *sb,
 					     void *),
 				void *data)
 {
-	struct hlist_head *head = i_hash_head(sb, hashval);
+	struct hlist_bl_head *b = i_hash_head(sb, hashval);
+	struct hlist_bl_node *node;
 	struct inode *inode, *ret_inode = NULL;
 	int mval;
 
-	spin_lock(&inode_hash_lock);
-	hlist_for_each_entry(inode, head, i_hash) {
+	hlist_bl_lock(b);
+	hlist_bl_for_each_entry(inode, node, b, i_hash) {
 		if (inode->i_sb != sb)
 			continue;
 		mval = match(inode, hashval, data);
@@ -1548,7 +1596,7 @@ struct inode *find_inode_nowait(struct super_block *sb,
 		goto out;
 	}
 out:
-	spin_unlock(&inode_hash_lock);
+	hlist_bl_unlock(b);
 	return ret_inode;
 }
 EXPORT_SYMBOL(find_inode_nowait);
@@ -1577,13 +1625,14 @@ EXPORT_SYMBOL(find_inode_nowait);
 struct inode *find_inode_rcu(struct super_block *sb, unsigned long hashval,
 			     int (*test)(struct inode *, void *), void *data)
 {
-	struct hlist_head *head = i_hash_head(sb, hashval);
+	struct hlist_bl_head *b = i_hash_head(sb, hashval);
+	struct hlist_bl_node *node;
 	struct inode *inode;
 
 	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
 			 "suspicious find_inode_rcu() usage");
 
-	hlist_for_each_entry_rcu(inode, head, i_hash) {
+	hlist_bl_for_each_entry_rcu(inode, node, b, i_hash) {
 		if (inode->i_sb == sb &&
 		    !(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)) &&
 		    test(inode, data))
@@ -1615,13 +1664,14 @@ EXPORT_SYMBOL(find_inode_rcu);
 struct inode *find_inode_by_ino_rcu(struct super_block *sb,
 				    unsigned long ino)
 {
-	struct hlist_head *head = i_hash_head(sb, ino);
+	struct hlist_bl_head *b = i_hash_head(sb, ino);
+	struct hlist_bl_node *node;
 	struct inode *inode;
 
 	RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
 			 "suspicious find_inode_by_ino_rcu() usage");
 
-	hlist_for_each_entry_rcu(inode, head, i_hash) {
+	hlist_bl_for_each_entry_rcu(inode, node, b, i_hash) {
 		if (inode->i_ino == ino &&
 		    inode->i_sb == sb &&
 		    !(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)))
@@ -1635,39 +1685,42 @@ int insert_inode_locked(struct inode *inode)
 {
 	struct super_block *sb = inode->i_sb;
 	ino_t ino = inode->i_ino;
-	struct hlist_head *head = i_hash_head(sb, ino);
+	struct hlist_bl_head *b = i_hash_head(sb, ino);
 
 	while (1) {
-		struct inode *old = NULL;
-		spin_lock(&inode_hash_lock);
-		hlist_for_each_entry(old, head, i_hash) {
-			if (old->i_ino != ino)
+		struct hlist_bl_node *node;
+		struct inode *old = NULL, *t;
+
+		hlist_bl_lock(b);
+		hlist_bl_for_each_entry(t, node, b, i_hash) {
+			if (t->i_ino != ino)
 				continue;
-			if (old->i_sb != sb)
+			if (t->i_sb != sb)
 				continue;
-			spin_lock(&old->i_lock);
-			if (old->i_state & (I_FREEING|I_WILL_FREE)) {
-				spin_unlock(&old->i_lock);
+			spin_lock(&t->i_lock);
+			if (t->i_state & (I_FREEING|I_WILL_FREE)) {
+				spin_unlock(&t->i_lock);
 				continue;
 			}
+			old = t;
 			break;
 		}
 		if (likely(!old)) {
 			spin_lock(&inode->i_lock);
 			inode->i_state |= I_NEW | I_CREATING;
-			hlist_add_head_rcu(&inode->i_hash, head);
+			__insert_inode_hash_head(inode, b);
 			spin_unlock(&inode->i_lock);
-			spin_unlock(&inode_hash_lock);
+			hlist_bl_unlock(b);
 			return 0;
 		}
 		if (unlikely(old->i_state & I_CREATING)) {
 			spin_unlock(&old->i_lock);
-			spin_unlock(&inode_hash_lock);
+			hlist_bl_unlock(b);
 			return -EBUSY;
 		}
 		__iget(old);
 		spin_unlock(&old->i_lock);
-		spin_unlock(&inode_hash_lock);
+		hlist_bl_unlock(b);
 		wait_on_inode(old);
 		if (unlikely(!inode_unhashed(old))) {
 			iput(old);
@@ -2192,17 +2245,18 @@ EXPORT_SYMBOL(inode_needs_sync);
  * wake_up_bit(&inode->i_state, __I_NEW) after removing from the hash list
  * will DTRT.
  */
-static void __wait_on_freeing_inode(struct inode *inode)
+static void __wait_on_freeing_inode(struct hlist_bl_head *b,
+				struct inode *inode)
 {
 	wait_queue_head_t *wq;
 	DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
 	wq = bit_waitqueue(&inode->i_state, __I_NEW);
 	prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
 	spin_unlock(&inode->i_lock);
-	spin_unlock(&inode_hash_lock);
+	hlist_bl_unlock(b);
 	schedule();
 	finish_wait(wq, &wait.wq_entry);
-	spin_lock(&inode_hash_lock);
+	hlist_bl_lock(b);
 }
 
 static __initdata unsigned long ihash_entries;
@@ -2228,7 +2282,7 @@ void __init inode_init_early(void)
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct hlist_head),
+					sizeof(struct hlist_bl_head),
 					ihash_entries,
 					14,
 					HASH_EARLY | HASH_ZERO,
@@ -2254,7 +2308,7 @@ void __init inode_init(void)
 
 	inode_hashtable =
 		alloc_large_system_hash("Inode-cache",
-					sizeof(struct hlist_head),
+					sizeof(struct hlist_bl_head),
 					ihash_entries,
 					14,
 					HASH_ZERO,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1a6f951942..db8d49cbf7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -647,7 +647,8 @@ struct inode {
 	unsigned long		dirtied_when;	/* jiffies of first dirtying */
 	unsigned long		dirtied_time_when;
 
-	struct hlist_node	i_hash;
+	struct hlist_bl_node	i_hash;
+	struct hlist_bl_head	*i_hash_head;
 	struct list_head	i_io_list;	/* backing dev IO list */
 #ifdef CONFIG_CGROUP_WRITEBACK
 	struct bdi_writeback	*i_wb;		/* the associated cgroup wb */
@@ -713,7 +714,7 @@ static inline unsigned int i_blocksize(const struct inode *node)
 
 static inline int inode_unhashed(struct inode *inode)
 {
-	return hlist_unhashed(&inode->i_hash);
+	return hlist_bl_unhashed(&inode->i_hash);
 }
 
 /*
@@ -724,7 +725,7 @@ static inline int inode_unhashed(struct inode *inode)
  */
 static inline void inode_fake_hash(struct inode *inode)
 {
-	hlist_add_fake(&inode->i_hash);
+	hlist_bl_add_fake(&inode->i_hash);
 }
 
 /*
@@ -2695,7 +2696,7 @@ static inline void insert_inode_hash(struct inode *inode)
 extern void __remove_inode_hash(struct inode *);
 static inline void remove_inode_hash(struct inode *inode)
 {
-	if (!inode_unhashed(inode) && !hlist_fake(&inode->i_hash))
+	if (!inode_unhashed(inode) && !hlist_bl_fake(&inode->i_hash))
 		__remove_inode_hash(inode);
 }
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (21 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 22/32] vfs: inode cache conversion to hash-bl Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-10  2:20   ` kernel test robot
  2023-05-11  2:08   ` kernel test robot
  2023-05-09 16:56 ` [PATCH 24/32] MAINTAINERS: Add entry for generic-radix-tree Kent Overstreet
                   ` (8 subsequent siblings)
  31 siblings, 2 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Alexander Viro, Matthew Wilcox

Add a foliated version of copy_page_from_iter_atomic()

Signed-off-by: Kent Overstreet <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: Matthew Wilcox <[email protected]>
---
 include/linux/uio.h |  2 ++
 lib/iov_iter.c      | 53 ++++++++++++++++++++++++++++++++++++---------
 2 files changed, 45 insertions(+), 10 deletions(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index 27e3fd9429..b2c281cb10 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -154,6 +154,8 @@ static inline struct iovec iov_iter_iovec(const struct iov_iter *iter)
 
 size_t copy_page_from_iter_atomic(struct page *page, unsigned offset,
 				  size_t bytes, struct iov_iter *i);
+size_t copy_folio_from_iter_atomic(struct folio *folio, size_t offset,
+				   size_t bytes, struct iov_iter *i);
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
 size_t fault_in_iov_iter_readable(const struct iov_iter *i, size_t bytes);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index 274014e4ea..27ba7e9f9e 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -800,18 +800,10 @@ size_t iov_iter_zero(size_t bytes, struct iov_iter *i)
 }
 EXPORT_SYMBOL(iov_iter_zero);
 
-size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
-				  struct iov_iter *i)
+static inline size_t __copy_page_from_iter_atomic(struct page *page, unsigned offset,
+						  size_t bytes, struct iov_iter *i)
 {
 	char *kaddr = kmap_atomic(page), *p = kaddr + offset;
-	if (!page_copy_sane(page, offset, bytes)) {
-		kunmap_atomic(kaddr);
-		return 0;
-	}
-	if (WARN_ON_ONCE(!i->data_source)) {
-		kunmap_atomic(kaddr);
-		return 0;
-	}
 	iterate_and_advance(i, bytes, base, len, off,
 		copyin(p + off, base, len),
 		memcpy(p + off, base, len)
@@ -819,8 +811,49 @@ size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t byt
 	kunmap_atomic(kaddr);
 	return bytes;
 }
+
+size_t copy_page_from_iter_atomic(struct page *page, unsigned offset, size_t bytes,
+				  struct iov_iter *i)
+{
+	if (!page_copy_sane(page, offset, bytes))
+		return 0;
+	if (WARN_ON_ONCE(!i->data_source))
+		return 0;
+	return __copy_page_from_iter_atomic(page, offset, bytes, i);
+}
 EXPORT_SYMBOL(copy_page_from_iter_atomic);
 
+size_t copy_folio_from_iter_atomic(struct folio *folio, size_t offset,
+				   size_t bytes, struct iov_iter *i)
+{
+	size_t ret = 0;
+
+	if (WARN_ON(offset + bytes > folio_size(folio)))
+		return 0;
+	if (WARN_ON_ONCE(!i->data_source))
+		return 0;
+
+#ifdef CONFIG_HIGHMEM
+	while (bytes) {
+		struct page *page = folio_page(folio, offset >> PAGE_SHIFT);
+		unsigned b = min(bytes, PAGE_SIZE - (offset & PAGE_MASK));
+		unsigned r = __copy_page_from_iter_atomic(page, offset, b, i);
+
+		offset	+= r;
+		bytes	-= r;
+		ret	+= r;
+
+		if (r != b)
+			break;
+	}
+#else
+	ret = __copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
+#endif
+
+	return ret;
+}
+EXPORT_SYMBOL(copy_folio_from_iter_atomic);
+
 static void pipe_advance(struct iov_iter *i, size_t size)
 {
 	struct pipe_inode_info *pipe = i->pipe;
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 24/32] MAINTAINERS: Add entry for generic-radix-tree
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (22 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 21:03   ` Randy Dunlap
  2023-05-09 16:56 ` [PATCH 25/32] lib/generic-radix-tree.c: Don't overflow in peek() Kent Overstreet
                   ` (7 subsequent siblings)
  31 siblings, 1 reply; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Kent Overstreet

lib/generic-radix-tree.c is a simple radix tree that supports storing
arbitrary types. Add a maintainers entry for it.

Signed-off-by: Kent Overstreet <[email protected]>
---
 MAINTAINERS | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 5d76169140..c550f5909e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8615,6 +8615,13 @@ F:	Documentation/devicetree/bindings/power/power?domain*
 F:	drivers/base/power/domain*.c
 F:	include/linux/pm_domain.h
 
+GENERIC RADIX TREE:
+M:	Kent Overstreet <[email protected]>
+S:	Supported
+C:	irc://irc.oftc.net/bcache
+F:	include/linux/generic-radix-tree.h
+F:	lib/generic-radix-tree.c
+
 GENERIC RESISTIVE TOUCHSCREEN ADC DRIVER
 M:	Eugen Hristev <[email protected]>
 L:	[email protected]
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 25/32] lib/generic-radix-tree.c: Don't overflow in peek()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (23 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 24/32] MAINTAINERS: Add entry for generic-radix-tree Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 26/32] lib/generic-radix-tree.c: Add a missing include Kent Overstreet
                   ` (6 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet

From: Kent Overstreet <[email protected]>

When we started spreading new inode numbers throughout most of the 64
bit inode space, that triggered some corner case bugs, in particular
some integer overflows related to the radix tree code. Oops.

Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
 include/linux/generic-radix-tree.h |  6 ++++++
 lib/generic-radix-tree.c           | 17 ++++++++++++++---
 2 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h
index 107613f7d7..63080822dc 100644
--- a/include/linux/generic-radix-tree.h
+++ b/include/linux/generic-radix-tree.h
@@ -184,6 +184,12 @@ void *__genradix_iter_peek(struct genradix_iter *, struct __genradix *, size_t);
 static inline void __genradix_iter_advance(struct genradix_iter *iter,
 					   size_t obj_size)
 {
+	if (iter->offset + obj_size < iter->offset) {
+		iter->offset	= SIZE_MAX;
+		iter->pos	= SIZE_MAX;
+		return;
+	}
+
 	iter->offset += obj_size;
 
 	if (!is_power_of_2(obj_size) &&
diff --git a/lib/generic-radix-tree.c b/lib/generic-radix-tree.c
index f25eb111c0..7dfa88282b 100644
--- a/lib/generic-radix-tree.c
+++ b/lib/generic-radix-tree.c
@@ -166,6 +166,10 @@ void *__genradix_iter_peek(struct genradix_iter *iter,
 	struct genradix_root *r;
 	struct genradix_node *n;
 	unsigned level, i;
+
+	if (iter->offset == SIZE_MAX)
+		return NULL;
+
 restart:
 	r = READ_ONCE(radix->root);
 	if (!r)
@@ -184,10 +188,17 @@ void *__genradix_iter_peek(struct genradix_iter *iter,
 			(GENRADIX_ARY - 1);
 
 		while (!n->children[i]) {
+			size_t objs_per_ptr = genradix_depth_size(level);
+
+			if (iter->offset + objs_per_ptr < iter->offset) {
+				iter->offset	= SIZE_MAX;
+				iter->pos	= SIZE_MAX;
+				return NULL;
+			}
+
 			i++;
-			iter->offset = round_down(iter->offset +
-					   genradix_depth_size(level),
-					   genradix_depth_size(level));
+			iter->offset = round_down(iter->offset + objs_per_ptr,
+						  objs_per_ptr);
 			iter->pos = (iter->offset >> PAGE_SHIFT) *
 				objs_per_page;
 			if (i == GENRADIX_ARY)
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 26/32] lib/generic-radix-tree.c: Add a missing include
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (24 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 25/32] lib/generic-radix-tree.c: Don't overflow in peek() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 27/32] lib/generic-radix-tree.c: Add peek_prev() Kent Overstreet
                   ` (5 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet

From: Kent Overstreet <[email protected]>

We now need linux/limits.h for SIZE_MAX.

Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
 include/linux/generic-radix-tree.h | 1 +
 1 file changed, 1 insertion(+)

diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h
index 63080822dc..f6cd0f909d 100644
--- a/include/linux/generic-radix-tree.h
+++ b/include/linux/generic-radix-tree.h
@@ -38,6 +38,7 @@
 
 #include <asm/page.h>
 #include <linux/bug.h>
+#include <linux/limits.h>
 #include <linux/log2.h>
 #include <linux/math.h>
 #include <linux/types.h>
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 27/32] lib/generic-radix-tree.c: Add peek_prev()
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (25 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 26/32] lib/generic-radix-tree.c: Add a missing include Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 28/32] stacktrace: Export stack_trace_save_tsk Kent Overstreet
                   ` (4 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet

From: Kent Overstreet <[email protected]>

This patch adds genradix_peek_prev(), genradix_iter_rewind(), and
genradix_for_each_reverse(), for iterating backwards over a generic
radix tree.

Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
 include/linux/generic-radix-tree.h | 61 +++++++++++++++++++++++++++++-
 lib/generic-radix-tree.c           | 59 +++++++++++++++++++++++++++++
 2 files changed, 119 insertions(+), 1 deletion(-)

diff --git a/include/linux/generic-radix-tree.h b/include/linux/generic-radix-tree.h
index f6cd0f909d..c74b737699 100644
--- a/include/linux/generic-radix-tree.h
+++ b/include/linux/generic-radix-tree.h
@@ -117,6 +117,11 @@ static inline size_t __idx_to_offset(size_t idx, size_t obj_size)
 
 #define __genradix_cast(_radix)		(typeof((_radix)->type[0]) *)
 #define __genradix_obj_size(_radix)	sizeof((_radix)->type[0])
+#define __genradix_objs_per_page(_radix)			\
+	(PAGE_SIZE / sizeof((_radix)->type[0]))
+#define __genradix_page_remainder(_radix)			\
+	(PAGE_SIZE % sizeof((_radix)->type[0]))
+
 #define __genradix_idx_to_offset(_radix, _idx)			\
 	__idx_to_offset(_idx, __genradix_obj_size(_radix))
 
@@ -180,7 +185,25 @@ void *__genradix_iter_peek(struct genradix_iter *, struct __genradix *, size_t);
 #define genradix_iter_peek(_iter, _radix)			\
 	(__genradix_cast(_radix)				\
 	 __genradix_iter_peek(_iter, &(_radix)->tree,		\
-			      PAGE_SIZE / __genradix_obj_size(_radix)))
+			__genradix_objs_per_page(_radix)))
+
+void *__genradix_iter_peek_prev(struct genradix_iter *, struct __genradix *,
+				size_t, size_t);
+
+/**
+ * genradix_iter_peek - get first entry at or below iterator's current
+ *			position
+ * @_iter:	a genradix_iter
+ * @_radix:	genradix being iterated over
+ *
+ * If no more entries exist at or below @_iter's current position, returns NULL
+ */
+#define genradix_iter_peek_prev(_iter, _radix)			\
+	(__genradix_cast(_radix)				\
+	 __genradix_iter_peek_prev(_iter, &(_radix)->tree,	\
+			__genradix_objs_per_page(_radix),	\
+			__genradix_obj_size(_radix) +		\
+			__genradix_page_remainder(_radix)))
 
 static inline void __genradix_iter_advance(struct genradix_iter *iter,
 					   size_t obj_size)
@@ -203,6 +226,25 @@ static inline void __genradix_iter_advance(struct genradix_iter *iter,
 #define genradix_iter_advance(_iter, _radix)			\
 	__genradix_iter_advance(_iter, __genradix_obj_size(_radix))
 
+static inline void __genradix_iter_rewind(struct genradix_iter *iter,
+					  size_t obj_size)
+{
+	if (iter->offset == 0 ||
+	    iter->offset == SIZE_MAX) {
+		iter->offset = SIZE_MAX;
+		return;
+	}
+
+	if ((iter->offset & (PAGE_SIZE - 1)) == 0)
+		iter->offset -= PAGE_SIZE % obj_size;
+
+	iter->offset -= obj_size;
+	iter->pos--;
+}
+
+#define genradix_iter_rewind(_iter, _radix)			\
+	__genradix_iter_rewind(_iter, __genradix_obj_size(_radix))
+
 #define genradix_for_each_from(_radix, _iter, _p, _start)	\
 	for (_iter = genradix_iter_init(_radix, _start);	\
 	     (_p = genradix_iter_peek(&_iter, _radix)) != NULL;	\
@@ -220,6 +262,23 @@ static inline void __genradix_iter_advance(struct genradix_iter *iter,
 #define genradix_for_each(_radix, _iter, _p)			\
 	genradix_for_each_from(_radix, _iter, _p, 0)
 
+#define genradix_last_pos(_radix)				\
+	(SIZE_MAX / PAGE_SIZE * __genradix_objs_per_page(_radix) - 1)
+
+/**
+ * genradix_for_each_reverse - iterate over entry in a genradix, reverse order
+ * @_radix:	genradix to iterate over
+ * @_iter:	a genradix_iter to track current position
+ * @_p:		pointer to genradix entry type
+ *
+ * On every iteration, @_p will point to the current entry, and @_iter.pos
+ * will be the current entry's index.
+ */
+#define genradix_for_each_reverse(_radix, _iter, _p)		\
+	for (_iter = genradix_iter_init(_radix,	genradix_last_pos(_radix));\
+	     (_p = genradix_iter_peek_prev(&_iter, _radix)) != NULL;\
+	     genradix_iter_rewind(&_iter, _radix))
+
 int __genradix_prealloc(struct __genradix *, size_t, gfp_t);
 
 /**
diff --git a/lib/generic-radix-tree.c b/lib/generic-radix-tree.c
index 7dfa88282b..41f1bcdc44 100644
--- a/lib/generic-radix-tree.c
+++ b/lib/generic-radix-tree.c
@@ -1,4 +1,5 @@
 
+#include <linux/atomic.h>
 #include <linux/export.h>
 #include <linux/generic-radix-tree.h>
 #include <linux/gfp.h>
@@ -212,6 +213,64 @@ void *__genradix_iter_peek(struct genradix_iter *iter,
 }
 EXPORT_SYMBOL(__genradix_iter_peek);
 
+void *__genradix_iter_peek_prev(struct genradix_iter *iter,
+				struct __genradix *radix,
+				size_t objs_per_page,
+				size_t obj_size_plus_page_remainder)
+{
+	struct genradix_root *r;
+	struct genradix_node *n;
+	unsigned level, i;
+
+	if (iter->offset == SIZE_MAX)
+		return NULL;
+
+restart:
+	r = READ_ONCE(radix->root);
+	if (!r)
+		return NULL;
+
+	n	= genradix_root_to_node(r);
+	level	= genradix_root_to_depth(r);
+
+	if (ilog2(iter->offset) >= genradix_depth_shift(level)) {
+		iter->offset = genradix_depth_size(level);
+		iter->pos = (iter->offset >> PAGE_SHIFT) * objs_per_page;
+
+		iter->offset -= obj_size_plus_page_remainder;
+		iter->pos--;
+	}
+
+	while (level) {
+		level--;
+
+		i = (iter->offset >> genradix_depth_shift(level)) &
+			(GENRADIX_ARY - 1);
+
+		while (!n->children[i]) {
+			size_t objs_per_ptr = genradix_depth_size(level);
+
+			iter->offset = round_down(iter->offset, objs_per_ptr);
+			iter->pos = (iter->offset >> PAGE_SHIFT) * objs_per_page;
+
+			if (!iter->offset)
+				return NULL;
+
+			iter->offset -= obj_size_plus_page_remainder;
+			iter->pos--;
+
+			if (!i)
+				goto restart;
+			--i;
+		}
+
+		n = n->children[i];
+	}
+
+	return &n->data[iter->offset & (PAGE_SIZE - 1)];
+}
+EXPORT_SYMBOL(__genradix_iter_peek_prev);
+
 static void genradix_free_recurse(struct genradix_node *n, unsigned level)
 {
 	if (level) {
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 28/32] stacktrace: Export stack_trace_save_tsk
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (26 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 27/32] lib/generic-radix-tree.c: Add peek_prev() Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote Kent Overstreet
                   ` (3 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Christopher James Halse Rogers, Kent Overstreet

From: Christopher James Halse Rogers <[email protected]>

The bcachefs module wants it, and there doesn't seem to be any
reason it shouldn't be exported like the other functions.

Signed-off-by: Christopher James Halse Rogers <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
 kernel/stacktrace.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
index 9ed5ce9894..4f65824879 100644
--- a/kernel/stacktrace.c
+++ b/kernel/stacktrace.c
@@ -151,6 +151,7 @@ unsigned int stack_trace_save_tsk(struct task_struct *tsk, unsigned long *store,
 	put_task_stack(tsk);
 	return c.len;
 }
+EXPORT_SYMBOL_GPL(stack_trace_save_tsk);
 
 /**
  * stack_trace_save_regs - Save a stack trace based on pt_regs into a storage array
@@ -301,6 +302,7 @@ unsigned int stack_trace_save_tsk(struct task_struct *task,
 	save_stack_trace_tsk(task, &trace);
 	return trace.nr_entries;
 }
+EXPORT_SYMBOL_GPL(stack_trace_save_tsk);
 
 /**
  * stack_trace_save_regs - Save a stack trace based on pt_regs into a storage array
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (27 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 28/32] stacktrace: Export stack_trace_save_tsk Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 30/32] lib: Export errname Kent Overstreet
                   ` (2 subsequent siblings)
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Kent Overstreet

From: Kent Overstreet <[email protected]>

printbuf now needs to know the number of characters that would have been
written if the buffer was too small, like snprintf(); this changes
string_get_size() to return the the return value of snprintf().

Signed-off-by: Kent Overstreet <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
 include/linux/string_helpers.h | 4 ++--
 lib/string_helpers.c           | 8 ++++----
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/linux/string_helpers.h b/include/linux/string_helpers.h
index fae6beaaa2..44148f8feb 100644
--- a/include/linux/string_helpers.h
+++ b/include/linux/string_helpers.h
@@ -23,8 +23,8 @@ enum string_size_units {
 	STRING_UNITS_2,		/* use binary powers of 2^10 */
 };
 
-void string_get_size(u64 size, u64 blk_size, enum string_size_units units,
-		     char *buf, int len);
+int string_get_size(u64 size, u64 blk_size, enum string_size_units units,
+		    char *buf, int len);
 
 int parse_int_array_user(const char __user *from, size_t count, int **array);
 
diff --git a/lib/string_helpers.c b/lib/string_helpers.c
index 230020a2e0..ca36ceba0e 100644
--- a/lib/string_helpers.c
+++ b/lib/string_helpers.c
@@ -32,8 +32,8 @@
  * at least 9 bytes and will always be zero terminated.
  *
  */
-void string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
-		     char *buf, int len)
+int string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
+		    char *buf, int len)
 {
 	static const char *const units_10[] = {
 		"B", "kB", "MB", "GB", "TB", "PB", "EB", "ZB", "YB"
@@ -126,8 +126,8 @@ void string_get_size(u64 size, u64 blk_size, const enum string_size_units units,
 	else
 		unit = units_str[units][i];
 
-	snprintf(buf, len, "%u%s %s", (u32)size,
-		 tmp, unit);
+	return snprintf(buf, len, "%u%s %s", (u32)size,
+			tmp, unit);
 }
 EXPORT_SYMBOL(string_get_size);
 
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 30/32] lib: Export errname
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (28 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 31/32] lib: add mean and variance module Kent Overstreet
  2023-05-09 16:56 ` [PATCH 32/32] MAINTAINERS: Add entry for bcachefs Kent Overstreet
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Christopher James Halse Rogers

The bcachefs module now wants this and it seems sensible.

Signed-off-by: Christopher James Halse Rogers <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
 lib/errname.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/errname.c b/lib/errname.c
index 67739b174a..dd1b998552 100644
--- a/lib/errname.c
+++ b/lib/errname.c
@@ -228,3 +228,4 @@ const char *errname(int err)
 
 	return err > 0 ? name + 1 : name;
 }
+EXPORT_SYMBOL(errname);
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 31/32] lib: add mean and variance module.
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (29 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 30/32] lib: Export errname Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 16:56 ` [PATCH 32/32] MAINTAINERS: Add entry for bcachefs Kent Overstreet
  31 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Daniel Hill, Kent Overstreet

From: Daniel Hill <[email protected]>

This module provides a fast 64bit implementation of basic statistics
functions, including mean, variance and standard deviation in both
weighted and unweighted variants, the unweighted variant has a 32bit
limitation per sample to prevent overflow when squaring.

Signed-off-by: Daniel Hill <[email protected]>
Signed-off-by: Kent Overstreet <[email protected]>
---
 MAINTAINERS                       |   9 ++
 include/linux/mean_and_variance.h | 219 ++++++++++++++++++++++++++++++
 lib/Kconfig.debug                 |   9 ++
 lib/math/Kconfig                  |   3 +
 lib/math/Makefile                 |   2 +
 lib/math/mean_and_variance.c      | 136 +++++++++++++++++++
 lib/math/mean_and_variance_test.c | 155 +++++++++++++++++++++
 7 files changed, 533 insertions(+)
 create mode 100644 include/linux/mean_and_variance.h
 create mode 100644 lib/math/mean_and_variance.c
 create mode 100644 lib/math/mean_and_variance_test.c

diff --git a/MAINTAINERS b/MAINTAINERS
index c550f5909e..dbf3c33c31 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12767,6 +12767,15 @@ F:	Documentation/devicetree/bindings/net/ieee802154/mcr20a.txt
 F:	drivers/net/ieee802154/mcr20a.c
 F:	drivers/net/ieee802154/mcr20a.h
 
+MEAN AND VARIANCE LIBRARY
+M:	Daniel B. Hill <[email protected]>
+M:	Kent Overstreet <[email protected]>
+S:	Maintained
+T:	git https://github.com/YellowOnion/linux/
+F:	include/linux/mean_and_variance.h
+F:	lib/math/mean_and_variance.c
+F:	lib/math/mean_and_variance_test.c
+
 MEASUREMENT COMPUTING CIO-DAC IIO DRIVER
 M:	William Breathitt Gray <[email protected]>
 L:	[email protected]
diff --git a/include/linux/mean_and_variance.h b/include/linux/mean_and_variance.h
new file mode 100644
index 0000000000..89540628e8
--- /dev/null
+++ b/include/linux/mean_and_variance.h
@@ -0,0 +1,219 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef MEAN_AND_VARIANCE_H_
+#define MEAN_AND_VARIANCE_H_
+
+#include <linux/types.h>
+#include <linux/limits.h>
+#include <linux/math64.h>
+
+#define SQRT_U64_MAX 4294967295ULL
+
+
+#if defined(CONFIG_ARCH_SUPPORTS_INT128) && defined(__SIZEOF_INT128__)
+
+typedef unsigned __int128 u128;
+
+static inline u128 u64_to_u128(u64 a)
+{
+	return (u128)a;
+}
+
+static inline u64 u128_to_u64(u128 a)
+{
+	return (u64)a;
+}
+
+static inline u64 u128_shr64_to_u64(u128 a)
+{
+	return (u64)(a >> 64);
+}
+
+static inline u128 u128_add(u128 a, u128 b)
+{
+	return a + b;
+}
+
+static inline u128 u128_sub(u128 a, u128 b)
+{
+	return a - b;
+}
+
+static inline u128 u128_shl(u128 i, s8 shift)
+{
+	return i << shift;
+}
+
+static inline u128 u128_shl64_add(u64 a, u64 b)
+{
+	return ((u128)a << 64) + b;
+}
+
+static inline u128 u128_square(u64 i)
+{
+	return i*i;
+}
+
+#else
+
+typedef struct {
+	u64 hi, lo;
+} u128;
+
+static inline u128 u64_to_u128(u64 a)
+{
+	return (u128){ .lo = a };
+}
+
+static inline u64 u128_to_u64(u128 a)
+{
+	return a.lo;
+}
+
+static inline u64 u128_shr64_to_u64(u128 a)
+{
+	return a.hi;
+}
+
+static inline u128 u128_add(u128 a, u128 b)
+{
+	u128 c;
+
+	c.lo = a.lo + b.lo;
+	c.hi = a.hi + b.hi + (c.lo < a.lo);
+	return c;
+}
+
+static inline u128 u128_sub(u128 a, u128 b)
+{
+	u128 c;
+
+	c.lo = a.lo - b.lo;
+	c.hi = a.hi - b.hi - (c.lo > a.lo);
+	return c;
+}
+
+static inline u128 u128_shl(u128 i, s8 shift)
+{
+	u128 r;
+
+	r.lo = i.lo << shift;
+	if (shift < 64)
+		r.hi = (i.hi << shift) | (i.lo >> (64 - shift));
+	else {
+		r.hi = i.lo << (shift - 64);
+		r.lo = 0;
+	}
+	return r;
+}
+
+static inline u128 u128_shl64_add(u64 a, u64 b)
+{
+	return u128_add(u128_shl(u64_to_u128(a), 64), u64_to_u128(b));
+}
+
+static inline u128 u128_square(u64 i)
+{
+	u128 r;
+	u64  h = i >> 32, l = i & (u64)U32_MAX;
+
+	r =             u128_shl(u64_to_u128(h*h), 64);
+	r = u128_add(r, u128_shl(u64_to_u128(h*l), 32));
+	r = u128_add(r, u128_shl(u64_to_u128(l*h), 32));
+	r = u128_add(r,          u64_to_u128(l*l));
+	return r;
+}
+
+#endif
+
+static inline u128 u128_div(u128 n, u64 d)
+{
+	u128 r;
+	u64 rem;
+	u64 hi = u128_shr64_to_u64(n);
+	u64 lo = u128_to_u64(n);
+	u64  h =  hi & ((u64)U32_MAX  << 32);
+	u64  l = (hi &  (u64)U32_MAX) << 32;
+
+	r =             u128_shl(u64_to_u128(div64_u64_rem(h,                d, &rem)), 64);
+	r = u128_add(r, u128_shl(u64_to_u128(div64_u64_rem(l  + (rem << 32), d, &rem)), 32));
+	r = u128_add(r,          u64_to_u128(div64_u64_rem(lo + (rem << 32), d, &rem)));
+	return r;
+}
+
+struct mean_and_variance {
+	s64 n;
+	s64 sum;
+	u128 sum_squares;
+};
+
+/* expontentially weighted variant */
+struct mean_and_variance_weighted {
+	bool init;
+	u8 w;
+	s64 mean;
+	u64 variance;
+};
+
+/**
+ * fast_divpow2() - fast approximation for n / (1 << d)
+ * @n: numerator
+ * @d: the power of 2 denominator.
+ *
+ * note: this rounds towards 0.
+ */
+static inline s64 fast_divpow2(s64 n, u8 d)
+{
+	return (n + ((n < 0) ? ((1 << d) - 1) : 0)) >> d;
+}
+
+static inline struct mean_and_variance
+mean_and_variance_update_inlined(struct mean_and_variance s1, s64 v1)
+{
+	struct mean_and_variance s2;
+	u64 v2 = abs(v1);
+
+	s2.n           = s1.n + 1;
+	s2.sum         = s1.sum + v1;
+	s2.sum_squares = u128_add(s1.sum_squares, u128_square(v2));
+	return s2;
+}
+
+static inline struct mean_and_variance_weighted
+mean_and_variance_weighted_update_inlined(struct mean_and_variance_weighted s1, s64 x)
+{
+	struct mean_and_variance_weighted s2;
+	// previous weighted variance.
+	u64 var_w0 = s1.variance;
+	u8 w = s2.w = s1.w;
+	// new value weighted.
+	s64 x_w = x << w;
+	s64 diff_w = x_w - s1.mean;
+	s64 diff = fast_divpow2(diff_w, w);
+	// new mean weighted.
+	s64 u_w1     = s1.mean + diff;
+
+	BUG_ON(w % 2 != 0);
+
+	if (!s1.init) {
+		s2.mean = x_w;
+		s2.variance = 0;
+	} else {
+		s2.mean = u_w1;
+		s2.variance = ((var_w0 << w) - var_w0 + ((diff_w * (x_w - u_w1)) >> w)) >> w;
+	}
+	s2.init = true;
+
+	return s2;
+}
+
+struct mean_and_variance mean_and_variance_update(struct mean_and_variance s1, s64 v1);
+       s64		 mean_and_variance_get_mean(struct mean_and_variance s);
+       u64		 mean_and_variance_get_variance(struct mean_and_variance s1);
+       u32		 mean_and_variance_get_stddev(struct mean_and_variance s);
+
+struct mean_and_variance_weighted mean_and_variance_weighted_update(struct mean_and_variance_weighted s1, s64 v1);
+       s64			  mean_and_variance_weighted_get_mean(struct mean_and_variance_weighted s);
+       u64			  mean_and_variance_weighted_get_variance(struct mean_and_variance_weighted s);
+       u32			  mean_and_variance_weighted_get_stddev(struct mean_and_variance_weighted s);
+
+#endif // MEAN_AND_VAIRANCE_H_
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 3dba7a9aff..9ca88e0027 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2101,6 +2101,15 @@ config CPUMASK_KUNIT_TEST
 
 	  If unsure, say N.
 
+config MEAN_AND_VARIANCE_UNIT_TEST
+	tristate "mean_and_variance unit tests" if !KUNIT_ALL_TESTS
+	depends on KUNIT
+	select MEAN_AND_VARIANCE
+	default KUNIT_ALL_TESTS
+	help
+	  This option enables the kunit tests for mean_and_variance module.
+	  If unsure, say N.
+
 config TEST_LIST_SORT
 	tristate "Linked list sorting test" if !KUNIT_ALL_TESTS
 	depends on KUNIT
diff --git a/lib/math/Kconfig b/lib/math/Kconfig
index 0634b428d0..7530ae9a35 100644
--- a/lib/math/Kconfig
+++ b/lib/math/Kconfig
@@ -15,3 +15,6 @@ config PRIME_NUMBERS
 
 config RATIONAL
 	tristate
+
+config MEAN_AND_VARIANCE
+	tristate
diff --git a/lib/math/Makefile b/lib/math/Makefile
index bfac26ddfc..2ef1487e01 100644
--- a/lib/math/Makefile
+++ b/lib/math/Makefile
@@ -4,6 +4,8 @@ obj-y += div64.o gcd.o lcm.o int_pow.o int_sqrt.o reciprocal_div.o
 obj-$(CONFIG_CORDIC)		+= cordic.o
 obj-$(CONFIG_PRIME_NUMBERS)	+= prime_numbers.o
 obj-$(CONFIG_RATIONAL)		+= rational.o
+obj-$(CONFIG_MEAN_AND_VARIANCE) += mean_and_variance.o
 
 obj-$(CONFIG_TEST_DIV64)	+= test_div64.o
 obj-$(CONFIG_RATIONAL_KUNIT_TEST) += rational-test.o
+obj-$(CONFIG_MEAN_AND_VARIANCE_UNIT_TEST)   += mean_and_variance_test.o
diff --git a/lib/math/mean_and_variance.c b/lib/math/mean_and_variance.c
new file mode 100644
index 0000000000..6e315d3a13
--- /dev/null
+++ b/lib/math/mean_and_variance.c
@@ -0,0 +1,136 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Functions for incremental mean and variance.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published by
+ * the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ *
+ * Copyright © 2022 Daniel B. Hill
+ *
+ * Author: Daniel B. Hill <[email protected]>
+ *
+ * Description:
+ *
+ * This is includes some incremental algorithms for mean and variance calculation
+ *
+ * Derived from the paper: https://fanf2.user.srcf.net/hermes/doc/antiforgery/stats.pdf
+ *
+ * Create a struct and if it's the weighted variant set the w field (weight = 2^k).
+ *
+ * Use mean_and_variance[_weighted]_update() on the struct to update it's state.
+ *
+ * Use the mean_and_variance[_weighted]_get_* functions to calculate the mean and variance, some computation
+ * is deferred to these functions for performance reasons.
+ *
+ * see lib/math/mean_and_variance_test.c for examples of usage.
+ *
+ * DO NOT access the mean and variance fields of the weighted variants directly.
+ * DO NOT change the weight after calling update.
+ */
+
+#include <linux/bug.h>
+#include <linux/compiler.h>
+#include <linux/export.h>
+#include <linux/limits.h>
+#include <linux/math.h>
+#include <linux/math64.h>
+#include <linux/mean_and_variance.h>
+#include <linux/module.h>
+
+/**
+ * mean_and_variance_update() - update a mean_and_variance struct @s1 with a new sample @v1
+ * and return it.
+ * @s1: the mean_and_variance to update.
+ * @v1: the new sample.
+ *
+ * see linked pdf equation 12.
+ */
+struct mean_and_variance mean_and_variance_update(struct mean_and_variance s1, s64 v1)
+{
+	return mean_and_variance_update_inlined(s1, v1);
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_update);
+
+/**
+ * mean_and_variance_get_mean() - get mean from @s
+ */
+s64 mean_and_variance_get_mean(struct mean_and_variance s)
+{
+	return div64_u64(s.sum, s.n);
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_get_mean);
+
+/**
+ * mean_and_variance_get_variance() -  get variance from @s1
+ *
+ * see linked pdf equation 12.
+ */
+u64 mean_and_variance_get_variance(struct mean_and_variance s1)
+{
+	u128 s2 = u128_div(s1.sum_squares, s1.n);
+	u64  s3 = abs(mean_and_variance_get_mean(s1));
+
+	return u128_to_u64(u128_sub(s2, u128_square(s3)));
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_get_variance);
+
+/**
+ * mean_and_variance_get_stddev() - get standard deviation from @s
+ */
+u32 mean_and_variance_get_stddev(struct mean_and_variance s)
+{
+	return int_sqrt64(mean_and_variance_get_variance(s));
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_get_stddev);
+
+/**
+ * mean_and_variance_weighted_update() - exponentially weighted variant of mean_and_variance_update()
+ * @s1: ..
+ * @s2: ..
+ *
+ * see linked pdf: function derived from equations 140-143 where alpha = 2^w.
+ * values are stored bitshifted for performance and added precision.
+ */
+struct mean_and_variance_weighted mean_and_variance_weighted_update(struct mean_and_variance_weighted s1,
+								    s64 x)
+{
+	return mean_and_variance_weighted_update_inlined(s1, x);
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_weighted_update);
+
+/**
+ * mean_and_variance_weighted_get_mean() - get mean from @s
+ */
+s64 mean_and_variance_weighted_get_mean(struct mean_and_variance_weighted s)
+{
+	return fast_divpow2(s.mean, s.w);
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_weighted_get_mean);
+
+/**
+ * mean_and_variance_weighted_get_variance() -- get variance from @s
+ */
+u64 mean_and_variance_weighted_get_variance(struct mean_and_variance_weighted s)
+{
+	// always positive don't need fast divpow2
+	return s.variance >> s.w;
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_weighted_get_variance);
+
+/**
+ * mean_and_variance_weighted_get_stddev() - get standard deviation from @s
+ */
+u32 mean_and_variance_weighted_get_stddev(struct mean_and_variance_weighted s)
+{
+	return int_sqrt64(mean_and_variance_weighted_get_variance(s));
+}
+EXPORT_SYMBOL_GPL(mean_and_variance_weighted_get_stddev);
+
+MODULE_AUTHOR("Daniel B. Hill");
+MODULE_LICENSE("GPL");
diff --git a/lib/math/mean_and_variance_test.c b/lib/math/mean_and_variance_test.c
new file mode 100644
index 0000000000..79a96d7307
--- /dev/null
+++ b/lib/math/mean_and_variance_test.c
@@ -0,0 +1,155 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <kunit/test.h>
+#include <linux/mean_and_variance.h>
+
+#define MAX_SQR (SQRT_U64_MAX*SQRT_U64_MAX)
+
+static void mean_and_variance_basic_test(struct kunit *test)
+{
+	struct mean_and_variance s = {};
+
+	s = mean_and_variance_update(s, 2);
+	s = mean_and_variance_update(s, 2);
+
+	KUNIT_EXPECT_EQ(test, mean_and_variance_get_mean(s), 2);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_get_variance(s), 0);
+	KUNIT_EXPECT_EQ(test, s.n, 2);
+
+	s = mean_and_variance_update(s, 4);
+	s = mean_and_variance_update(s, 4);
+
+	KUNIT_EXPECT_EQ(test, mean_and_variance_get_mean(s), 3);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_get_variance(s), 1);
+	KUNIT_EXPECT_EQ(test, s.n, 4);
+}
+
+/*
+ * Test values computed using a spreadsheet from the psuedocode at the bottom:
+ * https://fanf2.user.srcf.net/hermes/doc/antiforgery/stats.pdf
+ */
+
+static void mean_and_variance_weighted_test(struct kunit *test)
+{
+	struct mean_and_variance_weighted s = {};
+
+	s.w = 2;
+
+	s = mean_and_variance_weighted_update(s, 10);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), 10);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 0);
+
+	s = mean_and_variance_weighted_update(s, 20);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), 12);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 18);
+
+	s = mean_and_variance_weighted_update(s, 30);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), 16);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 72);
+
+	s = (struct mean_and_variance_weighted){};
+	s.w = 2;
+
+	s = mean_and_variance_weighted_update(s, -10);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), -10);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 0);
+
+	s = mean_and_variance_weighted_update(s, -20);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), -12);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 18);
+
+	s = mean_and_variance_weighted_update(s, -30);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), -16);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 72);
+
+}
+
+static void mean_and_variance_weighted_advanced_test(struct kunit *test)
+{
+	struct mean_and_variance_weighted s = {};
+	s64 i;
+
+	s.w = 8;
+	for (i = 10; i <= 100; i += 10)
+		s = mean_and_variance_weighted_update(s, i);
+
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), 11);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 107);
+
+	s = (struct mean_and_variance_weighted){};
+
+	s.w = 8;
+	for (i = -10; i >= -100; i -= 10)
+		s = mean_and_variance_weighted_update(s, i);
+
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_mean(s), -11);
+	KUNIT_EXPECT_EQ(test, mean_and_variance_weighted_get_variance(s), 107);
+
+}
+
+static void mean_and_variance_fast_divpow2(struct kunit *test)
+{
+	s64 i;
+	u8 d;
+
+	for (i = 0; i < 100; i++) {
+		d = 0;
+		KUNIT_EXPECT_EQ(test, fast_divpow2(i, d), div_u64(i, 1LLU << d));
+		KUNIT_EXPECT_EQ(test, abs(fast_divpow2(-i, d)), div_u64(i, 1LLU << d));
+		for (d = 1; d < 32; d++) {
+			KUNIT_EXPECT_EQ_MSG(test, abs(fast_divpow2(i, d)),
+					    div_u64(i, 1 << d), "%lld %u", i, d);
+			KUNIT_EXPECT_EQ_MSG(test, abs(fast_divpow2(-i, d)),
+					    div_u64(i, 1 << d), "%lld %u", -i, d);
+		}
+	}
+}
+
+static void mean_and_variance_u128_basic_test(struct kunit *test)
+{
+	u128 a = u128_shl64_add(0, U64_MAX);
+	u128 a1 = u128_shl64_add(0, 1);
+	u128 b = u128_shl64_add(1, 0);
+	u128 c = u128_shl64_add(0, 1LLU << 63);
+	u128 c2 = u128_shl64_add(U64_MAX, U64_MAX);
+
+	KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_add(a, a1)), 1);
+	KUNIT_EXPECT_EQ(test, u128_to_u64(u128_add(a, a1)), 0);
+	KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_add(a1, a)), 1);
+	KUNIT_EXPECT_EQ(test, u128_to_u64(u128_add(a1, a)), 0);
+
+	KUNIT_EXPECT_EQ(test, u128_to_u64(u128_sub(b, a1)), U64_MAX);
+	KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_sub(b, a1)), 0);
+
+	KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_shl(c, 1)), 1);
+	KUNIT_EXPECT_EQ(test, u128_to_u64(u128_shl(c, 1)), 0);
+
+	KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_square(U64_MAX)), U64_MAX - 1);
+	KUNIT_EXPECT_EQ(test, u128_to_u64(u128_square(U64_MAX)), 1);
+
+	KUNIT_EXPECT_EQ(test, u128_to_u64(u128_div(b, 2)), 1LLU << 63);
+
+	KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_div(c2, 2)), U64_MAX >> 1);
+	KUNIT_EXPECT_EQ(test, u128_to_u64(u128_div(c2, 2)), U64_MAX);
+
+	KUNIT_EXPECT_EQ(test, u128_shr64_to_u64(u128_div(u128_shl(u64_to_u128(U64_MAX), 32), 2)), U32_MAX >> 1);
+	KUNIT_EXPECT_EQ(test, u128_to_u64(u128_div(u128_shl(u64_to_u128(U64_MAX), 32), 2)), U64_MAX << 31);
+}
+
+static struct kunit_case mean_and_variance_test_cases[] = {
+	KUNIT_CASE(mean_and_variance_fast_divpow2),
+	KUNIT_CASE(mean_and_variance_u128_basic_test),
+	KUNIT_CASE(mean_and_variance_basic_test),
+	KUNIT_CASE(mean_and_variance_weighted_test),
+	KUNIT_CASE(mean_and_variance_weighted_advanced_test),
+	{}
+};
+
+static struct kunit_suite mean_and_variance_test_suite = {
+.name = "mean and variance tests",
+.test_cases = mean_and_variance_test_cases
+};
+
+kunit_test_suite(mean_and_variance_test_suite);
+
+MODULE_AUTHOR("Daniel B. Hill");
+MODULE_LICENSE("GPL");
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* [PATCH 32/32] MAINTAINERS: Add entry for bcachefs
  2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
                   ` (30 preceding siblings ...)
  2023-05-09 16:56 ` [PATCH 31/32] lib: add mean and variance module Kent Overstreet
@ 2023-05-09 16:56 ` Kent Overstreet
  2023-05-09 21:04   ` Randy Dunlap
  31 siblings, 1 reply; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 16:56 UTC (permalink / raw)
  To: linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Kent Overstreet

bcachefs is a new copy-on-write filesystem; add a MAINTAINERS entry for
it.

Signed-off-by: Kent Overstreet <[email protected]>
---
 MAINTAINERS | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index dbf3c33c31..0ac2b432f0 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -3509,6 +3509,13 @@ W:	http://bcache.evilpiepirate.org
 C:	irc://irc.oftc.net/bcache
 F:	drivers/md/bcache/
 
+BCACHEFS:
+M:	Kent Overstreet <[email protected]>
+L:	[email protected]
+S:	Supported
+C:	irc://irc.oftc.net/bcache
+F:	fs/bcachefs/
+
 BDISP ST MEDIA DRIVER
 M:	Fabien Dessenne <[email protected]>
 L:	[email protected]
-- 
2.40.1


^ permalink raw reply related	[flat|nested] 73+ messages in thread

* Re: [PATCH 01/32] Compiler Attributes: add __flatten
  2023-05-09 16:56 ` [PATCH 01/32] Compiler Attributes: add __flatten Kent Overstreet
@ 2023-05-09 17:04   ` Miguel Ojeda
  2023-05-09 17:24     ` Kent Overstreet
  0 siblings, 1 reply; 73+ messages in thread
From: Miguel Ojeda @ 2023-05-09 17:04 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Miguel Ojeda, Nick Desaulniers

On Tue, May 9, 2023 at 6:57 PM Kent Overstreet
<[email protected]> wrote:
>
> This makes __attribute__((flatten)) available, which is used by
> bcachefs.

We already have it in mainline, so I think it is one less patch you
need to care about! :)

Cheers,
Miguel

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 16/32] MAINTAINERS: Add entry for closures
  2023-05-09 16:56 ` [PATCH 16/32] MAINTAINERS: Add entry for closures Kent Overstreet
@ 2023-05-09 17:05   ` Coly Li
  2023-05-09 21:03   ` Randy Dunlap
  1 sibling, 0 replies; 73+ messages in thread
From: Coly Li @ 2023-05-09 17:05 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, linux-fsdevel, linux-bcachefs



> 2023年5月10日 00:56,Kent Overstreet <[email protected]> 写道:
> 
> closures, from bcache, are async widgets with a variety of uses.
> bcachefs also uses them, so they're being moved to lib/; mark them as
> maintained.
> 
> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Coly Li <[email protected]>

Acked-by: Coly Li <[email protected]>

Thanks.

Coly Li

> ---
> MAINTAINERS | 8 ++++++++
> 1 file changed, 8 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 3fc37de3d6..5d76169140 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5044,6 +5044,14 @@ T: git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
> F: Documentation/devicetree/bindings/timer/
> F: drivers/clocksource/
> 
> +CLOSURES:
> +M: Kent Overstreet <[email protected]>
> +L: [email protected]
> +S: Supported
> +C: irc://irc.oftc.net/bcache
> +F: include/linux/closure.h
> +F: lib/closure.c
> +
> CMPC ACPI DRIVER
> M: Thadeu Lima de Souza Cascardo <[email protected]>
> M: Daniel Oliveira Nascimento <[email protected]>
> -- 
> 2.40.1
> 


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 01/32] Compiler Attributes: add __flatten
  2023-05-09 17:04   ` Miguel Ojeda
@ 2023-05-09 17:24     ` Kent Overstreet
  0 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 17:24 UTC (permalink / raw)
  To: Miguel Ojeda
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Miguel Ojeda, Nick Desaulniers

On Tue, May 09, 2023 at 07:04:43PM +0200, Miguel Ojeda wrote:
> On Tue, May 9, 2023 at 6:57 PM Kent Overstreet
> <[email protected]> wrote:
> >
> > This makes __attribute__((flatten)) available, which is used by
> > bcachefs.
> 
> We already have it in mainline, so I think it is one less patch you
> need to care about! :)
> 
> Cheers,
> Miguel

Wonderful :)

Cheers,
Kent

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
@ 2023-05-09 18:19   ` Lorenzo Stoakes
  2023-05-09 20:15     ` Kent Overstreet
  2023-05-09 20:46   ` Christoph Hellwig
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 73+ messages in thread
From: Lorenzo Stoakes @ 2023-05-09 18:19 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm

On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> From: Kent Overstreet <[email protected]>
>
> This is needed for bcachefs, which dynamically generates per-btree node
> unpack functions.

Small nits -

Would be good to refer to the original patch that removed it,
i.e. 7a0e27b2a0ce ("mm: remove vmalloc_exec") something like 'patch
... folded vmalloc_exec() into its one user, however bcachefs requires this
as well so revert'.

Would also be good to mention that you are now exporting the function which
the original didn't appear to do.

>
> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Uladzislau Rezki <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: [email protected]

Another nit: I'm a vmalloc reviewer so would be good to get cc'd too :)
(forgivable mistake as very recent change!)

> ---
>  include/linux/vmalloc.h |  1 +
>  kernel/module/main.c    |  4 +---
>  mm/nommu.c              | 18 ++++++++++++++++++
>  mm/vmalloc.c            | 21 +++++++++++++++++++++
>  4 files changed, 41 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 69250efa03..ff147fe115 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -145,6 +145,7 @@ extern void *vzalloc(unsigned long size) __alloc_size(1);
>  extern void *vmalloc_user(unsigned long size) __alloc_size(1);
>  extern void *vmalloc_node(unsigned long size, int node) __alloc_size(1);
>  extern void *vzalloc_node(unsigned long size, int node) __alloc_size(1);
> +extern void *vmalloc_exec(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
>  extern void *vmalloc_32(unsigned long size) __alloc_size(1);
>  extern void *vmalloc_32_user(unsigned long size) __alloc_size(1);
>  extern void *__vmalloc(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index d3be89de70..9eaa89e84c 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -1607,9 +1607,7 @@ static void dynamic_debug_remove(struct module *mod, struct _ddebug_info *dyndbg
>
>  void * __weak module_alloc(unsigned long size)
>  {
> -	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> -			GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> -			NUMA_NO_NODE, __builtin_return_address(0));
> +	return vmalloc_exec(size, GFP_KERNEL);
>  }
>
>  bool __weak module_init_section(const char *name)
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 57ba243c6a..8d9ab19e39 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -280,6 +280,24 @@ void *vzalloc_node(unsigned long size, int node)
>  }
>  EXPORT_SYMBOL(vzalloc_node);
>
> +/**
> + *	vmalloc_exec  -  allocate virtually contiguous, executable memory
> + *	@size:		allocation size
> + *
> + *	Kernel-internal function to allocate enough pages to cover @size
> + *	the page level allocator and map them into contiguous and
> + *	executable kernel virtual space.
> + *
> + *	For tight control over page level allocator and protection flags
> + *	use __vmalloc() instead.
> + */
> +
> +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> +{
> +	return __vmalloc(size, gfp_mask);
> +}
> +EXPORT_SYMBOL_GPL(vmalloc_exec);
> +
>  /**
>   * vmalloc_32  -  allocate virtually contiguous memory (32bit addressable)
>   *	@size:		allocation size
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 31ff782d36..2ebb9ea7f0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3401,6 +3401,27 @@ void *vzalloc_node(unsigned long size, int node)
>  }
>  EXPORT_SYMBOL(vzalloc_node);
>
> +/**
> + * vmalloc_exec - allocate virtually contiguous, executable memory
> + * @size:	  allocation size
> + *
> + * Kernel-internal function to allocate enough pages to cover @size
> + * the page level allocator and map them into contiguous and
> + * executable kernel virtual space.
> + *
> + * For tight control over page level allocator and protection flags
> + * use __vmalloc() instead.
> + *
> + * Return: pointer to the allocated memory or %NULL on error
> + */
> +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> +{
> +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> +			gfp_mask, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> +			NUMA_NO_NODE, __builtin_return_address(0));
> +}
> +EXPORT_SYMBOL_GPL(vmalloc_exec);
> +
>  #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
>  #define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
>  #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)
> --
> 2.40.1
>

Otherwise lgtm, feel free to add:

Acked-by: Lorenzo Stoakes <[email protected]>

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 02/32] locking/lockdep: lock_class_is_held()
  2023-05-09 16:56 ` [PATCH 02/32] locking/lockdep: lock_class_is_held() Kent Overstreet
@ 2023-05-09 19:30   ` Peter Zijlstra
  2023-05-09 20:11     ` Kent Overstreet
  0 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2023-05-09 19:30 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng

On Tue, May 09, 2023 at 12:56:27PM -0400, Kent Overstreet wrote:
> From: Kent Overstreet <[email protected]>
> 
> This patch adds lock_class_is_held(), which can be used to assert that a
> particular type of lock is not held.

How is lock_is_held_type() not sufficient? Which is what's used to
implement lockdep_assert_held*().


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-09 16:56 ` [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion() Kent Overstreet
@ 2023-05-09 19:31   ` Peter Zijlstra
  2023-05-09 19:57     ` Kent Overstreet
  2023-05-09 20:18     ` Kent Overstreet
  0 siblings, 2 replies; 73+ messages in thread
From: Peter Zijlstra @ 2023-05-09 19:31 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> This adds a method to tell lockdep not to check lock ordering within a
> lock class - but to still check lock ordering w.r.t. other lock types.
> 
> This is for bcachefs, where for btree node locks we have our own
> deadlock avoidance strategy w.r.t. other btree node locks (cycle
> detection), but we still want lockdep to check lock ordering w.r.t.
> other lock types.
> 

ISTR you had a much nicer version of this where you gave a custom order
function -- what happend to that?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-09 19:31   ` Peter Zijlstra
@ 2023-05-09 19:57     ` Kent Overstreet
  2023-05-09 20:18     ` Kent Overstreet
  1 sibling, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 19:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > This adds a method to tell lockdep not to check lock ordering within a
> > lock class - but to still check lock ordering w.r.t. other lock types.
> > 
> > This is for bcachefs, where for btree node locks we have our own
> > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > detection), but we still want lockdep to check lock ordering w.r.t.
> > other lock types.
> > 
> 
> ISTR you had a much nicer version of this where you gave a custom order
> function -- what happend to that?

Probably in the other branch that I was meaning to re-mail you separately,
clearly I hadn't pulled the latest versions back into here... expect
that shortly :)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 02/32] locking/lockdep: lock_class_is_held()
  2023-05-09 19:30   ` Peter Zijlstra
@ 2023-05-09 20:11     ` Kent Overstreet
  0 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 20:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Ingo Molnar, Will Deacon, Waiman Long, Boqun Feng

On Tue, May 09, 2023 at 09:30:39PM +0200, Peter Zijlstra wrote:
> On Tue, May 09, 2023 at 12:56:27PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <[email protected]>
> > 
> > This patch adds lock_class_is_held(), which can be used to assert that a
> > particular type of lock is not held.
> 
> How is lock_is_held_type() not sufficient? Which is what's used to
> implement lockdep_assert_held*().

I should've looked at that before - it returns a tristate, so it's
closer than I thought, but this is used in contexts where we don't have
a lock or lockdep_map to pass and need to pass the lock_class_key
instead.

e.g, when initializing a btree_trans, or waiting on btree node IO, we
need to assert that no btree node locks are held.

Looking at the code, __lock_is_held() -> match_held_lock() has to care
about a bunch of stuff related to subclasses that doesn't seem relevant
to lock_class_is_held() - lock_class_is_held() is practically no code in
comparison, so I'm inclined to think they should just be separate.

But I'm not the lockdep expert :) Thoughts?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 18:19   ` Lorenzo Stoakes
@ 2023-05-09 20:15     ` Kent Overstreet
  0 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 20:15 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm

On Tue, May 09, 2023 at 11:19:38AM -0700, Lorenzo Stoakes wrote:
> On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > This is needed for bcachefs, which dynamically generates per-btree node
> > unpack functions.
> 
> Small nits -
> 
> Would be good to refer to the original patch that removed it,
> i.e. 7a0e27b2a0ce ("mm: remove vmalloc_exec") something like 'patch
> ... folded vmalloc_exec() into its one user, however bcachefs requires this
> as well so revert'.
> 
> Would also be good to mention that you are now exporting the function which
> the original didn't appear to do.
> 
> >
> > Signed-off-by: Kent Overstreet <[email protected]>
> > Cc: Andrew Morton <[email protected]>
> > Cc: Uladzislau Rezki <[email protected]>
> > Cc: Christoph Hellwig <[email protected]>
> > Cc: [email protected]
> 
> Another nit: I'm a vmalloc reviewer so would be good to get cc'd too :)
> (forgivable mistake as very recent change!)

Thanks - folded your suggestions into the commit message, and added you
for the next posting :)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-09 19:31   ` Peter Zijlstra
  2023-05-09 19:57     ` Kent Overstreet
@ 2023-05-09 20:18     ` Kent Overstreet
  2023-05-09 20:27       ` Waiman Long
  2023-05-10  8:59       ` Peter Zijlstra
  1 sibling, 2 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 20:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > This adds a method to tell lockdep not to check lock ordering within a
> > lock class - but to still check lock ordering w.r.t. other lock types.
> > 
> > This is for bcachefs, where for btree node locks we have our own
> > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > detection), but we still want lockdep to check lock ordering w.r.t.
> > other lock types.
> > 
> 
> ISTR you had a much nicer version of this where you gave a custom order
> function -- what happend to that?

Actually, I spoke too soon; this patch and the other series with the
comparison function solve different problems.

For bcachefs btree node locks, we don't have a defined lock ordering at
all - we do full runtime cycle detection, so we don't want lockdep
checking for self deadlock because we're handling that but we _do_ want
lockdep checking lock ordering of btree node locks w.r.t. other locks in
the system.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-09 20:18     ` Kent Overstreet
@ 2023-05-09 20:27       ` Waiman Long
  2023-05-09 20:35         ` Kent Overstreet
  2023-05-10  8:59       ` Peter Zijlstra
  1 sibling, 1 reply; 73+ messages in thread
From: Waiman Long @ 2023-05-09 20:27 UTC (permalink / raw)
  To: Kent Overstreet, Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Boqun Feng


On 5/9/23 16:18, Kent Overstreet wrote:
> On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
>> On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
>>> This adds a method to tell lockdep not to check lock ordering within a
>>> lock class - but to still check lock ordering w.r.t. other lock types.
>>>
>>> This is for bcachefs, where for btree node locks we have our own
>>> deadlock avoidance strategy w.r.t. other btree node locks (cycle
>>> detection), but we still want lockdep to check lock ordering w.r.t.
>>> other lock types.
>>>
>> ISTR you had a much nicer version of this where you gave a custom order
>> function -- what happend to that?
> Actually, I spoke too soon; this patch and the other series with the
> comparison function solve different problems.
>
> For bcachefs btree node locks, we don't have a defined lock ordering at
> all - we do full runtime cycle detection, so we don't want lockdep
> checking for self deadlock because we're handling that but we _do_ want
> lockdep checking lock ordering of btree node locks w.r.t. other locks in
> the system.

Maybe you can use lock_set_novalidate_class() instead.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-09 20:27       ` Waiman Long
@ 2023-05-09 20:35         ` Kent Overstreet
  2023-05-09 21:37           ` Waiman Long
  0 siblings, 1 reply; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 20:35 UTC (permalink / raw)
  To: Waiman Long
  Cc: Peter Zijlstra, linux-kernel, linux-fsdevel, linux-bcachefs,
	Ingo Molnar, Will Deacon, Boqun Feng

On Tue, May 09, 2023 at 04:27:46PM -0400, Waiman Long wrote:
> 
> On 5/9/23 16:18, Kent Overstreet wrote:
> > On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > > > This adds a method to tell lockdep not to check lock ordering within a
> > > > lock class - but to still check lock ordering w.r.t. other lock types.
> > > > 
> > > > This is for bcachefs, where for btree node locks we have our own
> > > > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > > > detection), but we still want lockdep to check lock ordering w.r.t.
> > > > other lock types.
> > > > 
> > > ISTR you had a much nicer version of this where you gave a custom order
> > > function -- what happend to that?
> > Actually, I spoke too soon; this patch and the other series with the
> > comparison function solve different problems.
> > 
> > For bcachefs btree node locks, we don't have a defined lock ordering at
> > all - we do full runtime cycle detection, so we don't want lockdep
> > checking for self deadlock because we're handling that but we _do_ want
> > lockdep checking lock ordering of btree node locks w.r.t. other locks in
> > the system.
> 
> Maybe you can use lock_set_novalidate_class() instead.

No, we want that to go away, this is the replacement.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
  2023-05-09 18:19   ` Lorenzo Stoakes
@ 2023-05-09 20:46   ` Christoph Hellwig
  2023-05-09 21:12     ` Lorenzo Stoakes
  2023-05-10 14:18   ` Christophe Leroy
  2023-05-10 15:05   ` Johannes Thumshirn
  3 siblings, 1 reply; 73+ messages in thread
From: Christoph Hellwig @ 2023-05-09 20:46 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Andrew Morton, Uladzislau Rezki, Christoph Hellwig, linux-mm

On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> From: Kent Overstreet <[email protected]>
> 
> This is needed for bcachefs, which dynamically generates per-btree node
> unpack functions.

No, we will never add back a way for random code allocating executable
memory in kernel space.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 16/32] MAINTAINERS: Add entry for closures
  2023-05-09 16:56 ` [PATCH 16/32] MAINTAINERS: Add entry for closures Kent Overstreet
  2023-05-09 17:05   ` Coly Li
@ 2023-05-09 21:03   ` Randy Dunlap
  1 sibling, 0 replies; 73+ messages in thread
From: Randy Dunlap @ 2023-05-09 21:03 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs; +Cc: Coly Li



On 5/9/23 09:56, Kent Overstreet wrote:
> closures, from bcache, are async widgets with a variety of uses.
> bcachefs also uses them, so they're being moved to lib/; mark them as
> maintained.
> 
> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Coly Li <[email protected]>
> ---
>  MAINTAINERS | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 3fc37de3d6..5d76169140 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -5044,6 +5044,14 @@ T:	git git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git timers/core
>  F:	Documentation/devicetree/bindings/timer/
>  F:	drivers/clocksource/
>  
> +CLOSURES:

No colon at the end of the line.

> +M:	Kent Overstreet <[email protected]>
> +L:	[email protected]
> +S:	Supported
> +C:	irc://irc.oftc.net/bcache
> +F:	include/linux/closure.h
> +F:	lib/closure.c
> +
>  CMPC ACPI DRIVER
>  M:	Thadeu Lima de Souza Cascardo <[email protected]>
>  M:	Daniel Oliveira Nascimento <[email protected]>

-- 
~Randy

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 24/32] MAINTAINERS: Add entry for generic-radix-tree
  2023-05-09 16:56 ` [PATCH 24/32] MAINTAINERS: Add entry for generic-radix-tree Kent Overstreet
@ 2023-05-09 21:03   ` Randy Dunlap
  0 siblings, 0 replies; 73+ messages in thread
From: Randy Dunlap @ 2023-05-09 21:03 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs



On 5/9/23 09:56, Kent Overstreet wrote:
> lib/generic-radix-tree.c is a simple radix tree that supports storing
> arbitrary types. Add a maintainers entry for it.
> 
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
>  MAINTAINERS | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 5d76169140..c550f5909e 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -8615,6 +8615,13 @@ F:	Documentation/devicetree/bindings/power/power?domain*
>  F:	drivers/base/power/domain*.c
>  F:	include/linux/pm_domain.h
>  
> +GENERIC RADIX TREE:

No colon at the end of the line.

> +M:	Kent Overstreet <[email protected]>
> +S:	Supported
> +C:	irc://irc.oftc.net/bcache
> +F:	include/linux/generic-radix-tree.h
> +F:	lib/generic-radix-tree.c
> +
>  GENERIC RESISTIVE TOUCHSCREEN ADC DRIVER
>  M:	Eugen Hristev <[email protected]>
>  L:	[email protected]

-- 
~Randy

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 32/32] MAINTAINERS: Add entry for bcachefs
  2023-05-09 16:56 ` [PATCH 32/32] MAINTAINERS: Add entry for bcachefs Kent Overstreet
@ 2023-05-09 21:04   ` Randy Dunlap
  2023-05-09 21:07     ` Kent Overstreet
  0 siblings, 1 reply; 73+ messages in thread
From: Randy Dunlap @ 2023-05-09 21:04 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs



On 5/9/23 09:56, Kent Overstreet wrote:
> bcachefs is a new copy-on-write filesystem; add a MAINTAINERS entry for
> it.
> 
> Signed-off-by: Kent Overstreet <[email protected]>
> ---
>  MAINTAINERS | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index dbf3c33c31..0ac2b432f0 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -3509,6 +3509,13 @@ W:	http://bcache.evilpiepirate.org
>  C:	irc://irc.oftc.net/bcache
>  F:	drivers/md/bcache/
>  
> +BCACHEFS:

No colon at the end of the line.


> +M:	Kent Overstreet <[email protected]>
> +L:	[email protected]
> +S:	Supported
> +C:	irc://irc.oftc.net/bcache
> +F:	fs/bcachefs/
> +
>  BDISP ST MEDIA DRIVER
>  M:	Fabien Dessenne <[email protected]>
>  L:	[email protected]

-- 
~Randy

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 32/32] MAINTAINERS: Add entry for bcachefs
  2023-05-09 21:04   ` Randy Dunlap
@ 2023-05-09 21:07     ` Kent Overstreet
  0 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 21:07 UTC (permalink / raw)
  To: Randy Dunlap; +Cc: linux-kernel, linux-fsdevel, linux-bcachefs

On Tue, May 09, 2023 at 02:04:00PM -0700, Randy Dunlap wrote:
> 
> 
> On 5/9/23 09:56, Kent Overstreet wrote:
> > bcachefs is a new copy-on-write filesystem; add a MAINTAINERS entry for
> > it.
> > 
> > Signed-off-by: Kent Overstreet <[email protected]>
> > ---
> >  MAINTAINERS | 7 +++++++
> >  1 file changed, 7 insertions(+)
> > 
> > diff --git a/MAINTAINERS b/MAINTAINERS
> > index dbf3c33c31..0ac2b432f0 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -3509,6 +3509,13 @@ W:	http://bcache.evilpiepirate.org
> >  C:	irc://irc.oftc.net/bcache
> >  F:	drivers/md/bcache/
> >  
> > +BCACHEFS:
> 
> No colon at the end of the line.

Thanks, updated.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 20:46   ` Christoph Hellwig
@ 2023-05-09 21:12     ` Lorenzo Stoakes
  2023-05-09 21:29       ` Kent Overstreet
  2023-05-09 21:43       ` Darrick J. Wong
  0 siblings, 2 replies; 73+ messages in thread
From: Lorenzo Stoakes @ 2023-05-09 21:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, linux-mm

On Tue, May 09, 2023 at 01:46:09PM -0700, Christoph Hellwig wrote:
> On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> > From: Kent Overstreet <[email protected]>
> >
> > This is needed for bcachefs, which dynamically generates per-btree node
> > unpack functions.
>
> No, we will never add back a way for random code allocating executable
> memory in kernel space.

Yeah I think I glossed over this aspect a bit as it looks ostensibly like simply
reinstating a helper function because the code is now used in more than one
place (at lsf/mm so a little distracted :)

But it being exported is a problem. Perhaps there's another way of acheving the
same aim without having to do so?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 21:12     ` Lorenzo Stoakes
@ 2023-05-09 21:29       ` Kent Overstreet
  2023-05-10  6:48         ` Eric Biggers
  2023-05-10 11:56         ` David Laight
  2023-05-09 21:43       ` Darrick J. Wong
  1 sibling, 2 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 21:29 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, linux-mm

On Tue, May 09, 2023 at 02:12:41PM -0700, Lorenzo Stoakes wrote:
> On Tue, May 09, 2023 at 01:46:09PM -0700, Christoph Hellwig wrote:
> > On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> > > From: Kent Overstreet <[email protected]>
> > >
> > > This is needed for bcachefs, which dynamically generates per-btree node
> > > unpack functions.
> >
> > No, we will never add back a way for random code allocating executable
> > memory in kernel space.
> 
> Yeah I think I glossed over this aspect a bit as it looks ostensibly like simply
> reinstating a helper function because the code is now used in more than one
> place (at lsf/mm so a little distracted :)
> 
> But it being exported is a problem. Perhaps there's another way of acheving the
> same aim without having to do so?

None that I see.

The background is that bcachefs generates a per btree node unpack
function, based on the packed format for that btree node, for unpacking
keys within that node. The unpack function is only ~50 bytes, and for
locality we want it to be located with the btree node's other in-memory
lookup tables so they can be prefetched all at once.

Here's the codegen:

https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/bkey.c#n727

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-09 20:35         ` Kent Overstreet
@ 2023-05-09 21:37           ` Waiman Long
  0 siblings, 0 replies; 73+ messages in thread
From: Waiman Long @ 2023-05-09 21:37 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Peter Zijlstra, linux-kernel, linux-fsdevel, linux-bcachefs,
	Ingo Molnar, Will Deacon, Boqun Feng

On 5/9/23 16:35, Kent Overstreet wrote:
> On Tue, May 09, 2023 at 04:27:46PM -0400, Waiman Long wrote:
>> On 5/9/23 16:18, Kent Overstreet wrote:
>>> On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
>>>> On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
>>>>> This adds a method to tell lockdep not to check lock ordering within a
>>>>> lock class - but to still check lock ordering w.r.t. other lock types.
>>>>>
>>>>> This is for bcachefs, where for btree node locks we have our own
>>>>> deadlock avoidance strategy w.r.t. other btree node locks (cycle
>>>>> detection), but we still want lockdep to check lock ordering w.r.t.
>>>>> other lock types.
>>>>>
>>>> ISTR you had a much nicer version of this where you gave a custom order
>>>> function -- what happend to that?
>>> Actually, I spoke too soon; this patch and the other series with the
>>> comparison function solve different problems.
>>>
>>> For bcachefs btree node locks, we don't have a defined lock ordering at
>>> all - we do full runtime cycle detection, so we don't want lockdep
>>> checking for self deadlock because we're handling that but we _do_ want
>>> lockdep checking lock ordering of btree node locks w.r.t. other locks in
>>> the system.
>> Maybe you can use lock_set_novalidate_class() instead.
> No, we want that to go away, this is the replacement.

OK, you can mention that in the commit log then.

Cheers,
Longman


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 21:12     ` Lorenzo Stoakes
  2023-05-09 21:29       ` Kent Overstreet
@ 2023-05-09 21:43       ` Darrick J. Wong
  2023-05-09 21:54         ` Kent Overstreet
  1 sibling, 1 reply; 73+ messages in thread
From: Darrick J. Wong @ 2023-05-09 21:43 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Christoph Hellwig, Kent Overstreet, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Tue, May 09, 2023 at 02:12:41PM -0700, Lorenzo Stoakes wrote:
> On Tue, May 09, 2023 at 01:46:09PM -0700, Christoph Hellwig wrote:
> > On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> > > From: Kent Overstreet <[email protected]>
> > >
> > > This is needed for bcachefs, which dynamically generates per-btree node
> > > unpack functions.
> >
> > No, we will never add back a way for random code allocating executable
> > memory in kernel space.
> 
> Yeah I think I glossed over this aspect a bit as it looks ostensibly like simply
> reinstating a helper function because the code is now used in more than one
> place (at lsf/mm so a little distracted :)
> 
> But it being exported is a problem. Perhaps there's another way of acheving the
> same aim without having to do so?

I already trolled Kent with this on IRC, but for the parts of bcachefs
that want better assembly code than whatever gcc generates from the C
source, could you compile code to BPF and then let the BPF JIT engines
turn that into machine code for you?

(also distracted by LSFMM)

--D

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 21:43       ` Darrick J. Wong
@ 2023-05-09 21:54         ` Kent Overstreet
  2023-05-11  5:33           ` Theodore Ts'o
  0 siblings, 1 reply; 73+ messages in thread
From: Kent Overstreet @ 2023-05-09 21:54 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Tue, May 09, 2023 at 02:43:19PM -0700, Darrick J. Wong wrote:
> On Tue, May 09, 2023 at 02:12:41PM -0700, Lorenzo Stoakes wrote:
> > On Tue, May 09, 2023 at 01:46:09PM -0700, Christoph Hellwig wrote:
> > > On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> > > > From: Kent Overstreet <[email protected]>
> > > >
> > > > This is needed for bcachefs, which dynamically generates per-btree node
> > > > unpack functions.
> > >
> > > No, we will never add back a way for random code allocating executable
> > > memory in kernel space.
> > 
> > Yeah I think I glossed over this aspect a bit as it looks ostensibly like simply
> > reinstating a helper function because the code is now used in more than one
> > place (at lsf/mm so a little distracted :)
> > 
> > But it being exported is a problem. Perhaps there's another way of acheving the
> > same aim without having to do so?
> 
> I already trolled Kent with this on IRC, but for the parts of bcachefs
> that want better assembly code than whatever gcc generates from the C
> source, could you compile code to BPF and then let the BPF JIT engines
> turn that into machine code for you?

It's an intriguing idea, but it'd be a _lot_ of work and this is old
code that's never had a single bug - I'm not in a hurry to rewrite it.

And there would still be the issue that we've still got lots of little
unpack functions that go with other tables; we can't just burn a full
page per unpack function, that would waste way too much memory, and if
we put them together then we're stuck writing a whole nother allocator
- nope, and then we're also mucking with the memory layout of the data
structures used in the very hottest paths in the filesystem - I'm very
wary of introducing performance regressions there.

I think it'd be much more practical to find some way of making
vmalloc_exec() more palatable. What are the exact concerns?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-09 16:56 ` [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping Kent Overstreet
@ 2023-05-10  1:07   ` Jan Kara
  2023-05-10  6:18     ` Kent Overstreet
  0 siblings, 1 reply; 73+ messages in thread
From: Jan Kara @ 2023-05-10  1:07 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Jan Kara, Darrick J . Wong

On Tue 09-05-23 12:56:31, Kent Overstreet wrote:
> From: Kent Overstreet <[email protected]>
> 
> This is used by bcachefs to fix a page cache coherency issue with
> O_DIRECT writes.
> 
> Also relevant: mapping->invalidate_lock, see below.
> 
> O_DIRECT writes (and other filesystem operations that modify file data
> while bypassing the page cache) need to shoot down ranges of the page
> cache - and additionally, need locking to prevent those pages from
> pulled back in.
> 
> But O_DIRECT writes invoke the page fault handler (via get_user_pages),
> and the page fault handler will need to take that same lock - this is a
> classic recursive deadlock if userspace has mmaped the file they're DIO
> writing to and uses those pages for the buffer to write from, and it's a
> lock ordering deadlock in general.
> 
> Thus we need a way to signal from the dio code to the page fault handler
> when we already are holding the pagecache add lock on an address space -
> this patch just adds a member to task_struct for this purpose. For now
> only bcachefs is implementing this locking, though it may be moved out
> of bcachefs and made available to other filesystems in the future.

It would be nice to have at least a link to the code that's actually using
the field you are adding.

Also I think we were already through this discussion [1] and we ended up
agreeing that your scheme actually solves only the AA deadlock but a
malicious userspace can easily create AB BA deadlock by running direct IO
to file A using mapped file B as a buffer *and* direct IO to file B using
mapped file A as a buffer.

[1] https://lore.kernel.org/all/[email protected]

> ---------------------------------
> 
> The closest current VFS equivalent is mapping->invalidate_lock, which
> comes from XFS. However, it's not used by direct IO.  Instead, direct IO
> paths shoot down the page cache twice - before starting the IO and at
> the end, and they're still technically racy w.r.t. page cache coherency.
> 
> This is a more complete approach: in the future we might consider
> replacing mapping->invalidate_lock with the bcachefs code.

Yes, and this is because we never provided 100% consistent buffered VS
direct IO behavior on the same file exactly because we never found the
complexity worth the usefulness...

								Honza

> 
> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Darrick J. Wong <[email protected]>
> Cc: [email protected]
> ---
>  include/linux/sched.h | 1 +
>  init/init_task.c      | 1 +
>  2 files changed, 2 insertions(+)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 63d242164b..f2a56f64f7 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -869,6 +869,7 @@ struct task_struct {
>  
>  	struct mm_struct		*mm;
>  	struct mm_struct		*active_mm;
> +	struct address_space		*faults_disabled_mapping;
>  
>  	int				exit_state;
>  	int				exit_code;
> diff --git a/init/init_task.c b/init/init_task.c
> index ff6c4b9bfe..f703116e05 100644
> --- a/init/init_task.c
> +++ b/init/init_task.c
> @@ -85,6 +85,7 @@ struct task_struct init_task
>  	.nr_cpus_allowed= NR_CPUS,
>  	.mm		= NULL,
>  	.active_mm	= &init_mm,
> +	.faults_disabled_mapping = NULL,
>  	.restart_block	= {
>  		.fn = do_no_restart_syscall,
>  	},
> -- 
> 2.40.1
> 
-- 
Jan Kara <[email protected]>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 15/32] bcache: move closures to lib/
  2023-05-09 16:56 ` [PATCH 15/32] bcache: move closures to lib/ Kent Overstreet
@ 2023-05-10  1:10   ` Randy Dunlap
  0 siblings, 0 replies; 73+ messages in thread
From: Randy Dunlap @ 2023-05-10  1:10 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Coly Li



On 5/9/23 09:56, Kent Overstreet wrote:
> diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
> index 39d1d93164..3dba7a9aff 100644
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -1618,6 +1618,15 @@ config DEBUG_NOTIFIERS
>  	  This is a relatively cheap check but if you care about maximum
>  	  performance, say N.
>  
> +config DEBUG_CLOSURES
> +	bool "Debug closures (bcache async widgits)"
> +	depends on CLOSURES
> +	select DEBUG_FS
> +	help
> +	Keeps all active closures in a linked list and provides a debugfs
> +	interface to list them, which makes it possible to see asynchronous
> +	operations that get stuck.

According to coding-style.rst, the help text (3 lines above) should be
indented with 2 additional spaces.

> +	help
> +	  Keeps all active closures in a linked list and provides a debugfs
> +	  interface to list them, which makes it possible to see asynchronous
> +	  operations that get stuck.

-- 
~Randy

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic()
  2023-05-09 16:56 ` [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic() Kent Overstreet
@ 2023-05-10  2:20   ` kernel test robot
  2023-05-11  2:08   ` kernel test robot
  1 sibling, 0 replies; 73+ messages in thread
From: kernel test robot @ 2023-05-10  2:20 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: llvm, oe-kbuild-all, Kent Overstreet, Alexander Viro, Matthew Wilcox

Hi Kent,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/locking/core]
[cannot apply to axboe-block/for-next akpm-mm/mm-everything kdave/for-next linus/master v6.4-rc1 next-20230509]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kent-Overstreet/Compiler-Attributes-add-__flatten/20230510-010302
base:   tip/locking/core
patch link:    https://lore.kernel.org/r/20230509165657.1735798-24-kent.overstreet%40linux.dev
patch subject: [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic()
config: i386-randconfig-a002 (https://download.01.org/0day-ci/archive/20230510/[email protected]/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # https://github.com/intel-lab-lkp/linux/commit/0e5d4229f5e7671dabba56ea36583b1ca20a9a18
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Kent-Overstreet/Compiler-Attributes-add-__flatten/20230510-010302
        git checkout 0e5d4229f5e7671dabba56ea36583b1ca20a9a18
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=i386 SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

>> lib/iov_iter.c:839:16: warning: comparison of distinct pointer types ('typeof (bytes) *' (aka 'unsigned int *') and 'typeof (((1UL) << 12) - (offset & (~(((1UL) << 12) - 1)))) *' (aka 'unsigned long *')) [-Wcompare-distinct-pointer-types]
                   unsigned b = min(bytes, PAGE_SIZE - (offset & PAGE_MASK));
                                ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:67:19: note: expanded from macro 'min'
   #define min(x, y)       __careful_cmp(x, y, <)
                           ^~~~~~~~~~~~~~~~~~~~~~
   include/linux/minmax.h:36:24: note: expanded from macro '__careful_cmp'
           __builtin_choose_expr(__safe_cmp(x, y), \
                                 ^~~~~~~~~~~~~~~~
   include/linux/minmax.h:26:4: note: expanded from macro '__safe_cmp'
                   (__typecheck(x, y) && __no_side_effects(x, y))
                    ^~~~~~~~~~~~~~~~~
   include/linux/minmax.h:20:28: note: expanded from macro '__typecheck'
           (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
                      ~~~~~~~~~~~~~~ ^  ~~~~~~~~~~~~~~
   1 warning generated.


vim +839 lib/iov_iter.c

   825	
   826	size_t copy_folio_from_iter_atomic(struct folio *folio, size_t offset,
   827					   size_t bytes, struct iov_iter *i)
   828	{
   829		size_t ret = 0;
   830	
   831		if (WARN_ON(offset + bytes > folio_size(folio)))
   832			return 0;
   833		if (WARN_ON_ONCE(!i->data_source))
   834			return 0;
   835	
   836	#ifdef CONFIG_HIGHMEM
   837		while (bytes) {
   838			struct page *page = folio_page(folio, offset >> PAGE_SHIFT);
 > 839			unsigned b = min(bytes, PAGE_SIZE - (offset & PAGE_MASK));
   840			unsigned r = __copy_page_from_iter_atomic(page, offset, b, i);
   841	
   842			offset	+= r;
   843			bytes	-= r;
   844			ret	+= r;
   845	
   846			if (r != b)
   847				break;
   848		}
   849	#else
   850		ret = __copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
   851	#endif
   852	
   853		return ret;
   854	}
   855	EXPORT_SYMBOL(copy_folio_from_iter_atomic);
   856	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 22/32] vfs: inode cache conversion to hash-bl
  2023-05-09 16:56 ` [PATCH 22/32] vfs: inode cache conversion to hash-bl Kent Overstreet
@ 2023-05-10  4:45   ` Dave Chinner
  0 siblings, 0 replies; 73+ messages in thread
From: Dave Chinner @ 2023-05-10  4:45 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Dave Chinner,
	Alexander Viro, Christian Brauner

On Tue, May 09, 2023 at 12:56:47PM -0400, Kent Overstreet wrote:
> From: Dave Chinner <[email protected]>
> 
> Because scalability of the global inode_hash_lock really, really
> sucks.
> 
> 32-way concurrent create on a couple of different filesystems
> before:
> 
> -   52.13%     0.04%  [kernel]            [k] ext4_create
>    - 52.09% ext4_create
>       - 41.03% __ext4_new_inode
>          - 29.92% insert_inode_locked
>             - 25.35% _raw_spin_lock
>                - do_raw_spin_lock
>                   - 24.97% __pv_queued_spin_lock_slowpath
> 
> -   72.33%     0.02%  [kernel]            [k] do_filp_open
>    - 72.31% do_filp_open
>       - 72.28% path_openat
>          - 57.03% bch2_create
>             - 56.46% __bch2_create
>                - 40.43% inode_insert5
>                   - 36.07% _raw_spin_lock
>                      - do_raw_spin_lock
>                           35.86% __pv_queued_spin_lock_slowpath
>                     4.02% find_inode
> 
> Convert the inode hash table to a RCU-aware hash-bl table just like
> the dentry cache. Note that we need to store a pointer to the
> hlist_bl_head the inode has been added to in the inode so that when
> it comes to unhash the inode we know what list to lock. We need to
> do this because the hash value that is used to hash the inode is
> generated from the inode itself - filesystems can provide this
> themselves so we have to either store the hash or the head pointer
> in the inode to be able to find the right list head for removal...
> 
> Same workload after:
> 
> Signed-off-by: Dave Chinner <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Cc: Christian Brauner <[email protected]>
> Cc: [email protected]
> Signed-off-by: Kent Overstreet <[email protected]>

I have been maintaining this patchset uptodate in my own local trees
and the code in this patch looks the same. The commit message above,
however, has been mangled. The full commit message should be:

vfs: inode cache conversion to hash-bl

Because scalability of the global inode_hash_lock really, really
sucks and prevents me from doing scalability characterisation and
analysis of bcachefs algorithms.

Profiles of a 32-way concurrent create of 51.2m inodes with fsmark
on a couple of different filesystems on a 5.10 kernel:

-   52.13%     0.04%  [kernel]            [k] ext4_create
   - 52.09% ext4_create
      - 41.03% __ext4_new_inode
         - 29.92% insert_inode_locked
            - 25.35% _raw_spin_lock
               - do_raw_spin_lock
                  - 24.97% __pv_queued_spin_lock_slowpath


-   72.33%     0.02%  [kernel]            [k] do_filp_open
   - 72.31% do_filp_open
      - 72.28% path_openat
         - 57.03% bch2_create
            - 56.46% __bch2_create
               - 40.43% inode_insert5
                  - 36.07% _raw_spin_lock
                     - do_raw_spin_lock
                          35.86% __pv_queued_spin_lock_slowpath
                    4.02% find_inode

btrfs was tested but it is limited by internal lock contention at
>=2 threads on this workload, so never hammers the inode cache lock
hard enough for this change to matter to it's performance.

However, both bcachefs and ext4 demonstrate poor scaling at >=8
threads on concurrent lookup or create workloads.

Hence convert the inode hash table to a RCU-aware hash-bl table just
like the dentry cache. Note that we need to store a pointer to the
hlist_bl_head the inode has been added to in the inode so that when
it comes to unhash the inode we know what list to lock. We need to
do this because, unlike the dentry cache, the hash value that is
used to hash the inode is not generated from the inode itself. i.e.
filesystems can provide this themselves so we have to either store
the hashval or the hlist head pointer in the inode to be able to
find the right list head for removal...

Concurrent create with variying thread count (files/s):

                ext4                    bcachefs
threads         vanilla  patched        vanilla patched
2               117k     112k            80k     85k
4               185k     190k           133k    145k
8               303k     346k           185k    255k
16              389k     465k           190k    420k
32              360k     437k           142k    481k

CPU usage for both bcachefs and ext4 at 16 and 32 threads has been
halved on the patched kernel, while performance has increased
marginally on ext4 and massively on bcachefs. Internal filesystem
algorithms now limit performance on these workloads, not the global
inode_hash_lock.

Profile of the workloads on the patched kernels:

-   35.94%     0.07%  [kernel]                  [k] ext4_create
   - 35.87% ext4_create
      - 20.45% __ext4_new_inode
...
           3.36% insert_inode_locked

   - 78.43% do_filp_open
      - 78.36% path_openat
         - 53.95% bch2_create
            - 47.99% __bch2_create
....
              - 7.57% inode_insert5
                    6.94% find_inode

Spinlock contention is largely gone from the inode hash operations
and the filesystems are limited by contention in their internal
algorithms.

Signed-off-by: Dave Chinner <[email protected]>
---

Other than that, the diffstat is the same and I don't see any obvious
differences in the code comapred to what I've been running locally.

-Dave.
-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 21/32] hlist-bl: add hlist_bl_fake()
  2023-05-09 16:56 ` [PATCH 21/32] hlist-bl: add hlist_bl_fake() Kent Overstreet
@ 2023-05-10  4:48   ` Dave Chinner
  0 siblings, 0 replies; 73+ messages in thread
From: Dave Chinner @ 2023-05-10  4:48 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Dave Chinner

On Tue, May 09, 2023 at 12:56:46PM -0400, Kent Overstreet wrote:
> From: Dave Chinner <[email protected]>
> 
> in preparation for switching the VFS inode cache over the hlist_bl
  In

> lists, we nee dto be able to fake a list node that looks like it is
            need to

> hased for correct operation of filesystems that don't directly use
  hashed

> the VFS indoe cache.
          inode cache hash index.

-Dave.

-- 
Dave Chinner
[email protected]

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping
  2023-05-10  1:07   ` Jan Kara
@ 2023-05-10  6:18     ` Kent Overstreet
  0 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-10  6:18 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Darrick J . Wong, dhowells

On Wed, May 10, 2023 at 03:07:37AM +0200, Jan Kara wrote:
> On Tue 09-05-23 12:56:31, Kent Overstreet wrote:
> > From: Kent Overstreet <[email protected]>
> > 
> > This is used by bcachefs to fix a page cache coherency issue with
> > O_DIRECT writes.
> > 
> > Also relevant: mapping->invalidate_lock, see below.
> > 
> > O_DIRECT writes (and other filesystem operations that modify file data
> > while bypassing the page cache) need to shoot down ranges of the page
> > cache - and additionally, need locking to prevent those pages from
> > pulled back in.
> > 
> > But O_DIRECT writes invoke the page fault handler (via get_user_pages),
> > and the page fault handler will need to take that same lock - this is a
> > classic recursive deadlock if userspace has mmaped the file they're DIO
> > writing to and uses those pages for the buffer to write from, and it's a
> > lock ordering deadlock in general.
> > 
> > Thus we need a way to signal from the dio code to the page fault handler
> > when we already are holding the pagecache add lock on an address space -
> > this patch just adds a member to task_struct for this purpose. For now
> > only bcachefs is implementing this locking, though it may be moved out
> > of bcachefs and made available to other filesystems in the future.
> 
> It would be nice to have at least a link to the code that's actually using
> the field you are adding.

Bit of a trick to link to a _later_ patch in the series from a commit
message, but...

https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n975
https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/fs-io.c#n2454

> Also I think we were already through this discussion [1] and we ended up
> agreeing that your scheme actually solves only the AA deadlock but a
> malicious userspace can easily create AB BA deadlock by running direct IO
> to file A using mapped file B as a buffer *and* direct IO to file B using
> mapped file A as a buffer.

No, that's definitely handled (and you can see it in the code I linked),
and I wrote a torture test for fstests as well.

David Howells was also just running into a strange locking situation with
iov_iters and recursive gups - I don't recall all the details, but it
sounded like this might be a solution for that. David, did you have
thoughts on that?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 21:29       ` Kent Overstreet
@ 2023-05-10  6:48         ` Eric Biggers
  2023-05-10 11:56         ` David Laight
  1 sibling, 0 replies; 73+ messages in thread
From: Eric Biggers @ 2023-05-10  6:48 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Lorenzo Stoakes, Christoph Hellwig, linux-kernel, linux-fsdevel,
	linux-bcachefs, Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	linux-mm

On Tue, May 09, 2023 at 05:29:10PM -0400, Kent Overstreet wrote:
> On Tue, May 09, 2023 at 02:12:41PM -0700, Lorenzo Stoakes wrote:
> > On Tue, May 09, 2023 at 01:46:09PM -0700, Christoph Hellwig wrote:
> > > On Tue, May 09, 2023 at 12:56:32PM -0400, Kent Overstreet wrote:
> > > > From: Kent Overstreet <[email protected]>
> > > >
> > > > This is needed for bcachefs, which dynamically generates per-btree node
> > > > unpack functions.
> > >
> > > No, we will never add back a way for random code allocating executable
> > > memory in kernel space.
> > 
> > Yeah I think I glossed over this aspect a bit as it looks ostensibly like simply
> > reinstating a helper function because the code is now used in more than one
> > place (at lsf/mm so a little distracted :)
> > 
> > But it being exported is a problem. Perhaps there's another way of acheving the
> > same aim without having to do so?
> 
> None that I see.
> 
> The background is that bcachefs generates a per btree node unpack
> function, based on the packed format for that btree node, for unpacking
> keys within that node. The unpack function is only ~50 bytes, and for
> locality we want it to be located with the btree node's other in-memory
> lookup tables so they can be prefetched all at once.
> 
> Here's the codegen:
> 
> https://evilpiepirate.org/git/bcachefs.git/tree/fs/bcachefs/bkey.c#n727

Well, it's a cool trick, but it's not clear that it actually belongs in
production kernel code.  What else in the kernel actually does dynamic codegen?
Just BPF, I think?

Among other issues, this is entirely architecture-specific, and it may cause
interoperability issues with various other features, including security
features.  Is it really safe to leave a W&X page around, for example?

What seems to be missing is any explanation for what we're actually getting from
this extremely unusual solution that cannot be gained any other way.  What is
unique about bcachefs that it really needs something like this?

- Eric

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-09 20:18     ` Kent Overstreet
  2023-05-09 20:27       ` Waiman Long
@ 2023-05-10  8:59       ` Peter Zijlstra
  2023-05-10 20:38         ` Kent Overstreet
  1 sibling, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2023-05-10  8:59 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Tue, May 09, 2023 at 04:18:59PM -0400, Kent Overstreet wrote:
> On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> > On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > > This adds a method to tell lockdep not to check lock ordering within a
> > > lock class - but to still check lock ordering w.r.t. other lock types.
> > > 
> > > This is for bcachefs, where for btree node locks we have our own
> > > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > > detection), but we still want lockdep to check lock ordering w.r.t.
> > > other lock types.
> > > 
> > 
> > ISTR you had a much nicer version of this where you gave a custom order
> > function -- what happend to that?
> 
> Actually, I spoke too soon; this patch and the other series with the
> comparison function solve different problems.
> 
> For bcachefs btree node locks, we don't have a defined lock ordering at
> all - we do full runtime cycle detection, so we don't want lockdep
> checking for self deadlock because we're handling that but we _do_ want
> lockdep checking lock ordering of btree node locks w.r.t. other locks in
> the system.

Have you read the ww_mutex code? If not, please do so, it does similar
things.

The way it gets around the self-nesting check is by using the nest_lock
annotation, the acquire context itself also has a dep_map for this
purpose.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* RE: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 21:29       ` Kent Overstreet
  2023-05-10  6:48         ` Eric Biggers
@ 2023-05-10 11:56         ` David Laight
  1 sibling, 0 replies; 73+ messages in thread
From: David Laight @ 2023-05-10 11:56 UTC (permalink / raw)
  To: 'Kent Overstreet', Lorenzo Stoakes
  Cc: Christoph Hellwig, linux-kernel, linux-fsdevel, linux-bcachefs,
	Kent Overstreet, Andrew Morton, Uladzislau Rezki, linux-mm

From: Kent Overstreet
> Sent: 09 May 2023 22:29
...
> The background is that bcachefs generates a per btree node unpack
> function, based on the packed format for that btree node, for unpacking
> keys within that node. The unpack function is only ~50 bytes, and for
> locality we want it to be located with the btree node's other in-memory
> lookup tables so they can be prefetched all at once.

Loading data into the d-cache isn't going to load code into
the i-cache.
Indeed you don't want to be mixing code and data in the same
cache line - because it just wastes space in the cache.

Looks to me like you could have a few different unpack
functions and pick the correct one based on the packed format.
Quite likely the code would be just as fast (if longer)
when you allow for parallel execution on modern cpu.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
  2023-05-09 18:19   ` Lorenzo Stoakes
  2023-05-09 20:46   ` Christoph Hellwig
@ 2023-05-10 14:18   ` Christophe Leroy
  2023-05-10 15:05   ` Johannes Thumshirn
  3 siblings, 0 replies; 73+ messages in thread
From: Christophe Leroy @ 2023-05-10 14:18 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: Kent Overstreet, Andrew Morton, Uladzislau Rezki,
	Christoph Hellwig, linux-mm



Le 09/05/2023 à 18:56, Kent Overstreet a écrit :
> From: Kent Overstreet <[email protected]>
> 
> This is needed for bcachefs, which dynamically generates per-btree node
> unpack functions.
> 
> Signed-off-by: Kent Overstreet <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Uladzislau Rezki <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: [email protected]
> ---
>   include/linux/vmalloc.h |  1 +
>   kernel/module/main.c    |  4 +---
>   mm/nommu.c              | 18 ++++++++++++++++++
>   mm/vmalloc.c            | 21 +++++++++++++++++++++
>   4 files changed, 41 insertions(+), 3 deletions(-)
> 
> diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
> index 69250efa03..ff147fe115 100644
> --- a/include/linux/vmalloc.h
> +++ b/include/linux/vmalloc.h
> @@ -145,6 +145,7 @@ extern void *vzalloc(unsigned long size) __alloc_size(1);
>   extern void *vmalloc_user(unsigned long size) __alloc_size(1);
>   extern void *vmalloc_node(unsigned long size, int node) __alloc_size(1);
>   extern void *vzalloc_node(unsigned long size, int node) __alloc_size(1);
> +extern void *vmalloc_exec(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
>   extern void *vmalloc_32(unsigned long size) __alloc_size(1);
>   extern void *vmalloc_32_user(unsigned long size) __alloc_size(1);
>   extern void *__vmalloc(unsigned long size, gfp_t gfp_mask) __alloc_size(1);
> diff --git a/kernel/module/main.c b/kernel/module/main.c
> index d3be89de70..9eaa89e84c 100644
> --- a/kernel/module/main.c
> +++ b/kernel/module/main.c
> @@ -1607,9 +1607,7 @@ static void dynamic_debug_remove(struct module *mod, struct _ddebug_info *dyndbg
>   
>   void * __weak module_alloc(unsigned long size)
>   {
> -	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> -			GFP_KERNEL, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> -			NUMA_NO_NODE, __builtin_return_address(0));
> +	return vmalloc_exec(size, GFP_KERNEL);
>   }
>   
>   bool __weak module_init_section(const char *name)
> diff --git a/mm/nommu.c b/mm/nommu.c
> index 57ba243c6a..8d9ab19e39 100644
> --- a/mm/nommu.c
> +++ b/mm/nommu.c
> @@ -280,6 +280,24 @@ void *vzalloc_node(unsigned long size, int node)
>   }
>   EXPORT_SYMBOL(vzalloc_node);
>   
> +/**
> + *	vmalloc_exec  -  allocate virtually contiguous, executable memory
> + *	@size:		allocation size
> + *
> + *	Kernel-internal function to allocate enough pages to cover @size
> + *	the page level allocator and map them into contiguous and
> + *	executable kernel virtual space.
> + *
> + *	For tight control over page level allocator and protection flags
> + *	use __vmalloc() instead.
> + */
> +
> +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> +{
> +	return __vmalloc(size, gfp_mask);
> +}
> +EXPORT_SYMBOL_GPL(vmalloc_exec);
> +
>   /**
>    * vmalloc_32  -  allocate virtually contiguous memory (32bit addressable)
>    *	@size:		allocation size
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 31ff782d36..2ebb9ea7f0 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3401,6 +3401,27 @@ void *vzalloc_node(unsigned long size, int node)
>   }
>   EXPORT_SYMBOL(vzalloc_node);
>   
> +/**
> + * vmalloc_exec - allocate virtually contiguous, executable memory
> + * @size:	  allocation size
> + *
> + * Kernel-internal function to allocate enough pages to cover @size
> + * the page level allocator and map them into contiguous and
> + * executable kernel virtual space.
> + *
> + * For tight control over page level allocator and protection flags
> + * use __vmalloc() instead.
> + *
> + * Return: pointer to the allocated memory or %NULL on error
> + */
> +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> +{
> +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> +			gfp_mask, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> +			NUMA_NO_NODE, __builtin_return_address(0));
> +}

That cannot work. The VMALLOC space is mapped non-exec on powerpc/32. 
You have to allocate between MODULES_VADDR and MODULES_END if you want 
something executable so you must use module_alloc() see 
https://elixir.bootlin.com/linux/v6.4-rc1/source/arch/powerpc/kernel/module.c#L108

> +EXPORT_SYMBOL_GPL(vmalloc_exec);
> +
>   #if defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA32)
>   #define GFP_VMALLOC32 (GFP_DMA32 | GFP_KERNEL)
>   #elif defined(CONFIG_64BIT) && defined(CONFIG_ZONE_DMA)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
                     ` (2 preceding siblings ...)
  2023-05-10 14:18   ` Christophe Leroy
@ 2023-05-10 15:05   ` Johannes Thumshirn
  3 siblings, 0 replies; 73+ messages in thread
From: Johannes Thumshirn @ 2023-05-10 15:05 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs, Kees Cook
  Cc: Kent Overstreet, Andrew Morton, Uladzislau Rezki, hch, linux-mm,
	linux-hardening

On 09.05.23 18:56, Kent Overstreet wrote:
> +/**
> + * vmalloc_exec - allocate virtually contiguous, executable memory
> + * @size:	  allocation size
> + *
> + * Kernel-internal function to allocate enough pages to cover @size
> + * the page level allocator and map them into contiguous and
> + * executable kernel virtual space.
> + *
> + * For tight control over page level allocator and protection flags
> + * use __vmalloc() instead.
> + *
> + * Return: pointer to the allocated memory or %NULL on error
> + */
> +void *vmalloc_exec(unsigned long size, gfp_t gfp_mask)
> +{
> +	return __vmalloc_node_range(size, 1, VMALLOC_START, VMALLOC_END,
> +			gfp_mask, PAGE_KERNEL_EXEC, VM_FLUSH_RESET_PERMS,
> +			NUMA_NO_NODE, __builtin_return_address(0));
> +}
> +EXPORT_SYMBOL_GPL(vmalloc_exec);

Uh W+X memory reagions.
The 90s called, they want their shellcode back.

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-10  8:59       ` Peter Zijlstra
@ 2023-05-10 20:38         ` Kent Overstreet
  2023-05-11  8:25           ` Peter Zijlstra
  0 siblings, 1 reply; 73+ messages in thread
From: Kent Overstreet @ 2023-05-10 20:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Wed, May 10, 2023 at 10:59:05AM +0200, Peter Zijlstra wrote:
> On Tue, May 09, 2023 at 04:18:59PM -0400, Kent Overstreet wrote:
> > On Tue, May 09, 2023 at 09:31:47PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 09, 2023 at 12:56:28PM -0400, Kent Overstreet wrote:
> > > > This adds a method to tell lockdep not to check lock ordering within a
> > > > lock class - but to still check lock ordering w.r.t. other lock types.
> > > > 
> > > > This is for bcachefs, where for btree node locks we have our own
> > > > deadlock avoidance strategy w.r.t. other btree node locks (cycle
> > > > detection), but we still want lockdep to check lock ordering w.r.t.
> > > > other lock types.
> > > > 
> > > 
> > > ISTR you had a much nicer version of this where you gave a custom order
> > > function -- what happend to that?
> > 
> > Actually, I spoke too soon; this patch and the other series with the
> > comparison function solve different problems.
> > 
> > For bcachefs btree node locks, we don't have a defined lock ordering at
> > all - we do full runtime cycle detection, so we don't want lockdep
> > checking for self deadlock because we're handling that but we _do_ want
> > lockdep checking lock ordering of btree node locks w.r.t. other locks in
> > the system.
> 
> Have you read the ww_mutex code? If not, please do so, it does similar
> things.
> 
> The way it gets around the self-nesting check is by using the nest_lock
> annotation, the acquire context itself also has a dep_map for this
> purpose.

This might work.

I was confused for a good bit when reading tho code to figure out how
it works - nest_lock seems to be a pretty bad name, it's really not a
lock. acquire_ctx?

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic()
  2023-05-09 16:56 ` [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic() Kent Overstreet
  2023-05-10  2:20   ` kernel test robot
@ 2023-05-11  2:08   ` kernel test robot
  1 sibling, 0 replies; 73+ messages in thread
From: kernel test robot @ 2023-05-11  2:08 UTC (permalink / raw)
  To: Kent Overstreet, linux-kernel, linux-fsdevel, linux-bcachefs
  Cc: oe-kbuild-all, Kent Overstreet, Alexander Viro, Matthew Wilcox

Hi Kent,

kernel test robot noticed the following build warnings:

[auto build test WARNING on tip/locking/core]
[cannot apply to axboe-block/for-next akpm-mm/mm-everything kdave/for-next linus/master v6.4-rc1 next-20230510]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url:    https://github.com/intel-lab-lkp/linux/commits/Kent-Overstreet/Compiler-Attributes-add-__flatten/20230510-010302
base:   tip/locking/core
patch link:    https://lore.kernel.org/r/20230509165657.1735798-24-kent.overstreet%40linux.dev
patch subject: [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic()
config: powerpc-randconfig-s042-20230509 (https://download.01.org/0day-ci/archive/20230511/[email protected]/config)
compiler: powerpc-linux-gcc (GCC) 12.1.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # apt-get install sparse
        # sparse version: v0.6.4-39-gce1a6720-dirty
        # https://github.com/intel-lab-lkp/linux/commit/0e5d4229f5e7671dabba56ea36583b1ca20a9a18
        git remote add linux-review https://github.com/intel-lab-lkp/linux
        git fetch --no-tags linux-review Kent-Overstreet/Compiler-Attributes-add-__flatten/20230510-010302
        git checkout 0e5d4229f5e7671dabba56ea36583b1ca20a9a18
        # save the config file
        mkdir build_dir && cp config build_dir/.config
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=powerpc olddefconfig
        COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.1.0 make.cross C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=powerpc SHELL=/bin/bash

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/

sparse warnings: (new ones prefixed by >>)
>> lib/iov_iter.c:839:30: sparse: sparse: incompatible types in comparison expression (different type sizes):
>> lib/iov_iter.c:839:30: sparse:    unsigned int *
>> lib/iov_iter.c:839:30: sparse:    unsigned long *

vim +839 lib/iov_iter.c

   825	
   826	size_t copy_folio_from_iter_atomic(struct folio *folio, size_t offset,
   827					   size_t bytes, struct iov_iter *i)
   828	{
   829		size_t ret = 0;
   830	
   831		if (WARN_ON(offset + bytes > folio_size(folio)))
   832			return 0;
   833		if (WARN_ON_ONCE(!i->data_source))
   834			return 0;
   835	
   836	#ifdef CONFIG_HIGHMEM
   837		while (bytes) {
   838			struct page *page = folio_page(folio, offset >> PAGE_SHIFT);
 > 839			unsigned b = min(bytes, PAGE_SIZE - (offset & PAGE_MASK));
   840			unsigned r = __copy_page_from_iter_atomic(page, offset, b, i);
   841	
   842			offset	+= r;
   843			bytes	-= r;
   844			ret	+= r;
   845	
   846			if (r != b)
   847				break;
   848		}
   849	#else
   850		ret = __copy_page_from_iter_atomic(&folio->page, offset, bytes, i);
   851	#endif
   852	
   853		return ret;
   854	}
   855	EXPORT_SYMBOL(copy_folio_from_iter_atomic);
   856	

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-09 21:54         ` Kent Overstreet
@ 2023-05-11  5:33           ` Theodore Ts'o
  2023-05-11  5:44             ` Kent Overstreet
  0 siblings, 1 reply; 73+ messages in thread
From: Theodore Ts'o @ 2023-05-11  5:33 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: Darrick J. Wong, Lorenzo Stoakes, Christoph Hellwig,
	linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Andrew Morton, Uladzislau Rezki, linux-mm

On Tue, May 09, 2023 at 05:54:26PM -0400, Kent Overstreet wrote:
> 
> I think it'd be much more practical to find some way of making
> vmalloc_exec() more palatable. What are the exact concerns?

Having a vmalloc_exec() function (whether it is not exported at all,
or exported as a GPL symbol) makes it *much* easier for an exploit
writer, since it's a super convenient gadget for use with
Returned-oriented-programming[1] to create a writeable, executable
space that could then be filled with arbitrary code of the exploit
author's arbitrary desire.

[1] https://en.wikipedia.org/wiki/Return-oriented_programming

The other thing I'll note from examining the code generator, is that
it appears that bcachefs *only* has support for x86_64.  This brings
me back to the days of my youth when all the world was a Vax[2].  :-)

   10.  Thou shalt foreswear, renounce, and abjure the vile heresy
        which claimeth that ``All the world's a VAX'', and have no commerce
	with the benighted heathens who cling to this barbarous belief, that
 	the days of thy program may be long even though the days of thy
	current machine be short.

	[ This particular heresy bids fair to be replaced by ``All the
	world's a Sun'' or ``All the world's a 386'' (this latter
	being a particularly revolting invention of Satan), but the
	words apply to all such without limitation. Beware, in
	particular, of the subtle and terrible ``All the world's a
	32-bit machine'', which is almost true today but shall cease
	to be so before thy resume grows too much longer.]

[2] The Ten Commandments for C Programmers (Annotated Edition)
    https://www.lysator.liu.se/c/ten-commandments.html

Seriously, does this mean that bcachefs won't work on Arm systems
(arm32 or arm64)?  Or Risc V systems?  Or S/390's?  Or Power
architectuers?  Or Itanium or PA-RISC systems?  (OK, I really don't
care all that much about those last two.  :-)


When people ask me why file systems are so hard to make enterprise
ready, I tell them to recall the general advice given to people to
write secure, robust systems: (a) avoid premature optimization, (b)
avoid fine-grained, multi-threaded programming, as much as possible,
because locking bugs are a b*tch, and (c) avoid unnecessary global
state as much as possible.

File systems tend to violate all of these precepts: (a) people chase
benchmark optimizations to the exclusion of all else, because people
have an unhealthy obsession with Phornix benchmark articles, (b) file
systems tend to be inherently multi-threaded, with lots of locks, and
(c) file systems are all about managing global state in the form of
files, directories, etc.

However, hiding a miniature architecture-specific compiler inside a
file system seems to be a rather blatent example of "premature
optimization".

							- Ted

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 07/32] mm: Bring back vmalloc_exec
  2023-05-11  5:33           ` Theodore Ts'o
@ 2023-05-11  5:44             ` Kent Overstreet
  0 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-11  5:44 UTC (permalink / raw)
  To: Theodore Ts'o
  Cc: Darrick J. Wong, Lorenzo Stoakes, Christoph Hellwig,
	linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Andrew Morton, Uladzislau Rezki, linux-mm

On Thu, May 11, 2023 at 01:33:12AM -0400, Theodore Ts'o wrote:
> Seriously, does this mean that bcachefs won't work on Arm systems
> (arm32 or arm64)?  Or Risc V systems?  Or S/390's?  Or Power
> architectuers?  Or Itanium or PA-RISC systems?  (OK, I really don't
> care all that much about those last two.  :-)

No :)

My CI servers are arm64 servers. There's a bch2_bkey_unpack_key()
written in C, that works on any architecture. But specializing for a
particular format is a not-insignificant performance improvement, so
writing an arm64 version has been on my todo list.

> When people ask me why file systems are so hard to make enterprise
> ready, I tell them to recall the general advice given to people to
> write secure, robust systems: (a) avoid premature optimization, (b)
> avoid fine-grained, multi-threaded programming, as much as possible,
> because locking bugs are a b*tch, and (c) avoid unnecessary global
> state as much as possible.
> 
> File systems tend to violate all of these precepts: (a) people chase
> benchmark optimizations to the exclusion of all else, because people
> have an unhealthy obsession with Phornix benchmark articles, (b) file
> systems tend to be inherently multi-threaded, with lots of locks, and
> (c) file systems are all about managing global state in the form of
> files, directories, etc.
> 
> However, hiding a miniature architecture-specific compiler inside a
> file system seems to be a rather blatent example of "premature
> optimization".

Ted, this project is _15_ years old.

I'm getting ready to write a full explanation of what this is for and
why it's important, I've just been busy with the conference - and I want
to write something good, that provides all the context.

I've also been mulling over fallback options, but I don't see any good
ones. The unspecialized, C version of unpack has branches (the absolute
minimum, I took my time when I was writing that code too); the
specialized versions are branchless and _much_ smaller, and the only way
to do that specialization is with some form of dynamic codegen.

But I do owe you all a detailed walkthrough of what this is all about,
so you'll get it in the next day or so.

Cheers,
Kent

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-10 20:38         ` Kent Overstreet
@ 2023-05-11  8:25           ` Peter Zijlstra
  2023-05-11  9:32             ` Kent Overstreet
  0 siblings, 1 reply; 73+ messages in thread
From: Peter Zijlstra @ 2023-05-11  8:25 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Wed, May 10, 2023 at 04:38:15PM -0400, Kent Overstreet wrote:
> On Wed, May 10, 2023 at 10:59:05AM +0200, Peter Zijlstra wrote:

> > Have you read the ww_mutex code? If not, please do so, it does similar
> > things.
> > 
> > The way it gets around the self-nesting check is by using the nest_lock
> > annotation, the acquire context itself also has a dep_map for this
> > purpose.
> 
> This might work.
> 
> I was confused for a good bit when reading tho code to figure out how
> it works - nest_lock seems to be a pretty bad name, it's really not a
> lock. acquire_ctx?

That's just how ww_mutex uses it, the annotation itself comes from
mm_take_all_locks() where mm->mmap_lock (the lock formerly known as
mmap_sem) is used to serialize multi acquisition of vma locks.

That is, no other code takes multiple vma locks (be it i_mmap_rwsem or
anonvma->root->rwsem) in any order. These locks nest inside mmap_lock
and therefore by holding mmap_lock you serialize the whole thing and can
take them in any order you like.

Perhaps, now, all these many years later another name would've made more
sense, but I don't think it's worth the hassle of the tree-wide rename
(there's a few other users since).

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion()
  2023-05-11  8:25           ` Peter Zijlstra
@ 2023-05-11  9:32             ` Kent Overstreet
  0 siblings, 0 replies; 73+ messages in thread
From: Kent Overstreet @ 2023-05-11  9:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Ingo Molnar,
	Will Deacon, Waiman Long, Boqun Feng

On Thu, May 11, 2023 at 10:25:44AM +0200, Peter Zijlstra wrote:
> On Wed, May 10, 2023 at 04:38:15PM -0400, Kent Overstreet wrote:
> > On Wed, May 10, 2023 at 10:59:05AM +0200, Peter Zijlstra wrote:
> 
> > > Have you read the ww_mutex code? If not, please do so, it does similar
> > > things.
> > > 
> > > The way it gets around the self-nesting check is by using the nest_lock
> > > annotation, the acquire context itself also has a dep_map for this
> > > purpose.
> > 
> > This might work.
> > 
> > I was confused for a good bit when reading tho code to figure out how
> > it works - nest_lock seems to be a pretty bad name, it's really not a
> > lock. acquire_ctx?
> 
> That's just how ww_mutex uses it, the annotation itself comes from
> mm_take_all_locks() where mm->mmap_lock (the lock formerly known as
> mmap_sem) is used to serialize multi acquisition of vma locks.
> 
> That is, no other code takes multiple vma locks (be it i_mmap_rwsem or
> anonvma->root->rwsem) in any order. These locks nest inside mmap_lock
> and therefore by holding mmap_lock you serialize the whole thing and can
> take them in any order you like.
> 
> Perhaps, now, all these many years later another name would've made more
> sense, but I don't think it's worth the hassle of the tree-wide rename
> (there's a few other users since).

Thanks for the history lesson :)

^ permalink raw reply	[flat|nested] 73+ messages in thread

* Re: [PATCH 04/32] locking: SIX locks (shared/intent/exclusive)
  2023-05-09 16:56 ` [PATCH 04/32] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
@ 2023-05-11 12:14   ` Jan Engelhardt
  0 siblings, 0 replies; 73+ messages in thread
From: Jan Engelhardt @ 2023-05-11 12:14 UTC (permalink / raw)
  To: Kent Overstreet
  Cc: linux-kernel, linux-fsdevel, linux-bcachefs, Kent Overstreet,
	Peter Zijlstra, Ingo Molnar, Will Deacon, Waiman Long,
	Boqun Feng


On Tuesday 2023-05-09 18:56, Kent Overstreet wrote:
>--- /dev/null
>+++ b/include/linux/six.h
>@@ -0,0 +1,210 @@
>+ * There are also operations that take the lock type as a parameter, where the
>+ * type is one of SIX_LOCK_read, SIX_LOCK_intent, or SIX_LOCK_write:
>+ *
>+ *   six_lock_type(lock, type)
>+ *   six_unlock_type(lock, type)
>+ *   six_relock(lock, type, seq)
>+ *   six_trylock_type(lock, type)
>+ *   six_trylock_convert(lock, from, to)
>+ *
>+ * A lock may be held multiple types by the same thread (for read or intent,

"multiple times"

>+// SPDX-License-Identifier: GPL-2.0

The currently SPDX list only knows "GPL-2.0-only" or "GPL-2.0-or-later",
please edit.

^ permalink raw reply	[flat|nested] 73+ messages in thread

end of thread, other threads:[~2023-05-11 12:15 UTC | newest]

Thread overview: 73+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-09 16:56 [PATCH 00/32] bcachefs - a new COW filesystem Kent Overstreet
2023-05-09 16:56 ` [PATCH 01/32] Compiler Attributes: add __flatten Kent Overstreet
2023-05-09 17:04   ` Miguel Ojeda
2023-05-09 17:24     ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 02/32] locking/lockdep: lock_class_is_held() Kent Overstreet
2023-05-09 19:30   ` Peter Zijlstra
2023-05-09 20:11     ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 03/32] locking/lockdep: lockdep_set_no_check_recursion() Kent Overstreet
2023-05-09 19:31   ` Peter Zijlstra
2023-05-09 19:57     ` Kent Overstreet
2023-05-09 20:18     ` Kent Overstreet
2023-05-09 20:27       ` Waiman Long
2023-05-09 20:35         ` Kent Overstreet
2023-05-09 21:37           ` Waiman Long
2023-05-10  8:59       ` Peter Zijlstra
2023-05-10 20:38         ` Kent Overstreet
2023-05-11  8:25           ` Peter Zijlstra
2023-05-11  9:32             ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 04/32] locking: SIX locks (shared/intent/exclusive) Kent Overstreet
2023-05-11 12:14   ` Jan Engelhardt
2023-05-09 16:56 ` [PATCH 05/32] MAINTAINERS: Add entry for six locks Kent Overstreet
2023-05-09 16:56 ` [PATCH 06/32] sched: Add task_struct->faults_disabled_mapping Kent Overstreet
2023-05-10  1:07   ` Jan Kara
2023-05-10  6:18     ` Kent Overstreet
2023-05-09 16:56 ` [PATCH 07/32] mm: Bring back vmalloc_exec Kent Overstreet
2023-05-09 18:19   ` Lorenzo Stoakes
2023-05-09 20:15     ` Kent Overstreet
2023-05-09 20:46   ` Christoph Hellwig
2023-05-09 21:12     ` Lorenzo Stoakes
2023-05-09 21:29       ` Kent Overstreet
2023-05-10  6:48         ` Eric Biggers
2023-05-10 11:56         ` David Laight
2023-05-09 21:43       ` Darrick J. Wong
2023-05-09 21:54         ` Kent Overstreet
2023-05-11  5:33           ` Theodore Ts'o
2023-05-11  5:44             ` Kent Overstreet
2023-05-10 14:18   ` Christophe Leroy
2023-05-10 15:05   ` Johannes Thumshirn
2023-05-09 16:56 ` [PATCH 08/32] fs: factor out d_mark_tmpfile() Kent Overstreet
2023-05-09 16:56 ` [PATCH 09/32] block: Add some exports for bcachefs Kent Overstreet
2023-05-09 16:56 ` [PATCH 10/32] block: Allow bio_iov_iter_get_pages() with bio->bi_bdev unset Kent Overstreet
2023-05-09 16:56 ` [PATCH 11/32] block: Bring back zero_fill_bio_iter Kent Overstreet
2023-05-09 16:56 ` [PATCH 12/32] block: Rework bio_for_each_segment_all() Kent Overstreet
2023-05-09 16:56 ` [PATCH 13/32] block: Rework bio_for_each_folio_all() Kent Overstreet
2023-05-09 16:56 ` [PATCH 14/32] block: Don't block on s_umount from __invalidate_super() Kent Overstreet
2023-05-09 16:56 ` [PATCH 15/32] bcache: move closures to lib/ Kent Overstreet
2023-05-10  1:10   ` Randy Dunlap
2023-05-09 16:56 ` [PATCH 16/32] MAINTAINERS: Add entry for closures Kent Overstreet
2023-05-09 17:05   ` Coly Li
2023-05-09 21:03   ` Randy Dunlap
2023-05-09 16:56 ` [PATCH 17/32] closures: closure_wait_event() Kent Overstreet
2023-05-09 16:56 ` [PATCH 18/32] closures: closure_nr_remaining() Kent Overstreet
2023-05-09 16:56 ` [PATCH 19/32] closures: Add a missing include Kent Overstreet
2023-05-09 16:56 ` [PATCH 20/32] vfs: factor out inode hash head calculation Kent Overstreet
2023-05-09 16:56 ` [PATCH 21/32] hlist-bl: add hlist_bl_fake() Kent Overstreet
2023-05-10  4:48   ` Dave Chinner
2023-05-09 16:56 ` [PATCH 22/32] vfs: inode cache conversion to hash-bl Kent Overstreet
2023-05-10  4:45   ` Dave Chinner
2023-05-09 16:56 ` [PATCH 23/32] iov_iter: copy_folio_from_iter_atomic() Kent Overstreet
2023-05-10  2:20   ` kernel test robot
2023-05-11  2:08   ` kernel test robot
2023-05-09 16:56 ` [PATCH 24/32] MAINTAINERS: Add entry for generic-radix-tree Kent Overstreet
2023-05-09 21:03   ` Randy Dunlap
2023-05-09 16:56 ` [PATCH 25/32] lib/generic-radix-tree.c: Don't overflow in peek() Kent Overstreet
2023-05-09 16:56 ` [PATCH 26/32] lib/generic-radix-tree.c: Add a missing include Kent Overstreet
2023-05-09 16:56 ` [PATCH 27/32] lib/generic-radix-tree.c: Add peek_prev() Kent Overstreet
2023-05-09 16:56 ` [PATCH 28/32] stacktrace: Export stack_trace_save_tsk Kent Overstreet
2023-05-09 16:56 ` [PATCH 29/32] lib/string_helpers: string_get_size() now returns characters wrote Kent Overstreet
2023-05-09 16:56 ` [PATCH 30/32] lib: Export errname Kent Overstreet
2023-05-09 16:56 ` [PATCH 31/32] lib: add mean and variance module Kent Overstreet
2023-05-09 16:56 ` [PATCH 32/32] MAINTAINERS: Add entry for bcachefs Kent Overstreet
2023-05-09 21:04   ` Randy Dunlap
2023-05-09 21:07     ` Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK