0001 =======================================================
0002 Configfs - Userspace-driven Kernel Object Configuration
0003 =======================================================
0004
0005 Joel Becker <joel.becker@oracle.com>
0006
0007 Updated: 31 March 2005
0008
0009 Copyright (c) 2005 Oracle Corporation,
0010 Joel Becker <joel.becker@oracle.com>
0011
0012
0013 What is configfs?
0014 =================
0015
0016 configfs is a ram-based filesystem that provides the converse of
0017 sysfs's functionality. Where sysfs is a filesystem-based view of
0018 kernel objects, configfs is a filesystem-based manager of kernel
0019 objects, or config_items.
0020
0021 With sysfs, an object is created in kernel (for example, when a device
0022 is discovered) and it is registered with sysfs. Its attributes then
0023 appear in sysfs, allowing userspace to read the attributes via
0024 readdir(3)/read(2). It may allow some attributes to be modified via
0025 write(2). The important point is that the object is created and
0026 destroyed in kernel, the kernel controls the lifecycle of the sysfs
0027 representation, and sysfs is merely a window on all this.
0028
0029 A configfs config_item is created via an explicit userspace operation:
0030 mkdir(2). It is destroyed via rmdir(2). The attributes appear at
0031 mkdir(2) time, and can be read or modified via read(2) and write(2).
0032 As with sysfs, readdir(3) queries the list of items and/or attributes.
0033 symlink(2) can be used to group items together. Unlike sysfs, the
0034 lifetime of the representation is completely driven by userspace. The
0035 kernel modules backing the items must respond to this.
0036
0037 Both sysfs and configfs can and should exist together on the same
0038 system. One is not a replacement for the other.
0039
0040 Using configfs
0041 ==============
0042
0043 configfs can be compiled as a module or into the kernel. You can access
0044 it by doing::
0045
0046 mount -t configfs none /config
0047
0048 The configfs tree will be empty unless client modules are also loaded.
0049 These are modules that register their item types with configfs as
0050 subsystems. Once a client subsystem is loaded, it will appear as a
0051 subdirectory (or more than one) under /config. Like sysfs, the
0052 configfs tree is always there, whether mounted on /config or not.
0053
0054 An item is created via mkdir(2). The item's attributes will also
0055 appear at this time. readdir(3) can determine what the attributes are,
0056 read(2) can query their default values, and write(2) can store new
0057 values. Don't mix more than one attribute in one attribute file.
0058
0059 There are two types of configfs attributes:
0060
0061 * Normal attributes, which similar to sysfs attributes, are small ASCII text
0062 files, with a maximum size of one page (PAGE_SIZE, 4096 on i386). Preferably
0063 only one value per file should be used, and the same caveats from sysfs apply.
0064 Configfs expects write(2) to store the entire buffer at once. When writing to
0065 normal configfs attributes, userspace processes should first read the entire
0066 file, modify the portions they wish to change, and then write the entire
0067 buffer back.
0068
0069 * Binary attributes, which are somewhat similar to sysfs binary attributes,
0070 but with a few slight changes to semantics. The PAGE_SIZE limitation does not
0071 apply, but the whole binary item must fit in single kernel vmalloc'ed buffer.
0072 The write(2) calls from user space are buffered, and the attributes'
0073 write_bin_attribute method will be invoked on the final close, therefore it is
0074 imperative for user-space to check the return code of close(2) in order to
0075 verify that the operation finished successfully.
0076 To avoid a malicious user OOMing the kernel, there's a per-binary attribute
0077 maximum buffer value.
0078
0079 When an item needs to be destroyed, remove it with rmdir(2). An
0080 item cannot be destroyed if any other item has a link to it (via
0081 symlink(2)). Links can be removed via unlink(2).
0082
0083 Configuring FakeNBD: an Example
0084 ===============================
0085
0086 Imagine there's a Network Block Device (NBD) driver that allows you to
0087 access remote block devices. Call it FakeNBD. FakeNBD uses configfs
0088 for its configuration. Obviously, there will be a nice program that
0089 sysadmins use to configure FakeNBD, but somehow that program has to tell
0090 the driver about it. Here's where configfs comes in.
0091
0092 When the FakeNBD driver is loaded, it registers itself with configfs.
0093 readdir(3) sees this just fine::
0094
0095 # ls /config
0096 fakenbd
0097
0098 A fakenbd connection can be created with mkdir(2). The name is
0099 arbitrary, but likely the tool will make some use of the name. Perhaps
0100 it is a uuid or a disk name::
0101
0102 # mkdir /config/fakenbd/disk1
0103 # ls /config/fakenbd/disk1
0104 target device rw
0105
0106 The target attribute contains the IP address of the server FakeNBD will
0107 connect to. The device attribute is the device on the server.
0108 Predictably, the rw attribute determines whether the connection is
0109 read-only or read-write::
0110
0111 # echo 10.0.0.1 > /config/fakenbd/disk1/target
0112 # echo /dev/sda1 > /config/fakenbd/disk1/device
0113 # echo 1 > /config/fakenbd/disk1/rw
0114
0115 That's it. That's all there is. Now the device is configured, via the
0116 shell no less.
0117
0118 Coding With configfs
0119 ====================
0120
0121 Every object in configfs is a config_item. A config_item reflects an
0122 object in the subsystem. It has attributes that match values on that
0123 object. configfs handles the filesystem representation of that object
0124 and its attributes, allowing the subsystem to ignore all but the
0125 basic show/store interaction.
0126
0127 Items are created and destroyed inside a config_group. A group is a
0128 collection of items that share the same attributes and operations.
0129 Items are created by mkdir(2) and removed by rmdir(2), but configfs
0130 handles that. The group has a set of operations to perform these tasks
0131
0132 A subsystem is the top level of a client module. During initialization,
0133 the client module registers the subsystem with configfs, the subsystem
0134 appears as a directory at the top of the configfs filesystem. A
0135 subsystem is also a config_group, and can do everything a config_group
0136 can.
0137
0138 struct config_item
0139 ==================
0140
0141 ::
0142
0143 struct config_item {
0144 char *ci_name;
0145 char ci_namebuf[UOBJ_NAME_LEN];
0146 struct kref ci_kref;
0147 struct list_head ci_entry;
0148 struct config_item *ci_parent;
0149 struct config_group *ci_group;
0150 struct config_item_type *ci_type;
0151 struct dentry *ci_dentry;
0152 };
0153
0154 void config_item_init(struct config_item *);
0155 void config_item_init_type_name(struct config_item *,
0156 const char *name,
0157 struct config_item_type *type);
0158 struct config_item *config_item_get(struct config_item *);
0159 void config_item_put(struct config_item *);
0160
0161 Generally, struct config_item is embedded in a container structure, a
0162 structure that actually represents what the subsystem is doing. The
0163 config_item portion of that structure is how the object interacts with
0164 configfs.
0165
0166 Whether statically defined in a source file or created by a parent
0167 config_group, a config_item must have one of the _init() functions
0168 called on it. This initializes the reference count and sets up the
0169 appropriate fields.
0170
0171 All users of a config_item should have a reference on it via
0172 config_item_get(), and drop the reference when they are done via
0173 config_item_put().
0174
0175 By itself, a config_item cannot do much more than appear in configfs.
0176 Usually a subsystem wants the item to display and/or store attributes,
0177 among other things. For that, it needs a type.
0178
0179 struct config_item_type
0180 =======================
0181
0182 ::
0183
0184 struct configfs_item_operations {
0185 void (*release)(struct config_item *);
0186 int (*allow_link)(struct config_item *src,
0187 struct config_item *target);
0188 void (*drop_link)(struct config_item *src,
0189 struct config_item *target);
0190 };
0191
0192 struct config_item_type {
0193 struct module *ct_owner;
0194 struct configfs_item_operations *ct_item_ops;
0195 struct configfs_group_operations *ct_group_ops;
0196 struct configfs_attribute **ct_attrs;
0197 struct configfs_bin_attribute **ct_bin_attrs;
0198 };
0199
0200 The most basic function of a config_item_type is to define what
0201 operations can be performed on a config_item. All items that have been
0202 allocated dynamically will need to provide the ct_item_ops->release()
0203 method. This method is called when the config_item's reference count
0204 reaches zero.
0205
0206 struct configfs_attribute
0207 =========================
0208
0209 ::
0210
0211 struct configfs_attribute {
0212 char *ca_name;
0213 struct module *ca_owner;
0214 umode_t ca_mode;
0215 ssize_t (*show)(struct config_item *, char *);
0216 ssize_t (*store)(struct config_item *, const char *, size_t);
0217 };
0218
0219 When a config_item wants an attribute to appear as a file in the item's
0220 configfs directory, it must define a configfs_attribute describing it.
0221 It then adds the attribute to the NULL-terminated array
0222 config_item_type->ct_attrs. When the item appears in configfs, the
0223 attribute file will appear with the configfs_attribute->ca_name
0224 filename. configfs_attribute->ca_mode specifies the file permissions.
0225
0226 If an attribute is readable and provides a ->show method, that method will
0227 be called whenever userspace asks for a read(2) on the attribute. If an
0228 attribute is writable and provides a ->store method, that method will be
0229 called whenever userspace asks for a write(2) on the attribute.
0230
0231 struct configfs_bin_attribute
0232 =============================
0233
0234 ::
0235
0236 struct configfs_bin_attribute {
0237 struct configfs_attribute cb_attr;
0238 void *cb_private;
0239 size_t cb_max_size;
0240 };
0241
0242 The binary attribute is used when the one needs to use binary blob to
0243 appear as the contents of a file in the item's configfs directory.
0244 To do so add the binary attribute to the NULL-terminated array
0245 config_item_type->ct_bin_attrs, and the item appears in configfs, the
0246 attribute file will appear with the configfs_bin_attribute->cb_attr.ca_name
0247 filename. configfs_bin_attribute->cb_attr.ca_mode specifies the file
0248 permissions.
0249 The cb_private member is provided for use by the driver, while the
0250 cb_max_size member specifies the maximum amount of vmalloc buffer
0251 to be used.
0252
0253 If binary attribute is readable and the config_item provides a
0254 ct_item_ops->read_bin_attribute() method, that method will be called
0255 whenever userspace asks for a read(2) on the attribute. The converse
0256 will happen for write(2). The reads/writes are bufferred so only a
0257 single read/write will occur; the attributes' need not concern itself
0258 with it.
0259
0260 struct config_group
0261 ===================
0262
0263 A config_item cannot live in a vacuum. The only way one can be created
0264 is via mkdir(2) on a config_group. This will trigger creation of a
0265 child item::
0266
0267 struct config_group {
0268 struct config_item cg_item;
0269 struct list_head cg_children;
0270 struct configfs_subsystem *cg_subsys;
0271 struct list_head default_groups;
0272 struct list_head group_entry;
0273 };
0274
0275 void config_group_init(struct config_group *group);
0276 void config_group_init_type_name(struct config_group *group,
0277 const char *name,
0278 struct config_item_type *type);
0279
0280
0281 The config_group structure contains a config_item. Properly configuring
0282 that item means that a group can behave as an item in its own right.
0283 However, it can do more: it can create child items or groups. This is
0284 accomplished via the group operations specified on the group's
0285 config_item_type::
0286
0287 struct configfs_group_operations {
0288 struct config_item *(*make_item)(struct config_group *group,
0289 const char *name);
0290 struct config_group *(*make_group)(struct config_group *group,
0291 const char *name);
0292 int (*commit_item)(struct config_item *item);
0293 void (*disconnect_notify)(struct config_group *group,
0294 struct config_item *item);
0295 void (*drop_item)(struct config_group *group,
0296 struct config_item *item);
0297 };
0298
0299 A group creates child items by providing the
0300 ct_group_ops->make_item() method. If provided, this method is called from
0301 mkdir(2) in the group's directory. The subsystem allocates a new
0302 config_item (or more likely, its container structure), initializes it,
0303 and returns it to configfs. Configfs will then populate the filesystem
0304 tree to reflect the new item.
0305
0306 If the subsystem wants the child to be a group itself, the subsystem
0307 provides ct_group_ops->make_group(). Everything else behaves the same,
0308 using the group _init() functions on the group.
0309
0310 Finally, when userspace calls rmdir(2) on the item or group,
0311 ct_group_ops->drop_item() is called. As a config_group is also a
0312 config_item, it is not necessary for a separate drop_group() method.
0313 The subsystem must config_item_put() the reference that was initialized
0314 upon item allocation. If a subsystem has no work to do, it may omit
0315 the ct_group_ops->drop_item() method, and configfs will call
0316 config_item_put() on the item on behalf of the subsystem.
0317
0318 Important:
0319 drop_item() is void, and as such cannot fail. When rmdir(2)
0320 is called, configfs WILL remove the item from the filesystem tree
0321 (assuming that it has no children to keep it busy). The subsystem is
0322 responsible for responding to this. If the subsystem has references to
0323 the item in other threads, the memory is safe. It may take some time
0324 for the item to actually disappear from the subsystem's usage. But it
0325 is gone from configfs.
0326
0327 When drop_item() is called, the item's linkage has already been torn
0328 down. It no longer has a reference on its parent and has no place in
0329 the item hierarchy. If a client needs to do some cleanup before this
0330 teardown happens, the subsystem can implement the
0331 ct_group_ops->disconnect_notify() method. The method is called after
0332 configfs has removed the item from the filesystem view but before the
0333 item is removed from its parent group. Like drop_item(),
0334 disconnect_notify() is void and cannot fail. Client subsystems should
0335 not drop any references here, as they still must do it in drop_item().
0336
0337 A config_group cannot be removed while it still has child items. This
0338 is implemented in the configfs rmdir(2) code. ->drop_item() will not be
0339 called, as the item has not been dropped. rmdir(2) will fail, as the
0340 directory is not empty.
0341
0342 struct configfs_subsystem
0343 =========================
0344
0345 A subsystem must register itself, usually at module_init time. This
0346 tells configfs to make the subsystem appear in the file tree::
0347
0348 struct configfs_subsystem {
0349 struct config_group su_group;
0350 struct mutex su_mutex;
0351 };
0352
0353 int configfs_register_subsystem(struct configfs_subsystem *subsys);
0354 void configfs_unregister_subsystem(struct configfs_subsystem *subsys);
0355
0356 A subsystem consists of a toplevel config_group and a mutex.
0357 The group is where child config_items are created. For a subsystem,
0358 this group is usually defined statically. Before calling
0359 configfs_register_subsystem(), the subsystem must have initialized the
0360 group via the usual group _init() functions, and it must also have
0361 initialized the mutex.
0362
0363 When the register call returns, the subsystem is live, and it
0364 will be visible via configfs. At that point, mkdir(2) can be called and
0365 the subsystem must be ready for it.
0366
0367 An Example
0368 ==========
0369
0370 The best example of these basic concepts is the simple_children
0371 subsystem/group and the simple_child item in
0372 samples/configfs/configfs_sample.c. It shows a trivial object displaying
0373 and storing an attribute, and a simple group creating and destroying
0374 these children.
0375
0376 Hierarchy Navigation and the Subsystem Mutex
0377 ============================================
0378
0379 There is an extra bonus that configfs provides. The config_groups and
0380 config_items are arranged in a hierarchy due to the fact that they
0381 appear in a filesystem. A subsystem is NEVER to touch the filesystem
0382 parts, but the subsystem might be interested in this hierarchy. For
0383 this reason, the hierarchy is mirrored via the config_group->cg_children
0384 and config_item->ci_parent structure members.
0385
0386 A subsystem can navigate the cg_children list and the ci_parent pointer
0387 to see the tree created by the subsystem. This can race with configfs'
0388 management of the hierarchy, so configfs uses the subsystem mutex to
0389 protect modifications. Whenever a subsystem wants to navigate the
0390 hierarchy, it must do so under the protection of the subsystem
0391 mutex.
0392
0393 A subsystem will be prevented from acquiring the mutex while a newly
0394 allocated item has not been linked into this hierarchy. Similarly, it
0395 will not be able to acquire the mutex while a dropping item has not
0396 yet been unlinked. This means that an item's ci_parent pointer will
0397 never be NULL while the item is in configfs, and that an item will only
0398 be in its parent's cg_children list for the same duration. This allows
0399 a subsystem to trust ci_parent and cg_children while they hold the
0400 mutex.
0401
0402 Item Aggregation Via symlink(2)
0403 ===============================
0404
0405 configfs provides a simple group via the group->item parent/child
0406 relationship. Often, however, a larger environment requires aggregation
0407 outside of the parent/child connection. This is implemented via
0408 symlink(2).
0409
0410 A config_item may provide the ct_item_ops->allow_link() and
0411 ct_item_ops->drop_link() methods. If the ->allow_link() method exists,
0412 symlink(2) may be called with the config_item as the source of the link.
0413 These links are only allowed between configfs config_items. Any
0414 symlink(2) attempt outside the configfs filesystem will be denied.
0415
0416 When symlink(2) is called, the source config_item's ->allow_link()
0417 method is called with itself and a target item. If the source item
0418 allows linking to target item, it returns 0. A source item may wish to
0419 reject a link if it only wants links to a certain type of object (say,
0420 in its own subsystem).
0421
0422 When unlink(2) is called on the symbolic link, the source item is
0423 notified via the ->drop_link() method. Like the ->drop_item() method,
0424 this is a void function and cannot return failure. The subsystem is
0425 responsible for responding to the change.
0426
0427 A config_item cannot be removed while it links to any other item, nor
0428 can it be removed while an item links to it. Dangling symlinks are not
0429 allowed in configfs.
0430
0431 Automatically Created Subgroups
0432 ===============================
0433
0434 A new config_group may want to have two types of child config_items.
0435 While this could be codified by magic names in ->make_item(), it is much
0436 more explicit to have a method whereby userspace sees this divergence.
0437
0438 Rather than have a group where some items behave differently than
0439 others, configfs provides a method whereby one or many subgroups are
0440 automatically created inside the parent at its creation. Thus,
0441 mkdir("parent") results in "parent", "parent/subgroup1", up through
0442 "parent/subgroupN". Items of type 1 can now be created in
0443 "parent/subgroup1", and items of type N can be created in
0444 "parent/subgroupN".
0445
0446 These automatic subgroups, or default groups, do not preclude other
0447 children of the parent group. If ct_group_ops->make_group() exists,
0448 other child groups can be created on the parent group directly.
0449
0450 A configfs subsystem specifies default groups by adding them using the
0451 configfs_add_default_group() function to the parent config_group
0452 structure. Each added group is populated in the configfs tree at the same
0453 time as the parent group. Similarly, they are removed at the same time
0454 as the parent. No extra notification is provided. When a ->drop_item()
0455 method call notifies the subsystem the parent group is going away, it
0456 also means every default group child associated with that parent group.
0457
0458 As a consequence of this, default groups cannot be removed directly via
0459 rmdir(2). They also are not considered when rmdir(2) on the parent
0460 group is checking for children.
0461
0462 Dependent Subsystems
0463 ====================
0464
0465 Sometimes other drivers depend on particular configfs items. For
0466 example, ocfs2 mounts depend on a heartbeat region item. If that
0467 region item is removed with rmdir(2), the ocfs2 mount must BUG or go
0468 readonly. Not happy.
0469
0470 configfs provides two additional API calls: configfs_depend_item() and
0471 configfs_undepend_item(). A client driver can call
0472 configfs_depend_item() on an existing item to tell configfs that it is
0473 depended on. configfs will then return -EBUSY from rmdir(2) for that
0474 item. When the item is no longer depended on, the client driver calls
0475 configfs_undepend_item() on it.
0476
0477 These API cannot be called underneath any configfs callbacks, as
0478 they will conflict. They can block and allocate. A client driver
0479 probably shouldn't calling them of its own gumption. Rather it should
0480 be providing an API that external subsystems call.
0481
0482 How does this work? Imagine the ocfs2 mount process. When it mounts,
0483 it asks for a heartbeat region item. This is done via a call into the
0484 heartbeat code. Inside the heartbeat code, the region item is looked
0485 up. Here, the heartbeat code calls configfs_depend_item(). If it
0486 succeeds, then heartbeat knows the region is safe to give to ocfs2.
0487 If it fails, it was being torn down anyway, and heartbeat can gracefully
0488 pass up an error.
0489
0490 Committable Items
0491 =================
0492
0493 Note:
0494 Committable items are currently unimplemented.
0495
0496 Some config_items cannot have a valid initial state. That is, no
0497 default values can be specified for the item's attributes such that the
0498 item can do its work. Userspace must configure one or more attributes,
0499 after which the subsystem can start whatever entity this item
0500 represents.
0501
0502 Consider the FakeNBD device from above. Without a target address *and*
0503 a target device, the subsystem has no idea what block device to import.
0504 The simple example assumes that the subsystem merely waits until all the
0505 appropriate attributes are configured, and then connects. This will,
0506 indeed, work, but now every attribute store must check if the attributes
0507 are initialized. Every attribute store must fire off the connection if
0508 that condition is met.
0509
0510 Far better would be an explicit action notifying the subsystem that the
0511 config_item is ready to go. More importantly, an explicit action allows
0512 the subsystem to provide feedback as to whether the attributes are
0513 initialized in a way that makes sense. configfs provides this as
0514 committable items.
0515
0516 configfs still uses only normal filesystem operations. An item is
0517 committed via rename(2). The item is moved from a directory where it
0518 can be modified to a directory where it cannot.
0519
0520 Any group that provides the ct_group_ops->commit_item() method has
0521 committable items. When this group appears in configfs, mkdir(2) will
0522 not work directly in the group. Instead, the group will have two
0523 subdirectories: "live" and "pending". The "live" directory does not
0524 support mkdir(2) or rmdir(2) either. It only allows rename(2). The
0525 "pending" directory does allow mkdir(2) and rmdir(2). An item is
0526 created in the "pending" directory. Its attributes can be modified at
0527 will. Userspace commits the item by renaming it into the "live"
0528 directory. At this point, the subsystem receives the ->commit_item()
0529 callback. If all required attributes are filled to satisfaction, the
0530 method returns zero and the item is moved to the "live" directory.
0531
0532 As rmdir(2) does not work in the "live" directory, an item must be
0533 shutdown, or "uncommitted". Again, this is done via rename(2), this
0534 time from the "live" directory back to the "pending" one. The subsystem
0535 is notified by the ct_group_ops->uncommit_object() method.