admin-guide/device-mapper/log-writes.rst

0001 =============
0002 dm-log-writes
0003 =============
0004
0005 This target takes 2 devices, one to pass all IO to normally, and one to log all
0006 of the write operations to.  This is intended for file system developers wishing
0007 to verify the integrity of metadata or data as the file system is written to.
0008 There is a log_write_entry written for every WRITE request and the target is
0009 able to take arbitrary data from userspace to insert into the log.  The data
0010 that is in the WRITE requests is copied into the log to make the replay happen
0011 exactly as it happened originally.
0012
0013 Log Ordering
0014 ============
0015
0016 We log things in order of completion once we are sure the write is no longer in
0017 cache.  This means that normal WRITE requests are not actually logged until the
0018 next REQ_PREFLUSH request.  This is to make it easier for userspace to replay
0019 the log in a way that correlates to what is on disk and not what is in cache,
0020 to make it easier to detect improper waiting/flushing.
0021
0022 This works by attaching all WRITE requests to a list once the write completes.
0023 Once we see a REQ_PREFLUSH request we splice this list onto the request and once
0024 the FLUSH request completes we log all of the WRITEs and then the FLUSH.  Only
0025 completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
0026 simulate the worst case scenario with regard to power failures.  Consider the
0027 following example (W means write, C means complete):
0028
0029         W1,W2,W3,C3,C2,Wflush,C1,Cflush
0030
0031 The log would show the following:
0032
0033         W3,W2,flush,W1....
0034
0035 Again this is to simulate what is actually on disk, this allows us to detect
0036 cases where a power failure at a particular point in time would create an
0037 inconsistent file system.
0038
0039 Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as
0040 they complete as those requests will obviously bypass the device cache.
0041
0042 Any REQ_OP_DISCARD requests are treated like WRITE requests.  Otherwise we would
0043 have all the DISCARD requests, and then the WRITE requests and then the FLUSH
0044 request.  Consider the following example:
0045
0046         WRITE block 1, DISCARD block 1, FLUSH
0047
0048 If we logged DISCARD when it completed, the replay would look like this:
0049
0050         DISCARD 1, WRITE 1, FLUSH
0051
0052 which isn't quite what happened and wouldn't be caught during the log replay.
0053
0054 Target interface
0055 ================
0056
0057 i) Constructor
0058
0059    log-writes <dev_path> <log_dev_path>
0060
0061    ============= ==============================================
0062    dev_path      Device that all of the IO will go to normally.
0063    log_dev_path  Device where the log entries are written to.
0064    ============= ==============================================
0065
0066 ii) Status
0067
0068     <#logged entries> <highest allocated sector>
0069
0070     =========================== ========================
0071     #logged entries             Number of logged entries
0072     highest allocated sector    Highest allocated sector
0073     =========================== ========================
0074
0075 iii) Messages
0076
0077     mark <description>
0078
0079         You can use a dmsetup message to set an arbitrary mark in a log.
0080         For example say you want to fsck a file system after every
0081         write, but first you need to replay up to the mkfs to make sure
0082         we're fsck'ing something reasonable, you would do something like
0083         this::
0084
0085           mkfs.btrfs -f /dev/mapper/log
0086           dmsetup message log 0 mark mkfs
0087           <run test>
0088
0089         This would allow you to replay the log up to the mkfs mark and
0090         then replay from that point on doing the fsck check in the
0091         interval that you want.
0092
0093         Every log has a mark at the end labeled "dm-log-writes-end".
0094
0095 Userspace component
0096 ===================
0097
0098 There is a userspace tool that will replay the log for you in various ways.
0099 It can be found here: https://github.com/josefbacik/log-writes
0100
0101 Example usage
0102 =============
0103
0104 Say you want to test fsync on your file system.  You would do something like
0105 this::
0106
0107   TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
0108   dmsetup create log --table "$TABLE"
0109   mkfs.btrfs -f /dev/mapper/log
0110   dmsetup message log 0 mark mkfs
0111
0112   mount /dev/mapper/log /mnt/btrfs-test
0113   <some test that does fsync at the end>
0114   dmsetup message log 0 mark fsync
0115   md5sum /mnt/btrfs-test/foo
0116   umount /mnt/btrfs-test
0117
0118   dmsetup remove log
0119   replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
0120   mount /dev/sdb /mnt/btrfs-test
0121   md5sum /mnt/btrfs-test/foo
0122   <verify md5sum's are correct>
0123
0124   Another option is to do a complicated file system operation and verify the file
0125   system is consistent during the entire operation.  You could do this with:
0126
0127   TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
0128   dmsetup create log --table "$TABLE"
0129   mkfs.btrfs -f /dev/mapper/log
0130   dmsetup message log 0 mark mkfs
0131
0132   mount /dev/mapper/log /mnt/btrfs-test
0133   <fsstress to dirty the fs>
0134   btrfs filesystem balance /mnt/btrfs-test
0135   umount /mnt/btrfs-test
0136   dmsetup remove log
0137
0138   replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
0139   btrfsck /dev/sdb
0140   replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
0141         --fsck "btrfsck /dev/sdb" --check fua
0142
0143 And that will replay the log until it sees a FUA request, run the fsck command
0144 and if the fsck passes it will replay to the next FUA, until it is completed or
0145 the fsck command exists abnormally.