Jupiter Broadcasting

For the love of ZFS | BSD Now 203

RSS Feeds:

MP3 Feed | iTunes Feed | HD Vid Feed | HD Torrent Feed

Become a supporter on Patreon:

– Show Notes: –

Headlines

ZFS is the best file system (for now)

ZFS should have been great, but I kind of hate it: ZFS seems to be trapped in the past, before it was sidelined it as the cool storage project of choice; it’s inflexible; it lacks modern flash integration; and it’s not directly supported by most operating systems. But I put all my valuable data on ZFS because it simply offers the best level of data protection in a small office/home office (SOHO) environment. Here’s why.
When ZFS first appeared in 2005, it was absolutely with the times, but it’s remained stuck there ever since. The ZFS engineers did a lot right when they combined the best features of a volume manager with a “zettabyte-scale” filesystem in Solaris 10
The skies first darkened in 2007, as NetApp sued Sun, claiming that their WAFL patents were infringed by ZFS. Sun counter-sued later that year, and the legal issues dragged on.

By then, Sun was hitting hard times and Oracle swooped in to purchase the company. This sowed further doubt about the future of ZFS, since Oracle did not enjoy wide support from open source advocates.

the CDDL license Sun applied to the ZFS code was [https://sfconservancy.org/blog/2016/feb/25/zfs-and-linux/](judged incompatible) with the GPLv2 that covers Linux, making it a non-starter for inclusion in the world’s server operating system.

Although OpenSolaris continued after the Oracle acquisition, and FreeBSD embraced ZFS, this was pretty much the extent of its impact outside the enterprise. Sure, NexentaStor and https://blog.fosketts.net/2008/09/15/greenbytes-embraces-extends-zfs/ helped push ZFS forward in the enterprise, but Oracle’s lackluster commitment to Sun in the datacenter started having an impact.

OpenZFS remains little-changed from what we had a decade ago.

Many remain skeptical of deduplication, which hogs expensive RAM in the best-case scenario.

And I do mean expensive: Pretty much every ZFS FAQ flatly declares that ECC RAM is a must-have and 8 GB is the bare minimum. In my own experience with FreeNAS, 32 GB is a nice amount for an active small ZFS server, and this costs $200-$300 even at today’s prices.

And ZFS never really adapted to today’s world of widely-available flash storage: Although flash can be used to support the ZIL and L2ARC caches, these are of dubious value in a system with sufficient RAM, and ZFS has no true hybrid storage capability. It’s laughable that the ZFS documentation obsesses over a few GB of SLC flash when multi-TB 3D NAND drives are on the market. And no one is talking about NVMe even though it’s everywhere in performance PC’s.

Then there’s the question of flexibility, or lack thereof. Once you build a ZFS volume, it’s pretty much fixed for life. There are only three ways to expand a storage pool:

  1. Replace each and every drive in the pool with a larger one (which is great but limiting and expensive)
  1. Add a stripe on another set of drives (which can lead to imbalanced performance and redundancy and a whole world of potential stupid stuff)
  1. Build a new pool and “zfs send” your datasets to it (which is what I do, even though it’s kind of tricky)

Apart from option 3 above, you can’t shrink a ZFS pool.

I’ve probably made ZFS sound pretty unappealing right about now. It was revolutionary but now it’s startlingly limiting and out of touch with the present solid-state-dominated storage world.

After all, reliably storing data is the only thing a storage system really has to do. All my important data goes on ZFS, from photos to music and movies to office files. It’s going to be a long time before I trust anything other than ZFS!
+ I agree.
+ ZFS has a great track record of doing its most important job, keeping your data safe.
+ Work is ongoing to make ZFS more performance, and more flexible. The import thing is that this work is never allowed to compromise job #1, keeping your data safe.
+ Hybrid/tiered storage features, re-RAID-ing, are coming
+ There is a lot going on with OpenZFS, check out the notes from the last two OpenZFS Developer Summits just to get an idea of what some of those things are:

2015 & 2016


Writing a NetBSD kernel module

Kernel modules are object files used to extend an operating system’s kernel functionality at run time.
In this post, we’ll look at implementing a simple character device driver as a kernel module in NetBSD. Once it is loaded, userspace processes will be able to write an arbitrary byte string to the device, and on every successive read expect a cryptographically-secure pseudorandom permutation of the original byte string.


DragonFlyBSD Desktop!

If you read my last post, you know that I set up a machine (Thinkpad x230) with UEFI and four operating systems on it. One, I had no experience with – DragonFlyBSD (other than using Matthew Dillon’s C compiler for the Amiga back in the day!) and so it was uncharted territory for me. After getting the install working, I started playing around inside of DragonFlyBSD and discovered to my delight that it was a great operating system with some really unique features – all with that BSD commitment to good documentation and a solid coupling of kernel and userland that doesn’t exist (by design) in Linux.
So my goal for my DragonFlyBSD desktop experience was to be as BSD as I possibly could. Given that (and since I’m the maintainer of the port on OpenBSD ), I went with Lumina as the desktop environment and XDM as the graphical login manager. I have to confess that I really like the xfce terminal application so I wanted to make sure I had that as well. Toss in Firefox, libreOffice and ownCloud sync client and I’m good to go!
OK. So where to start. First, we need to get WiFi and wired networking happening for the console at login. To do that, I added the following to /etc/rc.conf:

wlans_iwn0=”wlan0″
ifconfig_wlan0=”WPA DHCP”
ifconfig_em0=”DHCP”

I then edited /etc/wpa_supplicant.conf to put in the details of my WiFi network:

network={
ssid=”MY-NETWORK-NAME”
psk=”my-super-secret-password”
}

A quick reboot showed that both wired and wireless networking were functional and automatically were assigned IP addresses via DHCP. Next up is to try getting into X with whatever DragonFlyBSD uses for its default window manager. A straight up “startx” met with, shall we say, less than stellar results. Therefore, I used the following command to generate a simple /etc/X11/xorg.conf file:

# Xorg -configure
# cp /root/xorg.conf.new /etc/X11/xorg.conf

With that file in place, I could get into the default window manager, but I had no mouse. After some searching and pinging folks on the mailing list, I was able to figure out what I needed to do. I added the following to my /etc/rc.conf file:

moused_enable=”YES”
moused_type=”auto”
moused_port=”/dev/psm0″

I rebooted (I’m sure there is an easier way to get the changes but I don’t know it… yet) and was able to get into a basic X session and have a functional mouse. Next up, installing and configuring Lumina! To do that, I went through the incredibly torturous process of installing Lumina:

# pkg install lumina

Wow! That was really, really hard. I might need to pause here to catch my breath.

Next up, jumping into Lumina from the console. To do that, I created a .xinitrc file in my home directory with the following:

exec start-lumina-desktop

From there, I could “startx” until my heart was content and bounce into Lumina. That wasn’t good enough though! I want a graphical login (specifically xdm). To do that, I had to do a little research. The trick on DragonFlyBSD is not to add anything to /etc/rc.conf like you do in other BSDs, it’s a bit more old school. Basically you need to edit the /etc/ttys file and update ttyv8 to turn on the xdm daemon:

ttyv8 “/usr/X11R6/bin/xdm -nodaemon” xterm on secure

The other thing you need to do is set it up to use your desktop environment of choice. In my case, that’s Lumina. To do that, I needed to edit /usr/local/lib/X11/xdm/Xsession and change the next to the last line in the file to launch Lumina:

exec /usr/local/bin/start-lumina-desktop

I then crossed my fingers, rebooted and lo and behold had a graphical login that, when I actually didn’t fat finger my password from excitement, put me into the Lumina desktop environment!
Next up – I need a cool desktop wallpaper. Of course that’s way more important that installing application or other stuff! After some searching, I found this one that met my needs. I downloaded it to a local ~/Pictures directory and then used the Lumina wallpaper preference application to add the directory containing the picture and set it to automatic layout. Voila! I had a cool DragonFlyBSD wallpaper.
Next I installed the xfce4 terminal program by doing:

# pkg install xfce4-terminal

I then went into the Lumina “All Desktop Settings” preferences, found the applet for the “Menu” under “Interface Configuration” and swapped out “terminal” for “Xfce Terminal”. I then configured Lumina further to have a 26 pixel thick, 99% length bottom center panel with the following gadgets in it (in this order):

Start Menu
Task Manager (No Groups)
Spacer
System Tray
Time/Date
Battery Monitor

I then went into my Appearance | Window Manager gadget and set my Window Theme to “bora_blue” (my favorite out of the defaults supplied). I then installed my remaining applications that I needed in order to have a functioning desktop:

# pkg install owncloudclient qtkeychain evolution evolution-ews firefox libreoffice

After that, I really had a nicely functioning desktop environment! By the way, the performance of DragonFlyBSD is pretty impressive in terms of its day to day usage. Keep in mind I’m not doing any official benchmarking or anything, but it sure feels to me to be just as fast (if not faster) than OpenBSD and FreeBSD. I know that the kernel team has done a lot to unlock things (which FreeBSD has done and we are starting to do on OpenBSD) so perhaps I can attribute the “snappiness” to that?
As you can see, although there isn’t as much documentation on the Internet for this BSD, you can get a really nice, functional desktop out of it with some simple (and intuitive) configuration. I’m really looking forward to living in this system for a while and learning about it. Probably the first thing I’ll do is ring up the port maintainer for Lumina and see if we can’t collaborate on getting Lumina 1.3 moved over to it! Give this one a try – I think you’ll find that its a very nice operating system with some very cool features (the HAMMER filesystem for one!).


News Roundup

Porting NetBSD to Alwinner H3 SoCs

A new SUNXI evbarm kernel has appeared recently in NetBSD -current with support for boards based on the Allwinner H3 system on a chip (SoC). The H3 SoC is a quad-core Cortex-A7 SoC designed primarily for set-top boxes, but has managed to find its way into many single-board computers (SBC). This is one of the first evbarm ports built from the ground up with device tree support, which helps us to use a single kernel config to support many different boards.
To get these boards up and running, first we need to deal with low-level startup code. For the SUNXI kernel this currently lives in sys/arch/evbarm/sunxi/. The purpose of this code is fairly simple; initialize the boot CPU and initialize the MMU so we can jump to the kernel. The initial MMU configuration needs to cover a few things — early on we need to be able to access the kernel, UART debug console, and the device tree blob (DTB) passed in from U-Boot. We wrap the kernel in a U-Boot header that claims to be a Linux kernel; this is no accident! This tells U-Boot to use the Linux boot protocol when loading the kernel, which ensures that the DTB (loaded by U-Boot) is processed and passed to us in r2.
Once the CPU and MMU are ready, we jump to the generic ARM FDT implementation of initarm in sys/arch/evbarm/fdt/fdt_machdep.c. The first thing this code does is validate and relocate the DTB data. After it has been relocated, we compare the compatible property of the root node in the device tree with the list of ARM platforms compiled into the kernel. The Allwinner sunxi platform code lives in sys/arch/arm/sunxi/sunxi_platform.c. The sunxi platform code provides SoC-specific versions of code needed early at boot. We need to know how to initialize the debug console, spin up application CPUs, reset the board, etc.
With a bit of luck, we’re now booting and enumerating devices. Apart from a few devices, almost nothing works yet as we are missing a driver for the CCU. The CCU in the Allwinner H3 SoC controls PLLs and most of the clock generation, division, muxing, and gating. Since there are many similarities between Allwinner SoCs, I opted to write generic CCU code and then SoC-specific frontends. The resulting code lives in sys/arch/arm/sunxi/; generic code as sunxi_ccu.c and H3-specific code in sun8i_h3_ccu.c.


TrueOS Community Spotlight: Reset ZFS Replication using the command line

We’d like to spotlight TrueOS community member Brad Alexander for documenting his experience repairing ZFS dataset replication with TrueOS. Thank you! His notes are posted here, and they’ve been added to the TrueOS handbook Troubleshooting section for later reference.

The SysAdm Client tray icon was pulsing red. Right-clicking on the icon and clicking Messages would show the message:

FAILED replication task on NCC74602 -> 192.168.47.20: LOGFILE: /var/log/lpreserver/lpreserver_failed.log

/var/log/lpreserver/lastrep-send.log shows very little information:

send from @auto-2017-07-12-01-00-00 to NCC74602/ROOT/12.0-CURRENT-up-20170623_120331@auto-2017-07-14-01-00-00
total estimated size is 0
TIME SENT SNAPSHOT

And no useful errors were being written to the lpreserver_failed.log.

sudo lpreserver replicate init NCC74602 192.168.47.20
sudo lpreserver replicate run NCC74602 192.168.47.20

Unfortunately, the replication failed again. I got these messages in the logs:

Fri Jul 14 09:03:34 EDT 2017: Removing NX80101/archive/yukon.sonsofthunder.nanobit.org/ROOT - re-created locally
cannot unmount '/mnt/NX80101/archive/yukon.sonsofthunder.nanobit.org/ROOT': Operation not permitted
Failed creating remote dataset!
cannot create 'NX80101/archive/yukon.sonsofthunder.nanobit.org/ROOT': dataset already exists

It turned out there were a number of children. I logged into luna (the FreeNAS) and issued this command as root:

zfs destroy -r NX80101/archive/defiant.sonsofthunder.nanobit.org

I then ran the replicate init and replicate run commands again from the TrueOS host, and replication worked! It has continued to work too, at least until the next fiddly bit breaks.


Kernel relinking status from Theo de Raadt

As you may have heard (and as was mentioned in an earlier article), on recent OpenBSD snapshots we have KARL, which means that the kernel is relinked so each boot comes with a new kernel where all .o files are linked in random order and with random offsets. Theo de Raadt summarized the status in a message to the tech@ mailing list, subject kernel relinking as follows:

5 weeks ago at d2k17 I started work on randomized kernels. I’ve been having conversations with other developers for nearly 5 years on the topic… but never got off to a good start, probably because I was trying to pawn the work off on others.
Having done this, I really had no idea we’d end up where we are today.
Here are the behaviours:
The base set has grown, it now includes a link-kit
At install time, a unique kernel is built from this link-kit
At upgrade time, a unique kernel is built
At boot time, a unique kernel is built and installed for the next boot
If someone compiles their own kernel and uses ‘make install’, the link-kit is also updated, so that the next boot can do the work
If a developer cp’s a kernel to /, the mechanism dis-engages until a ‘make install” or upgrade is performed in the future. That may help debugging.
A unique kernel is linked such that the startup assembly code is kept in the same place, followed by randomly-sized gapping, followed by all the other .o files randomly re-organized. As a result the distances between functions and variables are entirely new. An info leak of a pointer will not disclose other pointers or objects. This may also help reduce gadgets on variable-sized architectures, because polymorphism in the instruction stream is damaged by nested offsets changing.
At runtime, the kernel can unmap it’s startup code. On architectures where an unmap isn’t possible due to large-PTE use, the code can be trashed instead.
I did most of the kernel work on amd64 first, then i386. I explained what needed to be done to visa and patrick, who handled the arm and mips platforms. Then I completed all the rest of the architectures. Some architecture are missing the unmap/trashing of startup code, that’s a step we need to get back to.
The next part was tricky and I received assistance from tb and rpe. We had to script the behaviours at build time, snapshot time, relink time, boot time, and later on install/upgrade time also.
While they helped, I also worked to create the “gap.o” file without use of a C compiler so that relinks can occur without the “comp” set installed. This uses the linkscript feature of ld. I don’t think anyone has done this before. It is little bit fragile, and the various linkers may need to receive some fixes due to what I’ve encountered.
To ensure this security feature works great in the 6.2 release, please test snapshots. By working well, I mean it should work invisibly, without any glitch such as a broken kernel or anything. If there are ugly glitches we would like to know before the release.

You heard the man, now is the time to test and get a preview of the new awsomeness that will be in OpenBSD 6.2.


Beastie Bits


Feedback/Questions


Final Note