Jupiter Broadcasting

Bowling in the LimeLight | BSD Now 241

Headlines

[Other big ZFS improvements you might have missed]


**Digital Ocean**

PostgreSQL developers find that every operating system other than FreeBSD and IllumOS might corrupt your data

Some time ago I ran into an issue where a user encountered data corruption after a storage error. PostgreSQL played a part in that corruption by allowing checkpoint what should’ve been a fatal error.
TL;DR: Pg should PANIC on fsync() EIO return. Retrying fsync() is not OK at least on Linux. When fsync() returns success it means “all writes since the last fsync have hit disk” but we assume it means “all writes since the last SUCCESSFUL fsync have hit disk”.
Pg wrote some blocks, which went to OS dirty buffers for writeback. Writeback failed due to an underlying storage error. The block I/O layer and XFS marked the writeback page as failed (ASEIO), but had no way to tell the app about the failure. When Pg called fsync() on the FD during the next checkpoint, fsync() returned EIO because of the flagged page, to tell Pg that a previous async write failed. Pg treated the checkpoint as failed and didn’t advance the redo start position in the control file.
+ All good so far.
But then we retried the checkpoint, which retried the fsync(). The retry succeeded, because the prior fsync() *cleared the AS
EIO bad page flag*.
The write never made it to disk, but we completed the checkpoint, and merrily carried on our way. Whoops, data loss.
The clear-error-and-continue behaviour of fsync is not documented as far as I can tell. Nor is fsync() returning EIO unless you have a very new linux man-pages with the patch I wrote to add it. But from what I can see in the POSIX standard we are not given any guarantees about what happens on fsync() failure at all, so we’re probably wrong to assume that retrying fsync() is safe.
We already PANIC on fsync() failure for WAL segments. We just need to do the same for data forks at least for EIO. This isn’t as bad as it seems because AFAICS fsync only returns EIO in cases where we should be stopping the world anyway, and many FSes will do that for us.
+ Upon further looking, it turns out it is not just Linux brain damage:
Apparently I was too optimistic. I had looked only at FreeBSD, which keeps the page around and dirties it so we can retry, but the other BSDs apparently don’t (FreeBSD changed that in 1999).
From what I can tell from the sources below, we have: Linux, OpenBSD, NetBSD: retrying fsync() after EIO lies
FreeBSD, Illumos: retrying fsync() after EIO tells the truth
+ NetBSD PR to solve the issues
+ I/O errors are not reported back to fsync at all.
+ Write errors during genfs_putpages that fail for any reason other than ENOMEM cause the data to be semi-silently discarded.
+ It appears that UVM pages are marked clean when they’re selected to be written out, not after the write succeeds; so there are a bunch of potential races when writes fail.
+ It appears that write errors for buffercache buffers are semi-silently discarded as well.


Interview – Kevin Bowling: Senior Manager Engineering of LimeLight Networks – kbowling@llnw.com / @kevinbowling1


News Roundup

BSDCan 2018 Selected Talks


iXsystems

Cryptographic Right Answers

There are, in the literature and in the most sophisticated modern systems, “better” answers for many of these items. If you’re building for low-footprint embedded systems, you can use STROBE and a sound, modern, authenticated encryption stack entirely out of a single SHA-3-like sponge constructions. You can use NOISE to build a secure transport protocol with its own AKE. Speaking of AKEs, there are, like, 30 different password AKEs you could choose from.

But if you’re a developer and not a cryptography engineer, you shouldn’t do any of that. You should keep things simple and conventional and easy to analyze; “boring”, as the Google TLS people would say.

Percival, 2009: AES-CTR with HMAC.
Ptacek, 2015: (1) NaCl/libsodium’s default, (2) ChaCha20-Poly1305, or (3) AES-GCM.
Latacora, 2018: KMS or XSalsa20+Poly1305

Percival, 2009: Use 256-bit keys.
Ptacek, 2015: Use 256-bit keys.
Latacora, 2018: Go ahead and use 256 bit keys.

Percival, 2009: Use HMAC.
Ptacek, 2015: Yep, use HMAC.
Latacora, 2018: Still HMAC.

Percival, 2009: Use SHA256 (SHA-2).
Ptacek, 2015: Use SHA-2.
Latacora, 2018: Still SHA-2.

Percival, 2009: Use 256-bit random numbers.
Ptacek, 2015: Use 256-bit random numbers.
Latacora, 2018: Use 256-bit random numbers.

Percival, 2009: scrypt or PBKDF2.
Ptacek, 2015: In order of preference, use scrypt, bcrypt, and then if nothing else is available PBKDF2.
Latacora, 2018: In order of preference, use scrypt, argon2, bcrypt, and then if nothing else is available PBKDF2.

Percival, 2009: Use RSAES-OAEP with SHA256 and MGF1+SHA256 bzzrt pop ffssssssst exponent 65537.
Ptacek, 2015: Use NaCl/libsodium (box / cryptobox).
Latacora, 2018: Use Nacl/libsodium (box / crypto
box).

Percival, 2009: Use RSASSA-PSS with SHA256 then MGF1+SHA256 in tricolor systemic silicate orientation.
Ptacek, 2015: Use Nacl, Ed25519, or RFC6979.
Latacora, 2018: Use Nacl or Ed25519.

Percival, 2009: Operate over the 2048-bit Group #14 with a generator of 2.
Ptacek, 2015: Probably still DH-2048, or Nacl.
Latacora, 2018: Probably nothing. Or use Curve25519.

Percival, 2009: Use OpenSSL.
Ptacek, 2015: Remains: OpenSSL, or BoringSSL if you can. Or just use AWS ELBs
Latacora, 2018: Use AWS ALB/ELB or OpenSSL, with LetsEncrypt

Percival, 2009: Distribute the server’s public RSA key with the client code, and do not use SSL.
Ptacek, 2015: Use OpenSSL, or BoringSSL if you can. Or just use AWS ELBs
Latacora, 2018: Use AWS ALB/ELB or OpenSSL, with LetsEncrypt

Percival, 2009: Use Tarsnap.
Ptacek, 2015: Use Tarsnap.
Latacora, 2018: Store PMAC-SIV-encrypted arc files to S3 and save fingerprints of your backups to an ERC20-compatible blockchain. Just kidding. You should still use Tarsnap.


Adding IPv6 to an existing server

I am adding IPv6 addresses to each of my servers. This post assumes the server is up and running FreeBSD 11.1 and you already have an IPv6 address block. This does not cover the creation of an IPv6 tunnel, such as that provided by HE.net. This assumes native IPv6.

In this post, I am using the IPv6 addresses from the IPv6 Address Prefix Reserved for Documentation (i.e. 2001:DB8::/32). You should use your own addresses.

The IPv6 block I have been assigned is 2001:DB8:1001:8d00/64.

I added this to /etc/rc.conf:


ipv6_activate_all_interfaces="YES"
ipv6_defaultrouter="2001:DB8:1001:8d00::1"
ifconfig_em1_ipv6="inet6 2001:DB8:1001:8d00:d389:119c:9b57:396b prefixlen 64 accept_rtadv" # ns1

The IPv6 address I have assigned to this host is completely random (with the given block). I found a random IPv6 address generator and used it to select d389:119c:9b57:396b as the address for this service within my address block.

I don’t have the reference, but I did read that randomly selecting addresses within your block is a better approach.

In order to invoke these changes without rebooting, I issued these commands:

“`
[dan@tallboy:~] $ sudo ifconfig em1 inet6 2001:DB8:1001:8d00:d389:119c:9b57:396b prefixlen 64 accept_rtadv
[dan@tallboy:~] $

[dan@tallboy:~] $ sudo route add -inet6 default 2001:DB8:1001:8d00::1
add net default: gateway 2001:DB8:1001:8d00::1
“`

If you do the route add first, you will get this error:


[dan@tallboy:~] $ sudo route add -inet6 default 2001:DB8:1001:8d00::1
route: writing to routing socket: Network is unreachable
add net default: gateway 2001:DB8:1001:8d00::1 fib 0: Network is unreachable


Beastie Bits


Tarsnap

Feedback/Questions