Recovering from the Contreras Fire
Benjamin Weaver, NOIRLab
January 24, 2023
DESI collaborators have previously reported on the Contreras Fire and how it initially affected DESI operations along with an update as personnel were able to return to Kitt Peak National Observatory (KPNO). Even as we were cleaning up KPNO and the DESI instrument in August 2022, we knew that it would possibly be several months before the Tohono O’odham Utility Authority (TOUA) could replace the infrastructure bringing electrical power and line internet (a fiber-optic link) to KPNO. In fact, line internet was not restored until December 7, 2022, almost exactly six months after the start of the fire.
The story of restoring electrical power and the generators used while restoration was in progress is definitely worth telling, and far more lengthy and detailed than the story I’d like to tell now, recovering from the missing internet connection.
The DESI Data Management team, working closely with the on-site NOIRLab personnel, developed a plan to work around the missing internet, even as DESI returned to full-night operations. It wasn’t a new plan; in fact, it was a very old plan: “Sneakernet”. The idea is that a human being (presumably wearing sneakers) can hand-carry data on physical media from one location to another, and in certain circumstances, the effective bandwidth of that transfer can be much, much higher than the bandwidth of an internet connection or other communication channel. This is actually even more true now than it was in the past, because the storage density of physical media has experienced exponential growth similar to that of Moore’s law for CPU performance. However, while Moore’s law has been showing signs of reaching fundamental physical limits, storage density trends continue to rise.
With that in mind, we purchased six 2 TB SanDisk solid-state drives (SSDs), and a few fast USB cables. There is nothing exotic about the hardware; these items would be available in many consumer electronics or department stores. Then with the 6 disks initially carried to KPNO, we would place an entire night of data (or sometimes several nights) on a SSD. One of the personnel working at KPNO would be designated to carry the SSD to Tucson at the end of the work day. Then we would pick up the disk either in person or from a drop box and copy the data to NERSC where it could be processed with the usual DESI data pipeline. Since we purchased six SSDs, that meant we always had several SSDs on standby at KPNO in case an already-transferred SSD could not be brought back from Tucson to KPNO the next day. On weekends, no one would be available to carry the SSDs, so we simply had several nights on Monday evening’s SSD. From the point of view of the pipeline, the only difference was that an entire night (or several nights) would arrive all at once, after a delay of 1 to 3 days, instead of the normal operations procedure of one exposure at a time, almost immediately after the exposure is completed.
The DESI instrument resumed taking data starting with night 20220825 (August 25, 2022). The first SSD was brought to Tucson on September 2, 2022. This process continued until line internet was restored on December 7, 2022. The original data transfer system was ready to go at that point and was reactivated almost immediately after line internet was restored. Thus the last night of data transferred by sneakernet was 20221206 (December 6, 2022). In total, 16,726 exposures over 100(!) nights were hand-carried to Tucson. This amounts to 6.1 TB of data.
The bandwidth of hand-carried SSDs is impressive, but so is the bandwidth of modern USB cables. After some experimentation, we discovered that we could transfer data directly from the external SSD to NERSC via Globus Online, with bandwidth performance indistinguishable from the case where the data were first copied to a workstation, then transferred via Globus. In other words, the limiting bandwidth was the internet itself between Tucson and NERSC, not the internal systems of the receiving workstation.
While we were transferring DESI data with a very old system (but with the latest SSDs), we were also experimenting with a new one: Starlink. In fact there were at least two separate Starlink installations active at KPNO during the recovery. One, which we’ll call the “DESI Starlink”, was a strictly limited command and control channel for use by instrument experts. The DESI Starlink was not used for any significant data transfer. The other, which we’ll call the “NOIRLab Starlink” was used for more general access to all of KPNO. We did use the NOIRLab Starlink for some data transfer, with mixed results.
At least currently, Starlink works best with a point-to-point style networking connection, optimized for individuals or residences, much like a residential cable modem. Since the uplink bandwidth is currently a factor of 10-20 times slower than the downlink bandwidth, it is less practically suited as a two-way connection between two substantially-sized networks, such as between KPNO and the Tucson base facilities. This manifested as a significant, recurring routing problem. Although bandwidth was adequate for interactive use and for the transfer of small data files, the connection was frequently not working and the routing systems had to be monitored constantly. Nevertheless, we were able to use the NOIRLab Starlink to maintain an off-site copy of the DESI operations database. This copy would have been impractical to maintain using daily SSD transfers, because the copy is a real-time mirror of a live database; it is more of a stream than a set of files. There were occasions where the database mirror fell behind due to interruptions in the Starlink connection. In these cases files created by the database were transferred manually on the SSDs and then once in Tucson the data files were manually added to the database mirror.
It was interesting and a good experience to transfer data over sneakernet. It was also a lot of work and hours. I want to give credit to a big team who all contributed to this effort: Keith Blaine, Paul Demmer, Matthew Evatt, Steve Grandi, Bob Marshall, Rod Rutland, David Sprayberry, Christopher Stone, Bob Stupak (NOIRLab), Martin Landriau (LBNL), and Klaus Honscheid (OSU). Here’s one job we hope we never have to do again!