TECHunplugged had the opportunity to talk with Excelero on the topic of tail latency, as Excelero announced on 2-Oct-19 that they were awarded a US patent for a proprietary technology that addresses tail latency.
Tail latency refers to unpredictable latency spikes that can occur due to storage media unavailability. Besides failed flash drives, this could be related to internal activities handled by the SSD drive controller such as garbage collection in progress, block erasure, metadata movement, etc.
Impact of Tail Latency on Distributed Systems
Tail latency may seem innocuous at first sight. It can however become a serious challenge when operating at scale. Big Data Analytics platforms, Machine Learning and AI Training systems are typically severely impacted by tail latency.
Those systems perform heavy duty tasks; each node may request huge amounts of information and the systems will typically read data spread out on very large amount of drives.
In distributed systems, data processing is heavily dependent on consistent data throughput rates, and tail latency on a single component will likely adversely affect the entire environment. It should also be noted that database applications can also be impacted by tail latency.
“The industry needs a better answer to the tail latency issue, where entire workflows can get hung up by the slowest element’s completion time”
Yaniv Romem, CTO and co-founder of Excelero
Fail Fast And Carry On
First of all, let’s recap on Excelero, in case you haven’t listened our May 2019 podcast episode with Josh Goldenhar.
Excelero NVMesh is a distributed scale-out elastic storage architecture leveraging NVMe drives. The platform has no storage controllers and uses initiators on clients (servers); storage targets are passive channels to the individual NVMe drives.
To address tail latency, Excelero patented a proprietary solution that allows developers to “tag” I/O operations that are latency-critical. This “tag” defines a timeout period that needs to be respected, allowing for an I/O operation to “fail fast” in case the I/O cannot be served within the allotted timeout limit.
If the tagged I/O requirement cannot be fulfilled in a reasonable timeframe, mechanisms will be triggered to take the next course of action i.e read data from another drive / location, etc.
Similar standards already exist, but according to Excelero only their solution allows to set a timeout for an I/O operation when invoking the I/O instead of having to wait for a timeout to happen at an undefined time.
TECHunplugged’s Opinion
As we’ve seen, tail latency can have a significant negative impact on distributed computing systems requiring constant read performance, both in throughput and latency.
Organizations implementing Big Data Analytics, ML and AI training at scale understand not only the challenges of tail latency, but understand perhaps even more acutely the impact of tail latency on costs & project timelines.
Excelero’s “fail fast” technology is changing the game for those organizations.
The only downside we can find is that the technology will be limited to the NVMesh architecture, while it would be indeed awesome to have this become an industry standard. It will be interesting to hear what will be Excelero’s plans in regard to this.