Edge file replication and data synchronisation are essential for wide-area distributed processing. However the following are some problems and challenges for edge replication[3]
- Data is growing at different locations, including the Edge
- Analytics and decision-making, product development need data
- Network capacity is less than the data growth rate.
Massive Data Generation at the Edge With over 20 billion devices continuously generating data, the volume of data at the Edge is skyrocketing. This surge in data raises the critical question of what data we should forward to the cloud or the data center[3]. Should the selection be based on access patterns, requirements, or user needs?
The context-constrained bandwidth brings the question of pushing an elephant through a needle, or can we?
Moving all data from the Edge to the data center may be impossible due to the volume, high latency, and constrained bandwidth, in addition to regulations. Furthermore, applications at the data center may not need all data from the Edge [3].
In addition to time constraints, the induced lag may reduce the window for timely decision-making even if we use data reduction techniques to reduce the volume of data transmitted from the Edge to the data center.
Some data, if not most of the valuable ones, has a time value, and the duration of validity and usefulness for decision-making may have a limited window value (e.g., life-saving data, surveillance, weather, flood, or tornado warning).
An issue to consider is how to reduce the replication window, costs, and the time from when systems generate or capture data to when it is available and consumed for decision-making processes.
When we have the folllowing:
- Data stored at the Edge
- Replication sends data to the cloud or the data center
- Queries access data from the cloud or remote data centers
What to replicate from the Edge?
Replication cost is a function of:
- f(file system size)
- f ( size of files to replicate)
- f (size of file block changes and the rate of change)
- f (number of files)
- f (total size of the data)
We should also consider the consistency of distributed objects, the cache, latency, and whether we can satisfy requests for data locally from the Edge using edge services[1].
Locating data on the Edge near all users’ applications will reduce latency and may provide faster access. Gaming, video streaming, and bandwidth-intensive applications are examples. Other examples include web applications with intensive access or burst traffic and intensive data generated by autonomous vehicles, smart cities, and sensors.
In these examples, locating data in several locations near the user can speed up access. In addition, hybrid cloud/Edge and microservices that can be migrated with their preferred data stores can help by keeping data replicas in the cloud and at the Edge [2].
However, some locations may be able to store only limited amounts of data. In this case, we should create specific replication sets that allow us to select what data to replicate using real-time or scheduled replication. Global services worldwide or in different locations within a country boundary can benefit from geo-replication, where edge nodes are distributed globally with a weak consistency model.
According to CAP (or Brewer’s) theorem, no distributed system can be simultaneously available and consistent in the presence of network partitions (which are unavoidable). Therefore, the author suggests giving preference to availability over consistency. Hence, replicas at the Edge have a better response time [2].
With EnduraData replication, there is no communication between all nodes in the configuration. Each replication node communicates only with the sending node. Hence, each replica can be isolated from all the other nodes and not involved in replication between a sending and a receiving node[4]
In EnduraData real-time mode, each time a user or process mutates a file or directory (write, change metadata, symbolic links), the operation is relayed to one or more nodes to create an exact mirror of the source. Administrators can configure replication to propagate or not propagate deletes[4]
References
[1] L. Gao, M. Dahlin, A. Nayate, J Zheng, A. Iyengar(2003) : “Application specific Data Replication for Edge Services.”. WWW2003, May 20-23, 2003, Budapest, Hungary, p 449-450. [2] D. Mealha, N. Preguica, M. Gomes, J, Leitao (2019): “Data Replication on the Cloud/Edge”. PaPoc’19, March 25, 2019, Dresden Germany. [3] N. Semmler, G. Smaragdakis, M. Rost, A. Feldmann(2020): “Edge Replication Strategies for Wide Area Distributed Processing”, EdgeSys 20. April 27, 2020, Heraklion, Greece. [4] enduradata.comAuthor: A. A. El Haddi (https://www.linkedin.com/in/aelhaddi/)
Share this Post