For some time now I've been enthralled with the idea that data physical locality should be descriptive rather than prescriptive.

That is to say: Data locality should only ever be based on its origin and its utility. Who's emitting it, and who wants it - Data's physical locality within a given volume of space should be based on this, and nothing else. Of course, this is a very abstract way of looking at things, but it's very important for the future of computing that we get this right.

Today, we tend to take a very prescriptive approach to assigning data locality. In the old days of on premises servers, we would prescribe that location to be down the hall in the server room. We hardly cached anything* because round-tripping 50 feet is no big deal. These days, more often than not there's some distant and elaborate cache hierarchy which is rooted in some cloud provider's storage system or database persistent storage. In some more avant-garde systems such as the blockchain flavor-of-the-month, or other peer to peer networks, this should be mitigated, or so one would think. Unfortunately no, pretty much all of these have some sort of prescriptive data locality assignment. In the case of most public blockchains, that prescribed location is: everywhere (modulo some half-baked trusted-full-node scheme). In other Distributed-Hash-Table based storage systems like IPFS, that prescribed location (at least of the index) is some redundant set of logical addresses, which map arbitrarily to physical localities.

( *We've always used memory cache hierarchies on the CPU and memory bus of the client machine. It's easy to contemplate that cache hierarchy as being fundamentally separate from the server's cache hierarchy, but in actuality it isn't – Rather it's single overarching hierarchy built of a composition the two. From pixel/keyboard/mouse, through the network, and all the way to the persistent storage medium – This is a single hierarchical composition of cache consistency models. More on this later)

The big issue with prescriptive data-locality-assignment is primarily that location may or may not be well aligned in physical space with those parties who want it (or those who are authorized to have it!) In the case of DHT schemes, if you start with a very small physical volume; say for instance your office; which is fully connected with a reliable network, the logical address which contains the data you want is somewhere within the building. Not bad in terms of latency, and not a worry with regard to connectivity.

Now, imagine the physical volume is a lot larger… for instance a planet – Earth lets say. If we have singly-assigned logical addresses based on the DHT scheme, then that location is some random host somewhere on the globe. That host might be offline, or might be on the other side of the world, or worse still might have had a backhoe cut their last mile network cable five minutes ago. Not to worry though, Most sensible DHT networks assign many redundant copies of this data at differing logical addresses, usually with the number of replicas being vaguely commensurate with the overall network size. This way, you can ask several of these logical addresses for the information. The one with the backhoe cut won't respond to your query, but others will, and you can just discard the responses after the first. Case closed!

The problem is: even with a massive number of redundant copies of the data at predictably differing logical/physical addresses, the probability that any of their commensurate physical addresses is within your office is rather low. Of course you could make some server in your office a "full node" meaning that it's guaranteed to have the data for all possible queries, but now you need to receive a copy of all worldwide traffic. That's going to get expensive.

So, why such a big deal? so what if most of my queries need to reach out to some public full node or hosting provider? Well, in the short term, you want system availability. If your internet connection gets cut or degraded somehow, Bob still wants to keep collaborating on that document with Alice down the hall. Does that mean they have to use an on-premises system? For now I guess it does. They'll probably just put up with it until it becomes unbearable however. One can construct all manner of pathological situations where Bob and Alice are in different cities or continents, with network partitions occurring in some other place which is not between them, but still affecting their ability to collaborate or communicate. This will only stand to worsen as the IOT revolution gradually morphs into the shitty-IOT-everyday-reality.

In the medium and long terms, I believe we are going to run into serious limitations in terms of data size vs bandwidth within various spatial volumes, (dis)trust in our service providers, new demands placed on our cache hierarchies by modern federated/decentralized systems, and also increasing expectations of decreasing latency for various applications.

For this reason, I have imagined unba.se – A highly speculative endeavor, and one that will be nigh impossible to monetize (at least directly as a database company) even in the unlikely event that it ever turns into a fully working product. Yes, there's a whole other conversation to be had about cryptocurrency-based computational frameworks. It's not wholly impossible to take that approach, but I'll leave my ethical grievances over modern cryptocurrencies for another post. The main objective of unbase, as it is imagined, is to provide a strong causal consistency model with zero a priori planning of data locality – only behavioral localization based on utility and availability.

The point of such a system is to approach the upper-bound of performance which is physically possible for a strong causal consistency model within an arbitrary spatial volume. Of course there's vastly more technical definition which is warranted, but for now lets just say that we intend to have NO full nodes, NO shards/masters, and a monotonically progressing worldview for all parties. One's local node may or may not have the information necessary to service a given query, but at a minimum it must detect this staleness if it had heard about any causally descendant operations, even indirectly. I call this "infectious knowledge". Implementations of such a strategy should match or even surpass the performance possible under eventual consistency systems – but with stronger guarantees. (If you want to talk about linearizability or CAP theorem, Martin Kleppmann's "A critique of the cap theorem" is an excellent place to start)

My conjecture here is that for any given set of agents (originators and consumers) of immutable, copyable data within a physical volume, there exists some optimal distribution for the copies of said data; and a continuous transformation thereto; such that latency, durability, energy, and storage volume are collectively minimized.

If we had infinite bandwidth and infinite storage density, we could simply broadcast all the data to all the places, but of course we do not. Especially in a large scale system; density, locality, energy, latency, and durability are all in tension with each other. I believe that stasis or a priori decision-making in the selection of said data locality, while usually chosen in the interest of complexity reduction, is necessarily a sub-optimal solution for the reasons described above. A better approach in my view is to continually transform the locality of that data based on a series of simple behaviors. The downside of this is that it requires a more finely-grained bookkeeping system than we are accustomed to today in order to traverse references and retrieve the desired data. This is a nontrivial thing if you want to achieve reasonably low latency, and avoid choking out all useful bandwidth and compute capacity with the bookkeeping traffic needed to do it.

I believe there is likely some profound information-theoretic insight waiting to be discovered about relationship between the geometry in a higher dimensional sense of bookkeeping traffic about the locality of data, versus the geometry of the physical space in which the data resides. By achieving a better understanding of this relationship, I feel we can make better decisions about these tradeoffs in future systems. With that in mind, through experimentation I am searching for a function which describes the lower bound of latency and energy which is required to recursively retrieve information distributed within a physical volume. This means understanding the properties of both non-optimal/arbitrary distributions of data within the physical volume, as well as the limits of optimal distributions; and perhaps most importantly, the latency and energy required to transform the one into the other under different loads, and how well that aligns with the underlying physical phenomena which we are ultimately trying to model. By probing the properties of optimal, and non-optimal distributions of data, we may better understand the tractability of such a system, and identify the right set of simple behaviors for creating emergent data-locality optima. In the end, we hope to create the best-case distribution of data as often and reliably as possible; And for that matter, understanding mathematically what is possible under what constraints.

Why all this silliness? -you might ask. Well, in practical terms, we in CS tend to think of our memory bus as having a logical distance of zero to our locality of computation, our systems functioning in lock step, with a grand unifying clock in the sky. The hardware industry knows full well this isn't the case however, as does anyone familiar with Meltdown / Spectre. Even those of us who are intimately familiar with the underlying causal consistency models that actually underpin our microprocessor implementations enjoy a certain form of selective amnesia about this fact. We do so in the interest of reducing the complexity, undergoing herculean efforts to abstract this away and maintain the illusion of computation outside of time and space. While this might sound like a nice abstraction, I believe it comes at the great cost of overall complexity, reliability, and perverse control dynamics. Though not immediately obvious, this is more than a mere theoretical concern. It is the functioning of information systems in a changing world, and the ethics of privacy and control which are at stake.

As perhaps a bit of a precocious digression, as IANA physicist: It is probably fair to say, that no more than one bit of data can ever occupy a single point in spacetime. There is always some system of indirection at hand. In the degenerate case, your data is only "there" if you want exactly one bit of information, with any more bits requiring traversal of some system of indirection; be it sequential, or referential. In most real-world cases it's a dozen or more hops, as with register, L1-L3 cache, memory controller, bus, network controller, b-tree index nodes, disk allocation table, etc etc. We have various ways to spread the work around, but as a general matter the principle holds. I think it's reciprocally fair to say that every bit of information must occupy nonzero space.

Today, the applicability of this principle on the microscopic scale may be of debatable utility beyond idle academic curiosity, but I think there is a much stronger argument for the application of this principle in the macroscopic sense. That is to say: among today's networks of today's computers. The manner in which they locate, organize, retrieve data, and perform computation is of tremendous value – particularly if such a scheme is able to converge toward such optima not just once, but to do so continuously.

And here lay the ragged ends of my intuitions — No doubt, there's a strong connection to information theory and physics. Substantial prior art exists here, and yet all our present foundational data storage technologies, including database/DHT/blockchain/hashgraph seem to run fundamentally counter to this principle of descriptive, rather than prescriptive data locality assignment. Through experimentation and the recruitment of skilled collaborators, I seek to at least ask the right questions ; to dream of a revolution in computer science and information technology, and just perhaps, at least in the conventional sense, the death of the database.

Daniel Norman1 Comment

Comments (1)

Newest First
Preview Post Comment…

Thank you for sharing thhis

Preview Post Reply