Skip to main content

How to Use Git as an Offline-First Database?

Hidekazu Kubota

Hidekazu Kubota

Creator of GitDocumentDB

Git ecosystem can be an excellent knowledge management platform, but fundamentally it is operated by human decisions.

I discussed and implemented deterministic ways for computers to synchronize and resolve conflicts in the Git ecosystem automatically.

https://betterprogramming.pub/how-to-use-git-as-an-offline-first-database-dca7f9604142

It was much more difficult than I had previously thought.

It looked like Git needed some constraints to work deterministically, and I needed to make explicit what implicit rules humans use in automated workflows like CI/CD.

Is the Git data model a CRDT?

Hidekazu Kubota

Hidekazu Kubota

Creator of GitDocumentDB

Git is an excellent distributed version control system, but there are no examples of using it as a distributed database. Why is that?

The reason is that Git is a human-oriented tool, and you need to set up rules for automated processing. This article series will discuss how close Git is to a distributed database and define the rules for completing a distributed database.

Is Git a CRDT?#

Git is a kind of conflict-free replicated data type(CRDT), a model for updating replicated data in distributed databases. CRDT is not only a data structure but also a procedure to process data.

Git isn't commonly referred to as CRDT, but it has aspects of CRDT from multiple perspectives. I've seen this discussion on social networking sites on rare occasions. However, they made no specific comparison between Git and existing CRDTs(*).

Commit graph of Git

First, Git has a feature of a Multi-Value register (MV-Register). MV-Register is a type of CRDT in that one piece of data has multiple revisions, so editing in various places at the same time does not destroy the data.

Second, Git has a feature of an add-only monotonic DAG. The commit-graph of Git is a directed acyclic graph (DAG). In this graph, the merge operation is designed not to destroy the graph structure, even if multiple users update the same commit-graph in different locations. Such a graph is an add-only monotonic DAG, a kind of CRDT.

How to bring Git closer to a distributed database?#

Git has good features for distributed databases because of the eventual consistency model of the CRDTs described above. The bad part is that computers cannot resolve conflicts deterministically because they are assumed to be resolved manually by humans. Other problems are that Git does not have database-like APIs and has a lot of overhead when viewed as a DB, but I briefly discussed them in the past article.

If I could solve these problems, the benefits of using Git as a distributed database would be more incredible.

GitDocumentDB uses Git-CRDT, a deterministic solution to conflicts in Git, to achieve automatic synchronization. I will discuss the details in the next update.

(*) The following research report introduces a lot of CRDTs, including MV-Register and add-only monotonic DAG: Marc Shapiro, Nuno Preguiça, Carlos Baquero, Marek Zawirski. A comprehensive study of Convergent and Commutative Replicated Data Types. [Research Report] RR-7506, Inria – Centre Paris- Rocquencourt; INRIA. 2011, pp.50. ria-00555588

Database-side debounce

Hidekazu Kubota

Hidekazu Kubota

Creator of GitDocumentDB

Database-side debounce has been experimentally implemented in GitDocumentDB v0.4.5.

Debounce is a commonly used feature in UI development. Debounce is also well known to be provided by Lodash and RxJS. Debounce executes an enqueued task only when a debounce time has passed without any other enqueued tasks. If a new task is enqueued before a debounce time passes, the previous tasks will be dropped and not executed, and a new debounce time is scheduled.

debounce

Such a feature can reduce the processing load when continuously retrieving data from a server or persisting data from a (redux) store.

In the same way, debounce in GitDocumentDB drops consecutive 'put' or 'update' tasks to the same id document.

Database-side debouncing is intended to simplify the data store persistence procedure on the application side, especially in data flows involving DB synchronization and exclusive locking on the application side.

The data flow between a DB with an eventual consistency sync model, such as GitDocumentDB, and an app's Redux store is a cyclic graph and causes a race case. UI event updates both the store and a document in the database. Sync event also updates both the same document in the database and the same store.

App-side debounce

The middleware needs to be locked exclusively to determine whether updates due to UI events or synchronization events should be stored in the store. Tasks in the diagram are enqueued in the order T1, T2, T3. The middleware can solve the race case by comparing the synchronization time (T1) with the scheduled execution time (T3), dropping the data at the older time, and storing the data at the latest time in the store.

Here, if the middleware uses debounce, the debounce queue and timer are also included in the critical section, complicating exclusive locking.

However, with debounce on the database-side, locking the middleware becomes simple.

Database-side debounce

Debouncing on the database side may sound strange, but I think it can be considered for a document database that is updated on a per-document basis.

It is an experimental implementation and may be changed in the future.

Powered by Git ecosystem

Hidekazu Kubota

Hidekazu Kubota

Creator of GitDocumentDB

There is a growing Git ecosystem where tools are interconnected around Git's data storage. GitDocumentDB is a Git-based database API developed to make it easier to integrate the Git ecosystem into your apps.

Powered by Git

Git is increasingly being applied to software development and website building (CMS), design, collaborative editing, journals, and more because it helps manage files and their change history in a distributed manner.

At least 60% of GitHub users link their repositories with external tools, which means that Git repositories have become a central hub for data collaboration.

Collection

Hidekazu Kubota

Hidekazu Kubota

Creator of GitDocumentDB

GitDocumentDB's collection API is based on the Git directory structure. You can use the Collection class to put, get, delete and search for documents under a specified directory.

Collection API

A Collection is hierarchical like directories, and attributes of a Collection are inherited from its parent Collection.

Collections are independent concerning synchronization. Each collection can only receive Git sync events related to documents under a specific directory.

Pros and cons

Hidekazu Kubota

Hidekazu Kubota

Creator of GitDocumentDB

GitDocumentDB is compatible with Git. Distributed multi-primary databases and efficient CI/CD are inherited features from Git.

fully automated

Pros:#

  • It automates Git synchronization workflows by resolving transactional and consistency conflicts, besides revision conflicts.
  • Typical synchronization patterns and diff-and-patch strategies solve them. Accessible CRUD and collection APIs for working with JSON reduce tasks.

Cons:#

  • The throughput of GitDocumentDB is about the same as Git. It is not as fast as typical databases.
  • Storage size grows when managing many revisions of a document.

These are a trade-off for Git features.

Integrated NodeGit and isomorphic-git

Hidekazu Kubota

Hidekazu Kubota

Creator of GitDocumentDB

Great release!

GitDocumentDB v0.4.3 integrates NodeGit and isomorphic-git by using a plugin system.

git-engines

NodeGit and isomorphic-git are widely known Git implementations for JavaScript/TypeScript. The development of GitDocumentDB started with NodeGit, which has met much need for GitDocumentDB. However, it also causes memory leaks, imperfect documents, and esoteric integration with Electron. These problems are almost derived from libgit2 native code in NodeGit.

isomorphic-git is a pure JavaScript implementation of Git. It's well documented and easy to integrate with Electron. So, I have rewritten all NodeGit code with isomorphic-git code.

Still, NodeGit has the advantage of being able to SSH keys authentication. I have made it available as a remote connection plugin.

Git implementation in GitDocumentDB uses isomorphic-git. If you like, you can use NodeGit for remote connection instead of isomorphic-git.

The plugin system will be ready for another Git implementation in the future.

Offline capable

Hidekazu Kubota

Hidekazu Kubota

Creator of GitDocumentDB

Unplugged Tanuki

He is Unplugged Tanuki. He is usually offline and occasionally online. He is always having fun, no matter where he is.

The above is my drawing that illustrates an offline-first concept. Offline-first is a design paradigm that ensures your app works offline as well as online.

A multi-primary database is an essential component of this paradigm. In a multi-primary database, the data you have is not the server's cache but the original data. So, you need not get your initial data from a server.

You are free from the internet connection.
You can store your data in distributed place.
Moreover, you can synchronize your distributed data with each other when online.

GitDocumentDB is the offline-first database that realizes these concepts.

Hello!

Hidekazu Kubota

Hidekazu Kubota

Creator of GitDocumentDB

Welcome to the GitDocumentDB blog.

In this blog, I will write about the GitDocumentDB technology and its release.