Timestamped Row PatternInsights November 26, 2010
Here is how to conform to the pattern:
- Every row should have a “last updated timestamp”.
- Some form of “soft deletes” or “archiving” must be used rather than full deletes.
This is analogous to a file system where each file has a last updated field and deleted files go to the “trash”.
The timestamped row pattern has two main benefits:
- Any two datastores can be synchronized without downtime
- The timestamp dates are helpful for various kinds of analytics
Here are some examples of situations where we would like to syncronize datastores:
- migrating into and out of Amazon’s Relational Data Store
- migrating into and out of a non-relational datastore
- keeping a transactional datastore in sync with a data warehouse
- keeping caches and datastores in sync
- Copy rows that have changed since the last synchronization
- Delete rows that were added to the “dead rows” table (or flagged as dead)
- If there were no such rows in step 1 or 2, then you’re done
- If there were such rows, then goto 1
Alternative Change Data Tracking Strategies
The most popular competing synchronization strategy is the “transaction log” model. This model is higher fidelity in its playback and history, but more complicated to implement. It also requires more database writes, which are hard to scale.
With our timestamped row solution, one gets only a subset of Change Data Capture: not a full history, but just enough to synchronize two databases based on updated or deleted rows since the last synchronization. For Ayogo, that’s usually the right amount of data.
On Soft Deletes
Soft deletes are a convention where deleted rows do not get “totally” deleted but are archived or marked as “deleted” in-place.
Typically this can be done by moving deleted rows to an “archive table”. Alternatively, a “this row should be treated as deleted” flag can be used.
The “delete flag” solution solution minimizes problems of “dangling pointers” to rows that were deleted without references to them being also properly deleted. Your code should not produce these references, but bugs do happen. It also allows you to “schedule” your true deletions, which may be a useful operational property. “True” row deletion may cause various forms of binary tree rebalancing or compaction which might slow your transaction processing throughput.
On the other hand, the “delete flag” causes problems of queries that “accidentally” return zombie rows that are already deleted. Another way to minimize this problem is to have a view that excludes deleted rows: you just need to be disciplined about when you use the “raw” and “filtered” views.