Hello, my name is 👨🏻 David Shekunts 👴🏿 and I'm tired of debugging at night, I'm tired of unmaintainable code, I'm tired of technologies that break. Here I collect best practices that help me rest more. I hope they will help you too.
📚 Table of Contents
📚 Table of Contents📖 Dictionary🍔 App structurei. Monorepoii. Monolithiii. Microservicesiv. Modular monolithv. What to choose?vi. How to divide microservicesa. Separate teamb. Securityc. Geo-dependenced. Statefule. Separate feature setf. Dedicated resourcesg. Interferes with othersh. Deployment independencei. Business logic independencej. Mini-productvii. FALSe - Best Project Structureviii. Graceful-shutdownix. Separate your Cron🎛️ APIi. RPCii. CQRSiii. (a)Sync Communicationiv. Schemav. Push | Pullvi. Create IDs on the client👴🏿 Type iti. Branded & Nominal Types and validation on typeii. Algebraic Data Types (ADT) and Invariantsiii. implements – incredible evilI prefix – evil🐞 Error Handlingi. Classified Errorsii. Error Dictionaryiii. Return, not throw🧠 Architecturei. Functionally Oriented Programming (FOP)ii. Data Oriented Architectureiii. Dirty vs Clean Architectureiv. Vertical Slicesv. Event Driven Architecture (EDA)vi. Say "NO" to master-mastervii. Say "yes" to master-slaveviii. Horizontal scaling💾 Databasesi. ORM or notii. Migration firstiii. Optimistic & Pessimistic Concurrency Controliv. Transactionsv. Distributed Transactionsvi. Drop Relationsvii. Drop Constraintsviii. How to Choose a Databaseix. Use UUIDx. INSERT-s / UPDATE-s / DELETE-s must be batchedxi. "Storages like Onions"xii. Event busxiii. SQLite🔎 Testingi. Generalii. Unit, Integration, E2E testsiii. Integration tests🌎 Logging, metrics, tracingi. Metrics are your spider sense, Traces are your map, Logs are your eyesii. Meta-info of logsiii. Technical👨👧👦 Leadingi. Triptych – the ideal structure of a technical team💻 Programmingi. Everything is concurrentii. Avoid Mutexesiii. Program as if everything is already brokeniv. And much, much more👨🏻 About the Author
📖 Dictionary
Terms that will appear in the text:
- "Data source" – database, cache, third-party API, basically anywhere we get data from or send data to.
- Application Interface (API) – a way to interact with a running application (stdin, stdout, TCP, UDP, )
- "Endpoint" – one of the API methods (e.g., HTTP API URL, gRPC command description, mutation/query/subscription in GQL, etc.)
- Message Broker (MB) – ability to publish messages (usually by "topics") and subscribe to their appearance
- Message Queue (MQ) – same as MB, but guarantees delivery sequence
- Event Driven Architecture (EDA) – we throw out "Events" (id, name in past perfect tense, data) and consume them with any number of services.
- CI / CD – automated processes required for building and deploying services
- "Infrastructure" – databases, CI/CD, servers, orchestration, network, and everything else
- Orphan data – data that only made sense when the data it was tied to existed (essentially, any data we want to make ON DELETE CASCADE)
- "Projection" – readonly data calculated based on other data. For example, we save all messages from a device to one table, and then calculate projections of its current and historical temperature readings / battery status / location / etc.
- System degradation – when new functionality breaks old functionality, considered one of the worst types of errors.
- Feature flag – boolean value (true | false) that you pass into code (via env) and depending on whether it's on or off, you enable or disable functionality
- Real-time – systems that involve processing and responding in milliseconds (maximum 1-3 seconds)
- Application (app) – code running as a process
- Instance – a unit of a running application
- Transport – any way two processes interact, involving sending/receiving data (TCP, HTTP, MQ, 2IP, stdin, etc.)
- Internal communication – calling business logic within one application by calling a function from within code
- External communication – calling business logic from another application using some Transport
🍔 App structure
Let's structure applications:
i. Monorepo
There's a lot to discuss about this, but these 3 properties are why I choose monorepo in 9 out of 10 cases:
- Atomic deployments – both backend, frontend, and infrastructure code are deployed to production at once
- Everything in one place – even with 5 repositories, problems start with finding the right things
- Shared code – ability to use local code between services without publishing
It's important that with a monorepo you need to take care of:
- Development and staging environments, so that development is done in the first one, and changes that will then go to production go to the second one
- If you need to publish libraries, be sure to add a CI/CD stage with manual control for their building and publishing (main releases the main version, development releases beta, other branches release alpha)
ii. Monolith
!ATTENTION! the following "Pros" and "Cons" are in comparison between Monoliths, Microservices, and Modular monoliths.
- One large codebase
- Runs in approximately 1 instance
Pros:
- Avoids distributed problems (cross-communication and state synchronization)
- Fast deployment
- Less hassle with Infrastructure
- Easier to debug
Cons:
- All the above pros work up to a certain size, after which they completely disappear
- Difficulty of horizontal scaling
- Single point of failure
- Without code writing rules, it turns into spaghetti
iii. Microservices
Essence:
- Codebase separated from each other
- Ideally, their own resources (different servers, databases, caches, etc.)
Pros:
- Can isolate codebase in a separate place (security, management)
- Theoretically maximum horizontal scaling
Cons:
- Microservice boundary errors are much worse than any monolith, no matter how big
- Complexity of communication and state synchronization at maximum
- Difficult to deploy
- Difficult to monitor infrastructure
iv. Modular monolith
Essence:
- Reuse the same codebase but run in different instances
- Use the same data sources
Pros:
- All the advantages of monoliths, without the point of irreversible bloat
- All the advantages of microservices, minus the complexity of cross-communication and state synchronization
Cons:
- Sharing resources (e.g., database queries) can cause them to overload
- More difficult to deploy and maintain than a monolith
- Without code writing rules, it turns into spaghetti
v. What to choose?
Within one team:
- Start with a modular monolith, trying to avoid cross-service communication as much as possible
- Extract something into microservices out of NECESSITY, meaning you'll simply see that there's really no other way, then extract something into microservices (this point may never even arrive)
But each separate team should make their own separate modular monolith and microservices, because working on one shared codebase is difficult if you're not part of one team.
vi. How to divide microservices
Dividing microservices by "responsibility" is the biggest mistake.
Humans are very poor at classification and categorization, and if your methodology requires strict grouping (e.g., OOP or "division by responsibility"), you will never be able to do it right
Why? Because even if you could identify a specific set of features by responsibility and break them down into microservices/classes that satisfy business requirements, the world doesn't stand still. New requirements will constantly emerge that will really blur the boundaries of this "responsibility" and the right architecture today becomes the wrong one tomorrow through no fault of your own
The second problem: "responsibility" is a very subjective concept. Ask several people in a medium+ system: "what responsibilities would you identify?" – each person's variant will be 50% different from the others. And increasing the degree of "subjectivity" is a path to endless errors.
This "invented responsibility" by developers can be called "artificial responsibility," but we need to focus on "natural responsibility," and here's a checklist of examples of this "natural responsibility":
a. Separate team
If there are two teams of people who need to solve independent parts of the system and don't communicate on a regular basis (weekly, daily, their own management, etc.), it's better for each to make their own set of microservices/modules and agree on APIs.
Even shared libraries are dangerous (there should be 1 specific maintainer of this library), because any intervention in the code by a developer from an outside team is very poorly controlled and can degrade half the system
b. Security
You're making a module handling transactions, and you want as few developers as possible to not just develop it or have access, but even see the code (to reduce potential hacking)
Then you create a separate microservice, with its own repository, and only expose a secure API.
c. Geo-dependence
For example, you have the task of collecting data from your client's devices, but these devices live within the internal network of a specific warehouse. Then you create a separate microservice that will live within a server on this warehouse and communicate with it through transport.
The same applies if you need to place some service only within one of the regions (for example, only RU or EU), then you can make a separate module just for this region.
d. Stateful
For example, you need to keep open Web Socket or TCP connections, while you want to redeploy and generally touch this application as rarely as possible for maximum uptime, then you create a separate microservice/module that will store the state (socket) and communicate with it through transport
e. Separate feature set
Your partners need a stripped-down/different variation of your main API, then you create a separate module in which you only expose what's needed + replace the authentication method (for example, with oAuth) and so on.
f. Dedicated resources
Your application requires heavy work with the file system, or, for example, it's image compression, or streaming video processing, or CPU-intensive calculation, in short, something that requires a special resource. Then you create a microservice
g. Interferes with others
A particular endpoint is especially heavily loaded (for example, you collect every click from the client's browser) and it generates too much traffic, the processing of which interferes with processing of the remaining requests.
In this case, you can take exactly this piece and isolate it into a separate module/microservice and deploy it on a separate machine.
h. Deployment independence
When we want different services to be deployed independently of each other (for example, so that redeployment doesn't affect applications whose code wasn't changed in a PR)
Then we can create a module (modules have their own independent Docker images, so after building we can check if the hash has changed to make a redeployment decision) or a microservice
i. Business logic independence
This is the MOST difficult aspect, because the independence of one microservice from another is often a temporary concept (remember that requirements always change), but if one module is truly capable of operating independently of another (for example, a reward calculation system and an authentication system), then it's a candidate for extraction into a microservice.
BUT I recommend doing this as a second step, after you've understood that these modules have truly become independent of each other and for the convenience of further development, you want to separate the codebases.
ONCE AGAIN, only time will show the real independence of codebases
j. Mini-product
When you need to build a demo/prototype product, and you're not sure whether it will remain in this form, it's often easier to create a separate microservice. This gives freedom to experiment without worrying about side effects on the primary system.
vii. FALSe - Best Project Structure
4 main folders:
- Features - code with business logic of the application, divided by domains.
- Apps - code that launches the application in different configurations.
- Libs - code that could become private or public libraries.
- Scripts - code that we occasionally want to run locally or from CI/CD.
Example:
/features /auth login.ts register.ts /user-management get-user.ts delete-user.ts /apps /main-http /http-api schema.ts config.ts index.ts /cron index.ts /libs /@my-company /specific-lib index.ts /logger index.ts /scripts /inactivate-stale-user index.ts
Currently, this structure covers 100% of the cases I've encountered.
viii. Graceful-shutdown
Always close your applications carefully:
- Start a timer that will kill the application even if it hasn't finished
- Close APIs (HTTP, MQ, etc.)
- Turn off all cron jobs and intervals
- Wait for current business logic processes to complete
- Close external connections (DB, MQ, etc.)
Most often, you should respond to SIGINT and SIGTERM.
ix. Separate your Cron
Always create separate applications for cron jobs to run them separately and not interfere with horizontal scaling of other applications.
BUT I don't recommend making a separate application with just 1 instance for this, because then there's no guarantee of cron execution.
The best option is to use a cron scheduler, for example, the cron scheduler built into k8s.
This way, even if a server is unavailable, the scheduler will run the cron job where and when possible.
It's also useful to make the cron job simply emit an event indicating that some operation needs to be performed, and then one (or several) of the already active applications react to it and do what's needed.
This way you get more control: (1) within the executor application you can better distribute the load, (2) it's easier to write logic for distributing this cron calculation, (3) react to and debug the cron job, (4) not run more calculations than needed.
🎛️ API
i. RPC
Remote Procedure Call – one of the most convenient ways to structure an API.
First, it's well-suited for request-response:
// # Request { id: string name: "GetUser", params: { userId: string }, meta: { ts: Date requesterId: UUID traceId: UUID } } // # Success Response { id: string // same as in request result: { case: "success", success: { email: string avatar: string } } } // # Failure Response { id: string // same as in request result: { case: "failure", failure: { code: number message: string } } }
But Request can also be used as an Event:
{ id: string name: "UserCreated", payload: { userId: string }, meta: { ts: Date requesterId: UUID traceId: UUID } }
Second, it allows communication through just one bi-directional channel (for example, as with WebSockets), which means it can be used with absolutely any protocol/communication interface.
Famous RPC implementations: gRPC, GraphQL, Pg Wire Protocol, JSON RPC.
ii. CQRS
Command Query Responsibility Separation – to simplify it maximally, there are 2 main rules:
- If you're returning data, you have no right to change the system state (write anything to DB, cache, change variable values)
- If you're changing the system state, you can only respond with "OK" or "Error"
This way of building any API/Interface guarantees the possibility of (1) convenient horizontal scaling of the API, (2) fewer problems when writing and maintaining business logic, (3) the ability to use eventual-consistency.
!IMPORTANT! It's impossible to follow it in 100% of cases, for example, as with JWT authentication: you'll most likely need to create a token, save it, and then return it to the user.
Therefore, you shouldn't implement CQRS everywhere; you should simply maximize its use.
Lifehack 1: for CQRS to work at full capacity, I recommend making all entity IDs as UUIDs so the client can create and pass them.
Lifehack 2: combines perfectly with RPC.
iii. (a)Sync Communication
Synchronous communication (sync) - we send a request to a "channel" and block it until a response returns (HTTP, gRPC).
Asynchronous communication (async) - we send a request to a "channel," and we'll receive a response sometime later from it or another channel (UDP, WS, Message Queue, Message Broker, etc.)
Pros of sync: security, reliability, simplicity, speed
Cons of sync: manual routing, lots of blocked channels
Pros of async: eventual consistency, using 1 channel, ability to use with queues, EDA, works well with RPC
Cons of async: slower and less reliable
iv. Schema
Always start describing your API from the schema, and only then write the implementation code.
The most convenient schemas:
- OpenAPI or GQL for HTTP / MQ / MB / WS
- Protobuf for gRPC / MQ / MB / WS
Lifehack 1: use Union types more often, such as
oneof
from gRPC, allOf / oneOf
from OpenApi.v. Push | Pull
Push Model – we send something somewhere.
+ realtime
+- smart producer
- requires a backpressure mechanism to control load on the reading side
Pull Model – we get something from somewhere.
+ absolute control over the consumption process
+- smart consumer
- true real-time is impossible
- doesn't work with non-streaming data (for example, HTTP requests are difficult to implement using the Pull model)
This applies to MQ, storage systems (Prometheus and InfluxDB), and code implementations (Event Emitter vs Async Iterator).
In practice, I try to use the Pull model more often, and Push only where Pull simply won't work.
vi. Create IDs on the client
The entity identifier should be created by the client:
- This allows the client to independently request the necessary data in case of success
- Easier to implement streaming / real-time communication
- Allows the use of Eventual Consistency
- The unique identifier will also act as an idempotency key
- On the backend, you can create entities, link them with this UUID, and only then write to the database (instead of making a record, getting a SERIAL, and only then creating the next related entity)
Use UUID v7 or a similar unique identifier with an embedded date for this purpose.
👴🏿 Type it
Advance typing technics:
i. Branded & Nominal Types and validation on type
This chapter has been moved to the book λ Functional Oriented Programming:
ii. Algebraic Data Types (ADT) and Invariants
This chapter has been moved to the book λ Functional Oriented Programming:
iii. implements
– incredible evil
implements
completely destroys the purpose of interfaces.The essence of an interface is to separate a specific implementation from a set of methods needed by a specific function to reduce code coupling.
If you want to explicitly declare that some "class" should include a set of methods, then use abstract classes, that's what they were created for.
An interface is needed to declare what you need in one place, implement it in another, and pass the implementation to a request in a third.
// business-logic/some-fn.ts interface User { id: string email: string } interface UserDataSource { getUserById(id: string): User } function someLogic(uds: UserDataSource) { // ... } // databases/pgsql.ts interface UserTable = { id: UUID email: string } const UserTableService = (conn: PgConnection) => { return { getUserById(id: string): UserTable => { // ... } } } // app/main.ts const pg = PgConnection() const userTable = UserTableService(pg) someLogic(userTable)
Only in such a situation do you reduce code coupling, which means you're correctly using interfaces.
I
prefix – evil
The most important problem: by using the
I
prefix, you are structuring your code incorrectly.- It's a semantic error – you don't name all classes with a prefix
C
or all numbers with a postfixInt
- With this naming convention, you tie the interface to the implementation, but the essence of an interface is precisely to decouple one from the other
interface IPaymentService { // ... } function extendSubscription(ps: IPaymentService) { // ... }
Now we need to implement Stripe and PayPal as payment services. Based on this code, a person would name them
StripePaymentService
and TinkoffPaymentService
, but the question is: what if somewhere else in the code there's an ISubscriptionService
and our Stripe and Tinkoff completely match it and need to be used there too? Would we have to add to the name something like StripePaymentSubscriptionService
? No, that's absolute nonsense.It's sufficient to simply make the names
PaymentService
and SubscriptionService
for interfaces, and Stripe
and Tinkoff
for implementations, then everything becomes absolutely logical.And here more details
To begin with, I'll list some links that explain this topic in great detail, and then I'll share my own perspective:
Now, here's what I think about it:
First of all, it's a textbook example of Hungarian Notation. If you're naming an
interface
with the I
prefix, then why aren't you naming class
with a C
prefix or string
with an S
prefix?It makes logical sense not to add a type prefix to its name.
Secondly, the first link provides an excellent explanation:
If you stop to think about it, you'll see that an interface really isn't semantically much different from an abstract class:
- Both have methods and/or properties (behaviour);
- Neither should have non-private fields (data);
- Neither can be instantiated directly;
- Deriving from one means implementing any abstract methods it has, unless the derived type is also abstract.
In fact, the most important distinctions between classes and interfaces are:
- Interfaces cannot have private data;
- Interface members cannot have access modifiers (all members are "public");
- A class can implement multiple interfaces (as opposed to generally being able to inherit from only one base class).
Since the only particularly meaningful distinctions between classes and interfaces revolve around (a) private data and (b) type hierarchy - neither of which make the slightest bit of difference to a caller - it's generally not necessary to know if a type is an interface or a class. You certainly don't need the visual indication.
I'll explain this thought a bit differently:
class User { constructor( public id: string, public email: string ) {} } // When you use the User class in typing, you're actually using // its interface const someFn = (user: User) => { // ... } // That is, in fact, the User class consists of 2 parts: // 1. The class type interface User { id: string email: string } // 2. The class runtime (conditionally) const User = { new(id: string, email: string): User => { return {id, email } } }
The only important difference is that the class interface also includes private properties.
BUT from the perspective of the
someFn
function, are its private properties important? No, because this function can only call its public properties.Consequently, we always use an
interface
anyway, even when we specify a class in the type, so why then when writing our own interface
should we need to know that it's an interface by adding the I
prefix?Thirdly, it's much better if your interface is named without a prefix, but its implementations have a postfix:
interface UserRepository = { insert(user: User) => void } // PSQL implementation class UserRepositoryPostgreSQL { insert(user: User): void {} } // Mongo implementation class UserRepositoryMongoDB { insert(user: User): void {} }
Because implementation is a specification of the interface, which should be reflected in the naming.
Fifthly, this is a side effect, of course, but I've seen many times how developers who don't understand the meaning of interfaces would create an interface with an
I
prefix next to each implementation class...Folks, these interfaces are only needed for (1) the ability to substitute implementations (that is, you should have at least 2 classes that meet the same interface) or (2) to reduce code coupling (but then we should describe the interface not next to the implementation, but where this decoupling occurs).
🐞 Error Handling
i. Classified Errors
Create a layered classification of errors to make it easier to check their types:
// . First the base error type BaseError = { _isBaseError: true // for simple checking statusCode: number // always use HTTP Status Code type: string // unique keys of the error dictionary (more on this below) message: string // internal error description data: JSON // additional data } // . The first layer is usually about whether they are available internally or externally type InternalError = BaseError & { _isInternalError: true } type PublicError = BaseError & { _isPublicError: true publicMessage: string // this error message will be shown to users } // . Then standard types type ForbiddenError = PublicError & { _isForbiddenError: true statusCode: 403 } type NotFoundError = PublicError & { _isNotFoundError: true statusCode: 404 } // and so on // . Then your specific ones type SubscriptionEndedError = ForbiddenError & { _isSubscriptionEndedError: true publicMessage: "Your susbscription ended" } type PostsNotFoundError = NotFoundError & { _isPostsNotFoundError: true publicMessage: "No posts" }
This way, you can check the error for the needed type anywhere in the program.
For example, if an error reaches the highest level (like a user's HTTP API), you can check and return only what's needed:
const errorHandler = (request: Request, error: Error) => { if (error._isBaseError) { if (error._isPublicError) { return request.status(error.statusCode).message(error.publicMessage) } else { return request.status(500).message("Internal error") } } }
ii. Error Dictionary
Use a separate field and add a unique value to it, for example:
// errors.json { // key is unique, and the value is any more or less understandable string "user_email_validation_error": "Email is incorrect", "value_to_long": "Value too long: ", "you_are_not_owner": "You are not owner" }
Put unique error keys into a localization system (e.g., Lokalize, i18n, etc.), and this way both on the backend and frontend you will, first, better understand what error was sent, and second, be able to automatically translate them into the needed languages.
If special values are needed, send them as an object or always add them to the end of the string.
iii. Return, not throw
Almost always I prefer returning an error instead of throwing it. You can return either the error itself or an entity that can contain either a response or an error (Either monad).
- Program clarity increases tenfold
- You can type the returning errors
- Debugging is simplified
- Code for error handling looks cleaner
- It's harder to forget to handle all the necessary cases
Go, Rust, Zig – modern languages for which returning errors is the norm.
🧠 Architecture
i. Functionally Oriented Programming (FOP)
FOP is a functional alternative to OOP.
I am convinced that OOP in modern backend applications is an atavism that needs to be actively eliminated.
Why and what to replace it with, you will learn in the online FOP book:
ii. Data Oriented Architecture
I have moved this chapter to the book λ Functionally Oriented Programming:
iii. Dirty vs Clean Architecture
Clean / Hexagonal / Onion Architecture advocate for separating business entities (User, Order, Product, etc.) from infrastructure (DB, controllers, etc.).
This is a good approach, BUT ONLY IN ONE SPECIFIC CASE, namely, when the infrastructure MUST be replaceable (for example, you have technology that assumes someone else can deploy it with a database of their choice). People try to use it absolutely everywhere.
I can tell you 1000% that the more straightforward I act (almost putting SQL queries in http controllers), the more reliable, scalable, optimized, and flexible programs I get.
I call this Dirty Architecture.
I'm ready to argue that I'll catch 10,000 insults for this message, but I'm betting my ass because I've used Clean Architecture and DDD in production on my own skin and only after that realized in practice how unrealistic promises they give.
In short:
- Any abstractions of business logic increase Artificial complexity (link), and in CA and DDD, this is maximized.
- Code is the servant of data (DB), not vice versa, so the fewer abstractions over data in your code, the easier it will be to work with it (link)
- If you create a program so that you can change, for example, PG to MongoDB, then your program will work poorly with both databases
- Abstraction over the database immediately assumes that (1) you won't be able to use important features of a specific database, (2) you won't be able to optimize for a specific database, (3) you will work unoptimized and very quickly reach the ceiling of resources you need
- The more reusable code (for example, the Entities layer), the more danger there is of breaking something that works when adding a new feature / changing an old one (over time, this probability reaches 100%)
iv. Vertical Slices
The concept of Vertical Slices fits well within the context of Dirty Architecture.
To simplify, many applications are divided into folders like: use-cases, db-queries, models
When a request comes in, it's first processed by use-cases code, which calls the database, which uses models.
This means that a single feature (request processing) is scattered across multiple folders containing code for other features. This structure is called Horizontal Slices.
The idea of Vertical Slices is to have a folder for each feature, and within each folder to include both the controller, use-case logic, database query descriptions, and model code used in that feature.
One feature is not allowed to use another feature's code directly, only call its controller / use case.
- This gives us true independence of code bases, and the chance of affecting something else when changing a feature approaches zero
- Logic related to a specific feature doesn't pollute the common codebase but stays where it's needed, which incredibly helps in understanding the codebase
- Testing such code is much easier
- We can easily enable/disable/move features anywhere
- Each feature can use its own technologies (you can even change the Query builder, database, library, etc.)
It's important to add a couple more conditions:
- Keeping 1000 features in one folder is inconvenient, so feature folders should be distributed across domain area folders (auth, shop, reports, blog-system, etc.)
- Within a domain area, you can have reusable code for that domain area
- And common reusable code (like a DB schema) can always be organized as a "library" / SDK that individual features will use
v. Event Driven Architecture (EDA)
Honestly, I can't imagine almost any backend application without some variation of EDA.
The essence of EDA is to give us a mechanism where we can announce some Event so that other processes can be triggered, processes that we shouldn't need to know about at the publication point.
Some useful points:
- An Event consists of
id: uuid
name: string
– in past perfect form (UserRegistered
,PostCreated
, etc.)payload: JSON
– contenttimestamp: UnixTimestamp
– creation timetraceId: uuid
– identifier passing from the very first call through all calls and events to build a complete picture
- Events have sizes
- S – name + entity id
- M – name + useful data
- L – name + all data
- XL – name + new and old data
- Each case requires its own "size," experiment
- If some data affecting the event or handlers might change over time, put it in the event
- If it's possible to reproduce some logic using events (i.e., when you don't need to wait for a response), it's better to do so
- Use queue systems with delivery guarantees and replication (like Kafka)
- It's better if events are persisted (like in Kafka) so they can be re-read
- Learn the Transaction Outbox pattern for very complex and dangerous operations
- For state synchronization, use Saga or Event-based Orchestrator
- Never expect an event to execute in a constant time
vi. Say "NO" to master-master
Don't use master-master technologies.
It seems like a "silver bullet," but in reality, these are technologies that should be used only when you've tried all other options:
- Too much data?
- Try separating cold and hot data, where cold data is archived and set aside for the future
- If data is only needed for projection, move it to horizontally scalable OLAP databases (Clickhouse) and remove it from the main database
- INSERT not fast enough? Use batching layers, like Kafka, where you dump all records that will later be batched and put into final storage
- UPDATEs too slow? Often you can change UPDATE to INSERT or use Event Sourcing / EDA – save events about changes, then aggregate and calculate projections from the other end
- DELETEs too slow? Mark data for deletion (deletedAt) and clean them once via cron
If you need m-m clustering, it means you have a complex task. If you have a complex task, it means it will be difficult to maintain reliability and even more difficult to debug.
m-m clustering significantly reduces reliability and speed while increasing infrastructure complexity and debugging difficulty.
If you want to use master-master, ideally your data should not have any kind of Constraints (and minimize UNIQUE), you should only use INSERT and SELECT (essentially, these are time-series, event sourcing, or OLAP data).
If you still need m-m, use technologies that are based on (or better yet, cannot operate without) CRDTs (Conflict-free Replicated Data Types).
That is, some Redis / MySQL has optional m-m clustering, so you definitely shouldn't use them for this purpose.
On the other hand, etcd / cockroach / clickhouse are initially designed to work in a cluster, which means they can be trusted. But when transitioning to them, you will still pay a price, so you should be confident in your decision.
vii. Say "yes" to master-slave
Only for cache do I grant the right not to have slave replication, but only because cache should be treated as data that has the right to disappear at any moment.
In all other cases, I always choose technologies and configure them to have a slave, which will be (1) a read replica, (2) a fallback in case the master fails, (3) a backup node.
If you need strict slave synchronization, then choose technologies with RAFT.
When you need to grow, deploy N master-slave nodes and manage data between them at the application logic level (microservices, logical shards, actors, domains, etc.)
viii. Horizontal scaling
Write horizontally scalable applications.
First, absolutely any code is always concurrent, even within the most single-threaded language. When you write with horizontal scaling in mind, you remember this more often and make fewer racing errors.
Second, in modern realities, it's very easy to hit the ceiling of a single-instance application, and transitioning from vertical scaling to horizontal is incredibly difficult, while in the opposite direction, no problems arise.
With horizontal scaling, you need to consider:
- You'll need external storage for state synchronization (Redis-like)
- Absolutely all processes become concurrent, which means you need to either know how to distribute (for example, round-robin on queues) or know how to lock (Redis-like / etcd-like)
- Problems can occur on only some instances, so in resource monitoring, you need to separate each individual instance
IMPORTANT! Don't confuse horizontal scalability with "microservices" - you can horizontally scale a monolith as well (especially a distributed one).
💾 Databases
i. ORM or not
If you're writing a library/service that can be used with different databases, then you can use an ORM.
In all other cases (which means, almost always) use libraries that are as close as possible to the query language, meaning either pure SQL/CQL/Dynamo API, or a Query Builder.
ii. Migration first
In 90% of cases, this is a much more convenient and reliable approach:
- Write migrations
- Apply them to the database
- Perform introspection – export the table schema into your language's type system and constants (table names, column names, etc.)
iii. Optimistic & Pessimistic Concurrency Control
Pessimistic Concurrency Control (PCC) – lock the data when retrieving it, make changes, write it back, remove the lock (essentially a Mutex).
+ operations are reliable
- slow and there's a chance of deadlocks
Optimistic Concurrency Control (OCC) – retrieve data, modify it, when trying to write it back, check that no one else has changed it before us (for example, when writing, check that the same updated_at or version remains).
+ fast, simple
- the more competition, the slower the system will work or not work at all
We can immediately conclude that if your algorithm involves competition (multiple processes should sequentially UPDATE the same data), then OCC is definitely not suitable. But if you only have a probability of competition (two processes decided to write to the same entity), OCC can significantly speed up the system with minimal cost.
(O|P)CC are applicable to absolutely any data sources and to building code logic.
iv. Transactions
If you've never used different transaction isolation levels, it means you've never written even medium-level applications or you did it incorrectly.
On average, it will be like this:
- Read uncommitted – we read data that is not yet committed (maximum speed, minimum reliability, suitable when the data we read cannot fail to be written, for example, because they have no Constraints)
- Read committed – we read only committed data (reliable, simple, works)
- Repeatable read – within a transaction when reading, we will always get the same data as if reading from a snapshot (more complex, but a common case, such as when using a sub-query)
- Serializable – we put all transactions in a queue (minimum speed, maximum reliability, prepare for deadlocks)
Also, be sure to study what gets locked and when (current/related rows, table, schema, or database) and know how to control the locking level (conditionally,
SELECT FOR UPDATE SHARE / SKIP
)Tips:
- Design your architecture to minimize or avoid using transactions (and if you do use them, not higher than Repeatable read). And yes, architectural solutions can allow working without transactions.
- Never delay transactions – slowing down 1 transaction can lead to cascading growth and errors throughout the system.
- If you had to use Serializable, then most likely, you were just lazy / don't have time to do it differently.
v. Distributed Transactions
Never delay a transaction: don't use timers, don't make external calls.
Delaying a transaction geometrically increases the operation time of the entire system.
But what if it's necessary, for example, we depend on a third-party API?
Change the application logic so that it works without transactions.
If you approach the problem exactly this way, you will discover that there are ways to solve the issue.
And in those places where you need to know whether a set of actions has completed and in what state, create state machines, for example, "Jobs" (N-phase commits) or more reliable but complex "Sagas".
vi. Drop Relations
Most likely, you don't need relations.
This is surprising and counterintuitive, but at a certain volume of work, you will start to notice this yourself:
- ON DELETE CASCADE is an incredibly dangerous construct that (1) can delete necessary data, (2) significantly slows down the database, (3) is very difficult to control and debug. It can only be used with one-to-one relationships; in other cases, it's better to delete orphaned data with a cron job during periods of low load.
- Even more often, you'll notice that when deleting data, you don't actually need to delete related data—in fact, it's harmful—making ON DELETE CASCADE completely pointless.
- ON DELETE / UPDATE ... transfers business logic to the database, which leads to tons of debugging and not understanding "why it doesn't work."
- FOREIGN KEY for checking the existence of an entity is often meaningless—if you received some id, then most often you either have to verify its existence in advance or it definitely exists. And even if not, the cron job from the first point will eventually delete this data.
- Orphaned data most often won't interfere with your queries because if we've deleted the linking data, they simply won't appear in standard queries (possible collisions only in OLAP queries, but there you need to monitor many aspects anyway).
- Sooner or later, you'll need to store some data in one database and some in another, and at that point, you'll lose relations anyway. If you reject them from the beginning, you'll be able to use as many different databases as you want and horizontally scale your storage 100 times easier.
vii. Drop Constraints
In addition to relations, if you also stop using constraints (like UNIQUE) and build application logic and architecture around this, you can more easily transition to using horizontally scalable databases while maintaining high system performance.
viii. How to Choose a Database
Well, besides articles, community, selling experience, etc., it's also important to check:
- Will it handle the required QPS
- Look at the required transaction levels and atomicity
- Check the number of available connections
- Check replication availability
- Check if it's suitable for many small inserts or only large ones
- Is there Update functionality
- Is there Upsert functionality
- Are there bulk loading methods (e.g., COPY in PG)
- How MVCC and GC are structured (to understand the complexity of Insert/Update/Delete, as well as the causes of pauses)
- What mechanisms provide Optimistic/Pessimistic Concurrency Control
- Check types of numbers, dates, arrays, and the presence of JSON/unstructured types, as well as different methods of working with them
- Check for normal libraries: adapter, query builder, migrator, and introspection
- Row-based or column-based
- OLAP or OLTP
- Is there CDC
- Eventual or Strong consistency
- If master-master, what consensus algorithm
- Is something additional required for clustering (e.g., zookeeper, etcd)
(soon there will be a comparison table of PostgreSQL, MySQL, MongoDB, Clickhouse, CockroachDB, TimescaleDB, etc.)
ix. Use UUID
If there are no specific nuances, use UUID as the primary key:
- You can prepare multiple entities in code that are related to each other by id and insert them into the database at once (in the case of serial, you would have to insert, get the id, and only then insert the next entity)
- Allows the client to send the entity id and after a successful operation request it through separate endpoints/mechanisms (for example, receive it from WS)
- Allows for delayed insertion (for example, if you want to batch entities and insert them later, but related entities can already appear in the database)
- If you sum up the 3 points above: allows building Eventual Consistency systems
- In emergency situations, allows you to go through almost all tables to find which entity this id belongs to
BUT use UUIDs that start with a timestamp (e.g., UUID v7), this increases insertion speed significantly.
x. INSERT-s / UPDATE-s / DELETE-s must be batched
Most databases don't like individual inserts. Most likely, a database will process 1 or 1000 records at the same speed (and some, like Clickhouse, insist on 100,000 and 1,000,000,000 records)
If you're creating a high-load system, design the architecture with the understanding that for optimization, you'll need to batch.
The main side effects of batching:
- If you don't persist messages in a third-party system (e.g., Kafka), you can lose them
- Batched messages won't be available at the moment they actually appear
xi. "Storages like Onions"
I tried to abstract the processes occurring in a bunch of different storages, including PostgreSQL, TimescaleDB, MySQL, MongoDB, Redis, Clickhouse, CockroachDB, YDB, Amazon Aurora, TiDB, RMQ, Kafka, RedPanda.
Undoubtedly, I forgot/don't know a lot, so your comments are needed.
- Connection establishment – establishing a connection with the client
- Processor allocation – allocating a unit that will process the request (process in PG, thread in MySQL, goroutine in CockroachDB, thread per core in RedPanda)
- Query processing – parsing the received message into logical steps (for example, parsing SQL)
- Schema validation – checking that the sent data corresponds to the schema
- Execution planning – determining if, in what sequence, and from where to retrieve data
- Indexes
- Verification – ensuring we don't violate existing indexes
- Creation – creating new ones (especially worth noting the atomicity of UNIQUE indexes)
- Concurrent access – for example, MVCC (creating new record versions or rollback journal)
- Transactions – transaction management
- Commit – confirming that the operation will succeed regardless of circumstances (for example, WAL record or RAFT commit)
- Respond with metadata – sometimes it's necessary to respond to the client with metadata in advance, for example, about the type of returned data
- Persistence
- Communication with the storage layer – sometimes this is part of the processing instance (PG), and sometimes it exists separately (YDB, TiDB)
- Compression – compressing data
- Storage optimization – enables efficient storage
- Batching – aggregating data in memory for subsequent flush
- Flush – unloading from memory to disk
- Cleanup – most storage systems will have one mechanism or another for cleanup, for example, VACUUM in PG or sector deletion in Kafka
- Clustering
- Membership – discovering and joining a cluster
- Leadership – determining leaders
- Health-checks – checking the availability of cluster components
- Anti split-brain – preventing Split-brain
- Recovery – recovering after separation from the cluster
- Rehydration – restoring data to the required level
- Configuration synchronization – synchronizing the final state of cluster configuration and individual nodes
- Index synchronization – synchronizing indexes
- Data synchronization – synchronizing the data itself
- Master-Slave replication – replicating data for further reading
- Partitioning
- Local – dividing large master tables into smaller ones by key within 1 instance to optimize IO operations
- Sharding – dividing and storing master tables on different instances
- Backups – some storages are capable of automatic backup of hot/cold data, for example, to S3 (TimescaleDB, Redpanda)
xii. Event bus
There are 2 options here:
- Almost always, first of all, you will need a persistent storage like Kafka:
- This gives you the ability to process messages in batches
- Re-read messages when invalidation is necessary
- Write in average constant time
- Have horizontal scaling
- If you are willing to sacrifice delivery reliability and persistence for the sake of speed, then use a Message Broker (NATS.io / EMQX)
And nothing prevents you from mixing both approaches.
xiii. SQLite
SQLite is essentially a library specification that allows operations on database files directly from a programming language
Advantages
- Since SQLite's performance depends on language speed + processor power + IO throughput, theoretically it's one of the fastest databases, at minimum because it completely lacks all the network complexity of standard databases
- While other embedded storages are just key-value (rocksdb, leveldb, badger), or NoSQL with unique SDKs (couchdb-like), SQLite is a full-fledged SQL database, comparable in SQL capabilities to PostgreSQL: schema, indexes, transactions, locks, joins, constraints, everything is there
- Consequently, experience with any other database will be relevant, making SQLite very attractive to developers
- This also means you can switch from SQLite to PostgreSQL / MySQL almost painlessly when/if the time comes
- Due to its simplicity, SQLite is either already integrated (browsers, mobile devices, native apps) or easily added (I've seen hardware with minimal Linux that uses SQLite)
- You can work with one instance from multiple processes simultaneously
- To backup the database, you just need to upload files to S3
- Super simple integration testing because you can have a separate SQLite instance for each test (e.g., in-memory)
- If you create a SaaS without multitenancy and distribute it as a boxed solution, you can easily open an instance per client
- As free as possible
Disadvantages
- Classic network file systems don't allow multiple servers/containers to work with the same SQLite database (I only found information about VFS, but I don't yet understand how viable this option is)
- If something happens to the file system during data writing, there's a chance it will break beyond recovery
Use Cases
- Local database for applications with frontend (web, mobile, desktop)
- Local database for remote agents (applications that collect and send data to the cloud from devices, from a server in a warehouse)
- Cache for a single instance (for example, for actors)
- Database for startups/projects requiring work with large volumes of data, but without wanting to pay a lot for PostgreSQL/MySQL instances
Solving Disadvantages
We need 2 things:
- Sync Read replicas, so you can switch when the master fails
- WAL Streaming Backup for reliable backups
Optionally, async read replicas for delayed reading or even CRDT to transform it into a distributed multi-master p2p database
Ideally, all of this should be embedded in SQLite itself, meaning embedded in the language or as a sidecar process. Otherwise, I think using SQLite loses its purpose
Interesting Projects
- Pocketbase (https://pocketbase.io/) – admin panel, Firebase-like HTTP API, email, auth, file storage, logs and much more out of the box
- Turso (https://turso.tech/) – distributed SQLite in an Edge environment
- Electric (https://electric-sql.com/) – SQLite on the client side that synchronizes with PostgreSQL using CRDT, turning SQLite into a multi-master edge database
- LiteFS (https://github.com/superfly/litefs) – SQLite database replication at the file system level
- libSQL (https://github.com/tursodatabase/libsql/) – SQLite fork on which Turso is built, with the ability to deploy servers, replicas, auto-sync WAL to S3, and so on
- rqlite (https://github.com/rqlite/rqlite) – turning SQLite into a full-fledged database with read replicas, written in Go
- dqlite (https://github.com/canonical/dqlite) – roughly the same as rqlite, but in C
The Future of SQLite
- First, it's clear that SQLite is evolving into a database for Edge environments because it can operate with minimal resources
- Also, it's definitely destined to become a p2p database (in the style of couchbase) because it's already integrated/easily integrates with any client
- Adding built-in Sync Read replicas will make it more likely to be used in production applications
Additional Thoughts
- If I had to develop a standard web application for business now, I wouldn't overthink it and would use PostgreSQL
- For my own projects, I would happily use it
- If LiteFS or libSQL works well, I would more seriously consider using it in production applications
🔎 Testing
i. General
- Business logic is tested absolutely always, everything else is optional
- If a function doesn't use infrastructure, write unit tests
- In all other cases, always write integration tests
- Tests are simple (you just need to learn how to set them up)
- Tests are an extra hour of coding that saves you 10 hours of sleep
- How many tests should you write for a function? One, then more if desired
- When you find a bug, first write a test to reproduce it, then fix the bug
ii. Unit, Integration, E2E tests
- Unit tests – testing without environment (DB, caches, API, etc.), meaning all integrations are replaced with mocks, tests are part of the code.
- Integration tests – we expose integrations (DB, caches, API, etc.) for tests, tests are part of the code.
- E2E – we expose the full application with all integrations and call its API, tests are a separate codebase.
iii. Integration tests
- The launch script can be written in Makefile
- Before running tests, set up a local environment (preferably through docker-compose)
- Wait for startup
- Apply migrations and fixtures
- Take a DB snapshot before each test
- Create a separate connection for each test
- After each test, delete the snapshot and close the connection
- At the end of tests, DO NOT close the environment
🌎 Logging, metrics, tracing
i. Metrics are your spider sense, Traces are your map, Logs are your eyes
- At large volume, the only way to see if everything is (not) ok is through metrics
- The secondary tool is tracing
- And only as a last resort do we use logging to clarify details
Accordingly, by priority we should primarily add metrics, secondarily tracing, and tertiarily logs.
If you have a small project, the priority goes in the opposite direction.
ii. Meta-info of logs
- Add commit hash
- Level
- Service name
- Unique id
- Service start time
- Function call stack
- Request ID
- Trace ID
iii. Technical
- Use zero-allocation loggers
- Write to stdout
- Verify that the logger writes asynchronously
👨👧👦 Leading
i. Triptych – the ideal structure of a technical team
Once again, I come to the conclusion that technical teams should consist of Team Lead + Senior + a set of Middle developers, and not otherwise.
- Team lead (ears and mouth) – a single entry point for business, responsible for delivery timelines, technical backlog, and team condition; creates working conditions for Senior and Middle
- Senior (brain) – responsible for the quality and functionality of the entire system, and therefore: makes technical decisions, has veto power, puts out fires, creates technical conditions for Middle work
- Middle (hands) – responsible for the functionality of the code they write: aligns decisions with Senior, builds features and ensures they work
Notes
- I called them "Team lead", "Senior" and "Middle" because there are no more appropriate words (except those in parentheses); in reality, a person at the "Senior" level can be in the "Middle" role
- These are ROLES, which means they can be combined by the same person
- There should be a maximum of one Team Lead and one Senior, but there can be many Middles, ideally within the limit of 10
- Team Lead may not have technical skills (I call these PMs, but this doesn't remove the responsibilities described above)
- I deliberately didn't include QA, DevOps, CTO, Architect, etc. because they either can reuse what's written above, or are more hyperbolic versions (for example, CTO is a Team Lead who is also responsible for payroll)
Why exactly this team structure
- Decisions become (in)correct only after you've applied them, so someone simply needs to take responsibility for choosing a path and waiting for the result. If several people try to approve such a decision, they will have big problems, therefore, the captain (Senior) should be only one.
- Communication with business is painful. There can be an incredible amount of it. Both incoming to the team and outgoing. At the same time, business often doesn't know how to communicate with developers and vice versa, so letting them interact on a regular basis is definitely not worth it. They can be introduced and left for some time (feature creation), but all communication and responsibility should be in one person's hands (Team Lead)
- Seniors complement Middles, and Middles complement Seniors: Middles can get all the knowledge they need, Seniors can realize themselves as "senseis" and at the same time learn by structuring their knowledge while transferring it to Middle developers. Such an Ouroboros allows both to grow and enjoy the process.Two Senior developers can (initially, have conflicts, but I wrote about this above, so in this situation) ask each other too few questions (for example, due to embarrassment) and build some incredibly absurd thing simply because they didn't dare to discuss it in advance (this is generally about maturity, but that's a separate topic)
How to create such a team
- Openly and explicitly discuss who takes on which role, ensuring that Seniors clearly transfer decision rights on business matters to the Team Lead, while Middles explicitly transfer rights for technical decisions to Seniors
- Establish good, open, and constant communication between all these links
- Give Seniors the ability to make technical decisions and hire people in accordance with them
What are the dangers
- If the Team Lead has poorly developed soft skills and lacks the "steel balls" to say "no" to both developers and business, everyone will burn out, business will suffer, and turnover will begin
- If you choose a bad Senior, they will lead the entire team into the abyss, but if you don't give Seniors the right to make risky decisions, good Seniors won't be able to accomplish anything either
- If Seniors aren't ready to listen to Middles, or Middles aren't willing to agree with Seniors' final decisions, the scheme won't work, which is why it should be the Senior who builds the team
💻 Programming
i. Everything is concurrent
Even within single-threaded languages, as soon as you create long-lived objects (for example, in-memory cache), ALWAYS treat them as entities that can be modified from different points in the program in a concurrent manner.
And for this:
ii. Avoid Mutexes
Mutexes allow us to synchronize in a concurrent environment, BUT mutexes always have a huge chance of getting stuck forever or turning into a cascade of interconnected mutexes.
Almost always, I prefer structuring operations in a sequence to avoid mutexes, and I recommend you do the same.
Examples of patterns with sequential processing: Actor Model, Serializable Transactions, RAFT.
iii. Program as if everything is already broken
Applications crash. And they do it in the most unpleasant and dangerous places.
Most often, it's your fault, but in 10% of cases, it's due to external circumstances.
Always program as if the application could crash at any moment:
- If you have a state machine, don't forget to write a cron job that will move hung state machines to some final state.
- If there's a set of data that must be written together or not written at all, then use: (1) prepare all operations on entities in advance and write them in one atomic operation at the end, (2) combine this data into one table/message in a queue (and if you need to separate them later, do it as a separate process), (3) create and run a state machine (sooner or later it will do what's needed), (4) use transactions.
- If you need data at a specific moment in time, write it as soon as you can gather it together, BUT make sure that the write operation is idempotent (meaning that if the same input data enters the same code logic again, it won't create a duplicate record).
- You might be lucky and the application will crash with SIGINT or SIGTERM, so take care to implement graceful shutdown.
iv. And much, much more
In general, all the philosophical foundations that I try to follow are described in the "Pillars" chapter of the FOP book, so I'll leave a link to it here:
👨🏻 About the Author
Hi! My name is David Shekunts and I’m Golang / Node.ts Tech Lead & mustache owner 👨🏻
Github: https://github.com/Dionid
Telegram: t.me/davidshekunts
Wishing everyone powerful growth 💪