Fatigue Driven Development (FDD)

Hello, my name is 👨🏻 David Shekunts 👴🏿 and I'm tired of debugging at night, I'm tired of unmaintainable code, I'm tired of technologies that break. Here I collect best practices that help me rest more. I hope they will help you too.

📚 Table of Contents

📚 Table of Contents 📖 Dictionary 🍔 App structure i. Monorepo ii. Monolith iii. Microservices iv. Modular monolith v. What to choose?vi. How to divide microservices a. Separate team b. Security c. Geo-dependence d. Stateful e. Separate feature set f. Dedicated resources g. Interferes with others h. Deployment independence i. Business logic independence j. Mini-product vii. FALSe - Best Project Structure viii. Graceful-shutdown ix. Separate your Cron 🎛️ API i. RPC ii. CQRS iii. (a)Sync Communication iv. Schema v. Push | Pull vi. Create IDs on the client 👴🏿 Type it i. Branded & Nominal Types and validation on type ii. Algebraic Data Types (ADT) and Invariants iii. implements – incredible evil I prefix – evil 🐞 Error Handling i. Classified Errors ii. Error Dictionary iii. Return, not throw 🧠 Architecture i. Functionally Oriented Programming (FOP)ii. Data Oriented Architecture iii. Dirty vs Clean Architecture iv. Vertical Slices v. Event Driven Architecture (EDA)vi. Say "NO" to master-master vii. Say "yes" to master-slave viii. Horizontal scaling 💾 Databases i. ORM or not ii. Migration first iii. Optimistic & Pessimistic Concurrency Control iv. Transactions v. Distributed Transactions vi. Drop Relations vii. Drop Constraints viii. How to Choose a Database ix. Use UUID x. INSERT-s / UPDATE-s / DELETE-s must be batched xi. "Storages like Onions"xii. Event bus xiii. SQLite 🔎 Testing i. General ii. Unit, Integration, E2E tests iii. Integration tests 🌎 Logging, metrics, tracing i. Metrics are your spider sense, Traces are your map, Logs are your eyes ii. Meta-info of logs iii. Technical 👨‍👧‍👦 Leading i. Triptych – the ideal structure of a technical team 💻 Programming i. Everything is concurrent ii. Avoid Mutexes iii. Program as if everything is already broken iv. And much, much more 👨🏻 About the Author

📖 Dictionary

Terms that will appear in the text:

"Data source" – database, cache, third-party API, basically anywhere we get data from or send data to.

Application Interface (API) – a way to interact with a running application (stdin, stdout, TCP, UDP, )

"Endpoint" – one of the API methods (e.g., HTTP API URL, gRPC command description, mutation/query/subscription in GQL, etc.)

Message Broker (MB) – ability to publish messages (usually by "topics") and subscribe to their appearance

Message Queue (MQ) – same as MB, but guarantees delivery sequence

Event Driven Architecture (EDA) – we throw out "Events" (id, name in past perfect tense, data) and consume them with any number of services.

CI / CD – automated processes required for building and deploying services

"Infrastructure" – databases, CI/CD, servers, orchestration, network, and everything else

Orphan data – data that only made sense when the data it was tied to existed (essentially, any data we want to make ON DELETE CASCADE)

"Projection" – readonly data calculated based on other data. For example, we save all messages from a device to one table, and then calculate projections of its current and historical temperature readings / battery status / location / etc.

System degradation – when new functionality breaks old functionality, considered one of the worst types of errors.

Feature flag – boolean value (true | false) that you pass into code (via env) and depending on whether it's on or off, you enable or disable functionality

Real-time – systems that involve processing and responding in milliseconds (maximum 1-3 seconds)

Application (app) – code running as a process

Instance – a unit of a running application

Transport – any way two processes interact, involving sending/receiving data (TCP, HTTP, MQ, 2IP, stdin, etc.)

Internal communication – calling business logic within one application by calling a function from within code

External communication – calling business logic from another application using some Transport

🍔 App structure

Let's structure applications:

i. Monorepo

There's a lot to discuss about this, but these 3 properties are why I choose monorepo in 9 out of 10 cases:

Atomic deployments – both backend, frontend, and infrastructure code are deployed to production at once

Everything in one place – even with 5 repositories, problems start with finding the right things

Shared code – ability to use local code between services without publishing

It's important that with a monorepo you need to take care of:

Development and staging environments, so that development is done in the first one, and changes that will then go to production go to the second one

If you need to publish libraries, be sure to add a CI/CD stage with manual control for their building and publishing (main releases the main version, development releases beta, other branches release alpha)

ii. Monolith

!ATTENTION! the following "Pros" and "Cons" are in comparison between Monoliths, Microservices, and Modular monoliths.

One large codebase

Runs in approximately 1 instance

Pros:

Avoids distributed problems (cross-communication and state synchronization)

Fast deployment

Less hassle with Infrastructure

Easier to debug

Cons:

All the above pros work up to a certain size, after which they completely disappear

Difficulty of horizontal scaling

Single point of failure

Without code writing rules, it turns into spaghetti

iii. Microservices

Essence:

Codebase separated from each other

Ideally, their own resources (different servers, databases, caches, etc.)

Pros:

Can isolate codebase in a separate place (security, management)

Theoretically maximum horizontal scaling

Cons:

Microservice boundary errors are much worse than any monolith, no matter how big

Complexity of communication and state synchronization at maximum

Difficult to deploy

Difficult to monitor infrastructure

iv. Modular monolith

Essence:

Reuse the same codebase but run in different instances

Use the same data sources

Pros:

All the advantages of monoliths, without the point of irreversible bloat

All the advantages of microservices, minus the complexity of cross-communication and state synchronization

Cons:

Sharing resources (e.g., database queries) can cause them to overload

More difficult to deploy and maintain than a monolith

Without code writing rules, it turns into spaghetti

v. What to choose?

Within one team:

Start with a modular monolith, trying to avoid cross-service communication as much as possible

Extract something into microservices out of NECESSITY, meaning you'll simply see that there's really no other way, then extract something into microservices (this point may never even arrive)

But each separate team should make their own separate modular monolith and microservices, because working on one shared codebase is difficult if you're not part of one team.

vi. How to divide microservices

Dividing microservices by "responsibility" is the biggest mistake.

Humans are very poor at classification and categorization, and if your methodology requires strict grouping (e.g., OOP or "division by responsibility"), you will never be able to do it right

Why? Because even if you could identify a specific set of features by responsibility and break them down into microservices/classes that satisfy business requirements, the world doesn't stand still. New requirements will constantly emerge that will really blur the boundaries of this "responsibility" and the right architecture today becomes the wrong one tomorrow through no fault of your own

The second problem: "responsibility" is a very subjective concept. Ask several people in a medium+ system: "what responsibilities would you identify?" – each person's variant will be 50% different from the others. And increasing the degree of "subjectivity" is a path to endless errors.

This "invented responsibility" by developers can be called "artificial responsibility," but we need to focus on "natural responsibility," and here's a checklist of examples of this "natural responsibility":

a. Separate team

If there are two teams of people who need to solve independent parts of the system and don't communicate on a regular basis (weekly, daily, their own management, etc.), it's better for each to make their own set of microservices/modules and agree on APIs.

Even shared libraries are dangerous (there should be 1 specific maintainer of this library), because any intervention in the code by a developer from an outside team is very poorly controlled and can degrade half the system

b. Security

You're making a module handling transactions, and you want as few developers as possible to not just develop it or have access, but even see the code (to reduce potential hacking)

Then you create a separate microservice, with its own repository, and only expose a secure API.

c. Geo-dependence

For example, you have the task of collecting data from your client's devices, but these devices live within the internal network of a specific warehouse. Then you create a separate microservice that will live within a server on this warehouse and communicate with it through transport.

The same applies if you need to place some service only within one of the regions (for example, only RU or EU), then you can make a separate module just for this region.

d. Stateful

For example, you need to keep open Web Socket or TCP connections, while you want to redeploy and generally touch this application as rarely as possible for maximum uptime, then you create a separate microservice/module that will store the state (socket) and communicate with it through transport

e. Separate feature set

Your partners need a stripped-down/different variation of your main API, then you create a separate module in which you only expose what's needed + replace the authentication method (for example, with oAuth) and so on.

f. Dedicated resources

Your application requires heavy work with the file system, or, for example, it's image compression, or streaming video processing, or CPU-intensive calculation, in short, something that requires a special resource. Then you create a microservice

g. Interferes with others

A particular endpoint is especially heavily loaded (for example, you collect every click from the client's browser) and it generates too much traffic, the processing of which interferes with processing of the remaining requests.

In this case, you can take exactly this piece and isolate it into a separate module/microservice and deploy it on a separate machine.

h. Deployment independence

When we want different services to be deployed independently of each other (for example, so that redeployment doesn't affect applications whose code wasn't changed in a PR)

Then we can create a module (modules have their own independent Docker images, so after building we can check if the hash has changed to make a redeployment decision) or a microservice

i. Business logic independence

This is the MOST difficult aspect, because the independence of one microservice from another is often a temporary concept (remember that requirements always change), but if one module is truly capable of operating independently of another (for example, a reward calculation system and an authentication system), then it's a candidate for extraction into a microservice.

BUT I recommend doing this as a second step, after you've understood that these modules have truly become independent of each other and for the convenience of further development, you want to separate the codebases.

ONCE AGAIN, only time will show the real independence of codebases

j. Mini-product

When you need to build a demo/prototype product, and you're not sure whether it will remain in this form, it's often easier to create a separate microservice. This gives freedom to experiment without worrying about side effects on the primary system.

vii. FALSe - Best Project Structure

4 main folders:

Features - code with business logic of the application, divided by domains.

Apps - code that launches the application in different configurations.

Libs - code that could become private or public libraries.

Scripts - code that we occasionally want to run locally or from CI/CD.

Example:


/features
	/auth
		login.ts
		register.ts
	/user-management
		get-user.ts
		delete-user.ts

/apps
	/main-http
		/http-api
			schema.ts
		config.ts
		index.ts
	/cron
		index.ts

/libs
	/@my-company
		/specific-lib
			index.ts
	/logger
		index.ts

/scripts
	/inactivate-stale-user
		index.ts

Currently, this structure covers 100% of the cases I've encountered.

viii. Graceful-shutdown

Always close your applications carefully:

Start a timer that will kill the application even if it hasn't finished

Close APIs (HTTP, MQ, etc.)

Turn off all cron jobs and intervals

Wait for current business logic processes to complete

Close external connections (DB, MQ, etc.)

Most often, you should respond to SIGINT and SIGTERM.

ix. Separate your Cron

Always create separate applications for cron jobs to run them separately and not interfere with horizontal scaling of other applications.

BUT I don't recommend making a separate application with just 1 instance for this, because then there's no guarantee of cron execution.

The best option is to use a cron scheduler, for example, the cron scheduler built into k8s.

This way, even if a server is unavailable, the scheduler will run the cron job where and when possible.

It's also useful to make the cron job simply emit an event indicating that some operation needs to be performed, and then one (or several) of the already active applications react to it and do what's needed.

This way you get more control: (1) within the executor application you can better distribute the load, (2) it's easier to write logic for distributing this cron calculation, (3) react to and debug the cron job, (4) not run more calculations than needed.

🎛️ API

i. RPC

Remote Procedure Call – one of the most convenient ways to structure an API.

First, it's well-suited for request-response:


// # Request
{
	id: string
	name: "GetUser",
	params: {
		userId: string
	},
	meta: {
		ts: Date
		requesterId: UUID
		traceId: UUID
	}
}

// # Success Response
{
	id: string // same as in request
	result: {
		case: "success",
		success: {
			email: string
			avatar: string
		}
	}
}

// # Failure Response
{
	id: string // same as in request
	result: {
		case: "failure",
		failure: {
			code: number
			message: string
		}
	}
}

But Request can also be used as an Event:


{
	id: string
	name: "UserCreated",
	payload: {
		userId: string
	},
	meta: {
		ts: Date
		requesterId: UUID
		traceId: UUID
	}
}

Second, it allows communication through just one bi-directional channel (for example, as with WebSockets), which means it can be used with absolutely any protocol/communication interface.

Famous RPC implementations: gRPC, GraphQL, Pg Wire Protocol, JSON RPC.

ii. CQRS

Command Query Responsibility Separation – to simplify it maximally, there are 2 main rules:

If you're returning data, you have no right to change the system state (write anything to DB, cache, change variable values)

If you're changing the system state, you can only respond with "OK" or "Error"

This way of building any API/Interface guarantees the possibility of (1) convenient horizontal scaling of the API, (2) fewer problems when writing and maintaining business logic, (3) the ability to use eventual-consistency.

!IMPORTANT! It's impossible to follow it in 100% of cases, for example, as with JWT authentication: you'll most likely need to create a token, save it, and then return it to the user.

Therefore, you shouldn't implement CQRS everywhere; you should simply maximize its use.

Lifehack 1: for CQRS to work at full capacity, I recommend making all entity IDs as UUIDs so the client can create and pass them.

Lifehack 2: combines perfectly with RPC.

iii. (a)Sync Communication

Synchronous communication (sync) - we send a request to a "channel" and block it until a response returns (HTTP, gRPC).

Asynchronous communication (async) - we send a request to a "channel," and we'll receive a response sometime later from it or another channel (UDP, WS, Message Queue, Message Broker, etc.)

Pros of sync: security, reliability, simplicity, speed

Cons of sync: manual routing, lots of blocked channels

Pros of async: eventual consistency, using 1 channel, ability to use with queues, EDA, works well with RPC

Cons of async: slower and less reliable

iv. Schema

Always start describing your API from the schema, and only then write the implementation code.

The most convenient schemas:

OpenAPI or GQL for HTTP / MQ / MB / WS

Protobuf for gRPC / MQ / MB / WS

Lifehack 1: use Union types more often, such as oneof from gRPC, allOf / oneOf from OpenApi.

v. Push | Pull

Push Model – we send something somewhere.

+ realtime

+- smart producer

- requires a backpressure mechanism to control load on the reading side

Pull Model – we get something from somewhere.

+ absolute control over the consumption process

+- smart consumer

- true real-time is impossible

- doesn't work with non-streaming data (for example, HTTP requests are difficult to implement using the Pull model)

This applies to MQ, storage systems (Prometheus and InfluxDB), and code implementations (Event Emitter vs Async Iterator).

In practice, I try to use the Pull model more often, and Push only where Pull simply won't work.

vi. Create IDs on the client

The entity identifier should be created by the client:

This allows the client to independently request the necessary data in case of success

Easier to implement streaming / real-time communication

Allows the use of Eventual Consistency

The unique identifier will also act as an idempotency key

On the backend, you can create entities, link them with this UUID, and only then write to the database (instead of making a record, getting a SERIAL, and only then creating the next related entity)

Use UUID v7 or a similar unique identifier with an embedded date for this purpose.

👴🏿 Type it

Advance typing technics:

i. Branded & Nominal Types and validation on type

This chapter has been moved to the book λ Functional Oriented Programming:

Функционально Ориентированное Программирование (ФОП)

Функционально Ориентированное Программирование (ФОП) – функциональная альтернатива ООП.

https://fop.davidshekunts.ru/#d04ae9b9e8e24f3ab46144f1e8816fc2

ii. Algebraic Data Types (ADT) and Invariants

This chapter has been moved to the book λ Functional Oriented Programming:

Функционально Ориентированное Программирование (ФОП)

Функционально Ориентированное Программирование (ФОП) – функциональная альтернатива ООП.

https://fop.davidshekunts.ru/#d04ae9b9e8e24f3ab46144f1e8816fc2

iii. `implements` – incredible evil

implements completely destroys the purpose of interfaces.

The essence of an interface is to separate a specific implementation from a set of methods needed by a specific function to reduce code coupling.

If you want to explicitly declare that some "class" should include a set of methods, then use abstract classes, that's what they were created for.

An interface is needed to declare what you need in one place, implement it in another, and pass the implementation to a request in a third.


// business-logic/some-fn.ts

interface User {
	id: string
	email: string
}

interface UserDataSource {
	getUserById(id: string): User
}

function someLogic(uds: UserDataSource) {
	// ...
}

// databases/pgsql.ts

interface UserTable = {
	id: UUID
	email: string
}

const UserTableService = (conn: PgConnection) => {
	return {
		getUserById(id: string): UserTable => {
			// ...
		}
	}
}

// app/main.ts

const pg = PgConnection()
const userTable = UserTableService(pg)
someLogic(userTable)

Only in such a situation do you reduce code coupling, which means you're correctly using interfaces.

`I` prefix – evil

The most important problem: by using the I prefix, you are structuring your code incorrectly.

It's a semantic error – you don't name all classes with a prefix C or all numbers with a postfix Int

With this naming convention, you tie the interface to the implementation, but the essence of an interface is precisely to decouple one from the other


interface IPaymentService {
	// ...
}

function extendSubscription(ps: IPaymentService) {
	// ...
}

Now we need to implement Stripe and PayPal as payment services. Based on this code, a person would name them StripePaymentService and TinkoffPaymentService, but the question is: what if somewhere else in the code there's an ISubscriptionService and our Stripe and Tinkoff completely match it and need to be used there too? Would we have to add to the name something like StripePaymentSubscriptionService? No, that's absolute nonsense.

It's sufficient to simply make the names PaymentService and SubscriptionService for interfaces, and Stripe and Tinkoff for implementations, then everything becomes absolutely logical.

And here more details

To begin with, I'll list some links that explain this topic in great detail, and then I'll share my own perspective:

Should interface names begin with an "I" prefix?

Prefixing interfaces with "I"

IInterface considered harmful

Now, here's what I think about it:

First of all, it's a textbook example of Hungarian Notation. If you're naming an interface with the I prefix, then why aren't you naming class with a C prefix or string with an S prefix?

It makes logical sense not to add a type prefix to its name.

Secondly, the first link provides an excellent explanation:

If you stop to think about it, you'll see that an interface really isn't semantically much different from an abstract class:
Both have methods and/or properties (behaviour);
Neither should have non-private fields (data);
Neither can be instantiated directly;
Deriving from one means implementing any abstract methods it has, unless the derived type is also abstract.
In fact, the most important distinctions between classes and interfaces are:
Interfaces cannot have private data;
Interface members cannot have access modifiers (all members are "public");
A class can implement multiple interfaces (as opposed to generally being able to inherit from only one base class).
Since the only particularly meaningful distinctions between classes and interfaces revolve around (a) private data and (b) type hierarchy - neither of which make the slightest bit of difference to a caller - it's generally not necessary to know if a type is an interface or a class. You certainly don't need the visual indication.

I'll explain this thought a bit differently:


class User {
	constructor(
	public id: string,
	public email: string
	) {}
}

// When you use the User class in typing, you're actually using
// its interface
const someFn = (user: User) => {
	// ...
}

// That is, in fact, the User class consists of 2 parts:

// 1. The class type
interface User {
	id: string
	email: string
}

// 2. The class runtime (conditionally)
const User = {
	new(id: string, email: string): User => {
		return {id, email }
	}
}

The only important difference is that the class interface also includes private properties.

BUT from the perspective of the someFn function, are its private properties important? No, because this function can only call its public properties.

Consequently, we always use an interface anyway, even when we specify a class in the type, so why then when writing our own interface should we need to know that it's an interface by adding the I prefix?

Thirdly, it's much better if your interface is named without a prefix, but its implementations have a postfix:


interface UserRepository = {
	insert(user: User) => void
}

// PSQL implementation
class UserRepositoryPostgreSQL {
	insert(user: User): void {}
}

// Mongo implementation
class UserRepositoryMongoDB {
	insert(user: User): void {}
}

Because implementation is a specification of the interface, which should be reflected in the naming.

Fifthly, this is a side effect, of course, but I've seen many times how developers who don't understand the meaning of interfaces would create an interface with an I prefix next to each implementation class...

Folks, these interfaces are only needed for (1) the ability to substitute implementations (that is, you should have at least 2 classes that meet the same interface) or (2) to reduce code coupling (but then we should describe the interface not next to the implementation, but where this decoupling occurs).

🐞 Error Handling

i. Classified Errors

Create a layered classification of errors to make it easier to check their types:


// . First the base error

type BaseError = {
	_isBaseError: true // for simple checking
	statusCode: number // always use HTTP Status Code
	type: string // unique keys of the error dictionary (more on this below)
	message: string // internal error description
	data: JSON // additional data
}

// . The first layer is usually about whether they are available internally or externally

type InternalError = BaseError & {
		_isInternalError: true
}

type PublicError = BaseError & {
		_isPublicError: true
		publicMessage: string // this error message will be shown to users
}

// . Then standard types

type ForbiddenError = PublicError & {
	_isForbiddenError: true
	statusCode: 403
}

type NotFoundError = PublicError & {
	_isNotFoundError: true
	statusCode: 404
}

// and so on

// . Then your specific ones

type SubscriptionEndedError = ForbiddenError & {
	_isSubscriptionEndedError: true
	publicMessage: "Your susbscription ended"
}

type PostsNotFoundError = NotFoundError & {
	_isPostsNotFoundError: true
	publicMessage: "No posts"
}

This way, you can check the error for the needed type anywhere in the program.

For example, if an error reaches the highest level (like a user's HTTP API), you can check and return only what's needed:


const errorHandler = (request: Request, error: Error) => {
	if (error._isBaseError) {
		if (error._isPublicError) {
			return request.status(error.statusCode).message(error.publicMessage)
		} else {
			return request.status(500).message("Internal error")
		}
	}
}

ii. Error Dictionary

Use a separate field and add a unique value to it, for example:


// errors.json
{
	// key is unique, and the value is any more or less understandable string
	"user_email_validation_error": "Email is incorrect",
	"value_to_long": "Value too long: ",
	"you_are_not_owner": "You are not owner"
}

Put unique error keys into a localization system (e.g., Lokalize, i18n, etc.), and this way both on the backend and frontend you will, first, better understand what error was sent, and second, be able to automatically translate them into the needed languages.

If special values are needed, send them as an object or always add them to the end of the string.

iii. Return, not throw

Almost always I prefer returning an error instead of throwing it. You can return either the error itself or an entity that can contain either a response or an error (Either monad).

Program clarity increases tenfold

You can type the returning errors

Debugging is simplified

Code for error handling looks cleaner

It's harder to forget to handle all the necessary cases

Go, Rust, Zig – modern languages for which returning errors is the norm.

🧠 Architecture

i. Functionally Oriented Programming (FOP)

FOP is a functional alternative to OOP.

I am convinced that OOP in modern backend applications is an atavism that needs to be actively eliminated.

Why and what to replace it with, you will learn in the online FOP book:

ФОП - Функционально Ориентированное Программирование

Функциональная альтернатива ООП ФОП - парадигма программирования, основанная на Процедурном программировании с использованием техник Функционального программирования, разработанная для мультипарадигмальных языков. Для работы с ФОП в вашем языке должно быть всего 2 сущности: Описание структуры данных - описание структур с которыми будут работать наши функции.

https://fop.davidshekunts.ru/101ffcdcef71463c81c9ecf20130d74b

ii. Data Oriented Architecture

I have moved this chapter to the book λ Functionally Oriented Programming:

Data Oriented Architecture

Функционально Ориентированное Программирование (ФОП) – функциональная альтернатива ООП.

https://fop.davidshekunts.ru/data-oriented-architecture

iii. Dirty vs Clean Architecture

Clean / Hexagonal / Onion Architecture advocate for separating business entities (User, Order, Product, etc.) from infrastructure (DB, controllers, etc.).

This is a good approach, BUT ONLY IN ONE SPECIFIC CASE, namely, when the infrastructure MUST be replaceable (for example, you have technology that assumes someone else can deploy it with a database of their choice). People try to use it absolutely everywhere.

I can tell you 1000% that the more straightforward I act (almost putting SQL queries in http controllers), the more reliable, scalable, optimized, and flexible programs I get.

I call this Dirty Architecture.

I'm ready to argue that I'll catch 10,000 insults for this message, but I'm betting my ass because I've used Clean Architecture and DDD in production on my own skin and only after that realized in practice how unrealistic promises they give.

In short:

Any abstractions of business logic increase Artificial complexity (link), and in CA and DDD, this is maximized.

Code is the servant of data (DB), not vice versa, so the fewer abstractions over data in your code, the easier it will be to work with it (link)

If you create a program so that you can change, for example, PG to MongoDB, then your program will work poorly with both databases

Abstraction over the database immediately assumes that (1) you won't be able to use important features of a specific database, (2) you won't be able to optimize for a specific database, (3) you will work unoptimized and very quickly reach the ceiling of resources you need

The more reusable code (for example, the Entities layer), the more danger there is of breaking something that works when adding a new feature / changing an old one (over time, this probability reaches 100%)

iv. Vertical Slices

The concept of Vertical Slices fits well within the context of Dirty Architecture.

To simplify, many applications are divided into folders like: use-cases, db-queries, models

When a request comes in, it's first processed by use-cases code, which calls the database, which uses models.

This means that a single feature (request processing) is scattered across multiple folders containing code for other features. This structure is called Horizontal Slices.

The idea of Vertical Slices is to have a folder for each feature, and within each folder to include both the controller, use-case logic, database query descriptions, and model code used in that feature.

One feature is not allowed to use another feature's code directly, only call its controller / use case.

This gives us true independence of code bases, and the chance of affecting something else when changing a feature approaches zero

Logic related to a specific feature doesn't pollute the common codebase but stays where it's needed, which incredibly helps in understanding the codebase

Testing such code is much easier

We can easily enable/disable/move features anywhere

Each feature can use its own technologies (you can even change the Query builder, database, library, etc.)

It's important to add a couple more conditions:

Keeping 1000 features in one folder is inconvenient, so feature folders should be distributed across domain area folders (auth, shop, reports, blog-system, etc.)

Within a domain area, you can have reusable code for that domain area

And common reusable code (like a DB schema) can always be organized as a "library" / SDK that individual features will use

v. Event Driven Architecture (EDA)

Honestly, I can't imagine almost any backend application without some variation of EDA.

The essence of EDA is to give us a mechanism where we can announce some Event so that other processes can be triggered, processes that we shouldn't need to know about at the publication point.

Some useful points:

An Event consists of

id: uuid
name: string – in past perfect form (UserRegistered, PostCreated, etc.)
payload: JSON – content
timestamp: UnixTimestamp – creation time
traceId: uuid – identifier passing from the very first call through all calls and events to build a complete picture

Events have sizes

S – name + entity id
M – name + useful data
L – name + all data
XL – name + new and old data

Each case requires its own "size," experiment

If some data affecting the event or handlers might change over time, put it in the event

If it's possible to reproduce some logic using events (i.e., when you don't need to wait for a response), it's better to do so

Use queue systems with delivery guarantees and replication (like Kafka)

It's better if events are persisted (like in Kafka) so they can be re-read

Learn the Transaction Outbox pattern for very complex and dangerous operations

For state synchronization, use Saga or Event-based Orchestrator

Never expect an event to execute in a constant time

vi. Say "NO" to master-master

Don't use master-master technologies.

It seems like a "silver bullet," but in reality, these are technologies that should be used only when you've tried all other options:

Too much data?

Try separating cold and hot data, where cold data is archived and set aside for the future
If data is only needed for projection, move it to horizontally scalable OLAP databases (Clickhouse) and remove it from the main database

INSERT not fast enough? Use batching layers, like Kafka, where you dump all records that will later be batched and put into final storage

UPDATEs too slow? Often you can change UPDATE to INSERT or use Event Sourcing / EDA – save events about changes, then aggregate and calculate projections from the other end

DELETEs too slow? Mark data for deletion (deletedAt) and clean them once via cron

If you need m-m clustering, it means you have a complex task. If you have a complex task, it means it will be difficult to maintain reliability and even more difficult to debug.

m-m clustering significantly reduces reliability and speed while increasing infrastructure complexity and debugging difficulty.

If you want to use master-master, ideally your data should not have any kind of Constraints (and minimize UNIQUE), you should only use INSERT and SELECT (essentially, these are time-series, event sourcing, or OLAP data).

If you still need m-m, use technologies that are based on (or better yet, cannot operate without) CRDTs (Conflict-free Replicated Data Types).

That is, some Redis / MySQL has optional m-m clustering, so you definitely shouldn't use them for this purpose.

On the other hand, etcd / cockroach / clickhouse are initially designed to work in a cluster, which means they can be trusted. But when transitioning to them, you will still pay a price, so you should be confident in your decision.

vii. Say "yes" to master-slave

Only for cache do I grant the right not to have slave replication, but only because cache should be treated as data that has the right to disappear at any moment.

In all other cases, I always choose technologies and configure them to have a slave, which will be (1) a read replica, (2) a fallback in case the master fails, (3) a backup node.

If you need strict slave synchronization, then choose technologies with RAFT.

When you need to grow, deploy N master-slave nodes and manage data between them at the application logic level (microservices, logical shards, actors, domains, etc.)

viii. Horizontal scaling

Write horizontally scalable applications.

First, absolutely any code is always concurrent, even within the most single-threaded language. When you write with horizontal scaling in mind, you remember this more often and make fewer racing errors.

Second, in modern realities, it's very easy to hit the ceiling of a single-instance application, and transitioning from vertical scaling to horizontal is incredibly difficult, while in the opposite direction, no problems arise.

With horizontal scaling, you need to consider:

You'll need external storage for state synchronization (Redis-like)

Absolutely all processes become concurrent, which means you need to either know how to distribute (for example, round-robin on queues) or know how to lock (Redis-like / etcd-like)

Problems can occur on only some instances, so in resource monitoring, you need to separate each individual instance

IMPORTANT! Don't confuse horizontal scalability with "microservices" - you can horizontally scale a monolith as well (especially a distributed one).

💾 Databases

i. ORM or not

If you're writing a library/service that can be used with different databases, then you can use an ORM.

In all other cases (which means, almost always) use libraries that are as close as possible to the query language, meaning either pure SQL/CQL/Dynamo API, or a Query Builder.

ii. Migration first

In 90% of cases, this is a much more convenient and reliable approach:

Write migrations

Apply them to the database

Perform introspection – export the table schema into your language's type system and constants (table names, column names, etc.)

iii. Optimistic & Pessimistic Concurrency Control

Pessimistic Concurrency Control (PCC) – lock the data when retrieving it, make changes, write it back, remove the lock (essentially a Mutex).

+ operations are reliable

- slow and there's a chance of deadlocks

Optimistic Concurrency Control (OCC) – retrieve data, modify it, when trying to write it back, check that no one else has changed it before us (for example, when writing, check that the same updated_at or version remains).

+ fast, simple

- the more competition, the slower the system will work or not work at all

We can immediately conclude that if your algorithm involves competition (multiple processes should sequentially UPDATE the same data), then OCC is definitely not suitable. But if you only have a probability of competition (two processes decided to write to the same entity), OCC can significantly speed up the system with minimal cost.

(O|P)CC are applicable to absolutely any data sources and to building code logic.

iv. Transactions

If you've never used different transaction isolation levels, it means you've never written even medium-level applications or you did it incorrectly.

On average, it will be like this:

Read uncommitted – we read data that is not yet committed (maximum speed, minimum reliability, suitable when the data we read cannot fail to be written, for example, because they have no Constraints)

Read committed – we read only committed data (reliable, simple, works)

Repeatable read – within a transaction when reading, we will always get the same data as if reading from a snapshot (more complex, but a common case, such as when using a sub-query)

Serializable – we put all transactions in a queue (minimum speed, maximum reliability, prepare for deadlocks)

Also, be sure to study what gets locked and when (current/related rows, table, schema, or database) and know how to control the locking level (conditionally, SELECT FOR UPDATE SHARE / SKIP)

Tips:

Design your architecture to minimize or avoid using transactions (and if you do use them, not higher than Repeatable read). And yes, architectural solutions can allow working without transactions.

Never delay transactions – slowing down 1 transaction can lead to cascading growth and errors throughout the system.

If you had to use Serializable, then most likely, you were just lazy / don't have time to do it differently.

v. Distributed Transactions

Never delay a transaction: don't use timers, don't make external calls.

Delaying a transaction geometrically increases the operation time of the entire system.

But what if it's necessary, for example, we depend on a third-party API?

Change the application logic so that it works without transactions.

If you approach the problem exactly this way, you will discover that there are ways to solve the issue.

And in those places where you need to know whether a set of actions has completed and in what state, create state machines, for example, "Jobs" (N-phase commits) or more reliable but complex "Sagas".

vi. Drop Relations

Most likely, you don't need relations.

This is surprising and counterintuitive, but at a certain volume of work, you will start to notice this yourself:

ON DELETE CASCADE is an incredibly dangerous construct that (1) can delete necessary data, (2) significantly slows down the database, (3) is very difficult to control and debug. It can only be used with one-to-one relationships; in other cases, it's better to delete orphaned data with a cron job during periods of low load.

Even more often, you'll notice that when deleting data, you don't actually need to delete related data—in fact, it's harmful—making ON DELETE CASCADE completely pointless.

ON DELETE / UPDATE ... transfers business logic to the database, which leads to tons of debugging and not understanding "why it doesn't work."

FOREIGN KEY for checking the existence of an entity is often meaningless—if you received some id, then most often you either have to verify its existence in advance or it definitely exists. And even if not, the cron job from the first point will eventually delete this data.

Orphaned data most often won't interfere with your queries because if we've deleted the linking data, they simply won't appear in standard queries (possible collisions only in OLAP queries, but there you need to monitor many aspects anyway).

Sooner or later, you'll need to store some data in one database and some in another, and at that point, you'll lose relations anyway. If you reject them from the beginning, you'll be able to use as many different databases as you want and horizontally scale your storage 100 times easier.

vii. Drop Constraints

In addition to relations, if you also stop using constraints (like UNIQUE) and build application logic and architecture around this, you can more easily transition to using horizontally scalable databases while maintaining high system performance.

viii. How to Choose a Database

Well, besides articles, community, selling experience, etc., it's also important to check:

Will it handle the required QPS

Look at the required transaction levels and atomicity

Check the number of available connections

Check replication availability

Check if it's suitable for many small inserts or only large ones

Is there Update functionality

Is there Upsert functionality

Are there bulk loading methods (e.g., COPY in PG)

How MVCC and GC are structured (to understand the complexity of Insert/Update/Delete, as well as the causes of pauses)

What mechanisms provide Optimistic/Pessimistic Concurrency Control

Check types of numbers, dates, arrays, and the presence of JSON/unstructured types, as well as different methods of working with them

Check for normal libraries: adapter, query builder, migrator, and introspection

Row-based or column-based

OLAP or OLTP

Is there CDC

Eventual or Strong consistency

If master-master, what consensus algorithm

Is something additional required for clustering (e.g., zookeeper, etcd)

(soon there will be a comparison table of PostgreSQL, MySQL, MongoDB, Clickhouse, CockroachDB, TimescaleDB, etc.)

ix. Use UUID

If there are no specific nuances, use UUID as the primary key:

You can prepare multiple entities in code that are related to each other by id and insert them into the database at once (in the case of serial, you would have to insert, get the id, and only then insert the next entity)

Allows the client to send the entity id and after a successful operation request it through separate endpoints/mechanisms (for example, receive it from WS)

Allows for delayed insertion (for example, if you want to batch entities and insert them later, but related entities can already appear in the database)

If you sum up the 3 points above: allows building Eventual Consistency systems

In emergency situations, allows you to go through almost all tables to find which entity this id belongs to

BUT use UUIDs that start with a timestamp (e.g., UUID v7), this increases insertion speed significantly.

x. INSERT-s / UPDATE-s / DELETE-s must be batched

Most databases don't like individual inserts. Most likely, a database will process 1 or 1000 records at the same speed (and some, like Clickhouse, insist on 100,000 and 1,000,000,000 records)

If you're creating a high-load system, design the architecture with the understanding that for optimization, you'll need to batch.

The main side effects of batching:

If you don't persist messages in a third-party system (e.g., Kafka), you can lose them

Batched messages won't be available at the moment they actually appear

xi. "Storages like Onions"

I tried to abstract the processes occurring in a bunch of different storages, including PostgreSQL, TimescaleDB, MySQL, MongoDB, Redis, Clickhouse, CockroachDB, YDB, Amazon Aurora, TiDB, RMQ, Kafka, RedPanda.

Undoubtedly, I forgot/don't know a lot, so your comments are needed.

Connection establishment – establishing a connection with the client

Processor allocation – allocating a unit that will process the request (process in PG, thread in MySQL, goroutine in CockroachDB, thread per core in RedPanda)

Query processing – parsing the received message into logical steps (for example, parsing SQL)

Schema validation – checking that the sent data corresponds to the schema

Execution planning – determining if, in what sequence, and from where to retrieve data

Indexes

Verification – ensuring we don't violate existing indexes
Creation – creating new ones (especially worth noting the atomicity of UNIQUE indexes)

Concurrent access – for example, MVCC (creating new record versions or rollback journal)

Transactions – transaction management

Commit – confirming that the operation will succeed regardless of circumstances (for example, WAL record or RAFT commit)

Respond with metadata – sometimes it's necessary to respond to the client with metadata in advance, for example, about the type of returned data

Persistence

Communication with the storage layer – sometimes this is part of the processing instance (PG), and sometimes it exists separately (YDB, TiDB)
Compression – compressing data
Storage optimization – enables efficient storage
Batching – aggregating data in memory for subsequent flush
Flush – unloading from memory to disk

Cleanup – most storage systems will have one mechanism or another for cleanup, for example, VACUUM in PG or sector deletion in Kafka

Clustering

Membership – discovering and joining a cluster
Leadership – determining leaders
Health-checks – checking the availability of cluster components
Anti split-brain – preventing Split-brain
Recovery – recovering after separation from the cluster
Rehydration – restoring data to the required level
Configuration synchronization – synchronizing the final state of cluster configuration and individual nodes
Index synchronization – synchronizing indexes
Data synchronization – synchronizing the data itself
Master-Slave replication – replicating data for further reading

Partitioning

Local – dividing large master tables into smaller ones by key within 1 instance to optimize IO operations
Sharding – dividing and storing master tables on different instances

Backups – some storages are capable of automatic backup of hot/cold data, for example, to S3 (TimescaleDB, Redpanda)

xii. Event bus

There are 2 options here:

Almost always, first of all, you will need a persistent storage like Kafka:

This gives you the ability to process messages in batches
Re-read messages when invalidation is necessary
Write in average constant time
Have horizontal scaling

If you are willing to sacrifice delivery reliability and persistence for the sake of speed, then use a Message Broker (NATS.io / EMQX)

And nothing prevents you from mixing both approaches.

xiii. SQLite

SQLite is essentially a library specification that allows operations on database files directly from a programming language

Advantages

Since SQLite's performance depends on language speed + processor power + IO throughput, theoretically it's one of the fastest databases, at minimum because it completely lacks all the network complexity of standard databases

While other embedded storages are just key-value (rocksdb, leveldb, badger), or NoSQL with unique SDKs (couchdb-like), SQLite is a full-fledged SQL database, comparable in SQL capabilities to PostgreSQL: schema, indexes, transactions, locks, joins, constraints, everything is there

Consequently, experience with any other database will be relevant, making SQLite very attractive to developers

This also means you can switch from SQLite to PostgreSQL / MySQL almost painlessly when/if the time comes

Due to its simplicity, SQLite is either already integrated (browsers, mobile devices, native apps) or easily added (I've seen hardware with minimal Linux that uses SQLite)

You can work with one instance from multiple processes simultaneously

To backup the database, you just need to upload files to S3

Super simple integration testing because you can have a separate SQLite instance for each test (e.g., in-memory)

If you create a SaaS without multitenancy and distribute it as a boxed solution, you can easily open an instance per client

As free as possible

Disadvantages

Classic network file systems don't allow multiple servers/containers to work with the same SQLite database (I only found information about VFS, but I don't yet understand how viable this option is)

If something happens to the file system during data writing, there's a chance it will break beyond recovery

Use Cases

Local database for applications with frontend (web, mobile, desktop)

Local database for remote agents (applications that collect and send data to the cloud from devices, from a server in a warehouse)

Cache for a single instance (for example, for actors)

Database for startups/projects requiring work with large volumes of data, but without wanting to pay a lot for PostgreSQL/MySQL instances

Solving Disadvantages

We need 2 things:

Sync Read replicas, so you can switch when the master fails

WAL Streaming Backup for reliable backups

Optionally, async read replicas for delayed reading or even CRDT to transform it into a distributed multi-master p2p database

Ideally, all of this should be embedded in SQLite itself, meaning embedded in the language or as a sidecar process. Otherwise, I think using SQLite loses its purpose

Interesting Projects

Pocketbase (https://pocketbase.io/) – admin panel, Firebase-like HTTP API, email, auth, file storage, logs and much more out of the box

Turso (https://turso.tech/) – distributed SQLite in an Edge environment

Electric (https://electric-sql.com/) – SQLite on the client side that synchronizes with PostgreSQL using CRDT, turning SQLite into a multi-master edge database

LiteFS (https://github.com/superfly/litefs) – SQLite database replication at the file system level

libSQL (https://github.com/tursodatabase/libsql/) – SQLite fork on which Turso is built, with the ability to deploy servers, replicas, auto-sync WAL to S3, and so on

rqlite (https://github.com/rqlite/rqlite) – turning SQLite into a full-fledged database with read replicas, written in Go

dqlite (https://github.com/canonical/dqlite) – roughly the same as rqlite, but in C

The Future of SQLite

First, it's clear that SQLite is evolving into a database for Edge environments because it can operate with minimal resources

Also, it's definitely destined to become a p2p database (in the style of couchbase) because it's already integrated/easily integrates with any client

Adding built-in Sync Read replicas will make it more likely to be used in production applications

Additional Thoughts

If I had to develop a standard web application for business now, I wouldn't overthink it and would use PostgreSQL

For my own projects, I would happily use it

If LiteFS or libSQL works well, I would more seriously consider using it in production applications

🔎 Testing

i. General

Business logic is tested absolutely always, everything else is optional

If a function doesn't use infrastructure, write unit tests

In all other cases, always write integration tests

Tests are simple (you just need to learn how to set them up)

Tests are an extra hour of coding that saves you 10 hours of sleep

How many tests should you write for a function? One, then more if desired

When you find a bug, first write a test to reproduce it, then fix the bug

ii. Unit, Integration, E2E tests

Unit tests – testing without environment (DB, caches, API, etc.), meaning all integrations are replaced with mocks, tests are part of the code.

Integration tests – we expose integrations (DB, caches, API, etc.) for tests, tests are part of the code.

E2E – we expose the full application with all integrations and call its API, tests are a separate codebase.

iii. Integration tests

The launch script can be written in Makefile

Before running tests, set up a local environment (preferably through docker-compose)

Wait for startup

Apply migrations and fixtures

Take a DB snapshot before each test

Create a separate connection for each test

After each test, delete the snapshot and close the connection

At the end of tests, DO NOT close the environment

🌎 Logging, metrics, tracing

i. Metrics are your spider sense, Traces are your map, Logs are your eyes

At large volume, the only way to see if everything is (not) ok is through metrics

The secondary tool is tracing

And only as a last resort do we use logging to clarify details

Accordingly, by priority we should primarily add metrics, secondarily tracing, and tertiarily logs.

If you have a small project, the priority goes in the opposite direction.

ii. Meta-info of logs

Add commit hash

Level

Service name

Unique id

Service start time

Function call stack

Request ID

Trace ID

iii. Technical

Use zero-allocation loggers

Write to stdout

Verify that the logger writes asynchronously

👨‍👧‍👦 Leading

i. Triptych – the ideal structure of a technical team

Once again, I come to the conclusion that technical teams should consist of Team Lead + Senior + a set of Middle developers, and not otherwise.

Team lead (ears and mouth) – a single entry point for business, responsible for delivery timelines, technical backlog, and team condition; creates working conditions for Senior and Middle

Senior (brain) – responsible for the quality and functionality of the entire system, and therefore: makes technical decisions, has veto power, puts out fires, creates technical conditions for Middle work

Middle (hands) – responsible for the functionality of the code they write: aligns decisions with Senior, builds features and ensures they work

Notes

I called them "Team lead", "Senior" and "Middle" because there are no more appropriate words (except those in parentheses); in reality, a person at the "Senior" level can be in the "Middle" role

These are ROLES, which means they can be combined by the same person

There should be a maximum of one Team Lead and one Senior, but there can be many Middles, ideally within the limit of 10

Team Lead may not have technical skills (I call these PMs, but this doesn't remove the responsibilities described above)

I deliberately didn't include QA, DevOps, CTO, Architect, etc. because they either can reuse what's written above, or are more hyperbolic versions (for example, CTO is a Team Lead who is also responsible for payroll)

Why exactly this team structure

Decisions become (in)correct only after you've applied them, so someone simply needs to take responsibility for choosing a path and waiting for the result. If several people try to approve such a decision, they will have big problems, therefore, the captain (Senior) should be only one.

Communication with business is painful. There can be an incredible amount of it. Both incoming to the team and outgoing. At the same time, business often doesn't know how to communicate with developers and vice versa, so letting them interact on a regular basis is definitely not worth it. They can be introduced and left for some time (feature creation), but all communication and responsibility should be in one person's hands (Team Lead)

Seniors complement Middles, and Middles complement Seniors: Middles can get all the knowledge they need, Seniors can realize themselves as "senseis" and at the same time learn by structuring their knowledge while transferring it to Middle developers. Such an Ouroboros allows both to grow and enjoy the process.Two Senior developers can (initially, have conflicts, but I wrote about this above, so in this situation) ask each other too few questions (for example, due to embarrassment) and build some incredibly absurd thing simply because they didn't dare to discuss it in advance (this is generally about maturity, but that's a separate topic)

How to create such a team

Openly and explicitly discuss who takes on which role, ensuring that Seniors clearly transfer decision rights on business matters to the Team Lead, while Middles explicitly transfer rights for technical decisions to Seniors

Establish good, open, and constant communication between all these links

Give Seniors the ability to make technical decisions and hire people in accordance with them

What are the dangers

If the Team Lead has poorly developed soft skills and lacks the "steel balls" to say "no" to both developers and business, everyone will burn out, business will suffer, and turnover will begin

If you choose a bad Senior, they will lead the entire team into the abyss, but if you don't give Seniors the right to make risky decisions, good Seniors won't be able to accomplish anything either

If Seniors aren't ready to listen to Middles, or Middles aren't willing to agree with Seniors' final decisions, the scheme won't work, which is why it should be the Senior who builds the team

💻 Programming

i. Everything is concurrent

Even within single-threaded languages, as soon as you create long-lived objects (for example, in-memory cache), ALWAYS treat them as entities that can be modified from different points in the program in a concurrent manner.

And for this:

ii. Avoid Mutexes

Mutexes allow us to synchronize in a concurrent environment, BUT mutexes always have a huge chance of getting stuck forever or turning into a cascade of interconnected mutexes.

Almost always, I prefer structuring operations in a sequence to avoid mutexes, and I recommend you do the same.

Examples of patterns with sequential processing: Actor Model, Serializable Transactions, RAFT.

iii. Program as if everything is already broken

Applications crash. And they do it in the most unpleasant and dangerous places.

Most often, it's your fault, but in 10% of cases, it's due to external circumstances.

Always program as if the application could crash at any moment:

If you have a state machine, don't forget to write a cron job that will move hung state machines to some final state.

If there's a set of data that must be written together or not written at all, then use: (1) prepare all operations on entities in advance and write them in one atomic operation at the end, (2) combine this data into one table/message in a queue (and if you need to separate them later, do it as a separate process), (3) create and run a state machine (sooner or later it will do what's needed), (4) use transactions.

If you need data at a specific moment in time, write it as soon as you can gather it together, BUT make sure that the write operation is idempotent (meaning that if the same input data enters the same code logic again, it won't create a duplicate record).

You might be lucky and the application will crash with SIGINT or SIGTERM, so take care to implement graceful shutdown.

iv. And much, much more

In general, all the philosophical foundations that I try to follow are described in the "Pillars" chapter of the FOP book, so I'll leave a link to it here:

(λ) Функционально Ориентированное Программирование (ФОП)

Функционально Ориентированное Программирование (ФОП) – функциональная альтернатива ООП.

https://fop.davidshekunts.ru/#dd250ed82406451f9dc45525d34b045a

👨🏻 About the Author

Hi! My name is David Shekunts and I’m Golang / Node.ts Tech Lead & mustache owner 👨🏻

Site: about-me.davidshekunts.com

Github: https://github.com/Dionid

X: https://x.com/DavidShekunts

Telegram: t.me/davidshekunts

Wishing everyone powerful growth 💪