Finally, we are entering the nitty-gritty of the design of the system. It is time to make choices regarding components and technology.
We focus on the internal design of the system, ensuring that each component serves its purpose efficiently and integrates seamlessly with others. At this stage, we do not discuss infrastructure aspects such as cloud deployment, load balancing, or CI/CD pipelines in depth; these will be addressed in a later chapter.
There is anyway a really important aspect of the platform that we need to consider. As we discussed in the third installment of this series, the constraint chapter, we opted for a cloud solution instead of on-prem or VPS.
From the technical constraints chapter
Decision Point: Given the need for scalability and low upfront cost for a Discord clone, cloud deployment is the preferred option.
Justification: Cloud infrastructure allows us to leverage auto-scaling, managed services, and global CDNs to minimize latency and optimize costs.
The cloud provider of choice for our example is MS Azure. Lately, I'm working a lot with this platform, and designing this system is a good way to practice a little more.
The components described in this chapter will, when possible, leverage the capabilities of MS Azure. It doesn't make any sense to deploy an application using a cloud provider and implementing everything custom. At least not at the beginning. There is always time to find out that we are in an edge case in terms of performance or specific functionality, but these kinds of problems, even if difficult to solve, are normally a consequence of a successful launch of an application.
So let's hope we will end up with one of these "standard cloud component" kinds of problems. The three major cloud providers Azure, AWS, and GCP all offer equivalent tools to design and implement our Discord clone example application. In case you want to follow my structure with another cloud provider, please post your component list of choice in the comments.
Core Components Overview
A well-designed system is composed of clearly defined components, each responsible for specific functionality. In our practical example, we have chosen specific technologies to implement these components while following a modular monolith architectural approach. There are several distinct component but It would be extremely pretentious to design an example app using a Microservice architecture. The key system components include:
Client Applications – Interfaces through which users interact with the system.
API Gateway – Centralized entry point for client requests.
Authentication & Authorization Service – Manages user identity and permissions.
Real-Time Messaging Service – Handles real-time communication.
User & Channel Management Service – Maintains user data and communication channels.
Media Processing Service – Manages audio and video streaming.
Storage Services – Handles persistent and ephemeral data.
Notification Service – Sends real-time notifications.
Analytics & Monitoring Service – Collects and processes system metrics.
Each of these components is designed with scalability, maintainability, and performance in mind. The golas and constraints were decided in the previous chapters.
Client Applications
Client applications provide the interface for end users to interact with the system. In our practical example, we defined some constraint in Chapter 3.1 about and those lead me to chose a web-only client using a Typescript framework. To challange my knowledge I chosed React. It is a major framework in the landscape and it can for sure cover any requirement I could have in term of UI. The client will communicate with the backend via REST and WebSocket APIs.
Responsibilities:
Display user interface and handle user interactions.
Communicate with backend services through API requests.
Maintain session state and caching for performance optimization.
API Gateway
The API Gateway acts as the single entry point for all client requests, handling routing, load balancing, and security enforcement.
Responsibilities:
Forward requests to appropriate backend services.
Implement rate limiting and security policies.
Handle authentication tokens and request validation.
As defined in the constraints , we will use Azure API Management as our API Gateway solution, ensuring a scalable and secure request-handling mechanism.
Authentication & Authorization Service
This service is responsible for user authentication, session management, and role-based access control.
Responsibilities:
Authenticate users using OAuth2 and JWT.
Manage permissions and access control lists (ACLs).
Support third-party authentication (e.g., Google, GitHub login).
In our practical example, we have chosen Azure Active Directory (Azure AD) as our identity provider, integrating it with the API Gateway and other backend services.
Backend Service
The backend is responsible for handling business logic and processing client requests.As I said in the past I am a Java guy. Lately I'm using more Quarkus and "standard JEE" but my home base for backend with Java is still Spring Boot, so this is the backend framework of choice.
Responsibilities:
Expose REST and WebSocket endpoints.
Handle business logic for messaging, user management, and channels.
Interact with databases, messaging systems, and external services.
Spring Boot was chosen due to its extensive ecosystem, strong community support, and seamless integration with other Java-based tools and frameworks. But mostly as said it is the Framework that I know best.
Real-Time Messaging Service
The messaging service handles the real-time exchange of messages between users, ensuring low latency and high throughput.
Responsibilities:
Manage WebSocket connections for live messaging.
Implement message delivery guarantees (at-least-once, at-most-once, exactly-once).
Support presence and typing indicators.
As defined in the constraints , we will use Azure Event Hubs for event-driven communication, ensuring scalability and resilience in message delivery. This is another example where the cloud provider offer already an out of the box solution. My experience will make me lean more in the direction of Apache Kafka or Rabbit MQ for simplicity but Azure Event Hub dos exactly what we need without the necessity of managing the installation ourself.
User & Channel Management Service
This service is responsible for maintaining user profiles and managing communication channels.
Responsibilities:
Store and retrieve user profile information.
Manage chat rooms, groups, and user membership.
Provide search and discovery features for channels.
I generally prefer to use Relational database when I can. I'm a big fan of DDD but I know the struggle of designing aggregates that can be managed easily in the real world. This part of the application will for sure need some hierarchical structure and relation between stored data objects. For this scenario relational databases are a straight forward choice for me.
We will use Azure SQL Database as our relational database for structured data storage. This choice provides a fully managed, scalable solution with built-in high availability, automatic backups, and advanced security features. Compared to PostgreSQL and MySQL, Azure SQL Database offers seamless integration with other Azure services, supports serverless compute and hyperscale storage, and includes intelligent performance tuning capabilities. While PostgreSQL is known for its advanced indexing and JSON support, and MySQL for its widespread adoption and simplicity, Azure SQL Database provides a balanced option optimized for enterprise-grade workloads in an Azure-native environment.
Storage Services
Storage services handle both persistent and temporary data storage needs.
Responsibilities:
Store chat messages and media files.
Maintain ephemeral storage for temporary messages.
Provide backup and disaster recovery mechanisms.
The second major database system. After users data channels and so on we need a place to persist the messages and the files exchanged.
For file storage, we will use Azure Blob Storage, which provides a highly available, durable, and scalable solution for unstructured data. This service is ideal for storing chat media such as images, voice messages, and attachments. It supports automatic replication, ensuring data resilience, and integrates seamlessly with other Azure services, such as Azure Content Delivery Network (CDN), to optimize content delivery performance. Given the need for efficient media storage and retrieval in our system, Azure Blob Storage aligns well with our requirements for scalability and reliability.
Notification Service
This service ensures users receive notifications for relevant events.
Responsibilities:
Send push notifications to mobile and web clients.
Deliver email and in-app notifications.
Manage notification preferences for users.
We will integrate Azure Notification Hubs for push notifications, as it provides a scalable, cross-platform solution for sending notifications to various client applications. Azure Notification Hubs supports high-throughput push notifications, offering integration with both APNs (Apple Push Notification service) and FCM (Firebase Cloud Messaging), ensuring reliable and efficient message delivery. Its ability to segment audiences and personalize notifications makes it an ideal choice for our system's real-time engagement needs.
Analytics & Monitoring Service
To maintain system health and performance, we need a dedicated service for collecting metrics and logs.
Responsibilities:
Track system performance (e.g., latency, request throughput).
Monitor user engagement and feature usage.
Provide alerts for anomalies and failures.
As defined in the constraints , we will use Azure Monitor for monitoring and Azure Monitor Metrics for performance tracking and visualization. Azure Monitor provides centralized logging, metric collection, and alerting capabilities, allowing us to gain real-time insights into application performance and system health. Compared to Prometheus and Grafana, which require self-managed setups and manual scaling, Azure Monitor offers a fully managed solution with seamless integration into the Azure ecosystem. It supports automatic scaling, intelligent anomaly detection, and deep integration with Azure security and compliance features, making it a robust choice for our monitoring needs.
Media Processing Service - OPTIONAL
This is an optional component that would be nice to implement but that can be kept out of an MVP.
I will anyway describe the responsibility and the choices made on the topic.
The media processing service enables users to send and receive voice and video streams, supporting live calls and media playback.
Responsibilities:
Encode and decode audio/video streams.
Provide media storage and retrieval.
Enable live streaming and conferencing.
For media handling, we will leverage Azure Communication Services, which provides built-in support for real-time voice, video, and chat communication. This service is fully managed, allowing us to integrate seamless peer-to-peer and group communication capabilities without handling complex infrastructure. It supports SDKs for multiple platforms, ensuring interoperability between web and mobile clients. Additionally, Azure Communication Services integrates natively with other Azure solutions, improving scalability and security for our media processing requirements.
By defining these components in detail, we establish a solid foundation for our application’s design. In the next chapters, we will explore how these components interact within an infrastructure, addressing scalability, deployment strategies, and cloud considerations.
This chapter needs to describe the basic architecture of the software components themselves. I think though that it would make this chapter too long and convoluted. So stay tuned for part two of the design process, where we will do a component drill-down and decide what will be implemented where and the common structure of the applications.
Articles I enjoyed this week
For the friends who speak Portuguese: Por que nos deixamos definir pela nossa profissão? by Paula Alves (@ondeestapaula)
8 Key Protocols You Can’t Ignore, 5 Steps to Understand How VPN Tunneling Works, and 7 Smart Strategies to Reduce Latency – Sketech #18 by Nina