Build it Right: 9 Fundamental Principles of Data Architecture.

Are you tired of making data architecture decisions in the dark? Get ready to supercharge your decision-making process! This robust framework acts as your trusted compass, illuminating the path to building strong, scalable data systems that stand the test of time.

This article draws inspiration from Joe Reis and Matt Housley's seminal book Fundamentals of Data Engineering: Plan and Build Robust Data Systems. This essential reference book for data engineers provides a comprehensive overview of data engineering principles and fundamentals.

I'll engagingly present these principles, using real-world examples to bring the concepts to life. As the CTO of a tech startup, I use these principles to guide all my data architecture decisions.

Choose Common Data Architecture Components Wisely

Let's first define what we mean by a component. In data architecture, components encompass monitoring tools, orchestration systems, object storage, version control systems, and processing systems.

Here are some real-world examples to clarify each type:

Monitoring Tools: Grafana for visualizing metrics, Prometheus for collecting and storing metrics, and DataDog for comprehensive infrastructure monitoring.
Orchestration Systems: Apache Airflow is used for workflow management, and Kubernetes is used for container orchestration.
Object Storage: Amazon S3 for cloud storage and Google Cloud Storage for scalable object storage
Version Control Systems: Git with platforms like GitHub or GitLab for code versioning and DVC (Data Version Control) for managing data versions
Processing Systems: Apache Spark for large-scale data processing, Apache Kafka for real-time stream processing, and Apache Flink for stateful computations.

Components encompass everything involved in data engineering work, and choosing them wisely means selecting tools that benefit both individual projects and the organization.

While components should primarily serve project-specific needs, they must integrate smoothly with organizational workflows. This balance promotes better interoperability and team collaboration. However, avoid forcing a one-size-fits-all approach, as this can hamper productivity.

Here's a real-world example to illustrate this principle:

Imagine a large tech company where different teams handle various data products. The infrastructure team decided to standardize all data processing on Apache Spark because it's effective and well-known. However, this decision might not work well for all teams:

The real-time analytics team needs to process streaming data in milliseconds - Spark's batch processing approach might be too slow
The ML team works primarily with Python and small datasets - Spark's overhead might be unnecessary for their use case
The data warehousing team deals with large-scale batch processing - Spark would be perfect for them

Instead of forcing all teams to use Spark, a better approach would be to:

Allow the real-time team to use Apache Flink for stream processing
Let the ML team use pandas for their Python-based workflows
Keep Spark for the data warehousing team where it makes sense

This way, each team can use tools that best fit their specific needs while maintaining some standardization where appropriate.

Again, it's all about finding the right balance!

Plan For Failure

"Anything that can go wrong, will go wrong." - Murphy's Law

As a data architect, you must understand four critical terms: availability, reliability, recovery time objective, and recovery point objective. Let's explore these concepts using an e-commerce website example:

Availability

This measures how often your system is operational and accessible. If your e-commerce website has 99.9% availability, it may be down for up to 8.76 hours annually. During peak times like Black Friday, even minutes of downtime can result in substantial revenue losses.

Reliability

This indicates how consistently your system performs its intended functions. In e-commerce, reliability means the system accurately maintains cart contents, calculates totals, and processes payments without errors. A system that randomly fails during checkout or displays incorrect prices would be considered unreliable.

Recovery Time Objective (RTO)

This defines the maximum acceptable downtime after a failure. With a one-hour RTO for your e-commerce platform, your team must restore the website within an hour of a crash. While customers can't make purchases during this time, you've determined that this recovery window meets your business needs.

Recovery Point Objective (RPO)

This specifies the acceptable amount of data loss during a failure. A 15-minute RPO means backing up data every 15 minutes. If your system crashes, you might lose up to 15 minutes of customer orders, cart updates, and other transactions since the last backup.

Together, these four metrics determine your system's resilience. For example, Amazon likely targets 99.99% availability, near-perfect reliability, minute-level RTO, and second-level RPO—because even brief disruptions could result in millions in lost sales.

The requirements for these four metrics vary depending on your company's context and industry. While e-commerce and similar domains require strict metrics due to their high-stakes nature, other domains—like internal data analysis tools or NGO systems—can take a more relaxed approach.

You must consider these metrics while building data architecture as a data engineer. Run virtual failure scenarios repeatedly to verify whether you meet or exceed the company's targets.

Pro tip: Always aim to exceed performance expectations, even when company requirements are lenient.

Architect for Scalability

When discussing scalability, people often focus on scaling up—the ability of systems to automatically handle increased load and traffic spikes. However, scalability works both ways. Effective systems should also scale down during periods of low demand, conserving resources and reducing costs.

As a tech startup, our system currently handles modest loads. However, we've designed our data architecture to accommodate significant growth over the next several years. This forward-thinking approach ensures that when our startup experiences exponential growth (a likely scenario), our architecture will seamlessly scale to support the business.

Let's look at a practical example of scalability in action:

Imagine a food delivery app that typically processes 1,000 orders per day. During normal operations, it runs on two servers that handle the load efficiently. However, during special events like Super Bowl Sunday, order volume can spike to 10,000+ orders. Here's how the system scales:

Scaling Up: When order volume increases, the system automatically spins up additional servers (auto-scaling). The load balancer distributes requests across these servers to maintain performance.
Scaling Down: After the event ends and order volume returns to normal, the extra servers are automatically decommissioned to reduce costs.

Architecture Is Leadership

This principle serves as a roadmap for data engineers, encouraging them to develop the leadership and critical thinking skills needed to transition into data architecture.

Think of a data architect as = data engineer + leadership skills.

Data engineers should actively develop their leadership skills and learn from their organization's data architects. This growth path often leads to transitioning from a data engineering role to a data architect position. As a data architect, you'll make key decisions about your organization's data strategy while maintaining a high-level perspective. Rather than handling development work directly, you'll delegate these tasks to your team and focus on overseeing data initiatives from a strategic viewpoint.

As a data architect, you'll apply these remaining principles while mentoring team members and selecting the most appropriate technologies for each team's projects.

Always Be Architecting

If I want to name this principle myself, I would call it “aim high all the time.”

This principle emphasizes that data architects must continuously evaluate their current infrastructure and systems while maintaining clear goals for the future. Once they establish both the current state and the desired outcome, architects can determine the optimal steps for their team to advance their data systems to the next level of sophistication.

Here's a real example to illustrate this principle:

Consider a small startup that currently processes data analytics reports monthly. They know they'll need to process reports daily or hourly as they grow. Instead of waiting for this need to arise, they:

Document their current manual reporting process
Identify areas that will need automation in the future
Plan the transition from manual to automated reporting
Begin implementing basic automation while maintaining manual oversight

This proactive approach ensures they're continuously architecting their system to meet current and future needs rather than waiting for problems to arise before making changes.

Build Loosely Coupled System

Think of code as a complex machine with many interconnected parts. In a tightly coupled system, these parts are so intertwined that changing one component often requires modifying many others - like replacing a single gear in a clock where all gears are welded together.

Let's look at a practical example:

# Tightly Coupled System
def process_data():
    data = get_data_from_database()
    clean_data = remove_duplicates(data)
    transformed_data = apply_business_rules(clean_data)
    save_to_file(transformed_data)
    send_email_notification()
    update_dashboard()

In this tightly coupled system, all operations are bundled together. You'd need to modify the entire function to change how data is cleaned or add a new transformation step.

# Loosely Coupled System
def get_data():
    return get_data_from_database()

def clean_data(data):
    return remove_duplicates(data)

def transform_data(data):
    return apply_business_rules(data)

def save_data(data):
    save_to_file(data)

def notify_stakeholders():
    send_email_notification()

def update_metrics():
    update_dashboard()

def process_data():
    data = get_data()
    clean_data = clean_data(data)
    transformed_data = transform_data(clean_data)
    save_data(transformed_data)
    notify_stakeholders()
    update_metrics()

In this loosely coupled version, each function has a single responsibility. This design offers several advantages:

Easy Testing: Each function can be tested independently
Simple Maintenance: Bugs can be isolated and fixed in specific functions
Greater Flexibility: Functions can be modified or replaced without affecting others
Better Reusability: Functions can be used in different contexts or projects

This principle of loose coupling is crucial for building maintainable, scalable data architectures that can evolve with your organization's needs.

Similarly, you need to build loosely coupled systems for your data needs. Let's understand this further with this 2002 email from Jeff Bezoz to his internal teams.

Henceforth, service interfaces will expose all teams to their data and functionality.
Teams must communicate with each other through these interfaces.
No other form of interprocess communication will be allowed: no direct linking, no direct reads of another team's data store, no shared-memory model, no backdoors whatsoever. The only communication allowed is via service interface calls over the network.
It doesn't matter what technology they use—HTTP, Corba, Pubsub, or custom protocols—it doesn't matter.
All service interfaces, without exception, must be designed from the ground up to be externalizable. The team must plan and design to expose the interface to developers in the outside world. No exceptions.

We can leverage this idea in our organization,

Define Clear Service Boundaries: Each data team should operate as an independent service provider with well-defined data access and manipulation interfaces.
Implement API-First Architecture: All data interactions should occur through documented APIs, eliminating direct database access and reducing system coupling.
Enable Self-Service Analytics: Teams should build interfaces that allow other departments to access and analyze data independently without requiring direct technical support.
Mandate Documentation: Every data service must be thoroughly documented, making it possible for external teams to understand and utilize it without direct communication.
Implement Access Controls: Each service should handle its authentication and authorization, ensuring data security at the service level.

Make Reversible Decisions

While we often wish to reverse our past decisions, life doesn't offer that option. However, as data engineers, we can—and should—ensure our data architecture decisions are reversible.

Data engineering is a rapidly evolving field with constant changes. Don't become overly dependent on any technology that seems attractive today—it may become irrelevant when better alternatives emerge. Instead, leverage the power of modularization (loosely coupled systems) to ensure changes can be easily reversed.

This principle can be better understood by the phrase "always choose two-way doors"—doors that allow you to walk back out of a room rather than one-way doors that lock you in.

Prioritize Security

All data engineers must also act as security engineers. In this modern world, it is essential to integrate security right from the start in the design of your data architecture.

Both perimeter security and zero-trust security models are essential. Consider a company that depends only on a firewall and VPN for protection. When an employee falls for a phishing email disguised as IT support and enters their credentials on a fake login page, the attacker gains legitimate access to the internal network, rendering those perimeter defenses useless.

Cloud providers like GCP and AWS secure their services, but the security of an organization's data architecture ultimately depends on how their end users implement it. This is known as the shared security model.

Embrace FinOps

In the past, when organizations primarily used on-premises data systems, managing costs was a relatively straightforward process that involved budgeting for hardware purchases, maintenance contracts, and staffing needs. Technology teams could predict expenses with reasonable accuracy since infrastructure costs remained relatively stable over extended periods.

However, the cost management landscape has transformed dramatically with the widespread adoption of cloud-based data tools and services. Organizations now face the complex challenge of continuously monitoring and optimizing costs, as cloud services operate on usage-based pricing models that fluctuate significantly based on data volume, processing requirements, and service utilization patterns.

As a data engineer, your role should be to understand these cost structures provided by cloud providers in detail and work to maximize your organization's ROI from these tools and services.

Overview

The next time you plan to build data architecture from scratch or modify your existing architecture, use these nine principles as your compass. They will minimize the risk of making the wrong decisions.

Buy the Book: https://amzn.to/4bqnIyE