Startup tech stack

May 6, 2021 · 25 minute read

Tommy Dang

Engineering

TLDR

Your tech stack will evolve over time. We are sharing our tech stack at 

Mage

to help other companies get started and to get feedback from the community.

Table of contents

SaaS tools we integrate into our tech stack

  • Monitoring

  • Alterting

  • Continuous integration/Continuous delivery (CI/CD)

  • Security

  • Notifications

  • Utilities

  • Payments

Services we operate internally

  • Front-end

  • Backend

  • Data/Airflow

  • API

  • AI

Infrastructure we run on

  • Vercel

  • AWS

  • Astronomer

SaaS tools we integrate into our tech stack

Monitoring

Organize your space and see all of your data on one platform. (source: FXX)

Monitoring tools help us make sure our servers are running normally and detect if anything is going wrong.

What we use:

Datadog

,

Effx

Alternatives:

New Relic

Datadog allows users greater insight into their cloud-stack by bringing together data from servers, containers, and databases. This platform provides observability to better monitor performance data, infrastructure, and key performance indicators (KPIs); giving your team fast feedback and support.

We use Datadog to keep an eye on our serviced and infrastructure. If something breaks, chances are we’ve already got a Datadog alert in Slack about the issue. We make extensive use of Datadog’s metrics API to track everything from data import errors to cache misses. This allows us to quickly surface activity that is outside the norm and resolve issues before they cause service disruptions. In conjunction with Application Performance Monitoring (APM) and log analytics, Datadog is a core part of our stack that provides a comprehensive overview and detailed insight into all the moving parts in our evolving infrastructure. We rely on it to move fast while maintaining high availability.

We are currently integrating Effx for better service management and service quality. Effx allows us to catalogue all our services in one place, assign people to contact for help (owners of service), associate documentation, run books, dashboards, and more to each service. All of this information is critical to operate a service effectively. Effx also provides an activity feed that aggregates all events that come from our services; such as turning on feature flags, experimentation delivery, infrastructure events, logs from other SaaS tools (e.g. Datadog, AWS CloudWatch, etc), and more.

Alerting

Alerting tools tell your team when incidents arise and provides assistance to fixing these issues. When problems emerge, restoring your service as quickly as possible improves customer confidence and mitigates the negative impact on your business operations.

What we use:

PagerDuty

,

Sentry

Alternatives:

Dynatrace

,

Rollbar

PagerDuty makes it easy to set up oncall schedules with multiple tiers and flexible escalation policies. We use PagerDuty to set up our oncall schedules on a weekly basis with primary and secondary oncalls. We create alerts based on Datadog metrics and send the notifications to PagerDuty. Once PagerDuty receives the alert notification, it’ll page the oncall person by phone call, text message, and push notification. The oncall person is responsible for troubleshooting and mitigating the issues ASAP. With PagerDuty, we can quickly get notified when the latency increases, errors increase, or any abnormal metrics happen in our system.

Sentry is an error tracking tool that allows developers the ability to track and fix errors in real time.

We use Sentry because it’s super easy to set up using only a few lines of code. Sentry works in all the languages we use: JavaScript and Python. It catches all the errors across each of our services, logs a ton of metadata regarding the error, the stack trace, URL parameters if present, and more. We also setup Sentry to post errors to our internal Slack channel. This way, we can immediately investigate any unexpected errors.

CI/CD

Speed up the rate of improvements with the right CI/CD tool. (source: Tendor.com)

Continuous integration (CI) and continuous delivery (CD) enables developers to deploy software and code changes more frequently and reliably.

What we use:

CircleCI

Alternatives:

Jenkins

,

Travis

,

Buildkite

CircleCI is a cloud-based CI/CD tool. It allows development teams to speed up the development processes with less risk of errors due to its build control, debugging, and built-in Docker support.

We use CircleCI for all our services (except our front-end service) to automatically run tests and deploy to staging and production environments. We connect CircleCI to GitHub, so depending on what branch we are pushing code to, certain tests will get run. Updates are only deployed to production when code is committed to the master branch.

We previously used Travis CI but switched to CircleCI as it allowed more control in how we build our CI/CD pipeline. For example, CircleCI allows you to set manual approvals for certain jobs in a pipeline before the next job can run, which can be useful for testing on staging environments before deploying to production. In addition, our CircleCI plan ended up costing five times less than Travis CI.

Stay tuned for a more in-depth article about how we implemented CI/CD at Mage.

Security

Keep your secrets a secret. (source: PARAMOUNT PICTURES)

Startups are fast moving and hyper-focused on delivering the best product to their customers. In an effort to ship products quickly, security and secrets (or digital authentication credentials) can be compromised. Using a secrets management tool gives developers the headspace to focus entirely on applications without compromising security.

What we use:

Doppler

Doppler is a secrets manager that allows developers to safely work with and distribute secrets, such as environment variables, across a number of projects. Changes to secrets can be made through the Doppler U8 which will synchronize the change across all environments.

We use a shared development environment to streamline setup and collaboration. Doppler makes this easy by providing shared configurations scoped to a project and environment. In addition to development, staging, and production configurations, we use branched configs to allow individual engineers to customize their own variables that override the root configuration for the environment. Changes to the root configuration are also reflected in branched configurations.

Notifications

(source: Melissa Castaneda)

Notifications give your team and customers up-to-date information about your company and its products. Notifications can be a great marketing tool to re-engage users and deliver timely information to your customers.

What we use:

Courier

Alternatives:

Omnisend

,

MessageBird

Courier is a no-code API platform that helps developers add notifications across multiple channels: email, Slack, messaging apps, etc. It’s editor and orchestration engine features allow users to create on-brand and aesthetically pleasing notifications.

Without a SaaS notification service, companies are left to build their own. At Airbnb, our team was responsible for an internal service called Rookery. This service was responsible for sending all types of notifications (e.g. such as email, text messages, push notifications, etc) by providing an API for internal clients to call. Rookery would then transform that internal request and use 3rd party vendors (e.g.

Sendgrid

,

Twilio

,

Sparkpost

, etc) to deliver the notification to users. Maintaining this service was very challenging. Adding new features to it was costly and time consuming. If there was a tool that could do this for us early on at Airbnb, we would’ve used that instead.

Luckily, that exists today: Courier. We are integrating Courier to handle all our notifications to users; such as alerting users when their data is finished uploading, when their model is done training, or even when a teammate invites them to join their workspace. We also use Courier to build email newsletters to our users giving them regular product updates.

Utilities

Get your customers their data quicker. (source: memegenerator.net)

Importing data can be a complicated and time consuming process. Using a data onboarding tool speeds up the integration process by allowing businesses to automate data transfers from one’s local computer to your servers.

What we use:

Flatfile

Alternatives:

Delphix

Flatfile uses AI-assisted data onboarding to learn how data should be structured and cleaned; providing quick and reliable access to your customer data.

There are several ways our users add data into Mage for training models: connect to an external service like Amplitude, stream data via API, or upload a file. When users upload a file, we use Flatfile to handle the data onboarding process. Processing files can be complicated and error prone; however, leveraging Flatfile makes it really easy for our users to upload files, fix problems in their files, review their data before uploading, and more.

Payments

When paying for SaaS, recurring payments can complicate transactions: different pricing plans, annual and monthly memberships, add-ons, etc. Through SaaS billing systems, customers first choose the subscription level they want. They then send their credit information to a secure payment gateway which will end up in the merchant’s account. Finally, a subscription management platform will charge your customer’s account when necessary.

What we use:

Stripe

,

RevOps

Alternatives:

Braintree

Stripe is an all-in-one payment tool that acts as the payment gateway, merchant account, and subscription management. Stripe offers a simple setup with white-label customization and integration compatibility.

RevOps streamlines the quote-to-cash process by building platforms for sales teams to more easily navigate customers through.

Currently, our pricing is a subscription based model where each company pays a fixed amount per month for a certain amount of seats. This subscription provides the customer with access to our suite of tools that helps with connecting data to Mage, transforming data, building training data, creating models, predictive analytics, and more. As our pricing model becomes more complex and when we start providing options for annual contracts, we will use RevOps. They provide a really easy no-code solution for building invoices and deals. Their API is simple and intuitive to use. Their product is among the easiest and quickest to get started while maintaining a ton of flexibility.

Services we operate internally

Front-end

Deliver magical experiences to your customers through simple and intuitive designs. (source: Wallpaperup.com)

Our front-end service is responsible for the entire customer facing experience. When our users come to Mage, they are interacting with a web app that makes API calls to our backend server to persist data and to retrieve information for users to read and update.

What we use:

React

,

Next.js

,

Typescript

, hosted on

Vercel

(see section on Infrastructure)

Alternatives:

Angular

,

Vue.js

,

GatsbyJS

React is a Javascript library used to build websites and application interfaces. We use React to build user interfaces and components for single-page web applications. React allows us to efficiently create reusable components and has a very active community of developers. It’s also used by many large tech companies like Facebook, Instagram, and Netflix.

Next.js is a web framework built on top of React that enables several additional features like

server-side rendering

, built-in routing, and more. We use Next.js to optimize our code and make development faster; since code-splitting and caching pages is either done automatically or easier to implement.

React Router

is commonly used for routing in React apps, but Next.js can create routes for you automatically based on your project’s folder structure.

Typescript is a superset of JavaScript that adds optional static type definitions. We use it to further enhance the development experience by catching common errors as we type the code (if enabled in the

IDE

) and to enforce coding best practices.

Backend

(source: DesignQuote)

Our front-end service makes authenticated API calls to our backend service to fetch data or to write data to our MySQL database hosted on AWS RDS (Relational Database Service). Our backend service also handles complex business logic such as automatically clean data before it’s stored in our data lake (AWS S3), uploading data into memory for low latency feature fetching while making real-time predictions, and more. We also use a copy of our backend service as a background worker that subscribes to a queue on AWS SQS (Simple Queue Service) and processes asynchronous jobs such as recording experiment treatment group assignment, initiating summary statistics calculations after a feature set is ingested, and more.

What we use:

Python

,

Django

Alternatives:

PHP

,

Ruby on Rails

Our backend is powered by Django. Using a framework like Django allows us to focus on Mage’s core business logic and accelerate our development process. Django was a clear choice for us because it’s written in Python, has a strong developer community, and comes with a full-featured admin interface out of the box.

We chose to use Python because our AI code is written in Python, Airflow is written in Python, and our founding team has extensive experience using Python at Airbnb and previous startups.

Data /Airflow

When developers don’t use Airflow to run their data. (source: Reddit)

We have a data repo that contains Airflow code and PySpark (Spark in Python) scripts that we run in AWS EMR (Elastic MapReduce). Our data repository is responsible for building training sets, training models, deploying models, retraining models, calculating experimentation results, performing batch predictions, and more.

What we use:

Python

,

Airflow

,

Spark,

Docker

Alternatives:

AWS Glue

,

Dataform

Both our Airflow and data processing code are written in Python and heavily uses the PySpark library. We chose to use Airflow instead of other tools like Temporal, AWS Step Functions, etc. because we were very familiar with Airflow, having worked with it for almost 6 years at Airbnb. Airflow is also written in Python which is our company’s language of choice. Airflow has a proven track record for handling thousands of data pipelines. We’ve also used Airflow in the past to generate thousands of dynamic tasks. We knew this was a requirement for us when we built Mage.

For more information on our Airflow set up, check out our recent article on “

Migrating Airflow from Amazon ECS to Astronomer

”.

API

(source: Devpost.com)

Once a product developer connects their data to Mage and trains a model without writing any code, there are 3 ways for them to get predictions from their model:

  1. Download the predictions to a file

  2. Automatically export the predictions to a data warehouse (e.g. AWS Redshift) on a recurring basis

  3. Make real-time predictions via API request

When our users choose option 3, they are making API calls to our API service. This service is responsible for handling all public facing functionality outside our front-end service. Here is a sample API request:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
curl \
  --request POST 'https://api.mage.ai/v1/predict' \
  --header 'Content-Type: application/json' \
  --data-raw \
'{
  "api_key": "...",
  "model": "email_unsubscribe",
  "features": [
    {
      "user": {
        "id": 612
      }
    }
  ]
}

Here is a sample response from the API service:

1
2
3
4
5
6
7
[
  {
    "model_used": true,
    "prediction": "yes",
    "uuid": "..."
  }
]

There are 4 ways for our users to connect their data to Mage:

  1. Upload a file

  2. Connect Mage to a 3rd party API like Segment or Amplitude

  3. Connect Mage to a data warehouse (e.g. AWS Redshift, GCP BigQuery) or data lake (e.g. Snowflake)

  4. Stream data to Mage via API request

When our users choose option 4, they are making API calls to our API service. Here is a sample API request:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
curl \
  --request POST 'https://api.mage.ai/v1/stream' \
  --header 'Content-Type: application/json' \
  --data-raw \
'{
  "api_key": "...",
  "feature_set": "user/profile_attributes/5",
  "features": [
    {
      "age": "[...]",
      "country": "[...]",
      "date_joined": "[...]",
      "gender": "[...]",
      "id": "[...]",
      "purchases": "[...]",
      "words_in_description": "[...]"
    }
  ]
}'

Here is a sample response from the API service:

1
2
3
{
  "success":1
}

What we use:

Python

,

Flask

,

Docker

,

Insomnia

Alternatives:

CherryPy

(server)

Flask is a Python framework that gives developers control over how they access data. The microservice component of Flask makes for an easier learning curve and better integration with applications and platforms.

Docker makes creating and running applications easier because all the dependencies and libraries needed to run the application is bundled into the Docker image. This Docker image is highly interoperable on any server as long as the server can run Docker.

Our API service is powered by a Flask app deployed in AWS ECS (Elastic Container Service). This allows us to quickly prototype new functionality and scale as needed. The Insomnia API client is an important part of our development process — we use it to formulate API requests and test endpoints. We use the built in sharing functionality to keep things in sync between team members.

AI

We have a repository on GitHub containing our AI code pipeline. This repository also includes a lot of our data preparation pipelines, data cleaning pipelines, feature engineering pipelines, model training, evaluation, and model explainability functionality.

Alternatives:

C++

,

Scala

Python has a vast set of libraries specific to AI development:

Keras

,

Tensorflow

,

PyTorch

,

Scikit-learn

, and many more which can increase productivity by not re-inventing the wheel. We use

Jupyter Notebook

s to quickly prototype new state-of-the-art machine learning models and techniques that we can then make available to all our users.

We build our data preparation libraries and training pipelines based on some common Python libraries, including

Numpy

,

Pandas

,

scikit-learn

,

xgboost

, etc. These libraries provide clean and simple interfaces for some complex data operations and machine learning algorithms.

We use

Dask

to run computationally expensive and time consuming operations across entire data sets all in parallel. In a future blog post, we will share how we use Dask and a guide to become proficient quickly.

In our CI/CD build and deploy pipeline, we build a Docker image using our AI code. We use this Docker image to train models in and we use this Docker image as the base image for our API service. This ensures code consistency across our various services.

Infrastructure

Infrastructure platforms are the building blocks for development. They providing access to virtual hardware and data storage which applications can be built upon.

What we use:

Vercel

,

AWS

,

Astronomer

Vercel

is a platform for front-end developers to host websites and web applications with automated deployments. Next.js apps integrate well with Vercel.

Vercel comes with automatic CI/CD built in. Once you push a commit to a remote branch or merge into master, Vercel will start a build process and deploy your web app. Vercel integrates with

GitHub

, as well as

GitLab

and

Bitbucket

, and also has staging (preview) deployments for non-master branches. Being able to immediately check your changes after committing code is very helpful in making sure the changes you made in your development environment matches what is on production or staging.

Amazon Web Services (AWS)

provides all the cloud infrastructure required to run almost any application. AWS provides a wide variety of services. Here are the services we use:

  • ElasticBeanstalk/EC2: our Backend service runs on this

  • ECS: We host our API service in ECS. Customers can make public requests to the API service to get prediction results or stream features to Mage.

  • Route53: We use Route53 to manage traffic routing policies.

  • ECR: We store the versioned service Docker images and training Docker images in ECR.

  • S3: We store feature sets, training sets, models, etc. here.

  • RDS: We use MySQL to store relational data models.

  • ElastiCache (Redis): We use ElastiCache to cache data in memory for fast access.

  • DynamoDB: We log experiment assignments, prediction results and streamed features in DynamoDB.

  • EMR/Spark: We run spark jobs on EMR clusters for big data processing.

  • SQS: We use SQS queue to perform asynchronous operations and message handling.

  • Redshift: We currently use this as our data warehouse. It’s not the best, but it was fast to get setup. We’re open to suggestions.

We will be sharing more details on our AWS infrastructure setup in a future article, so stay tuned.

Astronomer

is a cloud-agnostic, Airflow management platform. To better manage your workflows, Astronomer allows you the visibility and control to maintain as many Airflow environments that your company needs.

We use Astronomer Cloud to manage our Airflow. Read more about how Mage utilizes Airflow

here

.

Closing thoughts

Your tech stack will evolve over time, so be open to rapid changes. However, spending some time upfront to choose the technologies and software you use will save you time and headache in the near future. Get feedback from your peers and read what others do to collect more information. Please send us your thoughts, feedback, questions, or comments at

eng@mage.ai

. We’d love to hear!

Hang out with us

Join our community and chat about startups, AI/ML, and product development.

Like what you see? Join the team.

Mage is making AI and ML accessible to product developers. Join us and build beautiful and intuitive devtools.

Want to give us feedback or ask questions?

Please chat with us live by joining our Discord channel or send us an email.