Willis Tech

Serverless Framework with custom authoriser locally (episode 2)

Willis — Sun, 11 Feb 2024 00:55:19 GMT

Background

Booyakasha! 🤜

This is the second episode of the AWS Serverless series. In the first episode, we walked through an example of AWS SAM and a custom authorizer.

Today, we are going to explore another approach - Serverless Framework!
Serverless Framework has serverless-offline and serverless-esbuild plugins, which can help you write Lambda in TypeScript and run it with custom authoriser locally.

TL;DR

Using the Serverless Framework with serverless-offline plugin can simulate the Lambda and API Gateway locally.
Using Serverless (version 1.72.0 or later), you can write Infrastructure as Code in TypeScript.
Grab the working example here
https://github.com/Willis0826/serverless-local-authorizer-example (Using TypeScript)

Let's walk through it

We are going to implement a /ping API behind an API Gateway and Custom Authoriser using TypeScript.

The architecture diagram as following:

If you have read the first episode, you will notice that we are doing the same thing here, but with the Serverless Framework.

We have five steps to implement for this:

Setting Up Serverless Plugins
Creating a Basic API Gateway
Creating a Ping Lambda
Creating an Authoriser Lambda
Verifying the Result

Setting Up Serverless Plugins

We are going to use the following two plugins to create a seamless development experience:

serverless-offline - Simulate API Gateway and Lambdas locally
serverless-esbuild - Compile TypeScript to JavaScript

You can use the serverless CLI to generate a new project. However, I recommend cloning the repository serverless-local-authorizer-example to get a workable example.

Once you have a Serverless project, you can install these packages.

$ npm install --save-dev @serverless/typescript 
$ npm install --save-dev serverless-esbuild
$ npm install --save-dev serverless-offline
$ npm install --save-dev ts-node tsconfig-paths typescript

All set, you can now create a {project_root_path}/serverless.ts file with the following content.

import type { AWS } from '@serverless/typescript';

const serverlessConfiguration: AWS = {
  service: 'api',
  frameworkVersion: '3',
  plugins: ['serverless-esbuild', 'serverless-offline'],
  provider: {},
  functions: {},
  package: { individually: true },
  custom: {
    esbuild: {
      bundle: true,
      minify: true,
      sourcemap: true,
      exclude: ['aws-sdk'],
      target: 'node20',
      define: { 'require.resolve': undefined },
      platform: 'node',
      concurrency: 10,
    },
  },
};

module.exports = serverlessConfiguration;

Creating a Basic API Gateway

Let's define a very simple API Gateway and some general settings in the provider field of {project_root_path}/serverless.ts.

...

const serverlessConfiguration: AWS = {
  service: 'api',
  frameworkVersion: '3',
  plugins: [ ... ],
  provider: {
    name: 'aws',
    region: "ap-southeast-1",
    runtime: 'nodejs20.x',
    apiGateway: {
      shouldStartNameWithService: true,
    },
    environment: {},
  },
  functions: {},
  package: { ... },
  custom: {
    ...
  },
};

...

Creating a Ping Lambda

Now, we are going to add the first Ping Lambda to this API Gateway.

We need the following three files for the Ping Lambda.

src/functions/ping/handler.ts a handler returns 200 and a pong message.

import { APIGatewayProxyHandler } from 'aws-lambda';

export const lambdaHandler: APIGatewayProxyHandler = async (_event, _context) => {
  return {
    statusCode: 200,
    body: "pong",
  }
};

src/functions/ping/index.ts a serverless definition for the Lambda.

import { handlerPath } from '@libs/handler-resolver';

export default {
  handler: `${handlerPath(__dirname)}/handler.lambdaHandler`,
  events: [
    {
      http: {
        method: 'get',
        path: 'ping',
        authorizer: 'authorizer',
      },
    },
  ],
};

src/libs/handler-resolver.ts a helper function to get the current path.

export const handlerPath = (context: string) => {
  return `${context.split(process.cwd())[1].substring(1).replace(/\\/g, '/')}`
};

Now, we've created the Ping Lambda and set the authorizer to a function named authorizer.

Next step, we need to reference this Lambda in the serverless.ts

import type { AWS } from '@serverless/typescript';

import ping from '@functions/ping';

...
const serverlessConfiguration: AWS = {
  ...
  functions: {
    ping,
  },
  ...
}
...

You may have noticed that we use @libs and @serverless while importing stuff. Please follow the tsconfig.json configuration if you fancy this setup.

Creating an Authoriser Lambda

Alright, we are almost there!

We can create an authoriser that does nothing but returning a policy to allow GET/ping request.

src/functions/authorizer/handler.ts a handler acts as a Custom Authoriser.

import { APIGatewayTokenAuthorizerHandler, APIGatewayAuthorizerResult } from 'aws-lambda';

export const lambdaHandler: APIGatewayTokenAuthorizerHandler = async (_event, _context) => {
  return generateAdminPolicy();
};

const generateAdminPolicy = () => {
  const authResponse: APIGatewayAuthorizerResult = {
    principalId: `systemadmin`, // you can assign principalId if you want
    policyDocument: {
      Version: '2012-10-17',
      Statement: [
        {
          Action: 'execute-api:Invoke',
          Effect: 'Allow',
          Resource: 'arn:aws:execute-api:*:*:*/*/GET/ping', // allow access GET/ping
        },
      ],
    },
  };
  return authResponse;
}

src/functions/authorizer/index.ts a serverless definition for the Lambda.

import { handlerPath } from '@libs/handler-resolver';

export default {
  handler: `${handlerPath(__dirname)}/handler.lambdaHandler`,
};

Don't forget to reference this Authorizer Lambda in the serverless.ts.

import type { AWS } from '@serverless/typescript';

import authorizer from '@functions/authorizer';
import ping from '@functions/ping';

...
const serverlessConfiguration: AWS = {
  ...
  functions: {
    authorizer,
    ping,
  },
  ...
}
...

Wicked! You have set up everything. ✌️

Verifying the Result

Let's run the following command to spin up the API Gateway and Lambda locally:

# If you haven't got the Serverless CLI installed
npm install -g serverless

serverless offline start

Your API Gateway is listening on port 3000 with the path prefix /dev. Use the following command to send a request and see it in action.

curl -H "Authorization: abc" http://127.0.0.1:3000/dev/ping

Cheers! 🍺 You can develop AWS Lambda locally now.

Conclusion

We have now covered AWS SAM and Serverless Framework. Both tools support local development and deployment. The remaining question is, how do you choose between these two options?

Personally, I select the options based on these conditions.

🌟 Serverless Framework

Fast local development experience.
Ability to write infrastructure as code in TypeScript.
Support for other cloud providers.

🌟 AWS SAM

Comfort with CloudFormation.
Focus on AWS and the ability to configure detailed settings.

SAM with custom authoriser locally (episode 1)

Willis — Fri, 11 Aug 2023 02:04:22 GMT

Background

When you want to develop an AWS Serverless application, you may have heard of this powerful tool - SAM CLI (Serverless Application Model). You can use SAM to create a classic serverless application (including API Gateway and Lambda) on AWS. Also, it can simulate the AWS environment on your local environment.

Recently, SAM CLI added a useful functionality to support simulating API Gateway Custom Authoriser locally. Before this feature was implemented, it can be painful to develop an API Gateway with Custom Authoriser. This post will walk you through an example, and help you use this new feature with ease!

TL;DR

Grab the working example and play around here https://github.com/Willis0826/sam-local-authorizer-example (Using TypeScript)

Let's walk through it

We are going to create an API called Ping with TypeScript. When you invoke this API with path /ping, your request goes through a custom authoriser called Authorizer. If the custom authoriser returns a valid response, your request can reach the Lambda of Ping API.

The architecture diagram as following:

We have four steps to implement this example:

Creating a Basic API Gateway
Creating a Ping Lambda
Creating an Authoriser Lambda
Verifying the Result

Creating a Basic API Gateway

You can create a brand new project with sam init and remove the stuff you don't need. Or, if you want a minimal working project with TypeScript, you can clone sam-local-authorizer-example.

After you got a SAM project, the first thing you need to do is defining an API Gateway in {project_root_path}/template.yaml.

AWSTemplateFormatVersion: 2010-09-09
Description: >-
  sam-local-authorizer-example
Transform:
- AWS::Serverless-2016-10-31

Resources:
  # API Gateway
  ApiGateway:
    Type: AWS::Serverless::Api
    Properties:
      StageName: dev
      Auth:
        # CORS setting
        AddDefaultAuthorizerToCorsPreflight: false
        ResourcePolicy:
          CustomStatements: [
              {
                "Effect": "Allow",
                "Principal": "*",
                "Action": "execute-api:Invoke",
                "Resource": "execute-api:/*/OPTIONS/*",
              },
            ]

Creating a Ping Lambda

We are going to create a Ping Lambda which response pong when it's invoked. Also, we are going to explore how to develop Lambda with TypeScript.

Let's add the Ping Lambda first!

Create a new file {project_root_path}/src/handlers/ping.ts with the following codes.

import { APIGatewayProxyHandler } from 'aws-lambda';

export const lambdaHandler: APIGatewayProxyHandler = async (event, context) => {
    return {
        statusCode: 200,
        body: "pong",
    }
};

In order to expose the Ping Lambda we created, we need to create another file {project_root_path}/src/app.ts with the following codes.

import { lambdaHandler as PingHandler } from './handlers/ping';


export {
    PingHandler,
}

Now, we got a function ready to handle the Lambda invocation event. But, this is a TypeScript Lambda, we need a file {project_root_path}/package.json as well. The following code is an example of package.json

{
  "name": "sam_local_authorizer_example",
  "version": "1.0.0",
  "description": "lambda",
  "main": "app.js",
  "repository": "",
  "author": "",
  "license": "MIT",
  "dependencies": {},
  "scripts": {
    "build": "sam build",
    "deploy": "sam deploy"
  },
  "devDependencies": {
    "@tsconfig/node18": "^1.0.0",
    "@types/aws-lambda": "^8.10.73",
    "@types/node": "^18.0.0",
    "typescript": "^4.2.3"
  }
}

Finally, let's define this Ping Lambda by adding a new AWS::Serverless::Function resource under Resources section in {project_root_path}/template.yaml.

... Others
Resources:
  ... API Gateway
  # Protected API Lambda
  Ping:
    Type: AWS::Serverless::Function
    Metadata:
      BuildMethod: esbuild
      BuildProperties:
        Minify: true
        Target: es2020
        SourceMap: false
        External:
          - node_modules
        EntryPoints:
          - src/app.ts
    Properties:
      Runtime: nodejs18.x
      CodeUri: ./
      Handler: app.PingHandler
      Events:
        Api:
          Type: Api
          Properties:
            Auth:
              ApiKeyRequired: true
            RestApiId:
              Ref: ApiGateway
            Path: /ping
            Method: get

As you can see, the Ping resource has a section Metadata that tells SAM to use esbuild for your Lambda. Before you run sam build, you may need to make sure you install the esbuild beforehand.

You can use the following command to install esbuild and try to build the Ping Lambda.

npm install -g esbuild
sam build

Now, you should be able to see a new folder {project_root_path}/.aws-sam/build/Ping/app.js contains the transpiled JavaScript.

Creating an Authoriser Lambda

We are going to create a simple Authoriser Lambda in {project_root_path}/src/handlers/authorizer.ts. This Lambda allows all requests to access GET/ping API.

import { APIGatewayTokenAuthorizerHandler, APIGatewayAuthorizerResult } from 'aws-lambda';

export const lambdaHandler: APIGatewayTokenAuthorizerHandler = async (event, context) => {
  return generateAdminPolicy();
};

const generateAdminPolicy = () => {
  const authResponse: APIGatewayAuthorizerResult = {
    principalId: `systemadmin`,
    policyDocument: {
      Version: '2012-10-17',
      Statement: [
        {
          Action: 'execute-api:Invoke',
          Effect: 'Allow',
          Resource: 'arn:aws:execute-api:*:*:*/*/GET/ping',
        },
      ],
    },
  };
  return authResponse;
}

After we created the Authoriser Lambda, we need to expose this Lambda in {project_root_path}/src/app.ts as well.

import { lambdaHandler as AuthorizerHandler } from './handlers/authorizer';
import { lambdaHandler as PingHandler } from './handlers/ping';


export {
    AuthorizerHandler,
    PingHandler,
}

We are almost there! 🚩

Let's define the Authoriser Lambda in {project_root_path}/template.yaml. In order to make the SAM work with Authoriser Lambda locally, the definition of template.yaml is crucial.

We need to add two attributes DefaultAuthorizer and Authorizers to AWS::Serverless::Api resource.

Then, we need to define Authorizer Lambda under Resources section.

... Others
Resources:
  # API Gateway
  ApiGateway:
    Type: AWS::Serverless::Api
    Properties:
      StageName: dev
      Auth:
        ... Others
        DefaultAuthorizer: LambdaTokenAuthorizer
        Authorizers:
          LambdaTokenAuthorizer:
            FunctionPayloadType: TOKEN
            FunctionArn: !GetAtt Authorizer.Arn
            Identity:
              Header: Authorization
              ReauthorizeEvery: 300
  # Auth Lambda
  Authorizer:
    Type: AWS::Serverless::Function
    Metadata:
      BuildMethod: esbuild
      BuildProperties:
        Minify: true
        Target: es2020
        SourceMap: false
        External:
          - node_modules
        EntryPoints:
          - src/app.ts
    Properties:
      Runtime: nodejs18.x
      CodeUri: ./
      Handler: app.AuthorizerHandler
      Timeout: 5
   ... Ping Lambda

Done!🎉 You are all set. Let's jump to the next step to verify the result.

Verifying the Result

Now, you can run the following command to simulate API Gateway and Lambda Authoriser locally.

sam build
sam local start-api

Your API Gateway is listening on port 3000. Using the following command to send a request and see it in action.

curl -H "Authorization: abc" http://127.0.0.1:3000/ping

Cheers! 🍺 That's all. Enjoy the API Gateway and Custom Authoriser locally.

What Is Next?

Check the AWS Serverless - Using Serverless Framework with custom authoriser locally (episode 2) to find out another amazing tool.

AWS Serverless - Feature Flags

Willis — Sun, 30 Jul 2023 21:43:38 GMT

Background of this project

Lately, I was implementing a feature flags functionality to enable us selectively deliver new features to users. Also, the whole system uses AWS serverless architecture, including API Gateway, Lambda, and DynamoDB.

I assume you are familiar with API Gateway custom authroiser and DynamoDB.

There are many options you can implement feature flags, we explore two approaches in this article and compare them.

Solutions

We have two approaches to implementing feature flags. Let's start with the easiest one.

1. Retrieve and verify in custom authroiser

The simplest solution is, we retrieve and verify the user's feature flags in the custom authroiser. With this approach, each time a user access an API, we validate the request with the latest feature flags. That means when we turn on a feature flag for a user, that user can access the new feature instantly.

The architecture as follows:

Pros:

Turning off/on feature flags always reflects instantly.
The implementation is straight-forward.

Cons:

Adding an extra latency in the custom authroiser to retrieve feature flags.
The latency in custom authroiser impacts all APIs.

2. Retrieve from JWT and verify in customer authroiser

Wicked! This is the most interesting part. In order to overcome the extra latency in the customer authoriser, we embed the feature flags in JWT which users got when they logged in. When a user access an API, we validate the feature flags in JWT. By doing so, we only need to retrieve the feature flags from the database when a user login. That means we remove the extra latency in the custom authoriser. But, the disadvantage is if we turn on/off a feature flag and want it to take effect immediately, we need to invalidate the JWT of the user.

The architecture as follows:

Pros:

No extra latency in the custom authroiser.
Less performance overhead.

Cons:

The implementation is slightly complicated since you need to invalidate the JWT of a user when you want to refresh the feature flags.
Turning off/on feature flags doesn't reflect directly. However, the user must log in again and get the new JWT.

Database Design

We use DynamoDB as our database to store feature flags. Below is the schema of the FeatureFlags DynamoDB table.

user_id (PK)	feature_flag(SK)	is_enabled	updated_at	created_at
system_global_admin	new_feature_A	false	1687882000	1687882000
user_A	new_feature_A	true	1687882010	1687882010

The above table contains two feature flags. The one with system_global_admin is a global feature flag, this feature flag is applied to all the users by default. The one with user_A is a user-specific feature flag. The a user-specific feature flag has higher priority.

In this article, we use a dedicated table for feature flags. However, it is a good practice to store your feature flags with other data you need. You can read through The What, Why, and When of Single-Table Design with DynamoDB to find out more detail.

API Design

You probably want to create an API for the frontend to retrieve feature flags. This API should combine the global feature flags and user-specific feature flags in one response.

In the following example, if a user user_A has a feature flag:

{
  "new_feature_A": true
}

and we have two global feature flags:

{
  "new_feature_A": false,
  "new_feature_B": false
}

the API of retrieving user_A feature flags should return the following response:

{
  "new_feature_A": true,
  "new_feature_B": false
}

That means we can toggle feature flags with the global scope or per-user scope!

Conclusion

In the early stage, it should be fine to go with the first solution. You implement the straightforward solution and get the benefit of feature flags as soon as possible. However, if you want to reduce the latency, it's worth looking into the second solution.

Embedding feature flags in JWT is an inspiring idea for me. We use the nature of JWT to carry the information we want. Hope this inspires you as well.

The feature flags can include two scopes: global and user-specific. This can give you more control over the whole system.

FinTech - Global latency sensitive service on EKS

Willis — Sun, 26 Feb 2023 16:28:20 GMT

Background of the service

A global latency sensitive service is running on AWS across multiple regions.
Some components are stateful. We make sure the stateful components have a backup node which can take over the workload when there is a failure .
Some components have long-lived TCP connection. We need to gracefully close the connection before we do release.
To reduce the latency as much as possible, we deploy our services in six regions.

Architecture for a latency sensitive service.

Route 53 Traffic Policy

Traffic policy provides geo-based DNS resolution. This can be useful to route traffic between different cloud providers.

Global Accelerator

[Endpoint group] provides a entrypoint for your service in specific AWS region. You can route traffic between different AWS regions.

Data Layer

In general use case, we can put the database in the same region with the service to have the lowest latency. But, this results you need to have a application to aggregate the data from many databases in order to present the aggregated data for users.

How do we deploy services in EKS?

Used StatefulSet for stateful components. Also, some components have a MySQL sidecar container, it needs volumeClaimTemplates to help it persist the data.
Used EKS to serve components with large disk volume. We need to use AWS Backup service to make sure the data in the EBS volume has a daily snapshot.
Used EKS with EFS to share disk between specific services. For example, a service A will parse the logs generated by service B. In this case, we use EFS to share logs file between service A and service B.

apiVersion: helm.fluxcd.io/v1
kind: HelmRelease
metadata:
  name: example-app
  namespace: app
  annotations:
    fluxcd.io/automated: "true"
spec:
  releaseName: example-app
  chart:
    git: ssh://git@github.com/my-org/helm-charts
    ref: app_1.0.0
    path: charts/app
  values:
    image:
      repository: xxxxxxxxxxx.dkr.ecr.ap-northeast-1.amazonaws.com/app
      tag: "xxxxxxxxxxxxxxxxxxxxxx"
    targetgroupbinding:
      ... truncated ...
    serviceAccount:
      enabled: true
      irsaRoleArn: "arn:aws:iam::xxxxxxxxxxxxx:role/app_irsa_role"
    configuration:
      environment: "prod" <- use variables to retrieve specific configs
      region: "frankfurt"

This is an example for deploying an service in EKS across many regions.
The helm chart covers all the details and let you easily scale your service.

FinTech - How to monitor API error rate?

Willis — Sun, 26 Feb 2023 16:04:41 GMT

High level of the API traffic

This diagram shows the route of one API request. We can collect request information at the “Nginx” stage, and the information has response time and status code.

What happen inside the Nginx?

The Nginx instance has two process : Nginx and nginxlog-exporter.

Nginx will write the request information to log file, and then nginxlog-exporter watch that log file to aggregate the information based on URI.

Here is an example of exporter.access.log.string log file

app.com /api/blog/1 GET 200 0.002
app.com /api/blog/2 GET 200 0.008
app.com /api/blog/3 GET 200 0.027
app.com /api/user/1 GET 200 0.003
app.com /api/user/2 GET 404 0.003

These log can tell us some informations:

How long the request takes?
Is this request succeed with status code 200?

Logs format:

${domain} ${uri} ${http_method} ${http_status_code} ${request_time}

Finally, we need to transform the logs to metrics in order to monitor it.

nginxlog-exporter helps us to archive this! We just give it a config file like below.

namespaces:
  - name: dispatcher_nginx
    format: "$domain $uri $http_method $http_status_code $request_time"
    source_files:
      - /var/log/nginx/exporter.access.log.string
    ...
    aggregrateuri:
      # blog related api
      - /api/blog/
      # user related api
      - /api/user/
      # others api
      - /

Access the metrics that nginxlog-exporter provided.
We can know information:

GET app.com for /api/blog/* has been succeed for 55,984 times.

curl "0.0.0.0:4040/metrics"

dispatcher_nginx_http_response_count_total{app="nginx",environment="prod",host="api.com",method="GET",status="200",uri="/api/blog"} 55984

Finally, How to collect these metrics and evaluate it?

nginxlog-exporter expose metrics as Prometheus format, therefore, we use Prometheus to scrape it. Done! Now we have all the metrics in the time series database (Prometheus) 🎉

We can add alert rule base on the requirement, for example an alert for 4xx error rate :

- alert: "path /api/blog/ 4xx error rate exceed 1%"
  expr: sum(increase(dispatcher_nginx_http_response_count_total{uri="/api/blog", status=~"4.."}[1m])) / sum(increase(dispatcher_nginx_http_response_count_total{uri="/api/blog"}[1m])) > 0.01
  for: 1m
  labels:
    severity: warning
    app: api
  annotations:
    description: "Nginx blog related api (path /api/blog) 4xx error rate exceed 1%"

This alert rule use expr to know how to evaluate the metrics with PromQL syntax.

Because the metrics we got dispatcher_nginx_http_response_count_total is an incremental integer, we need to use increase to find the different between two data points.

Since we want to get 4xx error rate > 1% alert here, we use increase(4xx)/increase(all)>0.01 ✨

Also, we can use Grafana to create some dashboards.

The End.

Future Plan?

nginxlog-exporter can also aggregate the response time of request. If we want to monitor P99, P90 latency, we can try it!

How does a Prometheus Histogram work?

FinTech - EC2 deployment

Willis — Sun, 26 Feb 2023 12:41:31 GMT

Background of this solution

We need a method to deploy EC2 (VM) based applications safely.
I come up two approaches:
- Progressive deployment which splits instances to many batches.
- Blue/Green deployment which switch traffic between two Auto Scaling Groups.

Version 1 - progressive deployment

High level of workflow

Users submit change via PR. After the PR was merged, the workflow analyze the change and find the corresponding Ansible Inventory. Then it separate the Ansible Inventory to multiple deployment groups by Auto Scaling Group.

Workflow deploys each deployment group parallelly. Within each deployment group, the workflow splits instances in the Auto Scaling Group to multiple batches. By doing so, we can deploy each batch with a configurable interval.

The flow looks like:

get Auto Scaling Group in the specific environment.
get instances in the specific Auto Scaling Group.
split instances to multiple batches.
deploy each batch.

How to use this workflow?

Users can use title to specify interval between each batches.
If you specify [interval: 300, 180, 60, 10] in the PR title, workflow deploy your instances with 5 batches. It start with first batch and wait for 300 seconds, then it goes to second batch and wait for 180 seconds, and so on.
Users can add [revert] or [hotfix] to title, and it will skip progressive deployment in an emergency.
After merged to master branch, it triggers deployment

Use unit test to ensure the workflow result

I use JavaScript and TypeScript to implement most of the logic of the workflow.

By doing so, we can easily use jest to write unit test. High coverage is the key point for me to release new feature of the workflow with confidence.

I use jest and mock fs , child_process to simulate the desired output for each test case. An example jest test file looks like:

describe('group_inventory', () => {
  // Because JavaScript in GitHub Action workflow uses environment to pass variables
  // we'll set environment variables for each test case too.
  const OLD_ENV = process.env;

  beforeEach(() => {
    jest.resetModules(); // clears the cache
    process.env = { ...OLD_ENV }; // Make a copy
  });

  afterAll(() => {
    process.env = OLD_ENV; // Restore old environment
  });

  // test case 1
  test('createInventoryGroup read a asg with 10 hosts, hosts per group is 10%,10%,20%,20%,40%', async () => {
    // set up
    jest.mock('fs', () => {
      ... truncated ...
    });
    jest.mock('child_process', () => {
      ... truncated ...
    });
    process.env.ANSIBLE_INVENTORY = "inventory/aws_ec2.yml";
    process.env.ANSIBLE_HOSTS_PER_GROUP = '10%,10%,20%,20%,40%';
    ... truncated ...

    // import
    // here is the function we want to test
    const group_inventory = require('../group_inventory');

    // invoke
    const resp = await group_inventory({ github: {}, context: {} });
    const shouldBe = [
      {
        asg_group: 'nginx_asg',
        asg_group_with_ranges: [
          'nginx_asg[9:9]',
          'nginx_asg[8:8]',
          'nginx_asg[6:7]',
          'nginx_asg[4:5]',
          'nginx_asg[0:3]',
        ],
      },
    ];
    expect(resp).toStrictEqual(shouldBe);
  });
});

Version 2 - Blue/Green deployment

High level of workflow

After I implemented the version 1 workflow, I think the solution can be even better. For a EC2-based deployment, using Blue/Green deployment can provide the ability to rollback service within seconds. That's the reason I implemented version 2 workflow.

This workflow use Blue/Green deployment approach to deploy changes to instances.

We create one additional ASG for the existing ASG, and we group these two ASGs into one Blue/Green deployment group. Because every ASG has its own Target Group, we can control how many traffic go to each Target Group by weight of ALB rule.

This approach enhances the version 1 workflow with following perspectives:

We can update EC2 AMI automatically.
We can rollback service within 1 minutes.

Project - E-Commerce

Willis — Sun, 26 Feb 2023 12:35:36 GMT

Background of this project

A Asia online retailer needs a discount system to meet their marketing team's requirements.
The discount system supports three types of promotions: A+B, total piece, and total price.
The discount system includes many functionalities like coupons, product groups, and shopping cart.

Checkout Flow

Users can browse products on the site.

Once the user selected the products, he/she can view the price of each product after applied promotions.

The user can select different payment methods and view the total price.

Coupons management

Users can manage their coupons in member center page.

Staffs can manage the promotion rules in Admin portal.

What I did in this project

Used PHP and Larvael in the backend to implement E-Commerce platform functionailities.
Used Angular with nebular UI library.
Used dev container to build a delightful development environment locally.
Used Docker container to deploy services.

Project - Appointment match platform

Willis — Sun, 26 Feb 2023 12:35:00 GMT

Background of this project

I worked with friends to build a business appointment platform according to a Japanese team’s requirement.
User can arrange meeting with other companies in this platform.
This platform aim to increase the visibility for registered companies.

User can browse company information.

User can arrange a meeting with each company.

User can manage received meeting invitations.

What I did in this project

The whole project is in serverless architecture. We used API Gateway, Lambda, Cognito, Aurora Serverless, SQS and SES to build the backend.
Used AWS SAM (CloudFormation) to deploy whole infrastructure.
Used GitHub Action to implement CI/CD.
Used Nodejs and Prisma to implement 50+ Lambda functions to fulfill every API requirements.

Use AWS SAM to deploy Lambda also has a build-in canary deployment functionality we can use.

...
Type: AWS::Serverless::Function
Properties:
  DeploymentPreference:
    # Start with 10 percent traffic to the new version. After 10 mins, it switches all traffic to the new one.
    Type: Canary10Percent10Minutes

Built an email notification microservice with SQS, Lambda and SES. Also, we used Event Bridge to generate daily report.

After users registered, we send a welcome letter to each user.
Built an PPT to PDF converter with aws-lambda-libreoffice. User can upload their PPT and share it on this website.

Project - Petio

Willis — Sun, 26 Feb 2023 12:34:19 GMT

Background of this project

Recently, more and more people own pets. We provided an iOS and Android App for them.
You can use this App to search the nearest pet clinics and pet friendly restaurants.
You can use this App to report incidents on the map. For example, you can mark the position on the map when you find a stray dog.

What I did in this project

Used Nodejs with NestJS framework to develop backend service.
Used Azure Blob, AKS, Azure Redis, MongoDB for our backend.
Used React Native to develop iOS App and Android.
Used Angular with ng-zorro (antd) framework to develop admin portal.
Built the whole infrastructure with Terraform. We can re-build all services to another Azure account within one hour.
Used Azure DevOps to implement CI/CD pipeline. In this side project, we spent 80% of our time in development.

CI/CD in Azure DevOps. Most of our services have unit test.

The CD includes 4 steps: dev infra → dev k8s → prod infra → prod k8s.
The pipeline applies infrastructure changes and application release.

Features and screenshots

User can report incidents with photo on the map.

Map shows the incidents. Users can reply each incident and mark the incident as resolved.

Push notification to users when a new incident was reported around them.

Full Stack - Monitoring Platform and Chaos Engineering

Willis — Sun, 26 Feb 2023 12:31:24 GMT

Background of monitoring platform (IPP)

Incident prevention platform (IPP) is a in-house monitoring and chaos engineering platform for our services.
It provides a UI portal to help user set up a daemon in their EC2 instances to collect logs and metrics.
It provides built-in dashboards to monitor EC2 resources, HTTP requests (Apdex) and AWS resources.
It provides chaos engineering functionality including CPU pressure, Memory pressure, Disk pressure and network connectivity blockade.

In this project, I contributed to those items

Introduced Terraform to the team and used it to deploy AWS infrastructure including Step Function, VPC Endpoint Service, Lambda, S3 Bucket, IAM Role.
Developed Step Function to help new customer to integrate with our platform. This Step Function created VPC Endpoint in customer AWS account and use SSM run command to set up Telegraf agent.
Developed backend service with Golang and gRPC.
Developed Lambda functions to aggregate data and interact with API by using gRPC.
The aggregated data includes Apdex, P99, P50 to better monitor service performance.
Developed Grafana dashboard templating engine. Our backend service can create certain dashboards for customers based on their requirements.
Developed frontend with React. Users can execute a chaos engineering task by using UI.
Used TestCafe to write frontend E2E testing.
Implemented CI/CD in GitLab Pipeline.

Architecture

Logs and Metrics

Used InfluxDB, Prometheus and CloudWatch Metrics to store service metrics.
Used Elasticsearch, S3 and Kinesis to store service logs.

Chaos Engineering

Used tc and stress to simulate packet lost, high network latency, CPU pressure and Memory Pressure … etc.

EKS Monitoring solution

Used Prometheus Operator to monitor services. Also, we used Prometheus Remote Write / Read to collect customer’s metrics into our Prometheus. Then we used Prometheus Federation to aggregate data and provided user metrics from our aggregated Prometheus.

E-commerce Platform - staging env in EKS for 200+ stacks

Willis — Sun, 26 Feb 2023 12:28:14 GMT

What's the challenge?

Staging environment needs 200 stacks.
Each stack contains three services (api, storefront, admin)
You can access stack 1 service by api.stg1.service.com, shop.stg1.service.com and admin.stg1.service.com.

200 stacks have many configurations and secrets need to manage.
Need a platform for DEV, PM and QA to deploy.
All services in staging must support TLS.

Why we choose EKS for staging env?

We use EKS in our production environment. We need a staging environment in EKS too. By doing so, we can make sure our architectures of different environments are the same and testable.
50+ developers need to verify the result in staging environment parallelly. Using EKS can isolate the environment and benefit from Kubernetes community.

In this project, I contributed to those items

Built staging environment in EKS to host 200 stacks and all other e-commerce services (eg: open-api, third-party-api).
Created a document and hosted a training session for DEV, QA to teach them how to use EKS.
After two months, most of our developers can work with EKS efficiently.
Used Gomplate (templating tool) to generate ConfigMap file for 200 stacks. By doing so, we only need to mange one file.
For example, the following ConfigMap file can generate the ConfigMap file with desired STACK_NO:
```
metadata:
  name: custom-env
data:
  APP_CONFIG_HOST: api.stg{{.Env.STACK_NO}}.service.com
  APP_CONFIG_HOST: api.stg{{.Env.STACK_NO}}.service.com
```
Used External DNS, Cert Manger and Nginx Ingress Controller to serve all the staging service with HTTPS.
Used Kubed to duplicate TSL certificate within EKS cluster. Because we sign 100 domains in one certificate to reduce request count to Let’s Encrypt.
Used Git-Ctypt to encrypt secret and commit to our version control. The concept is like Sealed Secret.
Set up Samson CI/CD platform for users to deploy services.
Our Staging cluster has 120+ worker nodes. We used different node type to fulfill different workloads. For example: System Node, Memory Node and CPU Node. By doing so, we can use computing resource efficiently.

E-commerce Platform - MongoDB self-host to Atlas

Willis — Sun, 26 Feb 2023 12:26:47 GMT

What is this MongoDB cluster

Core database for the e-commerce platform.
Has 500,000 shop datas including orders, customers … etc.
self-host in AWS EC2 with 3 shards, and each shard has 4 nodes (2 secondaries, 1 primary and 1 arbiter).

Why we did this

Moved our self-host MongoDB cluster to MongoDB Atlas (fully managed service).
MongoDB Atlas provides permission control and near real-time monitor.

In this project, I contributed to those items

Made a migration plan and completed the migration within three months.
Created 4TB cluster with 2 shards and 8 nodes (1 read-only、2 secondary、1 primary per shard).
Managed Atlas cluster with Terraform.
Upgraded MongoDB driver version and ruby version due to MongoDB upgrade.
Made a plan to restrict traffic during migration.
In charge of core MongoDB stability.
Sharded MongoDB collections to ensure high performance.

Architecture

What benefits MongoDB Atlas brings to?

Fine-grained Access Control:
You can control who can access your data and who can manage your cluster.
MongoDB Atlas Search:
Natively support text search functionality. you don't need to run a search engine anymore.
Monitoring:
Atlas provides near real-time monitoring dashboard. You can know what queries are running on the cluster.

E-commerce Platform - Data Pipelines with Debezium and monitoring

Willis — Sun, 26 Feb 2023 12:25:54 GMT

What are these Data Pipelines

The analysis tasks of e-commerce platform rely on a Postgres database. We have a data pipeline to sync our data from MongoDB and MySQL to Postgres.
The data pipeline of MongoDB to Postgres was achieved by Change Streams
The data pipeline of MySQL to Postgres was achieved by Debezium

Why we did this

We need to ensure these data pipelines are in-sync and functional.
We need to build a CDC solution to stream MySQL changes to Postgres.

In this project, I contributed to those items

Used Nodejs to implement a custom Prometheus Exporter to collect the latest timestamp of specific MongoDB Collection and get the latest timestamp of record we wrote to Postgres.
Used Prometheus and Grafana to build a dashboard and alerts for the data pipeline. Once the data in the Postgres behind MongoDB for more than 5 minutes, it sends out an alert.
Deployed Debezium in EKS to fulfill MySQL CDC requirement.

Developed a custom Helm Chart to help us add Service Monitor more easily.

{{- range $serviceMonitorName, $ref := .Values.serviceMonitors }}
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: {{ $serviceMonitorName }}
  labels:
    release: prometheus-operator
  {{- if hasKey $ref "labels" }}
    {{- range $key, $value := $ref.labels }}
    {{ $key }}: {{ $value | quote }}
    {{- end }}
  {{- end }}
spec:
  namespaceSelector:
    matchNames:
    {{- range $namespace := $ref.namespaceSelector }}
      - {{ $namespace }}
    {{- end }}
  selector:
    matchLabels:
      {{- range $key, $value := $ref.selector.matchLabels }}
      {{ $key }}: {{ $value | quote }}
      {{- end }}
  endpoints: {{- toYaml $ref.endpoints | nindent 4 }}
{{- end}}

Architecture

The high level view of this solution.

Result

We can monitor the delay of data pipeline with near real-time.

E-commerce Platform - Image Service GCP to AWS

Willis — Sun, 26 Feb 2023 12:24:00 GMT

What is Image Service:

It’s a asynchronous image resize service. When you access product image in the e-commerce platform, you’ll access image with https://image-service.com/image/1?size=100x200. The service will try to find the image with id 1 and size 100x200. If it can’t find an existing one, it sends out a job to resize the image in the background.

Why we did this:

Reduce the data transfer fee between two cloud providers.
Focus on AWS. As the team is mainly using AWS.
Improve manually renew certification with AWS Certificate Manager.

In this project, I contributed to those items:

Moved architecture from GCP to AWS (GKE to EKS). Modified RoR application to use AWS Web Identity to get SQS permission and interact with SQS in EKS.
Moved GCP Cloud Function to AWS Lambda. This includes using Lambda Layer to store image processing library and implementing Lambda function to process image.
Solved Lambda faced limitation of disk space. Image Service uses disk to store temporary artifacts. However, the disk size has /tmp 250 MB limitation. Considering using EFS would cost greatly. Therefore, we Terraform to deploy multiple Lambda functions. Because /tmp disk only shared within same Lambda function.
Used k8s-cloudwatch-adapter to scale out/in our service. I replaced cloudwatch-adapter with Keda afterward. The auto scaling mechanism can handle workload with avg. 50k rpm and max. 80k rpm.
Optimized the CloudFront cache rule. In the beginning, our application rely on User-Agent Header to handle request. This can greatly impact cache. I changed our application to get the required data from Query String instead of Header. The result is increasing cache Hit Rate from 50% to 90%.
Optimized memory usage. I observed the application and find the application has Memory Fragmentation issue. Then, I change our Memory Allocator to jemalloc . By doing so, we reduced 80% memory usage.

Architecture

Result

After we disabled CloudFront Forward Header, and cache started to work.

After we changed the Memory Allocator, memory usage decreased by 80%.