by Daniel Li July 10, 2020

Let's say you've just built an amazing Node.js application and want to distribute it as a Docker image. You write a Dockerfile, run docker build, and distribute the generated image on a Docker registry like Docker Hub.

You pat yourself on the back and utter to yourself "Not too shabby!". But being the perfectionist that you are, you want to make sure that your Docker image and Dockerfile are as optimized as possible.

Well, that's exactly what we'll cover in this article! But because there're a lot of techniques for optimizing your Docker image, we've split this article into 2 parts. The first part (this one) will cover:

  • Reducing the number of running processes
  • Handling signals properly
  • Making use of the build cache
  • Using ENTRYPOINT
  • Using EXPOSE to document exposed ports

In the second part, we will cover:

  • Reducing the Docker Image file size by:
    • Removing Obsolete Files
    • Using a lighter base image
  • Using labels (LABEL)
  • Adding Semantics to Labels
  • Linting your Dockerfile

Later on, we will publish a dedicated article on securing our Docker image, where we will cover:

  • Following the Principle of Least Privilege
  • Signing and verifying Docker Images
  • Use .dockerignore to ignore sensitive files
  • Vulnerability Scanning

Although this article deals with a Node.js application, the principles outlined here applies to applications written in other languages and frameworks too!

Background

For this article, we will work to optimize the Dockerfile associated with a basic front-end application. Start by cloning the repository at github.com/d4nyll/docker-demo-frontend. Specifically, we want to use the docker/basic branch.

$ git clone -b docker/basic https://github.com/d4nyll/docker-demo-frontend.git

Next, open up the Dockerfile to see what instructions are already there.

FROM node
WORKDIR /root/
COPY . .
RUN npm install
RUN npm run build
CMD npm run serve

Each line inside a Dockerfile is called an instruction. You can find all valid instructions at the Dockerfile reference page.

Pretty basic stuff. But for those new to Docker, here's a brief overview:

  • FROM node
    • we want to build our Docker image based on the node base image
  • WORKDIR /root/
    • we want all subsequent instructions in this Dockerfile to be carried out inside the specified directory. It's similar to runnin cd /root/ on your terminal.
  • COPY . .
    • copy everything from the build context to the current WORKDIR. Don't know what a build context is? Check out the documentation on docker build.
  • RUN npm install
    • run npm install to install all the application's dependencies, as specified inside the dependencies property of the package.json, as well as the package-lock.json file
  • RUN npm build
    • run the build npm script inside package.json, which simply uses Webpack to build the application
  • CMD npm run serve
    • whilst all the previous instructions are executed during the docker build process, the CMD command is executed when you run docker run. It specifies which process should run inside the container as the first process.

Try running docker build to build our image.

$ docker build -t demo-frontend:basic .
...
Removing intermediate container a3d5032b851b
 ---> 703e723acecf
Successfully built 703e723acecf
Successfully tagged demo-frontend:basic

You should now be able to see the demo-frontend:basic image when you run docker images.

$ docker images
REPOSITORY     TAG     IMAGE ID      SIZE
demo-frontend  basic   703e723acecf  939MB
node           latest  b18afbdfc458  908MB

Next, run docker run to run our application.

$ docker run --name demo-frontend demo-frontend:basic

> frontend@1.0.0 serve /root
> http-server ./dist/

Starting up http-server, serving ./dist/
Available on:
  http://127.0.0.1:8080
Hit CTRL-C to stop the server

If you see this (rather rudimentary) interface on the URL outputted by npm run serve (127.0.0.1:8080 in our example), then the application is running successfully.

Reducing the Number of Processes

With our Docker container running, we can run docker exec on another terminal to see what processes are running inside our container.

$ docker exec demo-frontend ps -eo pid,ppid,user,args --sort pid
  PID  PPID USER  COMMAND
    1     0 root  /bin/sh -c npm run serve
    6     1 root  npm
   17     6 root  sh -c http-server ./dist/
   18    17 root  node /root/node_modules/.bin/http-server ./dist/
   25     0 root  ps -eo pid,ppid,user,args --sort pid

We can see an /bin/sh shell (with process ID (PID) of 1) is invoked to execute npm (PID 6), which invokes another sh shell (PID 17) to run our npm serve script, which then executes the node command (PID 18) that we actually want.

The ps command is the same one that we are running with docker exec. It would not normally be running inside the container, and we can ignore it here.

That's a lot of processes that are not needed to run our application, and each one takes up a large amount of memory relative to the total memory usage of the container. It would be ideal if we can just run the node /root/node_modules/.bin/http-server ./dist/ command and nothing else.

Avoid using npm script

It's best not to use npm as the CMD command because, as you saw above, npm will invoke a sub-shell and execute the script inside that sub-shell, yielding a redundant process. Instead, you should specify the command directly as the value of our CMD instruction.

Update the CMD instruction inside your Dockerfile to invoke our node process directly.

FROM node
WORKDIR /root/
COPY . .
RUN npm install
RUN npm run build
CMD node /root/node_modules/.bin/http-server ./dist/

Next, try stopping our existing http-server instance by pressing Ctrl + C. Hmmm, it seems like it's not working! We will explain the reason shortly, but for now, run docker stop demo-frontend and docker rm demo-frontend on a separate terminal to stop and remove the container.

With a clean slate, let's build and run our image again.

$ docker build -t demo-frontend:no-npm .
$ docker run --name demo-frontend demo-frontend:no-npm

Once again, run docker exec on a separate terminal. This time, the number of processes have been reduced from 4 to 2.

$ docker exec demo-frontend ps -eo pid,ppid,user,args --sort pid
  PID  PPID USER  COMMAND
    1     0 root  /bin/sh -c node /root/node_modules/.bin/http-server ./dist/
    6     1 root  node /root/node_modules/.bin/http-server ./dist/
   13     0 root  ps -eo pid,ppid,user,args --sort pid

If we calculate the 'real memory' used by the container before and after the change, you'll find that we've saved ~16MB, just by removing the superfluous npm and sh functions.

If you are interested in how to calculate the 'real memory' usage, have a read around the topic of proportional set size (PSS).

However, our node command is still being ran inside of a /bin/sh shell. How do we get rid of that shell and invoke node as the first and only process inside our container? To answer that, we must understand and use the exec form syntax in our Dockerfile.

Using the Exec Form

Docker supports two different syntax when specifying instructions inside your Dockerfile - the shell form, which is what we've been using, and the exec form.

The exec form specifies the command and its options and arguments in the form of a JSON array, rather than a simple string. Our Dockerfile using the exec form would look like this:

FROM node
WORKDIR /root/
COPY . .
RUN ["npm", "install"]
RUN ["npm", "run", "build"]
CMD ["node", "/root/node_modules/.bin/http-server" , "./dist/"]

Shell vs. Exec Form

The practical difference is that with the shell form, Docker will implicitly invoke a shell and run the CMD command inside that shell (this is what we saw earlier). With the exec form, the command we specified is run directly, without first invoking a shell.

Again, stop and remove the existing demo-frontend container, update your Dockerfile to use the exec form, build it, run it, and run docker exec to query the container's process(es).

$ docker stop demo-frontend && docker rm demo-frontend
$ docker build -t demo-frontend:exec .
$ docker run --name demo-frontend demo-frontend:exec
$ docker exec demo-frontend ps -eo pid,ppid,user,args --sort pid
  PID  PPID USER     COMMAND
    1     0 root     node /root/node_modules/.bin/http-server ./dist/
   12     0 root     ps -eo pid,ppid,user,args --sort pid

Great, now the only process running inside our container is the node process we care about! We have succesfully reduced the number of running processes to just one!

Signal Handling

However, saving a single process is not the reason why we prefer the exec form over the shell form. The real reason is because of signal handling.

On Linux, different processes can communicate with each other through inter-process communication (IPC). One method of IPC is signalling). If you use the command line, you've probably used signals without realizing it. For example, when you press Ctrl + C, you're actually instructing the kernel to send a SIGINT signal to the process, requesting it to stop.

Remember previously, when we tried to stop our container by ressing Ctrl + C, it didn't work. But now, let's try that again. With the demo-frontend:exec image running, try pressing Ctrl + C on the terminal running http-server. This time, the http-server stops successfully.

$ docker run --name demo-frontend demo-frontend:exec
Starting up http-server, serving ./dist/
Available on:
  http://127.0.0.1:8080
  http://172.17.0.2:8080
Hit CTRL-C to stop the server
^Chttp-server stopped.

So why did it work this time, but not earlier? This is because when we send the SIGINT signal from our terminal, we are actually sending it to the to the first process ran inside the container. This process is known as the init process, and has the PID of 1.

Therefore, the init process must have the ability to listen for the SIGINT signal. When it receives the signal, it must try to shutdown gracefully. For example, for a web server, the server must stop accepting any new requests, wait for any remaining requests to finish, and then exit.

With the shell form, the init process is /bin/sh. When /bin/sh receives the SIGINT signal, it'll simply ignore it. Therefore, our container and the http-server process won't be stopped.

When we run docker stop demo-frontend, the Docker daemon similarly sends a SIGTERM signal to the container's init process, but again, /bin/sh ignores it. After a time period of around 10 seconds, the Docker daemon realizes the container is not responding to the SIGTERM signal, and issues a SIGKILL signal, which forcefully kills the process. The SIGKILL signal cannot be handled; this means processes within the container do not get a chance to shut down gracefully. For a web server, it might mean that existing requests won't have a chance to run to completion, and your client might have to retry that request again.

If we measure the time it takes to stop a container where the init process is /bin/sh, you can see that it takes just over 10 seconds, which is the timeout period Docker will wait before sending a SIGKILL.

$ time docker stop demo-frontend

real  0m10.443s
user  0m0.072s
sys      0m0.022s

In comparison, when we use the exec form, node is the init process and it will handle the SIGINT and SIGTERM signals. You can either include a process.on('SIGINT') handler yourself, or the default one will be used. The point is, with node as the first command, you have the ability to catch signals and handle them.

To demonstrate, with the new image built using the exec form Dockerfile, the container can be stopped in under half a second.

$ time docker stop demo-frontend

real  0m0.420s
user  0m0.053s
sys      0m0.026s

If the application you are running cannot handle signals, you should run docker run with the --init flag, which will execute tini as its first process. Unlike sh, tini is a minimalistic init system that can handle and propagate signals.

Caching Layers

So far, we've looked at techniques that improves the function of our Docker image whilst it's running. In this section, we'll look at how we can use Docker's build cache to make the build process faster.

When we run docker build, Docker will run the base image as a container, execute each instruction sequentially (one after the other) on top of it, and save the resulting state of the container in a layer, and use that as the base image for the next instruction. The final image is built this way - layer by layer.

You can conceptualize a layer as a diff from the previous layer.

(Taken from the About images, containers, and storage drivers page)

However, pulling or building an image from scratch every single time can be time-consuming. This is why Docker will try to use an existing, cached layer whenever possible. If Docker determines that the next instruction will yield the same result as an existing layer, it will use the cached layer.

For example, let's say we've updated something inside the src direcotry; when we run docker build again, Docker will use the cached layer associated with the FROM node and WORKDIR /root/ instructions.

When it gets to the COPY instruction, it will notice that the source code has changed, and invalidates the cached layer and builds it from scratch. This will also invalidate every layer that comes after it. Therefore, every instruction after the COPY instruction must be built again. In this instance, this build process takes ~10 seconds.

$ time docker build -t demo-frontend:exec .
Sending build context to Docker daemon    511kB
Step 1/6 : FROM node
 ---> a9c1445cbd52
Step 2/6 : WORKDIR /root/
 ---> Using cache
 ---> 7ac595062ce2
Step 3/6 : COPY . .
 ---> 3c2f3cfb6f92
Step 4/6 : RUN ["npm", "install"]

...
Successfully built 326bf48a8488
Successfully tagged demo-frontend:exec

real  0m10.387s
user  0m0.187s
sys   0m0.089s

However, making a small change in our source code (e.g. fixing a typo) shouldn't affect the dependencies of our application, and so there's really no need to run npm install again. However, because the cache is invalidated in an earlier step, every subsequent step must be re-ran from scratch.

To optimize this, we should copy only what is needed for the next immediate step. This means if the next step is npm install, we should COPY only the package.json and package-lock.json, and nothing else.

Update our Dockerfile to copy only what is needed for the next immediate step:

FROM node
WORKDIR /root/
COPY ["package.json", "package-lock.json", "./"]
RUN ["npm", "install"]
COPY ["webpack.config.js", "./"]
COPY ["src/", "./src/"]
RUN ["npm", "run", "build"]
CMD ["node", "/root/node_modules/.bin/http-server" , "./dist/"]

By COPYing only what is needed immediately, we allow more layers of the image to be cached. Now, if we update the /src direcotry again, every instructions and layers up until COPY ["src/", "./src/"] are cached.

$ time docker build -t demo-frontend:cache .
Step 1/8 : FROM node
Step 2/8 : WORKDIR /root/
 ---> Using cache
Step 3/8 : COPY ["package.json", "package-lock.json", "./"]
 ---> Using cache
Step 4/8 : RUN ["npm", "install"]
 ---> Using cache
Step 5/8 : COPY ["webpack.config.js", "./"]
 ---> Using cache
Step 6/8 : COPY ["src/", "./src/"]
Step 7/8 : RUN ["npm", "run", "build"]
...
Successfully tagged demo-frontend:cache

real    0m3.175s
user    0m0.193s
sys    0m0.132s

Now, instead of taking ~10 seconds to build, it takes only ~3 seconds (your mileage may vary, but using the build cache will always be faster.

You can find more details on caching, including how Docker determines when a cache is invalidated, on the Dockerfile Best Practices page.

Using ENTRYPOINT and CMD together

Right now, the command that's ran by docker run is specified by the CMD instruction. This command can be overriden by the user of the image (the one executing docker run). For example, if I want to use a different port (e.g. 4567) rather than the default (8080), then I can run:

$ docker run --name demo-frontend demo-frontend:cache node /root/node_modules/.bin/http-server ./dist/ -p 4567

Starting up http-server, serving ./dist/
Available on:
  http://127.0.0.1:4567
  http://172.17.0.2:4567
Hit CTRL-C to stop the server

However, we have to specify the whole command in its entirety. This requires the user of the image to know where the executable is located within the container (i.e. /root/node_modules/.bin/http-server). We should make it as easy as possible for the user to run our application. Wouldn't it be nice if they can run the containerized application in the same way as the non-containerized application?

You guessed it! We can!

Instead of using the CMD instruction only, we can use the ENTRYPOINT instruction to specify the default command and options to run, and use the CMD instruction to specify any additional options that are commonly overridden.

Update our Dockerfile to make use of the ENTRYPOINT instruction.

FROM node
WORKDIR /root/
COPY ["package.json", "package-lock.json", "./"]
RUN ["npm", "install"]
COPY ["webpack.config.js", "./"]
COPY ["src/", "./src/"]
RUN ["npm", "run", "build"]
ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"]

And build the image.

$ docker build -t demo-frontend:entrypoint .

Using this method, the user can run the image as if it was the http-server command, and does not need to know the underlying file structure of the container.

$ docker run --name demo-frontend demo-frontend:entrypoint -p 4567

The command specified by the ENTRYPOINT instruction can also be overriden using the --entrypoint flag of docker run. For example, if we want to run a /bin/sh shell inside the container to explore, we can run:

$ docker run --name demo-frontend -it --entrypoint /bin/sh demo-frontend:entrypoint

# hostname
1b64852541eb

Using EXPOSE to document exposed ports

Lastly, let's finish up the first part of this article with some documentation. By default, our http-server listens on port 8080; however, a user of our image won't know this without looking up the documentation for http-server. Likewise, if we are running our own application, the user would have to look inside our implementation code to know which port the application listens on.

We can make it easier for the users by using an EXPOSE instruction to document which ports and protocol (TCP or UDP) the application expects to listen on. This way, the user can easily figure out which ports needs to be published.

FROM node
WORKDIR /root/
COPY ["package.json", "package-lock.json", "./"]
RUN ["npm", "install"]
COPY ["webpack.config.js", "./"]
COPY ["src/", "./src/"]
RUN ["npm", "run", "build"]
ENTRYPOINT ["node", "/root/node_modules/.bin/http-server" , "./dist/"]
EXPOSE 8080/tcp

Once again, build the image using docker build.

$ docker build -t demo-frontend:expose .

Now a user can see which ports are exposed either by looking at the Dockerfile, or by using docker inspect on the image.

$ docker inspect --format '{{range $key, $value := .ContainerConfig.ExposedPorts}}{{ $key }}{{end}}' demo-frontend:expose
8080/tcp

Note that the EXPOSE instruction does not publish the port. If the user wishes to publish the port, he/she would have to either:

  • use the -p flag on docker run to individually specify each host-to-container port mapping, or
  • use the -P flag to automatically map all exposed container port(s) to an ephemeral high-ordered host port(s)

Summary

By following the 5 techniques outlined above, we have improved our Dockerfile and Docker image. However, this is only the begining! Keep an eye out for the next part of this article, where we will reduce the size of our Docker image, learn to use labels, and lint our Dockerfile!

Author: Daniel Li

Daniel Li

Daniel Li is a DevOps Engineer and Fullstack Node.js Developer, working with AWS, Ansible, Terraform, Docker, Kubernetes, and Node.js. He is the author of the book Building Enterprise JavaScript Applications, published by Packt.