Skip to the content.

Twine: A Unified Cluster Management System for Shared Infrastructure

What is Twine

Paper

Why does Facebook need Twine

Limitations from existing system(K8S)

Twine architecture

architecture

Task (Pod)

One instance of an application deployed in a container.

Job (Deployment)

A group of tasks of the same application.

Entitlement (Cluster)

entitlement

Agent (Kubelet)

Run on every machine to manage tasks. Like kubelet.

Scheduler (Controller manager)

Allocator (Scheduler)

TaskController (K8S does not have)

ReBalancer

Resource Broker

Health Check Service

Sidekick

Service Resource Manager (HPA/VPA)

Auto-scale jobs in response to load changes.

How is a job deployed

How is a task redeployed

Machine failure/maintenance or rolling update might make a task to be redeployed.

The redeployment is caused by machine unavailability

The redeployment is caused by rolling update

An example is to update the container image of a Job. A new task will be created, an old task will be terminated.

How does a machine get moved from one entitlement to another

Note:This is from my personal understanding

How does auto scaling work

Service Resource Manager is the components to use historical data of a service and predict the traffic, so that it could automatically send resize job request to front end in order to fulfill the horizontal auto scaling.

How does Twine control plane manage one million machines

Shard the core components

twine-core-components-sharding

Application level scheduler to offload core scheduler

FB Sharding VS K8S Federation

k8s-federation

twine-sharding

Comparison Twine K8S
Machine allocation Dynamic Static
Job metadata management Within same Scheduler and Allocator Split and distributed

Availability

Single regional control plane could not handle the HA well. Twine has the following design principles: