Giving my Keynote Speech at CloudNative Days Tokyo 2023!

Presentation materials of the day

*This is the presentation material at the time of the event.

I am Aoki, Senior Manager of the Infrastructure Technology Department, PayPay Product Group.
The event CloudNative Days Tokyo 2023 was held on December 11 and 12, 2023, at Ariake Central Tower Hall & Conference.

On the second day, I gave a keynote speech titled “Infrastructure in PayPay” to touch on the infrastructure from which we provide PayPay’s services.
In this article, I would like to go over my keynote speech and share it with those who were both at the event and were unfortunately unable to attend or watch it.

Almost all of PayPay’s infrastructure has been built on Amazon Web Services (AWS) since the onset of Day 1.
We mainly use Kubernetes as the application infrastructure, Kafka for asynchronous processing, and Amazon Aurora and Amazon DynamoDB—managed services provided by AWS—as data stores.
An especially unique characteristic of our system is that we built and run TiDB, a distributed database, on EC2.
TiDB can also be a cloud service (TiDB Cloud) as a managed service, but at PayPay, our engineers handle everything from building to operating as a self-hosted cluster.

This slide shows a simple timeline of the changes from Day 1 to the present, focusing on three points: regions, Kubernetes, and databases.
After some time from the start of Day 1, we provided services in the AWS Tokyo Region and stored backup data remotely in the AWS Osaka Region. Then, from around 2021 to 2022, we started running applications in the AWS Osaka Region and changed to a multi-region configuration.
At the same time, the Kubernetes cluster was also changed to a multi-cluster configuration with redundancy.
We used RDS MySQL at the beginning of Day 1 for relational databases but have since migrated to Aurora MySQL and are now using TiDB as well for some workloads.
In this presentation, I introduced two points: 1) TiDB and 2) multi-region, multi-cluster.

Here’s the backdrop to adopting TiDB: although Aurora could seamlessly handle the data in 2019, concerns about Aurora’s cluster size limit, table size limit, and write throughput under high load emerged when considering the growth of our services in the future.
While TiDB was less widely known then than today, we compared several options and selected TiDB based mainly on the points listed here.

We first adopted TiDB in the transaction history database. As we accumulated the know-how in its construction and operations, we gradually expanded its deployment to payment flow and balance management.
From this point, you can see that TiDB is used in mission-critical areas in PayPay.
After that, we have been working on more stable operations of large clusters and multi-region support for self-hosted clusters rather than extending the use of TiDB to various areas.

Here, I outlined some of the good points and challenges of using TiDB.
The upsides include resolving our initial concerns, requiring fewer man-hours than initially estimated for application modifications, and mitigating the impact of instance failures.
The challenges pertain to the difficulty of physically distributing workloads due to the self-hosted nature of the system and controlling costs, which is a flipside to having the great advantage of unlimited scaling.

The move to multi-region was because of AWS Osaka Local Region expanding to a Full Region in March 2021, which coincided with PayPay’s search for additional availability.
Although Kubernetes was a single cluster, it had been running in a multiple availability zone (multi-AZ) configuration since the beginning of Day 1.
In addition, deploying applications in the Osaka Region inevitably led to a multi-cluster configuration, so we decided to make the Tokyo side a multi-cluster configuration for availability and operational flexibility.
However, a multi-region, multi-cluster configuration obviously means increased operational cost and complexity, so we must operate more efficiently and control costs.

This slide shows how PayPay has tried maintaining operational and cost efficiency in a multi-region, multi-cluster environment.
In the case of multi-region, we have addressed the issue by leveraging Terraform modules, and in the case of multi-cluster, we have created a mechanism to control jobs and pod numbers well.

Five years have passed since the launch of PayPay, and the infrastructure has grown along with the service.
In particular, adopting TiDB and changing to a multi-cluster, multi-region, and multi-AZ infrastructure for high availability significantly advanced PayPay’s infrastructure, which I have introduced here.

In closing

Due to time constraints, I was only able to provide an overview of the significant events in PayPay’s infrastructure to date rather than a detailed account of them.
For more details, Ninomiya-san from the Database Team talked about TiDB at HTAP Summit 2023 hosted by PingCAP, and Nishinaka-san from the Cloud Solutions Team spoke about multi-region at AWS Summit Tokyo 2023 hosted by Amazon Web Services.
We hope you will check them out as well.

As I wrote in the last slide, there are still many challenges in PayPay’s infrastructure.
Please apply if you are interested in our infrastructure and services that support over 61 million users and deliver even greater convenience to the public.
We are actively looking for people willing to take on the infrastructure challenges to join our team!

Current job openings

Platform Engineer (DB Specialist)

*Job openings and employee affiliations are current as of the time of the interview.