- Khan Academy is a US Based non profit education provider that teaches math, art, software engineering for free – similar to Coursera. It is aimed towards middle school and high school students.
- It started as a Monolith written in Python2. In 2019, Python Software Foundation announced EOL of Python2 in 2020. The whole company faced an existential threat as the programming language is no longer supported.
- This influenced the decision to break into microservices. There were other issues as well. The python monolith was becoming more expensive to operate.
- Graphql was introduced in 2017. This really helped the migration to microservices because the interface is separated from the REST APIs. A field of data can be forwarded from any number of backend services.
- I did recall using Khan Academy in 2013s, the website seemed a lot faster now.
- The new architecture looks like below:

Choosing the New Programming Language
- The three options were Python3, Kotlin, and Golang. Chose Golang in terms of performance, less memory usage, and better support across the different kinds of editors. For faster migration, there could’ve been porting over to Python3, though the conscious effort is to be able to scale for many years, so chose a much more performant language. It was a long term cost saving conscious trade off.
Planning Work To Be Done
- Initially estimated using heuristics. Count number of lines of python code available for each service usually would take. Use the same estimate for creating the new services.
- Broke down work not by MVP (minimum viable product), but by MVE (minimum viable experience). The experience were things like progress tracking user manage, content publishing, and more.
- Some tooling that were built for python had to be changed for Go
- About 1 million lines of python code, 40+ services to be migrated
- Balancing with the engineers on working new products vs the migration is a challenge as the migration is seen as a “tech debt”
- Product and Design features had to be pushed back often from Engineering due to migration.

Migration Strategy
- The exact behavior of the monolith is migrated to Golang. If there are bugs, it would also be migrated as well, this is to prevent scope creep.
- Dual services are running at the same time.
- A portion of the traffic goes to the new services to test the waters. Canary Migration.
- Both python and go responses are logged. If there are any differences in the responses, only the python response is returned.
- Remove the old python service once the migration is seen as complete.
- Wrote automated tests in BDD approach.
- Migration is a fixed scope and fixed timeline. In hindsight, saw this as the right approach. When certain migrations were not meeting the timeline, would move engineers around.
- The massive burn down chart was good way to gauge the progress. Ended up finishing the migration 4 days before the fixed deadline. That’s crazy. When work has a fixed timeline it is no longer considered “agile” approach. But this drastically helped in ensuring end goal ownership, communicating priorities and deadlines across teams.
- For internal tools, things were not ported one by one. Sometimes things were rebuilt from scratch and ended up spending more time.
Takeaways:
- The migration took about 3.5 years.
- Porting by MVE – a work categorisation of containment worked well
- Porting as direct as possible meant timeline estimates slipped very little.
- Hard deadlines motivated people to work around it and coordinate better. It helped finding critical path.
- A big proponenet of the success was planning, work estimates, resource, skill sets, time off, dependencies, parallelisms. Detailed measuring helped points were had to go slow in order to go fast.
- No Scope creep – no new features added during migration.
- Engineers swapped around frequently to meet timelines
- Side by side testing.
References:
- https://blog.khanacademy.org/go-services-one-goliath-project/
- https://blog.khanacademy.org/the-great-python-refactor-of-2017-and-also-2018/
- https://blog.khanacademy.org/slicker-a-tool-for-moving-things-in-python/
- https://blog.khanacademy.org/untangling-our-python-code/
- https://newsletter.pragmaticengineer.com/p/real-world-eng-8
- https://userpilot.com/blog/minimum-viable-experience-mve/#:~:text=By%20Minimum%20Viable%20Product%20(MVP,happen%20at%20the%20same%20time. (MVE vs MVP)

Leave a comment