What is DeepSeek? Explain Training Algorithm

Have you heard about the "sputnik" moment in AI? 20 January 2025. This date. Yes, this date is what some people in the AI industry believe as the sputnik moment.

Sputnik, as you might know, was the first artificial earth satellite. It was launched into an elliptical low Earth orbit by the Soviet Union on 4 October 1957 as part of the Soviet space program. This launch was significant because it began what we now know as the space race between 2 super powers of the time: the USA and the Soviet Union.

Similarly, on 20 Jan. 2025, a Chinese artificial intelligence (AI) startup who created an open-source large language model (LLM) called DeepSeek R1 and released it for the general public. There are growing feelings since that, this launch will begin the AI race between 2 modern-day super powers. Obviously, we as AI technocrats, don't know whether the Deepseek-R1 launch will become the same race for AI as it happened over 50 years ago between Soviet Union and United States for space and will leave that topic to politicians and instead we will focus on the underlying technology behind the Deepseek in this article.

Jump in...let's explore more on what is Deepseek and what is the underlying technology used to power Deepseek-R1.

Introduction

Basics first. What is Deepseek?

Well, simply put, it is a Chinese company and a startup who works in the field of artificial intelligence (AI) space. It is a relatively new company established in 2023 by Liang Wenfeng who is the co-founder and serves also as its CEO. When you see the website of Deepseek, you will find that they are also working on doing research in the following areas too,

DeepSeek-LLM is an advanced language model comprising 67 billion parameters. This model has been trained from scrat-ch on a vast dataset of 2 trillion tokens in both English and Chinese.
DeepSeek Coder is composed of a series of code language models, each trained from scratch on 2T tokens, with a composition of 87% code and 13% natural language in both English and Chinese. As the name suggests, the DeepSeek Codercan generate code snippets and complete partial code segments across multiple programming languages. It supports multiple popular programming languages such as C, C++, Java, Python javascript and many more. So if you are using any of these programming languages, the DeepSeek Coder can be super helpful to you by facilitating rapid development and prototyping.
DeepSeek Math, as you might guess, is an AI system designed to solve complex mathematical problems. It is based on a large language model or LLMs that has been trained on a massive dataset of mathematical text and code.

Beside the above research areas, DeepSeek start up is also working on a few other AI-research fields, which illustrate their commitment to AI innovation and research.

In this article, we will explore and deep dive into the DeepSeek's model offering which made all the headlines in Jan-2025, which is their DeepSeek-R1 Model.

Foundation model and DeepSeek-R1

The foundation models are at the backbone of AI powered systems. So, for example, if you want to develop a Chatbot for the banking needs, you would need an appropriate Foundation Model which is based on text such as OpenAI's GPT-4.

But if you want to identify objects in the pictures and videos or do a facial recognition, or generate images and thumbnails for your YouTube videos, a vision model is the more appropriate choice. You might want to use models such as DALL-E, or similar, which are trained to understand and work with visual data.

OK, so now that we know the foundation model's core purpose, let's understand DeepSeek-R1 and its technical capabilities.

DeepSeek-R1 is a reasoning model (the R in DeepSeek-R1 stands of reasoning). Hmm..what is a reasoning model and how it is different from a text generating model such as a GPT-4 or image generation model such as DALL-E.

Well, as the name might hint, a is a reasoning model will be a specialized AI system which is explicitly designed to perform structured, logical, and multi-step problem-solving. Now, this is very different from the general-purpose Foundation Models such as GPT-4, which are superb at pattern recognition and new content generation, but lack the reasoning prowess.

The reasoning models focus more on tasks that require the below characteristics such as below,

Step-by-step deduction. This is similar to solving a logical puzzle where you breakdown the bigger goal to step-by-step approach to reach the final goal.
Causal understanding such as what if X happens, then Y occurs because of so and so reasons.
Planning and decision-making, where models focus on optimizing the workflow by using various strategies.
Explanability of the response, where the model focuses on providing a transparent reasoning behind the generated answers or response.

Now that we know that DeepSeek-R1 is a reasoning model, let's go deep and understand it's underlying training algorithm.

Learning approach behind DeepSeek-R1

To understand and appreciate the true significance of DeepSeek-R1, we need to know a few key concepts and terms.

The first and foremost is how the Foundation models are trained? Well, as you can imagine, there will be multiple ways to train the Foundation models. Some of the most common ones are mentioned below,

Supervised Learning - In this kind of learning, the foundation model is trained on labeled data, where each input has a corresponding correct output. This approach is effective for tasks like classification and translation but requires a large amount of annotated data.
Unsupervised Learning - In this kind of learning the model is trained on raw, unlabeled data, learning patterns and structures without explicit guidance. This is common in large language models (LLMs) like GPT, which predict missing text segments using next-token or masked-token predictions.
Self-Supervised Learning (SSL) - This is a subset of unsupervised learning where the model generates its own labels from the data and hence the name self-supervised learning.
Reinforcement Learning (RL) - In this type of learning, the model learns by interacting with an environment and receiving feedback in the form of rewards. Reinforcement Learning focuses on trial-and-error exploration: the model discovers optimal behaviors by experimenting with different actions and learning from their consequences.

Now each of these learning approaches has its own pros and cons and there is no one size fits all. You might be wondering which of these is used by DeepSeek-R1 which has allowed it to perform better and still remain cost-effective than some of the other models. And the short answer to that is DeepSeek-R1 uses a multi-stage training process, which includes a combination of reinforcement learning along with supervised learning as a crucial initial step.

In other words, large language models used in DeepSeek-R1 are first pre-trained using supervised learning on a massive dataset. And DeepSeek-R1 also leverages Group Relative Policy Optimization (GRPO)? which is a reinforcement learning-based algorithm and this combination is what allow the model to address issues like readability and language coherence.

Training algorithm of DeepSeek-R1 in-depth

The key intuition behind the DeepSeek-R1 can be summarized as below,

The foundation model's reasoning capabilities can be significantly improved through large-scale reinforcement learning (RL), even without using supervised fine-tuning (SFT) as a cold start. Furthermore, the model's performance can be enhanced with the inclusion of a small amount of supervised cold-start data.

Let's break it down further.

Many other foundation models have heavily relied on large amounts of supervised data to enhance model performance. It is well-known that gathering high-quality supervised data can be a time-intensive and costly matter. The labeled data requires real humans or people who are experts in the field to annotate, verify, and curate datasets, which obviously takes time and financial resources.

The predecessor of DeepSeek-R1 was DeepSeek-R1-Zero and DeepSeek-R1-Zero was the initial experiment for the researcher at DeepSeek.

The researcher trained DeepSeek-R1-Zero using only Reinforcement Learning (RL) with no supervised fine-tuning at the beginning. Although DeepSeek-R1-Zero used Group Relative Policy Optimization (GRPO) to refine reasoning and language understanding, it still faced issues and challenges like poor readability and language inconsistency due to the lack of initial supervision.

Over a period of multiple tries, the researchers who were working on the DeepSeek-R1-Zero realized that adding a small amount of supervised data at the beginning helped the model learn better initial representations before RL took over. This "cold-start" data, which is supervised, helped establish a stronger foundation, making RL more efficient and improving overall model performance.

Oh wow! So, adding the cold-start data is promising but how to prepare such data?

Again, they used multiple approaches like few-shot prompting with a long Chain of Thought, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators.

In a nutshell, they collected thousands of cold-start data to fine-tune the base model as the starting point for RL. This approach offered a clear emphasis on 2 key areas,

Readability: A key limitation or shortcoming found with DeepSeek-R1-Zero was that its content was often not very suitable for reading. But why? It was because DeepSeek-R1-Zero responses used to mix languages and would also lack proper formatting (like markdown) to clearly present the answer. This makes it harder for end users to fully understand the output. To overcome this, the researchers, when creating cold-start data for DeepSeek-R1, designed a readable pattern that includes a summary at the end of each response and filters out responses that are not reader-friendly. They defined the output format as |special_token|<reasoning_process> |special_token|<summary>, Here the reasoning process is the Chain of Thought for the query. The chain of thought is the reasoning process the model used to arrive at the answer. It shows the steps and logic involved and the summary is a concise summary of the reasoning results.
Potential: By carefully designing the pattern for cold-start data with human priors, the researchers observed far better performance against DeepSeek-R1-Zero. They basically believed that training in stages, refining the model over a period of time is more effective approach for developing reasoning models.

Thus, DeepSeek-R1 training approach highlights the importance of readability in AI-generated content and describes a method for improving it by structuring the training data with a clear, readable pattern that includes both the reasoning process and a summary.

Conclusion

In this work, we briefed about the DeepSeek's DeepSeek-R1 which is a reasoning model and how by applying a multi-stage training process, which includes a combination of reinforcement learning along with supervised learning as a crucial initial step, the model's reasoning abilities through reinforcement are improved.

The older version, DeepSeek-R1-Zero represents a pure RL approach without relying on cold-start data, but it suffers in readability issues.

DeepSeek-R1 is more powerful, leveraging supervised cold-start data alongside iterative RL and is far more cost-effective and achieves performance comparable to OpenAI-o1-1217 on a range of tasks.

Ashutosh Deshmukh

Mobile app Architect | AWS Certified Solution Architect - Professional | PMI-ACP | PMI-PMP

Updated on: 2025-02-11T15:14:46+05:30

187 Views

Kickstart Your Career

Get certified by completing the course

Get Started