AI Doesn't Just Read Texts, It "Sees" Them

Deepseek’s new OCR system processes texts as photographs and compresses them as much as 10 instances. This know-how, able to analyzing 33 million pages in a day, permits AI to learn for much longer paperwork.

Deepseek, a Chinese language synthetic intelligence firm, is attracting consideration with its new OCR (Optical Character Recognition) system developed for extra environment friendly processing of text-based paperwork. The system compresses image-based texts, enabling AI fashions to course of for much longer paperwork with out hitting their reminiscence limits.

Processing Textual content as Visible Knowledge

Based on Deepseek’s technical report, the system analyzes textual content information in picture format as a substitute of processing it instantly. This method considerably reduces the computational load. The brand new OCR system can compress texts by as much as 10 instances whereas retaining 97% of the data.

As recognized, giant language fashions signify textual content as tokens, with every token containing a number of characters. Researchers are working to develop fashions that may course of lengthy paperwork and conversations exceeding tens of millions of tokens, thereby increasing the context window. Nevertheless, because the variety of tokens that may be processed concurrently will increase, so do the computational prices. Thus, a big token capability prevents the mannequin’s reminiscence from filling up even with lengthy paperwork, but it surely will increase the associated fee. Deepseek’s OCR answer, nevertheless, processes very lengthy content material as if it have been an picture, successfully viewing the content material as pixels.

Seeing Lengthy Texts as Pixels

The core of the system consists of two most important parts: DeepEncoder and Deepseek3B-MoE. DeepEncoder, which handles the picture processing, operates with 380 million parameters. Deepseek3B-MoE, chargeable for textual content technology, has 570 million lively parameters. DeepEncoder combines Meta’s 80-million-parameter SAM (Section Something Mannequin) and OpenAI’s 300-million-parameter CLIP mannequin. An middleman 16x compressor considerably reduces the picture information, growing processing velocity. For instance, 4,096 tokens of a $1,024 instances 1,024$ pixel picture are diminished to solely 256 tokens after compression.

Deepseek OCR can function utilizing between 64 and 400 “imaginative and prescient tokens,” relying on the decision. This quantity considerably lightens operations that sometimes require hundreds of tokens in basic OCR methods. In OmniDocBench exams, the system outperformed GOT-OCR 2.0 utilizing solely 100 imaginative and prescient tokens. It additionally surpassed the efficiency of MinerU 2.0, which required over 6,000 tokens, whereas working below 800 tokens.

The system, optimized for various doc sorts, makes use of 64 tokens for easy displays, 100 tokens for books and experiences, and 800 tokens utilizing a particular mode referred to as “Gundam mode” for advanced newspapers.
Deepseek OCR can course of not solely textual content but additionally advanced visible components like diagrams, chemical formulation, and geometric shapes. Moreover, it really works in roughly 100 languages, can protect formatting, and might generate plain textual content or common visible descriptions if desired.

Processes 33 Million Pages a Day

Roughly 30 million PDF pages have been used to coach the system. 25 million of this information consisted of English and Chinese language paperwork, and the remainder comprised 10 million artificial diagrams, 5 million chemical formulation, and 1 million geometric shapes.

In real-world use, Deepseek OCR achieves a really excessive processing capability. The system can course of over 200,000 paperwork a day on a single Nvidia A100 GPU. With 20 servers, every housing eight A100 GPUs, this capability will increase to 33 million pages per day. This velocity has the potential to enormously facilitate the manufacturing of coaching information for brand spanking new AI fashions. Each the code and mannequin weights are publicly out there (accessible by way of the supply part).