Skip to content

Commit 5ccb3d9

Browse files
authored
Update to Tesseract.js Version 4 (#691)
See #662 for explanation of Tesseract.js Version 4 changes. List below is auto-generated from commits. * Added image preprocessing functions (rotate + save images) * Updated createWorker to be async * Reworked createWorker to be async and throw errors per #654 * Reworked createWorker to be async and throw errors per #654 * Edited detect to return null when detection fails rather than throwing error per #526 * Updated types per #606 and #580 (#663) (#664) * Removed unused files * Added savePDF option to recognize per #488; cleaned up code for linter * Updated download-pdf example for node to use new savePDF option * Added OutputFormats option/interface for setting output * Allowed for Tesseract parameters to be set through recognition options per #665 * Updated docs * Edited loadLanguage to no longer overwrite cache with data from cache per #666 * Added interface for setting 'init only' options per #613 * Wrapped caching in try block per #609 * Fixed unit tests * Updated setImage to resolve memory leak per #678 * Added debug output option per #681 * Fixed bug with saving images per #588 * Updated examples * Updated readme and Tesseract.js-core version
1 parent 80aef15 commit 5ccb3d9

33 files changed

+2068
-2076
lines changed

.eslintrc

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -12,6 +12,9 @@
1212
"no-console": 0,
1313
"global-require": 0,
1414
"camelcase": 0,
15-
"no-control-regex": 0
15+
"no-control-regex": 0,
16+
// Airbnb disallows ForOfStatement based on the bizarre belief that loops are not readable
17+
// https://github.com/airbnb/javascript/issues/1271
18+
"no-restricted-syntax": ["error", "ForInStatement", "LabeledStatement", "WithStatement"]
1619
}
1720
}

README.md

Lines changed: 11 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -46,12 +46,11 @@ Or more imperative
4646
```javascript
4747
import { createWorker } from 'tesseract.js';
4848

49-
const worker = createWorker({
49+
const worker = await createWorker({
5050
logger: m => console.log(m)
5151
});
5252

5353
(async () => {
54-
await worker.load();
5554
await worker.loadLanguage('eng');
5655
await worker.initialize('eng');
5756
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
@@ -62,6 +61,16 @@ const worker = createWorker({
6261

6362
[Check out the docs](#documentation) for a full explanation of the API.
6463

64+
## Major changes in v4
65+
Version 4 includes many new features and bug fixes--see [this issue](https://github.com/naptha/tesseract.js/issues/662) for a full list. Several highlights are below.
66+
67+
- Added rotation preprocessing options (including auto-rotate) for significantly better accuracy
68+
- Processed images (rotated, grayscale, binary) can now be retrieved
69+
- Improved support for parallel processing (schedulers)
70+
- Breaking changes:
71+
- `createWorker` is now async
72+
- `getPDF` function replaced by `pdf` recognize option
73+
6574
## Major changes in v3
6675
- Significantly faster performance
6776
- Runtime reduction of 84% for Browser and 96% for Node.js when recognizing the [example images](./examples/data)

docs/api.md

Lines changed: 10 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
# API
22

33
- [createWorker()](#create-worker)
4-
- [Worker.load](#worker-load)
54
- [Worker.writeText](#worker-writeText)
65
- [Worker.readText](#worker-readText)
76
- [Worker.removeFile](#worker-removeFile)
@@ -53,7 +52,7 @@ createWorker is a factory function that creates a tesseract worker, a worker is
5352

5453
```javascript
5554
const { createWorker } = Tesseract;
56-
const worker = createWorker({
55+
const worker = await createWorker({
5756
langPath: '...',
5857
logger: m => console.log(m),
5958
});
@@ -63,7 +62,6 @@ const worker = createWorker({
6362

6463
A Worker helps you to do the OCR related tasks, it takes few steps to setup Worker before it is fully functional. The full flow is:
6564

66-
- load
6765
- FS functions // optional
6866
- loadLanguauge
6967
- initialize
@@ -82,23 +80,6 @@ Each function is async, so using async/await or Promise is required. When it is
8280

8381
jobId is generated by Tesseract.js, but you can put your own when calling any of the function above.
8482

85-
<a name="worker-load"></a>
86-
### Worker.load(jobId): Promise
87-
88-
Worker.load() loads tesseract.js-core scripts (download from remote if not presented), it makes Web Worker/Child Process ready for next action.
89-
90-
**Arguments:**
91-
92-
- `jobId` Please see details above
93-
94-
**Examples:**
95-
96-
```javascript
97-
(async () => {
98-
await worker.load();
99-
})();
100-
```
101-
10283
<a name="worker-writeText"></a>
10384
### Worker.writeText(path, text, jobId): Promise
10485

@@ -225,7 +206,7 @@ Worker.setParameters() set parameters for Tesseract API (using SetVariable()), i
225206
- `params` an object with key and value of the parameters
226207
- `jobId` Please see details above
227208

228-
**Supported Paramters:**
209+
**Useful Paramters:**
229210

230211
| name | type | default value | description |
231212
| --------------------------- | ------ | ----------------- | ------------------------------------------------------------------------------------------------------------------------------- |
@@ -234,11 +215,8 @@ Worker.setParameters() set parameters for Tesseract API (using SetVariable()), i
234215
| tessedit\_char\_whitelist | string | '' | setting white list characters makes the result only contains these characters, useful the content in image is limited |
235216
| preserve\_interword\_spaces | string | '0' | '0' or '1', keeps the space between words |
236217
| user\_defined\_dpi | string | '' | Define custom dpi, use to fix **Warning: Invalid resolution 0 dpi. Using 70 instead.** |
237-
| tessjs\_create\_hocr | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes hocr in the result |
238-
| tessjs\_create\_tsv | string | '1' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes tsv in the result |
239-
| tessjs\_create\_box | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes box in the result |
240-
| tessjs\_create\_unlv | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes unlv in the result |
241-
| tessjs\_create\_osd | string | '0' | only 2 values, '0' or '1', when the value is '1', tesseract.js includes osd in the result |
218+
219+
This list is incomplete. As Tesseract.js passes parameters to the Tesseract engine, all parameters supported by the underlying version of Tesseract should also be supported by Tesseract.js. (Note that parameters marked as “init only” in Tesseract documentation cannot be set by `setParameters` or `recognize`.)
242220

243221
**Examples:**
244222

@@ -262,8 +240,9 @@ Figures out what words are in `image`, where the words are in `image`, etc.
262240
**Arguments:**
263241

264242
- `image` see [Image Format](./image-format.md) for more details.
265-
- `options` a object of customized options
243+
- `options` an object of customized options
266244
- `rectangle` an object to specify the regions you want to recognized in the image, should contain top, left, width and height, see example below.
245+
- `output` an object specifying which output formats to return (by default `text`, `blocks`, `hocr`, and `tsv` are returned)
267246
- `jobId` Please see details above
268247

269248
**Output:**
@@ -273,8 +252,7 @@ Figures out what words are in `image`, where the words are in `image`, etc.
273252
```javascript
274253
const { createWorker } = Tesseract;
275254
(async () => {
276-
const worker = createWorker();
277-
await worker.load();
255+
const worker = await createWorker();
278256
await worker.loadLanguage('eng');
279257
await worker.initialize('eng');
280258
const { data: { text } } = await worker.recognize(image);
@@ -287,8 +265,7 @@ With rectangle
287265
```javascript
288266
const { createWorker } = Tesseract;
289267
(async () => {
290-
const worker = createWorker();
291-
await worker.load();
268+
const worker = await createWorker();
292269
await worker.loadLanguage('eng');
293270
await worker.initialize('eng');
294271
const { data: { text } } = await worker.recognize(image, {
@@ -313,8 +290,7 @@ Worker.detect() does OSD (Orientation and Script Detection) to the image instead
313290
```javascript
314291
const { createWorker } = Tesseract;
315292
(async () => {
316-
const worker = createWorker();
317-
await worker.load();
293+
const worker = await createWorker();
318294
await worker.loadLanguage('eng');
319295
await worker.initialize('eng');
320296
const { data } = await worker.detect(image);
@@ -361,7 +337,7 @@ Scheduler.addWorker() adds a worker into the worker pool inside scheduler, it is
361337
```javascript
362338
const { createWorker, createScheduler } = Tesseract;
363339
const scheduler = createScheduler();
364-
const worker = createWorker();
340+
const worker = await createWorker();
365341
scheduler.addWorker(worker);
366342
```
367343

docs/examples.md

Lines changed: 15 additions & 26 deletions
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,9 @@ You can also check [examples](../examples) folder.
77
```javascript
88
const { createWorker } = require('tesseract.js');
99

10-
const worker = createWorker();
10+
const worker = await createWorker();
1111

1212
(async () => {
13-
await worker.load();
1413
await worker.loadLanguage('eng');
1514
await worker.initialize('eng');
1615
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
@@ -24,12 +23,11 @@ const worker = createWorker();
2423
```javascript
2524
const { createWorker } = require('tesseract.js');
2625

27-
const worker = createWorker({
26+
const worker = await createWorker({
2827
logger: m => console.log(m), // Add logger here
2928
});
3029

3130
(async () => {
32-
await worker.load();
3331
await worker.loadLanguage('eng');
3432
await worker.initialize('eng');
3533
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
@@ -43,26 +41,24 @@ const worker = createWorker({
4341
```javascript
4442
const { createWorker } = require('tesseract.js');
4543

46-
const worker = createWorker();
44+
const worker = await createWorker();
4745

4846
(async () => {
49-
await worker.load();
5047
await worker.loadLanguage('eng+chi_tra');
5148
await worker.initialize('eng+chi_tra');
5249
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
5350
console.log(text);
5451
await worker.terminate();
5552
})();
5653
```
57-
### with whitelist char (^2.0.0-beta.1)
54+
### with whitelist char
5855

5956
```javascript
6057
const { createWorker } = require('tesseract.js');
6158

62-
const worker = createWorker();
59+
const worker = await createWorker();
6360

6461
(async () => {
65-
await worker.load();
6662
await worker.loadLanguage('eng');
6763
await worker.initialize('eng');
6864
await worker.setParameters({
@@ -74,17 +70,16 @@ const worker = createWorker();
7470
})();
7571
```
7672

77-
### with different pageseg mode (^2.0.0-beta.1)
73+
### with different pageseg mode
7874

7975
Check here for more details of pageseg mode: https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163
8076

8177
```javascript
8278
const { createWorker, PSM } = require('tesseract.js');
8379

84-
const worker = createWorker();
80+
const worker = await createWorker();
8581

8682
(async () => {
87-
await worker.load();
8883
await worker.loadLanguage('eng');
8984
await worker.initialize('eng');
9085
await worker.setParameters({
@@ -96,7 +91,7 @@ const worker = createWorker();
9691
})();
9792
```
9893

99-
### with pdf output (^2.0.0-beta.1)
94+
### with pdf output
10095

10196
Please check **examples** folder for details.
10297

@@ -110,11 +105,10 @@ Node: [download-pdf.js](../examples/node/download-pdf.js)
110105
```javascript
111106
const { createWorker } = require('tesseract.js');
112107

113-
const worker = createWorker();
108+
const worker = await createWorker();
114109
const rectangle = { left: 0, top: 0, width: 500, height: 250 };
115110

116111
(async () => {
117-
await worker.load();
118112
await worker.loadLanguage('eng');
119113
await worker.initialize('eng');
120114
const { data: { text } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png', { rectangle });
@@ -128,7 +122,7 @@ const rectangle = { left: 0, top: 0, width: 500, height: 250 };
128122
```javascript
129123
const { createWorker } = require('tesseract.js');
130124

131-
const worker = createWorker();
125+
const worker = await createWorker();
132126
const rectangles = [
133127
{
134128
left: 0,
@@ -145,7 +139,6 @@ const rectangles = [
145139
];
146140

147141
(async () => {
148-
await worker.load();
149142
await worker.loadLanguage('eng');
150143
await worker.initialize('eng');
151144
const values = [];
@@ -164,8 +157,8 @@ const rectangles = [
164157
const { createWorker, createScheduler } = require('tesseract.js');
165158

166159
const scheduler = createScheduler();
167-
const worker1 = createWorker();
168-
const worker2 = createWorker();
160+
const worker1 = await createWorker();
161+
const worker2 = await createWorker();
169162
const rectangles = [
170163
{
171164
left: 0,
@@ -182,8 +175,6 @@ const rectangles = [
182175
];
183176

184177
(async () => {
185-
await worker1.load();
186-
await worker2.load();
187178
await worker1.loadLanguage('eng');
188179
await worker2.loadLanguage('eng');
189180
await worker1.initialize('eng');
@@ -198,18 +189,16 @@ const rectangles = [
198189
})();
199190
```
200191

201-
### with multiple workers to speed up (^2.0.0-beta.1)
192+
### with multiple workers to speed up
202193

203194
```javascript
204195
const { createWorker, createScheduler } = require('tesseract.js');
205196

206197
const scheduler = createScheduler();
207-
const worker1 = createWorker();
208-
const worker2 = createWorker();
198+
const worker1 = await createWorker();
199+
const worker2 = await createWorker();
209200

210201
(async () => {
211-
await worker1.load();
212-
await worker2.load();
213202
await worker1.loadLanguage('eng');
214203
await worker2.loadLanguage('eng');
215204
await worker1.initialize('eng');

docs/faq.md

Lines changed: 9 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,13 @@
11
FAQ
22
===
33

4+
# Project
5+
## What is the scope of this project?
6+
Tesseract.js is the JavaScript/Webassembly port of the Tesseract OCR engine. We do not edit the underlying Tesseract recognition engine in any way. Therefore, if you encounter bugs caused by the Tesseract engine you may open an issue here for the purposes of raising awareness to other users, but fixing is outside the scope of this repository.
7+
8+
If you encounter a Tesseract bug you would like to see fixed you should confirm the behavior is the same in the [main (CLI) version](https://github.com/tesseract-ocr/tesseract) of Tesseract and then open a Git Issue in that repository.
9+
10+
# Trained Data
411
## How does tesseract.js download and keep \*.traineddata?
512

613
The language model is downloaded by `worker.loadLanguage()` and you need to pass the langs to `worker.initialize()`.
@@ -9,34 +16,5 @@ During the downloading of language model, Tesseract.js will first check if \*.tr
916

1017
## How can I train my own \*.traineddata?
1118

12-
For tesseract.js v2, check [TrainingTesseract 4.00](https://tesseract-ocr.github.io/tessdoc/TrainingTesseract-4.00)
13-
14-
For tesseract.js v1, check [Training Tesseract 3.03–3.05](https://tesseract-ocr.github.io/tessdoc/Training-Tesseract-3.03%E2%80%933.05)
15-
16-
## How can I get HOCR, TSV, Box, UNLV, OSD?
17-
18-
Starting from 2.0.0-beta.1, you can get all these information in the final result.
19-
20-
```javascript
21-
import { createWorker } from 'tesseract.js';
22-
const worker = createWorker({
23-
logger: m => console.log(m)
24-
});
25-
26-
(async () => {
27-
await worker.load();
28-
await worker.loadLanguage('eng');
29-
await worker.initialize('eng');
30-
await worker.setParameters({
31-
tessedit_create_box: '1',
32-
tessedit_create_unlv: '1',
33-
tessedit_create_osd: '1',
34-
});
35-
const { data: { text, hocr, tsv, box, unlv } } = await worker.recognize('https://tesseract.projectnaptha.com/img/eng_bw.png');
36-
console.log(text);
37-
console.log(hocr);
38-
console.log(tsv);
39-
console.log(box);
40-
console.log(unlv);
41-
})();
42-
```
19+
See the documentation from the main [Tesseract project](https://tesseract-ocr.github.io/tessdoc/) for training instructions.
20+

docs/local-installation.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ Tesseract.recognize(image, langs, {
1919
Or
2020

2121
```javascript
22-
const worker = createWorker({
22+
const worker = await createWorker({
2323
workerPath: 'https://unpkg.com/tesseract.js@v2.0.0/dist/worker.min.js',
2424
langPath: 'https://tessdata.projectnaptha.com/4.0.0',
2525
corePath: 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.wasm.js',
@@ -33,6 +33,6 @@ A string specifying the location of the [worker.js](./dist/worker.min.js) file.
3333
A string specifying the location of the tesseract language files, with default value 'https://tessdata.projectnaptha.com/4.0.0'. Language file URLs are calculated according to the formula `langPath + langCode + '.traineddata.gz'`.
3434

3535
### corePath
36-
A string specifying the location of the [tesseract.js-core library](https://github.com/naptha/tesseract.js-core), with default value 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.wasm.js' (fallback to tesseract-core.asm.js when WebAssembly is not available).
36+
A string specifying the location of the [tesseract.js-core library](https://github.com/naptha/tesseract.js-core), with default value 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.wasm.js'.
3737

3838
Another WASM option is 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.js' which is a script that loads 'https://unpkg.com/tesseract.js-core@v2.0.0/tesseract-core.wasm'. But it fails to fetch at this moment.

0 commit comments

Comments
 (0)